What’s so Significant About (Statistical) Significance?

A recent article in Nature [1] is among the latest installments in the long-running crusade to turn statistics right side up [2]. Certain aspects of the article might have been better handled, but that should not be the focus here; the piece is a great contribution toward fostering better understanding of statistics. Anyone interested in understanding research and statistics should read the article in its entirety, but at the heart of the matter is a call to retire the notion of “statistical significance”, which is usually declared when a research finding has a p value < 0.05 (some fields use more stringent thresholds). The authors are frustrated with the slavish and myopic reverence for and poor understanding of “statistical significance” (and the resultant unscientific behaviors) that permeate the literature. But for many used to the revered cut-off of 0.05, their proposal may seem borderline heretical.

Is it, though?

Hardly. The 0.05 cut-off is, in fact, entirely arbitrary (hence “statistical significance” appearing in quotes above, which is dropped hereafter for ease of reading) [3]. Additionally, the longstanding misuse and abuse of p values and statistical significance [3-7] makes their frustration and proposal understandable. This is also not the first time something like this has been proposed. Two of the same authors suggested this in 2018 in response to a proposal to adopt a more stringent threshold for declaring something statistically significant (which is also not a new idea) [8-11]. In fact, there were warning calls to abandon the practice of dichotomizing results into statistically significant and non-statistically significant at least as early as the 1960s [12-13], and these warnings were still being offered toward the end of the last millennium. Consider, for instance, Doug Altman’s rebuke in 1991, offered in his typical clarity:

The cut-off level for statistical significance is usually taken at 0.05, but sometimes at 0.01. These cut-offs are arbitrary and have no specific importance. It is ridiculous to interpret the results of a study differently according to whether the P value obtained was, say, 0.055 or 0.045. These P values should lead to very similar conclusions, not diametrically opposed ones. [Additionally,] estimation via confidence intervals [is] greatly preferred. The use of a cut-off for P leads to treating the analysis as a process for making a decision. Within this framework, it is customary (but unwise) to consider that a statistically significant effect is a real one, and conversely that a non-significant result indicates that there is no effect. Forcing a choice between significant and non-significant obscures the uncertainty present whenever we draw inferences from a sample. [14, pp168-169]

Fast forward to 2019, and The American Statistician devoted an entire supplemental issue to discussions surrounding this matter. The issue begins with an editorial aptly titled Moving to a World Beyond “p < 0.05” [7], with two of the editorial’s three authors also being the authors of the American Statistical Association (ASA) 2016 statement on p values [6]. The editorial notes, eloquently:

The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase (1925), Edgeworth’s (1885) original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny. But that idea has been irretrievably lost. Statistical significance was never meant to imply scientific importance, and the confusion of the two was decried soon after its widespread use (Boring 1919). Yet a full century later the confusion persists.

And so the tool has become the tyrant. The problem is not simply use of the word “significant,” although the statistical and ordinary language meanings of the word are indeed now hopelessly confused (Ghose 2013); the term should be avoided for that reason alone. The problem is a larger one, however: using bright-line rules for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making (ASA statement, Principle 3). A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.

For example, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern (2006) famously observed, the difference between “significant” and “not significant” is not itself statistically significant. [7, p2]

One should note nothing here calls for abandoning the reporting of p values per se. These can still be reported, and the misuse/abuse and poor understanding of p values are not the p value’s fault [15]. That said, there are other inferential measures that people should be knowledgeable about for a better understanding of inferential statistics (e.g., confidence intervals, credible intervals, Bayes factors, false-positive risks), because they convey more/more useful information than p values, even though there are no perfect measures.

Although discourse on this matter should continue (and it is), what we need much more than statistical significance are statistically-minded researchers, clinicians, journal editors, and the like who are transparent in their practices, meticulous in their analyses, judicious in their inferences, and comfortable with uncertainty and the limitations of “knowing”.

So, then, what’s significant about (statistical) significance?

Borrowing from Gelman and Stern [16], not much.

Read additional thoughts from Dr. Martin Mayer on the Nature proposal to retire statistical significance and see references cited in this article.

Read more about p values and statistical significance from Dr. Martin Mayer.