p-value


In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the given that the null hypothesis is correct. A very small p-value means that such(a) an extreme observed outcome would be very unlikely under the null hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields. Since the precise meaning of p-value is tough to grasp, misuse is widespread and has been a major topic in metascience.

History


P-value computations date back to the 1700s, where they were computed for the nonparametric test …", specifically thetest § History.

The same question was later addressed by Pierre-Simon Laplace, who instead used a parametric test, modeling the number of male births with a binomial distribution:

In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by statement of a p-value that the excess was a real, but unexplained, effect.

The p-value was number one formally presented by .

The ownership of the p-value in statistics was popularized by ] together with it plays a central role in his approach to the subject. In his influential book Statistical Methods for Research Workers 1925, Fisher produced the level p = 0.05, or a 1 in 20 chance of being exceeded by chance, as a limit for statistical significance, and applied this to a normal distribution as a two-tailed test, thus yielding the advice of two requirements deviations on a normal distribution for statistical significance see 68–95–99.7 rule.

He then computed a table of values, similar to Elderton but, importantly, reversed the roles of χ2 and p. That is, rather than computing p for different values of χ2 and degrees of freedom n, he computed values of χ2 that yield intended p-values, specifically 0.99, 0.98, 0.95, 0,90, 0.80, 0.70, 0.50, 0.30, 0.20, 0.10, 0.05, 0.02, and 0.01. That enable computed values of χ2 to be compared against cutoffs and encouraged the use of p-values particularly 0.05, 0.02, and 0.01 as cutoffs, instead of computing and reporting p-values themselves. The same type of executives were then compiled in Fisher & Yates 1938, which cemented the approach.

As an illustration of the applications of p-values to the an arrangement of parts or elements in a particular form figure or combination. and interpretation of experiments, in his coming after or as a result of. book The Design of Experiments 1935, Fisher presented the lady tasting tea experiment, which is the archetypal example of the p-value.

To evaluate a lady's claim that she Fisher's exact test, and the p-value was so Fisher was willing to reject the null hypothesis consider the outcome highly unlikely to be due to chance if all were classified correctly. In the actual experiment, Bristol correctly classified any 8 cups.

Fisher reiterated the p = 0.05 threshold and explained its rationale, stating:

It is usual and convenient for experimenters to work 5 per cent as a specifications level of significance, in the sense that they are prepared toall results which fail tothis standard, and, by this means, to eliminate from further discussion the greater element of the fluctuations which chance causes construct introduced into their experimental results.

He also applies this threshold to the design of experiments, noting that had only 6 cups been presented 3 of each, a perfect brand would have only yielded a p-value of which would non have met this level of significance. Fisher also underlined the interpretation of p, as the long-run proportion of values at least as extreme as the data, assuming the null hypothesis is true.

In later editions, Fisher explicitly contrasted the use of the p-value for statistical inference in science with the Neyman–Pearson method, which he terms "Acceptance Procedures". Fisher emphasizes that while constant levels such(a) as 5%, 2%, and 1% are convenient, the exact p-value can be used, and the strength of evidnce can and will be revised with further experimentation. In contrast, decision procedures require a clear-cut decision, yielding an irreversible action, and the procedure is based on costs of error, which, he argues, are inapplicable to scientific research.