On the fifth day of Christmas, my true love sent to me: NO P-VALUES
Clear stopping rules
Bayes testing
Anscombe's Quartet
And fair stats communication.
You may be wondering about the relationship I have with my true love at this point. My husband (to his credit) has never sent me any of the above. It clearly differs according to tastes in different disciplines, as my friend Phil who is a reformed physicist apparently received the "5 sigma rule" from his true love. All I can say is that his love is extremely stringent in these matters.
But why avoid sending me p-values? If we follow Pierre's advice from the first day of Christmas, they are not consistent with fair statistical communication, particularly failing on transparency and clarity. Here is an excerpt from our introductory chapter (written by me and Maurits) on why p-values might not mean what you think they mean.
"The NHST approach, fiercely promoted by Fisher in the 1930’s (Fisher, 1934; and also Pearson, Fisher, & Inman, 1994), has become the gold standard in many disciplines including quantitative evaluations in HCI. However, the approach is rather counter-intuitive; and subsequently many researchers misinterpret the meaning of the p-value. To illustrate this point Oakes (1986) posed a series of true/false questions regarding the interpretation of p-vales to seventy experienced researchers and discovered that only two had a sound understanding of the underlying concept of significance.
So what does a p-value actually mean? “…the p-value is the probability of obtaining the observed value of a sample statistic (such as t, F, ) or a more extreme value if the data were generated from a null-hypothesis population and sampled according to the intention of the experimenter” (Oakes 1986, p.293). Because p-values are based on the idea that probabilities are long run frequencies, they are properties of a collective of events rather than single event. They do not give the probability of a hypothesis being true or false for this particular experiment, they only provide a description of the long term Type I error rate for a class of hypothetical experiments – most of which the researcher has not conducted.
The erroneous interpretation of the p-value by many researchers brings up the first actual threat to the validity of conclusions that are supported by null-hypothesis testing. Researchers often interpret the p-value to quantify the probability that the null hypothesis is true. Thus, a p-value smaller than .05 indicates to large groups of researchers – be it conscious or unconscious –that the probability that the null hypothesis is true (e.g. = u0) is very small.
Under this misinterpretation the p-value would quantify P(H0|D) - the probability of the null hypothesis (H0) given the data collected in the experiment. However, the correct interpretation of the p-value is rather different: it quantifies P(D|H0) – the probability of the data given that H0 is true. Researchers who state that it is very unlikely that the null hypothesis is true based on a low p-value are attributing an incorrect meaning to the p-value.
The difference between P(H0|D) and P(D|H0) is not merely a statistical quirk that is unlikely to affect anything in practice. Not at all. It is easy to understand why this misconception is incorrect and hugely important by the following example: consider the probability of being dead after being shot, P(D=true|S=true). Most would estimate this to be very high, say 0.99. However, the mere observation of a dead person does not lead most people to believe that the corpse was shot – after all, there are many possible ways to die which don’t involve shooting. P(S=true|D=true) is estimated to be rather small. Luckily, the relationship between P(D|H0) and P(H0|D) is well-known and given by Bayes rule (Barnard & Bayes, 1958). Thus, if one wishes to estimate P(H0|D) - which we often do – the proper analytical tools are at our disposal." (see my gift from the 4th day of Christmas).
I am personally not arguing for a full-out ban on p-values (although one psychology journal has done so). I am arguing for using them carefully and appropriately as part of a wider statistical argument. I would also like reviewers to accept other statistical approaches rather than favouring NHST practices over others.
Modern Statistical Methods in Human Computer Interaction (edited by Judy Robertson and Maurits Kaptein) will be published by Springer in early 2016. Here is the table of contents:
Modern Statistical Methods for HCI
Preface
1. An introduction to Modern Statistical Methods for HCI.
J. Robertson & M.C. Kaptein
Section 1: Getting Started With Data Analysis.
2. Getting started with [R]: a brief introduction
L. Ippel.
3. Descriptive statistics, Graphs, and Visualization.
J. Young & J. Wessnitzer
4. Handling missing data
T. Baguley & M. Andrews
Section 2: Classical Null Hypothesis Significance Testing done properly
5. Effect sizes and Power in HCI
K. Yatani
6. Using R for repeated and time-series observations
D. Fry & K. Wazny
7. Non-parametric Statistics in Human-Computer Interaction
J.O. Wobbrock and M. Kay
Section 3: Bayesian Inference
8. Bayesian Inference
M. Tsikerdekis
9. Bayesian Testing of Constrained Hypothesis
J. Mulder
Section 4: Advanced modeling in HCI
10. Latent Variable Models
A. Beaujean & G. Morgan
11. Using Generalized Linear (Mixed) Models in HCI
M.C. Kaptein
12. Mixture models: Latent profile and latent class analysis
D. Oberski
Section 5: Improving statistical practice in HCI
13. Fair Statistical Communication in HCI
P. Dragicevic
14. Improving statistical practice in HCI
J. Robertson & M.C. Kaptein
Online supplementary materials