On the fourth day of Christmas, my true love sent to me: Clear stopping rules.
Simmon at al's paper about researcher degrees of freedom (such as lax stopping rules for data collection) is a disturbing read. In one of our own chapters, Maurits and I describe this study before going on to recommend some guidelines for authors and reviewers in HCI to avoid such problems. Here is an excerpt:
"Simmons et al. recently introduced the concept of “researcher degrees of freedom” to describe the series of decisions which researchers must make when designing, running, analysing and reporting experiments. The result of ambiguity when making such decisions is that researchers often (without intention of deceit) perform multiple alternative statistical analyses and then choose to report the form of analysis which found statistically significant results. The authors show through a cleverly constructed simulation study that “It is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis” (Simmons, Nelson & Simonsohn, 2011, p1). The authors illustrate this by reporting evidence in support of the hypothesis that listening to the song “When I’m 64” reduces chronological age. The simulation study investigated how the significance of the results changed when different forms of analyses were conducted on a simulated data set randomly drawn from the normal distribution, repeated 15000 times. The different forms of analysis manipulated common researcher degrees of freedom: selecting dependant variables, setting sample size, the use of covariates and reporting only selected subsets of experimental conditions. If the researcher is flexible about when she stops data collection – a common practice although it is not at all advocated in the NHST literature – and gathers more data after doing a significance test, the false positive rate increases by 50%. By using covariates, such as the ubiquitous gender factor, the researcher can introduce a false positive rate of 11%, and by reporting only a subset of the data, a false positive rate of 12.6% is produced. Finally, if our researcher was too flexible in all four of these areas, the false positive rate would be 61%.
Simmons et al (2011) recommend a set of ten simple practices which reduce the problem of false positives. Firstly, they recommend that authors should decide in advance on their criteria for when to stop data collection. Given the prevalence of problems of underpowered tests, it would be sensible to use stopping rules based on power analysis. The second guideline of providing at least 20 observations per cell is directly related to power, and reduces the risk of type II errors. According to guidelines 3 and 4, authors would be required to report all the variables collected, and all experimental conditions investigated even if they are not reported in the final analysis in the paper. This would enable readers to judge how selective authors are in reporting their results. Guideline 5 recommends that if observations (such as outliers) are removed, they should report both the results with and without their removal to enable the reader to determine what effect it has. To reduce the occurrence of false positives by the introduction of covariates, guidelines 6 recommends that results should be reported with and without covariates. As changes in authorial practices require monitoring by reviewers, Simmons et al. (2011) propose four related guidelines for reviewers, starting with ensuring that authors follow the first 6 recommendations. Following these guidelines is likely to lead to publications with fewer significant and less perfect seeming results; reviewers are therefore encouraged to be more tolerant of imperfections in results (but less tolerant of imperfections in methodology!). Reviewers are also asked to check that the authors have demonstrated that their results don’t depend on arbitrary decisions while conducting the analysis, and that the analytic decisions are consistent across studies reported in the same paper. Lastly, Simmons et al. (2011) invite reviewers to make themselves unpopular by requiring that authors conduct replications of studies where data collection or analysis methods are not compelling."
Modern Statistical Methods in Human Computer Interaction (edited by Judy Robertson and Maurits Kaptein) will be published by Springer in early 2016. Here is the table of contents:
Modern Statistical Methods for HCI
1. An introduction to Modern Statistical Methods for HCI.
J. Robertson & M.C. Kaptein
Section 1: Getting Started With Data Analysis.
2. Getting started with [R]: a brief introduction
3. Descriptive statistics, Graphs, and Visualization.
J. Young & J. Wessnitzer
4. Handling missing data
T. Baguley & M. Andrews
Section 2: Classical Null Hypothesis Significance Testing done properly
5. Effect sizes and Power in HCI
6. Using R for repeated and time-series observations
D. Fry & K. Wazny
7. Non-parametric Statistics in Human-Computer Interaction
J.O. Wobbrock and M. Kay
Section 3: Bayesian Inference
8. Bayesian Inference
9. Bayesian Testing of Constrained Hypothesis
Section 4: Advanced modeling in HCI
10. Latent Variable Models
A. Beaujean & G. Morgan
11. Using Generalized Linear (Mixed) Models in HCI
12. Mixture models: Latent profile and latent class analysis
Section 5: Improving statistical practice in HCI
13. Fair Statistical Communication in HCI
14. Improving statistical practice in HCI
J. Robertson & M.C. Kaptein
Online supplementary materials