*On the sixth day of Christmas, my true love sent to me: *

*Effect size reporting*

Suppose my love likes to quantify his affection. Like the "Guess How Much I Love You" bunny, he gives numerical estimates of his love, perhaps in comparison to another suitor. He uses real world measurements to do so (usually units of distance) , rather than standardised measurements of effect size such as Cohen's d. This is a reasonable decision, as the real word units are easier to interpret (and more poetic). I suppose he might use Cohen's d if he wanted to give me context of related effect sizes in the literature or if he was trying to establish the positive feelings of rabbits to a range of species.

But suppose I was trying to estimate the feelings of my true love for me using a similar strategy to Null Hypotheses Significance Testing (NHST). Then I might adopt an arbitrary threshold and make a binary decision based on that. If p < .05, he loves me. If p > .05 he loves me not. I suspect this might not be the path to happiness.

This is probably not a very helpful way to introduce the concept of effect sizes. I recommend that you read Koji Yatani's chapter on effect sizes instead for a more cogent discussion of the importance of effect sizes and a guide to how to compute them in R. Here is a rather clearer explanation than mine, written by Koji in his characteristically friendly style:

"Even if you have a significant result, it is “significant” only in the context of statistics. It does not necessarily mean that the difference you have is meaningful or important in a practical setting. Imagine that you are reviewing a very typical interaction technique paper. It compares two techniques: one is what the authors developed, and the other is a conventional technique. Suppose that the performance time of some tasks was improved with a new interaction technique by 1% with the standard deviation of 0.1%. In this case, their NHST would show a significant difference. So should you automatically accept that paper because it shows significant improvements backed up with well-known statistical testing methods? No – it is probably too early to decide the fate of the paper. 1% improvement may be a difference between 10 and 9.9 sec. Would this 100 msec. be really important? Maybe so for fighter pilots, but we definitely need more contextual information to assess the magnitude of this improvement.

Now you are reviewing another paper. In this paper, the performance time was improved with a new technique by 15%, but the standard deviation was also as large as 15%. The results would not show a significant difference because the standard deviation is too large. So would your conclusion be that the new technique is useless? I would say not. 15% improvement in average (e.g., 10 sec *vs*. 8.5 sec) could have an impact on user experience. There may have been some factors that caused such variances. For example, some participants might have been able to use experience from other interactive systems (e.g., heavy smartphone users might be able to adapt to new touch interaction more quickly than others). Other researchers might come up with an even better technique which can remove that variance while maintaining faster performance. Thus, the developed technique may be important enough to be shared even though the present paper could not find a significant difference. Assuming the paper is of high quality in other aspects, I think that it should be accepted (though I would ask the authors to explain why the standard deviation was so large and how their technique could be improved to make it smaller). Thus, we should not overrate statistically-significant results and underrate non-statistically-significant results. All results should be interpreted in a larger context."

In summary, if you are reporting quantitative results, report either standardised effect sizes or means and standard deviations so that the reader can compute the effect sizes if necessary. And if you are reading a paper, take a look at the descriptive statistics before you get too carried away by a significant result.

Modern Statistical Methods in Human Computer Interaction (edited by Judy Robertson and Maurits Kaptein) will be published by Springer in early 2016. Here is the table of contents:

Modern Statistical Methods for HCI

Preface

1. An introduction to Modern Statistical Methods for HCI.

J. Robertson & M.C. Kaptein

Section 1: Getting Started With Data Analysis.

2. Getting started with [R]: a brief introduction

L. Ippel.

3. Descriptive statistics, Graphs, and Visualization.

J. Young & J. Wessnitzer

4. Handling missing data

T. Baguley & M. Andrews

Section 2: Classical Null Hypothesis Significance Testing done properly

5. Effect sizes and Power in HCI

K. Yatani

6. Using R for repeated and time-series observations

D. Fry & K. Wazny

7. Non-parametric Statistics in Human-Computer Interaction

J.O. Wobbrock and M. Kay

Section 3: Bayesian Inference

8. Bayesian Inference

M. Tsikerdekis

9. Bayesian Testing of Constrained Hypothesis

J. Mulder

Section 4: Advanced modeling in HCI

10. Latent Variable Models

A. Beaujean & G. Morgan

11. Using Generalized Linear (Mixed) Models in HCI

M.C. Kaptein

12. Mixture models: Latent profile and latent class analysis

D. Oberski

Section 5: Improving statistical practice in HCI

13. Fair Statistical Communication in HCI

P. Dragicevic

14. Improving statistical practice in HCI

J. Robertson & M.C. Kaptein

Online supplementary materials

## Comments