Let me tell you the story of an evaluation we did of FitQuest, which is an exergame designed to encourage children to take more exercise. It is the story of how my understanding of evaluation methodology - of research itself- changed as our design was let loose in the unforgiving environment of ordinary schools.
I realise that I like methodology perhaps more than is natural. It makes me obscurely happy. Hopefully this post might persuade you that methodology matters, even if you hate it with a dark passion. If you’re a Human-Computer Interaction (HCI) researcher, you might object to what follows. If so, let’s start talking. If you’re a public health researcher, you can feel free to shake your head smugly.
(Note: When I am talking about HCI researchers here, I am referring to those who design or evaluation technology for social good such as education or health).
Part 1: Why I inflicted a randomised controlled trial on myself
I'll start at the beginning. When I did my PhD I thought that you could bring cool new technology to classrooms to make things better. I wasn't particularly guilty of considering technology as a solution looking for an educational problem, but let's just say that the educational problems I picked were perhaps not of the most pressing concern. I have certainly reviewed papers over the years which seem to consider kids as a prime excuse to try out technological innovations. Conferences like Interaction Design and Children (IDC) have become quite wary of such work, and papers published here often now have a sound basis in theory. I think this has been a welcome change over the years since IDC first started.
When Andrew Macvean and I started working on FitQuest, we picked a genuinely hard real world problem: trying to increase children's physical activity. The stakes here are high - physical activity is a global pandemic. Being physically inactive is as unhealthy as smoking, and yet a worrying low proportion of kids exercise for the recommended time per week. We wanted to design a game which would improve the situation. But I wanted to establish whether the game did improve things or not. I was frustrated with typical HCI projects where applying user centred design approaches early on the process buy you a "get out of jail free" card on doing a decent summative evaluation.
I wanted to know how to answer the questions "does this behaviour change technology work? Does FitQuest encourage kids to be more active?" We realised quite quickly that doing a pilot study for a single session wouldn't be enough to establish this. Getting fitter is something which happens over time. We didn't particularly care whether kids liked it the first time they played it. We were interested to see whether there could be a motivational push from playing a game which would keep the kids interested in exercising time and again. So we needed a longitudinal study. Andy did pilot studies over seven weeks in two schools and presented case studies which revealed interesting patterns of behaviour for children with different self-efficacy profiles. The pilot revealed a novelty effect in that kids did not enjoy the game so much after the initial few weeks, so we redesigned the game to include goal setting as a behaviour change technique which might promote sustained improvements in physical activity. That is, kids may be motivated to play the game over a sustained period because they wanted to achieve particular goals.
It is worth pausing here to consider the stage of the research and development process we had reached. From an HCI point of view, we had been following a user centre design process which is captured by this diagram from Andy’s thesis:
Figure 1: FitQuest user centred design process
As you can see, we had been through several iterations of technology development and several user consultation studies over the course of about 3 years. How does this relate to the processes used to develop interventions in public health, which is a natural home discipline for physical activity and behaviour change? Figure 2 shows the Medical Research Council’s (MRC) suggested process for developing complex interventions. I put the orange box labelled “Technology development process” hovering uncertainly next to the “development” and “feasibility and pilot testing” phases in the MRC process. It isn’t obvious to me whether the user centred design process only gets the technology to the point where intervention development might start or whether it would be included as part of development and feasibility/pilot testing. Certainly people in public health devote a good deal of time to these stages when they are simply working with leaflet interventions, so it might be that the social context and supporting materials (teacher education in using the technology of user manuals) for a game like FitQuest should undergo the same sort of development as the technology itself. Maybe technology interventions should be called “super-complex” interventions and get their own diagram.
Regardless, the evaluation stage in the MRC process required extending the studies we had conducted so far. I wanted to scale up. If there is one thing I have learned from my unsettling journey into the world of critiquing statistical methods, it is that sample size is enormously important because it is related to statistical power and therefore the possibility that you will fail to detect an effect which is present in the data (type 2 error). In fact, low power inflates type 1 errors as well.
Part 2: Why randomised controlled trials are not the whole answer
I decided to run a cluster randomised controlled trial (RCT) to evaluate the impact of FitQuest of children’s self-efficacy to exercise and their step count. We recruited 10 schools and randomly allocated them to treatment or control groups. The treatment group used FitQuest for 5 weeks during their physical education (PE) classes. This was the first time I had attempted an RCT and it’s hard! There are a lot of logistics associated with it and a lot of reporting guidelines to follow if you take it seriously. My reason for choosing this design was that we needed lots of kids to give us a good chance of detecting effects if there are any to be found. As FitQuest was designed for schools, we needed to work with entire classes and take account of this in the analysis, otherwise we wouldn’t be able to tell whether other factors such as the teacher or school environment were having an influence rather than the software. Factoring in the structure of schools required a cluster design. And I reckon, by the time you’re working with that many schools and kids, you might as well randomise to make the design more robust. I think for physical activity researchers this study might be described as a “pilot RCT”, which raises a horrifying question about how long it would take to get to a full RCT. We’re already talking about 5 years to go from initial development to completion of the pilot RCT. This might not seem long for public health researchers, but people who publish in HCI are used to very rapid publication about technology innovation.
So we (me, Andrew Macvean and Stuart Gray) ran the study. We did it in two waves, from October to December and then from February to April. After the first wave, we thought we’d look at the initial data to write it up for a paper. If we stopped at that point, after two intervention schools (around 60 kids) we’d have been happy. It looked like FitQuest was improving children’s self-efficacy (confidence) in physical activity, which is widely considered to be an early indication of longer term behavioural change. This number of users over this time period is considered quite respectable in HCI so we could have stopped there. But if we had got that initial paper published it would have been very misleading because when the full RCT results were in, it turned out that FitQuest did not increase physical activity, and it didn’t improve self-efficacy. We can’t conclude whether it works on not because in fact the kids didn’t use it for anything like the time we recommended (this is referred to as low treatment fidelity, also a Type III error). Due to all kinds of pressures on the school timetable, the schools only used the game for under 50% of the 150 minutes they agreed to. And of course the thing about physical activity is that you have to do it for long enough for health benefits to take place. No matter how cool a game FitQuest might be, players need to spend time playing it or it won’t be effective.
This would have all been wretchedly demoralising, but for the fact that we had observations and interviews, as well as a lot of log file data which we could to build explanations about user behaviour beyond the quantitative outcome measures. Understanding these patterns of behaviour will help us to redesign it to have a better chance improving physical activity. A lot of contextual factors emerged from the qualitative data relating to using a mobile phone game in school (such as whether the teachers found it appropriate for PE lessons and the school policy on mobile phones in the playground) which need addressed which I feel started to emerge only when we worked with a wider range of schools.
Part 3:A realist approach to HCI evaluation
When I started analysing the data from the randomised controlled trial, I had some very interesting conversations with Ruth Jepson about realist evaluation. Here the mantra is not simply “what works?” but rather “what works for whom under what circumstances?” In Well’s fascinating paper in which she interviews the grant holders and others involved in major randomised controlled trials (Wells, Williams, Treweek, Coyle, & Taylor, 2012), she discovered that a lot of interesting and potentially problematic contextual details about how interventions unfold in different settings are lost in the sanitised way which they are reported in the literature. Realist evaluation responds to the messiness of the real world by acknowledging that it is highly likely that different groups will have different reactions to interventions, and sets about trying to document this so that mid-range theories (e.g. relating to behavioural change) may be synthesised from findings over a range of studies.
So in fact, in asking whether FitQuest works, I was asking the wrong question. It would be more fruitful to ask “Which aspects of FitQuest work for which users and in which settings?” The answer to that is probably “FitQuest works for kids who set achievable goals in schools where there is an interested adult to encourage them and there is opportunity to play it.” Which is probably a more useful answer for other researchers who might want to build on the work, or for policy makers who are trying to decide which schools might benefit from investment in new technology of this type. Now I am collaborating with some colleagues who also have exergame datasets to see what we learn about mid-range theories of behaviour change from looking how users in different demographic groups respond to exergames intended to enhance physical activity. I think this kind of sharing of data should be done more often in HCI because it would allow the field to build more coherence. At the moment it seems to me to be characterised by a disparate set of small, relatively short term studies based on innovative technology prototypes. HCI researchers might plead that it is too expensive to conduct larger studies, but even if you accept that (biting your tongue on questions like “well how do other disciplines do it then?”), you might acknowledge that pooling resources with other research teams around the same research question makes sense.
If we took a realistic approach to HCI evaluation, we might use a cycle similar to that shown below, adapted from Pawson’s wheel of evaluation.
One of the most important stages in this for me is identifying the programme theory because it requires you to specify carefully why you think the software will be effective – what aspects of it in particular do you think will make the difference (based on literature or from consulting stakeholders)? I wish I had started doing this years ago because of the clarity of focus it brings.
I found looking into the realist paradigm eye-opening (as someone new to social science) because from my reading I realised that social scientists don’t necessarily expect interventions to work per se! I find this quote particularly sobering: “The expected value of any net impact assessment of any large scale social program is zero” (Rossi, in (Pawson, 2013)). I don’t think I have ever encountered this attitude in HCI researchers, who seem like a wildly optimistic bunch in comparison. We generally assume we can design technology which will make a positive difference. The realist perspective on this would probably be “Sure! You can design technology which will make a difference for some people, some of the time.” So then the goal for HCI evaluation becomes to establish which groups of users the technology will benefit, how much and what socio-technical factors matter in its use.
The last few years on the FitQuest project have been a huge learning curve for me. Working with people in another discipline (with seemingly higher standards of evidence) has helped me enormously and I am very grateful. I hope my understanding of evaluation continues to evolve in the coming years.
Pawson, R. (2013). The Science of Evaluation : A Realist Manifesto (1st ed.). London: SAGE Publications Ltd.
Wells, M., Williams, B., Treweek, S., Coyle, J., & Taylor, J. (2012). Intervention description is not enough: evidence from an in-depth multiple case study on the untold role and impact of context in randomised controlled trials of seven complex interventions. Trials, 13(1), 95. doi:10.1186/1745-6215-13-95