by Erin Bradner
As a user researcher in a company that practices user-center design, I rarely find myself justifying user research. Autodesk values design research because we consistently derive actionable design direction from it. Specifically, in the division I work in at Autodesk, user research was long ago justified and ROIed. To use Eric Schaffer’s term, user research has been institutionalized. What remains somewhat of a mystery among non-researchers at Autodesk is how we do what we do with usability results…how do we know we’ve found a legitimate problem with such a small sample of users?
I got that ‘how do you know it’s not noise’ question recently. We had just completed a usability study of the AutoCAD software installer. We had observed and interviewed 11 customers on the same tasks: finding the download on Autodesk.com, downloading, installing and launching the software. We found that the download button was misleading (it didn’t look clickable enough) and that some of the error messaging in the installer was cryptic (“ADR Not Empty.” Huh?).
Within the week, a colleague reported that she had already begun responding to the usability issues. Jumping on the highest-severity issue first, she and her team had redesigned the download button. And had used A/B testing against the design. Her testing showed that the redesign produced a 5% improvement in click-through rates; which in her words was "considerable" in light of the rigorous testing the download page had previously undergone. While our small-sample research showed only three out of our 11 users had experienced a problem with the download button, her statistically significant results from an A/B test immediately showed that the usability fix yielded considerable increase in click-throughs.
Shortly after results from the A/B testing was announced, a colleague from the web analytics team stopped me in the hall and asked: “How did you do it?” How did you detect what amounted to a statistically significant design flaw with just 11 users?
I gave my colleague the answer I’d given many times before: with small-sample usability research we report patterns not p-values. We use our experience and expertise to separate high-severity problems from low. My colleague seemed satisfied with my response. But I went back to my desk dissatisfied. It struck me that my standard response – it’s patterns not p-values – eschews the small-sample statistics that is the secret sauce behind usability.
I decided it was time to revise my standard response to give credit to small-sample statistics. Two minutes on Google turns up no shortage of discussion on the topic of small-sample usability research ranging from the academic to the polemical to the pragmatic. To summarize:
The time-tested minimum sample for a usability test is five users. Testing the same tasks with five users we’re not likely to see most problems, we are just likely to see most problems that affect 30% or more of users. With five users, we’ll detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 30%. These numbers are based on the binomial distribution and probability statistics.
In practice, in my department at Autodesk, this means that when we’re testing with small samples:
- we pick up high high-frequency (30% or greater) issues, not low frequency issues.
- when we know the probability of picking up a specific problem that we want to shoot for, we calculate the number of users we need to test with.
- we put confidence intervals on our data when reasonable.
- we triangulate our findings with statistics from other data sources when available, such as feature usage data, web analytics and support calls.
- we strive to test between 5 and 20 users, iteratively, for each study.
In the case of the download button not looking like a button, we observed 3 of 11 users struggle. From this, we estimated the impact on all users. Calculating the probability and confidence interval, we were 95% confident that between 9% and 57%* of all users would fail to notice that the button is clickable.
Since writing this post, I have not yet been asked in the hallway ‘how do you do it.’ I can’t say yet if returning to my roots and quoting small-sample statistics will be a more satisfying response than ‘patterns not p-values’. Feel free to weigh in here!
*We used calculators and insights developed by Jeff Sauro at www.measuringusability.com.
Subscribe
Great article Erin. I just stumbled on it from your Facebook mention.
Posted by: Marc Goldman | March 13, 2011 at 11:09 PM
The statistical significance of usability studies is a hot topic, perhaps more so in engineering-driven cultures, which are inherently quantitative.
I appreciate your article and mathematical acumen, but to defend small-group usability studies with statistical arguments misses the point and exacerbates number fetishism. The real point, in my view at least, is that small-scale qualitative usability studies demonstrate the existence of problems, whereas larger-scale quantitative studies demonstrate significance. (Or as Nielsen puts it, the former produce qualitative insights and the latter quantitative statistics. Furthermore, quantitative studies require more participants, more careful experimental design, more time, and more money.)
So next time someone disputes the statistical significance of usability study results, ask them the following: "We're a quality-driven company. How many people need to trip over the product that we just released before we fix it?" One person is usually enough : )
Posted by: Aneesh Karve | September 20, 2010 at 02:01 PM