by Melissa Dawe on October 5, 2009
As a user researcher with a primarily qualitative background, I have to confess that when I was asked to conduct a usability benchmark study on AutoCAD, I was not exactly jumping out of my chair. Frankly, I was wary of the quantitative emphasis of the method and the proposal to reduce the whole user experience down to a single number. I was also more than slightly nervous about designing a benchmark study for a product as complex as AutoCAD.
Despite my trepidation, I ultimately developed respect and appreciation for usability benchmarking. I would even call myself a convert. I observed the power in a simple usability scorecard (that even execs can understand!) to raise awareness of usability in an organization. I also found that the data collected in a benchmark study can be more than just numbers, and qualitative observations can bring focus to long-existing usability issues that might otherwise get lost.
What is a Usability Benchmark?
Whereas a typical usability study focuses on specific features or aspects of an application, a usability benchmark measures the overall usability of the app. It provides a single score that can then be used to compare usability of the entire application release after release.
To design the AutoCAD usability benchmark we relied heavily on Sauro and Kindlund's excellent work in quantifying usability with the SUM metric (see www.measuringusability.com). The metric combines three aspects of usability: effectiveness, efficiency and satisfaction. These measures are standardized into z-scores and then combined into a single score.
Sounds simple enough, right? Nothing is simple when you’re talking about an application as complex as AutoCAD. We encountered three major hurdles:
- Deciding on what type of users to run through the benchmark
- Coming up with a good set of benchmark tasks
- Communicating results using the SUM metric
Hurdle One: Determining the Participant Profile
Selecting target users for a benchmark study is important because it greatly affects the tasks you choose and how you measure success (e.g. how you determine your task target times). Are you focusing on users who have never used the product (i.e. measuring first experience)? Are you targeting expert users in order to focus on changes in efficiency?
In addition to skill level, your users may also vary in other important ways. Like other widely-used, large feature set applications such as Microsoft Word or Excel, AutoCAD has a diverse range of users. It has been used for 25 years by millions of people to design very different things, from skyscrapers to boats to skateboards.
I assembled an advisory team and we decided on two target user profiles: experienced AEC AutoCAD users, and students or novice users who'd received some training in AutoCAD. We had a different “First Experience” study for new users without training.
Hurdle Two: Developing a Set of Benchmark Tasks
Deciding on the right set of tasks to benchmark was a daunting challenge. AutoCAD is a complex, professional application with 25 years under its belt - it has thousands of commands and tools and semester-length college courses dedicated to learning it. How can you benchmark the entire product in a reasonable-length user session? I toyed with the idea of serving free unlimited coffee and challenging users to stay awake for 24 hours, but then remembered the ethics section of my behavioral psychology class and went with a more humane approach.
Ultimately we developed a dozen high-level tasks that covered core areas of the product. The tasks were piloted and refined so that they could be reasonably completed in a 3-hour session. The tasks did not prescribe which features or tools to use since the tasks needed to remain relevant in future releases and accommodate UI changes and new features.
Once we'd perfected our tasks, we weren't quite done - we needed to determine success criteria for each task. Should we consider a typo a task failure? At the end of the day you are going to say "n people succeeded in this task," so it is crucial you have team buy-in on how you define success. For our tasks, success was output-oriented -- users could take any path through the interface, as long as they sufficiently reproduced the goal drawing.
Hurdle Three: Communicating results using the SUM Metric
The biggest challenge with using SUM came when we were getting ready to present our data. The SUM score is represented as a z-score, which ranges from -4 to +4. We had a table of scores displaying 0.4, 0.8, 1.2, etc. We realized that if we presented to our management, "AutoCAD plotting received a usability score of 0.4!" we were sure to get a room full of blank stares.
To make the scores easier to understand, we converted the z-scores back to their percentage score equivalent (MS Excel's normsdist will do this for you), and then we had a new problem. When the mean is equal to the target (i.e. the target is met), the z-score is 0. A z-score of 0 translates to 50%, which is not exactly a compelling score to indicate we met our usability goal. To address this, we re-translated and adjusted our target numbers to equal minimum acceptable usability, which is more meaningful as the 50% value. Thus, what we considered good usability was in the 80% to 90% range – a more meaningful translation for our broad audience.
Benefits of Usability Benchmarking, or "Why I'm a convert"
Despite my early wariness, I believe that developing a usability benchmark is absolutely worth the effort. For one, it has the potential to raise awareness and impact of usability in an organization – by providing a single score, usability can be tracked by high level management along with performance, sales, and other product metrics.
But the benchmark study had a hidden benefit that is perhaps even more important than the single score. For three straight hours (resulting in a total of over 50 hours of video), we had the opportunity to observe users working in AutoCAD in an uninterrupted, fairly natural way. The high-level tasks gave users a lot of room to choose how to work, giving us insight into the paths users take and the issues they experience. We were able to uncover patterns of stumbling blocks that revealed small bugs or design problems that had not been reported or whose importance wasn’t previously understood. We compiled a "Top 10 List" of usability issues encountered by each user profile (experts and novices), and these have been added to the roadmap for future releases.
I can't say that I was disappointed when the benchmark study was over and I could go back to doing site visits and qualitative research. But I've become convinced that usability benchmarking is an important method in a user research program, as it measures and communicates the usability of a product in a way the entire organization can understand. It's not as sexy as video clips and user quotes, but it gives you some lovely bar charts and sometimes bar charts are just what your audience needs to invest more in usability. And after all, isn’t communicating in your audience’s language what usability is all about?
Hi Stefan,
Thanks for the feedback! Yes, gleaning the qualitative data from the videos was one of the most rewarding parts of the study. We rarely get video with that many users performing the same tasks. We were able to identify some nice patterns. I haven't played with CogTool yet but I'll check it out, it looks useful.
Cheers,
Melissa
Posted by: Melissa Dawe | October 07, 2009 at 10:19 AM
Hey Melissa,
Nice report, stuff I can use. Speaking of quantitative tools, have you looked at Cogtool? I did a tutorial in San Diego with Bonnie John and think it might be useful for analyzing very repetitive parts of a GUI. I liked how you managed to scrape out some qualitative data from the videos. Regards from your old cubicle mate, Stefan in Spain
Posted by: Stefan Carmien | October 07, 2009 at 12:17 AM