Friday, September 11, 2009

Sep 12 - Lewis, Sample Sizes for Usability Tests: Mostly Math, Not Magic

Sample Sizes for Usability Tests: Mostly Math, Not Magic.
James R. Lewis IBM Corp. jimlewis@us.ibm.com
USER EXPERIENCE, VOL. 4, ISSUE 4, 2005


Perhaps the most important factor is the economics of usability testing. For many practitioners, usability tests are fairly expensive events, with much of the expense in the variable cost of the number of participants observed (which includes cost of participants, cost of observers, cost of lab, and limited time to obtain data to provide to developers in a timely fashion).

Usability testing includes three key components: representative participants, representative tasks, and representative environments, with participants’ activities monitored by one or more observers [2].
They can be formal or informal, think-aloud or not, use low-fidelity prototypes or working systems.
They can have a primary focus on task-level measurements (summative testing) or problem discovery (formative testing).

The IBM practice at that time, based on papers published by Alphonse Chapanis and colleagues [1, 5], was to observe about five to six participants per iteration for problem discovery. Chapanis had asserted that after you’d observed six participants, you would have seen about all of the problems you were going to see.

THE GOAL: PROBLEM DISCOVERY
You can’t really talk about discovering 90 percent of all possible usability problems across all possible users, tasks, and environments.
You can establish a problem discovery goal given a sampled population of users, a defined set of tasks, and a defined set of environments.
Change the population of users, tasks, or environments, and all bets are off. But this is better than nothing. If your problem discovery rate is starting to go down, then change one or all of
these elements of usability.
Test from a different population of users, using different tasks, in different environments. You’ll discover different problems.

My comments: In other words, result of usability testing does not have high repeatable and reproducible.

In 2001, Spool and Schroeder published the results of a large-scale usability evaluation in which they concluded that five users were “nowhere near enough” to find all (or even 85 percent) of the usability problems in the Web sites they were studying.
Perfetti and Landesman [17], discussing related research, stated:
When we tested the site with 18 users, we identified 247 total obstacles-topurchase. Contrary to our expectations, we saw new usability problems throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious
problems for the first time with some of our later users. What was even more surprising to us was that repeat usability problems did not increase as testing progressed. These findings clearly undermine the belief that five users will be enough to catch nearly 85 percent of the usability problems on a Web site. In our tests, we found only 35 percent of all usability problems after the first five users. We estimated over 600 total problems on this particular online music site. Based on this estimate, it would have taken us 90 tests to discover them all!

Discussion
If a practitioner says that five participants are all you need to discover most of the problems that will occur in a usability test, it’s likely that this practitioner is typically working in contexts that have a fairly high value of p and fairly low problem discovery goals.
If another practitioner says that he’s been running a study for three months, has observed 50 participants, and is continuing to discover new problems every few participants, then it’s likely that he has a somewhat lower value of p, a higher problem discovery goal, and lots of cash (or a low cost audience of participants).
Neither practitioner is necessarily wrong—they’re just working in different usability testing spaces.


References that I may want to read further in future:
5. Chapanis, A. (1981). Evaluating ease of use. Unpublished manuscript prepared for IBM, available on request from J. R. Lewis.
11. Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier.
12. Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human
Factors, 36, 368-378.
13. Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479
14. Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.
17. Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 4, 2006 from http://www.uie.com/articles/eight_is_not_enough/ 1
8. Sauro, J. (2006). UI problem discovery sample size. Downloaded from Measuring Usability website, July 20, 2006- http://www.measuringusability.com/samplesize/problem_discovery.php .
19. Spool, J., & Schroeder, W. (2001). Testing web sites: Five users is nowhere near enough. In CHI 2001 Extended Abstracts (pp. 285-286). New York: ACM Press.

No comments:

Post a Comment