Sunday, September 20, 2009

Sep 21 - Nielsen, Quantitative Studies: How Many Users to Test? (Alertbox)


Quantitative Studies: How Many Users to Test?


Summary: When collecting usability metrics, testing 20 users typically offers a reasonably tight confidence interval.


Introduction
We can define usability in terms of quality metrics, such as learning time, efficiency of use, memorability, user errors, and subjective satisfaction. Sadly, few projects collect such metrics because doing so is expensive: it requires four times as many users as simple user testing.
Many users are required because of the substantial individual differences in user performance. When you measure people, you'll always get some who are really fast and some who are really slow. Given this, you need to average these measures across a fairly large number of observations to smooth over the variability.

Standard Deviation for Web Usability Data

We know from previous analysis that user performance on websites follows a normal distribution. This is happy, because normal distributions are fairly easy to deal with statistically. By knowing just two numbers -- the mean and the standard deviation -- you can draw the bell curve that represents your data.
I analyzed 1,520 measures of user time-on-task performance for 70 different tasks from a broad spectrum of websites and intranets. Across these many studies, the standard deviation was 52% of the mean values. For example, if it took an average of 10 minutes to complete a certain task, then the standard deviation for that metric would be 5.2 minutes.

Removing Outliers

To compute the standard deviation, I first removed the outliers representing excessively slow users. Is this reasonable to do? In some ways, no: slow users are real, and you should consider them when assessing a design's quality. Thus, even though I recommend removing outliers from the statistical analyses, you shouldn't forget about them. Do a qualitative analysis of outliers' test sessions and find out what "bad luck" (i.e., bad design) conspired to drag down their performance.
For most statistical analyses, however, you should eliminate the outliers. Because they occur randomly, you might have more outliers in one study than in another, and these few extreme values can seriously skew your averages and other conclusions.
The only reason to compute statistics is to compare them with other statistics.

Estimating Margin of Error

The following chart shows the margin of error for testing various numbers of users, assuming that you want a 90% confidence interval (blue curve). This means that 90% of the time, you hit within the interval, 5% of the time you hit too low, and 5% of the time you hit too high. For practical Web projects, you really don't need more accurate interval than this.

Determining the Number of Users to Test
In the chart, the margin of error is expressed as a percent of the mean value of your usability metric.
For example, if you test 10 users, the margin of error is +/- 27% of the mean. This means that if the mean task time is 300 seconds (five minutes), then your margin of error is +/- 81 seconds. Your confidence interval thus goes from 219 seconds to 381 seconds: 90% of the time you're inside this interval; 5% of the time you're below 219, and 5% of the time you're above 381.

This is a rather wide confidence interval, which is why I usually recommend testing with 20 users when collecting quantitative usability metrics. With 20 users, you'll probably have one outlier (since 6% of users are outliers), so you'll include data from 19 users in your average. This makes your confidence interval go from 243 to 357 seconds, since the margin of error is +/- 19% for testing 19 users.

You might say that this is still a wide confidence interval, but the truth is that it's extremely expensive to tighten it up further. To get a margin of error of +/- 10%, you need data from 71 users, so you'd have to test 76 to account for the five likely outliers.
Testing 76 users is a complete waste of money for almost all practical development projects. You can get good-enough data on four different designs by testing each of them with 20 users, rather than blow your budget on only slightly better metrics for a single design.

In practice, a confidence interval of +/- 19% is ample for most goals. Mainly, you're going to compare two designs to see which one measures best. And the average difference between websites is 68% -- much more than the margin of error.
Quantitative vs. Qualitative
Based on the above analysis, my recommendation is to test 20 users in quantitative studies. This is very expensive, because test users are hard to come by and require systematic recruiting to actually represent your target audience.

Luckily, you don't have to measure usability to improve it. Usually, it's enough to test with a handful of users and revise the design in the direction indicated by a qualitative analysis of their behavior.
When you see several people being stumped by the same design element, you don't really need to know how much the users are being delayed. If it's hurting users, change it or get rid of it.

You can usually run a qualitative study with 5 users, so quantitative studies are about 4 times as expensive.
Because they're expensive and difficult to get right, I usually warn against quantitative studies. The first several usability studies you perform should be qualitative. Only after your organization has progressed in maturity with respect to integrating usability into the design lifecycle and you're routinely performing usability studies should you start including a few quant studies in the mix.

Source:
Jakob Nielsen's Alertbox, June 26, 2006:
Quantitative Studies: How Many Users to Test?
http://www.useit.com/alertbox/quantitative_testing.html
Quantitative Studies: How Many Users to Test? (Jakob Nielsen's Alertbox)

1 comment:

  1. Very thoughtfull post on confidence .It should be very much helpfull

    Thanks,
    Karim - Creating Power

    ReplyDelete