Friday, September 25, 2009

Sep 25 - Nielsen, Risks of Quantitative Studies (Alertbox)

Summary: Number fetishism leads usability studies astray by focusing on statistical analyses that are often false, biased, misleading, or overly narrow. Better to emphasize insights and qualitative research.

Risks of Quantitative Studies

There are two main types of user research: quantitative (statistics) and qualitative (insights).
The key benefit of quantitative studies is simple: they boil a complex situation down to a single number that's easy to grasp and discuss. I exploit this communicative clarity myself, for example, in reporting that using websites is 206% more difficult for users with disabilities and 122% more difficult for senior citizens than for mainstream users.

Beware Number Fetishism

When I read reports from other people's research, I usually find that their qualitative study results are more credible and trustworthy than their quantitative results. It's a dangerous mistake to believe that statistical research is somehow more scientific or credible than insight-based observational research. In fact, most statistical research is less credible than qualitative studies.

User interfaces and usability are highly contextual, and their effectiveness depends on a broad understanding of human behavior.

Fixating on numbers rather than qualitative insights has driven many usability studies astray. As the following points illustrate, quantitative approaches are inherently risky in a host of ways.

Random Results

Researchers often perform statistical analysis to determine whether numeric results are "statistically significant." By convention, they deem an outcome significant if there is less than 5% probability that it could have occurred randomly rather than signifying a true phenomenon.
This sounds reasonable, but it implies that one out of twenty "significant" results might be random if researchers rely purely on quantitative methods.

Luckily, most good researchers -- especiaally those in the user-interface field -- use more than a simple quantitative analysis. Thus, they typically have insights beyond simple statistics when they publish a paper, which drives down, but doesn't eliminate, bogus findings.

There's a reverse phenomenon as well: Sometimes a true finding is statistically insignificant because of the experiment's design. Perhaps the study didn't include enough participants to observe a major -- but rare -- finding in sufficient numbers. It would therefore be wrong to dismiss issues as irrelevant just because they don't show up in quantitative study results.

Pulling Correlations Out of a Hat

If you measure enough variables, you will inevitably discover that some seem to correlate. Run all your stats through the software and a few "significant" correlations will surely pop out. (Remember: one out of twenty analyses are "significant," even if there is no underlying true phenomenon.)

Studies that measure seven metrics will generate twenty-one possible correlations between the variables. Thus, on average, such studies will have one bogus correlation that the statistics program deems "significant," even if the issues being measured have no real connection.
In my Web Usability 2004 project, we collected metrics on fifty-three different aspects of user behavior on websites. There are thus 1,378 possible correlations that I could throw into the hopper. Even if we didn't discover anything at all in the study, about sixty-nine correlations would emerge as "statistically significant."

Overlooking Covariants

Even when a correlation represents a true phenomenon, it can be misleading if the real action concerns a third variable that is related to the two you're studying.

For example, studies show that intelligence declines by birth order. In other words, a person who was a first-born child will on average have a higher IQ than someone who was born second. Third-, fourth-, fifth-born children and so on have progressively lower average IQs. This data seems to present a clear warning to prospective parents: Don't have too many kids, or they'll come out increasingly stupid. Not so.
There's a hidden third variable at play: smarter parents tend to have fewer children. When you want to measure the average IQ of first-born children, you sample the offspring of all parents, regardless of how many kids they have. But when you measure the average IQ of fifth-born children, you're obviously sampling only the offspring of parents who have five or more kids. There will thus be a bigger percentage of low-IQ children in the latter sample, giving us the true -- but misleading -- conclusion that fifth-born children have lower average IQs than first-born children. Any given couple can have as many children as they want, and their younger children are unlikely to be significantly less intelligent than their older ones. When you measure intelligence based on a random sample from the available pool of children, however, you're ignoring the parents, who are the true cause of the observed data.

(Update added 2007: The newest research suggests that there may actually be a tiny advantage in IQ for first-born children after correcting for family size and the parents' economic and educational status. But the point remains that you have to correct for these covariants, and when you do so, the IQ difference is much less than plain averages may lead you to believe.)

As a Web example, you might observe that longer link texts are positively correlated with user success. This doesn't mean that you should write long links. Website designers are the hidden covariant here: clueless designers tend to use short text links like "more," "click here," and made-up words.

Over-Simplified Analysis

To get good statistics, you must tightly control the experimental conditions -- often so tightly that the findings don't generalize to real problems in the real world.
This is a common problem for university research, where the test subjects tend to be undergraduate students rather than mainstream users. Also, instead of testing real websites with their myriad contextual complexities, many academic studies test scaled-back designs with a small page count and simplified content.

For example, it's easy to run a study that shows breadcrumbs are useless: just give users directed tasks that require them to go in a straight line to the desired destination and stop there. Such users will (rightly) ignore any breadcrumb trail. Breadcrumbs are still recommended for many sites, of course. Not only are they lightweight, and thus unlikely to interfere with direct-movement users, but they're helpful to users who arrive deep within a site via search engines and direct links. Breadcrumbs give these users context and help users who are doing comparisons by offering direct access to higher levels of the information architecture.

Usability-in-the-large is often neglected by narrow research that doesn't consider, for example, revisitation behavior, search engine visibility, and multi-user decision-making.

Distorted Measurements

It's easy to prejudice a usability study by helping the users at the wrong time or by using the wrong tasks. In fact, you can prove virtually anything you want if you design the study accordingly. This is often a factor behind "sponsored" studies that purport to show that one vendor's products are easier to use than a competitor's products.

Even if the experimenters aren't fraudulent, it's easy to get hoodwinked by methodological weaknesses, such as directing the users' attention to specific details on the screen.
The very fact that you're asking about some design elements rather than others makes users notice them more and thus changes their behavior.

Many Web advertising studies are misleading, possibly because most such studies come from advertising agencies.
The most common distortion is the novelty effect: whenever a new advertising format is introduced, it's always accompanied by a study showing that the new type of ad generates more user clicks. Sure, that's because the new format enjoys a temporary advantage: it gathers user attention simply because it's new and users have yet to train themselves to ignore it.
The study might be genuine as far as it goes, but it says nothing about the new advertising format's long-term advantages once the novelty effect wears off.

Publication Bias

Editors follow the "man bites dog" principle to highlight new and interesting stories. While understandable, this preference for new and different findings imposes a significant bias in the results that get exposure.

Usability is a very stable field. User behavior is pretty much the same year after year. I keep finding the same results in study after study, as do many others. Every now and then, a bogus result emerges and publication bias ensures that it gets much more attention than it deserves.

Consider the question of Web page download time. Everyone knows that faster is better. Interaction design theory has documented the importance of response times since 1968, and this importance has been seen empirically in countless Web studies since 1995. E-commerce sites that speed up response times sell more. The day your server is slow, you lose traffic. (This happened to me recently: on January 14, Tog got "slashdotted"; because we share a server, my site lost 10% of its normal pageviews for a Wednesday when AskTog's increased traffic slowed useit.com down.)
If twenty people study download times, nineteen will conclude that faster is better. But again: one of every twenty statistical analyses will give the wrong result, and this one study might be widely discussed simply because it's new. The nineteen correct studies, in contrast, might easily escape mention.

Judging Bizarre Results

Bizarre results are sometimes supported by seemingly convincing numbers. You can use the issues I've raised here as a sanity check: Did the study pull correlations out of a hat? Was it biased or overly narrow? Was it promoted purely because it's different? Or was it just a fluke?
Typically, you'll discover that deviant findings should be ignored.
The broad concepts of human behavior in interactive systems are stable and easy to understand. The exceptions usually turn out to be exactly that: exceptions.

In 1989, for example, I published a paper on discount usability engineering, stating that small, fast user studies are superior to larger studies, and that testing with about five users is typically sufficient.
This was quite contrary to the prevailing wisdom at the time, which was dominated by big-budget testing. During the fifteen years since my original claim, several other researchers reached similar conclusions, and we developed a mathematical model to substantiate the theory behind my empirical observation. Today, almost everyone who does user testing has concluded that they learn most of what they'll ever learn with about five users.

But four or five studies constitute a trend, which much enhances the finding's credibility as a general phenomenon.

Quantitative Studies: Intrinsic Risks

All the reasons I've listed for quantitative studies being misleading indicate bad research; it's possible to do good quantitative research and derive valid insights from measurements. But doing so is expensive and difficult.
Quantitative studies must be done exactly right in every detail or the numbers will be deceptive. There are so many pitfalls that you're likely to land in one of them and get into trouble.

If you rely on numbers without insights, you don't have backup when things go wrong. You'll stumble down the wrong path, because that's where the numbers will lead.

Qualitative studies are less brittle and thus less likely to break under the strain of a few methodological weaknesses. Even if your study isn't perfect in every last detail, you'll still get mostly good results from a qualitative method that relies on understanding users and their observed behavior.
Yes, experts get better results than beginners from qualitative studies.

But for quantitative studies, only the best experts get any valid results at all, and only then if they're extremely careful.

Source:
Jakob Nielsen's Alertbox, March 1, 2004:
Risks of Quantitative Studies
http://www.useit.com/alertbox/20040301.html
Risks of Quantitative Studies (Jakob Nielsen's Alertbox)

No comments:

Post a Comment