Friday, September 4, 2009

Sep 4 - Sauro & Kindlund, A Method to Standardize Usability Metrics Into a Single Score.





A Method to Standardize Usability Metrics Into a Single Score.
Jeff Sauro. PeopleSoft, Inc., Denver, Colorado USA. Jeff_Sauro@peoplesoft.com
Erika Kindlund. Intuit, Inc., Mountain View, California USA. Erika_Kindlund@intuit.com

CHI 2005, April 2–7, 2005, Portland, Oregon, USA

ABSTRACT
Current methods to represent system or task usability in a single metric do not include all the ANSI and ISO defined usability aspects: effectiveness, efficiency & satisfaction. We propose a method to simplify all the ANSI and ISO aspects of usability into a single, standardized and summated usability metric (SUM). In four data sets, totaling 1860 task observations, we show that these aspects of usability are correlated and equally weighted and present a quantitative model for usability. Using standardization techniques from Six Sigma, we propose a scalable process for standardizing disparate usability metrics and show how Principal Components Analysis can be used to establish appropriate weighting for a summated model.
SUM provides one continuous variable for summative usability evaluations that can be used in regression analysis, hypothesis testing and usability reporting.


In a summative usability evaluation, several metrics are available to the analyst for benchmarking the usability of a product. There is general agreement from the standards boards ANSI 2001[2] and ISO 9241 pt.11[18] as to what the dimensions of usability are (effectiveness, efficiency & satisfaction) and to a lesser extent which metrics are most commonly used to quantify those dimensions.
Effectiveness includes measures for completion rates and errors, efficiency is measured from time on task and satisfaction is summarized using any of a number of standardized satisfaction questionnaires (either collected on a task-by-task basis or at the end of a test session) [2],[18].

There have been attempts to derive a single measure for the construct of usability.
Babiker et al [3] derived a single metric for usability in hypertext systems using objective performance measures only.
Questionnaires such as the SUMI [22,23], PSSUQ[27], QUIS[7] and SUS[5] have users provide a subjective assessment of recently completed tasks or specific product issues and claim to derive a reliable and low-cost standardized measure of the overall usability or quality of use of a system.
While the authors of these questionnaires do not necessarily intend for the questionnaires to act as a single measure of usability (e.g. “QUIS was designed to assess users' subjective satisfaction with specific aspects of the human-computer interface” [7]), they are often used by practitioners as a way to measure usability with one number. Such usage is often not discouraged by the questionnaires’ instructions (e.g. “SUMI is the only commercially available questionnaire for the assessment of the usability of software” [22] and “The SUS scale is a Likert scale and yields a single number representing a composite measure of the overall usability of the system [5]”).
McGee uses a geometric averaging procedure (UME) to standardize ratios of participants’ subjective assessment ratings on tasks to derive a single score for task usability. His research identifies the potential for a standardized measure of usability to support usability comparisons across products, the same product over time, at lower levels of detail, and of tasks common to multiple products.
Lewis used a rank-based system when assessing competing products [25]. This approach creates a rank score comprised of both users’ objective performance measures and subjective assessment, but the resulting metric only represents a relative comparison between like-products with similar tasks.

My Comments: I may consider to use quantitative usability concept and a single, combined usability score.

METHOD
Four summative usability tests were conducted to collect the common metrics as described above (task completion, error counts, task times and satisfaction scores) as well as several other metrics as suggested in Dumas and Redish [11], and Nielsen [39].
For measuring satisfaction we created a questionnaire containing semantic distance scales with five points, similar to the ASQ created by Lewis [26] (see Table 5 below). The questionnaire included questions on task experience, ease of task, time on task, and overall task satisfaction.
The questionnaires were administered immediately after each task to improve accuracy [16]. The four usability tests were conducted in a controlled usability lab setting over a two-year period. Participants were asked to complete the tasks to the best of their ability and the administrator only intervened when the participant indicated they were done or gave up.
At the end of the test session, “post-test” satisfaction questions similar to those in SUMI and SUS that asked about overall product usability were given to users.
Data was collected from 129 total participants completing a total of 57 tasks. Participants varied in their application experience, gender, and industries.

RESULTS
Examining the Relationships between the Metrics

To attempt to combine the metrics into a single usability score we examined the relationship among the four primary variables for each task observation. We generated a correlation matrix with all four variables from all four data sets plus a combined data set containing data from all tests.
As can be seen in the lower right cell of Table 1, the Pearson Product Moment correlation coefficients between satisfaction and task completion are consistent with prior correlation analyses (that is, displaying moderate and significant correlations between .3 and .5) [26, 29]. What’s more, the positive correlation between subjective measures (satisfaction) and objective measures (time, errors and completion) are also consistent with Nielsen’s 1994 meta-analysis [38] (although the subjective measures were preferences instead of satisfaction in that study).
Frøkjær et al [12] earlier has made the case for including all aspects (effectiveness, efficiency and satisfaction) when measuring the usability of a system since it was found that these aspects did not always correlate in the data they reviewed. We agree with Frøkjær et al’s conclusion to measure all aspects of usability, however, not because they do not correlate with each other (our data clearly shows the opposite), but because each measure adds additional information not contained in the other measures.


Principal Component Analysis (PCA) was used as statitstical tool for analysis.


The goal then becomes standardizing the four variables (time, satisfaction, completion and errors).

STANDARDIZING USABILITY METRICS
To standardize each of the usability metrics we created a z-score type value or z-equivalent. For the continuous data (time and average satisfaction), we subtracted the mean value from a specification limit and divided by the standard deviation. For discrete data (completion rates and errors) we divided the unacceptable conditions (defects) by all opportunities for defects.
This method of standardization was adapted from the process sigma metric used in Six Sigma [4],[17], [43]. See Sauro & Kindlund [44] for a more detailed discussion on how to standardize these metrics from raw usability data.

Standardizing Task Completion
We can assume that all users want to successfully complete tasks, so a defect in task completion can be identified as an instance of a user failing a task. An opportunity for a defect in task completion is simply each instance of a user attempting a task. Therefore, we standardized task completion as the ratio of failed tasks to attempted tasks. This proportion of defects per opportunities has a corresponding z-equivalent that can be looked up in a standard normal table.
For example, a task completion rate of 80% would have the z-equivalent of .841.

Standardizing Error Rates
Each error instance is unique, yet all are associated with the more general “opportunity” to make an error in this component of the task. Once the task’s error opportunities have been identified, the z-equivalent can be calculated by dividing the total number of errors by the error opportunities. This proportion can be approximated using the standard normal deviate.

Standardizing Satisfaction Scores
As described in the Methods section, we used a post task questionnaire containing 5-point semantic distance scales with the end points labeled (e.g. 5:Very Easy to 1:Very Difficult). For the analysis we created a composite satisfaction score by averaging the responses from questions of overall ease, satisfaction and perceived task time (See Table 5) .
To standardize the composite score we looked to the literature for a logical specification limit. Prior research across numerous usability studies suggests that systems with “good-usability” typically have a mean rating of 4 on a 1-5 scale and 5.6 on a 1-7 scale [38]. Therefore we set the specification limit to 4. To arrive at a standardized z-equivalent for composite satisfaction we subtracted the average rating of a user’s satisfaction score from 4 and divided by the standard deviation.

Standardizing Task Times
Identifying ideal task times presents an interesting challenge: how long is too long for any given task?
Once the ideal task time has been set for each task, standardizing the task time involves subtracting the raw task time from the specification limit and dividing by the standard deviation to arrive at the z-equivalent.

Creating a Single, Standardized and Summated Usability Metric: SUM
We created a single, standardized and summated usability metric for each task by averaging together the four standardized values based on the equal weighting of the coefficients from the Principal Components Analysis.

CONCLUSION
A single, standardized and summated usability metric (SUM) cannot and should not take the place of diagnostic qualitative usability improvements typically found in formative evaluations. When a summative evaluation is used to quantitatively assess the “before and after” impact of design changes, the advantage of one score is in its ability to summarize the majority of variance in four integral summative usability measures.
SUM has two additional advantages. First it provides one continuous variable that can be used in regression analysis, hypothesis testing and in the same ways existing metrics are used to report usability. Second, a single metric based on logical specification limits provides an idea of how usable a task or product is without having to reference historical data. This score can then be used to report against other key business metrics.

References that I may want to read further in future:
1. Abran, A., Surya, W., Khelifi, A., Rilling, J., Seffah, A., Robert, F. (2003). Consolidating the ISO Usability Models. Paper presented at 11th annual International Software Quality Management Conference.
2. ANSI (2001). Common industry format for usability test reports (ANSI-NCITS 354-2001). Washington, DC: American National Standards Institute.
5. Brooke, J. (1996). SUS: A “quick and dirty” usability scale. In P. Jordan, B. Thomas, and B. Weerdmeester (Eds.), Usability Evaluation in Industry (pp.189-194). London: Taylor and Francis. See also http://www.cee.hw.ac.uk/~ph/sus.html
8. Cordes, R. E (1984). Application of Magnitude Estimation for Evaluating Software Ease of Use. In Gavriel Salvendy (Ed.) First USA-Japan Conference on Human Computer Interaction, Amsterdam: Elsevier Science Publishers.
12. Frøkjær, E., Hertzum, M., and Hornbæk, K. (2000) Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? In Proc. CHI 2000, (pp.345-352). Washington, D.C.: ACM Press.
13. Gliem, J. and Gliem, R. (2003). Calculating, Interpreting, and Reporting Cronbach’s Alpha Reliability Coefficient for Likert-Type Scales. In 2003 Midwest Research to Practice Conference in Adult, Continuing and Community Education. Columbus, OH.
22. Kirakowski, J. (1996). The Software Usability Measurement Inventory: Background and usage. In P. Jordan, B. Thomas, and B. Weerdmeester (Eds.), Usability Evaluation in Industry (pp. 169-178). London, UK: Taylor and Francis. (Also, see http://www.ucc.ie/hfrg/questionnaires/sumi/index.html )
23. Kirakowski, J., and Corbett, M. (1993). SUMI: The Software Usability Measurement Inventory. British Journal of Educational Technology, 24, 210-212.
25. Lewis, J (1991) A Rank-Based Method for the Usability Comparison of Competing Products. In Proceedings of the Human Factors and Ergonomics Society 35th Annual Meeting San Francisco California (pp1312-1316).
26. Lewis, J. R. (1991). Psychometric evaluation of an after-scenario questionnaire for computer usability studies: The ASQ. SIGCHI Bulletin, 23, 78-81.
27. Lewis, J. R. (1992). Psychometric evaluation of the Post-Study System Usability Questionnaire: The PSSUQ. In Proceedings of the Human Factors Society 36th Annual Meeting (pp. 1259-1263). Atlanta, GA: Human Factors Society.
28. Lewis, J. R. (1993). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use (Tech. Report 54.786). Boca Raton, FL: IBM Corp. http://drjim.0catch.com/usabqtr.pdf
29. Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. International Journal of Human-Computer Interaction, 7, 57-78.
32. McGee, M. (2003). Usability magnitude estimation. Proc. HFES, 47th Annual Meeting, (691--695).
33. McGee, M (2004). Master usability scaling: magnitude estimation and master scaling applied to usability measurement. In Proc. CHI 2004, (pp 335 - 342). Washington, D.C.: ACM Press.
38. Nielsen, J. and Levy, J. (1994) Measuring Usability: Preference vs. Performance. Communications of the ACM, 37, p. 66-76
42. Sauro, J. (2004) How long should a task take? Identifying Spec Limits for Task Times in Usability Tests. Retrieved September 13, 2004, from Measuring Usability Web site : http://measuringusability.com/time_specs.htm
43. Sauro, J. (2004) How Do You Calculate a Z-Score? Retrieved September 13, 2004, from Measuring Usability Web site: http://measuringusability.com/z_calc.htm
44. Sauro, J & Kindlund E. (In Press) Making Sense of Usability Metrics: Usability and Six Sigma, in Proceedings of the 14th Annual Conference of the Usability Professionals Association, Montreal, Canada

No comments:

Post a Comment