Michael Yeap...PhD Candidate: Lewis

Showing posts with label Lewis. Show all posts

Friday, October 9, 2009

Oct 9 - CSUQ: Computer System Usability Questionnaire - Lewis of IBM

The Computer System Usability Questionnaire (CSUQ)

The PSSUQ research was preliminary for two reasons. First, the sample size for the factor analysis was small, consisting of data from only 48 participants. Second, the PSSUQ data came from a usability study. This setting may have influenced the correlations among the items and, therefore, the resultant factors.
The purpose of this research (Lewis, 1992a) was to use a slightly revised version of the PSSUQ, the Computer System Usability Questionnaire (CSUQ) to obtain a database of sufficient size to calculate stable factors from a mailed survey.
If the same factors emerged from this research as from the PSSUQ research, the study would demonstrate the potential usefulness of the questionnaire across different user groups and different research settings.

Item Selection and Construction
The CSUQ is identical to the PSSUQ (Lewis, 1991c), except that the wording of the items does not refer to a usability testing situation. For example, Item 3 of the PSSUQ states, "I could effectively complete the tasks and scenarios using this system," but Item 3 of the CSUQ states, "I can effectively complete my work using this system." (See the appendix for the CSUQ items.)

Psychometric Evaluation

The mail survey using the CSUQ.
The participants were 825 IBM employees who worked at nine IBM development sites: Atlanta, Austin, Bethesda, Boca Raton, Dallas, Raleigh, Rochester, San Jose, and Tucson. I used a random number generator to select the participants' names from the IBM electronic mail directory (CALLUP), and mailed them each a copy of the CSUQ with a cover letter. Responses from the returned questionnaires that arrived within 3 months of mailing made up the database for this study.

Factor analysis.
Forty-six percent (377) of the participants returned the questionnaire.
A principal factor analysis of the returned questionnaires produced the scree plot shown in Figure 3. The scree plot was similar to that found for the PSSUQ, indicating that an appropriate factor analysis should solve for three factors. Table 6 shows the varimax-rotated 3-factor solution. The selection criterion for the factor loadings was 0.5, shown in bold type in the table.
The factor analysis showed that Item 8 ("I believe I became productive quickly using this system"), which was not a part of the original PSSUQ, should be part of Factor 1. Item 15 ("The organization of information on the system screens is clear"), which loaded on two factors in the PSSUQ study, loaded on only Factor 2 in the current study. In the PSSUQ study and in the current study, Item 19 ("Overall, I am satisfied with this system") loaded on both Factors 1 and 3, and is not part of any subscale.
Otherwise, the factor structure of the CSUQ is very similar to that of the PSSUQ, so the CSUQ and PSSUQ subscales have the same names.
The three factors accounted for 98.6% of the variability in the rating data.

Reliability.
In all cases, coefficient alpha exceeded 0.89, indicating acceptable scale reliability. The estimates of coefficient alpha for the CSUQ were .93 for SYSUSE, .91 for INFOQUAL, .89 for INTERQUAL, and .95 for the OVERALL scale. The values of coefficient alpha for the CSUQ scales were within 0.03 of those for the PSSUQ scales.

Validity/Sensitivity.
After establishing scale reliability, the next step in psychometric evaluation is to determine scale validity. However, without a concurrent or predicted measurement, it is impossible to obtain a quantitative measure of validity in the traditional psychometric sense. An indirect way to assess validity is to examine scale sensitivity to variables that should systematically affect the scale. The sensitivity analyses of the PSSUQ (Lewis, 1992b) showed significant effects of user group (business professional with mouse experience, business professional without mouse experience, and secretary/clerk without mouse experience) on the OVERALL, SYSUSE, INFOQUAL, and INTERQUAL scales. The type of computer system the participant used during the study significantly affected the INFOQUAL scale.

A comprehensive listing of the influence of respondent characteristics on the CSUQ scores is outside the scope of this paper. However, the significant findings are similar to those for the PSSUQ. The type of computer that respondents used significantly affected their responses only for the INFOQUAL score (F(5,311)=2.14, p=0.06). The number of years of experience with their computer system affected respondents' scores for OVERALL (F(4,294)=3.12, p=0.02), SYSUSE (F(4,332)=2.05, p=0.09), INFOQUAL (F(4,311)=2.59, p=0.04) and INTERQUAL (F(4,322)=2.47, p=0.04). The respondents' range of experience with computer systems (number of different computer systems that they reported having used) affected scores for OVERALL (F(3,294)=2.77, p=0.04), INFOQUAL (F(3,311)=2.60, p=0.05) and INTERQUAL (F(3,322)=2.14, p=0.10).
These significant findings provide indirect support to the hypothesis that these scales are valid.

Discussion
The key results from this study are
(1) a demonstration of stable factors for the CSUQ (and, by extension, for the PSSUQ) and
(2) evidence that the questionnaire works well in non-laboratory settings.
The CSUQ scales are comparable to the PSSUQ scales, both in terms of reliability and validity (indicated by similarity in the sensitivity analyses).
These findings substantially enhance the usefulness of the CSUQ and PSSUQ to usability practitioners. Researchers who conduct usability studies (either laboratory or non-laboratory) can use this questionnaire to assess user satisfaction with system usability.

My Comments: This would be helpful if I eventually select to develop Usability Questionnaire as usability evaluation tool.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - PSSUQ: Post-Study System Usability Questionnaire - Lewis of IBM

The Post-Study System Usability Questionnaire (PSSUQ)

The Post-Study System Usability Questionnaire (PSSUQ) is currently a 19-item instrument for assessing user satisfaction with system usability. (See the appendix for a copy of the questionnaire items.)
Participants need more time to complete the PSSUQ than the ASQ (about 10 minutes to complete the PSSUQ), but only complete it once, at the end of a usability study. Completing the PSSUQ allows participants to provide an overall evaluation of the system they used.

After the 48 participants in the office-applications usability study (Lewis, Henry, & Mack, 1990) completed all the scenarios, they rated their system with the PSSUQ. This data allowed preliminary psychometric evaluation of the PSSUQ (Lewis, 1992b).
This earlier version of the PSSUQ (Lewis, 1992b) had only 18 items, with the items in a different order than shown in the appendix. Recently, a series of investigations using decision support systems revealed a common set of five system characteristics associated with usability by several different user groups (Doug Antonelli, personal communication, January 5, 1991). The original 18-item PSSUQ addressed four of these five system characteristics. The 19-item version of the PSSUQ contains an additional item to cover the fifth of these five system characteristics.

Item Construction
The items are 7-point graphic scales, anchored at the end points with the terms "Strongly agree" for 1, "Strongly disagree" for 7, and a "Not applicable" (N/A) point outside the scale.

Item Selection
A group of usability evaluators selected the items on the basis of their comprehensive content regarding hypothesized constituents of usability. For example, the items assess such system characteristics as ease of use, ease of learning, simplicity, effectiveness, information, and the user interface.

Psychometric Evaluation

Factor analysis.
The scree plot for an exploratory principal factors analysis of the PSSUQ data indicated that a 3-factor solution was appropriate (see Figure 2), so the overall scale defined by the full set of items contained three subscales. Table 5 shows the varimax-rotated factor pattern, revealing the structure of the subscales. Bold type in Table 5 highlights factor loadings that exceeded .5. Items that loaded highly on two factors were ambiguous regarding the appropriate subscale of which they should be a component, so they did not become a component of any subscale. (See the appendix to examine the content of these items.)
One of the most difficult tasks following this type of exploratory factor analysis is naming the factors. After considering a number of alternatives, a group of human factors engineers named the factors (and their corresponding subscales) System Usefulness (SYSUSE), Information Quality (INFOQUAL), and Interface Quality (INTERQUAL). These three factors account for 87% of the variability in the data.

Reliability.
Coefficient alpha analyses showed that the reliability of the overall summative scale (OVERALL) was .97, and ranged from .91 to .96 for the three subscales (SYSUSE=.96, INFOQUAL=.91, and INTERQUAL=.91). Therefore, the overall scale and the three subscales have excellent reliability.

Validity.
Correlation analyses support the validity of the scales. The OVERALL scale correlated highly with the sum of the ASQ ratings that participants gave after completing each scenario (r(20)=.80, p=.0001). OVERALL also correlated significantly with the percentage of successful scenario completion (r(29)=-.40, p=.026). The SYSUSE (r(36)=-.40, p=.006) and INTERQUAL (r(35)=-.29, p=.08) correlated with the percentage of successful scenario completion.

Sensitivity.
In the sensitivity ANOVAs, the overall scale and all three subscales indicated significant differences among the user groups (OVERALL: F(2,29)=4.35, p=.02; SYSUSE: F(2,36)=6.9, p=.003; INFOQUAL: F(2,33)=3.68, p=.04; INTERQUAL: F(2,33)=3.74, p=.03). INFOQUAL showed a significant system effect (F(2,33)=3.18, p=.05).

Discussion
These findings have limited generalizability because the sample size for the factor analysis was relatively small. The usual recommendation would be 90 participants for this questionnaire.
However, the factor analysis and reliability analyses suggest that it is reasonable to define three subscales from this set of items. The PSSUQ has reasonable concurrent validity when compared with successful scenario completion rates and the ASQ scores. The overall scale and the subscales are reasonably sensitive.
The evidence provided sufficient justification to use the PSSUQ to measure user satisfaction with system usability in usability studies, but also suggested that it would be prudent to collect more data in different circumstances to extend the generalizability of the findings.

My Comments: This would be helpful if I eventually select to develop Usability Questionnaire as usability evaluation tool.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - ASQ: After-Scenario Questionnaire - Lewis of IBM

The After-Scenario Questionnaire (ASQ)

In a scenario-based usability study, participants use a product, such as a computer application, to do a series of realistic tasks.
The After-Scenario Questionnaire (ASQ) is a three-item questionnaire that IBM usability evaluators have used to assess participant satisfaction after the completion of each scenario. (See the appendix for a copy of the questionnaire.)
The items address three important components of user satisfaction with system usability: ease of task completion, time to complete a task, and adequacy of support information (on-line help, messages, and documentation). Because the questionnaire is very short, it takes very little time for participants to complete – an important practical consideration for usability studies.
Usability professionals have used these items (or very similar items) in usability studies at IBM for many years, but a recent series of studies has provided a database of sufficient size to allow a preliminary psychometric evaluation of the ASQ.
The ASQ items are the constituent items for a summative, or Likert, scale (McIver & Carmines, 1981; Nunnally, 1978). In developing summative scales, it is important to consider item construction, item selection and psychometric evaluation.

Item Construction
The items are 7-point graphic scales, anchored at the end points with the terms "Strongly agree" for 1 and "Strongly disagree" for 7, and a Not Applicable (N/A) point outside the scale, as shown in the appendix.

Item Selection
The content of the items reflects components of usability that usability
professionals at IBM have generally considered important.

Psychometric Evaluation
The office-applications studies. Scenario-based usability studies of three office application systems (Lewis, Henry, & Mack, 1990) provided the data for a psychometric evaluation of the ASQ. Forty-eight employees of temporary help agencies participated in the studies, with 15 hired in Hawthorne, New York; 15 hired in Boca Raton, Florida; and 18 hired in Southbury, Connecticut. Each set of participants consisted of one-third clerical/secretarial work experience with no mouse experience (SECNO), one-third business professionals with no mouse experience (BPNO), and one-third business professionals with at least three months of mouse experience (BPMS). All participants had at least three months experience using some type of computer system. They had no programming training or experience, and had no (or very limited) knowledge of operating systems.
Popular word-processing applications, mail applications, calendar applications, and spreadsheet applications installed in three different operating environments comprised the three office systems (hereafter referred to as System I, System II and System III). All three environments allowed windowing, used a mouse as a pointing device, and allowed a certain amount of integration among the applications. The systems differed in details of implementation, but were generally similar. The three wordprocessing and spreadsheet applications were similar, but the mail and calendar applications differed considerably. The studies contained eight scenarios in common,
Participants began the study with a brief lab tour, read a description of the study's purpose and the day's agenda, and completed a background questionnaire. Participants using System I completed an interactive tutorial shipped with the system. Tutors provided the other participants with a brief demonstration about how to move, point and select with a mouse; how to open the icons for each product; and how to maximize and minimize windows.
After this system exploration period (usually about 1 hour), participants performed the scenarios, completing the ASQ as they finished each scenario. While the participant performed the scenario, an observer logged the participant's activities. If the participant completed the scenario without assistance and produced the correct output, then he or she completed the scenario successfully. Either after completing all scenarios or at the end of the workday (with some scenarios never attempted), participants provided an overall system rating with the Post-Study System Usability Questionnaire (PSSUQ) (Lewis, 1992b; Lewis, Henry, & Mack, 1990).
Participants usually needed a full work day (8 hours) to complete the study.
At the end of the three studies, the researchers entered the responses to the ASQ, PSSUQ, and the scenario completion data into a database.
From this database, it was possible to conduct an exploratory factor analysis, reliability analyses, validity analyses, and a sensitivity analysis.

Factor analysis.
Due to the design of this study (eight scenarios and a 3-item questionnaire), either an 8-factor or 3-factor solution would have been reasonable. An 8-factor solution could indicate grouping by scenario, and a 3-factor solution could indicate grouping by item type. Figure 1 shows the scree plot for the eigenvalues.
The scree plot for this analysis did not support a 3-factor solution, but did support an 8-factor solution. The rotated factor pattern is in Table 2. Using a selection criterion of .5 for the factor loadings (indicated with bold type), a clear relationship existed between the factors and the scenarios. The eight factors accounted for almost all (94%) of the variance in the data.

Reliability.
For the eight summative scales derived from the eight factors, all the coefficient alphas exceeded .90. Coefficient alphas this large were surprising because each scale contained only three items, and reliability is largely a function of the number of scale items (Nunnally, 1978).

Validity.
The correlation between the ASQ scores and scenario failure or success (coded as 0=failure and 1=success) was -.40 (n=48, p<.01). This result showed that participants who successfully completed a scenario tended to give lower (more favorable) ASQ ratings – evidence of concurrent validity.

Sensitivity.
Of the 48 participants, 27 completed all of the ASQ items for all of the scenarios. This reduced database was appropriate for an analysis-of-variance (ANOVA) to assess the sensitivity of the ASQ. Specifically, did the ASQ scores discriminate among the different systems, user groups, or scenarios in the three usability studies? The main effect of Scenario was highly significant (F(7,126)=8.92, p<.0001).
The Scenario by System interaction was also significant (F(14,126)=1.75, p=.05). These results suggest that the ASQ scale score is a reasonably sensitive measure.

Discussion
These findings have limited generalizability because the sample size for the factor analysis was relatively small. The usual recommendation would require 120 participants for this analysis (5 participants x 8 scenarios/participant x 3 items/scenario). On the other hand, the resulting factor structure was very clear.
The psychometric evaluation of this questionnaire showed that it is reasonable to condense the three ASQ items into a single scale through summation (or, equivalently, averaging). The available evidence indicates that the ASQ is reliable, valid, and sensitive. This condensation should allow easier interpretation and reporting of results when usability practitioners use the ASQ.

My Comments: This would be helpful if I eventually select to develop Usability Questionnaire as usability evaluation tool.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Thursday, October 8, 2009

Oct 9 - Lewis, IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use

ABSTRACT
This paper describes recent research in subjective usability measurement at IBM. The focus of the research was the application of psychometric methods to the development and evaluation of questionnaires that measure user satisfaction with system usability. The primary goals of this paper are to (1) discuss the psychometric characteristics of four IBM questionnaires that measure user satisfaction with computer system usability, and (2) provide the questionnaires, with administration and scoring instructions. Usability practitioners can use these questionnaires with confidence to help them measure users' satisfaction with the usability of computer systems.

Introduction

Customers want usable products, and developers strive to produce them. It follows that an important part of modern product engineering, both hardware and software, must be the measurement of usability.
Measuring usability is particularly difficult because usability is not a unidimensional product or user characteristic, but emerges as a multidimensional characteristic in the context of users performing tasks with a product in a specific environment (Bevan, Kirakowski, & Maissel, 1991; Shackel, 1984).
However, if you are unable to measure usability, how can you judge your product against your competitors', or even your own previous versions of the product?

Subjective and Objective Evaluation

Most usability evaluations gather both subjective and objective quantitative data in the context of realistic scenarios-of-use, as well as descriptions of the problems representative participants have trying to complete the scenarios.
Subjective data are measures of participants' opinions or attitudes concerning their perception of usability.
Objective data are measures of participants' performance (such as scenario completion time and successful scenario completion rate).

Objective usability measures include, but are not limited to, scenario completion time, successful scenario completion rate, and time spent recovering from errors (Whiteside, Bennett, & Holtzblatt, 1988). Subjective usability measures are usually responses to Likert-type questionnaire items that assess user attitude concerning attributes such as system ease-of-use and interface likeability (Alty, 1992).
Most usability evaluators collect both objective and subjective data.

Research Focus

The focus of this research was the application of psychometric methods to the development and evaluation of standard questionnaires to assess subjective usability.
The goal of psychometrics is to establish the quality of psychological measures (Nunnally, 1978). Is a measure reliable in the sense that it is consistent? Given a reliable measure, is it valid (measures the intended attribute)? Finally, is the measure appropriately sensitive to experimental manipulations?
Psychometrics is a well-developed field, but usability researchers have only recently used these methods to develop and evaluate questionnaires to assess usability (Sweeney & Dillon, 1987).
In contrast to other recent computer-user satisfaction questionnaires (Chin, Diehl, & Norman, 1988; Kirakowski & Dillon, 1988; LaLomia & Sidowski, 1990) the IBM questionnaires are specifically for use in the context of scenario-based usability testing (Lewis, 1991a; Lewis, 1991b; Lewis, 1991c; Lewis, 1992b; Lewis, Henry, & Mack, 1990), although additional research has indicated that one may be useful as an instrument for field evaluation (Lewis, 1992a). Usability practitioners can use these questionnaires to enhance their current usability methods. (The four IBM questionnaires appear in the appendix.)
Before describing the psychometric properties of the IBM questionnaires, I will briefly review the relevant elements of psychometric practice. (For a comprehensive discussion of psychometrics, see Nunnally, 1978.)

----------
ASQ
PSQ
PSSUQ
CSUQ
-----------

General Discussion

Although user satisfaction with system usability is only one component of the multifaceted construct of usability (Bevan et al., 1991), it is a very important component in many situations.
It is especially important when a primary design goal is user satisfaction.

This paper has described the psychometric qualities of four questionnaires that assess user satisfaction with system usability: the ASQ, PSQ, PSSUQ and CSUQ.

The ASQ and PSQ are both after-scenario questionnaires, intended for use in a scenario-based usability testing situation. They contain essentially the same items, but the ASQ uses a 7-point scale and the PSQ uses a 5-point scale.
Using data from very different scenario-based usability studies (one a study of software office applications, the other a study of printers), their factor analyses, validity analyses, and sensitivity analyses were virtually identical. Obtaining the same results in different settings with different user groups provides strong evidence that these results are generalizable, and the questionnaires have wide applicability. Because the ASQ has substantially better reliability than the PSQ, usability practitioners should use the ASQ rather than the PSQ as their after-scenario questionnaire.

The PSSUQ and CSUQ are both overall satisfaction questionnaires. The PSSUQ items are appropriate for a usability testing situation, and the CSUQ items are appropriate for a field testing situation. Otherwise, the questionnaires are identical.
The psychometric evaluations of the PSSUQ (using data from a usability study) and the CSUQ (using data from a mail survey) were virtually identical. As with the after-scenario questionnaires, this consistency provides strong evidence of generalizability of results and wide applicability of the questionnaires.

Because these questionnaires have acceptable psychometric properties, usability practitioners can use them with confidence as standardized measurements of satisfaction for usability studies and tests (ASQ, PSSUQ) or field research (CSUQ).
(Practitioners should note that nothing prevents the addition of items to these questionnaires if a particular situation suggests the need. However, using these questionnaires as the foundation for special-purpose questionnaires ensures that practitioners can score the scales and subscales from the questionnaires, maintaining the advantages of standardized measurement.)

Standardized satisfaction measurements offer many advantages to the usability practitioner (Nunnally, 1978). Specifically, standardized measurements provide:

1 Objectivity.
A standardized measurement supports objectivity because it allows usability practitioners to independently verify the measurement statements of other practitioners.

2 Quantification.
Standardized measurements allow practitioners to report results in finer detail than they could using only personal judgment. Standardization also permits practitioners to use powerful methods of mathematics and statistics to better understand their results (Nunnally, 1978).

3 Communication.
It is easier for practitioners to communicate effectively when standardized measures are available. Inadequate efficiency and fidelity of communication in any field is an impediment to progress.

4 Economy.
Developing standardized measures requires a substantial amount of work. However, once developed, they are economical. There is rarely any need to re-evaluate standardized measures.

5 Scientific generalization.
Scientific generalization is at the heart of scientific work. Standardization is essential for assessing the generalization of results.

Conclusion
In conclusion, these questionnaires should be valuable additions to the repertoire of techniques that usability practitioners apply in the design and evaluation of computer systems.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - Printer-Scenario Questionnaire (PSQ) - Lewis of IBM

Printer-Scenario Questionnaire (PSQ)

Administration and Scoring.
As indicated in the body of the paper, use the ASQ rather than the PSQ.

Instructions and Items.
The questionnaire's instructions and items are:
For each of the items below, please circle the response that best describes your experience with the printer for this scenario.

1. Time to Complete Task

1 = Acceptable as is -- less time than expected
2 = Acceptable as is -- about right
3 = Needs slight improvement
4 = Needs moderate improvement
5 = Needs a lot of improvement
- = Unable to evaluate
Comments:

2. Ease of Performing Tasks

1 = Acceptable as is -- very easy
2 = Acceptable as is -- easy
3 = Needs slight improvement
4 = Needs moderate improvement
5 = Needs a lot of improvement
- = Unable to evaluate
Comments:

3. Satisfaction with Instructions/Publications

1 = Acceptable as is -- very satisfied
2 = Acceptable as is -- satisfied
3 = Needs slight improvement
4 = Needs moderate improvement
5 = Needs a lot of improvement
- = Unable to evaluate
Comments:

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - After-Scenario Questionnaire (ASQ) - Lewis of IBM

After-Scenario Questionnaire (ASQ)

Administration and Scoring.
Give the questionnaire to a participant after he or
she has completed a scenario during a usability evaluation. Average (with the arithmetic mean) the scores from the three items to obtain the ASQ score for a participant's satisfaction with the system for a given scenario. Low scores are better than high scores due to the anchors used in the 7-point scales. If a participant does not answer an item or marks N/A, average the remaining items to obtain the ASQ score.

Instructions and Items.
The questionnaire's instructions and items are:
For each of the statements below, circle the rating of your choice.

Likert scale:
1 = strongly agree
2
3
4
5
6
7 = strongly disagree

1. Overall, I am satisfied with the ease of completing this task.

2. Overall, I am satisfied with the amount of time it took to complete this task.

3. Overall, I am satisfied with the support information (on-line help, messages,
documentation) when completing this task.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - Computer System Usability Questionnaire (CSUQ)

Computer System Usability Questionnaire (CSUQ)

Administration and Scoring.
Use the CSUQ rather than the PSSUQ when the
usability study is in a non-laboratory setting. Appendix Table 1 contains the rules for
calculating the CSUQ and PSSUQ scores.
_____________________________________________________________________________
Appendix Table 1. Rules for Calculating CSUQ/PSSUQ Scores
_____________________________________________________________________________
Score Name > Average the Responses to:
_____________________________________________________________________________
OVERALL > Items 1 through 19
SYSUSE > Items 1 through 8
INFOQUAL > Items 9 through 15
INTERQUAL > Items 16 through 18
_____________________________________________________________________________
Average the scores from the appropriate items to obtain the scale and subscale
scores. Low scores are better than high scores due to the anchors used in the 7-point
scales. If a participant does not answer an item or marks "N/A," then average the
remaining item scores.

Instructions and Items.
The questionnaire's instructions and items are:
This questionnaire (which starts on the following page) gives you an opportunity to express your satisfaction with the usability of your primary computer system. Your responses will help us understand what aspects of the system you are particularly concerned about and the aspects that satisfy you.
To as great a degree as possible, think about all the tasks that you have done with the system while you answer these questions.
Please read each statement and indicate how strongly you agree or disagree with the statement by circling a number on the scale. If a statement does not apply to you, circle N/A.
Whenever it is appropriate, please write comments to explain your answers.
Thank you!

Likert scale:
1 = strongly agree
2
3
4
5
6
7 = strongly disagree

1. Overall, I am satisfied with how easy it is to use this system.

2. It is simple to use this system.

3. I can effectively complete my work using this system.

4. I am able to complete my work quickly using this system.

5. I am able to efficiently complete my work using this system.

6. I feel comfortable using this system.

7. It was easy to learn to use this system.

8. I believe I became productive quickly using this system.

9. The system gives error messages that clearly tell me how to fix problems.

10. Whenever I make a mistake using the system, I recover easily and quickly.

11. The information (such as on-line help, on-screen messages and other documentation) provided with this system is clear.

12. It is easy to find the information I need.

13. The information provided with the system is easy to understand.

14. The information is effective in helping me complete my work.

15. The organization of information on the system screens is clear.

Note: The interface includes those items that you use to interact with the system. For example, some components of the interface are the keyboard, the mouse, the screens (including their use of graphics and language).

16. The interface of this system is pleasant.

17. I like using the interface of this system.

18. This system has all the functions and capabilities I expect it to have.

19. Overall, I am satisfied with this system.

IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - Post-Study System Usability Questionnaire (PSSUQ) - Lewis of IBM

The Post-Study System Usability Questionnaire (PSSUQ)

Administration and Scoring.
Give the PSSUQ to participants after they have completed all the scenarios in a usability study.
You can calculate four scores from the responses to the PSSUQ items:
* the overall satisfaction score (OVERALL),
* system usefulness (SYSUSE),
* information quality (INFOQUAL) and
* interface quality (INTERQUAL).
Because research on an alternative form of the PSSUQ (the Computer System Usability Questionnaire, or CSUQ) confirmed and clarified (and slightly modified) the factor structure of the questionnaire, refer to Appendix Table 1 in the next section of this appendix for the current scoring rules of the PSSUQ.

Instructions and Items.
The questionnaire's instructions and items are:
This questionnaire, which starts on the following page, gives you an opportunity to tell us your reactions to the system you used. Your responses will help us understand what aspects of the system you are particularly concerned about and the aspects that satisfy you.
To as great a degree as possible, think about all the tasks that you have done with the system while you answer these questions.
Please read each statement and indicate how strongly you agree or disagree with the statement by circling a number on the scale. If a statement does not apply to you, circle N/A.
Please write comments to elaborate on your answers.
After you have completed this questionnaire, I'll go over your answers with you to make sure I
understand all of your responses.
Thank you!

Likert scale
1 = strongly agree
2
3
4
5
6
7 = strongly disagree

1. Overall, I am satisfied with how easy it is to use this system.

2. It was simple to use this system.

3. I could effectively complete the tasks and scenarios using this system.

4. I was able to complete the tasks and scenarios quickly using this system.

5. I was able to efficiently complete the tasks and scenarios using this system.

6. I felt comfortable using this system.

7. It was easy to learn to use this system.

8. I believe I could become productive quickly using this system.

9. The system gave error messages that clearly told me how to fix problems.

10. Whenever I made a mistake using the system, I could recover easily and quickly.

11. The information (such as on-line help, on-screen messages and other documentation)
provided with this system was clear.

12. It was easy to find the information I needed.

13. The information provided for the system was easy to understand.

14. The information was effective in helping me complete the tasks and scenarios.

15. The organization of information on the system screens was clear.

Note: The interface includes those items that you use to interact with the system. For example, some components of the interface are the keyboard, the mouse, the screens (including their use of graphics and language).

16. The interface of this system was pleasant.

17. I liked using the interface of this system.

18. This system has all the functions and capabilities I expect it to have.

19. Overall, I am satisfied with this system.

Appendix Table 1. Rules for Calculating CSUQ/PSSUQ Scores
_____________________________________________________________________________
Score Name > Average the Responses to:
_____________________________________________________________________________
OVERALL > Items 1 through 19
SYSUSE > Items 1 through 8
INFOQUAL > Items 9 through 15
INTERQUAL > Items 16 through 18
_____________________________________________________________________________

IBM Computer Usability Satisfaction Questionnaires:
Psychometric Evaluation and Instructions for Use
Technical Report 54.786
James R. Lewis
Human Factors Group
Boca Raton, FL

Source: http://drjim.0catch.com/usabqtr.pdf

Oct 9 - Lewis, Tradeoffs in the Design of the IBM Computer Usability Satisfaction Questionnaires

Tradeoffs in the Design of the IBM Computer Usability Satisfaction Questionnaires
James R. Lewis

1 Introduction

Psychometrics is a well-developed field in psychology, and usability researchers
began to use psychometric methods to develop and evaluate questionnaires to
assess usability a little over ten years ago (Sweeney & Dillon, 1987). The goal
of psychometrics is to establish the quality of psychological measures
(Nunnally, 1978).

2 Brief Review of Psychometric Practice

Reliability goals.
In psychometrics, reliability is quantified consistency, typically estimated using coefficient alpha (Nunnally, 1978). Coefficient alpha can range from 0 (no reliability) to 1 (perfect reliability). Measures of individual aptitude (such as IQ tests or college entrance exams) should have a minimum reliability of .90 (preferably a reliability of .95). For other research or evaluation, measurement reliability should be at least .70 (Landauer, 1988).

Validity goals.
Validity is the measurement of the extent to which a questionnaire measures what it claims to measure. Researchers commonly use the Pearson correlation coefficient to assess criterion-related validity (the relationship between the measure of interest and a different concurrent or predictive measure). Moderate correlations (with absolute values as small as .30 to .40) are often large enough to justify the use of psychometric instruments
(Nunnally, 1978).

Sensitivity goals.
A questionnaire that is reliable and valid should also be sensitive – capable of detecting appropriate differences. Statistically significant differences in the magnitudes of questionnaire scores for different systems or other usability-related manipulations provide evidence for sensitivity.

Goals of factor analysis.
Factor analysis is a statistical procedure that examines the correlations among variables to discover clusters of related variables (Nunnally, 1978). Because summated (Likert) scales are more reliable than single-item scales (Nunnally, 1978) and it is easier to present and interpret a smaller number of scores, it is common to conduct a factor analysis to determine if there is a statistical basis for the formation of summative scales.

3 Tradeoffs Considered in the Development of the IBM Questionnaires

Number of scale steps.
The more scale steps in a questionnaire the better, but with rapidly diminishing returns (Nunnally, 1978). As the number of scale steps increases from 2 to 20, there is an initial rapid increase in reliability, but it tends to level off at about 7 steps. After 11 steps there is little gain in reliability from increasing the number of steps. The number of steps is most important for single-item assessments, but is usually less important when summing scores over a number of items.
This turned out to be true in the case of the IBM questionnaires (Lewis, 1995). Coefficient alpha exceeded .89 for all instruments using 7-point scales. Coefficient alpha for a questionnaire using 5-point scales ranged from .64 to .93 and averaged .80. A related analysis using the same data (Lewis, 1993) showed that the mean difference of the 7-point scales correlated more strongly than the mean difference of the 5-point scales with the observed significance levels of t-tests. For these reasons, we currently use 7-point rather than 5-point scales.

Calculating scale scores.
From psychometric theory (Nunnally, 1978), scale reliability is a function of the interrelatedness of scale items, the number of scale steps per item, and the number of items in a scale. If a participant chooses not to answer an item, the effect would be to slightly reduce the reliability of the scale in that instance. In most cases, the remaining items should offer a reasonable estimate of the appropriate scale score.
From a practical standpoint, averaging the answered items to obtain the scale score enhances the flexibility of use of the questionnaire, because if an item is not appropriate in a specific context and users choose not to answer it, the questionnaire is still useful. Also, users who do not answer every item can stay in the sample. Finally, averaging items to obtain scale scores does not affect the statistical properties of the scores, and standardizes the range of scale scores, making them easier to interpret and compare. For example, with items based on 7-point scales, all the summative scales would also have scores that range from 1 to 7. For these reasons, we average the responses given by a participant across the items for each scale.

Unidimensional or multidimensional instrument.
The developer of a questionnaire can have the goal of creating a unidimensional or multidimensional instrument (McIver & Carmines, 1981).
A unidimensional instrument will typically require fewer items, so it will take less time to administer and provides a straightforward measurement because it has no subscales. A multidimensional instrument, because it measures several subscales related to the higher-level, overall scale, typically requires more items.
For example, the System Usability Scale (Brooke, 199?), a unidimensional instrument, contains ten items. The PSSUQ, a multidimensional instrument that provides measurements for three subscales as well as the overall measurement, contains 19 items.

Control of potential response bias or consistency in item alignment.
It is a common practice in questionnaire development to vary the tone of items so that, typically, half of the items elicit agreement and the other half elicit disagreement. The purpose of this is to control potential response bias. An alternative approach, less commonly used, is to align the items consistently.
Probably the most common criticism I’ve seen of the IBM questionnaires is that they do not use the standard control for potential response bias. Our rationale in consistently aligning the items was to make it as easy as possible for participants to complete the questionnaire. With consistent item alignment, the proper way to mark responses on the scales is clearer and requires less interpretive effort on the part of the participant. Even if this results in some response bias, typical use of the IBM questionnaires is to compare systems or experimental conditions. In this context of use, any systematic response bias will cancel out across comparisons.
I have seen the caution expressed that a frustrated or lazy participant will simply choose one end point or the other and mark all items the same way. With all items aligned in the same way, this could lead to the erroneous conclusion that the participant held a strong belief (either positive or negative) regarding the usability of the system.
With items constructed in the standard way, such a set of responses would indicate a neutral opinion. Although this characteristic of the standard approach is appealing, I have seen no evidence of such participant behavior, at least not in the hundreds of PSSUQs that I have personally scored.

To norm or not to norm.
When a questionnaire has norms, data exists that allows researchers to interpret individual and average scores as greater or smaller than the expected norm scores. In some contexts (field studies, standard single-system usability studies), this can be a tremendous advantage. In other contexts (multiple-system comparative usability studies, other types of experiments), it might provide no particular advantage.
When I performed the psychometric qualification of the CSUQ, I acquired a fair amount of data suitable for norms. I never published the norms because they were considered IBM Confidential. Those norms are now about 10 years out of date, and I no longer use them. The only instruments I know of that appear to have useful norms are those created by Kirakowski and his colleagues (Kirakowski & Corbett, 1993; Kirakowski & Dillon, 1988).
Researchers should be cautious in the use of such norms, however, because differences between the contexts in which the norms were gathered and the use of the instrument could be misleading.

4 Advantages of Using Psychometrically Qualified Instruments

Despite any controversies regarding decisions made in the development of such questionnaires, standardized satisfaction measurements (whichever questionnaire you choose to use) offer many advantages to the usability practitioner (Nunnally, 1978).

Specifically, standardized measurements (even without norms) provide objectivity, replicability, quantification, economy, communication, and scientific generalization. Standardization also permits practitioners to use powerful methods of mathematics and statistics to better
understand their results (Nunnally, 1978).
The level of measurement of an instrument (ratio, interval, ordinal) does not limit permissible arithmetic operations or related statistical operations, but does limit the permissible interpretations of the results of these operations (Harris, 1985).
Measurements using Likert scales are ordinal. Suppose you compare two products with the
PSSUQ, and Product A receives a score of 2.0 versus Product B's score of 4.0. Given a significant comparison, you could say that Product A had more satisfying usability characteristics than Product B (an ordinal claim), but you could not say that Product A was twice as satisfying as B (a ratio claim).

In conclusion, psychometrically qualified, standardized questionnaires can be
valuable additions to practitioners’ repertoire of usability evaluation techniques.

5 References

Brooke, J. (199?). SUS – A quick and dirty usability scale. Unpublished paper.
Harris, R. J. (1985). A primer of multivariate statistics. Orlando, FL: Academic Press.
Kirakowski, J., & Corbett, M. (1993). SUMI: The software usability measurement inventory. British Journal of Educational Technology, 24, 210-212.
Kirakowski, J., & Dillon, A. (1988). The computer user satisfaction inventory (CUSI): Manual and scoring key. Cork, Ireland: Human Factors Research Group, University College of Cork.
Landauer, T. K. (1988). Research methods in human-computer interaction. In M. Helander (Ed.), Handbook of Human-Computer Interaction (pp. 905-928). New York, NY: Elsevier.
Lewis, J. R. (1993). Multipoint scales: Mean and median differences and observed significance levels. International Journal of Human-Computer Interaction, 5, 383-392.
Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. International Journal of
Human-Computer Interaction, 7, 57-78.
McIver, J. P., & Carmines, E. G. (1981). Unidimensional scaling. Sage University Paper Series on Quantitative Applications in the Social Sciences, series no. 07-024, Beverly Hills, CA: Sage Publications.
Nunnally, J. C. (1978). Psychometric Theory. New York, NY: McGraw-Hill.
Sweeney, M., & Dillon A. (1987). Methodologies employed in the psychological evaluation of HCI. In Proceedings of Human-Computer Interaction -- INTERACT '87 (pp. 367-373).

Tradeoffs in the Design of the IBM Computer Usability Satisfaction Questionnaires
James R. Lewis
International Business Machines Corp.

Source: http://drjim.0catch.com/hci99jrl.pdf

Friday, September 18, 2009

Sep 19 - Nielsen, 25 Years in Usability (Alertbox)

25 Years in Usability

Summary: Since I started in 1983, the usability field has grown by 5,000%. It's a wonderful job — and still a promising career choice for new people.

Evolution: How Things Have Changed

The field's main difference today compared to when I started is size: it is much larger now. In 1983, usability was a narrow discipline pursued by a few people, largely confined to academia, phone companies (mainly Bell Labs), and a few pockets of enlightenment in the biggest computer companies.
When we met at conferences, we all knew each other. Although new people did join the field (as I did in '83), the new membership rate was about a handful per year. All told, there were maybe 1,000 usability people in the world (primarily in the U.S. and the U.K.).
Today, by my estimates, there might be as many as 50,000 full-time usability professionals in the world, supplemented by about half a million people with part-time usability responsibilities or interest.

Highlights of 1983

Usability basics haven't changed in 25 years. Methods for user testing were already well established by 1983 — the year that John Gould and Clayton Lewis presented a paper outlining 3 main principles for successful design:
* Establish an early focus on users and run field studies before starting any design work.
* Conduct empirical usability studies throughout development.
* Use an iterative design process.

These are the same 3 things we teach today as the most important usability steps. The main difference now is that Gould and Lewis talked about collecting quantitative measurements during their tests, whereas I've emphasized faster, qualitative studies for most projects since I started evangelizing "discount usability" in 1989.

In general, it's good for usability professionals to have experience with many generations of user interface technologies — this allows you to:
* Generalize the underlying issues in interaction design: when you see something every year for 25 years, you know there's some truth to it.
* Avoid being swayed by the surface appearance of the latest gizmo.

Personal Retrospective

I've been happy with my career choice. I've had a great time in every single one of these 25 years, with so many interesting studies and exciting findings.
Usability allows us to make everyday life more satisfying by empowering people to control their destiny and their technology...
...usability also strengthens business by making companies more profitable through increased sales and higher productivity.

Usability as a Career

If a young person today asked me whether usability is still a good career choice, I wouldn't hesitate to say yes. If anything, usability is a better career now than when I started.
In 1983, usability was an oppressed discipline. We few pioneers had to struggle against the prevailing attitude that computing is about power and features — not ease of use and a pleasurable user experience.
Today, usability is widely recognized as one of the key drivers of website profitability. Not a day passes without a big-shot CEO declaring support for better user experience.
...usability works — it adds vastly more value to design projects than it costs, and companies tend to add more and more usability over time as they experience this payoff in their own projects...

We have job security as long as there's stupid design in the world, and that's forever: every new technology that comes along will be abused.
Come join us. You'll have a great time. I certainly am, and will enjoy continuing to keep pace with this ever-growing field.

source:
Jakob Nielsen's Alertbox, April 21, 2008:
25 Years in Usability
http://www.useit.com/alertbox/25-years-usability.html
25 Years in Usability (Jakob Nielsen's Alertbox)

Friday, September 11, 2009

Sep 12 - Lewis, Sample Sizes for Usability Tests: Mostly Math, Not Magic

Sample Sizes for Usability Tests: Mostly Math, Not Magic.
James R. Lewis IBM Corp. jimlewis@us.ibm.com
USER EXPERIENCE, VOL. 4, ISSUE 4, 2005

Perhaps the most important factor is the economics of usability testing. For many practitioners, usability tests are fairly expensive events, with much of the expense in the variable cost of the number of participants observed (which includes cost of participants, cost of observers, cost of lab, and limited time to obtain data to provide to developers in a timely fashion).

Usability testing includes three key components: representative participants, representative tasks, and representative environments, with participants’ activities monitored by one or more observers [2].
They can be formal or informal, think-aloud or not, use low-fidelity prototypes or working systems.
They can have a primary focus on task-level measurements (summative testing) or problem discovery (formative testing).

The IBM practice at that time, based on papers published by Alphonse Chapanis and colleagues [1, 5], was to observe about five to six participants per iteration for problem discovery. Chapanis had asserted that after you’d observed six participants, you would have seen about all of the problems you were going to see.

THE GOAL: PROBLEM DISCOVERY
You can’t really talk about discovering 90 percent of all possible usability problems across all possible users, tasks, and environments.
You can establish a problem discovery goal given a sampled population of users, a defined set of tasks, and a defined set of environments.
Change the population of users, tasks, or environments, and all bets are off. But this is better than nothing. If your problem discovery rate is starting to go down, then change one or all of
these elements of usability.
Test from a different population of users, using different tasks, in different environments. You’ll discover different problems.

My comments: In other words, result of usability testing does not have high repeatable and reproducible.

In 2001, Spool and Schroeder published the results of a large-scale usability evaluation in which they concluded that five users were “nowhere near enough” to find all (or even 85 percent) of the usability problems in the Web sites they were studying.
Perfetti and Landesman [17], discussing related research, stated:
When we tested the site with 18 users, we identified 247 total obstacles-topurchase. Contrary to our expectations, we saw new usability problems throughout the testing sessions. In fact, we saw more than five new obstacles for each user we tested. Equally important, we found many serious
problems for the first time with some of our later users. What was even more surprising to us was that repeat usability problems did not increase as testing progressed. These findings clearly undermine the belief that five users will be enough to catch nearly 85 percent of the usability problems on a Web site. In our tests, we found only 35 percent of all usability problems after the first five users. We estimated over 600 total problems on this particular online music site. Based on this estimate, it would have taken us 90 tests to discover them all!

Discussion
If a practitioner says that five participants are all you need to discover most of the problems that will occur in a usability test, it’s likely that this practitioner is typically working in contexts that have a fairly high value of p and fairly low problem discovery goals.
If another practitioner says that he’s been running a study for three months, has observed 50 participants, and is continuing to discover new problems every few participants, then it’s likely that he has a somewhat lower value of p, a higher problem discovery goal, and lots of cash (or a low cost audience of participants).
Neither practitioner is necessarily wrong—they’re just working in different usability testing spaces.

References that I may want to read further in future:
5. Chapanis, A. (1981). Evaluating ease of use. Unpublished manuscript prepared for IBM, available on request from J. R. Lewis.
11. Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier.
12. Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human
Factors, 36, 368-378.
13. Lewis, J. R. (2001). Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479
14. Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.
17. Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 4, 2006 from http://www.uie.com/articles/eight_is_not_enough/ 1
8. Sauro, J. (2006). UI problem discovery sample size. Downloaded from Measuring Usability website, July 20, 2006- http://www.measuringusability.com/samplesize/problem_discovery.php .
19. Spool, J., & Schroeder, W. (2001). Testing web sites: Five users is nowhere near enough. In CHI 2001 Extended Abstracts (pp. 285-286). New York: ACM Press.

Friday, September 4, 2009

Sep 4 - Sauro & Kindlund, A Method to Standardize Usability Metrics Into a Single Score.

A Method to Standardize Usability Metrics Into a Single Score.
Jeff Sauro. PeopleSoft, Inc., Denver, Colorado USA. Jeff_Sauro@peoplesoft.com
Erika Kindlund. Intuit, Inc., Mountain View, California USA. Erika_Kindlund@intuit.com

CHI 2005, April 2–7, 2005, Portland, Oregon, USA

ABSTRACT
Current methods to represent system or task usability in a single metric do not include all the ANSI and ISO defined usability aspects: effectiveness, efficiency & satisfaction. We propose a method to simplify all the ANSI and ISO aspects of usability into a single, standardized and summated usability metric (SUM). In four data sets, totaling 1860 task observations, we show that these aspects of usability are correlated and equally weighted and present a quantitative model for usability. Using standardization techniques from Six Sigma, we propose a scalable process for standardizing disparate usability metrics and show how Principal Components Analysis can be used to establish appropriate weighting for a summated model.
SUM provides one continuous variable for summative usability evaluations that can be used in regression analysis, hypothesis testing and usability reporting.

In a summative usability evaluation, several metrics are available to the analyst for benchmarking the usability of a product. There is general agreement from the standards boards ANSI 2001[2] and ISO 9241 pt.11[18] as to what the dimensions of usability are (effectiveness, efficiency & satisfaction) and to a lesser extent which metrics are most commonly used to quantify those dimensions.
Effectiveness includes measures for completion rates and errors, efficiency is measured from time on task and satisfaction is summarized using any of a number of standardized satisfaction questionnaires (either collected on a task-by-task basis or at the end of a test session) [2],[18].

There have been attempts to derive a single measure for the construct of usability.
Babiker et al [3] derived a single metric for usability in hypertext systems using objective performance measures only.
Questionnaires such as the SUMI [22,23], PSSUQ[27], QUIS[7] and SUS[5] have users provide a subjective assessment of recently completed tasks or specific product issues and claim to derive a reliable and low-cost standardized measure of the overall usability or quality of use of a system.
While the authors of these questionnaires do not necessarily intend for the questionnaires to act as a single measure of usability (e.g. “QUIS was designed to assess users' subjective satisfaction with specific aspects of the human-computer interface” [7]), they are often used by practitioners as a way to measure usability with one number. Such usage is often not discouraged by the questionnaires’ instructions (e.g. “SUMI is the only commercially available questionnaire for the assessment of the usability of software” [22] and “The SUS scale is a Likert scale and yields a single number representing a composite measure of the overall usability of the system [5]”).
McGee uses a geometric averaging procedure (UME) to standardize ratios of participants’ subjective assessment ratings on tasks to derive a single score for task usability. His research identifies the potential for a standardized measure of usability to support usability comparisons across products, the same product over time, at lower levels of detail, and of tasks common to multiple products.
Lewis used a rank-based system when assessing competing products [25]. This approach creates a rank score comprised of both users’ objective performance measures and subjective assessment, but the resulting metric only represents a relative comparison between like-products with similar tasks.

My Comments: I may consider to use quantitative usability concept and a single, combined usability score.

METHOD
Four summative usability tests were conducted to collect the common metrics as described above (task completion, error counts, task times and satisfaction scores) as well as several other metrics as suggested in Dumas and Redish [11], and Nielsen [39].
For measuring satisfaction we created a questionnaire containing semantic distance scales with five points, similar to the ASQ created by Lewis [26] (see Table 5 below). The questionnaire included questions on task experience, ease of task, time on task, and overall task satisfaction.
The questionnaires were administered immediately after each task to improve accuracy [16]. The four usability tests were conducted in a controlled usability lab setting over a two-year period. Participants were asked to complete the tasks to the best of their ability and the administrator only intervened when the participant indicated they were done or gave up.
At the end of the test session, “post-test” satisfaction questions similar to those in SUMI and SUS that asked about overall product usability were given to users.
Data was collected from 129 total participants completing a total of 57 tasks. Participants varied in their application experience, gender, and industries.

RESULTS
Examining the Relationships between the Metrics
To attempt to combine the metrics into a single usability score we examined the relationship among the four primary variables for each task observation. We generated a correlation matrix with all four variables from all four data sets plus a combined data set containing data from all tests.
As can be seen in the lower right cell of Table 1, the Pearson Product Moment correlation coefficients between satisfaction and task completion are consistent with prior correlation analyses (that is, displaying moderate and significant correlations between .3 and .5) [26, 29]. What’s more, the positive correlation between subjective measures (satisfaction) and objective measures (time, errors and completion) are also consistent with Nielsen’s 1994 meta-analysis [38] (although the subjective measures were preferences instead of satisfaction in that study).
Frøkjær et al [12] earlier has made the case for including all aspects (effectiveness, efficiency and satisfaction) when measuring the usability of a system since it was found that these aspects did not always correlate in the data they reviewed. We agree with Frøkjær et al’s conclusion to measure all aspects of usability, however, not because they do not correlate with each other (our data clearly shows the opposite), but because each measure adds additional information not contained in the other measures.

Principal Component Analysis (PCA) was used as statitstical tool for analysis.

The goal then becomes standardizing the four variables (time, satisfaction, completion and errors).

STANDARDIZING USABILITY METRICS
To standardize each of the usability metrics we created a z-score type value or z-equivalent. For the continuous data (time and average satisfaction), we subtracted the mean value from a specification limit and divided by the standard deviation. For discrete data (completion rates and errors) we divided the unacceptable conditions (defects) by all opportunities for defects.
This method of standardization was adapted from the process sigma metric used in Six Sigma [4],[17], [43]. See Sauro & Kindlund [44] for a more detailed discussion on how to standardize these metrics from raw usability data.

Standardizing Task Completion
We can assume that all users want to successfully complete tasks, so a defect in task completion can be identified as an instance of a user failing a task. An opportunity for a defect in task completion is simply each instance of a user attempting a task. Therefore, we standardized task completion as the ratio of failed tasks to attempted tasks. This proportion of defects per opportunities has a corresponding z-equivalent that can be looked up in a standard normal table.
For example, a task completion rate of 80% would have the z-equivalent of .841.

Standardizing Error Rates
Each error instance is unique, yet all are associated with the more general “opportunity” to make an error in this component of the task. Once the task’s error opportunities have been identified, the z-equivalent can be calculated by dividing the total number of errors by the error opportunities. This proportion can be approximated using the standard normal deviate.

Standardizing Satisfaction Scores
As described in the Methods section, we used a post task questionnaire containing 5-point semantic distance scales with the end points labeled (e.g. 5:Very Easy to 1:Very Difficult). For the analysis we created a composite satisfaction score by averaging the responses from questions of overall ease, satisfaction and perceived task time (See Table 5) .
To standardize the composite score we looked to the literature for a logical specification limit. Prior research across numerous usability studies suggests that systems with “good-usability” typically have a mean rating of 4 on a 1-5 scale and 5.6 on a 1-7 scale [38]. Therefore we set the specification limit to 4. To arrive at a standardized z-equivalent for composite satisfaction we subtracted the average rating of a user’s satisfaction score from 4 and divided by the standard deviation.

Standardizing Task Times
Identifying ideal task times presents an interesting challenge: how long is too long for any given task?
Once the ideal task time has been set for each task, standardizing the task time involves subtracting the raw task time from the specification limit and dividing by the standard deviation to arrive at the z-equivalent.

Creating a Single, Standardized and Summated Usability Metric: SUM
We created a single, standardized and summated usability metric for each task by averaging together the four standardized values based on the equal weighting of the coefficients from the Principal Components Analysis.

CONCLUSION
A single, standardized and summated usability metric (SUM) cannot and should not take the place of diagnostic qualitative usability improvements typically found in formative evaluations. When a summative evaluation is used to quantitatively assess the “before and after” impact of design changes, the advantage of one score is in its ability to summarize the majority of variance in four integral summative usability measures.
SUM has two additional advantages. First it provides one continuous variable that can be used in regression analysis, hypothesis testing and in the same ways existing metrics are used to report usability. Second, a single metric based on logical specification limits provides an idea of how usable a task or product is without having to reference historical data. This score can then be used to report against other key business metrics.

References that I may want to read further in future:
1. Abran, A., Surya, W., Khelifi, A., Rilling, J., Seffah, A., Robert, F. (2003). Consolidating the ISO Usability Models. Paper presented at 11th annual International Software Quality Management Conference.
2. ANSI (2001). Common industry format for usability test reports (ANSI-NCITS 354-2001). Washington, DC: American National Standards Institute.
5. Brooke, J. (1996). SUS: A “quick and dirty” usability scale. In P. Jordan, B. Thomas, and B. Weerdmeester (Eds.), Usability Evaluation in Industry (pp.189-194). London: Taylor and Francis. See also http://www.cee.hw.ac.uk/~ph/sus.html
8. Cordes, R. E (1984). Application of Magnitude Estimation for Evaluating Software Ease of Use. In Gavriel Salvendy (Ed.) First USA-Japan Conference on Human Computer Interaction, Amsterdam: Elsevier Science Publishers.
12. Frøkjær, E., Hertzum, M., and Hornbæk, K. (2000) Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? In Proc. CHI 2000, (pp.345-352). Washington, D.C.: ACM Press.
13. Gliem, J. and Gliem, R. (2003). Calculating, Interpreting, and Reporting Cronbach’s Alpha Reliability Coefficient for Likert-Type Scales. In 2003 Midwest Research to Practice Conference in Adult, Continuing and Community Education. Columbus, OH.
22. Kirakowski, J. (1996). The Software Usability Measurement Inventory: Background and usage. In P. Jordan, B. Thomas, and B. Weerdmeester (Eds.), Usability Evaluation in Industry (pp. 169-178). London, UK: Taylor and Francis. (Also, see http://www.ucc.ie/hfrg/questionnaires/sumi/index.html )
23. Kirakowski, J., and Corbett, M. (1993). SUMI: The Software Usability Measurement Inventory. British Journal of Educational Technology, 24, 210-212.
25. Lewis, J (1991) A Rank-Based Method for the Usability Comparison of Competing Products. In Proceedings of the Human Factors and Ergonomics Society 35th Annual Meeting San Francisco California (pp1312-1316).
26. Lewis, J. R. (1991). Psychometric evaluation of an after-scenario questionnaire for computer usability studies: The ASQ. SIGCHI Bulletin, 23, 78-81.
27. Lewis, J. R. (1992). Psychometric evaluation of the Post-Study System Usability Questionnaire: The PSSUQ. In Proceedings of the Human Factors Society 36th Annual Meeting (pp. 1259-1263). Atlanta, GA: Human Factors Society.
28. Lewis, J. R. (1993). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use (Tech. Report 54.786). Boca Raton, FL: IBM Corp. http://drjim.0catch.com/usabqtr.pdf
29. Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. International Journal of Human-Computer Interaction, 7, 57-78.
32. McGee, M. (2003). Usability magnitude estimation. Proc. HFES, 47th Annual Meeting, (691--695).
33. McGee, M (2004). Master usability scaling: magnitude estimation and master scaling applied to usability measurement. In Proc. CHI 2004, (pp 335 - 342). Washington, D.C.: ACM Press.
38. Nielsen, J. and Levy, J. (1994) Measuring Usability: Preference vs. Performance. Communications of the ACM, 37, p. 66-76
42. Sauro, J. (2004) How long should a task take? Identifying Spec Limits for Task Times in Usability Tests. Retrieved September 13, 2004, from Measuring Usability Web site : http://measuringusability.com/time_specs.htm
43. Sauro, J. (2004) How Do You Calculate a Z-Score? Retrieved September 13, 2004, from Measuring Usability Web site: http://measuringusability.com/z_calc.htm
44. Sauro, J & Kindlund E. (In Press) Making Sense of Usability Metrics: Usability and Six Sigma, in Proceedings of the 14th Annual Conference of the Usability Professionals Association, Montreal, Canada

Michael Yeap...PhD Candidate