Friday, October 9, 2009

Oct 10 - Tullis & Stetson, A Comparison of Questionnaires for Assessing Website Usability

A Comparison of Questionnaires for Assessing Website Usability
Thomas S. Tullis, Fidelity Investments
Jacqueline N. Stetson, Fidelity Investments and Bentley College

 
Various questionnaires have been reported in the literature for assessing the perceived usability of an interactive system, e.g:
–Questionnaire for User Interface Satisfaction (QUIS) (1988)
–Computer System Usability Questionnaire (CSUQ) (1995)
–System Usability Scale (SUS) (1996)
 
A slightly different approach was taken by Microsoft with their "Product Reaction Cards"(2002)
And we have been using our own questionnaire for several years in our Usability Lab at Fidelity Investments
 
Problem
How well do these questionnaires apply to the assessment of Websites?
Do any of these questionnaires work well, as an adjunct to a usability test, with relatively small numbers of users?


Our Study
Limited ourselves to questionnaires in the published literature
–Did not include commercial services for evaluating website usability (e.g., WAMMI, RelevantView, NetRaker, Vividence).
We studied five questionnaires:
–SUS
–QUIS
–CSUQ
–Microsoft’s "Words"
–Our own questionnaire
 
Questionnaire #1: SUS
•Developed at Digital Equipment Corp.
•Consists of ten items.
•Adapted by replacing "system"with "website".
•Each item is a statement (positive or negative) and a rating on a five-point scale of "Strongly Disagree"to "Strongly Agree".
 
Questionnaire #2: QUIS
•Developed at the University of Maryland.
•Original questionnaire had 27 questions.
–We dropped 3 that did not seem relevant to Websites (e.g., "Remembering names and use of commands").
•"System"was replaced by "website"and term "screen"was replaced by "web page".
•Each question is a rating on a ten-point scale with appropriate anchors.
 
Questionnaire #3: CSUQ
•Developed at IBM.
•Composed of 19 questions.
•"System"or "computer system"was replaced by "website".
•Each question is a statement and a rating on a seven-point scale of "Strongly Disagree"to "Strongly Agree".
 
Questionnaire #4: Words
•Based on the 118 words used by Microsoft on their Product Reaction Cards.
–Some positive (e.g., "Convenient")
–Some negative (e.g., "Unattractive")
•Each word was presented with a check-box
–Users were asked to choose the words that best describe their interaction with the website.
–Could choose as many or as few words as they wished.

Questionnaire #5: Ours
•Developed ourselves and have been using for several years in our usability tests of websites.
•Composed of nine statements (e.g., "This website is visually appealing") to which the user responds on a seven-point scale from "Strongly Disagree"to "Strongly Agree".
•Points of the scale are numbered -3, -2, -1, 0, 1, 2, 3.
–Obvious neutral point at 0.
 
A Live Experiment!
•We’re going to compare two sites:
–CircuitCity.com
–Outpost.com
•Task 1: Your digital camera uses SmartMediacards. Find the least expensive external reader (USB) for your PC that will read them.
•Task 2: You do lots of hiking. Find the least
expensive personal GPS with map capability
and at least 8 MB of memory.

 
Method of Our Study
•Conducted entirely on our company Intranet.
•123 of our employees participated.
•Each participant was randomly assigned to one of the five questionnaire conditions.
•Each was asked to perform two tasks on each of two well-known personal financial information sites.
•Sites studied:
–Finance.Yahoo.com
–Kiplinger.com
–Hereafter referred to only as "Site 1"and
"Site 2". Don’t assume which is which.
•Tasks:
–Find the highest price in the past year for a share of .
–Find the mutual fund with the highest 3-year return.
•Order of presentation of the two sites was randomized.
•After completing (or at least attempting) the two tasks on a site, the user was presented with the questionnaire for their randomly selected condition.
•Each user completed the same questionnaire for both sites.

 
Data Analysis
•For each participant, an overall score was calculated for each website by averaging all of the ratings on the questionnaire that was used.
–All scales had been coded internally so that the "better"end corresponded to higher numbers.
–These were converted to percentages by dividing each score by the maximum score possible on that scale.
–For example, a rating of 3 on SUS was converted to a percentage by dividing that by 5 (the maximum score for SUS), giving a percentage of 60%.
•Special treatment for the "Words"condition since it did not involve rating scales:
–Before the study, we classified each of the words as being "Positive"or "Negative".
–Not grouped or identified as such to the participants.
–For each participant, an overall score was calculated by counting the total number of words that person selected and then dividing that number into the number of "Positive"words chosen.
–If someone selected 8 positive words and 10 words total, that yielded a score of 80%.
 
Results
•Calculated frequency distributions for the ratings, converted to percentages, for:
–Each questionnaire
–Both websites

Bar Charts are used to compare the results. See http://www.upassoc.org/usability_resources/conference/2004/UPA-2004-TullisStetson.pdf

My Comments: This study will be a good benchmark and reference for my research.

 
Results: Summary
•All five questionnaires showed that Site 1 was significantly preferred over Site 2 (p<.01). •The largest mean difference (74% vs. 38%) was found using the Words questionnaire, but this was also the questionnaire that yielded the greatest variability. Analysis of Sub-samples
•Next we analyzed randomly selected sub-samples of the data at size 6, 8, 10, 12, and 14.
–20 random samples for each size
•For each sample, t-test was conducted to determine whether the results showed that Site 1 was significantly better than Site 2 (the conclusion from the full dataset).
 
Analysis of Sub-samples
•Accuracy of the results increases as the sample size gets larger.
•With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%
–60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
•Accuracy of some of the questionnaires increases quicker than others.
–SUS jumps up to about 75% accuracy at a size of 8.
 
Caveats
•Results were undoubtedly influenced by:
–The sites studied.
–The tasks used.
•We have only addressed the question of whether a given questionnaire was able to reliably distinguish between the ratings of one site vs. the other.
–Often you care more about how well the results help guide a redesign.
 

Conclusions
•One of the simplest questionnaires studied, SUS (with only 10 rating scales), yielded among the most reliable results across sample sizes.
–Also the only one whose questions all address different aspects of the user’s reaction to the website as a whole.
•For the conditions of this study, sample sizes of at least 12-14 participants are needed to get reasonably reliable results.

Source:
http://www.upassoc.org/usability_resources/conference/2004/UPA-2004-TullisStetson.pdf

No comments:

Post a Comment