Thursday, October 8, 2009

Oct 9 - SUMI: Software Usability Measurement Inventory

SUMI - Software Usability Measurement Inventory

The de facto industry standard evaluation questionnaire for assessing quality of use of software by end users.

What is SUMI

The Software Usability Measurement Inventory is a rigorously tested and proven method of measuring software quality from the end user's point of view.
SUMI is a consistent method for assessing the quality of use of a software product or prototype, and can assist with the detection of usability flaws before a product is shipped.
It is backed by an extensive reference database embedded in an effective analysis and report generation tool.

Who should use SUMI?

SUMI is recommended to any organisation which wishes to measure the perceived quality of use of software, either as a developer, a consumer of software, or as a purchaser/consultant. SUMI is increasingly being used to set quality of use requirements by software procurers.
SUMI also assists the manager in identifying the most appropriate software for their organisation. It has been well documented that if staff have quality tools to work with, this contributes to overall efficiency of staff and the quality of their work output.

Our customers have used SUMI effectively to:
* assess new products during product evaluation
* make comparisons between products or versions of products
* set targets for future application developments.

SUMI has been used specifically within development environments to:
* set verifiable goals for quality of use attainment
* track achievement of targets during product development
* highlight good and bad aspects of an interface.

SUMI is the de facto industry standard questionnaire for analysing users' responses to desktop software or software applications provided through the internet.

Why use SUMI?

SUMI is the only commercially available questionnaire for the assessment of the usability of software which has been developed, validated, and standardised on an international basis.
There is a large range of languages in which SUMI is available. Each language version is carefully translated and validated by native speakers of the language.
SUMI ennables measurement of some of the user-orientated requirements expressed in the European Directive on Minimum Health and Safety Requirements for Work with Display Screen Equipment (90/270/EEC).
SUMI is mentioned in the ISO 9241 standard as a recognised method of testing user satisfaction.

What does SUMI look like?

SUMI consists of 50 statements to which the user has to reply that they either Agree, Don't Know, or Disagree.

Here are some example statements: Item No. Item Wording
1. This software responds too slowly to inputs.
3. The instructions and prompts are helpful.
13. The way that system information is presented is clear and understandable.
22. I would not like to use this software every day.
You may also take a look at a sample questionnaire (UK wording) in pdf format,

How do I administer SUMI and how long does it take?

It takes a user about 3 minutes to fill out the questionnaire, perhaps a few minutes longer on the internet version.
One way of administering it is on paper: print out the SUMI form and get your user to make marks on the page. It takes an analyst about one minute to type each user's responses into a file for scoring by SUMISCO or to send to HFRG for scoring.
On the other hand you may decide to go for the internet, online option.
You can also do a hybrid: concoct your own HTML pages to be served on your intranet and either send the results to HFRG for analysis, or purchase SUMISCO and analyse them yourself.

How many users do I need?

Online SUMI might require sample sizes with a minimum of about 30 unless your respondents are well targetted.
However, we know that paper SUMI will give you reliable results with as few as 12 users. This is because you are able to control the quality of your user sample directly when administering SUMI on paper.
You can use fewer users if you wish, but beware that your results may not be as representative of the true user population. In fact, SUMI has yielded useful information with user sample sizes of four or five.
However, this question is always like 'how long is a piece of string.' You should try to get as many users as you can within your timeframe.

Source: http://sumi.ucc.ie/whatis.html

3. SUMI Development

Work on SUMI started in late 1990. One of the work packages entrusted to the HFRG within the MUSiC project was to develop questionnaire methods of assessing usability. The objectives of this work package were:
* to examine the CUSI Competence scale and to expand it and to extract further subscales if warranted by the evidence;
* to achieve an international standardisation database for the new questionnaire and to validate its use in commercial environments.

Both these objectives were achieved by the end of the project. The SUMI questionnaire was first published in 1993 and has been widely disseminated since then, both in Europe and in the United States.

3.1 Psychometric development

SUMI started with an initial item pool of over 150 items, assembled from previously reported studies (including many reviewed above), from discussions with actual end users about their experiences with information technology, and from suggestions given by experts in HCI and software engineers working in the MUSiC project. The items were examined for consistency of perceived meaning by getting 10 subject matter experts to allocate each item to content areas. Items were then rewritten or eliminated if they produced inconsistent allocations.

The questionnaire developers opted for a Lickert scaling approach, both for historical reasons (the CUSI questionnaire was Lickert-scaled) and because this is considered to be a natural way of eliciting opinions about a software product. Different types of scales in use in questionnaire design within HCI are discussed in Kirakowski and Corbett (1990). The implication is that each item is considered to have roughly similar importance, and that the strength of a user's opinion can be estimated by summing or averaging the individual ratings of strength of opinion for each item. Many items are used in order to overcome variability due to extraneous or irrelevant factors.

This procedure produced the first questionnaire form, which consisted of 75 satisfactory items. The respondents had to decide whether they agreed strongly, agreed, didn't know, disagreed, or disagreed strongly with each of the 75 items in relation to the software they were evaluating.
Questionnaires were administered to 139 end users from a range of organisations (this was called sample 1). The respondents completed the inventory at their work place with the software they were evaluating near at hand. All these respondents were genuine end users who were using the software to accomplish task goals within their organisations for their daily work. The resulting matrix of inter-correlations between items was factor analysed and the items were observed to relate to a number of different meaningful areas of user perception of usability. Five to six groupings emerged which gave acceptable measures of internal consistency and score distributions.

Revisions were made to some items to centralise means and improve item variances, and then the ten best items with highest factor loadings were retained for each grouping. The number of groups of items was set to five. Items were revised in the light of critique from the industrial partners of MUSiC in order to reflect the growing trend towards Graphical User Interfaces. A number of users had remarked that it was difficult to make a judgement over five categories of response for some items. After some discussion, it was decided to change the response categories to three: Agree, Don't Know, and Disagree.

This produced the second questionnaire form of 50 items, in which each subscale was represented by 10 different items.
Typical items from this version are:Item No. Item Wording
1. This software responds too slowly to inputs.
3. The instructions and prompts are helpful.
13. The way that system information is presented is clear and understandable.
22. I would not like to use this software every day.
A new sample (sample 2) of data from 143 users in a commercial environment was collected. Analysis of this sample of data showed that item response rates, scale reliabilities, and item-scale correlations were similar to or better than those in the first form's sample. Analyses of variance showed that the questionnaire differentiated between different software systems in the sample. After analysis of sample 2, a few items were revised slightly to improve their scale properties. The subscales were given descriptive labels by the questionnaire developers. These were:
* Efficiency
*Affect
* Helpfulness
* Control
* Learnability.

The precise meaning of these subscales is given in the SUMI manual, but in general, the Affect subscale measures (as before, in CUSI) the user's general emotional reaction to the software -- it may be glossed as Likeability. Efficiency measures the degree to which users feel that the software assists them in their work and is related to the concept of transparency. Helpfulness measures the degree to which the software is self-explanatory, as well as more specific things like the adequacy of help facilities and documentation. The Control dimensions measures the extent to which the user feels in control of the software, as opposed to being controlled by the software, when carrying out the task. Learnability, finally, measures the speed and facility with which the user feels that they have been able to master the system, or to learn how to use new features when necessary.

At this time, a validity study was carried out in a company who has requested to remain anonymous. In this company, two versions of the same editing software were installed for the use of the programmer teams. The users carried out the same kinds of tasks in the same environments, but each programmer team used only one of software versions for most of the time. There were 20 users for version 1, and 22 for version 2. Analysis of variance showed a significant effect for the SUMI scales, for the difference between systems, and for the interaction between SUMI scales and systems. This last finding was important as it indicated that the SUMI scales were not responding en masse but that they were discriminating between differential levels of components of usability. Table 1 shows the means and standard deviations of the SUMI scales and the two software versions.

In fact, Version 2 was considerably more popular among its users, and the users of this version considered that they were able to carry out their tasks more efficiently with it. Learnability was not considered to be an issue by either group of users. The Data Processing manager of the company reviewed the results with the questionnaire development team and in his opinion, the results confirmed informal feedback and observation. Later we learnt that the company subsequently decided to switch to Version 2, not only on the basis of our results, but because that was the general feeling.

At this stage, the Global scale was also derived. The Global scale consists of 25 items out of the 50 which loaded most heavily on a general usability factor. Because of the larger number of items which contribute to the Global scale, reliabilities are correspondingly higher. The Global scale was produced in order to represent the single construct of perceived quality of use better than a simple average of all the items of the questionnaire.
...................
...................
Source: http://sumi.ucc.ie/sumipapp.html

No comments:

Post a Comment