Wednesday, November 4, 2009

Nov 5 - Somervell, Heuristic Comparison Experiment (PhD dissertation)

Chapter 5
Heuristic Comparison Experiment

5.1 Introduction

Now that there is a set of heuristics tailored for the large screen information exhibit system class, a comparison of this set to more established types of heuristics can be done. The purpose of this comparison would be to show the utility of this new heuristic set. This comparison needs to be fair, so that determining the effectiveness of the new method will be accurate.
To assess whether the new set of heuristics provides better usability results over existing alternative sets, we conducted a comparison experiment in which each of three sets of heuristics were used to evaluate three separate large screen information exhibits. We then compared the results of each set through several metrics to determine the better evaluation methods for large screen information exhibits.

5.2 Approach

The following sections provide descriptions of the heuristics used, the comparison method, and the systems used in this experiment.

5.2.1 Heuristic Sets

We used three different sets of usability heuristics, each at a different level of specificity for application to large screen information exhibits, ranging from a set completely designed for this particular system class, to a generic set applicable to a wide range of interactive systems.

Nielsen

The least specific set of heuristics was taken from Nielsen and Mack [70]. This set is intended for use on any interactive system, mostly targeted towards desktop applications. Furthermore, this set has been in use since around 1990. It has been tested and criticized for years, but still remains popular with usability practitioners. Again, this set is not tailored for large screen information exhibits in any way and has no relation to the critical parameters for notification systems.

Visibility of system status
Match between system and real world
User control and freedom
Consistency and standards
Error prevention
Recognition rather than recall
Flexibility and efficiency of use
Aesthetic and minimalist design
Help users recognize, diagnose, and recover from errors
Help and documentation

Figure 5.1: Nielsen’s heuristics. General heuristics that apply to most interfaces. Found in [70].

Berry

The second heuristic set used in this comparison test was created for general notification systems by Berry [9]. This set is based on the critical parameters associated with notification systems [62], but only at cursory levels. This set is more closely tied to large screen information exhibits than Nielsen’s method in that large screen information exhibits are a subset of notification systems, but this set is still generic in nature with regards to the specifics surrounding the LSIE system class.

Notifications should be timely
Notifications should be reliable
Notification displays should be consistent (within priority levels)
Information should be clearly understandable by the user
Allow for shortcuts to more information
Indicate status of notification system
Flexibility and efficiency of use
Provide context of notifications
Allow adjustment of notification parameters to fit user goals
Figure 5.2: Berry’s heuristics. Tailored more towards Notification Systems in general. Found in
[9].

Somervell

The final heuristic set is the one created in this work, as reported in Chapter 4. This set is tailored specifically to large screen information exhibits, and thus would be the most specific method of the three when targeting this type of system. It is based on specific levels of the critical parameters associated with the LSIE system class.

5.2.2 Comparison Technique

To determine which of the three sets is better suited for formative evaluation of large screen information exhibits, we use a current set of comparison metrics that rely upon several measures of a method’s ability to uncover usability problems through an evaluation.
The comparison method we are using typically relies on five separate measures to assess the
utility of a given UEM for one’s particular needs, but we will only use a subset in this particular
comparison study.
Hartson et al. report that thoroughness, validity, effectiveness, reliability, and
downstream utility
are appropriate measures for comparing evaluation methods [40].
Specifically, our comparison method capitalizes on thoroughness, validity, effectiveness, and reliability, abandoning the downstream utility measure. This choice is used here because longterm studies are required to illustrate downstream utility.

Thoroughness

This measure gives an indication of a method’s ability to uncover a significant percentage of the problems in a given system. Thoroughness consists of a simple calculation of the number of problems uncovered by a single UEM divided by the total number of problems found by all three methods.
thoroughness = (# of problems found by target UEM) / (# of problems found by all methods)

Validity

Validity refers to the ability of a method to uncover the types of problems that real users would experience in day to day use of the system, as opposed to simple or minor problems. Validity is measured as the number of real problems found divided by the total number of real problems identified in the system.
validity = (# of problems found by target UEM) / (# of problems in the system)
The number of real problems in the system refers to the problem set identified through some standard method that is separate from the method being tested.

Effectiveness

Effectiveness combines the previous two metrics into a single assessment of the method. This measure is calculated by multiplying the thoroughness score by the validity score.
effectiveness = thoroughness X validity

Reliability

Reliability is a measure of the consistency of the results of several evaluators using the method. This is also sometimes referred to as inter-rater reliability. This measure is taken more as agreement between the usability problem sets produced by different people using a given method.
This measure is calculated from the differences among all of the evaluators for a specific system as well as by the total number of agreements among the evaluators, thus two measures are used to provide a more robust measurement of the reliability of the heuristic sets:
reliability-d = difference among evaluators for a specific method
reliability-a = average agreement among evaluators for a specific method


For calculating reliability, Hartson et al. recommend using a method from Sears [81] that depends on the ratio of the standard deviation of the numbers of problems found by the average number found [40]. This measure of reliability is overly complicated for current needs, thus a more traditional measure that relies upon actual rater differences is used instead.

5.2.3 Systems

Three systems were used in the comparison study providing a range of applications for which each heuristic would be used in an analytic evaluation. The intent was to provide enough variability in the test systems to tease out differences in the methods.

1 Source Viewer
2 Plasma Poster
3 Notification Collage

Why Source Viewer?
The Source Viewer was chosen as a target system for this study because we wanted an example of a real system that has been in regular use for an extended period. We immediately thought of command and control situations. Potential candidates included local television stations, local air traffic control towers, electrical power companies, and telephone exchange stations. We finally settled on local television command and control after limited responses from the other candidates.

Why Plasma Poster?
We wanted to include the Plasma Poster because it is one of very few LSIE systems that has seen some success in terms of long term usage and acceptance. It has seen over a year of deployment in a large research laboratory, with reports on usage and user feedback reported in [20].
This lengthy deployment and data collection period provides ample evidence for typical usability problems. We can use the published reports as support for our problem sets. Coupled with developer feedback, we can effectively validate the problem set for this system.

Why Notification Collage?
We chose the Notification Collage as the third system for several reasons.
First, we wanted to increase the validity of any results we find. By using more systems, we get a better picture of the “goodness” of the heuristic sets, especially if we get consistent results across all three systems.
Secondly, we wanted to explicitly show that the heuristic set we created in this work actually uncovered the issues that went into that creation process. In other words, since the Notification Collage was one of the systems that led to this heuristic set, using that set on the Notification Collage should uncover most of the issues with that system.
Finally, we wanted to use the Notification Collage out of the original five because we had the most developer feedback on that system, and like the Plasma Poster, it has seen reasonable deployment and use.

5.2.4 Hypotheses

We have three main hypotheses to test in this experiment:

1. Somervell’s set of heuristics has a higher validity score for the Notification Collage.
We believed this was true because the Notification Collage was used in the creation of Somervell’s heuristics, thus those heuristics should identify most or all of the issues in the Notification Collage.

2. More specific heuristics have higher thoroughness, validity, and reliability measures.
We felt this was true because more specific methods are more closely related to the systems in this study. Indeed, from Chapter 3, we discussed how previous work suggests system-class level heuristics would be best. This experiment illustrates this case for heuristic evaluation of large screen information exhibits.

3. Generic methods require more time for evaluators to complete the study.
This seems logical because a more generic heuristic set would require more interpretation and thought, hence we felt that those evaluators who use Nielsen’s set would take longer to complete the system evaluations, providing further impetus for developing system-class UEMs.

5.2.5 Identifying Problem Sets

One problem identified in other UEM comparison studies involves the calculation of specific metrics that rely upon something referred to as the “real” problem set (see [40]). In most cases, this problem set is the union of the problems found by each of the methods in the comparison study. In other words, each UEM is applied in a standard usability evaluation of a system, and the “real” problem set is simply the union of the problems found by each of the methods.
This comparison study also faced the same challenge. Instead of relying on evaluators to produce sets of problems from each method, then using the union of those problem sets as the “real” problem set, analysis and testing was performed on the target systems beforehand and the problem reports from those efforts were used to come up with a standard set of problems for each system.

Source Viewer Problem Set
To determine the problem sets experienced by the users of this system, a field study was conducted. Two interviews with the users of the large screen system, as well as observation were conducted.

Plasma Poster Problem Set
Analytic evaluation augmented with developer feedback and literature review served as the method for determining the real problem set for the Plasma Poster. We employed the same claims analysis technique that we used in the creation process to identify typical usability tradeoffs for the Plasma Poster. After identifying the usability issues, we asked the developers of the system to verify the tradeoffs.

Notification Collage Problem Set
To validate the problem set for the Notification Collage, we contacted the developers of the system and asked them to check each tradeoff as it pertained to the behavior of real users. The developers were given a list of the tradeoffs found in our claims analysis (from Chapter 4) and asked to verify each tradeoff according to their observations of real user behavior.


5.3 Testing Methodology

This experiment involves a 3x3 mixed factors design. We have three levels of heuristics (Nielsen, Berry, and Somervell) and three systems (Source Viewer, Plasma Poster, and Notification Collage).
The heuristics variable is a between-subjects variable because each evaluator sees only one set of heuristics. The system variable is within-subjects because each participant sees all three systems. For example, evaluator 1 saw only Nielsen’s heuristics, but used those to evaluate all three systems.
We used a balanced Latin Square to ensure learning effects from system presentation order would be minimized. Thus, we needed a minimum of 18 participants (6 per heuristic set) to ensure coverage of the systems in the Latin Square balancing.

5.3.1 Participants

As shown in Table 5.1, we needed a minimum of 18 evaluators for this study.
Twenty-one computer science graduate students who had completed a course on Usability Engineering volunteered for participation as inspectors. Six participants were assigned to each heuristic set, to cover each of the order assignments. Three additional students volunteered and they were randomly assigned a presentation order.
These participants all had knowledge of usability evaluation, as well as analytic and empirical methods. Furthermore, each was familiar with heuristic evaluation. Some of the participants were not familiar with the claim structure used in this study, but they were able to understand the tradeoff concept immediately.
Unfortunately, one of the participants failed to complete the experiment. This individual apparently decided the effort required to complete the test was too much, and thus filled out the questionnaire using a set pattern.
This makes the final number of participants 20, with seven for Nielsen’s heuristics, seven for Berry’s heuristics, and six for Somervell’s heuristics.

5.3.2 Materials

Each target system was described in one to three short scenarios, and screen shots were provided to the evaluators. The goal was to provide the evaluators with a sense of the display and its intended usage.
This material is sufficient for the heuristic inspection technique according to Nielsen and Mack [70]. This setup ensured that each of the heuristic sets would be used with the same material, thereby reducing the number of random variables in the execution of this experiment.
A description of the heuristic set to be used was also provided to the evaluators. This description included a listing of the heuristics and accompanying text clarification. This clarification helps a person understand the intent and meaning of a specific heuristic, hopefully aiding in assessment. These descriptions were taken from [70] and [9] for Nielsen and Berry respectively.

Armed with the materials for the experiment, the evaluator then proceeded to rate each of the heuristics using a 7-point Likert scale, based on whether or not they felt that the heuristic applied to a claim describing a design tradeoff in the interface. Thus they are judging whether or not a specific heuristic applies to the claim, and how much so.
Marks of four or higher indicate agreement that the heuristic applies to the claim, otherwise the evaluator is indicating disagreement that the heuristic applies.

5.3.3 Questionnaire

As mentioned earlier, the evaluators in this experiment provided their feedback through a Likert scale, with agreement ratings for each of the heuristics in the set. In addition to this feedback, each evaluator also rated the claim in terms of how much they felt it actually applied to the interface in question.
By indicating their agreement level with the claim to the interface, we get feedback on whether usability experts actually think the claim is appropriate for the interface in question.
After rating each of the heuristics for the claims, we also asked each evaluator to rank the severity that the claim would hold, if the claim were indeed a usability problem in the interface.

5.3.4 Measurements Recorded

The data collected in this experiment consists of each evaluator’s rating of the claim applicability, each heuristic rating for an individual claim, and the evaluator’s assessment of the severity of the usability problem. This data was collected for each of the thirty claims across the three systems.
In addition to the above measures, we also collected data on the evaluator’s experience with usability evaluation, heuristics, and large screen information exhibits. This evaluator information was collected through survey questions before the evaluation was started.
After the evaluators completed the test, they recorded the amount of time they spent on the task. This was a self reported value as each evaluator worked at his/her own pace and in their own location.

5.4 Results

Twenty-one evaluators provided feedback on 33 different claims across three systems. Each evaluator ended up providing either 10 or 12 question responses per claim, depending on the heuristic set used (Nielsen’s set has 10 in it, whereas the others only have 8). This means we have either 330 or 396 answers to consider, per evaluator.
Fortunately, this data was separable into manageable chunks, dealing with applicability, severity, and heuristic ratings; as well as evaluator experience levels and time to complete for each method.

5.4.1 Participant Experience

As for individual evaluator abilities, the average experience level with usability evaluation, across all three systems, was “amateur”. This means that overall, for each heuristic set, we had comparable experience for the evaluators assigned to that set.

5.4.2 Applicability Scores

To indicate whether or not a heuristic set applied to a given claim (or problem), evaluators marked their agreement with the statement “the heuristic applies to the claim”. This agreement rating indicates that a specific heuristic applied to the claim. Each of the heuristics was marked on a 7-point Likert scale by the evaluators, indicating his/her level of agreement with the statement.
Using this applicability measure, the responses were averaged for a single claim across all of the evaluators. Averaging across evaluators allows assessment of the overall “applicability” of the heuristic to the claim.
This applicability score is used to determine whether any of the heuristics applied to the issue described in the claim. If a heuristic received an “agree” rating, average greater than or equal to five, then that heuristic was thought to have applied to the issue in the claim.

Overall Applicability

Considering all 33 claims together (found in all three systems), one-way analysis of variance (ANOVA) indicates significant differences among the three heuristic sets for applicability (F2,855) = 3.0,MSE = 49.7, p < 0.05). Further pair-wise t-tests reveal that Somervell’s set of heuristics had significantly higher applicability ratings over both Berry’s (df = 526, t = 3.32, p < 0.05) and Nielsen’s sets (df = 592, t = 11.56, p < 0.05). In addition, Berry’s heuristics had significantly higher applicability scores over Nielsen’s set (df = 592, t = 5.94, p < 0.05).

5.4.3 Thoroughness

Recall that thoroughness is measured as the number of problems found by a single method, divided by the number uncovered by all of the methods. This requires a breakdown of the total number of claims into the numbers for each system.
Plasma Poster has 14 claims, Notification Collage has eight claims, and the Source Viewer has 11 claims. We look at thoroughness measures for each system. To calculate the thoroughness measures for the data we have collected, we count the number of claims “covered” by the target heuristic set.
Here we are defining covered to mean that at least one of the heuristics in the set had an average agreement rating of at least five. Why five?
On the Likert scale, five indicates somewhat agree. If we require that the average score across all of the evaluators to be greater than or equal to five for a single heuristic, we are only capturing those heuristics that truly apply to the claim in question.

Overall Thoroughness
Across all three heuristic sets, 28 of 33 claims had applicability scores higher than five. Somervell’s heuristics had the highest thoroughness rating of the three heuristic sets with 96% (27 of 28 claims). Berry’s heuristics came next with a thoroughness score of 86% (24 of 28) and Nielsen’s heuristics had a score of 61 (17 of 28).

5.4.4 Validity

Validity measures the UEM’s ability to uncover real usability problems in a system [40].
Here the full set of problems in the system is used as the real problem set (as discussed in earlier sections).
As with thoroughness, the applicability scores determine the validity each heuristic set held for the three systems. As before, we used the cutoff value of five on the Likert scale to indicate applicability of the heuristic to the claim. An average rating of five or higher indicates that the heuristic applied to the claim in question.

Overall Validity
Similar to thoroughness, validity scores were calculated across all three systems. Out of 33 total claims, only 28 showed applicability scores greater than five across all three heuristic sets. Somervell’s heuristics had the highest validity, with 27 of 33 claims yielding applicability scores greater than five, for a validity score of 82%. Berry’s heuristics had the next highest validity with 24 of 33 claims, for a validity score of 73%. Nielsen’s heuristics had the lowest validity score, with 17 of 33 claims for a score of 52%.

5.4.5 Effectiveness

Effectiveness is calculated by multiplying thoroughness by validity. UEMs that have high thoroughness and high validity will have high effectiveness scores. A low score on either of these measures will reduce the effectiveness score.

Overall Effectiveness
Considering the effectiveness scores across all three systems reveals that Somervell’s heuristics had the highest effectiveness with a score of 0.79. Berry’s heuristics came next with a score of 0.62. Nielsen’s heuristics had the lowest overall effectiveness with a score of 0.31.

5.4.6 Reliability – Differences

Recall that the reliability of each heuristic set is measured in two ways: one relying upon the actual differences among the evaluators, the other upon the average number of agreements among the evaluators.
Here we focus on the former. For example, Berry’s set has eight heuristics, so consider calculating the differences in the ratings for the first heuristic for the first claim in the Plasma Poster. This difference is found by subtracting the ratings of each evaluator from every other evaluator and summing up each of the differences, then dividing by the number of differences (or the average difference). Suppose that an evaluator rated the first heuristic with a 6 (agree) and another rated it as a 4 (neutral) and a third rated it as a 5 (somewhat agree). The difference in this is 1.33.
We then averaged the differences for every heuristic on a given claim to get an overall difference score for that claim, with a lower score indicating higher reliability (zero difference indicates complete reliability). These average differences provide a measure for the reliability of the heuristic set.

Overall Reliability Differences
Considering all 33 claims across the three systems gives an overall indication of the average differences for the heuristic sets. One-way ANOVA suggests significant differences among the three heuristic sets (F(2, 23) = 23.02,MSE = 0.84, p < 0.05).
Pair-wise t-tests show that Somervell’s heuristics had significantly lower average differences than both Berry’s heuristics (df = 14, t = 4.3, p < 0.05) and Nielsen’s heuristics (df = 16, t = 6.8, p < 0.05). No significant differences were found between Berry’s heuristics and Nielsen’s heuristics (df = 16, t = 1.43, p = 0.17), but Berry’s set had a slightly lower average difference (MB = 2.02, SDB = 0.21; MN = 2.14, SDN = 0.13).

5.4.7 Reliability – Agreement

In addition to the average differences, a further measure of reliability was calculated by counting the number of agreements among the evaluators, then dividing by the total number of possible agreements. This calculation provides a measure of the agreement rating for each heuristic.
For example, consider the previous three evaluators and their ratings (6, 5, and 4). The agreement rating in this case would be:
agreement = 0 / 3 = 0
because none of the evaluators agreed on the rating, but there were potentially three agreements (if they had all given the same rating). Averages across all of the claims for a given system were then taken. This provides an assessment of the average agreement for each heuristic as it pertains to a given system.

Overall Agreement
Taking all 33 claims into consideration, one-way ANOVA indicates significant differences among the three heuristic sets for evaluator agreement (F(2, 23) = 6.31,MSE = 0.01, p = 0.01). Pairwise t-tests show that both Somervell’s heuristics and Berry’s heuristics had significantly higher agreement than Nielsen’s set (df = 16, t = 2.99, p = 0.01 and df = 16, t = 3.7, p < 0.05 respectively). No significant differences were found between Somervell’s and Berry’s heuristics (df = 14, t = 0.46, p = 0.65).

5.4.8 Time Spent

Recall that we also asked the evaluators to report the amount of time (in minutes) they spent completing this evaluation. This measure is valuable in assessing the cost of the methods in terms of effort required. It was anticipated that the time required for each method would be similar across the methods.
Averaging reported times across evaluators for each method suggests that Somervell’s set required the least amount of time (M = 103.17, SD = 27.07), but one-way ANOVA reveals no significant differences (F(2, 17) = 0.26, p = 0.77). Berry’s set required the most time (M = 119.14, SD = 60.69) while Nielsen’s set (M = 104.29, SD = 38.56) required slightly more than Somervell’s.

5.5 Discussion

So what does all this statistical analysis mean? What do we know about the three heuristic sets? How have we supported or refuted our hypotheses through this analysis?

5.5.1 Hypotheses Revisited

1. Somervell’s set of heuristics will have a higher validity score for the Notification Collage.
2. More specific heuristics will have higher thoroughness, validity, and reliability measures.
3. Generic methods will require more time for evaluators to complete the study.


Hypothesis 1
For hypothesis one, we discovered that Somervell’s heuristics indeed held the highest validity score for the Notification Collage (see Figure 5.7).
However, this validity score was not 100%, as was expected. What does this mean? It simply illustrates the difference in the evaluators who participated in this study. They did not think that any of the heuristics applied to one of the claims from the Notification Collage. Although, it can be noted that the applicability scores for that particular claims were very close to the cutoff level we chose for agreement (that being 5 or greater on a 7-point scale).
Still, evidence suggests that hypothesis 1 holds.

Hypothesis 2
We find evidence to support this hypothesis based on the scores on each of the three measures: thoroughness, validity, and reliability. In each case, more specific methods had the better ratings over Nielsen’s heuristics for each measure.
Overall one could argue that Somervell’s set of heuristics is most suitable for evaluating large screen information exhibits, but must concede that Berry’s heuristics could also be used with some effectiveness.

Hypothesis 3
We did not find evidence to support this hypothesis. As reported, there were no significant differences in the times required to complete the evaluations for the three methods.
However, Somervell’s and Nielsen’s sets took about 15 fewer minutes, on average, to complete. This does not indicate that the more generic method (Nielsen’s) required more time.
So what would cause the evaluators to take more time with Berry’s method? Initial speculation would suggest that this set uses terminology associated with Notification Systems [62] (see Figure 5.2 for listing of heuristics), including reference to the critical parameters of interruption, reaction, and comprehension, and thus could have increased the interpretation time required to understand each of the heuristics.

5.6 Summary

We have described an experiment to compare three sets of heuristics, representing different levels of generality/specificity, in their ability to evaluate three different LSIE systems. Information on the systems used, test setup, and data collection and analysis has been provided. This test was performed to illustrate the utility that system-class specific methods provide by showing how they are better suited to evaluation of interfaces from that class.
In addition, this work has provided important validation of the creation method used in developing these new heuristics.
We have shown that a system-class specific set of heuristics provides better thoroughness, validity, and reliability than more generic sets (like Nielsen’s). The implication being that without great effort to tailor these generic evaluation tools, they do not provide as effective usability data as a more specific tool.


Source: Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications] Virginia Polytechnic Institute and State University. June 22, 2004.

No comments:

Post a Comment