I restructured the Chapter 2 into 2 main sections.
2.1 Mobile Learning
2.2 Usability
Today, I wrote mainly on What is Usability?
Also wrote a little on What is Mobile Learning? and Usability Evaluation Methods.
Wanted to continue writing more on UEM but feeling slightly mentally tired.
Tuesday, November 10, 2009
Sunday, November 8, 2009
Week of Nov 2-7: Progress Report
Was in Jakarta from Nov 1 (Sun) till Nov 7 (Sat).
Had achieved a lot of thorough reading of:
Rubin, Jeffrey. Handbook of Usability Testing.
Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications]
Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.]
Good benchmark for me to write my chapters 1 and 2.
Through Rubin's book, I have understood more deeply about who, what, when and how of usability testing.
Had achieved a lot of thorough reading of:
Rubin, Jeffrey. Handbook of Usability Testing.
Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications]
Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.]
Good benchmark for me to write my chapters 1 and 2.
Through Rubin's book, I have understood more deeply about who, what, when and how of usability testing.
Thursday, November 5, 2009
Nov 5,6 - Baker, Evaluation Methodology (MSc thesis)
Heuristic Evaluation of Shared Workspace Groupware
Chapter 4
Evaluation Methodology
Having formulated two sets of groupware heuristics from two inter-related frameworks in Chapter 3, the next logical task in my research is to validate these heuristics. To that extent, this chapter describes the two-step methodology used to carry out this objective.
The first step was a pilot study whereby the groupware heuristics were reviewed and subsequently modified prior to conducting the main research study, the second step. The main research study was set up to mirror the methodology and terminology employed by Nielsen to validate his heuristics (Nielsen and Molich 1990, Nielsen 1992).
For our purposes, two groups of inspectors with varying degrees of expertise in HCI and CSCW evaluated two groupware systems, a toy groupware editor called GroupDraw (Roseman and Greenberg 1994), and a very substantial commercial groupware system called Groove (www.groove.net). Their goal was to record as many usability problems as possible that violated the groupware heuristics.
4.1 Objective
As per my research goals (see Section 1.4), the objective of validating the heuristics is to:
“Demonstrate that the adapted heuristic evaluation for groupware remains a ‘discount’ usability technique by analyzing the ability of inspectors to identify problems in collaborative applications”.
As a means to execute the stated objective, I revisit Nielsen’s motivations for traditional
heuristic evaluation where he designed it as a usability engineering methodology that can
be done cheaply, quickly, and produce useful results (Nielsen and Molich 1990, Mack
and Nielsen 1994).
To briefly elaborate the three key terms:
1. Cheaply.
Nielsen’s heuristic evaluation places a low demand on resources. Nonexperts can carry out inspections; therefore, it is not confined to using more costly experts (the caveat: experts will produce better results) (Nielsen 1992).
This is practical because heuristic evaluation is, in practice, relatively easy to learn and to apply (Mack and Nielsen 1994). Consequently, extensive resources are not required for training. Finally, heuristic evaluation requires only the evaluator, the interface, paper, and pencil. No special equipment or facilities are necessary.
2. Quickly.
Heuristic evaluations do not require advance planning nor do they require a large amount of time to conduct (Nielsen and Molich 1990). Typically, the evaluation of an interface can be done within an hour or two for a simple interface and somewhat longer for more complex interfaces.
3. Useful results.
Despite performing heuristic evaluations with less control and formality than would be entailed by formal user testing, this technique provably produces useful results (Nielsen and Molich 1990, Mack and Nielsen 1994). As discussed in Chapter 2, only a few evaluators (3-5) can discover a majority (~75%) of interface bugs with varying severity. Fixing these bugs can in turn improve the usability of products.
Within my main research study, I look to validate heuristics applied to groupware evaluation to see if it remains a discount usability technique that is quick and cheap to perform while producing useful results.
To gauge this cost-effectiveness, I conducted the evaluation in the following manner:
• I chose as inspectors people with knowledge of human computer interaction, but limited knowledge of Computer Supported Cooperative Work.
• Prior to performing their evaluations, inspectors received a basic one hour training lecture and a packet of written materials (see Appendix A) explaining the heuristics.
• Inspectors self-selected the length of time and amount of effort they would put into completing the evaluation.
4.2 Pilot study
The groupware heuristics used by the inspectors to perform their evaluation of the two systems are presented in Appendix A.5 and A.6. A quick glance over the mechanics of collaboration heuristics reveals that their explanation and format differ from the same heuristics presented in the previous chapter.
Due to time constraints and my secondary focus on the Locales Framework heuristics, the pilot study was conducted with only the mechanics of collaboration heuristics.
4.2.1 Participants
Three professional HCI and groupware practitioners with 5 to 17 years of relevant experience were recruited to review the mechanics of collaboration heuristics. All three were also currently engaged in a project that involved re-designing the user interface for a real-time, shared workspace collaborative application.
4.2.2 Method
Each participant received a copy of the mechanics of collaboration heuristics similar to
what is found in Chapter 3.
Each was asked to address the following two questions:
1. Do I understand the principles of the heuristic?
2. Would I be able to apply this heuristic as part of a heuristic evaluation?
The objective was to gain informal feedback regarding the comprehensibility of each heuristic and its suitability to be applied in the context of an evaluation. Feedback was gathered in the form of written comments on the original handouts and verbal comments recorded from interviews with each individual after their review.
4.2.3 Results
Overall, the reviewers’ feedback was positive regarding the content of each mechanics of collaboration heuristic. They expressed that beginning a heuristic with its supporting theory—derived from studies of face-to-face interactions—helped to establish its motivation. In addition, the reviewers considered the heuristics to be practical since they included examples of techniques employed by groupware systems to comply with each one.
Despite the positive feedback, two main areas of concerns surfaced in response to the reviewers answering the two aforementioned questions.
The first area of concern surrounds the heuristics’ comprehension. Each reviewer had to re-read each heuristic (in some cases, several times) in order to comfortably understand the concepts.
The second area of concern centered on the ability of the reviewers to apply the heuristics as part of a groupware evaluation. ...the reviewers concluded that it would be awkward to use the heuristics in their current format to effectively evaluate groupware systems.
4.2.4 Discussion
Prior to conducting the main research study, the mechanics of collaboration heuristics (and subsequently the Locales Framework heuristics) had to be revised to address the reviewers’ concerns. I did not want the bottleneck to ‘good’ results to be the inability of the inspectors to understand and apply the heuristics.
To address the first concern, all of the heuristics were re-written with the intent of making them an ‘easier read’. Domain specific terms were replaced with more common terms. Sentences were shortened. In addition, all new concepts introduced by the heuristics were clearly defined and spelt out.
The second concern regarding the practicality of the heuristics in their current format raised an interesting issue, one that I had not considered up until this point. It is one thing to have all the pertinent theory encapsulated in a set of heuristics, but it is another to ensure that the heuristics are packaged as a practitioner’s training tool to facilitate conducting a heuristic evaluation.
The naïve approach is to structure these heuristics as a checklist whereby a practitioner is presented with a series of items that can be systematically checked off during an inspection.
In response to the reviewers’ comments, the mechanics of collaboration heuristics were re-structured in an attempt to facilitate their role as a tool.
The first section of each heuristic was divided into two new sections “Theory” and “What this means for groupware”.
“Theory” provides the underlying principles behind each heuristic and is not critical to performing an evaluation.
The intent of the next section, “What this means for groupware”, is to provide all the pertinent information that an inspector should consult when performing a heuristic evaluation.
The final section “Techniques used in groupware” remained essentially intact with the “Typical groupware support” section from the early version of the heuristics.
4.2.5 Sanity check
To ensure that all comments had been adequately addressed, the same three professionals reviewed the revised mechanics of collaboration heuristics. All were comfortable that their comments had been addressed. They viewed the new heuristics as easier to read and understand. This set of heuristics and its training material is presented in Appendix A.5.
4.2.6 Locales Framework heuristics
Although the pilot study was conducted with the mechanics of collaboration heuristics, the findings were transferable to the Locales Framework heuristics. Consequently, the latter were re-written to address the issues in a similar manner to the mechanics of collaboration heuristics.
The only difference between the two sets of heuristics is the lack of a “Techniques used in groupware” section with the locales heuristics.
4.3 Main research study
Subsequent to revising both sets of heuristics in response to the pilot study findings, my next step was to see if these adapted heuristics could be used for “heuristic evaluation of groupware and remain a ‘discount’ usability technique”.
To do this, I analyze the ability of inspectors to identify problems in two collaborative applications.
4.3.1 Participants
To assess the practicality of the groupware heuristics, I looked at the ability of individuals with minimal training in HCI but with varying levels of knowledge in CSCW to apply them. We recruited several groups fitting these criteria and validated our demographics through a questionnaire.
Participants were categorized as two evaluator types: novice and regular.
• Novice evaluators were 16 students in their 3rd or 4th year of a University computer science program. All had completed one full course in HCI and were currently enrolled in a second senior-level advanced undergraduate HCI course. When asked, the majority indicated some experience with designing and evaluating graphical user interfaces. However, few had any substantive knowledge regarding CSCW interface design principles. Consequently, the group consisted of “novices” with respect to CSCW usability but not with respect to computers and HCI.
• Regular specialists were 2 professors and 9 students working on their graduate degrees in computer science. All had a history of research, applied work, and/or class work in groupware and CSCW, as well as conventional user interface design and evaluation. These individuals were labeled ‘regular specialists’ since in contrast to the former group; they were knowledgeable of groupware fundamentals.
Except for the professors, all participants were students. Due to their limited availability, professional HCI/CSCW practitioners from industry were not employed as part of our research.
4.3.2 Materials
To help inspectors conduct their heuristic evaluation, we gave them a training packet, workstations, and the two groupware systems.
Training packet.
The training packet (Appendix A) consisted of two sets of groupware heuristics; one based on the mechanics of collaboration (A.5) and the other on the Locales Framework (A.6). These were revised versions in accordance with the pilot study findings.
The packet also contains the following forms:
• a consent form outlining the inspectors’ participation in the research study (A.1);
• a background information questionnaire to help ascertain each inspector’s knowledge and experience in the areas of HCI and CSCW (A.2);
• pre- and post-training feedback forms containing questions assessing the ease with which the inspectors were able to comprehend the heuristics before and after their training session (A.4); and
• problem reports designed in accordance with Nielsen (1994a) and Cox (1998) and used by the inspectors to capture their usability problems (A.7).
Workstations.
The novice evaluators conducted the heuristic evaluations of the two groupware systems on four PC workstations located in a single row in the undergraduate computer lab.
...Installation instructions (Appendix A.3) were provided on how to set-up the software.
Groupware systems.
As part of the study, the participants evaluated two quite different
shared visual workspaces contained in two real-time groupware systems: GroupDraw
and Groove.
GroupDraw is an object-oriented ‘toy’ drawing program built to show people
how to program the GroupKit groupware toolkit (Roseman and Greenberg 1996).
Groove is a virtual space for real-time, small group interactions. Users create “shared spaces” to communicate and collaborate with one another. Changes made to a shared space by one participant are automatically synchronized with all other computers.
Its functionality includes:
1. Communication tools – live voice over the Internet, instant messaging, text-based chat, and threaded discussion.
2. Content sharing tools – shared files, pictures, and contacts.
3. Joint activity tools – co-Web browsing, multiple-user drawing and editing, group calendar.
4.3.3 Method
The heuristic evaluation of the GroupDraw and Groove interfaces followed Nielsen’s standard recommendations (refer to chapter 2 for details).
This involved administering an orientation session to each group of inspectors prior to the evaluation process.
However, I did not conduct a debriefing session due to the geographical separation between the
inspectors and the researcher (myself).
Orientation session.
Prior to the session, each participant was asked to:
• sign the consent form (A.1);
• fill-out the background information questionnaire (A.2);
• read the detailed written description of the groupware heuristics (A.5 and A.6); and
• complete the pre-training feedback form (A.4).
Given the number of inspectors (27 total inspectors) and their location (22 in Calgary and 5 in Saskatoon), I conducted three separate 90-minute orientation sessions to three audiences.
* Collected the signed consent form, the completed background questionnaire and pretraining
feedback form from each inspector.
* Inspectors were handed a blank post-training feedback form (this form was identical to the pre-training feedback form) and an ample supply of blank problem reports for their heuristic evaluation of the systems.
* Conducted a one-hour training session on the proposed groupware heuristics. This included a review of the theory supporting each heuristics, how to apply them during an evaluation, and real-time groupware examples that illustrated compliance and non-compliance with each heuristic.
* Participants then filled out a post-training feedback form so that I could gauge how well they comprehended the groupware heuristics upon receiving the training.
* Provided the inspectors with an overview of the two groupware systems under test, GroupDraw and Groove.
* For GroupDraw, the inspectors were asked to evaluate only the shared workspace (Figure 4.1 bottom) and the Notes functionality (Figure 4.2) as the means for communicating with one another. For Groove, I asked the inspectors to evaluate the Outliner tool (Figure 4.3) as well as the text chat and audio link.
* General instructions were given regarding the process for conducting the heuristic evaluation. The novices were to inspect both systems with only the mechanics of collaboration heuristics. The regular specialists were to assess the groupware interfaces with both the mechanics of collaboration and Locales Framework heuristics.
Evaluation process.
Each inspector dictated when, where, and how to perform the evaluation. As with traditional heuristic evaluation, they could use the heuristics to systematically review all the functionality. Alternatively, they could walk through an imaginary task of their own making and verify how each step of the task complied with the heuristics.
Inspectors had complete control over the length of time and amount of effort they for completing the evaluation.
In some instances, the evaluation was performed in pairs and other times it involved four or five inspectors working concurrently.
4.3.4 Data collection.
For each problem uncovered, the inspectors completed a separate problem report by recording a description of the problem, the violated heuristic, a severity rating, and an (optional) solution to the problem. They judged a ‘major’ severity rating as one that represented a significant obstacle to effective collaboration, while a ‘minor’ rating is one that could be worked around by the participant. A blank problem report is found in Appendix A.7.
With respect to formulating severity ratings for each usability problem, Nielsen (1994a) states that evaluators have difficulty performing this step during the evaluation process since they are more focused on finding new usability problems. In addition, each evaluator will not find all the usability problems in the system; therefore, the severity ratings will be incomplete since they only reflect those problems found by the evaluator.
These original problem reports form the raw data for my analysis in the next chapter (see Appendix B for all problem reports).
4.4 Conclusion
This chapter reviewed the methodology for the pilot study and the subsequent changes to the groupware heuristics in preparation for the main study. Next, the main research study was introduced via a detailed description of its methodology. Of primary importance is that we used people that we felt were reasonable approximations of the actual practitioners we would expect to do groupware evaluations, and that our methodology echoed Nielsen’s traditional heuristic evaluation methodology.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Chapter 4
Evaluation Methodology
Having formulated two sets of groupware heuristics from two inter-related frameworks in Chapter 3, the next logical task in my research is to validate these heuristics. To that extent, this chapter describes the two-step methodology used to carry out this objective.
The first step was a pilot study whereby the groupware heuristics were reviewed and subsequently modified prior to conducting the main research study, the second step. The main research study was set up to mirror the methodology and terminology employed by Nielsen to validate his heuristics (Nielsen and Molich 1990, Nielsen 1992).
For our purposes, two groups of inspectors with varying degrees of expertise in HCI and CSCW evaluated two groupware systems, a toy groupware editor called GroupDraw (Roseman and Greenberg 1994), and a very substantial commercial groupware system called Groove (www.groove.net). Their goal was to record as many usability problems as possible that violated the groupware heuristics.
4.1 Objective
As per my research goals (see Section 1.4), the objective of validating the heuristics is to:
“Demonstrate that the adapted heuristic evaluation for groupware remains a ‘discount’ usability technique by analyzing the ability of inspectors to identify problems in collaborative applications”.
As a means to execute the stated objective, I revisit Nielsen’s motivations for traditional
heuristic evaluation where he designed it as a usability engineering methodology that can
be done cheaply, quickly, and produce useful results (Nielsen and Molich 1990, Mack
and Nielsen 1994).
To briefly elaborate the three key terms:
1. Cheaply.
Nielsen’s heuristic evaluation places a low demand on resources. Nonexperts can carry out inspections; therefore, it is not confined to using more costly experts (the caveat: experts will produce better results) (Nielsen 1992).
This is practical because heuristic evaluation is, in practice, relatively easy to learn and to apply (Mack and Nielsen 1994). Consequently, extensive resources are not required for training. Finally, heuristic evaluation requires only the evaluator, the interface, paper, and pencil. No special equipment or facilities are necessary.
2. Quickly.
Heuristic evaluations do not require advance planning nor do they require a large amount of time to conduct (Nielsen and Molich 1990). Typically, the evaluation of an interface can be done within an hour or two for a simple interface and somewhat longer for more complex interfaces.
3. Useful results.
Despite performing heuristic evaluations with less control and formality than would be entailed by formal user testing, this technique provably produces useful results (Nielsen and Molich 1990, Mack and Nielsen 1994). As discussed in Chapter 2, only a few evaluators (3-5) can discover a majority (~75%) of interface bugs with varying severity. Fixing these bugs can in turn improve the usability of products.
Within my main research study, I look to validate heuristics applied to groupware evaluation to see if it remains a discount usability technique that is quick and cheap to perform while producing useful results.
To gauge this cost-effectiveness, I conducted the evaluation in the following manner:
• I chose as inspectors people with knowledge of human computer interaction, but limited knowledge of Computer Supported Cooperative Work.
• Prior to performing their evaluations, inspectors received a basic one hour training lecture and a packet of written materials (see Appendix A) explaining the heuristics.
• Inspectors self-selected the length of time and amount of effort they would put into completing the evaluation.
4.2 Pilot study
The groupware heuristics used by the inspectors to perform their evaluation of the two systems are presented in Appendix A.5 and A.6. A quick glance over the mechanics of collaboration heuristics reveals that their explanation and format differ from the same heuristics presented in the previous chapter.
Due to time constraints and my secondary focus on the Locales Framework heuristics, the pilot study was conducted with only the mechanics of collaboration heuristics.
4.2.1 Participants
Three professional HCI and groupware practitioners with 5 to 17 years of relevant experience were recruited to review the mechanics of collaboration heuristics. All three were also currently engaged in a project that involved re-designing the user interface for a real-time, shared workspace collaborative application.
4.2.2 Method
Each participant received a copy of the mechanics of collaboration heuristics similar to
what is found in Chapter 3.
Each was asked to address the following two questions:
1. Do I understand the principles of the heuristic?
2. Would I be able to apply this heuristic as part of a heuristic evaluation?
The objective was to gain informal feedback regarding the comprehensibility of each heuristic and its suitability to be applied in the context of an evaluation. Feedback was gathered in the form of written comments on the original handouts and verbal comments recorded from interviews with each individual after their review.
4.2.3 Results
Overall, the reviewers’ feedback was positive regarding the content of each mechanics of collaboration heuristic. They expressed that beginning a heuristic with its supporting theory—derived from studies of face-to-face interactions—helped to establish its motivation. In addition, the reviewers considered the heuristics to be practical since they included examples of techniques employed by groupware systems to comply with each one.
Despite the positive feedback, two main areas of concerns surfaced in response to the reviewers answering the two aforementioned questions.
The first area of concern surrounds the heuristics’ comprehension. Each reviewer had to re-read each heuristic (in some cases, several times) in order to comfortably understand the concepts.
The second area of concern centered on the ability of the reviewers to apply the heuristics as part of a groupware evaluation. ...the reviewers concluded that it would be awkward to use the heuristics in their current format to effectively evaluate groupware systems.
4.2.4 Discussion
Prior to conducting the main research study, the mechanics of collaboration heuristics (and subsequently the Locales Framework heuristics) had to be revised to address the reviewers’ concerns. I did not want the bottleneck to ‘good’ results to be the inability of the inspectors to understand and apply the heuristics.
To address the first concern, all of the heuristics were re-written with the intent of making them an ‘easier read’. Domain specific terms were replaced with more common terms. Sentences were shortened. In addition, all new concepts introduced by the heuristics were clearly defined and spelt out.
The second concern regarding the practicality of the heuristics in their current format raised an interesting issue, one that I had not considered up until this point. It is one thing to have all the pertinent theory encapsulated in a set of heuristics, but it is another to ensure that the heuristics are packaged as a practitioner’s training tool to facilitate conducting a heuristic evaluation.
The naïve approach is to structure these heuristics as a checklist whereby a practitioner is presented with a series of items that can be systematically checked off during an inspection.
In response to the reviewers’ comments, the mechanics of collaboration heuristics were re-structured in an attempt to facilitate their role as a tool.
The first section of each heuristic was divided into two new sections “Theory” and “What this means for groupware”.
“Theory” provides the underlying principles behind each heuristic and is not critical to performing an evaluation.
The intent of the next section, “What this means for groupware”, is to provide all the pertinent information that an inspector should consult when performing a heuristic evaluation.
The final section “Techniques used in groupware” remained essentially intact with the “Typical groupware support” section from the early version of the heuristics.
4.2.5 Sanity check
To ensure that all comments had been adequately addressed, the same three professionals reviewed the revised mechanics of collaboration heuristics. All were comfortable that their comments had been addressed. They viewed the new heuristics as easier to read and understand. This set of heuristics and its training material is presented in Appendix A.5.
4.2.6 Locales Framework heuristics
Although the pilot study was conducted with the mechanics of collaboration heuristics, the findings were transferable to the Locales Framework heuristics. Consequently, the latter were re-written to address the issues in a similar manner to the mechanics of collaboration heuristics.
The only difference between the two sets of heuristics is the lack of a “Techniques used in groupware” section with the locales heuristics.
4.3 Main research study
Subsequent to revising both sets of heuristics in response to the pilot study findings, my next step was to see if these adapted heuristics could be used for “heuristic evaluation of groupware and remain a ‘discount’ usability technique”.
To do this, I analyze the ability of inspectors to identify problems in two collaborative applications.
4.3.1 Participants
To assess the practicality of the groupware heuristics, I looked at the ability of individuals with minimal training in HCI but with varying levels of knowledge in CSCW to apply them. We recruited several groups fitting these criteria and validated our demographics through a questionnaire.
Participants were categorized as two evaluator types: novice and regular.
• Novice evaluators were 16 students in their 3rd or 4th year of a University computer science program. All had completed one full course in HCI and were currently enrolled in a second senior-level advanced undergraduate HCI course. When asked, the majority indicated some experience with designing and evaluating graphical user interfaces. However, few had any substantive knowledge regarding CSCW interface design principles. Consequently, the group consisted of “novices” with respect to CSCW usability but not with respect to computers and HCI.
• Regular specialists were 2 professors and 9 students working on their graduate degrees in computer science. All had a history of research, applied work, and/or class work in groupware and CSCW, as well as conventional user interface design and evaluation. These individuals were labeled ‘regular specialists’ since in contrast to the former group; they were knowledgeable of groupware fundamentals.
Except for the professors, all participants were students. Due to their limited availability, professional HCI/CSCW practitioners from industry were not employed as part of our research.
4.3.2 Materials
To help inspectors conduct their heuristic evaluation, we gave them a training packet, workstations, and the two groupware systems.
Training packet.
The training packet (Appendix A) consisted of two sets of groupware heuristics; one based on the mechanics of collaboration (A.5) and the other on the Locales Framework (A.6). These were revised versions in accordance with the pilot study findings.
The packet also contains the following forms:
• a consent form outlining the inspectors’ participation in the research study (A.1);
• a background information questionnaire to help ascertain each inspector’s knowledge and experience in the areas of HCI and CSCW (A.2);
• pre- and post-training feedback forms containing questions assessing the ease with which the inspectors were able to comprehend the heuristics before and after their training session (A.4); and
• problem reports designed in accordance with Nielsen (1994a) and Cox (1998) and used by the inspectors to capture their usability problems (A.7).
Workstations.
The novice evaluators conducted the heuristic evaluations of the two groupware systems on four PC workstations located in a single row in the undergraduate computer lab.
...Installation instructions (Appendix A.3) were provided on how to set-up the software.
Groupware systems.
As part of the study, the participants evaluated two quite different
shared visual workspaces contained in two real-time groupware systems: GroupDraw
and Groove.
GroupDraw is an object-oriented ‘toy’ drawing program built to show people
how to program the GroupKit groupware toolkit (Roseman and Greenberg 1996).
Groove is a virtual space for real-time, small group interactions. Users create “shared spaces” to communicate and collaborate with one another. Changes made to a shared space by one participant are automatically synchronized with all other computers.
Its functionality includes:
1. Communication tools – live voice over the Internet, instant messaging, text-based chat, and threaded discussion.
2. Content sharing tools – shared files, pictures, and contacts.
3. Joint activity tools – co-Web browsing, multiple-user drawing and editing, group calendar.
4.3.3 Method
The heuristic evaluation of the GroupDraw and Groove interfaces followed Nielsen’s standard recommendations (refer to chapter 2 for details).
This involved administering an orientation session to each group of inspectors prior to the evaluation process.
However, I did not conduct a debriefing session due to the geographical separation between the
inspectors and the researcher (myself).
Orientation session.
Prior to the session, each participant was asked to:
• sign the consent form (A.1);
• fill-out the background information questionnaire (A.2);
• read the detailed written description of the groupware heuristics (A.5 and A.6); and
• complete the pre-training feedback form (A.4).
Given the number of inspectors (27 total inspectors) and their location (22 in Calgary and 5 in Saskatoon), I conducted three separate 90-minute orientation sessions to three audiences.
* Collected the signed consent form, the completed background questionnaire and pretraining
feedback form from each inspector.
* Inspectors were handed a blank post-training feedback form (this form was identical to the pre-training feedback form) and an ample supply of blank problem reports for their heuristic evaluation of the systems.
* Conducted a one-hour training session on the proposed groupware heuristics. This included a review of the theory supporting each heuristics, how to apply them during an evaluation, and real-time groupware examples that illustrated compliance and non-compliance with each heuristic.
* Participants then filled out a post-training feedback form so that I could gauge how well they comprehended the groupware heuristics upon receiving the training.
* Provided the inspectors with an overview of the two groupware systems under test, GroupDraw and Groove.
* For GroupDraw, the inspectors were asked to evaluate only the shared workspace (Figure 4.1 bottom) and the Notes functionality (Figure 4.2) as the means for communicating with one another. For Groove, I asked the inspectors to evaluate the Outliner tool (Figure 4.3) as well as the text chat and audio link.
* General instructions were given regarding the process for conducting the heuristic evaluation. The novices were to inspect both systems with only the mechanics of collaboration heuristics. The regular specialists were to assess the groupware interfaces with both the mechanics of collaboration and Locales Framework heuristics.
Evaluation process.
Each inspector dictated when, where, and how to perform the evaluation. As with traditional heuristic evaluation, they could use the heuristics to systematically review all the functionality. Alternatively, they could walk through an imaginary task of their own making and verify how each step of the task complied with the heuristics.
Inspectors had complete control over the length of time and amount of effort they for completing the evaluation.
In some instances, the evaluation was performed in pairs and other times it involved four or five inspectors working concurrently.
4.3.4 Data collection.
For each problem uncovered, the inspectors completed a separate problem report by recording a description of the problem, the violated heuristic, a severity rating, and an (optional) solution to the problem. They judged a ‘major’ severity rating as one that represented a significant obstacle to effective collaboration, while a ‘minor’ rating is one that could be worked around by the participant. A blank problem report is found in Appendix A.7.
With respect to formulating severity ratings for each usability problem, Nielsen (1994a) states that evaluators have difficulty performing this step during the evaluation process since they are more focused on finding new usability problems. In addition, each evaluator will not find all the usability problems in the system; therefore, the severity ratings will be incomplete since they only reflect those problems found by the evaluator.
These original problem reports form the raw data for my analysis in the next chapter (see Appendix B for all problem reports).
4.4 Conclusion
This chapter reviewed the methodology for the pilot study and the subsequent changes to the groupware heuristics in preparation for the main study. Next, the main research study was introduced via a detailed description of its methodology. Of primary importance is that we used people that we felt were reasonable approximations of the actual practitioners we would expect to do groupware evaluations, and that our methodology echoed Nielsen’s traditional heuristic evaluation methodology.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Nov 5 - Baker, Groupware Heuristics (MSc thesis)
Chapter 3
Groupware Heuristics
Nielsen’s existing heuristics were derived from how well they explained usability problems resident within single-user systems. However, these heuristics are insufficient for groupware evaluation since they do not cater to groupware usability, which is distinctly different from single-user usability.
Single-user usability has been defined as the degree to which a system is effective, efficient, and pleasant to use, given a certain set of users and tasks (e.g., Shackel 1990).
Within this context, usability emphasizes ‘task work’: how a person performs the domain tasks and activities that result in the end products like drawings, documents, or models.
Groupware must also support task work to proceed effectively, efficiently, and pleasantly; however, these systems must go one step further and support teamwork—the work of working together—in order to be trulyusable.
Thus we can define groupware usability as the degree to which a system supports both single-user usability and teamwork. While this definition is a starting point, we need a better understanding of what we actually mean by ‘support for teamwork’.
Teamwork involves activities ranging from low level mechanical acts necessary for almost any type of real time collaboration, to those that are social and affective in nature. If we want to apply the heuristic evaluation methodology to groupware, we need new heuristics that identify those interface aspects necessary for effective teamwork.
Using these, the inspector can then examine the interface to see if adequate support is provided to a group. The problem is that developing new heuristics to evaluate teamwork is complicated since—unlike the single-user interface literature that helped Nielsen set up his heuristics—there is no broad corpus of design guidelines specific to teamwork issues.
As a starting point, I have instead adapted two sources as the basis for two potential sets of groupware heuristics.
1. Mechanics of Collaboration (Gutwin and Greenberg 2000)
2. Locales Framework (Fitzpatrick 1998)
These sources were chosen because they contain some of the few theories or frameworks dealing with teamwork that were specifically created with groupware in mind.
Gutwin and Greenberg’s (2000) mechanics of collaboration identifies those activities that support the mechanics of how people interact over a shared visual workspace. These activities include group members communicating, providing assistance, coordinating activity, dividing labour, and monitoring each other’s work.
I believe that heuristics derived from these mechanics can be applied to task-based groupware (e.g., shared whiteboards, brainstorming tools, etc.), which comprise the majority of existing groupware systems today.
In contrast, Fitzpatrick’s (1998) Locales Framework deals more with the social issues surrounding the use of groupware. Because they define social versus mechanical aspects of teamwork, I believe that heuristics derived from the Locales Framework are better suited for evaluating design subtleties of how social interaction is supported in general groupware environments that support a broad variety of tasks (such as Teamwave Workplace) rather than the mechanical aspects of task-specific groupware. While there is some overlap between heuristics suggested by the mechanics of collaboration and by the Locales Framework, they are both inspired by quite different conceptual frameworks.
I divide the rest of this chapter into two main parts. The first part (Section 3.1) provides a brief description of the mechanics of collaboration, and how I adapted them into eight heuristics. The second part (Section 3.2) follows a similar structure; details of the Locales Framework are presented along with a brief description of the five heuristics stemming from this framework.
3.1 Mechanics of collaboration
3.1.1 Background
The mechanics of collaboration frames the low level actions and interactions that small groups of people do if they are to complete a collaborative task effectively. These mechanics are specific to shared workspace groupware. They include communication, coordination, planning, monitoring, assistance, and protection.
Gutwin developed this framework both from his experience building shared workspace systems and how they were used, and from an extensive research of shared workspace usage and theory developed by others (e.g., Bly 1988, Tang and Leifer 1988, Tang 1991, Gutwin 1997, Gutwin and Greenberg 1999).
3.1.2 Heuristics for supporting the mechanics of collaboration
I believe that the framework can help inspectors identify usability problems of both groupware prototypes and existing systems. While the framework was developed with low-cost evaluation methods in mind, I had to adapt, restructure, and rephrase it as heuristics, and augment it with a few other important points omitted from the framework.
I should emphasize that this was done in cooperation with Gutwin and Greenberg, and there has been a mutual debate and evolution of both my heuristics and how the actual mechanics are being articulated by them over time.
The resulting eight mechanics of collaboration heuristics are listed in Table 3.1.
1. Provide the means for intentional and appropriate verbal communication
The prevalent form of communication between group members is verbal conversations. This establishes a common understanding of the task at hand. Support verbal exchanges or a viable alternative.
2. Provide the mean for intentional and appropriate gestural communication
Allow explicit gestures and other visual actions to be visible since they are done in direct support of the conversation and help convey task information. Support illustration, emblem and deixis.
3. Provide consequential communication of an individual’s embodiment
A person’s body interacting with a computational workspace must unintentionally give off information to others. This is the primary mechanism for maintaining awareness and sustaining teamwork. Couple unintentional body language with both the workspace and its artifacts, and the conversation.
4. Provide consequential communication of shared artifacts (i.e. artifact feedthrough)
Make artifacts expressive so they give off information as they are manipulated. Support artifact feedthrough.
5. Provide protection
Protect users from inadvertently interfering with work that others are doing now, or altering or destroying work that they have done. Provide mechanisms to support social protocols and/or implement technical means to ensure protection.
6. Manage the transitions between tightly and loosely-coupled collaboration
Users should be able to focus their attention on different parts of the workspace when performing individual work in order to maintain awareness of others. Provide techniques for making relevant parts of the workspace visible.
7. Support people with the coordination of their actions
Support awareness of others’ activities to ensure people can coordinate their actions in order to avoid conflicts and make tasks happen in the correct sequence.
8. Facilitate finding collaborators and establishing contact
Provide information on potential collaborators so that they can be easily found and their availability for group work can be determined. Initiation of contact should be possible with minimal effort.
======Table 3.1 Mechanics of Collaboration heuristics=======
3.1.3 Summary
In summary, these eight heuristics look at the essential mechanical acts of collaboration. The physics of the everyday world as well as people’s natural actions means these mechanics ‘just happen’. In the computer world, they must be explicitly recognized as well as designed and implemented into the groupware system.
3.2 Locales Framework
3.2.1 Background
The Locales Framework is an approach to help people understand the nature of social activity and teamwork, and how a locale (or place) can support these activities (Fitzpatrick et al. 1996, Fitzpatrick 1998). More formally, the Locales Framework comprises five highly interdependent and overlapping aspects, as summarized below.
1. Locale foundations define a collection of people and artifacts (tools, information, objects) in relation to the central purpose of the social world. A locale within a social world is best considered as a “center” of collective purpose that is part of a dynamic and continually evolving system. Locales are fluid places with social meaning that may be mapped onto physical spaces.
2. Mutuality considers those interactions within locales that maintain a sense of shared place. Mutuality includes presence information that people and artifacts make available to others, and how people maintain awareness of that information. It also includes capabilities that entities have to transmit and receive information, and how entities choose from these capabilities to create a particular presence-awareness level.
3. Individual view over multiple locales acknowledges that individuals can be participating in many locales. Each person’s view is an aggregation of their views onto their locales of interest. People also manifest a “view intensity” onto particular locales as they vary their focus and participation across locales.
4. Interaction trajectories concern how courses of action evolve and change over time. In essence, people come into locales and social worlds with past experiences, plans, and actions.
5. Civic structures concern how interactions fit within a broader communal level. Civic structures can be considered a “meta-locale” that describe how social worlds and locales relate to one another, how people find their way between them, and how new locales are formed and old ones dissipated.
3.2.2 Heuristics for supporting the Locales Framework
The Locales Framework was not originally developed as a usability evaluation method, rather as a means to understand the nature of social practices in the workaday world and to help inform groupware design. Regardless, it seems amenable to expand this framework into a set of groupware heuristics for several reasons.
First, this is a general framework for understanding the fundamental aspects of teamwork. It describes a small set of inter-dependent perspectives of the characteristics of teamwork, where each could be recast as a heuristic.
Second, it has been validated as a way to understand existing work practices, and to motivate the design of new systems.
Greenberg et al. (1999) first introduced the Locales Framework heuristics. Based on the text and descriptions found in Chapters 1 and 8 of Fitzpatrick (1998), each of the aforementioned aspects was rewritten as a separate heuristic asking whether a groupware’s interface afforded certain social phenomena. Table 3.2 presents a brief summary of these heuristics.
1. Provide centers (locales)
Provide virtual locales as the site, means and resources for a group to pursue team and task
work. The system should collect people, artifacts and resources in relation to the central
purpose of the social world.
2. Provide awareness (mutuality) within locales
People and artifacts must make presence information available to others. People must be aware of others and their activities as they evolve.
3. Allow individual views
Individuals should be able to adapt their own idiosyncratic view of the locale or aggregate multiple locales in a way that reflect their responsibilities, activities, and interests.
4. Allow people to manage and stay aware of their evolving interactions over time
People must be able to manage and stay aware of their evolving interactions. This involves control over past, present and future aspects of routine and non-routine work.
5. Provide a way to organize and relate locales to one another (civic structures)
Locales are rarely independent of one another: people need a way to structure the locales in a meaningful way, to find their way between locales, to create new locales, and to remove old ones.
======Table 3.2 Locales Framework heuristics======
3.3 Conclusion
In support of my research goal, I have developed two sets heuristics for the purposes of
identifying usability problems in groupware.
I would not claim that all groupware applications should support all heuristics. Rather, the heuristics can be used to suggest areas of system strengths, and also areas of weakness that might need to be compensated for in other ways.
While these heuristics are tentative, I believe they are good candidates. Unlike Nielsen’s heuristics, each set is derived from a well-defined theory or framework of group work.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Groupware Heuristics
Nielsen’s existing heuristics were derived from how well they explained usability problems resident within single-user systems. However, these heuristics are insufficient for groupware evaluation since they do not cater to groupware usability, which is distinctly different from single-user usability.
Single-user usability has been defined as the degree to which a system is effective, efficient, and pleasant to use, given a certain set of users and tasks (e.g., Shackel 1990).
Within this context, usability emphasizes ‘task work’: how a person performs the domain tasks and activities that result in the end products like drawings, documents, or models.
Groupware must also support task work to proceed effectively, efficiently, and pleasantly; however, these systems must go one step further and support teamwork—the work of working together—in order to be trulyusable.
Thus we can define groupware usability as the degree to which a system supports both single-user usability and teamwork. While this definition is a starting point, we need a better understanding of what we actually mean by ‘support for teamwork’.
Teamwork involves activities ranging from low level mechanical acts necessary for almost any type of real time collaboration, to those that are social and affective in nature. If we want to apply the heuristic evaluation methodology to groupware, we need new heuristics that identify those interface aspects necessary for effective teamwork.
Using these, the inspector can then examine the interface to see if adequate support is provided to a group. The problem is that developing new heuristics to evaluate teamwork is complicated since—unlike the single-user interface literature that helped Nielsen set up his heuristics—there is no broad corpus of design guidelines specific to teamwork issues.
As a starting point, I have instead adapted two sources as the basis for two potential sets of groupware heuristics.
1. Mechanics of Collaboration (Gutwin and Greenberg 2000)
2. Locales Framework (Fitzpatrick 1998)
These sources were chosen because they contain some of the few theories or frameworks dealing with teamwork that were specifically created with groupware in mind.
Gutwin and Greenberg’s (2000) mechanics of collaboration identifies those activities that support the mechanics of how people interact over a shared visual workspace. These activities include group members communicating, providing assistance, coordinating activity, dividing labour, and monitoring each other’s work.
I believe that heuristics derived from these mechanics can be applied to task-based groupware (e.g., shared whiteboards, brainstorming tools, etc.), which comprise the majority of existing groupware systems today.
In contrast, Fitzpatrick’s (1998) Locales Framework deals more with the social issues surrounding the use of groupware. Because they define social versus mechanical aspects of teamwork, I believe that heuristics derived from the Locales Framework are better suited for evaluating design subtleties of how social interaction is supported in general groupware environments that support a broad variety of tasks (such as Teamwave Workplace) rather than the mechanical aspects of task-specific groupware. While there is some overlap between heuristics suggested by the mechanics of collaboration and by the Locales Framework, they are both inspired by quite different conceptual frameworks.
I divide the rest of this chapter into two main parts. The first part (Section 3.1) provides a brief description of the mechanics of collaboration, and how I adapted them into eight heuristics. The second part (Section 3.2) follows a similar structure; details of the Locales Framework are presented along with a brief description of the five heuristics stemming from this framework.
3.1 Mechanics of collaboration
3.1.1 Background
The mechanics of collaboration frames the low level actions and interactions that small groups of people do if they are to complete a collaborative task effectively. These mechanics are specific to shared workspace groupware. They include communication, coordination, planning, monitoring, assistance, and protection.
Gutwin developed this framework both from his experience building shared workspace systems and how they were used, and from an extensive research of shared workspace usage and theory developed by others (e.g., Bly 1988, Tang and Leifer 1988, Tang 1991, Gutwin 1997, Gutwin and Greenberg 1999).
3.1.2 Heuristics for supporting the mechanics of collaboration
I believe that the framework can help inspectors identify usability problems of both groupware prototypes and existing systems. While the framework was developed with low-cost evaluation methods in mind, I had to adapt, restructure, and rephrase it as heuristics, and augment it with a few other important points omitted from the framework.
I should emphasize that this was done in cooperation with Gutwin and Greenberg, and there has been a mutual debate and evolution of both my heuristics and how the actual mechanics are being articulated by them over time.
The resulting eight mechanics of collaboration heuristics are listed in Table 3.1.
1. Provide the means for intentional and appropriate verbal communication
The prevalent form of communication between group members is verbal conversations. This establishes a common understanding of the task at hand. Support verbal exchanges or a viable alternative.
2. Provide the mean for intentional and appropriate gestural communication
Allow explicit gestures and other visual actions to be visible since they are done in direct support of the conversation and help convey task information. Support illustration, emblem and deixis.
3. Provide consequential communication of an individual’s embodiment
A person’s body interacting with a computational workspace must unintentionally give off information to others. This is the primary mechanism for maintaining awareness and sustaining teamwork. Couple unintentional body language with both the workspace and its artifacts, and the conversation.
4. Provide consequential communication of shared artifacts (i.e. artifact feedthrough)
Make artifacts expressive so they give off information as they are manipulated. Support artifact feedthrough.
5. Provide protection
Protect users from inadvertently interfering with work that others are doing now, or altering or destroying work that they have done. Provide mechanisms to support social protocols and/or implement technical means to ensure protection.
6. Manage the transitions between tightly and loosely-coupled collaboration
Users should be able to focus their attention on different parts of the workspace when performing individual work in order to maintain awareness of others. Provide techniques for making relevant parts of the workspace visible.
7. Support people with the coordination of their actions
Support awareness of others’ activities to ensure people can coordinate their actions in order to avoid conflicts and make tasks happen in the correct sequence.
8. Facilitate finding collaborators and establishing contact
Provide information on potential collaborators so that they can be easily found and their availability for group work can be determined. Initiation of contact should be possible with minimal effort.
======Table 3.1 Mechanics of Collaboration heuristics=======
3.1.3 Summary
In summary, these eight heuristics look at the essential mechanical acts of collaboration. The physics of the everyday world as well as people’s natural actions means these mechanics ‘just happen’. In the computer world, they must be explicitly recognized as well as designed and implemented into the groupware system.
3.2 Locales Framework
3.2.1 Background
The Locales Framework is an approach to help people understand the nature of social activity and teamwork, and how a locale (or place) can support these activities (Fitzpatrick et al. 1996, Fitzpatrick 1998). More formally, the Locales Framework comprises five highly interdependent and overlapping aspects, as summarized below.
1. Locale foundations define a collection of people and artifacts (tools, information, objects) in relation to the central purpose of the social world. A locale within a social world is best considered as a “center” of collective purpose that is part of a dynamic and continually evolving system. Locales are fluid places with social meaning that may be mapped onto physical spaces.
2. Mutuality considers those interactions within locales that maintain a sense of shared place. Mutuality includes presence information that people and artifacts make available to others, and how people maintain awareness of that information. It also includes capabilities that entities have to transmit and receive information, and how entities choose from these capabilities to create a particular presence-awareness level.
3. Individual view over multiple locales acknowledges that individuals can be participating in many locales. Each person’s view is an aggregation of their views onto their locales of interest. People also manifest a “view intensity” onto particular locales as they vary their focus and participation across locales.
4. Interaction trajectories concern how courses of action evolve and change over time. In essence, people come into locales and social worlds with past experiences, plans, and actions.
5. Civic structures concern how interactions fit within a broader communal level. Civic structures can be considered a “meta-locale” that describe how social worlds and locales relate to one another, how people find their way between them, and how new locales are formed and old ones dissipated.
3.2.2 Heuristics for supporting the Locales Framework
The Locales Framework was not originally developed as a usability evaluation method, rather as a means to understand the nature of social practices in the workaday world and to help inform groupware design. Regardless, it seems amenable to expand this framework into a set of groupware heuristics for several reasons.
First, this is a general framework for understanding the fundamental aspects of teamwork. It describes a small set of inter-dependent perspectives of the characteristics of teamwork, where each could be recast as a heuristic.
Second, it has been validated as a way to understand existing work practices, and to motivate the design of new systems.
Greenberg et al. (1999) first introduced the Locales Framework heuristics. Based on the text and descriptions found in Chapters 1 and 8 of Fitzpatrick (1998), each of the aforementioned aspects was rewritten as a separate heuristic asking whether a groupware’s interface afforded certain social phenomena. Table 3.2 presents a brief summary of these heuristics.
1. Provide centers (locales)
Provide virtual locales as the site, means and resources for a group to pursue team and task
work. The system should collect people, artifacts and resources in relation to the central
purpose of the social world.
2. Provide awareness (mutuality) within locales
People and artifacts must make presence information available to others. People must be aware of others and their activities as they evolve.
3. Allow individual views
Individuals should be able to adapt their own idiosyncratic view of the locale or aggregate multiple locales in a way that reflect their responsibilities, activities, and interests.
4. Allow people to manage and stay aware of their evolving interactions over time
People must be able to manage and stay aware of their evolving interactions. This involves control over past, present and future aspects of routine and non-routine work.
5. Provide a way to organize and relate locales to one another (civic structures)
Locales are rarely independent of one another: people need a way to structure the locales in a meaningful way, to find their way between locales, to create new locales, and to remove old ones.
======Table 3.2 Locales Framework heuristics======
3.3 Conclusion
In support of my research goal, I have developed two sets heuristics for the purposes of
identifying usability problems in groupware.
I would not claim that all groupware applications should support all heuristics. Rather, the heuristics can be used to suggest areas of system strengths, and also areas of weakness that might need to be compensated for in other ways.
While these heuristics are tentative, I believe they are good candidates. Unlike Nielsen’s heuristics, each set is derived from a well-defined theory or framework of group work.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Labels:
Baker,
heuristic evaluation,
usability criteria
Wednesday, November 4, 2009
Nov 5 - Baker, Heuristic Evaluation of Shared Workspace Groupware (MSc thesis)
Kevin F. Baker
Comments: My interest in Baker's works is due to the fact that I have read about his works indirectly in other people's dissertations and research.
Publications from this Research
An earlier version of the mechanics of collaboration heuristics (similar to Chapter 3) has appeared in the following peer-reviewed publication:
Baker, K., Greenberg, S. and Gutwin, C. (2001) Heuristic Evaluation of Groupware Based on the Mechanics of Collaboration. In M. Little and L. Nigay (Eds) Engineering for Human-Computer Interaction, LNCS Vol 2254, pp. 123-139.
The methodology, analysis, and results from this research study (Chapters 4 and 5) have been summarized in the report listed below. This report has been peer-reviewed and accepted for the upcoming ACM Computer Supported Cooperative Work conference (CSCW 2002).
Baker, K., Greenberg, S. and Gutwin, C. (2002) Empirical Development of a Heuristic Evaluation Methodology for Shared Workspace Groupware. Report 2002-700-03, Department of Computer Science, University of Calgary, Alberta, Canada.
Abstract
Despite the increasing availability of groupware, most systems are not widely used. One main reason is that groupware is difficult to evaluate. In particular, there are no discount usability evaluation methodologies that can discover problems specific to teamwork.
In this research study, I adapt Nielsen’s heuristic evaluation methodology, designed originally for single user applications, to help inspectors rapidly, cheaply, and effectively identify usability problems within groupware systems.
Specifically, I take the Gutwin and Greenberg’s (2000) mechanics of collaboration and restate them as heuristics for the purposes of discovering problems in shared visual work surfaces for distance-separated groups.
As a secondary objective, I revise existing Locales Framework heuristics and assess their compatibility with the mechanics.
I evaluate the practicality of both sets of heuristics by having individuals with varying degrees of HCI and CSCW expertise use them to uncover usability problems in two groupware systems. The results imply that practitioners can effectively inspect and evaluate groupware with the mechanics of collaboration heuristics where they can identify obstacles to real-time interactions over shared workspaces.
The Locales Framework heuristics are not as promising: while inspectors do identify problems inhibiting groupware acceptance, their practicality is limited and they require further improvements.
Chapter 1
Introduction
1.2 A brief survey of single-user evaluation techniques
Research in HCI has developed a multitude of evaluation techniques for analyzing and then improving the usability of conventional single user interfaces. Each methodology highlights different usability issues and identifies different types of problems; therefore, evaluators can choose and mix an appropriate technique to fit the needs and nuances of their situation (McGrath 1996).
There are three primary categories that have been used to distinguish these different types of methods: user observations, field studies, and interface inspections.
1.2.1 User observations
Techniques in this category are conducted in a lab, ideally using a representative sample of the eventual users performing tasks that depict how the product will be used in the “real” world. Evaluators uncover problems, called ‘usability bugs’, by observing the participants completing the tasks with the interface under evaluation.
User observation methodologies include controlled experiments and usability testing.
Controlled experiments.
Controlled experiments are used to establish a cause-and-effect relationship; it must be shown with certainty that the variation of experimental factors (i.e. the independent variables), and only those factors, could have caused the effect observed in the data (i.e., the dependent variable).
Rigorous control is used in these experiments to ensure that all other uncontrolled variables (i.e., confounding variables) do not affect the results and their interpretation.
Usability testing.
The goal of usability testing is to identify and rectify usability deficiencies. This is in conjunction with the intent to create products that are easy to learn, satisfying to use, and that provide high utility and functionality for the users (Rubin 1994).
Designers can manipulate the design of the product to allow them to see how particular features encourage or discourage usability. Participants can do several perhaps unrelated tasks that allow an evaluator to see how the human computer system performs over a broad set of expected uses. The product itself can be changed as the test progresses. For example, if pilot testing reveals certain problems, then the product can be modified midway to correct them.
When the product is tested with one individual, the participant is encouraged to think aloud during the test. This involves users talking out loud while they are performing a particular task in order to reflect cognitive processes. An evaluator observes the participant performing the task in question by focusing on occurrences such as errors made and difficulties experienced.
The information collected can then be applied to remedy the observed usability problems by going through another design iteration of the product, eventually leading to another usability test.
1.2.2 Field studies
A significant problem with performing an evaluation within the laboratory is the failure to account for conditions, context, and tasks that are central to the system’s real world use. Part of this failure stems from the fact that many developers of systems often have only partial or naïve knowledge of the “real world” setting where the end system will be used.
Field studies allow us to study systems in use on real tasks in real work settings, and to observe or discover important factors that are not easily found in a laboratory setting.
Two field study techniques are ethnography and contextual inquiry.
Ethnography
Ethnography is a naturalistic methodology grounded in sociology and anthropology (Bentley et al. 1992, Hughes et al. 1994, Randall 1996). Its premise is that human activities are socially organized; therefore, it looks into patterns of collaboration and interaction.
Randall (1996) stresses four features of ethnography that make it distinct as a method:
1. Naturalistic: involves studying real people and their activities within their natural environment. Only by studying work under these circumstances can one rightfully inform the system’s design.
2. Prolonged: it takes time to form a coherent view of what is going on especially for a complex domain.
3. Seeks to elicit the social world from the point of view of those who inhabit it: the appropriate level of analysis is the significance of the behaviour and not the behaviour itself.
4. Data resists formalization: the methodology stresses the importance of context; therefore, there is no ‘right’ data to be collected.
Data is gathered by the ethnographer observing and recording participants in their environment as they go about their work activities using the technology and tools available to them. This includes focusing on social relationships and how they affect the nature of work. To understand what the culture is doing, the ethnographer must immerse oneself within the cultural framework.
The goal of an ethnographic study for system design is to identify routine practices, problems, and possibilities for development within a given activity or setting. The data gathered usually takes the form of field notes but can be supplemented by audio and video data.
Contextual inquiry
To facilitate designing products, contextual inquiry employs an interview methodology to gain knowledge of what people do within their real world context (Holzblatt and Beyer 1996, 1999).
Specifically, this is accomplished by first conducting interviews through observations and discussions with users as they work. Target users are representatives of those for whom the system is being developed.
1.2.3 Inspection methods
Inspection methods have evaluators ‘inspect’ an interface for usability bugs according to a set of criteria, usually related to how individuals see and perform a task. These methods use judgement as a source of feedback when evaluating specific elements of a user interface (Mack and Nielsen 1994). Inspection techniques include heuristic evaluations, task-centered walkthroughs, pluralistic walkthroughs, and cognitive walkthroughs.
Heuristic evaluation
Heuristic evaluation is a widely accepted discount evaluation method for diagnosing potential usability problems in user interfaces (Mack and Nielsen 1994, Nielsen 1992,1993, 1994a,b).
With this methodology, a small number of usability experts visually inspect an interface and judge its compliance with recognized usability principles (the “heuristics”) (Nielsen 1992, 1993, 1994a).
Heuristics are general rules used to describe common properties of usable interfaces (Nielsen 1994a). During a heuristic evaluation, heuristics help evaluators focus their attention on aspects of an interface that are often trouble spots, making detection of usability problems easier. Noncompliant aspects of the interface are captured as interface bug reports, where evaluators
describe the problem, its severity, and perhaps even suggestions of how to fix it.
Through a process called results synthesis, these raw usability problem reports are then transformed into a cohesive set of design recommendations that are passed on to developers (Cox 1998).
Cognitive walkthroughs
Cognitive walkthroughs (Wharton 1994) are intended to evaluate the design of an interface for ease of learning, particularly by exploration. This is an extension of a model of learning by exploration proposed by Polson and Lewis (1990). The model is related to Norman’s theory of action that forms the theoretical foundation for his work on cognitive engineering (Norman 1988). Cognitive walkthroughs also incorporate the construction-integration model developed by Kintsch (1988).
These ideas help the evaluators examine how the interface guides the user to generate the correct goals and sub goals to perform the required task, and to select the necessary actions to fulfill each goal.
Task-centered walkthroughs
The task-centered walkthrough is a discount usability variation of cognitive walkthroughs.
It was developed as one step in the task-centered design process (Lewis and Rieman 1994). This process looks to involve end users in the design process and provide context to the evaluation of the interface in question.
Pluralistic walkthroughs
Pluralistic walkthroughs (Bias 1994) are meetings where users, developers, and human factors people step through an interface for the purposes of identifying usability problems.
A pre-defined scenario dictates the participants’ interaction with the interface. The scenario ensures that the participants confront the screens just as they would during the successful conduct of the specified task online.
The walkthrough begins when the participants are presented a hardcopy snapshot of the first screen they would encounter in the scenario. Participants are asked to write on the hardcopy of the first panel the actions they would perform while attempting the specified task. After all participants have written their independent responses, the walkthrough administrator announces the “right” answer. The participants verbalize their responses and discuss potential usability problems due to “incorrect” answers.
1.3 Problems applying single-user techniques to groupware evaluation
1.3.3 Inspection methods
As in single-user applications, groupware must effectively support task work. However, groupware must also support teamwork, the ‘work of working together’. Inspection methods are thus limited when we use them ‘as-is’, for they do not address the teamwork components necessary for effective collaboration with groupware.
For example, Nielsen lists many heuristics to guide inspectors, yet none address ‘bugs’ particular to groupware usability.
Similarly, a cognitive walkthrough used to evaluate groupware gave mixed and somewhat inconclusive results (Erback and Hook 1994). Other researchers are providing a framework for typical groupware scenarios that can form a stronger basis for walkthroughs (Cugini et al. 1997).
I speculate in this research study that some of these inspection techniques can be altered to evaluate groupware.
Specifically, I chose to adapt Nielsen’s heuristic evaluation methodology since it is popular with both researchers and industry for several important reasons. It is low cost in terms of time since it can be completed in a relatively short amount of time (i.e., a few hours). End-users are also not required; therefore, resources are inexpensive. Because the heuristics are well documented and worked examples have been made available (e.g., Nielsen 1994a, b), they can be easy to learn and apply. Also, heuristic evaluation is becoming part of the standard HCI curriculum (e.g., Greenberg 1996) and thus known to many HCI practitioners. Non-usability experts can also use this technique fairly successfully (Nielsen 1994a). As well, it is cost-effective: an aggregate of 3-5 usability specialists will typically identify ~75% of all known usability problems for a given interface (Nielsen 1994b). All these factors contribute to the significant uptake of heuristic evaluation in today’s industry since this technique can be easily and cost-effectively integrated into existing development processes while producing instant results.
In expanding heuristic evaluation for the purposes of evaluating groupware, I look to capitalize on all these factors that make this methodology a success.
1.4 Problem statement and research goals
The motivation behind this research is that current real-time distributed groupware systems are awkward and cumbersome to use, a situation partly caused by the lack of practical groupware evaluation methodologies.
My general research goal is to develop and validate a groupware evaluation methodology that is practical in terms of time, cost, logistics, and evaluator experience, while still identifying significant problems in a groupware system.
To narrow the scope, I adopt an existing discount usability technique to real-time, distributed groupware supporting shared workspaces. Real-time distributed groupware encompasses collaborative systems that enable multiple people to work together at the same time but from different locations. A shared workspace is “a bounded space where people can see and manipulate artifacts related to their activities.” (Gutwin 1997). This application genre is very common (e.g., real-time systems for sharing views of conventional applications).
Specifically, I focused on heuristic evaluation as this methodology in its current state satisfies the practicality criteria of time, cost, logistics, evaluator experience while still identifying significant problems in a single user systems. I believe that this technique and its strengths can be extended to assessing collaborative systems.
From this general research goal, my specific research sub-goals follow.
1. I will propose a new set of heuristics that can be used within the heuristic evaluation methodology to detect usability problems in real-time, distributed groupware with a shared workspace.
2. I will demonstrate that the adapted heuristic evaluation for groupware remains a ‘discount’ usability technique by analyzing the ability of inspectors to identify problems in collaborative applications.
1.5 Research direction
At this point, I need to elaborate on the circumstance and my resulting decisions that led to the main thrust of my research study: to derive and validate groupware heuristics based on the mechanics of collaboration. The purpose is to provide insight into my disproportionate focus with the Locales Framework heuristics.
My original objective was to build upon Greenberg et al.’s (1999) preliminary work on the Locales Framework heuristics. While conventional heuristics are easy to learn and apply, an outstanding concern from the original study was that heuristics based on the Locales Framework are complex, which in turn might require a greater level of evaluator training and experience. To that extent, I set out to assess these heuristics by studying how well inspectors unfamiliar with the Locales Framework were able to apply these heuristics to identify usability problems in groupware systems.
Shortly afterwards Gutwin and Greenberg (2000) introduced the mechanics of collaboration framework. This framework was created with low-cost evaluation methods for groupware in mind; therefore, I decided to refocus my research in this direction.
Subsequently, this research study concentrates on creating and validating the mechanics of collaboration heuristics.
While I still explore the locales framework heuristics, they are not my primary area of interest and hence I have devoted less time and effort in this study to their validation.
1.6 Research overview
Chapter 2 chronicles Jakob Nielsen’s design, validation, and subsequent evolution of his original 10 heuristics for the purposes of evaluating single-user interfaces. ...I believe it is necessary to provide a brief history on how the existing heuristic methodology was developed, validated, and updated.
Chapter 3 describes in detail the eight heuristics derived from Gutwin and Greenberg’s (2000) mechanics of collaboration framework. These heuristics form the basis for the rest of the research. In addition, five complementary heuristics evolving from the Locales Framework (Fitzpatrick 1998) are also briefly introduced.
Chapter 4 details the two-step methodology I employed to validate the groupware heuristics as a discount usability method for groupware.
First, a pilot study was conducted to review and subsequently improve the heuristics. Next, two categories of inspectors with varying levels of expertise in HCI and CSCW used the revised heuristics to evaluate two groupware systems. The resulting problem reports form the raw data for the forthcoming analysis.
Chapter 5 describes the results synthesis process employed to transform the inspectors’ raw problem reports into a consolidated list of usability problems for each groupware system.
Next, I systematically analyze these lists to derive conclusions regarding the practicality of both sets of groupware heuristics.
Finally, I discuss some of the factors affecting my results and how I interpret these results.
Chapter 6 summarizes how the goals of my research have been satisfied and the contributions made. In addition, I look to the future and discuss what still needs to be done to help evolve the heuristics for the purposes of developing a robust and effective low-cost technique for evaluating groupware.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Comments: My interest in Baker's works is due to the fact that I have read about his works indirectly in other people's dissertations and research.
Publications from this Research
An earlier version of the mechanics of collaboration heuristics (similar to Chapter 3) has appeared in the following peer-reviewed publication:
Baker, K., Greenberg, S. and Gutwin, C. (2001) Heuristic Evaluation of Groupware Based on the Mechanics of Collaboration. In M. Little and L. Nigay (Eds) Engineering for Human-Computer Interaction, LNCS Vol 2254, pp. 123-139.
The methodology, analysis, and results from this research study (Chapters 4 and 5) have been summarized in the report listed below. This report has been peer-reviewed and accepted for the upcoming ACM Computer Supported Cooperative Work conference (CSCW 2002).
Baker, K., Greenberg, S. and Gutwin, C. (2002) Empirical Development of a Heuristic Evaluation Methodology for Shared Workspace Groupware. Report 2002-700-03, Department of Computer Science, University of Calgary, Alberta, Canada.
Abstract
Despite the increasing availability of groupware, most systems are not widely used. One main reason is that groupware is difficult to evaluate. In particular, there are no discount usability evaluation methodologies that can discover problems specific to teamwork.
In this research study, I adapt Nielsen’s heuristic evaluation methodology, designed originally for single user applications, to help inspectors rapidly, cheaply, and effectively identify usability problems within groupware systems.
Specifically, I take the Gutwin and Greenberg’s (2000) mechanics of collaboration and restate them as heuristics for the purposes of discovering problems in shared visual work surfaces for distance-separated groups.
As a secondary objective, I revise existing Locales Framework heuristics and assess their compatibility with the mechanics.
I evaluate the practicality of both sets of heuristics by having individuals with varying degrees of HCI and CSCW expertise use them to uncover usability problems in two groupware systems. The results imply that practitioners can effectively inspect and evaluate groupware with the mechanics of collaboration heuristics where they can identify obstacles to real-time interactions over shared workspaces.
The Locales Framework heuristics are not as promising: while inspectors do identify problems inhibiting groupware acceptance, their practicality is limited and they require further improvements.
Chapter 1
Introduction
1.2 A brief survey of single-user evaluation techniques
Research in HCI has developed a multitude of evaluation techniques for analyzing and then improving the usability of conventional single user interfaces. Each methodology highlights different usability issues and identifies different types of problems; therefore, evaluators can choose and mix an appropriate technique to fit the needs and nuances of their situation (McGrath 1996).
There are three primary categories that have been used to distinguish these different types of methods: user observations, field studies, and interface inspections.
1.2.1 User observations
Techniques in this category are conducted in a lab, ideally using a representative sample of the eventual users performing tasks that depict how the product will be used in the “real” world. Evaluators uncover problems, called ‘usability bugs’, by observing the participants completing the tasks with the interface under evaluation.
User observation methodologies include controlled experiments and usability testing.
Controlled experiments.
Controlled experiments are used to establish a cause-and-effect relationship; it must be shown with certainty that the variation of experimental factors (i.e. the independent variables), and only those factors, could have caused the effect observed in the data (i.e., the dependent variable).
Rigorous control is used in these experiments to ensure that all other uncontrolled variables (i.e., confounding variables) do not affect the results and their interpretation.
Usability testing.
The goal of usability testing is to identify and rectify usability deficiencies. This is in conjunction with the intent to create products that are easy to learn, satisfying to use, and that provide high utility and functionality for the users (Rubin 1994).
Designers can manipulate the design of the product to allow them to see how particular features encourage or discourage usability. Participants can do several perhaps unrelated tasks that allow an evaluator to see how the human computer system performs over a broad set of expected uses. The product itself can be changed as the test progresses. For example, if pilot testing reveals certain problems, then the product can be modified midway to correct them.
When the product is tested with one individual, the participant is encouraged to think aloud during the test. This involves users talking out loud while they are performing a particular task in order to reflect cognitive processes. An evaluator observes the participant performing the task in question by focusing on occurrences such as errors made and difficulties experienced.
The information collected can then be applied to remedy the observed usability problems by going through another design iteration of the product, eventually leading to another usability test.
1.2.2 Field studies
A significant problem with performing an evaluation within the laboratory is the failure to account for conditions, context, and tasks that are central to the system’s real world use. Part of this failure stems from the fact that many developers of systems often have only partial or naïve knowledge of the “real world” setting where the end system will be used.
Field studies allow us to study systems in use on real tasks in real work settings, and to observe or discover important factors that are not easily found in a laboratory setting.
Two field study techniques are ethnography and contextual inquiry.
Ethnography
Ethnography is a naturalistic methodology grounded in sociology and anthropology (Bentley et al. 1992, Hughes et al. 1994, Randall 1996). Its premise is that human activities are socially organized; therefore, it looks into patterns of collaboration and interaction.
Randall (1996) stresses four features of ethnography that make it distinct as a method:
1. Naturalistic: involves studying real people and their activities within their natural environment. Only by studying work under these circumstances can one rightfully inform the system’s design.
2. Prolonged: it takes time to form a coherent view of what is going on especially for a complex domain.
3. Seeks to elicit the social world from the point of view of those who inhabit it: the appropriate level of analysis is the significance of the behaviour and not the behaviour itself.
4. Data resists formalization: the methodology stresses the importance of context; therefore, there is no ‘right’ data to be collected.
Data is gathered by the ethnographer observing and recording participants in their environment as they go about their work activities using the technology and tools available to them. This includes focusing on social relationships and how they affect the nature of work. To understand what the culture is doing, the ethnographer must immerse oneself within the cultural framework.
The goal of an ethnographic study for system design is to identify routine practices, problems, and possibilities for development within a given activity or setting. The data gathered usually takes the form of field notes but can be supplemented by audio and video data.
Contextual inquiry
To facilitate designing products, contextual inquiry employs an interview methodology to gain knowledge of what people do within their real world context (Holzblatt and Beyer 1996, 1999).
Specifically, this is accomplished by first conducting interviews through observations and discussions with users as they work. Target users are representatives of those for whom the system is being developed.
1.2.3 Inspection methods
Inspection methods have evaluators ‘inspect’ an interface for usability bugs according to a set of criteria, usually related to how individuals see and perform a task. These methods use judgement as a source of feedback when evaluating specific elements of a user interface (Mack and Nielsen 1994). Inspection techniques include heuristic evaluations, task-centered walkthroughs, pluralistic walkthroughs, and cognitive walkthroughs.
Heuristic evaluation
Heuristic evaluation is a widely accepted discount evaluation method for diagnosing potential usability problems in user interfaces (Mack and Nielsen 1994, Nielsen 1992,1993, 1994a,b).
With this methodology, a small number of usability experts visually inspect an interface and judge its compliance with recognized usability principles (the “heuristics”) (Nielsen 1992, 1993, 1994a).
Heuristics are general rules used to describe common properties of usable interfaces (Nielsen 1994a). During a heuristic evaluation, heuristics help evaluators focus their attention on aspects of an interface that are often trouble spots, making detection of usability problems easier. Noncompliant aspects of the interface are captured as interface bug reports, where evaluators
describe the problem, its severity, and perhaps even suggestions of how to fix it.
Through a process called results synthesis, these raw usability problem reports are then transformed into a cohesive set of design recommendations that are passed on to developers (Cox 1998).
Cognitive walkthroughs
Cognitive walkthroughs (Wharton 1994) are intended to evaluate the design of an interface for ease of learning, particularly by exploration. This is an extension of a model of learning by exploration proposed by Polson and Lewis (1990). The model is related to Norman’s theory of action that forms the theoretical foundation for his work on cognitive engineering (Norman 1988). Cognitive walkthroughs also incorporate the construction-integration model developed by Kintsch (1988).
These ideas help the evaluators examine how the interface guides the user to generate the correct goals and sub goals to perform the required task, and to select the necessary actions to fulfill each goal.
Task-centered walkthroughs
The task-centered walkthrough is a discount usability variation of cognitive walkthroughs.
It was developed as one step in the task-centered design process (Lewis and Rieman 1994). This process looks to involve end users in the design process and provide context to the evaluation of the interface in question.
Pluralistic walkthroughs
Pluralistic walkthroughs (Bias 1994) are meetings where users, developers, and human factors people step through an interface for the purposes of identifying usability problems.
A pre-defined scenario dictates the participants’ interaction with the interface. The scenario ensures that the participants confront the screens just as they would during the successful conduct of the specified task online.
The walkthrough begins when the participants are presented a hardcopy snapshot of the first screen they would encounter in the scenario. Participants are asked to write on the hardcopy of the first panel the actions they would perform while attempting the specified task. After all participants have written their independent responses, the walkthrough administrator announces the “right” answer. The participants verbalize their responses and discuss potential usability problems due to “incorrect” answers.
1.3 Problems applying single-user techniques to groupware evaluation
1.3.3 Inspection methods
As in single-user applications, groupware must effectively support task work. However, groupware must also support teamwork, the ‘work of working together’. Inspection methods are thus limited when we use them ‘as-is’, for they do not address the teamwork components necessary for effective collaboration with groupware.
For example, Nielsen lists many heuristics to guide inspectors, yet none address ‘bugs’ particular to groupware usability.
Similarly, a cognitive walkthrough used to evaluate groupware gave mixed and somewhat inconclusive results (Erback and Hook 1994). Other researchers are providing a framework for typical groupware scenarios that can form a stronger basis for walkthroughs (Cugini et al. 1997).
I speculate in this research study that some of these inspection techniques can be altered to evaluate groupware.
Specifically, I chose to adapt Nielsen’s heuristic evaluation methodology since it is popular with both researchers and industry for several important reasons. It is low cost in terms of time since it can be completed in a relatively short amount of time (i.e., a few hours). End-users are also not required; therefore, resources are inexpensive. Because the heuristics are well documented and worked examples have been made available (e.g., Nielsen 1994a, b), they can be easy to learn and apply. Also, heuristic evaluation is becoming part of the standard HCI curriculum (e.g., Greenberg 1996) and thus known to many HCI practitioners. Non-usability experts can also use this technique fairly successfully (Nielsen 1994a). As well, it is cost-effective: an aggregate of 3-5 usability specialists will typically identify ~75% of all known usability problems for a given interface (Nielsen 1994b). All these factors contribute to the significant uptake of heuristic evaluation in today’s industry since this technique can be easily and cost-effectively integrated into existing development processes while producing instant results.
In expanding heuristic evaluation for the purposes of evaluating groupware, I look to capitalize on all these factors that make this methodology a success.
1.4 Problem statement and research goals
The motivation behind this research is that current real-time distributed groupware systems are awkward and cumbersome to use, a situation partly caused by the lack of practical groupware evaluation methodologies.
My general research goal is to develop and validate a groupware evaluation methodology that is practical in terms of time, cost, logistics, and evaluator experience, while still identifying significant problems in a groupware system.
To narrow the scope, I adopt an existing discount usability technique to real-time, distributed groupware supporting shared workspaces. Real-time distributed groupware encompasses collaborative systems that enable multiple people to work together at the same time but from different locations. A shared workspace is “a bounded space where people can see and manipulate artifacts related to their activities.” (Gutwin 1997). This application genre is very common (e.g., real-time systems for sharing views of conventional applications).
Specifically, I focused on heuristic evaluation as this methodology in its current state satisfies the practicality criteria of time, cost, logistics, evaluator experience while still identifying significant problems in a single user systems. I believe that this technique and its strengths can be extended to assessing collaborative systems.
From this general research goal, my specific research sub-goals follow.
1. I will propose a new set of heuristics that can be used within the heuristic evaluation methodology to detect usability problems in real-time, distributed groupware with a shared workspace.
2. I will demonstrate that the adapted heuristic evaluation for groupware remains a ‘discount’ usability technique by analyzing the ability of inspectors to identify problems in collaborative applications.
1.5 Research direction
At this point, I need to elaborate on the circumstance and my resulting decisions that led to the main thrust of my research study: to derive and validate groupware heuristics based on the mechanics of collaboration. The purpose is to provide insight into my disproportionate focus with the Locales Framework heuristics.
My original objective was to build upon Greenberg et al.’s (1999) preliminary work on the Locales Framework heuristics. While conventional heuristics are easy to learn and apply, an outstanding concern from the original study was that heuristics based on the Locales Framework are complex, which in turn might require a greater level of evaluator training and experience. To that extent, I set out to assess these heuristics by studying how well inspectors unfamiliar with the Locales Framework were able to apply these heuristics to identify usability problems in groupware systems.
Shortly afterwards Gutwin and Greenberg (2000) introduced the mechanics of collaboration framework. This framework was created with low-cost evaluation methods for groupware in mind; therefore, I decided to refocus my research in this direction.
Subsequently, this research study concentrates on creating and validating the mechanics of collaboration heuristics.
While I still explore the locales framework heuristics, they are not my primary area of interest and hence I have devoted less time and effort in this study to their validation.
1.6 Research overview
Chapter 2 chronicles Jakob Nielsen’s design, validation, and subsequent evolution of his original 10 heuristics for the purposes of evaluating single-user interfaces. ...I believe it is necessary to provide a brief history on how the existing heuristic methodology was developed, validated, and updated.
Chapter 3 describes in detail the eight heuristics derived from Gutwin and Greenberg’s (2000) mechanics of collaboration framework. These heuristics form the basis for the rest of the research. In addition, five complementary heuristics evolving from the Locales Framework (Fitzpatrick 1998) are also briefly introduced.
Chapter 4 details the two-step methodology I employed to validate the groupware heuristics as a discount usability method for groupware.
First, a pilot study was conducted to review and subsequently improve the heuristics. Next, two categories of inspectors with varying levels of expertise in HCI and CSCW used the revised heuristics to evaluate two groupware systems. The resulting problem reports form the raw data for the forthcoming analysis.
Chapter 5 describes the results synthesis process employed to transform the inspectors’ raw problem reports into a consolidated list of usability problems for each groupware system.
Next, I systematically analyze these lists to derive conclusions regarding the practicality of both sets of groupware heuristics.
Finally, I discuss some of the factors affecting my results and how I interpret these results.
Chapter 6 summarizes how the goals of my research have been satisfied and the contributions made. In addition, I look to the future and discuss what still needs to be done to help evolve the heuristics for the purposes of developing a robust and effective low-cost technique for evaluating groupware.
Source: Baker, Kevin F. Heuristic Evaluation of Shared Workspace Groupware based on the Mechanics of Collaboration. [Thesis, M.Sc.] University of Calgary, Calgary, Alberta, Canada. May 2002.
Nov 5 - Somervell, Heuristic Comparison Experiment (PhD dissertation)
Chapter 5
Heuristic Comparison Experiment
5.1 Introduction
Now that there is a set of heuristics tailored for the large screen information exhibit system class, a comparison of this set to more established types of heuristics can be done. The purpose of this comparison would be to show the utility of this new heuristic set. This comparison needs to be fair, so that determining the effectiveness of the new method will be accurate.
To assess whether the new set of heuristics provides better usability results over existing alternative sets, we conducted a comparison experiment in which each of three sets of heuristics were used to evaluate three separate large screen information exhibits. We then compared the results of each set through several metrics to determine the better evaluation methods for large screen information exhibits.
5.2 Approach
The following sections provide descriptions of the heuristics used, the comparison method, and the systems used in this experiment.
5.2.1 Heuristic Sets
We used three different sets of usability heuristics, each at a different level of specificity for application to large screen information exhibits, ranging from a set completely designed for this particular system class, to a generic set applicable to a wide range of interactive systems.
Nielsen
The least specific set of heuristics was taken from Nielsen and Mack [70]. This set is intended for use on any interactive system, mostly targeted towards desktop applications. Furthermore, this set has been in use since around 1990. It has been tested and criticized for years, but still remains popular with usability practitioners. Again, this set is not tailored for large screen information exhibits in any way and has no relation to the critical parameters for notification systems.
Visibility of system status
Match between system and real world
User control and freedom
Consistency and standards
Error prevention
Recognition rather than recall
Flexibility and efficiency of use
Aesthetic and minimalist design
Help users recognize, diagnose, and recover from errors
Help and documentation
Figure 5.1: Nielsen’s heuristics. General heuristics that apply to most interfaces. Found in [70].
Berry
The second heuristic set used in this comparison test was created for general notification systems by Berry [9]. This set is based on the critical parameters associated with notification systems [62], but only at cursory levels. This set is more closely tied to large screen information exhibits than Nielsen’s method in that large screen information exhibits are a subset of notification systems, but this set is still generic in nature with regards to the specifics surrounding the LSIE system class.
Notifications should be timely
Notifications should be reliable
Notification displays should be consistent (within priority levels)
Information should be clearly understandable by the user
Allow for shortcuts to more information
Indicate status of notification system
Flexibility and efficiency of use
Provide context of notifications
Allow adjustment of notification parameters to fit user goals
Figure 5.2: Berry’s heuristics. Tailored more towards Notification Systems in general. Found in
[9].
Somervell
The final heuristic set is the one created in this work, as reported in Chapter 4. This set is tailored specifically to large screen information exhibits, and thus would be the most specific method of the three when targeting this type of system. It is based on specific levels of the critical parameters associated with the LSIE system class.
5.2.2 Comparison Technique
To determine which of the three sets is better suited for formative evaluation of large screen information exhibits, we use a current set of comparison metrics that rely upon several measures of a method’s ability to uncover usability problems through an evaluation.
The comparison method we are using typically relies on five separate measures to assess the
utility of a given UEM for one’s particular needs, but we will only use a subset in this particular
comparison study.
Hartson et al. report that thoroughness, validity, effectiveness, reliability, and
downstream utility are appropriate measures for comparing evaluation methods [40].
Specifically, our comparison method capitalizes on thoroughness, validity, effectiveness, and reliability, abandoning the downstream utility measure. This choice is used here because longterm studies are required to illustrate downstream utility.
Thoroughness
This measure gives an indication of a method’s ability to uncover a significant percentage of the problems in a given system. Thoroughness consists of a simple calculation of the number of problems uncovered by a single UEM divided by the total number of problems found by all three methods.
thoroughness = (# of problems found by target UEM) / (# of problems found by all methods)
Validity
Validity refers to the ability of a method to uncover the types of problems that real users would experience in day to day use of the system, as opposed to simple or minor problems. Validity is measured as the number of real problems found divided by the total number of real problems identified in the system.
validity = (# of problems found by target UEM) / (# of problems in the system)
The number of real problems in the system refers to the problem set identified through some standard method that is separate from the method being tested.
Effectiveness
Effectiveness combines the previous two metrics into a single assessment of the method. This measure is calculated by multiplying the thoroughness score by the validity score.
effectiveness = thoroughness X validity
Reliability
Reliability is a measure of the consistency of the results of several evaluators using the method. This is also sometimes referred to as inter-rater reliability. This measure is taken more as agreement between the usability problem sets produced by different people using a given method.
This measure is calculated from the differences among all of the evaluators for a specific system as well as by the total number of agreements among the evaluators, thus two measures are used to provide a more robust measurement of the reliability of the heuristic sets:
reliability-d = difference among evaluators for a specific method
reliability-a = average agreement among evaluators for a specific method
For calculating reliability, Hartson et al. recommend using a method from Sears [81] that depends on the ratio of the standard deviation of the numbers of problems found by the average number found [40]. This measure of reliability is overly complicated for current needs, thus a more traditional measure that relies upon actual rater differences is used instead.
5.2.3 Systems
Three systems were used in the comparison study providing a range of applications for which each heuristic would be used in an analytic evaluation. The intent was to provide enough variability in the test systems to tease out differences in the methods.
1 Source Viewer
2 Plasma Poster
3 Notification Collage
Why Source Viewer?
The Source Viewer was chosen as a target system for this study because we wanted an example of a real system that has been in regular use for an extended period. We immediately thought of command and control situations. Potential candidates included local television stations, local air traffic control towers, electrical power companies, and telephone exchange stations. We finally settled on local television command and control after limited responses from the other candidates.
Why Plasma Poster?
We wanted to include the Plasma Poster because it is one of very few LSIE systems that has seen some success in terms of long term usage and acceptance. It has seen over a year of deployment in a large research laboratory, with reports on usage and user feedback reported in [20].
This lengthy deployment and data collection period provides ample evidence for typical usability problems. We can use the published reports as support for our problem sets. Coupled with developer feedback, we can effectively validate the problem set for this system.
Why Notification Collage?
We chose the Notification Collage as the third system for several reasons.
First, we wanted to increase the validity of any results we find. By using more systems, we get a better picture of the “goodness” of the heuristic sets, especially if we get consistent results across all three systems.
Secondly, we wanted to explicitly show that the heuristic set we created in this work actually uncovered the issues that went into that creation process. In other words, since the Notification Collage was one of the systems that led to this heuristic set, using that set on the Notification Collage should uncover most of the issues with that system.
Finally, we wanted to use the Notification Collage out of the original five because we had the most developer feedback on that system, and like the Plasma Poster, it has seen reasonable deployment and use.
5.2.4 Hypotheses
We have three main hypotheses to test in this experiment:
1. Somervell’s set of heuristics has a higher validity score for the Notification Collage.
We believed this was true because the Notification Collage was used in the creation of Somervell’s heuristics, thus those heuristics should identify most or all of the issues in the Notification Collage.
2. More specific heuristics have higher thoroughness, validity, and reliability measures.
We felt this was true because more specific methods are more closely related to the systems in this study. Indeed, from Chapter 3, we discussed how previous work suggests system-class level heuristics would be best. This experiment illustrates this case for heuristic evaluation of large screen information exhibits.
3. Generic methods require more time for evaluators to complete the study.
This seems logical because a more generic heuristic set would require more interpretation and thought, hence we felt that those evaluators who use Nielsen’s set would take longer to complete the system evaluations, providing further impetus for developing system-class UEMs.
5.2.5 Identifying Problem Sets
One problem identified in other UEM comparison studies involves the calculation of specific metrics that rely upon something referred to as the “real” problem set (see [40]). In most cases, this problem set is the union of the problems found by each of the methods in the comparison study. In other words, each UEM is applied in a standard usability evaluation of a system, and the “real” problem set is simply the union of the problems found by each of the methods.
This comparison study also faced the same challenge. Instead of relying on evaluators to produce sets of problems from each method, then using the union of those problem sets as the “real” problem set, analysis and testing was performed on the target systems beforehand and the problem reports from those efforts were used to come up with a standard set of problems for each system.
Source Viewer Problem Set
To determine the problem sets experienced by the users of this system, a field study was conducted. Two interviews with the users of the large screen system, as well as observation were conducted.
Plasma Poster Problem Set
Analytic evaluation augmented with developer feedback and literature review served as the method for determining the real problem set for the Plasma Poster. We employed the same claims analysis technique that we used in the creation process to identify typical usability tradeoffs for the Plasma Poster. After identifying the usability issues, we asked the developers of the system to verify the tradeoffs.
Notification Collage Problem Set
To validate the problem set for the Notification Collage, we contacted the developers of the system and asked them to check each tradeoff as it pertained to the behavior of real users. The developers were given a list of the tradeoffs found in our claims analysis (from Chapter 4) and asked to verify each tradeoff according to their observations of real user behavior.
5.3 Testing Methodology
This experiment involves a 3x3 mixed factors design. We have three levels of heuristics (Nielsen, Berry, and Somervell) and three systems (Source Viewer, Plasma Poster, and Notification Collage).
The heuristics variable is a between-subjects variable because each evaluator sees only one set of heuristics. The system variable is within-subjects because each participant sees all three systems. For example, evaluator 1 saw only Nielsen’s heuristics, but used those to evaluate all three systems.
We used a balanced Latin Square to ensure learning effects from system presentation order would be minimized. Thus, we needed a minimum of 18 participants (6 per heuristic set) to ensure coverage of the systems in the Latin Square balancing.
5.3.1 Participants
As shown in Table 5.1, we needed a minimum of 18 evaluators for this study.
Twenty-one computer science graduate students who had completed a course on Usability Engineering volunteered for participation as inspectors. Six participants were assigned to each heuristic set, to cover each of the order assignments. Three additional students volunteered and they were randomly assigned a presentation order.
These participants all had knowledge of usability evaluation, as well as analytic and empirical methods. Furthermore, each was familiar with heuristic evaluation. Some of the participants were not familiar with the claim structure used in this study, but they were able to understand the tradeoff concept immediately.
Unfortunately, one of the participants failed to complete the experiment. This individual apparently decided the effort required to complete the test was too much, and thus filled out the questionnaire using a set pattern.
This makes the final number of participants 20, with seven for Nielsen’s heuristics, seven for Berry’s heuristics, and six for Somervell’s heuristics.
5.3.2 Materials
Each target system was described in one to three short scenarios, and screen shots were provided to the evaluators. The goal was to provide the evaluators with a sense of the display and its intended usage.
This material is sufficient for the heuristic inspection technique according to Nielsen and Mack [70]. This setup ensured that each of the heuristic sets would be used with the same material, thereby reducing the number of random variables in the execution of this experiment.
A description of the heuristic set to be used was also provided to the evaluators. This description included a listing of the heuristics and accompanying text clarification. This clarification helps a person understand the intent and meaning of a specific heuristic, hopefully aiding in assessment. These descriptions were taken from [70] and [9] for Nielsen and Berry respectively.
Armed with the materials for the experiment, the evaluator then proceeded to rate each of the heuristics using a 7-point Likert scale, based on whether or not they felt that the heuristic applied to a claim describing a design tradeoff in the interface. Thus they are judging whether or not a specific heuristic applies to the claim, and how much so.
Marks of four or higher indicate agreement that the heuristic applies to the claim, otherwise the evaluator is indicating disagreement that the heuristic applies.
5.3.3 Questionnaire
As mentioned earlier, the evaluators in this experiment provided their feedback through a Likert scale, with agreement ratings for each of the heuristics in the set. In addition to this feedback, each evaluator also rated the claim in terms of how much they felt it actually applied to the interface in question.
By indicating their agreement level with the claim to the interface, we get feedback on whether usability experts actually think the claim is appropriate for the interface in question.
After rating each of the heuristics for the claims, we also asked each evaluator to rank the severity that the claim would hold, if the claim were indeed a usability problem in the interface.
5.3.4 Measurements Recorded
The data collected in this experiment consists of each evaluator’s rating of the claim applicability, each heuristic rating for an individual claim, and the evaluator’s assessment of the severity of the usability problem. This data was collected for each of the thirty claims across the three systems.
In addition to the above measures, we also collected data on the evaluator’s experience with usability evaluation, heuristics, and large screen information exhibits. This evaluator information was collected through survey questions before the evaluation was started.
After the evaluators completed the test, they recorded the amount of time they spent on the task. This was a self reported value as each evaluator worked at his/her own pace and in their own location.
5.4 Results
Twenty-one evaluators provided feedback on 33 different claims across three systems. Each evaluator ended up providing either 10 or 12 question responses per claim, depending on the heuristic set used (Nielsen’s set has 10 in it, whereas the others only have 8). This means we have either 330 or 396 answers to consider, per evaluator.
Fortunately, this data was separable into manageable chunks, dealing with applicability, severity, and heuristic ratings; as well as evaluator experience levels and time to complete for each method.
5.4.1 Participant Experience
As for individual evaluator abilities, the average experience level with usability evaluation, across all three systems, was “amateur”. This means that overall, for each heuristic set, we had comparable experience for the evaluators assigned to that set.
5.4.2 Applicability Scores
To indicate whether or not a heuristic set applied to a given claim (or problem), evaluators marked their agreement with the statement “the heuristic applies to the claim”. This agreement rating indicates that a specific heuristic applied to the claim. Each of the heuristics was marked on a 7-point Likert scale by the evaluators, indicating his/her level of agreement with the statement.
Using this applicability measure, the responses were averaged for a single claim across all of the evaluators. Averaging across evaluators allows assessment of the overall “applicability” of the heuristic to the claim.
This applicability score is used to determine whether any of the heuristics applied to the issue described in the claim. If a heuristic received an “agree” rating, average greater than or equal to five, then that heuristic was thought to have applied to the issue in the claim.
Overall Applicability
Considering all 33 claims together (found in all three systems), one-way analysis of variance (ANOVA) indicates significant differences among the three heuristic sets for applicability (F2,855) = 3.0,MSE = 49.7, p < 0.05). Further pair-wise t-tests reveal that Somervell’s set of heuristics had significantly higher applicability ratings over both Berry’s (df = 526, t = 3.32, p < 0.05) and Nielsen’s sets (df = 592, t = 11.56, p < 0.05). In addition, Berry’s heuristics had significantly higher applicability scores over Nielsen’s set (df = 592, t = 5.94, p < 0.05).
5.4.3 Thoroughness
Recall that thoroughness is measured as the number of problems found by a single method, divided by the number uncovered by all of the methods. This requires a breakdown of the total number of claims into the numbers for each system.
Plasma Poster has 14 claims, Notification Collage has eight claims, and the Source Viewer has 11 claims. We look at thoroughness measures for each system. To calculate the thoroughness measures for the data we have collected, we count the number of claims “covered” by the target heuristic set.
Here we are defining covered to mean that at least one of the heuristics in the set had an average agreement rating of at least five. Why five?
On the Likert scale, five indicates somewhat agree. If we require that the average score across all of the evaluators to be greater than or equal to five for a single heuristic, we are only capturing those heuristics that truly apply to the claim in question.
Overall Thoroughness
Across all three heuristic sets, 28 of 33 claims had applicability scores higher than five. Somervell’s heuristics had the highest thoroughness rating of the three heuristic sets with 96% (27 of 28 claims). Berry’s heuristics came next with a thoroughness score of 86% (24 of 28) and Nielsen’s heuristics had a score of 61 (17 of 28).
5.4.4 Validity
Validity measures the UEM’s ability to uncover real usability problems in a system [40].
Here the full set of problems in the system is used as the real problem set (as discussed in earlier sections).
As with thoroughness, the applicability scores determine the validity each heuristic set held for the three systems. As before, we used the cutoff value of five on the Likert scale to indicate applicability of the heuristic to the claim. An average rating of five or higher indicates that the heuristic applied to the claim in question.
Overall Validity
Similar to thoroughness, validity scores were calculated across all three systems. Out of 33 total claims, only 28 showed applicability scores greater than five across all three heuristic sets. Somervell’s heuristics had the highest validity, with 27 of 33 claims yielding applicability scores greater than five, for a validity score of 82%. Berry’s heuristics had the next highest validity with 24 of 33 claims, for a validity score of 73%. Nielsen’s heuristics had the lowest validity score, with 17 of 33 claims for a score of 52%.
5.4.5 Effectiveness
Effectiveness is calculated by multiplying thoroughness by validity. UEMs that have high thoroughness and high validity will have high effectiveness scores. A low score on either of these measures will reduce the effectiveness score.
Overall Effectiveness
Considering the effectiveness scores across all three systems reveals that Somervell’s heuristics had the highest effectiveness with a score of 0.79. Berry’s heuristics came next with a score of 0.62. Nielsen’s heuristics had the lowest overall effectiveness with a score of 0.31.
5.4.6 Reliability – Differences
Recall that the reliability of each heuristic set is measured in two ways: one relying upon the actual differences among the evaluators, the other upon the average number of agreements among the evaluators.
Here we focus on the former. For example, Berry’s set has eight heuristics, so consider calculating the differences in the ratings for the first heuristic for the first claim in the Plasma Poster. This difference is found by subtracting the ratings of each evaluator from every other evaluator and summing up each of the differences, then dividing by the number of differences (or the average difference). Suppose that an evaluator rated the first heuristic with a 6 (agree) and another rated it as a 4 (neutral) and a third rated it as a 5 (somewhat agree). The difference in this is 1.33.
We then averaged the differences for every heuristic on a given claim to get an overall difference score for that claim, with a lower score indicating higher reliability (zero difference indicates complete reliability). These average differences provide a measure for the reliability of the heuristic set.
Overall Reliability Differences
Considering all 33 claims across the three systems gives an overall indication of the average differences for the heuristic sets. One-way ANOVA suggests significant differences among the three heuristic sets (F(2, 23) = 23.02,MSE = 0.84, p < 0.05).
Pair-wise t-tests show that Somervell’s heuristics had significantly lower average differences than both Berry’s heuristics (df = 14, t = 4.3, p < 0.05) and Nielsen’s heuristics (df = 16, t = 6.8, p < 0.05). No significant differences were found between Berry’s heuristics and Nielsen’s heuristics (df = 16, t = 1.43, p = 0.17), but Berry’s set had a slightly lower average difference (MB = 2.02, SDB = 0.21; MN = 2.14, SDN = 0.13).
5.4.7 Reliability – Agreement
In addition to the average differences, a further measure of reliability was calculated by counting the number of agreements among the evaluators, then dividing by the total number of possible agreements. This calculation provides a measure of the agreement rating for each heuristic.
For example, consider the previous three evaluators and their ratings (6, 5, and 4). The agreement rating in this case would be:
agreement = 0 / 3 = 0
because none of the evaluators agreed on the rating, but there were potentially three agreements (if they had all given the same rating). Averages across all of the claims for a given system were then taken. This provides an assessment of the average agreement for each heuristic as it pertains to a given system.
Overall Agreement
Taking all 33 claims into consideration, one-way ANOVA indicates significant differences among the three heuristic sets for evaluator agreement (F(2, 23) = 6.31,MSE = 0.01, p = 0.01). Pairwise t-tests show that both Somervell’s heuristics and Berry’s heuristics had significantly higher agreement than Nielsen’s set (df = 16, t = 2.99, p = 0.01 and df = 16, t = 3.7, p < 0.05 respectively). No significant differences were found between Somervell’s and Berry’s heuristics (df = 14, t = 0.46, p = 0.65).
5.4.8 Time Spent
Recall that we also asked the evaluators to report the amount of time (in minutes) they spent completing this evaluation. This measure is valuable in assessing the cost of the methods in terms of effort required. It was anticipated that the time required for each method would be similar across the methods.
Averaging reported times across evaluators for each method suggests that Somervell’s set required the least amount of time (M = 103.17, SD = 27.07), but one-way ANOVA reveals no significant differences (F(2, 17) = 0.26, p = 0.77). Berry’s set required the most time (M = 119.14, SD = 60.69) while Nielsen’s set (M = 104.29, SD = 38.56) required slightly more than Somervell’s.
5.5 Discussion
So what does all this statistical analysis mean? What do we know about the three heuristic sets? How have we supported or refuted our hypotheses through this analysis?
5.5.1 Hypotheses Revisited
1. Somervell’s set of heuristics will have a higher validity score for the Notification Collage.
2. More specific heuristics will have higher thoroughness, validity, and reliability measures.
3. Generic methods will require more time for evaluators to complete the study.
Hypothesis 1
For hypothesis one, we discovered that Somervell’s heuristics indeed held the highest validity score for the Notification Collage (see Figure 5.7).
However, this validity score was not 100%, as was expected. What does this mean? It simply illustrates the difference in the evaluators who participated in this study. They did not think that any of the heuristics applied to one of the claims from the Notification Collage. Although, it can be noted that the applicability scores for that particular claims were very close to the cutoff level we chose for agreement (that being 5 or greater on a 7-point scale).
Still, evidence suggests that hypothesis 1 holds.
Hypothesis 2
We find evidence to support this hypothesis based on the scores on each of the three measures: thoroughness, validity, and reliability. In each case, more specific methods had the better ratings over Nielsen’s heuristics for each measure.
Overall one could argue that Somervell’s set of heuristics is most suitable for evaluating large screen information exhibits, but must concede that Berry’s heuristics could also be used with some effectiveness.
Hypothesis 3
We did not find evidence to support this hypothesis. As reported, there were no significant differences in the times required to complete the evaluations for the three methods.
However, Somervell’s and Nielsen’s sets took about 15 fewer minutes, on average, to complete. This does not indicate that the more generic method (Nielsen’s) required more time.
So what would cause the evaluators to take more time with Berry’s method? Initial speculation would suggest that this set uses terminology associated with Notification Systems [62] (see Figure 5.2 for listing of heuristics), including reference to the critical parameters of interruption, reaction, and comprehension, and thus could have increased the interpretation time required to understand each of the heuristics.
5.6 Summary
We have described an experiment to compare three sets of heuristics, representing different levels of generality/specificity, in their ability to evaluate three different LSIE systems. Information on the systems used, test setup, and data collection and analysis has been provided. This test was performed to illustrate the utility that system-class specific methods provide by showing how they are better suited to evaluation of interfaces from that class.
In addition, this work has provided important validation of the creation method used in developing these new heuristics.
We have shown that a system-class specific set of heuristics provides better thoroughness, validity, and reliability than more generic sets (like Nielsen’s). The implication being that without great effort to tailor these generic evaluation tools, they do not provide as effective usability data as a more specific tool.
Source: Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications] Virginia Polytechnic Institute and State University. June 22, 2004.
Heuristic Comparison Experiment
5.1 Introduction
Now that there is a set of heuristics tailored for the large screen information exhibit system class, a comparison of this set to more established types of heuristics can be done. The purpose of this comparison would be to show the utility of this new heuristic set. This comparison needs to be fair, so that determining the effectiveness of the new method will be accurate.
To assess whether the new set of heuristics provides better usability results over existing alternative sets, we conducted a comparison experiment in which each of three sets of heuristics were used to evaluate three separate large screen information exhibits. We then compared the results of each set through several metrics to determine the better evaluation methods for large screen information exhibits.
5.2 Approach
The following sections provide descriptions of the heuristics used, the comparison method, and the systems used in this experiment.
5.2.1 Heuristic Sets
We used three different sets of usability heuristics, each at a different level of specificity for application to large screen information exhibits, ranging from a set completely designed for this particular system class, to a generic set applicable to a wide range of interactive systems.
Nielsen
The least specific set of heuristics was taken from Nielsen and Mack [70]. This set is intended for use on any interactive system, mostly targeted towards desktop applications. Furthermore, this set has been in use since around 1990. It has been tested and criticized for years, but still remains popular with usability practitioners. Again, this set is not tailored for large screen information exhibits in any way and has no relation to the critical parameters for notification systems.
Visibility of system status
Match between system and real world
User control and freedom
Consistency and standards
Error prevention
Recognition rather than recall
Flexibility and efficiency of use
Aesthetic and minimalist design
Help users recognize, diagnose, and recover from errors
Help and documentation
Figure 5.1: Nielsen’s heuristics. General heuristics that apply to most interfaces. Found in [70].
Berry
The second heuristic set used in this comparison test was created for general notification systems by Berry [9]. This set is based on the critical parameters associated with notification systems [62], but only at cursory levels. This set is more closely tied to large screen information exhibits than Nielsen’s method in that large screen information exhibits are a subset of notification systems, but this set is still generic in nature with regards to the specifics surrounding the LSIE system class.
Notifications should be timely
Notifications should be reliable
Notification displays should be consistent (within priority levels)
Information should be clearly understandable by the user
Allow for shortcuts to more information
Indicate status of notification system
Flexibility and efficiency of use
Provide context of notifications
Allow adjustment of notification parameters to fit user goals
Figure 5.2: Berry’s heuristics. Tailored more towards Notification Systems in general. Found in
[9].
Somervell
The final heuristic set is the one created in this work, as reported in Chapter 4. This set is tailored specifically to large screen information exhibits, and thus would be the most specific method of the three when targeting this type of system. It is based on specific levels of the critical parameters associated with the LSIE system class.
5.2.2 Comparison Technique
To determine which of the three sets is better suited for formative evaluation of large screen information exhibits, we use a current set of comparison metrics that rely upon several measures of a method’s ability to uncover usability problems through an evaluation.
The comparison method we are using typically relies on five separate measures to assess the
utility of a given UEM for one’s particular needs, but we will only use a subset in this particular
comparison study.
Hartson et al. report that thoroughness, validity, effectiveness, reliability, and
downstream utility are appropriate measures for comparing evaluation methods [40].
Specifically, our comparison method capitalizes on thoroughness, validity, effectiveness, and reliability, abandoning the downstream utility measure. This choice is used here because longterm studies are required to illustrate downstream utility.
Thoroughness
This measure gives an indication of a method’s ability to uncover a significant percentage of the problems in a given system. Thoroughness consists of a simple calculation of the number of problems uncovered by a single UEM divided by the total number of problems found by all three methods.
thoroughness = (# of problems found by target UEM) / (# of problems found by all methods)
Validity
Validity refers to the ability of a method to uncover the types of problems that real users would experience in day to day use of the system, as opposed to simple or minor problems. Validity is measured as the number of real problems found divided by the total number of real problems identified in the system.
validity = (# of problems found by target UEM) / (# of problems in the system)
The number of real problems in the system refers to the problem set identified through some standard method that is separate from the method being tested.
Effectiveness
Effectiveness combines the previous two metrics into a single assessment of the method. This measure is calculated by multiplying the thoroughness score by the validity score.
effectiveness = thoroughness X validity
Reliability
Reliability is a measure of the consistency of the results of several evaluators using the method. This is also sometimes referred to as inter-rater reliability. This measure is taken more as agreement between the usability problem sets produced by different people using a given method.
This measure is calculated from the differences among all of the evaluators for a specific system as well as by the total number of agreements among the evaluators, thus two measures are used to provide a more robust measurement of the reliability of the heuristic sets:
reliability-d = difference among evaluators for a specific method
reliability-a = average agreement among evaluators for a specific method
For calculating reliability, Hartson et al. recommend using a method from Sears [81] that depends on the ratio of the standard deviation of the numbers of problems found by the average number found [40]. This measure of reliability is overly complicated for current needs, thus a more traditional measure that relies upon actual rater differences is used instead.
5.2.3 Systems
Three systems were used in the comparison study providing a range of applications for which each heuristic would be used in an analytic evaluation. The intent was to provide enough variability in the test systems to tease out differences in the methods.
1 Source Viewer
2 Plasma Poster
3 Notification Collage
Why Source Viewer?
The Source Viewer was chosen as a target system for this study because we wanted an example of a real system that has been in regular use for an extended period. We immediately thought of command and control situations. Potential candidates included local television stations, local air traffic control towers, electrical power companies, and telephone exchange stations. We finally settled on local television command and control after limited responses from the other candidates.
Why Plasma Poster?
We wanted to include the Plasma Poster because it is one of very few LSIE systems that has seen some success in terms of long term usage and acceptance. It has seen over a year of deployment in a large research laboratory, with reports on usage and user feedback reported in [20].
This lengthy deployment and data collection period provides ample evidence for typical usability problems. We can use the published reports as support for our problem sets. Coupled with developer feedback, we can effectively validate the problem set for this system.
Why Notification Collage?
We chose the Notification Collage as the third system for several reasons.
First, we wanted to increase the validity of any results we find. By using more systems, we get a better picture of the “goodness” of the heuristic sets, especially if we get consistent results across all three systems.
Secondly, we wanted to explicitly show that the heuristic set we created in this work actually uncovered the issues that went into that creation process. In other words, since the Notification Collage was one of the systems that led to this heuristic set, using that set on the Notification Collage should uncover most of the issues with that system.
Finally, we wanted to use the Notification Collage out of the original five because we had the most developer feedback on that system, and like the Plasma Poster, it has seen reasonable deployment and use.
5.2.4 Hypotheses
We have three main hypotheses to test in this experiment:
1. Somervell’s set of heuristics has a higher validity score for the Notification Collage.
We believed this was true because the Notification Collage was used in the creation of Somervell’s heuristics, thus those heuristics should identify most or all of the issues in the Notification Collage.
2. More specific heuristics have higher thoroughness, validity, and reliability measures.
We felt this was true because more specific methods are more closely related to the systems in this study. Indeed, from Chapter 3, we discussed how previous work suggests system-class level heuristics would be best. This experiment illustrates this case for heuristic evaluation of large screen information exhibits.
3. Generic methods require more time for evaluators to complete the study.
This seems logical because a more generic heuristic set would require more interpretation and thought, hence we felt that those evaluators who use Nielsen’s set would take longer to complete the system evaluations, providing further impetus for developing system-class UEMs.
5.2.5 Identifying Problem Sets
One problem identified in other UEM comparison studies involves the calculation of specific metrics that rely upon something referred to as the “real” problem set (see [40]). In most cases, this problem set is the union of the problems found by each of the methods in the comparison study. In other words, each UEM is applied in a standard usability evaluation of a system, and the “real” problem set is simply the union of the problems found by each of the methods.
This comparison study also faced the same challenge. Instead of relying on evaluators to produce sets of problems from each method, then using the union of those problem sets as the “real” problem set, analysis and testing was performed on the target systems beforehand and the problem reports from those efforts were used to come up with a standard set of problems for each system.
Source Viewer Problem Set
To determine the problem sets experienced by the users of this system, a field study was conducted. Two interviews with the users of the large screen system, as well as observation were conducted.
Plasma Poster Problem Set
Analytic evaluation augmented with developer feedback and literature review served as the method for determining the real problem set for the Plasma Poster. We employed the same claims analysis technique that we used in the creation process to identify typical usability tradeoffs for the Plasma Poster. After identifying the usability issues, we asked the developers of the system to verify the tradeoffs.
Notification Collage Problem Set
To validate the problem set for the Notification Collage, we contacted the developers of the system and asked them to check each tradeoff as it pertained to the behavior of real users. The developers were given a list of the tradeoffs found in our claims analysis (from Chapter 4) and asked to verify each tradeoff according to their observations of real user behavior.
5.3 Testing Methodology
This experiment involves a 3x3 mixed factors design. We have three levels of heuristics (Nielsen, Berry, and Somervell) and three systems (Source Viewer, Plasma Poster, and Notification Collage).
The heuristics variable is a between-subjects variable because each evaluator sees only one set of heuristics. The system variable is within-subjects because each participant sees all three systems. For example, evaluator 1 saw only Nielsen’s heuristics, but used those to evaluate all three systems.
We used a balanced Latin Square to ensure learning effects from system presentation order would be minimized. Thus, we needed a minimum of 18 participants (6 per heuristic set) to ensure coverage of the systems in the Latin Square balancing.
5.3.1 Participants
As shown in Table 5.1, we needed a minimum of 18 evaluators for this study.
Twenty-one computer science graduate students who had completed a course on Usability Engineering volunteered for participation as inspectors. Six participants were assigned to each heuristic set, to cover each of the order assignments. Three additional students volunteered and they were randomly assigned a presentation order.
These participants all had knowledge of usability evaluation, as well as analytic and empirical methods. Furthermore, each was familiar with heuristic evaluation. Some of the participants were not familiar with the claim structure used in this study, but they were able to understand the tradeoff concept immediately.
Unfortunately, one of the participants failed to complete the experiment. This individual apparently decided the effort required to complete the test was too much, and thus filled out the questionnaire using a set pattern.
This makes the final number of participants 20, with seven for Nielsen’s heuristics, seven for Berry’s heuristics, and six for Somervell’s heuristics.
5.3.2 Materials
Each target system was described in one to three short scenarios, and screen shots were provided to the evaluators. The goal was to provide the evaluators with a sense of the display and its intended usage.
This material is sufficient for the heuristic inspection technique according to Nielsen and Mack [70]. This setup ensured that each of the heuristic sets would be used with the same material, thereby reducing the number of random variables in the execution of this experiment.
A description of the heuristic set to be used was also provided to the evaluators. This description included a listing of the heuristics and accompanying text clarification. This clarification helps a person understand the intent and meaning of a specific heuristic, hopefully aiding in assessment. These descriptions were taken from [70] and [9] for Nielsen and Berry respectively.
Armed with the materials for the experiment, the evaluator then proceeded to rate each of the heuristics using a 7-point Likert scale, based on whether or not they felt that the heuristic applied to a claim describing a design tradeoff in the interface. Thus they are judging whether or not a specific heuristic applies to the claim, and how much so.
Marks of four or higher indicate agreement that the heuristic applies to the claim, otherwise the evaluator is indicating disagreement that the heuristic applies.
5.3.3 Questionnaire
As mentioned earlier, the evaluators in this experiment provided their feedback through a Likert scale, with agreement ratings for each of the heuristics in the set. In addition to this feedback, each evaluator also rated the claim in terms of how much they felt it actually applied to the interface in question.
By indicating their agreement level with the claim to the interface, we get feedback on whether usability experts actually think the claim is appropriate for the interface in question.
After rating each of the heuristics for the claims, we also asked each evaluator to rank the severity that the claim would hold, if the claim were indeed a usability problem in the interface.
5.3.4 Measurements Recorded
The data collected in this experiment consists of each evaluator’s rating of the claim applicability, each heuristic rating for an individual claim, and the evaluator’s assessment of the severity of the usability problem. This data was collected for each of the thirty claims across the three systems.
In addition to the above measures, we also collected data on the evaluator’s experience with usability evaluation, heuristics, and large screen information exhibits. This evaluator information was collected through survey questions before the evaluation was started.
After the evaluators completed the test, they recorded the amount of time they spent on the task. This was a self reported value as each evaluator worked at his/her own pace and in their own location.
5.4 Results
Twenty-one evaluators provided feedback on 33 different claims across three systems. Each evaluator ended up providing either 10 or 12 question responses per claim, depending on the heuristic set used (Nielsen’s set has 10 in it, whereas the others only have 8). This means we have either 330 or 396 answers to consider, per evaluator.
Fortunately, this data was separable into manageable chunks, dealing with applicability, severity, and heuristic ratings; as well as evaluator experience levels and time to complete for each method.
5.4.1 Participant Experience
As for individual evaluator abilities, the average experience level with usability evaluation, across all three systems, was “amateur”. This means that overall, for each heuristic set, we had comparable experience for the evaluators assigned to that set.
5.4.2 Applicability Scores
To indicate whether or not a heuristic set applied to a given claim (or problem), evaluators marked their agreement with the statement “the heuristic applies to the claim”. This agreement rating indicates that a specific heuristic applied to the claim. Each of the heuristics was marked on a 7-point Likert scale by the evaluators, indicating his/her level of agreement with the statement.
Using this applicability measure, the responses were averaged for a single claim across all of the evaluators. Averaging across evaluators allows assessment of the overall “applicability” of the heuristic to the claim.
This applicability score is used to determine whether any of the heuristics applied to the issue described in the claim. If a heuristic received an “agree” rating, average greater than or equal to five, then that heuristic was thought to have applied to the issue in the claim.
Overall Applicability
Considering all 33 claims together (found in all three systems), one-way analysis of variance (ANOVA) indicates significant differences among the three heuristic sets for applicability (F2,855) = 3.0,MSE = 49.7, p < 0.05). Further pair-wise t-tests reveal that Somervell’s set of heuristics had significantly higher applicability ratings over both Berry’s (df = 526, t = 3.32, p < 0.05) and Nielsen’s sets (df = 592, t = 11.56, p < 0.05). In addition, Berry’s heuristics had significantly higher applicability scores over Nielsen’s set (df = 592, t = 5.94, p < 0.05).
5.4.3 Thoroughness
Recall that thoroughness is measured as the number of problems found by a single method, divided by the number uncovered by all of the methods. This requires a breakdown of the total number of claims into the numbers for each system.
Plasma Poster has 14 claims, Notification Collage has eight claims, and the Source Viewer has 11 claims. We look at thoroughness measures for each system. To calculate the thoroughness measures for the data we have collected, we count the number of claims “covered” by the target heuristic set.
Here we are defining covered to mean that at least one of the heuristics in the set had an average agreement rating of at least five. Why five?
On the Likert scale, five indicates somewhat agree. If we require that the average score across all of the evaluators to be greater than or equal to five for a single heuristic, we are only capturing those heuristics that truly apply to the claim in question.
Overall Thoroughness
Across all three heuristic sets, 28 of 33 claims had applicability scores higher than five. Somervell’s heuristics had the highest thoroughness rating of the three heuristic sets with 96% (27 of 28 claims). Berry’s heuristics came next with a thoroughness score of 86% (24 of 28) and Nielsen’s heuristics had a score of 61 (17 of 28).
5.4.4 Validity
Validity measures the UEM’s ability to uncover real usability problems in a system [40].
Here the full set of problems in the system is used as the real problem set (as discussed in earlier sections).
As with thoroughness, the applicability scores determine the validity each heuristic set held for the three systems. As before, we used the cutoff value of five on the Likert scale to indicate applicability of the heuristic to the claim. An average rating of five or higher indicates that the heuristic applied to the claim in question.
Overall Validity
Similar to thoroughness, validity scores were calculated across all three systems. Out of 33 total claims, only 28 showed applicability scores greater than five across all three heuristic sets. Somervell’s heuristics had the highest validity, with 27 of 33 claims yielding applicability scores greater than five, for a validity score of 82%. Berry’s heuristics had the next highest validity with 24 of 33 claims, for a validity score of 73%. Nielsen’s heuristics had the lowest validity score, with 17 of 33 claims for a score of 52%.
5.4.5 Effectiveness
Effectiveness is calculated by multiplying thoroughness by validity. UEMs that have high thoroughness and high validity will have high effectiveness scores. A low score on either of these measures will reduce the effectiveness score.
Overall Effectiveness
Considering the effectiveness scores across all three systems reveals that Somervell’s heuristics had the highest effectiveness with a score of 0.79. Berry’s heuristics came next with a score of 0.62. Nielsen’s heuristics had the lowest overall effectiveness with a score of 0.31.
5.4.6 Reliability – Differences
Recall that the reliability of each heuristic set is measured in two ways: one relying upon the actual differences among the evaluators, the other upon the average number of agreements among the evaluators.
Here we focus on the former. For example, Berry’s set has eight heuristics, so consider calculating the differences in the ratings for the first heuristic for the first claim in the Plasma Poster. This difference is found by subtracting the ratings of each evaluator from every other evaluator and summing up each of the differences, then dividing by the number of differences (or the average difference). Suppose that an evaluator rated the first heuristic with a 6 (agree) and another rated it as a 4 (neutral) and a third rated it as a 5 (somewhat agree). The difference in this is 1.33.
We then averaged the differences for every heuristic on a given claim to get an overall difference score for that claim, with a lower score indicating higher reliability (zero difference indicates complete reliability). These average differences provide a measure for the reliability of the heuristic set.
Overall Reliability Differences
Considering all 33 claims across the three systems gives an overall indication of the average differences for the heuristic sets. One-way ANOVA suggests significant differences among the three heuristic sets (F(2, 23) = 23.02,MSE = 0.84, p < 0.05).
Pair-wise t-tests show that Somervell’s heuristics had significantly lower average differences than both Berry’s heuristics (df = 14, t = 4.3, p < 0.05) and Nielsen’s heuristics (df = 16, t = 6.8, p < 0.05). No significant differences were found between Berry’s heuristics and Nielsen’s heuristics (df = 16, t = 1.43, p = 0.17), but Berry’s set had a slightly lower average difference (MB = 2.02, SDB = 0.21; MN = 2.14, SDN = 0.13).
5.4.7 Reliability – Agreement
In addition to the average differences, a further measure of reliability was calculated by counting the number of agreements among the evaluators, then dividing by the total number of possible agreements. This calculation provides a measure of the agreement rating for each heuristic.
For example, consider the previous three evaluators and their ratings (6, 5, and 4). The agreement rating in this case would be:
agreement = 0 / 3 = 0
because none of the evaluators agreed on the rating, but there were potentially three agreements (if they had all given the same rating). Averages across all of the claims for a given system were then taken. This provides an assessment of the average agreement for each heuristic as it pertains to a given system.
Overall Agreement
Taking all 33 claims into consideration, one-way ANOVA indicates significant differences among the three heuristic sets for evaluator agreement (F(2, 23) = 6.31,MSE = 0.01, p = 0.01). Pairwise t-tests show that both Somervell’s heuristics and Berry’s heuristics had significantly higher agreement than Nielsen’s set (df = 16, t = 2.99, p = 0.01 and df = 16, t = 3.7, p < 0.05 respectively). No significant differences were found between Somervell’s and Berry’s heuristics (df = 14, t = 0.46, p = 0.65).
5.4.8 Time Spent
Recall that we also asked the evaluators to report the amount of time (in minutes) they spent completing this evaluation. This measure is valuable in assessing the cost of the methods in terms of effort required. It was anticipated that the time required for each method would be similar across the methods.
Averaging reported times across evaluators for each method suggests that Somervell’s set required the least amount of time (M = 103.17, SD = 27.07), but one-way ANOVA reveals no significant differences (F(2, 17) = 0.26, p = 0.77). Berry’s set required the most time (M = 119.14, SD = 60.69) while Nielsen’s set (M = 104.29, SD = 38.56) required slightly more than Somervell’s.
5.5 Discussion
So what does all this statistical analysis mean? What do we know about the three heuristic sets? How have we supported or refuted our hypotheses through this analysis?
5.5.1 Hypotheses Revisited
1. Somervell’s set of heuristics will have a higher validity score for the Notification Collage.
2. More specific heuristics will have higher thoroughness, validity, and reliability measures.
3. Generic methods will require more time for evaluators to complete the study.
Hypothesis 1
For hypothesis one, we discovered that Somervell’s heuristics indeed held the highest validity score for the Notification Collage (see Figure 5.7).
However, this validity score was not 100%, as was expected. What does this mean? It simply illustrates the difference in the evaluators who participated in this study. They did not think that any of the heuristics applied to one of the claims from the Notification Collage. Although, it can be noted that the applicability scores for that particular claims were very close to the cutoff level we chose for agreement (that being 5 or greater on a 7-point scale).
Still, evidence suggests that hypothesis 1 holds.
Hypothesis 2
We find evidence to support this hypothesis based on the scores on each of the three measures: thoroughness, validity, and reliability. In each case, more specific methods had the better ratings over Nielsen’s heuristics for each measure.
Overall one could argue that Somervell’s set of heuristics is most suitable for evaluating large screen information exhibits, but must concede that Berry’s heuristics could also be used with some effectiveness.
Hypothesis 3
We did not find evidence to support this hypothesis. As reported, there were no significant differences in the times required to complete the evaluations for the three methods.
However, Somervell’s and Nielsen’s sets took about 15 fewer minutes, on average, to complete. This does not indicate that the more generic method (Nielsen’s) required more time.
So what would cause the evaluators to take more time with Berry’s method? Initial speculation would suggest that this set uses terminology associated with Notification Systems [62] (see Figure 5.2 for listing of heuristics), including reference to the critical parameters of interruption, reaction, and comprehension, and thus could have increased the interpretation time required to understand each of the heuristics.
5.6 Summary
We have described an experiment to compare three sets of heuristics, representing different levels of generality/specificity, in their ability to evaluate three different LSIE systems. Information on the systems used, test setup, and data collection and analysis has been provided. This test was performed to illustrate the utility that system-class specific methods provide by showing how they are better suited to evaluation of interfaces from that class.
In addition, this work has provided important validation of the creation method used in developing these new heuristics.
We have shown that a system-class specific set of heuristics provides better thoroughness, validity, and reliability than more generic sets (like Nielsen’s). The implication being that without great effort to tailor these generic evaluation tools, they do not provide as effective usability data as a more specific tool.
Source: Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications] Virginia Polytechnic Institute and State University. June 22, 2004.
Labels:
heuristic evaluation,
Somervell,
usability criteria
Nov 4 - Somervell, Heuristics Creation (PhD dissertation)
Chapter 4
Heuristics Creation
4.1 Introduction
Ensuring usability is an ongoing challenge for software developers. Myriad testing techniques exist, leading to a trade-off between implementation cost and results effectiveness.
Usability testing techniques are broken down into analytical and empirical types.
Analytical methods involve inspection of the system, typically experts in the application field, who identify problems in a walkthrough process.
Empirical methods leverage people who could be real users of the application in controlled tests of specific aspects of the system, often to determine efficiency in performing tasks with the system.
Using either type has advantages and disadvantages, but practitioners typically have limited budgets for usability testing. Thus, they need to use techniques that give useful results while not requiring significant funds. Analytic methods fit this requirement more readily for formative evaluation stages.
With the advent of new technologies and non-traditional interfaces, analytic techniques like heuristics hold the key to early and effective interface evaluation.
There are problems with using analytical methods (like heuristics) that can decrease the validity of results [21]. These problems come from applying a small set of guidelines to a wide range of systems, necessitating interpretation of evaluation results. This illustrates how generic guidelines are not readily applicable to all systems [40], and more specific heuristics are necessary.
Our goal was to create a more specific set, tailored to this system class, yet still have a
set that can be generic enough to apply to all systems in this class.
LSIEs focus on very specific user goals based on the critical parameters of interruption, reaction, and comprehension. Differing levels of each parameter (high, medium, or low) define different system classes [62]. We focus on LSIEs which require medium interruption, low to high
reaction, and high comprehension.
4.2 Motivation
Tremendous effort has been devoted to the study of usability evaluation, specifically in comparing analytic to empirical methods.
Nielsen’s heuristics are probably the most notable set of analytical techniques, developed to facilitate formative usability testing [71, 70]. They have come under fire for their claims that heuristic evaluations are comparable to user testing, yet require fewer test subjects. Comparisons of user testing to heuristic evaluation are numerous [48, 50, 90].
Some have worked to develop targeted heuristics for specific application types.
Baker et al. report on adapting heuristic evaluation to groupware systems [5]. They show that applying heuristic evaluation methods to groupware systems is effective and efficient for formative usability evaluation.
Mankoff et al. compare an adapted set of heuristics to Nielsen’s original set [56]. They studied ambient displays with both sets of heuristics and determined that their adapted set is better suited to ambient displays.
4.3 Processes Involved
How does one create a set of heuristics anyway? We could follow the steps of previous researchers and just use pre-existing heuristics, then reason about the target system class, hopefully coming up with a list of new heuristics that prove useful.
Nielsen and Molich explicitly state that the heuristics come from years of experience and reflection. Not surprising as the heuristics emerged some 30 years after graphical interfaces became mainstream. In the case of Nielsen and Mack, they at least validated their method through using it in the analysis of several systems, after they had created their set.
The two mentioned studies relied upon vague descriptions of theoretical underpinnings [5] or simple tweaking of existing heuristics [56].
Our approach to this lack of structure in creating heuristics is to take a logical look at how one might uncover or discover heuristics for a particular type of system. Basically, to gain insight about a certain type of system, one could analyze several example applications in that system class based on the critical parameters for that system class, and then use the results of that analysis to categorize and group the issues discovered into re-usable design guidelines or heuristics.
These stages involve:
• selection of target systems.
• inspection of these systems. An approach like claims analysis [15] provides necessary
structure to knowledge extraction and provides a consistent representation.
• classifying design implications. Leveraging the underlying critical parameters can help
organize the claims found in terms of impacts to those parameters.
• categorizing design implications. Scenario Based Design [77] provides a mechanism for
categorizing design knowledge into manageable parts.
• extracting high level design guidance. Based on the groupings developed in the previous step, high level design guidelines can be formulated in terms of design issues.
• synthesizing potential heuristics. By matching and relating similar issues, heuristics can
be synthesized.
4.4 Selecting Systems
The first step in the creation process requires careful selection of example systems to inspect and analyze for uncovering existing problems in the systems. The idea is to uncover typical issues inherent in that specific type of system.
Our goal was to use a representative set of systems from the LSIE class. We wanted systems that had been in use for a while, with reports on usage or studies on usability to help validate the analysis we would perform on the systems. //we chose the following five LSIE systems, including some from our own work and some from other well-documented design efforts, to further investigate in the creation process.
• GAWK [31] This system provides teachers and students an overview and history of current project work by group and time, on a public display in the classroom.
• Photo News Board [85] This system provides photos of news stories in four categories, shown on a large display in a break room or lab.
• Notification Collage [36] This system provides users with communication information and various data from others in the shared space on a large screen.
• What’s Happening? [94, 95] This system shows relevant information (news, traffic, weather) to members of a local group on a large, wall display.
• BlueBoard [78] This system allows members in a local setting to view information pages about what is occurring in their location (research projects, meetings, events).
These five systems were chosen as a representative set of large screen information exhibits. The GAWK and Photo News Board were created in local labs and thus we have access to the developers and potential user classes. The other three are some of the more famous and familiar ones found in recent literature.
4.5 Analyzing Systems
Now that we have selected our target systems, we must now determine the typical usability issues and problems inherent in these systems. Performing usability analysis or testing of these systems finds the issues and problems each system holds. To find usability problems we can do analytic or empirical investigations, recording the issues we find.
We chose to use an analytic evaluation approach to the five aforementioned LSIEs, based on
arguments from Section 3.3. We wanted to uncover as many usability concerns as possible, so
we chose claims analysis [15, 77] as the analytic vehicle with which we investigated our systems.
4.5.1 Claims Analysis
Claims analysis is a method for determining the impacts design decisions have on user goals for
a specific piece of software [15, 77]. Claims are statements about a design element reflecting a
positive or negative effect resulting from using the design element in a system [15].
Claims analysis involves inspection and reflection on the wordings of specific claims to determine the psychological impacts a design artifact may have on a user [15]. The wordings are the actual words used to describe positive and negative effects of the claims. The impacts are the overall psychological effect on the user.
4.5.2 System Claims
Claims were made for each of the five systems that were inspected. These claims focused on design artifacts and overall goals of the systems. These claims are based on typical usage, as exemplified by the scenarios shown for each system. On average, there were over 50 claims made per system.
Table 4.2 shows the breakdown of the numbers of claims found for each system.
Each claim dealt with some design element in the interface, showing upsides or downsides resulting from a particular design choice.
These claims can be thought of as problem indicators, unveiling potential problems with the system being able to support the user goals. These problem indicators include positive aspects of design choices as well. By including the good with the bad, we gain fuller understanding of the underlying design issues.
4.5.3 Validating Claims
How do we know that the claims we found through our analysis represent the “real” design challenges in the systems? This is a fair question and one that must be addressed. We need to verify that the claims we are using to extract design guidance for LSIE systems are actually representative of real user problems encountered during use of those systems. We tackled this problem through several different techniques.
For the GAWK and Photo News Board, we relied upon existing empirical studies [85] to validate the claims we found for those systems.
For the Notification Collage we relied upon discussion and feedback from the system developers. We sent the list of claims and scenarios to Saul Greenberg and Michael Rounding and asked them to verify that the claims we made for the Notification Collage were typical of what they observed users actually doing with the system. Michael Rounding provided a thorough response that indicated most of the claims were indeed correct and experienced by real users of the system.
A similar effort was attempted with both the What’s Happening? and Blue Board systems. The developers of these systems were contacted but no specific feedback was provided on our claims. However, John Stasko, co-developer of the What’s Happening? system, provided interview feedback on the system and provided a nice publication [95] that served as validation material for the claims. This report provides details on user experiments done with the What’s Happening? system. Using this report, we were able to verify that most of the claims we made for the system were experienced in those experiments.
Unfortunately, none of the developers of the Blue Board system responded to our request. We were able to use existing literature on the system to verify some of the claims but the reports on user behavior in [78] did not provide enough material to validate all of the claims we found for that system.
4.6 Categorizing Claims
Now that we have analyzed several systems in the LSIE class, and we have over 250 claims about design decisions for those systems, how do we make sense of it all and glean reusable design guidance in the form of heuristics? To make sense of the claims we have, we need to group and categorize similar claims.
This requires a framework to ensure consistent classification and facilitate final heuristic synthesis from the classification. This is where the idea of critical parameters plays an important role.
4.6.1 Classifying Claims Using the IRC Framework
Recall that notification systems can be classified by their level of impact on interruption, reaction, and comprehension [62]. This classification scheme can be simplified to reflect a high, medium, or low impact to each of interruption, reaction, and comprehension.
In other words, we can take a single claim and classify it according to the impact it would have on the user goals associated with the system.
For example, we have a claim about the collage metaphor from the Notification Collage system that suggests that the lack of organization can hinder efforts to find information. This claim would be classified as “high” interruption because it increases the time required to find a piece of information. It could also be classified as “low” comprehension because it reduces a person’s ability to understand the information quickly and accurately. It is perfectly acceptable to have the claim fit into both classifications.
4.6.2 Assessing Goal Impact
Determining the impact a claim has on the user goals was done through inspection and reflection techniques. Each claim was read and approached from the scenarios for the system, trying to identify if the claim had an impact on the user goals. A claim impacted a user goal if it was determined through the wording of the claim that one of interruption, reaction, or comprehension was modified by the design element.
To assign user goal impacts to the claims, a team of experts should assess each claim.
These experts should have extensive knowledge of the system class, and the critical parameters that define that class. Knowledge of claims analysis techniques and/or usability evaluation are highly recommended.
We used a two- person team of experts.
Differences occurred when these classifications were not compatible.
Agreement was measured as the number of claims with the same classification divided by the total number of claims. We found that initial agreement on the claims was near 94% and after discussion was 100% for all claims.
This calculation comes from the fact that out of 253 individual claims, 237 were classified by the inspectors as impacting the user goals in the same way, i.e. all of the experts agreed on the same classification.
4.6.3 Categorization Through Scenario Based Design
Heuristics Creation
4.1 Introduction
Ensuring usability is an ongoing challenge for software developers. Myriad testing techniques exist, leading to a trade-off between implementation cost and results effectiveness.
Usability testing techniques are broken down into analytical and empirical types.
Analytical methods involve inspection of the system, typically experts in the application field, who identify problems in a walkthrough process.
Empirical methods leverage people who could be real users of the application in controlled tests of specific aspects of the system, often to determine efficiency in performing tasks with the system.
Using either type has advantages and disadvantages, but practitioners typically have limited budgets for usability testing. Thus, they need to use techniques that give useful results while not requiring significant funds. Analytic methods fit this requirement more readily for formative evaluation stages.
With the advent of new technologies and non-traditional interfaces, analytic techniques like heuristics hold the key to early and effective interface evaluation.
There are problems with using analytical methods (like heuristics) that can decrease the validity of results [21]. These problems come from applying a small set of guidelines to a wide range of systems, necessitating interpretation of evaluation results. This illustrates how generic guidelines are not readily applicable to all systems [40], and more specific heuristics are necessary.
Our goal was to create a more specific set, tailored to this system class, yet still have a
set that can be generic enough to apply to all systems in this class.
LSIEs focus on very specific user goals based on the critical parameters of interruption, reaction, and comprehension. Differing levels of each parameter (high, medium, or low) define different system classes [62]. We focus on LSIEs which require medium interruption, low to high
reaction, and high comprehension.
4.2 Motivation
Tremendous effort has been devoted to the study of usability evaluation, specifically in comparing analytic to empirical methods.
Nielsen’s heuristics are probably the most notable set of analytical techniques, developed to facilitate formative usability testing [71, 70]. They have come under fire for their claims that heuristic evaluations are comparable to user testing, yet require fewer test subjects. Comparisons of user testing to heuristic evaluation are numerous [48, 50, 90].
Some have worked to develop targeted heuristics for specific application types.
Baker et al. report on adapting heuristic evaluation to groupware systems [5]. They show that applying heuristic evaluation methods to groupware systems is effective and efficient for formative usability evaluation.
Mankoff et al. compare an adapted set of heuristics to Nielsen’s original set [56]. They studied ambient displays with both sets of heuristics and determined that their adapted set is better suited to ambient displays.
4.3 Processes Involved
How does one create a set of heuristics anyway? We could follow the steps of previous researchers and just use pre-existing heuristics, then reason about the target system class, hopefully coming up with a list of new heuristics that prove useful.
Nielsen and Molich explicitly state that the heuristics come from years of experience and reflection. Not surprising as the heuristics emerged some 30 years after graphical interfaces became mainstream. In the case of Nielsen and Mack, they at least validated their method through using it in the analysis of several systems, after they had created their set.
The two mentioned studies relied upon vague descriptions of theoretical underpinnings [5] or simple tweaking of existing heuristics [56].
Our approach to this lack of structure in creating heuristics is to take a logical look at how one might uncover or discover heuristics for a particular type of system. Basically, to gain insight about a certain type of system, one could analyze several example applications in that system class based on the critical parameters for that system class, and then use the results of that analysis to categorize and group the issues discovered into re-usable design guidelines or heuristics.
These stages involve:
• selection of target systems.
• inspection of these systems. An approach like claims analysis [15] provides necessary
structure to knowledge extraction and provides a consistent representation.
• classifying design implications. Leveraging the underlying critical parameters can help
organize the claims found in terms of impacts to those parameters.
• categorizing design implications. Scenario Based Design [77] provides a mechanism for
categorizing design knowledge into manageable parts.
• extracting high level design guidance. Based on the groupings developed in the previous step, high level design guidelines can be formulated in terms of design issues.
• synthesizing potential heuristics. By matching and relating similar issues, heuristics can
be synthesized.
4.4 Selecting Systems
The first step in the creation process requires careful selection of example systems to inspect and analyze for uncovering existing problems in the systems. The idea is to uncover typical issues inherent in that specific type of system.
Our goal was to use a representative set of systems from the LSIE class. We wanted systems that had been in use for a while, with reports on usage or studies on usability to help validate the analysis we would perform on the systems. //we chose the following five LSIE systems, including some from our own work and some from other well-documented design efforts, to further investigate in the creation process.
• GAWK [31] This system provides teachers and students an overview and history of current project work by group and time, on a public display in the classroom.
• Photo News Board [85] This system provides photos of news stories in four categories, shown on a large display in a break room or lab.
• Notification Collage [36] This system provides users with communication information and various data from others in the shared space on a large screen.
• What’s Happening? [94, 95] This system shows relevant information (news, traffic, weather) to members of a local group on a large, wall display.
• BlueBoard [78] This system allows members in a local setting to view information pages about what is occurring in their location (research projects, meetings, events).
These five systems were chosen as a representative set of large screen information exhibits. The GAWK and Photo News Board were created in local labs and thus we have access to the developers and potential user classes. The other three are some of the more famous and familiar ones found in recent literature.
4.5 Analyzing Systems
Now that we have selected our target systems, we must now determine the typical usability issues and problems inherent in these systems. Performing usability analysis or testing of these systems finds the issues and problems each system holds. To find usability problems we can do analytic or empirical investigations, recording the issues we find.
We chose to use an analytic evaluation approach to the five aforementioned LSIEs, based on
arguments from Section 3.3. We wanted to uncover as many usability concerns as possible, so
we chose claims analysis [15, 77] as the analytic vehicle with which we investigated our systems.
4.5.1 Claims Analysis
Claims analysis is a method for determining the impacts design decisions have on user goals for
a specific piece of software [15, 77]. Claims are statements about a design element reflecting a
positive or negative effect resulting from using the design element in a system [15].
Claims analysis involves inspection and reflection on the wordings of specific claims to determine the psychological impacts a design artifact may have on a user [15]. The wordings are the actual words used to describe positive and negative effects of the claims. The impacts are the overall psychological effect on the user.
4.5.2 System Claims
Claims were made for each of the five systems that were inspected. These claims focused on design artifacts and overall goals of the systems. These claims are based on typical usage, as exemplified by the scenarios shown for each system. On average, there were over 50 claims made per system.
Table 4.2 shows the breakdown of the numbers of claims found for each system.
Each claim dealt with some design element in the interface, showing upsides or downsides resulting from a particular design choice.
These claims can be thought of as problem indicators, unveiling potential problems with the system being able to support the user goals. These problem indicators include positive aspects of design choices as well. By including the good with the bad, we gain fuller understanding of the underlying design issues.
4.5.3 Validating Claims
How do we know that the claims we found through our analysis represent the “real” design challenges in the systems? This is a fair question and one that must be addressed. We need to verify that the claims we are using to extract design guidance for LSIE systems are actually representative of real user problems encountered during use of those systems. We tackled this problem through several different techniques.
For the GAWK and Photo News Board, we relied upon existing empirical studies [85] to validate the claims we found for those systems.
For the Notification Collage we relied upon discussion and feedback from the system developers. We sent the list of claims and scenarios to Saul Greenberg and Michael Rounding and asked them to verify that the claims we made for the Notification Collage were typical of what they observed users actually doing with the system. Michael Rounding provided a thorough response that indicated most of the claims were indeed correct and experienced by real users of the system.
A similar effort was attempted with both the What’s Happening? and Blue Board systems. The developers of these systems were contacted but no specific feedback was provided on our claims. However, John Stasko, co-developer of the What’s Happening? system, provided interview feedback on the system and provided a nice publication [95] that served as validation material for the claims. This report provides details on user experiments done with the What’s Happening? system. Using this report, we were able to verify that most of the claims we made for the system were experienced in those experiments.
Unfortunately, none of the developers of the Blue Board system responded to our request. We were able to use existing literature on the system to verify some of the claims but the reports on user behavior in [78] did not provide enough material to validate all of the claims we found for that system.
4.6 Categorizing Claims
Now that we have analyzed several systems in the LSIE class, and we have over 250 claims about design decisions for those systems, how do we make sense of it all and glean reusable design guidance in the form of heuristics? To make sense of the claims we have, we need to group and categorize similar claims.
This requires a framework to ensure consistent classification and facilitate final heuristic synthesis from the classification. This is where the idea of critical parameters plays an important role.
4.6.1 Classifying Claims Using the IRC Framework
Recall that notification systems can be classified by their level of impact on interruption, reaction, and comprehension [62]. This classification scheme can be simplified to reflect a high, medium, or low impact to each of interruption, reaction, and comprehension.
In other words, we can take a single claim and classify it according to the impact it would have on the user goals associated with the system.
For example, we have a claim about the collage metaphor from the Notification Collage system that suggests that the lack of organization can hinder efforts to find information. This claim would be classified as “high” interruption because it increases the time required to find a piece of information. It could also be classified as “low” comprehension because it reduces a person’s ability to understand the information quickly and accurately. It is perfectly acceptable to have the claim fit into both classifications.
4.6.2 Assessing Goal Impact
Determining the impact a claim has on the user goals was done through inspection and reflection techniques. Each claim was read and approached from the scenarios for the system, trying to identify if the claim had an impact on the user goals. A claim impacted a user goal if it was determined through the wording of the claim that one of interruption, reaction, or comprehension was modified by the design element.
To assign user goal impacts to the claims, a team of experts should assess each claim.
These experts should have extensive knowledge of the system class, and the critical parameters that define that class. Knowledge of claims analysis techniques and/or usability evaluation are highly recommended.
We used a two- person team of experts.
Differences occurred when these classifications were not compatible.
Agreement was measured as the number of claims with the same classification divided by the total number of claims. We found that initial agreement on the claims was near 94% and after discussion was 100% for all claims.
This calculation comes from the fact that out of 253 individual claims, 237 were classified by the inspectors as impacting the user goals in the same way, i.e. all of the experts agreed on the same classification.
4.6.3 Categorization Through Scenario Based Design
Categorization is needed to separate the claims into manageable groups. By focusing on related claims, similar design tradeoffs can be considered together. An interface design methodology is useful because these approaches often provide a built-in structure that facilitates claims categorization.
Possible design methodologies include Scenario Based Design [77], User Centered Design [73], and Norman’s Stages of Action [72].
Scenario based design (SBD)[77] is an interface design methodology that relies on scenarios
about typical usage of a target system.
Activity Design
Activity design involves what users can and cannot accomplish with the system, at a high level [77]. These are the tasks that the interface supports, ones that the users would otherwise not be able to accomplish.
Activity design encompasses metaphors and supported/unsupported activities [77].
Information Design
Information design deals with how information is shown and how the interface looks [77]. Design decisions for information presentation directly impact comprehension, as well as interruption. Identifying the impacts of information design decisions on user goals can lead to effective design guidelines.
We chose to use the following sub-categories for refining the information design category: use of screen space, foreground and background colors, use of fonts, use of audio, use of animation, and layout. These sub-categories were chosen because they cover almost all of the design issues relevant to information design [77].
Interaction Design
Interaction design focuses on how a user would interact with a system (clicking, typing, etc) [77]. This includes recognizing affordances, understanding the behavior of interface controls, knowing the expected transitions of states in the interface, support for error recovery and undo operations, feedback about task goals, and configurability options for different user classes [77].
Categorization
Armed with the above categories, we are now able to group individual claims into an organized structure, thereby facilitating further analysis and reuse. So how do we know in which area a particular claim should go? This again is done through group analysis and discussion regarding the wording of the claim. The claim wordings typically indicate which category of SBD applies, and any disagreements can be handled through discussion and mitigation.
Similar to the classification effort, this categorization process relied upon the claim wordings for correct placement within the SBD categories. The sub- categories for each of activity, information, and interaction provide 14 areas in which claims may be placed.
Unclassified Claims
Some of the claims were deemed to be unclassified, since the claim did not impact interruption, reaction, or comprehension. While it is possible to situate these claims within the SBD categories, if the claim does not impact one of the three user goals, it was said to be unclassified.
4.7 Synthesis Into Heuristics
After classifying the problems within the framework, we then needed to extract usable design recommendations from those problems.
This required re-inspection of the claim groupings to determine the underlying causes to these issues.
Since the problems come from different systems, we get a broad look at potential design flaws. Identifying and recognizing these flaws in these representative systems can help other designers avoid making those same mistakes in their work.
4.7.1 Visualizing the Problem Tree
To better understand how claims impacted the user goals of each of the systems, a problem tree was created to aid in the visualization of the dispersion of the claims within different areas of the SBD categories.
A problem tree is a collection of claims for a system class, organized by categories, sub-categories, and critical parameter. It serves as a representation of the design knowledge that is collected from the claims analysis process.
A node in the problem tree refers to a collection of claims that fits within a single category (from SBD) with a single classification (from the critical parameters). A leaf in the tree refers to a single claim, and is attached to some node in the tree.
4.7.2 Identifying Issues
To glean reusable design guidance from the individual claims, team discussion was used. A team of experts who are familiar with the claims analysis process and the problem tree considers each node in the tree with the aim of identifying one or more issues that capture the claims within said node.
Issues are design statements, more general than individual claims.
This effort produced 22 issues that covered the 333 claims.
4.7.3 Issues to Heuristics
Armed with the 22 high level issues, we now needed to extract a subset of high level design heuristics from these issues. Twenty-two is unmanageable for formative heuristic evaluation [66] and in many cases the issues were similar or related, suggesting opportunities for concatenation and grouping. This similarity allowed us to create higher level, more generic heuristics to capture the issues.
We created eight final heuristics, capturing the 22 issues discovered in the earlier process.
Table 4.7 provides an example of how we moved from the issues to the heuristics. In most instances, two or three issues could be combined into a single heuristic. However some of the issues were already at a high level and were taken directly into the heuristic list.
4.7.4 Heuristics
Here is the list of heuristics that can be used to guide evaluation of large screen information exhibits.
Explanatory text follows each heuristic, to clarify and illustrate how the heuristics could impact evaluation. Each is general enough to be applied to many systems in this application class, yet they all address the unique user goals of large screen information exhibits.
• Appropriate color schemes should be used for supporting information understanding.
Try using cool colors such as blue or green for background or borders. Use warm colors like red and yellow for highlighting or emphasis.
Try using cool colors such as blue or green for background or borders. Use warm colors like red and yellow for highlighting or emphasis.
• Layout should reflect the information according to its intended use.
Time based information should use a sequential layout; topical information should use categorical, hierarchical, or grid layouts. Screen space should be delegated according to information importance.
• Judicious use of animation is necessary for effective design.
Multiple, separate animations should be avoided. Indicate current and target locations if items are to be automatically moved around the display. Introduce new items with slower, smooth transitions. Highlighting related information is an effective technique for showing relationships among data.
• Use text banners only when necessary.
Reading text on a large screen takes time and effort. Try to keep it at the top or bottom of the screen if necessary. Use sans serif fonts to facilitate reading, and make sure the font sizes are big enough.
• Show the presence of information, but not the details.
Use icons to represent larger information structures, or to provide an overview of the information space, but not the detailed information; viewing information details is better suited to desktop interfaces. The magnitude or density of the information dictates representation mechanism (text vs icons for example).
• Using cyclic displays can be useful, but care must be taken in implementation.
Indicate “where” the display is in the cycle (i.e. 1 of 5 items, or progress bar). Timings (both for
single item presence and total cycle time) on cycles should be appropriate and allow users to
understand content without being distracted.
single item presence and total cycle time) on cycles should be appropriate and allow users to
understand content without being distracted.
• Avoid the use of audio.
Audio is distracting, and on a large public display, could be detrimental to others in the setting. Furthermore, lack of audio can reinforce the idea of relying on the visual system for information exchange.
• Eliminate or hide configurability controls.
Large public displays should be configured one time by an administrator. Allowing multiple users to change settings can increase confusion and distraction caused by the display. Changing the interface too often prevents users from learning the interface.
Source: Somervell, Jacob. Developing Heuristic Evaluation Methods for Large Screen Information Exhibits Based on Critical Parameters. [Dissertation, PhD in Computer Science and Applications] Virginia Polytechnic Institute and State University. June 22, 2004.
Labels:
heuristic evaluation,
Somervell,
usability criteria
Subscribe to:
Posts (Atom)