Monday, May 28, 2012

Draft: Steps in a Defect Outbreak Investigation

     

Introduction

While it may not seem obvious at first thought, the steps required in the process of disease outbreak investigation have strong correlates with steps required for two separate software development processes:

  • Most obviously: the defect investigation and resolution process conducted against software already developed and made available to users
  • Less obviously, but more importantly: the software development process itself before it is formally complete or released to users

 

Investigating an Epidemic of Defect-Infected Software

After reading the following excerpt from a CDC education site, you can see a very clear analogy between disease outbreak investigation and "defect" outbreak investigation:

"In investigating an outbreak, speed is essential, but getting the right answer is essential, too. To satisfy both requirements, epidemiologists approach investigations systematically, using the following 10 steps:

  1. Prepare for field work
  2. Establish the existence of an outbreak
  3. Verify the diagnosis
  4. Define and identify cases
  5. Describe and orient the data in terms of time, place, and person
  6. Develop hypotheses
  7. Evaluate hypotheses
  8. Refine hypotheses and carry out additional studies
  9. Implement control and prevention measures
  10. Communicate findings

The steps are presented here in conceptual order. In practice, however, several may be done at the same time, or they may be done in a different order. For example, control measures should be implemented as soon as the source and mode of transmission are known, which may be early or late in any particular outbreak investigation."

Source: http://www.cdc.gov/excite/classroom/outbreak/steps.htm

Visualizing a Food-Borne Illness Investigation

To  help set a mental picture of this process, and its iterative, feedback-driven nature, consider the following diagram from CDC's OutbreakNet web site:

Investigating Foodborne Outbreaks Final3

Steps in a Foodborne Outbreak Investigation
Source: http://www.cdc.gov/outbreaknet/investigations/figure_outbreak_process.html

Notice the loop that occurs branching off from step four. It shows that controlling the outbreak, or learning how to prevent it from occurring again, involves a feedback cycle that involves evaluating the results and theoretical implications of hypotheses and experiments against continued empirical observations in the field. TODO: refine this.

Mapping Disease Outbreak Steps to Software Defect Resolution Steps

Assume below that there is a browser-based application that is used by epidemiologists and public health professionals in the field, investigating outbreaks. When these users discover problems with the software, they must report those problems to the development team that maintains the software.

So, the right side represents the actions that members of the development team must take to investigate and resolve problems when reported. Of course, this process is not at all specific to public health information systems, but can apply to just about any system that is "deployed in the field".

Disease Outbreak Investigation Step Defect Root-Cause Investigation & Resolution Step
Prepare for field work Establish defect communication channels Establish defect communication channels
Establish the presence of an outbreak Establish confirmed presence of a defect
Verify the diagnosis Verify the steps to reproduce
Define and identify cases Define and identify type of issue
Describe and orient the data in terms of time, place, and person Describe and orient the defect report in terms of time, browser, operating system, user role and other system-specific characteristics
Develop Hypotheses Reproduce Defect under controlled conditions
Evaluate Hypotheses Determine Root-Cause
Refine hypotheses and carry out additional studies Develop a "fix" for the defect and perform system regression testing
Implement control and prevention measures Implement control and prevention measures
Communicate findings Deploy fix and communicate resolution

 

Details of Defect Root-Cause Investigation and Resolution Process Steps

 

Prepare for Field Work : Establish defect communication channels

End users are the ones "in the field", and management must provide communication channels (automatic error trapping within deployed systems and telephone, email, issue tracking systems) that enable users to provide their feedback.

Establish the presence of an outbreak : Establish confirmed presence of a defect

When users report an issue as a defect, the team must be able to verify whether this is true or perhaps some other kind of malfunction or user training issue (misunderstanding, lack of experience, etc)

Verify the diagnosis : Verify the steps to reproduce

In order to investigate the defect, it is required to document the precise sequence of actions taken by the user. These are needed to reproduce the same issue under controlled testing conditions. This normally involves using a dedicated "test" environment, often implemented using virtual machine software that can mimic the end-user's environment as close to identically as possible. There are, in fact, entire businesses that exist solely for providing the capability to other companies.

Define and identify cases : Define and identify type of issue

Not all reported issues fall neatly into the category of "defect". Other categories include:

  • Request for enhancement
  • Difficulty using feature
  • Annoyance with a feature

It is still important to classify each reported item for the purpose of improving the system based upon its users' experiences. This helps with a planning and prioritization process we will discuss in future articles.

Describe and orient the data in terms of time, place, and person : Describe and orient the defect report in terms of time, browser, operating system, user role and other system-specific characteristics

Document conditions that can help characterize the event for detailed investigation by the management team. For a browser-based application, this includes, at minimum, the characterizing features listed above. You might wonder why "time" would be important. It can be important for a number of reasons, including:

  • Patterns of system load (how many people are using it at the same time) which vary by time of day
  • Complications with other transactions undertaken by different users (concurrency violations, dead-locks, threading issues, etc)
  • System dependencies upon other, external systems -- many systems in corporate or government environments depend on "Single Sign On" servers being functional. When these systems fail, your own system may fail, but your users will still blame you :-)

Develop Hypotheses : Reproduce Defect within Controlled Environment

Following directly from the previous step, the team must reproduce the defect by performing the same set of steps that the user who encountered the defect reported. Reproducing a defect is not entirely the same as performing the steps in a scientific experiment, but it's pretty close! When performing a test, there are three parts:

  • Starting conditions (including any steps to take to prepare the conditions)
  • Series of steps to execute
  • Expected observations

After the test steps are executed, the actual observations are compared against the expected observations to evaluate the results. When attempting to reproduce a defect, if the actual observations do not match (meaning the bug cannot be reproduced!) it does not automatically mean that the user was incorrect or lying. It could be:

  • Faulty or incomplete starting conditions
  • Incorrect or incomplete execution of steps
  • Incorrect or improperly measured expected observations or actual observations

As you can see, this section is a subject for an entire article unto its own.

Evaluate Hypotheses : Determine Root-Cause

As foreshadowed in the previous step, this step may involve a number of trial-and-error "hypotheses" depending upon:

  • How well-reported the defect is
  • How well-executed the reproduction process is
  • How well-constructed the system itself is according to quality engineering practices


Refine hypotheses and carry out additional studies : Develop a "fix" for the defect and perform system regression testing

This simplifies a much larger series of steps, but the most important part is that a "regression test" is performed which verifies that the resolution of this specific defect does not introduce additional defects in other areas (or even in the same area!) This will be the subject of its own article as well.

Implement control and prevention measures : Implement control and prevention measures

In this step, the team must reflect upon the root-cause determinant of the defect and analyze what it can do to prevent similar defects from occurring again. I've sometimes heard this kind of activity described as a "post-mortem" in the realm of public health. Visions of autopsies and mortality are not always omens for success. So, in the realm of systems development it's often called a "retrospective". You can learn more about retrospectives at http://www.retrospectives.com/.

Communicate findings : Deploy fix and communicate resolution

This step involves updating the system with the defect corrected (and the rest of the potentially affected system fully regression tested). Naturally, this step is very system-specific. But, in a future article I will demonstrate how to structure a modern web application in such a way that its specific features, or modules, are independently version-controlled, and independently upgradable. These practices allow for teams to improve user experience with little to no "down-time" and even to experiment with new functionality without the cost of expensive, time-consuming "whole system rewrites".

2 comments:

  1. Great Analogy!

    To me the real success-mantra in both the cases would be a shift from cure based effort to prevention based effort.

    Nevertheless, good points presented very intelligently.

    ReplyDelete
  2. Thanks, Anshu. In the next post, I'll start adapting the model so that it applies to the product *under development*, prior to release, and will compare that to public health surveillance and monitoring, statistics, population health metrics, etc: the tools of proactive prevention.

    ReplyDelete