"Lies, Damn Lies and Statistics"

“There are lies, damn lies and statistics” – Mark Twain

Mark Twain once said “There are lies, damn lies, and statistics.” We know that the same numbers can be used to support many, often opposing, causes. Even though numbers are just numbers, they can be used to tell a story.

What you do with the messages developed from the metrics you collect is important in its own right. Messages become tools (or weapons) to motivate. Motivation can range from the positive (look how good you are doing) to the negative (look how bad you are doing) or the ultimatum (do better or else). Here we will discuss what happens when there is no message or where the message and data aren’t synchronized.

If You Act (or Re-act) Irrationally, Bad Things Happen:

“Good numbers go bad when middle management dictates what the metrics program will report in order to improve or make a less then stellar project look better then it really is.” — RaeAnn Hamilton, TDS Telecom

Good numbers go bad when the reaction they cause is irrational. One example of how measures can be used to incent or create bad behavior is the ’single measure syndrome’ (SMS). SMS occurs when it decided from on high that a whole organization can be maximized based on a single metric, such as time-to-market. Measuring an organization, department or project just on a single metric might sound like a good idea, but life is more complicated. The use of a single metric, however impressive, might have unintended consequences. For example, one means of maximizing time-to-market might be to reduce quality (forget testing, fast is what counts). In this example, is the problem the behavior; is the problem the use of just one metric, or is it the metric itself? Arguably the most rational behavior would be to maximize the measure being focused on, therefore the problem would appear to be the behavior a single metric creates. This is basic human nature. It is just that it might not be the best idea for the organization. Think about what is it you want to incentivize. Is time-to-market the real goal in this case or is something more balanced?

Patterns, Patterns Everywhere:

It is human to ascribe a meaning to data and then to act on that meaning (this is a cognitive bias). Measurement organizations use this basic premise to drive activity. It is the organizational psychology that has created the adage, “you get what measure.” For example, in Agile projects using a burn-down chart, reporting remaining effort above the the ideal line for two or three days is generally interpreted as a sign that the team needs to change behavior.  The pattern acts as the trigger rather than a single observation. Knowing that numbers and actions are intertwined requires that behavioral implications must be examined before numbers are deployed, pell-mell or ASAP.

Measure What You Think You Are Measuring:

When measures and metrics are linked to unrelated items, combined with the logical backing of studious people, the results will create ramifications that are best interesting. For example measuring productivity when you are interested in quality or time-to-market when  you are interested in customer satisfaction. Do they represent Good Numbers Gone Bad or merely chaos? In the long run, the ramifications of mismatches lead to poor decisions and abandoned metrics programs. Measuring toilet paper usage and relating it to productivity is a particularly absurd example where the logic is that higher usage of toilet paper reflected longer working hours, which would result in more output (of some sort or another). While the example was created as a class exercise, it is possible to find similar examples in the wild. Less absurd mismatches can include deciding that effort or the cost of effort is a direct proxy for productivity. Effort is an input to productivity and without an output such as software or widgets. Using half of an equation as a tool might not create the results expected.

One Metric To Rule Them All?:

Not all metrics can be used for all projects. If you can’t easily answer the question “does this relate?” for each metric, the information generated through measurement and analysis will provide little or no value. Stratification is a requirement for analysis. The goal is to understand the differences between groups of work so that when you make the comparison, you can discern what is driving the difference (or even if there is a difference). Comparing package implementations, hardware intensive projects or custom development is rational only if you understand that there will be differences and what those differences mean. Examples abound of organizations that have failed to stratify like projects into groups for comparison. Failing this simple precaution lets Good Numbers Gone Bad.

They Are Everywhere– They Are Everywhere!:

There are many items that are very important to measure. Measurement can tell you the state of your IT practice while providing focus. Measurement is sometimes thought of as a silver bullet? Because it seems important to measure many activities within an IT organization, many measurement teams think measuring everything is important. Unfortunately, measuring what is really important is rarely easy or straightforward. When presented with obstacles, many metrics programs let Good Numbers Go Bad by measuring something, anything. “Quick, do something” is the attitude! When organizations slip into the “measure something” mode, often times what gets measured is not related to the organizations target behavior (the real needs). When measures are not related to the target behavior, it will be easy to breed unexpected behaviors (not indeterminate or unpredictable, just not what was expected). For example, one organization determined that the personal capability was a key metric. More capability would translate into higher productivity and quality. During the research into the topic, it was determined that capability was too difficult or “touchy-feely” to measure directly. The organization decided that counting requirements were a rough proxy for systems capability, and if systems capability went up, it must be a reflection of personal capability. So, of course, they measured requirements. One unanticipated behavior was that the requirements became more granular (actually more consistent), which meant that there was an appearance that increased capability that could not be sustained (or easily approved) after the initial baseline of the measure.