What can your data do for you – the ad touts, implying that all it takes is to analyze the data to develop insights. This is true when the analysis is grounded in data, but being grounded in data is a lot more difficult then it sounds. Analytics that use some kind of aggregation metrics are prone to making assumptions regarding the distribution of the data. Averages and totals are not suited for all kinds of data distribution.

Let me explain.

It is not uncommon to see pivot tables or dashboards that show the sum of sales by region, average sales per customer and then based on that, data making decisions. For example a company may find that it has three regions and the sum of sales and average sales are as follows:

Region A

  • Sum of Sales : $10,000
  • Avg Sale per customer: $500

Region B

  • Sum of Sales : $ 20,000
  • Avg Sale per customer : $ 1,000

Region C

  • Sum of Sales : $ 30,000
  • Avg Sale per customer : $ 1500

If this was all the information given to us, and we were to identify the region which probably had the top customers who could be used for benchmarking purposes, our response would be skewed to Region C. There might be some who think that Region B may be a contender, but Region A is definitely out. What these analytics don’t provide is the actual data distribution. Also, when we look at such data we also naturally assume the standard deviation is uniform across all regions.

However, if I tell you that Region A sales are heavily skewed and that 5 out of the 20 customers were government customers who bought over $9000 while the rest of the 15 customers constituted the remaining $1000. In other words, the average sale per customer for these 5 alone was $1800 while the rest of the 15 customers was only $67. If this fact was combined with the fact that the other two regions were more uniformly distributed around the mean and had a standard deviation of $100, then our focus would switch from Region C to Region A, becausesomething interesting is happening here. In other words if I add another metric, which is the average sales of top 5 customers (we could use standard deviation as well, but may not be intuitive for all users), my focus shifts to Region A from Region C.

Region A

  • Sum of Sales : $10,000
  • Avg Sale per customer: $500
  • Avg sales of top 5 customers: $1800
  • Total Sales of top 5 customers: $9000

Region B

  • Sum of Sales : $ 20,000
  • Avg Sale per customer : $ 1,000
  • Avg Sales of top 5 customers: $1100

Region C

  • Sum of Sales : $ 30,000
  • Avg Sale per customer : $ 1500
  • Avg Sales of top 5 customers: 1600

Distribution of the data is therefore extremely important especially when we use metrics such as averages and totals to make decisions about groups of data. The grouping variable such as regions are more arbitrary and don’t always justify the uniform distribution across all groups.

To give a more drastic example, if we were to put Warren Buffet in a company of 999 not-so-well-off individuals, or even paupers and compute their net worth, we could have 1000 individuals who all seem to be millionaires if we look at the average net worth of the group.

The question I have for the audience is when companies develop these dashboards, does your company consider the data distributions, and automatically change the metric for skewed distribution?

And more fundamentally how do you evaluate the effectiveness of the metric itself and its applicability for the decisions that may be made using it?