The Importance of Low-Level Data

The quncertain results graph used to look like this,

and now it looks like this:

The data visualization is now more interactive, and gives the audience the ability to drill down to specific pieces of data.

For context, the above chart is a tool for users of quncertain.com to predict future events. They assign a numeric plausibility to the likelihood of a future event, and then observe how often it actually occurs. In time, they can calibrate a 70% plausibility of a future event, to an actual 70% frequency of occurrence in reality. The chart above is from a calibration step of “predicting” events that happened in the past.

But this post is specifically about charts. And about authors who cite data, audiences that scrutinize the data, and how presenting large amounts of non-abstracted data increases the trust of the audience in the author’s conclusions. Data presented without a large amount of context should make an audience suspicious; the possibility that the author is doctoring the data becomes more plausible.

Charts can misrepresent the underlying facts in many ways. Here are three common misrepresentations

1. Averaging and other abstractions

Abstractions allow an audience to see higher-level ideas faster and more easily than inspecting a table of numeric values. Showing the average high temperature for a particular month expresses a range of related measurements in a single value. Data visualizations like time-series charts, bar graphs, area charts, etc, are also designed to explore high-level ideas built on lower-level data. But all these abstractions can be misleading, even the widely-used concept of average.

Because we are taught averages when we’re young and because use them every day, we might forget that averages are misleading if the data being averaged is not symmetric and closely grouped. If the data is spread out, or has a skew, averages mask important information. This is why whenever an average (or median, or any single number summary) is provided, the author should place next to it a histogram showing the distribution of the points being averaged. Current technology allows us to make this enhancement easily, even in journal articles and news articles.

Metrics like “Customer Lifetime Value” have averages built into them. But a histogram will usually show the lifetime value of your customers varies widely, and probably has long tails. Looking at this distribution is probably much more revealing than looking at an average. Only data that looks like it has a close-to-central (or close-to-gaussian, close-to-normal) distribution should be represented as just a number. However even in the case of centrally-distributed data, the audience has no assurance this is the case without being able to see the distribution themselves. So the distribution should be shown in all cases.

Examples:

American mean net worth is $301,000. American median net worth is $45,000.

US Family mean income is $89,000, median income is $67,000 (https://projects.quncertain.com/us_income/income_data_2014.html)

2. Misclassification

Authors will sometimes split data into groups and then summarize each group. Conclusions are drawn by comparing the summaries of each group. Obviously, these conclusions are only as valid as the accuracy of the classification. And yet usually when an author presents a chart like this, there is no mechanism for an audience to inspect the classifications. Because of this, its common for data to be misclassified, both intentionally and unintentionally.

Authors have the ability to inspect raw data, and manipulate classification rules before drawing a line between (for example) high-value customers and low-value customers. Audiences do not have this ability. Maybe changing the classification rule slightly has a large impact on the resulting metrics, and makes the author’s conclusions suspect. Even when classification rules are sound, there is value from the ability to inspect group membership.

The only solution to this problem is to allow the audience to inspect the individual members that comprise the groups. Audience members can spot-check individual data points that they’re knowledgeable about. If all their spot-checks prove properly classified, they will rightly gain confidence that the author’s conclusions are sound.

Examples:

Arguments about official vs real unemployment rates

Bill Clinton’s “I did not have sex with that woman”

3. Undocumented Data Filtering / Outlier Removal

Data filtering is a subset of classification. In data filtering, the author removes data they consider irrelevant or distorting. Hopefully that statement sounded backwards. Because the if we are judging the data and an author’s abstraction of the data, then unquestionably the data is the ‘realer’ thing. Abstractions (averages, charts, graphs) are the things that distort the data. Modifying the data because it distorts an abstraction is disingenuous, and should make an audience less certain of an author’s conclusions.

For handling things like typos and garbage data, data filtering may be necessary. But even in this case, the data should be included somewhere and should be inspectable by the audience. The author should describe why the data was removed. And, all data that was NOT removed should be handled with additional uncertainty, knowing that the input to the author’s abstraction is likely still susceptible to typos and other inaccuracies.

Giving the audience the ability to look at the data included and excluded allows them to spot-check outliers they are aware of and see that they are handled reasonably in the chart.

Examples:

Cherry-picking endpoints on time-series graphs to show a desired trend

Regression models that don’t strictly satisfy the rules for regression

All the issues I have discussed can be largely solved by authors providing access to raw and lower-level data with their metrics, charts, and graphs. Dashboards in analytics, CRM, and financial systems show the power and elegance of being able to drill down into data. Charts and figures in news articles, journal publications, and business and conference presentations have a long way to go.

I hope to write some more about data transparency in the future and work on tools that help with this effort. Please take a look at quncertain.com, and make some predictions!

Leave a Reply Cancel reply