The Triumph and Folly of Proxy Data

Note: This article first published on 8 March 2018.

“The data doesn’t exist.”

The phrase is music to my ears. I have always had a deep love affair with information that is unavailable, unrecorded, unreachable, or seemingly impossible to understand.

When clients or other consultants call me in to solve data problems, many times they will preface my introduction by saying, “the data we need doesn’t exist, so I’m not sure what you can actually do.”

The truth is that information is everywhere, all around us.

And just because we can’t see it as such, does not mean that it does not exist. There are several reasons why we might not be able to identify the information we believe we need to answer a question.

Perhaps our conceptualization or operationalization of the imagined necessary data is flawed. Perhaps we are looking to resources that, for one reason or another, will not or cannot produce the information we need.

If, in fact, the data as it is defined by the question and hypothesis, does not exist as such, we use proxy data to get as close to the desired data as possible.

Proxy data, in the hands of the inexperienced, can lead to inappropriate assumptions, erroneous conclusions, and flawed analysis. However, the identification of appropriate proxy data by expert data scientists can open doors to discovery and create cornerstones for collecting exact data points for future inquiry.