Catch Me If You Can: Why Objectively Defining Survey Data Quality is Our Biggest Challenge

In the insights industry, experts have described 2022 as the Year of Data Quality. There is no doubt that it has been a hot topic of discussion and debates throughout the year. However, we find common ground where most agree there is no silver bullet to address data quality issues in surveys.

As the Swiss cheese model suggests, to have the best chance of preventing survey fraud and poor data quality we need to approach the problem by thinking of it in terms of layers of protection that are implemented throughout the research process.

To this end, the Insights Association Data Integrity Initiative Council has published a hands-on toolkit. It includes a Checks of Integrity Framework with concrete data integrity measures. This is essential to all phases of survey research: pre-survey, in-survey, and post-survey.

The biggest challenge yet remains: objectively defining data quality

What constitutes good data quality remains nebulous. We can agree on what is very bad data such as gibberish open-ended responses. However, identifying poor-quality data is rarely so simple. The responses we keep or remove from a dataset are often a tough call. These called are often based on our own personal assumptions and tolerance for imperfection.

Because objectively defining data quality is difficult, researchers have developed a wide range of in-survey checks. Including; instructional manipulation, low incidence, speeder, straight lining, red herring questions, and open-end responses, that act as predictors of poor-quality participants. But, like data quality itself, these predictors are subjective in nature.

The lack of objectivity leads to miscategorizing participants

The in-survey checks typically built into surveys inadvertently lead to miscategorizing participants as false positives (i.e. incorrectly flagging valid respondents as problematic) and false negatives (i.e. incorrectly flagging problematic respondents as valid).

In fact, these in-survey checks may penalize human error too harshly. While, at the same time, making it too easy for experienced participants, whether fraudsters or professional survey takers, to fall through the cracks. As an example, most surveys exclude speeders, participants who complete the survey too quickly to have provided thoughtful responses.

While researchers are likely to agree on what is unreasonably fast (or bot-fast!), there is no consensus on what is a little too fast. Is it the fastest 10% of the sample? Or those completing in <33% relative to median duration?

This subjectivity baked into these rules can result in researchers flagging honest participants who read and process information faster, or those who are less engaged with the category.  Researchers might not flag participants with excessively long response time, the crawlers who could be translating the survey, or fraudulently filling out more than one survey at once.

Improving our hit rate

These errors have a serious impact on the research. On the one hand, false positives can have negative consequences such as providing a poor survey experience and alienating honest participants.

Is this not a compelling enough reason to avoid false positives? Then think about the extra days of fieldwork needed to replace participants. On the other hand, false negatives can cause researchers to draw conclusions based on dubious data which lead to bad business decisions.

Our ultimate goal as responsible researchers is to minimize these errors. To achieve this, it’s critical that we shift our focus to understanding which data integrity measures are most effective at flagging the right participants. With this in mind, using advanced analytics (e.g.Root Likelihood in conjoint or maxdiff) to identify randomly answering, poor-quality participants presents a huge opportunity.

Onwards and upwards

In 2022, much worthwhile effort was devoted to raising awareness and educating insights professionals. Especially, on how to identify and mitigate data issues in survey response quality. Moving forward, researchers need a better understanding of which data integrity measures are most effective at objectively identifying problematic respondents in order to minimize false positives and false negatives.