The ways your data is lying to you

Home
Subscribe

These days, even grocery chains are data companies as much as they are food retailers. Loblaws, through its points-collecting program for customers, harvests crucial information on consumer behaviour – such as what they buy, how often, and how likely they are to buy again. This information can help drive managerial decision-making. But this only works if the data is being used correctly.

“Data can be misleading,” said Najib Mozahem on the McGill Delve podcast. He’s an Assistant Professor of Information Systems at McGill University and a data scientist at Air Canada. “We always have to be careful.”

While data might imply truth or scientific rigour, that’s not always the case, said Mozahem. There are many ways numbers can deceive you if you’re not careful. So here are three ways your data might be lying to you.

1. Aggregating out the truth

A data point is a fact, said Mozahem. Someone clicked a link, purchased a product, or answered a survey question. Collect enough of these facts, and a story starts to emerge about a given phenomenon. This happens through aggregation.

Aggregation is where data gets its power, said Mozahem. It’s how YouTube can predict and recommend videos you might enjoy. If you often click on content about astronomy, music, and technology, it’s not a stretch to think you’ll also enjoy a video of an astronaut playing guitar in space. This kind of prediction is only possible because YouTube knows you and your viewing history, and can make suggestions accordingly.

But aggregation also comes with risk.

“You might be able to uncover certain patterns, but you lose some of the details,” said Mozahem.

It might not have been you who clicked those videos, but your daughter – you just forgot to log out of your account. YouTube doesn’t know that, so now it falsely assumes that you’re interested in the same content.

“This is why we always have to be careful,” said Mozahem.

2. Dirty data

Clean data is essential for ensuring accurate analyses, said Mozahem. This means reviewing every row and column of a datasheet to verify its integrity. If you spot a few missing cells, it’s important to understand why, because those holes could skew your results enough to distort your decision-making.

For example, in a national quality of life survey, what if five per cent of your collected data didn’t show up in your final spreadsheet? You might be tempted to move on. But that missing five per cent could have been your only data on the poorest members of the country. If they’re not represented in your tables, you’re not getting an accurate picture of the situation. And you might squander an opportunity to support these vulnerable communities.

Cleaning up data is not glamourous work, concedes Mozahem. It requires patience, attention to detail, and occasionally some investigation. But it is crucial to sound analyses and, by extension, sound decision-making.

3. Personal bias

“Torture the data long enough, and it’ll tell you whatever you want,” said Mozahem.

He’s referring to when data analysts process data to achieve a specific outcome, rather than the most accurate one. This can happen for a variety of reasons, but Mozahem attributes it most often to managerial pressures. For example, after spending significant time and money developing a machine learning model, managers might pressure an analyst to artificially increase the model’s accuracy – even if that accuracy can’t be maintained long-term. In this case, the goal isn’t to create a model that can produce reliable results. It’s to show that the model works, regardless of accuracy, to justify the spending, said Mozahem.

“That’s why it’s important to get someone else to look at the project,” he said.

Extra eyes from other data scientists can help identify blind spots and false assumptions, improving the overall accuracy of the model. But this is only possible if there’s a workplace culture that allows for failure and encourages this kind of peer review.

“It’s up to management,” he said.

In the end, Mozahem’s message is simple: data is powerful, but only when handled with care. While data can imply truth, there are many factors that can interfere with its accuracy. The best place to start is with clean data practices, which will set the groundwork for any future analyses. Then you can worry about accurately reading your aggregated results, and empowering your data scientists to process that data accurately.

On the McGill Delve podcast, Mozahem further expands on these points, highlighting the importance of strong data pipelines and exploring the role of AI in data analysis. Search “McGill Delve” on Apple Podcasts or in your favourite podcast player.

Written by Eric Dicaire, Managing Editor, McGill Delve

Featured experts

Najib Mozahem
Assistant Professor, Information Systems
McGill University