Six common errors when using data: Part two
Published: 24 Apr 2018
With such a volume of information available to planners nowadays, it's essential to be aware of the pitfalls data can present.
Research and evidence are the backbone of what planners do. But we live in the age of ‘big data’, with information coming at us from all angles and multiple sources. All this insight is useful, but how do we sort, interpret, and act on such a volume fairly and accurately?
One starting point is to know what not to do. Here are the second three of six common pitfalls to avoid when using data.
4. Seeing correlation as causation
Our minds want to find meaningful cause-and-effect patterns in any dataset, even if a relationship isn’t clearly indicated by the evidence. For example, the data tells us that children who don’t have breakfast perform less well at school. Does that mean not having breakfast is the cause of poor academic performance? It’s more likely to be symptomatic of coming from a low-income family with low educational aspiration. Correlations may help identify related factors that point towards a common cause – or, as Edward Tufte, an expert on the visual presentation of information, says: “Correlation is not causation but it sure is a hint.”
Tyler Vigen has fun with spurious correlations at: tylervigen.com/spurious-correlations
5. Relying on the average
Averages are useful to give an overview, but can conceal significant trends by making it easy to overlook polarisations that could be the most telling features of a dataset. An example might be using the average income of an area as a key indicator of its character, when there may be pockets of deprivation within a more affluent district. Other demographic information (e.g. age) needs also to be treated with care when presented as an average, as do transport statistics (e.g. average traffic volume), health figures, and so on. Planners need awareness of the way in which data is spread and concentrated.
6. Taking the most frequent to be the most
We tend to favour uncomplicated solutions – and may lean towards commonalities to support that desire. Let’s say you’re surveying employment patterns in a district with a view to planning for future employment needs. You find three significant employers: a department store (450 employees), a clothing factory (210 employees), and a bank (190 employees). It makes sense to plan around these three major players in the local economy, right? But what happens if you factor in the ‘Other’ column that covers all the smaller employers and self-employed people in the district, and adds up to 941 people – more than the big three combined? With a ‘long tail’ such as this, who takes priority, and why?
Reality is often more nuanced and messy than we would like it to be. How do you deal with that?
The quality of outputs in planning is contingent on the quality of inputs (otherwise known as ‘garbage in, garbage out’).
Furthermore, we are bound to plan for the future based on past events and present needs. When using data to inform planning, there are inevitably unknowns.
Writing in Data Points: Visualization That Means Something (2), statistician Nathan Yau suggests that data should perhaps have its own “golden rule”: “Treat others’ data as you would want your data treated.”
He adds: “Data is an abstraction of real life and can be complicated, but if you gather enough context, you can at least put forth a solid effort to make sense of it.”
The Alliance for Useful Evidence’s guide to using research evidence cites the study by Professor Philip E Tetlock of 80,000 expert predictions that found the majority were wrong and ‘dart-throwing monkeys’ were better at forecasting the future.
Image credit | Shuttershock