"Statistics never lie, but lovers often do..." (J. Tinga, Antonio vs. Reyes, 484 SCRA 353 (2006))
With all due respect to Justice Tinga but statistics and data do lie and they do it quite often.
Cherry picking
By selecting or cherry-picking data, the trend of global warming appears to mistakenly stop, as in the period from 1998 to 2012, which is actually a random contrary fluctuation.
What it is
Cherry picking is also known as suppressing evidence or the fallacy of incomplete evidence.
Selecting, using, and presenting only the subset of the data that agree or fit with your claims and beliefs.
This becomes really dangerous when paired with people’s confirmation bias.
How to avoid it
As a creator, always be faithful to your data and results especially when they do not fully agree with your claims.
As a consumer, ask for the complete dataset or ask yourself: “What am I not being told?”
Data dredging
An example of data produced by data dredging through a bot operated by Tyler Vigen, apparently showing a close link between the best word in a spelling bee competition and the number of people in the US killed by venomous spiders. It's obviously a coincidence: with so many possible comparisons of data of things happening in the world, it is easy to find some unrelated data that shows similar trends.
What it is
Data dredging refers to the misuse of data analysis to find patterns in data that can be presented as statistically significant.
Seeking correlation where there is none. Performing countless statistical tests on data and reporting the ones that show correlation.
If you combine enough time with a large enough dataset, you are bound to find things that appear to be correlated.
How to avoid it
Always be upfront with what you are testing—e.g. using a hypothesis in the analyze step of the data pipeline.
Accept that sometimes things that seem to be correlated aren’t.
Survivorship bias
This hypothetical pattern of damage of returning aircraft shows locations where they can sustain damage and still return home. If the aircraft was reinforced in the most commonly hit areas, this would be a result of survivorship bias because crucial data from fatally damaged planes was being ignored; those hit in other places presumably did not survive.
What it is
Survivorship bias is the logical error of drawing conclusions from an incomplete dataset composed of data that has survived a selection process and overlooking those that did not, typically because of their lack of visibility.
Example:
Bullet patterns of WW2 aircrafts returning from the war
College dropouts being billionaires—for every 1 dropout who ended up becoming a billionaire, how many thousands more did not?
How to avoid it
Always try to look at the full picture and ask yourself if you are overlooking anything or if something is missing in your data.
Ask yourself: "Did your data undergo any selection or trimming process prior to your analysis?"
Sampling bias occurs when a sample is selected in a way such that some members of the intended population have a lower or higher chance of being included in the sample.
Results in conclusions drawn from a dataset that is not representative of the population you are trying to understand.
Example:
Using an online poll to determine whether students are in favor of online classes.
How to avoid it
Always try to use a representative sample of your population in your analysis.
Choose an appropriate and robust sampling method.
Cobra effect
The story goes something like this: back in colonial India the top Brit in charge decided there were too many cobras around Delhi. To reduce the population they put in place a cash reward, or bounty, for anyone who brought in a dead cobra. The intention was clear. Legend has it that people did bring in the cobras reliably because some enterprising souls had started breeding cobras for the very purpose of getting the bounty.
What it is
When an attempted solution to a problem somehow makes it worse as an unintended result of using incorrect stimulation or wrong incentives.
The belief that because two events occur together or immediately after one another then one must have caused the other.
Correlation does not imply causation.
How to avoid it
Never assume causation based on correlation alone.
Danger of summary metrics
Four different datasets look identical when examined using simple summary statistics, but vary considerably when graphed.
What it is
Reliance on summary metrics blur out differences in the dataset. Some datasets may have the same summary metrics (e.g. mean, variance, correlation) but be totally different from each other.
As a provider, show or open the data used for the study instead of just the summary statistics.
As a consumer, always look or ask for the data behind the summary statistics used.
Other data fallacies
Simpson's paradox - A phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Gambler's fallacy - The erroneous belief that if an event occurs more frequently than normal in the past then it is less likely to occur in the future, when it has already been established that the probability of such events do not depend on what happened in the past.
Hawthorne effect - Also known as the Observer Effect. This is the phenomenon where the actions and behaviors of the subjects of a study change because they are aware that they are being observed/monitored.
McNamara fallacy - Being too focused on what can easily observed and assuming that does that cannot are irrelevant. This leads to decisions based solely on quantitative observations (i.e., metrics, hard data, statistics) while all qualitative factors are ignored.
Publication bias - Refers to the fallacy that the outcome of a research or experiment influences whether it is published instead of the robustness of the methodology. This results in an imbalance in published papers in favor of positive results when, in reality, more researches using the same methodology but showing negative or inconclusive results may exist.
For more information, you can refer to Data fallacies by Geckoboard.