Causality played a major role in the development of modern thought and still does. We love to explain things and that obsession made us seekers of cause for every effect. Why do we love to explain things? Well, the process of explaining things makes majority of us feel smart and satisfies curiosity of the remaining minority. I thought about what made me write this piece and decided on curiosity as the cause. Of course, I am faking. I found some really funny correlations and thought you might get a few good laughs too.
But seriously though, why should we care whether it is a correlation or a causation. Aren’t they both the same? Two sides of the same coin. Of course not! Just because two events appear to have a correlation does not mean there is a causation (i.e., one caused the other). We fall prey to so many phony causal explanations (in most cases a mere correlation exists) in our day to day life that our quality of life suffers. It is imperative then that we fix this.
Before we get into nitty-gritty details of this confusion let us define these terms.
Causation: Something (cause) gives rise to something else (effect) under certain conditions. And we call this the principle of causality or causation. Here is an example: One has to heat a pot of water to boil it. Heat is the cause and boiling water is the effect. Cause must always precede effect. Pretty simple!
Correlation: A statistical term that explains relationship between two variables or events. The strength of the relationship can be calculated by Pearson Correlation Coefficient. A positive correlation between two variables or events simply means that both the variables move in the same direction (it is very important to note here that one variable does not have to cause the other variable to move). Similarly, a negative correlation means that both the variables move in opposite directions. Correlation does not have to be linear. It can be exponential, quadratic, etc. But our present discussion is limited to linear correlation.
You are probably thinking right about now, “I know that, Sherlock. Stop patronizing me and go find someone else to amuse yourself.” But hold on Watson, are you sure you have a good command over these tricky concepts. We may think we do but very rarely can we disabuse our fellow citizens of their faulty causal arguments supporting various controversial practices and policies. Here is a short list of various issues we face in our society: Racism, discrimination in general, superstitions, etc. A lot of these came in to practice due to selective observations made by some pseudo intellectuals. One of my friends commented that even smart people mistake correlation for causation. You will be shocked to know of some crazy (definitely faulty) views held by a few very famous people. I am only presenting one such case here but there are a lot of cases, unfortunately.
Dr. Larry Summers, former President of Harvard University (2001-2006), famously argued in 2005 that men outperform women in math and science because of biological differences. This is a classic example of faulty causation. His argument is entirely based on some data points – effects (ex: number of women in science and engineering fields, average test scores, etc.) that he considered relevant to unearth the cause (biological differences). He used biological differences as a cause simply because that is the only difference he can use unequivocally. He ignored various other sociological factors that could have caused the disparity that we see in our society today. Why did he make this mistake? I don’t know for sure but I think he mistook correlation for causation. Similar faulty views based on race, nationality, etc. are prevalent in our society now. How can we blame the common man when an intellect of his stature can make the same mistake?
Mistaking correlation for causation is a major problem and let us try to disabuse ourselves of this malady. Let us analyze a few examples (some of them are really silly) to clearly differentiate correlation from causation. Our ultimate goal is to be able to effectively analyze every causal relationship that we come across and potentially change questionable common practices. David Hume and Karl Popper published extensively on causality and I highly recommend their works.
A friend of mine sent me this joke about a kid’s conversation with his mother in a plane and is a very good starting point to highlight faulty causation:
“Wish they didn’t turn on that seatbelt sign so much! Every time they do, it gets bumpy.”
Did Maine’s divorce rate drop between 2000 and 2009 because US per capita margarine consumption went down during the same period?
If you ever come across people making up crazy causal relationships using strong correlation co-efficient as a trump card (did you notice what I did there), please use the following example and try to convince them not to eat margarine (especially if they are Americans) as that might help lower Maine’s (The Great Pine Tree State) divorce rate. Please see figure-1 for some compelling data.
Figure-1: Maine’s divorce rate versus US per capita Margarine consumption.
Amazing isn’t it. Americans cut back on margarine consumption by a whopping 50% between the years 2000 and 2009 (from 8 pounds (3.6 kg) per person/yr to 4 (1.8kg) pounds per person/yr). Surprisingly, Maine’s divorce rate also dropped by almost 20% during the same period. Look at the correlation co-efficient, an impressive +0.99. In case you are wondering what that co-efficient is, +1 means the two variables are perfectly positively correlated and -1 means the two variables are perfectly negatively correlated. The correlation co-efficient always varies between +1 and -1. Higher the number (in both directions away from 0), the better the relationship is. But do you really think either one of them is causing the other?
I am sure you can always find some nut out there who might say, “Well, as ‘Mericans reduced eating margarine their personal relationships got better. Residents of Maine are true ‘Mericans. The chart makes perfect sense”. If you counter the nut by asking, “then how come we don’t see this in the Golden State of California or in Greece”, the nut’s most likely response could be, “Californians are not ‘Mericans, they are tofu eating, gopher loving, just crazy folk, I tell ya. What da ya mean by Greece? We love Mc Donald’s grease!” Here is another example.
Did increase in spending on science, space, and technology by the US govt. cause increase in number of suicides by hanging, strangulation and suffocation?
Here is a perfect argument to cut NASA’s funding (one of my favorite organizations). Please see figure-2 for some entertainment.
Figure-2: US spending on science and technology versus suicides.
Crazy is it not? As the US government’s spending on Science, Space, and Technology increased, the number of suicides by hanging, strangulation and suffocation also increased between 1999 and 2009. The correlation co-efficient is +0.99. Again some anti-govt. nut might say, “Dang man! I knew it. The US govt. is strategically eliminating smart men (obviously it can’t be women, Dr. Summers already told us that women are not good enough in science and technology because of their body parts) by increasing the science budget.” If we probe him further on why these smart men commit suicide only by hanging, strangulation and suffocation and not by other methods, the nut may answer, “Duh! Don’t you know? That’s why they are called smart people. They know stuff like that.”
I got really curious and thought why stop here and ran a correlation between margarine consumption and suicides.
What impact did drop in consumption of margarine have on number of suicides by hanging, strangulation and suffocation?
Please see figure-3 for results.
Figure-3: US per capita margarine consumption versus suicides.
If you are wondering what da heck… dude. You got it. I found a negative correlation for demonstration. As Americans cut back on margarine consumption, US suicides by the methods mentioned in the above plot went up! And the correlation is a pretty healthy -0.88. Of course, fans of margarine could not breathe without margarine. What do we do now?
Why stop there!
But did drop in per capita margarine consumption help reduce the number of people died of starvation in the US? Really?
Yes, really. Please see Figure-4 for life saving details.
Figure-4: Margarine per capita consumption in the US versus number of people starved to death in the US.
Number of deaths due to starvation dropped by more than 85% when margarine per capita consumption dropped by 50%. Apparently too much of margarine starves people and in extreme cases may even kill them. This is very confusing now. Let us summarize what we have learned so far.
Drop in margarine consumption dropped Maine’s divorce rate (I say this is good and boo to Mr. Fabio of “I can’t believe it is not butter” fame)
Drop in margarine consumption increased suicides by hanging, strangulation, and suffocation (Definitely not good.)
But Drop in margarine consumption also lowered number of deaths from starvation in the US (Good.)
We are in a pickle now. If we try to help married couples in the state of Maine by encouraging Americans to cut back on margarine consumption, we may inadvertently increase the number of suicides by some very painful methods but could also lower number of deaths by starvation. What have we gotten ourselves into? Let us for a second assume that saving lives is more important than saving marriages. Should we recommend increase in consumption of margarine? Let us look at the magnitude of increase in number of suicides versus margarine consumption. Americans cutback ~4 pounds (1.8 kgs) of margarine/person/yr between 2000 and 2009 and 3,300 additional people committed suicide. But number of deaths by starvation dropped from 120 to 14 during the same period. See how hard public policy can be with messed up causality! I am sure you got the point but let us find one more negative correlation data somewhere.
What did drop in honey producing bee colonies do to juvenile arrests for possession of marijuana in the US?
As the honey producing bee colonies dwindled in the US, the juvenile arrests for possession of Marijuana in the US jumped up between 1990 and 2009. That is a 20 year data set with a healthy negative correlation co-efficient of -0.93. Please see Figure-5 for more details. One interpretation could be that the drop in honey producing bee colonies increased the price of honey and the ‘Merican juveniles, who were sucking on honey before, could not afford it anymore and switched to sucking on joints. Not likely you say, well, here is another hypothesis. May be the police were busy raising honey bee colonies before but with the drop in the number of colonies they hit the streets arresting juveniles for promoting organic lifestyle.
I hope you can see my desperate pathetic attempts at coming up with some funny causal relationships. Hey! At least I am trying. How about two more examples? I hope you are having as much fun as I am.
Figure-5: Honey bee colonies versus juvenile arrests for marijuana in the US.
Did my Civil engineering graduate friend’s decision not to pursue a PhD save lives?
I say it did. One of my friends from graduate school decided to end his graduate studies after earning a Master’s degree but often wondered if it was the right decision. I am sure he will be totally relieved after seeing the chart below (Figure-6).
Figure-6: Number of Civil engineering doctorates awarded versus suicides.
Holy cannoli batman, Can you believe it! Apparently humanity lost faith as more Civil engineering doctors started popping out. Also, look at the correlation co-efficient, a structurally strong 0.95.
But on a serious note though, isn’t it clear that we can always pick and choose the data points we want to fit our own hypotheses. Clearly, it is very important that we don’t fall for the standard statistical tricks played by pseudo intellectuals and pundits.
You may say, “Well, with all due respect Faker dude, you picked a bunch of random examples (what is this obsession with suicides and arrests) to make your point. But I am not stupid enough to believe in any of the scenarios you mentioned above.” Point is well taken. I shared these cases because I found the data behind these cases with relative ease (thanks to Tylervigen.com). However, the arguments and takeaways are still valid. Let us consider a few serious cases then.
Does precipitation in New York impact how much rain Vermonters get?
Finally! Here is a dataset that looks at precipitation levels in New York and Vermont for more than three decades. Please see Figure-7 for the correlation data. The correlation co-efficient is +0.89. Not bad at all but do we really think one caused the other. Isn’t it more likely that these states share the border and similar weather patterns and hence similar precipitation levels?
This is a classic case of strong correlation but no causation.
Figure-7: Average daily precipitation in New York and Vermont.
Please take the time to go through the full article and it is worth a read. This is a classic example of data misinterpretation and borderline manipulation. Here is a quick summary of the article: Dr. John Yudkin was a British nutritionist who pioneered the research on sugar’s impact on human health and obesity. Unfortunately, food industry and an influential US based scientist, Dr. Ancel Keys, pretty much destroyed Dr. Yudkin’s work. Dr. Keys showed correlation between saturated fat and heart disease and did not exclude other causes but erroneously established causality between saturated fat and heart disease. His charismatic personality and some compelling correlation charts (similar to the ones shown above) convinced majority of global medical boards to recommend low fat diets with large cokes. Please see Part Unum – Deep Thought’s answer to the ultimate question of Life is not 42… It is Complexity article for an in-depth analysis. Many generations of people shunned fat mistakenly and consumed loads of sugar and suffered for decades (still do). Data misinterpretation and faulty casualty was the root cause.
The link here has a very good summary of the vitamin D controversy and worth a read. The key takeaway is that not all vitamin deficiencies cause or exacerbate diseases.
Another article on hormone replacement therapy is a good read as well. Please click on hormone therapies for the link to the full article.
The key point of this write-up is that don’t simply assume causal relationship just because a correlation between two variables is found. Figure-8 below shows correlation co-efficient of various data sets.
Figure-8: Correlation-coefficients of various data sets.
One can have a +1 or -1 correlation with varying degrees of slope (i.e., impact). Standard correlation tests miss complex relationships. Hence we should always pay more attention to the hypothesis and underlying intuition than simply believing the strength of the correlation.
Data analytics and Artificial Intelligence (AI) are the current buzz words and it is imperative that data scientists and AI programmers clearly distinguish between correlation and causation. How can we teach a machine to learn if we can’t even figure out the difference ourselves? Who knows, may be the Skynet gets smarter soon enough and teaches us. But till then we should be more discriminating while evaluating causal claims made by so-called experts, politicians, marketers, and especially media personalities.
I did not get into the formulae behind correlation, co-variance, standard deviation, etc. in this write-up but if you are interested in the math behind all of these funny charts, please leave a note and I will do my best to come up with an appropriate article.