Lecture 5: Causation and Correlation I. Correlation Correlation means the tendency of two variables to tend to move together. A correlation can be positive (meaning that the variables tend to move in the same direction) or negative (meaning the variables tend to move in opposite directions). An example of a positive correlation is age and income; people who are older tend to earn more money. An example of a negative correlation is latitude and temperature; as latitude increases, temperature tends to decrease.
More specifically, we have a statistical formula for correlation coefficient:
Notice the numerator will tend to be larger in magnitude when larger values of x happen along with larger values of y, smaller values of x with smaller values of y. The denominator is the standard deviation of both variables; in other words, we are scaling for how much these variables vary without reference to each other.
The correlation coefficient always has a value between -1 and 1. When it’s 1, we have a perfect positive correlation; when it’s -1, we have a perfect negative correlation. A perfect correlation means that if you know one variable’s value, you automatically know the other’s as well. For instance, the ages of any two people are perfectly correlated; if you know one person’s age, then as long as you know the difference in their birthdates, you know the other person’s age, too. When the correlation coefficient is zero, there is no correlation; knowing one variables tells you absolutely nothing about the other.
If you square the correlation coefficient, you get something called the coefficient of determination, or r2. It always lies between 0 and 1, and it has a nice interpretation: it tells you the fraction of variation in one variable that can be explained, or predicted, by variation in the other. For example, suppose r2 = 0.4 for income and age. That would mean differences in age explain 40% of differences in income; the remaining 60% would have to be explained by other factors.
However, we’re using “explain” and “predict” in a very specific way here. It is only a property of the variables’ numerical values and how they tend to go together. That does necessarily mean that changes in one variable cause changes in the other. For example, age and income might be correlated, but age may not cause higher income; it’s just that older people tend to have more experience. Correlation is only a statement of numerical facts; it says nothing about cause and effect. As we will see, causation is a much more complex matter.
II. Correlation versus Causation Causation = cause and effect; talking about one thing will tend, other things equal, to result in another thing. It is often a very difficult matter to distinguish true causal relationship.
Example: RadioLab podcast on “Secrets of Success.” What causes success? Is it innate ability? Timing/opportunity? Love of the activity (motivation)? Practice?
Things to consider: (1) Maybe there is not just one answer. Multiple factors contribute to success. Some may have one cause, others a different cause. (2) Some factors may only indirectly cause the outcome. E.g., motivation might matter only because it affects practice. (3) Some factors may interact with each other. E.g., the effectiveness of practice may depend on innate talent. E.g., the use of talent may be dependent on some degree of luck. (4) Some things may have both a direct and indirect effect. E.g., talent has both a direct effect by making you just better, and an indirect effect by increasing your motivation (people like to do what they’re good at).
Example: Republicans report having more satisfying sex lives than Democrats. According to an ABC News poll (http://abcnews.go.com/Primetime/News/story?id=180291), Republicans are more likely to report being very satisfied with their sex life than Democrats (by a margin of 56% to 47%). This is true even if you control for being in a committed relationship (87% versus 76%), so it’s not just that Republicans are more likely to be in such relationships. What’s going on? Does this mean being a Republican causes better sex lives? [Good for demonstrating CA&B. Turns out men are both more likely to be Republican and more likely to be happy with their sex lives.]
Example: People who have had more sex partners are more likely to get divorced. Does this mean having more sex partners causes divorce? (http://agoraphilia.blogspot.com/2007/02/sexual-correlation-and-causation.html)
[Good for demonstrating CA&B and also BA. One possibility is that possession of conservative values results in both fewer divorces and fewer sex partners. Another possibility is that getting married tends to cause fewer sex partners (because you stop adding more), while getting divorced tends to cause more sex partners (because you start adding them again).]
Example: Is the President responsible for the economy’s performance on his watch? Obviously, the President has some influence on economic policy. But there are lots of confounding factors. (a) Effects can be lagged in time, so that some economic effects are the responsibility of the previous president. (b) Business cycles can be driven by non-political factors, such as changes in underlying factors in the economy. (c) A president might get voted out of office because people think he’s responsible for the recession, and as a result the new president comes in just as the economy is recovering.
Simplest form of causation: AB. That is, when A happens, that means B will also happen. We say A causes B to happens. When we observe a correlation between A and B, people will often reach the conclusion that AB. But there are many other possibilities:
BA; call this reverse causation.
CA & B; cause this external causation.
AB & CB; A and C each independently cause B; call this multiple causation.
(A&C)B; A and B together cause B; call this joint causation.
ACB; call this indirect causation.
CAB; this is also indirect causation, but with a different order of events.
A unrelated to B; we call this a coincidence. B happened for unrelated reasons.
III. The Need for Controls Because of all these other factors that can cause spurious correlations (especially CA&B), we need to use controls. That means trying to find data for which the other factor (C) is held constant, so we can test to see whether we still have a relationship consistent with AB.
Example: What are the causes of higher income? Education is an obvious answer. Another obvious answer is IQ. But wait… IQ may also cause one to get higher education. So maybe we have external causation: IQ is the real cause, it leads to higher income – and also higher education as a side effect. To find out how much education really does, we need to control for IQ by having at least some people in our data set who have similar IQ’s but different amounts of education. And we’d also like some people with similar education but different IQ’s, to test to see whether IQ has an effect independent of education.
Much of statistics is concerned with trying to control for confounding factors. There are various methods for doing this. One is the experimental method, which directly holds constant some factors while varying others. Another is multiple regression, which works by collecting a large sample of data that happens to vary by multiple factors.
But even with good controls, we still cannot verify true causation. The best we can do is rule out certain alternative hypotheses, thereby strengthening the case of causation. Say we have a theory that education causes higher income. Someone might challenge that with the IQ hypothesis. To rule that out, we use IQ as a control and show there is still a correlation between education and income. That lends support to our theory, but it doesn’t prove it, because it’s still possible the correlation is just a coincidence or caused by yet another factor we’ve failed to control for.
The “ceteris paribus” assumption. Most of the claims we make in economics have this form; we’re saying that a causal relationship holds as long as other things are equal. The law of demand says a higher price will induce people to buy less – but that’s assuming income and preferences are constant. So we try to control for those other factors to isolate the effect of price.
IV. Necessary and Sufficient Conditions These terms are often, but not always, related to causation. In some cases, they refer not to causation but to strictly logical relationships.
We say A is a sufficient condition of B if having B guarantees having A. We can write this AB, or “If A then B.” Note that this does not mean that BA. For example, being a poodle is a sufficient condition for being a dog. Thus, poodledog. But if you have a dog, it might not be a poodle.
We say A is a necessary condition of B if B cannot happen without A. We can write this BA or AB. (Think that through. If A is necessary for B, then if B is true, we know A must have happened.) For example, being a poodle is a sufficient condition for being a dog.
In the example of dogs and poodles, the necessary and sufficient conditions are opposite sides of the coin. Poodle is sufficient for dog, and dog is necessary for poodle. But it doesn’t have to be that way. There are many cases where you may have sufficient conditions without necessary ones, and vice versa. For example, more education might be sufficient for higher income (that is, other things equal, more education leads to higher income). But even if that’s definitely true, you cannot conclude from someone having a higher income that they also have more education, because other things (like athletic ability) might lead to higher income without an education.
How do necessary and sufficient conditions show up in statistics? Both can show up in correlations. If A is necessary for B, then we will expect to see B only in cases where A occurs, and not-B when A does not occur, resulting in a correlation. (However, if A really is strictly necessary for B, then there should not be a single counterexample; that is, there should be no data points with B and not A.) If A is sufficient for B, then we will expect to see A and B frequently occurring together as well, since whenever A happens, B must also happen. (However, if A really is strictly sufficient for B, there should be no data points with A and not B.) Keep in mind that whenever other factors are also involved, as in the case of multiple causation, there will not be perfect correlations. We may try to control for them all, but we’re unlikely to succeed because the world is complex.