Technically, I’m cheating with the “causal Friday” title, since, while regressions do identify associations that exist when controlling for other variables, these associations aren’t always of the causal variety. (This is particularly true when not all relevant factors can be controlled for.) But I choose to not be too persnickety because I think the comic is funny and wanted to share it.
Okay, you should have known better than to believe that I was going to avoid “too persnickety.” Personally, I won’t decide whether I am suspicious of the linear regression until someone tells me whether the slope is statistically significant. Also, if there are multiple explanatory variables that affect an outcome, a scatter plot that only looks at one of them at a time will generally looks like a mess even when all of the variables are individually important. In related news, this is a good opportunity to talk about the distinction between estimated effects (i.e. regression coefficients) and R-squared. (Don’t stop reading if you aren’t super into econometrics, I promise to make this make sense.)
Let’s say an economist is trying to model how much coffee I drink. (In reality, this is not necessary- the regression would just have a really big constant term, but go with me here.) Unfortunately, the only data available to use as an explanatory variable is income. Obviously, there are a lot more factors that affect my coffee consumption than just my income, so it shouldn’t surprise you that if I were to plot coffee consumption as a function of income (where each data point is a month of time, let’s say) I would get something that looks like the scatter plot above.
Let’s say that I’m measuring my income in hundreds of dollars and the estimated slope of the regression line is 0.01. This means that, on average, each hundred dollar increase in income is associated with 0.01 more coffees per month. If the numbers show that this estimate is statistically significant, then it’s pretty unlikely that this association exists in the data by random chance. Let’s also say that the R-squared of the regression is 0.06, like in the picture. This means that changes in my income only explain 6 percent of the variation in my coffee consumption.
My point is that these two conclusions aren’t in conflict with one another- it’s entirely possible for a relationship to both be statistically significant and for it to explain only a small fraction of what is going on. (This happens a lot in finance, actually, and an R-squared of 0.06 wouldn’t generally be seen as a red flag just because there is so much unexplainable noise in the data.) Sure, the result would be more impressive with a higher R-squared, but it’s largely a matter of personal judgment whether explaining, say, 6 percent of a phenomenon is worth talking about. (Not gonna lie- some economics journals vote no on this question.)
That said, I do recommend watching out for a red flag of a slightly different sort- one of the conditions in order for a regression to be valid is that your explanatory variables are uncorrelated with all the relevant stuff that you aren’t controlling for (your error term, in technical terms). In the case of my coffee regression, my result is valid only if my income isn’t correlated with whatever else could be causing variation in my coffee consumption (hours worked, for example). I can tell you personally that that is a lot of stuff.
I’m now tempted to perform a neural net analysis of my coffee consumption in order to see if I could get Rexthor out of it.