Causal Friday: The Dumbest Differences-in-Differences Ever, Viral Video Edition…

It’s causal Friday, so I’m poking around in some data trying to make a case for cause and effect. More specifically, I’m drowning in a differences-in-differences analysis trying to construct a proper control group. (In this case, identifying a control group isn’t conceptually difficult, it’s just really annoying to pull the data.)

So what is a differences-in-differences analysis? It’s…well, pretty much exactly what it sounds like- it’s kind of nice when that happens. Here’s a little thing I wrote up a while ago for my team at work (who had more of a data science than social science background), or you can consult Wikipedia. (note: you will never convince me it’s “difference-in-differences” and not “differences-in-differences”) The general principle behind differences-in-differences is that you can’t just do a before-after comparison to identify the effect of an event, since you’d also need to know what the before-afters look like for stuff that wasn’t subjected to the event. For example, consider the following hypothetical question:

Fairlife milk initiated a new marketing campaign at the beginning of 2018, and so far sales of Fairlife milk are 5 percent higher than for the same portion of 2017. Should the marketing campaign be considered a success?

Hopefully it’s at least somewhat intuitive that the answer should be “I dunno, how do the sales of non-Fairlife milk compare to last year?” If milk sales more generally are, say, also up 5 percent, it’s not particularly likely that the marketing campaign is doing much. On the other hand, if sales for the milk industry were generally down compared to last year, the marketing campaign should be viewed much more favorably.

So this is what I was trying to do, but with music sales. Remember this?

Is this the best way ever to quit your job? Marina Shifrin resigns with Kanye West dance video

Is this the best way ever to quit your job? Marina Shifrin resigns with Kanye West dance video

Ever wanted to quit your job? Why don’t you take a leaf out of video producer Marina Shifrin’s book and do it through the medium of “interpretive dance”?


For context, I’m using this as a motivating example for a larger analysis on music sampling. So I dutifully went through and identified two other songs from the same Kanye album that were about as popular as ‘Gone” before the video above went viral, and then I looked up the sales of all three songs (this is way more annoying than you’d think it should be) before and after the video’s posting date so I could do a very careful and nuanced analysis. Clearly I didn’t think things through, since, well…

This is only for the song used in the viral video, so I don’t technically have a comparison group (yet), but I mean COME ON…nonetheless, I persisted and added my control group:

(I changed the scale of the graph so it looked just wonky rather than useless.) You’ll be pleased to know that my confidence in the video causing a sales bump has not decreased…but let’s calculate some differences in differences anyway (it’s not really possible to run a regression here). So here are the numbers for 4 weeks before and 4 weeks after:

These numbers are pretty easy to interpret- an effect is positive if the differences-in-differences numbers are positive and vice versa. (An effect is nonexistent if the difference is close to zero.) Now I guess technically I should run a test to see whether the differences are *statistically* different from zero, but 1. that’s kind of hard with 3 data points, and 2. I mean come on.

The real punch line in all of this is the fact that the video has been taken down on copyright grounds…I’m, um, not sure you’re doing it right, record label…

Causal Friday: But What About All The People Who Didn’t Declare Bankruptcy?

I’m not quite sure why this is in the news again now (there was a paper critique I can read ok), but it is so let’s take a look at it. Here’s the setup:

In 2005, she [Warren], along with David Himmelstein, Deborah Thorne and Steffie Woolhandler, published a paper in the journal Health Affairs documenting a memorable statistic: More than 40 percent of all bankruptcies in America were a result of medical problems, they wrote. In 2009, they updated that research with an even more startling number: Medical bills were responsible for more than 62 percent of all American bankruptcies.

One main problem here is the may that “medical problems” or “medical bills” is conveniently defined, but let’s put that aside for the moment. Logically, these statements mean that the study authors claim that medical bills cause bankruptcy. If you look at the methodology, however, you’ll see that the researchers get to this notion of causality by only looking at those people who declared bankruptcy, and…well, that’s definitely not how causality works.

To see why, let’s consider a logically equivalent situation that has the benefit of being far more obviously absurd. Namely, I could claim that having two legs causes bankruptcy because most people who declare bankruptcy have two legs. (I said most do not start with me) In this scenario, your brain’s knee-jerk reaction is probably “hey wait, but most people who don’t declare bankruptcy also have two legs, so this makes no sense.” And your brain is not wrong! For some reason though, people aren’t primed in the same way to conjecture that a large chuck of people have medical bills, not just those who declare bankruptcy. (I think the bill threshold to be counted in this context was $1,000, and really, who among us…)

We see this principle get abused and debunked in a legal sense all the time- and by that I obviously mean on Law and Order. IIRC, there’s an episode where a defense attorney argues that video games lead to murder because the murdery kid played video games and Jack McCoy is all “but what about all of the kids who play video games and don’t kill anyone?” I’m pretty sure Jack won the case, and we all learn an important lesson about holding ourselves to this standard of logic even when the bad conclusion is not intuitively absurd.

But wait, there’s more…my commentary on the matter led to this tweet from a friend of mine:

My initial reaction in a nutshell:

In this case, my lizard brain was temped to answer no, since, well duh. But, technically speaking yeah, you need a counterfactual group of people who don’t get shot (maybe not exactly the same size) to see how many of them die within an hour. I guess the thing here is that we generally know enough about human robustness to conclude that the chances of randomly kicking the bucket in the next hour without getting shot is negligibly small. (*knocks on wood*) But even then, I’m implicitly assuming that people who get shot aren’t systematically close to death anyway…which, I’m pretty sure this is just a different episode of Law and Order. (that link is a pretty good test of whether law school is for you tbh)

So what have we learned from this? Yes, in general, to establish that X causes Y, you need to look at both people who experience X and don’t experience X. (That said, I think you could rule out the possibility that X causes Y by only looking at people who experience X if they don’t also experience Y.) BUT, we can sometimes make do by only looking at people who experience X if we have a good read on what would happen if people didn’t experience X. Be careful though, since this only works if the people who experience X resemble the overall population, since otherwise selection bias bites us in the ass.

This shortcut, if you will, is more than just a thought exercise- in pharmaceutical trials, for example, it’s supposed to be the case that half of subjects get a placebo rather than the treatment being tested. This is ethically fine when the treatment is, say, Viagra, but what if it’s a cancer treatment? The ethics of denying people a potentially successful treatment for the sake of the pure scientific method is far less clear, especially when we’ve previously learned what happens when people don’t get the treatment. (spoiler: they die)

Note, however, that none of this “domain knowledge,” loosely speaking, allows us to conclude that X causes Y by only looking at people who experience Y. You could get closer by establishing that people who don’t experience Y generally don’t experience X (i.e. people who don’t declare bankruptcy don’t have medical bills), but even then you’ve only established correlation, and correlation does not imply causation.

Coming back to the original study and the ensuing debate…I can’t help but be annoyed with Elizabeth Warren here (even though in general I like her quite a bit), specifically because she’s stamping her academic credentials with information that is either poorly thought out or in bad faith. And for what purpose, even? Medical bills aren’t a problem only because they can result in bankruptcy, bankruptcy just happens to be an outcome that is enticingly easy to observe. In a way, they’re overcomplicating things, since a $10,000 medical bill imposes a cost of $10,000 on a household, full stop, and that matters in the amount of, wait for it, $10,000. Focusing on bankruptcy as the dealbreaker outcome even kind of gives the impression of “oh, you’re medical bill didn’t result in you lying in a ditch somewhere? You’re fine then!” which is, well, probably not fine…but I know someone who might think otherwise: