Let’s begin by working through a situation that has been quite popular recently…suppose that you show the following photo to 10 of your friends:
Now, let’s say that 8 of your friends saw the dress as blue and black and 2 saw it as white and gold. You’d probably feel pretty comfortable asserting that, if you were to poll more people, more would see the dress as blue and black than white and gold. What about if your sample had 6 reporting blue and black and 4 reporting white and gold? You’d probably think something along the lines of “yeah, I know the result isn’t split 50/50, but it’s not that weird to get a little away from 50/50 in a sample even when the population is divided 50/50.” Neither of these statements is particularly unreasonable…but where do you draw the line? What if your sample reports 7 for blue and black and 3 for white and gold?
This is what tests of statistical significance are supposed to help out with.* Wouldn’t it be nice to know how likely it would be that your sample would give a 7-3 vote if the population really were split 50/50? This is what a statistical “p-value” tells us. If that value is sufficiently small, we say to ourselves “self, you know what? It’s pretty darn unlikely that I would see what I’m seeing from my sample if the population were really split 50/50 on this issue- maybe it’s time to entertain the notion that more people think the dress is blue and black then think it is white and gold.” (In reality, I think the white/gold camp wins out, but this is my story, so just go with it.) This is what statistical hypothesis testing does.
Sounds pretty compelling, right? If so, then I hope for your sake that you aren’t a social psych researcher, since the Journal of Basic and Applied Social Psychology decided to ban statistical significance testing in all of th articles that it publishes. (For you Bayesians out there, they aren’t too happy with you either but are willing to consider your analyses on a case by case basis.) Okay, I get that the generally accepted practice of considering a finding with a p-value of 0.05 or less as significant and everything else garbage isn’t without it’s problems, most obviously that researchers have incentives to finagle their analyses to sneak in under this threshold, but what on earth are researchers supposed to do instead? (i.e. what is the counterfactual to statistical hypothesis testing? So meta.)
I have some suggestions:
- Just look at your data- if your graph traces out the shape of an animal, count it as meaningful. Like this:
- Just wave your hands and talk forcefully until people take your result seriously. (This seems to have worked for macroeconomists for a while now.)
- Ask your pets- right paw = significant, left paw = not significant. (If you have a bird, you could use the result to line the cage and…well, you figure it out, since I can’t decide if bird crap indicates significance or the lack thereof.)
The downside I suppose is that none of these approaches really have the gravitas normally associated with scientific rigor, so I’m at a bit of a loss. Seriously though, I don’t understand what researchers are supposed to do instead- the article mentions something about descriptive statistics, but the point of the statistical analysis that I referred to above is to give some context as to whether differences in descriptive statistics are large enough to be worth paying attention to.
As I said, statistical analysis is not without its flaws, but there are a number of far less controversial and likely more productive steps that the journal could have taken:
- Pre-registration of experimental trials- if there is a record of what was tried experimentally, then it’s more clear how many things were tried in order to get a result that looks “good.” (The American Economic Association has started doing this, but it’s not mandatory yet.)
- Publication of p-values and confidence intervals- rather than just declare something as “significant” because it meets some arbitrary p-value threshold (results are often simply given a number of asterisks to indicate significance), explicitly show the likelihood that your result is due to random chance and give a range for where or error bars for point estimate results.
- Publication of negative results- if journals published papers where the “null hypothesis” (i.e. the uninteresting hypothesis that the researchers are looking to refute) can’t be rejected, then researchers would have less of an incentive to fiddle with their analyses to make it look like they meet the threshold of statistical significance. This, coupled with pre-registration of experiments, would cut down on what is known as “publication bias,” or the tendency for readers to see only the studies that showed the result that researchers were looking for (while the other studies get put in the circular file or whatever).
I guess this makes me thankful that I’m an economist, since if I ever write a paper that reads “well, my cat and I think this result looks pretty good, how about you?” it will be because I wanted to and not because I had to.
* Yes, I know that this doesn’t have to do with causality specifically, but this same method is used for analyses that attempt to tease out cause and effect.