There was a time, not long ago, when people studying brains gained access to new and incredibly powerful instruments to measure what parts of the brain were more active than others. They could measure many many different places, so many that they could even sometimes tell what emotions a person was feeling just by looking at their readings!
This rush of data made researchers heady. They kept looking for more and more interesting details about how brains worked. They would look at the data lots of different ways, starting from the raw activity data. This raw data about brain activity was especially exciting — there were little sparks of activity everywhere! Most researchers were cautious, though, and only considered activity convincingly demonstrated if they first took into account that they were looking at lots and lots of measurements all at once.
Some researchers wanted to also talk about what they saw before that correction, though. They said, sure, the corrected results are the most certain, but the raw results are at least worth talking about! What if we miss something due to the correction?
And some academic journals kept publishing papers that discussed the implications of the raw results. It was very hard to convince the researchers that what they were looking at was too misleading to be useful.
Then, a few researchers looking to test the new instruments came up with an idea. They would look for emotions in fish! They scanned the brain of a salmon, while showing it photos commonly used to stimulate human emotions. In their raw data they found several places the salmon brain was stimulated by the images! What could have caused that surprising result? Important areas to investigate!
Of course, the salmon was dead.
The corrections involved, which I’ll give a name to, are called multiple comparisons corrections. They aren’t just a “nice to do” — not doing them when you’ve done multiple comparisons undermines the entire analysis. You’ll find emotions in dead salmon. They aren’t hard to do, either. The problem isn’t difficulty, but that the analysis pre-correction looks so alluring.
A more stylized example might help. As you’ve likely heard, one out of every twenty times you’ll find an effect where one doesn’t exist, if you use a common statistical approach (comparing “p-values” to a chosen standard, in this case < 0.05). Imagine you’re testing whether eating a small amount of candy makes someone do better on a test, and you give everyone candy, and you don’t find an effect. Then someone else says “well, sure, not on that high stakes test you did, but we should really give candy out throughout the course to some people and then check a few different outcome measures, just to be sure”. That’s a great experiment, and there’s absolutely no problem running it.
But if you do, you have to do the same correction as you’d do scanning the brains of dead salmon. Otherwise, results that look interesting will turn out to be just noise. Heck, look at enough outcomes and you’re basically guaranteed to find one that, before correction, looks statistically significant. “Candy improves educational outcomes!” — but what’s hidden by the flawed statistical analysis is no more real than “Emotions found in (dead) salmon!”
The issue is more complex than I’m laying out, but what I’m laying out is also the ground floor minimum standard for a statistical analysis trying to show a difference worth taking seriously. I’m not even arguing that effects that fail to meet corrected statistical significance should be ignored! But if the effect isn’t statistically significant, there needs to be other justification to look at it, such as a large estimated magnitude. This is particularly the case in the social sciences, because the study of humans has an interesting property: everything has effects. When our context is different, we’re different. If you made a precise enough measurement, which is done by using a large enough study, absolutely every change in human environments will result in statistically significant changes in outcomes!
(I’ll skip a long explanation of why statistical significance in social science isn’t totally useless given that everything in social science is, at a high enough resolution, statistically significant. I follow the advice of a number of prominent statisticians in the social sciences that p-values and significance aren’t the best way to look at things, but I also understand that their dominance in fields such as education is hard to overcome. They still have useful properties, provided care is taken in their interpretation).
This brings us around to a recent study circulating on the use of OER (Open Educational Resources) versus traditional textbooks . This is really, really interesting! And they look at multiple outcomes, like our candy example, which is fantastic! But, because they do (at least) five comparisons, and don’t discuss how they did a multiple comparisons correction at all, there’s no way to be confident in their claims of a demonstrated difference. They don’t report enough data to approximate the correction from the paper, either. For three of their five comparisons they report no effect estimates, and for the two they do report, there’s no calibration that would assist in telling if the effects are large.
Did they find emotions in dead salmon? Their failure to report enough detail about their methods and results means we can’t be sure, but it seems pretty likely. That undermines their claims of improved results (and increased intensity) with OER, which take up much of their discussion. Which is too bad, because I think what can be reasonably concluded from the unfortunately limited results and data they present, that OER courses have little to no difference in outcomes compared to ones using purchased textbooks, is a great result, and one that should be spread widely.
Be careful of multiple comparisons when looking at data. They’re one of the most common ways to find spurious results, because, as humans, we search for novelty, and one of the most novel looking things is a result that isn’t true. For example, compute correlations between a handful of data sets and look at the largest, and you’ve done a huge number of comparisons. Or in a Business Intelligence tool, make a bunch of graphs. When running a study, try several ways of analyzing the data until you find one where the effects seem to ‘make sense’.
In any of these you’re virtually guaranteed to find at least one result that doesn’t really exist. Is it the one you’re looking at?