It is standard to exclude extreme data points for analysis, and there are good reasons for doing so. For instance, if we look at response times:
A tricky question, however, is: how do we set the threshold for excluding data points? Exploring several possible thresholds during the analysis phase may be akin to multiple comparisons, artificially increasing the chance of finding a low p-value. Fortunately, simulations show that it does not significantly inflate false positives.
Take home message: As a reviewer, if you see suspicious thresholds, don't be excessively skeptical.
[The R script for these simulations is here. Feel free to modify it, e.g., by increasing the number of data points per simulation for more robust results.]
To evaluate the situation, we ran 1000 simulations. Each simulation followed these steps:
In an ideal world, such simulations, where conditions are meaningless, should yield a non-significant test. We expect 5% false positives at the significance threshold p < .05.
Simulation | Number of False Positives | Rate (%) | p-value threshold for 5% FPR |
---|---|---|---|
One core p-value | per simulation (no exclusion)52/1000 | 5.2% | .0477 |
Min of all 25 p-values (25 exclusion possibilities) | 239/1000 | 23.9% | .00923 |
Min of all 25 p-values (25 exclusion possibilities) - with replacement | 72/1000 | 7.2% | .0363 |
Min of all 25 p-values (exclusions based on an independent measure) | 114/1000 | 11.4% | .0181 |
Results show that exploring thresholds extensively reduces the specificity of the test. Multiplying the p-value by 4 or 5 is often sufficient correction, and replacing extreme values rather than filtering them is a better practice.
[Corresponding simulations can be modified in the R script by changing the test or data distribution (e.g., replacing "rnorm").]