How evil is it to exclude extreme data points without setting an a priori threshold for doing so?

It is standard to exclude extreme data points for analysis, and there are good reasons for doing so. For instance, if we look at response times:

A tricky question, however, is: how do we set the threshold for excluding data points? Exploring several possible thresholds during the analysis phase may be akin to multiple comparisons, artificially increasing the chance of finding a low p-value. Fortunately, simulations show that it does not significantly inflate false positives.

Take home message: As a reviewer, if you see suspicious thresholds, don't be excessively skeptical.

Simulations

[The R script for these simulations is here. Feel free to modify it, e.g., by increasing the number of data points per simulation for more robust results.]

To evaluate the situation, we ran 1000 simulations. Each simulation followed these steps:

  1. 40 data points were sampled from a Gaussian distribution.
  2. The first half of these data points were arbitrarily called "condition A" and the second half "condition B".
  3. We collected the p-value from a t-test comparing conditions A and B.

In an ideal world, such simulations, where conditions are meaningless, should yield a non-significant test. We expect 5% false positives at the significance threshold p < .05.

per simulation (no exclusion)
Simulation Number of False Positives Rate (%) p-value threshold for 5% FPR
One core p-value52/1000 5.2% .0477
Min of all 25 p-values (25 exclusion possibilities) 239/1000 23.9% .00923
Min of all 25 p-values (25 exclusion possibilities) - with replacement 72/1000 7.2% .0363
Min of all 25 p-values (exclusions based on an independent measure) 114/1000 11.4% .0181

Results show that exploring thresholds extensively reduces the specificity of the test. Multiplying the p-value by 4 or 5 is often sufficient correction, and replacing extreme values rather than filtering them is a better practice.

Further Questions

[Corresponding simulations can be modified in the R script by changing the test or data distribution (e.g., replacing "rnorm").]

Back to homepage