Excluding Extreme Data Points

It is standard to exclude extreme data points for analysis, and there are good reasons for doing so. For instance, if we look at response times:

Very fast RTs may correspond to cases in which the stimulus was not processed (and hence may not reflect anything about the underlying processes under study).
Very slow RTs may correspond to cases in which something different than normal processing happened (e.g., the participants got distracted or checked their email before responding), which would mostly contribute noise.

A tricky question, however, is: how do we set the threshold for excluding data points? Exploring several possible thresholds during the analysis phase may be akin to multiple comparisons, artificially increasing the chance of finding a low p-value. Fortunately, simulations show that it does not significantly inflate false positives.

Take home message: As a reviewer, if you see suspicious thresholds, don't be excessively skeptical.

Simulations

[The R script for these simulations is here. Feel free to modify it, e.g., by increasing the number of data points per simulation for more robust results.]

To evaluate the situation, we ran 1000 simulations. Each simulation followed these steps:

40 data points were sampled from a Gaussian distribution.
The first half of these data points were arbitrarily called "condition A" and the second half "condition B".
We collected the p-value from a t-test comparing conditions A and B.

In an ideal world, such simulations, where conditions are meaningless, should yield a non-significant test. We expect 5% false positives at the significance threshold p < .05.

per simulation (no exclusion)

Simulation	Number of False Positives	Rate (%)	p-value threshold for 5% FPR
One core p-value	52/1000	5.2%	.0477
Min of all 25 p-values (25 exclusion possibilities)	239/1000	23.9%	.00923
Min of all 25 p-values (25 exclusion possibilities) - with replacement	72/1000	7.2%	.0363
Min of all 25 p-values (exclusions based on an independent measure)	114/1000	11.4%	.0181

Results show that exploring thresholds extensively reduces the specificity of the test. Multiplying the p-value by 4 or 5 is often sufficient correction, and replacing extreme values rather than filtering them is a better practice.

Further Questions

What happens if the data is not from a normal distribution or if non-parametric/inappropriate tests are used?

How do threshold explorations behave with differently shaped data?
What happens when combining multiple statistical issues?

[Corresponding simulations can be modified in the R script by changing the test or data distribution (e.g., replacing "rnorm").]

How evil is it to exclude extreme data points without setting an a priori threshold for doing so?

Simulations

Further Questions