How evil is it to exclude extreme data points without setting an a priori threshold for doing so?

It is standard to exclude extreme datapoints for an analysis and there are good reasons for doing so. For instance, if we look at response times:

A tricky question however is: how do we set the threshold for excluding datapoints? In this note we worry that exploring several possible thresholds in the analysis phase may be the same sin as multiple comparisons: by taking a multiple perspective on the data set, we artificially increase the chance of finding a low p-value. Fortunately, simulations show that it does not screw the p-values too much (more precisely, the rate of false positives does not increase dramatically).

Take home message: as a reviewer, if you see suspicious thresholds, don't be exaggerately skeptical.

Simulations

[The R script for these simulations is here, I would be happy to see the result for a higher number of data points per simulations, that is replacing 'Part <- 40' with 'Part <- 80' for instance so that more values are excluded (although the same proportion would be excluded).]

To evaluate the situation, we ran 1000 simulations. Each of these simulations were as follows:

  1. 40 datapoints were sampled from a gaussian distribution
  2. The first half of these datapoints were arbitrarily called "condition A" and the second half "condition B"
  3. We collected the p-value from a t-test comparing conditions A and B

In an ideal world, such simulations in which the conditions are meaningless should yield a non-significant test. But of course, running the above 1000 times, we expect to obtain 5% of (false) positive outcomes at the significance threshold p<.05.

What happens to this number of false positives if we allow ourselves to ignore part of the data? Let's say we investigate several RT thresholds beyond which we consider that the datapoints are outliers. To assess this, we ran three more tests on the same simulated data:

  1. We replaced the p-values of the t-test in (c) with the minimum of 25 p-values, obtained by excluding 0%, 2.5%, 5%, 7.5% or 10% of the data on the upper side, on the lower side, or on both sides.
  2. We replaced the p-values of the t-test in (c) with the minimum of 25 p-values, obtained as above but with replacement: datapoints were not excluded but replaced with the threshold of exclusion.
  3. We replaced the p-values of the t-test in (c) with the minimum of 25 p-values, obtained as above but we based the exclusion criterion on another set of simulated values. This would correspond to a case in which you may analyze some kind of response but exclude datapoints based on response times.

These manipulations should mechanically increase the rate of false positives since the target p-values is replaced with a lower value based not only on the result of one t-test, but on the results of 25 tests (including the original one). The rates of false positives as well as the p-value significance threshold needed to obtain a false positive rate (FPR) of 5% are given in the following table:

Simulation Number of false positives p-value for a 5% FPR
0. Core p-value 52/1000 5.2% .0477
1. Min of 25 p-values with exclusion 239/1000 23.9% .00923
2. Min of 25 p-values with replacement 72/1000 7.2% .0363
3. Min of 25 p-values with exclusion based on an independent measure 114/1000 11.4% .0181

Overall, the results of these simulations show that exploring several thresholds extensively do not reduce the specificity of the test dramatically. In particular, even though it is based on 25 repeated tests, multiplying the p-value by 4 or 5 is a sufficient correction, and 2 or 3 is already a rather safe option.

Hence, even doing wild explorations of possible thresholds (who would do such a thing as trying systematically 4 or 5 thresholds on each and both sides?) does not increase the rate of false positives unreasonably. Intuitively, the result is that excluding a couple of data points does not alter the core of the data (at least in a gaussian world), and specially if we exclude extreme data points which presumably do belong to a rare class of events that were observed by chance.

Further questions

[Note: Corresponding simulations should be easy to run with small modifications of the R script: change the test and/or the distribution from which the data is sampled (the "rnorm" command).]