It is standard to exclude extreme datapoints for an analysis and there are good reasons for doing so. For instance, if we look at response times:
Take home message: as a reviewer, if you see suspicious thresholds, don't be exaggerately skeptical.
[The R script for these simulations is here, I would be happy to see the result for a higher number of data points per simulations, that is replacing 'Part <- 40' with 'Part <- 80' for instance so that more values are excluded (although the same proportion would be excluded).]
To evaluate the situation, we ran 1000 simulations. Each of these simulations were as follows:
In an ideal world, such simulations in which the conditions are meaningless should yield a non-significant test. But of course, running the above 1000 times, we expect to obtain 5% of (false) positive outcomes at the significance threshold p<.05.
What happens to this number of false positives if we allow ourselves to ignore part of the data? Let's say we investigate several RT thresholds beyond which we consider that the datapoints are outliers. To assess this, we ran three more tests on the same simulated data:
These manipulations should mechanically increase the rate of false positives since the target p-values is replaced with a lower value based not only on the result of one t-test, but on the results of 25 tests (including the original one). The rates of false positives as well as the p-value significance threshold needed to obtain a false positive rate (FPR) of 5% are given in the following table:
Simulation | Number of false positives | p-value for a 5% FPR | |
---|---|---|---|
0. Core p-value | 52/1000 | 5.2% | .0477 |
1. Min of 25 p-values with exclusion | 239/1000 | 23.9% | .00923 |
2. Min of 25 p-values with replacement | 72/1000 | 7.2% | .0363 |
3. Min of 25 p-values with exclusion based on an independent measure | 114/1000 | 11.4% | .0181 |
Overall, the results of these simulations show that exploring several thresholds extensively do not reduce the specificity of the test dramatically. In particular, even though it is based on 25 repeated tests, multiplying the p-value by 4 or 5 is a sufficient correction, and 2 or 3 is already a rather safe option.
Hence, even doing wild explorations of possible thresholds (who would do such a thing as trying systematically 4 or 5 thresholds on each and both sides?) does not increase the rate of false positives unreasonably. Intuitively, the result is that excluding a couple of data points does not alter the core of the data (at least in a gaussian world), and specially if we exclude extreme data points which presumably do belong to a rare class of events that were observed by chance.
[Note: Corresponding simulations should be easy to run with small modifications of the R script: change the test and/or the distribution from which the data is sampled (the "rnorm" command).]