Preprocessing and cross-validation

From http://www.eigenvector.com/faq/index.php?id=153:

In general, preprocessing should be done inside of cross-validation routine. If you preprocess outside of the cross-validation algorithm (before calling crossval), you will bias the cross-validation results and likely overfit your model. The reason for this is that preprocessing will be based on the ENTIRE set of data but the cross-validation’s validity REQUIRES that the preprocessing be based ONLY on specific subsets of data. Why? Read on:

Cross-validation splits your data up into “n” subsets (lets say 3 for simplicity). Let say you have 12 samples and you’re only doing mean centering as your preprocessing (again, for simplicity). Cross-validation is going to take your 12 samples and split it into 3 groups (4 samples in each group).

In each cycle of the cross-validation, the algorithm leaves out one of those 3 groups (=4 samples=”validation set”) and does both preprocessing and model building from the remaining 8 samples (=”calibration set”). Recall that the preprocessing step here is to calculate the mean of the data and subtract it. Then it applies the preprocessing and model to the 4-sample validation set and looks at the error (and repeats this for each of the 3 sets). Here, applying the preprocessing is to take the mean calculated from the 8 samples and subtract it from the other 4 samples.

That last part is the key to why preprocessing BEFORE crossval is bad: when preprocessing is done INSIDE cross-validation (as it should be), the mean is calculated from the 8 samples that were left in and subtracted from them, and that same 8-sample mean is also subtracted from the 4 samples left out by cross-validation. However, if you mean-center BEFORE cross-validation, the mean is calculated from all 12 samples. The result is that, even though the rules of cross-validation say that the preprocessing (mean) and model are supposed to be calculated from only the calibration set, doing the preprocessing outside of cross-validation means all samples are influencing the preprocessing (mean).

With mean-centering, the effect isn’t as bad as it is for something like GLSW or OSC. These “multivariate filters” are far stronger preprocessing methods and operating on the entire data set can have a significant influence on the covariance (read: can have a much bigger effect of “cheating” and thus overfitting). The one time it doesn’t matter is when the preprocessing methods being done are “row-wise” only – that is, methods that operate on samples independently are not a problem. Methods like smoothing, derivatives, baselining, or normalization (other than MSC when based on the mean) operate on each sample independently and adding or removing samples from the data set has no effect on the others. In fact, to save time, our cross-validation routine recognizes when row-wise operations come first in the preprocessing sequence and does them outside of the cross-validation loop. The only time you can’t do these in advance is when another non-row-wise method happens prior to the row-wise method.


This was originally published here: https://calvinmccarter.wordpress.com/2015/08/19/preprocessing-and-cross-validation/

#ML