A note for overfitting in prediction
It is a typical overfitting in data analysis that PCA seems promising in distinguishing cases and controls. However, it is not true but typical overfitting, which creeps in at feature preselection.
The data is simulated as below:
1 \(M=10000\) markers were sampled from freq between 0.2~0.8, and \(N=200\).
2 \(Y=0\) or \(1\) is randomly assigned.
So there is no association.
In pre-selection for markers, simple regression is conducted. Markers are ranked by thier p-values. By applying p-value cutoff,say \(<0.01\), \(<0.05\), etc, different numbers of markers can be used to conduct PCA again. Then, “case” and “controls” can be seperated.
\(R^2=\frac{1}{1+n/m}\), close to 1 when \(m\) goes far greater than \(n\).
#simualtion
M=10000 #marker
N=200 #sample size
P=runif(M, 0.2, 0.8) # ancestral
G=matrix(0, N, M)
for(i in 1:N) {
G[i,] = rbinom(M, 2, P)
}
Y = rbinom(N, 1, 0.5)
Gs=apply(G, 2, scale)
GG=Gs %*% t(Gs) / M
EigenG=eigen(GG)
plot(EigenG$vectors[,1], EigenG$vectors[,2], main="all snps", col=(Y+1), cex=0.5, pch=16)para = matrix(0, M, 2)
for (i in 1:M){
model = lm(Y~Gs[,i])
para[i,1]=summary(model)$coefficients[2,3]^2 #p value
para[i,2]=summary(model)$coefficients[2,4] #p value
}
hist(para[,2])pcut=c(0.001, 0.01, 0.05, 0.5, 0.95, 1)
layout(matrix(1:6, 3, 2))
for(i in 1:length(pcut)) {
idx=which(para[,2] < pcut[i])
Gs1=Gs[, idx]
GG1=Gs1%*%t(Gs1)/ncol(Gs1)
eG1=eigen(GG1)
plot(eG1$vectors[,1], eG1$vectors[,2], col=(Y+1), cex=0.5, pch=16, main=paste("p-cut", pcut[i]))
}