A simulation-based look at how the percentiles for components (e.g. SATM and SATCR) correspond to the percentiles for composites (e.g. SATM + SATCR). Motivated by discussion at Harvard admissions and meritocracy
This analysis is grossly oversimplified–no floor/ceiling, no discretization, ignores the selected nature of admissions SAT scores–but hopefully is helpful for reasoning about the problem.
Based on simulation at Likehood Estimates for Outcome Rank Given Correlated Variable
Started at 2014-09-20 08:50:23
First we generate a simulated distribution. For simplicity we use mean = 0 and sd = 1. Correlation is set to 0.5 based on the value of 0.485 for the correlation of SATM and SATV given in Table 3 - correlation matrix for HSGPA, SATW, SATV, SATM, and AP Credit.
# These variables define our distribution
Xmean <- 0
Ymean <- 0
Xsd <- 1
Ysd <- 1
XYcor <- 0.5
# Derived variables
XYmean <- c(Xmean, Ymean)
XYcov <- matrix(c(1, XYcor, XYcor, 1), nrow=2, ncol=2)
# Create our multivariate normal distribution using mvrnorm in MASS
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.0.3
# Make this reproducible
set.seed(42)
# Don't make sample too large here, car scatterplot can be slow
examplePopSize <- 1e5
myDrawsExample <- mvrnorm(examplePopSize, mu=XYmean, Sigma=XYcov)
myDrawsExample <- data.frame(myDrawsExample)
colnames(myDrawsExample) <- c("X", "Y")
# Optionally allow truncating the distribution
# Here we use +/- 3 SD which is close to the SAT floor/ceiling
truncate <- FALSE
if (truncate) {
myDrawsExample$X <- pmax(-3, pmin(3, myDrawsExample$X))
myDrawsExample$Y <- pmax(-3, pmin(3, myDrawsExample$Y))
}
myDrawsExample$XYsum <- myDrawsExample$X + myDrawsExample$Y
#check the mean, sd, and cor
apply(myDrawsExample, 2, mean)
## X Y XYsum
## -0.002828 -0.004319 -0.007147
apply(myDrawsExample, 2, sd)
## X Y XYsum
## 1.004 1.002 1.738
cor(myDrawsExample)
## X Y XYsum
## X 1.0000 0.5007 0.8666
## Y 0.5007 1.0000 0.8659
## XYsum 0.8666 0.8659 1.0000
# Plot the distribution
require(car)
## Loading required package: car
## Warning: package 'car' was built under R version 3.0.3
scatterplot(Y ~ X, data=myDrawsExample, ellipse=TRUE,
levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Example Distribution with Correlation =", XYcor))
Compare component and composite thresholds for different percentiles. Notice the symmetry of the simplified case as the percentile varies around 50%.
The difference between the two values (in component SDs) is given for each case.
The important things to notice are:
- The percentiles for the composite value (red diagonals) are not the same as the sum of the percentiles for the component values (blue diagonals).
- This effect is strongest at the extremes (e.g. 1st and 99th percentiles) and weakest near the mean.
- This effect is symmetric around the 50th percentile with the composite threshold always being closer to the mean.
I also tried this with a ceiling/floor at +/- 3 SD and it did not make a significant difference in the results. (but this does NOT capture the effect of the SAT ceiling on the college admissions percentiles)
percVals <- c(0.99, 0.75, 0.5, 0.25, 0.01)
#perc <- 0.999
for (perc in percVals) {
# Note use of reset.par to allow overlaying lines
# Even with reset.par there appears to be a systematic offset - compensate
delta <- 0.3 # This is an approximate hack
savePar <- par(c("mfcol","mar"))
scatterplot(Y ~ X, data=myDrawsExample, ellipse=TRUE, smoother=FALSE,
levels=c(0.5, 0.95, 0.99, 0.999, 0.9999, reset.par=FALSE))
title(paste("Example Distribution, Percentile =", 100 * perc))
percX <- quantile(myDrawsExample$X, perc)
abline(v=percX+delta, col="red")
percY <- quantile(myDrawsExample$Y, perc)
abline(h=percY+delta, col="red")
percXYsum <- quantile(myDrawsExample$XYsum, perc)
abline(a=percXYsum+2*delta, b=-1, col="red")
# Compare percentile of composite to sum of component percentiles
abline(a=percX+percY+2*delta, b=-1, col="blue")
# Quantify the difference
print(paste("For the", 100*perc, "percentile the difference is", round(percX+percY-percXYsum,2)))
par(savePar)
}
## [1] "For the 99 percentile the difference is 0.61"
## [1] "For the 75 percentile the difference is 0.18"
## [1] "For the 50 percentile the difference is -0.01"
## [1] "For the 25 percentile the difference is -0.18"
## [1] "For the 1 percentile the difference is -0.62"
Completed at 2014-09-20 08:53:08