Percentiles of Components and Composites

A simulation-based look at how the percentiles for components (e.g. SATM and SATCR) correspond to the percentiles for composites (e.g. SATM + SATCR). Motivated by discussion at Harvard admissions and meritocracy

This analysis is grossly oversimplified–no floor/ceiling, no discretization, ignores the selected nature of admissions SAT scores–but hopefully is helpful for reasoning about the problem.

Based on simulation at Likehood Estimates for Outcome Rank Given Correlated Variable

Started at 2014-09-20 08:50:23

Simulated Distribution

First we generate a simulated distribution. For simplicity we use mean = 0 and sd = 1. Correlation is set to 0.5 based on the value of 0.485 for the correlation of SATM and SATV given in Table 3 - correlation matrix for HSGPA, SATW, SATV, SATM, and AP Credit.

# These variables define our distribution
Xmean <- 0
Ymean <- 0
Xsd <- 1
Ysd <- 1
XYcor <- 0.5

# Derived variables
XYmean <- c(Xmean, Ymean)
XYcov <- matrix(c(1, XYcor, XYcor, 1), nrow=2, ncol=2)

# Create our multivariate normal distribution using mvrnorm in MASS
require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.0.3

# Make this reproducible
set.seed(42)
 
# Don't make sample too large here, car scatterplot can be slow
examplePopSize <- 1e5
myDrawsExample <- mvrnorm(examplePopSize, mu=XYmean, Sigma=XYcov)
myDrawsExample <- data.frame(myDrawsExample)
colnames(myDrawsExample) <- c("X", "Y")

# Optionally allow truncating the distribution
# Here we use +/- 3 SD which is close to the SAT floor/ceiling
truncate <- FALSE
if (truncate) {
  myDrawsExample$X <- pmax(-3, pmin(3, myDrawsExample$X))
  myDrawsExample$Y <- pmax(-3, pmin(3, myDrawsExample$Y))
}

myDrawsExample$XYsum <- myDrawsExample$X + myDrawsExample$Y

#check the mean, sd, and cor
apply(myDrawsExample, 2, mean)

##         X         Y     XYsum 
## -0.002828 -0.004319 -0.007147

apply(myDrawsExample, 2, sd)

##     X     Y XYsum 
## 1.004 1.002 1.738

cor(myDrawsExample)

##            X      Y  XYsum
## X     1.0000 0.5007 0.8666
## Y     0.5007 1.0000 0.8659
## XYsum 0.8666 0.8659 1.0000

# Plot the distribution
require(car)

## Loading required package: car

## Warning: package 'car' was built under R version 3.0.3

scatterplot(Y ~ X, data=myDrawsExample, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Example Distribution with Correlation =", XYcor))

plot of chunk dist

Component and Composite Thresholds for Different Percentiles

Compare component and composite thresholds for different percentiles. Notice the symmetry of the simplified case as the percentile varies around 50%.

The difference between the two values (in component SDs) is given for each case.

The important things to notice are:
- The percentiles for the composite value (red diagonals) are not the same as the sum of the percentiles for the component values (blue diagonals).
- This effect is strongest at the extremes (e.g. 1st and 99th percentiles) and weakest near the mean.
- This effect is symmetric around the 50th percentile with the composite threshold always being closer to the mean.

I also tried this with a ceiling/floor at +/- 3 SD and it did not make a significant difference in the results. (but this does NOT capture the effect of the SAT ceiling on the college admissions percentiles)

percVals <- c(0.99, 0.75, 0.5, 0.25, 0.01)
#perc <- 0.999

for (perc in percVals) {
  # Note use of reset.par to allow overlaying lines
  # Even with reset.par there appears to be a systematic offset - compensate
  delta <- 0.3 # This is an approximate hack
  savePar <- par(c("mfcol","mar"))
  scatterplot(Y ~ X, data=myDrawsExample, ellipse=TRUE, smoother=FALSE,
              levels=c(0.5, 0.95, 0.99, 0.999, 0.9999, reset.par=FALSE))
  title(paste("Example Distribution, Percentile =", 100 * perc))
  
  percX <- quantile(myDrawsExample$X, perc)
  abline(v=percX+delta, col="red")
  percY <- quantile(myDrawsExample$Y, perc)
  abline(h=percY+delta, col="red")
  percXYsum <- quantile(myDrawsExample$XYsum, perc)
  abline(a=percXYsum+2*delta, b=-1, col="red")
  # Compare percentile of composite to sum of component percentiles
  abline(a=percX+percY+2*delta, b=-1, col="blue")
  # Quantify the difference
  print(paste("For the", 100*perc, "percentile the difference is", round(percX+percY-percXYsum,2)))
  par(savePar)
}

plot of chunk unnamed-chunk-1

## [1] "For the 99 percentile the difference is 0.61"

plot of chunk unnamed-chunk-1

## [1] "For the 75 percentile the difference is 0.18"

plot of chunk unnamed-chunk-1

## [1] "For the 50 percentile the difference is -0.01"

plot of chunk unnamed-chunk-1

## [1] "For the 25 percentile the difference is -0.18"

plot of chunk unnamed-chunk-1

## [1] "For the 1 percentile the difference is -0.62"

Completed at 2014-09-20 08:53:08

Percentiles of Components and Composites

Rich Seiter

Saturday, September 20, 2014

Simulated Distribution

Component and Composite Thresholds for Different Percentiles