Likehood Estimates for Outcome Rank Given Correlated Variable

Based on idea described in my comment at http://infoproc.blogspot.com/2014/07/success-ability-and-all-that.html
Also see the original LessWrong post

Given a contributing factor X and a related outcome variable Y (with correlation r). Calculate likelihood that a given X will have the largest Y. An example case might be height as X and basketball ability as Y. Here we will focus on abstract normal variables with mean 0 and standard deviation 1. I have tried various correlations, but have primarily been using 0.4 or 0.65. Correlation for this simulation printed below.

Ideally this would be computed exactly. This analysis uses a simulation instead.

Also see the discussion of thresholded subsamples and their effect on correlations near the end.

First we generate a single set of data to give a sense of what we will see.

Sys.time()

## [1] "2014-07-31 10:24:15 PDT"

# These variables define our distribution
Xmean <- 0
Ymean <- 0
Xsd <- 1
Ysd <- 1
#XYcor <- 0.65
XYcor <- 0.4
#XYcor <- 0.2
#XYcor <- 0.0

# Derived variables
XYmean <- c(Xmean, Ymean)
XYcov <- matrix(c(1, XYcor, XYcor, 1), nrow=2, ncol=2)

# Create our multivariate normal distribution using mvrnorm in MASS
require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.0.3

#make this reproducible
set.seed(2)
 
# Don't make sample too large here, car scatterplot can be slow
examplePopSize <- 1e5
myDrawsExample <- mvrnorm(examplePopSize, mu=XYmean, Sigma=XYcov)
myDrawsExample <- data.frame(myDrawsExample)
colnames(myDrawsExample) <- c("X", "Y")

#check the mean, sd, and cor
apply(myDrawsExample, 2, mean)

##        X        Y 
## 0.001723 0.003428

apply(myDrawsExample, 2, sd)

##      X      Y 
## 1.0002 0.9992

cor(myDrawsExample)

##        X      Y
## X 1.0000 0.4033
## Y 0.4033 1.0000

# Plot the distribution
require(car)

## Loading required package: car

## Warning: package 'car' was built under R version 3.0.3

scatterplot(Y ~ X, data=myDrawsExample, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Example Distribution with Correlation =", XYcor))

plot of chunk example

Simulate Multiple Populations

Sys.time()

## [1] "2014-07-31 10:26:24 PDT"

quick <- FALSE
if (quick) {
  # Quick version for debugging
  numRuns <- 1e2
  popSize <- 1e4  
} else {
  # These values take about 15 minutes
  numRuns <- 1e4
  popSize <- 1e5  
}

# Also record X for a top portion of Y
topProp <- 0.001 # Try 0.1%
topN <- popSize*topProp

XOfLargestY <- 1:numRuns
XOfTopPropY <- 1:(numRuns*topN)

#make this reproducible
set.seed(2)
for (i in 1:numRuns) {
  myDraws <- mvrnorm(popSize, mu=XYmean, Sigma=XYcov)
  myDraws <- data.frame(myDraws)
  colnames(myDraws) <- c("X", "Y")
  XOfLargestY[i] <- myDraws$X[which.max(myDraws$Y)]
  XOfTopPropY[((i-1)*topN+1):(i*topN)] <- myDraws$X[head(order(myDraws$Y, decreasing=TRUE), n=topN)]
}
Sys.time()

## [1] "2014-07-31 10:40:49 PDT"

Look at the Results

How often does the largest Y have a given X value?

require(lattice)

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.0.3

densityplot(XOfLargestY)

plot of chunk unnamed-chunk-1

# Take a closer look at this distribution
summary(XOfLargestY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2.16    1.14    1.78    1.76    2.39    5.11

mean(XOfLargestY)

## [1] 1.764

sd(XOfLargestY)

## [1] 0.923

How likely is it that any given person at level X will have the best Y?

I’m not sure about these calculations (especially the attempt to normalize to a probability). Feedback welcomed.

obs_dvals <- density(XOfLargestY)
dvals <- dnorm(obs_dvals$x,mean=0,sd=1)
prob <- data.frame(x = obs_dvals$x,
                   y = obs_dvals$y / dvals / popSize # Attempt to normalize this to prob
                   )

xyplot(y ~ x, prob, type="l")

plot of chunk unnamed-chunk-2

xyplot(log10(y) ~ x, prob, type="l")

plot of chunk unnamed-chunk-2

I believe the somewhat anomalous results at the extremes are due to sampling error. For example, when I increased the popSize from 1e4 to 1e5 (numRuns = 1e2) the peak and dip at the right moved right by about 0.7 to correspond. Then when I increased popSize to 1e6 the peak moved another ~0.3 right. The center of the XOfLargestY distribution also shifted right and the probabilities for 3-4SD X decreased.

Now Look at the Top 0.1% of Y

How often does the top 0.1% of Y have a given X value?

densityplot(XOfTopPropY)

plot of chunk unnamed-chunk-3

# Take a closer look at this distribution
summary(XOfTopPropY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.900   0.724   1.350   1.350   1.970   5.590

mean(XOfTopPropY)

## [1] 1.345

sd(XOfTopPropY)

## [1] 0.922

How likely is it that any given person at level X will be in the top 0.1% of Y?

obs_dvals1 <- density(XOfTopPropY)
dvals1 <- dnorm(obs_dvals1$x,mean=0,sd=1)
prob1 <- data.frame(x = obs_dvals1$x,
                    y = obs_dvals1$y / dvals1 / popSize * topN # Attempt to normalize this to prob
                    )

xyplot(y ~ x, prob1, type="l")

plot of chunk unnamed-chunk-4

xyplot(log10(y) ~ x, prob1, type="l")

plot of chunk unnamed-chunk-4

Plot the densities together.

plot(density(XOfLargestY), xlim=c(min(XOfTopPropY), max(XOfTopPropY)), col="red",
#plot(density(XOfLargestY), xlim=c(min(XOfLargestY), max(XOfLargestY)), col="red",
     main=paste("Compare Densities of Largest and Top", topProp*100, "% of Y"),
     xlab="X")
lines(density(XOfTopPropY))
legend("topright", legend=c("Largest", paste("Top", topProp*100, "%")), lty=1, col=c("red", "black"))

plot of chunk unnamed-chunk-5

Note limiting of X axis. I think this is the valid range for inference.

#plot(prob$x, log10(prob$y), xlim=c(min(XOfTopPropY), max(XOfTopPropY)),
plot(prob$x, log10(prob$y), xlim=c(min(XOfLargestY), max(XOfLargestY)),
     ylim=c(min(log10(prob1$y)),max(log10(prob1$y))),
     col="red",
     main=paste("Compare Per-Person Likelihood of Largest and Top",
                topProp*100, "% of Y"),
     xlab="X", ylab="log10(probability)", type="l")
lines(prob1$x, log10(prob1$y))
legend("topleft", legend=c("Largest", paste("Top", topProp*100, "%")),
       lty=1, col=c("red", "black"))

plot of chunk unnamed-chunk-6

Comments

Clearly this simulation model has many assumptions. The two that particularly concern me are:
1. Normality (e.g. IQ is believed to have fat tails)
2. That the sample population is large enough. I saw consistent changes when increasing popSize. I finally chose 1e5 as a compromise.

I think it best to only really consider the results in the X range where XOfLargestY has at least one value (and possibly ignore apparent extreme outliers).

Berkson’s Paradox

This Less Wrong comment notes the relevance of Berkson’s Paradox(http://en.wikipedia.org/wiki/Berkson's_paradox)

Though not strictly applicable here (X and Y are not independent) it does raise the question of what correlations look like if we look at selected portions of the population.

Use the single example population created above. Look at > 2SD for X, Y, and both.

Notice how results change for different correlations. As correlations become small the sample with both >2SD becomes smaller.

Overall it looks like truncating the distributions decreases the measured correlations with the largest effect being if both X and Y are truncated. Compare the correlations below with the correlation of 0.4 for the full distribution.

myDrawsSampleY2SD <- myDrawsExample[myDrawsExample$Y > 2,]

#check the mean, sd, and cor
apply(myDrawsSampleY2SD, 2, mean)

##      X      Y 
## 0.9441 2.3741

apply(myDrawsSampleY2SD, 2, sd)

##      X      Y 
## 0.9044 0.3313

cor(myDrawsSampleY2SD)

##        X      Y
## X 1.0000 0.1592
## Y 0.1592 1.0000

# Plot the distribution
require(car)
scatterplot(Y ~ X, data=myDrawsSampleY2SD, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Sample with Y > 2SD,  N =", nrow(myDrawsSampleY2SD)))

plot of chunk unnamed-chunk-7

myDrawsSampleX2SD <- myDrawsExample[myDrawsExample$X > 2,]

#check the mean, sd, and cor
apply(myDrawsSampleX2SD, 2, mean)

##      X      Y 
## 2.3753 0.9313

apply(myDrawsSampleX2SD, 2, sd)

##      X      Y 
## 0.3381 0.9257

cor(myDrawsSampleX2SD)

##        X      Y
## X 1.0000 0.1157
## Y 0.1157 1.0000

# Plot the distribution
require(car)
scatterplot(Y ~ X, data=myDrawsSampleX2SD, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Sample with X > 2SD,  N =", nrow(myDrawsSampleX2SD)))

plot of chunk unnamed-chunk-8

myDrawsSampleXorY2SD <- myDrawsExample[(myDrawsExample$X > 2) | (myDrawsExample$Y > 2),]

#check the mean, sd, and cor
apply(myDrawsSampleXorY2SD, 2, mean)

##     X     Y 
## 1.614 1.596

apply(myDrawsSampleXorY2SD, 2, sd)

##      X      Y 
## 0.9966 1.0093

cor(myDrawsSampleXorY2SD)

##         X       Y
## X  1.0000 -0.5454
## Y -0.5454  1.0000

# Plot the distribution
require(car)
scatterplot(Y ~ X, data=myDrawsSampleXorY2SD, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Sample with either X or Y > 2SD,  N =", nrow(myDrawsSampleXorY2SD)))

plot of chunk unnamed-chunk-9

Correlation is large and negative at -0.5454

myDrawsSampleXY2SD <- myDrawsExample[(myDrawsExample$X > 2) & (myDrawsExample$Y > 2),]

#check the mean, sd, and cor
apply(myDrawsSampleXY2SD, 2, mean)

##     X     Y 
## 2.408 2.443

apply(myDrawsSampleXY2SD, 2, sd)

##      X      Y 
## 0.3629 0.3751

cor(myDrawsSampleXY2SD)

##         X       Y
## X 1.00000 0.03751
## Y 0.03751 1.00000

# Plot the distribution
require(car)
scatterplot(Y ~ X, data=myDrawsSampleXY2SD, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Sample with both X and Y > 2SD,  N =", nrow(myDrawsSampleXY2SD)))

plot of chunk unnamed-chunk-10

Compare the correlation for this group of 0.0375 to the correlation for the original data of 0.4033

And lastly look above a diagonal line.

myDrawsSampleXYsum <- myDrawsExample[(myDrawsExample$Y + myDrawsExample$X) > 2,]

#check the mean, sd, and cor
apply(myDrawsSampleXYsum, 2, mean)

##     X     Y 
## 1.408 1.407

apply(myDrawsSampleXYsum, 2, sd)

##      X      Y 
## 0.6481 0.6518

cor(myDrawsSampleXYsum)

##         X       Y
## X  1.0000 -0.4226
## Y -0.4226  1.0000

# Plot the distribution
require(car)
scatterplot(Y ~ X, data=myDrawsSampleXYsum, ellipse=TRUE,
            levels=c(0.5, 0.95, 0.99, 0.999, 0.9999))
title(paste("Sample with (X + Y) > 2,  N =", nrow(myDrawsSampleXYsum)))

plot of chunk unnamed-chunk-11

Correlation is large and negative at -0.4226

Pear Shaped Scatterplot?

There is a popular meme that IQ is correlated with accomplishment, but above IQ 120 (say) that correlation disappears.

A pear shaped scatterplot (combination of elliptical moderate correlation at lower levels and circular no correlation above a threshold) has been proposed as a way of thinking about this. See IQ, Creativity and the Twisted Pear by Grady Towers.

I believe the decreased correlation observed for thresholded subsamples above is a more parsimonious explanation for the popular observation of decreased correlation above a threshold (compared to no actual correlation in part of the distribution) and that the pear shaped scatterplot does not actually represent reality. (As an aside, if my conclusion is correct I find it ironic that the IQ/Pear article appeared in the newsletter for a high IQ society.)

See also the density simulations above for an explanation of how the bulk of high achievers in Y being not the highest X is consistent with high X still giving increasing returns in Y without a threshold effect.

Additional data/analysis/comments welcomed.

This reference has a relevant discussion: Understanding Correlation: Factors that Affect the Size of r and PDF. For example:

It is well known among statisticians that, other things being equal, the value of r will be greater if there is more variability among the observations than if there is less variability. However, many researchers are unaware of this fact (Glass & Hopkins, 1996), and it is common for students in basic statistics courses (and even some students in intermediate- and advanced-level courses) to have difficulty comprehending this concept. Examples of this characteristic of r can be found quite easily, and it is often termed range restriction, restriction of range, or truncated range by authors of statistics and measurement textbooks…

Which is then followed by an SAT and College GPA example.

Other Notes

A relevant point by another commenter: True or False: Half of All 7-Footers are in the NBA

Completed at 2014-07-31 10:41:42