Introduction

The PISA 2015 results have been published. In general, the scores are much the same as they were in the previous iteration. One country apparently made a very large improvement in scores: Argentina gained about a half standard deviation. Why? Steve Sailer in his commentary mentions that Argentina was previously very good at making sure their entire student body was well represented in the sample, while other countries seemed to have other opinions about the relative benefits of accurate results and their public image.

The purpose of this brief R Notebook is to explore the effects of excluding poorly scoring students.

Packages

Load some packages we need.

#packages
library(pacman)
p_load(kirkegaard, dplyr)
options(digits = 2)

The student body

Barring problems with the test, the distribution of cognitive ability as assessed by standardized tests (and even many ad hoc tests) is roughly normal for the general population. This holds true for similar tests like PISA tests. As such, we can approximately model the data using normal distributions. This has many simplifying implications for analysis.

In this scenario, we will consider the case of local bureaucrats who are considering whether to exclude some students from the testing, and in that case, how many and how. The assumption is that they begin with a large, represenative sample of their students. This can be visualized as:

#distribution
mean_score = 500
sd_score = 90
sample_size = 10000

#generate normal data
set.seed(1)
data_frame(
  score = rnorm(sample_size, mean = mean_score, sd = sd_score)
) %>% 
ggplot(aes(score)) +
  geom_density() +
  theme_bw()

For simplicity, we use the PISA scale, assuming a mean of 500 and standard deviation of 90.

Excluding students

Now, suppose we’re the bureaucrats who want to make our country appear better than it is. Perhaps we can get away with excluding 10% of the student body. The most effective approach is to exclude the bottom 10% of the student body, like this:

set.seed(1)

#what is cutoff score for bottom 10%?
(cutoff = qnorm(.1, mean = mean_score, sd = sd_score))
## [1] 385
#generate normal data
exclude_lower = data_frame(
  score = rnorm(sample_size, mean = mean_score, sd = sd_score),
  inclusion = (score > cutoff) %>% plyr::mapvalues(c(T, F), c("Included", "Excluded")) %>% factor
)

#complicated to make this plot with ggplot2 as it is!
#http://stackoverflow.com/questions/12429333/how-to-shade-a-region-under-a-curve-using-ggplot2

#calculate densities manually
exclude_lower_dens = density(x = exclude_lower$score, kernel = "gaussian")
which_keep = exclude_lower_dens$x < cutoff
exclude_lower_dens_df_below = data_frame(x = c(exclude_lower_dens$x[which_keep], cutoff),
                                   y = c(exclude_lower_dens$y[which_keep], 0)
                                   )
exclude_lower_dens_df_above = data_frame(x = c(exclude_lower_dens$x[!which_keep], cutoff),
                                   y = c(exclude_lower_dens$y[!which_keep], 0)
                                   )
#the last values must be added, otherwise polygon is seriously misshaped!

#plot with a polygon
exclude_lower %>% 
  ggplot(aes(score)) +
  geom_density() +
  geom_polygon(data = exclude_lower_dens_df_below, aes(x, y), fill = "red", alpha = .4) +
  geom_polygon(data = exclude_lower_dens_df_above, aes(x, y), fill = "green", alpha = .4) +
  theme_bw()

The two subsamples created by this approach will differ in mean scores. How large is the difference?

(mean_scores = exclude_lower %>% group_by(inclusion) %>% summarize(mean = mean(score)))
## # A tibble: 2 × 2
##   inclusion  mean
##      <fctr> <dbl>
## 1  Excluded   340
## 2  Included   518

Quite large: 177.78 points which is about 1.98 d. As a result, the included sample has increased its mean by about 17.92 points.

A more realistic case

The above assumes that we are able to exactly predict how well students will do on the test, or that they are able to exclude exactly the subset of students they want to after the testing. This is not very realistic. In practice, we would have to rely on various labels we can stick on students, some reasonable (e.g. deaf), some less so (e.g. ADHD, learning disability, dyslexic, low social status parent). This won’t produce as large a result as we saw above because these labels do not covary perfectly with PISA scores. But it will work to some degree.

To simulate a more realistic case, we need to assume some level of accuracy of such label-based exclusion. Perhaps using a variety of labels and demographics, we are able to achieve 50% predictive validity, a correlation of .50. We then proceed to use that to exclude students. What do the results look like?

set.seed(2) #dont use 1! then you generate the exact same numbers and get correlations of 1.0

#generate additional scores
exclude_lower %<>%  mutate(
  #standardized scores (z values, mean 0, sd 1)
  score_z = standardize(score),
  #mix with random noise
  #the sd of the noise is determined by the leftover variance
  #since r = .50, means 25% variance, the leftover is 75%, which is a correlation of sqrt(.75) = .87
  #we do this because the want to keep the score on the standardized scale
  noisy_score_z = (score_z * .5) + rnorm(sample_size, mean = 0, sd = sqrt(1-(.5^2))),
  #convert back to PISA scale
  noisy_score = (noisy_score_z * 90) + 500
)

#did we do it right? if so, the new measure should have the same mean and sd
mean(exclude_lower$noisy_score)
## [1] 501
sd(exclude_lower$noisy_score)
## [1] 91
#and correlate about .5 with the real scores
cor(exclude_lower[-2])
##               score score_z noisy_score_z noisy_score
## score          1.00    1.00          0.51        0.51
## score_z        1.00    1.00          0.51        0.51
## noisy_score_z  0.51    0.51          1.00        1.00
## noisy_score    0.51    0.51          1.00        1.00

With the realistic numbers, we do the same as before, namely split the student sample into the included and excluded but based on the noisy scores instead:

#new exclusion vector
exclude_lower$inclusion_noisy = (exclude_lower$noisy_score < cutoff) %>% plyr::mapvalues(c(T, F), c("Excluded", "Included")) %>% factor

#plot
GG_denhist(exclude_lower, "score", group = "inclusion_noisy") +
  scale_fill_discrete(name = "Status") +
  scale_x_continuous(breaks = seq(100, 1000, 100))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that though the excluded group did worse, it strongly overlaps with the included group. How large is the difference?

(mean_scores2 = exclude_lower %>% group_by(inclusion_noisy) %>% summarize(mean = mean(score)))
## # A tibble: 2 × 2
##   inclusion_noisy  mean
##            <fctr> <dbl>
## 1        Excluded   417
## 2        Included   509

So the difference is still large, about 91.49 points which is about 1.02 d. We could have derived this number because it’s half as large as before and we assumed 50% predictive validity. The gain in scores from excluding lower scoring students is also half as large.

If we used a totally inaccurate way to exclude students, the mean of the student body after exclusions would be the same as before. This means that the upper bound effect of excluding 10% of the student sample is about a 177.78 points gain, and the lower bound is 0. Thus, if we are willing to assume a particular level of accuracy among bureaucrats and combine that with the reported number of exclusions, we can adjust published PISA scores for student exclusions:

set.seed(1)
#combinations for the plot
parameters = expand.grid(exclude_frac = seq(0, .50, .01),
                         accuracy = seq(0, 1, .01))

#calculate gains from exclusions
exclusion_result = parameters %>% by_row(..f = function(row) {
  #sim data
  tmp_d = data_frame(
    score = rnorm(sample_size),
    #we ignore means and sd's because these don't matter here
    #noisy variant
    score_noisy = score * row$accuracy + rnorm(1e4) * sqrt(1 - (row$accuracy^2)),
    #exclusion status
    status = score_noisy > qnorm(row$exclude_frac)
  )
  # 
  # #mean of included group
  tmp_d$score[tmp_d$status] %>% 
    mean %>% 
    multiply_by(90)
},.to = "gain", .collate = "rows")

#plot
gg = ggplot(exclusion_result) +
  geom_tile(aes(x = exclude_frac, y = accuracy, fill = gain), interpolate = T) +
  scale_x_continuous(name = "Excluded %", labels = scales::percent) +
  ylab("Accuracy of exclusions (correlation)") +
  theme_bw()
## Warning: Ignoring unknown parameters: interpolate
plotly::ggplotly(gg)