Introduction

In the analysis of the ‘nlschools’ dataset, we utilized the exploration and visualizing techniques taught throughout the course that we thought would more clearly convey how each variable affected language test scores among pupils and how they interacted with each other. Our goal was to understand these relationships and provide a foundational model for prediction of school performance among pupils presenting similar characteristics.

Analysis

Preliminaries:

The ‘nlschools’ dataset is comprised of variables ‘lang’ for language test score, ‘IQ’ for verbal.IQ, ‘class’ for class ID, ‘GS’ for class size, ‘SES’ for socio-economic status, and ‘COMB’ which describes whether the pupils were taugh in a multi-grade class.

We aimed to understand causal associations between the language test score and other variables, along with other associative relationships amongst the other variables in the dataset.

# load libraries
library(ggplot2)
library(plyr)
library(reshape2)
library(tidyverse)
library(splines)
library(ISLR)
library(gam)
library(boot)
library(broom)
library(mclust)
library(car)

# load data
data(nlschools, package = 'MASS')
# explore and extract basic statistics from data set
View(nlschools)
summary(nlschools)
##       lang             IQ            class            GS       
##  Min.   : 9.00   Min.   : 4.00   15580  :  33   Min.   :10.00  
##  1st Qu.:35.00   1st Qu.:10.50   5480   :  31   1st Qu.:23.00  
##  Median :42.00   Median :12.00   15980  :  31   Median :27.00  
##  Mean   :40.93   Mean   :11.83   16180  :  31   Mean   :26.51  
##  3rd Qu.:48.00   3rd Qu.:13.00   18380  :  31   3rd Qu.:31.00  
##  Max.   :58.00   Max.   :18.00   5580   :  30   Max.   :39.00  
##                                  (Other):2100                  
##       SES        COMB    
##  Min.   :10.00   0:1658  
##  1st Qu.:20.00   1: 629  
##  Median :27.00           
##  Mean   :27.81           
##  3rd Qu.:35.00           
##  Max.   :50.00           
## 
str(nlschools)
## 'data.frame':    2287 obs. of  6 variables:
##  $ lang : int  46 45 33 46 20 30 30 57 36 36 ...
##  $ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
##  $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
##  $ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
##  $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# function that finds quantiles to create QQ plot
Find.QQ <- function(.data, col.name, pooled.data) {
  n.pts <- min(length(.data[, col.name]), length(pooled.data))
  probs <- seq(from = 0, to = 1, length.out = n.pts)
  q1 <- quantile(.data[, col.name], probs = probs)
  q2 <- quantile(pooled.data, probs = probs)
  return(data.frame(group.data = q1, pooled.data = q2, quantile = probs))
}

# function that finds fit through cross-validation
glm.cv.loop <- function(data, formula.text, DF.vector, K=10) {
  require(boot) 
  cv.scores <- rep(0, times = length(DF.vector))
  for (DF in DF.vector) {
    # get the fitted model for current value of DF
    spline.model <- glm(as.formula(formula.text), data = data)
    # run K-fold cross-validation 
    cv <- cv.glm(data = data, glmfit = spline.model, K=K)
    # extract the cross-validation score
    cv.scores[DF] <- cv$delta[1]
  }
  # make the plot
  data.out <- data.frame(df = DF.vector, cv.scores = cv.scores)
  cv.plot <- ggplot(data = data.out, mapping = aes(x=df, y=cv.scores)) + geom_point() + 
    labs(x='df', title='Cross-Validation Scores') + theme(plot.title = element_text(hjust = 0.5))
  # return a list containing the scores and the plot
  return(list(scores = cv.scores, plot = cv.plot))
}  

Task 1: Are there discrepancies in IQ or SES in the different classes, or when grouping by multi-grade vs. non-multi-grade classes?

Methods:

We run scatter plots for variables ‘IQ’ and ‘SES’, each as a function of class size. There were 132 class rooms designated by variable ‘class’, but we opted for utilizing the number of pupils per classroom variable, ‘GS’, in order to identify patterns occasioned by this variable and all others in the dataset. Utilizing class ID would just have given information on the particular classroom instead of leveraging any natural groupings present in the data.

  • We determined the degrees of freedom, ‘df’, for each scatter plot by running cross-validation for ‘IQ’ and ‘SES’, each against class size.
  • Following this, we plotted qq plots for both variables, each with their grouped data vs. the pooled data for each class size.
  • Lastly, we plotted ‘IQ’ and ‘SES’ vs. the type of class, i.e. whether the class was multi-grade or not.

Results:

  • From both the scatter plot with 5 df and the QQ plots, we found that there isn’t a noticeable discrepancy among the distributions of the 28 classroom sizes and the pupils’ IQ. Data were slim for classroom sizes = 10 & 39, and somewhat below the pooled data for classroom sizes = 14 & 15. Most other cassroom size distributions were in proximity to the pooled data, either on the ‘qunif’ line or right above for most IQs.
  • Multi-grade classes’ IQ distributions were majorily below, albeit slightly, the IQ of non-multi-grade classes.
  • Socio-economic status seemed to be lower among pupils in small class sizes (10-16, 18, 20) and in the very largest class sizes (37-39). In the remaining middle class sizes, the pupils’ socio-economic status seemed to be somewhat closer to and above the pooled data. The scatter plot indicates socio-economic status increases as class size increases up to class size > ~ 31 where it rapidly decreases, although with little confidence given the error intervals at this extreme of class sizes.
  • Multi-grade class types for pupils’ socio-economic status was also slightly below that of non-multi-grade classes.
# IQ
# cross-validation with natural splines and DF from 1:10 (note K=10)
IQ.by.class.size.cv <- glm.cv.loop(data = nlschools, formula.text = 'IQ ~ ns(GS, df=DF)', DF.vector = 1:10)
# cv plot
IQ.by.class.size.cv$plot

# scatter plot of IQ as a function of class size based on 'df' from cv function above
(IQ.by.class.size <- ggplot(data = nlschools, mapping = aes(x=GS, y=IQ)) + 
  geom_point() + geom_smooth(method = 'lm', formula = y ~ ns(x, df = 5)) + 
  labs(x='Class Size', title = 'IQ vs. Class Size') + 
  theme(plot.title = element_text(hjust = 0.5)))

# group class size by 'IQ' and call 'Find.QQ' function
QQ.IQ.df <- ddply(nlschools, .(GS), Find.QQ, col.name = 'IQ', pooled.data = nlschools$IQ)

# QQ plots for class size groups vs. all IQs
ggplot(data = QQ.IQ.df, mapping = aes(x=pooled.data, y=group.data)) + geom_point() + geom_abline() + 
  facet_wrap('GS', nrow = 4) + 
  labs(title = 'QQ plots, IQ by Class Size \n (grouped vs. pooled data)') + 
  theme(plot.title = element_text(hjust = 0.5))

# QQ plot (using base-qqplot) for multi-grade class type vs. IQ
with(nlschools, qqplot(x = IQ[COMB == 0], y = IQ[COMB == 1], 
            main='QQ Plot, IQ by Multi-Grade Class Type', 
            xlab = 'Multi-Grade = NO', ylab = 'Multi-Grade = YES')
     )
abline(0, 1)

# SES
# cross-validation with natural splines and DF from 1:10 (note K=10)
SES.by.class.size.cv <- glm.cv.loop(data = nlschools, formula.text = 'SES ~ ns(GS, df=DF)', DF.vector = 1:10)
# cv plot
SES.by.class.size.cv$plot

# scatter plot of SES as a function of class size based on 'df' from cv function above
(SES.by.class.size <- ggplot(data = nlschools, mapping = aes(x=GS, y=SES)) + 
  geom_point() + geom_smooth(method = 'lm', formula = y ~ ns(x, df = 4)) + 
  labs(x='Class Size', y='Socio-Economic Status', title = 'Socio-Economic Status vs. Class Size') + 
  theme(plot.title = element_text(hjust = 0.5)))

# group class size by 'SES' and call 'Find.QQ' function
QQ.SES.df <- ddply(nlschools, .(GS), Find.QQ, col.name = 'SES', pooled.data = nlschools$SES)

# QQ plots for class size groups vs. socio-economic status of all pupils' families
ggplot(data = QQ.SES.df, mapping = aes(x=pooled.data, y=group.data)) + geom_point() + geom_abline() + 
  facet_wrap('GS', nrow = 4) + 
  labs(title = 'QQ plots, Socio-Economic Status by Class Size \n (grouped vs. pooled data)') + 
  theme(plot.title = element_text(hjust = 0.5))

# QQ plot (using base-qqplot) for multi-grade class type vs. socio-economic status
with(nlschools, qqplot(x = SES[COMB == 0], y = SES[COMB == 1], 
            main='QQ Plot, Socio-Economic Status \n by Multi-Grade Class Type', 
            xlab = 'Multi-Grade = NO', ylab = 'Multi-Grade = YES')
     )
abline(0, 1)

Task 2: When did students perform better or worse on the language exam? Describe which variables had the most important effects.

Methods:

We used three models to arrive and corroborate our best predictive model: * stepwise variable selection (also used to determine the number of ‘df’ to be used in our spline model plot) * forward variable selection * backward variable selection

Results:

  • Based on the three models run (stepwise, forward and backward variable selection), IQ, socio-economic status and multi-grade type class appeared to constitute the best predictive model for language test score. Class size did not.
  • Both backward and forward variable selection models yielded the same AIC score = 8859.83, and were consistent with the stepwise model in variable selection and discrimination.
  • The greater IQ and socio-economic status were, the higher language test scores were.
  • Conversely, multi-grade type class settings were indicative of lower language test scores.
# choose model order and df for each spline
scope.list <- list(
  'IQ' = ~ 1 + IQ + ns(IQ, df = 2) + ns(IQ, df = 3) + ns(IQ, df = 4), 
  'GS' = ~ 1 + SES + ns(GS, df = 2) + ns(GS, df = 3) + ns(GS, df = 4), 
  'SES' = ~ 1 + SES + ns(SES, df = 2) + ns(SES, df = 3) + ns(SES, df = 4), 
  'COMB' = ~ 1 + COMB
  )

# start model using 'gam' function
start.model <- gam(lang ~ 1, data = nlschools)
# choose model order via 'step.gam' function
spline.step <- step.gam(start.model, scope = scope.list)
## Start:  lang ~ 1; AIC= 16545.2 
## Step:1 lang ~ IQ ; AIC= 15483.69 
## Step:2 lang ~ IQ + SES ; AIC= 15376.73 
## Step:3 lang ~ IQ + SES + COMB ; AIC= 15352.05
# corroborate results via 'forward' and 'backward' variable selection
# multiple regression model with all variables, excluding class ID
full.model <- lm(lang ~ IQ + GS + SES + COMB, data = nlschools)
confint(full.model)
##                   2.5 %      97.5 %
## (Intercept)  7.58553203 11.78377197
## IQ           2.24558682  2.53556667
## GS          -0.07588235  0.02397305
## SES          0.12020159  0.17545454
## COMB1       -2.32217483 -1.04667564
# backward variable selection
back.step <- step(full.model)
## Start:  AIC=8860.79
## lang ~ IQ + GS + SES + COMB
## 
##        Df Sum of Sq    RSS    AIC
## - GS    1        50 109699 8859.8
## <none>              109649 8860.8
## - COMB  1      1289 110938 8885.5
## - SES   1      5291 114940 8966.6
## - IQ    1     50231 159880 9721.3
## 
## Step:  AIC=8859.83
## lang ~ IQ + SES + COMB
## 
##        Df Sum of Sq    RSS    AIC
## <none>              109699 8859.8
## - COMB  1      1287 110986 8884.5
## - SES   1      5241 114940 8964.6
## - IQ    1     50231 159930 9720.0
confint(back.step)
##                  2.5 %     97.5 %
## (Intercept)  7.3483277 10.7220510
## IQ           2.2455832  2.5355655
## SES          0.1189295  0.1739182
## COMB1       -2.3207618 -1.0452635
# forward variable selection
# initial model: use no variables
empty.model <- lm(lang ~ 1, data = nlschools)
# step with direction = 'forward': add variables to increase AIC
# scope: largest model allowed
forward.step <- step(empty.model, direction = 'forward', scope = formula(full.model))
## Start:  AIC=10052.97
## lang ~ 1
## 
##        Df Sum of Sq    RSS     AIC
## + IQ    1     68916 116402  8991.5
## + SES   1     23398 161920  9746.3
## + COMB  1      2678 182640 10021.7
## <none>              185317 10053.0
## + GS    1        80 185237 10054.0
## 
## Step:  AIC=8991.46
## lang ~ IQ
## 
##        Df Sum of Sq    RSS    AIC
## + SES   1    5415.9 110986 8884.5
## + COMB  1    1462.0 114940 8964.6
## <none>              116402 8991.5
## + GS    1       0.1 116401 8993.5
## 
## Step:  AIC=8884.5
## lang ~ IQ + SES
## 
##        Df Sum of Sq    RSS    AIC
## + COMB  1    1286.8 109699 8859.8
## <none>              110986 8884.5
## + GS    1      47.8 110938 8885.5
## 
## Step:  AIC=8859.83
## lang ~ IQ + SES + COMB
## 
##        Df Sum of Sq    RSS    AIC
## <none>              109699 8859.8
## + GS    1    49.934 109649 8860.8
confint(forward.step)
##                  2.5 %     97.5 %
## (Intercept)  7.3483277 10.7220510
## IQ           2.2455832  2.5355655
## SES          0.1189295  0.1739182
## COMB1       -2.3207618 -1.0452635
# fit splines to each variable given by new model and using df from 'spline.step' 
spline.model <- lm(lang ~ IQ + SES + COMB, data = nlschools)
plot.gam(spline.model, se = TRUE, cex.lab = 1) # plot each spline separately

We also applied clustering in an attempt to discover any discernable patterns in the data not evident via supervised learning.

Two-cluster variable analysis produced a few worth-noting findings:

  • Language test scores vs. verbal IQ seemed to be clustered between an inner and outer group. In both clusters, verbal IQ had a linear positive effect on language test score. We also noticed that at the highest verbal IQs, langage test scores plateaued.
  • Smaller class sizes were concentrated through a narrower range of language test scores, and much larger towards the larger class sizes.
  • Other than a larger to narrower ‘SES’ pupils’ family size tendency, there wasn’t a discernable separation between socio-economic status and language test score. Both clusters narrowed as ‘SES’ increased.
  • Multi-grade type classes were clearly delineated between the two clusters and both groups had language test scores across the ‘lang’ range.
# cluster dataset with 1:2 clusters, returning best fit
clust <- Mclust(nlschools[, 1:6], G = 1:2)
nlschools$cluster <- clust$classification

# relabel variable names
labels <- c('language\n test score', 'verbal.IQ', 'class ID', 'class size', 'socio-economic\n status', 'multi-grade\n class type')

# scatter plot matrix
pairs(nlschools[, 1:6], col = nlschools$cluster, cex = .5, cex.labels = 1, labels = labels)

Task 3: Do you think there are interactions in the effects of the variables on the language exam score? Speculate as to the cause of any such effects that you think should be included.

Methods:

We run co-plots of ‘lang’ vs. various combinations of the variables that make up the model given our previous analysis on Task 2. We used 7 conditioning intervals as the basis for each co-plot.

Specifically, we plotted ‘lang’ vs. the other variables in our model given conditioning intervals:

  • ‘lang’ vs. ‘IQ’ given ‘COMB’ and ‘SES’
  • ‘lang’ vs. ‘SES’ given ‘IQ’ and ‘COMB’
  • ‘lang’ vs. ‘COMB’ given ‘IQ’
  • ‘lang’ vs. ‘COMB’ given ‘SES’

Results:

  • The distribution of language test scores vs. IQ across all socio-economic status intervals was similar. ‘lang’ increased as IQ increased. The distribution was narrower in multi-grade type classes signifying that fewer pupils benefited from such effect.
  • ‘lang’ vs. ‘SES’ increased most steeply in non-multi-grade type classes with low IQs. The other distributions were relatively flatter towards higher IQs.
  • ‘lang’ was lower in lower IQ and ‘SES’ groups in multi-grade type classes while flat in higher intervals for non-multi-grade type classes.
# coplot of 'lang' vs. 'IQ' given 'SES' and 'COMB'
coplot(lang ~ IQ | SES + COMB, data = nlschools, number = 7,  panel = panel.smooth, 
       rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)

# coplot of 'lang' vs. 'SES' given 'IQ' and 'COMB'
coplot(lang ~ SES | IQ + COMB, data = nlschools, number = 7, panel = panel.smooth, 
       rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)

# coplot of 'lang' vs. 'COMB' given 'IQ'
coplot(lang ~ COMB | IQ, data = nlschools, number = 7, panel = panel.smooth, 
       rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)

# coplot of 'lang' vs. 'COMB' given 'SES'
coplot(lang ~ COMB | SES, data = nlschools, number = 7, panel = panel.smooth, 
       rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)

Conclusions

Throughout the analysis of the ‘nlschools’ dataset, we learned the following insights: