In the analysis of the ‘nlschools’ dataset, we utilized the exploration and visualizing techniques taught throughout the course that we thought would more clearly convey how each variable affected language test scores among pupils and how they interacted with each other. Our goal was to understand these relationships and provide a foundational model for prediction of school performance among pupils presenting similar characteristics.
The ‘nlschools’ dataset is comprised of variables ‘lang’ for language test score, ‘IQ’ for verbal.IQ, ‘class’ for class ID, ‘GS’ for class size, ‘SES’ for socio-economic status, and ‘COMB’ which describes whether the pupils were taugh in a multi-grade class.
We aimed to understand causal associations between the language test score and other variables, along with other associative relationships amongst the other variables in the dataset.
# load libraries
library(ggplot2)
library(plyr)
library(reshape2)
library(tidyverse)
library(splines)
library(ISLR)
library(gam)
library(boot)
library(broom)
library(mclust)
library(car)
# load data
data(nlschools, package = 'MASS')
# explore and extract basic statistics from data set
View(nlschools)
summary(nlschools)
## lang IQ class GS
## Min. : 9.00 Min. : 4.00 15580 : 33 Min. :10.00
## 1st Qu.:35.00 1st Qu.:10.50 5480 : 31 1st Qu.:23.00
## Median :42.00 Median :12.00 15980 : 31 Median :27.00
## Mean :40.93 Mean :11.83 16180 : 31 Mean :26.51
## 3rd Qu.:48.00 3rd Qu.:13.00 18380 : 31 3rd Qu.:31.00
## Max. :58.00 Max. :18.00 5580 : 30 Max. :39.00
## (Other):2100
## SES COMB
## Min. :10.00 0:1658
## 1st Qu.:20.00 1: 629
## Median :27.00
## Mean :27.81
## 3rd Qu.:35.00
## Max. :50.00
##
str(nlschools)
## 'data.frame': 2287 obs. of 6 variables:
## $ lang : int 46 45 33 46 20 30 30 57 36 36 ...
## $ IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
## $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ GS : int 29 29 29 29 29 29 29 29 29 29 ...
## $ SES : int 23 10 15 23 10 10 23 10 13 15 ...
## $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# function that finds quantiles to create QQ plot
Find.QQ <- function(.data, col.name, pooled.data) {
n.pts <- min(length(.data[, col.name]), length(pooled.data))
probs <- seq(from = 0, to = 1, length.out = n.pts)
q1 <- quantile(.data[, col.name], probs = probs)
q2 <- quantile(pooled.data, probs = probs)
return(data.frame(group.data = q1, pooled.data = q2, quantile = probs))
}
# function that finds fit through cross-validation
glm.cv.loop <- function(data, formula.text, DF.vector, K=10) {
require(boot)
cv.scores <- rep(0, times = length(DF.vector))
for (DF in DF.vector) {
# get the fitted model for current value of DF
spline.model <- glm(as.formula(formula.text), data = data)
# run K-fold cross-validation
cv <- cv.glm(data = data, glmfit = spline.model, K=K)
# extract the cross-validation score
cv.scores[DF] <- cv$delta[1]
}
# make the plot
data.out <- data.frame(df = DF.vector, cv.scores = cv.scores)
cv.plot <- ggplot(data = data.out, mapping = aes(x=df, y=cv.scores)) + geom_point() +
labs(x='df', title='Cross-Validation Scores') + theme(plot.title = element_text(hjust = 0.5))
# return a list containing the scores and the plot
return(list(scores = cv.scores, plot = cv.plot))
}
IQ or SES in the different classes, or when grouping by multi-grade vs. non-multi-grade classes?We run scatter plots for variables ‘IQ’ and ‘SES’, each as a function of class size. There were 132 class rooms designated by variable ‘class’, but we opted for utilizing the number of pupils per classroom variable, ‘GS’, in order to identify patterns occasioned by this variable and all others in the dataset. Utilizing class ID would just have given information on the particular classroom instead of leveraging any natural groupings present in the data.
# IQ
# cross-validation with natural splines and DF from 1:10 (note K=10)
IQ.by.class.size.cv <- glm.cv.loop(data = nlschools, formula.text = 'IQ ~ ns(GS, df=DF)', DF.vector = 1:10)
# cv plot
IQ.by.class.size.cv$plot
# scatter plot of IQ as a function of class size based on 'df' from cv function above
(IQ.by.class.size <- ggplot(data = nlschools, mapping = aes(x=GS, y=IQ)) +
geom_point() + geom_smooth(method = 'lm', formula = y ~ ns(x, df = 5)) +
labs(x='Class Size', title = 'IQ vs. Class Size') +
theme(plot.title = element_text(hjust = 0.5)))
# group class size by 'IQ' and call 'Find.QQ' function
QQ.IQ.df <- ddply(nlschools, .(GS), Find.QQ, col.name = 'IQ', pooled.data = nlschools$IQ)
# QQ plots for class size groups vs. all IQs
ggplot(data = QQ.IQ.df, mapping = aes(x=pooled.data, y=group.data)) + geom_point() + geom_abline() +
facet_wrap('GS', nrow = 4) +
labs(title = 'QQ plots, IQ by Class Size \n (grouped vs. pooled data)') +
theme(plot.title = element_text(hjust = 0.5))
# QQ plot (using base-qqplot) for multi-grade class type vs. IQ
with(nlschools, qqplot(x = IQ[COMB == 0], y = IQ[COMB == 1],
main='QQ Plot, IQ by Multi-Grade Class Type',
xlab = 'Multi-Grade = NO', ylab = 'Multi-Grade = YES')
)
abline(0, 1)
# SES
# cross-validation with natural splines and DF from 1:10 (note K=10)
SES.by.class.size.cv <- glm.cv.loop(data = nlschools, formula.text = 'SES ~ ns(GS, df=DF)', DF.vector = 1:10)
# cv plot
SES.by.class.size.cv$plot
# scatter plot of SES as a function of class size based on 'df' from cv function above
(SES.by.class.size <- ggplot(data = nlschools, mapping = aes(x=GS, y=SES)) +
geom_point() + geom_smooth(method = 'lm', formula = y ~ ns(x, df = 4)) +
labs(x='Class Size', y='Socio-Economic Status', title = 'Socio-Economic Status vs. Class Size') +
theme(plot.title = element_text(hjust = 0.5)))
# group class size by 'SES' and call 'Find.QQ' function
QQ.SES.df <- ddply(nlschools, .(GS), Find.QQ, col.name = 'SES', pooled.data = nlschools$SES)
# QQ plots for class size groups vs. socio-economic status of all pupils' families
ggplot(data = QQ.SES.df, mapping = aes(x=pooled.data, y=group.data)) + geom_point() + geom_abline() +
facet_wrap('GS', nrow = 4) +
labs(title = 'QQ plots, Socio-Economic Status by Class Size \n (grouped vs. pooled data)') +
theme(plot.title = element_text(hjust = 0.5))
# QQ plot (using base-qqplot) for multi-grade class type vs. socio-economic status
with(nlschools, qqplot(x = SES[COMB == 0], y = SES[COMB == 1],
main='QQ Plot, Socio-Economic Status \n by Multi-Grade Class Type',
xlab = 'Multi-Grade = NO', ylab = 'Multi-Grade = YES')
)
abline(0, 1)
We used three models to arrive and corroborate our best predictive model: * stepwise variable selection (also used to determine the number of ‘df’ to be used in our spline model plot) * forward variable selection * backward variable selection
# choose model order and df for each spline
scope.list <- list(
'IQ' = ~ 1 + IQ + ns(IQ, df = 2) + ns(IQ, df = 3) + ns(IQ, df = 4),
'GS' = ~ 1 + SES + ns(GS, df = 2) + ns(GS, df = 3) + ns(GS, df = 4),
'SES' = ~ 1 + SES + ns(SES, df = 2) + ns(SES, df = 3) + ns(SES, df = 4),
'COMB' = ~ 1 + COMB
)
# start model using 'gam' function
start.model <- gam(lang ~ 1, data = nlschools)
# choose model order via 'step.gam' function
spline.step <- step.gam(start.model, scope = scope.list)
## Start: lang ~ 1; AIC= 16545.2
## Step:1 lang ~ IQ ; AIC= 15483.69
## Step:2 lang ~ IQ + SES ; AIC= 15376.73
## Step:3 lang ~ IQ + SES + COMB ; AIC= 15352.05
# corroborate results via 'forward' and 'backward' variable selection
# multiple regression model with all variables, excluding class ID
full.model <- lm(lang ~ IQ + GS + SES + COMB, data = nlschools)
confint(full.model)
## 2.5 % 97.5 %
## (Intercept) 7.58553203 11.78377197
## IQ 2.24558682 2.53556667
## GS -0.07588235 0.02397305
## SES 0.12020159 0.17545454
## COMB1 -2.32217483 -1.04667564
# backward variable selection
back.step <- step(full.model)
## Start: AIC=8860.79
## lang ~ IQ + GS + SES + COMB
##
## Df Sum of Sq RSS AIC
## - GS 1 50 109699 8859.8
## <none> 109649 8860.8
## - COMB 1 1289 110938 8885.5
## - SES 1 5291 114940 8966.6
## - IQ 1 50231 159880 9721.3
##
## Step: AIC=8859.83
## lang ~ IQ + SES + COMB
##
## Df Sum of Sq RSS AIC
## <none> 109699 8859.8
## - COMB 1 1287 110986 8884.5
## - SES 1 5241 114940 8964.6
## - IQ 1 50231 159930 9720.0
confint(back.step)
## 2.5 % 97.5 %
## (Intercept) 7.3483277 10.7220510
## IQ 2.2455832 2.5355655
## SES 0.1189295 0.1739182
## COMB1 -2.3207618 -1.0452635
# forward variable selection
# initial model: use no variables
empty.model <- lm(lang ~ 1, data = nlschools)
# step with direction = 'forward': add variables to increase AIC
# scope: largest model allowed
forward.step <- step(empty.model, direction = 'forward', scope = formula(full.model))
## Start: AIC=10052.97
## lang ~ 1
##
## Df Sum of Sq RSS AIC
## + IQ 1 68916 116402 8991.5
## + SES 1 23398 161920 9746.3
## + COMB 1 2678 182640 10021.7
## <none> 185317 10053.0
## + GS 1 80 185237 10054.0
##
## Step: AIC=8991.46
## lang ~ IQ
##
## Df Sum of Sq RSS AIC
## + SES 1 5415.9 110986 8884.5
## + COMB 1 1462.0 114940 8964.6
## <none> 116402 8991.5
## + GS 1 0.1 116401 8993.5
##
## Step: AIC=8884.5
## lang ~ IQ + SES
##
## Df Sum of Sq RSS AIC
## + COMB 1 1286.8 109699 8859.8
## <none> 110986 8884.5
## + GS 1 47.8 110938 8885.5
##
## Step: AIC=8859.83
## lang ~ IQ + SES + COMB
##
## Df Sum of Sq RSS AIC
## <none> 109699 8859.8
## + GS 1 49.934 109649 8860.8
confint(forward.step)
## 2.5 % 97.5 %
## (Intercept) 7.3483277 10.7220510
## IQ 2.2455832 2.5355655
## SES 0.1189295 0.1739182
## COMB1 -2.3207618 -1.0452635
# fit splines to each variable given by new model and using df from 'spline.step'
spline.model <- lm(lang ~ IQ + SES + COMB, data = nlschools)
plot.gam(spline.model, se = TRUE, cex.lab = 1) # plot each spline separately
We also applied clustering in an attempt to discover any discernable patterns in the data not evident via supervised learning.
Two-cluster variable analysis produced a few worth-noting findings:
# cluster dataset with 1:2 clusters, returning best fit
clust <- Mclust(nlschools[, 1:6], G = 1:2)
nlschools$cluster <- clust$classification
# relabel variable names
labels <- c('language\n test score', 'verbal.IQ', 'class ID', 'class size', 'socio-economic\n status', 'multi-grade\n class type')
# scatter plot matrix
pairs(nlschools[, 1:6], col = nlschools$cluster, cex = .5, cex.labels = 1, labels = labels)
We run co-plots of ‘lang’ vs. various combinations of the variables that make up the model given our previous analysis on Task 2. We used 7 conditioning intervals as the basis for each co-plot.
Specifically, we plotted ‘lang’ vs. the other variables in our model given conditioning intervals:
# coplot of 'lang' vs. 'IQ' given 'SES' and 'COMB'
coplot(lang ~ IQ | SES + COMB, data = nlschools, number = 7, panel = panel.smooth,
rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)
# coplot of 'lang' vs. 'SES' given 'IQ' and 'COMB'
coplot(lang ~ SES | IQ + COMB, data = nlschools, number = 7, panel = panel.smooth,
rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)
# coplot of 'lang' vs. 'COMB' given 'IQ'
coplot(lang ~ COMB | IQ, data = nlschools, number = 7, panel = panel.smooth,
rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)
# coplot of 'lang' vs. 'COMB' given 'SES'
coplot(lang ~ COMB | SES, data = nlschools, number = 7, panel = panel.smooth,
rows = 1, overlap = .7, cex = .7, span = .3, lwd = 2)
Throughout the analysis of the ‘nlschools’ dataset, we learned the following insights: