Predictive Model for NY State Test Scores

Predictive Analysis Overview
Multiple Linear Regression
Decision Tree Analysis
Monte Carlo Model

Predictive Analysis Overview

Each year the New York City Department of Education administers State Tests to most of the 1.1 million children that attend public schools in the city. The NYC DOE publishes test score results catagorically, measuring both Common Core Math and English results by district, grade and school as follows . . .

1 far below proficient  
2 below proficient  
3 proficient  
4 above proficient  
5 passing - cohort acheiving levels 3 or 4

Explore, analyze and model this data set containing approximately 15000 records. Each record represents a school’s per grade test scores from the years 2013 to 2016 inclusive. Passing percentage statistics are provided for each record. The objective is to build a multiple linear regression model on the training data to predict the percentage passing for a school/grade in future years given the current predictors.

Multiple Least Squares Regression

The response variable is deterministic and answers the simple question, what percentage of students in the cohert pass the state test?

Estimating the Regression Coefficients

The coefficients \(β0\) and \(β1\) are initially unknown, and must be estimated based on the available training data. Recall that linear regression uses the least squares approach to estimate the coefficients.

We now fit a basic regression model from the training data and inspect the predicted results from testing the data.

Multiple Linear Regression

Descriptive Statistics & Exploration

Based upon the data summaries, we see there are 11 NA’s in the Lvl_Pass_Pct variables. The missing data will be removed to save time since its a small proportion of the full dataset. The percentage Passed and Headcount variables are both right skewed distributions and the District data is multi-modal distribution. Year is a uniform distribution.

Note the historgrams below for a better visualliztion of the distribution and skew of these datasets.

summary(training)

##     District      School_Num     Year        Headcount    
##  10     : 200   18     :  18   2013:3572   Min.   :  6.0  
##  27     : 169   20     :  18   2014:   0   1st Qu.: 58.0  
##  31     : 168   41     :  18   2015:   0   Median : 86.0  
##  2      : 150   46     :  18   2016:   0   Mean   :109.3  
##  9      : 148   50     :  18               3rd Qu.:127.0  
##  11     : 135   89     :  18               Max.   :791.0  
##  (Other):2602   (Other):3464                              
##   Lvl_Pass_Pct   
##  Min.   :  0.00  
##  1st Qu.: 10.50  
##  Median : 20.30  
##  Mean   : 25.44  
##  3rd Qu.: 35.70  
##  Max.   :100.00  
##

par(mfrow = c(3,2))
Z <- cbind(X, Y)
colnames(Z) <- colnames(cbind(X,Y))
for (i in 1:ncol(Z)) {
  variable <- names(Z[i])
  hist(as.numeric(Z[,i]), xlab = variable, main = variable)
}

par(mfrow = c(3,2))

for (i in 1:ncol(Z)) {
  variable <- names(Z[i])
  d <- density(as.numeric(Z[,i]), na.rm = TRUE)
  plot(d, main = variable)
  polygon(d, col="red")
}

d <- density(as.numeric(MyData[,5]), na.rm = TRUE)
plot(d, main = variable)
polygon(d, col="red")

par(mfrow = c(2,4))
Z <- cbind(X,Y)
colnames(Z) <- c(colnames(X),"Lvl_Pass_Pct")
pairs(Z)

Data Modelling and Prediction

Using the training data set, build a multiple linear regression models and use it to predict passing rate for the ELA test.

Multicollinearity

Since the p-value is small, we accept the null hypothesis that there is not multicollinearity and conclude that the data supports the existence of some correlation. Multicollinearity can underestimate standard error and make predictors appear significant when they are not.

dwtest(model1)

## 
##  Durbin-Watson test
## 
## data:  model1
## DW = 1.9873, p-value = 0.1904
## alternative hypothesis: true autocorrelation is greater than 0

Mean Squared Error and RMSE

For this model, standard error of the mean (RMSE) is fairly large relative to the target variable. In this model, the standard deviation of the unexplained variance in Lvl_Pass_Pct is 15.59584 which is a large deviation from the 29.43609 average passed encountered in the data.

data.frame("MSE" = summary(model1)$sigma^2, "RMSE" = summary(model1)$sigma)

##        MSE     RMSE
## 1 243.2301 15.59584

R^2 and Adjusted R^2

R2 represents the percent change in Pass Percentage explained by the predictor variables. R2 is somewhat low for this model but shows predictive value.

data.frame("r.squared" = summary(model1)$r.squared, "adj.r.squared" = summary(model1)$adj.r.squared)

##   r.squared adj.r.squared
## 1 0.3541081     0.3482679

F-Statistic

The F-test evaluates the null hypothesis that all the regression coefficients are equal to zero versus the alternative that at least one does not.

summary(model1)$fstatistic

##      value      numdf      dendf 
##   60.63265   32.00000 3539.00000

Residuals

The Residuals vs Fitted plot shows that the residuals appear to have a linear pattern with outliers that show a heteroschedstic pattern. The Normal Q-Q plot shows that the residuals appear to be a somewhat rightward skewed distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot shows some extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line.

par(mfrow = c(2,2))
plot(model1)

Predictions

df_predictions <- data.frame(predict(model1, testing))
par(mfrow = c(2,2))
hist(df_predictions$predict.model1..testing., xlab = 'Predicted Pass%', main = 'Predicted Pass%')
hist(training$Lvl_Pass_Pct, xlab = 'Training Pass%', main = 'Training Pass%')

Decision Tree Analysis

Multiple Least Squares Regression

The response variable is now catagorical and answers the simple question, does the school have greater than 50 percent of students in the cohert pass the state test?

library(dplyr)
library(ISLR)
library(tree)

MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv", header=TRUE, sep=",",stringsAsFactors=FALSE)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData$Lvl_Pass_Pct <- ifelse(MyData$Lvl_Pass_Pct > 50,"Yes","No")
MyData$Lvl_Pass_Pct <- as.factor(MyData$Lvl_Pass_Pct)
MyData <- data.frame(MyData[,c(2,4,5,8,10,20)])


training <- filter(MyData, Year == 2013)

summary(MyData)

##    Borough             District       School_Num         Year     
##  Length:14704       Min.   : 1.00   Min.   :  1.0   Min.   :2013  
##  Class :character   1st Qu.: 9.00   1st Qu.: 80.0   1st Qu.:2014  
##  Mode  :character   Median :16.00   Median :171.0   Median :2015  
##                     Mean   :16.48   Mean   :204.4   Mean   :2015  
##                     3rd Qu.:25.00   3rd Qu.:288.0   3rd Qu.:2016  
##                     Max.   :32.00   Max.   :971.0   Max.   :2016  
##    Headcount     Lvl_Pass_Pct
##  Min.   :  1.0   No :12276   
##  1st Qu.: 55.0   Yes: 2428   
##  Median : 84.0               
##  Mean   :105.5               
##  3rd Qu.:123.0               
##  Max.   :804.0

#split data into testing and training sets using
## 50% of the sample size
smp_size <- floor(0.5 * nrow(MyData))

## set the seed to make your partition reproductible
set.seed(2)
train_ind <- sample(seq_len(nrow(MyData)), size = smp_size)

training_data <- MyData[train_ind, ]
testing_data <- MyData[-train_ind, ]
testing_predict <- testing_data$Lvl_Pass_Pct

# fit the tree model using training data
tree_model <- tree(Lvl_Pass_Pct~., training_data)
plot(tree_model)
text(tree_model,cex=.75, pretty=0)

# check model's predictive potential
tree_pred <- predict(tree_model, testing_data, type='class')
mean(tree_pred != testing_predict)

## [1] 0.1444505

Pruning & Overfitting

If you plot the complexity of the curve fit by the deviance factor, deviance is minimized by fitting a 3rd degree equation. However, the effect of pruning to this degree is minimal as the predictive potential between the original and pruned model show the same result.

# cross validate to test where to stop pruning
set.seed(3)
cv.tree <- cv.tree(tree_model, FUN=prune.misclass)
names(cv.tree)

## [1] "size"   "dev"    "k"      "method"

plot(cv.tree$size, cv.tree$dev, type = 'b')

#prune the tree
pruned_model = prune.misclass(tree_model, best = 3)
plot(pruned_model)
text(pruned_model, pretty = 0)

# recheck model's predictive potential
tree_pred <- predict(pruned_model, testing_data, type='class')
mean(tree_pred != testing_predict)

## [1] 0.1444505

Monte Carlo Model

Definition: Monte Carlo is the art of approximating an expectation by the sample mean of a function of simulated random variables.

Monte Carlo approximations to distributions.

A simple extension of this is to sample the density distribution of the data set, approximate it with a Poisson probability distribution and show a Monte Carlo distribution of expected values shown by the histogram below.

Simulation

library(MASS)

MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv", 
header=TRUE, sep=",",stringsAsFactors=FALSE)
names = c('District','School_Num','Year')
MyData[,names] <- lapply(MyData[,names] , factor)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData <- data.frame(MyData[,c(4,5,8,10,20)])
summary(MyData)

##     District       School_Num      Year        Headcount    
##  10     :  800   18     :   72   2013:3583   Min.   :  1.0  
##  27     :  695   20     :   72   2014:3652   1st Qu.: 55.0  
##  31     :  695   41     :   72   2015:3707   Median : 84.0  
##  2      :  638   46     :   72   2016:3762   Mean   :105.5  
##  9      :  623   89     :   72               3rd Qu.:123.0  
##  11     :  571   206    :   72               Max.   :804.0  
##  (Other):10682   (Other):14272                              
##   Lvl_Pass_Pct   
##  Min.   :  0.00  
##  1st Qu.: 12.90  
##  Median : 24.30  
##  Mean   : 29.35  
##  3rd Qu.: 41.90  
##  Max.   :100.00  
##

hist(MyData$Lvl_Pass_Pct)

fitdistr(trunc(MyData[complete.cases(MyData),5]), "poisson")

##      lambda  
##   28.9161453 
##  ( 0.0443458)

with(MyData, tapply(Lvl_Pass_Pct, District, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))

##                        1                        2                        3 
## "M (SD) = 31.19 (25.88)" "M (SD) = 56.21 (23.25)" "M (SD) = 40.74 (30.96)" 
##                        4                        5                        6 
## "M (SD) = 25.43 (21.46)" "M (SD) = 17.28 (16.11)" "M (SD) = 20.15 (15.00)" 
##                        7                        8                        9 
## "M (SD) = 13.19 (11.05)" "M (SD) = 19.62 (12.84)" "M (SD) = 15.37 (11.02)" 
##                       10                       11                       12 
## "M (SD) = 19.86 (12.48)" "M (SD) = 22.91 (12.28)"  "M (SD) = 12.77 (6.81)" 
##                       13                       14                       15 
## "M (SD) = 25.13 (18.44)" "M (SD) = 24.23 (17.28)" "M (SD) = 41.35 (26.17)" 
##                       16                       17                       18 
## "M (SD) = 20.22 (14.80)" "M (SD) = 22.25 (15.44)" "M (SD) = 24.74 (16.86)" 
##                       19                       20                       21 
## "M (SD) = 17.13 (10.63)" "M (SD) = 42.32 (17.95)" "M (SD) = 34.96 (17.15)" 
##                       22                       23                       24 
## "M (SD) = 36.41 (16.76)" "M (SD) = 15.21 (12.70)" "M (SD) = 37.10 (14.90)" 
##                       25                       26                       27 
## "M (SD) = 45.82 (13.25)" "M (SD) = 59.95 (11.73)" "M (SD) = 29.28 (17.12)" 
##                       28                       29                       30 
## "M (SD) = 36.95 (19.77)" "M (SD) = 27.58 (13.18)" "M (SD) = 34.89 (21.34)" 
##                       31                       32 
## "M (SD) = 37.40 (16.83)" "M (SD) = 19.77 (16.22)"

#get the lambda and sd for poisson and use this for the fitted model
parms <- fitdistr(MyData[complete.cases(MyData),5], "poisson")
parms

##      lambda   
##   29.35401251 
##  ( 0.04468029)

lambda <- parms$estimate
sd_x     <- parms$sd  

#fitted model
#sample of 500 random values from distribution
rpois(500,lambda)

##   [1] 22 32 32 32 24 26 35 28 21 36 40 41 36 29 27 36 23 29 36 31 35 29 29
##  [24] 37 25 37 35 36 22 26 20 24 20 39 27 27 25 31 39 32 32 30 26 38 39 26
##  [47] 30 22 33 33 30 30 26 34 24 35 29 39 26 35 31 33 33 39 27 29 32 25 27
##  [70] 27 33 27 32 39 35 31 22 26 27 26 19 29 27 26 26 33 27 21 25 43 18 31
##  [93] 25 35 23 28 23 25 31 39 29 26 26 22 33 19 29 33 27 31 29 28 25 26 25
## [116] 29 29 34 27 21 25 21 37 30 23 31 25 22 39 40 19 26 29 27 30 26 17 21
## [139] 35 32 27 36 23 36 29 33 24 22 43 23 35 27 36 23 24 38 25 28 32 24 23
## [162] 27 27 25 27 24 25 26 36 34 29 31 30 23 30 26 27 37 32 36 28 16 35 29
## [185] 16 30 35 24 35 30 29 28 35 24 28 28 36 29 27 35 23 31 29 30 29 30 27
## [208] 33 32 26 25 33 24 24 26 30 30 38 22 25 20 27 24 40 34 32 24 27 26 39
## [231] 33 38 26 25 33 35 25 32 12 29 37 32 28 29 30 19 27 31 29 30 42 33 46
## [254] 23 34 20 20 22 28 35 32 34 25 28 15 20 22 28 21 30 25 29 35 35 21 25
## [277] 29 17 44 22 36 25 33 29 30 28 30 18 23 29 21 34 36 41 23 29 25 27 34
## [300] 33 30 27 19 26 24 36 23 14 32 26 30 23 29 24 31 23 21 15 27 22 25 39
## [323] 30 40 31 29 27 32 32 29 34 31 33 29 34 29 29 30 33 28 31 34 24 21 29
## [346] 25 24 33 45 29 36 29 26 23 35 21 33 30 29 28 23 33 39 20 26 29 34 39
## [369] 19 27 39 37 27 27 14 34 30 32 30 31 30 30 28 23 31 26 27 36 28 37 25
## [392] 23 34 25 45 34 26 25 38 40 26 30 35 32 27 33 20 28 23 23 25 33 32 19
## [415] 29 32 33 23 29 29 20 24 29 31 23 26 32 37 19 34 21 32 25 24 34 27 31
## [438] 22 34 31 22 32 34 40 35 34 36 29 26 20 31 36 41 25 35 34 25 24 26 24
## [461] 21 29 26 37 21 28 27 33 27 46 25 25 33 28 31 33 35 21 31 24 20 22 23
## [484] 34 30 35 32 30 27 29 19 26 31 31 26 29 28 19 27 26

barplot(table(rpois(10000,lambda)),xlab='Percentage Passed',ylab='Frequency', main='Poisson Distribution of State Test Scores')

References

For more details on using R Markdown see http://rmarkdown.rstudio.com. https://rpubs.com/josezuniga/253955 https://stats.idre.ucla.edu/r/dae/poisson-regression/ http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf

Predictive Model for NY State Test Scores

Scott Karr

May 14, 2017

Predictive Analysis Overview

Multiple Least Squares Regression

Estimating the Regression Coefficients

Multiple Linear Regression

Descriptive Statistics & Exploration

Data Modelling and Prediction

Multicollinearity

Mean Squared Error and RMSE

R^2 and Adjusted R^2

F-Statistic

Residuals

Predictions

Decision Tree Analysis

Multiple Least Squares Regression

Pruning & Overfitting

Monte Carlo Model

Simulation

References