Predictive Analysis Overview

Each year the New York City Department of Education administers State Tests to most of the 1.1 million children that attend public schools in the city. The NYC DOE publishes test score results catagorically, measuring both Common Core Math and English results by district, grade and school as follows . . .

1 far below proficient  
2 below proficient  
3 proficient  
4 above proficient  
5 passing - cohort acheiving levels 3 or 4

Explore, analyze and model this data set containing approximately 15000 records. Each record represents a school’s per grade test scores from the years 2013 to 2016 inclusive. Passing percentage statistics are provided for each record. The objective is to build a multiple linear regression model on the training data to predict the percentage passing for a school/grade in future years given the current predictors.

Multiple Least Squares Regression

The response variable is deterministic and answers the simple question, what percentage of students in the cohert pass the state test?

Estimating the Regression Coefficients

The coefficients \(β0\) and \(β1\) are initially unknown, and must be estimated based on the available training data. Recall that linear regression uses the least squares approach to estimate the coefficients.

We now fit a basic regression model from the training data and inspect the predicted results from testing the data.

Multiple Linear Regression

Descriptive Statistics & Exploration

Based upon the data summaries, we see there are 11 NA’s in the Lvl_Pass_Pct variables. The missing data will be removed to save time since its a small proportion of the full dataset. The percentage Passed and Headcount variables are both right skewed distributions and the District data is multi-modal distribution. Year is a uniform distribution.

Note the historgrams below for a better visualliztion of the distribution and skew of these datasets.

summary(training)
##     District      School_Num     Year        Headcount    
##  10     : 200   18     :  18   2013:3572   Min.   :  6.0  
##  27     : 169   20     :  18   2014:   0   1st Qu.: 58.0  
##  31     : 168   41     :  18   2015:   0   Median : 86.0  
##  2      : 150   46     :  18   2016:   0   Mean   :109.3  
##  9      : 148   50     :  18               3rd Qu.:127.0  
##  11     : 135   89     :  18               Max.   :791.0  
##  (Other):2602   (Other):3464                              
##   Lvl_Pass_Pct   
##  Min.   :  0.00  
##  1st Qu.: 10.50  
##  Median : 20.30  
##  Mean   : 25.44  
##  3rd Qu.: 35.70  
##  Max.   :100.00  
## 
par(mfrow = c(3,2))
Z <- cbind(X, Y)
colnames(Z) <- colnames(cbind(X,Y))
for (i in 1:ncol(Z)) {
  variable <- names(Z[i])
  hist(as.numeric(Z[,i]), xlab = variable, main = variable)
}

par(mfrow = c(3,2))

for (i in 1:ncol(Z)) {
  variable <- names(Z[i])
  d <- density(as.numeric(Z[,i]), na.rm = TRUE)
  plot(d, main = variable)
  polygon(d, col="red")
}

d <- density(as.numeric(MyData[,5]), na.rm = TRUE)
plot(d, main = variable)
polygon(d, col="red")

par(mfrow = c(2,4))
Z <- cbind(X,Y)
colnames(Z) <- c(colnames(X),"Lvl_Pass_Pct")
pairs(Z)

Data Modelling and Prediction

Using the training data set, build a multiple linear regression models and use it to predict passing rate for the ELA test.

Multicollinearity

Since the p-value is small, we accept the null hypothesis that there is not multicollinearity and conclude that the data supports the existence of some correlation. Multicollinearity can underestimate standard error and make predictors appear significant when they are not.

dwtest(model1)
## 
##  Durbin-Watson test
## 
## data:  model1
## DW = 1.9873, p-value = 0.1904
## alternative hypothesis: true autocorrelation is greater than 0
Mean Squared Error and RMSE

For this model, standard error of the mean (RMSE) is fairly large relative to the target variable. In this model, the standard deviation of the unexplained variance in Lvl_Pass_Pct is 15.59584 which is a large deviation from the 29.43609 average passed encountered in the data.

data.frame("MSE" = summary(model1)$sigma^2, "RMSE" = summary(model1)$sigma)
##        MSE     RMSE
## 1 243.2301 15.59584
R^2 and Adjusted R^2

R2 represents the percent change in Pass Percentage explained by the predictor variables. R2 is somewhat low for this model but shows predictive value.

data.frame("r.squared" = summary(model1)$r.squared, "adj.r.squared" = summary(model1)$adj.r.squared)
##   r.squared adj.r.squared
## 1 0.3541081     0.3482679
F-Statistic

The F-test evaluates the null hypothesis that all the regression coefficients are equal to zero versus the alternative that at least one does not.

summary(model1)$fstatistic
##      value      numdf      dendf 
##   60.63265   32.00000 3539.00000
Residuals

The Residuals vs Fitted plot shows that the residuals appear to have a linear pattern with outliers that show a heteroschedstic pattern. The Normal Q-Q plot shows that the residuals appear to be a somewhat rightward skewed distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot shows some extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line.

par(mfrow = c(2,2))
plot(model1)

Predictions
df_predictions <- data.frame(predict(model1, testing))
par(mfrow = c(2,2))
hist(df_predictions$predict.model1..testing., xlab = 'Predicted Pass%', main = 'Predicted Pass%')
hist(training$Lvl_Pass_Pct, xlab = 'Training Pass%', main = 'Training Pass%')

Decision Tree Analysis

Multiple Least Squares Regression

The response variable is now catagorical and answers the simple question, does the school have greater than 50 percent of students in the cohert pass the state test?

library(dplyr)
library(ISLR)
library(tree)

MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv", header=TRUE, sep=",",stringsAsFactors=FALSE)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData$Lvl_Pass_Pct <- ifelse(MyData$Lvl_Pass_Pct > 50,"Yes","No")
MyData$Lvl_Pass_Pct <- as.factor(MyData$Lvl_Pass_Pct)
MyData <- data.frame(MyData[,c(2,4,5,8,10,20)])


training <- filter(MyData, Year == 2013)

summary(MyData)
##    Borough             District       School_Num         Year     
##  Length:14704       Min.   : 1.00   Min.   :  1.0   Min.   :2013  
##  Class :character   1st Qu.: 9.00   1st Qu.: 80.0   1st Qu.:2014  
##  Mode  :character   Median :16.00   Median :171.0   Median :2015  
##                     Mean   :16.48   Mean   :204.4   Mean   :2015  
##                     3rd Qu.:25.00   3rd Qu.:288.0   3rd Qu.:2016  
##                     Max.   :32.00   Max.   :971.0   Max.   :2016  
##    Headcount     Lvl_Pass_Pct
##  Min.   :  1.0   No :12276   
##  1st Qu.: 55.0   Yes: 2428   
##  Median : 84.0               
##  Mean   :105.5               
##  3rd Qu.:123.0               
##  Max.   :804.0
#split data into testing and training sets using
## 50% of the sample size
smp_size <- floor(0.5 * nrow(MyData))

## set the seed to make your partition reproductible
set.seed(2)
train_ind <- sample(seq_len(nrow(MyData)), size = smp_size)

training_data <- MyData[train_ind, ]
testing_data <- MyData[-train_ind, ]
testing_predict <- testing_data$Lvl_Pass_Pct

# fit the tree model using training data
tree_model <- tree(Lvl_Pass_Pct~., training_data)
plot(tree_model)
text(tree_model,cex=.75, pretty=0)

# check model's predictive potential
tree_pred <- predict(tree_model, testing_data, type='class')
mean(tree_pred != testing_predict)
## [1] 0.1444505
Pruning & Overfitting

If you plot the complexity of the curve fit by the deviance factor, deviance is minimized by fitting a 3rd degree equation. However, the effect of pruning to this degree is minimal as the predictive potential between the original and pruned model show the same result.

# cross validate to test where to stop pruning
set.seed(3)
cv.tree <- cv.tree(tree_model, FUN=prune.misclass)
names(cv.tree)
## [1] "size"   "dev"    "k"      "method"
plot(cv.tree$size, cv.tree$dev, type = 'b')

#prune the tree
pruned_model = prune.misclass(tree_model, best = 3)
plot(pruned_model)
text(pruned_model, pretty = 0)

# recheck model's predictive potential
tree_pred <- predict(pruned_model, testing_data, type='class')
mean(tree_pred != testing_predict) 
## [1] 0.1444505

Monte Carlo Model

Definition: Monte Carlo is the art of approximating an expectation by the sample mean of a function of simulated random variables.

Monte Carlo approximations to distributions.

A simple extension of this is to sample the density distribution of the data set, approximate it with a Poisson probability distribution and show a Monte Carlo distribution of expected values shown by the histogram below.

Simulation
library(MASS)

MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv", 
header=TRUE, sep=",",stringsAsFactors=FALSE)
names = c('District','School_Num','Year')
MyData[,names] <- lapply(MyData[,names] , factor)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData <- data.frame(MyData[,c(4,5,8,10,20)])
summary(MyData)
##     District       School_Num      Year        Headcount    
##  10     :  800   18     :   72   2013:3583   Min.   :  1.0  
##  27     :  695   20     :   72   2014:3652   1st Qu.: 55.0  
##  31     :  695   41     :   72   2015:3707   Median : 84.0  
##  2      :  638   46     :   72   2016:3762   Mean   :105.5  
##  9      :  623   89     :   72               3rd Qu.:123.0  
##  11     :  571   206    :   72               Max.   :804.0  
##  (Other):10682   (Other):14272                              
##   Lvl_Pass_Pct   
##  Min.   :  0.00  
##  1st Qu.: 12.90  
##  Median : 24.30  
##  Mean   : 29.35  
##  3rd Qu.: 41.90  
##  Max.   :100.00  
## 
hist(MyData$Lvl_Pass_Pct)

fitdistr(trunc(MyData[complete.cases(MyData),5]), "poisson")
##      lambda  
##   28.9161453 
##  ( 0.0443458)
with(MyData, tapply(Lvl_Pass_Pct, District, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                        1                        2                        3 
## "M (SD) = 31.19 (25.88)" "M (SD) = 56.21 (23.25)" "M (SD) = 40.74 (30.96)" 
##                        4                        5                        6 
## "M (SD) = 25.43 (21.46)" "M (SD) = 17.28 (16.11)" "M (SD) = 20.15 (15.00)" 
##                        7                        8                        9 
## "M (SD) = 13.19 (11.05)" "M (SD) = 19.62 (12.84)" "M (SD) = 15.37 (11.02)" 
##                       10                       11                       12 
## "M (SD) = 19.86 (12.48)" "M (SD) = 22.91 (12.28)"  "M (SD) = 12.77 (6.81)" 
##                       13                       14                       15 
## "M (SD) = 25.13 (18.44)" "M (SD) = 24.23 (17.28)" "M (SD) = 41.35 (26.17)" 
##                       16                       17                       18 
## "M (SD) = 20.22 (14.80)" "M (SD) = 22.25 (15.44)" "M (SD) = 24.74 (16.86)" 
##                       19                       20                       21 
## "M (SD) = 17.13 (10.63)" "M (SD) = 42.32 (17.95)" "M (SD) = 34.96 (17.15)" 
##                       22                       23                       24 
## "M (SD) = 36.41 (16.76)" "M (SD) = 15.21 (12.70)" "M (SD) = 37.10 (14.90)" 
##                       25                       26                       27 
## "M (SD) = 45.82 (13.25)" "M (SD) = 59.95 (11.73)" "M (SD) = 29.28 (17.12)" 
##                       28                       29                       30 
## "M (SD) = 36.95 (19.77)" "M (SD) = 27.58 (13.18)" "M (SD) = 34.89 (21.34)" 
##                       31                       32 
## "M (SD) = 37.40 (16.83)" "M (SD) = 19.77 (16.22)"
#get the lambda and sd for poisson and use this for the fitted model
parms <- fitdistr(MyData[complete.cases(MyData),5], "poisson")
parms 
##      lambda   
##   29.35401251 
##  ( 0.04468029)
lambda <- parms$estimate
sd_x     <- parms$sd  

#fitted model
#sample of 500 random values from distribution
rpois(500,lambda) 
##   [1] 22 32 32 32 24 26 35 28 21 36 40 41 36 29 27 36 23 29 36 31 35 29 29
##  [24] 37 25 37 35 36 22 26 20 24 20 39 27 27 25 31 39 32 32 30 26 38 39 26
##  [47] 30 22 33 33 30 30 26 34 24 35 29 39 26 35 31 33 33 39 27 29 32 25 27
##  [70] 27 33 27 32 39 35 31 22 26 27 26 19 29 27 26 26 33 27 21 25 43 18 31
##  [93] 25 35 23 28 23 25 31 39 29 26 26 22 33 19 29 33 27 31 29 28 25 26 25
## [116] 29 29 34 27 21 25 21 37 30 23 31 25 22 39 40 19 26 29 27 30 26 17 21
## [139] 35 32 27 36 23 36 29 33 24 22 43 23 35 27 36 23 24 38 25 28 32 24 23
## [162] 27 27 25 27 24 25 26 36 34 29 31 30 23 30 26 27 37 32 36 28 16 35 29
## [185] 16 30 35 24 35 30 29 28 35 24 28 28 36 29 27 35 23 31 29 30 29 30 27
## [208] 33 32 26 25 33 24 24 26 30 30 38 22 25 20 27 24 40 34 32 24 27 26 39
## [231] 33 38 26 25 33 35 25 32 12 29 37 32 28 29 30 19 27 31 29 30 42 33 46
## [254] 23 34 20 20 22 28 35 32 34 25 28 15 20 22 28 21 30 25 29 35 35 21 25
## [277] 29 17 44 22 36 25 33 29 30 28 30 18 23 29 21 34 36 41 23 29 25 27 34
## [300] 33 30 27 19 26 24 36 23 14 32 26 30 23 29 24 31 23 21 15 27 22 25 39
## [323] 30 40 31 29 27 32 32 29 34 31 33 29 34 29 29 30 33 28 31 34 24 21 29
## [346] 25 24 33 45 29 36 29 26 23 35 21 33 30 29 28 23 33 39 20 26 29 34 39
## [369] 19 27 39 37 27 27 14 34 30 32 30 31 30 30 28 23 31 26 27 36 28 37 25
## [392] 23 34 25 45 34 26 25 38 40 26 30 35 32 27 33 20 28 23 23 25 33 32 19
## [415] 29 32 33 23 29 29 20 24 29 31 23 26 32 37 19 34 21 32 25 24 34 27 31
## [438] 22 34 31 22 32 34 40 35 34 36 29 26 20 31 36 41 25 35 34 25 24 26 24
## [461] 21 29 26 37 21 28 27 33 27 46 25 25 33 28 31 33 35 21 31 24 20 22 23
## [484] 34 30 35 32 30 27 29 19 26 31 31 26 29 28 19 27 26
barplot(table(rpois(10000,lambda)),xlab='Percentage Passed',ylab='Frequency', main='Poisson Distribution of State Test Scores')