Each year the New York City Department of Education administers State Tests to most of the 1.1 million children that attend public schools in the city. The NYC DOE publishes test score results catagorically, measuring both Common Core Math and English results by district, grade and school as follows . . .
1 far below proficient
2 below proficient
3 proficient
4 above proficient
5 passing - cohort acheiving levels 3 or 4
Explore, analyze and model this data set containing approximately 15000 records. Each record represents a school’s per grade test scores from the years 2013 to 2016 inclusive. Passing percentage statistics are provided for each record. The objective is to build a multiple linear regression model on the training data to predict the percentage passing for a school/grade in future years given the current predictors.
The response variable is deterministic and answers the simple question, what percentage of students in the cohert pass the state test?
The coefficients \(β0\) and \(β1\) are initially unknown, and must be estimated based on the available training data. Recall that linear regression uses the least squares approach to estimate the coefficients.
We now fit a basic regression model from the training data and inspect the predicted results from testing the data.
Based upon the data summaries, we see there are 11 NA’s in the Lvl_Pass_Pct variables. The missing data will be removed to save time since its a small proportion of the full dataset. The percentage Passed and Headcount variables are both right skewed distributions and the District data is multi-modal distribution. Year is a uniform distribution.
Note the historgrams below for a better visualliztion of the distribution and skew of these datasets.
summary(training)
## District School_Num Year Headcount
## 10 : 200 18 : 18 2013:3572 Min. : 6.0
## 27 : 169 20 : 18 2014: 0 1st Qu.: 58.0
## 31 : 168 41 : 18 2015: 0 Median : 86.0
## 2 : 150 46 : 18 2016: 0 Mean :109.3
## 9 : 148 50 : 18 3rd Qu.:127.0
## 11 : 135 89 : 18 Max. :791.0
## (Other):2602 (Other):3464
## Lvl_Pass_Pct
## Min. : 0.00
## 1st Qu.: 10.50
## Median : 20.30
## Mean : 25.44
## 3rd Qu.: 35.70
## Max. :100.00
##
par(mfrow = c(3,2))
Z <- cbind(X, Y)
colnames(Z) <- colnames(cbind(X,Y))
for (i in 1:ncol(Z)) {
variable <- names(Z[i])
hist(as.numeric(Z[,i]), xlab = variable, main = variable)
}
par(mfrow = c(3,2))
for (i in 1:ncol(Z)) {
variable <- names(Z[i])
d <- density(as.numeric(Z[,i]), na.rm = TRUE)
plot(d, main = variable)
polygon(d, col="red")
}
d <- density(as.numeric(MyData[,5]), na.rm = TRUE)
plot(d, main = variable)
polygon(d, col="red")
par(mfrow = c(2,4))
Z <- cbind(X,Y)
colnames(Z) <- c(colnames(X),"Lvl_Pass_Pct")
pairs(Z)
Using the training data set, build a multiple linear regression models and use it to predict passing rate for the ELA test.
Since the p-value is small, we accept the null hypothesis that there is not multicollinearity and conclude that the data supports the existence of some correlation. Multicollinearity can underestimate standard error and make predictors appear significant when they are not.
dwtest(model1)
##
## Durbin-Watson test
##
## data: model1
## DW = 1.9873, p-value = 0.1904
## alternative hypothesis: true autocorrelation is greater than 0
For this model, standard error of the mean (RMSE) is fairly large relative to the target variable. In this model, the standard deviation of the unexplained variance in Lvl_Pass_Pct is 15.59584 which is a large deviation from the 29.43609 average passed encountered in the data.
data.frame("MSE" = summary(model1)$sigma^2, "RMSE" = summary(model1)$sigma)
## MSE RMSE
## 1 243.2301 15.59584
R2 represents the percent change in Pass Percentage explained by the predictor variables. R2 is somewhat low for this model but shows predictive value.
data.frame("r.squared" = summary(model1)$r.squared, "adj.r.squared" = summary(model1)$adj.r.squared)
## r.squared adj.r.squared
## 1 0.3541081 0.3482679
The F-test evaluates the null hypothesis that all the regression coefficients are equal to zero versus the alternative that at least one does not.
summary(model1)$fstatistic
## value numdf dendf
## 60.63265 32.00000 3539.00000
The Residuals vs Fitted plot shows that the residuals appear to have a linear pattern with outliers that show a heteroschedstic pattern. The Normal Q-Q plot shows that the residuals appear to be a somewhat rightward skewed distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot shows some extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line.
par(mfrow = c(2,2))
plot(model1)
df_predictions <- data.frame(predict(model1, testing))
par(mfrow = c(2,2))
hist(df_predictions$predict.model1..testing., xlab = 'Predicted Pass%', main = 'Predicted Pass%')
hist(training$Lvl_Pass_Pct, xlab = 'Training Pass%', main = 'Training Pass%')
The response variable is now catagorical and answers the simple question, does the school have greater than 50 percent of students in the cohert pass the state test?
library(dplyr)
library(ISLR)
library(tree)
MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv", header=TRUE, sep=",",stringsAsFactors=FALSE)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData$Lvl_Pass_Pct <- ifelse(MyData$Lvl_Pass_Pct > 50,"Yes","No")
MyData$Lvl_Pass_Pct <- as.factor(MyData$Lvl_Pass_Pct)
MyData <- data.frame(MyData[,c(2,4,5,8,10,20)])
training <- filter(MyData, Year == 2013)
summary(MyData)
## Borough District School_Num Year
## Length:14704 Min. : 1.00 Min. : 1.0 Min. :2013
## Class :character 1st Qu.: 9.00 1st Qu.: 80.0 1st Qu.:2014
## Mode :character Median :16.00 Median :171.0 Median :2015
## Mean :16.48 Mean :204.4 Mean :2015
## 3rd Qu.:25.00 3rd Qu.:288.0 3rd Qu.:2016
## Max. :32.00 Max. :971.0 Max. :2016
## Headcount Lvl_Pass_Pct
## Min. : 1.0 No :12276
## 1st Qu.: 55.0 Yes: 2428
## Median : 84.0
## Mean :105.5
## 3rd Qu.:123.0
## Max. :804.0
#split data into testing and training sets using
## 50% of the sample size
smp_size <- floor(0.5 * nrow(MyData))
## set the seed to make your partition reproductible
set.seed(2)
train_ind <- sample(seq_len(nrow(MyData)), size = smp_size)
training_data <- MyData[train_ind, ]
testing_data <- MyData[-train_ind, ]
testing_predict <- testing_data$Lvl_Pass_Pct
# fit the tree model using training data
tree_model <- tree(Lvl_Pass_Pct~., training_data)
plot(tree_model)
text(tree_model,cex=.75, pretty=0)
# check model's predictive potential
tree_pred <- predict(tree_model, testing_data, type='class')
mean(tree_pred != testing_predict)
## [1] 0.1444505
If you plot the complexity of the curve fit by the deviance factor, deviance is minimized by fitting a 3rd degree equation. However, the effect of pruning to this degree is minimal as the predictive potential between the original and pruned model show the same result.
# cross validate to test where to stop pruning
set.seed(3)
cv.tree <- cv.tree(tree_model, FUN=prune.misclass)
names(cv.tree)
## [1] "size" "dev" "k" "method"
plot(cv.tree$size, cv.tree$dev, type = 'b')
#prune the tree
pruned_model = prune.misclass(tree_model, best = 3)
plot(pruned_model)
text(pruned_model, pretty = 0)
# recheck model's predictive potential
tree_pred <- predict(pruned_model, testing_data, type='class')
mean(tree_pred != testing_predict)
## [1] 0.1444505
Definition: Monte Carlo is the art of approximating an expectation by the sample mean of a function of simulated random variables.
Monte Carlo approximations to distributions.
A simple extension of this is to sample the density distribution of the data set, approximate it with a Poisson probability distribution and show a Monte Carlo distribution of expected values shown by the histogram below.
library(MASS)
MyData <- read.csv(file="https://raw.githubusercontent.com/scottkarr/IS608FinalProject/master/data/ELAScoresBySchool.csv",
header=TRUE, sep=",",stringsAsFactors=FALSE)
names = c('District','School_Num','Year')
MyData[,names] <- lapply(MyData[,names] , factor)
MyData$Lvl_Pass_Pct <- as.numeric(MyData$Lvl_Pass_Pct)
MyData[is.na(MyData)] <- 0
MyData <- data.frame(MyData[,c(4,5,8,10,20)])
summary(MyData)
## District School_Num Year Headcount
## 10 : 800 18 : 72 2013:3583 Min. : 1.0
## 27 : 695 20 : 72 2014:3652 1st Qu.: 55.0
## 31 : 695 41 : 72 2015:3707 Median : 84.0
## 2 : 638 46 : 72 2016:3762 Mean :105.5
## 9 : 623 89 : 72 3rd Qu.:123.0
## 11 : 571 206 : 72 Max. :804.0
## (Other):10682 (Other):14272
## Lvl_Pass_Pct
## Min. : 0.00
## 1st Qu.: 12.90
## Median : 24.30
## Mean : 29.35
## 3rd Qu.: 41.90
## Max. :100.00
##
hist(MyData$Lvl_Pass_Pct)
fitdistr(trunc(MyData[complete.cases(MyData),5]), "poisson")
## lambda
## 28.9161453
## ( 0.0443458)
with(MyData, tapply(Lvl_Pass_Pct, District, function(x) {
sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
## 1 2 3
## "M (SD) = 31.19 (25.88)" "M (SD) = 56.21 (23.25)" "M (SD) = 40.74 (30.96)"
## 4 5 6
## "M (SD) = 25.43 (21.46)" "M (SD) = 17.28 (16.11)" "M (SD) = 20.15 (15.00)"
## 7 8 9
## "M (SD) = 13.19 (11.05)" "M (SD) = 19.62 (12.84)" "M (SD) = 15.37 (11.02)"
## 10 11 12
## "M (SD) = 19.86 (12.48)" "M (SD) = 22.91 (12.28)" "M (SD) = 12.77 (6.81)"
## 13 14 15
## "M (SD) = 25.13 (18.44)" "M (SD) = 24.23 (17.28)" "M (SD) = 41.35 (26.17)"
## 16 17 18
## "M (SD) = 20.22 (14.80)" "M (SD) = 22.25 (15.44)" "M (SD) = 24.74 (16.86)"
## 19 20 21
## "M (SD) = 17.13 (10.63)" "M (SD) = 42.32 (17.95)" "M (SD) = 34.96 (17.15)"
## 22 23 24
## "M (SD) = 36.41 (16.76)" "M (SD) = 15.21 (12.70)" "M (SD) = 37.10 (14.90)"
## 25 26 27
## "M (SD) = 45.82 (13.25)" "M (SD) = 59.95 (11.73)" "M (SD) = 29.28 (17.12)"
## 28 29 30
## "M (SD) = 36.95 (19.77)" "M (SD) = 27.58 (13.18)" "M (SD) = 34.89 (21.34)"
## 31 32
## "M (SD) = 37.40 (16.83)" "M (SD) = 19.77 (16.22)"
#get the lambda and sd for poisson and use this for the fitted model
parms <- fitdistr(MyData[complete.cases(MyData),5], "poisson")
parms
## lambda
## 29.35401251
## ( 0.04468029)
lambda <- parms$estimate
sd_x <- parms$sd
#fitted model
#sample of 500 random values from distribution
rpois(500,lambda)
## [1] 22 32 32 32 24 26 35 28 21 36 40 41 36 29 27 36 23 29 36 31 35 29 29
## [24] 37 25 37 35 36 22 26 20 24 20 39 27 27 25 31 39 32 32 30 26 38 39 26
## [47] 30 22 33 33 30 30 26 34 24 35 29 39 26 35 31 33 33 39 27 29 32 25 27
## [70] 27 33 27 32 39 35 31 22 26 27 26 19 29 27 26 26 33 27 21 25 43 18 31
## [93] 25 35 23 28 23 25 31 39 29 26 26 22 33 19 29 33 27 31 29 28 25 26 25
## [116] 29 29 34 27 21 25 21 37 30 23 31 25 22 39 40 19 26 29 27 30 26 17 21
## [139] 35 32 27 36 23 36 29 33 24 22 43 23 35 27 36 23 24 38 25 28 32 24 23
## [162] 27 27 25 27 24 25 26 36 34 29 31 30 23 30 26 27 37 32 36 28 16 35 29
## [185] 16 30 35 24 35 30 29 28 35 24 28 28 36 29 27 35 23 31 29 30 29 30 27
## [208] 33 32 26 25 33 24 24 26 30 30 38 22 25 20 27 24 40 34 32 24 27 26 39
## [231] 33 38 26 25 33 35 25 32 12 29 37 32 28 29 30 19 27 31 29 30 42 33 46
## [254] 23 34 20 20 22 28 35 32 34 25 28 15 20 22 28 21 30 25 29 35 35 21 25
## [277] 29 17 44 22 36 25 33 29 30 28 30 18 23 29 21 34 36 41 23 29 25 27 34
## [300] 33 30 27 19 26 24 36 23 14 32 26 30 23 29 24 31 23 21 15 27 22 25 39
## [323] 30 40 31 29 27 32 32 29 34 31 33 29 34 29 29 30 33 28 31 34 24 21 29
## [346] 25 24 33 45 29 36 29 26 23 35 21 33 30 29 28 23 33 39 20 26 29 34 39
## [369] 19 27 39 37 27 27 14 34 30 32 30 31 30 30 28 23 31 26 27 36 28 37 25
## [392] 23 34 25 45 34 26 25 38 40 26 30 35 32 27 33 20 28 23 23 25 33 32 19
## [415] 29 32 33 23 29 29 20 24 29 31 23 26 32 37 19 34 21 32 25 24 34 27 31
## [438] 22 34 31 22 32 34 40 35 34 36 29 26 20 31 36 41 25 35 34 25 24 26 24
## [461] 21 29 26 37 21 28 27 33 27 46 25 25 33 28 31 33 35 21 31 24 20 22 23
## [484] 34 30 35 32 30 27 29 19 26 31 31 26 29 28 19 27 26
barplot(table(rpois(10000,lambda)),xlab='Percentage Passed',ylab='Frequency', main='Poisson Distribution of State Test Scores')
For more details on using R Markdown see http://rmarkdown.rstudio.com. https://rpubs.com/josezuniga/253955 https://stats.idre.ucla.edu/r/dae/poisson-regression/ http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf