DS 805: Statistical Learning
For this project, our group analyzed a combination of race and ethnicity data from the National Center for Education Statistics (NCES) which provides the Common Core of Data (CCD). and educational funding from various governing entities.
The ElSi (Elementary/Secondary Information System) Table Generator which we used to pull various data can be found here.
The datasets are joined using the name of the school district and includes the number of students of each ethnicity (White, Black, Hispanic, etc.) in each respective district, as well as the dollar amounts of funding from different sources (federal, state, local, and miscellaneous).
Our goal for this project was to create multiple predictive models which could classify a given school into one of four main locales: Town, Rural, Urban, or City. We implemented Linear Discriminant Analysis, K-Nearest Neighbor, Random Tree, Random Forest, Bagging, Boosting, and Support Vector Machine models and compared their predictive performance. We hypothesize that being able to predict locale classifications could help determine a relationship between the locale and how the educational system is structured.
## Rows: 15,292
## Columns: 37
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ Row <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ AgencyName <chr> "21ST CENTURY CHARTER SCH OF GARY…
## $ State <chr> "Indiana", "Pennsylvania", "Arizo…
## $ FallMembership_16_17 <int> 888, 961, 64, 2278, 1069, 629, 12…
## $ RevenueLocalSources_16_17 <dbl> 516000, 12717000, 47000, 363000, …
## $ RevenueGeneral_16_17 <dbl> 9342000, 12848000, 492000, 201980…
## $ RevenueStateSources_16_17 <dbl> 7563000, 40000, 440000, 17778000,…
## $ RevenueFederalSources_16_17 <int> 1263000, 91000, 5000, 2057000, 14…
## $ CurrentExpendituresSecEdu_16_17 <dbl> 7447000, 11940000, 481000, 184750…
## $ CurrentExpenditures_16_17 <dbl> 8637000, 12528000, 495000, 185050…
## $ RevenuePerPupil_16_17 <int> 10520, 13369, 7688, 8867, 11573, …
## $ RevenueLocalSourcesPerPupil_16_17 <int> 581, 13233, 734, 159, 458, 167, 1…
## $ RevenueStateSourcesPerPupil_16_17 <int> 8517, 42, 6875, 7804, 9792, 7593,…
## $ RevenueFederalSourcesPerPupil_16_17 <int> 1422, 95, 78, 903, 1323, 1727, 73…
## $ AgencyID <int> 1800046, 4200091, 400328, 4800095…
## $ AgencyType_18_19 <chr> "7-Independent Charter District",…
## $ StartOfYearStatus_18_19 <chr> "1-Open", "1-Open", "1-Open", "1-…
## $ TotalStudentsAllGrades <int> 883, 1235, 65, 2084, 1409, 701, 1…
## $ AmerIndAlaska0tive <int> 0, 4, 2, 4, 2, 1, 0, 0, 0, 1, 5, …
## $ Asian_AsianPacificIsl <int> 0, 18, 0, 1, 7, 0, 3, 0, 0, 2, 1,…
## $ Hispanic <int> 23, 97, 22, 89, 1276, 7, 9, 51, 1…
## $ Black <int> 837, 128, 1, 1951, 53, 688, 48, 1…
## $ White <int> 0, 918, 37, 7, 61, 2, 53, 19, 393…
## $ Hawaiian_PacificIsl <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, …
## $ TwoOrMoreRaces <int> 23, 70, 3, 30, 10, 3, 17, 6, 2, 2…
## $ AmerIndAlaska0tive. <chr> "0.0%", "0.3%", "3.1%", "0.2%", "…
## $ Asian_AsianPacificIsl. <chr> "0.0%", "1.5%", "0.0%", "0.0%", "…
## $ Hispanic. <chr> "2.6%", "7.9%", "33.8%", "4.3%", …
## $ Black. <chr> "94.8%", "10.4%", "1.5%", "93.6%"…
## $ White. <chr> "0.0%", "74.3%", "56.9%", "0.3%",…
## $ Hawaiian_PacificIsl. <chr> "0.0%", "0.0%", "0.0%", "0.1%", "…
## $ TwoOrMoreRaces. <chr> "2.6%", "5.7%", "4.6%", "1.4%", "…
## $ CountyName_18_19 <chr> "Marion County", "Chester County"…
## $ UrbanCentricLocale_18_19 <chr> "13-City: Small", "21-Suburb: Lar…
## $ Size <chr> "City", "Suburb", "Town", "City",…
## $ Size_0_1 <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, …
For the numerical variables, we plotted box plots based on values of Size, our variable for locale.
## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow
For our locale variable, Size, we plotted a bar chart to visualize the total number of each locale.
totals = data.frame(table(df$Size))
ggplot(data=df, aes(x= Size, fill = Size)) +
geom_bar()+
geom_text(aes(Var1, Freq +250, label = Freq, fill = NULL), data = totals)+
theme_bw()We decided to take a stratified sample of our dataset using Size. With this data especially, it is important to try to take a sample that is representative of the entire population. If we had taken a simple random sample, there is a chance it would not include a sufficient number of districts in any given type of community (rural, urban, suburban, etc.). Therefore, our training sample, which includes 80% of our full dataset, incorporates adequate representation from each type of community.
set.seed(123)
train <- df %>%
group_by(Size) %>%
sample_frac(0.8) %>%
ungroup()
train = as.data.frame(train)
x <- rbind(df, train)
test <- x[!duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(df),]
c(nrow(df),nrow(train),nrow(test))## [1] 15292 12234 3058
Our first classification model was created using LDA and the variables relating to Fall Membership, State and Federal Revenue, Total Number of Students, and the number of Asian/Pacific Island, Hispanic, Black, and White students at each school.
lda_m = lda(Size~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train)
lda_m## Call:
## lda(Size ~ FallMembership_16_17 + RevenueStateSources_16_17 +
## RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl +
## Hispanic + Black + White, data = train)
##
## Prior probabilities of groups:
## City Rural Suburb Town
## 0.1324996 0.4677129 0.2373713 0.1624162
##
## Group means:
## FallMembership_16_17 RevenueStateSources_16_17
## City 7123.607 46993333
## Rural 1053.213 7019138
## Suburb 5778.626 35606790
## Town 2192.207 14687769
## RevenueFederalSources_16_17 TotalStudentsAllGrades Asian_AsianPacificIsl
## City 9441581 6965.738 424.01049
## Rural 1088182 1043.786 11.75463
## Suburb 4828927 5768.187 381.17252
## Town 2484004 2179.128 24.88123
## Hispanic Black White
## City 2570.7162 1478.16780 2110.365
## Rural 130.1393 87.71111 758.144
## Suburb 1527.3660 815.16736 2765.421
## Town 439.2899 187.95974 1391.416
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## FallMembership_16_17 3.273630e-04 3.354737e-04 -2.698444e-04
## RevenueStateSources_16_17 -1.324175e-08 -2.431901e-09 -2.171819e-10
## RevenueFederalSources_16_17 5.220148e-08 -7.795008e-08 1.119885e-07
## TotalStudentsAllGrades -2.667240e-04 -1.934473e-03 8.980301e-04
## Asian_AsianPacificIsl -2.571882e-04 1.637371e-03 -1.550364e-03
## Hispanic -1.123044e-04 1.681713e-03 -7.760396e-04
## Black -8.214917e-05 1.623065e-03 -9.948925e-04
## White -2.204662e-04 1.960617e-03 -4.870504e-04
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.6518 0.3060 0.0421
## True Values
## Predictions City Rural Suburb Town
## City 28 3 13 1
## Rural 334 1407 620 489
## Suburb 43 20 93 7
## Town 0 0 0 0
## [1] 0.500327
We implemented a linear discriminant analysis rather than a logistic regression because logistic regression is usually used for binary predictions; more than two response classes can lead to unstable estimates, therefore, we opted for an LDA. Linear discriminant analysis classifies observations by observing prior (training) and posterior (test) probabilities \(\pi_k\) and \(P_k(X)\) respectively. The LDA model classifies observations to the class for which \(P_k(X)\) is the largest; in other words, it assigns a given observation \(X\) to be equal to \(x\), the class which has the highest probability.
This LDA model classified locale with about a 50% testing error rate; generally speaking, this is a fairly poor performance for a classification model. Another important note is that this model failed to classify a single locale as “Town”, either correctly or incorrectly. Even if this model did result in a lower testing error rate than the following models, failing to consider an entire classification level could lead to some major errors in classification later on.
We applied a KNN classification to the training data.
#using same variables as LDA
knn.train=train[,c(8,9,19,21,22,23,24)]
knn.test=test[,c(8,9,19,21,22,23,24)]
knn.trainLabels=as.factor(train$Size)
knn.testLabels=as.factor(test$Size)
#finding optimal K-value
set.seed(123)
k.grid=1:100
error=rep(0, length(k.grid))
for (i in seq_along(k.grid)) {
pred = class::knn(train = scale(knn.train),
test = scale(knn.test),
cl = knn.trainLabels,
k = k.grid[i])
error[i] = mean(knn.testLabels !=pred)
}
min(error)## [1] 0.3662525
## [1] 25
plot(error, type = "b", cex = 1, pch = 20,
xlab = "k", ylab = "Error")
abline(h = min(error), col = "red", lty = 5)pred.knn <- class::knn(train = knn.train, test = knn.test, cl = knn.trainLabels, k=which.min(error))## True Values
## Predictions City Rural Suburb Town
## City 77 26 37 6
## Rural 212 1171 309 269
## Suburb 84 131 296 125
## Town 32 102 84 97
## [1] 0.4633748
K-Nearest Neighbor classification operates on the idea that similar things exist within close proximity to each other. K is a hyperparameter of this model which determines how many of the nearest observations to base its classification on. For example, a k-value of 1 means that for each point in the test data, the model will calculate the distance between each point and each observation from the training data and choose the single lowest distance to a training observation to base its classification on. A k-value of 5 will have the model calculate the same distances as before, now sorting those distances in ascending order and selecting the 5 observations with the lowest distances to base its classification on.
We can use cross-validation techniques to find the optimal k-value which produces the lowest testing error rate of the model. In this case, the optimal k-value is 25 which results in a testing error rate of 46.34%. While this is a slightly better performance than the LDA model, it is still not a very accurate model. Additionally, a relatively high k-value of 25 increases computational expense, bias, and the boundaries between classes become less distinct.
#We created four tree-based models including Random Tree, Bagging, Boosting, and Random Forest; we only included the model with the lowest testing error rate. Code for the three unused models is below, code for our final model for this section is located in the chunk after this one.
#Random Tree/Classification Tree
model.ct = rpart(Size~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, train, method="class", parms=list(split="gini"))
model.ct## n= 12234
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 12234 6512 Rural (0.13249959 0.46771293 0.23737126 0.16241622)
## 2) Black< 15.5 5970 1562 Rural (0.03417085 0.73835846 0.08190955 0.14556114) *
## 3) Black>=15.5 6264 3849 Suburb (0.22621328 0.20977011 0.38553640 0.17848020)
## 6) Asian_AsianPacificIsl< 38.5 3576 2498 Rural (0.23042506 0.30145414 0.23266219 0.23545861)
## 12) White< 136.5 1013 370 City (0.63474827 0.09279368 0.22112537 0.05133268) *
## 13) White>=136.5 2563 1579 Rural (0.07062037 0.38392509 0.23722201 0.30823254)
## 26) Asian_AsianPacificIsl< 6.5 810 356 Rural (0.06666667 0.56049383 0.15308642 0.21975309) *
## 27) Asian_AsianPacificIsl>=6.5 1753 1141 Town (0.07244723 0.30233885 0.27609812 0.34911580)
## 54) RevenueFederalSources_16_17< 1634000 822 506 Suburb (0.10583942 0.27372263 0.38442822 0.23600973) *
## 55) RevenueFederalSources_16_17>=1634000 931 513 Town (0.04296455 0.32760473 0.18045113 0.44897959) *
## 7) Asian_AsianPacificIsl>=38.5 2688 1105 Suburb (0.22061012 0.08779762 0.58891369 0.10267857) *
cp_opt <- model.ct$cptable[which.min(model.ct$cptable[, "xerror"]), "CP"]
model.ct_opt <- prune(tree = model.ct, cp = cp_opt)
model.ct_opt## n= 12234
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 12234 6512 Rural (0.13249959 0.46771293 0.23737126 0.16241622)
## 2) Black< 15.5 5970 1562 Rural (0.03417085 0.73835846 0.08190955 0.14556114) *
## 3) Black>=15.5 6264 3849 Suburb (0.22621328 0.20977011 0.38553640 0.17848020)
## 6) Asian_AsianPacificIsl< 38.5 3576 2498 Rural (0.23042506 0.30145414 0.23266219 0.23545861)
## 12) White< 136.5 1013 370 City (0.63474827 0.09279368 0.22112537 0.05133268) *
## 13) White>=136.5 2563 1579 Rural (0.07062037 0.38392509 0.23722201 0.30823254)
## 26) Asian_AsianPacificIsl< 6.5 810 356 Rural (0.06666667 0.56049383 0.15308642 0.21975309) *
## 27) Asian_AsianPacificIsl>=6.5 1753 1141 Town (0.07244723 0.30233885 0.27609812 0.34911580)
## 54) RevenueFederalSources_16_17< 1634000 822 506 Suburb (0.10583942 0.27372263 0.38442822 0.23600973) *
## 55) RevenueFederalSources_16_17>=1634000 931 513 Town (0.04296455 0.32760473 0.18045113 0.44897959) *
## 7) Asian_AsianPacificIsl>=38.5 2688 1105 Suburb (0.22061012 0.08779762 0.58891369 0.10267857) *
pred.tree=predict(model.ct, newdata=test, type="class")
confusionMatrix(
factor(pred.tree),
factor(test$Size)
)## Confusion Matrix and Statistics
##
## Reference
## Prediction City Rural Suburb Town
## City 156 25 44 11
## Rural 52 1203 149 257
## Suburb 179 123 490 125
## Town 18 79 43 104
##
## Overall Statistics
##
## Accuracy : 0.6387
## 95% CI : (0.6213, 0.6557)
## No Information Rate : 0.4676
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4455
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity 0.38519 0.8413 0.6749 0.20926
## Specificity 0.96985 0.7187 0.8169 0.94533
## Pos Pred Value 0.66102 0.7243 0.5344 0.42623
## Neg Pred Value 0.91176 0.8375 0.8898 0.86034
## Prevalence 0.13244 0.4676 0.2374 0.16252
## Detection Rate 0.05101 0.3934 0.1602 0.03401
## Detection Prevalence 0.07717 0.5432 0.2999 0.07979
## Balanced Accuracy 0.67752 0.7800 0.7459 0.57729
## [1] 0.6386527
#Bagging Algorithm
set.seed(123)
model.bag=bagging(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train, coob = TRUE)
print(model.bag)##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = factor(Size) ~ FallMembership_16_17 +
## RevenueStateSources_16_17 + RevenueFederalSources_16_17 +
## TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic +
## Black + White, data = train, coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.3498
pred.bag = predict(model.bag,newdata=test, type = "class")
confusionMatrix(factor(pred.bag),
factor(test$Size))## Confusion Matrix and Statistics
##
## Reference
## Prediction City Rural Suburb Town
## City 240 30 90 18
## Rural 31 1194 111 211
## Suburb 119 85 459 83
## Town 15 121 66 185
##
## Overall Statistics
##
## Accuracy : 0.6795
## 95% CI : (0.6627, 0.6961)
## No Information Rate : 0.4676
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5207
##
## Mcnemar's Test P-Value : 6.437e-06
##
## Statistics by Class:
##
## Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity 0.59259 0.8350 0.6322 0.3722
## Specificity 0.94798 0.7832 0.8769 0.9211
## Pos Pred Value 0.63492 0.7718 0.6153 0.4780
## Neg Pred Value 0.93843 0.8438 0.8845 0.8832
## Prevalence 0.13244 0.4676 0.2374 0.1625
## Detection Rate 0.07848 0.3905 0.1501 0.0605
## Detection Prevalence 0.12361 0.5059 0.2440 0.1266
## Balanced Accuracy 0.77029 0.8091 0.7546 0.6467
## [1] 0.6795291
ctrl <- trainControl(method = "cv", number = 5)
set.seed(123)
caret_model <- train(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White,
data = train,
method = "treebag",
trControl = ctrl)
vip(caret_model)#Boosting Algorithm
set.seed(123)
model.boos <- gbm(formula = factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, distribution="multinomial", data=train, n.trees = 10000)
print(model.boos)## gbm(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 +
## RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl +
## Hispanic + Black + White, distribution = "multinomial", data = train,
## n.trees = 10000)
## A gradient boosted model with multinomial loss function.
## 10000 iterations were performed.
## There were 8 predictors of which 8 had non-zero influence.
boost_table <- summary(model.boos, plotit=FALSE)
ggplot(boost_table) +
aes(x = reorder(var, rel.inf), weight = rel.inf) +
geom_bar(fill = "#0c4c8a") +
labs(x = "Variable of Interest", y = "Relative Influence") +
coord_flip() +
theme_bw()pred.boost=predict(model.boos, newdata=test,n.trees=10000, distribution="multinomial", type="response")
head(pred.boost)## , , 10000
##
## City Rural Suburb Town
## [1,] 0.3234466128 0.03873182 0.41004138 0.227780187
## [2,] 0.2331930389 0.02364980 0.49929565 0.243861514
## [3,] 0.0004012989 0.90449070 0.02117469 0.073933317
## [4,] 0.5597331158 0.12738613 0.30558275 0.007298003
## [5,] 0.3808144898 0.08705119 0.43822694 0.093907376
## [6,] 0.7978765227 0.04032131 0.09540377 0.066398398
#the above predictions lose some interpretability, so we will take the highest probability classification as the model's prediction for that observation, and reformat the confusion matrix accordingly.
boost.pred.df <- as.data.frame(pred.boost)
colnames(boost.pred.df) = c("City", "Rural", "Suburb", "Town")
boost.pred.df[] <- t(apply(boost.pred.df, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))
w <- which(boost.pred.df!="0",arr.ind=TRUE)
boost.pred.df[w] <- colnames(boost.pred.df)[w[,"col"]]
head(boost.pred.df)## City Rural Suburb Town
## 1 0 0 Suburb 0
## 2 0 0 Suburb 0
## 3 0 Rural 0 0
## 4 City 0 0 0
## 5 0 0 Suburb 0
## 6 City 0 0 0
boost.pred.vector <- as.vector(t(boost.pred.df)[t(boost.pred.df)!=0])
accuracy(test[,"Size"], boost.pred.vector)## [1] 0.6602354
## Confusion Matrix and Statistics
##
## Reference
## Prediction City Rural Suburb Town
## City 220 46 128 11
## Rural 29 1198 87 116
## Suburb 90 124 439 73
## Town 20 238 77 162
##
## Overall Statistics
##
## Accuracy : 0.6602
## 95% CI : (0.6431, 0.677)
## No Information Rate : 0.5252
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4874
##
## Mcnemar's Test P-Value : 2.002e-11
##
## Statistics by Class:
##
## Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity 0.61281 0.7460 0.6005 0.44751
## Specificity 0.93146 0.8402 0.8767 0.87574
## Pos Pred Value 0.54321 0.8378 0.6047 0.32596
## Neg Pred Value 0.94761 0.7494 0.8748 0.92191
## Prevalence 0.11740 0.5252 0.2390 0.11838
## Detection Rate 0.07194 0.3918 0.1436 0.05298
## Detection Prevalence 0.13244 0.4676 0.2374 0.16252
## Balanced Accuracy 0.77213 0.7931 0.7386 0.66163
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.
model.boos.cv <- gbm(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White,
distribution = "multinomial",
train,n.trees = 10000,
cv.folds = 3)
ntree.cv.opt=gbm.perf(model.boos.cv, method="cv")## [1] "Optimal ntrees (OOB Estimate): 210"
## [1] "Optimal ntrees (CV Estimate): 502"
pred.1=predict(object = model.boos,
newdata = test,
n.trees = ntree.oob.opt)
pred.2=predict(object = model.boos.cv,
newdata = test,
n.trees = ntree.cv.opt)
df1 <- as.data.frame(pred.1)
colnames(df1) = c("City", "Rural", "Suburb", "Town")
df1[] <- t(apply(df1, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))
w <- which(df1!="0",arr.ind=TRUE)
df1[w] <- colnames(df1)[w[,"col"]]
head(df1)## City Rural Suburb Town
## 1 0 0 0 Town
## 2 0 0 Suburb 0
## 3 0 Rural 0 0
## 4 City 0 0 0
## 5 City 0 0 0
## 6 City 0 0 0
v1 <- as.vector(t(df1)[t(df1)!=0])
df2 <- as.data.frame(pred.2)
colnames(df2) = c("City", "Rural", "Suburb", "Town")
df2[] <- t(apply(df2, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))
w <- which(df2!="0",arr.ind=TRUE)
df2[w] <- colnames(df2)[w[,"col"]]
head(df2)## City Rural Suburb Town
## 1 0 0 0 Town
## 2 0 0 Suburb 0
## 3 0 Rural 0 0
## 4 City 0 0 0
## 5 0 0 Suburb 0
## 6 City 0 0 0
## [1] 0.323087
## [1] 0.3191629
## Confusion Matrix and Statistics
##
## Reference
## Prediction City Rural Suburb Town
## City 230 38 122 15
## Rural 33 1209 82 106
## Suburb 80 125 457 64
## Town 11 212 88 186
##
## Overall Statistics
##
## Accuracy : 0.6808
## 95% CI : (0.664, 0.6973)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5198
##
## Mcnemar's Test P-Value : 1.285e-10
##
## Statistics by Class:
##
## Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity 0.64972 0.7633 0.6101 0.50135
## Specificity 0.93528 0.8501 0.8835 0.88426
## Pos Pred Value 0.56790 0.8455 0.6295 0.37425
## Neg Pred Value 0.95326 0.7697 0.8748 0.92776
## Prevalence 0.11576 0.5180 0.2449 0.12132
## Detection Rate 0.07521 0.3954 0.1494 0.06082
## Detection Prevalence 0.13244 0.4676 0.2374 0.16252
## Balanced Accuracy 0.79250 0.8067 0.7468 0.69280
model.rf=randomForest(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train, importance=TRUE,proximity=TRUE)
print(model.rf)##
## Call:
## randomForest(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 + RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic + Black + White, data = train, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 31.3%
## Confusion matrix:
## City Rural Suburb Town class.error
## City 976 152 434 59 0.3979025
## Rural 117 4837 344 424 0.1546662
## Suburb 348 492 1853 211 0.3619146
## Town 54 930 264 739 0.6280825
#ntree=120
mtry.rf = tuneRF(x = train[,c(8,9,19,21,22,23,24)],
y = factor(train$Size),
ntreeTry = 120)## mtry = 2 OOB error = 31.66%
## Searching left ...
## mtry = 1 OOB error = 31.46%
## 0.006196747 0.05
## Searching right ...
## mtry = 4 OOB error = 31.98%
## -0.01006971 0.05
## 1.OOB
## 1
#mtry=1
model.rf=randomForest(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train, importance=TRUE,proximity=TRUE,ntree=120,mtry=1)
print(model.rf)##
## Call:
## randomForest(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 + RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic + Black + White, data = train, importance = TRUE, proximity = TRUE, ntree = 120, mtry = 1)
## Type of random forest: classification
## Number of trees: 120
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 32.3%
## Confusion matrix:
## City Rural Suburb Town class.error
## City 933 182 451 55 0.4244294
## Rural 97 4854 342 429 0.1516952
## Suburb 334 541 1814 215 0.3753444
## Town 55 971 280 681 0.6572723
pred.rf= predict(model.rf, newdata = test, type = "class")
confusionMatrix(data = pred.rf, reference = as.factor(test$Size))## Confusion Matrix and Statistics
##
## Reference
## Prediction City Rural Suburb Town
## City 227 29 83 12
## Rural 37 1218 131 244
## Suburb 131 74 463 72
## Town 10 109 49 169
##
## Overall Statistics
##
## Accuracy : 0.6792
## 95% CI : (0.6623, 0.6957)
## No Information Rate : 0.4676
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5141
##
## Mcnemar's Test P-Value : 5.942e-16
##
## Statistics by Class:
##
## Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity 0.56049 0.8517 0.6377 0.34004
## Specificity 0.95326 0.7469 0.8812 0.93440
## Pos Pred Value 0.64672 0.7472 0.6257 0.50148
## Neg Pred Value 0.93424 0.8515 0.8865 0.87946
## Prevalence 0.13244 0.4676 0.2374 0.16252
## Detection Rate 0.07423 0.3983 0.1514 0.05526
## Detection Prevalence 0.11478 0.5330 0.2420 0.11020
## Balanced Accuracy 0.75688 0.7993 0.7595 0.63722
## [1] 0.3207979
After testing all four tree-based models, we chose to include the random forest model because it had the lowest testing error rate of the four models. Random forest models are an ensemble method similar to bagging, but with better performance. The random forest model selects a defined number of classification trees to create using the training data, generates a bootstrapped sample for \(i=1:n\) trees, then grows a classification tree to that bootstrapped data. Every time a split is considered, a random selection of \(m\) predictors is chosen at each split candidate from the full set of \(p\) predictors; this is where random forest differs from the bagging method. While bagging considers all predictors at each split, the random forest only considers a subset of those predictors, which leads to better performance and reduces tree correlation by inducing some randomness into the tree-growing process.
As previously stated, our random forest model had the lowest testing error rate of the four models, with an error rate of 31.46%. Even after using cross-validation to optimize hyperparameters “ntree” (# of trees grown in a forest) and “mtry” (how many variables are chosen), this was the lowest error rate possible. While this model has demonstrated the best performance thus far, a 31% error rate is still not acceptable were this model to be implemented.
Here, we applied a SVM model to our training data.
train_svm <- train
test_svm <- test
svm_model<- svm(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train_svm, type = "C-classification", kernel = "radial", scale = TRUE, cost=0.1)
svm_model##
## Call:
## svm(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 +
## RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl +
## Hispanic + Black + White, data = train_svm, type = "C-classification",
## kernel = "radial", cost = 0.1, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
##
## Number of Support Vectors: 9241
#re-estimating model using hyperparameter optimization
set.seed(123)
tune.radial=tune(svm, factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train,
kernel ="radial",
type = "C-classification",
ranges =list(cost=c(0.1,1,10,100, 1000),
gamma=c(0.5,1,2,3,4)))
summary(tune.radial)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1000 2
##
## - best performance: 0.3424883
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.4336269 0.012910335
## 2 1e+00 0.5 0.3863813 0.014317147
## 3 1e+01 0.5 0.3631680 0.012164559
## 4 1e+02 0.5 0.3518880 0.011556264
## 5 1e+03 0.5 0.3446123 0.009226288
## 6 1e-01 1.0 0.4176061 0.014100933
## 7 1e+00 1.0 0.3740384 0.010928441
## 8 1e+01 1.0 0.3545038 0.012259197
## 9 1e+02 1.0 0.3463289 0.010402223
## 10 1e+03 1.0 0.3438772 0.009039447
## 11 1e-01 2.0 0.4039552 0.012258918
## 12 1e+00 2.0 0.3653746 0.011879806
## 13 1e+01 2.0 0.3499259 0.012153882
## 14 1e+02 2.0 0.3456751 0.009757681
## 15 1e+03 2.0 0.3424883 0.011946144
## 16 1e-01 3.0 0.3951277 0.012777033
## 17 1e+00 3.0 0.3590812 0.011377972
## 18 1e+01 3.0 0.3478826 0.011084377
## 19 1e+02 3.0 0.3436323 0.010297758
## 20 1e+03 3.0 0.3476369 0.009956892
## 21 1e-01 4.0 0.3890783 0.014664634
## 22 1e+00 4.0 0.3561390 0.012769976
## 23 1e+01 4.0 0.3462478 0.011040095
## 24 1e+02 4.0 0.3465742 0.011552186
## 25 1e+03 4.0 0.3490264 0.010512489
bestmod=tune.radial$best.model
pred_test1=predict(svm_model, test_svm)
pred_test2=predict(bestmod, test_svm)
svm.error1 <- 1-mean(pred_test1 == test_svm$Size)
svm.error2 <- 1-mean(pred_test2 == test_svm$Size)
table(pred=predict(tune.radial$best.model ,
newdata=train),true=train$Size)## true
## pred City Rural Suburb Town
## City 1140 66 187 21
## Rural 311 5289 650 1153
## Suburb 150 184 1951 134
## Town 20 183 116 679
A Support Vector Machine, or SVM, is a supervised machine learning algorithm generally used for classification or regression. This model creates a line or hyperplane to separate the data into different classes. SVM generally requires optimal tuning of its hyperparameters to obtain the best predictive performance. Because we specified a radial kernel for this model, our hyperparameters are cost and \(\gamma\). We can find these optimal values using the tune() function and providing a list of potential values for these hyperparameters.
The SVM model performed similarly to the random forest model, but with a slightly higher testing error rate of 33.55%; this can be compared to the testing error rate of the SVM model without hyperparameter optimization which was 44.21%. While this is a significant improvement relative to the SVM, it still under-performs compared to the random forest model. Additionally, the SVM is much more computationally expensive than the random forest and the model takes much longer to estimate, especially when using cross-validation to test different hyperparameter values.
TestError = data.frame(lda.error, knn.error, forest.error, svm.error2)
TestError=round(TestError, 4)
colnames(TestError) = c("LDA", "KNN", "Random Forest", "SVM")
rownames(TestError) = "Testing Error"
kable(TestError, format.args = list(big.mark = ","), caption = "Comparing Testing Error Rates Between Classification Models") %>%
kable_styling("striped")| LDA | KNN | Random Forest | SVM | |
|---|---|---|---|---|
| Testing Error | 0.5003 | 0.4634 | 0.3208 | 0.3355 |
In comparing the testing error rate of each classification model, we found that the random forest model had the lowest testing rate, and therefore, was the most accurate predictor of school locale based on the predictor variables of this dataset. While it hold the lowest error rate (31.46%), that error rate is still over 30%, which leaves much room for improvement. The addition of external data may aid in these models’ respective accuracies; some data whose incorporation may yield interesting results could include the graduation rate of each school, job placement rates, population data on the town/city in which the school is located. This data may provide more insight into the institution and lead to more accurate classifications.
This project was very useful in gaining an understanding in how different models operate, how they classify observations differently, and how tuning model hyperparameters can affect predictive performance. Being able to run these models on real data reinforced our understanding of how they work, as well as made it easier to communicate exactly what they were doing to classify observations. In terms of R skills, we definitely had to get a bit creative with our data, and specifically figuring out a way to narrow down our classification levels from 12 to 4 took some time. This project was a very neat introduction into the basics of statistical learning, and we are excited to continue learning about predictive modeling.