Predictive Analysis of Common Core Data

Conor O’Regan, Jon Olson, Carter Mercer

3/8//2021

DS 805: Statistical Learning

For this project, our group analyzed a combination of race and ethnicity data from the National Center for Education Statistics (NCES) which provides the Common Core of Data (CCD). and educational funding from various governing entities.

The ElSi (Elementary/Secondary Information System) Table Generator which we used to pull various data can be found here.

The datasets are joined using the name of the school district and includes the number of students of each ethnicity (White, Black, Hispanic, etc.) in each respective district, as well as the dollar amounts of funding from different sources (federal, state, local, and miscellaneous).

Our goal for this project was to create multiple predictive models which could classify a given school into one of four main locales: Town, Rural, Urban, or City. We implemented Linear Discriminant Analysis, K-Nearest Neighbor, Random Tree, Random Forest, Bagging, Boosting, and Support Vector Machine models and compared their predictive performance. We hypothesize that being able to predict locale classifications could help determine a relationship between the locale and how the educational system is structured.

df = read.csv("https://unh.box.com/shared/static/h2e0xzpfifrlc7cmxkw617gfg8g9zt3p.csv")
glimpse(df)

## Rows: 15,292
## Columns: 37
## $ X                                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ Row                                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ AgencyName                          <chr> "21ST CENTURY CHARTER SCH OF GARY…
## $ State                               <chr> "Indiana", "Pennsylvania", "Arizo…
## $ FallMembership_16_17                <int> 888, 961, 64, 2278, 1069, 629, 12…
## $ RevenueLocalSources_16_17           <dbl> 516000, 12717000, 47000, 363000, …
## $ RevenueGeneral_16_17                <dbl> 9342000, 12848000, 492000, 201980…
## $ RevenueStateSources_16_17           <dbl> 7563000, 40000, 440000, 17778000,…
## $ RevenueFederalSources_16_17         <int> 1263000, 91000, 5000, 2057000, 14…
## $ CurrentExpendituresSecEdu_16_17     <dbl> 7447000, 11940000, 481000, 184750…
## $ CurrentExpenditures_16_17           <dbl> 8637000, 12528000, 495000, 185050…
## $ RevenuePerPupil_16_17               <int> 10520, 13369, 7688, 8867, 11573, …
## $ RevenueLocalSourcesPerPupil_16_17   <int> 581, 13233, 734, 159, 458, 167, 1…
## $ RevenueStateSourcesPerPupil_16_17   <int> 8517, 42, 6875, 7804, 9792, 7593,…
## $ RevenueFederalSourcesPerPupil_16_17 <int> 1422, 95, 78, 903, 1323, 1727, 73…
## $ AgencyID                            <int> 1800046, 4200091, 400328, 4800095…
## $ AgencyType_18_19                    <chr> "7-Independent Charter District",…
## $ StartOfYearStatus_18_19             <chr> "1-Open", "1-Open", "1-Open", "1-…
## $ TotalStudentsAllGrades              <int> 883, 1235, 65, 2084, 1409, 701, 1…
## $ AmerIndAlaska0tive                  <int> 0, 4, 2, 4, 2, 1, 0, 0, 0, 1, 5, …
## $ Asian_AsianPacificIsl               <int> 0, 18, 0, 1, 7, 0, 3, 0, 0, 2, 1,…
## $ Hispanic                            <int> 23, 97, 22, 89, 1276, 7, 9, 51, 1…
## $ Black                               <int> 837, 128, 1, 1951, 53, 688, 48, 1…
## $ White                               <int> 0, 918, 37, 7, 61, 2, 53, 19, 393…
## $ Hawaiian_PacificIsl                 <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, …
## $ TwoOrMoreRaces                      <int> 23, 70, 3, 30, 10, 3, 17, 6, 2, 2…
## $ AmerIndAlaska0tive.                 <chr> "0.0%", "0.3%", "3.1%", "0.2%", "…
## $ Asian_AsianPacificIsl.              <chr> "0.0%", "1.5%", "0.0%", "0.0%", "…
## $ Hispanic.                           <chr> "2.6%", "7.9%", "33.8%", "4.3%", …
## $ Black.                              <chr> "94.8%", "10.4%", "1.5%", "93.6%"…
## $ White.                              <chr> "0.0%", "74.3%", "56.9%", "0.3%",…
## $ Hawaiian_PacificIsl.                <chr> "0.0%", "0.0%", "0.0%", "0.1%", "…
## $ TwoOrMoreRaces.                     <chr> "2.6%", "5.7%", "4.6%", "1.4%", "…
## $ CountyName_18_19                    <chr> "Marion County", "Chester County"…
## $ UrbanCentricLocale_18_19            <chr> "13-City: Small", "21-Suburb: Lar…
## $ Size                                <chr> "City", "Suburb", "Town", "City",…
## $ Size_0_1                            <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, …

attach(df)

Exploratory Data Analysis

For the numerical variables, we plotted box plots based on values of Size, our variable for locale.

boxplot(df$RevenueStateSources_16_17 ~ df$Size, ylim = c(0,46000000))

boxplot(df$RevenueFederalSources_16_17 ~ df$Size, ylim = c(0,11000000))

## Warning in x[floor(d)] + x[ceiling(d)]: NAs produced by integer overflow

boxplot(df$TotalStudentsAllGrades ~ df$Size, ylim = c(0,8000))

boxplot(df$Asian_AsianPacificIsl ~ df$Size, ylim = c(0,300))

boxplot(df$Hispanic ~ df$Size, ylim = c(0,2000))

boxplot(df$Black ~ df$Size, ylim = c(0,1000))

boxplot(df$White ~ df$Size, ylim = c(0,3000))

For our locale variable, Size, we plotted a bar chart to visualize the total number of each locale.

totals = data.frame(table(df$Size))
ggplot(data=df, aes(x= Size, fill = Size)) +
  geom_bar()+
  geom_text(aes(Var1, Freq +250, label = Freq, fill = NULL), data = totals)+
  theme_bw()

We decided to take a stratified sample of our dataset using Size. With this data especially, it is important to try to take a sample that is representative of the entire population. If we had taken a simple random sample, there is a chance it would not include a sufficient number of districts in any given type of community (rural, urban, suburban, etc.). Therefore, our training sample, which includes 80% of our full dataset, incorporates adequate representation from each type of community.

set.seed(123)
train <- df %>%
  group_by(Size) %>%
  sample_frac(0.8) %>%
  ungroup()
train = as.data.frame(train)

x <- rbind(df, train)
test <- x[!duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(df),]

c(nrow(df),nrow(train),nrow(test))

## [1] 15292 12234  3058

Linear Discriminant Analysis

Our first classification model was created using LDA and the variables relating to Fall Membership, State and Federal Revenue, Total Number of Students, and the number of Asian/Pacific Island, Hispanic, Black, and White students at each school.

lda_m = lda(Size~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train)
lda_m

## Call:
## lda(Size ~ FallMembership_16_17 + RevenueStateSources_16_17 + 
##     RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl + 
##     Hispanic + Black + White, data = train)
## 
## Prior probabilities of groups:
##      City     Rural    Suburb      Town 
## 0.1324996 0.4677129 0.2373713 0.1624162 
## 
## Group means:
##        FallMembership_16_17 RevenueStateSources_16_17
## City               7123.607                  46993333
## Rural              1053.213                   7019138
## Suburb             5778.626                  35606790
## Town               2192.207                  14687769
##        RevenueFederalSources_16_17 TotalStudentsAllGrades Asian_AsianPacificIsl
## City                       9441581               6965.738             424.01049
## Rural                      1088182               1043.786              11.75463
## Suburb                     4828927               5768.187             381.17252
## Town                       2484004               2179.128              24.88123
##         Hispanic      Black    White
## City   2570.7162 1478.16780 2110.365
## Rural   130.1393   87.71111  758.144
## Suburb 1527.3660  815.16736 2765.421
## Town    439.2899  187.95974 1391.416
## 
## Coefficients of linear discriminants:
##                                       LD1           LD2           LD3
## FallMembership_16_17         3.273630e-04  3.354737e-04 -2.698444e-04
## RevenueStateSources_16_17   -1.324175e-08 -2.431901e-09 -2.171819e-10
## RevenueFederalSources_16_17  5.220148e-08 -7.795008e-08  1.119885e-07
## TotalStudentsAllGrades      -2.667240e-04 -1.934473e-03  8.980301e-04
## Asian_AsianPacificIsl       -2.571882e-04  1.637371e-03 -1.550364e-03
## Hispanic                    -1.123044e-04  1.681713e-03 -7.760396e-04
## Black                       -8.214917e-05  1.623065e-03 -9.948925e-04
## White                       -2.204662e-04  1.960617e-03 -4.870504e-04
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.6518 0.3060 0.0421

pred.lda = predict(lda_m, newdata=test)
table(Predictions=pred.lda$class, "True Values"=test$Size)

##            True Values
## Predictions City Rural Suburb Town
##      City     28     3     13    1
##      Rural   334  1407    620  489
##      Suburb   43    20     93    7
##      Town      0     0      0    0

lda.error <- 1- mean(test$Size==pred.lda$class)
lda.error

## [1] 0.500327

We implemented a linear discriminant analysis rather than a logistic regression because logistic regression is usually used for binary predictions; more than two response classes can lead to unstable estimates, therefore, we opted for an LDA. Linear discriminant analysis classifies observations by observing prior (training) and posterior (test) probabilities \(\pi_k\) and \(P_k(X)\) respectively. The LDA model classifies observations to the class for which \(P_k(X)\) is the largest; in other words, it assigns a given observation \(X\) to be equal to \(x\), the class which has the highest probability.

This LDA model classified locale with about a 50% testing error rate; generally speaking, this is a fairly poor performance for a classification model. Another important note is that this model failed to classify a single locale as “Town”, either correctly or incorrectly. Even if this model did result in a lower testing error rate than the following models, failing to consider an entire classification level could lead to some major errors in classification later on.

K-Nearest Neighbor

We applied a KNN classification to the training data.

#using same variables as LDA
knn.train=train[,c(8,9,19,21,22,23,24)]
knn.test=test[,c(8,9,19,21,22,23,24)]

knn.trainLabels=as.factor(train$Size)
knn.testLabels=as.factor(test$Size)

#finding optimal K-value
set.seed(123)
k.grid=1:100
error=rep(0, length(k.grid))

for (i in seq_along(k.grid)) {
  pred = class::knn(train = scale(knn.train), 
             test  = scale(knn.test), 
             cl    = knn.trainLabels, 
             k     = k.grid[i])
  error[i] = mean(knn.testLabels !=pred)
}

min(error)

## [1] 0.3662525

which.min(error)

## [1] 25

plot(error, type = "b", cex = 1, pch = 20, 
     xlab = "k", ylab = "Error")
abline(h = min(error), col = "red", lty = 5)

pred.knn <- class::knn(train = knn.train, test = knn.test, cl = knn.trainLabels, k=which.min(error))

table(Predictions=pred.knn, "True Values"=knn.testLabels)

##            True Values
## Predictions City Rural Suburb Town
##      City     77    26     37    6
##      Rural   212  1171    309  269
##      Suburb   84   131    296  125
##      Town     32   102     84   97

knn.error <- 1-mean(pred.knn==knn.testLabels)
knn.error

## [1] 0.4633748

K-Nearest Neighbor classification operates on the idea that similar things exist within close proximity to each other. K is a hyperparameter of this model which determines how many of the nearest observations to base its classification on. For example, a k-value of 1 means that for each point in the test data, the model will calculate the distance between each point and each observation from the training data and choose the single lowest distance to a training observation to base its classification on. A k-value of 5 will have the model calculate the same distances as before, now sorting those distances in ascending order and selecting the 5 observations with the lowest distances to base its classification on.

We can use cross-validation techniques to find the optimal k-value which produces the lowest testing error rate of the model. In this case, the optimal k-value is 25 which results in a testing error rate of 46.34%. While this is a slightly better performance than the LDA model, it is still not a very accurate model. Additionally, a relatively high k-value of 25 increases computational expense, bias, and the boundaries between classes become less distinct.

Tree-Based Models: Unused Models

#We created four tree-based models including Random Tree, Bagging, Boosting, and Random Forest; we only included the model with the lowest testing error rate. Code for the three unused models is below, code for our final model for this section is located in the chunk after this one.

#Random Tree/Classification Tree
model.ct = rpart(Size~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, train, method="class", parms=list(split="gini"))
model.ct

## n= 12234 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 12234 6512 Rural (0.13249959 0.46771293 0.23737126 0.16241622)  
##    2) Black< 15.5 5970 1562 Rural (0.03417085 0.73835846 0.08190955 0.14556114) *
##    3) Black>=15.5 6264 3849 Suburb (0.22621328 0.20977011 0.38553640 0.17848020)  
##      6) Asian_AsianPacificIsl< 38.5 3576 2498 Rural (0.23042506 0.30145414 0.23266219 0.23545861)  
##       12) White< 136.5 1013  370 City (0.63474827 0.09279368 0.22112537 0.05133268) *
##       13) White>=136.5 2563 1579 Rural (0.07062037 0.38392509 0.23722201 0.30823254)  
##         26) Asian_AsianPacificIsl< 6.5 810  356 Rural (0.06666667 0.56049383 0.15308642 0.21975309) *
##         27) Asian_AsianPacificIsl>=6.5 1753 1141 Town (0.07244723 0.30233885 0.27609812 0.34911580)  
##           54) RevenueFederalSources_16_17< 1634000 822  506 Suburb (0.10583942 0.27372263 0.38442822 0.23600973) *
##           55) RevenueFederalSources_16_17>=1634000 931  513 Town (0.04296455 0.32760473 0.18045113 0.44897959) *
##      7) Asian_AsianPacificIsl>=38.5 2688 1105 Suburb (0.22061012 0.08779762 0.58891369 0.10267857) *

rpart.plot(x = model.ct, yesno = 2, type = 2, extra = 1)

#pruned tree using cross validation
plotcp(model.ct)

cp_opt <- model.ct$cptable[which.min(model.ct$cptable[, "xerror"]), "CP"]
model.ct_opt <- prune(tree = model.ct, cp = cp_opt)
model.ct_opt

## n= 12234 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 12234 6512 Rural (0.13249959 0.46771293 0.23737126 0.16241622)  
##    2) Black< 15.5 5970 1562 Rural (0.03417085 0.73835846 0.08190955 0.14556114) *
##    3) Black>=15.5 6264 3849 Suburb (0.22621328 0.20977011 0.38553640 0.17848020)  
##      6) Asian_AsianPacificIsl< 38.5 3576 2498 Rural (0.23042506 0.30145414 0.23266219 0.23545861)  
##       12) White< 136.5 1013  370 City (0.63474827 0.09279368 0.22112537 0.05133268) *
##       13) White>=136.5 2563 1579 Rural (0.07062037 0.38392509 0.23722201 0.30823254)  
##         26) Asian_AsianPacificIsl< 6.5 810  356 Rural (0.06666667 0.56049383 0.15308642 0.21975309) *
##         27) Asian_AsianPacificIsl>=6.5 1753 1141 Town (0.07244723 0.30233885 0.27609812 0.34911580)  
##           54) RevenueFederalSources_16_17< 1634000 822  506 Suburb (0.10583942 0.27372263 0.38442822 0.23600973) *
##           55) RevenueFederalSources_16_17>=1634000 931  513 Town (0.04296455 0.32760473 0.18045113 0.44897959) *
##      7) Asian_AsianPacificIsl>=38.5 2688 1105 Suburb (0.22061012 0.08779762 0.58891369 0.10267857) *

rpart.plot(x = model.ct_opt, yesno = 2, type = 5, extra = 1, legend.y=1.06)

rpart.plot(x = model.ct_opt, yesno = 2, type = 5, extra = 4)

pred.tree=predict(model.ct, newdata=test, type="class")
confusionMatrix(
  factor(pred.tree),
  factor(test$Size)
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Rural Suburb Town
##     City    156    25     44   11
##     Rural    52  1203    149  257
##     Suburb  179   123    490  125
##     Town     18    79     43  104
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6387          
##                  95% CI : (0.6213, 0.6557)
##     No Information Rate : 0.4676          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4455          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity              0.38519       0.8413        0.6749     0.20926
## Specificity              0.96985       0.7187        0.8169     0.94533
## Pos Pred Value           0.66102       0.7243        0.5344     0.42623
## Neg Pred Value           0.91176       0.8375        0.8898     0.86034
## Prevalence               0.13244       0.4676        0.2374     0.16252
## Detection Rate           0.05101       0.3934        0.1602     0.03401
## Detection Prevalence     0.07717       0.5432        0.2999     0.07979
## Balanced Accuracy        0.67752       0.7800        0.7459     0.57729

tree.accuracy <- mean(test$Size==pred.tree)
tree.accuracy

## [1] 0.6386527

#Bagging Algorithm
set.seed(123)
model.bag=bagging(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train, coob = TRUE)
print(model.bag)

## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = factor(Size) ~ FallMembership_16_17 + 
##     RevenueStateSources_16_17 + RevenueFederalSources_16_17 + 
##     TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic + 
##     Black + White, data = train, coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.3498

pred.bag = predict(model.bag,newdata=test, type = "class")

confusionMatrix(factor(pred.bag),
                factor(test$Size))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Rural Suburb Town
##     City    240    30     90   18
##     Rural    31  1194    111  211
##     Suburb  119    85    459   83
##     Town     15   121     66  185
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6795          
##                  95% CI : (0.6627, 0.6961)
##     No Information Rate : 0.4676          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5207          
##                                           
##  Mcnemar's Test P-Value : 6.437e-06       
## 
## Statistics by Class:
## 
##                      Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity              0.59259       0.8350        0.6322      0.3722
## Specificity              0.94798       0.7832        0.8769      0.9211
## Pos Pred Value           0.63492       0.7718        0.6153      0.4780
## Neg Pred Value           0.93843       0.8438        0.8845      0.8832
## Prevalence               0.13244       0.4676        0.2374      0.1625
## Detection Rate           0.07848       0.3905        0.1501      0.0605
## Detection Prevalence     0.12361       0.5059        0.2440      0.1266
## Balanced Accuracy        0.77029       0.8091        0.7546      0.6467

bagging.accuracy <- mean(test$Size==pred.bag)
bagging.accuracy

## [1] 0.6795291

ctrl <- trainControl(method = "cv", number = 5) 
set.seed(123)  
caret_model <- train(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White,
                            data = train, 
                            method = "treebag",
                            trControl = ctrl)
vip(caret_model)

#Boosting Algorithm
set.seed(123)
model.boos <- gbm(formula = factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, distribution="multinomial", data=train, n.trees = 10000)

print(model.boos)

## gbm(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 + 
##     RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl + 
##     Hispanic + Black + White, distribution = "multinomial", data = train, 
##     n.trees = 10000)
## A gradient boosted model with multinomial loss function.
## 10000 iterations were performed.
## There were 8 predictors of which 8 had non-zero influence.

boost_table <- summary(model.boos, plotit=FALSE)

ggplot(boost_table) +
 aes(x = reorder(var, rel.inf), weight = rel.inf) +
 geom_bar(fill = "#0c4c8a") +
 labs(x = "Variable of Interest", y = "Relative Influence") +
 coord_flip() +
 theme_bw()

pred.boost=predict(model.boos, newdata=test,n.trees=10000, distribution="multinomial", type="response")

head(pred.boost)

## , , 10000
## 
##              City      Rural     Suburb        Town
## [1,] 0.3234466128 0.03873182 0.41004138 0.227780187
## [2,] 0.2331930389 0.02364980 0.49929565 0.243861514
## [3,] 0.0004012989 0.90449070 0.02117469 0.073933317
## [4,] 0.5597331158 0.12738613 0.30558275 0.007298003
## [5,] 0.3808144898 0.08705119 0.43822694 0.093907376
## [6,] 0.7978765227 0.04032131 0.09540377 0.066398398

#the above predictions lose some interpretability, so we will take the highest probability classification as the model's prediction for that observation, and reformat the confusion matrix accordingly.

boost.pred.df <- as.data.frame(pred.boost)
colnames(boost.pred.df) = c("City", "Rural", "Suburb", "Town")

boost.pred.df[] <- t(apply(boost.pred.df, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))

w <- which(boost.pred.df!="0",arr.ind=TRUE)
boost.pred.df[w] <- colnames(boost.pred.df)[w[,"col"]]

head(boost.pred.df)

##   City Rural Suburb Town
## 1    0     0 Suburb    0
## 2    0     0 Suburb    0
## 3    0 Rural      0    0
## 4 City     0      0    0
## 5    0     0 Suburb    0
## 6 City     0      0    0

boost.pred.vector <- as.vector(t(boost.pred.df)[t(boost.pred.df)!=0])

accuracy(test[,"Size"], boost.pred.vector)

## [1] 0.6602354

confusionMatrix(factor(test[,"Size"]), factor(boost.pred.vector))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Rural Suburb Town
##     City    220    46    128   11
##     Rural    29  1198     87  116
##     Suburb   90   124    439   73
##     Town     20   238     77  162
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6602         
##                  95% CI : (0.6431, 0.677)
##     No Information Rate : 0.5252         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4874         
##                                          
##  Mcnemar's Test P-Value : 2.002e-11      
## 
## Statistics by Class:
## 
##                      Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity              0.61281       0.7460        0.6005     0.44751
## Specificity              0.93146       0.8402        0.8767     0.87574
## Pos Pred Value           0.54321       0.8378        0.6047     0.32596
## Neg Pred Value           0.94761       0.7494        0.8748     0.92191
## Prevalence               0.11740       0.5252        0.2390     0.11838
## Detection Rate           0.07194       0.3918        0.1436     0.05298
## Detection Prevalence     0.13244       0.4676        0.2374     0.16252
## Balanced Accuracy        0.77213       0.7931        0.7386     0.66163

##Optimizing Boosting parameters
ntree.oob.opt=gbm.perf(model.boos, method="OOB", oobag.curve=TRUE)

## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

model.boos.cv <- gbm(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, 
                       distribution = "multinomial", 
                       train,n.trees = 10000,
                       cv.folds = 3)
ntree.cv.opt=gbm.perf(model.boos.cv, method="cv")

## [1] "Optimal ntrees (OOB Estimate): 210"

## [1] "Optimal ntrees (CV Estimate): 502"

pred.1=predict(object = model.boos, 
                  newdata = test,
                  n.trees = ntree.oob.opt)
pred.2=predict(object = model.boos.cv, 
                  newdata = test,
                  n.trees = ntree.cv.opt)

df1 <- as.data.frame(pred.1)
colnames(df1) = c("City", "Rural", "Suburb", "Town")

df1[] <- t(apply(df1, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))

w <- which(df1!="0",arr.ind=TRUE)
df1[w] <- colnames(df1)[w[,"col"]]

head(df1)

##   City Rural Suburb Town
## 1    0     0      0 Town
## 2    0     0 Suburb    0
## 3    0 Rural      0    0
## 4 City     0      0    0
## 5 City     0      0    0
## 6 City     0      0    0

v1 <- as.vector(t(df1)[t(df1)!=0])

df2 <- as.data.frame(pred.2)
colnames(df2) = c("City", "Rural", "Suburb", "Town")

df2[] <- t(apply(df2, 1, function(x) replace(x, x!= max(x, na.rm = TRUE), 0)))

w <- which(df2!="0",arr.ind=TRUE)
df2[w] <- colnames(df2)[w[,"col"]]

head(df2)

##   City Rural Suburb Town
## 1    0     0      0 Town
## 2    0     0 Suburb    0
## 3    0 Rural      0    0
## 4 City     0      0    0
## 5    0     0 Suburb    0
## 6 City     0      0    0

v2 <- as.vector(t(df2)[t(df2)!=0])

1-accuracy(test[,"Size"], v1)

## [1] 0.323087

1-accuracy(test[,"Size"], v2)

## [1] 0.3191629

confusionMatrix(factor(test[,"Size"]), factor(v2))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Rural Suburb Town
##     City    230    38    122   15
##     Rural    33  1209     82  106
##     Suburb   80   125    457   64
##     Town     11   212     88  186
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6808         
##                  95% CI : (0.664, 0.6973)
##     No Information Rate : 0.518          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5198         
##                                          
##  Mcnemar's Test P-Value : 1.285e-10      
## 
## Statistics by Class:
## 
##                      Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity              0.64972       0.7633        0.6101     0.50135
## Specificity              0.93528       0.8501        0.8835     0.88426
## Pos Pred Value           0.56790       0.8455        0.6295     0.37425
## Neg Pred Value           0.95326       0.7697        0.8748     0.92776
## Prevalence               0.11576       0.5180        0.2449     0.12132
## Detection Rate           0.07521       0.3954        0.1494     0.06082
## Detection Prevalence     0.13244       0.4676        0.2374     0.16252
## Balanced Accuracy        0.79250       0.8067        0.7468     0.69280

Tree-Based Models: Random Forest

model.rf=randomForest(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train, importance=TRUE,proximity=TRUE)
print(model.rf)

## 
## Call:
##  randomForest(formula = factor(Size) ~ FallMembership_16_17 +      RevenueStateSources_16_17 + RevenueFederalSources_16_17 +      TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic +      Black + White, data = train, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 31.3%
## Confusion matrix:
##        City Rural Suburb Town class.error
## City    976   152    434   59   0.3979025
## Rural   117  4837    344  424   0.1546662
## Suburb  348   492   1853  211   0.3619146
## Town     54   930    264  739   0.6280825

#parameter optimization
plot(model.rf)

#ntree=120

mtry.rf = tuneRF(x = train[,c(8,9,19,21,22,23,24)],
              y = factor(train$Size),
              ntreeTry = 120)

## mtry = 2  OOB error = 31.66% 
## Searching left ...
## mtry = 1     OOB error = 31.46% 
## 0.006196747 0.05 
## Searching right ...
## mtry = 4     OOB error = 31.98% 
## -0.01006971 0.05

mtry_opt <- mtry.rf[,"mtry"][which.min(mtry.rf[,"OOBError"])]
print(mtry_opt)

## 1.OOB 
##     1

#mtry=1
model.rf=randomForest(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train, importance=TRUE,proximity=TRUE,ntree=120,mtry=1)
print(model.rf)

## 
## Call:
##  randomForest(formula = factor(Size) ~ FallMembership_16_17 +      RevenueStateSources_16_17 + RevenueFederalSources_16_17 +      TotalStudentsAllGrades + Asian_AsianPacificIsl + Hispanic +      Black + White, data = train, importance = TRUE, proximity = TRUE,      ntree = 120, mtry = 1) 
##                Type of random forest: classification
##                      Number of trees: 120
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 32.3%
## Confusion matrix:
##        City Rural Suburb Town class.error
## City    933   182    451   55   0.4244294
## Rural    97  4854    342  429   0.1516952
## Suburb  334   541   1814  215   0.3753444
## Town     55   971    280  681   0.6572723

pred.rf= predict(model.rf, newdata = test, type = "class")
confusionMatrix(data = pred.rf, reference = as.factor(test$Size))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Rural Suburb Town
##     City    227    29     83   12
##     Rural    37  1218    131  244
##     Suburb  131    74    463   72
##     Town     10   109     49  169
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6792          
##                  95% CI : (0.6623, 0.6957)
##     No Information Rate : 0.4676          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5141          
##                                           
##  Mcnemar's Test P-Value : 5.942e-16       
## 
## Statistics by Class:
## 
##                      Class: City Class: Rural Class: Suburb Class: Town
## Sensitivity              0.56049       0.8517        0.6377     0.34004
## Specificity              0.95326       0.7469        0.8812     0.93440
## Pos Pred Value           0.64672       0.7472        0.6257     0.50148
## Neg Pred Value           0.93424       0.8515        0.8865     0.87946
## Prevalence               0.13244       0.4676        0.2374     0.16252
## Detection Rate           0.07423       0.3983        0.1514     0.05526
## Detection Prevalence     0.11478       0.5330        0.2420     0.11020
## Balanced Accuracy        0.75688       0.7993        0.7595     0.63722

forest.error <- 1-mean(test$Size==pred.rf)
forest.error

## [1] 0.3207979

After testing all four tree-based models, we chose to include the random forest model because it had the lowest testing error rate of the four models. Random forest models are an ensemble method similar to bagging, but with better performance. The random forest model selects a defined number of classification trees to create using the training data, generates a bootstrapped sample for \(i=1:n\) trees, then grows a classification tree to that bootstrapped data. Every time a split is considered, a random selection of \(m\) predictors is chosen at each split candidate from the full set of \(p\) predictors; this is where random forest differs from the bagging method. While bagging considers all predictors at each split, the random forest only considers a subset of those predictors, which leads to better performance and reduces tree correlation by inducing some randomness into the tree-growing process.

As previously stated, our random forest model had the lowest testing error rate of the four models, with an error rate of 31.46%. Even after using cross-validation to optimize hyperparameters “ntree” (# of trees grown in a forest) and “mtry” (how many variables are chosen), this was the lowest error rate possible. While this model has demonstrated the best performance thus far, a 31% error rate is still not acceptable were this model to be implemented.

Support Vector Machine

Here, we applied a SVM model to our training data.

train_svm <- train
test_svm <- test

svm_model<- svm(factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data = train_svm, type = "C-classification", kernel = "radial", scale = TRUE, cost=0.1)
svm_model

## 
## Call:
## svm(formula = factor(Size) ~ FallMembership_16_17 + RevenueStateSources_16_17 + 
##     RevenueFederalSources_16_17 + TotalStudentsAllGrades + Asian_AsianPacificIsl + 
##     Hispanic + Black + White, data = train_svm, type = "C-classification", 
##     kernel = "radial", cost = 0.1, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
## 
## Number of Support Vectors:  9241

#re-estimating model using hyperparameter optimization
set.seed(123)
tune.radial=tune(svm, factor(Size)~FallMembership_16_17+RevenueStateSources_16_17+RevenueFederalSources_16_17+TotalStudentsAllGrades+Asian_AsianPacificIsl+Hispanic+Black+White, data=train,
                 kernel ="radial", 
                 type = "C-classification",
                 ranges =list(cost=c(0.1,1,10,100, 1000),
                              gamma=c(0.5,1,2,3,4)))
summary(tune.radial)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##  1000     2
## 
## - best performance: 0.3424883 
## 
## - Detailed performance results:
##     cost gamma     error  dispersion
## 1  1e-01   0.5 0.4336269 0.012910335
## 2  1e+00   0.5 0.3863813 0.014317147
## 3  1e+01   0.5 0.3631680 0.012164559
## 4  1e+02   0.5 0.3518880 0.011556264
## 5  1e+03   0.5 0.3446123 0.009226288
## 6  1e-01   1.0 0.4176061 0.014100933
## 7  1e+00   1.0 0.3740384 0.010928441
## 8  1e+01   1.0 0.3545038 0.012259197
## 9  1e+02   1.0 0.3463289 0.010402223
## 10 1e+03   1.0 0.3438772 0.009039447
## 11 1e-01   2.0 0.4039552 0.012258918
## 12 1e+00   2.0 0.3653746 0.011879806
## 13 1e+01   2.0 0.3499259 0.012153882
## 14 1e+02   2.0 0.3456751 0.009757681
## 15 1e+03   2.0 0.3424883 0.011946144
## 16 1e-01   3.0 0.3951277 0.012777033
## 17 1e+00   3.0 0.3590812 0.011377972
## 18 1e+01   3.0 0.3478826 0.011084377
## 19 1e+02   3.0 0.3436323 0.010297758
## 20 1e+03   3.0 0.3476369 0.009956892
## 21 1e-01   4.0 0.3890783 0.014664634
## 22 1e+00   4.0 0.3561390 0.012769976
## 23 1e+01   4.0 0.3462478 0.011040095
## 24 1e+02   4.0 0.3465742 0.011552186
## 25 1e+03   4.0 0.3490264 0.010512489

bestmod=tune.radial$best.model

pred_test1=predict(svm_model, test_svm)
pred_test2=predict(bestmod, test_svm)
svm.error1 <- 1-mean(pred_test1 == test_svm$Size)
svm.error2 <- 1-mean(pred_test2 == test_svm$Size)

table(pred=predict(tune.radial$best.model ,
newdata=train),true=train$Size)

##         true
## pred     City Rural Suburb Town
##   City   1140    66    187   21
##   Rural   311  5289    650 1153
##   Suburb  150   184   1951  134
##   Town     20   183    116  679

A Support Vector Machine, or SVM, is a supervised machine learning algorithm generally used for classification or regression. This model creates a line or hyperplane to separate the data into different classes. SVM generally requires optimal tuning of its hyperparameters to obtain the best predictive performance. Because we specified a radial kernel for this model, our hyperparameters are cost and \(\gamma\). We can find these optimal values using the tune() function and providing a list of potential values for these hyperparameters.

The SVM model performed similarly to the random forest model, but with a slightly higher testing error rate of 33.55%; this can be compared to the testing error rate of the SVM model without hyperparameter optimization which was 44.21%. While this is a significant improvement relative to the SVM, it still under-performs compared to the random forest model. Additionally, the SVM is much more computationally expensive than the random forest and the model takes much longer to estimate, especially when using cross-validation to test different hyperparameter values.

Conclusion

TestError = data.frame(lda.error, knn.error, forest.error, svm.error2)
TestError=round(TestError, 4)
colnames(TestError) = c("LDA", "KNN", "Random Forest", "SVM")
rownames(TestError) = "Testing Error"
kable(TestError, format.args = list(big.mark = ","), caption = "Comparing Testing Error Rates Between Classification Models") %>%
    kable_styling("striped")

Comparing Testing Error Rates Between Classification Models
	LDA	KNN	Random Forest	SVM
Testing Error	0.5003	0.4634	0.3208	0.3355

In comparing the testing error rate of each classification model, we found that the random forest model had the lowest testing rate, and therefore, was the most accurate predictor of school locale based on the predictor variables of this dataset. While it hold the lowest error rate (31.46%), that error rate is still over 30%, which leaves much room for improvement. The addition of external data may aid in these models’ respective accuracies; some data whose incorporation may yield interesting results could include the graduation rate of each school, job placement rates, population data on the town/city in which the school is located. This data may provide more insight into the institution and lead to more accurate classifications.

This project was very useful in gaining an understanding in how different models operate, how they classify observations differently, and how tuning model hyperparameters can affect predictive performance. Being able to run these models on real data reinforced our understanding of how they work, as well as made it easier to communicate exactly what they were doing to classify observations. In terms of R skills, we definitely had to get a bit creative with our data, and specifically figuring out a way to narrow down our classification levels from 12 to 4 took some time. This project was a very neat introduction into the basics of statistical learning, and we are excited to continue learning about predictive modeling.