Choose a dataset from a source in Assignment #1, or another dataset of your choice.
Again, as in Assignment 1, using the public library data.
PLDS2022 is the 2022 (most recent) version of the Public Library
Survey Data. Found here, the data is a census survey of over 190
variables collected annually from over 9,000 libraries in the US,
including all 50 states, DC and outlying territories.
It is available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey
PLDS2022<-
read.csv("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
A reminder, from last assignment, the data contains:
Create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
First classification problem is to identify a binary variable PLDS2022$ODFINE, indicating if the library system charges overdue fines.
Second classification problem is to identify locale (PLDS2022$LOCALE_ADD), which is one of 4 values: City, Suburb, Town, Rural. Note: This variable actually starts out as 12 values, but we simplify it.
In both cases, the question we are trying to answer: How does the admin structure, funding source, physical book stats like expenditures, number of volumes (Books/Audio/Video) & Circulation and other variables predict these classifiers.
NOTE: THIS IS THE SAME AS ASSIGNMENT 1.
There is some cleaning of the data that is needed before thinking about any transformations.
To start, filter out those library systems that:
* do not meet all the criteria in the FSCS Public Library
Definition
* temporarily closed for FY 2022 (STATSTRU, Structure Change Code,
‘23’)
* Have neither a Central Library nor Branches (they have only
bookmobiles)
* Did not answer the question about ODFINE (Overdue fines).
This leaves 7,912 rows.
We will focus ODFINE, through the lens library type, revenue source, population served and location as well as on physical library services, and remove other columns relating to actual revenue amounts, staffing levels and ebook and other electronic services. This keeps 20 columns.
And add factors to LOCALE_ADD, which is the the geographic location in terms of the size of the community in which the library system is located and the proximity of that community to urban and metropolitan areas. We simplified this to City, Suburb, Town and Rural. The original data divides this further into S/M/L for City and Suburb and Fringe/Distant/Remote for Town and Rural.
Of the remaining rows, 4,586 have ODFINE set to N (library systems that do not charge overdue fines), and 3,326 have ODFINE set to Y.
locNUM <- c(11,12,13,
21,22,23,
31,32,33,
41,42,43)
locNAME <- c("City", "City", "City",
"Suburb", "Suburb", "Suburb",
"Town", "Town", "Town",
"Rural", "Rural", "Rural")
PLDS2022 <- PLDS2022 |>
filter(C_FSCS == "Y") |>
filter(ODFINE != "M") |>
select("STABR", "C_RELATN", "C_LEGBAS", "C_ADMIN","GEOCODE", "POPU_LSA", "CENTLIB", "BRANLIB","BKMOB", "LOCGVT", "STGVT",
"FEDGVT", "OTHINCM", "PRMATEXP", "BKVOL", "AUDIO_PH", "VIDEO_PH", "ODFINE", "PHYSCIR","LOCALE_ADD") |>
filter (PHYSCIR != -3) |>
filter (CENTLIB+BRANLIB>0)
PLDS2022$LOCALE_ADD <-factor(PLDS2022$LOCALE_ADD,
levels=locNUM,
labels=locNAME)
PLDS2022$ODFINE <- as.factor(PLDS2022$ODFINE)
table(PLDS2022$ODFINE)
##
## N Y
## 4586 3326
Split the data into train and test for ODFINE.
set.seed(888)
#split based on ODFINE
trainFine_rw <- createDataPartition(
y=PLDS2022$ODFINE, # this is our response var
p=.7, # 70% in training set
list=FALSE
)
trainF <- PLDS2022[trainFine_rw,]
testF <- PLDS2022[-trainFine_rw,]
Although the focus here is on the decision trees, we will keep some
of the Data Exploration from the prior assignment.
[For conciseness, took out some of the exploratory visualizations done
is assignment 1.]
Box Plots are a way to look at the numeric values - first we will do a quick and dirty peak, which we will refine as we go.
One can see how skewed the data is, and how many outliers there are.
par(mar = c(3, 4, 1, 2))
par(mfrow = c(3,4))
boxplot(POPU_LSA~ODFINE, ylab="Pop of Service Area",
data = trainF)
boxplot(BRANLIB~ODFINE, ylab="# of Branch Libraries",
data = trainF)
boxplot(BKMOB~ODFINE, ylab="# of Book Mobiles",
data = trainF)
boxplot(LOCGVT~ODFINE, ylab="Local Govt Funding",
data = trainF)
boxplot(STGVT~ODFINE, ylab="State Govt Funding",
data = trainF)
boxplot(FEDGVT~ODFINE, ylab="Fed Govt Funding",
data = trainF)
boxplot(OTHINCM~ODFINE, ylab="Other Funding",
data = trainF)
boxplot(PRMATEXP~ODFINE, ylab="Expenditures on Print Material",
data = trainF)
boxplot(BKVOL~ODFINE, ylab="# of Phys Books",
data = trainF)
boxplot(AUDIO_PH~ODFINE, ylab="# of Phys AudioBooks",
data = trainF)
boxplot(VIDEO_PH~ODFINE, ylab="# of Phys Videos",
data = trainF)
boxplot(PHYSCIR~ODFINE, ylab="Phys Circulation",
data = trainF)
Next, we will try faceting the box plots by LOCALE_AD, we will try with just one variable, so it is larger. Still, the scale of the large urban libraries dwarfs the others and makes it hard to compare.
ggplot(trainF, aes(x=ODFINE, y=PHYSCIR, fill=LOCALE_ADD)) +
geom_boxplot()+
scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-9))
Let try some of the Pre-Processing features in caret.
We will use the Yeo-Johnson transformation on the continuous predictors then center and scale them.
#run the preProcess on all but 18, ODFINE, the target
preProcValues <- preProcess(trainF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(testF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues
## Created from 5540 samples and 19 variables
##
## Pre-processing:
## - centered (13)
## - ignored (6)
## - scaled (13)
## - Yeo-Johnson transformation (10)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTransformed <- predict(preProcValues, trainF)
testTransformed <- predict(preProcValues2, testF)
ggplot(trainTransformed, aes(x=ODFINE, y=PHYSCIR, fill=LOCALE_ADD)) +
geom_boxplot()
Ooh, look at that! So much better! Let’s look at the other variables
now.
p2<-ggplot(trainTransformed, aes(x=ODFINE, y=POPU_LSA,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p3<-ggplot(trainTransformed, aes(x=ODFINE, y=BRANLIB,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p4<-ggplot(trainTransformed, aes(x=ODFINE, y=BKMOB,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p5<-ggplot(trainTransformed, aes(x=ODFINE, y=LOCGVT,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p6<-ggplot(trainTransformed, aes(x=ODFINE, y=STGVT,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p7<-ggplot(trainTransformed, aes(x=ODFINE, y=FEDGVT,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p8<-ggplot(trainTransformed, aes(x=ODFINE, y=OTHINCM,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p8<-ggplot(trainTransformed, aes(x=ODFINE, y=PRMATEXP,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p9<-ggplot(trainTransformed, aes(x=ODFINE, y=BKVOL,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p10<-ggplot(trainTransformed, aes(x=ODFINE, y=AUDIO_PH,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p11<-ggplot(trainTransformed, aes(x=ODFINE, y=VIDEO_PH,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
p12<-ggplot(trainTransformed, aes(x=ODFINE, y=PHYSCIR,
fill=LOCALE_ADD)) +
geom_boxplot()+
theme(axis.text = element_text(size = 6))+
theme(legend.position = "none")
ggarrange(p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12)
The final value used for the model was cp = 0.003434951, which has an accuracy of 65.16%.
ctrlF <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
set.seed(888)
dtree_fitF <- train(ODFINE ~., data = trainTransformed,
method = "rpart",
parms = list(split = "information"),
trControl=ctrlF,
tuneLength = 10)
dtree_fitF
## CART
##
## 5540 samples
## 19 predictor
## 2 classes: 'N', 'Y'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 4986, 4986, 4986, 4986, 4987, 4986, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.003434951 0.6516277 0.26299220
## 0.003864319 0.6486818 0.25627741
## 0.004293688 0.6460294 0.24877139
## 0.004508373 0.6471121 0.24985244
## 0.004723057 0.6465706 0.24832788
## 0.004937742 0.6466302 0.24778360
## 0.005689137 0.6444640 0.24102437
## 0.017604122 0.6125757 0.10870039
## 0.023615286 0.6095685 0.09312038
## 0.034993559 0.5977131 0.05710906
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003434951.
Looking at the plot of the decision tree, one sees many of the decisions are made on the STABR** variable - the State Abbreviation, where ** is replaced by the state (TX, Texas, is the top one - STABRTX), which coincides with our observation last week that geography seems to be a large indicator of fines. The other variables used BKVOL (Book Volumes), VIDEO_PH (Physical Videos), AUDIO_PH (Physical Audio Books) relate to the size of the library system. And LOCGVT, also prominent in the tree, is how much funding is received from the system’s local government.
prp(dtree_fitF$finalModel,
box.palette = "Blues",
tweak = 1.2)
In the first attempt to run the test data against the model, there was an error. Both HI (Hawaii) and MP (US Territory Northern Mariana Islands) only have one library system, so they will only have one row, which will either end up in train or test. The error indicated an issue when introducing new values in a category field in the test data (values not in training). Internet search suggests best option is to remove those rows.
Therefore, removing those rows from testTransformed to keep rolling.
The test data has an accuracy of 65.57%.
testTransformed <- testTransformed|>
filter (!(STABR %in% c('HI','MP')))
predF <- predict(dtree_fitF, newdata = testTransformed)
confusionMatrix(predF, testTransformed$ODFINE)
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 1140 581
## Y 235 414
##
## Accuracy : 0.6557
## 95% CI : (0.6362, 0.6748)
## No Information Rate : 0.5802
## P-Value [Acc > NIR] : 3.049e-14
##
## Kappa : 0.2575
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8291
## Specificity : 0.4161
## Pos Pred Value : 0.6624
## Neg Pred Value : 0.6379
## Prevalence : 0.5802
## Detection Rate : 0.4810
## Detection Prevalence : 0.7262
## Balanced Accuracy : 0.6226
##
## 'Positive' Class : N
##
Now we will run a decision tree to predict another variable, LOCALE_ADD (City, Suburb, Town, Rural).
We start by dividing the data into train and test with LOCAL_ADD as the response variable.
set.seed(888)
#split based on LOCALE_ADD
trainLoc_rw <- createDataPartition(
y=PLDS2022$LOCALE_ADD, # this is our response var
p=.7, # 70% in training set
list=FALSE
)
trainL <- PLDS2022[trainLoc_rw,]
testL <- PLDS2022[-trainLoc_rw,]
We run the same preprocessing as before.
#run the preProcess on all but 20, LOCALE_ADD, the target
prePTrainL <- preProcess(trainL[,-20], method = c("center", "scale", "YeoJohnson"))
prePTestL <- preProcess(testL[,-20], method = c("center", "scale", "YeoJohnson"))
preProcValues
## Created from 5540 samples and 19 variables
##
## Pre-processing:
## - centered (13)
## - ignored (6)
## - scaled (13)
## - Yeo-Johnson transformation (10)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTrsfmL <- predict(prePTrainL, trainL)
testTrsfmL <- predict(prePTestL, testL)
And then fit the model. The final value used for the model was cp = 0.003153579, which has an accuracy of 70.1%.
ctrlL <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
set.seed(888)
dtree_fitL <- train(LOCALE_ADD ~., data = trainTrsfmL,
method = "rpart",
parms = list(split = "information"),
trControl=ctrlF,
tuneLength = 10)
dtree_fitL
## CART
##
## 5541 samples
## 19 predictor
## 4 classes: 'City', 'Suburb', 'Town', 'Rural'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 4988, 4987, 4987, 4986, 4987, 4986, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.002838221 0.7008321 0.5656440
## 0.003153579 0.7010127 0.5660472
## 0.004099653 0.7002312 0.5651419
## 0.004415011 0.6993901 0.5639255
## 0.005045727 0.6988472 0.5632275
## 0.009618417 0.6824839 0.5398161
## 0.013718070 0.6579389 0.5074605
## 0.025228635 0.6433201 0.4848976
## 0.067328918 0.5756459 0.3699289
## 0.260170293 0.4869358 0.1529773
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003153579.
Plotting the model shows that the biggest factor is POP_LSA, Population of the Library Service Area. Again, LOCGVT (amount of funding by the local government) is a factor. These two are sufficient to divide the CITY from RURAL results. All the rest classifies SUBURB and TOWN, starting with GEOCODE (a code describing the geographic area served), C_LEGBAS (Legal Basis Code), number of Branches (BRANLIB). Interesting that at the bottom of the tree, it looks at 3 states, PA, NJ and MA to decide between SUBURB and TOWN.
prp(dtree_fitL$finalModel,
box.palette = "Greens",
tweak = 1.2)
Applying the model to the test data, results in an accuracy of 69.78%.
testTrsfmL <- testTrsfmL|>
filter (!(GEOCODE %in% c('SE1','SS1')))
predL <- predict(dtree_fitL, newdata = testTrsfmL)
confusionMatrix(predL, testTrsfmL$LOCALE_ADD)
## Confusion Matrix and Statistics
##
## Reference
## Prediction City Suburb Town Rural
## City 104 71 10 8
## Suburb 27 314 100 15
## Town 25 143 405 161
## Rural 2 52 102 830
##
## Overall Statistics
##
## Accuracy : 0.6978
## 95% CI : (0.6788, 0.7162)
## No Information Rate : 0.428
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5611
##
## Mcnemar's Test P-Value : 2.477e-13
##
## Statistics by Class:
##
## Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity 0.65823 0.5414 0.6564 0.8185
## Specificity 0.95975 0.9206 0.8122 0.8849
## Pos Pred Value 0.53886 0.6886 0.5518 0.8418
## Neg Pred Value 0.97518 0.8610 0.8703 0.8670
## Prevalence 0.06669 0.2448 0.2604 0.4280
## Detection Rate 0.04390 0.1325 0.1710 0.3504
## Detection Prevalence 0.08147 0.1925 0.3098 0.4162
## Balanced Accuracy 0.80899 0.7310 0.7343 0.8517
reference: https://www.geeksforgeeks.org/binary-classification-or-unknown-class-in-random-forest-in-r/
The first Random Forest, for ODFINE, is the same as in assignment 1. Fitting the model to the test data, we get an accuracy of 65.53%. Compare this to the accuracy of the decision tree: 65.57%. It is very similar, just a tad bit worse.
set.seed(888)
RFmodel <- randomForest(ODFINE ~ ., data = trainTransformed, importance = TRUE, ntree = 500)
RFmodel
##
## Call:
## randomForest(formula = ODFINE ~ ., data = trainTransformed, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 33.3%
## Confusion matrix:
## N Y class.error
## N 2525 686 0.2136406
## Y 1159 1170 0.4976385
RFpred<- predict(RFmodel, newdata = testTransformed)
conf_matrix <- confusionMatrix(RFpred, testTransformed$ODFINE)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 1112 554
## Y 263 441
##
## Accuracy : 0.6553
## 95% CI : (0.6357, 0.6744)
## No Information Rate : 0.5802
## P-Value [Acc > NIR] : 4.217e-14
##
## Kappa : 0.2626
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8087
## Specificity : 0.4432
## Pos Pred Value : 0.6675
## Neg Pred Value : 0.6264
## Prevalence : 0.5802
## Detection Rate : 0.4692
## Detection Prevalence : 0.7030
## Balanced Accuracy : 0.6260
##
## 'Positive' Class : N
##
Also want to try a Random Forest model on LOCALE_ADD. This has an accuracy of 73.87%. The decision tree accuracy was 69.78%, so this is an improvement.
set.seed(888)
RFmodelL <- randomForest(LOCALE_ADD ~ ., data = trainTrsfmL, importance = TRUE, ntree = 500)
RFmodelL
##
## Call:
## randomForest(formula = LOCALE_ADD ~ ., data = trainTrsfmL, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 26.15%
## Confusion matrix:
## City Suburb Town Rural class.error
## City 239 111 18 3 0.3557951
## Suburb 105 892 234 125 0.3421829
## Town 19 189 927 309 0.3580332
## Rural 2 83 251 2034 0.1417722
RFpredL<- predict(RFmodelL, newdata = testTrsfmL)
conf_matrixL <- confusionMatrix(RFpredL, testTrsfmL$LOCALE_ADD)
conf_matrixL
## Confusion Matrix and Statistics
##
## Reference
## Prediction City Suburb Town Rural
## City 95 40 8 6
## Suburb 42 373 66 21
## Town 19 107 415 120
## Rural 2 60 128 867
##
## Overall Statistics
##
## Accuracy : 0.7387
## 95% CI : (0.7205, 0.7563)
## No Information Rate : 0.428
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6159
##
## Mcnemar's Test P-Value : 3.798e-06
##
## Statistics by Class:
##
## Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity 0.60127 0.6431 0.6726 0.8550
## Specificity 0.97558 0.9279 0.8596 0.8598
## Pos Pred Value 0.63758 0.7430 0.6278 0.8202
## Neg Pred Value 0.97162 0.8891 0.8817 0.8880
## Prevalence 0.06669 0.2448 0.2604 0.4280
## Detection Rate 0.04010 0.1575 0.1752 0.3660
## Detection Prevalence 0.06290 0.2119 0.2790 0.4462
## Balanced Accuracy 0.78842 0.7855 0.7661 0.8574