Introduction

Select a files

Choose a dataset from a source in Assignment #1, or another dataset of your choice.

Again, as in Assignment 1, using the public library data.

PLDS2022 is the 2022 (most recent) version of the Public Library Survey Data. Found here, the data is a census survey of over 190 variables collected annually from over 9,000 libraries in the US, including all 50 states, DC and outlying territories.
It is available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey

Load data: PLDS2022

PLDS2022<-
read.csv("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")

A reminder, from last assignment, the data contains:

  • 9,248 rows, each of which represents a US public library system
  • 192 columns, of which 94 are character and 98 are numeric
  • no missing values, but in the data dictionary they use other values for missing data M, -1, -3, -9
    (Data Dictionary in Appendix A of: https://www.imls.gov/sites/default/files/2024-06/2022_pls_data_file_documentation.pdf)
  • Some fields, like FSCSKEY (a unique ID), LIBNAME (Library Name) and various address fields are unique to each row, and we will remove them prior to any other transformation.
  • Some of the Character fields are factors, for example, C_ADMIN is the Administrative Structure Code, which can be one of three values identifying how the library systems are organized.
  • For the numeric fields, looking at the small histograms, we see much of the numeric data is right skewed.
  • There are also many “imputation flags” indicating that other fields are imputed. For our purposes, we will remove and ignore these.

Decisions Trees

Create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results. Create a random forest and analyze the results.

First classification problem is to identify a binary variable PLDS2022$ODFINE, indicating if the library system charges overdue fines.

Second classification problem is to identify locale (PLDS2022$LOCALE_ADD), which is one of 4 values: City, Suburb, Town, Rural. Note: This variable actually starts out as 12 values, but we simplify it.

In both cases, the question we are trying to answer: How does the admin structure, funding source, physical book stats like expenditures, number of volumes (Books/Audio/Video) & Circulation and other variables predict these classifiers.

Initial Clean up

NOTE: THIS IS THE SAME AS ASSIGNMENT 1.

There is some cleaning of the data that is needed before thinking about any transformations.

To start, filter out those library systems that:
* do not meet all the criteria in the FSCS Public Library Definition
* temporarily closed for FY 2022 (STATSTRU, Structure Change Code, ‘23’)
* Have neither a Central Library nor Branches (they have only bookmobiles)
* Did not answer the question about ODFINE (Overdue fines).
This leaves 7,912 rows.

We will focus ODFINE, through the lens library type, revenue source, population served and location as well as on physical library services, and remove other columns relating to actual revenue amounts, staffing levels and ebook and other electronic services. This keeps 20 columns.

And add factors to LOCALE_ADD, which is the the geographic location in terms of the size of the community in which the library system is located and the proximity of that community to urban and metropolitan areas. We simplified this to City, Suburb, Town and Rural. The original data divides this further into S/M/L for City and Suburb and Fringe/Distant/Remote for Town and Rural.

Of the remaining rows, 4,586 have ODFINE set to N (library systems that do not charge overdue fines), and 3,326 have ODFINE set to Y.

locNUM <- c(11,12,13,
21,22,23,
31,32,33,
41,42,43)
locNAME <- c("City", "City", "City",
"Suburb", "Suburb", "Suburb",
"Town", "Town", "Town",
"Rural", "Rural", "Rural")
PLDS2022 <- PLDS2022 |>
filter(C_FSCS == "Y") |>
filter(ODFINE != "M") |> 
select("STABR", "C_RELATN", "C_LEGBAS", "C_ADMIN","GEOCODE", "POPU_LSA", "CENTLIB", "BRANLIB","BKMOB", "LOCGVT", "STGVT",
"FEDGVT", "OTHINCM", "PRMATEXP", "BKVOL", "AUDIO_PH", "VIDEO_PH", "ODFINE", "PHYSCIR","LOCALE_ADD") |>
filter (PHYSCIR != -3) |>
filter (CENTLIB+BRANLIB>0)

PLDS2022$LOCALE_ADD <-factor(PLDS2022$LOCALE_ADD,
levels=locNUM,
labels=locNAME)

PLDS2022$ODFINE <- as.factor(PLDS2022$ODFINE)
table(PLDS2022$ODFINE)
## 
##    N    Y 
## 4586 3326

Split the data

Split the data into train and test for ODFINE.

set.seed(888)
#split based on ODFINE 
trainFine_rw <- createDataPartition(
  y=PLDS2022$ODFINE, # this is our response var
  p=.7,              # 70% in training set
  list=FALSE
)

trainF <- PLDS2022[trainFine_rw,]
testF <- PLDS2022[-trainFine_rw,]

Data Exploration

Although the focus here is on the decision trees, we will keep some of the Data Exploration from the prior assignment.
[For conciseness, took out some of the exploratory visualizations done is assignment 1.]

Box Plots charts for numeric data

Box Plots are a way to look at the numeric values - first we will do a quick and dirty peak, which we will refine as we go.

One can see how skewed the data is, and how many outliers there are.

par(mar = c(3, 4, 1, 2))
par(mfrow = c(3,4))
boxplot(POPU_LSA~ODFINE, ylab="Pop of Service Area", 
        data = trainF)
boxplot(BRANLIB~ODFINE, ylab="# of Branch Libraries",
        data = trainF)
boxplot(BKMOB~ODFINE, ylab="# of Book Mobiles",
        data = trainF)
boxplot(LOCGVT~ODFINE, ylab="Local Govt Funding", 
        data = trainF)
boxplot(STGVT~ODFINE, ylab="State Govt Funding", 
        data = trainF)
boxplot(FEDGVT~ODFINE, ylab="Fed Govt Funding",
        data = trainF)
boxplot(OTHINCM~ODFINE, ylab="Other Funding", 
        data = trainF)
boxplot(PRMATEXP~ODFINE, ylab="Expenditures on Print Material", 
        data = trainF)
boxplot(BKVOL~ODFINE, ylab="# of Phys Books", 
        data = trainF)
boxplot(AUDIO_PH~ODFINE, ylab="# of Phys AudioBooks", 
        data = trainF)
boxplot(VIDEO_PH~ODFINE, ylab="# of Phys Videos", 
        data = trainF)
boxplot(PHYSCIR~ODFINE, ylab="Phys Circulation", 
        data = trainF)

Next, we will try faceting the box plots by LOCALE_AD, we will try with just one variable, so it is larger. Still, the scale of the large urban libraries dwarfs the others and makes it hard to compare.

ggplot(trainF, aes(x=ODFINE, y=PHYSCIR, fill=LOCALE_ADD)) + 
    geom_boxplot()+
  scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-9))

preProcess

Let try some of the Pre-Processing features in caret.

We will use the Yeo-Johnson transformation on the continuous predictors then center and scale them.

#run the preProcess on all but 18, ODFINE, the target
preProcValues <- preProcess(trainF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(testF[,-18], method = c("center", "scale", "YeoJohnson"))

preProcValues
## Created from 5540 samples and 19 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (6)
##   - scaled (13)
##   - Yeo-Johnson transformation (10)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTransformed <- predict(preProcValues, trainF)
testTransformed <- predict(preProcValues2, testF)
ggplot(trainTransformed, aes(x=ODFINE, y=PHYSCIR, fill=LOCALE_ADD)) + 
  geom_boxplot()

Ooh, look at that! So much better! Let’s look at the other variables now.

p2<-ggplot(trainTransformed, aes(x=ODFINE, y=POPU_LSA, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p3<-ggplot(trainTransformed, aes(x=ODFINE, y=BRANLIB, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p4<-ggplot(trainTransformed, aes(x=ODFINE, y=BKMOB, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p5<-ggplot(trainTransformed, aes(x=ODFINE, y=LOCGVT, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p6<-ggplot(trainTransformed, aes(x=ODFINE, y=STGVT, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p7<-ggplot(trainTransformed, aes(x=ODFINE, y=FEDGVT, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p8<-ggplot(trainTransformed, aes(x=ODFINE, y=OTHINCM, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p8<-ggplot(trainTransformed, aes(x=ODFINE, y=PRMATEXP, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p9<-ggplot(trainTransformed, aes(x=ODFINE, y=BKVOL, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 
p10<-ggplot(trainTransformed, aes(x=ODFINE, y=AUDIO_PH, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 

p11<-ggplot(trainTransformed, aes(x=ODFINE, y=VIDEO_PH, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none") 
p12<-ggplot(trainTransformed, aes(x=ODFINE, y=PHYSCIR, 
                             fill=LOCALE_ADD)) + 
  geom_boxplot()+
  theme(axis.text = element_text(size = 6))+ 
  theme(legend.position = "none")  

ggarrange(p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12)

Model 1: Decision Tree for ODFINE

The final value used for the model was cp = 0.003434951, which has an accuracy of 65.16%.

ctrlF <- trainControl(method = "repeatedcv", 
                      number = 10,
                      repeats = 3)
set.seed(888)
dtree_fitF <- train(ODFINE ~., data = trainTransformed,
                    method = "rpart",
                    parms = list(split = "information"),
                    trControl=ctrlF,
                    tuneLength = 10)

dtree_fitF
## CART 
## 
## 5540 samples
##   19 predictor
##    2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 4986, 4986, 4986, 4986, 4987, 4986, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa     
##   0.003434951  0.6516277  0.26299220
##   0.003864319  0.6486818  0.25627741
##   0.004293688  0.6460294  0.24877139
##   0.004508373  0.6471121  0.24985244
##   0.004723057  0.6465706  0.24832788
##   0.004937742  0.6466302  0.24778360
##   0.005689137  0.6444640  0.24102437
##   0.017604122  0.6125757  0.10870039
##   0.023615286  0.6095685  0.09312038
##   0.034993559  0.5977131  0.05710906
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003434951.

Looking at the plot of the decision tree, one sees many of the decisions are made on the STABR** variable - the State Abbreviation, where ** is replaced by the state (TX, Texas, is the top one - STABRTX), which coincides with our observation last week that geography seems to be a large indicator of fines. The other variables used BKVOL (Book Volumes), VIDEO_PH (Physical Videos), AUDIO_PH (Physical Audio Books) relate to the size of the library system. And LOCGVT, also prominent in the tree, is how much funding is received from the system’s local government.

prp(dtree_fitF$finalModel, 
    box.palette = "Blues", 
    tweak = 1.2)

In the first attempt to run the test data against the model, there was an error. Both HI (Hawaii) and MP (US Territory Northern Mariana Islands) only have one library system, so they will only have one row, which will either end up in train or test. The error indicated an issue when introducing new values in a category field in the test data (values not in training). Internet search suggests best option is to remove those rows.

Therefore, removing those rows from testTransformed to keep rolling.

The test data has an accuracy of 65.57%.

testTransformed <- testTransformed|>
  filter (!(STABR %in% c('HI','MP')))

predF <- predict(dtree_fitF, newdata = testTransformed)

confusionMatrix(predF, testTransformed$ODFINE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N 1140  581
##          Y  235  414
##                                           
##                Accuracy : 0.6557          
##                  95% CI : (0.6362, 0.6748)
##     No Information Rate : 0.5802          
##     P-Value [Acc > NIR] : 3.049e-14       
##                                           
##                   Kappa : 0.2575          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8291          
##             Specificity : 0.4161          
##          Pos Pred Value : 0.6624          
##          Neg Pred Value : 0.6379          
##              Prevalence : 0.5802          
##          Detection Rate : 0.4810          
##    Detection Prevalence : 0.7262          
##       Balanced Accuracy : 0.6226          
##                                           
##        'Positive' Class : N               
## 

Model 2: Decision Tree on LOCALE_ADD

Now we will run a decision tree to predict another variable, LOCALE_ADD (City, Suburb, Town, Rural).

We start by dividing the data into train and test with LOCAL_ADD as the response variable.

set.seed(888)
#split based on LOCALE_ADD 
trainLoc_rw <- createDataPartition(
  y=PLDS2022$LOCALE_ADD, # this is our response var
  p=.7,              # 70% in training set
  list=FALSE
)

trainL <- PLDS2022[trainLoc_rw,]
testL <- PLDS2022[-trainLoc_rw,]

We run the same preprocessing as before.

#run the preProcess on all but 20, LOCALE_ADD, the target
prePTrainL <- preProcess(trainL[,-20], method = c("center", "scale", "YeoJohnson"))
prePTestL <- preProcess(testL[,-20], method = c("center", "scale", "YeoJohnson"))

preProcValues
## Created from 5540 samples and 19 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (6)
##   - scaled (13)
##   - Yeo-Johnson transformation (10)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTrsfmL <- predict(prePTrainL, trainL)
testTrsfmL <- predict(prePTestL, testL)

And then fit the model. The final value used for the model was cp = 0.003153579, which has an accuracy of 70.1%.

ctrlL <- trainControl(method = "repeatedcv", 
                      number = 10,
                      repeats = 3)
set.seed(888)
dtree_fitL <- train(LOCALE_ADD ~., data = trainTrsfmL,
                    method = "rpart",
                    parms = list(split = "information"),
                    trControl=ctrlF,
                    tuneLength = 10)

dtree_fitL
## CART 
## 
## 5541 samples
##   19 predictor
##    4 classes: 'City', 'Suburb', 'Town', 'Rural' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 4988, 4987, 4987, 4986, 4987, 4986, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.002838221  0.7008321  0.5656440
##   0.003153579  0.7010127  0.5660472
##   0.004099653  0.7002312  0.5651419
##   0.004415011  0.6993901  0.5639255
##   0.005045727  0.6988472  0.5632275
##   0.009618417  0.6824839  0.5398161
##   0.013718070  0.6579389  0.5074605
##   0.025228635  0.6433201  0.4848976
##   0.067328918  0.5756459  0.3699289
##   0.260170293  0.4869358  0.1529773
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.003153579.

Plotting the model shows that the biggest factor is POP_LSA, Population of the Library Service Area. Again, LOCGVT (amount of funding by the local government) is a factor. These two are sufficient to divide the CITY from RURAL results. All the rest classifies SUBURB and TOWN, starting with GEOCODE (a code describing the geographic area served), C_LEGBAS (Legal Basis Code), number of Branches (BRANLIB). Interesting that at the bottom of the tree, it looks at 3 states, PA, NJ and MA to decide between SUBURB and TOWN.

prp(dtree_fitL$finalModel, 
    box.palette = "Greens", 
    tweak = 1.2)

Applying the model to the test data, results in an accuracy of 69.78%.

testTrsfmL <- testTrsfmL|>
  filter (!(GEOCODE  %in% c('SE1','SS1')))

predL <- predict(dtree_fitL, newdata = testTrsfmL)

confusionMatrix(predL, testTrsfmL$LOCALE_ADD)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Suburb Town Rural
##     City    104     71   10     8
##     Suburb   27    314  100    15
##     Town     25    143  405   161
##     Rural     2     52  102   830
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6978          
##                  95% CI : (0.6788, 0.7162)
##     No Information Rate : 0.428           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5611          
##                                           
##  Mcnemar's Test P-Value : 2.477e-13       
## 
## Statistics by Class:
## 
##                      Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity              0.65823        0.5414      0.6564       0.8185
## Specificity              0.95975        0.9206      0.8122       0.8849
## Pos Pred Value           0.53886        0.6886      0.5518       0.8418
## Neg Pred Value           0.97518        0.8610      0.8703       0.8670
## Prevalence               0.06669        0.2448      0.2604       0.4280
## Detection Rate           0.04390        0.1325      0.1710       0.3504
## Detection Prevalence     0.08147        0.1925      0.3098       0.4162
## Balanced Accuracy        0.80899        0.7310      0.7343       0.8517

Model 3: Random Forest for ODFINE

reference: https://www.geeksforgeeks.org/binary-classification-or-unknown-class-in-random-forest-in-r/

The first Random Forest, for ODFINE, is the same as in assignment 1. Fitting the model to the test data, we get an accuracy of 65.53%. Compare this to the accuracy of the decision tree: 65.57%. It is very similar, just a tad bit worse.

set.seed(888)

RFmodel <- randomForest(ODFINE ~ ., data = trainTransformed, importance = TRUE, ntree = 500)

RFmodel
## 
## Call:
##  randomForest(formula = ODFINE ~ ., data = trainTransformed, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 33.3%
## Confusion matrix:
##      N    Y class.error
## N 2525  686   0.2136406
## Y 1159 1170   0.4976385
RFpred<- predict(RFmodel, newdata = testTransformed)

conf_matrix <- confusionMatrix(RFpred, testTransformed$ODFINE)
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N 1112  554
##          Y  263  441
##                                           
##                Accuracy : 0.6553          
##                  95% CI : (0.6357, 0.6744)
##     No Information Rate : 0.5802          
##     P-Value [Acc > NIR] : 4.217e-14       
##                                           
##                   Kappa : 0.2626          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8087          
##             Specificity : 0.4432          
##          Pos Pred Value : 0.6675          
##          Neg Pred Value : 0.6264          
##              Prevalence : 0.5802          
##          Detection Rate : 0.4692          
##    Detection Prevalence : 0.7030          
##       Balanced Accuracy : 0.6260          
##                                           
##        'Positive' Class : N               
## 

Model 4: Random Forest for LOCALE_ADD

Also want to try a Random Forest model on LOCALE_ADD. This has an accuracy of 73.87%. The decision tree accuracy was 69.78%, so this is an improvement.

set.seed(888)

RFmodelL <- randomForest(LOCALE_ADD ~ ., data = trainTrsfmL, importance = TRUE, ntree = 500)

RFmodelL
## 
## Call:
##  randomForest(formula = LOCALE_ADD ~ ., data = trainTrsfmL, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 26.15%
## Confusion matrix:
##        City Suburb Town Rural class.error
## City    239    111   18     3   0.3557951
## Suburb  105    892  234   125   0.3421829
## Town     19    189  927   309   0.3580332
## Rural     2     83  251  2034   0.1417722
RFpredL<- predict(RFmodelL, newdata = testTrsfmL)

conf_matrixL <- confusionMatrix(RFpredL, testTrsfmL$LOCALE_ADD)
conf_matrixL
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Suburb Town Rural
##     City     95     40    8     6
##     Suburb   42    373   66    21
##     Town     19    107  415   120
##     Rural     2     60  128   867
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7387          
##                  95% CI : (0.7205, 0.7563)
##     No Information Rate : 0.428           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6159          
##                                           
##  Mcnemar's Test P-Value : 3.798e-06       
## 
## Statistics by Class:
## 
##                      Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity              0.60127        0.6431      0.6726       0.8550
## Specificity              0.97558        0.9279      0.8596       0.8598
## Pos Pred Value           0.63758        0.7430      0.6278       0.8202
## Neg Pred Value           0.97162        0.8891      0.8817       0.8880
## Prevalence               0.06669        0.2448      0.2604       0.4280
## Detection Rate           0.04010        0.1575      0.1752       0.3660
## Detection Prevalence     0.06290        0.2119      0.2790       0.4462
## Balanced Accuracy        0.78842        0.7855      0.7661       0.8574