Introduction

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.

Reference:
https://www.geeksforgeeks.org/support-vector-machine-classifier-implementation-in-r-with-caret-package/

Select a files

As in Assignment 1 & 2, using the public library data.

PLDS2022 is the 2022 (most recent) version of the Public Library Survey Data. Found here, the data is a census survey of over 190 variables collected annually from over 9,000 libraries in the US, including all 50 states, DC and outlying territories.
It is available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey

Load data: PLDS2022

PLDS2022<-
read.csv("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")

A reminder, from last assignment, the data contains:

  • 9,248 rows, each of which represents a US public library system
  • 192 columns, of which 94 are character and 98 are numeric
  • no missing values, but in the data dictionary they use other values for missing data M, -1, -3, -9
    (Data Dictionary in Appendix A of: https://www.imls.gov/sites/default/files/2024-06/2022_pls_data_file_documentation.pdf)
  • Some fields, like FSCSKEY (a unique ID), LIBNAME (Library Name) and various address fields are unique to each row, and we will remove them prior to any other transformation.
  • Some of the Character fields are factors, for example, C_ADMIN is the Administrative Structure Code, which can be one of three values identifying how the library systems are organized.
  • For the numeric fields, looking at the small histograms, we see much of the numeric data is right skewed.
  • There are also many “imputation flags” indicating that other fields are imputed. For our purposes, we will remove and ignore these.

Problem to be solved

Solve a classification problem and predict the outcome of a particular feature or detail of the data used.

First classification problem is to identify a binary variable PLDS2022$ODFINE, indicating if the library system charges overdue fines.

Second classification problem is to identify locale (PLDS2022$LOCALE_ADD), which is one of 4 values: City, Suburb, Town, Rural. Note: This variable actually starts out as 12 values, but we simplify it.

In both cases, the question we are trying to answer: How does the admin structure, funding source, physical book stats like expenditures, number of volumes (Books/Audio/Video) & Circulation and other variables predict these classifiers.

Initial Clean up

NOTE: THIS IS THE SAME AS ASSIGNMENT 1 & 2.

There is some cleaning of the data that is needed before thinking about any transformations.

To start, filter out those library systems that:
* do not meet all the criteria in the FSCS Public Library Definition
* temporarily closed for FY 2022 (STATSTRU, Structure Change Code, ‘23’)
* Have neither a Central Library nor Branches (they have only bookmobiles)
* Did not answer the question about ODFINE (Overdue fines).
This leaves 7,912 rows.

We will focus ODFINE, through the lens library type, revenue source, population served and location as well as on physical library services, and remove other columns relating to actual revenue amounts, staffing levels and ebook and other electronic services. This keeps 20 columns.

And add factors to LOCALE_ADD, which is the the geographic location in terms of the size of the community in which the library system is located and the proximity of that community to urban and metropolitan areas. We simplified this to City, Suburb, Town and Rural. The original data divides this further into S/M/L for City and Suburb and Fringe/Distant/Remote for Town and Rural.

Of the remaining rows, 4,586 have ODFINE set to N (library systems that do not charge overdue fines), and 3,326 have ODFINE set to Y.

locNUM <- c(11,12,13,
21,22,23,
31,32,33,
41,42,43)
locNAME <- c("City", "City", "City",
"Suburb", "Suburb", "Suburb",
"Town", "Town", "Town",
"Rural", "Rural", "Rural")
PLDS2022 <- PLDS2022 |>
filter(C_FSCS == "Y") |>
filter(ODFINE != "M") |> 
select("STABR", "C_RELATN", "C_LEGBAS", "C_ADMIN","GEOCODE", "POPU_LSA", "CENTLIB", "BRANLIB","BKMOB", "LOCGVT", "STGVT",
"FEDGVT", "OTHINCM", "PRMATEXP", "BKVOL", "AUDIO_PH", "VIDEO_PH", "ODFINE", "PHYSCIR","LOCALE_ADD") |>
filter (PHYSCIR != -3) |>
filter (CENTLIB+BRANLIB>0)

PLDS2022$LOCALE_ADD <-factor(PLDS2022$LOCALE_ADD,
levels=locNUM,
labels=locNAME)

PLDS2022$ODFINE <- as.factor(PLDS2022$ODFINE)
table(PLDS2022$ODFINE)
## 
##    N    Y 
## 4586 3326

Split the data

Split the data into train and test for ODFINE.

set.seed(888)
#split based on ODFINE 
trainFine_rw <- createDataPartition(
  y=PLDS2022$ODFINE, # this is our response var
  p=.7,              # 70% in training set
  list=FALSE
)

trainF <- PLDS2022[trainFine_rw,]
testF <- PLDS2022[-trainFine_rw,]

preProcess

We will use the Yeo-Johnson transformation on the continuous predictors then center and scale them.

#run the preProcess on all but 18, ODFINE, the target
preProcValues <- preProcess(trainF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(testF[,-18], method = c("center", "scale", "YeoJohnson"))

preProcValues
## Created from 5540 samples and 19 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (6)
##   - scaled (13)
##   - Yeo-Johnson transformation (10)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTransformed <- predict(preProcValues, trainF)
testTransformed <- predict(preProcValues2, testF)

Model 1: SVM for ODFINE

The final values used for the model were \(\sigma\) = 0.008477156 and C = 1, and it had an accuracy of 67%.

# Code to train the SVM 
set.seed(888) 
# set the 10 fold crossvalidation with AU 
# to pick for us what we call the best model 
control <- trainControl(method="cv",
                        number=10, 
                        classProbs = TRUE) 
metric <- "Accuracy"
model <- train(ODFINE ~., data = trainTransformed, 
               method = "svmRadial",
               tuneLength = 8,
               preProc = c("center","scale"),
               metric=metric, 
               trControl=control)
## line search fails -0.8890749 0.0644336 2.202849e-05 -2.059628e-05 -2.343563e-08 1.262587e-08 -7.762975e-13line search fails -0.9046858 0.03680744 1.017828e-05 -8.728268e-06 -1.102259e-08 4.717244e-09 -1.533644e-13line search fails -0.8925669 0.04904657 1.827101e-05 -1.656212e-05 -1.95334e-08 9.56522e-09 -5.153153e-13
model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5540 samples
##   19 predictor
##    2 classes: 'N', 'Y' 
## 
## Pre-processing: centered (93), scaled (93) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4986, 4986, 4986, 4986, 4987, 4986, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.25  0.6592091  0.2702431
##    0.50  0.6631793  0.2835542
##    1.00  0.6740617  0.3109550
##    2.00  0.6709391  0.3059762
##    4.00  0.6696778  0.3017814
##    8.00  0.6687769  0.2980897
##   16.00  0.6676965  0.2958492
##   32.00  0.6669722  0.2940954
## 
## Tuning parameter 'sigma' was held constant at a value of 0.008477156
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.008477156 and C = 1.

Looking at the plot of the accuracy graph, one sees that the highest accuracy is achieved at C=1.

plot(model)

As before, we take out STABR = HI and MP as those two have only one library system per state / territory.

The test data has an accuracy of 67.3%.

testTransformed <- testTransformed|>
  filter (!(STABR %in% c('HI','MP')))

predict <- predict(model, newdata = testTransformed) 
confusionMatrix(predict, testTransformed$ODFINE)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N 1075  476
##          Y  300  519
##                                           
##                Accuracy : 0.6726          
##                  95% CI : (0.6533, 0.6915)
##     No Information Rate : 0.5802          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.311           
##                                           
##  Mcnemar's Test P-Value : 3.34e-10        
##                                           
##             Sensitivity : 0.7818          
##             Specificity : 0.5216          
##          Pos Pred Value : 0.6931          
##          Neg Pred Value : 0.6337          
##              Prevalence : 0.5802          
##          Detection Rate : 0.4536          
##    Detection Prevalence : 0.6544          
##       Balanced Accuracy : 0.6517          
##                                           
##        'Positive' Class : N               
## 

Model 2: SVM on LOCALE_ADD

Now we will run a SVM to predict another variable, LOCALE_ADD (City, Suburb, Town, Rural).

We start by dividing the data into train and test with LOCAL_ADD as the response variable.

set.seed(888)
#split based on LOCALE_ADD 
trainLoc_rw <- createDataPartition(
  y=PLDS2022$LOCALE_ADD, # this is our response var
  p=.7,              # 70% in training set
  list=FALSE
)

trainL <- PLDS2022[trainLoc_rw,]
testL <- PLDS2022[-trainLoc_rw,]

We run the same preprocessing as before.

#run the preProcess on all but 20, LOCALE_ADD, the target
prePTrainL <- preProcess(trainL[,-20], method = c("center", "scale", "YeoJohnson"))
prePTestL <- preProcess(testL[,-20], method = c("center", "scale", "YeoJohnson"))

preProcValues
## Created from 5540 samples and 19 variables
## 
## Pre-processing:
##   - centered (13)
##   - ignored (6)
##   - scaled (13)
##   - Yeo-Johnson transformation (10)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTrsfmL <- predict(prePTrainL, trainL)
testTrsfmL <- predict(prePTestL, testL)

The final values used for the model were \(\sigma\) = 0.008880636 and C = 1, and it had an accuracy of 72%.

set.seed(888) 
# control and metric are already set, but repeating here as a comment to make it clear 
#control <- trainControl(method="cv",
#                        number=10, 
#                        classProbs = TRUE) 
#metric <- "Accuracy"
model2 <- train(LOCALE_ADD ~., data = trainTrsfmL, 
               method = "svmRadial",
               tuneLength = 8,
               preProc = c("center","scale"),
               metric=metric, 
               trControl=control)

model2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5541 samples
##   19 predictor
##    4 classes: 'City', 'Suburb', 'Town', 'Rural' 
## 
## Pre-processing: centered (91), scaled (91) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4988, 4987, 4987, 4986, 4987, 4986, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##    0.25  0.7061970  0.5723535
##    0.50  0.7146811  0.5833922
##    1.00  0.7204608  0.5912137
##    2.00  0.7191934  0.5895658
##    4.00  0.7144983  0.5831476
##    8.00  0.7087231  0.5743099
##   16.00  0.7049328  0.5686800
##   32.00  0.6926633  0.5503994
## 
## Tuning parameter 'sigma' was held constant at a value of 0.008880636
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.008880636 and C = 1.

Looking at the plot of the accuracy graph, one sees that the highest accuracy is achieved at C=1.

plot(model2)

Applying the model to the test data, results in an accuracy of 70%.

testTrsfmL <- testTrsfmL|>
  filter (!(GEOCODE  %in% c('SE1','SS1')))

predict2 <- predict(model2, newdata = testTrsfmL) 
confusionMatrix(predict2, testTrsfmL$LOCALE_ADD)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction City Suburb Town Rural
##     City     80     29   19     7
##     Suburb   56    345   62    31
##     Town     20    174  426   148
##     Rural     2     32  110   828
## 
## Overall Statistics
##                                        
##                Accuracy : 0.7087       
##                  95% CI : (0.69, 0.727)
##     No Information Rate : 0.428        
##     P-Value [Acc > NIR] : < 2.2e-16    
##                                        
##                   Kappa : 0.5749       
##                                        
##  Mcnemar's Test P-Value : 3.818e-13    
## 
## Statistics by Class:
## 
##                      Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity              0.50633        0.5948      0.6904       0.8166
## Specificity              0.97512        0.9167      0.8048       0.8937
## Pos Pred Value           0.59259        0.6984      0.5547       0.8519
## Neg Pred Value           0.96509        0.8747      0.8807       0.8669
## Prevalence               0.06669        0.2448      0.2604       0.4280
## Detection Rate           0.03377        0.1456      0.1798       0.3495
## Detection Prevalence     0.05699        0.2085      0.3242       0.4103
## Balanced Accuracy        0.74073        0.7558      0.7476       0.8551