Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.
As in Assignment 1 & 2, using the public library data.
PLDS2022 is the 2022 (most recent) version of the Public Library
Survey Data. Found here, the data is a census survey of over 190
variables collected annually from over 9,000 libraries in the US,
including all 50 states, DC and outlying territories.
It is available here: https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey
PLDS2022<-
read.csv("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
A reminder, from last assignment, the data contains:
Solve a classification problem and predict the outcome of a particular feature or detail of the data used.
First classification problem is to identify a binary variable PLDS2022$ODFINE, indicating if the library system charges overdue fines.
Second classification problem is to identify locale (PLDS2022$LOCALE_ADD), which is one of 4 values: City, Suburb, Town, Rural. Note: This variable actually starts out as 12 values, but we simplify it.
In both cases, the question we are trying to answer: How does the admin structure, funding source, physical book stats like expenditures, number of volumes (Books/Audio/Video) & Circulation and other variables predict these classifiers.
NOTE: THIS IS THE SAME AS ASSIGNMENT 1 & 2.
There is some cleaning of the data that is needed before thinking about any transformations.
To start, filter out those library systems that:
* do not meet all the criteria in the FSCS Public Library
Definition
* temporarily closed for FY 2022 (STATSTRU, Structure Change Code,
‘23’)
* Have neither a Central Library nor Branches (they have only
bookmobiles)
* Did not answer the question about ODFINE (Overdue fines).
This leaves 7,912 rows.
We will focus ODFINE, through the lens library type, revenue source, population served and location as well as on physical library services, and remove other columns relating to actual revenue amounts, staffing levels and ebook and other electronic services. This keeps 20 columns.
And add factors to LOCALE_ADD, which is the the geographic location in terms of the size of the community in which the library system is located and the proximity of that community to urban and metropolitan areas. We simplified this to City, Suburb, Town and Rural. The original data divides this further into S/M/L for City and Suburb and Fringe/Distant/Remote for Town and Rural.
Of the remaining rows, 4,586 have ODFINE set to N (library systems that do not charge overdue fines), and 3,326 have ODFINE set to Y.
locNUM <- c(11,12,13,
21,22,23,
31,32,33,
41,42,43)
locNAME <- c("City", "City", "City",
"Suburb", "Suburb", "Suburb",
"Town", "Town", "Town",
"Rural", "Rural", "Rural")
PLDS2022 <- PLDS2022 |>
filter(C_FSCS == "Y") |>
filter(ODFINE != "M") |>
select("STABR", "C_RELATN", "C_LEGBAS", "C_ADMIN","GEOCODE", "POPU_LSA", "CENTLIB", "BRANLIB","BKMOB", "LOCGVT", "STGVT",
"FEDGVT", "OTHINCM", "PRMATEXP", "BKVOL", "AUDIO_PH", "VIDEO_PH", "ODFINE", "PHYSCIR","LOCALE_ADD") |>
filter (PHYSCIR != -3) |>
filter (CENTLIB+BRANLIB>0)
PLDS2022$LOCALE_ADD <-factor(PLDS2022$LOCALE_ADD,
levels=locNUM,
labels=locNAME)
PLDS2022$ODFINE <- as.factor(PLDS2022$ODFINE)
table(PLDS2022$ODFINE)
##
## N Y
## 4586 3326
Split the data into train and test for ODFINE.
set.seed(888)
#split based on ODFINE
trainFine_rw <- createDataPartition(
y=PLDS2022$ODFINE, # this is our response var
p=.7, # 70% in training set
list=FALSE
)
trainF <- PLDS2022[trainFine_rw,]
testF <- PLDS2022[-trainFine_rw,]
We will use the Yeo-Johnson transformation on the continuous predictors then center and scale them.
#run the preProcess on all but 18, ODFINE, the target
preProcValues <- preProcess(trainF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(testF[,-18], method = c("center", "scale", "YeoJohnson"))
preProcValues
## Created from 5540 samples and 19 variables
##
## Pre-processing:
## - centered (13)
## - ignored (6)
## - scaled (13)
## - Yeo-Johnson transformation (10)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTransformed <- predict(preProcValues, trainF)
testTransformed <- predict(preProcValues2, testF)
The final values used for the model were \(\sigma\) = 0.008477156 and C = 1, and it had an accuracy of 67%.
# Code to train the SVM
set.seed(888)
# set the 10 fold crossvalidation with AU
# to pick for us what we call the best model
control <- trainControl(method="cv",
number=10,
classProbs = TRUE)
metric <- "Accuracy"
model <- train(ODFINE ~., data = trainTransformed,
method = "svmRadial",
tuneLength = 8,
preProc = c("center","scale"),
metric=metric,
trControl=control)
## line search fails -0.8890749 0.0644336 2.202849e-05 -2.059628e-05 -2.343563e-08 1.262587e-08 -7.762975e-13line search fails -0.9046858 0.03680744 1.017828e-05 -8.728268e-06 -1.102259e-08 4.717244e-09 -1.533644e-13line search fails -0.8925669 0.04904657 1.827101e-05 -1.656212e-05 -1.95334e-08 9.56522e-09 -5.153153e-13
model
## Support Vector Machines with Radial Basis Function Kernel
##
## 5540 samples
## 19 predictor
## 2 classes: 'N', 'Y'
##
## Pre-processing: centered (93), scaled (93)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4986, 4986, 4986, 4986, 4987, 4986, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.6592091 0.2702431
## 0.50 0.6631793 0.2835542
## 1.00 0.6740617 0.3109550
## 2.00 0.6709391 0.3059762
## 4.00 0.6696778 0.3017814
## 8.00 0.6687769 0.2980897
## 16.00 0.6676965 0.2958492
## 32.00 0.6669722 0.2940954
##
## Tuning parameter 'sigma' was held constant at a value of 0.008477156
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.008477156 and C = 1.
Looking at the plot of the accuracy graph, one sees that the highest accuracy is achieved at C=1.
plot(model)
As before, we take out STABR = HI and MP as those two have only one library system per state / territory.
The test data has an accuracy of 67.3%.
testTransformed <- testTransformed|>
filter (!(STABR %in% c('HI','MP')))
predict <- predict(model, newdata = testTransformed)
confusionMatrix(predict, testTransformed$ODFINE)
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 1075 476
## Y 300 519
##
## Accuracy : 0.6726
## 95% CI : (0.6533, 0.6915)
## No Information Rate : 0.5802
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.311
##
## Mcnemar's Test P-Value : 3.34e-10
##
## Sensitivity : 0.7818
## Specificity : 0.5216
## Pos Pred Value : 0.6931
## Neg Pred Value : 0.6337
## Prevalence : 0.5802
## Detection Rate : 0.4536
## Detection Prevalence : 0.6544
## Balanced Accuracy : 0.6517
##
## 'Positive' Class : N
##
Now we will run a SVM to predict another variable, LOCALE_ADD (City, Suburb, Town, Rural).
We start by dividing the data into train and test with LOCAL_ADD as the response variable.
set.seed(888)
#split based on LOCALE_ADD
trainLoc_rw <- createDataPartition(
y=PLDS2022$LOCALE_ADD, # this is our response var
p=.7, # 70% in training set
list=FALSE
)
trainL <- PLDS2022[trainLoc_rw,]
testL <- PLDS2022[-trainLoc_rw,]
We run the same preprocessing as before.
#run the preProcess on all but 20, LOCALE_ADD, the target
prePTrainL <- preProcess(trainL[,-20], method = c("center", "scale", "YeoJohnson"))
prePTestL <- preProcess(testL[,-20], method = c("center", "scale", "YeoJohnson"))
preProcValues
## Created from 5540 samples and 19 variables
##
## Pre-processing:
## - centered (13)
## - ignored (6)
## - scaled (13)
## - Yeo-Johnson transformation (10)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.05, 0.09, 0.08, -0.19, 0.14, 0.05, -0.12, 0.11, 0.14, 0.01
trainTrsfmL <- predict(prePTrainL, trainL)
testTrsfmL <- predict(prePTestL, testL)
The final values used for the model were \(\sigma\) = 0.008880636 and C = 1, and it had an accuracy of 72%.
set.seed(888)
# control and metric are already set, but repeating here as a comment to make it clear
#control <- trainControl(method="cv",
# number=10,
# classProbs = TRUE)
#metric <- "Accuracy"
model2 <- train(LOCALE_ADD ~., data = trainTrsfmL,
method = "svmRadial",
tuneLength = 8,
preProc = c("center","scale"),
metric=metric,
trControl=control)
model2
## Support Vector Machines with Radial Basis Function Kernel
##
## 5541 samples
## 19 predictor
## 4 classes: 'City', 'Suburb', 'Town', 'Rural'
##
## Pre-processing: centered (91), scaled (91)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4988, 4987, 4987, 4986, 4987, 4986, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.7061970 0.5723535
## 0.50 0.7146811 0.5833922
## 1.00 0.7204608 0.5912137
## 2.00 0.7191934 0.5895658
## 4.00 0.7144983 0.5831476
## 8.00 0.7087231 0.5743099
## 16.00 0.7049328 0.5686800
## 32.00 0.6926633 0.5503994
##
## Tuning parameter 'sigma' was held constant at a value of 0.008880636
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.008880636 and C = 1.
Looking at the plot of the accuracy graph, one sees that the highest accuracy is achieved at C=1.
plot(model2)
Applying the model to the test data, results in an accuracy of 70%.
testTrsfmL <- testTrsfmL|>
filter (!(GEOCODE %in% c('SE1','SS1')))
predict2 <- predict(model2, newdata = testTrsfmL)
confusionMatrix(predict2, testTrsfmL$LOCALE_ADD)
## Confusion Matrix and Statistics
##
## Reference
## Prediction City Suburb Town Rural
## City 80 29 19 7
## Suburb 56 345 62 31
## Town 20 174 426 148
## Rural 2 32 110 828
##
## Overall Statistics
##
## Accuracy : 0.7087
## 95% CI : (0.69, 0.727)
## No Information Rate : 0.428
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5749
##
## Mcnemar's Test P-Value : 3.818e-13
##
## Statistics by Class:
##
## Class: City Class: Suburb Class: Town Class: Rural
## Sensitivity 0.50633 0.5948 0.6904 0.8166
## Specificity 0.97512 0.9167 0.8048 0.8937
## Pos Pred Value 0.59259 0.6984 0.5547 0.8519
## Neg Pred Value 0.96509 0.8747 0.8807 0.8669
## Prevalence 0.06669 0.2448 0.2604 0.4280
## Detection Rate 0.03377 0.1456 0.1798 0.3495
## Detection Prevalence 0.05699 0.2085 0.3242 0.4103
## Balanced Accuracy 0.74073 0.7558 0.7476 0.8551