Choose a dataset, select a methodology from weeks 1-10 (Supervised Learning) and another from weeks 11-15 (Unsupervised Learning).
This is a problem from my work that I have been meaning to dive into for a while, so it is exciting to use the skills I have gotten in this class for immediate practical use.
Each row has data for 1 branch within Brooklyn Public Library, including:
This has 52 rows of 47 variables. A small dataset, but one that is well understood so I can focus on the modeling. The variables are as follows:
Branch - The name of the branch. Note that 7 branches are excluded as they were closed for renovation. Also Central and Center for Brooklyn History are excluded as they operate differently than our neighborhood branches.
Category - Has values Circ, Tech, Prog or Mixed depending on which is the primary services provided to the community.
AM fields - 6 Boolean fields which are TRUE when the branch metric is > 175% of the median for that metric. These fields are used in the staffing model.
Metrics - 6 fields that show the metric per hour (which normalized the value when different branches close for short periods during the year) plus a 7th field with the actual hours open.
Staffing - 3 fields which identify staffing issues at a branch.
Demographics - 29 fields relating to the Branch Area Demographics from the 2020 Census data. All Census tracts in Brooklyn are assigned to a branch to make up a Branch Area.
BranchData <- read.csv("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/BranchData.csv")
summary(BranchData)
## Branch Category AM.Holds AM.Circ
## Length:52 Length:52 Mode :logical Mode :logical
## Class :character Class :character FALSE:39 FALSE:36
## Mode :character Mode :character TRUE :13 TRUE :16
##
##
##
##
## AM.TechProg AM.CompSess AM.ProgSess AM.Visits
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:39 FALSE:46 FALSE:49 FALSE:41
## TRUE :13 TRUE :6 TRUE :3 TRUE :11
##
##
##
##
## Holds Circ CompSess TechProg
## Min. : 0.7405 Min. : 8.885 Min. : 1.634 Min. :0.001711
## 1st Qu.: 3.1047 1st Qu.: 20.766 1st Qu.: 4.420 1st Qu.:0.030076
## Median : 5.7295 Median : 34.928 Median : 5.584 Median :0.060561
## Mean : 7.6438 Mean : 44.527 Mean : 6.024 Mean :0.074625
## 3rd Qu.: 9.0243 3rd Qu.: 66.166 3rd Qu.: 7.390 3rd Qu.:0.092690
## Max. :33.0621 Max. :142.705 Max. :11.811 Max. :0.243370
##
## ProgSess Visits ActHrs Staff.LH
## Min. :0.1579 Min. : 14.55 Min. :1578 Min. : 0.250
## 1st Qu.:0.2171 1st Qu.: 25.69 1st Qu.:2303 1st Qu.: 0.475
## Median :0.2945 Median : 34.03 Median :2331 Median : 1.550
## Mean :0.3128 Mean : 40.41 Mean :2320 Mean : 3.449
## 3rd Qu.:0.3626 3rd Qu.: 46.92 3rd Qu.:2338 3rd Qu.: 6.050
## Max. :0.8256 Max. :143.68 Max. :3149 Max. :17.000
## NA's :21
## Vacc DffvsModel TotPop AvgMedInc
## Min. :0.000 Min. :-2.00000 Min. :16793 Min. : 28565
## 1st Qu.:1.000 1st Qu.:-1.00000 1st Qu.:33772 1st Qu.: 53507
## Median :1.000 Median : 0.00000 Median :41123 Median : 60866
## Mean :1.577 Mean :-0.01923 Mean :46438 Mean : 66866
## 3rd Qu.:3.000 3rd Qu.: 1.00000 3rd Qu.:58294 3rd Qu.: 79188
## Max. :4.000 Max. : 2.00000 Max. :93585 Max. :158798
##
## HholdBroadbandPrct Poverty PovertyPrct Unemployment
## Min. :0.1700 Min. : 1041 Min. :0.0400 Min. : 419
## 1st Qu.:0.4800 1st Qu.: 4056 1st Qu.:0.1100 1st Qu.:1003
## Median :0.6950 Median : 7552 Median :0.1700 Median :1251
## Mean :0.7988 Mean : 8580 Mean :0.1748 Mean :1437
## 3rd Qu.:1.0500 3rd Qu.:11837 3rd Qu.:0.2300 3rd Qu.:1762
## Max. :1.8100 Max. :30450 Max. :0.3300 Max. :3179
##
## UnemployPrct Eng.Vwell Eng.VwellPrct Less9Gd
## Min. :0.02000 Min. : 642 Min. :0.0200 Min. : 162
## 1st Qu.:0.02000 1st Qu.: 3424 1st Qu.:0.0900 1st Qu.:1540
## Median :0.03000 Median : 6008 Median :0.1400 Median :1992
## Mean :0.03212 Mean : 9021 Mean :0.1854 Mean :2616
## 3rd Qu.:0.04000 3rd Qu.:11827 3rd Qu.:0.2425 3rd Qu.:3033
## Max. :0.06000 Max. :35690 Max. :0.4900 Max. :9986
##
## X9.12GdNoDeg HS CollNoDeg Assoc Bach
## Min. : 182 Min. : 776 Min. : 765 Min. : 411 Min. : 1996
## 1st Qu.:1383 1st Qu.: 5126 1st Qu.: 2672 1st Qu.:1451 1st Qu.: 4496
## Median :1964 Median : 7634 Median : 3766 Median :1904 Median : 6806
## Mean :2577 Mean : 7881 Mean : 3976 Mean :1900 Mean : 6687
## 3rd Qu.:3714 3rd Qu.: 9253 3rd Qu.: 4841 3rd Qu.:2282 3rd Qu.: 8102
## Max. :9358 Max. :20199 Max. :10620 Max. :3918 Max. :14306
##
## Grad Under5yrs X5.9yrs X10.14yrs X15.19yrs
## Min. : 479 Min. : 892 Min. : 790 Min. : 545 Min. : 619
## 1st Qu.: 2446 1st Qu.: 1994 1st Qu.:1876 1st Qu.:1855 1st Qu.:1534
## Median : 4004 Median : 2613 Median :2194 Median :2360 Median :2162
## Mean : 4316 Mean : 3131 Mean :2694 Mean :2659 Mean :2365
## 3rd Qu.: 5938 3rd Qu.: 3698 3rd Qu.:3386 3rd Qu.:3279 3rd Qu.:2877
## Max. :11975 Max. :11484 Max. :8772 Max. :7814 Max. :6394
##
## X20.24yrs X25.34yrs X35.44yrs X45.54yrs X55.59yrs
## Min. : 575 Min. : 2005 Min. : 1937 Min. : 1806 Min. :1150
## 1st Qu.:1897 1st Qu.: 4823 1st Qu.: 4360 1st Qu.: 3837 1st Qu.:1911
## Median :2238 Median : 6684 Median : 5444 Median : 4900 Median :2518
## Mean :2728 Mean : 7827 Mean : 5933 Mean : 5084 Mean :2548
## 3rd Qu.:3501 3rd Qu.: 9526 3rd Qu.: 7405 3rd Qu.: 5761 3rd Qu.:2924
## Max. :6060 Max. :17388 Max. :11306 Max. :10491 Max. :4836
##
## X60.64yrs X65.74yrs X75.84.yrs X85.years.and.over
## Min. : 820 Min. :1444 Min. : 705 Min. : 210.0
## 1st Qu.:1640 1st Qu.:2628 1st Qu.:1348 1st Qu.: 550.2
## Median :2308 Median :3291 Median :1645 Median : 701.5
## Mean :2357 Mean :3506 Mean :1858 Mean : 839.3
## 3rd Qu.:2953 3rd Qu.:4335 3rd Qu.:2305 3rd Qu.:1064.8
## Max. :4542 Max. :6771 Max. :5524 Max. :2895.0
##
This data is pretty clean, as it comes from my day job, and that is what I do. However, there is one field Staff.LH (Lost Hours related to staff shortage) that is only populated when the branch has lost hours due to staff shortage, otherwise it is NA. In this case we will populate the NAs with 0.
BranchData <- BranchData |>
mutate (Staff.LH=ifelse(is.na(Staff.LH), 0, Staff.LH))
BranchData |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Missing_NAs")
## # A tibble: 47 × 2
## Variable Missing_NAs
## <chr> <int>
## 1 Branch 0
## 2 Category 0
## 3 AM.Holds 0
## 4 AM.Circ 0
## 5 AM.TechProg 0
## 6 AM.CompSess 0
## 7 AM.ProgSess 0
## 8 AM.Visits 0
## 9 Holds 0
## 10 Circ 0
## # ℹ 37 more rows
Splitting the data proportion to the target variable.
set.seed(888)
trainInd <- createDataPartition(
y=BranchData$Category, # this is our target var
p=.7, # 70% in training set
list=FALSE
)
BD_train <- BranchData[trainInd,]
BD_test <- BranchData[-trainInd,]
Bar charts help me visualize the Boolean data.
Here we are counting how many rows are TRUE or FALSE in the various “AM” (above 175% of Median) fields for different metrics. As you can see, and as one would expect, the values are mostly FALSE. But note that there are significantly less TRUE values for both Computer Sessions and Program Sessions than for the other metrics.
g1 <- BD_train |> group_by(AM.Holds) |>
summarise(counts=n(), .groups = "drop")
g2 <- BD_train |> group_by(AM.Circ) |>
summarise(counts=n(), .groups = "drop")
g3 <- BD_train |> group_by(AM.TechProg) |>
summarise(counts=n(), .groups = "drop")
g4 <- BD_train |> group_by(AM.CompSess) |>
summarise(counts=n(), .groups = "drop")
g5 <- BD_train |> group_by(AM.ProgSess) |>
summarise(counts=n(), .groups = "drop")
g6 <- BD_train |> group_by(AM.Visits) |>
summarise(counts=n(), .groups = "drop")
colors <- c(c("#09769A", "#72A98F"))
b1 <- ggplot(g1, aes(fill=as.factor(AM.Holds),
x = AM.Holds, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
b2 <- ggplot(g2, aes(fill=as.factor(AM.Circ),
x = AM.Circ, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
b3 <- ggplot(g3, aes(fill=as.factor(AM.TechProg),
x =AM.TechProg, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
b4 <- ggplot(g4, aes(fill=as.factor(AM.CompSess),
x =AM.CompSess, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
b5 <- ggplot(g5, aes(fill=as.factor(AM.ProgSess),
x =AM.ProgSess, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
b6 <- ggplot(g6, aes(fill=as.factor(AM.Visits),
x= AM.Visits, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = colors,
labels=c('N', 'Y'))+
theme_classic(base_size = 8)+
theme(legend.position="none")
ggarrange(b1,b2,b3,b4,b5,b6)
Using Histograms and Box Plots to look at the metrics per hour type fields shows the data is skewed, unevenly distributed and has outliers.
metric_columns <- c('Holds','Circ', 'CompSess','ProgSess', 'TechProg', 'ActHrs')
mlt.train.metric <- BD_train |>
select (one_of(metric_columns))
mlt.train.metric$ID <- rownames(mlt.train.metric) # Assign row names to ID
mlt.train.metric <- melt(mlt.train.metric, id.vars = "ID") # Melt the data
# Create histograms of the predictors
ggplot(data = mlt.train.metric, aes(x = value)) +
geom_histogram(binwidth = 3, fill = "skyblue", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Distributions of Metric Predictors", x = "Predictors", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
plot_list <- lapply(metric_columns[1:6], function(col) {
ggplot(BD_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "skyblue")
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
As with the metrics, using Histograms and Box Plots to look at the staff related fields shows this data is also skewed, unevenly distributed and has outliers.
staff_columns <- c('Staff.LH','Vacc', 'DffvsModel')
mlt.train.staff <- BD_train |>
select (one_of(staff_columns))
mlt.train.staff$ID <- rownames(mlt.train.staff) # Assign row names to ID
mlt.train.staff <- melt(mlt.train.staff, id.vars = "ID") # Melt the data
# Create histograms of the predictors
ggplot(data = mlt.train.staff, aes(x = value)) +
geom_histogram(binwidth = 3, fill = "#A5D6A7", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Distributions of Staff Predictors", x = "Predictors", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
plot_list <- lapply(staff_columns[1:3], function(col) {
ggplot(BD_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "#A5D6A7")
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
As with the other data, using the demographic fields are also skewed, unevenly distributed and have outliers. We can also see quite clearly that the scale of the data varies significantly.
#Original
demo_columns <- c('TotPop', 'AvgMedInc', 'HholdBroadbandPrct', 'Poverty', 'PovertyPrct','Unemployment', 'UnemployPrct', 'Eng.Vwell', 'Eng.VwellPrct', 'Less9Gd', 'X9.12GdNoDeg', 'HS', 'CollNoDeg', 'Assoc', 'Bach', 'Grad', 'Under5yrs', 'X5.9yrs', 'X10.14yrs', 'X15.19yrs', 'X20.24yrs', 'X25.34yrs', 'X35.44yrs', 'X45.54yrs', 'X55.59yrs', 'X60.64yrs', 'X65.74yrs', 'X75.84.yrs', 'X85.years.and.over')
mlt.train.demo <- BD_train |>
select (one_of(demo_columns))
mlt.train.demo$ID <- rownames(mlt.train.demo) # Assign row names to ID
mlt.train.demo <- melt(mlt.train.demo, id.vars = "ID") # Melt the data
# Create histograms of the predictors
ggplot(data = mlt.train.demo, aes(x = value)) +
geom_histogram(binwidth = 50, fill = "#CE93D8", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Distributions of Demographic Predictors", x = "Predictors", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
plot_list <- lapply(demo_columns[1:9], function(col) {
ggplot(BD_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
theme(axis.title=element_text(size=8))
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
plot_list <- lapply(demo_columns[10:18], function(col) {
ggplot(BD_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
theme(axis.title=element_text(size=8))
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
plot_list <- lapply(demo_columns[19:29], function(col) {
ggplot(BD_train, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
theme(axis.title=element_text(size=8))
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))
As seen though the exploration, some of the data is skewed, there are zeros in the data, and value ranges are very mixed. Pre-processing the data with Yeo-Johnson Transformation will help the models.
The Branch column is just the name, basically a “dumb identifier”. There is no information in that column, so it is removed.
#Remove Branch column, which is just an identifier
BD_train <- BD_train |> select(-Branch)
BD_test <- BD_test |> select(-Branch)
#run the preProcess on all but the first column, Category, the target
preProcValues <- preProcess(BD_train[,-1], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(BD_test[,-1], method = c("center", "scale", "YeoJohnson"))
preProcValues
## Created from 37 samples and 45 variables
##
## Pre-processing:
## - centered (39)
## - ignored (6)
## - scaled (39)
## - Yeo-Johnson transformation (35)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.15624 -0.05067 0.20818 0.12739 0.30371 1.10529
trainXfmd <- predict(preProcValues, BD_train)
testXfmd <- predict(preProcValues2, BD_test)
This is from the first half of class, when covering Supervised Learning models. Both a Random Forest and an SVM with be trained and tested to predict the classification of Category, a variable with four possible values.
There are a couple of challenges here. Though the data is clean, it is sparse - only 57 rows, and there very few rows with certain predictors. How each model handles this will be a factor in how well they predict Category.
set.seed(888)
trainXfmd$Category <- as.factor(trainXfmd$Category)
RFmodel <- randomForest(x = trainXfmd[-1],
y = trainXfmd$Category,
ntree = 500)
RFmodel
##
## Call:
## randomForest(x = trainXfmd[-1], y = trainXfmd$Category, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 45.95%
## Confusion matrix:
## Circ Mixed Prog Tech class.error
## Circ 9 3 1 1 0.3571429
## Mixed 6 2 0 2 0.8000000
## Prog 3 0 0 0 1.0000000
## Tech 0 1 0 9 0.1000000
testXfmd$Category <- as.factor(testXfmd$Category)
RFpred<- predict(RFmodel, newdata = testXfmd)
conf_matrix <- confusionMatrix(RFpred, testXfmd$Category)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Circ Mixed Prog Tech
## Circ 3 2 0 0
## Mixed 2 2 0 1
## Prog 0 0 0 0
## Tech 1 0 1 3
##
## Overall Statistics
##
## Accuracy : 0.5333
## 95% CI : (0.2659, 0.7873)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 0.2131
##
## Kappa : 0.3226
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity 0.5000 0.5000 0.00000 0.7500
## Specificity 0.7778 0.7273 1.00000 0.8182
## Pos Pred Value 0.6000 0.4000 NaN 0.6000
## Neg Pred Value 0.7000 0.8000 0.93333 0.9000
## Prevalence 0.4000 0.2667 0.06667 0.2667
## Detection Rate 0.2000 0.1333 0.00000 0.2000
## Detection Prevalence 0.3333 0.3333 0.00000 0.3333
## Balanced Accuracy 0.6389 0.6136 0.50000 0.7841
iv<-importance(RFmodel)
head(iv[order(-iv[, 1]), ],10)
## TechProg Holds Circ Eng.VwellPrct Eng.Vwell
## 2.2451970 1.5651095 1.5503528 1.3285477 1.3073187
## ActHrs ProgSess Unemployment Grad Visits
## 1.0029183 0.8768685 0.7814947 0.7591445 0.7400279
set.seed(83)
# Define a grid of hyperparameters
tuneGrid <- expand.grid(
mtry = c(1, 2, 3, 4, 5) # Number of variables to consider at each split
)
# Set up control for cross-validation
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
# Train the random forest model with hyperparameter tuning
rf_tune_model <- train(
Category ~ .,
data = trainXfmd,
method = "rf",
trControl = trainControl,
tuneGrid = tuneGrid,
importance = TRUE
)
# View the tuned model results
print(rf_tune_model)
# Best hyperparameters found
rf_tune_model$bestTune
# Make predictions on the test data using the tuned model
predictions <- predict(rf_tune_model, newdata = testXfmd)
# Evaluate accuracy using confusion matrix
confusionMatrix(predictions, testXfmd$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Circ Mixed Prog Tech
## Circ 4 2 0 0
## Mixed 2 2 0 1
## Prog 0 0 0 0
## Tech 0 0 1 3
##
## Overall Statistics
##
## Accuracy : 0.6
## 95% CI : (0.3229, 0.8366)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 0.09505
##
## Kappa : 0.4118
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity 0.6667 0.5000 0.00000 0.7500
## Specificity 0.7778 0.7273 1.00000 0.9091
## Pos Pred Value 0.6667 0.4000 NaN 0.7500
## Neg Pred Value 0.7778 0.8000 0.93333 0.9091
## Prevalence 0.4000 0.2667 0.06667 0.2667
## Detection Rate 0.2667 0.1333 0.00000 0.2000
## Detection Prevalence 0.4000 0.3333 0.00000 0.2667
## Balanced Accuracy 0.7222 0.6136 0.50000 0.8295
This plot shows the change in accuracy as the number of predictors considered at each split changes. As shown, 4 predictors yields the best accuracy.
plot(rf_tune_model)
iv_tune<-varImp(rf_tune_model, scale = FALSE)
iv_tune
## rf variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 45)
##
## Circ Mixed Prog Tech
## Circ 4.71313 0.7878 1.2669 8.43098
## Holds 2.73365 1.2479 0.6258 7.42157
## Eng.VwellPrct 0.44302 -0.3671 0.2085 7.07667
## Eng.Vwell 2.31381 1.4392 -0.6549 6.36474
## TechProg 6.03145 1.3428 -1.1108 4.63164
## ProgSess 4.43348 1.9093 1.7106 2.27185
## AM.TechProgTRUE 4.02417 2.3266 1.4171 4.35325
## UnemployPrct 3.52139 4.2051 3.5064 3.60157
## CompSess -0.41762 -1.3542 3.9428 -0.69789
## Visits -1.21384 0.1178 -0.4715 3.31778
## Unemployment 1.21754 0.5710 3.0517 3.30682
## AM.CircTRUE 0.64365 0.6000 1.0010 2.96635
## X10.14yrs 0.21960 2.7902 -1.8133 -1.03279
## X35.44yrs 0.10013 -0.7446 2.7338 -0.17122
## X9.12GdNoDeg 2.68339 -0.4661 1.7027 2.21356
## Grad 0.95396 2.5355 2.2015 2.37712
## Assoc 1.43934 -0.7836 2.4143 -0.21148
## X25.34yrs -0.53376 -0.7805 0.0000 2.00901
## ActHrs 0.25441 1.9644 -0.2425 0.03556
## TotPop -0.04515 -0.7047 0.6863 1.96231
# Code to train the SVM
set.seed(888)
# set the 10 fold crossvalidation with AU
# to pick for us what we call the best model
control <- trainControl(method="cv",
number=10,
classProbs = TRUE)
metric <- "Accuracy"
SVMmodel <- train(Category ~., data = trainXfmd,
method = "svmRadial",
tuneLength = 8,
preProc = c("center","scale"),
metric=metric,
trControl=control)
SVMmodel
plot(SVMmodel)
SVMpredict <- predict(SVMmodel, newdata = testXfmd)
confusionMatrix(SVMpredict, testXfmd$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Circ Mixed Prog Tech
## Circ 5 3 0 3
## Mixed 0 1 0 1
## Prog 0 0 0 0
## Tech 1 0 1 0
##
## Overall Statistics
##
## Accuracy : 0.4
## 95% CI : (0.1634, 0.6771)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 0.5968
##
## Kappa : 0.0559
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity 0.8333 0.25000 0.00000 0.0000
## Specificity 0.3333 0.90909 1.00000 0.8182
## Pos Pred Value 0.4545 0.50000 NaN 0.0000
## Neg Pred Value 0.7500 0.76923 0.93333 0.6923
## Prevalence 0.4000 0.26667 0.06667 0.2667
## Detection Rate 0.3333 0.06667 0.00000 0.0000
## Detection Prevalence 0.7333 0.13333 0.00000 0.1333
## Balanced Accuracy 0.5833 0.57955 0.50000 0.4091
control_tune <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3, # Repeating the CV process 3 times to reduce variance
classProbs = TRUE, # For handling class probabilities
summaryFunction = defaultSummary, # Using a more detailed performance measure for classification
savePredictions = "all") # Save all predictions to analyze later
# Specify metric to optimize for (done above, here for completeness)
#metric <- "Accuracy"
# Create a grid of hyperparameters for tuning
tuneGrid <- expand.grid(sigma = c(0.01, 0.05, 0.1, 0.2), # Testing different values of sigma (kernel parameter)
C = c(0.5, 1, 2, 5, 10)) # Testing different values of C (penalty term)
# Train the SVM model with radial basis kernel
SVMmodel_tune <- train(Category ~ ., data = trainXfmd,
method = "svmRadial",
tuneGrid = tuneGrid, # Specifying a custom hyperparameter grid
preProc = c("center", "scale", "pca"), # Standardize features
metric = metric, # accuracy
trControl = control_tune) # Use the updated control_tune
# Output the results
SVMmodel_tune
## Support Vector Machines with Radial Basis Function Kernel
##
## 37 samples
## 45 predictors
## 4 classes: 'Circ', 'Mixed', 'Prog', 'Tech'
##
## Pre-processing: centered (45), scaled (45), principal component
## signal extraction (45)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 34, 34, 32, 34, 32, 33, ...
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.01 0.5 0.3605556 0.073115416
## 0.01 1.0 0.3650000 0.094691172
## 0.01 2.0 0.3800000 0.101018059
## 0.01 5.0 0.3627778 0.053269601
## 0.01 10.0 0.3483333 0.047284110
## 0.05 0.5 0.3533333 0.055524322
## 0.05 1.0 0.3777778 0.094526356
## 0.05 2.0 0.4105556 0.114554220
## 0.05 5.0 0.3388889 0.005584027
## 0.05 10.0 0.3966667 0.096712851
## 0.10 0.5 0.3833333 0.095705369
## 0.10 1.0 0.4105556 0.121234256
## 0.10 2.0 0.3994444 0.100594525
## 0.10 5.0 0.3927778 0.094248437
## 0.10 10.0 0.3388889 0.017273358
## 0.20 0.5 0.3950000 0.110463950
## 0.20 1.0 0.4161111 0.132990652
## 0.20 2.0 0.3944444 0.087860181
## 0.20 5.0 0.3500000 0.028762131
## 0.20 10.0 0.3638889 0.052733527
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2 and C = 1.
confusionMatrix(SVMmodel_tune)
## Cross-Validated (10 fold, repeated 3 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Circ Mixed Prog Tech
## Circ 27.0 17.1 6.3 9.9
## Mixed 9.0 7.2 1.8 9.9
## Prog 0.0 0.9 0.0 0.0
## Tech 1.8 1.8 0.0 7.2
##
## Accuracy (average) : 0.4144
SVMpredict_tune <- predict(SVMmodel_tune, newdata = testXfmd)
confusionMatrix(SVMpredict_tune, testXfmd$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Circ Mixed Prog Tech
## Circ 6 4 0 3
## Mixed 0 0 1 1
## Prog 0 0 0 0
## Tech 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4
## 95% CI : (0.1634, 0.6771)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 0.5968
##
## Kappa : 0.0288
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity 1.0000 0.0000 0.00000 0.0000
## Specificity 0.2222 0.8182 1.00000 1.0000
## Pos Pred Value 0.4615 0.0000 NaN NaN
## Neg Pred Value 1.0000 0.6923 0.93333 0.7333
## Prevalence 0.4000 0.2667 0.06667 0.2667
## Detection Rate 0.4000 0.0000 0.00000 0.0000
## Detection Prevalence 0.8667 0.1333 0.00000 0.0000
## Balanced Accuracy 0.6111 0.4091 0.50000 0.5000
# Calculate feature importance
importanceSVMtun <- varImp(SVMmodel_tune, scale = FALSE) # scale = FALSE to use raw importance values
print(importanceSVMtun)
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 45)
##
## Circ Mixed Prog Tech
## Unemployment 0.9286 0.6571 1.0000 0.9286
## X20.24yrs 0.8571 0.5500 1.0000 0.8571
## TechProg 0.7000 0.9786 0.7000 0.9786
## X35.44yrs 0.9762 0.5357 0.9667 0.9762
## CollNoDeg 0.7381 0.7214 0.9667 0.7381
## CompSess 0.9286 0.5429 0.9667 0.9286
## Circ 0.7857 0.9643 0.7333 0.9643
## TotPop 0.8810 0.5429 0.9333 0.8810
## Holds 0.8810 0.9286 0.8000 0.9286
## UnemployPrct 0.9286 0.6893 0.7500 0.9286
## X45.54yrs 0.8810 0.5429 0.9000 0.8810
## X10.14yrs 0.8333 0.5500 0.9000 0.8333
## X25.34yrs 0.8571 0.5500 0.9000 0.8571
## Poverty 0.7619 0.5714 0.9000 0.7619
## X60.64yrs 0.8333 0.5571 0.9000 0.8333
## Bach 0.8810 0.6929 0.8333 0.8810
## X9.12GdNoDeg 0.8333 0.5643 0.8667 0.8333
## HholdBroadbandPrct 0.7500 0.6036 0.8667 0.7500
## X55.59yrs 0.8333 0.5429 0.8667 0.8333
## HS 0.8095 0.6786 0.8667 0.8095
For the Unsupervised Learning models, the model will run against the entire data set - there is no training and testing because it is unsupervised.
To do this, we will go back to the original data, but take out the Branch and Category, aa branch is unique to each row, and what we want from this is to compare it to the Category, not to learn from the Category.
Therefore, start with the BranchData dataframe. Recall that the data is well structured, and the NAs were replaced with 0s in the Staff.LH, the only place where there were NAs.
We need to reapply the pre-processed with Yeo-Johnson Transformation to this version of the data to to help the models, since the various features are in different units / scales.
BranchDataNoBC <- BranchData |> select(-Branch, -Category)
preProcValues3 <- preProcess(BranchDataNoBC, method = c("center", "scale", "YeoJohnson"))
BDXfmd <- predict(preProcValues3, BranchDataNoBC)
# Compute the distance matrix (Euclidean distance by default)
dist_matrix <- dist(BDXfmd, method = "euclidean")
# Perform hierarchical clustering (using the "ward.D" method, which minimizes variance)
hclust_model <- hclust(dist_matrix, method = "ward.D")
# Plot the dendrogram
# Create a basic dendrogram plot
plot(hclust_model, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.9)
# Cutting the dendrogram to form clusters
clusters <- cutree(hclust_model, k = 3) # Cut into 3 clusters
table(clusters)
## clusters
## 1 2 3
## 13 21 18
# Calculate silhouette scores (use hclust model or cluster labels)
silhouette_scores <- silhouette(clusters, dist(BDXfmd))
# Plot silhouette scores
plot(silhouette_scores, main = "Silhouette Plot")
A lower WCSS value indicates that the observations within each cluster are more similar to each other. This is similar to the inertia in K-means clustering.
# Compute WCSS for hierarchical clustering (using Euclidean distance)
wcss <- sum(silhouette_scores[, 3])
cat("WCSS: ", wcss)
## WCSS: 8.117217
Compares the clustering result with the true labels, adjusting for chance. An ARI score of 1 indicates perfect agreement, while a score of 0 indicates random clustering. Negative values indicate worse-than-random clustering. Here we are not looking at “true labels” as the desired result, merely want to compare the clusters with the categories.
# Add the cluster labels to the original data and transformed data
BranchData$Cluster <- clusters
BDXfmd$Cluster <- clusters
# View the data with cluster labels
BranchData[c('Branch','Category','Cluster')]
## Branch Category Cluster
## 1 Adams Street Circ 1
## 2 Arlington Mixed 2
## 3 Bay Ridge Circ 3
## 4 Borough Park Mixed 2
## 5 Brighton Beach Mixed 2
## 6 Brooklyn Heights Circ 1
## 7 Brower Park Circ 1
## 8 Brownsville Tech 3
## 9 Bushwick Tech 2
## 10 Canarsie Tech 3
## 11 Clinton Hill Mixed 1
## 12 Coney Island Tech 3
## 13 Cortelyou Circ 1
## 14 Crown Heights Circ 2
## 15 Cypress Hills Tech 2
## 16 DeKalb Circ 2
## 17 Dyker Circ 1
## 18 East Flatbush Tech 2
## 19 Eastern Parkway Tech 2
## 20 Flatbush Mixed 2
## 21 Flatlands Mixed 3
## 22 Fort Hamilton Circ 3
## 23 Gerritsen Beach Prog 1
## 24 Gravesend Circ 3
## 25 Greenpoint Mixed 1
## 26 Highlawn Circ 2
## 27 Homecrest Prog 3
## 28 Jamaica Bay Prog 3
## 29 Kensington Mixed 2
## 30 Kings Bay Mixed 3
## 31 Kings Highway Mixed 2
## 32 Macon Tech 2
## 33 Marcy Circ 2
## 34 McKinley Park Circ 2
## 35 Midwood Circ 2
## 36 Mill Basin Circ 3
## 37 New Lots Tech 3
## 38 New Utrecht Mixed 2
## 39 Pacific Mixed 1
## 40 Paerdegat Tech 3
## 41 Park Slope Circ 1
## 42 Rugby Tech 3
## 43 Saratoga Tech 3
## 44 Sheepshead Bay Prog 1
## 45 Spring Creek Mixed 3
## 46 Stone Avenue Tech 3
## 47 Sunset Park Circ 2
## 48 Ulmer Park Circ 3
## 49 Walt Whitman Tech 1
## 50 Washington Irving Circ 2
## 51 Williamsburgh Mixed 2
## 52 Windsor Terrace Circ 1
#v1
BDXfmd$Cluster <- as.numeric(factor(BDXfmd$Cluster))
numeric_data <- BDXfmd[, sapply(BDXfmd, is.numeric)] # Select only numeric columns
correlations <- sapply(numeric_data, function(x) cor(x, BDXfmd$Cluster, use = "complete.obs"))
# View correlations
correlations
## Holds Circ CompSess TechProg
## -0.547614808 -0.427455771 -0.074240980 -0.005613741
## ProgSess Visits ActHrs Staff.LH
## -0.155752614 -0.484552506 0.094963029 0.064531203
## Vacc DffvsModel TotPop AvgMedInc
## -0.140574799 0.077549188 0.151702415 -0.496460128
## HholdBroadbandPrct Poverty PovertyPrct Unemployment
## 0.338717546 0.198191087 0.213801300 0.147195791
## UnemployPrct Eng.Vwell Eng.VwellPrct Less9Gd
## 0.073384941 0.159670301 0.078927379 0.228924462
## X9.12GdNoDeg HS CollNoDeg Assoc
## 0.256289932 0.400674842 0.334516545 0.473736898
## Bach Grad Under5yrs X5.9yrs
## -0.325859884 -0.426431095 0.108287372 0.292472289
## X10.14yrs X15.19yrs X20.24yrs X25.34yrs
## 0.274339646 0.325084540 0.207453596 -0.148871479
## X35.44yrs X45.54yrs X55.59yrs X60.64yrs
## -0.135627600 0.200239185 0.357292382 0.409926376
## X65.74yrs X75.84.yrs X85.years.and.over Cluster
## 0.379025056 0.409641212 0.290252423 1.000000000
# Create a data frame for plotting
correlations_df <- data.frame(
Feature = names(correlations),
Correlation = correlations
)
# Create a bar plot of the correlations
ggplot(correlations_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip the axes to make it easier to read
labs(title = "Correlation Between Features and Cluster Labels",
x = "Feature",
y = "Correlation")
# Filter correlations with absolute value greater than 0.4
cor_filtered <- correlations[abs(correlations) > 0.4]
# Create a data frame for plotting
correlations_df_filtered <- data.frame(
Feature = names(cor_filtered),
Correlation = cor_filtered
)
# Create a bar plot of the filtered correlations
ggplot(correlations_df_filtered, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip the axes to make it easier to read
labs(title = "Filtered Correlations (|Correlation| > 0.4) Between Features and Cluster Labels",
x = "Feature",
y = "Correlation")
https://www.statology.org/k-means-clustering-in-r/
Looking for a bend to find optimal # of clusters
#want to start with the full dataset, minus the hier cluster
BDXfmd2 <- BDXfmd |> select (-Cluster)
fviz_nbclust(BDXfmd2, kmeans, method = "wss")
compares the total intra-cluster variation for different values of k with their expected values for a distribution with no clustering.
Highest point is 10
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(BDXfmd2,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
#make this example reproducible
set.seed(333)
#perform k-means clustering with k = 3 clusters
km <- kmeans(BDXfmd2, centers = 3, nstart = 25)
#view results
km
## K-means clustering with 3 clusters of sizes 19, 14, 19
##
## Cluster means:
## AM.Holds AM.Circ AM.TechProg AM.CompSess AM.ProgSess AM.Visits Holds
## 1 0.2631579 0.42105263 0.3157895 0.26315789 0.00000000 0.2631579 0.2610716
## 2 0.5714286 0.50000000 0.1428571 0.07142857 0.14285714 0.4285714 0.7203759
## 3 0.0000000 0.05263158 0.2631579 0.00000000 0.05263158 0.0000000 -0.7918749
## Circ CompSess TechProg ProgSess Visits ActHrs
## 1 0.3120854 0.6170751 0.02552011 -0.05536921 0.2526034 0.22103238
## 2 0.5266989 -0.2526294 -0.05514990 0.23605261 0.6152349 -0.26791067
## 3 -0.7001793 -0.4309271 0.01511666 -0.11856429 -0.7059344 -0.02362452
## Staff.LH Vacc DffvsModel TotPop AvgMedInc HholdBroadbandPrct
## 1 -0.44566898 -0.13976970 0.16630643 1.0605967 -0.2901546 0.89157535
## 2 0.02972848 0.26427429 -0.16229268 -0.9558937 1.0377547 -1.18541639
## 3 0.42376378 -0.05495873 -0.04672235 -0.3562540 -0.4745068 -0.01811064
## Poverty PovertyPrct Unemployment UnemployPrct Eng.Vwell Eng.VwellPrct
## 1 0.9259059 0.5104218 0.7111268 -0.20815176 0.7938899 0.4159729
## 2 -0.9891599 -0.7582079 -0.7365783 0.00270327 -0.7526519 -0.3821599
## 3 -0.1970512 0.0482577 -0.1683849 0.20615988 -0.2393043 -0.1343814
## Less9Gd X9.12GdNoDeg HS CollNoDeg Assoc Bach
## 1 0.8569059 0.8911982 0.7761952 0.63552570 0.6096363 0.6526520
## 2 -0.9437207 -1.0218078 -1.1988750 -0.98289203 -1.0990621 0.1155505
## 3 -0.1615328 -0.1382872 0.1071864 0.08871053 0.2001989 -0.7377945
## Grad Under5yrs X5.9yrs X10.14yrs X15.19yrs X20.24yrs X25.34yrs
## 1 0.3243099 0.9573258 0.9513591 0.9539822 0.90531941 0.9177132 0.8549477
## 2 0.4954729 -0.8441204 -1.1127319 -1.0795916 -1.13110659 -1.0202033 -0.4106724
## 3 -0.6893952 -0.3353423 -0.1314514 -0.1584937 -0.07187245 -0.1659845 -0.5523470
## X35.44yrs X45.54yrs X55.59yrs X60.64yrs X65.74yrs X75.84.yrs
## 1 0.8761387 0.8980255 0.79032088 0.76447646 0.81503765 0.67021081
## 2 -0.4408472 -0.9343608 -1.06626878 -1.09606839 -1.06260930 -0.96743954
## 3 -0.5513039 -0.2095491 -0.00464915 0.04315288 -0.03206237 0.04263938
## X85.years.and.over
## 1 0.5370043347
## 2 -0.7299036071
## 3 0.0008193758
##
## Clustering vector:
## [1] 2 3 2 1 1 2 2 3 1 3 2 3 2 1 1 1 2 1 3 1 3 3 2 3 2 1 3 3 1 3 1 1 1 1 1 3 3 1
## [39] 2 3 2 3 3 2 3 3 1 3 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 481.3518 380.4744 385.9225
## (between_SS / total_SS = 38.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#plot results of final k-means model
fviz_cluster(km, data = BDXfmd2)