Introduction

Choose a dataset, select a methodology from weeks 1-10 (Supervised Learning) and another from weeks 11-15 (Unsupervised Learning).

Describe the problem you are trying to solve.
Describe your data and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data.
What’s the purpose of the analysis performed.
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Problem to be Solved

This is a problem from my work that I have been meaning to dive into for a while, so it is exciting to use the skills I have gotten in this class for immediate practical use.

Traditionally, we have thought of branches as providing varying services to their communities based on need
We have anecdotally thought of this in 4 categories: Tech, Programs, Circ and Mixed.
For Methodology 1 (Supervised), I will look at predicting branches into these categories.
For Methodology 2 (Unsupervised), I will see how branches might be grouped without training for a labeled result.
We will then compare the results, and see what facets of the data were used in each, and how the branches that are grouped in each way are similar and different.

Data

Each row has data for 1 branch within Brooklyn Public Library, including:

Last 12 months metrics per hour
Demographic info
Staffing

This has 52 rows of 47 variables. A small dataset, but one that is well understood so I can focus on the modeling. The variables are as follows:

Branch - The name of the branch. Note that 7 branches are excluded as they were closed for renovation. Also Central and Center for Brooklyn History are excluded as they operate differently than our neighborhood branches.

Category - Has values Circ, Tech, Prog or Mixed depending on which is the primary services provided to the community.

AM fields - 6 Boolean fields which are TRUE when the branch metric is > 175% of the median for that metric. These fields are used in the staffing model.

Metrics - 6 fields that show the metric per hour (which normalized the value when different branches close for short periods during the year) plus a 7th field with the actual hours open.

Staffing - 3 fields which identify staffing issues at a branch.

Demographics - 29 fields relating to the Branch Area Demographics from the 2020 Census data. All Census tracts in Brooklyn are assigned to a branch to make up a Branch Area.

BranchData <- read.csv("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/BranchData.csv")

summary(BranchData)

##     Branch            Category          AM.Holds        AM.Circ       
##  Length:52          Length:52          Mode :logical   Mode :logical  
##  Class :character   Class :character   FALSE:39        FALSE:36       
##  Mode  :character   Mode  :character   TRUE :13        TRUE :16       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  AM.TechProg     AM.CompSess     AM.ProgSess     AM.Visits      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:39        FALSE:46        FALSE:49        FALSE:41       
##  TRUE :13        TRUE :6         TRUE :3         TRUE :11       
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##      Holds              Circ            CompSess         TechProg       
##  Min.   : 0.7405   Min.   :  8.885   Min.   : 1.634   Min.   :0.001711  
##  1st Qu.: 3.1047   1st Qu.: 20.766   1st Qu.: 4.420   1st Qu.:0.030076  
##  Median : 5.7295   Median : 34.928   Median : 5.584   Median :0.060561  
##  Mean   : 7.6438   Mean   : 44.527   Mean   : 6.024   Mean   :0.074625  
##  3rd Qu.: 9.0243   3rd Qu.: 66.166   3rd Qu.: 7.390   3rd Qu.:0.092690  
##  Max.   :33.0621   Max.   :142.705   Max.   :11.811   Max.   :0.243370  
##                                                                         
##     ProgSess          Visits           ActHrs        Staff.LH     
##  Min.   :0.1579   Min.   : 14.55   Min.   :1578   Min.   : 0.250  
##  1st Qu.:0.2171   1st Qu.: 25.69   1st Qu.:2303   1st Qu.: 0.475  
##  Median :0.2945   Median : 34.03   Median :2331   Median : 1.550  
##  Mean   :0.3128   Mean   : 40.41   Mean   :2320   Mean   : 3.449  
##  3rd Qu.:0.3626   3rd Qu.: 46.92   3rd Qu.:2338   3rd Qu.: 6.050  
##  Max.   :0.8256   Max.   :143.68   Max.   :3149   Max.   :17.000  
##                                                   NA's   :21      
##       Vacc         DffvsModel           TotPop        AvgMedInc     
##  Min.   :0.000   Min.   :-2.00000   Min.   :16793   Min.   : 28565  
##  1st Qu.:1.000   1st Qu.:-1.00000   1st Qu.:33772   1st Qu.: 53507  
##  Median :1.000   Median : 0.00000   Median :41123   Median : 60866  
##  Mean   :1.577   Mean   :-0.01923   Mean   :46438   Mean   : 66866  
##  3rd Qu.:3.000   3rd Qu.: 1.00000   3rd Qu.:58294   3rd Qu.: 79188  
##  Max.   :4.000   Max.   : 2.00000   Max.   :93585   Max.   :158798  
##                                                                     
##  HholdBroadbandPrct    Poverty       PovertyPrct      Unemployment 
##  Min.   :0.1700     Min.   : 1041   Min.   :0.0400   Min.   : 419  
##  1st Qu.:0.4800     1st Qu.: 4056   1st Qu.:0.1100   1st Qu.:1003  
##  Median :0.6950     Median : 7552   Median :0.1700   Median :1251  
##  Mean   :0.7988     Mean   : 8580   Mean   :0.1748   Mean   :1437  
##  3rd Qu.:1.0500     3rd Qu.:11837   3rd Qu.:0.2300   3rd Qu.:1762  
##  Max.   :1.8100     Max.   :30450   Max.   :0.3300   Max.   :3179  
##                                                                    
##   UnemployPrct       Eng.Vwell     Eng.VwellPrct       Less9Gd    
##  Min.   :0.02000   Min.   :  642   Min.   :0.0200   Min.   : 162  
##  1st Qu.:0.02000   1st Qu.: 3424   1st Qu.:0.0900   1st Qu.:1540  
##  Median :0.03000   Median : 6008   Median :0.1400   Median :1992  
##  Mean   :0.03212   Mean   : 9021   Mean   :0.1854   Mean   :2616  
##  3rd Qu.:0.04000   3rd Qu.:11827   3rd Qu.:0.2425   3rd Qu.:3033  
##  Max.   :0.06000   Max.   :35690   Max.   :0.4900   Max.   :9986  
##                                                                   
##   X9.12GdNoDeg        HS          CollNoDeg         Assoc           Bach      
##  Min.   : 182   Min.   :  776   Min.   :  765   Min.   : 411   Min.   : 1996  
##  1st Qu.:1383   1st Qu.: 5126   1st Qu.: 2672   1st Qu.:1451   1st Qu.: 4496  
##  Median :1964   Median : 7634   Median : 3766   Median :1904   Median : 6806  
##  Mean   :2577   Mean   : 7881   Mean   : 3976   Mean   :1900   Mean   : 6687  
##  3rd Qu.:3714   3rd Qu.: 9253   3rd Qu.: 4841   3rd Qu.:2282   3rd Qu.: 8102  
##  Max.   :9358   Max.   :20199   Max.   :10620   Max.   :3918   Max.   :14306  
##                                                                               
##       Grad         Under5yrs        X5.9yrs       X10.14yrs      X15.19yrs   
##  Min.   :  479   Min.   :  892   Min.   : 790   Min.   : 545   Min.   : 619  
##  1st Qu.: 2446   1st Qu.: 1994   1st Qu.:1876   1st Qu.:1855   1st Qu.:1534  
##  Median : 4004   Median : 2613   Median :2194   Median :2360   Median :2162  
##  Mean   : 4316   Mean   : 3131   Mean   :2694   Mean   :2659   Mean   :2365  
##  3rd Qu.: 5938   3rd Qu.: 3698   3rd Qu.:3386   3rd Qu.:3279   3rd Qu.:2877  
##  Max.   :11975   Max.   :11484   Max.   :8772   Max.   :7814   Max.   :6394  
##                                                                              
##    X20.24yrs      X25.34yrs       X35.44yrs       X45.54yrs       X55.59yrs   
##  Min.   : 575   Min.   : 2005   Min.   : 1937   Min.   : 1806   Min.   :1150  
##  1st Qu.:1897   1st Qu.: 4823   1st Qu.: 4360   1st Qu.: 3837   1st Qu.:1911  
##  Median :2238   Median : 6684   Median : 5444   Median : 4900   Median :2518  
##  Mean   :2728   Mean   : 7827   Mean   : 5933   Mean   : 5084   Mean   :2548  
##  3rd Qu.:3501   3rd Qu.: 9526   3rd Qu.: 7405   3rd Qu.: 5761   3rd Qu.:2924  
##  Max.   :6060   Max.   :17388   Max.   :11306   Max.   :10491   Max.   :4836  
##                                                                               
##    X60.64yrs      X65.74yrs      X75.84.yrs   X85.years.and.over
##  Min.   : 820   Min.   :1444   Min.   : 705   Min.   : 210.0    
##  1st Qu.:1640   1st Qu.:2628   1st Qu.:1348   1st Qu.: 550.2    
##  Median :2308   Median :3291   Median :1645   Median : 701.5    
##  Mean   :2357   Mean   :3506   Mean   :1858   Mean   : 839.3    
##  3rd Qu.:2953   3rd Qu.:4335   3rd Qu.:2305   3rd Qu.:1064.8    
##  Max.   :4542   Max.   :6771   Max.   :5524   Max.   :2895.0    
##

Initial Clean Up

This data is pretty clean, as it comes from my day job, and that is what I do. However, there is one field Staff.LH (Lost Hours related to staff shortage) that is only populated when the branch has lost hours due to staff shortage, otherwise it is NA. In this case we will populate the NAs with 0.

BranchData <- BranchData |> 
  mutate (Staff.LH=ifelse(is.na(Staff.LH), 0, Staff.LH))

BranchData |> 
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), 
               names_to = "Variable", 
               values_to = "Missing_NAs")

## # A tibble: 47 × 2
##    Variable    Missing_NAs
##    <chr>             <int>
##  1 Branch                0
##  2 Category              0
##  3 AM.Holds              0
##  4 AM.Circ               0
##  5 AM.TechProg           0
##  6 AM.CompSess           0
##  7 AM.ProgSess           0
##  8 AM.Visits             0
##  9 Holds                 0
## 10 Circ                  0
## # ℹ 37 more rows

Split Data

Splitting the data proportion to the target variable.

set.seed(888)

trainInd <- createDataPartition(
  y=BranchData$Category, # this is our target var
  p=.7,              # 70% in training set
  list=FALSE
)

BD_train <- BranchData[trainInd,]
BD_test <- BranchData[-trainInd,]

Data Exploration

Bar Charts for Boolean Data

Bar charts help me visualize the Boolean data.

Here we are counting how many rows are TRUE or FALSE in the various “AM” (above 175% of Median) fields for different metrics. As you can see, and as one would expect, the values are mostly FALSE. But note that there are significantly less TRUE values for both Computer Sessions and Program Sessions than for the other metrics.

g1 <- BD_train |> group_by(AM.Holds) |>
  summarise(counts=n(), .groups = "drop")

g2 <- BD_train |> group_by(AM.Circ) |> 
  summarise(counts=n(), .groups = "drop")

g3 <- BD_train |> group_by(AM.TechProg) |>
  summarise(counts=n(), .groups = "drop")

g4 <- BD_train |> group_by(AM.CompSess) |> 
  summarise(counts=n(), .groups = "drop")

g5 <- BD_train |> group_by(AM.ProgSess) |> 
  summarise(counts=n(), .groups = "drop")

g6 <- BD_train |> group_by(AM.Visits) |> 
  summarise(counts=n(), .groups = "drop")

colors <- c(c("#09769A", "#72A98F"))

b1 <- ggplot(g1, aes(fill=as.factor(AM.Holds), 
                     x = AM.Holds, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors, 
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

b2 <- ggplot(g2, aes(fill=as.factor(AM.Circ), 
                     x = AM.Circ, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors,
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

b3 <- ggplot(g3, aes(fill=as.factor(AM.TechProg), 
                     x =AM.TechProg, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors,
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

b4 <- ggplot(g4, aes(fill=as.factor(AM.CompSess), 
                     x =AM.CompSess, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors,
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

b5 <- ggplot(g5, aes(fill=as.factor(AM.ProgSess), 
                     x =AM.ProgSess, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors,
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

b6 <- ggplot(g6, aes(fill=as.factor(AM.Visits), 
                     x= AM.Visits, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = colors,
                    labels=c('N', 'Y'))+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

ggarrange(b1,b2,b3,b4,b5,b6)

Histograms & Box Plots of Metrics

Using Histograms and Box Plots to look at the metrics per hour type fields shows the data is skewed, unevenly distributed and has outliers.

metric_columns <- c('Holds','Circ', 'CompSess','ProgSess', 'TechProg', 'ActHrs')

mlt.train.metric <- BD_train |> 
  select (one_of(metric_columns))
mlt.train.metric$ID <- rownames(mlt.train.metric)  # Assign row names to ID
mlt.train.metric <- melt(mlt.train.metric, id.vars = "ID")  # Melt the data

# Create histograms of the predictors
ggplot(data = mlt.train.metric, aes(x = value)) +
  geom_histogram(binwidth = 3, fill = "skyblue", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions of Metric Predictors", x = "Predictors", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines

plot_list <- lapply(metric_columns[1:6], function(col) {
  ggplot(BD_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "skyblue")
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

Histograms & Box Plots of Staff Variables

As with the metrics, using Histograms and Box Plots to look at the staff related fields shows this data is also skewed, unevenly distributed and has outliers.

staff_columns <- c('Staff.LH','Vacc', 'DffvsModel')

mlt.train.staff <- BD_train |> 
  select (one_of(staff_columns))
mlt.train.staff$ID <- rownames(mlt.train.staff)  # Assign row names to ID
mlt.train.staff <- melt(mlt.train.staff, id.vars = "ID")  # Melt the data

# Create histograms of the predictors
ggplot(data = mlt.train.staff, aes(x = value)) +
  geom_histogram(binwidth = 3, fill = "#A5D6A7", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions of Staff Predictors", x = "Predictors", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines

plot_list <- lapply(staff_columns[1:3], function(col) {
  ggplot(BD_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "#A5D6A7")
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

Histograms & Box Plots of Demographic Variables

As with the other data, using the demographic fields are also skewed, unevenly distributed and have outliers. We can also see quite clearly that the scale of the data varies significantly.

#Original
demo_columns <- c('TotPop', 'AvgMedInc', 'HholdBroadbandPrct', 'Poverty', 'PovertyPrct','Unemployment', 'UnemployPrct', 'Eng.Vwell', 'Eng.VwellPrct', 'Less9Gd', 'X9.12GdNoDeg', 'HS', 'CollNoDeg', 'Assoc', 'Bach', 'Grad', 'Under5yrs', 'X5.9yrs', 'X10.14yrs', 'X15.19yrs', 'X20.24yrs', 'X25.34yrs', 'X35.44yrs', 'X45.54yrs', 'X55.59yrs', 'X60.64yrs', 'X65.74yrs', 'X75.84.yrs', 'X85.years.and.over')

mlt.train.demo <- BD_train |> 
  select (one_of(demo_columns))
mlt.train.demo$ID <- rownames(mlt.train.demo)  # Assign row names to ID
mlt.train.demo <- melt(mlt.train.demo, id.vars = "ID")  # Melt the data

# Create histograms of the predictors
ggplot(data = mlt.train.demo, aes(x = value)) +
  geom_histogram(binwidth = 50, fill = "#CE93D8", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions of Demographic Predictors", x = "Predictors", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines

plot_list <- lapply(demo_columns[1:9], function(col) {
  ggplot(BD_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
    theme(axis.title=element_text(size=8))
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

plot_list <- lapply(demo_columns[10:18], function(col) {
  ggplot(BD_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
    theme(axis.title=element_text(size=8))
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

plot_list <- lapply(demo_columns[19:29], function(col) {
  ggplot(BD_train, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "#CE93D8")+
    theme(axis.title=element_text(size=8))
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 3))

Pre Processing

As seen though the exploration, some of the data is skewed, there are zeros in the data, and value ranges are very mixed. Pre-processing the data with Yeo-Johnson Transformation will help the models.

The Branch column is just the name, basically a “dumb identifier”. There is no information in that column, so it is removed.

#Remove Branch column, which is just an identifier
BD_train <- BD_train |> select(-Branch)
BD_test <- BD_test |> select(-Branch)

#run the preProcess on all but the first column, Category, the target
preProcValues <- preProcess(BD_train[,-1], method = c("center", "scale", "YeoJohnson"))
preProcValues2 <- preProcess(BD_test[,-1], method = c("center", "scale", "YeoJohnson"))

preProcValues

## Created from 37 samples and 45 variables
## 
## Pre-processing:
##   - centered (39)
##   - ignored (6)
##   - scaled (39)
##   - Yeo-Johnson transformation (35)
## 
## Lambda estimates for Yeo-Johnson transformation:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.15624 -0.05067  0.20818  0.12739  0.30371  1.10529

trainXfmd <- predict(preProcValues, BD_train)
testXfmd <- predict(preProcValues2, BD_test)

Methodology 1

This is from the first half of class, when covering Supervised Learning models. Both a Random Forest and an SVM with be trained and tested to predict the classification of Category, a variable with four possible values.

There are a couple of challenges here. Though the data is clean, it is sparse - only 57 rows, and there very few rows with certain predictors. How each model handles this will be a factor in how well they predict Category.

Model 1: Random Forest

set.seed(888)

trainXfmd$Category <- as.factor(trainXfmd$Category)

RFmodel <- randomForest(x = trainXfmd[-1], 
                             y = trainXfmd$Category, 
                             ntree = 500) 

RFmodel

## 
## Call:
##  randomForest(x = trainXfmd[-1], y = trainXfmd$Category, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 45.95%
## Confusion matrix:
##       Circ Mixed Prog Tech class.error
## Circ     9     3    1    1   0.3571429
## Mixed    6     2    0    2   0.8000000
## Prog     3     0    0    0   1.0000000
## Tech     0     1    0    9   0.1000000

Predictions with Initial RF Model

testXfmd$Category <- as.factor(testXfmd$Category)

RFpred<- predict(RFmodel, newdata = testXfmd)

conf_matrix <- confusionMatrix(RFpred, testXfmd$Category)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Circ Mixed Prog Tech
##      Circ     3     2    0    0
##      Mixed    2     2    0    1
##      Prog     0     0    0    0
##      Tech     1     0    1    3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.2659, 0.7873)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 0.2131          
##                                           
##                   Kappa : 0.3226          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity               0.5000       0.5000     0.00000      0.7500
## Specificity               0.7778       0.7273     1.00000      0.8182
## Pos Pred Value            0.6000       0.4000         NaN      0.6000
## Neg Pred Value            0.7000       0.8000     0.93333      0.9000
## Prevalence                0.4000       0.2667     0.06667      0.2667
## Detection Rate            0.2000       0.1333     0.00000      0.2000
## Detection Prevalence      0.3333       0.3333     0.00000      0.3333
## Balanced Accuracy         0.6389       0.6136     0.50000      0.7841

iv<-importance(RFmodel)
head(iv[order(-iv[, 1]), ],10)

##      TechProg         Holds          Circ Eng.VwellPrct     Eng.Vwell 
##     2.2451970     1.5651095     1.5503528     1.3285477     1.3073187 
##        ActHrs      ProgSess  Unemployment          Grad        Visits 
##     1.0029183     0.8768685     0.7814947     0.7591445     0.7400279

Tuning the RF

set.seed(83)

# Define a grid of hyperparameters
tuneGrid <- expand.grid(
  mtry = c(1, 2, 3, 4, 5)  # Number of variables to consider at each split
)

# Set up control for cross-validation
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)

# Train the random forest model with hyperparameter tuning

rf_tune_model <- train(
  Category ~ ., 
  data = trainXfmd, 
  method = "rf", 
  trControl = trainControl, 
  tuneGrid = tuneGrid, 
  importance = TRUE
)

# View the tuned model results
print(rf_tune_model)

# Best hyperparameters found
rf_tune_model$bestTune

Predictions using Tuned RF Model

# Make predictions on the test data using the tuned model
predictions <- predict(rf_tune_model, newdata = testXfmd)

# Evaluate accuracy using confusion matrix
confusionMatrix(predictions, testXfmd$Category)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Circ Mixed Prog Tech
##      Circ     4     2    0    0
##      Mixed    2     2    0    1
##      Prog     0     0    0    0
##      Tech     0     0    1    3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6             
##                  95% CI : (0.3229, 0.8366)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 0.09505         
##                                           
##                   Kappa : 0.4118          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity               0.6667       0.5000     0.00000      0.7500
## Specificity               0.7778       0.7273     1.00000      0.9091
## Pos Pred Value            0.6667       0.4000         NaN      0.7500
## Neg Pred Value            0.7778       0.8000     0.93333      0.9091
## Prevalence                0.4000       0.2667     0.06667      0.2667
## Detection Rate            0.2667       0.1333     0.00000      0.2000
## Detection Prevalence      0.4000       0.3333     0.00000      0.2667
## Balanced Accuracy         0.7222       0.6136     0.50000      0.8295

Plot of Tunded RF

This plot shows the change in accuracy as the number of predictors considered at each split changes. As shown, 4 predictors yields the best accuracy.

plot(rf_tune_model)

Feature Importance with Tuned RF

iv_tune<-varImp(rf_tune_model, scale = FALSE)

iv_tune

## rf variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 45)
## 
##                     Circ   Mixed    Prog     Tech
## Circ             4.71313  0.7878  1.2669  8.43098
## Holds            2.73365  1.2479  0.6258  7.42157
## Eng.VwellPrct    0.44302 -0.3671  0.2085  7.07667
## Eng.Vwell        2.31381  1.4392 -0.6549  6.36474
## TechProg         6.03145  1.3428 -1.1108  4.63164
## ProgSess         4.43348  1.9093  1.7106  2.27185
## AM.TechProgTRUE  4.02417  2.3266  1.4171  4.35325
## UnemployPrct     3.52139  4.2051  3.5064  3.60157
## CompSess        -0.41762 -1.3542  3.9428 -0.69789
## Visits          -1.21384  0.1178 -0.4715  3.31778
## Unemployment     1.21754  0.5710  3.0517  3.30682
## AM.CircTRUE      0.64365  0.6000  1.0010  2.96635
## X10.14yrs        0.21960  2.7902 -1.8133 -1.03279
## X35.44yrs        0.10013 -0.7446  2.7338 -0.17122
## X9.12GdNoDeg     2.68339 -0.4661  1.7027  2.21356
## Grad             0.95396  2.5355  2.2015  2.37712
## Assoc            1.43934 -0.7836  2.4143 -0.21148
## X25.34yrs       -0.53376 -0.7805  0.0000  2.00901
## ActHrs           0.25441  1.9644 -0.2425  0.03556
## TotPop          -0.04515 -0.7047  0.6863  1.96231

Model 2: SVM

# Code to train the SVM 
set.seed(888) 
# set the 10 fold crossvalidation with AU 
# to pick for us what we call the best model 
control <- trainControl(method="cv",
                        number=10, 
                        classProbs = TRUE) 
metric <- "Accuracy"
SVMmodel <- train(Category ~., data = trainXfmd, 
               method = "svmRadial",
               tuneLength = 8,
               preProc = c("center","scale"),
               metric=metric, 
               trControl=control)

SVMmodel

plot(SVMmodel)

Predictions with initial SVM Model

SVMpredict <- predict(SVMmodel, newdata = testXfmd) 
confusionMatrix(SVMpredict, testXfmd$Category)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Circ Mixed Prog Tech
##      Circ     5     3    0    3
##      Mixed    0     1    0    1
##      Prog     0     0    0    0
##      Tech     1     0    1    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4             
##                  95% CI : (0.1634, 0.6771)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 0.5968          
##                                           
##                   Kappa : 0.0559          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity               0.8333      0.25000     0.00000      0.0000
## Specificity               0.3333      0.90909     1.00000      0.8182
## Pos Pred Value            0.4545      0.50000         NaN      0.0000
## Neg Pred Value            0.7500      0.76923     0.93333      0.6923
## Prevalence                0.4000      0.26667     0.06667      0.2667
## Detection Rate            0.3333      0.06667     0.00000      0.0000
## Detection Prevalence      0.7333      0.13333     0.00000      0.1333
## Balanced Accuracy         0.5833      0.57955     0.50000      0.4091

Tuning SVM

control_tune <- trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 3,  # Repeating the CV process 3 times to reduce variance
                        classProbs = TRUE,  # For handling class probabilities
                        summaryFunction = defaultSummary,  # Using a more detailed performance measure for classification
                        savePredictions = "all")  # Save all predictions to analyze later

# Specify metric to optimize for (done above, here for completeness)
#metric <- "Accuracy"

# Create a grid of hyperparameters for tuning
tuneGrid <- expand.grid(sigma = c(0.01, 0.05, 0.1, 0.2),  # Testing different values of sigma (kernel parameter)
                        C = c(0.5, 1, 2, 5, 10))  # Testing different values of C (penalty term)

# Train the SVM model with radial basis kernel
SVMmodel_tune <- train(Category ~ ., data = trainXfmd,
                  method = "svmRadial", 
                  tuneGrid = tuneGrid,  # Specifying a custom hyperparameter grid
                  preProc = c("center", "scale", "pca"),  # Standardize features
                  metric = metric,  #  accuracy
                  trControl = control_tune)  # Use the updated control_tune

# Output the results
SVMmodel_tune

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 37 samples
## 45 predictors
##  4 classes: 'Circ', 'Mixed', 'Prog', 'Tech' 
## 
## Pre-processing: centered (45), scaled (45), principal component
##  signal extraction (45) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 34, 34, 32, 34, 32, 33, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C     Accuracy   Kappa      
##   0.01    0.5  0.3605556  0.073115416
##   0.01    1.0  0.3650000  0.094691172
##   0.01    2.0  0.3800000  0.101018059
##   0.01    5.0  0.3627778  0.053269601
##   0.01   10.0  0.3483333  0.047284110
##   0.05    0.5  0.3533333  0.055524322
##   0.05    1.0  0.3777778  0.094526356
##   0.05    2.0  0.4105556  0.114554220
##   0.05    5.0  0.3388889  0.005584027
##   0.05   10.0  0.3966667  0.096712851
##   0.10    0.5  0.3833333  0.095705369
##   0.10    1.0  0.4105556  0.121234256
##   0.10    2.0  0.3994444  0.100594525
##   0.10    5.0  0.3927778  0.094248437
##   0.10   10.0  0.3388889  0.017273358
##   0.20    0.5  0.3950000  0.110463950
##   0.20    1.0  0.4161111  0.132990652
##   0.20    2.0  0.3944444  0.087860181
##   0.20    5.0  0.3500000  0.028762131
##   0.20   10.0  0.3638889  0.052733527
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2 and C = 1.

confusionMatrix(SVMmodel_tune)

## Cross-Validated (10 fold, repeated 3 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction Circ Mixed Prog Tech
##      Circ  27.0  17.1  6.3  9.9
##      Mixed  9.0   7.2  1.8  9.9
##      Prog   0.0   0.9  0.0  0.0
##      Tech   1.8   1.8  0.0  7.2
##                             
##  Accuracy (average) : 0.4144

Predictions with Tuned SVM

SVMpredict_tune <- predict(SVMmodel_tune, newdata = testXfmd) 
confusionMatrix(SVMpredict_tune, testXfmd$Category)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Circ Mixed Prog Tech
##      Circ     6     4    0    3
##      Mixed    0     0    1    1
##      Prog     0     0    0    0
##      Tech     0     0    0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4             
##                  95% CI : (0.1634, 0.6771)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 0.5968          
##                                           
##                   Kappa : 0.0288          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Circ Class: Mixed Class: Prog Class: Tech
## Sensitivity               1.0000       0.0000     0.00000      0.0000
## Specificity               0.2222       0.8182     1.00000      1.0000
## Pos Pred Value            0.4615       0.0000         NaN         NaN
## Neg Pred Value            1.0000       0.6923     0.93333      0.7333
## Prevalence                0.4000       0.2667     0.06667      0.2667
## Detection Rate            0.4000       0.0000     0.00000      0.0000
## Detection Prevalence      0.8667       0.1333     0.00000      0.0000
## Balanced Accuracy         0.6111       0.4091     0.50000      0.5000

SVM Feature Importance

# Calculate feature importance
importanceSVMtun <- varImp(SVMmodel_tune, scale = FALSE)  # scale = FALSE to use raw importance values
print(importanceSVMtun)

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 45)
## 
##                      Circ  Mixed   Prog   Tech
## Unemployment       0.9286 0.6571 1.0000 0.9286
## X20.24yrs          0.8571 0.5500 1.0000 0.8571
## TechProg           0.7000 0.9786 0.7000 0.9786
## X35.44yrs          0.9762 0.5357 0.9667 0.9762
## CollNoDeg          0.7381 0.7214 0.9667 0.7381
## CompSess           0.9286 0.5429 0.9667 0.9286
## Circ               0.7857 0.9643 0.7333 0.9643
## TotPop             0.8810 0.5429 0.9333 0.8810
## Holds              0.8810 0.9286 0.8000 0.9286
## UnemployPrct       0.9286 0.6893 0.7500 0.9286
## X45.54yrs          0.8810 0.5429 0.9000 0.8810
## X10.14yrs          0.8333 0.5500 0.9000 0.8333
## X25.34yrs          0.8571 0.5500 0.9000 0.8571
## Poverty            0.7619 0.5714 0.9000 0.7619
## X60.64yrs          0.8333 0.5571 0.9000 0.8333
## Bach               0.8810 0.6929 0.8333 0.8810
## X9.12GdNoDeg       0.8333 0.5643 0.8667 0.8333
## HholdBroadbandPrct 0.7500 0.6036 0.8667 0.7500
## X55.59yrs          0.8333 0.5429 0.8667 0.8333
## HS                 0.8095 0.6786 0.8667 0.8095

Methodology 2

Pre Processing

For the Unsupervised Learning models, the model will run against the entire data set - there is no training and testing because it is unsupervised.

To do this, we will go back to the original data, but take out the Branch and Category, aa branch is unique to each row, and what we want from this is to compare it to the Category, not to learn from the Category.

Therefore, start with the BranchData dataframe. Recall that the data is well structured, and the NAs were replaced with 0s in the Staff.LH, the only place where there were NAs.

We need to reapply the pre-processed with Yeo-Johnson Transformation to this version of the data to to help the models, since the various features are in different units / scales.

BranchDataNoBC <- BranchData |> select(-Branch, -Category)

preProcValues3 <- preProcess(BranchDataNoBC, method = c("center", "scale", "YeoJohnson"))

BDXfmd <- predict(preProcValues3, BranchDataNoBC)

Hierachical Clustering

#  Compute the distance matrix (Euclidean distance by default)
dist_matrix <- dist(BDXfmd, method = "euclidean")

# Perform hierarchical clustering (using the "ward.D" method, which minimizes variance)
hclust_model <- hclust(dist_matrix, method = "ward.D")

# Plot the dendrogram
# Create a basic dendrogram plot
plot(hclust_model, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "", cex = 0.9)

# Cutting the dendrogram to form clusters
clusters <- cutree(hclust_model, k = 3)  # Cut into 3 clusters

table(clusters)

## clusters
##  1  2  3 
## 13 21 18

# Calculate silhouette scores (use hclust model or cluster labels)
silhouette_scores <- silhouette(clusters, dist(BDXfmd))

# Plot silhouette scores
plot(silhouette_scores, main = "Silhouette Plot")

A lower WCSS value indicates that the observations within each cluster are more similar to each other. This is similar to the inertia in K-means clustering.

# Compute WCSS for hierarchical clustering (using Euclidean distance)
wcss <- sum(silhouette_scores[, 3])
cat("WCSS: ", wcss)

## WCSS:  8.117217

Compares the clustering result with the true labels, adjusting for chance. An ARI score of 1 indicates perfect agreement, while a score of 0 indicates random clustering. Negative values indicate worse-than-random clustering. Here we are not looking at “true labels” as the desired result, merely want to compare the clusters with the categories.

# Add the cluster labels to the original data and transformed data
BranchData$Cluster <- clusters
BDXfmd$Cluster <- clusters


# View the data with cluster labels
BranchData[c('Branch','Category','Cluster')]

##               Branch Category Cluster
## 1       Adams Street     Circ       1
## 2          Arlington    Mixed       2
## 3          Bay Ridge     Circ       3
## 4       Borough Park    Mixed       2
## 5     Brighton Beach    Mixed       2
## 6   Brooklyn Heights     Circ       1
## 7        Brower Park     Circ       1
## 8        Brownsville     Tech       3
## 9           Bushwick     Tech       2
## 10          Canarsie     Tech       3
## 11      Clinton Hill    Mixed       1
## 12      Coney Island     Tech       3
## 13         Cortelyou     Circ       1
## 14     Crown Heights     Circ       2
## 15     Cypress Hills     Tech       2
## 16            DeKalb     Circ       2
## 17             Dyker     Circ       1
## 18     East Flatbush     Tech       2
## 19   Eastern Parkway     Tech       2
## 20          Flatbush    Mixed       2
## 21         Flatlands    Mixed       3
## 22     Fort Hamilton     Circ       3
## 23   Gerritsen Beach     Prog       1
## 24         Gravesend     Circ       3
## 25        Greenpoint    Mixed       1
## 26          Highlawn     Circ       2
## 27         Homecrest     Prog       3
## 28       Jamaica Bay     Prog       3
## 29        Kensington    Mixed       2
## 30         Kings Bay    Mixed       3
## 31     Kings Highway    Mixed       2
## 32             Macon     Tech       2
## 33             Marcy     Circ       2
## 34     McKinley Park     Circ       2
## 35           Midwood     Circ       2
## 36        Mill Basin     Circ       3
## 37          New Lots     Tech       3
## 38       New Utrecht    Mixed       2
## 39           Pacific    Mixed       1
## 40         Paerdegat     Tech       3
## 41        Park Slope     Circ       1
## 42             Rugby     Tech       3
## 43          Saratoga     Tech       3
## 44    Sheepshead Bay     Prog       1
## 45      Spring Creek    Mixed       3
## 46      Stone Avenue     Tech       3
## 47       Sunset Park     Circ       2
## 48        Ulmer Park     Circ       3
## 49      Walt Whitman     Tech       1
## 50 Washington Irving     Circ       2
## 51     Williamsburgh    Mixed       2
## 52   Windsor Terrace     Circ       1

#v1
BDXfmd$Cluster <- as.numeric(factor(BDXfmd$Cluster))  

numeric_data <- BDXfmd[, sapply(BDXfmd, is.numeric)]  # Select only numeric columns 

correlations <- sapply(numeric_data, function(x) cor(x, BDXfmd$Cluster, use = "complete.obs"))

# View correlations
correlations

##              Holds               Circ           CompSess           TechProg 
##       -0.547614808       -0.427455771       -0.074240980       -0.005613741 
##           ProgSess             Visits             ActHrs           Staff.LH 
##       -0.155752614       -0.484552506        0.094963029        0.064531203 
##               Vacc         DffvsModel             TotPop          AvgMedInc 
##       -0.140574799        0.077549188        0.151702415       -0.496460128 
## HholdBroadbandPrct            Poverty        PovertyPrct       Unemployment 
##        0.338717546        0.198191087        0.213801300        0.147195791 
##       UnemployPrct          Eng.Vwell      Eng.VwellPrct            Less9Gd 
##        0.073384941        0.159670301        0.078927379        0.228924462 
##       X9.12GdNoDeg                 HS          CollNoDeg              Assoc 
##        0.256289932        0.400674842        0.334516545        0.473736898 
##               Bach               Grad          Under5yrs            X5.9yrs 
##       -0.325859884       -0.426431095        0.108287372        0.292472289 
##          X10.14yrs          X15.19yrs          X20.24yrs          X25.34yrs 
##        0.274339646        0.325084540        0.207453596       -0.148871479 
##          X35.44yrs          X45.54yrs          X55.59yrs          X60.64yrs 
##       -0.135627600        0.200239185        0.357292382        0.409926376 
##          X65.74yrs         X75.84.yrs X85.years.and.over            Cluster 
##        0.379025056        0.409641212        0.290252423        1.000000000

# Create a data frame for plotting
correlations_df <- data.frame(
  Feature = names(correlations),
  Correlation = correlations
)

# Create a bar plot of the correlations
ggplot(correlations_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip the axes to make it easier to read
  labs(title = "Correlation Between Features and Cluster Labels",
       x = "Feature",
       y = "Correlation")

# Filter correlations with absolute value greater than 0.4
cor_filtered <- correlations[abs(correlations) > 0.4]

# Create a data frame for plotting
correlations_df_filtered <- data.frame(
  Feature = names(cor_filtered),
  Correlation = cor_filtered
)
# Create a bar plot of the filtered correlations
ggplot(correlations_df_filtered, aes(x = reorder(Feature, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip the axes to make it easier to read
  labs(title = "Filtered Correlations (|Correlation| > 0.4) Between Features and Cluster Labels",
       x = "Feature",
       y = "Correlation")

Kmeans clusering

https://www.statology.org/k-means-clustering-in-r/

Looking for a bend to find optimal # of clusters

#want to start with the full dataset, minus the hier cluster

BDXfmd2 <- BDXfmd |> select (-Cluster)

fviz_nbclust(BDXfmd2, kmeans, method = "wss")

compares the total intra-cluster variation for different values of k with their expected values for a distribution with no clustering.

Highest point is 10

#calculate gap statistic based on number of clusters
gap_stat <- clusGap(BDXfmd2,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

#make this example reproducible
set.seed(333)

#perform k-means clustering with k = 3 clusters
km <- kmeans(BDXfmd2, centers = 3, nstart = 25)

#view results
km

## K-means clustering with 3 clusters of sizes 19, 14, 19
## 
## Cluster means:
##    AM.Holds    AM.Circ AM.TechProg AM.CompSess AM.ProgSess AM.Visits      Holds
## 1 0.2631579 0.42105263   0.3157895  0.26315789  0.00000000 0.2631579  0.2610716
## 2 0.5714286 0.50000000   0.1428571  0.07142857  0.14285714 0.4285714  0.7203759
## 3 0.0000000 0.05263158   0.2631579  0.00000000  0.05263158 0.0000000 -0.7918749
##         Circ   CompSess    TechProg    ProgSess     Visits      ActHrs
## 1  0.3120854  0.6170751  0.02552011 -0.05536921  0.2526034  0.22103238
## 2  0.5266989 -0.2526294 -0.05514990  0.23605261  0.6152349 -0.26791067
## 3 -0.7001793 -0.4309271  0.01511666 -0.11856429 -0.7059344 -0.02362452
##      Staff.LH        Vacc  DffvsModel     TotPop  AvgMedInc HholdBroadbandPrct
## 1 -0.44566898 -0.13976970  0.16630643  1.0605967 -0.2901546         0.89157535
## 2  0.02972848  0.26427429 -0.16229268 -0.9558937  1.0377547        -1.18541639
## 3  0.42376378 -0.05495873 -0.04672235 -0.3562540 -0.4745068        -0.01811064
##      Poverty PovertyPrct Unemployment UnemployPrct  Eng.Vwell Eng.VwellPrct
## 1  0.9259059   0.5104218    0.7111268  -0.20815176  0.7938899     0.4159729
## 2 -0.9891599  -0.7582079   -0.7365783   0.00270327 -0.7526519    -0.3821599
## 3 -0.1970512   0.0482577   -0.1683849   0.20615988 -0.2393043    -0.1343814
##      Less9Gd X9.12GdNoDeg         HS   CollNoDeg      Assoc       Bach
## 1  0.8569059    0.8911982  0.7761952  0.63552570  0.6096363  0.6526520
## 2 -0.9437207   -1.0218078 -1.1988750 -0.98289203 -1.0990621  0.1155505
## 3 -0.1615328   -0.1382872  0.1071864  0.08871053  0.2001989 -0.7377945
##         Grad  Under5yrs    X5.9yrs  X10.14yrs   X15.19yrs  X20.24yrs  X25.34yrs
## 1  0.3243099  0.9573258  0.9513591  0.9539822  0.90531941  0.9177132  0.8549477
## 2  0.4954729 -0.8441204 -1.1127319 -1.0795916 -1.13110659 -1.0202033 -0.4106724
## 3 -0.6893952 -0.3353423 -0.1314514 -0.1584937 -0.07187245 -0.1659845 -0.5523470
##    X35.44yrs  X45.54yrs   X55.59yrs   X60.64yrs   X65.74yrs  X75.84.yrs
## 1  0.8761387  0.8980255  0.79032088  0.76447646  0.81503765  0.67021081
## 2 -0.4408472 -0.9343608 -1.06626878 -1.09606839 -1.06260930 -0.96743954
## 3 -0.5513039 -0.2095491 -0.00464915  0.04315288 -0.03206237  0.04263938
##   X85.years.and.over
## 1       0.5370043347
## 2      -0.7299036071
## 3       0.0008193758
## 
## Clustering vector:
##  [1] 2 3 2 1 1 2 2 3 1 3 2 3 2 1 1 1 2 1 3 1 3 3 2 3 2 1 3 3 1 3 1 1 1 1 1 3 3 1
## [39] 2 3 2 3 3 2 3 3 1 3 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 481.3518 380.4744 385.9225
##  (between_SS / total_SS =  38.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#plot results of final k-means model
fviz_cluster(km, data = BDXfmd2)

DATA 622 Assignment 4

D Plunkett

2024-12-13

Introduction

Problem to be Solved

Data

Initial Clean Up

Split Data

Data Exploration

Bar Charts for Boolean Data

Histograms & Box Plots of Metrics

Histograms & Box Plots of Staff Variables

Histograms & Box Plots of Demographic Variables

Pre Processing

Methodology 1

Model 1: Random Forest

Predictions with Initial RF Model

Tuning the RF

Predictions using Tuned RF Model

Plot of Tunded RF

Feature Importance with Tuned RF

Model 2: SVM

Predictions with initial SVM Model

Tuning SVM

Predictions with Tuned SVM

SVM Feature Importance

Methodology 2

Pre Processing

Hierachical Clustering

Kmeans clusering