Introduction

This project will explore the relationship between local funding of public libraries (a measure of local support for public libraries) and the sustainability of those cities. The goal is to show that funding of libraries at a local level is a key indicator of a city’s progress toward environmental, social and economic sustainability. The focus will be on big urban cities and their libraries.

Note that all the code is included in the Appendix.

Data

City Ranking Data

This data is very straightforward and well organized. It is the 2022 Sustainable City Ranking with 194 rows. Each row is an observation of a US City, We will use the following variables, and remove the rest.

City - City Name
State - State
Overall.Rank - the overall sustainable rank for the city

A summary of this data is shown below:

##      City              State            Overall.Rank   
##  Length:194         Length:194         Min.   :  1.00  
##  Class :character   Class :character   1st Qu.: 49.25  
##  Mode  :character   Mode  :character   Median : 97.50  
##                                        Mean   : 97.49  
##                                        3rd Qu.:145.75  
##                                        Max.   :194.00

Public Library (PL) Data

This data is also very clean, but a bit larger. Like the city ranking data, it is the data from 2022 (the latest available). It starts out with 9,248 rows of 192 variables. We can take a look at the variables as we use them, but first we want to limit to just big urban cities.

The variable LOCALE_ADD has a 2 digit number that represents the location of the library system. LOCALE_ADD = 11 are the largest cities, defined in the PLS data dictionary as:
11–City, Large: Territory inside an urbanized area and inside a principal city with population of 250,000 or more.

We will limit the data to just those cities of interest.

We also want to limit to those library systems that meet the definition of a Public Library. The field C_FSCS is a Y/N field, with Y meaning the system meets all the criteria of a Public Library.

And finally, there are a couple of county libraries that overlap with city libraries (eg Orange County library also serves the city of Santa Ana). In these cases, we want to focus only on the city libraries. We will remove the county libraries.

There is also one state library, Hawaii. We will remove that as well.

We also want to drop all fields that have only 1 value, eg the same value for every row. These will not add anything meaningful to our analysis, and will just add clutter.

This leaves 83 rows of 156 variables. A very small data set, but very focused on our area of interest. We will hold off exploring other variables until we use them.

Put data together

Next, we want to put the two data sets together. To do this, we will join the two, taking all the pl_data, and just the rows from the city_rank that apply to those rows. We can match on the using the CITY/City fields in both data sets. Note that all 83 public library cities are are ranked in the city_rank data.

First, we need to make the City data in city_rank all uppercase, so it matches that in the pl_data.

Next, there are a few special cases that need to be adjusted.
Special Case 1 is the city in MN, which is called SAINT PAUL in the pl_data, and ST. PAUL in the city_rank data. We will change the city_rank value to ensure it matches.

Special Case 2 is Memphis, TN. In the pl_data, the city name is MEMPHIS (S). This will be corrected to MEMPHIS. (Note we checked, and this does not mean Memphis South, or similar.)

Next, There are some cases where there are cities with the same name in different states (eg Arlington in VA and TX). We need to make a field that concatenates the city and state fields in both data sets, and we will join on this field.

After confirming that the column city.state is unique in each dataset, we can join the data.

Using a left join, as we want all the pl_data, and only the city_rank data with matching cities.

We have one more special case, the three library systems in NYC. The New York Public Library (NYPL), which serves the boroughs of Manhattan, the Bronx and Staten Island, has City = New York, which will match the city_rank data, so that works. But the city_rank data does not separate the boroughs of Brooklyn and Queens, which each have their own library system (BPL and QPL). Since these all serve the city of New York, we will copy the city_rank columns (the last 9) from NYPL to BPL and QPL.

Data Exploration

Operating Revenue Fields

There are four fields that represent operating revenue in the pl_data: LOCGVT (Operating revenue from local government), STGVT (from state government), FEDGVT (from federal government) and OTHINCM (other operating revenue). We will use histograms and box plots to explore this data, seeing that they are all very skewed and have outliers.

## Local Funding as a % of Total Funding
We want to explore the Local funding as a percent of the total, so we will calculate that field (loc_pct) and have a look at it’s distribution. This is also skewed with outliers, but it shows most library systems have local funding of at least 40%. In fact only 2 are less than 50%.

To get to a more even distribution, create bins by quantile for this % of total.

Plotting the Overall (Sustainability) Rank vs the % of Local funding, with colors by bin, and adding a regression line, just to get a feel for what we are seeing.

Limit the data to fields of interest

Creating a small data set with just the fields of interest in the for the linear model.

Building Models

We set.seed before the models so results are reproducible.

Cross Validation

Because data is so small, we want to use cross validation. This train_control will be used in all the subsequent models.

Simple Linear Model

First, we want to set up a simple linear model. We do not expect great results here, but it will be a place to start when comparing other models.

##       RMSE   Rsquared        MAE 
## 56.2990118  0.0411275 49.6386663

As expected, pretty bad results:

RMSE shows on average, the predictions deviate from the actuals by 56 units.
\(R^2\) about 4% (0.041) of the variance of the target variable is explained by the model.
MAE on average, the model’s predictions are off by 49.64 units.

Check the Assumptions

Just for thoroughness, we will check the linear model assumptions.

In the Residuals vs Fitted, we see a pattern of more points to the right - the skewness, rather than the random scattering around zero we want for a lm model. This is not the funnel shape of heteroscedasticity, where the variance changes as fitted values increase (which would violates one of the assumptions of linear regression - homoscedasticity).

The Normal Q-Q, showing heavy tails, reinforces the issue of the lack of normal distribution.

The RMSE Across Cross-Validation Folds shows inconsistency in the RMSE across the folds - a sign that the model does NOT generalize well.

Noting that we tried some variations (Gamma Regression and Robust Linear Regression) with very minor improvements.

Random Forest

Tree-based models like Random Forest and Decision Trees are inherently robust to outliers. Trees split the data based on the best features, and outliers typically don’t affect the splits significantly.

Random Forests in particular can be very effective at modeling data with outliers, as it builds multiple trees using different subsets of the data.

## Random Forest 
## 
##  83 samples
## 160 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 75, 74, 75, 75, 75, 74, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     56.18068  0.3554825  49.77880
##   2     53.75106  0.3782917  47.77488
##   3     51.49646  0.3827737  45.81794
##   4     49.76102  0.3744031  43.85485
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.

This show further improvement, but is still quite poor:

RMSE shows on average, the predictions deviate from the actuals by 50 units.
\(R^2\) about 36% (0.364) of the variance of the target variable (Overall.Rank) is explained by the model.
MAE on average, the model’s predictions are off by 44.54 units.

RF Variable Importance

Even though the model does a poor job of explaining the variance in Overall.Rank, we might learn something about what is important to the model.

XGBoost

With similar benefits as Random Forest, XGBoost is an even more powerful ensemble method to use on small skewed data.

##       RMSE  Rsquared      MAE
## 1 56.05040 0.2824386 45.48761
## 2 47.16960 0.4065147 38.52283
## 3 42.85114 0.4943605 35.71385
## 4 53.56663 0.2823527 41.97079
## 5 51.51815 0.3020539 41.90481
## 6 46.60390 0.3704433 38.77632

Again, further improvement, but still quite poor. ‘Best’ results are:

RMSE shows on average, the predictions deviate from the actuals by 42 units.
\(R^2\) about 49% (0.487) of the variance of the target variable (Overall.Rank) is explained by the model.
MAE on average, the model’s predictions are off by 35.35 units.

XGB Variable Importance

Similar to the RF model, though the model does a poor job of explaining the variance in Overall.Rank, we might learn something about what is important to the model. Here we are using library(shapviz) to get and plot the SHAP (SHapley Additive exPlanations) values. Recall that SHAP Values break down a model’s prediction for a single data point, showing the contribution of each feature to that prediction.

More about SHAP values and how they are plotted: Feature Importance: Features at the top are the most important in predicting the model output. You can see which features have the greatest influence on the prediction.

Feature Impact: The spread of the dots shows the range of SHAP values for that feature. A wider spread means that the feature has a more variable impact on the prediction across different data points.

Positive/Negative Impact: The direction of the dots (right for positive SHAP values, left for negative) shows whether the feature generally increases or decreases the model’s output.

Feature Value Influence: The color of the dots indicates whether higher or lower values of a feature contribute to higher or lower predictions.

Note on Variable Importance

The RF and XGB models were run 5 times each, with the seeds

in order to get at the variables that were consistently important given the randomness inherent in the models. The results of these were captured in a spreadsheet (for expedience) and those that showed up consistently were considered in hypotheses for further study.

Appendix 1

Here are some portions of the code I did not end up using in mhy anaylisis, but were interesting, so kept for future reference.

Variable Importance from Cross Validation

RF

With RF we got similar

# Define control for RFE
ctrl_rf <- rfeControl(
  functions = rfFuncs,   # Built-in functions for RF
  method = "cv",         # Cross-validation
  number = 5,            # 5-fold CV
  verbose = FALSE
)

#define X & y
x = dplyr::select(pl_rank, -Overall.Rank)
y = pl_rank$Overall.Rank

# Run RFE
set.seed(123)
rfe_rf <- rfe(
  x = x,
  y = y,
  sizes = c(10,20,40,60,80,83),   # Try diff # of features
  rfeControl = ctrl_rf
)

# View results
rfe_rf

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables  RMSE Rsquared   MAE RMSESD RsquaredSD MAESD Selected
##         10 45.35   0.4127 36.77  8.319     0.2494 7.888        *
##         20 46.45   0.4017 39.46  9.044     0.2606 7.639         
##         40 46.75   0.3955 39.88  7.322     0.2300 6.173         
##         60 47.11   0.3889 40.28  6.152     0.2138 5.177         
##         80 48.03   0.3674 41.29  6.052     0.2062 5.183         
##         83 48.30   0.3626 41.35  6.141     0.2094 5.377         
##        160 48.66   0.3576 42.00  5.170     0.2074 4.478         
## 
## The top 5 variables (out of 10):
##    LONGITUD, ZIP_M, EBOOK, ZIP, MASTER

XGB

Cannot get XGB to work with harvesting the feature importance via the cross validation, and anyway, realize that doing so means taking the average across best and not best models, as apposed to running with different seeds and using results from multiple ‘bests’.

#THIS HAS ERRORS - do not eval, leaving until done just in case.

#define x again, this time with model.matrix to deal 
#with character fields and as.data.frame so the X & Y 
#still match up with same number of rows
x_xgb = as.data.frame(model.matrix(Overall.Rank ~ . - 1, data = pl_rank))


# Start from caret's default functions
xgbFuncs <- caretFuncs

# Use xgbTree as the model method
xgbFuncs$fit <- function(x, y, first, last, ...){
  train(
    x = x_xgb,
    y = y,
    method = "xgbTree",
    trControl = trainControl(method = "cv", number = 5),
    ...
  )
}

# RFE control
ctrl_xgb <- rfeControl(
  functions = xgbFuncs,
  method = "cv",
  number = 3,
  verbose = TRUE  # Set to TRUE to get more logs
)

# Run RFE
set.seed(123)
rfe_xgb <- rfe(
  x = x_xgb,
  y = y,
  sizes = c(13, 23, 43, 63),   # Try diff # of features
  rfeControl = ctrl_xgb
)

# View results
rfe_xgb

Per Capita variation

Took a look at the some of the per capita values of the variables that came out in the first round of the Variable Importance work. But realized it was important to dig deeper into which variables were considered (eg running the models multiple times and considering results), so focued limited time there. Left this as the foundation upon which to build in future iterations.

Redo Random Forest

Running the same Random Forest Model, with the new data set that has per capita data calculated for the fields that showed importance in the prior model.

## Random Forest 
## 
##  83 samples
## 160 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 75, 75, 75, 75, 74, 74, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     56.41829  0.3283596  49.81589
##   2     53.93963  0.3625471  47.74514
##   3     51.63767  0.3797494  45.79009
##   4     50.18144  0.3767099  44.21810
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.

Here we see results very similar to the original RF model, and still quite poor:

RMSE shows on average, the predictions deviate from the actuals by 50 units.
\(R^2\) about 41% (0.406) of the variance of the target variable (Overall.Rank) is explained by the model.
MAE on average, the model’s predictions are off by 44.19 units.

Redo RF Variable Importance

Redo XGBoost

Running the same XGBoost, with the new data set that has per capita data calculated for the fields that showed importance in the prior model.

##       RMSE  Rsquared      MAE
## 1 48.53307 0.4151694 39.10948
## 2 45.19650 0.3816671 37.05474
## 3 41.39172 0.4208697 34.63164
## 4 49.45439 0.3236302 41.53255
## 5 47.56504 0.3505308 38.73774
## 6 41.44881 0.4252576 34.20331

Here we see results slightly worse than the original XGB model, and still quite poor:

RMSE shows on average, the predictions deviate from the actuals by 43 units.
\(R^2\) about 47% (0.466) of the variance of the target variable (Overall.Rank) is explained by the model.
MAE on average, the model’s predictions are off by 35.44 units.

Redo XGB Variable Importance

Code Appendix

library(tidyverse) 
library(ggpubr)     #ggarrange
library(reshape2)   #melt
library(scales)     #show axis vals in M & not sci not 
library(gridExtra)  #grid.arrange()
library(caret)      # For data preprocessing and modeling
library(ggfortify)  #for autoplot to test assuptions in lm
library(MASS)       # for Box-Cox
library(randomForest)
library(xgboost)
library(shapviz)    #SHAP values (for XGBoost) 

theme_set(theme_minimal())

pl_data <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
city_rank <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/SustainCityRank.csv")
# Cleaning up name of overall rank as it was too long
colnames(city_rank)[1] <- 'Overall.Rank'

#Just want a subset of the fields
city_rank <- city_rank |>
  dplyr::select(City, State, Overall.Rank)
summary(city_rank)
#filter to just big urban libraries
pl_data<- pl_data |> filter(LOCALE_ADD==11)
#filter to just those that meet the criteria for public libraries
pl_data<- pl_data |> filter(C_FSCS=='Y')
#clean where we have both county and city + HI
pl_data<- pl_data |> filter(FSCSKEY!='CA0084')  #Orange
pl_data<- pl_data |> filter(FSCSKEY!='CA0112')  #San Diego
pl_data<- pl_data |> filter(FSCSKEY!='PA0532')  #Allegheny
pl_data<- pl_data |> filter(FSCSKEY!='TX0101')  #Harris
pl_data<- pl_data |> filter(FSCSKEY!='TX0744')  #BiblioTech
pl_data<- pl_data |> filter(FSCSKEY!='HI0001')  #HI

#drop fields with the same value for every row
pl_data<- pl_data[, sapply(pl_data, function(x) length(unique(x)) > 1)]
#make the City data in city_rank all uppercase
city_rank$City <- str_to_upper(city_rank$City)
#Adjusting to match spelling of city name in pl_data
city_rank <- city_rank |> 
  mutate(City = str_replace(City,'ST. PAUL','SAINT PAUL'))
#Adjusting to match spelling of city name in pl_data
pl_data$CITY[pl_data$FSCSKEY == 'TN0134'] <- 'MEMPHIS'
#make a field that concatenates the city and state fields in both data sets to be used in join
pl_data <- pl_data |> 
  mutate(city.state = paste(CITY,STABR))

city_rank <- city_rank |>
  mutate(city.state = paste(City,State))
#check that city.state is unique in each dataset.

#NOTE results=FALSE, as we do not need to see this in the html 
duplicated_pl_data <- pl_data %>% 
  group_by(city.state) %>% 
  filter(n() > 1)

duplicated_city_rank <- city_rank %>% 
  group_by(city.state) %>% 
  filter(n() > 1)

duplicated_pl_data
duplicated_city_rank
#join the two data sets
pl_rank<-left_join (pl_data, city_rank, by = 'city.state')
#handle NYC, which has 3 distinct library systems
#get column names
city_rank_cols <-tail(names(pl_rank), 9) 

#copy the values we want
#need to specify dplyr::select bcs also have MASS
nypl_rank_data <- pl_rank |>  
  filter(FSCSKEY == 'NY0778') |> 
  dplyr::select(all_of(city_rank_cols)) 


#update the BPL and QPL values with the NYPL values
pl_rank <- pl_rank |>  
  mutate(across(all_of(city_rank_cols), 
                ~ if_else(FSCSKEY %in% c('NY0004', 'NY0562'), nypl_rank_data[[cur_column()]], .)))
#histograms
#the columns of interest
rev_cols <-c('LOCGVT', 'STGVT', 'FEDGVT', 'OTHINCM')

#small dataset with just those columns, pivoted 
rev_data <- pl_rank |> 
  dplyr::select (one_of(rev_cols))
rev_data$ID <- rownames(rev_data) 
rev_data <- melt(rev_data, id.vars = "ID")

# Create histograms of the revenue columns
ggplot(data = rev_data, aes(x = value)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Distributions of Revenue Variables", x = "Predictors", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +  # Clean up grid lines
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1))  # Format x-axis in millions)

#Box plots
plot_list <- lapply(rev_cols[1:4], function(col) {
  ggplot(pl_rank, aes_string(y = col)) +
    geom_boxplot(outlier.color = "red", fill = "skyblue")+
    theme_minimal()+
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1)) 
 
})

# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 2))



#calc Local funding as a percent of the total
pl_rank <- pl_rank |>
  mutate(loc_pct = LOCGVT/TOTINCM)
# Create histograms of the % column
ggplot(data = pl_rank, aes(x = loc_pct)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) +  
  labs(title = "Distribution of Local Funding as % of Total", x = "% of Total", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines

#bins by quantile
pl_rank <- pl_rank |>
  mutate(bin_lp = cut(loc_pct,
            breaks = quantile(loc_pct, probs=seq(0,1,0.25)),
            labels = c("Q1", "Q2", "Q3", "Q4"),
            include.lowest = TRUE))
# Create bar of the bins
ggplot(data = pl_rank, aes(x = bin_lp)) +
  geom_bar(fill = "skyblue", color = "black", alpha = 0.8) + 
  labs(title = "Bins of Local Funding % ", x = "Bin", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())  # Clean up grid lines

ggplot(pl_rank, aes(x=loc_pct, y=Overall.Rank, color = bin_lp)) +
  geom_point() +
  geom_smooth(method=lm , color="skyblue", fill="#69b3a2", se=TRUE) +
   labs(title = "Overall Sustainability Rank vs % Local Funding",
       x = "Local % of Total Funding",
       y = "Sustainable Rank",
       color = "Bin (Quantile)")+
  theme_minimal()+
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
pl_rk_F <- pl_rank |>
  dplyr::select(loc_pct, bin_lp, Overall.Rank, 
                FSCSKEY,STABR,LIBNAME,CITY, POPU_LSA,
                C_LEGBAS, C_ADMIN, BRANLIB)

set.seed(888)
# 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)  
#simple linear model
lm_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "lm", trControl = train_control)

predictions <- predict(lm_model, newdata = pl_rk_F)
actuals <- pl_rk_F$Overall.Rank
postResample(predictions, actuals)
# Residuals and fitted values
fitted_values <- predict(lm_model, newdata = pl_rk_F)
residuals <- lm_model$finalModel$residuals

# Plot Residuals vs Fitted values
p1<-ggplot(data = data.frame(fitted_values, residuals), aes(x = fitted_values, y = residuals)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "red") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals")
# Q-Q plot of residuals
p2<-ggplot(data = data.frame(residuals), aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot")

# Cross-validation performance (e.g., RMSE) across folds
resampling_results <- lm_model$resample

# Plot RMSE across folds
p3<-ggplot(resampling_results, aes(x = Resample, y = RMSE)) +
  geom_boxplot() +
  labs(title = "RMSE Across Cross-Validation Folds", x = "Fold", y = "RMSE")

ggarrange(p1,p2,p3)
#Gamma Regression
gml_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "glm", trControl = train_control)

print(gml_model)
#Robust Linear Regression
rlm_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "rlm", trControl = train_control)

print(rlm_model)
set.seed(717)
tune_grid <- expand.grid(mtry = c(1, 2, 3, 4))  

rf_model <- train(Overall.Rank ~ ., data = pl_rank, 
                  method = "rf", trControl = train_control,
                  tuneGrid = tune_grid)

print(rf_model)
#Variable Importance
v_imp <- varImp(rf_model, scale = FALSE)

plot(v_imp, top=15, main = "RF: Top 15 Variable Importances")
# Model matrix (for later reuse in SHAP)
X <- model.matrix(Overall.Rank ~ . - 1, data = pl_rank)


xgb_model <- suppressWarnings(train(x = X,
                    y = pl_rank$Overall.Rank,
                    method = "xgbTree", 
                    trControl = train_control,
                    tuneLength = 3,
                    verbosity = 0,
                    verbose = FALSE))

best_row <- xgb_model$results %>%
  dplyr::filter(
    nrounds == xgb_model$bestTune$nrounds,
    max_depth == xgb_model$bestTune$max_depth,
    eta == xgb_model$bestTune$eta,
    gamma == xgb_model$bestTune$gamma,
    )
best_metrics <- best_row %>%
  dplyr::select(RMSE, Rsquared, MAE)

print(best_metrics)
#Get SHAP values for the final (best) model
xgb_fit <- xgb_model$finalModel

# Create "shapviz" object
shp <- shapviz(xgb_fit, X_pred = X)

sv_importance(shp, kind = "both", alpha = 0.2, width = 0.2)

#sv_dependence(shp, v = "LOCGVT", "auto")

# Define control for RFE
ctrl_rf <- rfeControl(
  functions = rfFuncs,   # Built-in functions for RF
  method = "cv",         # Cross-validation
  number = 5,            # 5-fold CV
  verbose = FALSE
)

#define X & y
x = dplyr::select(pl_rank, -Overall.Rank)
y = pl_rank$Overall.Rank

# Run RFE
set.seed(123)
rfe_rf <- rfe(
  x = x,
  y = y,
  sizes = c(10,20,40,60,80,83),   # Try diff # of features
  rfeControl = ctrl_rf
)

# View results
rfe_rf
#THIS HAS ERRORS - do not eval, leaving until done just in case.

#define x again, this time with model.matrix to deal 
#with character fields and as.data.frame so the X & Y 
#still match up with same number of rows
x_xgb = as.data.frame(model.matrix(Overall.Rank ~ . - 1, data = pl_rank))


# Start from caret's default functions
xgbFuncs <- caretFuncs

# Use xgbTree as the model method
xgbFuncs$fit <- function(x, y, first, last, ...){
  train(
    x = x_xgb,
    y = y,
    method = "xgbTree",
    trControl = trainControl(method = "cv", number = 5),
    ...
  )
}

# RFE control
ctrl_xgb <- rfeControl(
  functions = xgbFuncs,
  method = "cv",
  number = 3,
  verbose = TRUE  # Set to TRUE to get more logs
)

# Run RFE
set.seed(123)
rfe_xgb <- rfe(
  x = x_xgb,
  y = y,
  sizes = c(13, 23, 43, 63),   # Try diff # of features
  rfeControl = ctrl_xgb
)

# View results
rfe_xgb

#create data set with per capita values, and not original
pl_rank_pc <- pl_rank |>
  mutate (master_pc = MASTER / POPU_LSA,
          salaries_pc = SALARIES / POPU_LSA,
          benefit_pc = BENEFIT / POPU_LSA,
          locgvt_pc = LOCGVT / POPU_LSA) |>
  dplyr::select(-MASTER, -SALARIES, -BENEFIT, -LOCGVT)
rf_model_pc <- train(Overall.Rank ~ ., data = pl_rank_pc, 
                  method = "rf", trControl = train_control,
                  tuneGrid = tune_grid)

print(rf_model_pc)

v_imp_pc <- varImp(rf_model_pc, scale = FALSE)

plot(v_imp_pc, top=15, main = "RF: Top 15 Variable Importances (PC)")
# Model matrix (for later reuse in SHAP)
X_pc <- model.matrix(Overall.Rank ~ . - 1, 
                     data = pl_rank_pc)



xgb_model_pc <- suppressWarnings(train(x = X_pc,
                    y = pl_rank_pc$Overall.Rank,
                    method = "xgbTree", 
                    trControl = train_control,
                    tuneLength = 3,
                    verbosity = 0,
                    verbose = FALSE))

                    
best_row_pc <- xgb_model_pc$results %>%
  dplyr::filter(
    nrounds == xgb_model_pc$bestTune$nrounds,
    max_depth == xgb_model_pc$bestTune$max_depth,
    eta == xgb_model_pc$bestTune$eta,
    gamma == xgb_model_pc$bestTune$gamma,
    )
best_metrics_pc <- best_row_pc %>%
  dplyr::select(RMSE, Rsquared, MAE)

print(best_metrics_pc)
xgb_fit_pc <- xgb_model_pc$finalModel

# Create "shapviz" object
shp_pc <- shapviz(xgb_fit_pc, X_pred = X_pc)

sv_importance(shp_pc, kind = "both", alpha = 0.2, width = 0.2)

#sv_dependence(shp, v = "LOCGVT", "auto")

DATA 661

D Plunkett

2025-04-30