This project will explore the relationship between local funding of public libraries (a measure of local support for public libraries) and the sustainability of those cities. The goal is to show that funding of libraries at a local level is a key indicator of a city’s progress toward environmental, social and economic sustainability. The focus will be on big urban cities and their libraries.
Note that all the code is included in the Appendix.
This data is very straightforward and well organized. It is the 2022 Sustainable City Ranking with 194 rows. Each row is an observation of a US City, We will use the following variables, and remove the rest.
A summary of this data is shown below:
## City State Overall.Rank
## Length:194 Length:194 Min. : 1.00
## Class :character Class :character 1st Qu.: 49.25
## Mode :character Mode :character Median : 97.50
## Mean : 97.49
## 3rd Qu.:145.75
## Max. :194.00
This data is also very clean, but a bit larger. Like the city ranking data, it is the data from 2022 (the latest available). It starts out with 9,248 rows of 192 variables. We can take a look at the variables as we use them, but first we want to limit to just big urban cities.
The variable LOCALE_ADD has a 2 digit number that represents the
location of the library system. LOCALE_ADD = 11 are the largest cities,
defined in the PLS data dictionary as:
11–City, Large: Territory inside an urbanized area and inside a
principal city with population of 250,000 or more.
We will limit the data to just those cities of interest.
We also want to limit to those library systems that meet the definition of a Public Library. The field C_FSCS is a Y/N field, with Y meaning the system meets all the criteria of a Public Library.
And finally, there are a couple of county libraries that overlap with city libraries (eg Orange County library also serves the city of Santa Ana). In these cases, we want to focus only on the city libraries. We will remove the county libraries.
There is also one state library, Hawaii. We will remove that as well.
We also want to drop all fields that have only 1 value, eg the same value for every row. These will not add anything meaningful to our analysis, and will just add clutter.
This leaves 83 rows of 156 variables. A very small data set, but very focused on our area of interest. We will hold off exploring other variables until we use them.
Next, we want to put the two data sets together. To do this, we will join the two, taking all the pl_data, and just the rows from the city_rank that apply to those rows. We can match on the using the CITY/City fields in both data sets. Note that all 83 public library cities are are ranked in the city_rank data.
First, we need to make the City data in city_rank all uppercase, so it matches that in the pl_data.
Next, there are a few special cases that need to be adjusted.
Special Case 1 is the city in MN, which is called SAINT PAUL in the
pl_data, and ST. PAUL in the city_rank data. We will change the
city_rank value to ensure it matches.
Special Case 2 is Memphis, TN. In the pl_data, the city name is MEMPHIS (S). This will be corrected to MEMPHIS. (Note we checked, and this does not mean Memphis South, or similar.)
Next, There are some cases where there are cities with the same name in different states (eg Arlington in VA and TX). We need to make a field that concatenates the city and state fields in both data sets, and we will join on this field.
After confirming that the column city.state is unique in each dataset, we can join the data.
Using a left join, as we want all the pl_data, and only the city_rank data with matching cities.
We have one more special case, the three library systems in NYC. The New York Public Library (NYPL), which serves the boroughs of Manhattan, the Bronx and Staten Island, has City = New York, which will match the city_rank data, so that works. But the city_rank data does not separate the boroughs of Brooklyn and Queens, which each have their own library system (BPL and QPL). Since these all serve the city of New York, we will copy the city_rank columns (the last 9) from NYPL to BPL and QPL.
There are four fields that represent operating revenue in the pl_data: LOCGVT (Operating revenue from local government), STGVT (from state government), FEDGVT (from federal government) and OTHINCM (other operating revenue). We will use histograms and box plots to explore this data, seeing that they are all very skewed and have outliers.
## Local Funding as a % of Total Funding
We want to explore the Local funding as a percent of the total, so we
will calculate that field (loc_pct) and have a look at it’s
distribution. This is also skewed with outliers, but it shows most
library systems have local funding of at least 40%. In fact only 2 are
less than 50%.
To get to a more even distribution, create bins by quantile for this % of total.
Plotting the Overall (Sustainability) Rank vs the % of Local funding, with colors by bin, and adding a regression line, just to get a feel for what we are seeing.
Creating a small data set with just the fields of interest in the for the linear model.
We set.seed before the models so results are reproducible.
Because data is so small, we want to use cross validation. This train_control will be used in all the subsequent models.
First, we want to set up a simple linear model. We do not expect great results here, but it will be a place to start when comparing other models.
## RMSE Rsquared MAE
## 56.2990118 0.0411275 49.6386663
As expected, pretty bad results:
Just for thoroughness, we will check the linear model assumptions.
In the Residuals vs Fitted, we see a pattern of more points to the right - the skewness, rather than the random scattering around zero we want for a lm model. This is not the funnel shape of heteroscedasticity, where the variance changes as fitted values increase (which would violates one of the assumptions of linear regression - homoscedasticity).
The Normal Q-Q, showing heavy tails, reinforces the issue of the lack of normal distribution.
The RMSE Across Cross-Validation Folds shows inconsistency in the RMSE across the folds - a sign that the model does NOT generalize well.
Noting that we tried some variations (Gamma Regression and Robust Linear Regression) with very minor improvements.
Tree-based models like Random Forest and Decision Trees are inherently robust to outliers. Trees split the data based on the best features, and outliers typically don’t affect the splits significantly.
Random Forests in particular can be very effective at modeling data with outliers, as it builds multiple trees using different subsets of the data.
## Random Forest
##
## 83 samples
## 160 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 75, 74, 75, 75, 75, 74, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 56.18068 0.3554825 49.77880
## 2 53.75106 0.3782917 47.77488
## 3 51.49646 0.3827737 45.81794
## 4 49.76102 0.3744031 43.85485
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
This show further improvement, but is still quite poor:
Even though the model does a poor job of explaining the variance in
Overall.Rank, we might learn something about what is important to the
model.
With similar benefits as Random Forest, XGBoost is an even more powerful ensemble method to use on small skewed data.
## RMSE Rsquared MAE
## 1 56.05040 0.2824386 45.48761
## 2 47.16960 0.4065147 38.52283
## 3 42.85114 0.4943605 35.71385
## 4 53.56663 0.2823527 41.97079
## 5 51.51815 0.3020539 41.90481
## 6 46.60390 0.3704433 38.77632
Again, further improvement, but still quite poor. ‘Best’ results are:
Similar to the RF model, though the model does a poor job of explaining the variance in Overall.Rank, we might learn something about what is important to the model. Here we are using library(shapviz) to get and plot the SHAP (SHapley Additive exPlanations) values. Recall that SHAP Values break down a model’s prediction for a single data point, showing the contribution of each feature to that prediction.
More about SHAP values and how they are plotted: Feature Importance:
Features at the top are the most important in predicting the model
output. You can see which features have the greatest influence on the
prediction.
Feature Impact: The spread of the dots shows the range of SHAP values for that feature. A wider spread means that the feature has a more variable impact on the prediction across different data points.
Positive/Negative Impact: The direction of the dots (right for positive SHAP values, left for negative) shows whether the feature generally increases or decreases the model’s output.
Feature Value Influence: The color of the dots indicates whether higher or lower values of a feature contribute to higher or lower predictions.
The RF and XGB models were run 5 times each, with the seeds
in order to get at the variables that were consistently important given the randomness inherent in the models. The results of these were captured in a spreadsheet (for expedience) and those that showed up consistently were considered in hypotheses for further study.
Here are some portions of the code I did not end up using in mhy anaylisis, but were interesting, so kept for future reference.
With RF we got similar
# Define control for RFE
ctrl_rf <- rfeControl(
functions = rfFuncs, # Built-in functions for RF
method = "cv", # Cross-validation
number = 5, # 5-fold CV
verbose = FALSE
)
#define X & y
x = dplyr::select(pl_rank, -Overall.Rank)
y = pl_rank$Overall.Rank
# Run RFE
set.seed(123)
rfe_rf <- rfe(
x = x,
y = y,
sizes = c(10,20,40,60,80,83), # Try diff # of features
rfeControl = ctrl_rf
)
# View results
rfe_rf
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 10 45.35 0.4127 36.77 8.319 0.2494 7.888 *
## 20 46.45 0.4017 39.46 9.044 0.2606 7.639
## 40 46.75 0.3955 39.88 7.322 0.2300 6.173
## 60 47.11 0.3889 40.28 6.152 0.2138 5.177
## 80 48.03 0.3674 41.29 6.052 0.2062 5.183
## 83 48.30 0.3626 41.35 6.141 0.2094 5.377
## 160 48.66 0.3576 42.00 5.170 0.2074 4.478
##
## The top 5 variables (out of 10):
## LONGITUD, ZIP_M, EBOOK, ZIP, MASTER
Cannot get XGB to work with harvesting the feature importance via the cross validation, and anyway, realize that doing so means taking the average across best and not best models, as apposed to running with different seeds and using results from multiple ‘bests’.
#THIS HAS ERRORS - do not eval, leaving until done just in case.
#define x again, this time with model.matrix to deal
#with character fields and as.data.frame so the X & Y
#still match up with same number of rows
x_xgb = as.data.frame(model.matrix(Overall.Rank ~ . - 1, data = pl_rank))
# Start from caret's default functions
xgbFuncs <- caretFuncs
# Use xgbTree as the model method
xgbFuncs$fit <- function(x, y, first, last, ...){
train(
x = x_xgb,
y = y,
method = "xgbTree",
trControl = trainControl(method = "cv", number = 5),
...
)
}
# RFE control
ctrl_xgb <- rfeControl(
functions = xgbFuncs,
method = "cv",
number = 3,
verbose = TRUE # Set to TRUE to get more logs
)
# Run RFE
set.seed(123)
rfe_xgb <- rfe(
x = x_xgb,
y = y,
sizes = c(13, 23, 43, 63), # Try diff # of features
rfeControl = ctrl_xgb
)
# View results
rfe_xgb
Took a look at the some of the per capita values of the variables that came out in the first round of the Variable Importance work. But realized it was important to dig deeper into which variables were considered (eg running the models multiple times and considering results), so focued limited time there. Left this as the foundation upon which to build in future iterations.
Running the same Random Forest Model, with the new data set that has per capita data calculated for the fields that showed importance in the prior model.
## Random Forest
##
## 83 samples
## 160 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 75, 75, 75, 75, 74, 74, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 56.41829 0.3283596 49.81589
## 2 53.93963 0.3625471 47.74514
## 3 51.63767 0.3797494 45.79009
## 4 50.18144 0.3767099 44.21810
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
Here we see results very similar to the original RF model, and still quite poor:
Running the same XGBoost, with the new data set that has per capita data calculated for the fields that showed importance in the prior model.
## RMSE Rsquared MAE
## 1 48.53307 0.4151694 39.10948
## 2 45.19650 0.3816671 37.05474
## 3 41.39172 0.4208697 34.63164
## 4 49.45439 0.3236302 41.53255
## 5 47.56504 0.3505308 38.73774
## 6 41.44881 0.4252576 34.20331
Here we see results slightly worse than the original XGB model, and still quite poor:
library(tidyverse)
library(ggpubr) #ggarrange
library(reshape2) #melt
library(scales) #show axis vals in M & not sci not
library(gridExtra) #grid.arrange()
library(caret) # For data preprocessing and modeling
library(ggfortify) #for autoplot to test assuptions in lm
library(MASS) # for Box-Cox
library(randomForest)
library(xgboost)
library(shapviz) #SHAP values (for XGBoost)
theme_set(theme_minimal())
pl_data <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
city_rank <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/SustainCityRank.csv")
# Cleaning up name of overall rank as it was too long
colnames(city_rank)[1] <- 'Overall.Rank'
#Just want a subset of the fields
city_rank <- city_rank |>
dplyr::select(City, State, Overall.Rank)
summary(city_rank)
#filter to just big urban libraries
pl_data<- pl_data |> filter(LOCALE_ADD==11)
#filter to just those that meet the criteria for public libraries
pl_data<- pl_data |> filter(C_FSCS=='Y')
#clean where we have both county and city + HI
pl_data<- pl_data |> filter(FSCSKEY!='CA0084') #Orange
pl_data<- pl_data |> filter(FSCSKEY!='CA0112') #San Diego
pl_data<- pl_data |> filter(FSCSKEY!='PA0532') #Allegheny
pl_data<- pl_data |> filter(FSCSKEY!='TX0101') #Harris
pl_data<- pl_data |> filter(FSCSKEY!='TX0744') #BiblioTech
pl_data<- pl_data |> filter(FSCSKEY!='HI0001') #HI
#drop fields with the same value for every row
pl_data<- pl_data[, sapply(pl_data, function(x) length(unique(x)) > 1)]
#make the City data in city_rank all uppercase
city_rank$City <- str_to_upper(city_rank$City)
#Adjusting to match spelling of city name in pl_data
city_rank <- city_rank |>
mutate(City = str_replace(City,'ST. PAUL','SAINT PAUL'))
#Adjusting to match spelling of city name in pl_data
pl_data$CITY[pl_data$FSCSKEY == 'TN0134'] <- 'MEMPHIS'
#make a field that concatenates the city and state fields in both data sets to be used in join
pl_data <- pl_data |>
mutate(city.state = paste(CITY,STABR))
city_rank <- city_rank |>
mutate(city.state = paste(City,State))
#check that city.state is unique in each dataset.
#NOTE results=FALSE, as we do not need to see this in the html
duplicated_pl_data <- pl_data %>%
group_by(city.state) %>%
filter(n() > 1)
duplicated_city_rank <- city_rank %>%
group_by(city.state) %>%
filter(n() > 1)
duplicated_pl_data
duplicated_city_rank
#join the two data sets
pl_rank<-left_join (pl_data, city_rank, by = 'city.state')
#handle NYC, which has 3 distinct library systems
#get column names
city_rank_cols <-tail(names(pl_rank), 9)
#copy the values we want
#need to specify dplyr::select bcs also have MASS
nypl_rank_data <- pl_rank |>
filter(FSCSKEY == 'NY0778') |>
dplyr::select(all_of(city_rank_cols))
#update the BPL and QPL values with the NYPL values
pl_rank <- pl_rank |>
mutate(across(all_of(city_rank_cols),
~ if_else(FSCSKEY %in% c('NY0004', 'NY0562'), nypl_rank_data[[cur_column()]], .)))
#histograms
#the columns of interest
rev_cols <-c('LOCGVT', 'STGVT', 'FEDGVT', 'OTHINCM')
#small dataset with just those columns, pivoted
rev_data <- pl_rank |>
dplyr::select (one_of(rev_cols))
rev_data$ID <- rownames(rev_data)
rev_data <- melt(rev_data, id.vars = "ID")
# Create histograms of the revenue columns
ggplot(data = rev_data, aes(x = value)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Distributions of Revenue Variables", x = "Predictors", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + # Clean up grid lines
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1)) # Format x-axis in millions)
#Box plots
plot_list <- lapply(rev_cols[1:4], function(col) {
ggplot(pl_rank, aes_string(y = col)) +
geom_boxplot(outlier.color = "red", fill = "skyblue")+
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1))
})
# Display all boxplots in a grid
do.call(grid.arrange, c(plot_list, ncol = 2))
#calc Local funding as a percent of the total
pl_rank <- pl_rank |>
mutate(loc_pct = LOCGVT/TOTINCM)
# Create histograms of the % column
ggplot(data = pl_rank, aes(x = loc_pct)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) +
labs(title = "Distribution of Local Funding as % of Total", x = "% of Total", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
#bins by quantile
pl_rank <- pl_rank |>
mutate(bin_lp = cut(loc_pct,
breaks = quantile(loc_pct, probs=seq(0,1,0.25)),
labels = c("Q1", "Q2", "Q3", "Q4"),
include.lowest = TRUE))
# Create bar of the bins
ggplot(data = pl_rank, aes(x = bin_lp)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.8) +
labs(title = "Bins of Local Funding % ", x = "Bin", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
ggplot(pl_rank, aes(x=loc_pct, y=Overall.Rank, color = bin_lp)) +
geom_point() +
geom_smooth(method=lm , color="skyblue", fill="#69b3a2", se=TRUE) +
labs(title = "Overall Sustainability Rank vs % Local Funding",
x = "Local % of Total Funding",
y = "Sustainable Rank",
color = "Bin (Quantile)")+
theme_minimal()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
pl_rk_F <- pl_rank |>
dplyr::select(loc_pct, bin_lp, Overall.Rank,
FSCSKEY,STABR,LIBNAME,CITY, POPU_LSA,
C_LEGBAS, C_ADMIN, BRANLIB)
set.seed(888)
# 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
#simple linear model
lm_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "lm", trControl = train_control)
predictions <- predict(lm_model, newdata = pl_rk_F)
actuals <- pl_rk_F$Overall.Rank
postResample(predictions, actuals)
# Residuals and fitted values
fitted_values <- predict(lm_model, newdata = pl_rk_F)
residuals <- lm_model$finalModel$residuals
# Plot Residuals vs Fitted values
p1<-ggplot(data = data.frame(fitted_values, residuals), aes(x = fitted_values, y = residuals)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, col = "red") +
labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals")
# Q-Q plot of residuals
p2<-ggplot(data = data.frame(residuals), aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q Plot")
# Cross-validation performance (e.g., RMSE) across folds
resampling_results <- lm_model$resample
# Plot RMSE across folds
p3<-ggplot(resampling_results, aes(x = Resample, y = RMSE)) +
geom_boxplot() +
labs(title = "RMSE Across Cross-Validation Folds", x = "Fold", y = "RMSE")
ggarrange(p1,p2,p3)
#Gamma Regression
gml_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "glm", trControl = train_control)
print(gml_model)
#Robust Linear Regression
rlm_model <- train(Overall.Rank ~ loc_pct, data = pl_rk_F, method = "rlm", trControl = train_control)
print(rlm_model)
set.seed(717)
tune_grid <- expand.grid(mtry = c(1, 2, 3, 4))
rf_model <- train(Overall.Rank ~ ., data = pl_rank,
method = "rf", trControl = train_control,
tuneGrid = tune_grid)
print(rf_model)
#Variable Importance
v_imp <- varImp(rf_model, scale = FALSE)
plot(v_imp, top=15, main = "RF: Top 15 Variable Importances")
# Model matrix (for later reuse in SHAP)
X <- model.matrix(Overall.Rank ~ . - 1, data = pl_rank)
xgb_model <- suppressWarnings(train(x = X,
y = pl_rank$Overall.Rank,
method = "xgbTree",
trControl = train_control,
tuneLength = 3,
verbosity = 0,
verbose = FALSE))
best_row <- xgb_model$results %>%
dplyr::filter(
nrounds == xgb_model$bestTune$nrounds,
max_depth == xgb_model$bestTune$max_depth,
eta == xgb_model$bestTune$eta,
gamma == xgb_model$bestTune$gamma,
)
best_metrics <- best_row %>%
dplyr::select(RMSE, Rsquared, MAE)
print(best_metrics)
#Get SHAP values for the final (best) model
xgb_fit <- xgb_model$finalModel
# Create "shapviz" object
shp <- shapviz(xgb_fit, X_pred = X)
sv_importance(shp, kind = "both", alpha = 0.2, width = 0.2)
#sv_dependence(shp, v = "LOCGVT", "auto")
# Define control for RFE
ctrl_rf <- rfeControl(
functions = rfFuncs, # Built-in functions for RF
method = "cv", # Cross-validation
number = 5, # 5-fold CV
verbose = FALSE
)
#define X & y
x = dplyr::select(pl_rank, -Overall.Rank)
y = pl_rank$Overall.Rank
# Run RFE
set.seed(123)
rfe_rf <- rfe(
x = x,
y = y,
sizes = c(10,20,40,60,80,83), # Try diff # of features
rfeControl = ctrl_rf
)
# View results
rfe_rf
#THIS HAS ERRORS - do not eval, leaving until done just in case.
#define x again, this time with model.matrix to deal
#with character fields and as.data.frame so the X & Y
#still match up with same number of rows
x_xgb = as.data.frame(model.matrix(Overall.Rank ~ . - 1, data = pl_rank))
# Start from caret's default functions
xgbFuncs <- caretFuncs
# Use xgbTree as the model method
xgbFuncs$fit <- function(x, y, first, last, ...){
train(
x = x_xgb,
y = y,
method = "xgbTree",
trControl = trainControl(method = "cv", number = 5),
...
)
}
# RFE control
ctrl_xgb <- rfeControl(
functions = xgbFuncs,
method = "cv",
number = 3,
verbose = TRUE # Set to TRUE to get more logs
)
# Run RFE
set.seed(123)
rfe_xgb <- rfe(
x = x_xgb,
y = y,
sizes = c(13, 23, 43, 63), # Try diff # of features
rfeControl = ctrl_xgb
)
# View results
rfe_xgb
#create data set with per capita values, and not original
pl_rank_pc <- pl_rank |>
mutate (master_pc = MASTER / POPU_LSA,
salaries_pc = SALARIES / POPU_LSA,
benefit_pc = BENEFIT / POPU_LSA,
locgvt_pc = LOCGVT / POPU_LSA) |>
dplyr::select(-MASTER, -SALARIES, -BENEFIT, -LOCGVT)
rf_model_pc <- train(Overall.Rank ~ ., data = pl_rank_pc,
method = "rf", trControl = train_control,
tuneGrid = tune_grid)
print(rf_model_pc)
v_imp_pc <- varImp(rf_model_pc, scale = FALSE)
plot(v_imp_pc, top=15, main = "RF: Top 15 Variable Importances (PC)")
# Model matrix (for later reuse in SHAP)
X_pc <- model.matrix(Overall.Rank ~ . - 1,
data = pl_rank_pc)
xgb_model_pc <- suppressWarnings(train(x = X_pc,
y = pl_rank_pc$Overall.Rank,
method = "xgbTree",
trControl = train_control,
tuneLength = 3,
verbosity = 0,
verbose = FALSE))
best_row_pc <- xgb_model_pc$results %>%
dplyr::filter(
nrounds == xgb_model_pc$bestTune$nrounds,
max_depth == xgb_model_pc$bestTune$max_depth,
eta == xgb_model_pc$bestTune$eta,
gamma == xgb_model_pc$bestTune$gamma,
)
best_metrics_pc <- best_row_pc %>%
dplyr::select(RMSE, Rsquared, MAE)
print(best_metrics_pc)
xgb_fit_pc <- xgb_model_pc$finalModel
# Create "shapviz" object
shp_pc <- shapviz(xgb_fit_pc, X_pred = X_pc)
sv_importance(shp_pc, kind = "both", alpha = 0.2, width = 0.2)
#sv_dependence(shp, v = "LOCGVT", "auto")