The question we are trying to answer is:
How can we demonstrate the impact of public library service offerings at big urban libraries on the health outcomes of those in the communities they serve?
Both the library and community health data come in different files per year, so to start, we load in the library data (pl_YY) and the community health data (ch_YY) into separate files by year.
The Public Library Data comes from the IMLS Public Library Survey (PLS) data which has surveyed US public libraries annually for over 25 years and contains information about where the library system resides (urban, suburban, town, rural) and the levels of services provided by each and much more.
We have brought in the data from 2013-2022 (2022 is the latest available). Each year has over 9K rows of 100+ variables. Each row represents one library system.
Here are the changes we make to the Library Data:
Rename LOCALE in 2013-16 -> LOCALE_ADD (the field name changed in 2017).
Limit to those library systems (rows) Where C_FSCS=Y, meaning the library system meet the definition of a Public Library.
Since the health data does not include Puerto Rico, it will be removed from the library data.
Add a column to identify year.
Do some work to make sure all columns match each year.
Get rid of columns that are not needed.
Keep only libraries that have all data for all 10 years.
Make a concatenated ‘county + state’ field (since counties may have the same names in different states).
Now we have a nice consistent set of 10 years, with 8814 library systems and 92 variables. This will be somewhat reduced as we connect to the community data.
The Community Health and Health Access data comes from County Health Rankings & Roadmaps, which has county level data from 2010 to 2024. The 2013-2022 data will be used to match the 10 years of library data.
Changes made to the Community Health Data, to match what is in the library data:
Convert all the state names in to the 2 letter state
abbreviation.
Convert the counties name to CAPS.
Make a concatenated ‘county + state’ field (as we did in the library data).
The 2013 data has OutcomeZ.Score and FactorsZ.Scrore where everywhere else has Quartile, so calculate the quartile for that year’s data.
Join health to library data for each year.
Union all the years.
A few counties did not match in the 2 data sets, so we remove those.
Some counties had NA or NR for the OutcomesRank, so remove those as well (again, not many).
Not all counties has a rank for all 10 years, so remove those not ranked every year.
Add in the total square feet per system.
Create a field with the total collection.
Some libraries have different names in different years (but the same FSCSKEY, so we know they are the same system). Will use the most recent names in all years for consistency.
Remove columns we do not need, now that everything is in one place.
The pl_data uses -1, -3 and -9 for various forms of NA. We want to make them all an explicit NA.
Make sure we still have 10 years for each system.
Now we have a set of 86,300 rows (8,630 library systems over 10 years put together).
A quick look at some of the variables of interest.
Using scatter plots to show how library services (like the collection, on the top left) and the use of that service (circulation, borrowing of material in the collection). By coloring each of the 10 years, we can see consistency in the patterns year over year.
We will be transforming all the columns into population adjusted z-scores. Here is a look at them BEFORE.
Here, we are making changes needed for the modeling.
We make a _F(inal) dataset to use in modeling. At this point, most fields are included, but this is a place to play around with which fields get included / excluded.
Generate Z-Scores
Population Adjusted scores
The same variables, in their population adjuested z-scores.
As we can see, there are several cases where we see a correlation at 0.7 or close:
We will start with a Factor Analysis, them use that in a Measurement Model. We will then run variation on the SEM model.
Summarizing the results: The model could be improved, but the factors are capturing meaningful information.
The model is a poor fit. SRMR of 0.066 is good, but SRMR alone is not enough to compensate for the poor RMSEA.
The use of all libraries does show slight improvement over looking at just urban libraries.
## rmsea cfi tli srmr
## 0.226 0.800 0.720 0.065
First, the model with a single latent var Library Performance (LibPerf)
Evaluating additional model fit indices (e.g., RMSEA, CFI) for a more comprehensive understanding of model fit.
As we saw with the Measurement Model, it is a poor fit, but an improvement over the model using just the urban libraries.
## rmsea cfi tli srmr
## 0.122 0.923 0.876 0.060
the model with a two latent var Library Service (LibServ) and Service Use (ServUse)
This is a poor fit, but an improvement over SEM1, as well as an improvement over SEM1 and SEM2 with just urban libraries.
## rmsea cfi tli srmr
## 0.114 0.936 0.890 0.051
Here we tried adding an additional variable REGBOR (card holders) to the Serve Use latent Variable.
In addition, we add both Square Feet and Total Operating Expenditures (in addition to state, which was already there). These were suggested as ways address possible a confound.
And because we are NOT using panel data here (see Model4), we add year as an intercept.
This is worse than SEM2. We added a number of things here, so in future iterations, we might try one at a time.
## rmsea cfi tli srmr
## 0.175 0.707 0.624 0.215
For this version we will use Panel Data
We need to pivot wide on year for each pz (variable) columns to allow lavaan handle Panel Data.
As this model is very complex, we will write code to loop through all the latent, regressions and covariances over the 10 years.
## rmsea cfi tli srmr
## 0.161 0.422 0.401 0.385
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggpubr) #ggarrange
library(reshape2) #melt
library(scales) #show axis vals in M & not sci not
library(gridExtra)#grid.arrange()
library(lavaan) #SEM model
library(semPlot) #to plot the models
library(caret) #preProcess()
library(corrplot) #corrplot()
library(psych) #fa()
library(stats) #anova()
#Public Library Data
pl_22 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
pl_21 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY21_AE_pud21i.csv")
pl_20 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY20_AE_pud20i.csv")
pl_19 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY19_AE_pud19i.csv")
pl_18 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/pls_fy18_ae_pud18i.csv")
pl_17 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY17_AE_pud17i.csv")
pl_16 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2016_AE_pupld16a_updated.csv")
pl_15 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2015_AE_pupld15a.csv")
pl_14 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2014_AE_pupld14a.csv")
pl_13 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/Pupld13a.csv")
#Community Health Data
ch_22 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2022CHR.csv")
ch_21 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2021CHR.csv")
ch_20 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2020CHR.csv")
ch_19 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2019CHR.csv")
ch_18 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2018CHR.csv")
ch_17 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2017CHR.csv")
ch_16 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2016CHR.csv")
ch_15 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2015CHR.csv")
ch_14 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2014CHR.csv")
ch_13 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2013CHR.csv")
#In order to get total sq ft by system per year from the outlet files (separate files in IMLS PLS done by branch rather than library system). Here we bring in a summary file that has Year, Sq Ft and FSCSKEY (the unique ID for a library system).
sqft <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/SqFt%202013-2022.csv")
#list of the pl data for each year where field is called LOCALE
pl_list <- list(pl_13, pl_14, pl_15, pl_16)
#Loop through each data frame and rename LOCALE-> LOCALE_ADD
pl_list <- lapply(pl_list, function(df) {
df |> rename(LOCALE_ADD = LOCALE)})
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
#Make pl_list have all years, to use in looping through filtering all data.
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17,pl_18,pl_19,pl_20,
pl_21,pl_22)
# Loop through each data frame and apply the filter
#the library system meet the definition of a Public Library.
pl_list <- lapply(pl_list, function(df) {
df |> filter(C_FSCS=='Y')
})
# Loop through each data frame and apply the filter: exclude state abbr= PR
pl_list <- lapply(pl_list, function(df) {
df |> filter(STABR!='PR')
})
#Then set all the data back to individual tables
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]]
pl_22 <- pl_list[[10]]
#add year column to each of the pl_YY sets before unioning it.
pl_13$year <- '2013'
pl_14$year <- '2014'
pl_15$year <- '2015'
pl_16$year <- '2016'
pl_17$year <- '2017'
pl_18$year <- '2018'
pl_19$year <- '2019'
pl_20$year <- '2020'
pl_21$year <- '2021'
pl_22$year <- '2022'
#also in prep for unioning, we need to see where columns / column names may be different across years.
#first, get the columns names for each dataframe
cols13 <- names(pl_13)
cols14 <- names(pl_14)
cols15 <- names(pl_15)
cols16 <- names(pl_16)
cols17 <- names(pl_17)
cols18 <- names(pl_18)
cols19 <- names(pl_19)
cols20 <- names(pl_20)
cols21 <- names(pl_21)
cols22 <- names(pl_22)
#put them together in a list
col_list <- list(cols13, cols14, cols15, cols16, cols17, cols18, cols19, cols20,
cols21, cols22)
#find which columns appear in all
common_col <- Reduce(intersect, col_list)
#and which are diff
diff_col <- lapply(col_list, function(f) setdiff(f, common_col))
print (diff_col)
#Changes made to columns in ALL years
#get rid off anything that starts 'F_'
#this gets rid of some fields that are common to all years,
#but they are not needed so cleaner code and less unneeded columns
#refresh pl_list since we made changes
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17, pl_18, pl_19, pl_20,
pl_21, pl_22)
#remove all fields that have F_
pl_list <- lapply(pl_list, function(df) {
df |> select(-contains('F_'))})
#Put data back
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]]
pl_22 <- pl_list[[10]]
#DB local now Elec Collection local
pl_13<- pl_13|> rename(EC_LO_OT=DB_LO_OT )
pl_14<-pl_14|> rename(EC_LO_OT= DB_LO_OT)
#DB state now Elec Collection state
pl_13<-pl_13|> rename(EC_ST = DB_ST)
pl_14<-pl_14|> rename(EC_ST = DB_ST)
#kid programs now broken out by age 0-5 and 6-11; add together to get old variable all kids
pl_22$KIDATTEN <- pl_22$K0_5ATTEN + pl_22$K6_11ATTEN
pl_22$KIDPRO <- pl_22$K0_5PRO + pl_22$K6_11PRO
#Columns to remove only in 2013 and 2014 data
rm_col13.14 <-c('DATABASE','FIPSCO','FIPSPLAC','FIPSST', 'GAL','GALMS', 'SUBSCRIP', 'POSTMS')
pl_13.14list <-list(pl_13, pl_14)
pl_13.14list <- lapply(pl_13.14list, function(df) {
df |> select(-all_of(rm_col13.14))})
pl_13 <- pl_13.14list[[1]]
pl_14 <- pl_13.14list[[2]]
#Columns to remove only in 2020, 2021,2022 data
#anything that starts 'C19'
pl_20.21.22list <-list(pl_20,
pl_21,pl_22)
pl_20.21.22list <- lapply(pl_20.21.22list, function(df) {
df |> select(-contains('C19'))})
pl_20 <- pl_20.21.22list[[1]]
pl_21 <- pl_20.21.22list[[2]]
pl_22 <- pl_20.21.22list[[3]]
#Columns to remove only in 2016 - 2021 (inclusive) data
rm_col16.21 <-c('ELCONT','ELECCOLL','ELINFO','GNISPLAC',
'INCITSCO','INCITSST','LOCALE_MOD','PHYSCIR',
'TOTCOLL','WIFISESS')
pl_16.21list <-list(pl_16, pl_17,pl_18, pl_19,
pl_20,
pl_21)
pl_16.21list <- lapply(pl_16.21list, function(df) {
df |> select(-all_of(rm_col16.21))})
pl_16 <- pl_16.21list[[1]]
pl_17 <- pl_16.21list[[2]]
pl_18 <- pl_16.21list[[3]]
pl_19 <- pl_16.21list[[4]]
pl_20 <- pl_16.21list[[5]]
pl_21 <- pl_16.21list[[6]]
#other one offs to get rid off so all columns match
pl_14 <-pl_14 |> select (-WIFISESS)
pl_15 <-pl_15 |> select (-c(ADDRTYPE,ELECCOLL,GNISPLAC,
INCITSCO,INCITSST,MSTATUS,
SCORE,SUBSCRIP,WIFISESS))
pl_16 <-pl_16 |> select (-c(MSTATUS,REAPLOCALE,
REAPLOCALE_MOD,SCORE,
SUBSCRIP,ADDRTYPE))
pl_17 <-pl_17 |> select (-c(GEOMATCH,REAPLOCALE_ADD,
REAPLOCALE_MOD,
SUBSCRIP,))
pl_18 <-pl_18 |> select (-c(GEOMATCH,REAPLOCALE_ADD,
REAPLOCALE_MOD,WEBVISIT,
SUBSCRIP))
pl_19 <-pl_19 |> select (-c(GEOMATCH,WEBVISIT,
SUBSCRIP,))
pl_20 <-pl_20 |>select(-c(GEOMTYPE,GEOSCORE,GEOSTATUS,
WEBVISIT,REFERRPT,VISITRPT))
pl_21 <-pl_21 |> select (-c(TOTPHYS,OTHPHYS,OTHPHCIR,
K0_5PRO,K6_11PRO,ADULTPRO,
GENPRO,ONPRO,OFFPRO,VIRPRO,
K0_5ATTEN,K6_11ATTEN,
ADULTATTEN,GENATTEN,ONATTEN,
OFFATTEN,VIRATTEN,TOTPRES,
TOTVIEWS,PITUSRRPT,WIFISRPT,
GEOMTYPE,GEOSCORE,GEOSTATUS,
WEBVISIT,REFERRPT,VISITRPT))
pl_22 <-pl_22 |>select(-c(ADULTATTEN,ADULTPRO,ELCONT,
ELECCOLL, ELINFO,GEOMTYPE,
GEOSCORE,GEOSTATUS,GENATTEN,
GENPRO, K0_5PRO,K0_5ATTEN,
K6_11PRO,K6_11ATTEN,LOCALE_MOD,
LSAGEOID,LSAGEORATIO,
LSAGEOTYPE,ODFINE, OFFATTEN,
OFFPRO,ONPRO,OTHPHCIR,
OTHPHYS, PHYSCIR, PITUSRRPT,
REFERRPT,TOTCOLL, TOTPHYS,
TOTPRES,TOTVIEWS, VIRPRO,
VIRATTEN, VISITRPT,WEBVISIT,
WIFISESS, WIFISRPT,ONATTEN))
#REPEAT SAME STEPS AS ABOVE TO SEE IF ANY ODD COLUMNS LEFT
#also in prep for unioning, we need to see where columns / column names may be different across years.
#first, get the columns names for each dataframe
cols13 <- names(pl_13)
cols14 <- names(pl_14)
cols15 <- names(pl_15)
cols16 <- names(pl_16)
cols17 <- names(pl_17)
cols18 <- names(pl_18)
cols19 <- names(pl_19)
cols20 <- names(pl_20)
cols21 <- names(pl_21)
cols22 <- names(pl_22)
#put them together in a list
col_list <- list(cols13, cols14, cols15, cols16, cols17, cols18, cols19, cols20,
cols21, cols22)
#find which columns appear in all
common_col <- Reduce(intersect, col_list)
#and which are diff
diff_col <- lapply(col_list, function(f) setdiff(f, common_col))
print (diff_col)
#refresh pl_list since we made changes
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17, pl_18, pl_19, pl_20,
pl_21, pl_22)
names(pl_list) <- paste0("pl", 13:22)
# Step 2: Combine Keys with their source df names
key_df <- bind_rows(lapply(names(pl_list), function(name) {
data.frame(Key = pl_list[[name]]$FSCSKEY, Source = name)
}))
# Step 3: Count in how many unique data frames each Key appears
key_counts <- key_df %>%
distinct(Key, Source) %>% # Remove duplicates
group_by(Key) %>%
summarise(n_df = n(), .groups = "drop")
# Step 4: Filter Keys not in all 10 data frames
not_in_all <- key_counts %>% filter(n_df < 10)
# Get the vector of Keys to remove
keys_to_remove <- not_in_all$Key
# Filter each df to remove those keys
df_list_filtered <- lapply(pl_list, function(df) {
df |> filter(!FSCSKEY %in% keys_to_remove)
})
for (i in 1:10) {
assign(paste0("pl_", i+12), df_list_filtered[[i]])
}
#list of the pl_YY for each year
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17,pl_18,pl_19,pl_20,
pl_21,pl_22)
# Loop through each data frame and apply the filter
pl_list <- lapply(pl_list, function(df) {
df |> mutate(cnty.state = paste(CNTY,STABR))
})
#Put data back
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]]
pl_22 <- pl_list[[10]]
#for the health data
#NEED TO CONVERT STATE TO STATE ABBR
state_abbr <- c(
Alabama = "AL",Alaska = "AK",Arizona = "AZ",Arkansas = "AR",
California = "CA",Colorado = "CO",Connecticut = "CT",
Delaware = "DE", Florida = "FL", Georgia = "GA",Hawaii = "HI",
Idaho = "ID", Illinois = "IL", Indiana = "IN", Iowa = "IA",
Kansas = "KS", Kentucky = "KY", Louisiana = "LA", Maine = "ME",
Maryland = "MD",Massachusetts = "MA", Michigan = "MI",
Minnesota = "MN", Mississippi = "MS", Missouri = "MO",
Montana = "MT", Nebraska = "NE", Nevada = "NV",
New_Hampshire = "NH", New_Jersey = "NJ",New_Mexico = "NM",
New_York = "NY", North_Carolina = "NC", North_Dakota = "ND",
Ohio = "OH",Oklahoma = "OK", Oregon = "OR", Pennsylvania = "PA",
Rhode_Island = "RI", South_Carolina = "SC",South_Dakota = "SD",
Tennessee = "TN", Texas = "TX", Utah = "UT", Vermont = "VT",
Virginia = "VA", Washington = "WA", West_Virginia = "WV",
Wisconsin = "WI", Wyoming = "WY", District_of_Columbia = 'DC'
)
state_to_abbr <- function(state_name) {
state_name <- gsub(" ", "_", state_name)
state_abbr[state_name]
}
#Merge all the County and States in the pl data
ch_list <- list(ch_13,ch_14, ch_15, ch_16, ch_17,ch_18,ch_19,ch_20,
ch_21,ch_22)
# Loop through each data frame and apply the filter
ch_list <- lapply(ch_list, function(df) {
df |> mutate(State = state_to_abbr(State),
County = if_else(County=="Virginia Beach City",
"Virginia Beach", County),
County = if_else(County=="District of Columbia",
"Dist of Columbia", County),
cnty.state = str_to_upper(paste(County,State)))
})
ch_13 <- ch_list[[1]]
ch_14 <- ch_list[[2]]
ch_15 <- ch_list[[3]]
ch_16 <- ch_list[[4]]
ch_17 <- ch_list[[5]]
ch_18 <- ch_list[[6]]
ch_19 <- ch_list[[7]]
ch_20 <- ch_list[[8]]
ch_21 <- ch_list[[9]]
ch_22 <- ch_list[[10]]
#total counties per state - maybe do not need
ch_13 <- ch_13 |>
mutate(OutcomesQuartile = ntile(OutcomesRank,4),
FactorsQuartile = ntile(FactorsRank,4)) |>
select (-c(OutcomesZ.Score,FactorsZ.Score))
pl_ch13<-left_join (pl_13, ch_13, by = 'cnty.state')
pl_ch14<-left_join (pl_14, ch_14, by = 'cnty.state')
pl_ch15<-left_join (pl_15, ch_15, by = 'cnty.state')
pl_ch16<-left_join (pl_16, ch_16, by = 'cnty.state')
pl_ch17<-left_join (pl_17, ch_17, by = 'cnty.state')
pl_ch18<-left_join (pl_18, ch_18, by = 'cnty.state')
pl_ch19<-left_join (pl_19, ch_19, by = 'cnty.state')
pl_ch20<-left_join (pl_20, ch_20, by = 'cnty.state')
pl_ch21<-left_join (pl_21, ch_21, by = 'cnty.state')
pl_ch22<-left_join (pl_22, ch_22, by = 'cnty.state')
pl_chALL <-rbind(pl_ch13,pl_ch14,pl_ch15,pl_ch16,pl_ch17,
pl_ch18,pl_ch19,pl_ch20,
pl_ch21,pl_ch22)
#filter where State (from ch data) is NA, indicating the cnty.state field did not match
pl_chALL <- pl_chALL |> filter(!is.na(State))
#filter NA or NR (Not Rated) for the OutcomesRank
pl_chALL <- pl_chALL |> filter(!is.na(OutcomesRank))
pl_chALL <- pl_chALL |> filter(OutcomesRank!='NR')
pl_chALL$OutcomesRank <- as.numeric(pl_chALL$OutcomesRank)
pl_chALL$FactorsRank <- as.numeric(pl_chALL$FactorsRank)
#filter any county not ranked all 10 years
pl_chALL <- pl_chALL |>
group_by(FSCSKEY) |>
filter(n() == 10) |>
ungroup()
#join in the sq ft data
sqft$year=as.character(sqft$year)
pl_chALL <- left_join(pl_chALL,sqft,
by = c("year"="year",
"FSCSKEY"="FSCSKEY"))
pl_chALL <- pl_chALL |> filter(!is.na(SQ_FEET))
#create collection fields.
#just in case, making sep phys and digital, then total
pl_chALL <- pl_chALL |>
mutate(phys_coll = BKVOL+AUDIO_PH+VIDEO_PH) |>
mutate(dig_coll = EBOOK+AUDIO_DL+VIDEO_DL) |>
mutate(all_coll = phys_coll + dig_coll)
#adj where library system changed names
pl_chALL <- pl_chALL |>
mutate(LIBNAME = if_else(FSCSKEY=="GA0022",
"FULTON COUNTY LIBRARY SYSTEM",
if_else(FSCSKEY=="IN0210",
"INDIANAPOLIS-MARION COUNTY PUBLIC LIBRARY",
if_else(FSCSKEY=="LA0058",
"NEW ORLEANS PUBLIC LIBRARY",
if_else(FSCSKEY=="TX0050",
"FORT WORTH PUBLIC LIBRARY",
LIBNAME)))))
# Vector of column names to remove
cols_to_remove <- c("ADDRES_M", "ADDRESS", "C_FSCS",
"C_RELATN", "CITY_M", "CNTY", "County",
"ENDDATE", "LSABOUND","ZIP","ZIP_M",
"ZIP4", "ZIP4_M","BENEFIT","BKMOB",
"CAP_REV","CAPITAL", "CBSA","CDCODE",
"CENBLOCK", "CENTRACT", "FCAP_REV",
"FEDGVT", "FIPS", "KIDATTEN", "KIDCIRCL",
"KIDPRO", "LATITUDE", "LCAP_REV",
"LIBRARIA", "LOANFM", "LOANTO", "LOCGVT",
"LONGITUD", "MASTER", "MICROF",
"OCAP_REV", "OTHINCM", "OTHMATEX",
"OTHOPEXP", "OTHPAID", "PHONE",
"REFERENC", "RSTATUS", "SALARIES",
"SCAP_REV", "STAFFEXP", "STATADDR",
"STATADDR", "STATSTRU", "STGVT",
"VIDEO_DL", "VIDEO_PH", "YAATTEN",
"YAPRO")
# Remove them
pl_chALL <- pl_chALL|>select(-all_of(cols_to_remove))
pl_chALL <- pl_chALL %>%
mutate(across(where(is.numeric), ~ ifelse(. < 0, NA, .)))
pl_chALL <- pl_chALL |>
group_by(FSCSKEY) |>
filter(n() == 10) |>
ungroup()
#writing the resulting data out so I can play in Tableau a bit
write.csv(pl_chALL, "C:/Users/diana/Documents/dmp/CUNY/698 Capstone - Final Research Project/Final Project/pl_chrALL.csv", row.names = FALSE)
p1<-ggplot(pl_chALL, aes(x=all_coll, y=TOTCIR, color=year)) +
geom_point() +
ggtitle("Collection vs Circulation")+
theme_minimal()+
theme(text=element_text(size=8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position="none")+
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))
p2<-ggplot(pl_chALL, aes(x=HRS_OPEN, y=VISITS, color=year)) +
geom_point() +
ggtitle("Hours Open vs Visits")+
theme_minimal()+
theme(text=element_text(size=8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position="none")+
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))
p3<-ggplot(pl_chALL, aes(x=TOTPRO, y=TOTATTEN, color=year)) +
geom_point() +
ggtitle("Programs vs Attendance")+
theme_minimal()+
theme(text=element_text(size=8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position="none")+
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))
p4<-ggplot(pl_chALL, aes(x=GPTERMS, y=PITUSR, color=year)) +
geom_point() +
ggtitle("Computers vs Sessions")+
theme_minimal()+
theme(text=element_text(size=8),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position="none")+
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))
ggarrange(p1,p2,p3,p4)
#the columns of interest
cols <-c('all_coll', 'TOTCIR', 'HRS_OPEN', 'VISITS', 'TOTPRO', 'TOTATTEN', 'GPTERMS', 'PITUSR')
#small dataset with just those columns, pivoted
sm_data <- pl_chALL|>
select (one_of(cols))
sm_data$ID <- rownames(sm_data)
sm_data <- melt(sm_data, id.vars = "ID")
# Create histograms of the revenue columns
ggplot(data = sm_data, aes(x = value)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Library Data for SEM", x = "Variables", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + # Clean up grid lines
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1)) # Format x-axis in millions)
pl_chALL_F <- pl_chALL |>
select (all_coll, TOTCIR, HRS_OPEN, VISITS, TOTPRO,
TOTATTEN, GPTERMS, PITUSR, FSCSKEY, year, POPU_LSA,
STABR, cnty.state, C_ADMIN, OutcomesRank,
FactorsRank, SQ_FEET,
TOTOPEXP, OBEREG, LOCALE_ADD,REGBOR )
pl_chALL_F <- pl_chALL_F |>
mutate (z.all_coll = scale(all_coll),
z.TOTCIR = scale(TOTCIR),
z.HRS_OPEN = scale(HRS_OPEN),
z.VISITS = scale(VISITS),
z.TOTPRO= scale(TOTPRO),
z.TOTATTEN = scale(TOTATTEN),
z.GPTERMS = scale(GPTERMS),
z.PITUSR = scale(PITUSR),
z.SQ_FEET = scale(SQ_FEET),
z.POPU_LSA = scale(POPU_LSA),
z.TOTOPEXP = scale(TOTOPEXP),
z.REGBOR = scale(REGBOR),
z.OutcomesRank = scale(OutcomesRank),
z.FactorsRank = scale(FactorsRank))
pl_chALL_F <- pl_chALL_F |>
mutate (pz.all_coll = z.all_coll-z.POPU_LSA,
pz.TOTCIR = z.TOTCIR-z.POPU_LSA,
pz.HRS_OPEN = z.HRS_OPEN-z.POPU_LSA,
pz.VISITS = z.VISITS-z.POPU_LSA,
pz.TOTPRO = z.TOTPRO-z.POPU_LSA,
pz.TOTATTEN = z.TOTATTEN-z.POPU_LSA,
pz.GPTERMS = z.GPTERMS-z.POPU_LSA,
pz.PITUSR = z.PITUSR-z.POPU_LSA,
pz.SQ_FEET = z.SQ_FEET-z.POPU_LSA,
pz.REGBOR = z.REGBOR-z.POPU_LSA,
pz.OutcomesRank = z.OutcomesRank-z.POPU_LSA,
pz.FactorsRank = z.FactorsRank-z.POPU_LSA,
pz.TOTOPEXP = z.TOTOPEXP-z.POPU_LSA
)
#the columns of interest
pz.cols <-c('pz.all_coll', 'pz.TOTCIR', 'pz.HRS_OPEN', 'pz.VISITS', 'pz.TOTPRO', 'pz.TOTATTEN', 'pz.GPTERMS', 'pz.PITUSR')
#small dataset with just those columns, pivoted
pz.sm_data <- pl_chALL_F|>
select (one_of(pz.cols))
pz.sm_data$ID <- rownames(pz.sm_data)
pz.sm_data <- melt(pz.sm_data, id.vars = "ID")
# Create histograms of the revenue columns
ggplot(data = pz.sm_data, aes(x = value)) +
geom_histogram(fill = "skyblue", color = "black", alpha = 0.8, binwidth = 5) + # Adjust binwidth as needed
facet_wrap(~ variable, scales = "free") +
labs(title = "Library Data for SEM", x = "Variables", y = "Frequency") +
theme_minimal(base_size = 9) + # Use a minimal theme for better clarity
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Clean up grid lines
pz_data <- pl_chALL_F |> select(starts_with('pz')) |> select(-c(pz.SQ_FEET, pz.OutcomesRank, pz.FactorsRank))
corr_matrix <- cor(pz_data, use = "pairwise.complete.obs")
corrplot(corr_matrix,
method = "circle",
type = "lower", #just the lower triangle
tl.col = "black", #label text color
tl.cex = 0.7, #label text font size
number.cex = 0.4, #corr coeff font size
addCoef.col = "skyblue", #corr coeff color
cl.cex = 0.4 ) #color palette #s font size
# Run factor analysis with one factor
fa_result <- fa(pz_data, nfactors = 1)
# Display the factor loadings (how much each variable contributes to the factor)
#commenting this out once we have seen initial results
#fa_result
fa_result$loadings[, 1]
# Define the measurement model
# Latent variable: Library Performance
model <- '
LibPerf =~ pz.all_coll + pz.TOTCIR + pz.HRS_OPEN +
pz.VISITS + pz.TOTPRO + pz.TOTATTEN + pz.GPTERMS +
pz.PITUSR
'
# Fit the measurement model
fit <- cfa(model, data = pz_data)
# Commenting out the Display the results until we get a promising a model
#summary(fit, fit.measures = TRUE)
(fitIndices <- fitMeasures(fit, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fit, whatLabels = "est")
SEM1<- '
#latent vars
LibPerf =~ pz.all_coll + pz.TOTCIR + pz.HRS_OPEN +
pz.VISITS + pz.TOTPRO + pz.TOTATTEN + pz.GPTERMS
+ pz.PITUSR
#regressions
pz.OutcomesRank ~ LibPerf
STABR ~ 1
#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'
fitSEM1 <- sem(model = SEM1, data= pl_chALL_F)
#Commenting out the detailed summary until we get a promising model
#summary(fitSEM1)
(fitIndices <- fitMeasures(fitSEM1, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM1, whatLabels = "est")
SEM2<- '
#latent vars
LibServ =~ pz.all_coll + pz.HRS_OPEN + pz.TOTPRO + pz.GPTERMS
ServUse =~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR
#regressions
pz.OutcomesRank ~ LibServ
pz.OutcomesRank ~ ServUse
STABR ~ 1
#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'
fitSEM2 <- sem(model = SEM2, data= pl_chALL_F)
#summary(fitSEM2)
(fitIndices2 <- fitMeasures(fitSEM2, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM2, whatLabels = "est")
SEM3<- '
#latent vars
LibServ =~ pz.all_coll + pz.HRS_OPEN + pz.TOTPRO + pz.GPTERMS
ServUse =~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.REGBOR
#regressions
pz.OutcomesRank ~ LibServ
pz.OutcomesRank ~ ServUse
STABR ~ 1
pz.SQ_FEET ~1
pz.TOTOPEXP ~1
year ~1
#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'
fitSEM3 <- sem(model = SEM3, data= pl_chALL_F)
#summary(fitSEM3)
(fitIndices3 <- fitMeasures(fitSEM3, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM3, whatLabels = "est")
pl_chALL_wide <- pl_chALL_F |>
pivot_wider(
id_cols = c(FSCSKEY,STABR),
# The year column
names_from = year,
# Only pivot pz* variables + SQ_FEET +TOTOPEXP
values_from = c(starts_with("pz")),
# Result: pz*_2013 etc
names_sep = "_"
)
years <- 2013:2022
# Define variable parts
libserv_vars <- c("pz.all_coll", "pz.HRS_OPEN", "pz.TOTPRO", "pz.GPTERMS")
servuse_vars <- c("pz.TOTCIR", "pz.VISITS", "pz.TOTATTEN", "pz.PITUSR", "pz.REGBOR")
model_lines <- c()
# Loop over years to build latent factors and regressions
for (year in years) {
# Latent variables
libserv_line <- paste0("LibServ_", year, " =~ ", paste0(libserv_vars, "_", year, collapse = " + "))
servuse_line <- paste0("ServUse_", year, " =~ ", paste0(servuse_vars, "_", year, collapse = " + "))
# Regression line
reg_line <- paste0("pz.OutcomesRank_", year, " ~ LibServ_", year, " + ServUse_", year,
" + pz.SQ_FEET_", year,
" + pz.TOTOPEXP_", year, " + STABR")
# Residual covariances
resids <- c(
paste0("pz.all_coll_", year, " ~~ pz.TOTCIR_", year),
paste0("pz.HRS_OPEN_", year, " ~~ ",
paste0(c("pz.TOTCIR_", "pz.VISITS_", "pz.TOTATTEN_", "pz.PITUSR_", "pz.TOTPRO_"), year, collapse = " + ")),
paste0("pz.TOTPRO_", year, " ~~ pz.TOTATTEN_", year),
paste0("pz.GPTERMS_", year, " ~~ pz.PITUSR_", year)
)
model_lines <- c(model_lines, libserv_line, servuse_line, reg_line, resids)
}
# Autoregressive paths
for (i in 2:length(years)) {
y_now <- years[i]
y_prev <- years[i - 1]
model_lines <- c(model_lines,
paste0("LibServ_", y_now, " ~ LibServ_", y_prev),
paste0("ServUse_", y_now, " ~ ServUse_", y_prev),
paste0("pz.OutcomesRank_", y_now, " ~ pz.OutcomesRank_", y_prev)
)
}
# Combine all lines into final lavaan model string
SEM4 <- paste(model_lines, collapse = "\n")
# If we wanted to check we could print or write to file
#cat(SEM_model_code)
#writeLines(SEM_model_code, "SEM_Panel_Model_2013_2022.txt")
#the model using the string created above
fitSEM4 <- sem(SEM4, data = pl_chALL_wide)
#summary(fitSEM4)
(fitIndices4 <- fitMeasures(fitSEM4, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM4, whatLabels = "est")