Introduction

The question we are trying to answer is:

How can we demonstrate the impact of public library service offerings at big urban libraries on the health outcomes of those in the communities they serve?

Data

Both the library and community health data come in different files per year, so to start, we load in the library data (pl_YY) and the community health data (ch_YY) into separate files by year.

Public Library (PL) Data

The Public Library Data comes from the IMLS Public Library Survey (PLS) data which has surveyed US public libraries annually for over 25 years and contains information about where the library system resides (urban, suburban, town, rural) and the levels of services provided by each and much more.

We have brought in the data from 2013-2022 (2022 is the latest available). Each year has over 9K rows of 100+ variables. Each row represents one library system.

Here are the changes we make to the Library Data:

  • Rename LOCALE in 2013-16 -> LOCALE_ADD (the field name changed in 2017).

  • Limit to those library systems (rows) Where C_FSCS=Y, meaning the library system meet the definition of a Public Library.

  • Since the health data does not include Puerto Rico, it will be removed from the library data.

  • Add a column to identify year.

  • Do some work to make sure all columns match each year.

  • Get rid of columns that are not needed.

  • Keep only libraries that have all data for all 10 years.

  • Make a concatenated ‘county + state’ field (since counties may have the same names in different states).

Now we have a nice consistent set of 10 years, with 8814 library systems and 92 variables. This will be somewhat reduced as we connect to the community data.

Community Health (CH) Data

The Community Health and Health Access data comes from County Health Rankings & Roadmaps, which has county level data from 2010 to 2024. The 2013-2022 data will be used to match the 10 years of library data.

Changes made to the Community Health Data, to match what is in the library data:

  • Convert all the state names in to the 2 letter state abbreviation.

  • Convert the counties name to CAPS.

  • Make a concatenated ‘county + state’ field (as we did in the library data).

  • The 2013 data has OutcomeZ.Score and FactorsZ.Scrore where everywhere else has Quartile, so calculate the quartile for that year’s data.

Put all Data together

  • Join health to library data for each year.

  • Union all the years.

  • A few counties did not match in the 2 data sets, so we remove those.

  • Some counties had NA or NR for the OutcomesRank, so remove those as well (again, not many).

  • Not all counties has a rank for all 10 years, so remove those not ranked every year.

  • Add in the total square feet per system.

  • Create a field with the total collection.

  • Some libraries have different names in different years (but the same FSCSKEY, so we know they are the same system). Will use the most recent names in all years for consistency.

  • Remove columns we do not need, now that everything is in one place.

  • The pl_data uses -1, -3 and -9 for various forms of NA. We want to make them all an explicit NA.

  • Make sure we still have 10 years for each system.

  • Now we have a set of 86,300 rows (8,630 library systems over 10 years put together).

Data Exploration

A quick look at some of the variables of interest.

Service vs Use Variables

Using scatter plots to show how library services (like the collection, on the top left) and the use of that service (circulation, borrowing of material in the collection). By coloring each of the 10 years, we can see consistency in the patterns year over year.

Histograms - Before

We will be transforming all the columns into population adjusted z-scores. Here is a look at them BEFORE.

Data Transformation

Here, we are making changes needed for the modeling.

  • We make a _F(inal) dataset to use in modeling. At this point, most fields are included, but this is a place to play around with which fields get included / excluded.

  • Generate Z-Scores

  • Population Adjusted scores

Histograms - After

The same variables, in their population adjuested z-scores.

Corrplot() of Variables in index

As we can see, there are several cases where we see a correlation at 0.7 or close:

  • Visits vs Total Circulation (0.70)
  • Visits vs Hours Open (0.66)
  • Visits vs Total Programs (0.68)
  • Visits vs Total Attendance (at programs) (0.72)
  • Visits vs Public Computer Use (0.74)
  • Total Programs vs Total Attendance (0.89)

Modeling

We will start with a Factor Analysis, them use that in a Measurement Model. We will then run variation on the SEM model.

Factor Analysis

Summarizing the results: The model could be improved, but the factors are capturing meaningful information.

Measurment Model

The model is a poor fit. SRMR of 0.066 is good, but SRMR alone is not enough to compensate for the poor RMSEA.

The use of all libraries does show slight improvement over looking at just urban libraries.

## rmsea   cfi   tli  srmr 
## 0.226 0.800 0.720 0.065

Measurement Model Plot

SEM Model1

First, the model with a single latent var Library Performance (LibPerf)

SEM1: Evaluation

Evaluating additional model fit indices (e.g., RMSEA, CFI) for a more comprehensive understanding of model fit.

As we saw with the Measurement Model, it is a poor fit, but an improvement over the model using just the urban libraries.

## rmsea   cfi   tli  srmr 
## 0.122 0.923 0.876 0.060

SEM Model2

the model with a two latent var Library Service (LibServ) and Service Use (ServUse)

SEM2: Evaluation

This is a poor fit, but an improvement over SEM1, as well as an improvement over SEM1 and SEM2 with just urban libraries.

## rmsea   cfi   tli  srmr 
## 0.114 0.936 0.890 0.051

SEM Model3

Here we tried adding an additional variable REGBOR (card holders) to the Serve Use latent Variable.

In addition, we add both Square Feet and Total Operating Expenditures (in addition to state, which was already there). These were suggested as ways address possible a confound.

And because we are NOT using panel data here (see Model4), we add year as an intercept.

SEM3: Evaluation

This is worse than SEM2. We added a number of things here, so in future iterations, we might try one at a time.

## rmsea   cfi   tli  srmr 
## 0.175 0.707 0.624 0.215

SEM Model4

For this version we will use Panel Data

Pivot

We need to pivot wide on year for each pz (variable) columns to allow lavaan handle Panel Data.

Build the Model String

As this model is very complex, we will write code to loop through all the latent, regressions and covariances over the 10 years.

## rmsea   cfi   tli  srmr 
## 0.161 0.422 0.401 0.385

To Try Next:

  • Library Index weighted by admin type and census region (See Create Index section of the Factor Analysis Sample.Rmd)
  • Add pz.TOTOPEXP & pz.SQ_FEET to Library Index?
  • Try to breakdown changes between SEM2 and SEM3, and see if any one of the multiple changes can make an improvement.
  • Do a SEM4 with the relationships suggested in CorrPlot
  • COVID year - remove 2020?
  • Play around with what variables are included and not

Code Appendix

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) 
library(ggpubr)   #ggarrange
library(reshape2) #melt
library(scales)   #show axis vals in M & not sci not 
library(gridExtra)#grid.arrange()
library(lavaan)   #SEM model
library(semPlot)  #to plot the models
library(caret)    #preProcess()
library(corrplot) #corrplot()
library(psych)    #fa()
library(stats)    #anova()
#Public Library Data
pl_22 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY22_AE_pud22i.csv")
pl_21 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY21_AE_pud21i.csv")
pl_20 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY20_AE_pud20i.csv")
pl_19 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY19_AE_pud19i.csv")
pl_18 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/pls_fy18_ae_pud18i.csv")
pl_17 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY17_AE_pud17i.csv")
pl_16 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2016_AE_pupld16a_updated.csv")
pl_15 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2015_AE_pupld15a.csv")
pl_14 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/PLS_FY2014_AE_pupld14a.csv")
pl_13 <- read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/Pupld13a.csv")

#Community Health Data 
ch_22 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2022CHR.csv")
ch_21 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2021CHR.csv")
ch_20 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2020CHR.csv")
ch_19 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2019CHR.csv")
ch_18 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2018CHR.csv")
ch_17 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2017CHR.csv")
ch_16 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2016CHR.csv")
ch_15 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2015CHR.csv")
ch_14 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2014CHR.csv")
ch_13 <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/miscData/refs/heads/main/2013CHR.csv")

#In order to get total sq ft by system per year from the outlet files (separate files in IMLS PLS done by branch rather than library system).  Here we bring in a summary file that has Year, Sq Ft and FSCSKEY (the unique ID for a library system).

sqft <-read.csv ("https://raw.githubusercontent.com/dianaplunkett/PLDS-data/refs/heads/main/SqFt%202013-2022.csv")
#list of the pl data for each year where field is called LOCALE  
pl_list <- list(pl_13, pl_14, pl_15, pl_16)

#Loop through each data frame and rename LOCALE-> LOCALE_ADD
pl_list <- lapply(pl_list, function(df) {
  df |> rename(LOCALE_ADD = LOCALE)})
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]                          
#Make pl_list have all years, to use in looping through filtering all data.
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17,pl_18,pl_19,pl_20,
                pl_21,pl_22)

# Loop through each data frame and apply the filter
#the library system meet the definition of a Public Library.    
pl_list <- lapply(pl_list, function(df) {
  df |> filter(C_FSCS=='Y')
})
# Loop through each data frame and apply the filter: exclude state abbr= PR
pl_list <- lapply(pl_list, function(df) {
  df |> filter(STABR!='PR')
})


#Then set all the data back to individual tables
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]] 
pl_22 <- pl_list[[10]] 
#add year column to each of the pl_YY sets before unioning it.
pl_13$year <- '2013'
pl_14$year <- '2014'
pl_15$year <- '2015'
pl_16$year <- '2016'
pl_17$year <- '2017'
pl_18$year <- '2018'
pl_19$year <- '2019'
pl_20$year <- '2020'
pl_21$year <- '2021'
pl_22$year <- '2022'
#also in prep for unioning, we need to see where columns / column names may be different across years.

#first, get the columns names for each dataframe
cols13 <- names(pl_13)
cols14 <- names(pl_14)
cols15 <- names(pl_15)
cols16 <- names(pl_16)
cols17 <- names(pl_17)
cols18 <- names(pl_18)
cols19 <- names(pl_19)
cols20 <- names(pl_20)
cols21 <- names(pl_21)
cols22 <- names(pl_22)

#put them together in a list
col_list <- list(cols13, cols14, cols15, cols16, cols17, cols18, cols19, cols20, 
                 cols21, cols22)

#find which columns appear in all
common_col <- Reduce(intersect, col_list)

#and which are diff
diff_col <- lapply(col_list, function(f) setdiff(f, common_col))

print (diff_col)
#Changes made to columns in ALL years
#get rid off anything that starts 'F_'
#this gets rid of some fields that are common to all years, 
#but they are not needed so cleaner code and less unneeded columns

#refresh pl_list since we made changes
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17, pl_18, pl_19, pl_20, 
                pl_21, pl_22)

#remove all fields that have F_
pl_list <- lapply(pl_list, function(df) {
  df |> select(-contains('F_'))})

#Put data back
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]] 
pl_22 <- pl_list[[10]]  
#DB local now Elec Collection local
pl_13<- pl_13|> rename(EC_LO_OT=DB_LO_OT )
pl_14<-pl_14|> rename(EC_LO_OT= DB_LO_OT)

#DB state now Elec Collection state
pl_13<-pl_13|> rename(EC_ST = DB_ST)
pl_14<-pl_14|> rename(EC_ST = DB_ST)

#kid programs now broken out by age 0-5 and 6-11; add together to get old variable all kids
pl_22$KIDATTEN <- pl_22$K0_5ATTEN + pl_22$K6_11ATTEN
pl_22$KIDPRO <- pl_22$K0_5PRO + pl_22$K6_11PRO

#Columns to remove only in 2013 and 2014 data
rm_col13.14 <-c('DATABASE','FIPSCO','FIPSPLAC','FIPSST', 'GAL','GALMS', 'SUBSCRIP', 'POSTMS')

pl_13.14list <-list(pl_13, pl_14)
pl_13.14list <- lapply(pl_13.14list, function(df) {
  df |> select(-all_of(rm_col13.14))})
pl_13 <- pl_13.14list[[1]]
pl_14 <- pl_13.14list[[2]]

#Columns to remove only in 2020, 2021,2022 data
#anything that starts 'C19'

pl_20.21.22list <-list(pl_20, 
                      pl_21,pl_22)
pl_20.21.22list <- lapply(pl_20.21.22list, function(df) {
  df |> select(-contains('C19'))})
pl_20 <- pl_20.21.22list[[1]]
pl_21 <- pl_20.21.22list[[2]] 
pl_22 <- pl_20.21.22list[[3]] 
#Columns to remove only in 2016 - 2021 (inclusive) data
rm_col16.21 <-c('ELCONT','ELECCOLL','ELINFO','GNISPLAC',
                'INCITSCO','INCITSST','LOCALE_MOD','PHYSCIR',
                'TOTCOLL','WIFISESS')

pl_16.21list <-list(pl_16, pl_17,pl_18, pl_19,
                    pl_20, 
                    pl_21)
pl_16.21list <- lapply(pl_16.21list, function(df) {
  df |> select(-all_of(rm_col16.21))})
pl_16 <- pl_16.21list[[1]]
pl_17 <- pl_16.21list[[2]]
pl_18 <- pl_16.21list[[3]]
pl_19 <- pl_16.21list[[4]]
pl_20 <- pl_16.21list[[5]]
pl_21 <- pl_16.21list[[6]]
#other one offs to get rid off so all columns match
pl_14 <-pl_14 |> select (-WIFISESS)
pl_15 <-pl_15 |> select (-c(ADDRTYPE,ELECCOLL,GNISPLAC,
                              INCITSCO,INCITSST,MSTATUS,
                              SCORE,SUBSCRIP,WIFISESS))
pl_16 <-pl_16 |> select (-c(MSTATUS,REAPLOCALE,
                                 REAPLOCALE_MOD,SCORE,
                                 SUBSCRIP,ADDRTYPE))
pl_17 <-pl_17 |> select (-c(GEOMATCH,REAPLOCALE_ADD,
                                 REAPLOCALE_MOD,
                                 SUBSCRIP,))
pl_18 <-pl_18 |> select (-c(GEOMATCH,REAPLOCALE_ADD,
                                REAPLOCALE_MOD,WEBVISIT,
                                 SUBSCRIP))
pl_19 <-pl_19 |> select (-c(GEOMATCH,WEBVISIT,
                                 SUBSCRIP,))
pl_20 <-pl_20 |>select(-c(GEOMTYPE,GEOSCORE,GEOSTATUS,
                          WEBVISIT,REFERRPT,VISITRPT))
pl_21 <-pl_21 |> select (-c(TOTPHYS,OTHPHYS,OTHPHCIR,
                              K0_5PRO,K6_11PRO,ADULTPRO,
                            GENPRO,ONPRO,OFFPRO,VIRPRO,
                                  K0_5ATTEN,K6_11ATTEN,
                            ADULTATTEN,GENATTEN,ONATTEN,
                              OFFATTEN,VIRATTEN,TOTPRES,
                            TOTVIEWS,PITUSRRPT,WIFISRPT,
                            GEOMTYPE,GEOSCORE,GEOSTATUS,
                            WEBVISIT,REFERRPT,VISITRPT))
pl_22 <-pl_22 |>select(-c(ADULTATTEN,ADULTPRO,ELCONT,
                          ELECCOLL, ELINFO,GEOMTYPE,
                          GEOSCORE,GEOSTATUS,GENATTEN,
                          GENPRO, K0_5PRO,K0_5ATTEN,
                          K6_11PRO,K6_11ATTEN,LOCALE_MOD,
                          LSAGEOID,LSAGEORATIO,
                          LSAGEOTYPE,ODFINE, OFFATTEN,
                          OFFPRO,ONPRO,OTHPHCIR,
                          OTHPHYS, PHYSCIR, PITUSRRPT,
                          REFERRPT,TOTCOLL, TOTPHYS,
                          TOTPRES,TOTVIEWS, VIRPRO,
                          VIRATTEN, VISITRPT,WEBVISIT,
                          WIFISESS, WIFISRPT,ONATTEN))

#REPEAT SAME STEPS AS ABOVE TO SEE IF ANY ODD COLUMNS LEFT

#also in prep for unioning, we need to see where columns / column names may be different across years.

#first, get the columns names for each dataframe
cols13 <- names(pl_13)
cols14 <- names(pl_14)
cols15 <- names(pl_15)
cols16 <- names(pl_16)
cols17 <- names(pl_17)
cols18 <- names(pl_18)
cols19 <- names(pl_19)
cols20 <- names(pl_20)
cols21 <- names(pl_21)
cols22 <- names(pl_22)

#put them together in a list
col_list <- list(cols13, cols14, cols15, cols16, cols17, cols18, cols19, cols20, 
                 cols21, cols22)

#find which columns appear in all
common_col <- Reduce(intersect, col_list)

#and which are diff
diff_col <- lapply(col_list, function(f) setdiff(f, common_col))

print (diff_col)
#refresh pl_list since we made changes
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17, pl_18, pl_19, pl_20, 
                pl_21, pl_22)

names(pl_list) <- paste0("pl", 13:22)

# Step 2: Combine Keys with their source df names
key_df <- bind_rows(lapply(names(pl_list), function(name) {
  data.frame(Key = pl_list[[name]]$FSCSKEY, Source = name)
}))

# Step 3: Count in how many unique data frames each Key appears
key_counts <- key_df %>%
  distinct(Key, Source) %>%              # Remove duplicates
  group_by(Key) %>%
  summarise(n_df = n(), .groups = "drop")

# Step 4: Filter Keys not in all 10 data frames
not_in_all <- key_counts %>% filter(n_df < 10)

# Get the vector of Keys to remove
keys_to_remove <- not_in_all$Key

# Filter each df to remove those keys
df_list_filtered <- lapply(pl_list, function(df) {
  df |> filter(!FSCSKEY %in% keys_to_remove)
})

for (i in 1:10) {
  assign(paste0("pl_", i+12), df_list_filtered[[i]])
}
#list of the pl_YY for each year  
pl_list <- list(pl_13, pl_14, pl_15, pl_16, pl_17,pl_18,pl_19,pl_20,
                pl_21,pl_22)

# Loop through each data frame and apply the filter
pl_list <- lapply(pl_list, function(df) {
  df |> mutate(cnty.state = paste(CNTY,STABR))
})
#Put data back
pl_13 <- pl_list[[1]]
pl_14 <- pl_list[[2]]
pl_15 <- pl_list[[3]]
pl_16 <- pl_list[[4]]
pl_17 <- pl_list[[5]]
pl_18 <- pl_list[[6]]
pl_19 <- pl_list[[7]]
pl_20 <- pl_list[[8]]
pl_21 <- pl_list[[9]] 
pl_22 <- pl_list[[10]] 

#for the health data
#NEED TO CONVERT STATE TO STATE ABBR
state_abbr <- c(
  Alabama = "AL",Alaska = "AK",Arizona = "AZ",Arkansas = "AR", 
  California = "CA",Colorado = "CO",Connecticut = "CT",
  Delaware = "DE", Florida = "FL", Georgia = "GA",Hawaii = "HI", 
  Idaho = "ID", Illinois = "IL", Indiana = "IN", Iowa = "IA",
  Kansas = "KS", Kentucky = "KY", Louisiana = "LA", Maine = "ME",
  Maryland = "MD",Massachusetts = "MA", Michigan = "MI", 
  Minnesota = "MN", Mississippi = "MS", Missouri = "MO",
  Montana = "MT", Nebraska = "NE", Nevada = "NV",
  New_Hampshire = "NH", New_Jersey = "NJ",New_Mexico = "NM",
  New_York = "NY", North_Carolina = "NC", North_Dakota = "ND", 
  Ohio = "OH",Oklahoma = "OK", Oregon = "OR", Pennsylvania = "PA",
  Rhode_Island = "RI", South_Carolina = "SC",South_Dakota = "SD", 
  Tennessee = "TN", Texas = "TX", Utah = "UT", Vermont = "VT",
  Virginia = "VA", Washington = "WA", West_Virginia = "WV",
  Wisconsin = "WI", Wyoming = "WY", District_of_Columbia = 'DC'
)

state_to_abbr <- function(state_name) {
  state_name <- gsub(" ", "_", state_name)
  state_abbr[state_name]
}

#Merge all the County and States in the pl data 
ch_list <- list(ch_13,ch_14, ch_15, ch_16, ch_17,ch_18,ch_19,ch_20,
                ch_21,ch_22)


# Loop through each data frame and apply the filter
ch_list <- lapply(ch_list, function(df) {
  df |> mutate(State = state_to_abbr(State),
        County = if_else(County=="Virginia Beach City",
                         "Virginia Beach", County),
        County = if_else(County=="District of Columbia",
                         "Dist of Columbia", County),
    cnty.state = str_to_upper(paste(County,State)))
})

ch_13 <- ch_list[[1]]
ch_14 <- ch_list[[2]]
ch_15 <- ch_list[[3]]
ch_16 <- ch_list[[4]]
ch_17 <- ch_list[[5]]
ch_18 <- ch_list[[6]]
ch_19 <- ch_list[[7]]
ch_20 <- ch_list[[8]]
ch_21 <- ch_list[[9]] 
ch_22 <- ch_list[[10]] 


#total counties per state - maybe do not need
ch_13 <- ch_13 |>
  mutate(OutcomesQuartile = ntile(OutcomesRank,4),
         FactorsQuartile = ntile(FactorsRank,4)) |> 
  select (-c(OutcomesZ.Score,FactorsZ.Score))
  
pl_ch13<-left_join (pl_13, ch_13, by = 'cnty.state')
pl_ch14<-left_join (pl_14, ch_14, by = 'cnty.state')
pl_ch15<-left_join (pl_15, ch_15, by = 'cnty.state')
pl_ch16<-left_join (pl_16, ch_16, by = 'cnty.state')
pl_ch17<-left_join (pl_17, ch_17, by = 'cnty.state')
pl_ch18<-left_join (pl_18, ch_18, by = 'cnty.state')
pl_ch19<-left_join (pl_19, ch_19, by = 'cnty.state')
pl_ch20<-left_join (pl_20, ch_20, by = 'cnty.state')
pl_ch21<-left_join (pl_21, ch_21, by = 'cnty.state')
pl_ch22<-left_join (pl_22, ch_22, by = 'cnty.state')
pl_chALL <-rbind(pl_ch13,pl_ch14,pl_ch15,pl_ch16,pl_ch17,
                  pl_ch18,pl_ch19,pl_ch20,
                 pl_ch21,pl_ch22)
#filter where State (from ch data) is NA, indicating the cnty.state field did not match
pl_chALL <- pl_chALL |>  filter(!is.na(State))
#filter NA or NR (Not Rated) for the OutcomesRank
pl_chALL <- pl_chALL |>  filter(!is.na(OutcomesRank))
pl_chALL <- pl_chALL |>  filter(OutcomesRank!='NR')
pl_chALL$OutcomesRank <- as.numeric(pl_chALL$OutcomesRank)
pl_chALL$FactorsRank <- as.numeric(pl_chALL$FactorsRank)
#filter any county not ranked all 10 years
pl_chALL <- pl_chALL |> 
  group_by(FSCSKEY) |>
  filter(n() == 10) |>
  ungroup()
#join in the sq ft data
sqft$year=as.character(sqft$year)

pl_chALL <- left_join(pl_chALL,sqft, 
                      by = c("year"="year",
                             "FSCSKEY"="FSCSKEY"))

pl_chALL <- pl_chALL |> filter(!is.na(SQ_FEET))
#create collection fields.  
#just in case, making sep phys and digital, then total
pl_chALL <- pl_chALL |> 
  mutate(phys_coll = BKVOL+AUDIO_PH+VIDEO_PH) |>
  mutate(dig_coll = EBOOK+AUDIO_DL+VIDEO_DL) |>
  mutate(all_coll = phys_coll + dig_coll)

#adj where library system changed names
pl_chALL <- pl_chALL |> 
  mutate(LIBNAME = if_else(FSCSKEY=="GA0022",
                         "FULTON COUNTY LIBRARY SYSTEM",
                         if_else(FSCSKEY=="IN0210",
                         "INDIANAPOLIS-MARION COUNTY PUBLIC LIBRARY",
                         if_else(FSCSKEY=="LA0058",
                         "NEW ORLEANS PUBLIC LIBRARY",
                         if_else(FSCSKEY=="TX0050",
                         "FORT WORTH PUBLIC LIBRARY",
                         LIBNAME)))))
 

# Vector of column names to remove
cols_to_remove <- c("ADDRES_M", "ADDRESS", "C_FSCS",
                    "C_RELATN", "CITY_M", "CNTY", "County",
                    "ENDDATE", "LSABOUND","ZIP","ZIP_M",
                    "ZIP4", "ZIP4_M","BENEFIT","BKMOB",
                    "CAP_REV","CAPITAL", "CBSA","CDCODE",
                    "CENBLOCK", "CENTRACT", "FCAP_REV",
                    "FEDGVT", "FIPS", "KIDATTEN", "KIDCIRCL",
                    "KIDPRO", "LATITUDE", "LCAP_REV",
                    "LIBRARIA", "LOANFM", "LOANTO", "LOCGVT",
                    "LONGITUD", "MASTER", "MICROF",
                    "OCAP_REV", "OTHINCM", "OTHMATEX", 
                    "OTHOPEXP", "OTHPAID", "PHONE",
                    "REFERENC", "RSTATUS", "SALARIES", 
                    "SCAP_REV", "STAFFEXP", "STATADDR", 
                    "STATADDR", "STATSTRU", "STGVT",
                    "VIDEO_DL", "VIDEO_PH", "YAATTEN",
                    "YAPRO")

# Remove them
pl_chALL <- pl_chALL|>select(-all_of(cols_to_remove))
pl_chALL <- pl_chALL %>%
  mutate(across(where(is.numeric), ~ ifelse(. < 0, NA, .)))
pl_chALL <- pl_chALL |>
  group_by(FSCSKEY) |>
  filter(n() == 10) |>
  ungroup()
#writing the resulting data out so I can play in Tableau a bit
write.csv(pl_chALL, "C:/Users/diana/Documents/dmp/CUNY/698 Capstone - Final Research Project/Final Project/pl_chrALL.csv", row.names = FALSE)
p1<-ggplot(pl_chALL, aes(x=all_coll, y=TOTCIR, color=year)) +
  geom_point() +
  ggtitle("Collection vs Circulation")+
  theme_minimal()+
    theme(text=element_text(size=8),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          legend.position="none")+
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))

p2<-ggplot(pl_chALL, aes(x=HRS_OPEN, y=VISITS, color=year)) +
  geom_point() +
  ggtitle("Hours Open vs Visits")+
  theme_minimal()+
    theme(text=element_text(size=8),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          legend.position="none")+
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))

p3<-ggplot(pl_chALL, aes(x=TOTPRO, y=TOTATTEN, color=year)) +
  geom_point() +
  ggtitle("Programs vs Attendance")+
  theme_minimal()+
    theme(text=element_text(size=8),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          legend.position="none")+
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))

p4<-ggplot(pl_chALL, aes(x=GPTERMS, y=PITUSR, color=year)) +
  geom_point() +
  ggtitle("Computers vs Sessions")+
  theme_minimal()+
    theme(text=element_text(size=8),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          legend.position="none")+
    scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 1)) +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K", accuracy = 1))

ggarrange(p1,p2,p3,p4)
#the columns of interest
cols <-c('all_coll', 'TOTCIR', 'HRS_OPEN', 'VISITS', 'TOTPRO', 'TOTATTEN', 'GPTERMS', 'PITUSR')


#small dataset with just those columns, pivoted 
sm_data <- pl_chALL|> 
  select (one_of(cols))
sm_data$ID <- rownames(sm_data) 
sm_data <- melt(sm_data, id.vars = "ID")

# Create histograms of the revenue columns
ggplot(data = sm_data, aes(x = value)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.8) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Library Data for SEM", x = "Variables", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +  # Clean up grid lines
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", accuracy = 0.1))  # Format x-axis in millions)
pl_chALL_F <- pl_chALL |>
  select (all_coll, TOTCIR, HRS_OPEN, VISITS, TOTPRO, 
  TOTATTEN, GPTERMS, PITUSR, FSCSKEY, year, POPU_LSA,
  STABR, cnty.state, C_ADMIN, OutcomesRank,
  FactorsRank, SQ_FEET,
  TOTOPEXP, OBEREG, LOCALE_ADD,REGBOR )
pl_chALL_F <- pl_chALL_F |> 
  mutate (z.all_coll = scale(all_coll),
          z.TOTCIR = scale(TOTCIR),
          z.HRS_OPEN = scale(HRS_OPEN),
          z.VISITS = scale(VISITS),
          z.TOTPRO= scale(TOTPRO),
          z.TOTATTEN = scale(TOTATTEN),
          z.GPTERMS = scale(GPTERMS),
          z.PITUSR = scale(PITUSR),
          z.SQ_FEET = scale(SQ_FEET),
          z.POPU_LSA = scale(POPU_LSA),
          z.TOTOPEXP = scale(TOTOPEXP),
          z.REGBOR = scale(REGBOR),
          z.OutcomesRank = scale(OutcomesRank),
          z.FactorsRank = scale(FactorsRank)) 
pl_chALL_F <- pl_chALL_F |> 
  mutate (pz.all_coll = z.all_coll-z.POPU_LSA,
          pz.TOTCIR = z.TOTCIR-z.POPU_LSA,
          pz.HRS_OPEN = z.HRS_OPEN-z.POPU_LSA,
          pz.VISITS = z.VISITS-z.POPU_LSA,
          pz.TOTPRO = z.TOTPRO-z.POPU_LSA,
          pz.TOTATTEN = z.TOTATTEN-z.POPU_LSA,
          pz.GPTERMS = z.GPTERMS-z.POPU_LSA,
          pz.PITUSR = z.PITUSR-z.POPU_LSA, 
          pz.SQ_FEET = z.SQ_FEET-z.POPU_LSA,
          pz.REGBOR = z.REGBOR-z.POPU_LSA,
          pz.OutcomesRank = z.OutcomesRank-z.POPU_LSA,
          pz.FactorsRank = z.FactorsRank-z.POPU_LSA,
          pz.TOTOPEXP = z.TOTOPEXP-z.POPU_LSA
)
#the columns of interest
pz.cols <-c('pz.all_coll', 'pz.TOTCIR', 'pz.HRS_OPEN', 'pz.VISITS', 'pz.TOTPRO', 'pz.TOTATTEN', 'pz.GPTERMS', 'pz.PITUSR')


#small dataset with just those columns, pivoted 
pz.sm_data <- pl_chALL_F|> 
  select (one_of(pz.cols))
pz.sm_data$ID <- rownames(pz.sm_data) 
pz.sm_data <- melt(pz.sm_data, id.vars = "ID")

# Create histograms of the revenue columns
ggplot(data = pz.sm_data, aes(x = value)) +
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.8, binwidth = 5) +  # Adjust binwidth as needed
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Library Data for SEM", x = "Variables", y = "Frequency") +
  theme_minimal(base_size = 9) +  # Use a minimal theme for better clarity
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())   # Clean up grid lines
  
pz_data <- pl_chALL_F |> select(starts_with('pz')) |> select(-c(pz.SQ_FEET,  pz.OutcomesRank, pz.FactorsRank)) 

corr_matrix <- cor(pz_data, use = "pairwise.complete.obs")
corrplot(corr_matrix, 
         method = "circle", 
         type = "lower",          #just the lower triangle
         tl.col = "black",        #label text color 
          tl.cex = 0.7,           #label text font size
         number.cex = 0.4,        #corr coeff font size
         addCoef.col = "skyblue", #corr coeff color
        cl.cex  = 0.4        )    #color palette #s font size
# Run factor analysis with one factor
fa_result <- fa(pz_data, nfactors = 1)

# Display the factor loadings (how much each variable contributes to the factor)
#commenting this out once we have seen initial results 
#fa_result 

fa_result$loadings[, 1]
# Define the measurement model
# Latent variable: Library Performance
model <- '
  LibPerf =~ pz.all_coll + pz.TOTCIR + pz.HRS_OPEN +
            pz.VISITS + pz.TOTPRO + pz.TOTATTEN + pz.GPTERMS +
            pz.PITUSR
'
# Fit the measurement model
fit <- cfa(model, data = pz_data)

# Commenting out the Display the results until we get a promising a model
#summary(fit, fit.measures = TRUE)
(fitIndices <- fitMeasures(fit, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fit, whatLabels = "est")
SEM1<- '
#latent vars
LibPerf =~ pz.all_coll + pz.TOTCIR + pz.HRS_OPEN +
            pz.VISITS + pz.TOTPRO + pz.TOTATTEN + pz.GPTERMS
            + pz.PITUSR

#regressions
pz.OutcomesRank ~ LibPerf
STABR ~ 1

#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'

fitSEM1 <- sem(model = SEM1, data= pl_chALL_F)

#Commenting out the detailed summary until we get a promising model
#summary(fitSEM1)

(fitIndices <- fitMeasures(fitSEM1, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM1, whatLabels = "est")
SEM2<- '
#latent vars
LibServ =~ pz.all_coll + pz.HRS_OPEN + pz.TOTPRO + pz.GPTERMS
            
ServUse =~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR              
#regressions
pz.OutcomesRank ~ LibServ
pz.OutcomesRank ~ ServUse
STABR ~ 1

#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'

fitSEM2 <- sem(model = SEM2, data= pl_chALL_F)

#summary(fitSEM2)

(fitIndices2 <- fitMeasures(fitSEM2, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM2, whatLabels = "est")
SEM3<- '
#latent vars
LibServ =~ pz.all_coll + pz.HRS_OPEN + pz.TOTPRO + pz.GPTERMS
            
ServUse =~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.REGBOR  

#regressions
pz.OutcomesRank ~ LibServ
pz.OutcomesRank ~ ServUse
STABR ~ 1
pz.SQ_FEET ~1
pz.TOTOPEXP ~1
year ~1

#residual covariances
pz.all_coll ~~ pz.TOTCIR
pz.HRS_OPEN ~~ pz.TOTCIR + pz.VISITS + pz.TOTATTEN + pz.PITUSR + pz.TOTPRO
pz.TOTPRO ~~ pz.TOTATTEN
pz.GPTERMS ~~ pz.PITUSR
'

fitSEM3 <- sem(model = SEM3, data= pl_chALL_F)

#summary(fitSEM3)

(fitIndices3 <- fitMeasures(fitSEM3, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM3, whatLabels = "est")
pl_chALL_wide <- pl_chALL_F |>
  pivot_wider(
    id_cols = c(FSCSKEY,STABR),
    # The year column
    names_from = year,
    # Only pivot pz* variables + SQ_FEET +TOTOPEXP 
    values_from = c(starts_with("pz")),
    # Result: pz*_2013 etc
    names_sep = "_"                   
  )


years <- 2013:2022

# Define variable parts
libserv_vars <- c("pz.all_coll", "pz.HRS_OPEN", "pz.TOTPRO", "pz.GPTERMS")
servuse_vars <- c("pz.TOTCIR", "pz.VISITS", "pz.TOTATTEN", "pz.PITUSR", "pz.REGBOR")

model_lines <- c()

# Loop over years to build latent factors and regressions
for (year in years) {
  # Latent variables
  libserv_line <- paste0("LibServ_", year, " =~ ", paste0(libserv_vars, "_", year, collapse = " + "))
  servuse_line <- paste0("ServUse_", year, " =~ ", paste0(servuse_vars, "_", year, collapse = " + "))
  
  # Regression line
  reg_line <- paste0("pz.OutcomesRank_", year, " ~ LibServ_", year, " + ServUse_", year,
                     " + pz.SQ_FEET_", year, 
                " + pz.TOTOPEXP_", year, " + STABR")
  
  # Residual covariances
  resids <- c(
    paste0("pz.all_coll_", year, " ~~ pz.TOTCIR_", year),
    paste0("pz.HRS_OPEN_", year, " ~~ ",
           paste0(c("pz.TOTCIR_", "pz.VISITS_", "pz.TOTATTEN_", "pz.PITUSR_", "pz.TOTPRO_"), year, collapse = " + ")),
    paste0("pz.TOTPRO_", year, " ~~ pz.TOTATTEN_", year),
    paste0("pz.GPTERMS_", year, " ~~ pz.PITUSR_", year)
  )
  
  model_lines <- c(model_lines, libserv_line, servuse_line, reg_line, resids)
}

# Autoregressive paths
for (i in 2:length(years)) {
  y_now <- years[i]
  y_prev <- years[i - 1]
  
  model_lines <- c(model_lines,
    paste0("LibServ_", y_now, " ~ LibServ_", y_prev),
    paste0("ServUse_", y_now, " ~ ServUse_", y_prev),
    paste0("pz.OutcomesRank_", y_now, " ~ pz.OutcomesRank_", y_prev)
  )
}

# Combine all lines into final lavaan model string
SEM4 <- paste(model_lines, collapse = "\n")

# If we wanted to check we could print or write to file
#cat(SEM_model_code)
#writeLines(SEM_model_code, "SEM_Panel_Model_2013_2022.txt")

#the model using the string created above
fitSEM4 <- sem(SEM4, data = pl_chALL_wide)

#summary(fitSEM4)

(fitIndices4 <- fitMeasures(fitSEM4, c("rmsea", "cfi", "tli", "srmr")))
# Plot the model
semPaths(fitSEM4, whatLabels = "est")