library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I have chosen a data set that contains various KPIs of library services and library use within the fifty states plus Washington DC to be the subject of my research paper. I have also added to that data set several economic indicators captured for each state. Both data sets were created and was provided to the public by the Institute of Museum and Library Services (IMLS), an independent government agency which is the primary overseer of federal support and policy to the country’s museums and libraries. With this data I aim to explore the relationship between a state’s sociology-economic situation and the availability/use of its library resources.
research_data<-read.csv("state_data_combo.csv")
Here is a summary of some of the variables contained within the data set that I am most interested in.
summary(research_data$Poverty.Rate....)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.90 11.15 13.70 13.65 15.75 20.80
summary(research_data$Percent.with.no.home.Internet..2018.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.10 16.80 19.10 19.72 21.35 31.50
summary(research_data$Library.Visits.Per.Capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.770 2.290 2.330 3.025 3.820
summary(research_data$Total.Circulation.Per.Capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.640 4.365 5.460 6.038 7.535 12.170
summary(research_data$Registered.Users.Per.Capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.5000 0.4875 0.5600 0.7500
The set isn’t missing any data.
Here are some histograms showing the frequency distributions amongst those variable samples within the country.
hist(research_data$Poverty.Rate....)
hist(research_data$Percent.with.no.home.Internet..2018.)
hist(research_data$Library.Visits.Per.Capita)
hist(research_data$Total.Circulation.Per.Capita)
hist(research_data$Registered.Users.Per.Capita)
At first glance the distribution of the selected variables across the
states appears to have some normalcy, but that quality of the data will
be explored further later.
Here are some sample plots that show the relationship between some of the economic metrics and library use KPIs:
plot(research_data$Poverty.Rate....,research_data$Registered.Users.Per.Capita)
plot(research_data$Percent.with.no.home.Internet..2018.,research_data$Library.Visits.Per.Capita)
plot(research_data$Percent.with.no.home.computer..2018.,research_data$Children.s.Material.Circulation.Percentage)
Here are the correlations between the the selected relationships from the above plots:
cor(research_data$Poverty.Rate....,research_data$Registered.Users.Per.Capita)
## [1] 0.1589795
cor(research_data$Percent.with.no.home.Internet..2018.,research_data$Library.Visits.Per.Capita)
## [1] -0.3044928
cor(research_data$Percent.with.no.home.computer..2018.,research_data$Children.s.Material.Circulation.Percentage)
## [1] -0.2514911
The correlation between the variables does not appear to be very strong. I will be digging into their relationships more soon, as well as testing the relationships between the other variables in the data set.