Project 1: Public Library Survey

Author

N. Yasmin Bromir

Dataset Source

Source: IMLS Public Library Survey

# note: dataset should have both quantitative and categorical variables
# note: it should have at least 4 variables

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
survey_org <- read_csv("data/pls-fy21-admin.csv")
Rows: 9215 Columns: 195
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (94): STABR, FSCSKEY, LIBID, LIBNAME, ADDRESS, CITY, ZIP4, ADDRES_M, CI...
dbl (101): ZIP, ZIP_M, PHONE, POPU_LSA, POPU_UND, CENTLIB, BRANLIB, BKMOB, M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction to Dataset

This dataset is a public library survey conducted by the Institute of Museum and Library Services. It has been conducted in one form or another since 1992; it switched agency hands in 1992.

The data is collected from approx. 9,000 public libraries with approx. 17,000 individual structures, such as the main library, the neighborhood library branches, and bookmobiles.

The data contains:

  • contact information about each library structure (state, address, phone number, etc.),
  • the number of library visits for each library structure (including if it was closed for COVID),
  • circulation information (number of checkouts by group, reference visits, etc.),
  • collection information (types of materials owned and amount of each, etc.),
  • staffing details (staff with Master’s degrees, salaries, etc.),
  • and much more.

There are almost 200 variables that provide lots of identifying details for each library structure. However, we will be cutting down the list of variables to approx. 10 or less.

Load the Data

Potential questions to explore and the variables needed:

  • average number of library visits for a state (or group of states) and where individual library structures fall on that average
    • scatterplot for individual actual library visits and line showing average number of visits
    • variables needed: state code, library name, visits, economic analysis code
  • salaries based on population served compared to state average
    • scatterplot for individual actual library visits and line showing average number of visits
    • variables needed: state code, library name, visits, economic analysis code,
head(survey_org)
# A tibble: 6 × 195
  STABR FSCSKEY LIBID    LIBNAME ADDRESS CITY    ZIP ZIP4  ADDRES_M CITY_M ZIP_M
  <chr> <chr>   <chr>    <chr>   <chr>   <chr> <dbl> <chr> <chr>    <chr>  <dbl>
1 AK    AK0001  AK0001-… ANCHOR… 34020 … ANCH… 99556 9150  P.O. BO… ANCHO… 99556
2 AK    AK0002  AK0002-… ANCHOR… 3600 D… ANCH… 99503 6055  3600 DE… ANCHO… 99503
3 AK    AK0003  AK0003-… ANDERS… 101 FI… ANDE… 99744 M     P.O. BO… ANDER… 99744
4 AK    AK0006  AK0006-… KUSKOK… 420 CH… BETH… 99559 M     P.O. BO… BETHEL 99559
5 AK    AK0007  AK0007-… BIG LA… 3140 S… WASI… 99623 9663  P.O. BO… BIG L… 99652
6 AK    AK0008  AK0008-… CANTWE… 1 SCHO… CANT… 99729 M     P.O. BO… CANTW… 99729
# ℹ 184 more variables: ZIP4_M <chr>, CNTY <chr>, PHONE <dbl>, C_RELATN <chr>,
#   C_LEGBAS <chr>, C_ADMIN <chr>, C_FSCS <chr>, GEOCODE <chr>, LSABOUND <chr>,
#   STARTDAT <chr>, ENDDATE <chr>, POPU_LSA <dbl>, F_POPLSA <chr>,
#   POPU_UND <dbl>, CENTLIB <dbl>, F_CENLIB <chr>, BRANLIB <dbl>,
#   F_BRLIB <chr>, BKMOB <dbl>, F_BKMOB <chr>, MASTER <dbl>, F_MASTER <chr>,
#   LIBRARIA <dbl>, F_LIBRAR <chr>, OTHPAID <dbl>, F_OTHSTF <chr>,
#   TOTSTAFF <dbl>, F_TOTSTF <chr>, LOCGVT <dbl>, F_LOCGVT <chr>, …
dim(survey_org)
[1] 9215  195
variable.names(survey_org)
  [1] "STABR"      "FSCSKEY"    "LIBID"      "LIBNAME"    "ADDRESS"   
  [6] "CITY"       "ZIP"        "ZIP4"       "ADDRES_M"   "CITY_M"    
 [11] "ZIP_M"      "ZIP4_M"     "CNTY"       "PHONE"      "C_RELATN"  
 [16] "C_LEGBAS"   "C_ADMIN"    "C_FSCS"     "GEOCODE"    "LSABOUND"  
 [21] "STARTDAT"   "ENDDATE"    "POPU_LSA"   "F_POPLSA"   "POPU_UND"  
 [26] "CENTLIB"    "F_CENLIB"   "BRANLIB"    "F_BRLIB"    "BKMOB"     
 [31] "F_BKMOB"    "MASTER"     "F_MASTER"   "LIBRARIA"   "F_LIBRAR"  
 [36] "OTHPAID"    "F_OTHSTF"   "TOTSTAFF"   "F_TOTSTF"   "LOCGVT"    
 [41] "F_LOCGVT"   "STGVT"      "F_STGVT"    "FEDGVT"     "F_FEDGVT"  
 [46] "OTHINCM"    "F_OTHINC"   "TOTINCM"    "F_TOTINC"   "SALARIES"  
 [51] "F_SALX"     "BENEFIT"    "F_BENX"     "STAFFEXP"   "F_TOSTFX"  
 [56] "PRMATEXP"   "F_PRMATX"   "ELMATEXP"   "F_ELMATX"   "OTHMATEX"  
 [61] "F_OTMATX"   "TOTEXPCO"   "F_TOCOLX"   "OTHOPEXP"   "F_OTHOPX"  
 [66] "TOTOPEXP"   "F_TOTOPX"   "LCAP_REV"   "F_LCAPRV"   "SCAP_REV"  
 [71] "F_SCAPRV"   "FCAP_REV"   "F_FCAPRV"   "OCAP_REV"   "F_OCAPRV"  
 [76] "CAP_REV"    "F_TCAPRV"   "CAPITAL"    "F_TCAPX"    "BKVOL"     
 [81] "F_BKVOL"    "EBOOK"      "F_EBOOK"    "AUDIO_PH"   "F_AUD_PH"  
 [86] "AUDIO_DL"   "F_AUD_DL"   "VIDEO_PH"   "F_VID_PH"   "VIDEO_DL"  
 [91] "F_VID_DL"   "TOTPHYS"    "OTHPHYS"    "EC_LO_OT"   "F_EC_L_O"  
 [96] "EC_ST"      "F_EC_ST"    "ELECCOLL"   "F_ELECOL"   "HRS_OPEN"  
[101] "F_HRS_OP"   "VISITS"     "F_VISITS"   "VISITRPT"   "REFERENC"  
[106] "F_REFER"    "REFERRPT"   "REGBOR"     "F_REGBOR"   "TOTCIR"    
[111] "F_TOTCIR"   "KIDCIRCL"   "F_KIDCIR"   "ELMATCIR"   "F_EMTCIR"  
[116] "PHYSCIR"    "F_PHYSCR"   "ELINFO"     "F_ELINFO"   "ELCONT"    
[121] "F_ELCONT"   "TOTCOLL"    "F_TOTCOL"   "OTHPHCIR"   "LOANTO"    
[126] "F_LOANTO"   "LOANFM"     "F_LOANFM"   "TOTPRO"     "F_TOTPRO"  
[131] "KIDPRO"     "F_KIDPRO"   "K0_5PRO"    "K6_11PRO"   "YAPRO"     
[136] "F_YAPRO"    "ADULTPRO"   "GENPRO"     "ONPRO"      "OFFPRO"    
[141] "VIRPRO"     "TOTATTEN"   "F_TOTATT"   "KIDATTEN"   "F_KIDATT"  
[146] "K0_5ATTEN"  "K6_11ATTEN" "YAATTEN"    "F_YAATT"    "ADULTATTEN"
[151] "GENATTEN"   "ONATTEN"    "OFFATTEN"   "VIRATTEN"   "TOTPRES"   
[156] "TOTVIEWS"   "GPTERMS"    "F_GPTERM"   "PITUSR"     "F_PITUSR"  
[161] "PITUSRRPT"  "WIFISESS"   "F_WIFISS"   "WIFISRPT"   "WEBVISIT"  
[166] "YR_SUB"     "OBEREG"     "RSTATUS"    "STATSTRU"   "STATNAME"  
[171] "STATADDR"   "LONGITUD"   "LATITUDE"   "INCITSST"   "INCITSCO"  
[176] "GNISPLAC"   "CNTYPOP"    "LOCALE_ADD" "LOCALE_MOD" "CENTRACT"  
[181] "CENBLOCK"   "CDCODE"     "CBSA"       "MICROF"     "GEOSTATUS" 
[186] "GEOSCORE"   "GEOMTYPE"   "C19CLOSE"   "C19PUBSV"   "C19ECRD2"  
[191] "C19REFER"   "C19OUTSD"   "C19XWIF2"   "C19XWIF3"   "C19STOTH"  

Clean up the Data

Include subtitles and detailed comments on all chunks to help audience understand intentions

survey_subset <- subset(survey_org, select = c(STABR, LIBNAME, POPU_UND, MASTER, SALARIES, HRS_OPEN, VISITS, OBEREG, LOCALE_MOD, CDCODE))

survey_categories <- survey_subset %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "1", "New England")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "2", "Mid East")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "3", "Great Lakes")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "4", "Plains")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "5", "South East")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "6", "South West")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "7", "Rocky Mountains")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "8", "Far West")) %>% 
  mutate(OBEREG = replace(OBEREG, OBEREG == "9", "Outlying Areas")) %>% 
  
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "11", "City, Large")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "12", "City, Mid-size")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "13", "City, Small")) %>% 
  
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "21", "Suburb, Large")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "22", "Suburb, Large")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "23", "Suburb, Large")) %>% 
  
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "31", "Town, Fringe")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "32", "Town, Distant")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "33", "Town, Remote")) %>% 
  
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "41", "Rural, Fringe")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "42", "Rural, Distant")) %>% 
  mutate(LOCALE_MOD = replace(LOCALE_MOD, LOCALE_MOD == "43", "Rural, Remote"))

survey_na <- survey_categories %>% 
  mutate(across(where(is.numeric), ~na_if(., -1))) %>% 
  mutate(across(where(is.numeric), ~na_if(., -3))) %>% 
  mutate(across(where(is.numeric), ~na_if(., -9)))

# -1 meant the data was missing
# -3 meant the administrative entity was temporarily closed
# -9 meant the data was suppressed for analytics or confidentiality (too few staff)

head(survey_na)
# A tibble: 6 × 10
  STABR LIBNAME       POPU_UND MASTER SALARIES HRS_OPEN VISITS OBEREG LOCALE_MOD
  <chr> <chr>            <dbl>  <dbl>    <dbl>    <dbl>  <dbl> <chr>  <chr>     
1 AK    ANCHOR POINT…     2123    0         NA     1404   5927 Far W… Rural, Re…
2 AK    ANCHORAGE PU…   288970   20.5  3766214     7596 159427 Far W… City, Lar…
3 AK    ANDERSON COM…      275    0         NA      420    325 Far W… Rural, Re…
4 AK    KUSKOKWIM CO…     6179    1      66813     2040   2100 Far W… Town, Rem…
5 AK    BIG LAKE PUB…     6942    1     209076       NA  18592 Far W… Rural, Di…
6 AK    CANTWELL COM…      190    0         NA      720    480 Far W… Rural, Re…
# ℹ 1 more variable: CDCODE <dbl>

Explore the Data with at least one data visualization

Some suggestions for visualizations include side-by-side box plots, histograms, bargraphs, scatterplots, treemaps, heatmaps, alluvials or streamgraphs. The type of data you use will help determine which visualization you should use.

Region Information

  • 01–New England (CT ME MA NH RI VT)
  • 02–Mid East (DE DC MD NJ NY PA)
  • 03–Great Lakes (IL IN MI OH WI)
  • 04–Plains (IA KS MN MO NE ND SD)
  • 05–Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)
  • 06–Southwest (AZ NM OK TX)
  • 07–Rocky Mountains (CO ID MT UT WY)
  • 08–Far West (AK CA HI NV OR WA)
  • 09–Outlying Areas (AS GU MP PR VI)
# table showing various functions of hours open by region
survey_hours <- survey_na %>%
  group_by(OBEREG) %>%
  summarise(total_hours_open = sum(HRS_OPEN, na.rm = TRUE),
            mean_hours_open = mean(HRS_OPEN, na.rm = TRUE),
            median_hours_open = median(HRS_OPEN, na.rm = TRUE))
survey_hours 
# A tibble: 9 × 4
  OBEREG          total_hours_open mean_hours_open median_hours_open
  <chr>                      <dbl>           <dbl>             <dbl>
1 Far West                 1503004           3400.             1314 
2 Great Lakes              6497710           3465.             2268 
3 Mid East                 4215818           2839.             2104 
4 New England              1433647           1212.             1062 
5 Outlying Areas             14278           3570.             3662 
6 Plains                   3478009           2285.             1660.
7 Rocky Mountains          1040718           2965.             2006 
8 South East               6542187           5857.             2567 
9 South West               2324739           2877.             1944 
# table showing mean of library visits by region
survey_visits <- survey_na %>%
  group_by(OBEREG) %>%
  summarise(mean_visits = mean(VISITS, na.rm = TRUE))
survey_visits
# A tibble: 9 × 2
  OBEREG          mean_visits
  <chr>                 <dbl>
1 Far West             88669.
2 Great Lakes          48439.
3 Mid East             44760.
4 New England          14879.
5 Outlying Areas       26878.
6 Plains               23752.
7 Rocky Mountains      73201.
8 South East           81147.
9 South West           39805.
# table showing number of libraries by region
survey_region <- survey_na %>%
  group_by(OBEREG) %>%
  summarise(library_sum_per_region = n())
survey_region
# A tibble: 9 × 2
  OBEREG          library_sum_per_region
  <chr>                            <int>
1 Far West                           517
2 Great Lakes                       1887
3 Mid East                          1545
4 New England                       1270
5 Outlying Areas                       4
6 Plains                            1583
7 Rocky Mountains                    392
8 South East                        1164
9 South West                         853
# table showing info on librarians with Master's degrees by region
survey_masterlib <- survey_na %>%
  group_by(OBEREG) %>%
  summarise(masters_librarians = mean(MASTER, na.rm = TRUE))
survey_masterlib
# A tibble: 9 × 2
  OBEREG          masters_librarians
  <chr>                        <dbl>
1 Far West                      9.59
2 Great Lakes                   3.88
3 Mid East                      4.54
4 New England                   2.21
5 Outlying Areas                1.5 
6 Plains                        1.05
7 Rocky Mountains               2.93
8 South East                    5.55
9 South West                    3.10
# table showing population data by region
survey_population <- survey_na %>%
  group_by(OBEREG) %>%
  summarise(region_avg_population = mean(POPU_UND, na.rm = TRUE))
survey_population
# A tibble: 9 × 2
  OBEREG          region_avg_population
  <chr>                           <dbl>
1 Far West                      117576.
2 Great Lakes                    24080.
3 Mid East                       31399.
4 New England                    11683.
5 Outlying Areas                 93325 
6 Plains                         12530.
7 Rocky Mountains                30549.
8 South East                     73051.
9 South West                     45454.
# table showing library types by region
survey_type <- survey_na %>% 
  group_by(LOCALE_MOD) %>%
  count(., OBEREG)
survey_type
# A tibble: 81 × 3
# Groups:   LOCALE_MOD [10]
   LOCALE_MOD     OBEREG              n
   <chr>          <chr>           <int>
 1 City, Large    Far West           31
 2 City, Large    Great Lakes        10
 3 City, Large    Mid East           11
 4 City, Large    New England         1
 5 City, Large    Plains              6
 6 City, Large    Rocky Mountains     3
 7 City, Large    South East         12
 8 City, Large    South West         20
 9 City, Mid-size Far West           31
10 City, Mid-size Great Lakes        16
# ℹ 71 more rows

Exploration visualizations

ggplot(data = survey_na) +
  geom_bar(mapping = aes(x = OBEREG, y= ))

ggplot(data = survey_na) + stat_summary(
  mapping = aes(x = OBEREG, y = MASTER),
  fun.min = min,
  fun.max = max,
  fun = median)
Warning: Removed 28 rows containing non-finite values (`stat_summary()`).

ggplot(data = survey_na) + 
  geom_bar(aes(x = OBEREG, fill = LOCALE_MOD, color = "Set3"), position = "stack") +
  scale_fill_brewer(palette = "Set3") + 
  scale_x_discrete(labels = function(x) str_wrap(x, width = 7))

ggplot(data = survey_na, mapping = aes(x = OBEREG, colour = LOCALE_MOD)) +
  geom_bar(fill = NA, position = "stack", color = "Black") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 5))

Final Visualizations

Visualization must include the following components:

  • Meaningful labels for axes
  • A title
  • At least 2 colors for distinguishing groups
  • Some sort of legend to make sense of colors, shapes, and sizes that describe any variables
ggplot(data = survey_na) + # data
  geom_bar(aes(x = OBEREG, fill = LOCALE_MOD, color = "Set3"), position = "stack") + # main data plot
  labs(title = "Locations of Libraries by Locale Mode & Region\n", # labels for graph
       x = "\nRegion of the United States",
       y = "# of Libraries by Locale Mode\n",
       caption = "New England (CT ME MA NH RI VT) // Mid East (DE DC MD NJ NY PA) // Great Lakes (IL IN MI OH WI) // Plains (IA KS MN MO NE ND SD)
       Southeast (AL AR FL GA KY LA MS NC SC TN VA WV) // Southwest (AZ NM OK TX) // Rocky Mountains (CO ID MT UT WY)
       Far West (AK CA HI NV OR WA) // Outlying Areas (AS GU MP PR VI)") +
  theme(plot.caption = element_text(size = 6)) + 
  scale_fill_brewer(name = "Locale Mode", palette = "Set3") + # coloring in the graph
  guides(colour = FALSE) +# removes legend for outline
  scale_x_discrete(labels = function(x) str_wrap(x, width = 7)) # narrows text space for x-axis
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

Short Essay

  • The source and topic of the data, any variables included, what kind of variables they are, how you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate). This part of the essay should be embedded AT THE BEGINNING OF THE MARKDOWN FILE, BEFORE YOU LOAD THE DATA.

  • (Parts b and c of the essay should be placed at the end of the document) What the visualization represents, any interesting patterns or surprises that arise within the visualization.

  • Anything that you might have shown that you could not get to work or that you wished you could have included.

Essay

My visualization came about because I’m interested in how libraries are spread apart and where they are located. I’ve worked in a variety of libraries: main branch for a large city, main branch for a smaller city, a fringe neighborhood single-building library. Seeing how different these communities are first-hand has made me wonder about the variety of locations of libraries and how that impacts our work.

Additionally, lots of conversations involving libraries sometimes seem made for one extreme or the other: a heavy focus on the digital divide for rural libraries, democratic use of programming space for more urban spaces.

But the reality is that we have a lot of libraries that don’t fall into either of those spaces and instead occupy a middle-ground. That’s why I wanted this visualization: to see where libraries fall in their “locale mode” and what the spread of those modes are across the United States’ regions.

note: As I’m studying my visual, I’m wishing I had the various states listed, so I add a caption but it’s really not the best look.

With the added caption, I can now bring some of my visuals into a bit more practical sense. For example, as I’m looking at Plains and Mid East, they’ve got similar totals but very different spreads of locale mode. Mid East has a huge emphasis on “Suburb, Large” while Plains has an emphasis on “Rural, Remote”, “Rural, Distant”, and “Town, Remote”. Plains has a much higher spread of distance and lower population between a library and their community compared to Mid East, whose libraries are situated near their suburban communities with higher population.

The Great Lakes having the tallest bar graph was really surprising to me. It appears as if their states have some high numbers of libraries. Considering there’s always discussion about the populations of Texas and California, it surprises me that those regions aren’t having higher library counts. I expected the Far West, with California, to have a much higher bar.

I’d be interested to do further visualizations of each region with their own state breakdown and see who contributes to what locale mode. I’d also be interested to bring this visualization together with one that shows where librarians with Master’s degrees are located. Are they showing up in rural libraries or are they huddled around suburbs and towns?

This is one spot where I can see the argument for removing some data, “Outlying Areas”, because it’s mostly data on territories that the United States owns. And yet, seeing the smallest bar - a line really - also shows that there’s not enough investment in libraries in these spaces.

For next time:

  • perhaps create my own palette so that the similar locale modes share similar colors.
  • add a layer showing average number of libraries for a state by region
  • create some sort of parallel graph showing library community population