# note: dataset should have both quantitative and categorical variables# note: it should have at least 4 variableslibrary(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
survey_org <-read_csv("data/pls-fy21-admin.csv")
Rows: 9215 Columns: 195
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (94): STABR, FSCSKEY, LIBID, LIBNAME, ADDRESS, CITY, ZIP4, ADDRES_M, CI...
dbl (101): ZIP, ZIP_M, PHONE, POPU_LSA, POPU_UND, CENTLIB, BRANLIB, BKMOB, M...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Introduction to Dataset
This dataset is a public library survey conducted by the Institute of Museum and Library Services. It has been conducted in one form or another since 1992; it switched agency hands in 1992.
The data is collected from approx. 9,000 public libraries with approx. 17,000 individual structures, such as the main library, the neighborhood library branches, and bookmobiles.
The data contains:
contact information about each library structure (state, address, phone number, etc.),
the number of library visits for each library structure (including if it was closed for COVID),
circulation information (number of checkouts by group, reference visits, etc.),
collection information (types of materials owned and amount of each, etc.),
staffing details (staff with Master’s degrees, salaries, etc.),
and much more.
There are almost 200 variables that provide lots of identifying details for each library structure. However, we will be cutting down the list of variables to approx. 10 or less.
Load the Data
Potential questions to explore and the variables needed:
average number of library visits for a state (or group of states) and where individual library structures fall on that average
scatterplot for individual actual library visits and line showing average number of visits
variables needed: state code, library name, visits, economic analysis code
salaries based on population served compared to state average
scatterplot for individual actual library visits and line showing average number of visits
variables needed: state code, library name, visits, economic analysis code,
# A tibble: 6 × 10
STABR LIBNAME POPU_UND MASTER SALARIES HRS_OPEN VISITS OBEREG LOCALE_MOD
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 AK ANCHOR POINT… 2123 0 NA 1404 5927 Far W… Rural, Re…
2 AK ANCHORAGE PU… 288970 20.5 3766214 7596 159427 Far W… City, Lar…
3 AK ANDERSON COM… 275 0 NA 420 325 Far W… Rural, Re…
4 AK KUSKOKWIM CO… 6179 1 66813 2040 2100 Far W… Town, Rem…
5 AK BIG LAKE PUB… 6942 1 209076 NA 18592 Far W… Rural, Di…
6 AK CANTWELL COM… 190 0 NA 720 480 Far W… Rural, Re…
# ℹ 1 more variable: CDCODE <dbl>
Explore the Data with at least one data visualization
Some suggestions for visualizations include side-by-side box plots, histograms, bargraphs, scatterplots, treemaps, heatmaps, alluvials or streamgraphs. The type of data you use will help determine which visualization you should use.
Region Information
01–New England (CT ME MA NH RI VT)
02–Mid East (DE DC MD NJ NY PA)
03–Great Lakes (IL IN MI OH WI)
04–Plains (IA KS MN MO NE ND SD)
05–Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)
06–Southwest (AZ NM OK TX)
07–Rocky Mountains (CO ID MT UT WY)
08–Far West (AK CA HI NV OR WA)
09–Outlying Areas (AS GU MP PR VI)
# table showing various functions of hours open by regionsurvey_hours <- survey_na %>%group_by(OBEREG) %>%summarise(total_hours_open =sum(HRS_OPEN, na.rm =TRUE),mean_hours_open =mean(HRS_OPEN, na.rm =TRUE),median_hours_open =median(HRS_OPEN, na.rm =TRUE))survey_hours
# A tibble: 9 × 4
OBEREG total_hours_open mean_hours_open median_hours_open
<chr> <dbl> <dbl> <dbl>
1 Far West 1503004 3400. 1314
2 Great Lakes 6497710 3465. 2268
3 Mid East 4215818 2839. 2104
4 New England 1433647 1212. 1062
5 Outlying Areas 14278 3570. 3662
6 Plains 3478009 2285. 1660.
7 Rocky Mountains 1040718 2965. 2006
8 South East 6542187 5857. 2567
9 South West 2324739 2877. 1944
# table showing mean of library visits by regionsurvey_visits <- survey_na %>%group_by(OBEREG) %>%summarise(mean_visits =mean(VISITS, na.rm =TRUE))survey_visits
# A tibble: 9 × 2
OBEREG mean_visits
<chr> <dbl>
1 Far West 88669.
2 Great Lakes 48439.
3 Mid East 44760.
4 New England 14879.
5 Outlying Areas 26878.
6 Plains 23752.
7 Rocky Mountains 73201.
8 South East 81147.
9 South West 39805.
# table showing number of libraries by regionsurvey_region <- survey_na %>%group_by(OBEREG) %>%summarise(library_sum_per_region =n())survey_region
# A tibble: 9 × 2
OBEREG library_sum_per_region
<chr> <int>
1 Far West 517
2 Great Lakes 1887
3 Mid East 1545
4 New England 1270
5 Outlying Areas 4
6 Plains 1583
7 Rocky Mountains 392
8 South East 1164
9 South West 853
# table showing info on librarians with Master's degrees by regionsurvey_masterlib <- survey_na %>%group_by(OBEREG) %>%summarise(masters_librarians =mean(MASTER, na.rm =TRUE))survey_masterlib
# A tibble: 9 × 2
OBEREG masters_librarians
<chr> <dbl>
1 Far West 9.59
2 Great Lakes 3.88
3 Mid East 4.54
4 New England 2.21
5 Outlying Areas 1.5
6 Plains 1.05
7 Rocky Mountains 2.93
8 South East 5.55
9 South West 3.10
# table showing population data by regionsurvey_population <- survey_na %>%group_by(OBEREG) %>%summarise(region_avg_population =mean(POPU_UND, na.rm =TRUE))survey_population
# A tibble: 9 × 2
OBEREG region_avg_population
<chr> <dbl>
1 Far West 117576.
2 Great Lakes 24080.
3 Mid East 31399.
4 New England 11683.
5 Outlying Areas 93325
6 Plains 12530.
7 Rocky Mountains 30549.
8 South East 73051.
9 South West 45454.
# A tibble: 81 × 3
# Groups: LOCALE_MOD [10]
LOCALE_MOD OBEREG n
<chr> <chr> <int>
1 City, Large Far West 31
2 City, Large Great Lakes 10
3 City, Large Mid East 11
4 City, Large New England 1
5 City, Large Plains 6
6 City, Large Rocky Mountains 3
7 City, Large South East 12
8 City, Large South West 20
9 City, Mid-size Far West 31
10 City, Mid-size Great Lakes 16
# ℹ 71 more rows
Visualization must include the following components:
Meaningful labels for axes
A title
At least 2 colors for distinguishing groups
Some sort of legend to make sense of colors, shapes, and sizes that describe any variables
ggplot(data = survey_na) +# datageom_bar(aes(x = OBEREG, fill = LOCALE_MOD, color ="Set3"), position ="stack") +# main data plotlabs(title ="Locations of Libraries by Locale Mode & Region\n", # labels for graphx ="\nRegion of the United States",y ="# of Libraries by Locale Mode\n",caption ="New England (CT ME MA NH RI VT) // Mid East (DE DC MD NJ NY PA) // Great Lakes (IL IN MI OH WI) // Plains (IA KS MN MO NE ND SD) Southeast (AL AR FL GA KY LA MS NC SC TN VA WV) // Southwest (AZ NM OK TX) // Rocky Mountains (CO ID MT UT WY) Far West (AK CA HI NV OR WA) // Outlying Areas (AS GU MP PR VI)") +theme(plot.caption =element_text(size =6)) +scale_fill_brewer(name ="Locale Mode", palette ="Set3") +# coloring in the graphguides(colour =FALSE) +# removes legend for outlinescale_x_discrete(labels =function(x) str_wrap(x, width =7)) # narrows text space for x-axis
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
Short Essay
The source and topic of the data, any variables included, what kind of variables they are, how you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate). This part of the essay should be embedded AT THE BEGINNING OF THE MARKDOWN FILE, BEFORE YOU LOAD THE DATA.
(Parts b and c of the essay should be placed at the end of the document) What the visualization represents, any interesting patterns or surprises that arise within the visualization.
Anything that you might have shown that you could not get to work or that you wished you could have included.
Essay
My visualization came about because I’m interested in how libraries are spread apart and where they are located. I’ve worked in a variety of libraries: main branch for a large city, main branch for a smaller city, a fringe neighborhood single-building library. Seeing how different these communities are first-hand has made me wonder about the variety of locations of libraries and how that impacts our work.
Additionally, lots of conversations involving libraries sometimes seem made for one extreme or the other: a heavy focus on the digital divide for rural libraries, democratic use of programming space for more urban spaces.
But the reality is that we have a lot of libraries that don’t fall into either of those spaces and instead occupy a middle-ground. That’s why I wanted this visualization: to see where libraries fall in their “locale mode” and what the spread of those modes are across the United States’ regions.
note: As I’m studying my visual, I’m wishing I had the various states listed, so I add a caption but it’s really not the best look.
With the added caption, I can now bring some of my visuals into a bit more practical sense. For example, as I’m looking at Plains and Mid East, they’ve got similar totals but very different spreads of locale mode. Mid East has a huge emphasis on “Suburb, Large” while Plains has an emphasis on “Rural, Remote”, “Rural, Distant”, and “Town, Remote”. Plains has a much higher spread of distance and lower population between a library and their community compared to Mid East, whose libraries are situated near their suburban communities with higher population.
The Great Lakes having the tallest bar graph was really surprising to me. It appears as if their states have some high numbers of libraries. Considering there’s always discussion about the populations of Texas and California, it surprises me that those regions aren’t having higher library counts. I expected the Far West, with California, to have a much higher bar.
I’d be interested to do further visualizations of each region with their own state breakdown and see who contributes to what locale mode. I’d also be interested to bring this visualization together with one that shows where librarians with Master’s degrees are located. Are they showing up in rural libraries or are they huddled around suburbs and towns?
This is one spot where I can see the argument for removing some data, “Outlying Areas”, because it’s mostly data on territories that the United States owns. And yet, seeing the smallest bar - a line really - also shows that there’s not enough investment in libraries in these spaces.
For next time:
perhaps create my own palette so that the similar locale modes share similar colors.
add a layer showing average number of libraries for a state by region
create some sort of parallel graph showing library community population