These are steps I took to view, analyze, and plot the data.
1. Pull in the dataset as a csv and assign it the dataframe name publiclibrary.
publiclibrary <- read.csv("PLS_FY17_State_pud17i.csv")
2. Get the dimensions of the dataframe.
dim(publiclibrary)
## [1] 54 120
Wow, that is a lot of variables! There are 54 instances (rows) of 120 variables (columns).
3. Look at a summary of the variables in the dataframe.
summary(publiclibrary)
## STABR POPU_LSA F_POPLSA POPU_UND
## AK : 1 Min. : 53883 IF17: 1 Min. : 53883
## AL : 1 1st Qu.: 1425636 R_17:53 1st Qu.: 1357138
## AR : 1 Median : 4080292 Median : 3668491
## AS : 1 Mean : 5963516 Mean : 5811297
## AZ : 1 3rd Qu.: 7096462 3rd Qu.: 6919552
## CA : 1 Max. :39507301 Max. :39507301
## (Other):48
## POPU_ST CENTLIB F_CENLIB BRANLIB F_BRLIB
## Min. : 53883 Min. : 1.0 R_17:54 Min. : 1.0 R_17:54
## 1st Qu.: 1359070 1st Qu.: 56.0 1st Qu.: 28.5
## Median : 3999956 Median : 98.0 Median : 85.5
## Mean : 5971812 Mean :167.4 Mean :142.5
## 3rd Qu.: 6919608 3rd Qu.:232.5 3rd Qu.:188.5
## Max. :39523613 Max. :755.0 Max. :959.0
##
## BKMOB F_BKMOB MASTER F_MASTER LIBRARIA
## Min. : 0.00 R_17:54 Min. : 0.0 IF17:23 Min. : 0.0
## 1st Qu.: 3.00 1st Qu.: 122.0 R_17:31 1st Qu.: 232.1
## Median : 8.00 Median : 308.9 Median : 621.9
## Mean :12.44 Mean : 623.5 Mean : 920.0
## 3rd Qu.:16.50 3rd Qu.: 766.2 3rd Qu.:1212.7
## Max. :74.00 Max. :3668.8 Max. :4434.3
##
## F_LIBRAR OTHPAID F_OTHSTF TOTSTAFF F_TOTSTF
## IF17:21 Min. : 9 IF17:22 Min. : 10.0 IF17:21
## R_17:33 1st Qu.: 355 R_17:32 1st Qu.: 652.9 R_17:33
## Median :1049 Median : 1832.8
## Mean :1717 Mean : 2636.6
## 3rd Qu.:2259 3rd Qu.: 3354.5
## Max. :8645 Max. :12822.5
##
## LOCGVT F_LOCGVT STGVT F_STGVT
## Min. :0.000e+00 IF17:23 Min. : 0 IF17:22
## 1st Qu.:3.673e+07 R_17:31 1st Qu.: 600288 R_17:32
## Median :1.228e+08 Median : 3987370
## Mean :2.109e+08 Mean : 16369668
## 3rd Qu.:2.488e+08 3rd Qu.: 13527890
## Max. :1.466e+09 Max. :378391351
##
## FEDGVT F_FEDGVT OTHINCM F_OTHINC
## Min. : 0 IF17:24 Min. : 0 IF17:24
## 1st Qu.: 100846 R_17:30 1st Qu.: 3245873 R_17:30
## Median : 488692 Median : 7804848
## Mean : 874573 Mean : 17358687
## 3rd Qu.:1187377 3rd Qu.: 21076727
## Max. :7245771 Max. :185825422
##
## TOTINCM F_TOTINC SALARIES F_SALX
## Min. :4.080e+05 IF17:24 Min. : 190298 IF17:25
## 1st Qu.:5.078e+07 R_17:30 1st Qu.: 25405734 R_17:29
## Median :1.358e+08 Median : 64473930
## Mean :2.455e+08 Mean :111821686
## 3rd Qu.:2.973e+08 3rd Qu.:149094417
## Max. :1.557e+09 Max. :655508574
##
## BENEFIT F_BENX STAFFEXP F_TOSTFX
## Min. : 0 IF17:26 Min. : 209072 IF17:26
## 1st Qu.: 8235946 R_17:28 1st Qu.: 33077968 R_17:28
## Median : 21602092 Median : 86076022
## Mean : 41358633 Mean :153180320
## 3rd Qu.: 47853656 3rd Qu.:198609555
## Max. :314906940 Max. :945490515
##
## PRMATEXP F_PRMATX ELMATEXP F_ELMATX
## Min. : 783 IF17:25 Min. : 0 IF17:29
## 1st Qu.: 2955908 R_17:29 1st Qu.: 995829 R_17:25
## Median : 9404856 Median : 3987152
## Mean :13957847 Mean : 6926839
## 3rd Qu.:18360410 3rd Qu.: 8764068
## Max. :73298018 Max. :40301843
##
## OTHMATEX F_OTMATX TOTEXPCO F_TOCOLX
## Min. : 0 IF17:27 Min. : 783 IF17:26
## 1st Qu.: 609498 R_17:27 1st Qu.: 4314047 R_17:28
## Median : 2933638 Median : 16328964
## Mean : 4593255 Mean : 25477941
## 3rd Qu.: 6338296 3rd Qu.: 35361758
## Max. :29715225 Max. :134252461
##
## OTHOPEXP F_OTHOPX TOTOPEXP F_TOTOPX
## Min. : 4980 IF17:25 Min. :3.540e+05 IF17:26
## 1st Qu.: 10214168 R_17:29 1st Qu.:4.873e+07 R_17:28
## Median : 26288912 Median :1.300e+08
## Mean : 50929753 Mean :2.296e+08
## 3rd Qu.: 58241499 3rd Qu.:2.894e+08
## Max. :408888807 Max. :1.489e+09
##
## LCAP_REV F_LCAPRV SCAP_REV F_SCAPRV
## Min. : 0 IF17:25 Min. : 0 IF17:26
## 1st Qu.: 627904 R_17:29 1st Qu.: 1696 R_17:28
## Median : 5150793 Median : 86662
## Mean :13771209 Mean : 1371072
## 3rd Qu.:20070256 3rd Qu.: 850390
## Max. :78141200 Max. :11137312
##
## FCAP_REV F_FCAPRV OCAP_REV F_OCAPRV
## Min. : 0 IF17:26 Min. : 0 IF17:27
## 1st Qu.: 0 R_17:28 1st Qu.: 214365 R_17:27
## Median : 5748 Median : 1111219
## Mean : 230634 Mean : 5545490
## 3rd Qu.: 99918 3rd Qu.: 3350573
## Max. :4615950 Max. :86414059
##
## CAP_REV F_TCAPRV CAPITAL F_TCAPX
## Min. : 0 IF17:27 Min. : 0 IF17:26
## 1st Qu.: 3324510 R_17:27 1st Qu.: 4548049 R_17:28
## Median : 11062230 Median : 13116358
## Mean : 20918405 Mean : 23090344
## 3rd Qu.: 26687619 3rd Qu.: 34128145
## Max. :115049111 Max. :128773264
##
## BKVOL EBOOK F_EBOOK AUDIO_PH
## Min. : 43208 Min. : 200 IA17: 1 Min. : 125
## 1st Qu.: 3971801 1st Qu.: 964119 IF17:21 1st Qu.: 198587
## Median : 8822316 Median : 3627748 R_17:32 Median : 509842
## Mean :13264512 Mean : 8612249 Mean : 808678
## 3rd Qu.:16090848 3rd Qu.: 8578765 3rd Qu.: 947486
## Max. :68128577 Max. :81436463 Max. :3533454
##
## F_AUD_PH AUDIO_DL F_AUD_DL VIDEO_PH F_VID_PH
## IF17:28 Min. : 0 IA17: 1 Min. : 6 IF17:27
## R_17:26 1st Qu.: 794436 IF17:26 1st Qu.: 348534 R_17:27
## Median : 1509235 R_17:27 Median : 806348
## Mean : 4804103 Mean :1224140
## 3rd Qu.: 5108128 3rd Qu.:1381205
## Max. :46995879 Max. :5487048
##
## VIDEO_DL F_VID_DL EC_LO_OT F_EC_L_O EC_ST
## Min. : 0 IA17: 1 Min. : 0.0 IF17:22 Min. : 0
## 1st Qu.: 19338 IF17:34 1st Qu.: 298.2 R_17:32 1st Qu.: 1115
## Median : 92686 R_17:19 Median : 613.5 Median : 3741
## Mean : 299419 Mean : 1933.6 Mean : 5935
## 3rd Qu.: 271718 3rd Qu.: 1659.0 3rd Qu.: 7302
## Max. :3882019 Max. :15816.0 Max. :33993
##
## F_EC_ST ELECCOLL F_ELECOL SUBSCRIP F_PRSUB
## IF17: 8 Min. : 0 IF17:21 Min. : 0 IF17:27
## R_17:46 1st Qu.: 2225 R_17:33 1st Qu.: 5069 R_17:27
## Median : 4434 Median : 14371
## Mean : 7868 Mean : 29230
## 3rd Qu.:10054 3rd Qu.: 34884
## Max. :36594 Max. :314871
##
## HRS_OPEN F_HRS_OP VISITS F_VISITS
## Min. : 1883 IF17:21 Min. : 67848 IF17:28
## 1st Qu.: 270776 R_17:33 1st Qu.: 5952298 R_17:26
## Median : 501157 Median : 16269185
## Mean : 691125 Mean : 24440815
## 3rd Qu.: 928899 3rd Qu.: 31944010
## Max. :2482431 Max. :151058515
##
## REFERENC F_REFER REGBOR F_REGBOR
## Min. : 418 IA17: 1 Min. : 14929 IF17:24
## 1st Qu.: 755533 IF17:36 1st Qu.: 815876 R_17:30
## Median : 2609746 R_17:17 Median : 2114880
## Mean : 4449827 Mean : 3197103
## 3rd Qu.: 5086460 3rd Qu.: 3802074
## Max. :26644064 Max. :22813619
##
## TOTCIR KIDCIRCL F_KIDCIR ELMATCIR
## Min. : 18155 Min. : 11481 IF17:29 Min. : 0
## 1st Qu.: 6679753 1st Qu.: 2374436 R_17:25 1st Qu.: 754897
## Median : 24675570 Median : 9168036 Median : 2967962
## Mean : 40072886 Mean :14211448 Mean : 4721746
## 3rd Qu.: 55453038 3rd Qu.:21096745 3rd Qu.: 5966432
## Max. :208171464 Max. :85181814 Max. :21842312
##
## PHYSCIR ELINFO ELCONT
## Min. : -1 Min. : -1 Min. : 0
## 1st Qu.: 6008730 1st Qu.: 1487888 1st Qu.: 2746017
## Median : 22552907 Median : 4716450 Median : 7874830
## Mean : 35357630 Mean : 14886733 Mean : 19479759
## 3rd Qu.: 50820808 3rd Qu.: 10657050 3rd Qu.: 18551435
## Max. :186369124 Max. :326882313 Max. :327614571
##
## TOTCOLL LOANTO F_LOANTO LOANFM
## Min. : 0 Min. : 0 IF17:30 Min. : 0
## 1st Qu.: 9686763 1st Qu.: 41038 R_17:24 1st Qu.: 44351
## Median : 28705466 Median : 253296 Median : 264258
## Mean : 53438261 Mean : 1192748 Mean : 1187276
## 3rd Qu.: 65707292 3rd Qu.: 785919 3rd Qu.: 772888
## Max. :332931529 Max. :10494552 Max. :10311967
##
## F_LOANFM TOTPRO F_TOTPRO KIDPRO F_KIDPRO
## IF17:29 Min. : 30 IF17:25 Min. : 20 IF17:24
## R_17:25 1st Qu.: 28246 R_17:29 1st Qu.: 14876 R_17:30
## Median : 67359 Median : 38897
## Mean :100432 Mean : 55088
## 3rd Qu.:119211 3rd Qu.: 69221
## Max. :640908 Max. :270054
##
## YAPRO F_YAPRO TOTATTEN F_TOTATT KIDATTEN
## Min. : 0 IF17:25 Min. : 5562 IF17:23 Min. : 4695
## 1st Qu.: 2954 R_17:29 1st Qu.: 497408 R_17:31 1st Qu.: 349582
## Median : 7306 Median : 1621562 Median :1047597
## Mean : 9832 Mean : 2196522 Mean :1467218
## 3rd Qu.:11670 3rd Qu.: 2856180 3rd Qu.:1944277
## Max. :67178 Max. :11121961 Max. :7687968
##
## F_KIDATT YAATTEN F_YAATT GPTERMS F_GPTERM
## IF17:24 Min. : 0 IF17:25 Min. : 20 IF17:23
## R_17:30 1st Qu.: 35486 R_17:29 1st Qu.: 1407 R_17:31
## Median :112205 Median : 4367
## Mean :151418 Mean : 5576
## 3rd Qu.:195825 3rd Qu.: 7272
## Max. :901005 Max. :23489
##
## PITUSR F_PITUSR WIFISESS STARTDAT
## Min. : 6889 IF17:34 Min. : 1432 01/01/2016: 4
## 1st Qu.: 982062 R_17:20 1st Qu.: 636738 01/01/2017:13
## Median : 3448202 Median : 3132385 02/01/2016: 1
## Mean : 4783887 Mean : 6219529 04/01/2016: 2
## 3rd Qu.: 6010960 3rd Qu.: 8042690 07/01/2016:24
## Max. :27219883 Max. :42721358 10/01/2016: 8
## 12/01/2015: 2
## ENDDATE INCITSST YR_SUB OBEREG
## 06/30/2017:24 Min. : 1.00 Min. :2017 Min. :1.000
## 09/30/2017: 9 1st Qu.:17.25 1st Qu.:2017 1st Qu.:3.000
## 10/31/2017: 1 Median :30.50 Median :2017 Median :5.000
## 12/31/2017:20 Mean :30.96 Mean :2017 Mean :4.722
## 3rd Qu.:44.75 3rd Qu.:2017 3rd Qu.:6.750
## Max. :69.00 Max. :2017 Max. :9.000
##
4. Look at the first few records of the dataframe called publiclibrary.
head(publiclibrary)
## STABR POPU_LSA F_POPLSA POPU_UND POPU_ST CENTLIB F_CENLIB BRANLIB
## 1 AK 649636 R_17 649636 737080 79 R_17 16
## 2 AL 4858979 R_17 4858979 4858979 222 R_17 73
## 3 AR 2925255 R_17 2643928 2915918 54 R_17 181
## 4 AS 60300 R_17 60300 60300 1 R_17 1
## 5 AZ 10986792 R_17 6965897 6965897 85 R_17 136
## 6 CA 39507301 R_17 39507301 39523613 165 R_17 959
## F_BRLIB BKMOB F_BKMOB MASTER F_MASTER LIBRARIA F_LIBRAR OTHPAID
## 1 R_17 1 R_17 56.90 R_17 105.31 R_17 195.27
## 2 R_17 15 R_17 334.38 R_17 761.26 R_17 948.56
## 3 R_17 3 R_17 154.02 IF17 327.70 IF17 771.01
## 4 R_17 0 R_17 1.00 R_17 1.00 R_17 9.00
## 5 R_17 11 R_17 463.20 IF17 572.64 IF17 1570.02
## 6 R_17 50 R_17 3178.02 IF17 3405.90 IF17 8645.04
## F_OTHSTF TOTSTAFF F_TOTSTF LOCGVT F_LOCGVT STGVT F_STGVT FEDGVT
## 1 R_17 300.58 R_17 35535907 R_17 967811 R_17 760297
## 2 R_17 1709.82 R_17 95284671 R_17 4289664 R_17 884759
## 3 IF17 1098.71 IF17 71811926 IF17 5093835 IF17 3198
## 4 R_17 10.00 R_17 0 R_17 280252 R_17 116264
## 5 IF17 2142.66 IF17 181981565 IF17 1359870 IF17 1885210
## 6 IF17 12050.94 IF17 1466205141 IF17 16015943 IF17 3450577
## F_FEDGVT OTHINCM F_OTHINC TOTINCM F_TOTINC SALARIES F_SALX
## 1 R_17 1086505 R_17 38350520 R_17 14825147 IF17
## 2 R_17 9561929 R_17 110021023 R_17 56852173 R_17
## 3 IF17 7001346 IF17 83910305 IF17 36971331 IF17
## 4 R_17 11470 R_17 407986 R_17 190298 R_17
## 5 IF17 6529775 IF17 191756420 IF17 79923881 IF17
## 6 IF17 71094198 IF17 1556765859 IF17 630583575 IF17
## BENEFIT F_BENX STAFFEXP F_TOSTFX PRMATEXP F_PRMATX ELMATEXP F_ELMATX
## 1 8817590 IF17 23642737 IF17 1766174 R_17 590787 IF17
## 2 15207301 R_17 72059474 R_17 6236182 R_17 2078897 R_17
## 3 11420197 IF17 48391528 IF17 5218162 IF17 2662117 IF17
## 4 18774 R_17 209072 R_17 2772 R_17 0 R_17
## 5 30807861 IF17 110731742 IF17 15148150 IF17 8132366 IF17
## 6 314906940 IF17 945490515 IF17 73298018 IF17 40301843 IF17
## OTHMATEX F_OTMATX TOTEXPCO F_TOCOLX OTHOPEXP F_OTHOPX TOTOPEXP
## 1 83053 IF17 2440014 R_17 10232067 R_17 36314818
## 2 2156596 R_17 10471675 R_17 20920399 R_17 103451548
## 3 1831950 IF17 9712229 IF17 19591119 IF17 77694876
## 4 11614 R_17 14386 R_17 130493 R_17 353951
## 5 5337908 IF17 28618424 IF17 41277932 IF17 180628098
## 6 20652600 IF17 134252461 IF17 408888807 IF17 1488631783
## F_TOTOPX LCAP_REV F_LCAPRV SCAP_REV F_SCAPRV FCAP_REV F_FCAPRV OCAP_REV
## 1 R_17 88549 R_17 882017 R_17 0 R_17 543731
## 2 R_17 5584461 R_17 0 R_17 92425 R_17 282280
## 3 IF17 25424513 IF17 177084 IF17 0 IF17 2423208
## 4 R_17 0 R_17 0 R_17 0 R_17 0
## 5 IF17 7496725 IF17 40896 IF17 0 IF17 225762
## 6 IF17 68435601 IF17 22454 IF17 1067098 IF17 11962515
## F_OCAPRV CAP_REV F_TCAPRV CAPITAL F_TCAPX BKVOL EBOOK F_EBOOK
## 1 R_17 1514297 R_17 12964531 R_17 2241681 1190695 R_17
## 2 R_17 5959166 R_17 7740222 R_17 8835816 4551754 R_17
## 3 IF17 28024805 IF17 13268186 IF17 6078553 1426355 IF17
## 4 R_17 0 R_17 0 R_17 43208 1265 R_17
## 5 IF17 7763383 IF17 6039281 IF17 7585916 4377983 R_17
## 6 IF17 81487668 IF17 81808077 IF17 62110372 13594513 IF17
## AUDIO_PH F_AUD_PH AUDIO_DL F_AUD_DL VIDEO_PH F_VID_PH VIDEO_DL F_VID_DL
## 1 117588 IF17 899185 R_17 301449 R_17 15039 R_17
## 2 435091 R_17 1610947 IF17 674779 IF17 114098 IF17
## 3 256759 IF17 1488248 IF17 526151 IF17 75191 IF17
## 4 895 R_17 0 R_17 337 R_17 0 R_17
## 5 592002 IF17 4534761 IF17 1049065 IF17 285018 IF17
## 6 3328908 IF17 17262208 IF17 5317349 IF17 723077 IF17
## EC_LO_OT F_EC_L_O EC_ST F_EC_ST ELECCOLL F_ELECOL SUBSCRIP F_PRSUB
## 1 356 R_17 3794 R_17 4150 R_17 4397 R_17
## 2 1077 R_17 13603 R_17 14680 R_17 8543 R_17
## 3 369 IF17 3120 R_17 3489 IF17 9287 IF17
## 4 0 R_17 41 R_17 41 R_17 63 R_17
## 5 646 IF17 2970 R_17 3616 IF17 17410 IF17
## 6 4143 IF17 6 IF17 4149 IF17 86709 IF17
## HRS_OPEN F_HRS_OP VISITS F_VISITS REFERENC F_REFER REGBOR F_REGBOR
## 1 143746 R_17 3403598 IF17 356776 R_17 368512 R_17
## 2 628602 R_17 14583055 R_17 4114709 IF17 2620475 R_17
## 3 322503 IF17 10358181 IF17 2664467 IF17 1665795 IF17
## 4 2129 R_17 67848 R_17 418 R_17 14929 R_17
## 5 486826 IF17 24588717 IF17 5177089 IF17 3238665 IF17
## 6 2386702 IF17 151058515 IF17 18953587 IF17 22813619 IF17
## TOTCIR KIDCIRCL F_KIDCIR ELMATCIR PHYSCIR ELINFO ELCONT
## 1 4725553 1499911 IF17 608351 4117202 213926 822274
## 2 19974984 6768365 R_17 2654129 17320855 8385998 11040127
## 3 13900209 3918856 IF17 1748493 12151716 4420947 6169440
## 4 18155 11481 R_17 58 18097 40459 40517
## 5 45351653 12427033 IF17 7664214 37682262 14881869 22547840
## 6 208171464 85181814 IF17 21842312 186369124 68631591 89364987
## TOTCOLL LOANTO F_LOANTO LOANFM F_LOANFM TOTPRO F_TOTPRO KIDPRO
## 1 4939476 9403 IF17 10008 IF17 13771 R_17 8873
## 2 28360982 386422 R_17 418634 R_17 51354 R_17 25096
## 3 18321156 27872 IF17 26998 IF17 46138 IF17 26202
## 4 58614 0 R_17 0 R_17 282 R_17 282
## 5 59922415 507669 IF17 485720 IF17 85156 IF17 44026
## 6 254659592 3463927 IF17 3068809 IF17 404614 IF17 242357
## F_KIDPRO YAPRO F_YAPRO TOTATTEN F_TOTATT KIDATTEN F_KIDATT YAATTEN
## 1 R_17 1520 R_17 266184 R_17 193013 R_17 20407
## 2 R_17 6647 IF17 1273334 R_17 807789 R_17 116403
## 3 IF17 7295 IF17 1136567 IF17 671586 IF17 112913
## 4 R_17 0 R_17 5562 R_17 5562 R_17 0
## 5 IF17 9453 IF17 1708390 IF17 1142827 IF17 139245
## 6 IF17 40791 IF17 10249635 IF17 7687968 IF17 608684
## F_YAATT GPTERMS F_GPTERM PITUSR F_PITUSR WIFISESS STARTDAT
## 1 R_17 983 R_17 617059 IF17 589331 01/01/2016
## 2 IF17 5056 R_17 3729784 IF17 4087113 10/01/2016
## 3 IF17 2811 IF17 1526940 IF17 1623761 01/01/2017
## 4 R_17 20 R_17 6889 R_17 1432 10/01/2016
## 5 IF17 7364 IF17 5577313 IF17 10813345 07/01/2016
## 6 IF17 23489 IF17 27219883 IF17 22950041 07/01/2016
## ENDDATE INCITSST YR_SUB OBEREG
## 1 06/30/2017 2 2017 8
## 2 09/30/2017 1 2017 5
## 3 12/31/2017 5 2017 5
## 4 09/30/2017 60 2017 9
## 5 06/30/2017 4 2017 6
## 6 06/30/2017 6 2017 8
5. Pull in the tidyverse package to start managing the variables.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
6. Choose variables to explore in a smaller sub-dataframe.
Rename the STABR variable to be STATE.
publiclibrarysub <- publiclibrary %>%
select(STABR, POPU_LSA, CENTLIB, BRANLIB, BKMOB, TOTSTAFF, STAFFEXP, BKVOL, EBOOK, AUDIO_PH, AUDIO_DL, HRS_OPEN, VISITS, REFERENC, REGBOR, TOTCIR, TOTCOLL, GPTERMS, PITUSR, WIFISESS, STARTDAT, ENDDATE, YR_SUB, OBEREG)
publiclibrarysub <- rename(publiclibrarysub, STATE = STABR)
head(publiclibrarysub)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 AK 649636 79 16 1 300.58 23642737 2241681
## 2 AL 4858979 222 73 15 1709.82 72059474 8835816
## 3 AR 2925255 54 181 3 1098.71 48391528 6078553
## 4 AS 60300 1 1 0 10.00 209072 43208
## 5 AZ 10986792 85 136 11 2142.66 110731742 7585916
## 6 CA 39507301 165 959 50 12050.94 945490515 62110372
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 1190695 117588 899185 143746 3403598 356776 368512
## 2 4551754 435091 1610947 628602 14583055 4114709 2620475
## 3 1426355 256759 1488248 322503 10358181 2664467 1665795
## 4 1265 895 0 2129 67848 418 14929
## 5 4377983 592002 4534761 486826 24588717 5177089 3238665
## 6 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 4725553 4939476 983 617059 589331 01/01/2016 06/30/2017
## 2 19974984 28360982 5056 3729784 4087113 10/01/2016 09/30/2017
## 3 13900209 18321156 2811 1526940 1623761 01/01/2017 12/31/2017
## 4 18155 58614 20 6889 1432 10/01/2016 09/30/2017
## 5 45351653 59922415 7364 5577313 10813345 07/01/2016 06/30/2017
## 6 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## YR_SUB OBEREG
## 1 2017 8
## 2 2017 5
## 3 2017 5
## 4 2017 9
## 5 2017 6
## 6 2017 8
That is still a lot of variables but it is getting better!
The variables in this library dataset include mostly continuous variables with 1 categorical variable (OBEREG). The OBEREG variable represents the Bureau of Economic Analysis (BEA) region code to which that State belongs. The US is divided into 8 regions: 01 New England, 02 Mid East, 03 Great Lakes, 04 Plains, 05 South East, 06 South West, 07 Rocky Mountains, and 08 Far West.
Initially, I wanted to create another categorical variable by grouping the number of staff members (full-time equivalent) in ranges of 1,000. I started doing this on step 9 below, grouping them into “few” (less than 1,000 FTEs), “moderate” (1,000-1,999 FTEs), “high” (2,000-2,999 FTEs), and “veryhigh” (3,000+). I used the mutate command to create new columns indicating the staffing level for each State. I did not end up using this information in my final visualization (final viz is step 15).
As for cleaning up the dataset, the variables were all named correctly (no spaces or symbols) and were mostly integers so I did not change them but only renamed the STABR variable to be STATE as that made more sense to me (in step 6 above). There are dates within the dataset but these simply represent the range of 12-months of data which that State reported. The PLS uses 12-months of data but this does not equate to a particular calendar year so these date variables indicate the 12-month range used by that State.
I used select and filter commands to group the records within sub-dataframes. I used the arrange command to sort the records descending.
7. View the dimensions of the sub-dataframe called publiclibrarysub.
dim(publiclibrarysub)
## [1] 54 24
8. Filter out the territories and retain the 51 States in a dataframe called state.
state <- filter(publiclibrarysub, STATE != "AS", STATE != "GU", STATE != "MP")
head(state)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 AK 649636 79 16 1 300.58 23642737 2241681
## 2 AL 4858979 222 73 15 1709.82 72059474 8835816
## 3 AR 2925255 54 181 3 1098.71 48391528 6078553
## 4 AZ 10986792 85 136 11 2142.66 110731742 7585916
## 5 CA 39507301 165 959 50 12050.94 945490515 62110372
## 6 CO 5505446 96 165 13 3252.12 191787325 9517379
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 1190695 117588 899185 143746 3403598 356776 368512
## 2 4551754 435091 1610947 628602 14583055 4114709 2620475
## 3 1426355 256759 1488248 322503 10358181 2664467 1665795
## 4 4377983 592002 4534761 486826 24588717 5177089 3238665
## 5 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## 6 2815454 810860 2277173 637132 31975615 3679129 3624818
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 4725553 4939476 983 617059 589331 01/01/2016 06/30/2017
## 2 19974984 28360982 5056 3729784 4087113 10/01/2016 09/30/2017
## 3 13900209 18321156 2811 1526940 1623761 01/01/2017 12/31/2017
## 4 45351653 59922415 7364 5577313 10813345 07/01/2016 06/30/2017
## 5 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## 6 62438667 73670109 6413 6490654 4004762 01/01/2017 12/31/2017
## YR_SUB OBEREG
## 1 2017 8
## 2 2017 5
## 3 2017 5
## 4 2017 6
## 5 2017 8
## 6 2017 7
8. Plot the population against the total collection circulation for each state.
The labels were showing exponents so I adjusted the scale of the labels to make this more legible.
ggplot(data = state) +
geom_point(mapping = aes(x = POPU_LSA, y = TOTCOLL)) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma)

It looks like there is a correlation between population and the total collection circulation, which makes sense, although the outlier is weird. I could explore that more fully but will leave it for now.
9. Plot TOTSTAFF to identify how to break that variable into categorical variables. The TOTSTAFF variable represents the number of full-time equivalent (FTE) staff that are employed by that State’s libraries.
ggplot(data = state) +
geom_point(mapping = aes(y = TOTSTAFF, x = STATE))

While the States are not legible, it looks like I can break the TOTSTAFF numerical variable into 4 categories in increments of 1,000: few, moderate, high, veryhigh.
10. Segment the TOTSTAFF variable. Use the arrange command to sort descending by TOTSTAFF (FTEs).
staffinglevels <- state %>%
mutate(few = (TOTSTAFF >0 & TOTSTAFF <1000), moderate = (TOTSTAFF >=1000 & TOTSTAFF <2000), high = (TOTSTAFF >=3000 & TOTSTAFF <4000), veryhigh = (TOTSTAFF >=4000)) %>%
arrange(desc(TOTSTAFF))
#This adds columns at the end of my data to show which states are "true" for few staff, or "true" for moderate staff, etc.
#The arrange command sorts the TOTSTAFF column descending.
topstaffing <- head(staffinglevels, n = 10)
#I decided to focus on the highest 10 staffing States instead, so I added n=10 to limit the dataframe for topstaffing.
head(topstaffing)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 NY 19542753 755 312 11 12822.53 943530658 68128577
## 2 CA 39507301 165 959 50 12050.94 945490515 62110372
## 3 OH 11510467 239 477 56 9110.92 475357681 39581237
## 4 IL 11714654 622 162 17 8758.59 496343452 39946369
## 5 TX 25699219 556 326 11 6916.71 373517642 38410864
## 6 FL 20790357 59 474 21 6340.03 334620281 28147847
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 28445930 3533454 6074003 2482431 100012193 26644064 10397405
## 2 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## 3 81436463 3367638 33758449 1927239 71895854 17735110 8440127
## 4 21830848 2835940 7855094 1921739 66174632 10261005 4995089
## 5 10331800 1954831 5663867 1935320 70521177 13528644 14320123
## 6 5201706 1880107 18368080 1275478 66753349 21779507 11640505
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 127222803 140044443 20973 17700652 15488574 04/01/2016 12/31/2017
## 2 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## 3 170442595 191426990 14305 15131182 19547124 01/01/2017 12/31/2017
## 4 107710331 91514945 15834 11617964 20971054 12/01/2015 06/30/2017
## 5 117304751 220885344 21824 13823532 42721358 02/01/2016 12/31/2017
## 6 99660103 110259833 16813 12529944 10257484 10/01/2016 09/30/2017
## YR_SUB OBEREG few moderate high veryhigh
## 1 2017 2 FALSE FALSE FALSE TRUE
## 2 2017 8 FALSE FALSE FALSE TRUE
## 3 2017 3 FALSE FALSE FALSE TRUE
## 4 2017 3 FALSE FALSE FALSE TRUE
## 5 2017 6 FALSE FALSE FALSE TRUE
## 6 2017 5 FALSE FALSE FALSE TRUE
The States with the highest staffing levels are NY, CA, OH, IL, TX, and Fl.
While the step above makes the numerical staffing total into a new categorical variable, I will hold on further staffing analysis of that for the moment.
11. Let’s looks at State populations. I’ll do this by using the arrange command to sort the dataframe, descending, by the population variable and assign the highest population States to a toppopulation dataframe.
populationlevels <- state %>%
arrange(desc(POPU_LSA))
toppopulation <- head(populationlevels, n = 10)
#I decided to focus on the highest 10 population States so I added n=10 to limit the dataframe for toppopulation.
head(toppopulation)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 CA 39507301 165 959 50 12050.94 945490515 62110372
## 2 TX 25699219 556 326 11 6916.71 373517642 38410864
## 3 FL 20790357 59 474 21 6340.03 334620281 28147847
## 4 NY 19542753 755 312 11 12822.53 943530658 68128577
## 5 PA 12427301 449 172 23 4669.93 245061274 24237863
## 6 IL 11714654 622 162 17 8758.59 496343452 39946369
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## 2 10331800 1954831 5663867 1935320 70521177 13528644 14320123
## 3 5201706 1880107 18368080 1275478 66753349 21779507 11640505
## 4 28445930 3533454 6074003 2482431 100012193 26644064 10397405
## 5 26462269 2115303 46995879 1484972 42251845 7275836 5328788
## 6 21830848 2835940 7855094 1921739 66174632 10261005 4995089
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## 2 117304751 220885344 21824 13823532 42721358 02/01/2016 12/31/2017
## 3 99660103 110259833 16813 12529944 10257484 10/01/2016 09/30/2017
## 4 127222803 140044443 20973 17700652 15488574 04/01/2016 12/31/2017
## 5 61829907 77918664 8127 6072550 8110110 01/01/2017 12/31/2017
## 6 107710331 91514945 15834 11617964 20971054 12/01/2015 06/30/2017
## YR_SUB OBEREG
## 1 2017 8
## 2 2017 6
## 3 2017 5
## 4 2017 2
## 5 2017 2
## 6 2017 3
The States with the highest population levels are CA, TX, FL, NY, PA, and IL.
12. Let’s look at total collection circulation of materials. Similar to toppopulation, I will use the arrange command to sort the dataframe, descending, by the total collection circulation variable and assign the highest collection circulation States to a topcollcirculation dataframe.
circulationlevels <- state %>%
arrange(desc(TOTCOLL))
topcollcirculation <- head(circulationlevels, n = 10)
#I decided to focus on the highest 10 collection circulation States so I added n=10 to limit the dataframe for topcollcirculation.
head(topcollcirculation)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 RI 1424662 47 24 2 645.18 37440495 3461135
## 2 CA 39507301 165 959 50 12050.94 945490515 62110372
## 3 TX 25699219 556 326 11 6916.71 373517642 38410864
## 4 OH 11510467 239 477 56 9110.92 475357681 39581237
## 5 NY 19542753 755 312 11 12822.53 943530658 68128577
## 6 FL 20790357 59 474 21 6340.03 334620281 28147847
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 2593546 152231 877905 159805 5669309 589418 443396
## 2 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## 3 10331800 1954831 5663867 1935320 70521177 13528644 14320123
## 4 81436463 3367638 33758449 1927239 71895854 17735110 8440127
## 5 28445930 3533454 6074003 2482431 100012193 26644064 10397405
## 6 5201706 1880107 18368080 1275478 66753349 21779507 11640505
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 6049216 332931529 1480 972285 426930 07/01/2016 06/30/2017
## 2 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## 3 117304751 220885344 21824 13823532 42721358 02/01/2016 12/31/2017
## 4 170442595 191426990 14305 15131182 19547124 01/01/2017 12/31/2017
## 5 127222803 140044443 20973 17700652 15488574 04/01/2016 12/31/2017
## 6 99660103 110259833 16813 12529944 10257484 10/01/2016 09/30/2017
## YR_SUB OBEREG
## 1 2017 1
## 2 2017 8
## 3 2017 6
## 4 2017 3
## 5 2017 2
## 6 2017 5
The States with the highest collection circulation levels are RI, CA, TX, OH, NY, and FL.
13. Let’s plot the total collection circulation against total population for the top 10 populated States.
popcirplot <- ggplot(toppopulation, aes(x = POPU_LSA,y = TOTCOLL, color = STATE)) +
geom_point()
popcirplot + ggtitle("State Population and Library Total Collection Circulation") + xlab("Population") + ylab("Collection Circulation") + labs(fill = "State") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma)

#Again, I added the scaling to make the axis tick labels display legibly.
This point plot looks ok but let’s keep going.
14. Let’s look at the total registered users for the States. I’ll use the arrange command and sort descending to collect the States with top registered users into a dataframe call topreguser.
registereduserlevels <- state %>%
arrange(desc(REGBOR))
topreguser <- head(registereduserlevels, n = 10)
#I decided to focus on the highest 10 registered user States so I added n=10 to limit the dataframe for topreguser.
head(topreguser)
## STATE POPU_LSA CENTLIB BRANLIB BKMOB TOTSTAFF STAFFEXP BKVOL
## 1 CA 39507301 165 959 50 12050.94 945490515 62110372
## 2 TX 25699219 556 326 11 6916.71 373517642 38410864
## 3 FL 20790357 59 474 21 6340.03 334620281 28147847
## 4 NY 19542753 755 312 11 12822.53 943530658 68128577
## 5 OH 11510467 239 477 56 9110.92 475357681 39581237
## 6 NC 10155942 70 319 20 3059.03 157518816 15526913
## EBOOK AUDIO_PH AUDIO_DL HRS_OPEN VISITS REFERENC REGBOR
## 1 13594513 3328908 17262208 2386702 151058515 18953587 22813619
## 2 10331800 1954831 5663867 1935320 70521177 13528644 14320123
## 3 5201706 1880107 18368080 1275478 66753349 21779507 11640505
## 4 28445930 3533454 6074003 2482431 100012193 26644064 10397405
## 5 81436463 3367638 33758449 1927239 71895854 17735110 8440127
## 6 6418337 666565 1002503 944993 32621293 6979972 5647452
## TOTCIR TOTCOLL GPTERMS PITUSR WIFISESS STARTDAT ENDDATE
## 1 208171464 254659592 23489 27219883 22950041 07/01/2016 06/30/2017
## 2 117304751 220885344 21824 13823532 42721358 02/01/2016 12/31/2017
## 3 99660103 110259833 16813 12529944 10257484 10/01/2016 09/30/2017
## 4 127222803 140044443 20973 17700652 15488574 04/01/2016 12/31/2017
## 5 170442595 191426990 14305 15131182 19547124 01/01/2017 12/31/2017
## 6 49231011 57826436 7611 5938721 4234491 07/01/2016 06/30/2017
## YR_SUB OBEREG
## 1 2017 8
## 2 2017 6
## 3 2017 5
## 4 2017 2
## 5 2017 3
## 6 2017 5
The States with the highest number of registered users are CA, TX, FL, NY, OH, and NC.
15. I want to compare the number of registered users to the State population for the top registered user States. I will do this using geom_point and will add titles, labels, and color.
popregplot <- ggplot(topreguser, aes(x = POPU_LSA, y = REGBOR, color = STATE)) +
geom_point(aes(size = 3))
popregplot + ggtitle("State Population and Number of Registered Users") + xlab("Population") + ylab("Registered Users") +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma)

#Again, I added the scaling to make the axis ticks display legibly.
#I made the bubble size 3 so the points were more visible.
It is not a surprise that as population increases, the number of registered users increases.
This visualization helps us see several aspects of the data. It is clear that there is a correlation between the population of a State and the number of registered public library users, with California having the highest population and highest number of registered users. This visualization shows us that California is far ahead of the 2nd most populated and registered-user State, Texas.
The visualization also shows a grouping of similarly populated States with similar numbers of registered users. Ohio appears to be an anomaly so that is one State that would be great to investigate further. Additional items I would like to explore are the relationship of population to number of libraries in the State and the number of visits compared to the number of computer terminals and wifi sessions by State. As a bonus, I would like to take the PLS data from other years and try combining the datasets to look at change over time.
I tried several commands and visualizations which did not work for me. I could not get the group_by command to work after I mutated the TOTSTAFF variable to indicate the categorical staffing level of each State. My results kept making every staffing level TRUE for every staffing variable option. I had trouble plotting bar graphs which I think related to my choice of x and y variables. Lastly, once I made the point size bigger for the point plot above, I could not figure out how to hide the “size” legend without also hiding the “STATE” legend. I also want to explore adjusting the decimal places on the data as the plot may have been better if I used “5M” instead of “5,000,000”, for example. Overall, I learned a great deal about working in “R” but I also realized it takes me a very long time to teach myself how to make successful plots.