challenge2

Author

Jaden Busch

Reading in the Dataset

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

For this challenge, I chose to use the railroad_2012_clean_county.csv dataset. In order to read in the dataset, I ran the command:

dataset <- read.csv("../challenge_datasets/railroad_2012_clean_county.csv")
head(dataset, 25)

   state               county total_employees
1     AE                  APO               2
2     AK            ANCHORAGE               7
3     AK FAIRBANKS NORTH STAR               2
4     AK               JUNEAU               3
5     AK    MATANUSKA-SUSITNA               2
6     AK                SITKA               1
7     AK SKAGWAY MUNICIPALITY              88
8     AL              AUTAUGA             102
9     AL              BALDWIN             143
10    AL              BARBOUR               1
11    AL                 BIBB              25
12    AL               BLOUNT             154
13    AL              BULLOCK              13
14    AL               BUTLER              29
15    AL              CALHOUN              45
16    AL             CHAMBERS              13
17    AL             CHEROKEE               9
18    AL              CHILTON              72
19    AL              CHOCTAW               7
20    AL               CLARKE              26
21    AL                 CLAY              10
22    AL             CLEBURNE               7
23    AL               COFFEE              14
24    AL              COLBERT             199
25    AL              CONECUH              11

Describing the Data

My interpretation is that the data was collected from some of the railroad stations around the US, with the columns corresponding to the location and number of employees at each station. The dataset is composed of 3 columns. The first column state includes the two letter acronym for a state which is stored as a character string. The second column county refers to the county within the corresponding state and is also stored as a character string. The third column total_employees has the total number of employees at the corresponding state/county pair, stored as an integer.

We can view the locations with most and least employees by computing the following:

mean(dataset$total_employees)

[1] 87.17816

sd(dataset$total_employees)

[1] 283.6359

dataset[which.min(dataset$total_employees), ] # minimum

  state county total_employees
6    AK  SITKA               1

dataset[which.max(dataset$total_employees), ] # maximum

    state county total_employees
659    IL   COOK            8207

We see that the mean number of employees is ~87.18 with a standard deviation of ~283.64. The minimum employees (or one of the tied minimums) is 1 in Sitka county in Alaska, and the maximum employees is 8207 in Cook county in Illinois.

We can also compute the number of times that each state appears in the dataframe:

state_counts <- table(dataset$state)
state_counts


 AE  AK  AL  AP  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
  1   6  67   1  72  15  55  57   8   1   3  67 152   3  99  36 103  92  95 119 
 LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR 
 63  12  24  16  78  86 115  78  53  94  49  89  10  21  29  12  61  88  73  33 
 PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 65   5  46  52  91 221  25  92  14  39  69  53  22

We can see each state end its corresponding number of occurrences in the list. We see that there are two states which only occur a single time (AE and DC), while Texas occurs the most with 221 occurrences. We can also note that there are 53 states listed, so the dataset is also including things like AE or “Armed Forces Europe” as a state rather than the typical 50.

Provide Grouped Summary Statistics

We can view the number of employees that each state has using group_by and summarise. We can also compute some statistics of this grouping’s central tendency and dispersion.

by_state <- dataset %>% group_by(state) %>% summarise(state_employees = sum(total_employees)) %>% arrange(state_employees)
by_state

# A tibble: 53 × 2
   state state_employees
   <chr>           <int>
 1 AP                  1
 2 AE                  2
 3 HI                  4
 4 AK                103
 5 VT                259
 6 DC                279
 7 NH                393
 8 RI                487
 9 ME                654
10 NV                746
# ℹ 43 more rows

mean(by_state$state_employees)

[1] 4819.472

quantile(by_state$state_employees)

   0%   25%   50%   75%  100% 
    1  1917  3379  6092 19839

We can observe that that the state “AP” for Armed Forces Pacific has only a single employee, while Texas narrowly beats out Illinois for having the most total state_employees with 19839. Additionally, by looking at the mean and the quartile breakdown we see that the data gets more spread out as state_employees increases, with the first quartile only encompassing states within a a range of 1916 while the final quartile encapsulates a much greater range of 13747.

I’m also interested in seeing the data for just MA, since that’s where I live. To do this, I can use filter to see just those entries:

filter(dataset, state == "MA")

   state     county total_employees
1     MA BARNSTABLE              44
2     MA  BERKSHIRE              50
3     MA    BRISTOL             232
4     MA      ESSEX             314
5     MA   FRANKLIN             113
6     MA    HAMPDEN             202
7     MA  HAMPSHIRE              68
8     MA  MIDDLESEX             673
9     MA    NORFOLK             386
10    MA   PLYMOUTH             429
11    MA    SUFFOLK             558
12    MA  WORCESTER             310

Or, if I was only interested in seeing which MA counties were included in the dataset, and not interested in their total_employees, then I could use select:

filter(dataset, state == "MA") %>% select(county)

       county
1  BARNSTABLE
2   BERKSHIRE
3     BRISTOL
4       ESSEX
5    FRANKLIN
6     HAMPDEN
7   HAMPSHIRE
8   MIDDLESEX
9     NORFOLK
10   PLYMOUTH
11    SUFFOLK
12  WORCESTER

Looking at the result, I see that my home county of Middlesex is included in the dataset!

Lastly, let’s take a look at the statistics for the top 100 most populous state/county pairs.

top_100 <- arrange(dataset, desc(dataset$total_employees)) %>% head(100)
head(top_100, 10)

   state           county total_employees
1     IL             COOK            8207
2     TX          TARRANT            4235
3     NE          DOUGLAS            3797
4     NY          SUFFOLK            3685
5     VA INDEPENDENT CITY            3249
6     FL            DUVAL            3073
7     CA   SAN BERNARDINO            2888
8     CA      LOS ANGELES            2545
9     TX           HARRIS            2535
10    NE          LINCOLN            2289

loc_freq <- group_by(top_100, state) %>% summarise(num_locations = n()) %>% arrange(num_locations)
loc_freq

# A tibble: 35 × 2
   state num_locations
   <chr>         <int>
 1 AL                1
 2 AR                1
 3 DE                1
 4 IA                1
 5 ID                1
 6 KY                1
 7 LA                1
 8 MI                1
 9 OR                1
10 TN                1
# ℹ 25 more rows

quantile(loc_freq$num_locations)

  0%  25%  50%  75% 100% 
 1.0  1.0  2.0  3.5  9.0

We can see that the top 10 most populous state-county pairs go from 2289 to 8207 total_employees. Additionally, we observe that there are 35 unique states in the top 100, with many occurring only a single time. New York occurs the most times in the top 100 with 9 occurrences. The quantile breakdown shows us that the data is skewed right, with most states occurring under 3.5 times and only the top 25% occurring all the way up to 9 times.