library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
For this challenge, I chose to use the railroad_2012_clean_county.csv dataset. In order to read in the dataset, I ran the command:
dataset <- read.csv("../challenge_datasets/railroad_2012_clean_county.csv")
head(dataset, 25) state county total_employees
1 AE APO 2
2 AK ANCHORAGE 7
3 AK FAIRBANKS NORTH STAR 2
4 AK JUNEAU 3
5 AK MATANUSKA-SUSITNA 2
6 AK SITKA 1
7 AK SKAGWAY MUNICIPALITY 88
8 AL AUTAUGA 102
9 AL BALDWIN 143
10 AL BARBOUR 1
11 AL BIBB 25
12 AL BLOUNT 154
13 AL BULLOCK 13
14 AL BUTLER 29
15 AL CALHOUN 45
16 AL CHAMBERS 13
17 AL CHEROKEE 9
18 AL CHILTON 72
19 AL CHOCTAW 7
20 AL CLARKE 26
21 AL CLAY 10
22 AL CLEBURNE 7
23 AL COFFEE 14
24 AL COLBERT 199
25 AL CONECUH 11
My interpretation is that the data was collected from some of the railroad stations around the US, with the columns corresponding to the location and number of employees at each station. The dataset is composed of 3 columns. The first column state includes the two letter acronym for a state which is stored as a character string. The second column county refers to the county within the corresponding state and is also stored as a character string. The third column total_employees has the total number of employees at the corresponding state/county pair, stored as an integer.
We can view the locations with most and least employees by computing the following:
mean(dataset$total_employees)[1] 87.17816
sd(dataset$total_employees)[1] 283.6359
dataset[which.min(dataset$total_employees), ] # minimum state county total_employees
6 AK SITKA 1
dataset[which.max(dataset$total_employees), ] # maximum state county total_employees
659 IL COOK 8207
We see that the mean number of employees is ~87.18 with a standard deviation of ~283.64. The minimum employees (or one of the tied minimums) is 1 in Sitka county in Alaska, and the maximum employees is 8207 in Cook county in Illinois.
We can also compute the number of times that each state appears in the dataframe:
state_counts <- table(dataset$state)
state_counts
AE AK AL AP AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
1 6 67 1 72 15 55 57 8 1 3 67 152 3 99 36 103 92 95 119
LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR
63 12 24 16 78 86 115 78 53 94 49 89 10 21 29 12 61 88 73 33
PA RI SC SD TN TX UT VA VT WA WI WV WY
65 5 46 52 91 221 25 92 14 39 69 53 22
We can see each state end its corresponding number of occurrences in the list. We see that there are two states which only occur a single time (AE and DC), while Texas occurs the most with 221 occurrences. We can also note that there are 53 states listed, so the dataset is also including things like AE or “Armed Forces Europe” as a state rather than the typical 50.
We can view the number of employees that each state has using group_by and summarise. We can also compute some statistics of this grouping’s central tendency and dispersion.
by_state <- dataset %>% group_by(state) %>% summarise(state_employees = sum(total_employees)) %>% arrange(state_employees)
by_state# A tibble: 53 × 2
state state_employees
<chr> <int>
1 AP 1
2 AE 2
3 HI 4
4 AK 103
5 VT 259
6 DC 279
7 NH 393
8 RI 487
9 ME 654
10 NV 746
# ℹ 43 more rows
mean(by_state$state_employees)[1] 4819.472
quantile(by_state$state_employees) 0% 25% 50% 75% 100%
1 1917 3379 6092 19839
We can observe that that the state “AP” for Armed Forces Pacific has only a single employee, while Texas narrowly beats out Illinois for having the most total state_employees with 19839. Additionally, by looking at the mean and the quartile breakdown we see that the data gets more spread out as state_employees increases, with the first quartile only encompassing states within a a range of 1916 while the final quartile encapsulates a much greater range of 13747.
I’m also interested in seeing the data for just MA, since that’s where I live. To do this, I can use filter to see just those entries:
filter(dataset, state == "MA") state county total_employees
1 MA BARNSTABLE 44
2 MA BERKSHIRE 50
3 MA BRISTOL 232
4 MA ESSEX 314
5 MA FRANKLIN 113
6 MA HAMPDEN 202
7 MA HAMPSHIRE 68
8 MA MIDDLESEX 673
9 MA NORFOLK 386
10 MA PLYMOUTH 429
11 MA SUFFOLK 558
12 MA WORCESTER 310
Or, if I was only interested in seeing which MA counties were included in the dataset, and not interested in their total_employees, then I could use select:
filter(dataset, state == "MA") %>% select(county) county
1 BARNSTABLE
2 BERKSHIRE
3 BRISTOL
4 ESSEX
5 FRANKLIN
6 HAMPDEN
7 HAMPSHIRE
8 MIDDLESEX
9 NORFOLK
10 PLYMOUTH
11 SUFFOLK
12 WORCESTER
Looking at the result, I see that my home county of Middlesex is included in the dataset!
Lastly, let’s take a look at the statistics for the top 100 most populous state/county pairs.
top_100 <- arrange(dataset, desc(dataset$total_employees)) %>% head(100)
head(top_100, 10) state county total_employees
1 IL COOK 8207
2 TX TARRANT 4235
3 NE DOUGLAS 3797
4 NY SUFFOLK 3685
5 VA INDEPENDENT CITY 3249
6 FL DUVAL 3073
7 CA SAN BERNARDINO 2888
8 CA LOS ANGELES 2545
9 TX HARRIS 2535
10 NE LINCOLN 2289
loc_freq <- group_by(top_100, state) %>% summarise(num_locations = n()) %>% arrange(num_locations)
loc_freq# A tibble: 35 × 2
state num_locations
<chr> <int>
1 AL 1
2 AR 1
3 DE 1
4 IA 1
5 ID 1
6 KY 1
7 LA 1
8 MI 1
9 OR 1
10 TN 1
# ℹ 25 more rows
quantile(loc_freq$num_locations) 0% 25% 50% 75% 100%
1.0 1.0 2.0 3.5 9.0
We can see that the top 10 most populous state-county pairs go from 2289 to 8207 total_employees. Additionally, we observe that there are 35 unique states in the top 100, with many occurring only a single time. New York occurs the most times in the top 100 with 9 occurrences. The quantile breakdown shows us that the data is skewed right, with most states occurring under 3.5 times and only the top 25% occurring all the way up to 9 times.