The kable function is also very useful for getting a sense of what imported data files contain. It renders the entire imported .xslx or .csv files into a table within RMarkdown. However this dataset has 2,930 rows, so that can be a bit onerous. We can shorten it for now, showing only the first 25 rows, using [1:25, ].
kable(railroad_2012_clean_county [1:25, ], caption = '2012 US Railway Employee Data')
2012 US Railway Employee Data
| AE |
APO |
2 |
| AK |
ANCHORAGE |
7 |
| AK |
FAIRBANKS NORTH STAR |
2 |
| AK |
JUNEAU |
3 |
| AK |
MATANUSKA-SUSITNA |
2 |
| AK |
SITKA |
1 |
| AK |
SKAGWAY MUNICIPALITY |
88 |
| AL |
AUTAUGA |
102 |
| AL |
BALDWIN |
143 |
| AL |
BARBOUR |
1 |
| AL |
BIBB |
25 |
| AL |
BLOUNT |
154 |
| AL |
BULLOCK |
13 |
| AL |
BUTLER |
29 |
| AL |
CALHOUN |
45 |
| AL |
CHAMBERS |
13 |
| AL |
CHEROKEE |
9 |
| AL |
CHILTON |
72 |
| AL |
CHOCTAW |
7 |
| AL |
CLARKE |
26 |
| AL |
CLAY |
10 |
| AL |
CLEBURNE |
7 |
| AL |
COFFEE |
14 |
| AL |
COLBERT |
199 |
| AL |
CONECUH |
11 |
We can also use the “head” or “dim” functions to ascertain this information, but for our purposes here, “view” is sufficient, so let’s move on.
This dataset is clean, meaning that it doesn’t have missing variables and it does not need to be “tidied” before the information can be manipulated and analyzed. It is also a very simple dataset in terms of its structure. It only has three columns and three variables. It provides information on the number of railroad employees in each county of each state of the US, as well as the District of the Columbia, Armed Forces Pacific, and Armed Forces Europe (which means there are 53 “states” rather than “50”).
The simplest code can nevertheless provide interesting insights, and even some surprises. Let’s start with “arrange.” The “arrange” function allows us to first sort the employee column in ascending order and then secondarily alphabetically by state. Also, by selecting only total employees and state, we remove the “county” column from the results, for what that’s worth (not very much). Overall, this may not be a very useful presentation of the data, and in fact it may be misleading, because it needs to be remembered that these are COUNTY numbers that must still be aggregated to other county numbers in order to reflect state numbers. Nevertheless, we can get an idea of how “select” works when manipulating data.
railroad_2012_clean_county%>%
select(total_employees, state)%>%
arrange(total_employees)
## # A tibble: 2,930 x 2
## total_employees state
## <dbl> <chr>
## 1 1 AK
## 2 1 AL
## 3 1 AL
## 4 1 AP
## 5 1 AR
## 6 1 CA
## 7 1 CO
## 8 1 CO
## 9 1 CO
## 10 1 CO
## # ... with 2,920 more rows
Likewise, we can use “arrange” to find the counties with the most railroad employees per county, but with highest numbrs first. And in this case, I don’t need to use “select” because I will keep all three column in my results, which show that Cook County, IL (Chicago!) is the county that employs the most railroad workers in the United States, almost twice as many as #2, Tarrant County, TX, which I’m guessing is Houston (and I would have never guessed Nebraska for #3. Omaha?). It all makes sense so far.
railroad_2012_clean_county%>%
arrange(desc(total_employees))
## # A tibble: 2,930 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 IL COOK 8207
## 2 TX TARRANT 4235
## 3 NE DOUGLAS 3797
## 4 NY SUFFOLK 3685
## 5 VA INDEPENDENT CITY 3249
## 6 FL DUVAL 3073
## 7 CA SAN BERNARDINO 2888
## 8 CA LOS ANGELES 2545
## 9 TX HARRIS 2535
## 10 NE LINCOLN 2289
## # ... with 2,920 more rows
However, selecting all 3 variables changes the layout, which might be useful if you want certain columns to be next to one another for the sake of readability, etc.
railroad_2012_clean_county%>%
select(total_employees, state, county)%>%
arrange(desc(total_employees))
## # A tibble: 2,930 x 3
## total_employees state county
## <dbl> <chr> <chr>
## 1 8207 IL COOK
## 2 4235 TX TARRANT
## 3 3797 NE DOUGLAS
## 4 3685 NY SUFFOLK
## 5 3249 VA INDEPENDENT CITY
## 6 3073 FL DUVAL
## 7 2888 CA SAN BERNARDINO
## 8 2545 CA LOS ANGELES
## 9 2535 TX HARRIS
## 10 2289 NE LINCOLN
## # ... with 2,920 more rows
Next, let’s use the groub_by, summarise_all, and mean functions to find the average number of employees per state. And the winner is…Delaware! Delaware?
railroad_2012_clean_county %>%
group_by(state) %>%
select(total_employees) %>%
summarise_all(mean, na.rm=TRUE)%>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
## state total_employees
## <chr> <dbl>
## 1 DE 498.
## 2 NJ 397.
## 3 CT 324
## 4 MA 282.
## 5 NY 280.
## 6 DC 279
## 7 CA 239.
## 8 AZ 210.
## 9 PA 196.
## 10 MD 196.
## # ... with 43 more rows
The burning question now is: Which states have the most railway employees? There are other ones too, but let’s first tackle that one. We can get that information by using the group_by, summarise_all, and sum functions to find total employees per state.
If we arrange employees in descending order the top three states are Texas, Illinois and New York, in terms of total employees per state. The bottom three are Hawaii, Alaska, and Vermont (I have omitted Armed Forces Pacific and Armed Forces Europe, which occupy the final two positions).
railroad_2012_clean_county %>%
group_by(state) %>%
select(total_employees) %>%
summarise_all(sum, na.rm=TRUE) %>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
## state total_employees
## <chr> <dbl>
## 1 TX 19839
## 2 IL 19131
## 3 NY 17050
## 4 NE 13176
## 5 CA 13137
## 6 PA 12769
## 7 OH 9056
## 8 GA 8605
## 9 IN 8537
## 10 MO 8419
## # ... with 43 more rows
railroad_2012_clean_county %>%
group_by(state) %>%
select(total_employees) %>%
summarise_all(sum, na.rm=TRUE) %>%
arrange(state)
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
## state total_employees
## <chr> <dbl>
## 1 AE 2
## 2 AK 103
## 3 AL 4257
## 4 AP 1
## 5 AR 3871
## 6 AZ 3153
## 7 CA 13137
## 8 CO 3650
## 9 CT 2592
## 10 DC 279
## # ... with 43 more rows
So, what conclusions can we make from this data? Most simply, we can glean verifiable facts, such as that Texas employs the most railway workers (Gogle it, it’s true). What do we make from the fact that Delaware is #1 for average employees per county but #42 for total employees. Likewise, what do we make from the fact that Texas and Illinois, the two states that employ the most railway employees overall, don’t appear in the top ten states of average employees per county?
Geography: Lots of Land and the Urban/Rural Divide
One observation we can make by looking at the states with the highest average number of employees per county is that the top four are relatively small eastern seaboard states: Delaware, New Jersey, Connecticut, and Massachusetts. Number 5, New York, ranks third in terms of total employees. Also, is flanked by NJ and CT, where many people who live in New York reside.
Delaware and Maryland are Washington D.C. commuter states. They are also very small. D.C., the geographically smallest entity represented here, nevertheless ranks 6th for average employees per county. But it has only one variable for total employees, meaning that every calculation made about it grouped by “state” will return the same value: 279.
Massachusetts is a relatively small state with a major metropolitan area, Boston. So the numbers start to make sense when we eyeball the size of the state and the metropolitan areas its counties may be serving.
So now we are getting a clearer picture of what the narrative of this data might actually be. It’s probably not that Delaware is not just gaga for trains. Rather, I suspect that other factors must also be considered such as state land area, population/population density, and proximity to major metropolitan areas., but rather that its small size, along with a relatively higher population density and its proximity to major metropolitan areas means that it has a greater concentration of railway workers compared to Texas, which not only employs more railway workers than any other state, it has the most miles of track. It’s just that Texas is so big, so all the counties with only 2 or 3 total employees skews the averages.
railroad_2012_clean_county %>%
filter(total_employees > 5) %>%
group_by(state)%>%
select(total_employees) %>%
summarise_all(mean, na.rm=TRUE)%>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 50 x 2
## state total_employees
## <chr> <dbl>
## 1 DE 498.
## 2 NJ 397.
## 3 CT 324
## 4 CA 285.
## 5 NY 284.
## 6 MA 282.
## 7 DC 279
## 8 AZ 225
## 9 MD 224.
## 10 PA 213.
## # ... with 40 more rows
railroad_2012_clean_county %>%
filter(total_employees > 10) %>%
group_by(state)%>%
select(total_employees) %>%
summarise_all(mean, na.rm=TRUE)%>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 50 x 2
## state total_employees
## <chr> <dbl>
## 1 DE 498.
## 2 NJ 397.
## 3 CT 324
## 4 CA 311.
## 5 NY 299.
## 6 MA 282.
## 7 DC 279
## 8 MD 260.
## 9 NE 246.
## 10 AZ 242.
## # ... with 40 more rows
railroad_2012_clean_county %>%
filter(`state` == "TX") %>%
filter(total_employees > 20) %>%
select(total_employees) %>%
summarise_all(mean, na.rm=TRUE)%>%
arrange(desc(total_employees))
## # A tibble: 1 x 1
## total_employees
## <dbl>
## 1 188.
What’s the trainiest state? Another thing I want to try is to compare employee totals to population totals, both on a state level. Starting with states, what would the ranking look like if we compared total employees per state to total population? I’m going to make a simple data frame using the top 10 states in the averages and totals tables (20 states total). No doubt there is a more efficient way to do this, but I can practice vectors, and see if this is a rabbit hole worth descending into.
#The 10 states with the most total employees and the 10 states with the highest average number of employees per county (2012)
US_State <- c("TX", "IL", "NY", "NE", "CA", "PA", "OH", "GA", "IN", "MO", "DE", "NJ", "CT", "MA", "DC", "AZ", "MD")
#Total RR employees in 2012 for the states listed above in 2012
Sum_employees <- c(19839, 19131, 17050, 13176, 13137, 12769, 9056, 8605, 8537, 8419, 1495, 8329, 2592, 3379, 279, 3153, 4709)
#State total population in 2012
State_Pop_2012 <- c(26084481, 12882510, 19572932, 1853303, 37948800, 12767118, 11548923, 9901430, 6537703, 6024367, 915179, 8844942, 3594547, 6663005, 634924, 6554978, 5886992)
#Data frame
data_top10total <- data.frame(US_State, Sum_employees, State_Pop_2012)
kable(data_top10total, caption = 'State railway employees vs general total population')
State railway employees vs general total population
| TX |
19839 |
26084481 |
| IL |
19131 |
12882510 |
| NY |
17050 |
19572932 |
| NE |
13176 |
1853303 |
| CA |
13137 |
37948800 |
| PA |
12769 |
12767118 |
| OH |
9056 |
11548923 |
| GA |
8605 |
9901430 |
| IN |
8537 |
6537703 |
| MO |
8419 |
6024367 |
| DE |
1495 |
915179 |
| NJ |
8329 |
8844942 |
| CT |
2592 |
3594547 |
| MA |
3379 |
6663005 |
| DC |
279 |
634924 |
| AZ |
3153 |
6554978 |
| MD |
4709 |
5886992 |
#Let's find out what percentage of total population are railway employees. Note: There are only 17 rows because three states (NY, CA, PA) appear in both lists.
data_top10total %>%
mutate(employee_prevalence = Sum_employees/State_Pop_2012) %>%
arrange(desc(employee_prevalence))
## US_State Sum_employees State_Pop_2012 employee_prevalence
## 1 NE 13176 1853303 0.0071094689
## 2 DE 1495 915179 0.0016335602
## 3 IL 19131 12882510 0.0014850367
## 4 MO 8419 6024367 0.0013974912
## 5 IN 8537 6537703 0.0013058103
## 6 PA 12769 12767118 0.0010001474
## 7 NJ 8329 8844942 0.0009416681
## 8 NY 17050 19572932 0.0008711010
## 9 GA 8605 9901430 0.0008690664
## 10 MD 4709 5886992 0.0007998992
## 11 OH 9056 11548923 0.0007841424
## 12 TX 19839 26084481 0.0007605672
## 13 CT 2592 3594547 0.0007210923
## 14 MA 3379 6663005 0.0005071285
## 15 AZ 3153 6554978 0.0004810085
## 16 DC 279 634924 0.0004394227
## 17 CA 13137 37948800 0.0003461770
Ok, let’s do the same thing, as above, but for all states. I made a simple excel file to use.
library(readxl)
state_rrworkers_population_2012 <- read_csv("state_rrworkers_population-2012.csv")
## Rows: 51 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): State_abbv, US_state
## dbl (2): Total_rail_employees, Population_2012
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(state_rrworkers_population_2012)
state_rrworkers_population_2012 %>%
select(State_abbv, Total_rail_employees, Population_2012) %>%
mutate(emp_prev = Total_rail_employees/Population_2012) %>%
arrange(desc(emp_prev))
## # A tibble: 51 x 4
## State_abbv Total_rail_employees Population_2012 emp_prev
## <chr> <dbl> <dbl> <dbl>
## 1 NE 13176 1853303 0.00711
## 2 WY 2876 576305 0.00499
## 3 MT 3327 1003783 0.00331
## 4 ND 2204 701176 0.00314
## 5 KS 6092 2885257 0.00211
## 6 WV 3213 1856872 0.00173
## 7 DE 1495 915179 0.00163
## 8 IL 19131 12882510 0.00149
## 9 MO 8419 6024367 0.00140
## 10 MS 3932 2983816 0.00132
## # ... with 41 more rows
Emp_prev stands for employee prevalence; I’m n ot sure about the best term. Also, the numbers in the column aren’t displayed as percentages, but for example, in Nebraska, 0.0071% of the population are railway employees. What’s important is that these are ranked with highest percentages first, and once again, NE takes the #1 spot.
Positions 2 through 6 are occupied by states that we haven’t seen in a top ten before: Wyoming, Montana, ND, KS, and West Virginia. But there in position #7 is Delaware, still going strong. And it ranks higher than Illinois. Oh Delaware. Only three states in the previous T10 list stayed there: Nebraska, Illinois, and Missouri.
Observations of note:
California ranks almost dead last here at #46. Despite its total number of employees (#5), this number is eclipsed to its total population relative to other states. And that makes sense to me. “Trainy” doesn’t come to mind when I think of California.
Texas also took quite a leap down the list, from #1 to #29. New York fell from #3 to #23, Pennsylvania went from #6 to #17, Ohio slumped from #7 to #27, Georgia dove from #8 to #24, and Indiana stumbled from #9 to #13.
Texas and Illinois:
If we look at the case of Texas, we see that it has 221 counties, but 52 of them have fewer than five employees. To put that in perspective, you would need 55 counties at 5 employees per county to reach the DC county average (average, highest, lowest) number of 279. Or, you would need 12 counties to match the number of total employees in Delaware’s smallest county (Sussex; 62).
I’m new at this, but I’m guessing that skews the average. How would our list of average employees per county change if we filtered out counties with less than five employees