I’ll be looking at a dataset about the number of railroad employees per US county and using some basic data wrangling functions to extract and arrange information. After reading in the file, we can use the view function, which opens a new tab with the table in R Studio and also provides some information within R Markdown. Specifically, it tells us that there are 2,930 rows and 3 columns. Two of the columns have values delimited by character string values (chr) and one column has values delimited by double precision floating point numbers (dbl). It also tells us that the names of the columns are “state,” county," and “total employees.”

railroad_2012_clean_county <- read_csv("railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)

The kable function is also very useful for getting a sense of what imported data files contain. It renders the entire imported .xslx or .csv files into a table within RMarkdown. However this dataset has 2,930 rows, so that can be a bit onerous. We can shorten it for now, showing only the first 25 rows, using [1:25, ].

kable(railroad_2012_clean_county [1:25, ], caption = '2012 US Railway Employee Data')
2012 US Railway Employee Data
state county total_employees
AE APO 2
AK ANCHORAGE 7
AK FAIRBANKS NORTH STAR 2
AK JUNEAU 3
AK MATANUSKA-SUSITNA 2
AK SITKA 1
AK SKAGWAY MUNICIPALITY 88
AL AUTAUGA 102
AL BALDWIN 143
AL BARBOUR 1
AL BIBB 25
AL BLOUNT 154
AL BULLOCK 13
AL BUTLER 29
AL CALHOUN 45
AL CHAMBERS 13
AL CHEROKEE 9
AL CHILTON 72
AL CHOCTAW 7
AL CLARKE 26
AL CLAY 10
AL CLEBURNE 7
AL COFFEE 14
AL COLBERT 199
AL CONECUH 11

We can also use the “head” or “dim” functions to ascertain this information, but for our purposes here, “view” is sufficient, so let’s move on.

This dataset is clean, meaning that it doesn’t have missing variables and it does not need to be “tidied” before the information can be manipulated and analyzed. It is also a very simple dataset in terms of its structure. It only has three columns and three variables. It provides information on the number of railroad employees in each county of each state of the US, as well as the District of the Columbia, Armed Forces Pacific, and Armed Forces Europe (which means there are 53 “states” rather than “50”).

The simplest code can nevertheless provide interesting insights, and even some surprises. Let’s start with “arrange.” The “arrange” function allows us to first sort the employee column in ascending order and then secondarily alphabetically by state. Also, by selecting only total employees and state, we remove the “county” column from the results, for what that’s worth (not very much). Overall, this may not be a very useful presentation of the data, and in fact it may be misleading, because it needs to be remembered that these are COUNTY numbers that must still be aggregated to other county numbers in order to reflect state numbers. Nevertheless, we can get an idea of how “select” works when manipulating data.

railroad_2012_clean_county%>%
  select(total_employees, state)%>%
  arrange(total_employees)
## # A tibble: 2,930 x 2
##    total_employees state
##              <dbl> <chr>
##  1               1 AK   
##  2               1 AL   
##  3               1 AL   
##  4               1 AP   
##  5               1 AR   
##  6               1 CA   
##  7               1 CO   
##  8               1 CO   
##  9               1 CO   
## 10               1 CO   
## # ... with 2,920 more rows

Likewise, we can use “arrange” to find the counties with the most railroad employees per county, but with highest numbrs first. And in this case, I don’t need to use “select” because I will keep all three column in my results, which show that Cook County, IL (Chicago!) is the county that employs the most railroad workers in the United States, almost twice as many as #2, Tarrant County, TX, which I’m guessing is Houston (and I would have never guessed Nebraska for #3. Omaha?). It all makes sense so far.

railroad_2012_clean_county%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    state county           total_employees
##    <chr> <chr>                      <dbl>
##  1 IL    COOK                        8207
##  2 TX    TARRANT                     4235
##  3 NE    DOUGLAS                     3797
##  4 NY    SUFFOLK                     3685
##  5 VA    INDEPENDENT CITY            3249
##  6 FL    DUVAL                       3073
##  7 CA    SAN BERNARDINO              2888
##  8 CA    LOS ANGELES                 2545
##  9 TX    HARRIS                      2535
## 10 NE    LINCOLN                     2289
## # ... with 2,920 more rows

However, selecting all 3 variables changes the layout, which might be useful if you want certain columns to be next to one another for the sake of readability, etc.

railroad_2012_clean_county%>%
  select(total_employees, state, county)%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    total_employees state county          
##              <dbl> <chr> <chr>           
##  1            8207 IL    COOK            
##  2            4235 TX    TARRANT         
##  3            3797 NE    DOUGLAS         
##  4            3685 NY    SUFFOLK         
##  5            3249 VA    INDEPENDENT CITY
##  6            3073 FL    DUVAL           
##  7            2888 CA    SAN BERNARDINO  
##  8            2545 CA    LOS ANGELES     
##  9            2535 TX    HARRIS          
## 10            2289 NE    LINCOLN         
## # ... with 2,920 more rows

Now, let’s find the county in each state with the most employees. Until now, we have been dealing with all 2,930 rows of data. Grouping the data by state and using the slice function will narrow things down by returning 53 rows. Why 53? Remember that in addition to the 50 states, the file includes information about the District of Columbia, AE (Armed Forces Europe), and AP (Armed Forces Pacific).

railroad_2012_clean_county %>%
  arrange(desc(total_employees)) %>%
  select(total_employees, state, county)%>%
  group_by(state)%>%
  slice(1)%>%
  arrange(state)
## # A tibble: 53 x 3
## # Groups:   state [53]
##    total_employees state county              
##              <dbl> <chr> <chr>               
##  1               2 AE    APO                 
##  2              88 AK    SKAGWAY MUNICIPALITY
##  3             990 AL    JEFFERSON           
##  4               1 AP    APO                 
##  5             972 AR    PULASKI             
##  6             749 AZ    PIMA                
##  7            2888 CA    SAN BERNARDINO      
##  8             553 CO    ADAMS               
##  9            1561 CT    NEW HAVEN           
## 10             279 DC    WASHINGTON DC       
## # ... with 43 more rows

Next, let’s use the groub_by, summarise_all, and mean functions to find the average number of employees per state. And the winner is…Delaware! Delaware?

railroad_2012_clean_county %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 DE               498.
##  2 NJ               397.
##  3 CT               324 
##  4 MA               282.
##  5 NY               280.
##  6 DC               279 
##  7 CA               239.
##  8 AZ               210.
##  9 PA               196.
## 10 MD               196.
## # ... with 43 more rows

Fun fact: Delaware ranks Number 1 in the US for most railroad employees per county. But it only has three counties and doesn’t rank number one in terms of total number of employees, which is not a surprise given it’s small population relative to other to other states. Still, while this information is very interesting, is it anything more than a fun fact? Can the number of railway employees per county tell us other things, such as about gas consumption, pollution/emissions, use/ridership, etc.? Or are the connections too tenuous? Is Delaware covered in train tracks? I don’t know.

filter(railroad_2012_clean_county, `state` == "DE") %>%
arrange(county)
## # A tibble: 3 x 3
##   state county     total_employees
##   <chr> <chr>                <dbl>
## 1 DE    KENT                   158
## 2 DE    NEW CASTLE            1275
## 3 DE    SUSSEX                  62
filter(railroad_2012_clean_county, `state` == "AR") %>%
arrange(county) %>%
  select(total_employees) 
## # A tibble: 72 x 1
##    total_employees
##              <dbl>
##  1              11
##  2              18
##  3              25
##  4              35
##  5               5
##  6               8
##  7               5
##  8               3
##  9              40
## 10              13
## # ... with 62 more rows

The burning question now is: Which states have the most railway employees? There are other ones too, but let’s first tackle that one. We can get that information by using the group_by, summarise_all, and sum functions to find total employees per state.

If we arrange employees in descending order the top three states are Texas, Illinois and New York, in terms of total employees per state. The bottom three are Hawaii, Alaska, and Vermont (I have omitted Armed Forces Pacific and Armed Forces Europe, which occupy the final two positions).

railroad_2012_clean_county %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarise_all(sum, na.rm=TRUE) %>% 
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 TX              19839
##  2 IL              19131
##  3 NY              17050
##  4 NE              13176
##  5 CA              13137
##  6 PA              12769
##  7 OH               9056
##  8 GA               8605
##  9 IN               8537
## 10 MO               8419
## # ... with 43 more rows
railroad_2012_clean_county %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarise_all(sum, na.rm=TRUE) %>% 
  arrange(state) 
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 AE                  2
##  2 AK                103
##  3 AL               4257
##  4 AP                  1
##  5 AR               3871
##  6 AZ               3153
##  7 CA              13137
##  8 CO               3650
##  9 CT               2592
## 10 DC                279
## # ... with 43 more rows

So, what conclusions can we make from this data? Most simply, we can glean verifiable facts, such as that Texas employs the most railway workers (Gogle it, it’s true). What do we make from the fact that Delaware is #1 for average employees per county but #42 for total employees. Likewise, what do we make from the fact that Texas and Illinois, the two states that employ the most railway employees overall, don’t appear in the top ten states of average employees per county?

Geography: Lots of Land and the Urban/Rural Divide

One observation we can make by looking at the states with the highest average number of employees per county is that the top four are relatively small eastern seaboard states: Delaware, New Jersey, Connecticut, and Massachusetts. Number 5, New York, ranks third in terms of total employees. Also, is flanked by NJ and CT, where many people who live in New York reside.

Delaware and Maryland are Washington D.C. commuter states. They are also very small. D.C., the geographically smallest entity represented here, nevertheless ranks 6th for average employees per county. But it has only one variable for total employees, meaning that every calculation made about it grouped by “state” will return the same value: 279.

Massachusetts is a relatively small state with a major metropolitan area, Boston. So the numbers start to make sense when we eyeball the size of the state and the metropolitan areas its counties may be serving.

So now we are getting a clearer picture of what the narrative of this data might actually be. It’s probably not that Delaware is not just gaga for trains. Rather, I suspect that other factors must also be considered such as state land area, population/population density, and proximity to major metropolitan areas., but rather that its small size, along with a relatively higher population density and its proximity to major metropolitan areas means that it has a greater concentration of railway workers compared to Texas, which not only employs more railway workers than any other state, it has the most miles of track. It’s just that Texas is so big, so all the counties with only 2 or 3 total employees skews the averages.

railroad_2012_clean_county %>%
filter(total_employees > 5) %>%
  group_by(state)%>%
  select(total_employees) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 50 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 DE               498.
##  2 NJ               397.
##  3 CT               324 
##  4 CA               285.
##  5 NY               284.
##  6 MA               282.
##  7 DC               279 
##  8 AZ               225 
##  9 MD               224.
## 10 PA               213.
## # ... with 40 more rows
railroad_2012_clean_county %>%
filter(total_employees > 10) %>%
  group_by(state)%>%
  select(total_employees) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 50 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 DE               498.
##  2 NJ               397.
##  3 CT               324 
##  4 CA               311.
##  5 NY               299.
##  6 MA               282.
##  7 DC               279 
##  8 MD               260.
##  9 NE               246.
## 10 AZ               242.
## # ... with 40 more rows
railroad_2012_clean_county %>%
  filter(`state` == "TX") %>% 
  filter(total_employees > 20) %>%
  select(total_employees) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## # A tibble: 1 x 1
##   total_employees
##             <dbl>
## 1            188.

What’s the trainiest state? Another thing I want to try is to compare employee totals to population totals, both on a state level. Starting with states, what would the ranking look like if we compared total employees per state to total population? I’m going to make a simple data frame using the top 10 states in the averages and totals tables (20 states total). No doubt there is a more efficient way to do this, but I can practice vectors, and see if this is a rabbit hole worth descending into.

#The 10 states with the most total employees and the 10 states with the highest average number of employees per county (2012) 
US_State <- c("TX", "IL", "NY", "NE", "CA", "PA", "OH", "GA", "IN", "MO", "DE", "NJ", "CT", "MA", "DC", "AZ", "MD") 
#Total RR employees in 2012 for the states listed above in 2012
Sum_employees <- c(19839, 19131, 17050, 13176, 13137, 12769, 9056, 8605, 8537, 8419, 1495, 8329, 2592, 3379, 279, 3153, 4709)
#State total population in 2012
State_Pop_2012 <- c(26084481, 12882510, 19572932, 1853303, 37948800, 12767118, 11548923, 9901430, 6537703, 6024367, 915179, 8844942, 3594547, 6663005, 634924, 6554978, 5886992)
#Data frame 
data_top10total <- data.frame(US_State, Sum_employees, State_Pop_2012)
kable(data_top10total, caption = 'State railway employees vs general total population')
State railway employees vs general total population
US_State Sum_employees State_Pop_2012
TX 19839 26084481
IL 19131 12882510
NY 17050 19572932
NE 13176 1853303
CA 13137 37948800
PA 12769 12767118
OH 9056 11548923
GA 8605 9901430
IN 8537 6537703
MO 8419 6024367
DE 1495 915179
NJ 8329 8844942
CT 2592 3594547
MA 3379 6663005
DC 279 634924
AZ 3153 6554978
MD 4709 5886992
#Let's find out what percentage of total population are railway employees. Note: There are only 17 rows because three states (NY, CA, PA) appear in both lists.
data_top10total %>% 
  mutate(employee_prevalence = Sum_employees/State_Pop_2012) %>%
  arrange(desc(employee_prevalence))
##    US_State Sum_employees State_Pop_2012 employee_prevalence
## 1        NE         13176        1853303        0.0071094689
## 2        DE          1495         915179        0.0016335602
## 3        IL         19131       12882510        0.0014850367
## 4        MO          8419        6024367        0.0013974912
## 5        IN          8537        6537703        0.0013058103
## 6        PA         12769       12767118        0.0010001474
## 7        NJ          8329        8844942        0.0009416681
## 8        NY         17050       19572932        0.0008711010
## 9        GA          8605        9901430        0.0008690664
## 10       MD          4709        5886992        0.0007998992
## 11       OH          9056       11548923        0.0007841424
## 12       TX         19839       26084481        0.0007605672
## 13       CT          2592        3594547        0.0007210923
## 14       MA          3379        6663005        0.0005071285
## 15       AZ          3153        6554978        0.0004810085
## 16       DC           279         634924        0.0004394227
## 17       CA         13137       37948800        0.0003461770

Whoa. Delaware is still holding strong at #2. I did not see that coming. Maybe DE is actually more trainy than I thought? Nebraska, #3 for total employees, takes the top spot for the number of railway employees as a percentage of total population.That result is not so remarkable.

The US Census Bureau provides information for population by county from 2010 to 2019 here: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-total.html#par_textimage_739801612

Ok, let’s do the same thing, as above, but for all states. I made a simple excel file to use.

library(readxl)
state_rrworkers_population_2012 <- read_csv("state_rrworkers_population-2012.csv")
## Rows: 51 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): State_abbv, US_state
## dbl (2): Total_rail_employees, Population_2012
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(state_rrworkers_population_2012)
state_rrworkers_population_2012 %>%
  select(State_abbv, Total_rail_employees, Population_2012) %>%
  mutate(emp_prev = Total_rail_employees/Population_2012) %>%
  arrange(desc(emp_prev))
## # A tibble: 51 x 4
##    State_abbv Total_rail_employees Population_2012 emp_prev
##    <chr>                     <dbl>           <dbl>    <dbl>
##  1 NE                        13176         1853303  0.00711
##  2 WY                         2876          576305  0.00499
##  3 MT                         3327         1003783  0.00331
##  4 ND                         2204          701176  0.00314
##  5 KS                         6092         2885257  0.00211
##  6 WV                         3213         1856872  0.00173
##  7 DE                         1495          915179  0.00163
##  8 IL                        19131        12882510  0.00149
##  9 MO                         8419         6024367  0.00140
## 10 MS                         3932         2983816  0.00132
## # ... with 41 more rows

Emp_prev stands for employee prevalence; I’m n ot sure about the best term. Also, the numbers in the column aren’t displayed as percentages, but for example, in Nebraska, 0.0071% of the population are railway employees. What’s important is that these are ranked with highest percentages first, and once again, NE takes the #1 spot.

Positions 2 through 6 are occupied by states that we haven’t seen in a top ten before: Wyoming, Montana, ND, KS, and West Virginia. But there in position #7 is Delaware, still going strong. And it ranks higher than Illinois. Oh Delaware. Only three states in the previous T10 list stayed there: Nebraska, Illinois, and Missouri.

Observations of note:

California ranks almost dead last here at #46. Despite its total number of employees (#5), this number is eclipsed to its total population relative to other states. And that makes sense to me. “Trainy” doesn’t come to mind when I think of California.

Texas also took quite a leap down the list, from #1 to #29. New York fell from #3 to #23, Pennsylvania went from #6 to #17, Ohio slumped from #7 to #27, Georgia dove from #8 to #24, and Indiana stumbled from #9 to #13.

Texas and Illinois:

If we look at the case of Texas, we see that it has 221 counties, but 52 of them have fewer than five employees. To put that in perspective, you would need 55 counties at 5 employees per county to reach the DC county average (average, highest, lowest) number of 279. Or, you would need 12 counties to match the number of total employees in Delaware’s smallest county (Sussex; 62).

I’m new at this, but I’m guessing that skews the average. How would our list of average employees per county change if we filtered out counties with less than five employees