I’ll be looking at a dataset about the number of railroad employees per US county and using some basic data wrangling functions to extract and arrange information. After reading in the file, we can use the view function, which opens a new tab with the table in R Studionand also also provides some information within R Markdown. Specifically, it tells us that there are 2,930 rows and 3 columns. Two of the columns have values delimited by character string values (chr) and one column has values delimited by double precision floating point numbers (dbl). It also tells us that the names of the columns are “state,” county," and “total employees.”

railroad_2012_clean_county <- read_csv("railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)

We can also use the “head” or “dim” functions to ascertain this information, but for our purposes here, “view” is sufficient, so let’s move on.

This dataset is clean, meaning that it doesn’t have missing variables and it does not need to be “tidied” before the information can be manipulated and analyzed. It is also a very simple dataset in terms of its structure. It only has three columns and three variables. It provides information on the number of railroad employees in each county of each state of the US, as well as the District of the Columbia, Armed Forces Pacific, and Armed Forces Europe (which means there are 53 “states” rather than “50”).

The simplest code can nevertheless provide interesting insights, and even some surprises. Let’s start with “arrange.” The “arrange” function allows us to first sort the employee column in ascending order and then secondarily alphabetically by state. Also, by selecting only total employees and state, we remove the “county” column from the results, for what that’s worth (not very much). Overall, this may not be a very useful presentation of the data, and in fact it may be misleading, because it needs to be remembered that these are COUNTY numbers that must still be aggregated to other county numbers in order to reflect state numbers. Nevertheless, we can get an idea of how “select” works when manipulating data.

railroad_2012_clean_county%>%
  select(total_employees, state)%>%
  arrange(total_employees)
## # A tibble: 2,930 x 2
##    total_employees state
##              <dbl> <chr>
##  1               1 AK   
##  2               1 AL   
##  3               1 AL   
##  4               1 AP   
##  5               1 AR   
##  6               1 CA   
##  7               1 CO   
##  8               1 CO   
##  9               1 CO   
## 10               1 CO   
## # ... with 2,920 more rows

Likewise, we can use “arrange” to find the counties with the most railroad employees per county, but with highest numbrs first. And in this case, I don’t need to use “select” because I will keep all three column in my results, which show that Cook County, IL (Chicago!) is the county that employs the most railroad workers in the United States, almost twice as many as #2, Tarrant County, TX, which I’m guessing is Houston (and I would have never guessed Nebraska for #3. Omaha?). It all makes sense so far.

railroad_2012_clean_county%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    state county           total_employees
##    <chr> <chr>                      <dbl>
##  1 IL    COOK                        8207
##  2 TX    TARRANT                     4235
##  3 NE    DOUGLAS                     3797
##  4 NY    SUFFOLK                     3685
##  5 VA    INDEPENDENT CITY            3249
##  6 FL    DUVAL                       3073
##  7 CA    SAN BERNARDINO              2888
##  8 CA    LOS ANGELES                 2545
##  9 TX    HARRIS                      2535
## 10 NE    LINCOLN                     2289
## # ... with 2,920 more rows

However, selecting all 3 variables changes the layout, which might be useful if you want certain columns to be next to one another for the sake of readability, etc.

railroad_2012_clean_county%>%
  select(total_employees, state, county)%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    total_employees state county          
##              <dbl> <chr> <chr>           
##  1            8207 IL    COOK            
##  2            4235 TX    TARRANT         
##  3            3797 NE    DOUGLAS         
##  4            3685 NY    SUFFOLK         
##  5            3249 VA    INDEPENDENT CITY
##  6            3073 FL    DUVAL           
##  7            2888 CA    SAN BERNARDINO  
##  8            2545 CA    LOS ANGELES     
##  9            2535 TX    HARRIS          
## 10            2289 NE    LINCOLN         
## # ... with 2,920 more rows

Now, let’s find the county in each state with the most employees. Until now, we have been dealing with all 2,930 rows of data. Grouping the data by state and using the slice function will narrow things down by returning 53 rows. Why 53? Remember that in addition to the 50 states, the file includes information about the District of Columbia, AE (Armed Forces Europe), and AP (Armed Forces Pacific).

railroad_2012_clean_county %>%
  arrange(desc(total_employees)) %>%
  select(total_employees, state, county)%>%
  group_by(state)%>%
  slice(1)%>%
  arrange(state)
## # A tibble: 53 x 3
## # Groups:   state [53]
##    total_employees state county              
##              <dbl> <chr> <chr>               
##  1               2 AE    APO                 
##  2              88 AK    SKAGWAY MUNICIPALITY
##  3             990 AL    JEFFERSON           
##  4               1 AP    APO                 
##  5             972 AR    PULASKI             
##  6             749 AZ    PIMA                
##  7            2888 CA    SAN BERNARDINO      
##  8             553 CO    ADAMS               
##  9            1561 CT    NEW HAVEN           
## 10             279 DC    WASHINGTON DC       
## # ... with 43 more rows

Just for kicks, let’s find the county in each state with the fewest employees.

railroad_2012_clean_county %>%
  arrange(total_employees) %>%
  select(total_employees, state, county)%>%
  group_by(state)%>%
  slice(1)%>%
  arrange(state)
## # A tibble: 53 x 3
## # Groups:   state [53]
##    total_employees state county       
##              <dbl> <chr> <chr>        
##  1               2 AE    APO          
##  2               1 AK    SITKA        
##  3               1 AL    BARBOUR      
##  4               1 AP    APO          
##  5               1 AR    NEWTON       
##  6               3 AZ    GREENLEE     
##  7               1 CA    MONO         
##  8               1 CO    BENT         
##  9              26 CT    TOLLAND      
## 10             279 DC    WASHINGTON DC
## # ... with 43 more rows

Next, let’s use the groub_by, summarise_all, and mean functions to find the average number of employees per state. And the winner is…Delaware! Delaware?

railroad_2012_clean_county %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 DE               498.
##  2 NJ               397.
##  3 CT               324 
##  4 MA               282.
##  5 NY               280.
##  6 DC               279 
##  7 CA               239.
##  8 AZ               210.
##  9 PA               196.
## 10 MD               196.
## # ... with 43 more rows

Fun fact: Delaware ranks Number 1 in the US for most railroad employees per county. But it only has three counties and doesn’t rank number one in terms of total number of employees, which is not a surprise given it’s small population relative to other to other states. Still, while this information is very interesting, is it anything more than a fun fact? Can the number of railway employees per county tell us other things, such as about gas consumption, pollution/emissions, use/ridership, etc.? Or are the connections too tenuous? Is Delaware covered in train tracks? I don’t know.

filter(railroad_2012_clean_county, `state` == "DE")
## # A tibble: 3 x 3
##   state county     total_employees
##   <chr> <chr>                <dbl>
## 1 DE    KENT                   158
## 2 DE    NEW CASTLE            1275
## 3 DE    SUSSEX                  62

The burning question now is: Which states have the most railway employees? There are other ones too, but let’s first tackle that one. We can get that information by using the group_by, summarise_all, and sum functions to find total employees per state.

If we arrange employees in descending order the top three states are Texas, Illinois and New York, in terms of total employees per state. The bottom three are Hawaii, Alaska, and Vermont (I have omitted Armed Forces Pacific and Armed Forces Europe, which occupy the final two positions).

railroad_2012_clean_county %>%
  group_by(state) %>%
  select(total_employees) %>%
  summarise_all(sum, na.rm=TRUE)%>%
  arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
##    state total_employees
##    <chr>           <dbl>
##  1 TX              19839
##  2 IL              19131
##  3 NY              17050
##  4 NE              13176
##  5 CA              13137
##  6 PA              12769
##  7 OH               9056
##  8 GA               8605
##  9 IN               8537
## 10 MO               8419
## # ... with 43 more rows

So, what conclusions can we make from this data? Most simply, we can glean verifiable facts, such as that Texas employs the most railway workers (Google it, it’s true). What do we make from the fact that Delaware is #1 for average employees per county but #42 for total employees. Likewise, what do we make from the fact that Texas and Illinois, the two states that employ the most railway employees overall, don’t appear in the top ten states of average employees per county?

There’s more to explore…Namely, we should next look at how many counties each state has as well as how many of its counties employ a relatively low number of employees.