The first thing I need to do is read in the data. I am going to look at the data set about the number of railroad employees per US county.I use the read_csv and view commands. The view function will open a new tab with the table in R Studio; it will also provide some information within R Markdown. Specifically, it tells us that in this dataset there are 2,930 rows and 3 columns. Two of the columns have values delimited by character string values (chr) and one column has values delimited by double precision floating point numbers (dbl). It also tells us that the names of the columns are “state,” county," and “total employees.”

railroad_2012_clean_county <- read_csv("railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)

I can also use the “head” or “dim” functions to ascertain this information, but for our purposes here, the “view” function is sufficient, so let’s move on.

This dataset is clean, meaning that it doesn’t have missing variables and it doesn’t need to be “tidied” before the information can be analyzed. It is also a very simple dataset in terms of its structure. It only has three columns and three variables. It provides information on the number of railroad employees in each county of each state of the US, as well as the District of the Columbia, Armed Forces Pacific, and Armed Forces Europe.

The simplest code can nevertheless provide interesting insights, and even some surprises. Let’s get started with “arrange.” The “arrange” function allows us to first arrange the employee column in ascending order and then secondarily alphabetically by state. Also, by selecting only total employees and state, we remove the “county” column from the results, for what it’s worth. You will see, however, that this code still returns ALL 2,930 rows. Overall, it may not be a very useful presentation of the data, and in fact it may be misleading, because it needs to be remembered that these are COUNTY numbers that must still be aggregated to other county numbers in order to reflect state numbers. But, you can see what “select” does.

railroad_2012_clean_county%>%
  select(total_employees, state)%>%
  arrange(total_employees)
## # A tibble: 2,930 x 2
##    total_employees state
##              <dbl> <chr>
##  1               1 AK   
##  2               1 AL   
##  3               1 AL   
##  4               1 AP   
##  5               1 AR   
##  6               1 CA   
##  7               1 CO   
##  8               1 CO   
##  9               1 CO   
## 10               1 CO   
## # ... with 2,920 more rows

Likewise, I can also use “arrange” to find the counties with the most railroad employees per county. I use the same code as before, but tell R to show the results in descending order. And in this case, I don’t need to use “select” because I will keep all three column in my results, which show that Cook County, IL (Chicago!) is the county that employs the most railroad workers in the United States, almost twice as many as #2, Tarrant County, TX, which I’m guessing is Houston (and I would have never guessed Nebraska for #3. Omaha?). It makes sense that major metropolitan areas with lots of rail lines would need a lot of employees.

railroad_2012_clean_county%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    state county           total_employees
##    <chr> <chr>                      <dbl>
##  1 IL    COOK                        8207
##  2 TX    TARRANT                     4235
##  3 NE    DOUGLAS                     3797
##  4 NY    SUFFOLK                     3685
##  5 VA    INDEPENDENT CITY            3249
##  6 FL    DUVAL                       3073
##  7 CA    SAN BERNARDINO              2888
##  8 CA    LOS ANGELES                 2545
##  9 TX    HARRIS                      2535
## 10 NE    LINCOLN                     2289
## # ... with 2,920 more rows

However, selecting all 3 variables changes the layout, which might be useful if you want certain columns to be next to one another for the sake of readability, etc.

railroad_2012_clean_county%>%
  select(total_employees, state, county)%>%
  arrange(desc(total_employees))
## # A tibble: 2,930 x 3
##    total_employees state county          
##              <dbl> <chr> <chr>           
##  1            8207 IL    COOK            
##  2            4235 TX    TARRANT         
##  3            3797 NE    DOUGLAS         
##  4            3685 NY    SUFFOLK         
##  5            3249 VA    INDEPENDENT CITY
##  6            3073 FL    DUVAL           
##  7            2888 CA    SAN BERNARDINO  
##  8            2545 CA    LOS ANGELES     
##  9            2535 TX    HARRIS          
## 10            2289 NE    LINCOLN         
## # ... with 2,920 more rows

Unfortunately, one thing I can’t do yet that I’d like to do, is rank the 50 states + DC + AE + AP based on number of total of employees. The most I can do based on TOTAL numbers is to find the county in each state with the most employees. The following code will do that, returning 53 rows, rather than 2,930.

railroad_2012_clean_county %>%
  arrange(desc(total_employees)) %>%
  select(total_employees, state, county)%>%
  group_by(state)%>%
  slice(1)%>%
  arrange(state)
## # A tibble: 53 x 3
## # Groups:   state [53]
##    total_employees state county              
##              <dbl> <chr> <chr>               
##  1               2 AE    APO                 
##  2              88 AK    SKAGWAY MUNICIPALITY
##  3             990 AL    JEFFERSON           
##  4               1 AP    APO                 
##  5             972 AR    PULASKI             
##  6             749 AZ    PIMA                
##  7            2888 CA    SAN BERNARDINO      
##  8             553 CO    ADAMS               
##  9            1561 CT    NEW HAVEN           
## 10             279 DC    WASHINGTON DC       
## # ... with 43 more rows

Just for kicks, let’s find the counties with the fewest workers per county.

railroad_2012_clean_county %>%
  arrange(total_employees) %>%
  select(total_employees, state, county)%>%
  group_by(state)%>%
  slice(1)%>%
  arrange(state)
## # A tibble: 53 x 3
## # Groups:   state [53]
##    total_employees state county       
##              <dbl> <chr> <chr>        
##  1               2 AE    APO          
##  2               1 AK    SITKA        
##  3               1 AL    BARBOUR      
##  4               1 AP    APO          
##  5               1 AR    NEWTON       
##  6               3 AZ    GREENLEE     
##  7               1 CA    MONO         
##  8               1 CO    BENT         
##  9              26 CT    TOLLAND      
## 10             279 DC    WASHINGTON DC
## # ... with 43 more rows

However I can do something else that’s pretty cool…I can find the average number of employees per state by using the summarise, mean, and group_by functions. Then I can present the results in descending order using the arrange function. And the winner is…Delaware!

railroad_2012_clean_county%>%
  group_by(state) %>%
  summarise(mean = mean(total_employees))%>%
  arrange(desc(mean))
## # A tibble: 53 x 2
##    state  mean
##    <chr> <dbl>
##  1 DE     498.
##  2 NJ     397.
##  3 CT     324 
##  4 MA     282.
##  5 NY     280.
##  6 DC     279 
##  7 CA     239.
##  8 AZ     210.
##  9 PA     196.
## 10 MD     196.
## # ... with 43 more rows

Fun fact: Delaware ranks Number 1 in the US for most railroad employees per county. Delaware has only three counties and doesn’t ranked number one in terms of total number of employees, which is not a surprise given it’s small population relative to other to other states. Still, while this information is very interesting, it may limited in scope. Can the number of railway employees per county tell us other things, such as about gas consumption, pollution/emissions, use/ridership, etc.? Or will any such connections always be too tenuous? I don’t know. What I do know is that Delaware wins, now in credit card companies AND railroad employees!

filter(railroad_2012_clean_county, `state` == "DE")
## # A tibble: 3 x 3
##   state county     total_employees
##   <chr> <chr>                <dbl>
## 1 DE    KENT                   158
## 2 DE    NEW CASTLE            1275
## 3 DE    SUSSEX                  62