I’ll be looking at a dataset about the number of railroad employees per US county and using some basic data wrangling functions to extract and arrange information. After reading in the file, we can use the view function, which opens a new tab with the table in R Studionand also also provides some information within R Markdown. Specifically, it tells us that there are 2,930 rows and 3 columns. Two of the columns have values delimited by character string values (chr) and one column has values delimited by double precision floating point numbers (dbl). It also tells us that the names of the columns are “state,” county," and “total employees.”
railroad_2012_clean_county <- read_csv("railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)
We can also use the “head” or “dim” functions to ascertain this information, but for our purposes here, “view” is sufficient, so let’s move on.
This dataset is clean, meaning that it doesn’t have missing variables and it does not need to be “tidied” before the information can be manipulated and analyzed. It is also a very simple dataset in terms of its structure. It only has three columns and three variables. It provides information on the number of railroad employees in each county of each state of the US, as well as the District of the Columbia, Armed Forces Pacific, and Armed Forces Europe (which means there are 53 “states” rather than “50”).
The simplest code can nevertheless provide interesting insights, and even some surprises. Let’s start with “arrange.” The “arrange” function allows us to first sort the employee column in ascending order and then secondarily alphabetically by state. Also, by selecting only total employees and state, we remove the “county” column from the results, for what that’s worth (not very much). Overall, this may not be a very useful presentation of the data, and in fact it may be misleading, because it needs to be remembered that these are COUNTY numbers that must still be aggregated to other county numbers in order to reflect state numbers. Nevertheless, we can get an idea of how “select” works when manipulating data.
railroad_2012_clean_county%>%
select(total_employees, state)%>%
arrange(total_employees)
## # A tibble: 2,930 x 2
## total_employees state
## <dbl> <chr>
## 1 1 AK
## 2 1 AL
## 3 1 AL
## 4 1 AP
## 5 1 AR
## 6 1 CA
## 7 1 CO
## 8 1 CO
## 9 1 CO
## 10 1 CO
## # ... with 2,920 more rows
Likewise, we can use “arrange” to find the counties with the most railroad employees per county, but with highest numbrs first. And in this case, I don’t need to use “select” because I will keep all three column in my results, which show that Cook County, IL (Chicago!) is the county that employs the most railroad workers in the United States, almost twice as many as #2, Tarrant County, TX, which I’m guessing is Houston (and I would have never guessed Nebraska for #3. Omaha?). It all makes sense so far.
railroad_2012_clean_county%>%
arrange(desc(total_employees))
## # A tibble: 2,930 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 IL COOK 8207
## 2 TX TARRANT 4235
## 3 NE DOUGLAS 3797
## 4 NY SUFFOLK 3685
## 5 VA INDEPENDENT CITY 3249
## 6 FL DUVAL 3073
## 7 CA SAN BERNARDINO 2888
## 8 CA LOS ANGELES 2545
## 9 TX HARRIS 2535
## 10 NE LINCOLN 2289
## # ... with 2,920 more rows
However, selecting all 3 variables changes the layout, which might be useful if you want certain columns to be next to one another for the sake of readability, etc.
railroad_2012_clean_county%>%
select(total_employees, state, county)%>%
arrange(desc(total_employees))
## # A tibble: 2,930 x 3
## total_employees state county
## <dbl> <chr> <chr>
## 1 8207 IL COOK
## 2 4235 TX TARRANT
## 3 3797 NE DOUGLAS
## 4 3685 NY SUFFOLK
## 5 3249 VA INDEPENDENT CITY
## 6 3073 FL DUVAL
## 7 2888 CA SAN BERNARDINO
## 8 2545 CA LOS ANGELES
## 9 2535 TX HARRIS
## 10 2289 NE LINCOLN
## # ... with 2,920 more rows
Now, let’s find the county in each state with the most employees. Until now, we have been dealing with all 2,930 rows of data. Grouping the data by state and using the slice function will narrow things down by returning 53 rows. Why 53? Remember that in addition to the 50 states, the file includes information about the District of Columbia, AE (Armed Forces Europe), and AP (Armed Forces Pacific).
railroad_2012_clean_county %>%
arrange(desc(total_employees)) %>%
select(total_employees, state, county)%>%
group_by(state)%>%
slice(1)%>%
arrange(state)
## # A tibble: 53 x 3
## # Groups: state [53]
## total_employees state county
## <dbl> <chr> <chr>
## 1 2 AE APO
## 2 88 AK SKAGWAY MUNICIPALITY
## 3 990 AL JEFFERSON
## 4 1 AP APO
## 5 972 AR PULASKI
## 6 749 AZ PIMA
## 7 2888 CA SAN BERNARDINO
## 8 553 CO ADAMS
## 9 1561 CT NEW HAVEN
## 10 279 DC WASHINGTON DC
## # ... with 43 more rows
Just for kicks, let’s find the county in each state with the fewest employees.
railroad_2012_clean_county %>%
arrange(total_employees) %>%
select(total_employees, state, county)%>%
group_by(state)%>%
slice(1)%>%
arrange(state)
## # A tibble: 53 x 3
## # Groups: state [53]
## total_employees state county
## <dbl> <chr> <chr>
## 1 2 AE APO
## 2 1 AK SITKA
## 3 1 AL BARBOUR
## 4 1 AP APO
## 5 1 AR NEWTON
## 6 3 AZ GREENLEE
## 7 1 CA MONO
## 8 1 CO BENT
## 9 26 CT TOLLAND
## 10 279 DC WASHINGTON DC
## # ... with 43 more rows
Next, let’s use the groub_by, summarise_all, and mean functions to find the average number of employees per state. And the winner is…Delaware! Delaware?
railroad_2012_clean_county %>%
group_by(state) %>%
select(total_employees) %>%
summarise_all(mean, na.rm=TRUE)%>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
## state total_employees
## <chr> <dbl>
## 1 DE 498.
## 2 NJ 397.
## 3 CT 324
## 4 MA 282.
## 5 NY 280.
## 6 DC 279
## 7 CA 239.
## 8 AZ 210.
## 9 PA 196.
## 10 MD 196.
## # ... with 43 more rows
Fun fact: Delaware ranks Number 1 in the US for most railroad employees per county. But it only has three counties and doesn’t rank number one in terms of total number of employees, which is not a surprise given it’s small population relative to other to other states. Still, while this information is very interesting, is it anything more than a fun fact? Can the number of railway employees per county tell us other things, such as about gas consumption, pollution/emissions, use/ridership, etc.? Or are the connections too tenuous? Is Delaware covered in train tracks? I don’t know.
filter(railroad_2012_clean_county, `state` == "DE")
## # A tibble: 3 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 DE KENT 158
## 2 DE NEW CASTLE 1275
## 3 DE SUSSEX 62
The burning question now is: Which states have the most railway employees? There are other ones too, but let’s first tackle that one. We can get that information by using the group_by, summarise_all, and sum functions to find total employees per state.
If we arrange employees in descending order the top three states are Texas, Illinois and New York, in terms of total employees per state. The bottom three are Hawaii, Alaska, and Vermont (I have omitted Armed Forces Pacific and Armed Forces Europe, which occupy the final two positions).
railroad_2012_clean_county %>%
group_by(state) %>%
select(total_employees) %>%
summarise_all(sum, na.rm=TRUE)%>%
arrange(desc(total_employees))
## Adding missing grouping variables: `state`
## # A tibble: 53 x 2
## state total_employees
## <chr> <dbl>
## 1 TX 19839
## 2 IL 19131
## 3 NY 17050
## 4 NE 13176
## 5 CA 13137
## 6 PA 12769
## 7 OH 9056
## 8 GA 8605
## 9 IN 8537
## 10 MO 8419
## # ... with 43 more rows
So, what conclusions can we make from this data? Most simply, we can glean verifiable facts, such as that Texas employs the most railway workers (Google it, it’s true). What do we make from the fact that Delaware is #1 for average employees per county but #42 for total employees. Likewise, what do we make from the fact that Texas and Illinois, the two states that employ the most railway employees overall, don’t appear in the top ten states of average employees per county?
There’s more to explore…Namely, we should next look at how many counties each state has as well as how many of its counties employ a relatively low number of employees.