Introduction

Let’s start with what the heck is data wrangling? You can wrangle cattle, but wrangling data? Yes! After data is imported or read into the RStudio environment it needs to be maneuvered before it can be used for visualization or modeling.

First Steps

First, I am going to import data into R- In this case the Railroad_2012 data set which is a CSV file. While this is an already cleaned data set, it can still provide insight.

library(readr)
railroad_2012_clean_county <- read_csv("C:/Users/Bud/Desktop/DACSS601_Homework/HW_3/_data/railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)

The railroad data set was checked out, starting with the head and tail functions, which displayed the first 5 and last 5 rows of the data set.

head(railroad_2012_clean_county,5)   
## # A tibble: 5 x 3
##   state county               total_employees
##   <chr> <chr>                          <dbl>
## 1 AE    APO                                2
## 2 AK    ANCHORAGE                          7
## 3 AK    FAIRBANKS NORTH STAR               2
## 4 AK    JUNEAU                             3
## 5 AK    MATANUSKA-SUSITNA                  2
tail(railroad_2012_clean_county,5)
## # A tibble: 5 x 3
##   state county     total_employees
##   <chr> <chr>                <dbl>
## 1 WY    SUBLETTE                 3
## 2 WY    SWEETWATER             196
## 3 WY    UINTA                   49
## 4 WY    WASHAKIE                10
## 5 WY    WESTON                  37

Then using colnames which takes a look at column names of the data set.

colnames(railroad_2012_clean_county)
## [1] "state"           "county"          "total_employees"

The glimpse function which is part of the dplyr package which provides a synopsis of data was used.

glimpse(railroad_2012_clean_county)   
## Rows: 2,930
## Columns: 3
## $ state           <chr> "AE", "AK", "AK", "AK", "AK", "AK", "AK", "AL", "AL", ~
## $ county          <chr> "APO", "ANCHORAGE", "FAIRBANKS NORTH STAR", "JUNEAU", ~
## $ total_employees <dbl> 2, 7, 2, 3, 2, 1, 88, 102, 143, 1, 25, 154, 13, 29, 45~

The pivot_wider function was used to make the data set more manageable and readable, as well as filling in missing values from n/a to 0.

railroad_2012_clean_county %>%
  pivot_wider(names_from = state, values_from = total_employees,values_fill = 0)
## # A tibble: 1,709 x 54
##    county         AE    AK    AL    AP    AR    AZ    CA    CO    CT    DC    DE
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 APO             2     0     0     1     0     0     0     0     0     0     0
##  2 ANCHORAGE       0     7     0     0     0     0     0     0     0     0     0
##  3 FAIRBANKS ~     0     2     0     0     0     0     0     0     0     0     0
##  4 JUNEAU          0     3     0     0     0     0     0     0     0     0     0
##  5 MATANUSKA-~     0     2     0     0     0     0     0     0     0     0     0
##  6 SITKA           0     1     0     0     0     0     0     0     0     0     0
##  7 SKAGWAY MU~     0    88     0     0     0     0     0     0     0     0     0
##  8 AUTAUGA         0     0   102     0     0     0     0     0     0     0     0
##  9 BALDWIN         0     0   143     0     0     0     0     0     0     0     0
## 10 BARBOUR         0     0     1     0     0     0     0     0     0     0     0
## # ... with 1,699 more rows, and 42 more variables: FL <dbl>, GA <dbl>,
## #   HI <dbl>, IA <dbl>, ID <dbl>, IL <dbl>, IN <dbl>, KS <dbl>, KY <dbl>,
## #   LA <dbl>, MA <dbl>, MD <dbl>, ME <dbl>, MI <dbl>, MN <dbl>, MO <dbl>,
## #   MS <dbl>, MT <dbl>, NC <dbl>, ND <dbl>, NE <dbl>, NH <dbl>, NJ <dbl>,
## #   NM <dbl>, NV <dbl>, NY <dbl>, OH <dbl>, OK <dbl>, OR <dbl>, PA <dbl>,
## #   RI <dbl>, SC <dbl>, SD <dbl>, TN <dbl>, TX <dbl>, UT <dbl>, VA <dbl>,
## #   VT <dbl>, WA <dbl>, WI <dbl>, WV <dbl>, WY <dbl>

Next step - Some Data Wrangling

I was interested in the railroad county that had the most employees so I started with the arrange function which ranked the number of employees in descending order.It shows, not surprisingly that Cook county in Illinois had the most number of employees.

arrange(railroad_2012_clean_county,desc(total_employees))
## # A tibble: 2,930 x 3
##    state county           total_employees
##    <chr> <chr>                      <dbl>
##  1 IL    COOK                        8207
##  2 TX    TARRANT                     4235
##  3 NE    DOUGLAS                     3797
##  4 NY    SUFFOLK                     3685
##  5 VA    INDEPENDENT CITY            3249
##  6 FL    DUVAL                       3073
##  7 CA    SAN BERNARDINO              2888
##  8 CA    LOS ANGELES                 2545
##  9 TX    HARRIS                      2535
## 10 NE    LINCOLN                     2289
## # ... with 2,920 more rows

The arrange function was also used to display the total_employees column.

arrange(railroad_2012_clean_county,total_employees)
## # A tibble: 2,930 x 3
##    state county   total_employees
##    <chr> <chr>              <dbl>
##  1 AK    SITKA                  1
##  2 AL    BARBOUR                1
##  3 AL    HENRY                  1
##  4 AP    APO                    1
##  5 AR    NEWTON                 1
##  6 CA    MONO                   1
##  7 CO    BENT                   1
##  8 CO    CHEYENNE               1
##  9 CO    COSTILLA               1
## 10 CO    DOLORES                1
## # ... with 2,920 more rows

I was interested if there were any na values in the railroad data set. It shows there were none. Which makes sense because this was a clean data set.

 railroad_2012_clean_county %>%
  is.na() %>% 
  sum()
## [1] 0

The select function was used to determine the number of rows. In this data set there were 2930 rows.

select(railroad_2012_clean_county)
## # A tibble: 2,930 x 0

The select function separates the state column from the rest of the data set.

select(railroad_2012_clean_county,state)
## # A tibble: 2,930 x 1
##    state
##    <chr>
##  1 AE   
##  2 AK   
##  3 AK   
##  4 AK   
##  5 AK   
##  6 AK   
##  7 AK   
##  8 AL   
##  9 AL   
## 10 AL   
## # ... with 2,920 more rows

The filter function was used to determine which rail stations had fewer than 2 employees. Yeah,I was curious about this. It turns out there were 145 stations with less than 2 employees.

filter(railroad_2012_clean_county,total_employees < 2)
## # A tibble: 145 x 3
##    state county   total_employees
##    <chr> <chr>              <dbl>
##  1 AK    SITKA                  1
##  2 AL    BARBOUR                1
##  3 AL    HENRY                  1
##  4 AP    APO                    1
##  5 AR    NEWTON                 1
##  6 CA    MONO                   1
##  7 CO    BENT                   1
##  8 CO    CHEYENNE               1
##  9 CO    COSTILLA               1
## 10 CO    DOLORES                1
## # ... with 135 more rows

I was curious about the number of stations that had less than or equal to 2 employees, so I created a new object called subset_employees, that utilized the pipe operator, group_by function,and filter function to determine the stations with less than or equal to 2 employees.

subset_employees<-railroad_2012_clean_county %>% 
  group_by(total_employees) %>% 
  filter(total_employees<=2)
subset_employees
## # A tibble: 279 x 3
## # Groups:   total_employees [2]
##    state county               total_employees
##    <chr> <chr>                          <dbl>
##  1 AE    APO                                2
##  2 AK    FAIRBANKS NORTH STAR               2
##  3 AK    MATANUSKA-SUSITNA                  2
##  4 AK    SITKA                              1
##  5 AL    BARBOUR                            1
##  6 AL    HENRY                              1
##  7 AP    APO                                1
##  8 AR    MONTGOMERY                         2
##  9 AR    NEWTON                             1
## 10 AR    STONE                              2
## # ... with 269 more rows

The summarise function was used to learn what the mean of the total employees in all states. As it turns out the mean was 87.17816.

railroad_2012_clean_county %>% 
  summarise(mean(total_employees))
## # A tibble: 1 x 1
##   `mean(total_employees)`
##                     <dbl>
## 1                    87.2

In this block the rename function was used to change one of the column names, from total_rail_ employees to number_of_employees. This comes in handy if a column needs to be renamed.

data<-railroad_2012_clean_county
data<-rename(data,number_of_rail_employees = total_employees)
data
## # A tibble: 2,930 x 3
##    state county               number_of_rail_employees
##    <chr> <chr>                                   <dbl>
##  1 AE    APO                                         2
##  2 AK    ANCHORAGE                                   7
##  3 AK    FAIRBANKS NORTH STAR                        2
##  4 AK    JUNEAU                                      3
##  5 AK    MATANUSKA-SUSITNA                           2
##  6 AK    SITKA                                       1
##  7 AK    SKAGWAY MUNICIPALITY                       88
##  8 AL    AUTAUGA                                   102
##  9 AL    BALDWIN                                   143
## 10 AL    BARBOUR                                     1
## # ... with 2,920 more rows

Conclusion

This assignment showed me the power of the Tidyverse, which contains the dplyr package. A lot of insight can be gleaned from the railroad dataset with these tools,such as business, and municipal forecasting. For instance, stations with the fewest employees, such as the examples used of 2 or less, could be analyzed for either expansion or closure depending on the geographic proximity to other local stations,jobs and housing.