Let’s start with what the heck is data wrangling? You can wrangle cattle, but wrangling data? Yes! After data is imported or read into the RStudio environment it needs to be maneuvered before it can be used for visualization or modeling.
First, I am going to import data into R- In this case the Railroad_2012 data set which is a CSV file. While this is an already cleaned data set, it can still provide insight.
library(readr)
railroad_2012_clean_county <- read_csv("C:/Users/Bud/Desktop/DACSS601_Homework/HW_3/_data/railroad_2012_clean_county.csv")
## Rows: 2930 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): state, county
## dbl (1): total_employees
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(railroad_2012_clean_county)
The railroad data set was checked out, starting with the head and tail functions, which displayed the first 5 and last 5 rows of the data set.
head(railroad_2012_clean_county,5)
## # A tibble: 5 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 AE APO 2
## 2 AK ANCHORAGE 7
## 3 AK FAIRBANKS NORTH STAR 2
## 4 AK JUNEAU 3
## 5 AK MATANUSKA-SUSITNA 2
tail(railroad_2012_clean_county,5)
## # A tibble: 5 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 WY SUBLETTE 3
## 2 WY SWEETWATER 196
## 3 WY UINTA 49
## 4 WY WASHAKIE 10
## 5 WY WESTON 37
Then using colnames which takes a look at column names of the data set.
colnames(railroad_2012_clean_county)
## [1] "state" "county" "total_employees"
The glimpse function which is part of the dplyr package which provides a synopsis of data was used.
glimpse(railroad_2012_clean_county)
## Rows: 2,930
## Columns: 3
## $ state <chr> "AE", "AK", "AK", "AK", "AK", "AK", "AK", "AL", "AL", ~
## $ county <chr> "APO", "ANCHORAGE", "FAIRBANKS NORTH STAR", "JUNEAU", ~
## $ total_employees <dbl> 2, 7, 2, 3, 2, 1, 88, 102, 143, 1, 25, 154, 13, 29, 45~
The pivot_wider function was used to make the data set more manageable and readable, as well as filling in missing values from n/a to 0.
railroad_2012_clean_county %>%
pivot_wider(names_from = state, values_from = total_employees,values_fill = 0)
## # A tibble: 1,709 x 54
## county AE AK AL AP AR AZ CA CO CT DC DE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 APO 2 0 0 1 0 0 0 0 0 0 0
## 2 ANCHORAGE 0 7 0 0 0 0 0 0 0 0 0
## 3 FAIRBANKS ~ 0 2 0 0 0 0 0 0 0 0 0
## 4 JUNEAU 0 3 0 0 0 0 0 0 0 0 0
## 5 MATANUSKA-~ 0 2 0 0 0 0 0 0 0 0 0
## 6 SITKA 0 1 0 0 0 0 0 0 0 0 0
## 7 SKAGWAY MU~ 0 88 0 0 0 0 0 0 0 0 0
## 8 AUTAUGA 0 0 102 0 0 0 0 0 0 0 0
## 9 BALDWIN 0 0 143 0 0 0 0 0 0 0 0
## 10 BARBOUR 0 0 1 0 0 0 0 0 0 0 0
## # ... with 1,699 more rows, and 42 more variables: FL <dbl>, GA <dbl>,
## # HI <dbl>, IA <dbl>, ID <dbl>, IL <dbl>, IN <dbl>, KS <dbl>, KY <dbl>,
## # LA <dbl>, MA <dbl>, MD <dbl>, ME <dbl>, MI <dbl>, MN <dbl>, MO <dbl>,
## # MS <dbl>, MT <dbl>, NC <dbl>, ND <dbl>, NE <dbl>, NH <dbl>, NJ <dbl>,
## # NM <dbl>, NV <dbl>, NY <dbl>, OH <dbl>, OK <dbl>, OR <dbl>, PA <dbl>,
## # RI <dbl>, SC <dbl>, SD <dbl>, TN <dbl>, TX <dbl>, UT <dbl>, VA <dbl>,
## # VT <dbl>, WA <dbl>, WI <dbl>, WV <dbl>, WY <dbl>
I was interested in the railroad county that had the most employees so I started with the arrange function which ranked the number of employees in descending order.It shows, not surprisingly that Cook county in Illinois had the most number of employees.
arrange(railroad_2012_clean_county,desc(total_employees))
## # A tibble: 2,930 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 IL COOK 8207
## 2 TX TARRANT 4235
## 3 NE DOUGLAS 3797
## 4 NY SUFFOLK 3685
## 5 VA INDEPENDENT CITY 3249
## 6 FL DUVAL 3073
## 7 CA SAN BERNARDINO 2888
## 8 CA LOS ANGELES 2545
## 9 TX HARRIS 2535
## 10 NE LINCOLN 2289
## # ... with 2,920 more rows
The arrange function was also used to display the total_employees column.
arrange(railroad_2012_clean_county,total_employees)
## # A tibble: 2,930 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 AK SITKA 1
## 2 AL BARBOUR 1
## 3 AL HENRY 1
## 4 AP APO 1
## 5 AR NEWTON 1
## 6 CA MONO 1
## 7 CO BENT 1
## 8 CO CHEYENNE 1
## 9 CO COSTILLA 1
## 10 CO DOLORES 1
## # ... with 2,920 more rows
I was interested if there were any na values in the railroad data set. It shows there were none. Which makes sense because this was a clean data set.
railroad_2012_clean_county %>%
is.na() %>%
sum()
## [1] 0
The select function was used to determine the number of rows. In this data set there were 2930 rows.
select(railroad_2012_clean_county)
## # A tibble: 2,930 x 0
The select function separates the state column from the rest of the data set.
select(railroad_2012_clean_county,state)
## # A tibble: 2,930 x 1
## state
## <chr>
## 1 AE
## 2 AK
## 3 AK
## 4 AK
## 5 AK
## 6 AK
## 7 AK
## 8 AL
## 9 AL
## 10 AL
## # ... with 2,920 more rows
The filter function was used to determine which rail stations had fewer than 2 employees. Yeah,I was curious about this. It turns out there were 145 stations with less than 2 employees.
filter(railroad_2012_clean_county,total_employees < 2)
## # A tibble: 145 x 3
## state county total_employees
## <chr> <chr> <dbl>
## 1 AK SITKA 1
## 2 AL BARBOUR 1
## 3 AL HENRY 1
## 4 AP APO 1
## 5 AR NEWTON 1
## 6 CA MONO 1
## 7 CO BENT 1
## 8 CO CHEYENNE 1
## 9 CO COSTILLA 1
## 10 CO DOLORES 1
## # ... with 135 more rows
I was curious about the number of stations that had less than or equal to 2 employees, so I created a new object called subset_employees, that utilized the pipe operator, group_by function,and filter function to determine the stations with less than or equal to 2 employees.
subset_employees<-railroad_2012_clean_county %>%
group_by(total_employees) %>%
filter(total_employees<=2)
subset_employees
## # A tibble: 279 x 3
## # Groups: total_employees [2]
## state county total_employees
## <chr> <chr> <dbl>
## 1 AE APO 2
## 2 AK FAIRBANKS NORTH STAR 2
## 3 AK MATANUSKA-SUSITNA 2
## 4 AK SITKA 1
## 5 AL BARBOUR 1
## 6 AL HENRY 1
## 7 AP APO 1
## 8 AR MONTGOMERY 2
## 9 AR NEWTON 1
## 10 AR STONE 2
## # ... with 269 more rows
The summarise function was used to learn what the mean of the total employees in all states. As it turns out the mean was 87.17816.
railroad_2012_clean_county %>%
summarise(mean(total_employees))
## # A tibble: 1 x 1
## `mean(total_employees)`
## <dbl>
## 1 87.2
In this block the rename function was used to change one of the column names, from total_rail_ employees to number_of_employees. This comes in handy if a column needs to be renamed.
data<-railroad_2012_clean_county
data<-rename(data,number_of_rail_employees = total_employees)
data
## # A tibble: 2,930 x 3
## state county number_of_rail_employees
## <chr> <chr> <dbl>
## 1 AE APO 2
## 2 AK ANCHORAGE 7
## 3 AK FAIRBANKS NORTH STAR 2
## 4 AK JUNEAU 3
## 5 AK MATANUSKA-SUSITNA 2
## 6 AK SITKA 1
## 7 AK SKAGWAY MUNICIPALITY 88
## 8 AL AUTAUGA 102
## 9 AL BALDWIN 143
## 10 AL BARBOUR 1
## # ... with 2,920 more rows
This assignment showed me the power of the Tidyverse, which contains the dplyr package. A lot of insight can be gleaned from the railroad dataset with these tools,such as business, and municipal forecasting. For instance, stations with the fewest employees, such as the examples used of 2 or less, could be analyzed for either expansion or closure depending on the geographic proximity to other local stations,jobs and housing.