Github link contained a downloadable csv which was loaded into a raw dataframe. Two separate clients have given feedback into the project. Each ask will be broken into separate sections where possible.
# dowload file from github, save it locally in your home directory
download <- download.file('https://raw.githubusercontent.com/kelloggjohnd/DATA607/master/DOITT_SUBWAY_ENTRANCE.csv', destfile = "subway_entrance.csv", method = "wininet")
# manipulate the data into a data frame
data_raw <- data.frame(read.csv(file = "subway_entrance.csv", header = TRUE, sep = ","))
# QC step
head(data_raw)## OBJECTID URL
## 1 1734 http://web.mta.info/nyct/service/
## 2 1735 http://web.mta.info/nyct/service/
## 3 1736 http://web.mta.info/nyct/service/
## 4 1737 http://web.mta.info/nyct/service/
## 5 1738 http://web.mta.info/nyct/service/
## 6 1739 http://web.mta.info/nyct/service/
## NAME
## 1 Birchall Ave & Sagamore St at NW corner
## 2 Birchall Ave & Sagamore St at NE corner
## 3 Morris Park Ave & 180th St at NW corner
## 4 Morris Park Ave & 180th St at NW corner
## 5 Boston Rd & 178th St at SW corner
## 6 Boston Rd & E Tremont Ave at NW corner
## the_geom LINE
## 1 POINT (-73.86835600032798 40.84916900104506) 5-Feb
## 2 POINT (-73.86821300022677 40.84912800131844) 5-Feb
## 3 POINT (-73.87349900050798 40.84122300105249) 5-Feb
## 4 POINT (-73.8728919997833 40.84145300067447) 5-Feb
## 5 POINT (-73.87962300013866 40.84081500075867) 5-Feb
## 6 POINT (-73.88000500027815 40.840434000875874) 5-Feb
“The Line column is all over the place and needs individual lines split into separate columns.”
With the data loaded, we can now start the process to clean the data. The first chunk will adjust the Line column. It appears Excel auto-formatted some of the fields into a date.
# Keep the ObjectID for ease in later joins
subway_line_raw <- subset(data_raw, select = c(OBJECTID,LINE))
# SRT replace functions. I chose to keep these seperate instead of a Tidyr package for ease in QC.
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "1/2/2003", "1-2-3")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "Feb", "2")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "Mar", "3")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "Apr", "4")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "2005", "5")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "2006", "6")
subway_line_raw$LINE <- str_replace_all(subway_line_raw$LINE, "2003", "3")
# QC step
head(subway_line_raw)## OBJECTID LINE
## 1 1734 5-2
## 2 1735 5-2
## 3 1736 5-2
## 4 1737 5-2
## 5 1738 5-2
## 6 1739 5-2
The Line column has been cleaned of Excel’s autoformatting. We now need to start the tidying up. First is getting the Subway lines data into individual columns.
# Seperate the lines column into seperate columns
subway_lines<- subway_line_raw %>%
separate(LINE,c("Line1","Line2","Line3","Line4","Line5","Line6","Line7","Line8","Line9","Line10","Line11"), sep = "-", convert = FALSE) ## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 1905
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
Second step in the tidying process: entrance locations into separate fields (Main Street, Cross Street, Corner)
location_raw <- subset(data_raw, select = c(OBJECTID,NAME))
location_raw <- location_raw %>%
separate(NAME, c("Street", "Cross_Street", "Second_street"), sep = "&")%>%
separate(Cross_Street, c("Cross_Street", "Corner"), sep = "at") ## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1927
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## Warning: Expected 2 pieces. Additional pieces discarded in 18 rows [362,
## 371, 389, 517, 756, 757, 758, 759, 760, 1275, 1276, 1358, 1359, 1442, 1443,
## 1444, 1851, 1914].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows
## [1227, 1228, 1924].
Third step in the tidying: separate the ‘Lat’ and ‘Long’ coordinates into separate columns
Forth and last step in the tidying: combine all the above subsets into one master Data Frame.
“A cool graphic could then be utilized marking the locations on a map.”
Using leaflet, a map of all the entrances with labels was created.
# Subset of the Positioning coordinates
Subway_points<- Subway%>%
unite(Name, Street:Cross_Street, sep = "_")%>%
select(Lat, Long, Name)%>%
mutate_at(vars(matches("Lat|Long")),funs(as.numeric))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.