Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
My first step in reprocessing after I had read in the two data sets was to rename each column so that they would be easier to use in R Studio and there was some naming cohesion for both two sets. Column Facilities is removed from the data set as it is untidy but out of scope. To join the two data sets I had to find a variable that they both had in common, this was the Location variable in both sets as it contained the name for each facility. However this variable in the LocationFacilityData also contained the type of station it was, as such it had to be separated into a different column.
Location was then tidied and both data sets were joined using their corresponding Location variable. The Transport_Mode variable was untidy as some cells contained multiple values, so was broken up so that each transport type had its own column stating whether the station had it or not. Each variable then had its data type checked and changed if necessary. Correct leveling for factors was done if they needed it.
The data set was then checked for NA, NAN, infinite and finite variables. The NA values for variables Morning_Peak and Afternoon_Peak were recoded using the mode value for their column. Outliers for the four Latitude and Longitude numerical values were checked using boxplots which found many, however using a scatter plot it was determined that these were not outliers as they were clearly mapping the routes that the facilities were positioned on and as such they were not removed.
Finally several Transformations were made on the Latitude and Longitude facility variables to see if they could better fit a normal distribution, while not perfect there was some improvement.
TfNSW data dictionary was used to define some variables - https://opendata.transport.nsw.gov.au/sites/default/files/TfNSW_LocFac_Op_Loadzone_OffStreet.pdf
LocationFacilityData Data set, read in as facilityData
This data set details the geolocation, facilities and other important information for train stations, ferry wharves and bus interchanges. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/public-transport-location-facilities-and-operators
variable description: Location - Name of facility TSN - Transit Stop Number for each stop at the Station/Wharf/Light Rail/Bus Interchange Facility_Latitude - The latitude coordinates of the facility
Facility_Longitude - The longitude coordinates of the facility EFA_ID - TfNSW data dictionary - “Not to be used” Phone - The phone number for the facility Facility_Address - the facility’s Address Facilities - The utilities available to the public at the facility for example bike racks, payphones etc Accessibility - Transport_Mode - The types of transport avaliable at the facility
Morning_Peak - The busiest time for this facility in the morning Afternoon_Peak - The busiest time for this facility in the afternoon
StationEntrances2020 Data set, read in as stationEntrance
This data set contains a list of every entrance for every station in NSW. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/train-station-entrance-locations
variable description: Location - Name of entrance location Street_Name - Name of street entrance is on Street_Type - Type of street entrance is on eg Avenue: Ave, Drive: Dr Entrance_Type - Type of entrance eg stairs, ramp, lift Entrance_Latitude - The latitude coordinates of the entrance Entrance_Longitude - The longitude coordinates of the entrance
Exit_Number - The number designated for an exit, if there is only one exit it is NA
# This is the R chunk for the Data Section
stationEntrance <- read_csv("StationEntrances2020_v4.csv")
## Parsed with column specification:
## cols(
## Train_Station = col_character(),
## Street_Name = col_character(),
## Street_Type = col_character(),
## Entrance_Type = col_character(),
## LAT = col_double(),
## LONG = col_double(),
## Exit_Number = col_double()
## )
facilityData <- read_csv("LocationFacilityData.csv")
## Parsed with column specification:
## cols(
## LOCATION_NAME = col_character(),
## TSN = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## EFA_ID = col_double(),
## PHONE = col_character(),
## ADDRESS = col_character(),
## FACILITIES = col_character(),
## ACCESSIBILITY = col_character(),
## TRANSPORT_MODE = col_character(),
## MORNING_PEAK = col_character(),
## AFTERNOON_PEAK = col_character()
## )
colnames(facilityData)
## [1] "LOCATION_NAME" "TSN" "LATITUDE" "LONGITUDE"
## [5] "EFA_ID" "PHONE" "ADDRESS" "FACILITIES"
## [9] "ACCESSIBILITY" "TRANSPORT_MODE" "MORNING_PEAK" "AFTERNOON_PEAK"
colnames(stationEntrance)
## [1] "Train_Station" "Street_Name" "Street_Type" "Entrance_Type"
## [5] "LAT" "LONG" "Exit_Number"
colnames(facilityData) <- c("Location","TSN","Facility_Latitude","Facility_Longitude","EFA_ID",
"Phone","Facility_Address","Facilities","Accessibility",
"Transport_Mode","Morning_Peak","Afternoon_Peak")
colnames(stationEntrance) <- c("Location","Street_Name","Street_Type","Entrance_Type",
"Entrance_Latitude","Entrance_Longitude","Exit_Number")
facilityData <- facilityData %>% select(-(Facilities))
The LocationFacilityData (LFD) Data set requires a lot of tidying while the StationEntrances2020 (SE) Data set is relatively tidy. I plan to join both data sets on their renamed “Location” variable. The SE location variable consists of only the name for each facility, while the LFD location variable has the name and with it the type of station eg. “Station”, “Wharf” this would need to be separated from the variable so a merge of the two data sets can be done. Also I consider both the location name and the station type to be individual variables thus to conform with tidy data principles they should be in their own column.
# This is the R chunk for the Tidy & Manipulate Data I
# Move station type to different column
station1 <- str_detect(facilityData$Location, "Station")
stop1 <- str_detect(facilityData$Location, "Coach Stop")
wharf1 <- str_detect(facilityData$Location, "Wharf")
lightRail1 <- str_detect(facilityData$Location, "Light Rail")
facilityData2 <- mutate(facilityData,
Station = station1,
Coach_Stop = stop1,
Wharf = wharf1,
Light_Rail = lightRail1)
facilityData2$Station <- as.character(facilityData2$Station)
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "TRUE", replacement = "Station")
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "FALSE", replacement = "")
facilityData2$Coach_Stop <- as.character(facilityData2$Coach_Stop)
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "TRUE", replacement = "Coach Stop")
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "FALSE", replacement = "")
facilityData2$Wharf <- as.character(facilityData2$Wharf)
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "TRUE", replacement = "Wharf")
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "FALSE", replacement = "")
facilityData2$Light_Rail <- as.character(facilityData2$Light_Rail)
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "TRUE", replacement = "Light Rail")
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "FALSE", replacement = "")
facilityData3 <- mutate(facilityData, Station_Type = "")
facilityData3$Station_Type <- paste(facilityData2$Station, facilityData2$Coach_Stop, facilityData2$Wharf, facilityData2$Light_Rail)
# delete column type from location
facilityData3$Location <- str_remove_all(facilityData3$Location, "Station")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Stop")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Coach")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Bus")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Wharf")
facilityData3$Location <- str_replace_all(facilityData3$Location, "Light Rail", "LR")
facilityData3$Location <- str_remove_all(facilityData3$Location, "northbound")
facilityData3$Location <- str_remove_all(facilityData3$Location, "southbound")
facilityData3$Location <- gsub("[][()]", "", facilityData3$Location)
# remove spaces
head(facilityData3$Location)
## [1] "Abbotsford " "Aberdeen " "Adamstown "
## [4] "Adelaide Central " "Adelong " "Albion Park "
facilityData3$Location <- str_trim(facilityData3$Location, side = "right")
head(facilityData3$Station_Type)
## [1] " Wharf " "Station " "Station " "Station "
## [5] " Coach Stop " "Station "
facilityData3$Station_Type <- str_trim(facilityData3$Station_Type, side = "both")
head(stationEntrance$Location)
## [1] "Aberdeen" "Aberdeen" "Adamstown" "Adamstown" "Adamstown" "Adamstown"
# join datasets
stationData <- stationEntrance %>% inner_join(facilityData3, by = "Location")
The variable Transport_Mode contains all the types of transport available at a facility as such the strings in this column can consist of multiple values eg “Train, Regional coach, Taxi”. To tidy this column I plan to mutate it into 7 different columns which will answer yes or no if the facility has that mode of transport.
# This is the R chunk for the Tidy & Manipulate Data II
# split transport_mode
train1 <- str_detect(stationData$Transport_Mode, "Train")
bus1 <- str_detect(stationData$Transport_Mode, "Bus")
taxi1 <- str_detect(stationData$Transport_Mode, "Taxi")
lightRail1 <- str_detect(stationData$Transport_Mode, "Light rail")
ferry1 <- str_detect(stationData$Transport_Mode, "Ferry")
regionalCoach1 <- str_detect(stationData$Transport_Mode, "Regional coach")
coach1 <- str_detect(stationData$Transport_Mode, "Coach")
stationData2 <- mutate(stationData,
Train = train1,
Bus = bus1,
Taxi = taxi1,
Light_Rail = lightRail1,
Ferry = ferry1,
Regional_Coach = regionalCoach1,
Coach = coach1)
# remove transport_mode
stationData2 <- stationData2 %>% select(-(Transport_Mode))
# wheelchair accessible
wheelchair <- str_detect(stationData$Accessibility, "This location is wheelchair accessible")
stationData2 <- mutate(stationData2,
Wheelchair_Accessible = wheelchair)
stationData2 <- stationData2 %>% select(-(Accessibility))
# This is the R chunk for the Understand Section
dim(stationData2)
## [1] 973 24
str(stationData2)
## tibble [973 x 24] (S3: tbl_df/tbl/data.frame)
## $ Location : chr [1:973] "Aberdeen" "Aberdeen" "Adamstown" "Adamstown" ...
## $ Street_Name : chr [1:973] "Macqueen" "Macqueen" "Park" "Park" ...
## $ Street_Type : chr [1:973] "St" "St" "Ave" "Ave" ...
## $ Entrance_Type : chr [1:973] "Ramp" "Stairs" "Path" "Path" ...
## $ Entrance_Latitude : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
## $ Entrance_Longitude : num [1:973] 151 151 152 152 152 ...
## $ Exit_Number : num [1:973] NA NA NA NA NA NA NA NA NA NA ...
## $ TSN : num [1:973] 233610 233610 228920 228920 228920 ...
## $ Facility_Latitude : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
## $ Facility_Longitude : num [1:973] 151 151 152 152 152 ...
## $ EFA_ID : num [1:973] 10101192 10101192 10101159 10101159 10101159 ...
## $ Phone : chr [1:973] "02 6543 1018" "02 6543 1018" "02 4962 9295" "02 4962 9295" ...
## $ Facility_Address : chr [1:973] "Macqueen St, Aberdeen NSW 2336" "Macqueen St, Aberdeen NSW 2336" "Park Ave, Adamstown NSW 2289" "Park Ave, Adamstown NSW 2289" ...
## $ Morning_Peak : chr [1:973] "6.00-10am" "6.00-10am" "6.00-10am" "6.00-10am" ...
## $ Afternoon_Peak : chr [1:973] "3.00-7pm" "3.00-7pm" "3.00-7pm" "3.00-7pm" ...
## $ Station_Type : chr [1:973] "Station" "Station" "Station" "Station" ...
## $ Train : logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ Bus : logi [1:973] FALSE FALSE TRUE TRUE TRUE TRUE ...
## $ Taxi : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Light_Rail : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Ferry : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Regional_Coach : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Coach : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Wheelchair_Accessible: logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...
# factoring
stationData2$Street_Type <- as.factor(stationData2$Street_Type)
stationData2$Entrance_Type <- as.factor(stationData2$Entrance_Type)
str(stationData2$Entrance_Type)
## Factor w/ 5 levels "Escalators","Lift",..: 4 5 3 3 5 4 4 4 4 5 ...
stationData2$Exit_Number <- as.factor(stationData2$Exit_Number)
stationData2$Station_Type <- as.factor(stationData2$Station_Type)
levels(stationData2$Street_Type)
## [1] "Ave" "Cl" "Cres" "Dr" "Hwy" "Lane" "Pde" "Pl" "Plza"
## [10] "Rd" "Rd W" "South" "Sq" "St" "St N" "T-way" "Tce" "Way"
# factorise and level
stationData3 <- stationData2
stationData3$Train <- as.character(stationData3$Train)
stationData3$Train <- str_replace(stationData3$Train, pattern = "TRUE", replacement = "Yes")
stationData3$Train <- str_replace(stationData3$Train, pattern = "FALSE", replacement = "No")
stationData3$Train <- factor(stationData3$Train, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Bus <- as.character(stationData3$Bus)
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "TRUE", replacement = "Yes")
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "FALSE", replacement = "No")
stationData3$Bus <- factor(stationData3$Bus, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Taxi <- as.character(stationData3$Taxi)
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "TRUE", replacement = "Yes")
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "FALSE", replacement = "No")
stationData3$Taxi <- factor(stationData3$Taxi, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Light_Rail <- as.character(stationData3$Light_Rail)
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "TRUE", replacement = "Yes")
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "FALSE", replacement = "No")
stationData3$Light_Rail <- factor(stationData3$Light_Rail, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Ferry <- as.character(stationData3$Ferry)
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "TRUE", replacement = "Yes")
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "FALSE", replacement = "No")
stationData3$Ferry <- factor(stationData3$Ferry, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Regional_Coach <- as.character(stationData3$Regional_Coach)
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "FALSE", replacement = "No")
stationData3$Regional_Coach <- factor(stationData3$Regional_Coach, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Coach <- as.character(stationData3$Coach)
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "FALSE", replacement = "No")
stationData3$Coach <- factor(stationData3$Coach, levels = c("Yes", "No"), ordered = TRUE)
stationData3$Wheelchair_Accessible <- as.character(stationData3$Wheelchair_Accessible)
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "TRUE", replacement = "Yes")
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "FALSE", replacement = "No")
stationData3$Wheelchair_Accessible <- factor(stationData3$Wheelchair_Accessible, levels = c("Yes", "No"), ordered = TRUE)
Methodology NA values were found in Exit_Number, Street_Type, Phone, Morning_Peak and Afternoon_Peak, however only the last two columns had anything done to rectify this. This was because not all streets have a type, not all facilities have multiple exits and not all facilities have a phone number, however all facilities should have a peak time. I used mode imputation as this would find the most common value in each column and replace the NA values with it as this is the most likely peak time for these unknown values.
# This is the R chunk for the Scan I
colSums(is.na(stationData3))
## Location Street_Name Street_Type
## 0 0 24
## Entrance_Type Entrance_Latitude Entrance_Longitude
## 0 0 0
## Exit_Number TSN Facility_Latitude
## 925 0 0
## Facility_Longitude EFA_ID Phone
## 0 0 87
## Facility_Address Morning_Peak Afternoon_Peak
## 0 18 18
## Station_Type Train Bus
## 0 0 0
## Taxi Light_Rail Ferry
## 0 0 0
## Regional_Coach Coach Wheelchair_Accessible
## 0 0 0
sapply(stationData3, function(x) sum(is.nan(x)))
## Location Street_Name Street_Type
## 0 0 0
## Entrance_Type Entrance_Latitude Entrance_Longitude
## 0 0 0
## Exit_Number TSN Facility_Latitude
## 0 0 0
## Facility_Longitude EFA_ID Phone
## 0 0 0
## Facility_Address Morning_Peak Afternoon_Peak
## 0 0 0
## Station_Type Train Bus
## 0 0 0
## Taxi Light_Rail Ferry
## 0 0 0
## Regional_Coach Coach Wheelchair_Accessible
## 0 0 0
sapply(stationData3, function(x) sum(is.infinite(x)))
## Location Street_Name Street_Type
## 0 0 0
## Entrance_Type Entrance_Latitude Entrance_Longitude
## 0 0 0
## Exit_Number TSN Facility_Latitude
## 0 0 0
## Facility_Longitude EFA_ID Phone
## 0 0 0
## Facility_Address Morning_Peak Afternoon_Peak
## 0 0 0
## Station_Type Train Bus
## 0 0 0
## Taxi Light_Rail Ferry
## 0 0 0
## Regional_Coach Coach Wheelchair_Accessible
## 0 0 0
sapply(stationData3, function(x) sum(is.finite(x)))
## Location Street_Name Street_Type
## 0 0 949
## Entrance_Type Entrance_Latitude Entrance_Longitude
## 973 973 973
## Exit_Number TSN Facility_Latitude
## 48 973 973
## Facility_Longitude EFA_ID Phone
## 973 973 0
## Facility_Address Morning_Peak Afternoon_Peak
## 0 0 0
## Station_Type Train Bus
## 973 973 973
## Taxi Light_Rail Ferry
## 973 973 973
## Regional_Coach Coach Wheelchair_Accessible
## 973 973 973
stationData3$Morning_Peak <- as.factor(stationData3$Morning_Peak)
stationData3$Morning_Peak <- impute(stationData3$Morning_Peak, fun = mode)
stationData3$Afternoon_Peak <- as.factor(stationData3$Afternoon_Peak)
stationData3$Afternoon_Peak <- impute(stationData3$Afternoon_Peak, fun = mode)
which(is.na(stationData3$Morning_Peak))
## named integer(0)
which(is.na(stationData3$Afternoon_Peak))
## named integer(0)
Methodology The only viable numerical variable to scan for outliers are the latitude and longitude variables for the entrance and facility locations. Looking at the boxplot there appears to be many outliers for each variable however when we create a scatter plot between the corresponding latitude and longitude values we can clearly see the routes that these stations lie along therefore it would be incorrect to remove any of the Univariate outliers found from the boxplots.
# This is the R chunk for the Scan II
# outliers lat and long
boxplot(stationData3$Entrance_Latitude)
boxplot(stationData3$Entrance_Longitude)
boxplot(stationData3$Facility_Latitude)
boxplot(stationData3$Facility_Longitude)
stationData3 %>% plot(Entrance_Latitude ~ Entrance_Longitude, data = .)
stationData3 %>% plot(Facility_Latitude ~ Facility_Longitude, data = .)
# This is the R chunk for the Transform Section
# Untransformed Distributions
#hist(stationData3$Entrance_Latitude)
#hist(stationData3$Entrance_Longitude)
#hist(stationData3$Facility_Latitude)
#hist(stationData3$Facility_Longitude)
# Facility_Longitude Distributions with Transformation
hist(stationData3$Facility_Longitude, main = "Facility_Longitude Untransformed")
#log10(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log10")
#log(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log")
#sqrt(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Square root")
#(stationData3$Facility_Longitude)^2 %>% hist(main = "Facility_Longitude Square")
(1/(stationData3$Facility_Longitude)) %>% hist(main = "Facility_Longitude Reciprocal")
#BoxCox(stationData3$Facility_Longitude, lambda = "auto") %>% hist(main = "Facility_Longitude Box-Cox")
# Facility_Latitude Distributions with Transformation
Fac_Lat_100 <- stationData3$Facility_Latitude + 100
hist(Fac_Lat_100, main = "Facility_Latitude Untransformed")
#log10(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log10")
#log(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log")
#sqrt(Fac_Lat_100) %>% hist(main = "Facility_Latitude Square root")
(Fac_Lat_100)^2 %>% hist(main = "Facility_Latitude Square")
#(1/(Fac_Lat_100)) %>% hist(main = "Facility_Latitude Reciprocal")
#BoxCox(Fac_Lat_100, lambda = "auto") %>% hist(main = "Facility_Latitude Box-Cox")