Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Executive Summary

My first step in reprocessing after I had read in the two data sets was to rename each column so that they would be easier to use in R Studio and there was some naming cohesion for both two sets. Column Facilities is removed from the data set as it is untidy but out of scope. To join the two data sets I had to find a variable that they both had in common, this was the Location variable in both sets as it contained the name for each facility. However this variable in the LocationFacilityData also contained the type of station it was, as such it had to be separated into a different column.

Location was then tidied and both data sets were joined using their corresponding Location variable. The Transport_Mode variable was untidy as some cells contained multiple values, so was broken up so that each transport type had its own column stating whether the station had it or not. Each variable then had its data type checked and changed if necessary. Correct leveling for factors was done if they needed it.

The data set was then checked for NA, NAN, infinite and finite variables. The NA values for variables Morning_Peak and Afternoon_Peak were recoded using the mode value for their column. Outliers for the four Latitude and Longitude numerical values were checked using boxplots which found many, however using a scatter plot it was determined that these were not outliers as they were clearly mapping the routes that the facilities were positioned on and as such they were not removed.

Finally several Transformations were made on the Latitude and Longitude facility variables to see if they could better fit a normal distribution, while not perfect there was some improvement.

Data

TfNSW data dictionary was used to define some variables - https://opendata.transport.nsw.gov.au/sites/default/files/TfNSW_LocFac_Op_Loadzone_OffStreet.pdf

LocationFacilityData Data set, read in as facilityData

This data set details the geolocation, facilities and other important information for train stations, ferry wharves and bus interchanges. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/public-transport-location-facilities-and-operators

variable description: Location - Name of facility TSN - Transit Stop Number for each stop at the Station/Wharf/Light Rail/Bus Interchange Facility_Latitude - The latitude coordinates of the facility
Facility_Longitude - The longitude coordinates of the facility EFA_ID - TfNSW data dictionary - “Not to be used” Phone - The phone number for the facility Facility_Address - the facility’s Address Facilities - The utilities available to the public at the facility for example bike racks, payphones etc Accessibility - Transport_Mode - The types of transport avaliable at the facility
Morning_Peak - The busiest time for this facility in the morning Afternoon_Peak - The busiest time for this facility in the afternoon

StationEntrances2020 Data set, read in as stationEntrance

This data set contains a list of every entrance for every station in NSW. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/train-station-entrance-locations

variable description: Location - Name of entrance location Street_Name - Name of street entrance is on Street_Type - Type of street entrance is on eg Avenue: Ave, Drive: Dr Entrance_Type - Type of entrance eg stairs, ramp, lift Entrance_Latitude - The latitude coordinates of the entrance Entrance_Longitude - The longitude coordinates of the entrance
Exit_Number - The number designated for an exit, if there is only one exit it is NA

# This is the R chunk for the Data Section

stationEntrance <- read_csv("StationEntrances2020_v4.csv")
## Parsed with column specification:
## cols(
##   Train_Station = col_character(),
##   Street_Name = col_character(),
##   Street_Type = col_character(),
##   Entrance_Type = col_character(),
##   LAT = col_double(),
##   LONG = col_double(),
##   Exit_Number = col_double()
## )
facilityData <- read_csv("LocationFacilityData.csv")
## Parsed with column specification:
## cols(
##   LOCATION_NAME = col_character(),
##   TSN = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   EFA_ID = col_double(),
##   PHONE = col_character(),
##   ADDRESS = col_character(),
##   FACILITIES = col_character(),
##   ACCESSIBILITY = col_character(),
##   TRANSPORT_MODE = col_character(),
##   MORNING_PEAK = col_character(),
##   AFTERNOON_PEAK = col_character()
## )
colnames(facilityData)
##  [1] "LOCATION_NAME"  "TSN"            "LATITUDE"       "LONGITUDE"     
##  [5] "EFA_ID"         "PHONE"          "ADDRESS"        "FACILITIES"    
##  [9] "ACCESSIBILITY"  "TRANSPORT_MODE" "MORNING_PEAK"   "AFTERNOON_PEAK"
colnames(stationEntrance)
## [1] "Train_Station" "Street_Name"   "Street_Type"   "Entrance_Type"
## [5] "LAT"           "LONG"          "Exit_Number"
colnames(facilityData) <- c("Location","TSN","Facility_Latitude","Facility_Longitude","EFA_ID",
                            "Phone","Facility_Address","Facilities","Accessibility",
                            "Transport_Mode","Morning_Peak","Afternoon_Peak")
colnames(stationEntrance) <- c("Location","Street_Name","Street_Type","Entrance_Type",
                               "Entrance_Latitude","Entrance_Longitude","Exit_Number")

facilityData <- facilityData %>% select(-(Facilities))

Tidy & Manipulate Data I

The LocationFacilityData (LFD) Data set requires a lot of tidying while the StationEntrances2020 (SE) Data set is relatively tidy. I plan to join both data sets on their renamed “Location” variable. The SE location variable consists of only the name for each facility, while the LFD location variable has the name and with it the type of station eg. “Station”, “Wharf” this would need to be separated from the variable so a merge of the two data sets can be done. Also I consider both the location name and the station type to be individual variables thus to conform with tidy data principles they should be in their own column.

# This is the R chunk for the Tidy & Manipulate Data I 


# Move station type to different column

station1 <- str_detect(facilityData$Location, "Station")
stop1 <- str_detect(facilityData$Location, "Coach Stop")
wharf1 <- str_detect(facilityData$Location, "Wharf")
lightRail1 <- str_detect(facilityData$Location, "Light Rail")

facilityData2 <- mutate(facilityData, 
                        Station = station1,
                        Coach_Stop = stop1,
                        Wharf = wharf1,
                        Light_Rail = lightRail1)

facilityData2$Station <- as.character(facilityData2$Station)
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "TRUE", replacement = "Station")
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "FALSE", replacement = "")

facilityData2$Coach_Stop <- as.character(facilityData2$Coach_Stop)
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "TRUE", replacement = "Coach Stop")
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "FALSE", replacement = "")

facilityData2$Wharf <- as.character(facilityData2$Wharf)
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "TRUE", replacement = "Wharf")
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "FALSE", replacement = "")

facilityData2$Light_Rail <- as.character(facilityData2$Light_Rail)
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "TRUE", replacement = "Light Rail")
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "FALSE", replacement = "")

facilityData3 <- mutate(facilityData, Station_Type = "")

facilityData3$Station_Type <- paste(facilityData2$Station, facilityData2$Coach_Stop, facilityData2$Wharf, facilityData2$Light_Rail)

# delete column type from location

facilityData3$Location <- str_remove_all(facilityData3$Location, "Station")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Stop")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Coach")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Bus")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Wharf")
facilityData3$Location <- str_replace_all(facilityData3$Location, "Light Rail", "LR")
facilityData3$Location <- str_remove_all(facilityData3$Location, "northbound")
facilityData3$Location <- str_remove_all(facilityData3$Location, "southbound")
facilityData3$Location <- gsub("[][()]", "", facilityData3$Location)

# remove spaces
head(facilityData3$Location)
## [1] "Abbotsford "        "Aberdeen "          "Adamstown "        
## [4] "Adelaide Central  " "Adelong  "          "Albion Park "
facilityData3$Location <- str_trim(facilityData3$Location, side = "right")

head(facilityData3$Station_Type)
## [1] "  Wharf "      "Station   "    "Station   "    "Station   "   
## [5] " Coach Stop  " "Station   "
facilityData3$Station_Type <- str_trim(facilityData3$Station_Type, side = "both")

head(stationEntrance$Location)
## [1] "Aberdeen"  "Aberdeen"  "Adamstown" "Adamstown" "Adamstown" "Adamstown"
# join datasets

stationData <- stationEntrance %>% inner_join(facilityData3, by = "Location")

Tidy & Manipulate Data II

The variable Transport_Mode contains all the types of transport available at a facility as such the strings in this column can consist of multiple values eg “Train, Regional coach, Taxi”. To tidy this column I plan to mutate it into 7 different columns which will answer yes or no if the facility has that mode of transport.

# This is the R chunk for the Tidy & Manipulate Data II 

# split transport_mode

train1 <- str_detect(stationData$Transport_Mode, "Train")
bus1 <- str_detect(stationData$Transport_Mode, "Bus")
taxi1 <- str_detect(stationData$Transport_Mode, "Taxi")
lightRail1 <- str_detect(stationData$Transport_Mode, "Light rail")
ferry1 <- str_detect(stationData$Transport_Mode, "Ferry")
regionalCoach1 <- str_detect(stationData$Transport_Mode, "Regional coach")
coach1 <- str_detect(stationData$Transport_Mode, "Coach")

stationData2 <- mutate(stationData, 
                        Train = train1,
                        Bus = bus1,
                        Taxi = taxi1,
                        Light_Rail = lightRail1,
                        Ferry = ferry1,
                        Regional_Coach = regionalCoach1,
                        Coach = coach1)

# remove transport_mode

stationData2 <- stationData2 %>% select(-(Transport_Mode))

# wheelchair accessible

wheelchair <- str_detect(stationData$Accessibility, "This location is wheelchair accessible")

stationData2 <- mutate(stationData2, 
                        Wheelchair_Accessible = wheelchair)

stationData2 <- stationData2 %>% select(-(Accessibility))

Understand

# This is the R chunk for the Understand Section

dim(stationData2)
## [1] 973  24
str(stationData2)
## tibble [973 x 24] (S3: tbl_df/tbl/data.frame)
##  $ Location             : chr [1:973] "Aberdeen" "Aberdeen" "Adamstown" "Adamstown" ...
##  $ Street_Name          : chr [1:973] "Macqueen" "Macqueen" "Park" "Park" ...
##  $ Street_Type          : chr [1:973] "St" "St" "Ave" "Ave" ...
##  $ Entrance_Type        : chr [1:973] "Ramp" "Stairs" "Path" "Path" ...
##  $ Entrance_Latitude    : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
##  $ Entrance_Longitude   : num [1:973] 151 151 152 152 152 ...
##  $ Exit_Number          : num [1:973] NA NA NA NA NA NA NA NA NA NA ...
##  $ TSN                  : num [1:973] 233610 233610 228920 228920 228920 ...
##  $ Facility_Latitude    : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
##  $ Facility_Longitude   : num [1:973] 151 151 152 152 152 ...
##  $ EFA_ID               : num [1:973] 10101192 10101192 10101159 10101159 10101159 ...
##  $ Phone                : chr [1:973] "02 6543 1018" "02 6543 1018" "02 4962 9295" "02 4962 9295" ...
##  $ Facility_Address     : chr [1:973] "Macqueen St, Aberdeen NSW 2336" "Macqueen St, Aberdeen NSW 2336" "Park Ave, Adamstown NSW 2289" "Park Ave, Adamstown NSW 2289" ...
##  $ Morning_Peak         : chr [1:973] "6.00-10am" "6.00-10am" "6.00-10am" "6.00-10am" ...
##  $ Afternoon_Peak       : chr [1:973] "3.00-7pm" "3.00-7pm" "3.00-7pm" "3.00-7pm" ...
##  $ Station_Type         : chr [1:973] "Station" "Station" "Station" "Station" ...
##  $ Train                : logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Bus                  : logi [1:973] FALSE FALSE TRUE TRUE TRUE TRUE ...
##  $ Taxi                 : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Light_Rail           : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Ferry                : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Regional_Coach       : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Coach                : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Wheelchair_Accessible: logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...
# factoring

stationData2$Street_Type <- as.factor(stationData2$Street_Type)

stationData2$Entrance_Type <- as.factor(stationData2$Entrance_Type)

str(stationData2$Entrance_Type)
##  Factor w/ 5 levels "Escalators","Lift",..: 4 5 3 3 5 4 4 4 4 5 ...
stationData2$Exit_Number <- as.factor(stationData2$Exit_Number)

stationData2$Station_Type <- as.factor(stationData2$Station_Type)

levels(stationData2$Street_Type)
##  [1] "Ave"   "Cl"    "Cres"  "Dr"    "Hwy"   "Lane"  "Pde"   "Pl"    "Plza" 
## [10] "Rd"    "Rd W"  "South" "Sq"    "St"    "St N"  "T-way" "Tce"   "Way"
# factorise and level

stationData3 <- stationData2

stationData3$Train <- as.character(stationData3$Train)
stationData3$Train <- str_replace(stationData3$Train, pattern = "TRUE", replacement = "Yes")
stationData3$Train <- str_replace(stationData3$Train, pattern = "FALSE", replacement = "No")
stationData3$Train <- factor(stationData3$Train, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Bus <- as.character(stationData3$Bus)
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "TRUE", replacement = "Yes")
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "FALSE", replacement = "No")
stationData3$Bus <- factor(stationData3$Bus, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Taxi <- as.character(stationData3$Taxi)
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "TRUE", replacement = "Yes")
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "FALSE", replacement = "No")
stationData3$Taxi <- factor(stationData3$Taxi, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Light_Rail <- as.character(stationData3$Light_Rail)
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "TRUE", replacement = "Yes")
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "FALSE", replacement = "No")
stationData3$Light_Rail <- factor(stationData3$Light_Rail, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Ferry <- as.character(stationData3$Ferry)
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "TRUE", replacement = "Yes")
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "FALSE", replacement = "No")
stationData3$Ferry <- factor(stationData3$Ferry, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Regional_Coach <- as.character(stationData3$Regional_Coach)
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "FALSE", replacement = "No")
stationData3$Regional_Coach <- factor(stationData3$Regional_Coach, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Coach <- as.character(stationData3$Coach)
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "FALSE", replacement = "No")
stationData3$Coach <- factor(stationData3$Coach, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Wheelchair_Accessible <- as.character(stationData3$Wheelchair_Accessible)
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "TRUE", replacement = "Yes")
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "FALSE", replacement = "No")
stationData3$Wheelchair_Accessible <- factor(stationData3$Wheelchair_Accessible, levels = c("Yes", "No"), ordered = TRUE)

Scan I

Methodology NA values were found in Exit_Number, Street_Type, Phone, Morning_Peak and Afternoon_Peak, however only the last two columns had anything done to rectify this. This was because not all streets have a type, not all facilities have multiple exits and not all facilities have a phone number, however all facilities should have a peak time. I used mode imputation as this would find the most common value in each column and replace the NA values with it as this is the most likely peak time for these unknown values.

# This is the R chunk for the Scan I

colSums(is.na(stationData3))
##              Location           Street_Name           Street_Type 
##                     0                     0                    24 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                   925                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                    87 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                    18                    18 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0
sapply(stationData3, function(x) sum(is.nan(x)))
##              Location           Street_Name           Street_Type 
##                     0                     0                     0 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                     0                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0
sapply(stationData3, function(x) sum(is.infinite(x)))
##              Location           Street_Name           Street_Type 
##                     0                     0                     0 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                     0                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0
sapply(stationData3, function(x) sum(is.finite(x)))
##              Location           Street_Name           Street_Type 
##                     0                     0                   949 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                   973                   973                   973 
##           Exit_Number                   TSN     Facility_Latitude 
##                    48                   973                   973 
##    Facility_Longitude                EFA_ID                 Phone 
##                   973                   973                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                   973                   973                   973 
##                  Taxi            Light_Rail                 Ferry 
##                   973                   973                   973 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                   973                   973                   973
stationData3$Morning_Peak <- as.factor(stationData3$Morning_Peak)
stationData3$Morning_Peak <- impute(stationData3$Morning_Peak, fun = mode)

stationData3$Afternoon_Peak <- as.factor(stationData3$Afternoon_Peak)
stationData3$Afternoon_Peak <- impute(stationData3$Afternoon_Peak, fun = mode)

which(is.na(stationData3$Morning_Peak))
## named integer(0)
which(is.na(stationData3$Afternoon_Peak))
## named integer(0)

Scan II

Methodology The only viable numerical variable to scan for outliers are the latitude and longitude variables for the entrance and facility locations. Looking at the boxplot there appears to be many outliers for each variable however when we create a scatter plot between the corresponding latitude and longitude values we can clearly see the routes that these stations lie along therefore it would be incorrect to remove any of the Univariate outliers found from the boxplots.

# This is the R chunk for the Scan II

# outliers lat and long

boxplot(stationData3$Entrance_Latitude)

boxplot(stationData3$Entrance_Longitude)

boxplot(stationData3$Facility_Latitude)

boxplot(stationData3$Facility_Longitude)

stationData3 %>% plot(Entrance_Latitude ~ Entrance_Longitude, data = .)

stationData3 %>% plot(Facility_Latitude ~ Facility_Longitude, data = .)

Transform

# This is the R chunk for the Transform Section

# Untransformed Distributions
#hist(stationData3$Entrance_Latitude)
#hist(stationData3$Entrance_Longitude)
#hist(stationData3$Facility_Latitude)
#hist(stationData3$Facility_Longitude)

# Facility_Longitude Distributions with Transformation 
hist(stationData3$Facility_Longitude, main = "Facility_Longitude Untransformed")

#log10(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log10")
#log(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log")
#sqrt(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Square root")
#(stationData3$Facility_Longitude)^2 %>% hist(main = "Facility_Longitude Square")
(1/(stationData3$Facility_Longitude)) %>% hist(main = "Facility_Longitude Reciprocal")

#BoxCox(stationData3$Facility_Longitude, lambda = "auto") %>% hist(main = "Facility_Longitude Box-Cox")

# Facility_Latitude Distributions with Transformation 
Fac_Lat_100 <- stationData3$Facility_Latitude + 100
hist(Fac_Lat_100, main = "Facility_Latitude Untransformed")

#log10(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log10")
#log(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log")
#sqrt(Fac_Lat_100) %>% hist(main = "Facility_Latitude Square root")
(Fac_Lat_100)^2 %>% hist(main = "Facility_Latitude Square")

#(1/(Fac_Lat_100)) %>% hist(main = "Facility_Latitude Reciprocal")
#BoxCox(Fac_Lat_100, lambda = "auto") %>% hist(main = "Facility_Latitude Box-Cox")