MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(stringr)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Executive Summary

My first step in reprocessing after I had read in the two data sets was to rename each column so that they would be easier to use in R Studio and there was some naming cohesion for both two sets. Column Facilities is removed from the data set as it is untidy but out of scope. To join the two data sets I had to find a variable that they both had in common, this was the Location variable in both sets as it contained the name for each facility. However this variable in the LocationFacilityData also contained the type of station it was, as such it had to be separated into a different column.

Location was then tidied and both data sets were joined using their corresponding Location variable. The Transport_Mode variable was untidy as some cells contained multiple values, so was broken up so that each transport type had its own column stating whether the station had it or not. Each variable then had its data type checked and changed if necessary. Correct leveling for factors was done if they needed it.

The data set was then checked for NA, NAN, infinite and finite variables. The NA values for variables Morning_Peak and Afternoon_Peak were recoded using the mode value for their column. Outliers for the four Latitude and Longitude numerical values were checked using boxplots which found many, however using a scatter plot it was determined that these were not outliers as they were clearly mapping the routes that the facilities were positioned on and as such they were not removed.

Finally several Transformations were made on the Latitude and Longitude facility variables to see if they could better fit a normal distribution, while not perfect there was some improvement.

Data

TfNSW data dictionary was used to define some variables - https://opendata.transport.nsw.gov.au/sites/default/files/TfNSW_LocFac_Op_Loadzone_OffStreet.pdf

LocationFacilityData Data set, read in as facilityData

This data set details the geolocation, facilities and other important information for train stations, ferry wharves and bus interchanges. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/public-transport-location-facilities-and-operators

variable description: Location - Name of facility TSN - Transit Stop Number for each stop at the Station/Wharf/Light Rail/Bus Interchange Facility_Latitude - The latitude coordinates of the facility
Facility_Longitude - The longitude coordinates of the facility EFA_ID - TfNSW data dictionary - “Not to be used” Phone - The phone number for the facility Facility_Address - the facility’s Address Facilities - The utilities available to the public at the facility for example bike racks, payphones etc Accessibility - Transport_Mode - The types of transport avaliable at the facility
Morning_Peak - The busiest time for this facility in the morning Afternoon_Peak - The busiest time for this facility in the afternoon

StationEntrances2020 Data set, read in as stationEntrance

This data set contains a list of every entrance for every station in NSW. It was sourced from the NSW open-data website: https://opendata.transport.nsw.gov.au/dataset/train-station-entrance-locations

variable description: Location - Name of entrance location Street_Name - Name of street entrance is on Street_Type - Type of street entrance is on eg Avenue: Ave, Drive: Dr Entrance_Type - Type of entrance eg stairs, ramp, lift Entrance_Latitude - The latitude coordinates of the entrance Entrance_Longitude - The longitude coordinates of the entrance
Exit_Number - The number designated for an exit, if there is only one exit it is NA

Downloaded the data sets files from the NSW open data website
Put both csv files into the project folder made for this assignment
The readr package was used to read this data using the read_csv command
The column names were checked and appropriately renamed for ease of use in R studio and naming cohesion
The variable used for merging the data sets requires tidying and manipulation for it to be used so merging of the data sets will be done in Tidy & Manipulate Data I
column Facilities is removed from the data set as it is untidy but out of scope

# This is the R chunk for the Data Section

stationEntrance <- read_csv("StationEntrances2020_v4.csv")

## Parsed with column specification:
## cols(
##   Train_Station = col_character(),
##   Street_Name = col_character(),
##   Street_Type = col_character(),
##   Entrance_Type = col_character(),
##   LAT = col_double(),
##   LONG = col_double(),
##   Exit_Number = col_double()
## )

facilityData <- read_csv("LocationFacilityData.csv")

## Parsed with column specification:
## cols(
##   LOCATION_NAME = col_character(),
##   TSN = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   EFA_ID = col_double(),
##   PHONE = col_character(),
##   ADDRESS = col_character(),
##   FACILITIES = col_character(),
##   ACCESSIBILITY = col_character(),
##   TRANSPORT_MODE = col_character(),
##   MORNING_PEAK = col_character(),
##   AFTERNOON_PEAK = col_character()
## )

colnames(facilityData)

##  [1] "LOCATION_NAME"  "TSN"            "LATITUDE"       "LONGITUDE"     
##  [5] "EFA_ID"         "PHONE"          "ADDRESS"        "FACILITIES"    
##  [9] "ACCESSIBILITY"  "TRANSPORT_MODE" "MORNING_PEAK"   "AFTERNOON_PEAK"

colnames(stationEntrance)

## [1] "Train_Station" "Street_Name"   "Street_Type"   "Entrance_Type"
## [5] "LAT"           "LONG"          "Exit_Number"

colnames(facilityData) <- c("Location","TSN","Facility_Latitude","Facility_Longitude","EFA_ID",
                            "Phone","Facility_Address","Facilities","Accessibility",
                            "Transport_Mode","Morning_Peak","Afternoon_Peak")
colnames(stationEntrance) <- c("Location","Street_Name","Street_Type","Entrance_Type",
                               "Entrance_Latitude","Entrance_Longitude","Exit_Number")

facilityData <- facilityData %>% select(-(Facilities))

Tidy & Manipulate Data I

The LocationFacilityData (LFD) Data set requires a lot of tidying while the StationEntrances2020 (SE) Data set is relatively tidy. I plan to join both data sets on their renamed “Location” variable. The SE location variable consists of only the name for each facility, while the LFD location variable has the name and with it the type of station eg. “Station”, “Wharf” this would need to be separated from the variable so a merge of the two data sets can be done. Also I consider both the location name and the station type to be individual variables thus to conform with tidy data principles they should be in their own column.

using str_detect function vectors were made for each station type so as to record true or false if that type is detected within the location variable
mutate function was used to add these vectors onto the data set as columns
each was then converted into a string and then str_replace was used to change the “TRUE” into the station type and the “FALSE” as nothing
the column Station_Type was created using the mutate function and then paste function was used to merge the previously created station type variables.
then station type names were deleted from the location variable using str_remove_all and the brackets were removed using gsub
str_trim was used to remove white space from the ends of the variables
the data sets were then able to be merged using inner join on the variable location to create the stationData data set.

# This is the R chunk for the Tidy & Manipulate Data I 


# Move station type to different column

station1 <- str_detect(facilityData$Location, "Station")
stop1 <- str_detect(facilityData$Location, "Coach Stop")
wharf1 <- str_detect(facilityData$Location, "Wharf")
lightRail1 <- str_detect(facilityData$Location, "Light Rail")

facilityData2 <- mutate(facilityData, 
                        Station = station1,
                        Coach_Stop = stop1,
                        Wharf = wharf1,
                        Light_Rail = lightRail1)

facilityData2$Station <- as.character(facilityData2$Station)
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "TRUE", replacement = "Station")
facilityData2$Station <- str_replace(facilityData2$Station, pattern = "FALSE", replacement = "")

facilityData2$Coach_Stop <- as.character(facilityData2$Coach_Stop)
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "TRUE", replacement = "Coach Stop")
facilityData2$Coach_Stop <- str_replace(facilityData2$Coach_Stop, pattern = "FALSE", replacement = "")

facilityData2$Wharf <- as.character(facilityData2$Wharf)
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "TRUE", replacement = "Wharf")
facilityData2$Wharf <- str_replace(facilityData2$Wharf, pattern = "FALSE", replacement = "")

facilityData2$Light_Rail <- as.character(facilityData2$Light_Rail)
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "TRUE", replacement = "Light Rail")
facilityData2$Light_Rail <- str_replace(facilityData2$Light_Rail, pattern = "FALSE", replacement = "")

facilityData3 <- mutate(facilityData, Station_Type = "")

facilityData3$Station_Type <- paste(facilityData2$Station, facilityData2$Coach_Stop, facilityData2$Wharf, facilityData2$Light_Rail)

# delete column type from location

facilityData3$Location <- str_remove_all(facilityData3$Location, "Station")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Stop")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Coach")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Bus")
facilityData3$Location <- str_remove_all(facilityData3$Location, "Wharf")
facilityData3$Location <- str_replace_all(facilityData3$Location, "Light Rail", "LR")
facilityData3$Location <- str_remove_all(facilityData3$Location, "northbound")
facilityData3$Location <- str_remove_all(facilityData3$Location, "southbound")
facilityData3$Location <- gsub("[][()]", "", facilityData3$Location)

# remove spaces
head(facilityData3$Location)

## [1] "Abbotsford "        "Aberdeen "          "Adamstown "        
## [4] "Adelaide Central  " "Adelong  "          "Albion Park "

facilityData3$Location <- str_trim(facilityData3$Location, side = "right")

head(facilityData3$Station_Type)

## [1] "  Wharf "      "Station   "    "Station   "    "Station   "   
## [5] " Coach Stop  " "Station   "

facilityData3$Station_Type <- str_trim(facilityData3$Station_Type, side = "both")

head(stationEntrance$Location)

## [1] "Aberdeen"  "Aberdeen"  "Adamstown" "Adamstown" "Adamstown" "Adamstown"

# join datasets

stationData <- stationEntrance %>% inner_join(facilityData3, by = "Location")

Tidy & Manipulate Data II

The variable Transport_Mode contains all the types of transport available at a facility as such the strings in this column can consist of multiple values eg “Train, Regional coach, Taxi”. To tidy this column I plan to mutate it into 7 different columns which will answer yes or no if the facility has that mode of transport.

using str_detect function vectors were made for each transport mode so as to record true or false if that type is detected within the location variable
mutate function was used to add these vectors onto the data set as their respective columns
the select function was then used to remove the variable Transport_Mode from the data set
the same process was followed to create a Wheelchair_Accessible column based on the Accessibility
Accessibility column was then deleted as the rest of the variables are considered out of scope and would take to long to separate

# This is the R chunk for the Tidy & Manipulate Data II 

# split transport_mode

train1 <- str_detect(stationData$Transport_Mode, "Train")
bus1 <- str_detect(stationData$Transport_Mode, "Bus")
taxi1 <- str_detect(stationData$Transport_Mode, "Taxi")
lightRail1 <- str_detect(stationData$Transport_Mode, "Light rail")
ferry1 <- str_detect(stationData$Transport_Mode, "Ferry")
regionalCoach1 <- str_detect(stationData$Transport_Mode, "Regional coach")
coach1 <- str_detect(stationData$Transport_Mode, "Coach")

stationData2 <- mutate(stationData, 
                        Train = train1,
                        Bus = bus1,
                        Taxi = taxi1,
                        Light_Rail = lightRail1,
                        Ferry = ferry1,
                        Regional_Coach = regionalCoach1,
                        Coach = coach1)

# remove transport_mode

stationData2 <- stationData2 %>% select(-(Transport_Mode))

# wheelchair accessible

wheelchair <- str_detect(stationData$Accessibility, "This location is wheelchair accessible")

stationData2 <- mutate(stationData2, 
                        Wheelchair_Accessible = wheelchair)

stationData2 <- stationData2 %>% select(-(Accessibility))

Understand

the dimensions of the Merged data set stationData was checked using the function dim
the variable types were checked using the str function
variables Entrance_Type, Exit_Number and Station_Type are character types but should be factors, this change is made using the function as.factor on each variable
Columns Train to Wheelchair_accessible are converted to strings, “TRUE” is replaced with “Yes” and “FALSE” with “No” and then each is factorised and leveled

# This is the R chunk for the Understand Section

dim(stationData2)

## [1] 973  24

str(stationData2)

## tibble [973 x 24] (S3: tbl_df/tbl/data.frame)
##  $ Location             : chr [1:973] "Aberdeen" "Aberdeen" "Adamstown" "Adamstown" ...
##  $ Street_Name          : chr [1:973] "Macqueen" "Macqueen" "Park" "Park" ...
##  $ Street_Type          : chr [1:973] "St" "St" "Ave" "Ave" ...
##  $ Entrance_Type        : chr [1:973] "Ramp" "Stairs" "Path" "Path" ...
##  $ Entrance_Latitude    : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
##  $ Entrance_Longitude   : num [1:973] 151 151 152 152 152 ...
##  $ Exit_Number          : num [1:973] NA NA NA NA NA NA NA NA NA NA ...
##  $ TSN                  : num [1:973] 233610 233610 228920 228920 228920 ...
##  $ Facility_Latitude    : num [1:973] -32.2 -32.2 -32.9 -32.9 -32.9 ...
##  $ Facility_Longitude   : num [1:973] 151 151 152 152 152 ...
##  $ EFA_ID               : num [1:973] 10101192 10101192 10101159 10101159 10101159 ...
##  $ Phone                : chr [1:973] "02 6543 1018" "02 6543 1018" "02 4962 9295" "02 4962 9295" ...
##  $ Facility_Address     : chr [1:973] "Macqueen St, Aberdeen NSW 2336" "Macqueen St, Aberdeen NSW 2336" "Park Ave, Adamstown NSW 2289" "Park Ave, Adamstown NSW 2289" ...
##  $ Morning_Peak         : chr [1:973] "6.00-10am" "6.00-10am" "6.00-10am" "6.00-10am" ...
##  $ Afternoon_Peak       : chr [1:973] "3.00-7pm" "3.00-7pm" "3.00-7pm" "3.00-7pm" ...
##  $ Station_Type         : chr [1:973] "Station" "Station" "Station" "Station" ...
##  $ Train                : logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Bus                  : logi [1:973] FALSE FALSE TRUE TRUE TRUE TRUE ...
##  $ Taxi                 : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Light_Rail           : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Ferry                : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Regional_Coach       : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Coach                : logi [1:973] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Wheelchair_Accessible: logi [1:973] TRUE TRUE TRUE TRUE TRUE TRUE ...

# factoring

stationData2$Street_Type <- as.factor(stationData2$Street_Type)

stationData2$Entrance_Type <- as.factor(stationData2$Entrance_Type)

str(stationData2$Entrance_Type)

##  Factor w/ 5 levels "Escalators","Lift",..: 4 5 3 3 5 4 4 4 4 5 ...

stationData2$Exit_Number <- as.factor(stationData2$Exit_Number)

stationData2$Station_Type <- as.factor(stationData2$Station_Type)

levels(stationData2$Street_Type)

##  [1] "Ave"   "Cl"    "Cres"  "Dr"    "Hwy"   "Lane"  "Pde"   "Pl"    "Plza" 
## [10] "Rd"    "Rd W"  "South" "Sq"    "St"    "St N"  "T-way" "Tce"   "Way"

# factorise and level

stationData3 <- stationData2

stationData3$Train <- as.character(stationData3$Train)
stationData3$Train <- str_replace(stationData3$Train, pattern = "TRUE", replacement = "Yes")
stationData3$Train <- str_replace(stationData3$Train, pattern = "FALSE", replacement = "No")
stationData3$Train <- factor(stationData3$Train, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Bus <- as.character(stationData3$Bus)
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "TRUE", replacement = "Yes")
stationData3$Bus <- str_replace(stationData3$Bus, pattern = "FALSE", replacement = "No")
stationData3$Bus <- factor(stationData3$Bus, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Taxi <- as.character(stationData3$Taxi)
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "TRUE", replacement = "Yes")
stationData3$Taxi <- str_replace(stationData3$Taxi, pattern = "FALSE", replacement = "No")
stationData3$Taxi <- factor(stationData3$Taxi, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Light_Rail <- as.character(stationData3$Light_Rail)
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "TRUE", replacement = "Yes")
stationData3$Light_Rail <- str_replace(stationData3$Light_Rail, pattern = "FALSE", replacement = "No")
stationData3$Light_Rail <- factor(stationData3$Light_Rail, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Ferry <- as.character(stationData3$Ferry)
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "TRUE", replacement = "Yes")
stationData3$Ferry <- str_replace(stationData3$Ferry, pattern = "FALSE", replacement = "No")
stationData3$Ferry <- factor(stationData3$Ferry, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Regional_Coach <- as.character(stationData3$Regional_Coach)
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Regional_Coach <- str_replace(stationData3$Regional_Coach, pattern = "FALSE", replacement = "No")
stationData3$Regional_Coach <- factor(stationData3$Regional_Coach, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Coach <- as.character(stationData3$Coach)
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "TRUE", replacement = "Yes")
stationData3$Coach <- str_replace(stationData3$Coach, pattern = "FALSE", replacement = "No")
stationData3$Coach <- factor(stationData3$Coach, levels = c("Yes", "No"), ordered = TRUE)

stationData3$Wheelchair_Accessible <- as.character(stationData3$Wheelchair_Accessible)
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "TRUE", replacement = "Yes")
stationData3$Wheelchair_Accessible <- str_replace(stationData3$Wheelchair_Accessible, pattern = "FALSE", replacement = "No")
stationData3$Wheelchair_Accessible <- factor(stationData3$Wheelchair_Accessible, levels = c("Yes", "No"), ordered = TRUE)

Scan I

using the function colSums and is.na on the stationData data set the total number of NA values in each column was found
using the following function: sapply(stationData3, function(x) sum(is.nan(x))) the total number of nan values in each column was found
using the following function: sapply(stationData3, function(x) sum(is.infinite(x))) the total number of infinite values in each column was found
using the following function: sapply(stationData3, function(x) sum(is.finite(x))) the total number of finite values in each column was found
Morning_Peak and Afternoon_Peak were converted to factors and then the impute function was used on them to replace the NA values with the mode values for each column

Methodology NA values were found in Exit_Number, Street_Type, Phone, Morning_Peak and Afternoon_Peak, however only the last two columns had anything done to rectify this. This was because not all streets have a type, not all facilities have multiple exits and not all facilities have a phone number, however all facilities should have a peak time. I used mode imputation as this would find the most common value in each column and replace the NA values with it as this is the most likely peak time for these unknown values.

# This is the R chunk for the Scan I

colSums(is.na(stationData3))

##              Location           Street_Name           Street_Type 
##                     0                     0                    24 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                   925                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                    87 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                    18                    18 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0

sapply(stationData3, function(x) sum(is.nan(x)))

##              Location           Street_Name           Street_Type 
##                     0                     0                     0 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                     0                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0

sapply(stationData3, function(x) sum(is.infinite(x)))

##              Location           Street_Name           Street_Type 
##                     0                     0                     0 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                     0                     0                     0 
##           Exit_Number                   TSN     Facility_Latitude 
##                     0                     0                     0 
##    Facility_Longitude                EFA_ID                 Phone 
##                     0                     0                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                     0                     0                     0 
##                  Taxi            Light_Rail                 Ferry 
##                     0                     0                     0 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                     0                     0                     0

sapply(stationData3, function(x) sum(is.finite(x)))

##              Location           Street_Name           Street_Type 
##                     0                     0                   949 
##         Entrance_Type     Entrance_Latitude    Entrance_Longitude 
##                   973                   973                   973 
##           Exit_Number                   TSN     Facility_Latitude 
##                    48                   973                   973 
##    Facility_Longitude                EFA_ID                 Phone 
##                   973                   973                     0 
##      Facility_Address          Morning_Peak        Afternoon_Peak 
##                     0                     0                     0 
##          Station_Type                 Train                   Bus 
##                   973                   973                   973 
##                  Taxi            Light_Rail                 Ferry 
##                   973                   973                   973 
##        Regional_Coach                 Coach Wheelchair_Accessible 
##                   973                   973                   973

stationData3$Morning_Peak <- as.factor(stationData3$Morning_Peak)
stationData3$Morning_Peak <- impute(stationData3$Morning_Peak, fun = mode)

stationData3$Afternoon_Peak <- as.factor(stationData3$Afternoon_Peak)
stationData3$Afternoon_Peak <- impute(stationData3$Afternoon_Peak, fun = mode)

which(is.na(stationData3$Morning_Peak))

## named integer(0)

which(is.na(stationData3$Afternoon_Peak))

## named integer(0)

Scan II

the boxplot function was used to look at the possible outliers for Entrance_Latitude, Entrance_Longitude, Facility_Latitude and Facility_Longitude
scatter plots were made comparing Entrance_Latitude to Entrance_Longitude and Facility_Latitude to Facility_Longitude

Methodology The only viable numerical variable to scan for outliers are the latitude and longitude variables for the entrance and facility locations. Looking at the boxplot there appears to be many outliers for each variable however when we create a scatter plot between the corresponding latitude and longitude values we can clearly see the routes that these stations lie along therefore it would be incorrect to remove any of the Univariate outliers found from the boxplots.

# This is the R chunk for the Scan II

# outliers lat and long

boxplot(stationData3$Entrance_Latitude)

boxplot(stationData3$Entrance_Longitude)

boxplot(stationData3$Facility_Latitude)

boxplot(stationData3$Facility_Longitude)

stationData3 %>% plot(Entrance_Latitude ~ Entrance_Longitude, data = .)

stationData3 %>% plot(Facility_Latitude ~ Facility_Longitude, data = .)

Transform

using the hist functions histograms were made for each latitude and longitude variable to find out if they fit a normal distribution
As we can see none have a normal distribution, however both latitude values and both longitude values share the same distribution so only the facility long and lat values will be transformed.
Log, square root, reciprocal and Box-Cox transformations were used on Facility_Longitude and from it we can see that reciprocal has had the most success in attempting to normalize the variable.
As Facility_Latitude was comprised of negative values a constant of 100 was added to each to bring the values into the positive and saved as Fac_Lat_100
Log, square root, reciprocal and Box-Cox transformations were used on Fac_Lat_100 and from it we can see that Square has had the most success in attempting to normalize the variable.
other transformations commented out for page size limit

# This is the R chunk for the Transform Section

# Untransformed Distributions
#hist(stationData3$Entrance_Latitude)
#hist(stationData3$Entrance_Longitude)
#hist(stationData3$Facility_Latitude)
#hist(stationData3$Facility_Longitude)

# Facility_Longitude Distributions with Transformation 
hist(stationData3$Facility_Longitude, main = "Facility_Longitude Untransformed")

#log10(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log10")
#log(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Log")
#sqrt(stationData3$Facility_Longitude) %>% hist(main = "Facility_Longitude Square root")
#(stationData3$Facility_Longitude)^2 %>% hist(main = "Facility_Longitude Square")
(1/(stationData3$Facility_Longitude)) %>% hist(main = "Facility_Longitude Reciprocal")

#BoxCox(stationData3$Facility_Longitude, lambda = "auto") %>% hist(main = "Facility_Longitude Box-Cox")

# Facility_Latitude Distributions with Transformation 
Fac_Lat_100 <- stationData3$Facility_Latitude + 100
hist(Fac_Lat_100, main = "Facility_Latitude Untransformed")

#log10(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log10")
#log(Fac_Lat_100) %>% hist(main = "Facility_Latitude Log")
#sqrt(Fac_Lat_100) %>% hist(main = "Facility_Latitude Square root")
(Fac_Lat_100)^2 %>% hist(main = "Facility_Latitude Square")

#(1/(Fac_Lat_100)) %>% hist(main = "Facility_Latitude Reciprocal")
#BoxCox(Fac_Lat_100, lambda = "auto") %>% hist(main = "Facility_Latitude Box-Cox")