In this project, we are instructed to choose three of the data sets submitted by other students in the Week 5 discussion forum. For each data set, we are to:
tidyr and dplyr to tidy and transform the data.A copy of this R Markdown file and the associated .csv and .sql files are located in my Github directory at:
https://github.com/stipton/CUNY-SPS/tree/master/DATA%20607%20Project%202
rm(list = ls())
library(tidyverse)## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(ggplot2)
library(RMySQL)## Loading required package: DBI
In my tidying and data analysis, I found it helpful to list both the head and the tail of data frames together, so I also defined a function to output head and tail together in a list.
headtail <- function(d, m = 5, n = m) {
list(HEAD = head(d, m), TAIL = tail(d, n))
}
# source: https://stackoverflow.com/questions/11600391/combining-head-and-tail-methods-in-rSubmitted by Albert Gilharry from the Ministry of Health, Belize.
The data set was relatively small and straight-forward, so I selected it for the first part of my project as a warm-up to practice the basic skills of importing, tidying, and plotting.
I copied the information from Blackboard into a .CSV file and read the file into R, transforming it into a tbl_df format for easier viewing.
denge <- read.csv("dengefever.csv", stringsAsFactors = FALSE) %>%
tbl_df()
denge## # A tibble: 14 x 10
## Community Type X2007 X2008 X2009 X2010 X2011 X2012 X2013
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Belize Ci~ Urban 1.28 0.690 2.68 5.13 3.92 6.53 11.7
## 2 San Pedro~ Urban 0.0200 0. 0.320 0.180 0.180 0.220 1.22
## 3 Caye Caul~ Urban 0.0700 0. 0.0900 0.0400 0.0200 0.110 0.200
## 4 Ladyville Rural 0.0600 0.0400 0.640 0.430 0.360 0.450 0.720
## 5 Hattievil~ Rural 0.0100 0.0200 0.0800 0.120 0.0900 0.130 0.170
## 6 Burrel Bo~ Rural 0.0200 0. 0.0700 0.0700 0.110 0.0900 0.400
## 7 Lords Bank Rural NA NA 0.0300 0.110 0.0700 0.150 0.160
## 8 Mahogany ~ Rural NA NA NA NA 0.0100 0.0100 0.0100
## 9 Sand Hill Rural 0.0100 0.0100 0.0900 0.100 0.0400 0.130 0.260
## 10 Bermudan ~ Rural NA NA NA 0.0100 0.0100 0.0100 0.0400
## 11 Biscayne Rural 0.0200 0. 0.0200 0.0400 0.0200 0.0600 0.140
## 12 Boston Rural NA NA NA 0.0100 0.0200 0.0100 0.0100
## 13 Unknown "" 0.0400 0.0200 0.0100 NA 0.300 0.0100 0.0100
## 14 Unknown -~ "" NA NA 0.0200 NA NA 0.0100 0.0300
## # ... with 1 more variable: X2014 <dbl>
To clean the data, I changed the two null values in the Type column to have the value “Unknown.” This label will help with categorization later. I also renamed the columns to remove the x from the year. Finally, I want to restructure this data in to the long format using the gather format on the variable “Year.”
denge$Type[13:14] <- "Unknown"
colnames(denge)[3:10] <- seq(2007, 2014)
denge <- gather(denge, "Year", "n", 3:10)With the data tidied, I plotted the occurrence rate for each community over time. We see from the graph that while most communities have an occurrence rate that stays mostly below one in one hundred, the rate for Belize City skyrockets over this time period. This line graph is a useful comparison to show the striking contrast between Belize City and the other communities.
ggplot(denge, aes(Year, n, group = Community)) +
geom_line(aes(color = Community), size = 1)+
geom_point() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_y_continuous(breaks = seq(0, 18, 1))+
labs(title = "Dengue Fever occurrence for selected communities in the Belize District",
x = "Year",
y = "Rate per 100 people")## Warning: Removed 15 rows containing missing values (geom_path).
## Warning: Removed 18 rows containing missing values (geom_point).
I attempted to show what a graph of the data would look like without the line for Belize City - the obvious outlier - but the filter command refused to work on any field with a space in it. To investigate further!
# denge %>%
# filter(Community != "Belize City") %>%
# ggplot(aes(Year, n, group = Community)) +
# geom_line(aes(color = Community), size = 1)+
# geom_point() +
# theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
# labs(title = "Dengue Fever occurrence for selected communities in the Belize District",
# x = "Year",
# y = "Rate per 100 people")I also created a jitter plot showing rates by community type. The jitter plot is useful to show that we have more data points for the rural communities than the urban communities, which will help inform how we view the difference between rural and urban communities.
ggplot(denge, aes(Type, n)) +
geom_jitter(aes(color = Community), size = 2) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(title = "Dengue Fever occurrence by Community Type",
x = "Community Type",
y = "Rate per 100 people")## Warning: Removed 18 rows containing missing values (geom_point).
Submitted by Raj Kumar from http://www.childmortality.org
The data file is available to download from a link on the website’s main page as an Excel document. The first few rows contain title and reference information, so I deleted those rows and saved to my working directory as .CSV file starting only with the header rows.
Non-data header information from original Excel file
The Excel file also contains definitions of the variables, which I will use to help tidy and transform the data.
Variable definitions
child_mort <- read.csv("PRO2.1 UNIGME Rates & Deaths_Under5.csv", stringsAsFactors = FALSE) %>%
tbl_df()After importing the .CSV file, I transformed the data set into tbl class, which allows me to examine the structure of the data more easily.
child_mort## # A tibble: 594 x 405
## ISO.Code CountryName Uncertainty.bounds. U5MR.1950 U5MR.1951 U5MR.1952
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 AFG Afghanistan Lower NA NA NA
## 2 AFG Afghanistan Median NA NA NA
## 3 AFG Afghanistan Upper NA NA NA
## 4 AGO Angola Lower NA NA NA
## 5 AGO Angola Median NA NA NA
## 6 AGO Angola Upper NA NA NA
## 7 ALB Albania Lower NA NA NA
## 8 ALB Albania Median NA NA NA
## 9 ALB Albania Upper NA NA NA
## 10 AND Andorra Lower NA NA NA
## # ... with 584 more rows, and 399 more variables: U5MR.1953 <dbl>,
## # U5MR.1954 <dbl>, U5MR.1955 <dbl>, U5MR.1956 <dbl>, U5MR.1957 <dbl>,
## # U5MR.1958 <dbl>, U5MR.1959 <dbl>, U5MR.1960 <dbl>, U5MR.1961 <dbl>,
## # U5MR.1962 <dbl>, U5MR.1963 <dbl>, U5MR.1964 <dbl>, U5MR.1965 <dbl>,
## # U5MR.1966 <dbl>, U5MR.1967 <dbl>, U5MR.1968 <dbl>, U5MR.1969 <dbl>,
## # U5MR.1970 <dbl>, U5MR.1971 <dbl>, U5MR.1972 <dbl>, U5MR.1973 <dbl>,
## # U5MR.1974 <dbl>, U5MR.1975 <dbl>, U5MR.1976 <dbl>, U5MR.1977 <dbl>,
## # U5MR.1978 <dbl>, U5MR.1979 <dbl>, U5MR.1980 <dbl>, U5MR.1981 <dbl>,
## # U5MR.1982 <dbl>, U5MR.1983 <dbl>, U5MR.1984 <dbl>, U5MR.1985 <dbl>,
## # U5MR.1986 <dbl>, U5MR.1987 <dbl>, U5MR.1988 <dbl>, U5MR.1989 <dbl>,
## # U5MR.1990 <dbl>, U5MR.1991 <dbl>, U5MR.1992 <dbl>, U5MR.1993 <dbl>,
## # U5MR.1994 <dbl>, U5MR.1995 <dbl>, U5MR.1996 <dbl>, U5MR.1997 <dbl>,
## # U5MR.1998 <dbl>, U5MR.1999 <dbl>, U5MR.2000 <dbl>, U5MR.2001 <dbl>,
## # U5MR.2002 <dbl>, U5MR.2003 <dbl>, U5MR.2004 <dbl>, U5MR.2005 <dbl>,
## # U5MR.2006 <dbl>, U5MR.2007 <dbl>, U5MR.2008 <dbl>, U5MR.2009 <dbl>,
## # U5MR.2010 <dbl>, U5MR.2011 <dbl>, U5MR.2012 <dbl>, U5MR.2013 <dbl>,
## # U5MR.2014 <dbl>, U5MR.2015 <dbl>, U5MR.2016 <dbl>, IMR.1950 <dbl>,
## # IMR.1951 <dbl>, IMR.1952 <dbl>, IMR.1953 <dbl>, IMR.1954 <dbl>,
## # IMR.1955 <dbl>, IMR.1956 <dbl>, IMR.1957 <dbl>, IMR.1958 <dbl>,
## # IMR.1959 <dbl>, IMR.1960 <dbl>, IMR.1961 <dbl>, IMR.1962 <dbl>,
## # IMR.1963 <dbl>, IMR.1964 <dbl>, IMR.1965 <dbl>, IMR.1966 <dbl>,
## # IMR.1967 <dbl>, IMR.1968 <dbl>, IMR.1969 <dbl>, IMR.1970 <dbl>,
## # IMR.1971 <dbl>, IMR.1972 <dbl>, IMR.1973 <dbl>, IMR.1974 <dbl>,
## # IMR.1975 <dbl>, IMR.1976 <dbl>, IMR.1977 <dbl>, IMR.1978 <dbl>,
## # IMR.1979 <dbl>, IMR.1980 <dbl>, IMR.1981 <dbl>, IMR.1982 <dbl>,
## # IMR.1983 <dbl>, IMR.1984 <dbl>, IMR.1985 <dbl>, ...
There are several variables intermixed within this data set that I’d like to separate out into individual columns. The variables for ISO.Code, CountryName, and Uncertainty.bounds. are already separated out into columns. Additionally, I’d like to separate out the age range (Under-5, Infant, and Neonatal) and year.
This data set also contains both rates of mortality and actual numbers of deaths for each country/year combination. My final tidied data set should have the rates and numbers in separate columns for each country/year.
To start, I followed Raj’s suggestion to simplify the analysis by focusing on only the median values for each country, removing the lower and upper bounds.
child_mort <- filter(child_mort, Uncertainty.bounds. == "Median") %>%
gather("Type.Year", "n", 4:405)I also gathered the data into a long format, defining a new column as Type.Year.
head(child_mort, n = 10)## # A tibble: 10 x 5
## ISO.Code CountryName Uncertainty.bounds. Type.Year n
## <chr> <chr> <chr> <chr> <dbl>
## 1 AFG Afghanistan Median U5MR.1950 NA
## 2 AGO Angola Median U5MR.1950 NA
## 3 ALB Albania Median U5MR.1950 NA
## 4 AND Andorra Median U5MR.1950 NA
## 5 ARE United Arab Emirates Median U5MR.1950 NA
## 6 ARG Argentina Median U5MR.1950 NA
## 7 ARM Armenia Median U5MR.1950 NA
## 8 ATG Antigua and Barbuda Median U5MR.1950 NA
## 9 AUS Australia Median U5MR.1950 31.6
## 10 AUT Austria Median U5MR.1950 NA
In order to split the Type.Year into two columns, I reformatted the data in the column. Since some of the values in the column had multiple periods, I used the str_replace function with a regular expression to change the delimiter between the type and the year into a hyphen. Once changed, I can use the separate function to split the column in two.
child_mort$Type.Year <- str_replace(child_mort$Type.Year, "\\.1", "-1")
child_mort$Type.Year <- str_replace(child_mort$Type.Year, "\\.2", "-2")
headtail(child_mort, n = 2) # verify that final periods are hyphens## $HEAD
## # A tibble: 5 x 5
## ISO.Code CountryName Uncertainty.bounds. Type.Year n
## <chr> <chr> <chr> <chr> <dbl>
## 1 AFG Afghanistan Median U5MR-1950 NA
## 2 AGO Angola Median U5MR-1950 NA
## 3 ALB Albania Median U5MR-1950 NA
## 4 AND Andorra Median U5MR-1950 NA
## 5 ARE United Arab Emirates Median U5MR-1950 NA
##
## $TAIL
## # A tibble: 2 x 5
## ISO.Code CountryName Uncertainty.bounds. Type.Year n
## <chr> <chr> <chr> <chr> <dbl>
## 1 ZMB Zambia Median Neonatal.Deaths-2016 14505.
## 2 ZWE Zimbabwe Median Neonatal.Deaths-2016 12246.
child_mort <- separate(child_mort, col = Type.Year, into = c("AgeRange", "Year"), sep = "-")
headtail(child_mort)## $HEAD
## # A tibble: 5 x 6
## ISO.Code CountryName Uncertainty.bounds. AgeRange Year n
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 AFG Afghanistan Median U5MR 1950 NA
## 2 AGO Angola Median U5MR 1950 NA
## 3 ALB Albania Median U5MR 1950 NA
## 4 AND Andorra Median U5MR 1950 NA
## 5 ARE United Arab Emirates Median U5MR 1950 NA
##
## $TAIL
## # A tibble: 5 x 6
## ISO.Code CountryName Uncertainty.bounds. AgeRange Year n
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 WSM Samoa Median Neonatal.Deaths 2016 44.
## 2 YEM Yemen Median Neonatal.Deaths 2016 23371.
## 3 ZAF South Africa Median Neonatal.Deaths 2016 14550.
## 4 ZMB Zambia Median Neonatal.Deaths 2016 14505.
## 5 ZWE Zimbabwe Median Neonatal.Deaths 2016 12246.
Looking at the values in the n column, some are rates and some are counts, depending on the value in the AgeRange column. I use the filter command to split the rows out by their AgeRange value, and then I join the two filtered tables back together with the full_join funtion. I also clean up the column names and ensure that they match for the join.
distinct(child_mort, AgeRange)## # A tibble: 6 x 1
## AgeRange
## <chr>
## 1 U5MR
## 2 IMR
## 3 NMR
## 4 Under.five.Deaths
## 5 Infant.Deaths
## 6 Neonatal.Deaths
child_mort.rates <- child_mort %>%
filter(AgeRange == "U5MR" | AgeRange == "IMR" | AgeRange == "NMR")
child_mort.rates$AgeRange <- recode(child_mort.rates$AgeRange,
U5MR = "under five",
IMR = "infant",
NMR = "neonatal")
colnames(child_mort.rates) <- c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year", "rate")
child_mort.counts <- child_mort %>%
filter(!(AgeRange %in% c("U5MR", "IMR", "NMR")))
child_mort.counts$AgeRange <- recode(child_mort.counts$AgeRange,
Under.five.Deaths = "under five",
Infant.Deaths = "infant",
Neonatal.Deaths = "neonatal")
colnames(child_mort.counts) <- c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year", "deaths")
headtail(child_mort.rates)## $HEAD
## # A tibble: 5 x 6
## ISOCode countryName uncertaintyBounds ageRange year rate
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 AFG Afghanistan Median under five 1950 NA
## 2 AGO Angola Median under five 1950 NA
## 3 ALB Albania Median under five 1950 NA
## 4 AND Andorra Median under five 1950 NA
## 5 ARE United Arab Emirates Median under five 1950 NA
##
## $TAIL
## # A tibble: 5 x 6
## ISOCode countryName uncertaintyBounds ageRange year rate
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 WSM Samoa Median neonatal 2016 9.20
## 2 YEM Yemen Median neonatal 2016 26.8
## 3 ZAF South Africa Median neonatal 2016 12.4
## 4 ZMB Zambia Median neonatal 2016 22.9
## 5 ZWE Zimbabwe Median neonatal 2016 22.9
headtail(child_mort.counts)## $HEAD
## # A tibble: 5 x 6
## ISOCode countryName uncertaintyBounds ageRange year deaths
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 AFG Afghanistan Median under five 1950 NA
## 2 AGO Angola Median under five 1950 NA
## 3 ALB Albania Median under five 1950 NA
## 4 AND Andorra Median under five 1950 NA
## 5 ARE United Arab Emirates Median under five 1950 NA
##
## $TAIL
## # A tibble: 5 x 6
## ISOCode countryName uncertaintyBounds ageRange year deaths
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 WSM Samoa Median neonatal 2016 44.
## 2 YEM Yemen Median neonatal 2016 23371.
## 3 ZAF South Africa Median neonatal 2016 14550.
## 4 ZMB Zambia Median neonatal 2016 14505.
## 5 ZWE Zimbabwe Median neonatal 2016 12246.
child_mort <- full_join(child_mort.rates, child_mort.counts)## Joining, by = c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year")
headtail(child_mort)## $HEAD
## # A tibble: 5 x 7
## ISOCode countryName uncertaintyBounds ageRange year rate deaths
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 AFG Afghanistan Median under fi~ 1950 NA NA
## 2 AGO Angola Median under fi~ 1950 NA NA
## 3 ALB Albania Median under fi~ 1950 NA NA
## 4 AND Andorra Median under fi~ 1950 NA NA
## 5 ARE United Arab Emir~ Median under fi~ 1950 NA NA
##
## $TAIL
## # A tibble: 5 x 7
## ISOCode countryName uncertaintyBounds ageRange year rate deaths
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 WSM Samoa Median neonatal 2016 9.20 44.
## 2 YEM Yemen Median neonatal 2016 26.8 23371.
## 3 ZAF South Africa Median neonatal 2016 12.4 14550.
## 4 ZMB Zambia Median neonatal 2016 22.9 14505.
## 5 ZWE Zimbabwe Median neonatal 2016 22.9 12246.
The discussion post did not contain any specific instructions for analysis. To begin, I decided to create a visualization showing the average worldwide child mortality rates, separated out by the Age Range. The first plot shows the three groups together on one graph. I practiced using many different elements of the ggplot2 package to format the various pieces of the graph, such as the x-axis labels and the legend title.
child_mort %>%
group_by(ageRange, year) %>%
summarize(avg = mean(rate, na.rm = TRUE)) %>%
ggplot(aes(year, avg, fill = ageRange)) +
geom_bar(stat = 'identity', position = 'dodge') +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_discrete(name = "Year", breaks = seq(1950, 2016, 3)) +
scale_y_continuous(name = "Average Child Mortality Rate (out of 1000 live births)", breaks = seq(0, 180, 20)) +
ggtitle("Child Mortality Rates by Age Range", "Worldwide Average") +
scale_fill_discrete(name = "Age Range")## Warning: Removed 1 rows containing missing values (geom_bar).
Below I created a graph showing the same information, but using facet_grid to separate each Age Range into its own graph.
child_mort %>%
group_by(ageRange, year) %>%
summarize(avg = mean(rate, na.rm = TRUE)) %>%
ggplot(aes(year, avg)) +
geom_point() +
facet_grid(. ~ ageRange) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_discrete(name = "Year", breaks = seq(1950, 2016, 6)) +
scale_y_continuous(name = "Average Child Mortality Rate (out of 1000 live births)", breaks = seq(0, 180, 20)) +
ggtitle("Child Mortality Rates by Age Range", "Worldwide Average") +
scale_fill_discrete(name = "Age Range")## Warning: Removed 1 rows containing missing values (geom_point).
I noted that for both of these graphs, R automatically removed the missing data point for neonatal mortality rate for 1950. I also noted that overall we see a very happy downward trend with in child mortality, with the sharpest decrease occuring in the under five age range during this time span.
Submitted by Jeremy O’Brien
With the most data of my three data sets, I decided to challenge myself to create a SQL database with the MTA information and connect it to another data set with additional information.
I created a .CSV file with Jeremy’s information from his github link (https://raw.githubusercontent.com/JeremyOBrien16/NYC-MTA-Weekly-MetroCard-Swipes/master/Fare_Card_History_for_Metropolitan_Transportation_Authority__MTA___Beginning_2010.csv). I wanted to find additional information to add to his data set, and I found a data set from the MTA website online containing line information for each station by the remote booth number listed in Jeremy’s data set (http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls).
Unable to find a list of stations with remote IDs to match on, I added the boroughs to the .CSV file manually. I then made an initial clean of the stationline data set in the .CSV file before uploading to SQL in order to have unique values for each remote station ID. In doing so, I lost the ability to track the subway division (IRT vs. IND, etc.) but for the purposes of this project, I decided to focus solely on the subway lines and boroughs instead of the division data.
Additionally, in uploading the ridedata file to the SQL database (using the loadmta file found on my github page), duplicate values as determined by my primary key set-up for that table were discarded. From the 180,608 original entries, 179,679 unique observations were uploaded into the SQL database.
-- loadmta.sql
DROP TABLE IF EXISTS ridedata;
DROP TABLE IF EXISTS stationlines;
CREATE TABLE ridedata (
FromDate DATE,
ToDate DATE,
RemoteStationID CHAR(4),
Station VARCHAR(50),
FullFare INTEGER,
SeniorCitizen_Disabled INTEGER,
SevenDayADAFarecardAccessSystemUnlimited INTEGER,
ThirtyDayADAFarecardAccessSystemUnlimited INTEGER,
JointRailRoadTicket INTEGER,
SevenDayUnlimited INTEGER,
ThirtyDayUnlimited INTEGER,
FourteenDayReducedFareMediaUnlimited INTEGER,
OneDayUnlimited INTEGER,
FourteenDayUnlimited INTEGER,
SevenDayExpressBusPass INTEGER,
TransitCheckMetrocard INTEGER,
LIBSpecialSenior INTEGER,
RailRoadUnlimitedNoTrade INTEGER,
TransitCheckMetrocardAnnualMetrocard INTEGER,
MailandRideEZPassExpress INTEGER,
MailandRideUnlimited INTEGER,
Path2Trip INTEGER,
AirtranFullFare INTEGER,
Airtran30Day INTEGER,
Airtran10Trip INTEGER,
AirtranMonthly INTEGER,
CONSTRAINT PK_ridedata PRIMARY KEY (FromDate, ToDate, RemoteStationID)
);
LOAD DATA LOCAL INFILE 'C:/Users/Steve Tipton/Documents/CUNY/DATA 607/proj2/ridedata.csv'
INTO TABLE ridedata
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
CREATE TABLE stationlines (
Remote CHAR(4),
Station VARCHAR(50),
LineName VARCHAR(20),
Division CHAR(3),
CONSTRAINT PK_stationlines PRIMARY KEY (Remote, Station, Division)
);
LOAD DATA LOCAL INFILE 'C:/Users/Steve Tipton/Documents/CUNY/DATA 607/proj2/stationlines.csv'
INTO TABLE stationlines
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
select * from stationlines;I connected MySQL to R and read in the station list table first.
driver <- dbDriver("MySQL")
con <- dbConnect(driver,
user = "root",
password = password, ## password hidden in previous code
dbname = "mta")
dbListTables(con)## [1] "ridedata" "stationlines"
sql_getStationList <- "SELECT Remote, Station, LineName, Borough FROM stationlines"
stationList <- dbGetQuery(con, sql_getStationList)The empty values in the Borough column imported as “” characters, so I tidy them into “NA” values.
stationList[!(stationList$Borough %in% c("MANHT","BKLYN","QUEEN","BRONX")),4] <- NAThe LineNames column of this table has a single character string for each station with all the lines serviced by that station concatenated into one single string. I extracted the information from those strings into atomic variables for each subway line.
## extract all unique strings from the ListNames column and split into single characters
trainLines <- stationList$LineName %>%
unique() %>%
paste(collapse = "") %>%
str_split("") %>%
unlist() %>%
unique() ## retain a vector of unique single characters referencing all subway lines in the data set
trainLines <- trainLines[order(trainLines)]Using my vector of unique train lines, I use two for loops to check whether each station in the stationlines data set services that line. As I complete the check for each line, I add its corresponding column to the original data set. Lastly, I update the names of the newly added columns with the names of the train lines.
for(i in 1:length(trainLines)) {
x <- vector(length = length(stationList$LineName))
for(j in 1:length(stationList$LineName)) {
if(str_detect(stationList$LineName[j], trainLines[i]) == TRUE) {
x[j] <- 1
} else {
x[j] <-0
}
}
stationList <- cbind.data.frame(stationList, x)
}
colnames(stationList)[5:length(stationList[1,])] <- trainLinesThe stationlist data set now has columns of 1’s and 0’s showing whether or not a line goes through each station.
headtail(stationList)## $HEAD
## Remote Station LineName Borough 1 2 3 4 5 6 7 A B C D E F G J L M
## 1 R001 WHITEHALL ST R1 MANHT 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 R002 FULTON ST ACJZ2345 MANHT 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0
## 3 R003 CYPRESS HILLS J BKLYN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 4 R004 ELDERTS LANE JZ QUEEN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 5 R005 FOREST PARKWAY J QUEEN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## N P Q R S Z
## 1 0 0 0 1 0 0
## 2 0 0 0 0 0 1
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 1
## 5 0 0 0 0 0 0
##
## $TAIL
## Remote Station LineName Borough 1 2 3 4 5 6 7 A B C D E F G J
## 476 R548 CHRISTOPHER ST P MANHT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 477 R549 NEWARK HW BMEBE P <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 478 R550 LACKAWANNA P <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 479 R551 GROVE STREET P <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 480 R552 JOURNAL SQUARE P <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## L M N P Q R S Z
## 476 0 0 0 1 0 0 0 0
## 477 0 0 0 1 0 0 0 0
## 478 0 0 0 1 0 0 0 0
## 479 0 0 0 1 0 0 0 0
## 480 0 0 0 1 0 0 0 0
Finally, I read the ridedata from SQL into R and joined the two data sets together with the join functionality.
sql_getRidedata <- "SELECT * FROM ridedata"
ridedata <- dbGetQuery(con, sql_getRidedata)
colnames(ridedata)[3] <- "Remote" ## rename column to facilitate join
mtadata <- left_join(ridedata, stationList, by = "Remote")
headtail(mtadata)## $HEAD
## FromDate ToDate Remote Station.x FullFare
## 1 2010-05-29 2010-06-04 R001 WHITEHALL STREET 56961
## 2 2010-05-29 2010-06-04 R002 FULTON ST & BROADWAY NASSAU 16210
## 3 2010-05-29 2010-06-04 R003 CYPRESS HILLS 3292
## 4 2010-05-29 2010-06-04 R004 75TH STREET & ELDERTS LANE 7774
## 5 2010-05-29 2010-06-04 R005 85TH STREET & FOREST PKWAY 8794
## SeniorCitizen_Disabled SevenDayADAFarecardAccessSystemUnlimited
## 1 1732 203
## 2 503 41
## 3 110 7
## 4 257 20
## 5 420 32
## ThirtyDayADAFarecardAccessSystemUnlimited JointRailRoadTicket
## 1 883 358
## 2 157 29
## 3 41 0
## 4 125 0
## 5 142 1
## SevenDayUnlimited ThirtyDayUnlimited
## 1 17654 32369
## 2 3536 11462
## 3 1404 1344
## 4 2878 4472
## 5 2997 5297
## FourteenDayReducedFareMediaUnlimited OneDayUnlimited
## 1 59 6733
## 2 8 451
## 3 0 36
## 4 1 79
## 5 0 88
## FourteenDayUnlimited SevenDayExpressBusPass TransitCheckMetrocard
## 1 1835 426 892
## 2 598 50 389
## 3 174 5 20
## 4 278 7 27
## 5 353 6 48
## LIBSpecialSenior RailRoadUnlimitedNoTrade
## 1 0 1035
## 2 0 90
## 3 0 4
## 4 0 9
## 5 0 10
## TransitCheckMetrocardAnnualMetrocard MailandRideEZPassExpress
## 1 4798 431
## 2 3578 83
## 3 154 9
## 4 409 20
## 5 317 9
## MailandRideUnlimited Path2Trip AirtranFullFare Airtran30Day
## 1 91 0 418 0
## 2 51 0 70 0
## 3 1 0 5 0
## 4 2 0 20 0
## 5 22 0 28 0
## Airtran10Trip AirtranMonthly Station.y LineName Borough 1 2 3 4 5 6
## 1 0 0 WHITEHALL ST R1 MANHT 1 0 0 0 0 0
## 2 0 0 FULTON ST ACJZ2345 MANHT 0 1 1 1 1 0
## 3 0 0 CYPRESS HILLS J BKLYN 0 0 0 0 0 0
## 4 0 0 ELDERTS LANE JZ QUEEN 0 0 0 0 0 0
## 5 0 0 FOREST PARKWAY J QUEEN 0 0 0 0 0 0
## 7 A B C D E F G J L M N P Q R S Z
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 2 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1
## 3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
## 5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
##
## $TAIL
## FromDate ToDate Remote Station.x FullFare
## 179675 2018-02-10 2018-02-16 R569 SBS-BX6 @ LIVINGSTON PLAZA 0
## 179676 2018-02-10 2018-02-16 R570 72ND STREET - 2 AVENUE 72085
## 179677 2018-02-10 2018-02-16 R571 86TH STREET - 2 AVENUE 55230
## 179678 2018-02-10 2018-02-16 R572 96TH STREET - 2 AVENUE 36272
## 179679 2018-02-10 2018-02-16 R573 SBS-Q52/53 @ LIVINGSTON PLAZA 0
## SeniorCitizen_Disabled SevenDayADAFarecardAccessSystemUnlimited
## 179675 0 0
## 179676 6629 282
## 179677 5694 262
## 179678 3241 225
## 179679 0 0
## ThirtyDayADAFarecardAccessSystemUnlimited JointRailRoadTicket
## 179675 0 0
## 179676 864 217
## 179677 810 59
## 179678 1006 40
## 179679 0 0
## SevenDayUnlimited ThirtyDayUnlimited
## 179675 0 0
## 179676 18717 50393
## 179677 17908 52231
## 179678 16670 37182
## 179679 0 0
## FourteenDayReducedFareMediaUnlimited OneDayUnlimited
## 179675 0 0
## 179676 1 0
## 179677 0 0
## 179678 0 0
## 179679 0 0
## FourteenDayUnlimited SevenDayExpressBusPass TransitCheckMetrocard
## 179675 0 0 0
## 179676 0 269 1280
## 179677 0 120 1289
## 179678 0 95 617
## 179679 0 0 0
## LIBSpecialSenior RailRoadUnlimitedNoTrade
## 179675 0 0
## 179676 827 194
## 179677 540 87
## 179678 509 41
## 179679 0 0
## TransitCheckMetrocardAnnualMetrocard MailandRideEZPassExpress
## 179675 0 0
## 179676 7086 3751
## 179677 7493 3017
## 179678 5330 1673
## 179679 0 0
## MailandRideUnlimited Path2Trip AirtranFullFare Airtran30Day
## 179675 0 0 0 0
## 179676 1637 0 625 0
## 179677 1682 0 695 0
## 179678 1059 0 389 0
## 179679 0 0 0 0
## Airtran10Trip AirtranMonthly Station.y LineName Borough 1 2 3 4
## 179675 0 0 <NA> <NA> <NA> NA NA NA NA
## 179676 0 0 <NA> <NA> <NA> NA NA NA NA
## 179677 0 0 <NA> <NA> <NA> NA NA NA NA
## 179678 0 0 <NA> <NA> <NA> NA NA NA NA
## 179679 0 0 <NA> <NA> <NA> NA NA NA NA
## 5 6 7 A B C D E F G J L M N P Q R S Z
## 179675 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179676 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179677 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179678 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179679 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I took a quick look at any unmatched stations from the ridedata set that did not have a match in the stationlist data set. For the most part, these are newer stations, such as the 34th Street station in Hudson Yards or the new stations along the 2nd Avenue Q line extension. For the purposes of this project, and since there is still a great deal of matched information, I decided to work with just the matched information.
z <- subset(mtadata, is.na(LineName))
distinct(z, Remote, Station.x)## Remote Station.x
## 1 R026 SOUTH FERRY
## 2 R516 MTABC - COLLEGE POINT DEPOT
## 3 R072 34TH STREET - HUDSON YARDS
## 4 R072 34TH STREET - HUDSON YARDS
## 5 R572 96TH STREET - 2 AVENUE
## 6 R570 72ND STREET - 2 AVENUE
## 7 R571 86TH STREET - 2 AVENUE
## 8 R475 GRAND AVENUE DEPOT
## 9 R476 GUNHILL DEPOT
## 10 R477 YUKON DEPOT
## 11 R478 CASTLETON DEPOT
## 12 R479 QUEENS VILLAGE DEPOT
## 13 R480 CASEY STENGEL DEPOT
## 14 R481 JAMAICA DEPOT
## 15 R482 FRESH POND DEPOT
## 16 R483 EAST NEW YORK DEPOT
## 17 R484 JACKIE GLEASON DEPOT
## 18 R485 FLATBUSH DEPOT
## 19 R486 ULMER PARK DEPOT
## 20 R487 KINGSBRIDGE DEPOT
## 21 R488 JACKIE GLEASON DEPOT(2)
## 22 R489 MOTHER CLARA HALE DEPOT
## 23 R492 MANHATTANVILLE DEPOT
## 24 R493 CHARLESTON DEPOT
## 25 R494 WEST FARMS DEPOT
## 26 R495 MICHAEL J. QUILL DEPOT
## 27 R496 TUSKEGEE AIRMEN DEPOT
## 28 R515 MTABC - LA GUARDIA DEPOT
## 29 R517 MTABC - JFK DEPOT
## 30 R518 MTABC - FAR ROCKAWAY DEPOT
## 31 R519 MTABC - SPRING CREEK DEPOT
## 32 R520 MTABC - BAISLEY DEPOT
## 33 R521 MTABC - EASTCHESTER DEPOT
## 34 R523 MITCHELL FIELD BUS DEPOT
## 35 R524 ROCKVILLE CENTER BUS DEPOT
## 36 R530 WESTCHESTER CMF YONKERS DEPOT
## 37 R531 WESTCHESTER VALHALLA DEPOT
## 38 R533 MTABC - COLLEGE POINT DEPOT2*
## 39 R555 SBS-M15 @ LIVINGSTON PLAZA
## 40 R556 SBS-BX12 @ LIVINGSTON PLAZA
## 41 R557 SBS-M34 @ LIVINGSTON PLAZA
## 42 R558 SBS-BX41 @ LIVINGSTON PLAZA
## 43 R559 SBS-B44 @ LIVINGSTON PLAZA
## 44 R560 SBS-M60 @ LIVINGSTON PLAZA
## 45 R561 SBS-M86 @ LIVINGSTON PLAZA
## 46 R562 MTABC - EASTCHESTER 2
## 47 R563 SBS-B46 @ LIVINGSTON PLAZA
## 48 R564 SBS-Q44 @ LIVINGSTON PLAZA
## 49 R567 SBS-M23 @ LIVINGSTON PLAZA
## 50 R516 MTABC - COLLEGE POINT DEPOT
## 51 R528 METRO NORTH HUDSON RAIL DEPOT
## 52 R568 SBS-M79 @ LIVINGSTON PLAZA
## 53 R569 SBS-BX6 @ LIVINGSTON PLAZA
## 54 R573 SBS-Q52/53 @ LIVINGSTON PLAZA
mtadata <- subset(mtadata, !is.na(LineName))In the discussion board post, one question arose around full fare vs. unlimited rides. Using the mutate function, I computed a full_unlim_score from the raw data. For each line, I computed the difference between all unlimited ride types and full fare rides (regular and senior/ADA). I used difference to avoid dividing by zero with a ratio.
mtadata <- mutate(mtadata, total_unlim = (SevenDayADAFarecardAccessSystemUnlimited + ThirtyDayADAFarecardAccessSystemUnlimited + SevenDayUnlimited + ThirtyDayUnlimited + FourteenDayReducedFareMediaUnlimited + OneDayUnlimited + FourteenDayUnlimited), total_full = (FullFare + SeniorCitizen_Disabled), full_unlim_score = total_unlim - total_unlim)
mtadata_score <- select(mtadata, FromDate, ToDate, Remote, Station.x, Borough, total_unlim, total_full, full_unlim_score)
head(mtadata_score)## FromDate ToDate Remote Station.x Borough
## 1 2010-05-29 2010-06-04 R001 WHITEHALL STREET MANHT
## 2 2010-05-29 2010-06-04 R002 FULTON ST & BROADWAY NASSAU MANHT
## 3 2010-05-29 2010-06-04 R003 CYPRESS HILLS BKLYN
## 4 2010-05-29 2010-06-04 R004 75TH STREET & ELDERTS LANE QUEEN
## 5 2010-05-29 2010-06-04 R005 85TH STREET & FOREST PKWAY QUEEN
## 6 2010-05-29 2010-06-04 R006 WOODHAVEN BOULEVARD QUEEN
## total_unlim total_full full_unlim_score
## 1 59736 58693 0
## 2 16253 16713 0
## 3 3006 3402 0
## 4 7853 8031 0
## 5 8909 9214 0
## 6 10486 8824 0
First, I created a scatterplot to compare the Unlimited Ride swipes to the Full Fare, using color to distinguish the Borough. They are, logically, positively correlated, and each borough has a similar correlation pattern.
library(scales) ## to format the axis numbers##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(mtadata_score[!(is.na(mtadata_score$Borough)),], aes(x=total_unlim, y=total_full)) +
geom_point(aes(color = Borough)) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(subtitle="Unlimited Ride vs Full Fare Swipes",
y="Full Fare Swipes",
x="Unlimited Ride Swipes",
title="Scatterplot")ggplot(mtadata_score[!(is.na(mtadata_score$Borough)),], aes(x=total_unlim, y=total_full)) +
geom_point(aes(color = Borough)) +
geom_smooth(se=F) +
facet_grid(. ~ Borough) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(subtitle="Unlimited Ride vs Full Fare Swipes",
y="Full Fare Swipes",
x="Unlimited Ride Swipes",
title="Scatterplot by Borough")## `geom_smooth()` using method = 'gam'
Next, I grouped the information by Remote ID, then took the average full_unlim_score for each station. I gathered that information by train line and then found the average difference for each train line to create a bar graph. On average for this data, we see that the B/D/F/M/L are the leading train lines for unlimited passes and the A/C/E/S lines favor the full fare swipes.
avgByLine <- mtadata_score %>%
group_by(Remote) %>%
summarize(avg = mean(full_unlim_score, na.rm = TRUE)) %>%
arrange(avg) %>%
left_join(stationList) %>%
gather("Line", "YesNo", 6:28) %>%
mutate(line_score = avg * YesNo) %>%
group_by(Line) %>%
summarize(LineAvg = mean(line_score, na.rm = TRUE)) %>%
arrange(LineAvg) %>%
filter(Line != "P") %>% ## removing PATH trains from NYC MTA analysis
ggplot(aes(Line, LineAvg)) +
geom_bar(stat = "identity") +
labs(title = "Average Difference in Unlimited Ride vs. Full Fare Swipes",
x = "MTA Line",
y = "Average Difference")## Joining, by = "Remote"
print(avgByLine)Note: The ggplot is working in RStudio but not when knitted. Here is an image file below with the completed ggplot from RStudio: