Overview and Environment Preparation

In this project, we are instructed to choose three of the data sets submitted by other students in the Week 5 discussion forum. For each data set, we are to:

A copy of this R Markdown file and the associated .csv and .sql files are located in my Github directory at:

https://github.com/stipton/CUNY-SPS/tree/master/DATA%20607%20Project%202

rm(list = ls())
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stringr)
library(ggplot2)
library(RMySQL)
## Loading required package: DBI

In my tidying and data analysis, I found it helpful to list both the head and the tail of data frames together, so I also defined a function to output head and tail together in a list.

headtail <- function(d, m = 5, n = m) {
  list(HEAD = head(d, m), TAIL = tail(d, n))
}
# source: https://stackoverflow.com/questions/11600391/combining-head-and-tail-methods-in-r

Part One - Dengue Fever Occurrence in the Belize District

Submitted by Albert Gilharry from the Ministry of Health, Belize.

Create .CSV file and read into R

The data set was relatively small and straight-forward, so I selected it for the first part of my project as a warm-up to practice the basic skills of importing, tidying, and plotting.

I copied the information from Blackboard into a .CSV file and read the file into R, transforming it into a tbl_df format for easier viewing.

denge <- read.csv("dengefever.csv", stringsAsFactors = FALSE) %>% 
  tbl_df()
denge
## # A tibble: 14 x 10
##    Community  Type    X2007   X2008   X2009   X2010   X2011  X2012   X2013
##    <chr>      <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 Belize Ci~ Urban  1.28    0.690   2.68    5.13    3.92   6.53   11.7   
##  2 San Pedro~ Urban  0.0200  0.      0.320   0.180   0.180  0.220   1.22  
##  3 Caye Caul~ Urban  0.0700  0.      0.0900  0.0400  0.0200 0.110   0.200 
##  4 Ladyville  Rural  0.0600  0.0400  0.640   0.430   0.360  0.450   0.720 
##  5 Hattievil~ Rural  0.0100  0.0200  0.0800  0.120   0.0900 0.130   0.170 
##  6 Burrel Bo~ Rural  0.0200  0.      0.0700  0.0700  0.110  0.0900  0.400 
##  7 Lords Bank Rural NA      NA       0.0300  0.110   0.0700 0.150   0.160 
##  8 Mahogany ~ Rural NA      NA      NA      NA       0.0100 0.0100  0.0100
##  9 Sand Hill  Rural  0.0100  0.0100  0.0900  0.100   0.0400 0.130   0.260 
## 10 Bermudan ~ Rural NA      NA      NA       0.0100  0.0100 0.0100  0.0400
## 11 Biscayne   Rural  0.0200  0.      0.0200  0.0400  0.0200 0.0600  0.140 
## 12 Boston     Rural NA      NA      NA       0.0100  0.0200 0.0100  0.0100
## 13 Unknown    ""     0.0400  0.0200  0.0100 NA       0.300  0.0100  0.0100
## 14 Unknown -~ ""    NA      NA       0.0200 NA      NA      0.0100  0.0300
## # ... with 1 more variable: X2014 <dbl>

Tidy and Transform

To clean the data, I changed the two null values in the Type column to have the value “Unknown.” This label will help with categorization later. I also renamed the columns to remove the x from the year. Finally, I want to restructure this data in to the long format using the gather format on the variable “Year.”

denge$Type[13:14] <- "Unknown"
colnames(denge)[3:10] <- seq(2007, 2014)
denge <- gather(denge, "Year", "n", 3:10)

Analysis

With the data tidied, I plotted the occurrence rate for each community over time. We see from the graph that while most communities have an occurrence rate that stays mostly below one in one hundred, the rate for Belize City skyrockets over this time period. This line graph is a useful comparison to show the striking contrast between Belize City and the other communities.

ggplot(denge, aes(Year, n, group = Community)) +
  geom_line(aes(color = Community), size = 1)+
  geom_point() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 18, 1))+
  labs(title = "Dengue Fever occurrence for selected communities in the Belize District",
       x = "Year",
       y = "Rate per 100 people")
## Warning: Removed 15 rows containing missing values (geom_path).
## Warning: Removed 18 rows containing missing values (geom_point).

I attempted to show what a graph of the data would look like without the line for Belize City - the obvious outlier - but the filter command refused to work on any field with a space in it. To investigate further!

# denge %>%
#   filter(Community != "Belize City") %>%
#   ggplot(aes(Year, n, group = Community)) +
#     geom_line(aes(color = Community), size = 1)+
#     geom_point() +
#     theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
#     labs(title = "Dengue Fever occurrence for selected communities in the Belize District",
#          x = "Year",
#          y = "Rate per 100 people")

I also created a jitter plot showing rates by community type. The jitter plot is useful to show that we have more data points for the rural communities than the urban communities, which will help inform how we view the difference between rural and urban communities.

ggplot(denge, aes(Type, n)) +
  geom_jitter(aes(color = Community), size = 2) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  labs(title = "Dengue Fever occurrence by Community Type",
       x = "Community Type",
       y = "Rate per 100 people")
## Warning: Removed 18 rows containing missing values (geom_point).

Part Two - Child Mortality Data

Submitted by Raj Kumar from http://www.childmortality.org

Create .CSV file and read into R

The data file is available to download from a link on the website’s main page as an Excel document. The first few rows contain title and reference information, so I deleted those rows and saved to my working directory as .CSV file starting only with the header rows.

Non-data header information from original Excel file

Non-data header information from original Excel file

The Excel file also contains definitions of the variables, which I will use to help tidy and transform the data.

Variable definitions

Variable definitions

child_mort <- read.csv("PRO2.1 UNIGME Rates & Deaths_Under5.csv", stringsAsFactors = FALSE) %>% 
  tbl_df()

Tidy and Transform

After importing the .CSV file, I transformed the data set into tbl class, which allows me to examine the structure of the data more easily.

child_mort
## # A tibble: 594 x 405
##    ISO.Code CountryName Uncertainty.bounds. U5MR.1950 U5MR.1951 U5MR.1952
##    <chr>    <chr>       <chr>                   <dbl>     <dbl>     <dbl>
##  1 AFG      Afghanistan Lower                      NA        NA        NA
##  2 AFG      Afghanistan Median                     NA        NA        NA
##  3 AFG      Afghanistan Upper                      NA        NA        NA
##  4 AGO      Angola      Lower                      NA        NA        NA
##  5 AGO      Angola      Median                     NA        NA        NA
##  6 AGO      Angola      Upper                      NA        NA        NA
##  7 ALB      Albania     Lower                      NA        NA        NA
##  8 ALB      Albania     Median                     NA        NA        NA
##  9 ALB      Albania     Upper                      NA        NA        NA
## 10 AND      Andorra     Lower                      NA        NA        NA
## # ... with 584 more rows, and 399 more variables: U5MR.1953 <dbl>,
## #   U5MR.1954 <dbl>, U5MR.1955 <dbl>, U5MR.1956 <dbl>, U5MR.1957 <dbl>,
## #   U5MR.1958 <dbl>, U5MR.1959 <dbl>, U5MR.1960 <dbl>, U5MR.1961 <dbl>,
## #   U5MR.1962 <dbl>, U5MR.1963 <dbl>, U5MR.1964 <dbl>, U5MR.1965 <dbl>,
## #   U5MR.1966 <dbl>, U5MR.1967 <dbl>, U5MR.1968 <dbl>, U5MR.1969 <dbl>,
## #   U5MR.1970 <dbl>, U5MR.1971 <dbl>, U5MR.1972 <dbl>, U5MR.1973 <dbl>,
## #   U5MR.1974 <dbl>, U5MR.1975 <dbl>, U5MR.1976 <dbl>, U5MR.1977 <dbl>,
## #   U5MR.1978 <dbl>, U5MR.1979 <dbl>, U5MR.1980 <dbl>, U5MR.1981 <dbl>,
## #   U5MR.1982 <dbl>, U5MR.1983 <dbl>, U5MR.1984 <dbl>, U5MR.1985 <dbl>,
## #   U5MR.1986 <dbl>, U5MR.1987 <dbl>, U5MR.1988 <dbl>, U5MR.1989 <dbl>,
## #   U5MR.1990 <dbl>, U5MR.1991 <dbl>, U5MR.1992 <dbl>, U5MR.1993 <dbl>,
## #   U5MR.1994 <dbl>, U5MR.1995 <dbl>, U5MR.1996 <dbl>, U5MR.1997 <dbl>,
## #   U5MR.1998 <dbl>, U5MR.1999 <dbl>, U5MR.2000 <dbl>, U5MR.2001 <dbl>,
## #   U5MR.2002 <dbl>, U5MR.2003 <dbl>, U5MR.2004 <dbl>, U5MR.2005 <dbl>,
## #   U5MR.2006 <dbl>, U5MR.2007 <dbl>, U5MR.2008 <dbl>, U5MR.2009 <dbl>,
## #   U5MR.2010 <dbl>, U5MR.2011 <dbl>, U5MR.2012 <dbl>, U5MR.2013 <dbl>,
## #   U5MR.2014 <dbl>, U5MR.2015 <dbl>, U5MR.2016 <dbl>, IMR.1950 <dbl>,
## #   IMR.1951 <dbl>, IMR.1952 <dbl>, IMR.1953 <dbl>, IMR.1954 <dbl>,
## #   IMR.1955 <dbl>, IMR.1956 <dbl>, IMR.1957 <dbl>, IMR.1958 <dbl>,
## #   IMR.1959 <dbl>, IMR.1960 <dbl>, IMR.1961 <dbl>, IMR.1962 <dbl>,
## #   IMR.1963 <dbl>, IMR.1964 <dbl>, IMR.1965 <dbl>, IMR.1966 <dbl>,
## #   IMR.1967 <dbl>, IMR.1968 <dbl>, IMR.1969 <dbl>, IMR.1970 <dbl>,
## #   IMR.1971 <dbl>, IMR.1972 <dbl>, IMR.1973 <dbl>, IMR.1974 <dbl>,
## #   IMR.1975 <dbl>, IMR.1976 <dbl>, IMR.1977 <dbl>, IMR.1978 <dbl>,
## #   IMR.1979 <dbl>, IMR.1980 <dbl>, IMR.1981 <dbl>, IMR.1982 <dbl>,
## #   IMR.1983 <dbl>, IMR.1984 <dbl>, IMR.1985 <dbl>, ...

There are several variables intermixed within this data set that I’d like to separate out into individual columns. The variables for ISO.Code, CountryName, and Uncertainty.bounds. are already separated out into columns. Additionally, I’d like to separate out the age range (Under-5, Infant, and Neonatal) and year.

This data set also contains both rates of mortality and actual numbers of deaths for each country/year combination. My final tidied data set should have the rates and numbers in separate columns for each country/year.

To start, I followed Raj’s suggestion to simplify the analysis by focusing on only the median values for each country, removing the lower and upper bounds.

child_mort <- filter(child_mort, Uncertainty.bounds. == "Median") %>% 
  gather("Type.Year", "n", 4:405)

I also gathered the data into a long format, defining a new column as Type.Year.

head(child_mort, n = 10)
## # A tibble: 10 x 5
##    ISO.Code CountryName          Uncertainty.bounds. Type.Year     n
##    <chr>    <chr>                <chr>               <chr>     <dbl>
##  1 AFG      Afghanistan          Median              U5MR.1950  NA  
##  2 AGO      Angola               Median              U5MR.1950  NA  
##  3 ALB      Albania              Median              U5MR.1950  NA  
##  4 AND      Andorra              Median              U5MR.1950  NA  
##  5 ARE      United Arab Emirates Median              U5MR.1950  NA  
##  6 ARG      Argentina            Median              U5MR.1950  NA  
##  7 ARM      Armenia              Median              U5MR.1950  NA  
##  8 ATG      Antigua and Barbuda  Median              U5MR.1950  NA  
##  9 AUS      Australia            Median              U5MR.1950  31.6
## 10 AUT      Austria              Median              U5MR.1950  NA

In order to split the Type.Year into two columns, I reformatted the data in the column. Since some of the values in the column had multiple periods, I used the str_replace function with a regular expression to change the delimiter between the type and the year into a hyphen. Once changed, I can use the separate function to split the column in two.

child_mort$Type.Year <- str_replace(child_mort$Type.Year, "\\.1", "-1")
child_mort$Type.Year <- str_replace(child_mort$Type.Year, "\\.2", "-2")
headtail(child_mort, n = 2) # verify that final periods are hyphens
## $HEAD
## # A tibble: 5 x 5
##   ISO.Code CountryName          Uncertainty.bounds. Type.Year     n
##   <chr>    <chr>                <chr>               <chr>     <dbl>
## 1 AFG      Afghanistan          Median              U5MR-1950    NA
## 2 AGO      Angola               Median              U5MR-1950    NA
## 3 ALB      Albania              Median              U5MR-1950    NA
## 4 AND      Andorra              Median              U5MR-1950    NA
## 5 ARE      United Arab Emirates Median              U5MR-1950    NA
## 
## $TAIL
## # A tibble: 2 x 5
##   ISO.Code CountryName Uncertainty.bounds. Type.Year                 n
##   <chr>    <chr>       <chr>               <chr>                 <dbl>
## 1 ZMB      Zambia      Median              Neonatal.Deaths-2016 14505.
## 2 ZWE      Zimbabwe    Median              Neonatal.Deaths-2016 12246.
child_mort <- separate(child_mort, col = Type.Year, into = c("AgeRange", "Year"), sep = "-")
headtail(child_mort)
## $HEAD
## # A tibble: 5 x 6
##   ISO.Code CountryName          Uncertainty.bounds. AgeRange Year      n
##   <chr>    <chr>                <chr>               <chr>    <chr> <dbl>
## 1 AFG      Afghanistan          Median              U5MR     1950     NA
## 2 AGO      Angola               Median              U5MR     1950     NA
## 3 ALB      Albania              Median              U5MR     1950     NA
## 4 AND      Andorra              Median              U5MR     1950     NA
## 5 ARE      United Arab Emirates Median              U5MR     1950     NA
## 
## $TAIL
## # A tibble: 5 x 6
##   ISO.Code CountryName  Uncertainty.bounds. AgeRange        Year       n
##   <chr>    <chr>        <chr>               <chr>           <chr>  <dbl>
## 1 WSM      Samoa        Median              Neonatal.Deaths 2016     44.
## 2 YEM      Yemen        Median              Neonatal.Deaths 2016  23371.
## 3 ZAF      South Africa Median              Neonatal.Deaths 2016  14550.
## 4 ZMB      Zambia       Median              Neonatal.Deaths 2016  14505.
## 5 ZWE      Zimbabwe     Median              Neonatal.Deaths 2016  12246.

Looking at the values in the n column, some are rates and some are counts, depending on the value in the AgeRange column. I use the filter command to split the rows out by their AgeRange value, and then I join the two filtered tables back together with the full_join funtion. I also clean up the column names and ensure that they match for the join.

distinct(child_mort, AgeRange)
## # A tibble: 6 x 1
##   AgeRange         
##   <chr>            
## 1 U5MR             
## 2 IMR              
## 3 NMR              
## 4 Under.five.Deaths
## 5 Infant.Deaths    
## 6 Neonatal.Deaths
child_mort.rates <- child_mort %>%
  filter(AgeRange == "U5MR" | AgeRange == "IMR" | AgeRange == "NMR")
child_mort.rates$AgeRange <- recode(child_mort.rates$AgeRange, 
                                     U5MR = "under five", 
                                     IMR = "infant", 
                                     NMR = "neonatal")
colnames(child_mort.rates) <- c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year", "rate")

child_mort.counts <- child_mort %>%
  filter(!(AgeRange %in% c("U5MR", "IMR", "NMR")))
child_mort.counts$AgeRange <- recode(child_mort.counts$AgeRange, 
                                     Under.five.Deaths = "under five", 
                                     Infant.Deaths = "infant", 
                                     Neonatal.Deaths = "neonatal")
colnames(child_mort.counts) <- c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year", "deaths")

headtail(child_mort.rates)
## $HEAD
## # A tibble: 5 x 6
##   ISOCode countryName          uncertaintyBounds ageRange   year   rate
##   <chr>   <chr>                <chr>             <chr>      <chr> <dbl>
## 1 AFG     Afghanistan          Median            under five 1950     NA
## 2 AGO     Angola               Median            under five 1950     NA
## 3 ALB     Albania              Median            under five 1950     NA
## 4 AND     Andorra              Median            under five 1950     NA
## 5 ARE     United Arab Emirates Median            under five 1950     NA
## 
## $TAIL
## # A tibble: 5 x 6
##   ISOCode countryName  uncertaintyBounds ageRange year   rate
##   <chr>   <chr>        <chr>             <chr>    <chr> <dbl>
## 1 WSM     Samoa        Median            neonatal 2016   9.20
## 2 YEM     Yemen        Median            neonatal 2016  26.8 
## 3 ZAF     South Africa Median            neonatal 2016  12.4 
## 4 ZMB     Zambia       Median            neonatal 2016  22.9 
## 5 ZWE     Zimbabwe     Median            neonatal 2016  22.9
headtail(child_mort.counts)
## $HEAD
## # A tibble: 5 x 6
##   ISOCode countryName          uncertaintyBounds ageRange   year  deaths
##   <chr>   <chr>                <chr>             <chr>      <chr>  <dbl>
## 1 AFG     Afghanistan          Median            under five 1950      NA
## 2 AGO     Angola               Median            under five 1950      NA
## 3 ALB     Albania              Median            under five 1950      NA
## 4 AND     Andorra              Median            under five 1950      NA
## 5 ARE     United Arab Emirates Median            under five 1950      NA
## 
## $TAIL
## # A tibble: 5 x 6
##   ISOCode countryName  uncertaintyBounds ageRange year  deaths
##   <chr>   <chr>        <chr>             <chr>    <chr>  <dbl>
## 1 WSM     Samoa        Median            neonatal 2016     44.
## 2 YEM     Yemen        Median            neonatal 2016  23371.
## 3 ZAF     South Africa Median            neonatal 2016  14550.
## 4 ZMB     Zambia       Median            neonatal 2016  14505.
## 5 ZWE     Zimbabwe     Median            neonatal 2016  12246.
child_mort <- full_join(child_mort.rates, child_mort.counts)
## Joining, by = c("ISOCode", "countryName", "uncertaintyBounds", "ageRange", "year")
headtail(child_mort)
## $HEAD
## # A tibble: 5 x 7
##   ISOCode countryName       uncertaintyBounds ageRange  year   rate deaths
##   <chr>   <chr>             <chr>             <chr>     <chr> <dbl>  <dbl>
## 1 AFG     Afghanistan       Median            under fi~ 1950     NA     NA
## 2 AGO     Angola            Median            under fi~ 1950     NA     NA
## 3 ALB     Albania           Median            under fi~ 1950     NA     NA
## 4 AND     Andorra           Median            under fi~ 1950     NA     NA
## 5 ARE     United Arab Emir~ Median            under fi~ 1950     NA     NA
## 
## $TAIL
## # A tibble: 5 x 7
##   ISOCode countryName  uncertaintyBounds ageRange year   rate deaths
##   <chr>   <chr>        <chr>             <chr>    <chr> <dbl>  <dbl>
## 1 WSM     Samoa        Median            neonatal 2016   9.20    44.
## 2 YEM     Yemen        Median            neonatal 2016  26.8  23371.
## 3 ZAF     South Africa Median            neonatal 2016  12.4  14550.
## 4 ZMB     Zambia       Median            neonatal 2016  22.9  14505.
## 5 ZWE     Zimbabwe     Median            neonatal 2016  22.9  12246.

Analysis

The discussion post did not contain any specific instructions for analysis. To begin, I decided to create a visualization showing the average worldwide child mortality rates, separated out by the Age Range. The first plot shows the three groups together on one graph. I practiced using many different elements of the ggplot2 package to format the various pieces of the graph, such as the x-axis labels and the legend title.

child_mort %>% 
  group_by(ageRange, year) %>% 
  summarize(avg = mean(rate, na.rm = TRUE)) %>%
  ggplot(aes(year, avg, fill = ageRange)) + 
    geom_bar(stat = 'identity', position = 'dodge') + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    scale_x_discrete(name = "Year", breaks = seq(1950, 2016, 3)) +
    scale_y_continuous(name = "Average Child Mortality Rate (out of 1000 live births)", breaks = seq(0, 180, 20)) +
    ggtitle("Child Mortality Rates by Age Range", "Worldwide Average") + 
    scale_fill_discrete(name = "Age Range")
## Warning: Removed 1 rows containing missing values (geom_bar).

Below I created a graph showing the same information, but using facet_grid to separate each Age Range into its own graph.

child_mort %>% 
  group_by(ageRange, year) %>% 
  summarize(avg = mean(rate, na.rm = TRUE)) %>%
  ggplot(aes(year, avg)) +
    geom_point() +
    facet_grid(. ~ ageRange) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    scale_x_discrete(name = "Year", breaks = seq(1950, 2016, 6)) +
    scale_y_continuous(name = "Average Child Mortality Rate (out of 1000 live births)", breaks = seq(0, 180, 20)) +
    ggtitle("Child Mortality Rates by Age Range", "Worldwide Average") + 
    scale_fill_discrete(name = "Age Range")
## Warning: Removed 1 rows containing missing values (geom_point).

I noted that for both of these graphs, R automatically removed the missing data point for neonatal mortality rate for 1950. I also noted that overall we see a very happy downward trend with in child mortality, with the sharpest decrease occuring in the under five age range during this time span.

Part Three - NYC MTA MetroCard Swipes

Submitted by Jeremy O’Brien

Create SQL database and read into R

With the most data of my three data sets, I decided to challenge myself to create a SQL database with the MTA information and connect it to another data set with additional information.

I created a .CSV file with Jeremy’s information from his github link (https://raw.githubusercontent.com/JeremyOBrien16/NYC-MTA-Weekly-MetroCard-Swipes/master/Fare_Card_History_for_Metropolitan_Transportation_Authority__MTA___Beginning_2010.csv). I wanted to find additional information to add to his data set, and I found a data set from the MTA website online containing line information for each station by the remote booth number listed in Jeremy’s data set (http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls).

Unable to find a list of stations with remote IDs to match on, I added the boroughs to the .CSV file manually. I then made an initial clean of the stationline data set in the .CSV file before uploading to SQL in order to have unique values for each remote station ID. In doing so, I lost the ability to track the subway division (IRT vs. IND, etc.) but for the purposes of this project, I decided to focus solely on the subway lines and boroughs instead of the division data.

Additionally, in uploading the ridedata file to the SQL database (using the loadmta file found on my github page), duplicate values as determined by my primary key set-up for that table were discarded. From the 180,608 original entries, 179,679 unique observations were uploaded into the SQL database.

-- loadmta.sql

DROP TABLE IF EXISTS ridedata;
DROP TABLE IF EXISTS stationlines;

CREATE TABLE ridedata (
    FromDate DATE,
    ToDate  DATE,
    RemoteStationID CHAR(4),
    Station VARCHAR(50),
    FullFare INTEGER,
    SeniorCitizen_Disabled INTEGER,
    SevenDayADAFarecardAccessSystemUnlimited INTEGER,
    ThirtyDayADAFarecardAccessSystemUnlimited INTEGER,
    JointRailRoadTicket INTEGER,
    SevenDayUnlimited INTEGER,
    ThirtyDayUnlimited INTEGER,
    FourteenDayReducedFareMediaUnlimited INTEGER,
    OneDayUnlimited INTEGER,
    FourteenDayUnlimited INTEGER,
    SevenDayExpressBusPass INTEGER,
    TransitCheckMetrocard INTEGER,
    LIBSpecialSenior INTEGER,
    RailRoadUnlimitedNoTrade INTEGER,
    TransitCheckMetrocardAnnualMetrocard INTEGER,
    MailandRideEZPassExpress INTEGER,
    MailandRideUnlimited INTEGER,
    Path2Trip INTEGER,
    AirtranFullFare INTEGER,
    Airtran30Day INTEGER,
    Airtran10Trip INTEGER,
    AirtranMonthly INTEGER,
    CONSTRAINT PK_ridedata PRIMARY KEY (FromDate, ToDate, RemoteStationID)
    );
  
LOAD DATA LOCAL INFILE 'C:/Users/Steve Tipton/Documents/CUNY/DATA 607/proj2/ridedata.csv' 
INTO TABLE ridedata 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

CREATE TABLE stationlines (
    Remote CHAR(4),
    Station VARCHAR(50),
    LineName VARCHAR(20),
    Division CHAR(3),
    CONSTRAINT PK_stationlines PRIMARY KEY (Remote, Station, Division)
    );
  
LOAD DATA LOCAL INFILE 'C:/Users/Steve Tipton/Documents/CUNY/DATA 607/proj2/stationlines.csv' 
INTO TABLE stationlines 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

select * from stationlines;

Tidy and Transform

I connected MySQL to R and read in the station list table first.

driver <- dbDriver("MySQL")
con <- dbConnect(driver, 
                 user = "root", 
                 password = password, ## password hidden in previous code 
                 dbname = "mta")

dbListTables(con)
## [1] "ridedata"     "stationlines"
sql_getStationList <- "SELECT Remote, Station, LineName, Borough FROM stationlines"
stationList <- dbGetQuery(con, sql_getStationList)

The empty values in the Borough column imported as “” characters, so I tidy them into “NA” values.

stationList[!(stationList$Borough %in% c("MANHT","BKLYN","QUEEN","BRONX")),4] <- NA

The LineNames column of this table has a single character string for each station with all the lines serviced by that station concatenated into one single string. I extracted the information from those strings into atomic variables for each subway line.

## extract all unique strings from the ListNames column and split into single characters
trainLines <- stationList$LineName %>%
  unique() %>%
  paste(collapse = "") %>%
  str_split("") %>%
  unlist() %>%
  unique() ## retain a vector of unique single characters referencing all subway lines in the data set

trainLines <- trainLines[order(trainLines)]

Using my vector of unique train lines, I use two for loops to check whether each station in the stationlines data set services that line. As I complete the check for each line, I add its corresponding column to the original data set. Lastly, I update the names of the newly added columns with the names of the train lines.

for(i in 1:length(trainLines)) {
  x <- vector(length = length(stationList$LineName))
  for(j in 1:length(stationList$LineName)) {
    if(str_detect(stationList$LineName[j], trainLines[i]) == TRUE) {
      x[j] <- 1
    } else {
      x[j] <-0
    }
  }
  stationList <- cbind.data.frame(stationList, x)
}

colnames(stationList)[5:length(stationList[1,])] <- trainLines

The stationlist data set now has columns of 1’s and 0’s showing whether or not a line goes through each station.

headtail(stationList)
## $HEAD
##   Remote        Station LineName Borough 1 2 3 4 5 6 7 A B C D E F G J L M
## 1   R001   WHITEHALL ST       R1   MANHT 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2   R002      FULTON ST ACJZ2345   MANHT 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0
## 3   R003  CYPRESS HILLS        J   BKLYN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 4   R004   ELDERTS LANE       JZ   QUEEN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 5   R005 FOREST PARKWAY        J   QUEEN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##   N P Q R S Z
## 1 0 0 0 1 0 0
## 2 0 0 0 0 0 1
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 1
## 5 0 0 0 0 0 0
## 
## $TAIL
##     Remote         Station LineName Borough 1 2 3 4 5 6 7 A B C D E F G J
## 476   R548  CHRISTOPHER ST        P   MANHT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 477   R549 NEWARK HW BMEBE        P    <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 478   R550      LACKAWANNA        P    <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 479   R551    GROVE STREET        P    <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 480   R552  JOURNAL SQUARE        P    <NA> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##     L M N P Q R S Z
## 476 0 0 0 1 0 0 0 0
## 477 0 0 0 1 0 0 0 0
## 478 0 0 0 1 0 0 0 0
## 479 0 0 0 1 0 0 0 0
## 480 0 0 0 1 0 0 0 0

Finally, I read the ridedata from SQL into R and joined the two data sets together with the join functionality.

sql_getRidedata <- "SELECT * FROM ridedata"
ridedata <- dbGetQuery(con, sql_getRidedata)
colnames(ridedata)[3] <- "Remote" ## rename column to facilitate join
mtadata <- left_join(ridedata, stationList, by = "Remote")
headtail(mtadata)
## $HEAD
##     FromDate     ToDate Remote                      Station.x FullFare
## 1 2010-05-29 2010-06-04   R001 WHITEHALL STREET                  56961
## 2 2010-05-29 2010-06-04   R002 FULTON ST & BROADWAY NASSAU       16210
## 3 2010-05-29 2010-06-04   R003 CYPRESS HILLS                      3292
## 4 2010-05-29 2010-06-04   R004 75TH STREET & ELDERTS LANE         7774
## 5 2010-05-29 2010-06-04   R005 85TH STREET & FOREST PKWAY         8794
##   SeniorCitizen_Disabled SevenDayADAFarecardAccessSystemUnlimited
## 1                   1732                                      203
## 2                    503                                       41
## 3                    110                                        7
## 4                    257                                       20
## 5                    420                                       32
##   ThirtyDayADAFarecardAccessSystemUnlimited JointRailRoadTicket
## 1                                       883                 358
## 2                                       157                  29
## 3                                        41                   0
## 4                                       125                   0
## 5                                       142                   1
##   SevenDayUnlimited ThirtyDayUnlimited
## 1             17654              32369
## 2              3536              11462
## 3              1404               1344
## 4              2878               4472
## 5              2997               5297
##   FourteenDayReducedFareMediaUnlimited OneDayUnlimited
## 1                                   59            6733
## 2                                    8             451
## 3                                    0              36
## 4                                    1              79
## 5                                    0              88
##   FourteenDayUnlimited SevenDayExpressBusPass TransitCheckMetrocard
## 1                 1835                    426                   892
## 2                  598                     50                   389
## 3                  174                      5                    20
## 4                  278                      7                    27
## 5                  353                      6                    48
##   LIBSpecialSenior RailRoadUnlimitedNoTrade
## 1                0                     1035
## 2                0                       90
## 3                0                        4
## 4                0                        9
## 5                0                       10
##   TransitCheckMetrocardAnnualMetrocard MailandRideEZPassExpress
## 1                                 4798                      431
## 2                                 3578                       83
## 3                                  154                        9
## 4                                  409                       20
## 5                                  317                        9
##   MailandRideUnlimited Path2Trip AirtranFullFare Airtran30Day
## 1                   91         0             418            0
## 2                   51         0              70            0
## 3                    1         0               5            0
## 4                    2         0              20            0
## 5                   22         0              28            0
##   Airtran10Trip AirtranMonthly      Station.y LineName Borough 1 2 3 4 5 6
## 1             0              0   WHITEHALL ST       R1   MANHT 1 0 0 0 0 0
## 2             0              0      FULTON ST ACJZ2345   MANHT 0 1 1 1 1 0
## 3             0              0  CYPRESS HILLS        J   BKLYN 0 0 0 0 0 0
## 4             0              0   ELDERTS LANE       JZ   QUEEN 0 0 0 0 0 0
## 5             0              0 FOREST PARKWAY        J   QUEEN 0 0 0 0 0 0
##   7 A B C D E F G J L M N P Q R S Z
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 2 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1
## 3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
## 5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 
## $TAIL
##          FromDate     ToDate Remote                     Station.x FullFare
## 179675 2018-02-10 2018-02-16   R569    SBS-BX6 @ LIVINGSTON PLAZA        0
## 179676 2018-02-10 2018-02-16   R570        72ND STREET - 2 AVENUE    72085
## 179677 2018-02-10 2018-02-16   R571        86TH STREET - 2 AVENUE    55230
## 179678 2018-02-10 2018-02-16   R572        96TH STREET - 2 AVENUE    36272
## 179679 2018-02-10 2018-02-16   R573 SBS-Q52/53 @ LIVINGSTON PLAZA        0
##        SeniorCitizen_Disabled SevenDayADAFarecardAccessSystemUnlimited
## 179675                      0                                        0
## 179676                   6629                                      282
## 179677                   5694                                      262
## 179678                   3241                                      225
## 179679                      0                                        0
##        ThirtyDayADAFarecardAccessSystemUnlimited JointRailRoadTicket
## 179675                                         0                   0
## 179676                                       864                 217
## 179677                                       810                  59
## 179678                                      1006                  40
## 179679                                         0                   0
##        SevenDayUnlimited ThirtyDayUnlimited
## 179675                 0                  0
## 179676             18717              50393
## 179677             17908              52231
## 179678             16670              37182
## 179679                 0                  0
##        FourteenDayReducedFareMediaUnlimited OneDayUnlimited
## 179675                                    0               0
## 179676                                    1               0
## 179677                                    0               0
## 179678                                    0               0
## 179679                                    0               0
##        FourteenDayUnlimited SevenDayExpressBusPass TransitCheckMetrocard
## 179675                    0                      0                     0
## 179676                    0                    269                  1280
## 179677                    0                    120                  1289
## 179678                    0                     95                   617
## 179679                    0                      0                     0
##        LIBSpecialSenior RailRoadUnlimitedNoTrade
## 179675                0                        0
## 179676              827                      194
## 179677              540                       87
## 179678              509                       41
## 179679                0                        0
##        TransitCheckMetrocardAnnualMetrocard MailandRideEZPassExpress
## 179675                                    0                        0
## 179676                                 7086                     3751
## 179677                                 7493                     3017
## 179678                                 5330                     1673
## 179679                                    0                        0
##        MailandRideUnlimited Path2Trip AirtranFullFare Airtran30Day
## 179675                    0         0               0            0
## 179676                 1637         0             625            0
## 179677                 1682         0             695            0
## 179678                 1059         0             389            0
## 179679                    0         0               0            0
##        Airtran10Trip AirtranMonthly Station.y LineName Borough  1  2  3  4
## 179675             0              0      <NA>     <NA>    <NA> NA NA NA NA
## 179676             0              0      <NA>     <NA>    <NA> NA NA NA NA
## 179677             0              0      <NA>     <NA>    <NA> NA NA NA NA
## 179678             0              0      <NA>     <NA>    <NA> NA NA NA NA
## 179679             0              0      <NA>     <NA>    <NA> NA NA NA NA
##         5  6  7  A  B  C  D  E  F  G  J  L  M  N  P  Q  R  S  Z
## 179675 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179676 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179677 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179678 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 179679 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I took a quick look at any unmatched stations from the ridedata set that did not have a match in the stationlist data set. For the most part, these are newer stations, such as the 34th Street station in Hudson Yards or the new stations along the 2nd Avenue Q line extension. For the purposes of this project, and since there is still a great deal of matched information, I decided to work with just the matched information.

z <- subset(mtadata, is.na(LineName))
distinct(z, Remote, Station.x)
##    Remote                      Station.x
## 1    R026 SOUTH FERRY                   
## 2    R516 MTABC - COLLEGE POINT DEPOT   
## 3    R072 34TH STREET - HUDSON YARDS    
## 4    R072     34TH STREET - HUDSON YARDS
## 5    R572         96TH STREET - 2 AVENUE
## 6    R570         72ND STREET - 2 AVENUE
## 7    R571         86TH STREET - 2 AVENUE
## 8    R475             GRAND AVENUE DEPOT
## 9    R476                  GUNHILL DEPOT
## 10   R477                    YUKON DEPOT
## 11   R478                CASTLETON DEPOT
## 12   R479           QUEENS VILLAGE DEPOT
## 13   R480            CASEY STENGEL DEPOT
## 14   R481                  JAMAICA DEPOT
## 15   R482               FRESH POND DEPOT
## 16   R483            EAST NEW YORK DEPOT
## 17   R484           JACKIE GLEASON DEPOT
## 18   R485                 FLATBUSH DEPOT
## 19   R486               ULMER PARK DEPOT
## 20   R487              KINGSBRIDGE DEPOT
## 21   R488        JACKIE GLEASON DEPOT(2)
## 22   R489        MOTHER CLARA HALE DEPOT
## 23   R492           MANHATTANVILLE DEPOT
## 24   R493               CHARLESTON DEPOT
## 25   R494               WEST FARMS DEPOT
## 26   R495         MICHAEL J. QUILL DEPOT
## 27   R496          TUSKEGEE AIRMEN DEPOT
## 28   R515       MTABC - LA GUARDIA DEPOT
## 29   R517              MTABC - JFK DEPOT
## 30   R518     MTABC - FAR ROCKAWAY DEPOT
## 31   R519     MTABC - SPRING CREEK DEPOT
## 32   R520          MTABC - BAISLEY DEPOT
## 33   R521      MTABC - EASTCHESTER DEPOT
## 34   R523       MITCHELL FIELD BUS DEPOT
## 35   R524     ROCKVILLE CENTER BUS DEPOT
## 36   R530  WESTCHESTER CMF YONKERS DEPOT
## 37   R531     WESTCHESTER VALHALLA DEPOT
## 38   R533  MTABC - COLLEGE POINT DEPOT2*
## 39   R555     SBS-M15 @ LIVINGSTON PLAZA
## 40   R556    SBS-BX12 @ LIVINGSTON PLAZA
## 41   R557     SBS-M34 @ LIVINGSTON PLAZA
## 42   R558    SBS-BX41 @ LIVINGSTON PLAZA
## 43   R559     SBS-B44 @ LIVINGSTON PLAZA
## 44   R560     SBS-M60 @ LIVINGSTON PLAZA
## 45   R561     SBS-M86 @ LIVINGSTON PLAZA
## 46   R562          MTABC - EASTCHESTER 2
## 47   R563     SBS-B46 @ LIVINGSTON PLAZA
## 48   R564     SBS-Q44 @ LIVINGSTON PLAZA
## 49   R567     SBS-M23 @ LIVINGSTON PLAZA
## 50   R516    MTABC - COLLEGE POINT DEPOT
## 51   R528  METRO NORTH HUDSON RAIL DEPOT
## 52   R568     SBS-M79 @ LIVINGSTON PLAZA
## 53   R569     SBS-BX6 @ LIVINGSTON PLAZA
## 54   R573  SBS-Q52/53 @ LIVINGSTON PLAZA
mtadata <- subset(mtadata, !is.na(LineName))

Analysis

In the discussion board post, one question arose around full fare vs. unlimited rides. Using the mutate function, I computed a full_unlim_score from the raw data. For each line, I computed the difference between all unlimited ride types and full fare rides (regular and senior/ADA). I used difference to avoid dividing by zero with a ratio.

mtadata <- mutate(mtadata, total_unlim = (SevenDayADAFarecardAccessSystemUnlimited + ThirtyDayADAFarecardAccessSystemUnlimited + SevenDayUnlimited + ThirtyDayUnlimited + FourteenDayReducedFareMediaUnlimited + OneDayUnlimited + FourteenDayUnlimited), total_full = (FullFare + SeniorCitizen_Disabled), full_unlim_score = total_unlim - total_unlim)

mtadata_score <- select(mtadata, FromDate, ToDate, Remote, Station.x, Borough, total_unlim, total_full, full_unlim_score)
head(mtadata_score)
##     FromDate     ToDate Remote                      Station.x Borough
## 1 2010-05-29 2010-06-04   R001 WHITEHALL STREET                 MANHT
## 2 2010-05-29 2010-06-04   R002 FULTON ST & BROADWAY NASSAU      MANHT
## 3 2010-05-29 2010-06-04   R003 CYPRESS HILLS                    BKLYN
## 4 2010-05-29 2010-06-04   R004 75TH STREET & ELDERTS LANE       QUEEN
## 5 2010-05-29 2010-06-04   R005 85TH STREET & FOREST PKWAY       QUEEN
## 6 2010-05-29 2010-06-04   R006 WOODHAVEN BOULEVARD              QUEEN
##   total_unlim total_full full_unlim_score
## 1       59736      58693                0
## 2       16253      16713                0
## 3        3006       3402                0
## 4        7853       8031                0
## 5        8909       9214                0
## 6       10486       8824                0

First, I created a scatterplot to compare the Unlimited Ride swipes to the Full Fare, using color to distinguish the Borough. They are, logically, positively correlated, and each borough has a similar correlation pattern.

library(scales) ## to format the axis numbers
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(mtadata_score[!(is.na(mtadata_score$Borough)),], aes(x=total_unlim, y=total_full)) + 
  geom_point(aes(color = Borough)) + 
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(subtitle="Unlimited Ride vs Full Fare Swipes", 
       y="Full Fare Swipes", 
       x="Unlimited Ride Swipes", 
       title="Scatterplot")

ggplot(mtadata_score[!(is.na(mtadata_score$Borough)),], aes(x=total_unlim, y=total_full)) + 
  geom_point(aes(color = Borough)) +
  geom_smooth(se=F) + 
  facet_grid(. ~ Borough) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(subtitle="Unlimited Ride vs Full Fare Swipes", 
       y="Full Fare Swipes", 
       x="Unlimited Ride Swipes", 
       title="Scatterplot by Borough")
## `geom_smooth()` using method = 'gam'

Next, I grouped the information by Remote ID, then took the average full_unlim_score for each station. I gathered that information by train line and then found the average difference for each train line to create a bar graph. On average for this data, we see that the B/D/F/M/L are the leading train lines for unlimited passes and the A/C/E/S lines favor the full fare swipes.

avgByLine <- mtadata_score %>% 
  group_by(Remote) %>% 
  summarize(avg = mean(full_unlim_score, na.rm = TRUE)) %>%
  arrange(avg) %>%
  left_join(stationList) %>%
  gather("Line", "YesNo", 6:28) %>%
  mutate(line_score = avg * YesNo) %>%
  group_by(Line) %>%
  summarize(LineAvg = mean(line_score, na.rm = TRUE)) %>%
  arrange(LineAvg) %>%
  filter(Line != "P") %>% ## removing PATH trains from NYC MTA analysis
  ggplot(aes(Line, LineAvg)) +
    geom_bar(stat = "identity") +
    labs(title = "Average Difference in Unlimited Ride vs. Full Fare Swipes",
       x = "MTA Line",
       y = "Average Difference")
## Joining, by = "Remote"
print(avgByLine)

Note: The ggplot is working in RStudio but not when knitted. Here is an image file below with the completed ggplot from RStudio: