1 Project Overview

For this study, we investigated potentical causes that would affect NYC subways on-time performance (OTP). We chose this as our project as all of our team members commute within NYC and share the same frustration. Historically, NYC subways have not been efficient or clean, yet every year New Yorker’s are paying more for the same dissapointing service:

For our analysis, we used MTA subway data and historical weather patterns in NYC from the years 2011-2017. Our business question was: Is weather predictive of NYC subway on-time performance? After acquiring, cleaning and combining the dataframes, we ran a linear regression on the data. Unfortunately, the fit was poor and gave us litte insight into the relationship between weather and OTP. We then attempted to fit a polynomial to the data which gave a slightly better fit, however still the correlation was weak

Our findings showed that NYC weather patterns are NOT predictive of NYC subway OTP. We can speculate that weather does not play a role as most of the subways are underground. Perhaps weather plays a stronger role in OTP for NYC buses or outside railways like the LIRR or the Staten Island Railway.

2 Obtain the Data

Data was obtained from the MTA website and weather.gov.

For the MTA data, we uploaded it directly into R via a .CSV. We then subsetted the data to pull out only the OTP subway data and stored in an R dataframe.

About the MTA subway performance dataset:
Subway wait assessment for all lines
Customer injury rate
Elevator availability
Escalator availability
Mean distance between failure
On-time performance (OTP), total and for all lines
Total ridership

http://web.mta.info/developers/performance.html

For the weather data, we downloaded a .CSV and then placed into a MongoDB instance running in Microsoft Azure. We then filtered by year with the query:

weather.2017<-mgo$ find(‘{“YEAR”:{“$gt”:2007}}’).

The values were then stored into an R dataframe and converted from wide to long format using tidyverse gather.

About the weather.gov dataset:

temperatures were recorded in Central Park, NY
monthly average and annual temperatures from 1869 to 2018

https://www.weather.gov/media/okx/Climate/CentralPark/monthlyannualtemp.pdf

2.1 Subway Data Processing

# load MTA data 
data<- read.csv("https://raw.githubusercontent.com/ChristinaValore/stats-prob-606/master/Performance_NYCT.csv", header=TRUE, check.names = FALSE)

# subset to pull out ONLY the OTP (ON TIME %)
sub <-data[ which(data$PARENT_SEQ=='391690'), ]

# view the subsetted data
head(sub,10)

##      INDICATOR_SEQ PARENT_SEQ AGENCY_NAME          INDICATOR_NAME
## 1075        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1076        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1077        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1078        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1079        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1080        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1081        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1082        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1083        391691     391690 NYC Transit OTP (Terminal) - 1 Line
## 1084        391691     391690 NYC Transit OTP (Terminal) - 1 Line
##                                                                                                               DESCRIPTION
## 1075 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1076 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1077 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1078 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1079 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1080 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1081 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1082 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1083 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
## 1084 Subways weekday Terminal OTP evaluates performance based on schedule/service plan\xa0in effect, includes all delays.
##                CATEGORY FREQUENCY DESIRED_CHANGE INDICATOR_UNIT
## 1075 Service Indicators         M              U              %
## 1076 Service Indicators         M              U              %
## 1077 Service Indicators         M              U              %
## 1078 Service Indicators         M              U              %
## 1079 Service Indicators         M              U              %
## 1080 Service Indicators         M              U              %
## 1081 Service Indicators         M              U              %
## 1082 Service Indicators         M              U              %
## 1083 Service Indicators         M              U              %
## 1084 Service Indicators         M              U              %
##      DECIMAL_PLACES PERIOD_YEAR PERIOD_MONTH YTD_TARGET YTD_ACTUAL
## 1075              1        2009            6        0.0       86.8
## 1076              1        2009            7        0.0       90.0
## 1077              1        2009            8        0.0       88.3
## 1078              1        2009            9        0.0       89.6
## 1079              1        2009           10        0.0       90.2
## 1080              1        2009           11        0.0       91.1
## 1081              1        2009           12        0.0       91.2
## 1082              1        2010            1       90.9       93.5
## 1083              1        2010            2       90.9       92.2
## 1084              1        2010            3       90.9       92.2
##      MONTHLY_TARGET MONTHLY_ACTUAL
## 1075            0.0           86.8
## 1076            0.0           93.2
## 1077            0.0           84.7
## 1078            0.0           93.9
## 1079            0.0           92.6
## 1080            0.0           95.6
## 1081            0.0           91.7
## 1082           90.9           93.5
## 1083           90.9           90.7
## 1084           90.9           92.2

# remove columns that are not needed for this analysis
sub<- sub[c(-1:-3, -5:-10)]

2.2 Weather Data Processing

For the weather data processing, I used a CSV file from the weather.gov site. However, due to the fact that there several years included as part of this data, I decided to store it in an Azure hosted MongoDB instance and retrieve is via a simple MongoDB query.

The password is stored locally on a file for security reasons and the MongoDB instance is opened to only a few select IPs. In order to run this Markdown file, a MongoDB instance will need to be configured and the passowrd file and string will need to be updated.

file <- "https://raw.githubusercontent.com/dapolloxp/607-finalproject/master/monthlyannualtemp-converted.csv"

weather.data = read.csv(file, stringsAsFactors = FALSE)
cleaned.weather <- weather.data %>% select(-11)
missing.days <- c(77.5, 76, 68.5, 57.5, 47.5, 38)
cleaned.weather[150,8:13] <- missing.days
cleaned.weather[150,14] <- apply(cleaned.weather[10,2:13],1,mean)

#mongo_pwd_file <- "/Users/christinavalore/Desktop/password.txt"
mongo_pwd_file <- "/Users/davidapolinar/Dropbox/CUNYProjects/Srping2019/Data607/Week 13/mongopassword.txt"

mongo_password <-read.delim(mongo_pwd_file, header = FALSE, stringsAsFactors = FALSE)

#url <- paste0("mongodb://nyc.admin:", mongo_password$V1 , "@52.167.52.62:27017/weather",sep = "")
url <- paste0("mongodb://nyc.admin:", mongo_password$V1 , "@10.20.1.6:27017/weather",sep = "")
mgo <- mongo(collection = "nyc", db = "weather",  url=url, verbose = TRUE)
mgo$drop()
mgo$insert(cleaned.weather)

## 
Complete! Processed total of 150 rows.

## List of 5
##  $ nInserted  : num 150
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

weather.2017<-mgo$find('{"YEAR":{"$gt":2007}}')

## 
 Found 11 records...
 Imported 11 records. Simplifying into dataframe...

mgo$find('{"YEAR":{}}')

## 
 Imported 0 records. Simplifying into dataframe...

## data frame with 0 columns and 0 rows

3 Cleansing

For the subway data, we checked for the completeness of the data as some years were missing OTP values ands some subway lines only had collected data from the mid-year. We removed all incomplete data including any years after 2011 and incomplete OTP for the subway lines W and S Line 42 St.

For the weather data, after gathering the monthly averages by year, we converted the data from wide to long. In preperation for combined the two dataframes, we changed the month values to numerics, renamed the columns to match in both dataframes and finally, filtered the weather data to include only the same years as the subway data, 2011 - 2017.

Combining the data required that the year and month in both dataframes have matching variables and titles. We then did a left join on the dataframes and renames the columns to keep the column names in a standard format. After, we did a boolean comparison on the monthly actual vs monthly target OTP and created a new column, ON_TIME, with 1,0 values to indicate if the subway was on-time for that month.

Finally, we changed the percentage values for OTP to decimals, divided by approximately 20 days as this study only takes weekdays into account, and added into a new column called DAYS_ON_TIME. We then aggregated the values by month to see the average DAYS_ON_TIME and monthly weather.

3.1 Subway Data Cleansing

# check for data completeness to see how manhy subways were analyzed
unique(sub$INDICATOR_NAME)

##  [1] OTP (Terminal) - 1 Line        OTP (Terminal) - 2 Line       
##  [3] OTP (Terminal) - 3 Line        OTP (Terminal) - 4 Line       
##  [5] OTP (Terminal) - 5 Line        OTP (Terminal) - 6 Line       
##  [7] OTP (Terminal) - 7 Line        OTP (Terminal) - S Line 42 St.
##  [9] OTP (Terminal) - A Line        OTP (Terminal) - B Line       
## [11] OTP (Terminal) - C Line        OTP (Terminal) - D Line       
## [13] OTP (Terminal) - E Line        OTP (Terminal) - F Line       
## [15] OTP (Terminal) - S Fkln Line   OTP (Terminal) - G Line       
## [17] OTP (Terminal) - J Z Line      OTP (Terminal) - L Line       
## [19] OTP (Terminal) - M Line        OTP (Terminal) - N Line       
## [21] OTP (Terminal) - Q Line        OTP (Terminal) - R Line       
## [23] OTP (Terminal) - S Line Rock   OTP (Terminal) - W Line       
## 86 Levels:  Bus Passenger Wheelchair Lift Usage - NYCT Bus ...

# 24 subways were analyzed over 12 months so each year should have 288 values
table(sub$PERIOD_YEAR)

## 
## 2009 2010 2011 2012 2013 2014 2015 2016 2017 
##  168  236  276  267  276  276  276  278  288

# See if OTP was measured for all subway lines
table(sub$INDICATOR_NAME)

## 
##              Bus Passenger Wheelchair Lift Usage - NYCT Bus 
##                                                           0 
##                             % of Completed Trips - NYCT Bus 
##                                                           0 
##                   100th Street Depot - % of Completed Trips 
##                                                           0 
##                  126th Street Depot - % of Completed Trips  
##                                                           0 
##                  Casey Stengel Depot - % of Completed Trips 
##                                                           0 
##                      Castleton Depot - % of Completed Trips 
##                                                           0 
##                     Charleston Depot - % of Completed Trips 
##                                                           0 
##                      Collisions with Injury Rate - NYCT Bus 
##                                                           0 
##                    Customer Accident Injury Rate - NYCT Bus 
##                                                           0 
##                              Customer Injury Rate - Subways 
##                                                           0 
##                  East New York Depot - % of Completed Trips 
##                                                           0 
##                             Elevator Availability - Subways 
##                                                           0 
##                Employee Lost Time and Restricted Duty Rate  
##                                                           0 
##                            Escalator Availability - Subways 
##                                                           0 
##                       Flatbush Depot - % of Completed Trips 
##                                                           0 
##                     Fresh Pond Depot - % of Completed Trips 
##                                                           0 
##                   Grand Avenue Depot - % of Completed Trips 
##                                                           0 
##                       Gun Hill Depot - % of Completed Trips 
##                                                           0 
##                 Jackie Gleason Depot - % of Completed Trips 
##                                                           0 
##                        Jamaica Depot - % of Completed Trips 
##                                                           0 
##                    Kingsbridge Depot - % of Completed Trips 
##                                                           0 
##                 Manhattanville Depot - % of Completed Trips 
##                                                           0 
##                   Mean Distance Between Failures - NYCT Bus 
##                                                           0 
##     Mean Distance Between Failures - Staten Island Railway  
##                                                           0 
##                    Mean Distance Between Failures - Subways 
##                                                           0 
##                Meredith Avenue Depot - % of Completed Trips 
##                                                           0 
##               Michael J. Quill Depot - % of Completed Trips 
##                                                           0 
##                 On-Time Performance - Staten Island Railway 
##                                                           0 
##                              On-Time Performance (Terminal) 
##                                                           0 
##                                     OTP (Terminal) - 1 Line 
##                                                         101 
##                                     OTP (Terminal) - 2 Line 
##                                                         101 
##                                     OTP (Terminal) - 3 Line 
##                                                         101 
##                                     OTP (Terminal) - 4 Line 
##                                                         101 
##                                     OTP (Terminal) - 5 Line 
##                                                         101 
##                                     OTP (Terminal) - 6 Line 
##                                                         101 
##                                     OTP (Terminal) - 7 Line 
##                                                         101 
##                                     OTP (Terminal) - A Line 
##                                                         101 
##                                     OTP (Terminal) - B Line 
##                                                         101 
##                                     OTP (Terminal) - C Line 
##                                                         101 
##                                     OTP (Terminal) - D Line 
##                                                         101 
##                                     OTP (Terminal) - E Line 
##                                                         101 
##                                     OTP (Terminal) - F Line 
##                                                         101 
##                                     OTP (Terminal) - G Line 
##                                                         101 
##                                   OTP (Terminal) - J Z Line 
##                                                         101 
##                                     OTP (Terminal) - L Line 
##                                                         101 
##                                     OTP (Terminal) - M Line 
##                                                         101 
##                                     OTP (Terminal) - N Line 
##                                                         101 
##                                     OTP (Terminal) - Q Line 
##                                                         101 
##                                     OTP (Terminal) - R Line 
##                                                         101 
##                                OTP (Terminal) - S Fkln Line 
##                                                         101 
##                              OTP (Terminal) - S Line 42 St. 
##                                                          92 
##                                OTP (Terminal) - S Line Rock 
##                                                         101 
##                                     OTP (Terminal) - W Line 
##                                                          27 
##                 Queens Village Depot - % of Completed Trips 
##                                                           0 
##                                     Subway Wait Assessment  
##                                                           0 
##                             Subway Wait Assessment - 1 Line 
##                                                           0 
##                             Subway Wait Assessment - 2 Line 
##                                                           0 
##                             Subway Wait Assessment - 3 Line 
##                                                           0 
##                             Subway Wait Assessment - 4 Line 
##                                                           0 
##                             Subway Wait Assessment - 5 Line 
##                                                           0 
##                             Subway Wait Assessment - 6 Line 
##                                                           0 
##                             Subway Wait Assessment - 7 Line 
##                                                           0 
##                             Subway Wait Assessment - A Line 
##                                                           0 
##                             Subway Wait Assessment - B Line 
##                                                           0 
##                             Subway Wait Assessment - C Line 
##                                                           0 
##                             Subway Wait Assessment - D Line 
##                                                           0 
##                             Subway Wait Assessment - E Line 
##                                                           0 
##                             Subway Wait Assessment - F Line 
##                                                           0 
##                             Subway Wait Assessment - G Line 
##                                                           0 
##                           Subway Wait Assessment - J Z Line 
##                                                           0 
##                             Subway Wait Assessment - L Line 
##                                                           0 
##                             Subway Wait Assessment - M Line 
##                                                           0 
##                             Subway Wait Assessment - N Line 
##                                                           0 
##                             Subway Wait Assessment - Q Line 
##                                                           0 
##                             Subway Wait Assessment - R Line 
##                                                           0 
##                            Subway Wait Assessment - S 42 St 
##                                                           0 
##                             Subway Wait Assessment - S Fkln 
##                                                           0 
##                            Subway Wait Assessment - S Rock  
##                                                           0 
##                             Subway Wait Assessment - W Line 
##                                                           0 
##                      Total Paratransit Ridership - NYCT Bus 
##                                                           0 
##                                 Total Ridership - NYCT Bus  
##                                                           0 
##                                   Total Ridership - Subways 
##                                                           0 
##                     Ulmer Park Depot - % of Completed Trips 
##                                                           0 
## Wait Assessment - Subways (Inactive, Historic Calculations) 
##                                                           0 
##                     West Farms Depot - % of Completed Trips 
##                                                           0 
##                          Yukon Depot - % of Completed Trips 
##                                                           0

# data should be as complete as possible so remove the years 2009 and 2010 and standardize the remainder of the years be removing incomplete subway lines: S line 42 & the W line
sub.updated<-sub %>% 
  filter(PERIOD_YEAR >= "2011", INDICATOR_NAME != "OTP (Terminal) - W Line", INDICATOR_NAME != "OTP (Terminal) - S Line 42 St.")

# check on the data once more
table(sub.updated$PERIOD_YEAR)

## 
## 2011 2012 2013 2014 2015 2016 2017 
##  264  264  264  264  264  264  264

table(sub.updated$PERIOD_MONTH)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 154 154 154 154 154 154 154 154 154 154 154 154

3.2 Weather Data Cleansing

weather.2017.cleaned <- weather.2017 %>% select(-ANNUAL)
new <- gather(weather.2017.cleaned , 'JAN', 'FEB', 'MAR', 'APR','MAY','JUN','JUL','AUG', 'SEP','OCT','NOV','DEC',key = 'Month', value = 'Avg Monthly Temp')

Before converting the data from wide to long:

weather.2017

##    YEAR  JAN  FEB  MAR  APR  MAY  JUN  JUL  AUG  SEP  OCT  NOV  DEC ANNUAL
## 1  2008 36.5 35.8 42.6 54.9 60.1 74.0 78.4 73.8 68.8 55.1 45.8 38.1  55.30
## 2  2009 27.9 36.7 42.4 54.5 62.5 67.5 72.7 75.7 66.3 55.0 51.2 35.9  54.00
## 3  2010 32.5 33.1 48.2 57.9 65.3 74.7 81.3 77.4 71.1 58.1 47.9 32.8  56.70
## 4  2011 29.7 36.0 42.3 54.3 64.5 72.3 80.2 75.3 70.0 57.1 51.9 43.3  56.40
## 5  2012 37.3 40.9 50.9 54.8 65.1 71.0 78.8 76.7 68.8 58.0 43.9 41.5  57.30
## 6  2013 35.1 33.9 40.1 53.0 62.8 72.7 79.8 74.6 67.9 60.2 45.3 38.5  55.30
## 7  2014 28.6 31.6 37.7 52.3 64.0 72.5 76.1 74.5 69.7 59.6 45.3 40.5  54.40
## 8  2015 29.9 23.9 38.1 54.3 68.5 71.2 78.8 79.0 74.5 58.0 52.8 50.8  56.70
## 9  2016 34.5 37.7 48.9 53.3 62.8 72.3 78.7 79.2 71.8 58.8 49.8 38.3  57.20
## 10 2017 38.0 41.6 39.2 57.2 61.1 72.0 76.8 74.0 70.5 64.1 46.6 33.4  56.30
## 11 2018 31.7 42.0 40.1 49.5 66.9 71.7 77.5 76.0 68.5 57.5 47.5 38.0  53.55

After the conversion:

head(new %>% arrange(YEAR),n=20)

##    YEAR Month Avg Monthly Temp
## 1  2008   JAN             36.5
## 2  2008   FEB             35.8
## 3  2008   MAR             42.6
## 4  2008   APR             54.9
## 5  2008   MAY             60.1
## 6  2008   JUN             74.0
## 7  2008   JUL             78.4
## 8  2008   AUG             73.8
## 9  2008   SEP             68.8
## 10 2008   OCT             55.1
## 11 2008   NOV             45.8
## 12 2008   DEC             38.1
## 13 2009   JAN             27.9
## 14 2009   FEB             36.7
## 15 2009   MAR             42.4
## 16 2009   APR             54.5
## 17 2009   MAY             62.5
## 18 2009   JUN             67.5
## 19 2009   JUL             72.7
## 20 2009   AUG             75.7

# to combine the datasets, we must make the month's equal values
new [new== "JAN"]<-1
new [new== "FEB"]<-2
new [new== "MAR"]<-3
new [new== "APR"]<-4
new [new== "MAY"]<-5
new [new== "JUN"]<-6
new [new== "JUL"]<-7
new [new== "AUG"]<-8
new [new== "SEP"]<-9
new [new== "OCT"]<-10
new [new== "NOV"]<-11
new [new== "DEC"]<-12

new<-new %>%
  rename(
   PERIOD_YEAR = YEAR,
   PERIOD_MONTH = Month
  )

new.weather<-new %>% 
  filter(PERIOD_YEAR >= "2011" & PERIOD_YEAR <="2017")

3.3 Combined Data Cleansing

# check the structure of the two frames before merging as year and month variable types must match before merge
str(sub.updated)

## 'data.frame':    1848 obs. of  7 variables:
##  $ INDICATOR_NAME: Factor w/ 86 levels " Bus Passenger Wheelchair Lift Usage - NYCT Bus",..: 30 30 30 30 30 30 30 30 30 30 ...
##  $ PERIOD_YEAR   : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ PERIOD_MONTH  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ YTD_TARGET    : num  91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 ...
##  $ YTD_ACTUAL    : num  90.8 91.3 91.6 91.3 91.3 91.3 91.2 91.4 90.8 90.9 ...
##  $ MONTHLY_TARGET: num  91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 91.5 ...
##  $ MONTHLY_ACTUAL: num  90.8 91.9 92 90.5 91.3 89.2 90.6 92.8 85.8 91.4 ...

str(new.weather)

## 'data.frame':    84 obs. of  3 variables:
##  $ PERIOD_YEAR     : int  2011 2012 2013 2014 2015 2016 2017 2011 2012 2013 ...
##  $ PERIOD_MONTH    : chr  "1" "1" "1" "1" ...
##  $ Avg Monthly Temp: num  29.7 37.3 35.1 28.6 29.9 34.5 38 36 40.9 33.9 ...

sub.updated$PERIOD_YEAR<-as.factor(sub.updated$PERIOD_YEAR)
sub.updated$PERIOD_MONTH<-as.factor(sub.updated$PERIOD_MONTH)
new.weather$PERIOD_YEAR<-as.factor(new.weather$PERIOD_YEAR)
new.weather$PERIOD_MONTH<-as.factor(new.weather$PERIOD_MONTH)

# combine the two dataframes
combined<-left_join(sub.updated, new.weather, by= c("PERIOD_YEAR", "PERIOD_MONTH"))

## Warning: Column `PERIOD_MONTH` joining factors with different levels,
## coercing to character vector

# rename avg monthly temp column to conform to column naming standards
combined<-combined %>%
  rename(
 MONTHLY_TEMP=`Avg Monthly Temp`
  )

# compare the monthly actual OTP to the monthly target OTP and if the actual is >= then the value is true
combined$ON_TIME <- as.numeric(combined$MONTHLY_ACTUAL >= combined$MONTHLY_TARGET)

# change percenmtages to decimals
combined$MONTHLY_TARGET <-combined$MONTHLY_TARGET/100
combined$MONTHLY_ACTUAL <-combined$MONTHLY_ACTUAL/100
combined$YTD_TARGET <-combined$YTD_TARGET/100
combined$YTD_ACTUAL <-combined$YTD_ACTUAL/100

# approximately 20 weekdays in a month as the OTP is only measured for weekdays
combined$DAYS_ON_TIME<-combined$MONTHLY_ACTUAL*20

# aggregate the values to see avg's by month
agg.combined<- aggregate(list(DAYS_ON_TIME=combined$DAYS_ON_TIME,MONTHLY_TEMP=combined$MONTHLY_TEMP), by= list(PERIOD_MONTH=combined$PERIOD_MONTH), FUN=mean)

agg.combined$PERIOD_MONTH<- as.factor(agg.combined$PERIOD_MONTH)

4 Exploration

We plotted the data using a scatterplot and a best fit line, immediately we could see the correlation was not linear. We used a boxplot to look for outliers outside of the 1.5 IQR as these values can skew the model, there were none. Then, we used a density plot to check is the response variable was close to normal, DAYS_ON_TIME looks to be close to normal.

Finally, we checked the correlation between MONTHLY_TEMP and DAYS_ON_TIME. The correlation is very low with ONLY 21% of the variation of DAYS_ON_TIME being explained by MONTHLY_TEMP.

# scatterplot shows the best fit may not be linear
scatter.smooth(x=agg.combined$MONTHLY_TEMP,y=agg.combined$DAYS_ON_TIME, xlab = "Monthly Temperature", ylab="% Days On Time")

# boxplot to see outliers outside the 1.5 IQR
par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(agg.combined$MONTHLY_TEMP,sub=paste("Outlier rows: ", boxplot.stats(agg.combined$MONTHLY_TEMP)$out))  

boxplot(agg.combined$DAYS_ON_TIME, sub=paste("Outlier rows: ", boxplot.stats(agg.combined$DAYS_ON_TIME)$out))

# density plot to see if reponse variable DAYS_ON_TIME is close to normal
par(mfrow=c(1, 2))

plot(density(agg.combined$MONTHLY_TEMP), main="Density Plot: MONTHLY TEMP", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(agg.combined$MONTHLY_TEMP), 3)))  
polygon(density(agg.combined$MONTHLY_TEMP), col="red")

plot(density(agg.combined$DAYS_ON_TIME), main="Density Plot: DAYS ON TIME", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(agg.combined$DAYS_ON_TIME), 3)))  
polygon(density(agg.combined$DAYS_ON_TIME), col="red")

# correlation is low indicating much of the variation by DAYS_ON_TIME is not explained by MONTHLY_TEMP
cor(agg.combined$MONTHLY_TEMP,agg.combined$DAYS_ON_TIME)

## [1] 0.2098454

5 Modeling

H0: Monthly temperature does NOT predict subway on-time performance H1: Monthly temperature DOES predict subway on-time performance

The first model used was a linear regression with the equation: DAYS_ON_TIME = 0.003394 * MONTHLY_TEMP + 14.672797

We need to evaluate the summary statistics to check for statistical signifigance: R-squared: higher value is better adjusted R-squared: higher value is better std.error: closer to zero is better F-statistic: higher is better t-statistic: greater than 1.96 for p-value to be less than .05

# Linear model: DAYS_ON_TIME =  0.003394 * MONTHLY_TEMP + 15.3295609 
fit<- lm(DAYS_ON_TIME ~ MONTHLY_TEMP, agg.combined)
summary(fit)

## 
## Call:
## lm(formula = DAYS_ON_TIME ~ MONTHLY_TEMP, data = agg.combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51322 -0.14475 -0.02435  0.17780  0.41354 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.672797   0.291602  50.318 2.32e-13 ***
## MONTHLY_TEMP  0.003394   0.005001   0.679    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2687 on 10 degrees of freedom
## Multiple R-squared:  0.04404,    Adjusted R-squared:  -0.05156 
## F-statistic: 0.4606 on 1 and 10 DF,  p-value: 0.5127

# Plot the values with a best fit line
plot(DAYS_ON_TIME ~ MONTHLY_TEMP, agg.combined, xlab="Monthly temperature", ylab="Subway days on-time", main="Temperature vs. Days On-time")
abline(fit)

# residual analysis 
plot(fit$fitted.values, fit$residuals, xlab="Fitted Values", ylab="Residuals",
     main="Fitted Values vs. Residuals")
abline(h=0)

qqnorm(fit$residuals)
qqline(fit$residuals)

Our initial linear model was a poor fit as:

R-squared: low with only 4% of the variation of on-time performance is explained by the weather
adjusted R-squared: is negative indicating no relationship
std.error: is close to zero
F-statistic: is very low at .46
t-statistic: is only .679 with a p-value greater than .05

The residuals plot also supports the poor fit as: * the residuals vary greatly and are not centered around 0 * the residuals are not nearly normal as the values are curved at the tails

Let us try to transform the model to a polynomial in hopes of seeing a better fit

# second degree polynomial fit: DAYS_ON_TIME = -0.0220708* MONTHLY_TEMP + 0.0002278 * MONTHLY_TEMP^2 + 15.3295609
fit2 <- lm(DAYS_ON_TIME ~ poly(MONTHLY_TEMP,2, raw=TRUE), agg.combined)
summary(fit2)

## 
## Call:
## lm(formula = DAYS_ON_TIME ~ poly(MONTHLY_TEMP, 2, raw = TRUE), 
##     data = agg.combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47278 -0.13118 -0.04407  0.13315  0.46768 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        15.3295609  1.3302477  11.524 1.09e-06
## poly(MONTHLY_TEMP, 2, raw = TRUE)1 -0.0220708  0.0504903  -0.437    0.672
## poly(MONTHLY_TEMP, 2, raw = TRUE)2  0.0002278  0.0004493   0.507    0.624
##                                       
## (Intercept)                        ***
## poly(MONTHLY_TEMP, 2, raw = TRUE)1    
## poly(MONTHLY_TEMP, 2, raw = TRUE)2    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2792 on 9 degrees of freedom
## Multiple R-squared:  0.07059,    Adjusted R-squared:  -0.136 
## F-statistic: 0.3418 on 2 and 9 DF,  p-value: 0.7194

# third degree polynomial fit: DAYS_ON_TIME = -1.141e-01* MONTHLY_TEMP + 1.946e-03* MONTHLY_TEMP^2 - 1.027e-05* MONTHLY_TEMP^3 + 1.690e+01
fit3 <- lm(DAYS_ON_TIME ~ poly(MONTHLY_TEMP,3, raw=TRUE), agg.combined)
summary(fit3)

## 
## Call:
## lm(formula = DAYS_ON_TIME ~ poly(MONTHLY_TEMP, 3, raw = TRUE), 
##     data = agg.combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44757 -0.14011 -0.03969  0.13230  0.47382 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         1.690e+01  6.175e+00   2.736   0.0256
## poly(MONTHLY_TEMP, 3, raw = TRUE)1 -1.141e-01  3.570e-01  -0.320   0.7575
## poly(MONTHLY_TEMP, 3, raw = TRUE)2  1.946e-03  6.611e-03   0.294   0.7759
## poly(MONTHLY_TEMP, 3, raw = TRUE)3 -1.027e-05  3.941e-05  -0.261   0.8010
##                                     
## (Intercept)                        *
## poly(MONTHLY_TEMP, 3, raw = TRUE)1  
## poly(MONTHLY_TEMP, 3, raw = TRUE)2  
## poly(MONTHLY_TEMP, 3, raw = TRUE)3  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2949 on 8 degrees of freedom
## Multiple R-squared:  0.07841,    Adjusted R-squared:  -0.2672 
## F-statistic: 0.2269 on 3 and 8 DF,  p-value: 0.8751

# fourth degree polynomial fit: DAYS_ON_TIME = -1.413e+00 * MONTHLY_TEMP + 3.899e-02* MONTHLY_TEMP^2 -4.657e-04 * MONTHLY_TEMP^3+ 2.041e-06* MONTHLY_TEMP^4 + 3.343e+01
fit4<-lm(DAYS_ON_TIME ~ poly(MONTHLY_TEMP,4, raw=TRUE), agg.combined)
summary(fit4)

## 
## Call:
## lm(formula = DAYS_ON_TIME ~ poly(MONTHLY_TEMP, 4, raw = TRUE), 
##     data = agg.combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46068 -0.12951 -0.02544  0.11706  0.41079 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         3.343e+01  2.953e+01   1.132    0.295
## poly(MONTHLY_TEMP, 4, raw = TRUE)1 -1.413e+00  2.295e+00  -0.616    0.558
## poly(MONTHLY_TEMP, 4, raw = TRUE)2  3.899e-02  6.496e-02   0.600    0.567
## poly(MONTHLY_TEMP, 4, raw = TRUE)3 -4.657e-04  7.951e-04  -0.586    0.576
## poly(MONTHLY_TEMP, 4, raw = TRUE)4  2.041e-06  3.559e-06   0.574    0.584
## 
## Residual standard error: 0.3081 on 7 degrees of freedom
## Multiple R-squared:  0.1198, Adjusted R-squared:  -0.3832 
## F-statistic: 0.2381 on 4 and 7 DF,  p-value: 0.9081

# plot the polynomial models
xx <- seq(0,80, length=50)
plot(x=agg.combined$MONTHLY_TEMP,y=agg.combined$DAYS_ON_TIME)
lines(xx, predict(fit, data.frame(MONTHLY_TEMP=xx)), col="red")
lines(xx, predict(fit2, data.frame(MONTHLY_TEMP=xx)), col="green")
lines(xx, predict(fit3, data.frame(MONTHLY_TEMP=xx)), col="blue")
lines(xx, predict(fit4, data.frame(MONTHLY_TEMP=xx)), col="purple")

# fit2: residual analysis 
plot(fit2$fitted.values, fit2$residuals, xlab="Fitted Values", ylab="Residuals",
     main="Poly 2: Fitted Values vs. Residuals")
abline(h=0)

qqnorm(fit2$residuals)
qqline(fit2$residuals)

# fit3: residual analysis 
plot(fit3$fitted.values, fit3$residuals, xlab="Fitted Values", ylab="Residuals",
     main="Poly 2: Fitted Values vs. Residuals")
abline(h=0)

qqnorm(fit3$residuals)
qqline(fit3$residuals)

# fit4: residual analysis 
plot(fit4$fitted.values, fit4$residuals, xlab="Fitted Values", ylab="Residuals",
     main="Poly 2: Fitted Values vs. Residuals")
abline(h=0)

qqnorm(fit4$residuals)
qqline(fit4$residuals)

The polynomial models of 2nd, 3rd, 4th degree pass through more points and visually look to be a close fit, however looking at the summary statistics we can see as we add in more intercepts the std.error and p-values become larger indicating poor fits.

The residual analysis on the polynomial models also shows even greater variance around the median and as the polynomial increases, the residuals at the tails become more skewed indicating the models are not nearly normal.

6 Conclusion

We accept the NULL hypothesis that weather is NOT a predictor of subway on-time performance. After attempting a linear and polynomial regression, our models failed to show a strong correlation between the two variables.

In order to accurately predict subway on-time performance, we may need to add several additional factors to our model such as:

equipment failuers
train issues
customer injury
weather
time of day (rush hour)
precipitation rates

Data 607: Final Project

Anthony Munoz, Christina Valore, David Apolinar.

5/10/2019