Udadicty Intro to Data Science

Do more people ride the New York City subway when it’s raining versus not raining?

Overview

A key issue with this data set is that the window period covers did not have a large enough number of time periods with rain despite a very high total sample size of 42,649 observations. So when comparing rain versus no-rain ridership for specific time periods, very different patterns begin to emerge. Mean ridership is higher for periods of rain during some time periods, lower for other periods, and statistically insignificantly different for other periods. The weekday periods are the most reliabe due to occupying 5 days a week versus 2 creating a larger sample to test. The weekday results show that subway ridership is higher in non-rain periods for the 20:00 time slot and lower than non rain periods for the midnight time slot. The key conclusion is that subway trafic takes on different characteristics throughout the course of a day and a week. And, the characteristics of the overal ridership at each time slot responds differently to rain. Although purely conjecture, this is likely the result of work related traffic where people have a higher discretion for traveling during outside commuter time periods and lower discrection during commuter timer periods.

Key steps for development in this project (if time and money allowed) would be to 1) expand the data set, and 2) subset data even further than what has been here (for example, flag any holidays, look at each weekday individually, etc.). These things were not done due to issues with being able to get a larger data set (see “Sidenote on the data”“).

In addition to answering the original question, this project will also contain a few detours this project took while trying to form an answer.

Load Required R Packages:

Data Methodology

Data set: May 1, 2011 - May 31, 2011 provided by Udacity. The dataset covers all turnstiles in the Metropolitan Transit Agency (MTA) New York City subway system.

#Data sample
glimpse(NY_subway)

## Observations: 42649
## Variables:
## $ UNIT            (fctr) R003, R003, R003, R003, R003, R003, R003, R00...
## $ DATEn           (fctr) 05-01-11, 05-01-11, 05-01-11, 05-01-11, 05-01...
## $ TIMEn           (fctr) 00:00:00, 04:00:00, 12:00:00, 16:00:00, 20:00...
## $ ENTRIESn        (int) 4388333, 4388333, 4388333, 4388333, 4388333, 4...
## $ EXITSn          (int) 2911002, 2911002, 2911002, 2911002, 2911002, 2...
## $ ENTRIESn_hourly (dbl) 0, 0, 0, 0, 0, 15, 19, 488, 490, 231, 235, 74,...
## $ EXITSn_hourly   (dbl) 0, 0, 0, 0, 0, 34, 40, 118, 132, 232, 405, 164...
## $ datetime        (fctr) 2011-05-01 00:00:00, 2011-05-01 04:00:00, 201...
## $ hour            (int) 0, 4, 12, 16, 20, 0, 4, 8, 12, 16, 20, 0, 4, 1...
## $ day_week        (int) 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1...
## $ weekday         (int) 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ station         (fctr) CYPRESS HILLS, CYPRESS HILLS, CYPRESS HILLS, ...
## $ latitude        (dbl) 40.68995, 40.68995, 40.68995, 40.68995, 40.689...
## $ longitude       (dbl) -73.87256, -73.87256, -73.87256, -73.87256, -7...
## $ conds           (fctr) Clear, Partly Cloudy, Mostly Cloudy, Mostly C...
## $ fog             (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ precipi         (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ pressurei       (dbl) 30.22, 30.25, 30.28, 30.26, 30.28, 30.31, 30.2...
## $ rain            (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tempi           (dbl) 55.9, 52.0, 62.1, 57.9, 52.0, 50.0, 50.0, 53.1...
## $ wspdi           (dbl) 3.5, 3.5, 6.9, 15.0, 10.4, 6.9, 4.6, 10.4, 11....
## $ meanprecipi     (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ meanpressurei   (dbl) 30.25800, 30.25800, 30.25800, 30.25800, 30.258...
## $ meantempi       (dbl) 55.98000, 55.98000, 55.98000, 55.98000, 55.980...
## $ meanwspdi       (dbl) 7.86, 7.86, 7.86, 7.86, 7.86, 8.25, 8.25, 8.25...
## $ weather_lat     (dbl) 40.70035, 40.70035, 40.70035, 40.70035, 40.700...
## $ weather_lon     (dbl) -73.88718, -73.88718, -73.88718, -73.88718, -7...

Firstly, MTA data for each turnstile in the transit system contains a counter showing the number of entries at the turnstile and the number of exits at the turnstile. These turnstile readings are recorded c. every 4 hours in the MTA dataset. By taking the difference between the counter at time t and at time t-1, we can estimate the number of people entering and exiting the turnstile for each time period over the time horizon.

However, that technique still presents a major challenge because the number of entries and exits at a turnstile by themselves do not directly correspond to ridership (i.e. the number of people actually in the transit system at time t).

For example, between 12pm and 4pm 10,000 people enter station XYZ and 10,000 exit the station during the sam time interval. If we use the 10,000 “gross” entry figure as our proxy for ridership, we would show that ridership increased by 10,000 when in fact, the number of users in the system remained constant.

Irrespective of the time horizon start time frame, there will always be people in the transit system since it runs 24 hours. For our purposes, we want to be able to track the changes in ridership from the start of the time horizon.

“Gross” metrics of entries and exits counts also present further challenges for statistical testing in that you can’t have negative “gross” entries and exits; the turnstile counter only moves forward and cannot move backwards. The “gross” metrics are thus bounded at the lower end by zero and a very high upper bound. That non-normal distribution prevents us from using parametric testing, unless we perform a transformation (such as natural log transformation). However, there is one option that solves both the ridership estimation problem and the distribution problem.

Creating a variable with “net” entries (entries minus exits) should be more aligned with actual ridership and does not have the problems associated with a zero lower bound (i.e. highly right skewed distribution.

The lower zero bound for “gross” entries leads to a highly skewed distribution that makes parametric testing difficult.

The same issue also applies to “gross” exits.

However, “net” entries has a more symetrical distribution and similar to a t-distribution. This allows us to avoid the issues associated with non-parametric testing and have a better proxy for ridership.

The quantiles show that there is a slight left skew, but the overall suitability for parametric testing has significantly improved.

summary(NY_subway$ENTRIES_hourly_net)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -31070.0   -167.0     95.0    525.1    811.0  28420.0

The last major data manipulation performed was to create a new rain indicator variable. The rain flag in the dataset specifies whether rain occurred during that respective day; not during each respective time slot. To address this, a new variable (new_rain_var) is created based on the “prcipi” variable. The “precipi” variable quantifies the amount of rain in inches that occurred during each individual time slot. “new_rain_var” will indicate 1 if “pricipi” was greater than or zero otherwise.

NY_subway <- NY_subway %>%
            mutate (new_rain_var = ifelse(precipi == 0,0,1))

Statistical Testing

T-testing is the most appropriate statistical procedure to answer the question of whether NYC ridership increases or decreases during period of rain. From a statistical perspective, the question is whether the difference in mean ridership during periods of rain is statistically significant from mean ridership during periods without rain. Comparing the difference in means of 2 samples is a specific purpose of t-testing. Specifically, Welch’s t-test is used as a more robust testing procedure that is more resistant to the effects of outliers in comparison to the standard 2 sample t-test procedure.

Why not use linear regression?

Linear regression is used to examine the relationship between a predictor and depedent variable. The question for this project is not to describe the relationship but to answer a yes or now question about whether the means of 2 samples are significantly different. Additionally, a regression model would add more complexity that isn’t really necessary to determine whether ridership during rain periods is statistically significant from non-rain periods.
Categorical variables are likely the key drivers in subway ridership (day of the week and time). Rain in itself is not likely to have much explanatory power in explaining subway ridership. When categorical variables carry such a large weight in the dependent variable, linear regression is not likely to be a great asset unless a quantitative variable also carries material explanatory power or the qualitive and quantitative variable interact in a material fashion.
The dataset doesn´t have a lot of time periods with rain. This makes creating a model using granular categorical variables impossible for some time slots.

T-testing answers the question sought in this project without adding the complexity of linear regression, which doesn´t add any value in this case anyways. In summary, linear regression is the wrong tool for the job!

Welch´s T-test: Null Hypothesis: rain ENTRIES_hourly_net (mean) = non-rain ENTRIES_hourly_net (mean) Alternative Hypothesis: rain ENTRIES_hourly_net (mean) != non-rain ENTRIES_hourly_net (mean) Significance Level: 0.05

Based on the p-value of 0.4743, we cannot reject the Null Hypothesis.

Note: 1) Equal variance is not assumed 2) Two sided test performed

t.test(filter(NY_subway,precipi==0) %>% select (ENTRIES_hourly_net), filter(NY_subway, precipi>0) %>% select (ENTRIES_hourly_net), var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  filter(NY_subway, precipi == 0) %>% select(ENTRIES_hourly_net) and filter(NY_subway, precipi > 0) %>% select(ENTRIES_hourly_net)
## t = 0.71566, df = 3264.8, p-value = 0.4743
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -53.79891 115.64758
## sample estimates:
## mean of x mean of y 
##  527.1483  496.2240

Not rejecting the Null Hypothesis does not neccesarily mean the Null Hypothesis can be accepted. The Power (0.48) of the t-test performed shows there is reasonable risk of committing a Type II error by accepting the null. This means that we should not form any conclusions based on the first test.

stats <-  NY_subway %>%
          group_by(new_rain_var) %>%
          summarise(mean=mean(ENTRIES_hourly_net), sd=sd(ENTRIES_hourly_net), count=n())
  
power.t.test(n = stats$count, sd = stats$sd, sig.level = 0.05,
             delta = stats$mean[1] - stats$mean[2],
             power=NULL,
             type = c("two.sample"),
             alternative = c("two.sided"),
             strict = FALSE)

## 
##      Two-sample t test power calculation 
## 
##               n = 39827, 2822
##           delta = 30.92434
##              sd = 2291.958, 2212.911
##       sig.level = 0.05
##           power = 0.47767726, 0.07562572
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

For the second set of testing, we perform the same testing from a more granular view of the dataset. We know that time of day and day of the week are going to be major drivers of subway ridership. If that´s true, we need to control for those variables in assessing whether rain is a statistical factor in subway ridership.

Now, we extract a list of the time intervals at which the observations were recorded using dplyr functions. From this we can see that all the data is binned into 6 time buckets.

# 
times<-NY_subway %>%
      select(TIMEn) %>%
      distinct(TIMEn)
times

##       TIMEn
## 1: 00:00:00
## 2: 04:00:00
## 3: 12:00:00
## 4: 16:00:00
## 5: 20:00:00
## 6: 08:00:00

# Dplyr process saved to a data table in the time format.  List object with character values will be needed for process later on.
times<-as.character(times$TIMEn)

Perform t_tests for the weekday time periods.

weekday_ttest <- lapply(times, function(x) {
    
    t.test(filter(NY_subway, TIMEn==x & weekday==1 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==1 & precipi > 0) %>% select(ENTRIES_hourly_net))
})


#extract the t-test data

p_values <- est_mean_clear <- est_mean_rain   <- c()

for (i in 1:length(weekday_ttest)){
  
  est_mean_clear <- c(est_mean_clear,weekday_ttest[[i]]$estimate[1])
  est_mean_rain <- c(est_mean_rain,weekday_ttest[[i]]$estimate[2])
  p_values <- round(c(p_values,weekday_ttest[[i]]$p.value),3)
  
}

#Build the data frame with a summary of the t-tests
weekday_summary<-data.table(times, p_values, est_mean_clear, est_mean_rain)
weekday_summary

##       times p_values est_mean_clear est_mean_rain
## 1: 00:00:00    0.000       245.0258     585.67824
## 2: 04:00:00    0.662        16.8990      29.59427
## 3: 12:00:00    0.907       944.7769     970.89709
## 4: 16:00:00    0.569       836.2668     801.08367
## 5: 20:00:00    0.000      1201.7513    -760.75000
## 6: 08:00:00    0.455       456.0016     405.88455

Now, the same thing for weekend data except with a twist…

The only difference with the weekend data is that we run into a small sample size issue because (surprisingly) there are only rain observations on the weekends for the 4am and 8am time blocks. It must not have rained on the weekends in May during the other time slots.

# For weekend day, create a rain flag of 0 for whether precipi equals 0, 1 otherwise.
# Show count of rain or no-rain for the weekdns by the 6 time slots.
# NA´s indicate there were no observations for the respective time slot and rain/no-rain subsetting.  As you can see, we have no rain for the 00:00, 12:00, 16:00, and 20:00 time slots.  So we can't run the t-tests for those slots.

NY_subway %>%
filter(weekday==0) %>%
group_by(TIMEn, new_rain_var) %>%
summarize(count = n()) %>%
spread(., key=new_rain_var, count, fill = NA)

## Source: local data table [6 x 3]
## Groups: 
## 
##      TIMEn    0   1
## 1 00:00:00 2142  NA
## 2 04:00:00 1986 154
## 3 08:00:00 1335 152
## 4 12:00:00 2132  NA
## 5 16:00:00 2134  NA
## 6 20:00:00 2144  NA

#Create a new list of weekend times to apply the t-tests.  I could do this manually, but I'd like to preserve as much flexibility as possible for any future data sets.  By just manually assigning the time slots to a list, I would just creating a failure point when running the same R code for datasets with diferent time horizons.
  
weekend_rain_times<-NY_subway %>%
                    filter(weekday==0 & new_rain_var ==1) %>%
                    select(TIMEn) %>%
                    distinct(TIMEn)

weekend_rain_times<-as.character(weekend_rain_times$TIMEn)

Run the weekend t-tests, extract the key results, and save to a data frame summary

weekend_ttest <- lapply(weekend_rain_times, function(x) {
    
    t.test(filter(NY_subway, TIMEn==x & weekday==0 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==0 & precipi > 0) %>% select(ENTRIES_hourly_net))
})


#extract the t-test data

p_values <- est_mean_clear <- est_mean_rain   <- c()

for (i in 1:length(weekend_ttest)){
  est_mean_clear <- c(est_mean_clear,weekend_ttest[[i]]$estimate[1])
  est_mean_rain <- c(est_mean_rain,weekend_ttest[[i]]$estimate[2])
  p_values <- round(c(p_values,weekend_ttest[[i]]$p.value),3)
}

#Build the data frame with a summary of the t-tests
weekend_summary<-data.table(weekend_rain_times, p_values, est_mean_clear, est_mean_rain)
weekend_summary

##    weekend_rain_times p_values est_mean_clear est_mean_rain
## 1:           04:00:00    0.033      -9.620342      77.60390
## 2:           08:00:00    0.753      60.261423      54.58553

So as mentioned in the Overview section, it’s very difficult to start drawing conclusions at this point despite seeing some statistically significant differences. However, I definitely think rain does impact subway ridership but to really understand how, a larger data set is needed allowing more granular subsetting. With this small of a data set, we are very limited in how much subsetting we can do without running into the issue of not only having a small sample size, but also a zero sample size for some time slots.

But just to highlight that a relationship probably exists, here is graph showing the overall subway ridership during periods of rain and no-rain with for the weekends.

no_rain<-  NY_subway %>%
  filter(precipi == 0 & weekday ==0) %>%
  select(ENTRIES_hourly_net)

rain<-  NY_subway %>%
  filter(precipi > 0 & weekday ==0) %>%
  select(ENTRIES_hourly_net)

rain_data <-  NY_subway %>%
              mutate (new_rain_var = ifelse(precipi > 0,"Rain"," No Rain")) %>%
              select(new_rain_var,ENTRIES_hourly_net)

rain_data_mean <-  rain_data %>%
                    group_by(new_rain_var) %>%
                    summarize(ridership = mean(ENTRIES_hourly_net))


ggplot(rain_data, aes(x=ENTRIES_hourly_net, fill=new_rain_var)) + 
              geom_density(alpha=.7) + 
              xlim(c(-2000,2000)) +
              ggtitle("Density of Rain versus No-Rain Ridership") + 
              geom_vline(data=rain_data_mean, aes(xintercept=ridership,  colour=new_rain_var), linetype="dashed", size=1)

## Warning in loop_apply(n, do.ply): Removed 5863 rows containing non-finite
## values (stat_density).

## Warning in loop_apply(n, do.ply): Removed 468 rows containing non-finite
## values (stat_density).

t.test(rain,no_rain, var.equal=FALSE, conf.level=0.95 )

## 
##  Welch Two Sample t-test
## 
## data:  rain and no_rain
## t = -9.5086, df = 423.84, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -265.0056 -174.2127
## sample estimates:
## mean of x mean of y 
##  66.16993 285.77908

What about forgetting weekday versus weekend and testing each time slot irrespective of whether a weekday or weekend?

time_ttest <- lapply(times, function(x) {
    
    t.test(filter(NY_subway, TIMEn==x & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & precipi > 0) %>% select(ENTRIES_hourly_net))})


#extract the t-test data

p_values <- est_mean_clear <- est_mean_rain   <- c()

for (i in 1:length(weekday_ttest)){
  
  est_mean_clear <- c(est_mean_clear,time_ttest[[i]]$estimate[1])
  est_mean_rain <- c(est_mean_rain,time_ttest[[i]]$estimate[2])
  p_values <- round(c(p_values,time_ttest[[i]]$p.value),3)
  
}

#Build the data frame with a summary of the t-tests
time_summary<-data.table(times, p_values, est_mean_clear, est_mean_rain)
time_summary

##       times p_values est_mean_clear est_mean_rain
## 1: 00:00:00    0.000     259.547943     585.67824
## 2: 04:00:00    0.162       9.162897      42.49738
## 3: 12:00:00    0.393     783.484611     970.89709
## 4: 16:00:00    0.123     707.701821     801.08367
## 5: 20:00:00    0.000     972.007515    -760.75000
## 6: 08:00:00    0.707     353.872608     333.82321

So, the end conclusion is that rain has a negative impact on subway ridership at the 20:00 (day independent) timeslots. Rain has a positive impact during the 00:00(day independent), 04:00 (weekends), 00:00 (weekdays) timeslots. Although one month of data with c. 43,000 observations seems like a large dataset, the size of the dataset is still a limiting factor due to the small number of rain days.

Section 0. References

Please include a list of references you have used for this project. Please be specific - for example, instead of including a general website such as stackoverflow.com, try to include a specific topic from Stackoverflow that you have found useful.

Stackoverflow.com

http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter3/statareg3.htm

http://www.r-bloggers.com/using-r-quickly-calculating-summary-statistics-with-dplyr/

http://www.stat.purdue.edu/~jennings/stat514/notes/topic3.pdf

Section 2. Linear Regression

Linear Regression Model for regressing ENTRIES_hourly_net on precipi and weekday (categorical variable)

#convert weekday to factor
NY_subway$hour_f <- factor(NY_subway$hour)
NY_subway$day_week_f <- factor(NY_subway$day_week)


#build linear model with precipi and weekday_f(categorical) as predictors
subway_lm <- lm(ENTRIESn_hourly ~ hour_f + day_week_f +  precipi + wspdi + precipi :wspdi, data = NY_subway)
summary(subway_lm)

## 
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi + 
##     wspdi + precipi:wspdi, data = NY_subway)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3857.7 -1300.8  -515.4   387.4 28956.3 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1595.009     48.074  33.178  < 2e-16 ***
## hour_f4       -1109.002     44.515 -24.913  < 2e-16 ***
## hour_f8        -505.073     47.744 -10.579  < 2e-16 ***
## hour_f12       1704.440     45.662  37.327  < 2e-16 ***
## hour_f16       1034.462     46.567  22.215  < 2e-16 ***
## hour_f20       1906.561     45.331  42.058  < 2e-16 ***
## day_week_f1     356.158     46.288   7.694 1.45e-14 ***
## day_week_f2     425.020     49.333   8.615  < 2e-16 ***
## day_week_f3     451.031     48.757   9.251  < 2e-16 ***
## day_week_f4     310.993     49.933   6.228 4.76e-10 ***
## day_week_f5    -576.972     49.722 -11.604  < 2e-16 ***
## day_week_f6    -868.714     46.542 -18.665  < 2e-16 ***
## precipi        3221.374   1746.498   1.844   0.0651 .  
## wspdi           -36.406      3.377 -10.780  < 2e-16 ***
## precipi:wspdi  -131.181    112.805  -1.163   0.2449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared:  0.1625, Adjusted R-squared:  0.1622 
## F-statistic: 590.7 on 14 and 42634 DF,  p-value: < 2.2e-16

plot (subway_lm)

2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:

OLS via the R lm() function

2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features? precipi and weekday (dummy variable)

2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model. Your reasons might be based on intuition. For example, response for fog might be: ???I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.??? Your reasons might also be based on data exploration and experimentation, for example: ???I used feature X because as soon as I included it in my model, it drastically improved my R2 value.???

2.4 What are the parameters (also known as “coefficients” or “weights”) of the non-dummy features in your linear regression model?

summary(subway_lm)

## 
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi + 
##     wspdi + precipi:wspdi, data = NY_subway)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3857.7 -1300.8  -515.4   387.4 28956.3 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1595.009     48.074  33.178  < 2e-16 ***
## hour_f4       -1109.002     44.515 -24.913  < 2e-16 ***
## hour_f8        -505.073     47.744 -10.579  < 2e-16 ***
## hour_f12       1704.440     45.662  37.327  < 2e-16 ***
## hour_f16       1034.462     46.567  22.215  < 2e-16 ***
## hour_f20       1906.561     45.331  42.058  < 2e-16 ***
## day_week_f1     356.158     46.288   7.694 1.45e-14 ***
## day_week_f2     425.020     49.333   8.615  < 2e-16 ***
## day_week_f3     451.031     48.757   9.251  < 2e-16 ***
## day_week_f4     310.993     49.933   6.228 4.76e-10 ***
## day_week_f5    -576.972     49.722 -11.604  < 2e-16 ***
## day_week_f6    -868.714     46.542 -18.665  < 2e-16 ***
## precipi        3221.374   1746.498   1.844   0.0651 .  
## wspdi           -36.406      3.377 -10.780  < 2e-16 ***
## precipi:wspdi  -131.181    112.805  -1.163   0.2449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared:  0.1625, Adjusted R-squared:  0.1622 
## F-statistic: 590.7 on 14 and 42634 DF,  p-value: < 2.2e-16

2.5 What is your model???s R2 (coefficients of determination) value?

summary(subway_lm)

## 
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi + 
##     wspdi + precipi:wspdi, data = NY_subway)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3857.7 -1300.8  -515.4   387.4 28956.3 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1595.009     48.074  33.178  < 2e-16 ***
## hour_f4       -1109.002     44.515 -24.913  < 2e-16 ***
## hour_f8        -505.073     47.744 -10.579  < 2e-16 ***
## hour_f12       1704.440     45.662  37.327  < 2e-16 ***
## hour_f16       1034.462     46.567  22.215  < 2e-16 ***
## hour_f20       1906.561     45.331  42.058  < 2e-16 ***
## day_week_f1     356.158     46.288   7.694 1.45e-14 ***
## day_week_f2     425.020     49.333   8.615  < 2e-16 ***
## day_week_f3     451.031     48.757   9.251  < 2e-16 ***
## day_week_f4     310.993     49.933   6.228 4.76e-10 ***
## day_week_f5    -576.972     49.722 -11.604  < 2e-16 ***
## day_week_f6    -868.714     46.542 -18.665  < 2e-16 ***
## precipi        3221.374   1746.498   1.844   0.0651 .  
## wspdi           -36.406      3.377 -10.780  < 2e-16 ***
## precipi:wspdi  -131.181    112.805  -1.163   0.2449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared:  0.1625, Adjusted R-squared:  0.1622 
## F-statistic: 590.7 on 14 and 42634 DF,  p-value: < 2.2e-16

2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2 value?

The R^2 value means this model has very little explanatory power in predicting ridership. Even ignoring that, pretty much all of the required assumptions for linear regression are violated. The residuals do not have constant variance and do not have a normal distribution. I do not think a linear model is appropriate for explaining ridership. The key variables (time and day) are categorical and the quantitative variables (rain) play a smaller role in ridership. As stated earlier, this is why the t-test is more appropriate for answering the original question for this project.