A key issue with this data set is that the window period covers did not have a large enough number of time periods with rain despite a very high total sample size of 42,649 observations. So when comparing rain versus no-rain ridership for specific time periods, very different patterns begin to emerge. Mean ridership is higher for periods of rain during some time periods, lower for other periods, and statistically insignificantly different for other periods. The weekday periods are the most reliabe due to occupying 5 days a week versus 2 creating a larger sample to test. The weekday results show that subway ridership is higher in non-rain periods for the 20:00 time slot and lower than non rain periods for the midnight time slot. The key conclusion is that subway trafic takes on different characteristics throughout the course of a day and a week. And, the characteristics of the overal ridership at each time slot responds differently to rain. Although purely conjecture, this is likely the result of work related traffic where people have a higher discretion for traveling during outside commuter time periods and lower discrection during commuter timer periods.
Key steps for development in this project (if time and money allowed) would be to 1) expand the data set, and 2) subset data even further than what has been here (for example, flag any holidays, look at each weekday individually, etc.). These things were not done due to issues with being able to get a larger data set (see “Sidenote on the data”“).
In addition to answering the original question, this project will also contain a few detours this project took while trying to form an answer.
Load Required R Packages:
Data set: May 1, 2011 - May 31, 2011 provided by Udacity. The dataset covers all turnstiles in the Metropolitan Transit Agency (MTA) New York City subway system.
#Data sample
glimpse(NY_subway)
## Observations: 42649
## Variables:
## $ UNIT (fctr) R003, R003, R003, R003, R003, R003, R003, R00...
## $ DATEn (fctr) 05-01-11, 05-01-11, 05-01-11, 05-01-11, 05-01...
## $ TIMEn (fctr) 00:00:00, 04:00:00, 12:00:00, 16:00:00, 20:00...
## $ ENTRIESn (int) 4388333, 4388333, 4388333, 4388333, 4388333, 4...
## $ EXITSn (int) 2911002, 2911002, 2911002, 2911002, 2911002, 2...
## $ ENTRIESn_hourly (dbl) 0, 0, 0, 0, 0, 15, 19, 488, 490, 231, 235, 74,...
## $ EXITSn_hourly (dbl) 0, 0, 0, 0, 0, 34, 40, 118, 132, 232, 405, 164...
## $ datetime (fctr) 2011-05-01 00:00:00, 2011-05-01 04:00:00, 201...
## $ hour (int) 0, 4, 12, 16, 20, 0, 4, 8, 12, 16, 20, 0, 4, 1...
## $ day_week (int) 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1...
## $ weekday (int) 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ station (fctr) CYPRESS HILLS, CYPRESS HILLS, CYPRESS HILLS, ...
## $ latitude (dbl) 40.68995, 40.68995, 40.68995, 40.68995, 40.689...
## $ longitude (dbl) -73.87256, -73.87256, -73.87256, -73.87256, -7...
## $ conds (fctr) Clear, Partly Cloudy, Mostly Cloudy, Mostly C...
## $ fog (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ precipi (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ pressurei (dbl) 30.22, 30.25, 30.28, 30.26, 30.28, 30.31, 30.2...
## $ rain (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tempi (dbl) 55.9, 52.0, 62.1, 57.9, 52.0, 50.0, 50.0, 53.1...
## $ wspdi (dbl) 3.5, 3.5, 6.9, 15.0, 10.4, 6.9, 4.6, 10.4, 11....
## $ meanprecipi (dbl) 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ meanpressurei (dbl) 30.25800, 30.25800, 30.25800, 30.25800, 30.258...
## $ meantempi (dbl) 55.98000, 55.98000, 55.98000, 55.98000, 55.980...
## $ meanwspdi (dbl) 7.86, 7.86, 7.86, 7.86, 7.86, 8.25, 8.25, 8.25...
## $ weather_lat (dbl) 40.70035, 40.70035, 40.70035, 40.70035, 40.700...
## $ weather_lon (dbl) -73.88718, -73.88718, -73.88718, -73.88718, -7...
Firstly, MTA data for each turnstile in the transit system contains a counter showing the number of entries at the turnstile and the number of exits at the turnstile. These turnstile readings are recorded c. every 4 hours in the MTA dataset. By taking the difference between the counter at time t and at time t-1, we can estimate the number of people entering and exiting the turnstile for each time period over the time horizon.
However, that technique still presents a major challenge because the number of entries and exits at a turnstile by themselves do not directly correspond to ridership (i.e. the number of people actually in the transit system at time t).
For example, between 12pm and 4pm 10,000 people enter station XYZ and 10,000 exit the station during the sam time interval. If we use the 10,000 “gross” entry figure as our proxy for ridership, we would show that ridership increased by 10,000 when in fact, the number of users in the system remained constant.
Irrespective of the time horizon start time frame, there will always be people in the transit system since it runs 24 hours. For our purposes, we want to be able to track the changes in ridership from the start of the time horizon.
“Gross” metrics of entries and exits counts also present further challenges for statistical testing in that you can’t have negative “gross” entries and exits; the turnstile counter only moves forward and cannot move backwards. The “gross” metrics are thus bounded at the lower end by zero and a very high upper bound. That non-normal distribution prevents us from using parametric testing, unless we perform a transformation (such as natural log transformation). However, there is one option that solves both the ridership estimation problem and the distribution problem.
Creating a variable with “net” entries (entries minus exits) should be more aligned with actual ridership and does not have the problems associated with a zero lower bound (i.e. highly right skewed distribution.
The lower zero bound for “gross” entries leads to a highly skewed distribution that makes parametric testing difficult.
The same issue also applies to “gross” exits.
However, “net” entries has a more symetrical distribution and similar to a t-distribution. This allows us to avoid the issues associated with non-parametric testing and have a better proxy for ridership.
The quantiles show that there is a slight left skew, but the overall suitability for parametric testing has significantly improved.
summary(NY_subway$ENTRIES_hourly_net)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -31070.0 -167.0 95.0 525.1 811.0 28420.0
The last major data manipulation performed was to create a new rain indicator variable. The rain flag in the dataset specifies whether rain occurred during that respective day; not during each respective time slot. To address this, a new variable (new_rain_var) is created based on the “prcipi” variable. The “precipi” variable quantifies the amount of rain in inches that occurred during each individual time slot. “new_rain_var” will indicate 1 if “pricipi” was greater than or zero otherwise.
NY_subway <- NY_subway %>%
mutate (new_rain_var = ifelse(precipi == 0,0,1))
T-testing is the most appropriate statistical procedure to answer the question of whether NYC ridership increases or decreases during period of rain. From a statistical perspective, the question is whether the difference in mean ridership during periods of rain is statistically significant from mean ridership during periods without rain. Comparing the difference in means of 2 samples is a specific purpose of t-testing. Specifically, Welch’s t-test is used as a more robust testing procedure that is more resistant to the effects of outliers in comparison to the standard 2 sample t-test procedure.
Why not use linear regression?
Linear regression is used to examine the relationship between a predictor and depedent variable. The question for this project is not to describe the relationship but to answer a yes or now question about whether the means of 2 samples are significantly different. Additionally, a regression model would add more complexity that isn’t really necessary to determine whether ridership during rain periods is statistically significant from non-rain periods.
Categorical variables are likely the key drivers in subway ridership (day of the week and time). Rain in itself is not likely to have much explanatory power in explaining subway ridership. When categorical variables carry such a large weight in the dependent variable, linear regression is not likely to be a great asset unless a quantitative variable also carries material explanatory power or the qualitive and quantitative variable interact in a material fashion.
The dataset doesn´t have a lot of time periods with rain. This makes creating a model using granular categorical variables impossible for some time slots.
T-testing answers the question sought in this project without adding the complexity of linear regression, which doesn´t add any value in this case anyways. In summary, linear regression is the wrong tool for the job!
Welch´s T-test: Null Hypothesis: rain ENTRIES_hourly_net (mean) = non-rain ENTRIES_hourly_net (mean) Alternative Hypothesis: rain ENTRIES_hourly_net (mean) != non-rain ENTRIES_hourly_net (mean) Significance Level: 0.05
Based on the p-value of 0.4743, we cannot reject the Null Hypothesis.
Note: 1) Equal variance is not assumed 2) Two sided test performed
t.test(filter(NY_subway,precipi==0) %>% select (ENTRIES_hourly_net), filter(NY_subway, precipi>0) %>% select (ENTRIES_hourly_net), var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: filter(NY_subway, precipi == 0) %>% select(ENTRIES_hourly_net) and filter(NY_subway, precipi > 0) %>% select(ENTRIES_hourly_net)
## t = 0.71566, df = 3264.8, p-value = 0.4743
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -53.79891 115.64758
## sample estimates:
## mean of x mean of y
## 527.1483 496.2240
Not rejecting the Null Hypothesis does not neccesarily mean the Null Hypothesis can be accepted. The Power (0.48) of the t-test performed shows there is reasonable risk of committing a Type II error by accepting the null. This means that we should not form any conclusions based on the first test.
stats <- NY_subway %>%
group_by(new_rain_var) %>%
summarise(mean=mean(ENTRIES_hourly_net), sd=sd(ENTRIES_hourly_net), count=n())
power.t.test(n = stats$count, sd = stats$sd, sig.level = 0.05,
delta = stats$mean[1] - stats$mean[2],
power=NULL,
type = c("two.sample"),
alternative = c("two.sided"),
strict = FALSE)
##
## Two-sample t test power calculation
##
## n = 39827, 2822
## delta = 30.92434
## sd = 2291.958, 2212.911
## sig.level = 0.05
## power = 0.47767726, 0.07562572
## alternative = two.sided
##
## NOTE: n is number in *each* group
For the second set of testing, we perform the same testing from a more granular view of the dataset. We know that time of day and day of the week are going to be major drivers of subway ridership. If that´s true, we need to control for those variables in assessing whether rain is a statistical factor in subway ridership.
Now, we extract a list of the time intervals at which the observations were recorded using dplyr functions. From this we can see that all the data is binned into 6 time buckets.
#
times<-NY_subway %>%
select(TIMEn) %>%
distinct(TIMEn)
times
## TIMEn
## 1: 00:00:00
## 2: 04:00:00
## 3: 12:00:00
## 4: 16:00:00
## 5: 20:00:00
## 6: 08:00:00
# Dplyr process saved to a data table in the time format. List object with character values will be needed for process later on.
times<-as.character(times$TIMEn)
Perform t_tests for the weekday time periods.
weekday_ttest <- lapply(times, function(x) {
t.test(filter(NY_subway, TIMEn==x & weekday==1 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==1 & precipi > 0) %>% select(ENTRIES_hourly_net))
})
#extract the t-test data
p_values <- est_mean_clear <- est_mean_rain <- c()
for (i in 1:length(weekday_ttest)){
est_mean_clear <- c(est_mean_clear,weekday_ttest[[i]]$estimate[1])
est_mean_rain <- c(est_mean_rain,weekday_ttest[[i]]$estimate[2])
p_values <- round(c(p_values,weekday_ttest[[i]]$p.value),3)
}
#Build the data frame with a summary of the t-tests
weekday_summary<-data.table(times, p_values, est_mean_clear, est_mean_rain)
weekday_summary
## times p_values est_mean_clear est_mean_rain
## 1: 00:00:00 0.000 245.0258 585.67824
## 2: 04:00:00 0.662 16.8990 29.59427
## 3: 12:00:00 0.907 944.7769 970.89709
## 4: 16:00:00 0.569 836.2668 801.08367
## 5: 20:00:00 0.000 1201.7513 -760.75000
## 6: 08:00:00 0.455 456.0016 405.88455
Now, the same thing for weekend data except with a twist…
The only difference with the weekend data is that we run into a small sample size issue because (surprisingly) there are only rain observations on the weekends for the 4am and 8am time blocks. It must not have rained on the weekends in May during the other time slots.
# For weekend day, create a rain flag of 0 for whether precipi equals 0, 1 otherwise.
# Show count of rain or no-rain for the weekdns by the 6 time slots.
# NA´s indicate there were no observations for the respective time slot and rain/no-rain subsetting. As you can see, we have no rain for the 00:00, 12:00, 16:00, and 20:00 time slots. So we can't run the t-tests for those slots.
NY_subway %>%
filter(weekday==0) %>%
group_by(TIMEn, new_rain_var) %>%
summarize(count = n()) %>%
spread(., key=new_rain_var, count, fill = NA)
## Source: local data table [6 x 3]
## Groups:
##
## TIMEn 0 1
## 1 00:00:00 2142 NA
## 2 04:00:00 1986 154
## 3 08:00:00 1335 152
## 4 12:00:00 2132 NA
## 5 16:00:00 2134 NA
## 6 20:00:00 2144 NA
#Create a new list of weekend times to apply the t-tests. I could do this manually, but I'd like to preserve as much flexibility as possible for any future data sets. By just manually assigning the time slots to a list, I would just creating a failure point when running the same R code for datasets with diferent time horizons.
weekend_rain_times<-NY_subway %>%
filter(weekday==0 & new_rain_var ==1) %>%
select(TIMEn) %>%
distinct(TIMEn)
weekend_rain_times<-as.character(weekend_rain_times$TIMEn)
Run the weekend t-tests, extract the key results, and save to a data frame summary
weekend_ttest <- lapply(weekend_rain_times, function(x) {
t.test(filter(NY_subway, TIMEn==x & weekday==0 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==0 & precipi > 0) %>% select(ENTRIES_hourly_net))
})
#extract the t-test data
p_values <- est_mean_clear <- est_mean_rain <- c()
for (i in 1:length(weekend_ttest)){
est_mean_clear <- c(est_mean_clear,weekend_ttest[[i]]$estimate[1])
est_mean_rain <- c(est_mean_rain,weekend_ttest[[i]]$estimate[2])
p_values <- round(c(p_values,weekend_ttest[[i]]$p.value),3)
}
#Build the data frame with a summary of the t-tests
weekend_summary<-data.table(weekend_rain_times, p_values, est_mean_clear, est_mean_rain)
weekend_summary
## weekend_rain_times p_values est_mean_clear est_mean_rain
## 1: 04:00:00 0.033 -9.620342 77.60390
## 2: 08:00:00 0.753 60.261423 54.58553
So as mentioned in the Overview section, it’s very difficult to start drawing conclusions at this point despite seeing some statistically significant differences. However, I definitely think rain does impact subway ridership but to really understand how, a larger data set is needed allowing more granular subsetting. With this small of a data set, we are very limited in how much subsetting we can do without running into the issue of not only having a small sample size, but also a zero sample size for some time slots.
But just to highlight that a relationship probably exists, here is graph showing the overall subway ridership during periods of rain and no-rain with for the weekends.
no_rain<- NY_subway %>%
filter(precipi == 0 & weekday ==0) %>%
select(ENTRIES_hourly_net)
rain<- NY_subway %>%
filter(precipi > 0 & weekday ==0) %>%
select(ENTRIES_hourly_net)
rain_data <- NY_subway %>%
mutate (new_rain_var = ifelse(precipi > 0,"Rain"," No Rain")) %>%
select(new_rain_var,ENTRIES_hourly_net)
rain_data_mean <- rain_data %>%
group_by(new_rain_var) %>%
summarize(ridership = mean(ENTRIES_hourly_net))
ggplot(rain_data, aes(x=ENTRIES_hourly_net, fill=new_rain_var)) +
geom_density(alpha=.7) +
xlim(c(-2000,2000)) +
ggtitle("Density of Rain versus No-Rain Ridership") +
geom_vline(data=rain_data_mean, aes(xintercept=ridership, colour=new_rain_var), linetype="dashed", size=1)
## Warning in loop_apply(n, do.ply): Removed 5863 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 468 rows containing non-finite
## values (stat_density).
t.test(rain,no_rain, var.equal=FALSE, conf.level=0.95 )
##
## Welch Two Sample t-test
##
## data: rain and no_rain
## t = -9.5086, df = 423.84, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -265.0056 -174.2127
## sample estimates:
## mean of x mean of y
## 66.16993 285.77908
What about forgetting weekday versus weekend and testing each time slot irrespective of whether a weekday or weekend?
time_ttest <- lapply(times, function(x) {
t.test(filter(NY_subway, TIMEn==x & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & precipi > 0) %>% select(ENTRIES_hourly_net))})
#extract the t-test data
p_values <- est_mean_clear <- est_mean_rain <- c()
for (i in 1:length(weekday_ttest)){
est_mean_clear <- c(est_mean_clear,time_ttest[[i]]$estimate[1])
est_mean_rain <- c(est_mean_rain,time_ttest[[i]]$estimate[2])
p_values <- round(c(p_values,time_ttest[[i]]$p.value),3)
}
#Build the data frame with a summary of the t-tests
time_summary<-data.table(times, p_values, est_mean_clear, est_mean_rain)
time_summary
## times p_values est_mean_clear est_mean_rain
## 1: 00:00:00 0.000 259.547943 585.67824
## 2: 04:00:00 0.162 9.162897 42.49738
## 3: 12:00:00 0.393 783.484611 970.89709
## 4: 16:00:00 0.123 707.701821 801.08367
## 5: 20:00:00 0.000 972.007515 -760.75000
## 6: 08:00:00 0.707 353.872608 333.82321
So, the end conclusion is that rain has a negative impact on subway ridership at the 20:00 (day independent) timeslots. Rain has a positive impact during the 00:00(day independent), 04:00 (weekends), 00:00 (weekdays) timeslots. Although one month of data with c. 43,000 observations seems like a large dataset, the size of the dataset is still a limiting factor due to the small number of rain days.
Please include a list of references you have used for this project. Please be specific - for example, instead of including a general website such as stackoverflow.com, try to include a specific topic from Stackoverflow that you have found useful.
Stackoverflow.com
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter3/statareg3.htm
http://www.r-bloggers.com/using-r-quickly-calculating-summary-statistics-with-dplyr/
http://www.stat.purdue.edu/~jennings/stat514/notes/topic3.pdf
Linear Regression Model for regressing ENTRIES_hourly_net on precipi and weekday (categorical variable)
#convert weekday to factor
NY_subway$hour_f <- factor(NY_subway$hour)
NY_subway$day_week_f <- factor(NY_subway$day_week)
#build linear model with precipi and weekday_f(categorical) as predictors
subway_lm <- lm(ENTRIESn_hourly ~ hour_f + day_week_f + precipi + wspdi + precipi :wspdi, data = NY_subway)
summary(subway_lm)
##
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi +
## wspdi + precipi:wspdi, data = NY_subway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3857.7 -1300.8 -515.4 387.4 28956.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1595.009 48.074 33.178 < 2e-16 ***
## hour_f4 -1109.002 44.515 -24.913 < 2e-16 ***
## hour_f8 -505.073 47.744 -10.579 < 2e-16 ***
## hour_f12 1704.440 45.662 37.327 < 2e-16 ***
## hour_f16 1034.462 46.567 22.215 < 2e-16 ***
## hour_f20 1906.561 45.331 42.058 < 2e-16 ***
## day_week_f1 356.158 46.288 7.694 1.45e-14 ***
## day_week_f2 425.020 49.333 8.615 < 2e-16 ***
## day_week_f3 451.031 48.757 9.251 < 2e-16 ***
## day_week_f4 310.993 49.933 6.228 4.76e-10 ***
## day_week_f5 -576.972 49.722 -11.604 < 2e-16 ***
## day_week_f6 -868.714 46.542 -18.665 < 2e-16 ***
## precipi 3221.374 1746.498 1.844 0.0651 .
## wspdi -36.406 3.377 -10.780 < 2e-16 ***
## precipi:wspdi -131.181 112.805 -1.163 0.2449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared: 0.1625, Adjusted R-squared: 0.1622
## F-statistic: 590.7 on 14 and 42634 DF, p-value: < 2.2e-16
plot (subway_lm)
2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:
OLS via the R lm() function
2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features? precipi and weekday (dummy variable)
2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model. Your reasons might be based on intuition. For example, response for fog might be: ???I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.??? Your reasons might also be based on data exploration and experimentation, for example: ???I used feature X because as soon as I included it in my model, it drastically improved my R2 value.???
2.4 What are the parameters (also known as “coefficients” or “weights”) of the non-dummy features in your linear regression model?
summary(subway_lm)
##
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi +
## wspdi + precipi:wspdi, data = NY_subway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3857.7 -1300.8 -515.4 387.4 28956.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1595.009 48.074 33.178 < 2e-16 ***
## hour_f4 -1109.002 44.515 -24.913 < 2e-16 ***
## hour_f8 -505.073 47.744 -10.579 < 2e-16 ***
## hour_f12 1704.440 45.662 37.327 < 2e-16 ***
## hour_f16 1034.462 46.567 22.215 < 2e-16 ***
## hour_f20 1906.561 45.331 42.058 < 2e-16 ***
## day_week_f1 356.158 46.288 7.694 1.45e-14 ***
## day_week_f2 425.020 49.333 8.615 < 2e-16 ***
## day_week_f3 451.031 48.757 9.251 < 2e-16 ***
## day_week_f4 310.993 49.933 6.228 4.76e-10 ***
## day_week_f5 -576.972 49.722 -11.604 < 2e-16 ***
## day_week_f6 -868.714 46.542 -18.665 < 2e-16 ***
## precipi 3221.374 1746.498 1.844 0.0651 .
## wspdi -36.406 3.377 -10.780 < 2e-16 ***
## precipi:wspdi -131.181 112.805 -1.163 0.2449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared: 0.1625, Adjusted R-squared: 0.1622
## F-statistic: 590.7 on 14 and 42634 DF, p-value: < 2.2e-16
2.5 What is your model???s R2 (coefficients of determination) value?
summary(subway_lm)
##
## Call:
## lm(formula = ENTRIESn_hourly ~ hour_f + day_week_f + precipi +
## wspdi + precipi:wspdi, data = NY_subway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3857.7 -1300.8 -515.4 387.4 28956.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1595.009 48.074 33.178 < 2e-16 ***
## hour_f4 -1109.002 44.515 -24.913 < 2e-16 ***
## hour_f8 -505.073 47.744 -10.579 < 2e-16 ***
## hour_f12 1704.440 45.662 37.327 < 2e-16 ***
## hour_f16 1034.462 46.567 22.215 < 2e-16 ***
## hour_f20 1906.561 45.331 42.058 < 2e-16 ***
## day_week_f1 356.158 46.288 7.694 1.45e-14 ***
## day_week_f2 425.020 49.333 8.615 < 2e-16 ***
## day_week_f3 451.031 48.757 9.251 < 2e-16 ***
## day_week_f4 310.993 49.933 6.228 4.76e-10 ***
## day_week_f5 -576.972 49.722 -11.604 < 2e-16 ***
## day_week_f6 -868.714 46.542 -18.665 < 2e-16 ***
## precipi 3221.374 1746.498 1.844 0.0651 .
## wspdi -36.406 3.377 -10.780 < 2e-16 ***
## precipi:wspdi -131.181 112.805 -1.163 0.2449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2702 on 42634 degrees of freedom
## Multiple R-squared: 0.1625, Adjusted R-squared: 0.1622
## F-statistic: 590.7 on 14 and 42634 DF, p-value: < 2.2e-16
2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2 value?
The R^2 value means this model has very little explanatory power in predicting ridership. Even ignoring that, pretty much all of the required assumptions for linear regression are violated. The residuals do not have constant variance and do not have a normal distribution. I do not think a linear model is appropriate for explaining ridership. The key variables (time and day) are categorical and the quantitative variables (rain) play a smaller role in ridership. As stated earlier, this is why the t-test is more appropriate for answering the original question for this project.