Discussion Prompt

*Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not? —–

I will reuse the dataset from last week, albeit this time the null values have been imputed based on observations from other counters on the same days.

The data selected is counts of bicycles on the Fremont Bridge in Seattle, covering October 2012 to October 2017; thousands of bike crossings occur daily at this location. Understanding when cycling does and does not occur can aid with planning, such as when to increase transit frequency for when cycling is projected to decline.

Link: https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k

## Warning: package 'zoo' was built under R version 3.3.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Warning: package 'forecast' was built under R version 3.3.3
URL = "https://raw.githubusercontent.com/cspitmit03/data602-finalproject/master/allDailyDF.csv"


data = read.csv(URL)
data1 = data[, c(5, 14, 15, 16, 39)] #Limit data to counts, weekend indicator variables, daylight hours, and precipitation volume
data1$weekend = (data1$Saturday == "True" )| (data1$Sunday == "True")
data1 = data1[ , c(1,4,5,6)]

Let’s look at a few scatter plots

plot(data1$Sunlight, data1$Fremont, data = data1, main = "Hours of Daylight vs Bicycle Counts", pch = '.')
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

plot(data1$Precip, data1$Fremont, data = data1, main = "Inches of Rain vs Bicycle Counts", pch = '.')
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

plot(data1$Fremont ~ jitter(as.numeric(data1$weekend), factor = 0.5), data = data1, main = "Weekend vs not Weekend bicycle counts", pch = '.')

Now let’s fit our model. We will create an interaction variable of Precip*weekend, to see if perhaps cycling activity is more or less effected by rain on weekends - which is a plausible outcome, since cycling is far lower during the weekend, and has the imprint of being recreational in nature, rather than commuting to and from work or school. On weekdays, cycling peaks at rush hour times, while on the weekends, the peak is smaller and flatter, and occurs in the early afternoon.

fit = lm(Fremont ~ Precip  + weekend + Sunlight + Precip*weekend, data = data1)
print(summary(fit))
## 
## Call:
## lm(formula = Fremont ~ Precip + weekend + Sunlight + Precip * 
##     weekend, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3220.6  -403.2    37.1   423.0  2984.1 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -117.159     79.303  -1.477     0.14    
## Precip             -1324.974     78.553 -16.867  < 2e-16 ***
## weekendTRUE        -1743.308     37.833 -46.079  < 2e-16 ***
## Sunlight             284.752      6.253  45.540  < 2e-16 ***
## Precip:weekendTRUE   631.536    142.369   4.436 9.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 673.8 on 1850 degrees of freedom
## Multiple R-squared:  0.7432, Adjusted R-squared:  0.7427 
## F-statistic:  1339 on 4 and 1850 DF,  p-value: < 2.2e-16

Oddly enough, we find that the rain ‘damps’ cycling less on weekends than weekdays. This is contrary to what I expected, but it’s not hard to fathom ‘after the fact’ explanations. Perhaps the commuters on the weekdays bike for convenience, but will soon switch to other modes of transportation when the rain makes biking less convenient. Meanwhile, the recreational cyclists crave cycling, and won’t let a bit of rain keep them entirely from enjoying the route. We can confirm this with some cursory looks at the means.

When it rains, weekend riding averages 921 per day; On a dry weekend, it is 1874. On a rainy weekday, counts average 2359; on a dry weekday, 3711.

So we can see that in terms of absolute change, the weekend is more stable, but in terms of relative change, weekday cycling is more durable. All in all, the coefficient merely seems to reflect absolute and not relative changes.

mean(subset(data1, Precip > 0.00 & weekend == TRUE)$Fremont)
## [1] 921.1014
mean(subset(data1, Precip == 0.00 & weekend == TRUE)$Fremont)
## [1] 1873.713
mean(subset(data1, Precip > 0.00 & weekend == FALSE)$Fremont)
## [1] 2359.218
mean(subset(data1, Precip == 0.00 & weekend == FALSE)$Fremont)
## [1] 3710.784

Perhaps rain is so fatal to bicyle cravings, that it ought to have a quadratic term in addition to the linear term.

fit1 = lm(Fremont ~ poly(Precip, 2) + weekend + Sunlight + Precip*weekend, data = data1)
print(summary(fit1))
## 
## Call:
## lm(formula = Fremont ~ poly(Precip, 2) + weekend + Sunlight + 
##     Precip * weekend, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3051.3  -389.6    41.9   425.6  2966.0 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -152.930     76.062  -2.011   0.0445 *  
## poly(Precip, 2)1   -14149.903    801.219 -17.660  < 2e-16 ***
## poly(Precip, 2)2     6374.247    665.235   9.582  < 2e-16 ***
## weekendTRUE         -1742.339     36.937 -47.171  < 2e-16 ***
## Sunlight              276.045      6.172  44.726  < 2e-16 ***
## Precip                     NA         NA      NA       NA    
## weekendTRUE:Precip    652.639    139.015   4.695 2.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 657.9 on 1849 degrees of freedom
## Multiple R-squared:  0.7554, Adjusted R-squared:  0.7547 
## F-statistic:  1142 on 5 and 1849 DF,  p-value: < 2.2e-16

The coefficient on precipitation and precipitation squared are both statistically significant at the p < .01 level, though they have opposite signs which should give us some pause.

The adjusted r squared is .75, which is reasonably high, and should give us moderate confidence that the selected predictors are influential. Let’s take a look at the residuals

plot(resid(fit1), pch ='.', cex = 2)

We do see a distinct, consistent, cyclical pattern, suggesting that we have left out a key predictor. The durability of the pattern suggests that it is somehow linked to the date, yet isn’t fully encompassed by the hours of sunlight per day predictor.

Let’s take a look at a histogram of the residuals.

hist(resid(fit1))

The residuals assume an approximately normal distribution.

And lastly, a look at the qq plots.

qqnorm(resid(fit1))
qqline(resid(fit1))

We see considerable deviation at the extremes, again suggesting there are predictors missing that we ought to incorporate into our model.

Let’s try one more model, to resolve these difficulties. We will incorporate average temperature and average wind, which will vary with the seasons and influence cycling volume.

data2 = data1
data2$WindAvg = data$WindAvg
data2$TempHi = as.numeric(data$TempHi)
data2$TempLow = as.numeric(data$TempLow)
data2$TempAvg = as.numeric(data$TempAvg)
fit2 = lm(Fremont ~ poly(Precip, 2) + weekend + Sunlight + Precip*weekend + WindAvg + TempHi + TempLow, data = data2)
print(summary(fit2))
## 
## Call:
## lm(formula = Fremont ~ poly(Precip, 2) + weekend + Sunlight + 
##     Precip * weekend + WindAvg + TempHi + TempLow, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2709.39  -296.81    12.12   313.18  2924.43 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -505.299     80.713  -6.260 4.76e-10 ***
## poly(Precip, 2)1   -10759.388    729.000 -14.759  < 2e-16 ***
## poly(Precip, 2)2     3439.234    585.997   5.869 5.18e-09 ***
## weekendTRUE         -1726.032     31.291 -55.160  < 2e-16 ***
## Sunlight              132.316      8.210  16.116  < 2e-16 ***
## Precip                     NA         NA      NA       NA    
## WindAvg               -20.446      3.717  -5.501 4.30e-08 ***
## TempHi                 55.374      2.471  22.412  < 2e-16 ***
## TempLow               -24.257      2.971  -8.165 5.87e-16 ***
## weekendTRUE:Precip    662.080    117.783   5.621 2.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 557.2 on 1846 degrees of freedom
## Multiple R-squared:  0.8248, Adjusted R-squared:  0.824 
## F-statistic:  1086 on 8 and 1846 DF,  p-value: < 2.2e-16

The adjusted r squared has improved appreciably, to .82 from .75. Let’s see the residuals plot.

plot(resid(fit2), pch = '.', cex = 2)

Adding in the wind and temperature predictors has greatly reduced the periodic pattern to the residuals, but it still remains somewhat.

qqnorm(resid(fit2))
qqline(resid(fit2))

hist(resid(fit2))