1 Introduction

In this project, we analyze a dataset relating to the Capital Bikeshare program in Washington, D.C. We will study the relationship between ride frequency and various calender and weather metrics.

Bikeshare programs like Capital Bikeshare allow individuals to borrow bikes for short-term use. In U.S. cities, these programs have a relatively short history, with some only starting within the last 5-7 years. However, they have become increasingly popular and widely used in many cities, and many users view them as a reliable and important mode of short-term transportation, ideal for dense urban environments. These programs are credited with various benefits, such as reducing auto congestion and relieving some of the strain on public transportation systems (e.g., subway or bus).

1.1 Research question

We will address two questions regarding the relationship between ride frequency and weather:

Is ride frequency sensitive to weather conditions such as temperature?
Is the ride frequency of registered users more or less sensitive than that of casual users to changes in weather conditions?

These questions should be relevant to various constituents of bikeshare programs (members, owners, administrators, etc.), as the answers should help to better understand some of the drivers of bikeshare use, which in turn impacts operating costs, availability of bikes, and traffic patterns.

1.2 Packages

We will mainly use tools from the tidyverse package to do our analysis.

# load packages
library(tidyverse)
library(knitr)

2 Description of the data

2.1 Source and data collection

The Capital Bikeshare dataset was available from the UC Irvine Machine Learning Repository. There are two datasets available in the repository, which contain hourly or daily counts of bike rentals during 2011 and 2012 in the Capital Bikeshare program, along with corresponding weather information.

The authors of the repository collected their data from the following sources:

Washington, D.C. bike share data: http://capitalbikeshare.com/system-data
Weather data: http://www.freemeteo.com
Holiday schedule: http://dchr.dc.gov/page/holiday-schedule.

The authors cleaned and tidied the data, and produced a combined dataset of hourly and daily data. For this project, I chose to work with the daily data set.

At the outset of the project, I downloaded the dataset from the UC Irvine Machine Learning Repository and saved it as a CSV file in my GitHub project repository.

We begin by loading the dataset from the GitHub repository and reviewing the data.

# load data from github repo or from disk
# from project repo:
 file <- "https://raw.githubusercontent.com/kecbenson/DATA606/master/Project/day.csv"
# from local directory:
# file <- "day.csv"

df <- read_csv(file)
df

## # A tibble: 731 x 16
##    instant dteday     season    yr  mnth holiday weekday workingday
##      <int> <date>      <int> <int> <int>   <int>   <int>      <int>
##  1       1 2011-01-01      1     0     1       0       6          0
##  2       2 2011-01-02      1     0     1       0       0          0
##  3       3 2011-01-03      1     0     1       0       1          1
##  4       4 2011-01-04      1     0     1       0       2          1
##  5       5 2011-01-05      1     0     1       0       3          1
##  6       6 2011-01-06      1     0     1       0       4          1
##  7       7 2011-01-07      1     0     1       0       5          1
##  8       8 2011-01-08      1     0     1       0       6          0
##  9       9 2011-01-09      1     0     1       0       0          0
## 10      10 2011-01-10      1     0     1       0       1          1
## # ... with 721 more rows, and 8 more variables: weathersit <int>,
## #   temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>, casual <int>,
## #   registered <int>, cnt <int>

2.2 Specifics of the data

2.2.1 Cases

The dataset includes 731 cases. Each case is a daily observation from 2011 and 2012 of 16 variables relating to (a) calendar information, (b) weather in the DC metro area, and (c) the number of bike rentals in the DC bike sharing program.

glimpse(df)

## Observations: 731
## Variables: 16
## $ instant    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 20...
## $ season     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ yr         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ mnth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ holiday    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ weekday    <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, ...
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, ...
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
## $ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.22695...
## $ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.22927...
## $ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0...
## $ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.18690...
## $ casual     <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25...
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1...
## $ cnt        <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1...

2.2.2 Dependent variables

Response variables in the dataset include:

casual: daily count of bike rentals (rides) by non-registered users (i.e., individuals who pay a per-trip usage fee)
registered: daily count of bike rentals (rides) by registered users (i.e., individuals who pay an annual membership fee)
cnt: daily count of bike rentals (rides) by all users, with cnt = casual + registered.

These are all quantitative variables.

2.2.3 Independent variables

Explanatory variables in the dataset include:

Quantitative variables:
- temp or atemp: normalized temperature or windchill in degrees Celsius
- hum: normalized humidity
- windspeed: normalized wind speed
Qualitative / categorical variables:
- yr: 2011 (0) or 2012 (1)
- season: season (1 to 4 for winter through fall)
- month: month (1 to 12 for January through December)
- workingday: indicator of whether the day is a workday (1) or a weekend day / holiday (0)
- weekday: day of the week (0-6 for Sunday through Saturday)
- holiday: indicator of holiday (1) or not (0)
- weathersit: weather category (1 to 4), where 1 = “Clear / few clouds / partly cloudy / cloudy” and 4 = “Heavy rain / hail / thunderstorm / mist / snow / fog”.

In addition there are two variables relating to observation order (instant) and observation date (dteday).

2.3 Type of study

Our analysis will be an observational study based on historical data.

2.4 Scope of inference

We should address two issues relating to the scope of inference based on this dataset:

Causality: Because this is an observational study, not an experiment, we won’t be able to draw any conclusions as to causality. Our results will only indicate possible correlations between the explanatory variables and the response variable.
Generalizability: The dataset includes daily ride counts in the DC rideshare program over a two-year period (in 2011 and 2012). The dataset was not randomly sampled, and furthermore, we don’t have any basis on which to assess whether it is representative of all time periods for the DC rideshare program or representative of other rideshare programs in other cities. As a result, the findings of our study cannot be generalized to the overall population of daily ride counts for the DC program since inception, or more broadly to other rideshare programs.

In fact, we will see below that the DC program underwent significant growth in ridership from 2011 to 2012, which suggests that the fitted relationship from our analysis will likely not be applicable to more recent ridership data.

3 Exploratory data analysis

3.1 Data preparation

To begin, let’s review the dataset. We can divide the data into 3 set of variables: calendar and timing variables, weather variables, and response variables.

First let’s review the calendar / timing variables. These variables have the expected periodic relationships among each other, for instance, the relationships between season and year or between month and weekday.

plot(df[ , 2:8])

Next let’s review the weather variables. It is clear that the temp and atemp variables are highly correlated, with a correlation coefficient of 99.2%.

plot(df[ , 9:13])

cor(df$temp, df$atemp)

## [1] 0.9917016

Finally let’s review the response variables casual, registered and cnt. It is evident that the total ride count is dominated by the registered ride count; and the two variables have a correlation coefficient of 94.6%.

plot(df[ , 14:16])

cor(df$registered, df$cnt)

## [1] 0.9455169

To facilitate the analysis, let’s modify the dataset by:

Converting season, yr, mnth, holiday, weekday, workingday, and weathersit from integer variables into factor variables
Removing atemp since it’s highly correlated to temp
Reordering and renaming the variables.

df1 <- df %>% 
    mutate(seas = as.factor(season), year = as.factor(yr), month = as.factor(mnth), 
           hlday = as.factor(holiday), wkday = as.factor(weekday), wrkday = as.factor(workingday), 
           weath = as.factor(weathersit)) %>% 
    select(obs = instant, date = dteday, year, seas, month, wrkday, wkday, hlday, weath, temp, 
           hum, windspeed, cnt, registered, casual) 
str(df1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    731 obs. of  15 variables:
##  $ obs       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ date      : Date, format: "2011-01-01" "2011-01-02" ...
##  $ year      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seas      : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month     : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ wrkday    : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
##  $ wkday     : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
##  $ hlday     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weath     : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...

3.2 Summary statistics

Summary statistics for the dataset are shown below, with a histogram illustrating the distribution of total daily ride frequency. From the summary table, we see that the mean daily ride counts are:

4,504 in total
3,656 for registered users
848 for casual users.

# summary statistics
summary(df1)

##       obs             date            year    seas        month    
##  Min.   :  1.0   Min.   :2011-01-01   0:365   1:181   1      : 62  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1:366   2:184   3      : 62  
##  Median :366.0   Median :2012-01-01           3:188   5      : 62  
##  Mean   :366.0   Mean   :2012-01-01           4:178   7      : 62  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01                   8      : 62  
##  Max.   :731.0   Max.   :2012-12-31                   10     : 62  
##                                                       (Other):359  
##  wrkday  wkday   hlday   weath        temp              hum        
##  0:231   0:105   0:710   1:463   Min.   :0.05913   Min.   :0.0000  
##  1:500   1:105   1: 21   2:247   1st Qu.:0.33708   1st Qu.:0.5200  
##          2:104           3: 21   Median :0.49833   Median :0.6267  
##          3:104                   Mean   :0.49538   Mean   :0.6279  
##          4:104                   3rd Qu.:0.65542   3rd Qu.:0.7302  
##          5:104                   Max.   :0.86167   Max.   :0.9725  
##          6:105                                                     
##    windspeed            cnt         registered       casual      
##  Min.   :0.02239   Min.   :  22   Min.   :  20   Min.   :   2.0  
##  1st Qu.:0.13495   1st Qu.:3152   1st Qu.:2497   1st Qu.: 315.5  
##  Median :0.18097   Median :4548   Median :3662   Median : 713.0  
##  Mean   :0.19049   Mean   :4504   Mean   :3656   Mean   : 848.2  
##  3rd Qu.:0.23321   3rd Qu.:5956   3rd Qu.:4776   3rd Qu.:1096.0  
##  Max.   :0.50746   Max.   :8714   Max.   :6946   Max.   :3410.0  
##

# histogram of daily rental counts: all users
ggplot(df1) + geom_histogram(aes(x = cnt)) + labs(title = "Distribution of daily ride count: All users")

The histogram above seems multi-modal. We can explain the mode in the lower left tail by breaking this down into separate histograms of registered and casual ride counts.

ggplot(df1) + geom_histogram(aes(x = registered)) + labs(title = "Distribution of daily ride count: Registered users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df1) + geom_histogram(aes(x = casual)) + labs(title = "Distribution of daily ride count: Casual users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the statistical summary and histograms above, we can see that the distribution of total and registered ride counts are approximately symmetric, whereas the distribution of casual ride counts is highly skewed, with a long tail to the right. The long tail may arise from high ridership by casual users on holidays or weekends when the weather is nice.

Also, as an aside, it appears that the weath variable only takes on values of 1 to 3, not 1 to 4 as described in the data description of the UCI repository.

3.3 Calendar and time patterns

Let’s explore some of the calendar and timing relationships in the data.

First it is evident from a timeseries plot that the total daily ride count varies by year and season. In fact we can easily see that the daily ride count grew substantially from year 0 (2011) to year 1 (2012).

# timing pattern - seasons
ggplot(df1, aes(x = date, y = cnt, color = seas)) + geom_point() + 
    labs(title = "Daily ride count varies by year and season")

# timing pattern - yr 0 vs 1; secular growth
ggplot(df1, aes(x = month, y = cnt, color = year)) + geom_point() + geom_smooth(aes(group = year), se = FALSE) +
    labs(title = "Daily ride count grew significantly from 2011 to 2012")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Second we can see that the total ride count has a daily pattern during the week, and that there’s a material difference between work days and non-work days.

# timing pattern - weekday
ggplot(df1, aes(x = date, y = cnt, color = wkday)) + geom_point() + geom_smooth(aes(group = wkday), se = FALSE) +
    labs(title = "Daily ride count varies by day of the week")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# timing pattern - workingday
ggplot(df1, aes(x = date, y = cnt, color = wrkday)) + geom_point() + geom_smooth(aes(group = wrkday), se = FALSE) +
    labs(title = "Daily ride count differs between work days and non-work days")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Finally, we note that registered and casual users exhibit different rental behavior during the week. Registered users tend to ride more during the work week (wkday of 1 to 5), whereas casual users tend to ride more on the weekends and holidays (wkday of 0 and 6).

# weekly pattern - registered
reg_wkly <- df1 %>% group_by(wkday, seas) %>% summarize(avg_reg = mean(registered))
ggplot(reg_wkly) + geom_bar(aes(x = wkday, y = avg_reg, fill = seas), stat = "identity", position = "dodge") + 
    labs(title = "Registered ride counts are highest during the work week")

# weekly pattern - casual
cas_wkly <- df1 %>% group_by(wkday, seas) %>% summarize(avg_cas = mean(casual))
ggplot(cas_wkly) + geom_bar(aes(x = wkday, y = avg_cas, fill = seas), stat = "identity", position = "dodge") +
    labs(title = "Casual ride counts are highest on the weekends")

3.4 Weather patterns

Next let’s explore some of the weather-related patterns in daily ride frequency. First we can see that ride count is highly sensitive to weather conditions.

# by weather
ggplot(df1, aes(x = date, y = cnt, color = weath)) + geom_point() + geom_smooth(aes(group = weath), se = FALSE) +
    labs(title = "Total ride count varies by weather conditions")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Further, ride count is sensitive to temperature. Across all seasons, ride count increases with increasing temperature up to a point, roughly ~ 0.65 on the normalized temperature scale (temp). For higher temperatures, ride count declines.

# by temp
ggplot(df1, aes(x = temp, y = cnt, color = seas)) + geom_point() + geom_smooth(aes(group = seas), se = FALSE) +
    labs(title = "Total ride count varies by season and temperature")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Finally, it appears that registered users may be more sensitive to temperature changes than casual users. In particular, the slope of the regression line of ride count vs. temperature appears to be steeper for registered users compared to casual users, during cold weather (season = 1, positive slope) as well as hot weather (season = 3, negative slope). Whether this difference is statistically significant will be investigated in the regression analysis below.

# scatter plots of rental count vs temp, by season: registered vs. casual users
ggplot(df1, aes(x = temp, y = registered)) + geom_point(aes(color = seas)) + facet_wrap(~ seas) + 
    geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Registered users")

ggplot(df1, aes(x = temp, y = casual)) + geom_point(aes(color = seas)) + facet_wrap(~ seas) + 
    geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Casual users")

4 Linear regression analysis

4.1 Initial analysis: two variables

For our initial analysis, we develop a linear model for daily ride frequency by registered and casual riders, using one calendar variable and one weather variable. We choose the seas (categorical) and temp (quantitative) variables for the analysis.

4.1.1 Linear models

First we estimate the linear model for the registered ride count.

reg_m <- lm(registered ~ seas + temp, data = df1)
summary(reg_m)

## 
## Call:
## lm(formula = registered ~ seas + temp, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4055.4  -915.0  -191.4  1108.0  3052.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    939.1      163.2   5.755 1.28e-08 ***
## seas2          515.1      171.5   3.003  0.00277 ** 
## seas3          347.3      225.4   1.540  0.12388    
## seas4         1170.7      143.3   8.172 1.35e-15 ***
## temp          4467.4      451.0   9.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1247 on 726 degrees of freedom
## Multiple R-squared:  0.3644, Adjusted R-squared:  0.3609 
## F-statistic:   104 on 4 and 726 DF,  p-value: < 2.2e-16

The registered linear model can be expressed mathematically as: \[ \begin{aligned} \widehat{\text{registered}} &= \beta_0 + \beta_\text{seas} \cdot \text{seas} + \beta_\text{temp} \cdot \text{temp} \\ &= 939.1 + 515.1 \cdot \text{seas2} + 347.3 \cdot \text{seas3} + 1170.7 \cdot \text{seas4} + 4467.4 \cdot \text{temp} \end{aligned} \]

Similarly, we estimate the linear model for the casual ride count.

cas_m <- lm(casual ~ seas + temp, data = df1)
summary(cas_m)

## 
## Call:
## lm(formula = casual ~ seas + temp, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1011.3  -357.6  -114.9   181.5  2436.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -193.27      74.28  -2.602  0.00946 ** 
## seas2         333.61      78.08   4.273 2.19e-05 ***
## seas3         142.91     102.61   1.393  0.16414    
## seas4         172.16      65.21   2.640  0.00846 ** 
## temp         1773.97     205.28   8.642  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 567.8 on 726 degrees of freedom
## Multiple R-squared:   0.32,  Adjusted R-squared:  0.3163 
## F-statistic: 85.41 on 4 and 726 DF,  p-value: < 2.2e-16

The casual linear model can be expressed mathematically as: \[ \begin{aligned} \widehat{\text{casual}} &= \beta_0 + \beta_\text{seas} \cdot \text{seas} + \beta_\text{temp} \cdot \text{temp} \\ &= -193.27 + 333.61 \cdot \text{seas2} + 142.91 \cdot \text{seas3} + 172.16 \cdot \text{seas4} + 1773.97 \cdot \text{temp} \end{aligned} \]

4.1.2 Interpretation of the coefficients

The coefficients of both models can be interpreted as follows:

Intercept: the baseline average ride count, excluding the temperature effect, during winter (when \(\text{seas} = 1\))
\(\beta_\text{seas2}\): the incremental addition to the baseline average ride count, excluding the temperature effect, during spring (when \(\text{seas} = 2\))
\(\beta_\text{seas3}\): the incremental addition to the baseline average ride count, excluding the temperature effect, during summer (when \(\text{seas} = 3\))
\(\beta_\text{seas4}\): the incremental addition to the baseline average ride count, excluding the temperature effect, during fall (when \(\text{seas} = 4\))
\(\beta_\text{temp}\): the slope for the temperature effect, which gives the incremental increase in ride count due to an increase in the scaled temperature variable temp.

Note that the temp variable is scaled and varies between 0.06 and 0.86, with a mean value of 0.50 (from the summary statistics table above). In this case, we can divide the temperature coefficient by 10 and interpret the slope for a 0.1 increment in the scaled temperature.

With this interpretation, the coefficients of the two models can be compared below:

Interpretation	Registered Model	Casual Model
Baseline ride count in winter (ex. temp effect)	939.1	-193.27
Baseline ride count in spring (ex. temp effect)	939.1 + 515.1 = 1454.2	-193.27 + 333.61 = 140.34
Baseline ride count in summer (ex. temp effect)	939.1 + 347.3 = 1286.4	-193.27 + 142.91 = -50.36
Baseline ride count in fall (ex. temp effect)	939.1 + 1170.7 = 2109.2	-193.27 + 172.16 = -21.11
Incremental ride count per 0.1 increment in `temp`	446.7	177.40

Note that the casual model has negative values for the baseline rider counts in all seasons other than spring. This reflects the fact that the baseline averages ignoe the temperature effect, which on average contributes +887 (0.5 mean scaled temperature x 1774 coefficient) to the baseline ride count for casual users throughout the year. The baseline ride count also understates the seasonal average ride counts in the registered model as well, where the temperature effect contributes on average +2233 (0.5 * 4467) to the baseline ride count for registered users throughout the year.

4.1.3 Conditions for regression

Let’s check whether the conditions for regression are satisfied in both models.

Are the residuals (nearly) normally distributed?

This can be assessed with a histogram of residuals and a normal probability plot of residuals, as shown below. The plots indicate that the residuals are not normally distributed and have significant skew, particularly for the casual ride count model. This condition is not satisfied, or at best, is very weakly satisfied.

par(mfrow = c(1, 2))

# histogram of residuals
hist(reg_m$residuals)
hist(cas_m$residuals)

# normal probability plot of residuals.
qqnorm(reg_m$residuals, main = "Q-Q Plot: reg_m")
qqline(reg_m$residuals)  # add diagonal line to the normal prob plot
qqnorm(cas_m$residuals, main = "Q-Q Plot: cas_m")
qqline(cas_m$residuals)  # add diagonal line to the normal prob plot

Is the variance of residuals nearly constant?

This can be evaluated using a plot of the absolute value of the residuals against their corresponding fitted values, as shown below. It appears that this condition is not satisfied, as the variability of the residuals appears to increase for larger fitted values (the right side of the plots).

par(mfrow = c(1, 2))

# plot absolute value of residuals vs. fitted values
plot(abs(reg_m$residuals) ~ reg_m$fitted.values, main = "Abs(Resids.) vs. Fitted: reg_m") 
abline(h = 0, lty = 3)  # add dashed line at y = 0
plot(abs(cas_m$residuals) ~ cas_m$fitted.values, main = "Abs(Resids.) vs. Fitted: cas_m") 
abline(h = 0, lty = 3)  # add dashed line at y = 0

Are the residuals independent?

This condition is likely not satisfied as the dataset includes time series data, and both the calendar and weather effects will lead to correlations in ride counts over consecutive days. This effect can be visualized using a plot of the residuals vs. the order of observation. From the plot below, it is apparent that many pairs of sequential observations are closely related.

par(mfrow = c(1, 2))

# plot residuals vs. order of observation
plot(reg_m$residuals ~ df1$obs, main = "Resids. vs. Order of Obs.: reg_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0
plot(cas_m$residuals ~ df1$obs, main = "Resids. vs. Order of Obs.: cas_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0

Is each predictor variable linearly related to the response variable?

This can be evaluated using plots of the residuals against each predictor variable (seas and temp). Since seas is a categorical variable, the seas plots don’t indicate whether the relationship with ride count is linear or non-linear; however it is apparent that variability in the residuals is not constant across seasons, particularly in the casual model cas_m. From the temp plots, it appears that the reg_m model exhibits a linear relationship between temperature and ride count. Unfortunately, this doesn’t seem to hold true for the cas_m model. This condition is only partially satisfied.

par(mfrow = c(1, 2))

# plot residuals vs. season
plot(reg_m$residuals ~ df1$seas, main = "Resids. vs. season: reg_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0
plot(cas_m$residuals ~ df1$seas, main = "Resids. vs. season: cas_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0

# plot residuals vs. temp
plot(reg_m$residuals ~ df1$temp, main = "Resids. vs. temp.: reg_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0
plot(cas_m$residuals ~ df1$temp, main = "Resids. vs. temp.: cas_m")
abline(h = 0, lty = 3)  # add dashed line at y = 0

Overall, it appears that the conditions for inference using the linear regression models are not satisfied.

4.1.4 Inference

Let’s assume that the conditions for inference are at least partially satisified, so that we can proceed with the analysis.

First, for both models, the p-values for the estimated temp coefficients are extremely small (< 2e-16 \(\ll \alpha\) = 0.05), so we conclude that the temp coefficients are statistically significant. In other words, we reject the null hypothesis that \(\beta_\text{temp} = 0\) and conclude that there is a statistically significant relationship between daily ride frequency and temperature.

Second, given the magnitude of the temp coefficients in the two models (4,467 in reg_m vs. 1,774 in cas_m), it appears that registered riders are more sensitive to changes in temperature than casual riders. However, it should be noted that the average ride count is significantly higher for registered riders than for casual riders (3,656 vs. 848, respectively). To account for this difference, let’s normalize the comparison by scaling the temp coefficients by the mean ride counts:

Model	`temp` coefficient	Mean ride count	`temp` coefficient / mean ride count
`reg_m`	4467.4	3656	1.22
`cas_m`	1774.0	848	2.09

When normalized for the average daily ride count, we see that casual riders are more sensitive than registered riders to changes in temperature. This make intuitive sense, since riding behavior of registered users is concentrated during the work week and is likely related to commuting to work, which riders will do regardless of the temperature. On the other hand, for casual riders, going for a bike ride on a holiday or weekend is likely an optional event, and will be highly dependent on attractive weather conditions.

4.2 Second analysis: multiple variables

Next, let’s try step-wise forward selection to see if we can improve the linear models by including more explanatory variables. To facilitate the process, we will:

Simplify the dataframe, including dropping and re-ordering the variables
Define a function to help with the forward selection process, which will:
- Take as input a base form of the linear model
- Add to the base model each explanatory variable in the dataset, one at a time
- Create a dataframe of \(R^2\) metrics for each linear model created.

# simplify dataframe for analysis
reg_df <- df1 %>% select(3:12, registered, obs)
str(reg_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    731 obs. of  12 variables:
##  $ year      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seas      : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month     : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ wrkday    : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
##  $ wkday     : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
##  $ hlday     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weath     : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ obs       : int  1 2 3 4 5 6 7 8 9 10 ...

# define function to do forward variable selection
reg_try <- function(try_form) {
    # initialize results 
    r <- rep(NA, 10)
    r1 <- rep(NA, 10)
    # define formula    
    temp_form <- as.formula(paste(try_form, " + reg_df[[j]]"))
    # add each variable one at a time
    for (j in 1:10) {
        trial <- lm(temp_form, data = reg_df)
        r[j] <- summary(trial)$r.squared
        r1[j] <- summary(trial)$adj.r.squared
    }
    results <- cbind(Var = colnames(reg_df)[1:10], R_Sq = round(r, 3), Adj_R_Sq = round(r1, 3)) %>% as.data.frame()
    return(results)
}

4.2.1 Model for registered ride frequency

Now let’s try each variable in a linear model of the form “registered ~ variable”. From the table of adjusted \(R^2\) results, we see that we should include year as the first variable to add.

reg_try("registered ~ ")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.353    0.352
## 2       seas 0.278    0.275
## 3      month 0.294    0.284
## 4     wrkday 0.092    0.091
## 5      wkday 0.081    0.073
## 6      hlday 0.012     0.01
## 7      weath 0.079    0.077
## 8       temp 0.292    0.291
## 9        hum 0.008    0.007
## 10 windspeed 0.047    0.046

So let’s add the year variable and try again. month is the next variable to add.

# add year and try again.
reg_try("registered ~ year")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.353    0.352
## 2       seas 0.633    0.631
## 3      month 0.649    0.643
## 4     wrkday 0.446    0.445
## 5      wkday 0.434    0.429
## 6      hlday 0.366    0.364
## 7      weath 0.412     0.41
## 8       temp 0.616    0.615
## 9        hum 0.354    0.352
## 10 windspeed 0.397    0.396

Next, we should add wrkday.

# choose month
reg_try("registered ~ year + month")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.649    0.643
## 2       seas 0.684    0.678
## 3      month 0.649    0.643
## 4     wrkday 0.735     0.73
## 5      wkday 0.727     0.72
## 6      hlday 0.658    0.651
## 7      weath   0.7    0.694
## 8       temp 0.674    0.668
## 9        hum 0.665    0.659
## 10 windspeed 0.661    0.655

At this point we have three calendar variables selected (year, month, and wrkday). Let’s see where the linear model stands. From the model summary, all of the variables are significant, and the adjusted \(R^2\) is 73%. This looks pretty reasonable so far, with only 3 calendar variables.

summary(lm(registered ~ year + month + wrkday, data = reg_df))

## 
## Call:
## lm(formula = registered ~ year + month + wrkday, data = reg_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5463.1  -356.3    99.3   523.1  1737.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   415.82     114.99   3.616  0.00032 ***
## year1        1858.44      59.93  31.012  < 2e-16 ***
## month2        355.81     148.68   2.393  0.01696 *  
## month3        913.67     145.59   6.276 6.03e-10 ***
## month4       1468.18     146.71  10.007  < 2e-16 ***
## month5       2105.61     145.53  14.468  < 2e-16 ***
## month6       2487.88     146.78  16.950  < 2e-16 ***
## month7       2305.04     145.50  15.842  < 2e-16 ***
## month8       2424.83     145.63  16.650  < 2e-16 ***
## month9       2591.12     146.71  17.661  < 2e-16 ***
## month10      2221.39     145.51  15.266  < 2e-16 ***
## month11      1633.78     146.71  11.136  < 2e-16 ***
## month12      1055.88     145.50   7.257 1.03e-12 ***
## wrkday1       987.46      64.57  15.293  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 810.1 on 717 degrees of freedom
## Multiple R-squared:  0.7352, Adjusted R-squared:  0.7304 
## F-statistic: 153.1 on 13 and 717 DF,  p-value: < 2.2e-16

In the same manner, we follow the steps below adding one variable at time to maximize the adjusted \(R^2\).

# choose wrkday
reg_try("registered ~ year + month + wrkday")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.735     0.73
## 2       seas 0.766    0.761
## 3      month 0.735     0.73
## 4     wrkday 0.735     0.73
## 5      wkday 0.739    0.732
## 6      hlday 0.736     0.73
## 7      weath 0.795    0.791
## 8       temp 0.754    0.749
## 9        hum 0.754    0.749
## 10 windspeed 0.746    0.741

# choose weath
reg_try("registered ~ year + month + wrkday + weath")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.795    0.791
## 2       seas 0.826    0.822
## 3      month 0.795    0.791
## 4     wrkday 0.795    0.791
## 5      wkday 0.801    0.795
## 6      hlday 0.796    0.792
## 7      weath 0.795    0.791
## 8       temp 0.811    0.807
## 9        hum 0.795    0.791
## 10 windspeed 0.803    0.798

# choose seas
reg_try("registered ~ year + month + wrkday + weath + seas")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.826    0.822
## 2       seas 0.826    0.822
## 3      month 0.826    0.822
## 4     wrkday 0.826    0.822
## 5      wkday 0.832    0.826
## 6      hlday 0.827    0.822
## 7      weath 0.826    0.822
## 8       temp 0.839    0.835
## 9        hum 0.826    0.822
## 10 windspeed 0.831    0.826

# choose temp
reg_try("registered ~ year + month + wrkday + weath + seas + temp")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.839    0.835
## 2       seas 0.839    0.835
## 3      month 0.839    0.835
## 4     wrkday 0.839    0.835
## 5      wkday 0.845     0.84
## 6      hlday  0.84    0.836
## 7      weath 0.839    0.835
## 8       temp 0.839    0.835
## 9        hum 0.841    0.837
## 10 windspeed 0.844    0.839

# choose wkday
reg_try("registered ~ year + month + wrkday + weath + seas + temp + wkday")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.845     0.84
## 2       seas 0.845     0.84
## 3      month 0.845     0.84
## 4     wrkday 0.845     0.84
## 5      wkday 0.845     0.84
## 6      hlday 0.845     0.84
## 7      weath 0.845     0.84
## 8       temp 0.845     0.84
## 9        hum 0.847    0.841
## 10 windspeed  0.85    0.844

Let’s review the model. It’s apparent that many of the coefficient estimates are not statistically significant, as we’ve added too many collinear variables. For instance, the following explanatory variables are correlated:

wrkday and wkday
month and seas
seas and temp.

Let’s redo the forward selection process, and take care to avoid these correlated variables. The steps are similar to those shown above, so here we just summarize the variables in the order selected:

yr
seas (there’s only a slight loss in adjusted \(R^2\) compared to month)
wrkday
weath
temp.

At this point, we can only increase the adjusted \(R^2\) by adding collinear variables, so let’s take this as the final model and review:

The adjusted \(R^2\) is 82%
Each coefficient estimate is significant (\(\ll 0.05\)), other than the intercept
Calendar / timing variables are the most important variables (accounting for the first 3 variables selected).

reg_try("registered ~ year + seas + wrkday + weath + temp")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.823    0.821
## 2       seas 0.823    0.821
## 3      month 0.839    0.835
## 4     wrkday 0.823    0.821
## 5      wkday 0.829    0.826
## 6      hlday 0.824    0.822
## 7      weath 0.823    0.821
## 8       temp 0.823    0.821
## 9        hum 0.824    0.822
## 10 windspeed 0.828    0.826

# try this as final model
reg_m <- lm(registered ~ year + seas + wrkday + weath + temp, data = reg_df)
summary(reg_m)

## 
## Call:
## lm(formula = registered ~ year + seas + wrkday + weath + temp, 
##     data = reg_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3664.0  -332.5    71.1   425.1  1760.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.843     96.501   0.050     0.96    
## year1        1752.270     49.091  35.694  < 2e-16 ***
## seas2         799.825     90.977   8.792  < 2e-16 ***
## seas3         798.127    119.685   6.669 5.14e-11 ***
## seas4        1388.169     76.135  18.233  < 2e-16 ***
## wrkday1       984.712     52.685  18.691  < 2e-16 ***
## weath2       -464.377     52.459  -8.852  < 2e-16 ***
## weath3      -1937.556    148.704 -13.030  < 2e-16 ***
## temp         3166.063    240.611  13.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 659.5 on 722 degrees of freedom
## Multiple R-squared:  0.8233, Adjusted R-squared:  0.8214 
## F-statistic: 420.5 on 8 and 722 DF,  p-value: < 2.2e-16

anova(reg_m)

## Analysis of Variance Table
## 
## Response: registered
##            Df    Sum Sq   Mean Sq F value    Pr(>F)    
## year        1 627553124 627553124 1443.02 < 2.2e-16 ***
## seas        3 497431507 165810502  381.27 < 2.2e-16 ***
## wrkday      1 150722248 150722248  346.58 < 2.2e-16 ***
## weath       2 112117317  56058658  128.90 < 2.2e-16 ***
## temp        1  75298533  75298533  173.14 < 2.2e-16 ***
## Residuals 722 313989243    434888                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In addition, the diagnostic plots look much better than previously for the simple two-variable model. The residuals are approximately normally distributed, appear to have approximately constant variance vs. the fitted values, and suggest a linear relationship between the explanatory variables and the response variables. However, the condition of independent observations is likely not met, as we still have daily observational data.

# diagnostic plots
par(mfrow = c(2, 2))
plot(reg_m, main = "Registered model reg_m")

4.2.2 Model for casual ride frequency

Next let’s follow the same step-wise forward selection process for a model to predict ride frequency by casual users. First we start by simplifying the dataframe for casual ride counts and define a helper function for the forward selection process.

# simplify dataframe for analysis
cas_df <- df1 %>% select(3:12, casual, obs)
str(cas_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    731 obs. of  12 variables:
##  $ year     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seas     : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month    : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ wrkday   : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
##  $ wkday    : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
##  $ hlday    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weath    : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp     : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ hum      : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed: num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual   : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ obs      : int  1 2 3 4 5 6 7 8 9 10 ...

# define function to do forward variable selection
cas_try <- function(try_form) {
    # initialize results 
    r <- rep(NA, 10)
    r1 <- rep(NA, 10)
    # define formula    
    temp_form <- as.formula(paste(try_form, " + cas_df[[j]]"))
    # add each variable one at a time
    for (j in 1:10) {
        trial <- lm(temp_form, data = cas_df)
        r[j] <- summary(trial)$r.squared
        r1[j] <- summary(trial)$adj.r.squared
    }
    results <- cbind(Var = colnames(cas_df)[1:10], R_Sq = round(r, 3), Adj_R_Sq = round(r1, 3)) %>% as.data.frame()
    return(results)
}

We follow the same step-wise procedure as above, and only show the final model. The variables added in order include:

temp (only very slight loss in adjusted \(R^2\) vs. month)
wrkday
year
seas (only slight loss in adjusted \(R^2\) vs. month)
weath

cas_try("casual ~ temp + wrkday + year + seas + weath")

##          Var  R_Sq Adj_R_Sq
## 1       year 0.693     0.69
## 2       seas 0.693     0.69
## 3      month 0.714    0.706
## 4     wrkday 0.693     0.69
## 5      wkday  0.71    0.704
## 6      hlday 0.698    0.695
## 7      weath 0.693     0.69
## 8       temp 0.693     0.69
## 9        hum 0.695    0.692
## 10 windspeed 0.701    0.697

# try this as final model
cas_m <- lm(casual ~ temp + wrkday + year + seas + weath, data = cas_df)
summary(cas_m)

## 
## Call:
## lm(formula = casual ~ temp + wrkday + year + seas + weath, data = cas_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1225.05  -228.73   -26.97   193.92  1859.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   260.77      55.96   4.660 3.76e-06 ***
## temp         1748.90     139.52  12.535  < 2e-16 ***
## wrkday1      -793.24      30.55 -25.965  < 2e-16 ***
## year1         296.82      28.47  10.427  < 2e-16 ***
## seas2         364.45      52.75   6.908 1.08e-11 ***
## seas3         162.72      69.40   2.345   0.0193 *  
## seas4         206.39      44.15   4.675 3.51e-06 ***
## weath2       -161.69      30.42  -5.315 1.42e-07 ***
## weath3       -495.11      86.23  -5.742 1.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 382.4 on 722 degrees of freedom
## Multiple R-squared:  0.6932, Adjusted R-squared:  0.6898 
## F-statistic: 203.9 on 8 and 722 DF,  p-value: < 2.2e-16

anova(cas_m)

## Analysis of Variance Table
## 
## Response: casual
##            Df    Sum Sq   Mean Sq F value    Pr(>F)    
## temp        1 101581307 101581307 694.678 < 2.2e-16 ***
## wrkday      1 103130970 103130970 705.275 < 2.2e-16 ***
## year        1  16726735  16726735 114.388 < 2.2e-16 ***
## seas        3   9236193   3078731  21.054 4.391e-13 ***
## weath       2   7907040   3953520  27.037 4.754e-12 ***
## Residuals 722 105576577    146228                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As before, we can slightly improve the adjusted \(R^2\), but only at the expense of adding correlated variables. So let’s stop here, and take this as the final model. Reviewing the model, we see that:

The adjusted \(R^2\) is 69%
Each coefficient estimate is significant (\(\ll 0.05\))
Calendar / timing variables are important variables (accounting for the second through fourth variables selected), but temp is a much stronger driver of casual ride counts.

Finally we review the diagnostic plots for the casual model. As in the case of the registered model, the diagnostics look much better than previously for the simple two-variable model. However, the model still seems to have issues; in particular, the variance of the residuals is not constant, as seen from the plot of residuals vs. fitted values. At least the residuals appear to be approximately normally distributed, and suggest a linear relationship between the explanatory variables and the response variables. As before, the condition of independent observations is likely not met, as we still have daily observational data.

# diagnostic plots
par(mfrow = c(2, 2))
plot(cas_m, main = "Casual model cas_m")

mean(cas_m$fitted.values)

## [1] 848.1765

mean(reg_m$fitted.values)

## [1] 3656.172

4.2.3 Inference

Let’s return to our original research questions regarding the relationship between daily ride frequency and temperature. We assume that the conditions for regression are satisfied for the registered and casual linear models; from the diagnostic plots in the previous sections, we saw that these conditions are mostly satisfied for the registered model but only partially satisfied for the casual model.

First, we note that the coefficient estimates for the temp variable are statistically significant in both models, with p-values < 2e-16. We can conclude from this that daily ride frequency does have a statistically significant relationship with temperature.

We summarize below the temperature coefficient estimates, both raw values and normalized values (scaled to the mean ride count):

Model	`temp` coefficient	Mean ride count	`temp` coefficient / mean ride count
`reg_m`	3166.1	3656	0.87
`cas_m`	1748.9	848	2.06

On a normalized basis, the casual model shows greater sensitivity to changes in temperature than the registered model, consistent with our observation from the simple two-variable model. We conclude that casual ride frequency is more sensitive to changes in temperature than registered ride frequency.

5 Conclusion

Some conclusions of our analysis:

Ride counts of registered and casual users both appear to be sensitive to changes in temperature.
On an absolute magnitude basis, it appears that registered ride counts change more on average for a given change in temperature, holding all else constant, than do casual ride counts.
However, when adjusting for the smaller average ride count of casual users vs. registered users, it becomes apparent that ride counts for casual users are significantly more sensitive to changes in temperature.
This is consistent with the observations from the exploratory analysis that ride count trends for registered users are dominated by rides during the work week, which are likely related to commuting to and from work; these rides are probably less likely to be affected by changes in temperature. On the other hand, rides by casual users are more prevalent during holidays and weekends, when rides are likely to be discretionary and dependent on pleasant weather conditions.

Suggestions for further work:

Consider ways to improve meeting the conditions for regression, particularly in the case of casual ride counts; for instance it may be worth trying to transform certain variables in order to reduce the non-linearity of model residuals.
Try backward elimination and examine whether model results are consistent with the forward selection process in this study.
Perform the analysis using a simulation-based approach, which may mitigate some of the issues encountered here.

DATA 606 Final Project

Kevin Benson

December 12, 2018