In this project, we analyze a dataset relating to the Capital Bikeshare program in Washington, D.C. We will study the relationship between ride frequency and various calender and weather metrics.
Bikeshare programs like Capital Bikeshare allow individuals to borrow bikes for short-term use. In U.S. cities, these programs have a relatively short history, with some only starting within the last 5-7 years. However, they have become increasingly popular and widely used in many cities, and many users view them as a reliable and important mode of short-term transportation, ideal for dense urban environments. These programs are credited with various benefits, such as reducing auto congestion and relieving some of the strain on public transportation systems (e.g., subway or bus).
We will address two questions regarding the relationship between ride frequency and weather:
These questions should be relevant to various constituents of bikeshare programs (members, owners, administrators, etc.), as the answers should help to better understand some of the drivers of bikeshare use, which in turn impacts operating costs, availability of bikes, and traffic patterns.
We will mainly use tools from the tidyverse package to do our analysis.
# load packages
library(tidyverse)
library(knitr)
The Capital Bikeshare dataset was available from the UC Irvine Machine Learning Repository. There are two datasets available in the repository, which contain hourly or daily counts of bike rentals during 2011 and 2012 in the Capital Bikeshare program, along with corresponding weather information.
The authors of the repository collected their data from the following sources:
The authors cleaned and tidied the data, and produced a combined dataset of hourly and daily data. For this project, I chose to work with the daily data set.
At the outset of the project, I downloaded the dataset from the UC Irvine Machine Learning Repository and saved it as a CSV file in my GitHub project repository.
We begin by loading the dataset from the GitHub repository and reviewing the data.
# load data from github repo or from disk
# from project repo:
file <- "https://raw.githubusercontent.com/kecbenson/DATA606/master/Project/day.csv"
# from local directory:
# file <- "day.csv"
df <- read_csv(file)
df
## # A tibble: 731 x 16
## instant dteday season yr mnth holiday weekday workingday
## <int> <date> <int> <int> <int> <int> <int> <int>
## 1 1 2011-01-01 1 0 1 0 6 0
## 2 2 2011-01-02 1 0 1 0 0 0
## 3 3 2011-01-03 1 0 1 0 1 1
## 4 4 2011-01-04 1 0 1 0 2 1
## 5 5 2011-01-05 1 0 1 0 3 1
## 6 6 2011-01-06 1 0 1 0 4 1
## 7 7 2011-01-07 1 0 1 0 5 1
## 8 8 2011-01-08 1 0 1 0 6 0
## 9 9 2011-01-09 1 0 1 0 0 0
## 10 10 2011-01-10 1 0 1 0 1 1
## # ... with 721 more rows, and 8 more variables: weathersit <int>,
## # temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>, casual <int>,
## # registered <int>, cnt <int>
The dataset includes 731 cases. Each case is a daily observation from 2011 and 2012 of 16 variables relating to (a) calendar information, (b) weather in the DC metro area, and (c) the number of bike rentals in the DC bike sharing program.
glimpse(df)
## Observations: 731
## Variables: 16
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 20...
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ weekday <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, ...
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, ...
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.22695...
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.22927...
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0...
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.18690...
## $ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25...
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1...
## $ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1...
Response variables in the dataset include:
casual: daily count of bike rentals (rides) by non-registered users (i.e., individuals who pay a per-trip usage fee)registered: daily count of bike rentals (rides) by registered users (i.e., individuals who pay an annual membership fee)cnt: daily count of bike rentals (rides) by all users, with cnt = casual + registered.These are all quantitative variables.
Explanatory variables in the dataset include:
temp or atemp: normalized temperature or windchill in degrees Celsiushum: normalized humiditywindspeed: normalized wind speedyr: 2011 (0) or 2012 (1)season: season (1 to 4 for winter through fall)month: month (1 to 12 for January through December)workingday: indicator of whether the day is a workday (1) or a weekend day / holiday (0)weekday: day of the week (0-6 for Sunday through Saturday)holiday: indicator of holiday (1) or not (0)weathersit: weather category (1 to 4), where 1 = “Clear / few clouds / partly cloudy / cloudy” and 4 = “Heavy rain / hail / thunderstorm / mist / snow / fog”.In addition there are two variables relating to observation order (instant) and observation date (dteday).
Our analysis will be an observational study based on historical data.
We should address two issues relating to the scope of inference based on this dataset:
Causality: Because this is an observational study, not an experiment, we won’t be able to draw any conclusions as to causality. Our results will only indicate possible correlations between the explanatory variables and the response variable.
Generalizability: The dataset includes daily ride counts in the DC rideshare program over a two-year period (in 2011 and 2012). The dataset was not randomly sampled, and furthermore, we don’t have any basis on which to assess whether it is representative of all time periods for the DC rideshare program or representative of other rideshare programs in other cities. As a result, the findings of our study cannot be generalized to the overall population of daily ride counts for the DC program since inception, or more broadly to other rideshare programs.
In fact, we will see below that the DC program underwent significant growth in ridership from 2011 to 2012, which suggests that the fitted relationship from our analysis will likely not be applicable to more recent ridership data.
To begin, let’s review the dataset. We can divide the data into 3 set of variables: calendar and timing variables, weather variables, and response variables.
First let’s review the calendar / timing variables. These variables have the expected periodic relationships among each other, for instance, the relationships between season and year or between month and weekday.
plot(df[ , 2:8])
Next let’s review the weather variables. It is clear that the temp and atemp variables are highly correlated, with a correlation coefficient of 99.2%.
plot(df[ , 9:13])
cor(df$temp, df$atemp)
## [1] 0.9917016
Finally let’s review the response variables casual, registered and cnt. It is evident that the total ride count is dominated by the registered ride count; and the two variables have a correlation coefficient of 94.6%.
plot(df[ , 14:16])
cor(df$registered, df$cnt)
## [1] 0.9455169
To facilitate the analysis, let’s modify the dataset by:
season, yr, mnth, holiday, weekday, workingday, and weathersit from integer variables into factor variablesatemp since it’s highly correlated to tempdf1 <- df %>%
mutate(seas = as.factor(season), year = as.factor(yr), month = as.factor(mnth),
hlday = as.factor(holiday), wkday = as.factor(weekday), wrkday = as.factor(workingday),
weath = as.factor(weathersit)) %>%
select(obs = instant, date = dteday, year, seas, month, wrkday, wkday, hlday, weath, temp,
hum, windspeed, cnt, registered, casual)
str(df1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 731 obs. of 15 variables:
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Date, format: "2011-01-01" "2011-01-02" ...
## $ year : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seas : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ wrkday : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
## $ wkday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
## $ hlday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weath : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
Summary statistics for the dataset are shown below, with a histogram illustrating the distribution of total daily ride frequency. From the summary table, we see that the mean daily ride counts are:
# summary statistics
summary(df1)
## obs date year seas month
## Min. : 1.0 Min. :2011-01-01 0:365 1:181 1 : 62
## 1st Qu.:183.5 1st Qu.:2011-07-02 1:366 2:184 3 : 62
## Median :366.0 Median :2012-01-01 3:188 5 : 62
## Mean :366.0 Mean :2012-01-01 4:178 7 : 62
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 8 : 62
## Max. :731.0 Max. :2012-12-31 10 : 62
## (Other):359
## wrkday wkday hlday weath temp hum
## 0:231 0:105 0:710 1:463 Min. :0.05913 Min. :0.0000
## 1:500 1:105 1: 21 2:247 1st Qu.:0.33708 1st Qu.:0.5200
## 2:104 3: 21 Median :0.49833 Median :0.6267
## 3:104 Mean :0.49538 Mean :0.6279
## 4:104 3rd Qu.:0.65542 3rd Qu.:0.7302
## 5:104 Max. :0.86167 Max. :0.9725
## 6:105
## windspeed cnt registered casual
## Min. :0.02239 Min. : 22 Min. : 20 Min. : 2.0
## 1st Qu.:0.13495 1st Qu.:3152 1st Qu.:2497 1st Qu.: 315.5
## Median :0.18097 Median :4548 Median :3662 Median : 713.0
## Mean :0.19049 Mean :4504 Mean :3656 Mean : 848.2
## 3rd Qu.:0.23321 3rd Qu.:5956 3rd Qu.:4776 3rd Qu.:1096.0
## Max. :0.50746 Max. :8714 Max. :6946 Max. :3410.0
##
# histogram of daily rental counts: all users
ggplot(df1) + geom_histogram(aes(x = cnt)) + labs(title = "Distribution of daily ride count: All users")
The histogram above seems multi-modal. We can explain the mode in the lower left tail by breaking this down into separate histograms of registered and casual ride counts.
ggplot(df1) + geom_histogram(aes(x = registered)) + labs(title = "Distribution of daily ride count: Registered users")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(df1) + geom_histogram(aes(x = casual)) + labs(title = "Distribution of daily ride count: Casual users")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the statistical summary and histograms above, we can see that the distribution of total and registered ride counts are approximately symmetric, whereas the distribution of casual ride counts is highly skewed, with a long tail to the right. The long tail may arise from high ridership by casual users on holidays or weekends when the weather is nice.
Also, as an aside, it appears that the weath variable only takes on values of 1 to 3, not 1 to 4 as described in the data description of the UCI repository.
Let’s explore some of the calendar and timing relationships in the data.
First it is evident from a timeseries plot that the total daily ride count varies by year and season. In fact we can easily see that the daily ride count grew substantially from year 0 (2011) to year 1 (2012).
# timing pattern - seasons
ggplot(df1, aes(x = date, y = cnt, color = seas)) + geom_point() +
labs(title = "Daily ride count varies by year and season")
# timing pattern - yr 0 vs 1; secular growth
ggplot(df1, aes(x = month, y = cnt, color = year)) + geom_point() + geom_smooth(aes(group = year), se = FALSE) +
labs(title = "Daily ride count grew significantly from 2011 to 2012")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Second we can see that the total ride count has a daily pattern during the week, and that there’s a material difference between work days and non-work days.
# timing pattern - weekday
ggplot(df1, aes(x = date, y = cnt, color = wkday)) + geom_point() + geom_smooth(aes(group = wkday), se = FALSE) +
labs(title = "Daily ride count varies by day of the week")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# timing pattern - workingday
ggplot(df1, aes(x = date, y = cnt, color = wrkday)) + geom_point() + geom_smooth(aes(group = wrkday), se = FALSE) +
labs(title = "Daily ride count differs between work days and non-work days")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Finally, we note that registered and casual users exhibit different rental behavior during the week. Registered users tend to ride more during the work week (wkday of 1 to 5), whereas casual users tend to ride more on the weekends and holidays (wkday of 0 and 6).
# weekly pattern - registered
reg_wkly <- df1 %>% group_by(wkday, seas) %>% summarize(avg_reg = mean(registered))
ggplot(reg_wkly) + geom_bar(aes(x = wkday, y = avg_reg, fill = seas), stat = "identity", position = "dodge") +
labs(title = "Registered ride counts are highest during the work week")
# weekly pattern - casual
cas_wkly <- df1 %>% group_by(wkday, seas) %>% summarize(avg_cas = mean(casual))
ggplot(cas_wkly) + geom_bar(aes(x = wkday, y = avg_cas, fill = seas), stat = "identity", position = "dodge") +
labs(title = "Casual ride counts are highest on the weekends")
Next let’s explore some of the weather-related patterns in daily ride frequency. First we can see that ride count is highly sensitive to weather conditions.
# by weather
ggplot(df1, aes(x = date, y = cnt, color = weath)) + geom_point() + geom_smooth(aes(group = weath), se = FALSE) +
labs(title = "Total ride count varies by weather conditions")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Further, ride count is sensitive to temperature. Across all seasons, ride count increases with increasing temperature up to a point, roughly ~ 0.65 on the normalized temperature scale (temp). For higher temperatures, ride count declines.
# by temp
ggplot(df1, aes(x = temp, y = cnt, color = seas)) + geom_point() + geom_smooth(aes(group = seas), se = FALSE) +
labs(title = "Total ride count varies by season and temperature")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Finally, it appears that registered users may be more sensitive to temperature changes than casual users. In particular, the slope of the regression line of ride count vs. temperature appears to be steeper for registered users compared to casual users, during cold weather (season = 1, positive slope) as well as hot weather (season = 3, negative slope). Whether this difference is statistically significant will be investigated in the regression analysis below.
# scatter plots of rental count vs temp, by season: registered vs. casual users
ggplot(df1, aes(x = temp, y = registered)) + geom_point(aes(color = seas)) + facet_wrap(~ seas) +
geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Registered users")
ggplot(df1, aes(x = temp, y = casual)) + geom_point(aes(color = seas)) + facet_wrap(~ seas) +
geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Casual users")
For our initial analysis, we develop a linear model for daily ride frequency by registered and casual riders, using one calendar variable and one weather variable. We choose the seas (categorical) and temp (quantitative) variables for the analysis.
First we estimate the linear model for the registered ride count.
reg_m <- lm(registered ~ seas + temp, data = df1)
summary(reg_m)
##
## Call:
## lm(formula = registered ~ seas + temp, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4055.4 -915.0 -191.4 1108.0 3052.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 939.1 163.2 5.755 1.28e-08 ***
## seas2 515.1 171.5 3.003 0.00277 **
## seas3 347.3 225.4 1.540 0.12388
## seas4 1170.7 143.3 8.172 1.35e-15 ***
## temp 4467.4 451.0 9.905 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1247 on 726 degrees of freedom
## Multiple R-squared: 0.3644, Adjusted R-squared: 0.3609
## F-statistic: 104 on 4 and 726 DF, p-value: < 2.2e-16
The registered linear model can be expressed mathematically as: \[ \begin{aligned} \widehat{\text{registered}} &= \beta_0 + \beta_\text{seas} \cdot \text{seas} + \beta_\text{temp} \cdot \text{temp} \\ &= 939.1 + 515.1 \cdot \text{seas2} + 347.3 \cdot \text{seas3} + 1170.7 \cdot \text{seas4} + 4467.4 \cdot \text{temp} \end{aligned} \]
Similarly, we estimate the linear model for the casual ride count.
cas_m <- lm(casual ~ seas + temp, data = df1)
summary(cas_m)
##
## Call:
## lm(formula = casual ~ seas + temp, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1011.3 -357.6 -114.9 181.5 2436.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -193.27 74.28 -2.602 0.00946 **
## seas2 333.61 78.08 4.273 2.19e-05 ***
## seas3 142.91 102.61 1.393 0.16414
## seas4 172.16 65.21 2.640 0.00846 **
## temp 1773.97 205.28 8.642 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567.8 on 726 degrees of freedom
## Multiple R-squared: 0.32, Adjusted R-squared: 0.3163
## F-statistic: 85.41 on 4 and 726 DF, p-value: < 2.2e-16
The casual linear model can be expressed mathematically as: \[ \begin{aligned} \widehat{\text{casual}} &= \beta_0 + \beta_\text{seas} \cdot \text{seas} + \beta_\text{temp} \cdot \text{temp} \\ &= -193.27 + 333.61 \cdot \text{seas2} + 142.91 \cdot \text{seas3} + 172.16 \cdot \text{seas4} + 1773.97 \cdot \text{temp} \end{aligned} \]
The coefficients of both models can be interpreted as follows:
temp.Note that the temp variable is scaled and varies between 0.06 and 0.86, with a mean value of 0.50 (from the summary statistics table above). In this case, we can divide the temperature coefficient by 10 and interpret the slope for a 0.1 increment in the scaled temperature.
With this interpretation, the coefficients of the two models can be compared below:
| Interpretation | Registered Model | Casual Model |
|---|---|---|
| Baseline ride count in winter (ex. temp effect) | 939.1 | -193.27 |
| Baseline ride count in spring (ex. temp effect) | 939.1 + 515.1 = 1454.2 | -193.27 + 333.61 = 140.34 |
| Baseline ride count in summer (ex. temp effect) | 939.1 + 347.3 = 1286.4 | -193.27 + 142.91 = -50.36 |
| Baseline ride count in fall (ex. temp effect) | 939.1 + 1170.7 = 2109.2 | -193.27 + 172.16 = -21.11 |
Incremental ride count per 0.1 increment in temp |
446.7 | 177.40 |
Note that the casual model has negative values for the baseline rider counts in all seasons other than spring. This reflects the fact that the baseline averages ignoe the temperature effect, which on average contributes +887 (0.5 mean scaled temperature x 1774 coefficient) to the baseline ride count for casual users throughout the year. The baseline ride count also understates the seasonal average ride counts in the registered model as well, where the temperature effect contributes on average +2233 (0.5 * 4467) to the baseline ride count for registered users throughout the year.
Let’s check whether the conditions for regression are satisfied in both models.
This can be assessed with a histogram of residuals and a normal probability plot of residuals, as shown below. The plots indicate that the residuals are not normally distributed and have significant skew, particularly for the casual ride count model. This condition is not satisfied, or at best, is very weakly satisfied.
par(mfrow = c(1, 2))
# histogram of residuals
hist(reg_m$residuals)
hist(cas_m$residuals)
# normal probability plot of residuals.
qqnorm(reg_m$residuals, main = "Q-Q Plot: reg_m")
qqline(reg_m$residuals) # add diagonal line to the normal prob plot
qqnorm(cas_m$residuals, main = "Q-Q Plot: cas_m")
qqline(cas_m$residuals) # add diagonal line to the normal prob plot
This can be evaluated using a plot of the absolute value of the residuals against their corresponding fitted values, as shown below. It appears that this condition is not satisfied, as the variability of the residuals appears to increase for larger fitted values (the right side of the plots).
par(mfrow = c(1, 2))
# plot absolute value of residuals vs. fitted values
plot(abs(reg_m$residuals) ~ reg_m$fitted.values, main = "Abs(Resids.) vs. Fitted: reg_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
plot(abs(cas_m$residuals) ~ cas_m$fitted.values, main = "Abs(Resids.) vs. Fitted: cas_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
This condition is likely not satisfied as the dataset includes time series data, and both the calendar and weather effects will lead to correlations in ride counts over consecutive days. This effect can be visualized using a plot of the residuals vs. the order of observation. From the plot below, it is apparent that many pairs of sequential observations are closely related.
par(mfrow = c(1, 2))
# plot residuals vs. order of observation
plot(reg_m$residuals ~ df1$obs, main = "Resids. vs. Order of Obs.: reg_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
plot(cas_m$residuals ~ df1$obs, main = "Resids. vs. Order of Obs.: cas_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
This can be evaluated using plots of the residuals against each predictor variable (seas and temp). Since seas is a categorical variable, the seas plots don’t indicate whether the relationship with ride count is linear or non-linear; however it is apparent that variability in the residuals is not constant across seasons, particularly in the casual model cas_m. From the temp plots, it appears that the reg_m model exhibits a linear relationship between temperature and ride count. Unfortunately, this doesn’t seem to hold true for the cas_m model. This condition is only partially satisfied.
par(mfrow = c(1, 2))
# plot residuals vs. season
plot(reg_m$residuals ~ df1$seas, main = "Resids. vs. season: reg_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
plot(cas_m$residuals ~ df1$seas, main = "Resids. vs. season: cas_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
# plot residuals vs. temp
plot(reg_m$residuals ~ df1$temp, main = "Resids. vs. temp.: reg_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
plot(cas_m$residuals ~ df1$temp, main = "Resids. vs. temp.: cas_m")
abline(h = 0, lty = 3) # add dashed line at y = 0
Overall, it appears that the conditions for inference using the linear regression models are not satisfied.
Let’s assume that the conditions for inference are at least partially satisified, so that we can proceed with the analysis.
First, for both models, the p-values for the estimated temp coefficients are extremely small (< 2e-16 \(\ll \alpha\) = 0.05), so we conclude that the temp coefficients are statistically significant. In other words, we reject the null hypothesis that \(\beta_\text{temp} = 0\) and conclude that there is a statistically significant relationship between daily ride frequency and temperature.
Second, given the magnitude of the temp coefficients in the two models (4,467 in reg_m vs. 1,774 in cas_m), it appears that registered riders are more sensitive to changes in temperature than casual riders. However, it should be noted that the average ride count is significantly higher for registered riders than for casual riders (3,656 vs. 848, respectively). To account for this difference, let’s normalize the comparison by scaling the temp coefficients by the mean ride counts:
| Model | temp coefficient |
Mean ride count | temp coefficient / mean ride count |
|---|---|---|---|
reg_m |
4467.4 | 3656 | 1.22 |
cas_m |
1774.0 | 848 | 2.09 |
When normalized for the average daily ride count, we see that casual riders are more sensitive than registered riders to changes in temperature. This make intuitive sense, since riding behavior of registered users is concentrated during the work week and is likely related to commuting to work, which riders will do regardless of the temperature. On the other hand, for casual riders, going for a bike ride on a holiday or weekend is likely an optional event, and will be highly dependent on attractive weather conditions.
Next, let’s try step-wise forward selection to see if we can improve the linear models by including more explanatory variables. To facilitate the process, we will:
# simplify dataframe for analysis
reg_df <- df1 %>% select(3:12, registered, obs)
str(reg_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 731 obs. of 12 variables:
## $ year : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seas : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ wrkday : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
## $ wkday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
## $ hlday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weath : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
# define function to do forward variable selection
reg_try <- function(try_form) {
# initialize results
r <- rep(NA, 10)
r1 <- rep(NA, 10)
# define formula
temp_form <- as.formula(paste(try_form, " + reg_df[[j]]"))
# add each variable one at a time
for (j in 1:10) {
trial <- lm(temp_form, data = reg_df)
r[j] <- summary(trial)$r.squared
r1[j] <- summary(trial)$adj.r.squared
}
results <- cbind(Var = colnames(reg_df)[1:10], R_Sq = round(r, 3), Adj_R_Sq = round(r1, 3)) %>% as.data.frame()
return(results)
}
Now let’s try each variable in a linear model of the form “registered ~ variable”. From the table of adjusted \(R^2\) results, we see that we should include year as the first variable to add.
reg_try("registered ~ ")
## Var R_Sq Adj_R_Sq
## 1 year 0.353 0.352
## 2 seas 0.278 0.275
## 3 month 0.294 0.284
## 4 wrkday 0.092 0.091
## 5 wkday 0.081 0.073
## 6 hlday 0.012 0.01
## 7 weath 0.079 0.077
## 8 temp 0.292 0.291
## 9 hum 0.008 0.007
## 10 windspeed 0.047 0.046
So let’s add the year variable and try again. month is the next variable to add.
# add year and try again.
reg_try("registered ~ year")
## Var R_Sq Adj_R_Sq
## 1 year 0.353 0.352
## 2 seas 0.633 0.631
## 3 month 0.649 0.643
## 4 wrkday 0.446 0.445
## 5 wkday 0.434 0.429
## 6 hlday 0.366 0.364
## 7 weath 0.412 0.41
## 8 temp 0.616 0.615
## 9 hum 0.354 0.352
## 10 windspeed 0.397 0.396
Next, we should add wrkday.
# choose month
reg_try("registered ~ year + month")
## Var R_Sq Adj_R_Sq
## 1 year 0.649 0.643
## 2 seas 0.684 0.678
## 3 month 0.649 0.643
## 4 wrkday 0.735 0.73
## 5 wkday 0.727 0.72
## 6 hlday 0.658 0.651
## 7 weath 0.7 0.694
## 8 temp 0.674 0.668
## 9 hum 0.665 0.659
## 10 windspeed 0.661 0.655
At this point we have three calendar variables selected (year, month, and wrkday). Let’s see where the linear model stands. From the model summary, all of the variables are significant, and the adjusted \(R^2\) is 73%. This looks pretty reasonable so far, with only 3 calendar variables.
summary(lm(registered ~ year + month + wrkday, data = reg_df))
##
## Call:
## lm(formula = registered ~ year + month + wrkday, data = reg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5463.1 -356.3 99.3 523.1 1737.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.82 114.99 3.616 0.00032 ***
## year1 1858.44 59.93 31.012 < 2e-16 ***
## month2 355.81 148.68 2.393 0.01696 *
## month3 913.67 145.59 6.276 6.03e-10 ***
## month4 1468.18 146.71 10.007 < 2e-16 ***
## month5 2105.61 145.53 14.468 < 2e-16 ***
## month6 2487.88 146.78 16.950 < 2e-16 ***
## month7 2305.04 145.50 15.842 < 2e-16 ***
## month8 2424.83 145.63 16.650 < 2e-16 ***
## month9 2591.12 146.71 17.661 < 2e-16 ***
## month10 2221.39 145.51 15.266 < 2e-16 ***
## month11 1633.78 146.71 11.136 < 2e-16 ***
## month12 1055.88 145.50 7.257 1.03e-12 ***
## wrkday1 987.46 64.57 15.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 810.1 on 717 degrees of freedom
## Multiple R-squared: 0.7352, Adjusted R-squared: 0.7304
## F-statistic: 153.1 on 13 and 717 DF, p-value: < 2.2e-16
In the same manner, we follow the steps below adding one variable at time to maximize the adjusted \(R^2\).
# choose wrkday
reg_try("registered ~ year + month + wrkday")
## Var R_Sq Adj_R_Sq
## 1 year 0.735 0.73
## 2 seas 0.766 0.761
## 3 month 0.735 0.73
## 4 wrkday 0.735 0.73
## 5 wkday 0.739 0.732
## 6 hlday 0.736 0.73
## 7 weath 0.795 0.791
## 8 temp 0.754 0.749
## 9 hum 0.754 0.749
## 10 windspeed 0.746 0.741
# choose weath
reg_try("registered ~ year + month + wrkday + weath")
## Var R_Sq Adj_R_Sq
## 1 year 0.795 0.791
## 2 seas 0.826 0.822
## 3 month 0.795 0.791
## 4 wrkday 0.795 0.791
## 5 wkday 0.801 0.795
## 6 hlday 0.796 0.792
## 7 weath 0.795 0.791
## 8 temp 0.811 0.807
## 9 hum 0.795 0.791
## 10 windspeed 0.803 0.798
# choose seas
reg_try("registered ~ year + month + wrkday + weath + seas")
## Var R_Sq Adj_R_Sq
## 1 year 0.826 0.822
## 2 seas 0.826 0.822
## 3 month 0.826 0.822
## 4 wrkday 0.826 0.822
## 5 wkday 0.832 0.826
## 6 hlday 0.827 0.822
## 7 weath 0.826 0.822
## 8 temp 0.839 0.835
## 9 hum 0.826 0.822
## 10 windspeed 0.831 0.826
# choose temp
reg_try("registered ~ year + month + wrkday + weath + seas + temp")
## Var R_Sq Adj_R_Sq
## 1 year 0.839 0.835
## 2 seas 0.839 0.835
## 3 month 0.839 0.835
## 4 wrkday 0.839 0.835
## 5 wkday 0.845 0.84
## 6 hlday 0.84 0.836
## 7 weath 0.839 0.835
## 8 temp 0.839 0.835
## 9 hum 0.841 0.837
## 10 windspeed 0.844 0.839
# choose wkday
reg_try("registered ~ year + month + wrkday + weath + seas + temp + wkday")
## Var R_Sq Adj_R_Sq
## 1 year 0.845 0.84
## 2 seas 0.845 0.84
## 3 month 0.845 0.84
## 4 wrkday 0.845 0.84
## 5 wkday 0.845 0.84
## 6 hlday 0.845 0.84
## 7 weath 0.845 0.84
## 8 temp 0.845 0.84
## 9 hum 0.847 0.841
## 10 windspeed 0.85 0.844
Let’s review the model. It’s apparent that many of the coefficient estimates are not statistically significant, as we’ve added too many collinear variables. For instance, the following explanatory variables are correlated:
wrkday and wkdaymonth and seasseas and temp.Let’s redo the forward selection process, and take care to avoid these correlated variables. The steps are similar to those shown above, so here we just summarize the variables in the order selected:
yrseas (there’s only a slight loss in adjusted \(R^2\) compared to month)wrkdayweathtemp.At this point, we can only increase the adjusted \(R^2\) by adding collinear variables, so let’s take this as the final model and review:
reg_try("registered ~ year + seas + wrkday + weath + temp")
## Var R_Sq Adj_R_Sq
## 1 year 0.823 0.821
## 2 seas 0.823 0.821
## 3 month 0.839 0.835
## 4 wrkday 0.823 0.821
## 5 wkday 0.829 0.826
## 6 hlday 0.824 0.822
## 7 weath 0.823 0.821
## 8 temp 0.823 0.821
## 9 hum 0.824 0.822
## 10 windspeed 0.828 0.826
# try this as final model
reg_m <- lm(registered ~ year + seas + wrkday + weath + temp, data = reg_df)
summary(reg_m)
##
## Call:
## lm(formula = registered ~ year + seas + wrkday + weath + temp,
## data = reg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3664.0 -332.5 71.1 425.1 1760.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.843 96.501 0.050 0.96
## year1 1752.270 49.091 35.694 < 2e-16 ***
## seas2 799.825 90.977 8.792 < 2e-16 ***
## seas3 798.127 119.685 6.669 5.14e-11 ***
## seas4 1388.169 76.135 18.233 < 2e-16 ***
## wrkday1 984.712 52.685 18.691 < 2e-16 ***
## weath2 -464.377 52.459 -8.852 < 2e-16 ***
## weath3 -1937.556 148.704 -13.030 < 2e-16 ***
## temp 3166.063 240.611 13.158 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 659.5 on 722 degrees of freedom
## Multiple R-squared: 0.8233, Adjusted R-squared: 0.8214
## F-statistic: 420.5 on 8 and 722 DF, p-value: < 2.2e-16
anova(reg_m)
## Analysis of Variance Table
##
## Response: registered
## Df Sum Sq Mean Sq F value Pr(>F)
## year 1 627553124 627553124 1443.02 < 2.2e-16 ***
## seas 3 497431507 165810502 381.27 < 2.2e-16 ***
## wrkday 1 150722248 150722248 346.58 < 2.2e-16 ***
## weath 2 112117317 56058658 128.90 < 2.2e-16 ***
## temp 1 75298533 75298533 173.14 < 2.2e-16 ***
## Residuals 722 313989243 434888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition, the diagnostic plots look much better than previously for the simple two-variable model. The residuals are approximately normally distributed, appear to have approximately constant variance vs. the fitted values, and suggest a linear relationship between the explanatory variables and the response variables. However, the condition of independent observations is likely not met, as we still have daily observational data.
# diagnostic plots
par(mfrow = c(2, 2))
plot(reg_m, main = "Registered model reg_m")
Next let’s follow the same step-wise forward selection process for a model to predict ride frequency by casual users. First we start by simplifying the dataframe for casual ride counts and define a helper function for the forward selection process.
# simplify dataframe for analysis
cas_df <- df1 %>% select(3:12, casual, obs)
str(cas_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 731 obs. of 12 variables:
## $ year : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seas : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ wrkday : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
## $ wkday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
## $ hlday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weath : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed: num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
# define function to do forward variable selection
cas_try <- function(try_form) {
# initialize results
r <- rep(NA, 10)
r1 <- rep(NA, 10)
# define formula
temp_form <- as.formula(paste(try_form, " + cas_df[[j]]"))
# add each variable one at a time
for (j in 1:10) {
trial <- lm(temp_form, data = cas_df)
r[j] <- summary(trial)$r.squared
r1[j] <- summary(trial)$adj.r.squared
}
results <- cbind(Var = colnames(cas_df)[1:10], R_Sq = round(r, 3), Adj_R_Sq = round(r1, 3)) %>% as.data.frame()
return(results)
}
We follow the same step-wise procedure as above, and only show the final model. The variables added in order include:
temp (only very slight loss in adjusted \(R^2\) vs. month)wrkdayyearseas (only slight loss in adjusted \(R^2\) vs. month)weathcas_try("casual ~ temp + wrkday + year + seas + weath")
## Var R_Sq Adj_R_Sq
## 1 year 0.693 0.69
## 2 seas 0.693 0.69
## 3 month 0.714 0.706
## 4 wrkday 0.693 0.69
## 5 wkday 0.71 0.704
## 6 hlday 0.698 0.695
## 7 weath 0.693 0.69
## 8 temp 0.693 0.69
## 9 hum 0.695 0.692
## 10 windspeed 0.701 0.697
# try this as final model
cas_m <- lm(casual ~ temp + wrkday + year + seas + weath, data = cas_df)
summary(cas_m)
##
## Call:
## lm(formula = casual ~ temp + wrkday + year + seas + weath, data = cas_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1225.05 -228.73 -26.97 193.92 1859.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 260.77 55.96 4.660 3.76e-06 ***
## temp 1748.90 139.52 12.535 < 2e-16 ***
## wrkday1 -793.24 30.55 -25.965 < 2e-16 ***
## year1 296.82 28.47 10.427 < 2e-16 ***
## seas2 364.45 52.75 6.908 1.08e-11 ***
## seas3 162.72 69.40 2.345 0.0193 *
## seas4 206.39 44.15 4.675 3.51e-06 ***
## weath2 -161.69 30.42 -5.315 1.42e-07 ***
## weath3 -495.11 86.23 -5.742 1.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 382.4 on 722 degrees of freedom
## Multiple R-squared: 0.6932, Adjusted R-squared: 0.6898
## F-statistic: 203.9 on 8 and 722 DF, p-value: < 2.2e-16
anova(cas_m)
## Analysis of Variance Table
##
## Response: casual
## Df Sum Sq Mean Sq F value Pr(>F)
## temp 1 101581307 101581307 694.678 < 2.2e-16 ***
## wrkday 1 103130970 103130970 705.275 < 2.2e-16 ***
## year 1 16726735 16726735 114.388 < 2.2e-16 ***
## seas 3 9236193 3078731 21.054 4.391e-13 ***
## weath 2 7907040 3953520 27.037 4.754e-12 ***
## Residuals 722 105576577 146228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As before, we can slightly improve the adjusted \(R^2\), but only at the expense of adding correlated variables. So let’s stop here, and take this as the final model. Reviewing the model, we see that:
temp is a much stronger driver of casual ride counts.Finally we review the diagnostic plots for the casual model. As in the case of the registered model, the diagnostics look much better than previously for the simple two-variable model. However, the model still seems to have issues; in particular, the variance of the residuals is not constant, as seen from the plot of residuals vs. fitted values. At least the residuals appear to be approximately normally distributed, and suggest a linear relationship between the explanatory variables and the response variables. As before, the condition of independent observations is likely not met, as we still have daily observational data.
# diagnostic plots
par(mfrow = c(2, 2))
plot(cas_m, main = "Casual model cas_m")
mean(cas_m$fitted.values)
## [1] 848.1765
mean(reg_m$fitted.values)
## [1] 3656.172
Let’s return to our original research questions regarding the relationship between daily ride frequency and temperature. We assume that the conditions for regression are satisfied for the registered and casual linear models; from the diagnostic plots in the previous sections, we saw that these conditions are mostly satisfied for the registered model but only partially satisfied for the casual model.
First, we note that the coefficient estimates for the temp variable are statistically significant in both models, with p-values < 2e-16. We can conclude from this that daily ride frequency does have a statistically significant relationship with temperature.
We summarize below the temperature coefficient estimates, both raw values and normalized values (scaled to the mean ride count):
| Model | temp coefficient |
Mean ride count | temp coefficient / mean ride count |
|---|---|---|---|
reg_m |
3166.1 | 3656 | 0.87 |
cas_m |
1748.9 | 848 | 2.06 |
On a normalized basis, the casual model shows greater sensitivity to changes in temperature than the registered model, consistent with our observation from the simple two-variable model. We conclude that casual ride frequency is more sensitive to changes in temperature than registered ride frequency.
Some conclusions of our analysis:
Suggestions for further work:
Consider ways to improve meeting the conditions for regression, particularly in the case of casual ride counts; for instance it may be worth trying to transform certain variables in order to reduce the non-linearity of model residuals.
Try backward elimination and examine whether model results are consistent with the forward selection process in this study.
Perform the analysis using a simulation-based approach, which may mitigate some of the issues encountered here.