We will analyze data from TfL to understand the usage of the London bike sharing scheme.

London Bikes data

This R code does two main things — loads the data and creates a new variable to mark weekends. This makes it easier to compare or model bike hires between weekdays and weekends.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))

Cleaning our data

The raw data requires cleaning to be useful for analysis. We will convert character variables to factors and engineer new date-based features using lubridate. We will also generate season_name variable and provide a comprehensive overview of each variable using the skim function.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE)) %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    month_name = lubridate::month(date, label = TRUE),
    day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE) 
  ) %>%
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn"))
  )
## Rows: 5268 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (9): preciptype, conditions, description, icon, month_name, day_of_wee...
## dbl  (27): bikes_hired, tempmax, tempmin, temp, feelslikemax, feelslikemin, ...
## lgl   (2): severerisk, weekend
## dttm  (3): date, sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(bike)
Data summary
Name bike
Number of rows 5268
Number of columns 41
_______________________
Column type frequency:
character 6
factor 3
logical 2
numeric 27
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
preciptype 1411 0.73 4 9 0 3 0
conditions 0 1.00 4 28 0 11 0
description 0 1.00 26 85 0 78 0
icon 0 1.00 4 17 0 6 0
wday 0 1.00 3 3 0 7 0
set 0 1.00 5 5 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month_name 0 1 TRUE 12 Aug: 465, Oct: 465, Dec: 464, Sep: 450
day_of_week 0 1 TRUE 7 Sun: 753, Mon: 753, Fri: 753, Sat: 753
season_name 0 1 FALSE 4 Aut: 1365, Sum: 1321, Win: 1294, Spr: 1288

Variable type: logical

skim_variable n_missing complete_rate mean count
severerisk 5268 0 NaN :
weekend 0 1 0.29 FAL: 3762, TRU: 1506

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1.00 26327.60 9496.92 0.0 19746.75 26070.00 32745.00 73094.00 ▂▇▅▁▁
tempmax 0 1.00 14.73 6.36 -2.6 10.10 14.40 19.50 36.10 ▁▇▇▃▁
tempmin 0 1.00 7.84 5.04 -11.1 4.10 8.00 11.60 22.10 ▁▃▇▇▁
temp 0 1.00 11.23 5.48 -5.9 7.20 11.10 15.50 28.30 ▁▅▇▅▁
feelslikemax 0 1.00 14.09 7.24 -7.2 10.10 14.40 19.50 38.20 ▁▅▇▃▁
feelslikemin 0 1.00 6.18 6.38 -14.6 1.17 6.00 11.60 22.10 ▁▃▇▇▂
feelslike 0 1.00 10.09 6.67 -9.1 4.90 10.60 15.50 28.90 ▁▆▇▇▁
dew 0 1.00 7.41 4.67 -8.0 4.10 7.60 11.00 19.00 ▁▃▇▇▂
humidity 0 1.00 79.28 10.40 39.6 72.20 80.30 87.30 99.90 ▁▂▅▇▅
precip 0 1.00 1.64 4.73 0.0 0.00 0.18 1.34 168.20 ▇▁▁▁▁
precipprob 0 1.00 73.04 44.38 0.0 0.00 100.00 100.00 100.00 ▃▁▁▁▇
precipcover 0 1.00 8.96 11.60 0.0 0.00 8.33 12.50 100.00 ▇▁▁▁▁
snow 0 1.00 0.02 0.25 0.0 0.00 0.00 0.00 8.40 ▇▁▁▁▁
snowdepth 0 1.00 0.05 0.55 0.0 0.00 0.00 0.00 18.10 ▇▁▁▁▁
windgust 1126 0.79 42.08 14.84 9.4 30.72 40.70 51.80 120.90 ▅▇▃▁▁
windspeed 0 1.00 22.95 8.42 5.1 16.88 21.80 27.60 72.20 ▅▇▂▁▁
winddir 0 1.00 196.97 92.33 0.1 127.75 221.60 259.00 359.80 ▃▂▃▇▃
sealevelpressure 0 1.00 1014.90 10.52 966.6 1008.70 1015.70 1022.00 1047.80 ▁▁▇▇▁
cloudcover 0 1.00 60.86 22.45 0.0 46.60 62.70 77.80 100.00 ▁▃▇▇▅
visibility 0 1.00 22.65 10.62 0.2 15.20 22.00 28.90 62.50 ▃▇▆▁▁
solarradiation 0 1.00 112.00 84.89 0.0 40.20 91.50 169.77 358.60 ▇▅▃▂▁
solarenergy 0 1.00 9.66 7.34 0.0 3.50 7.90 14.60 31.00 ▇▅▃▂▁
uvindex 0 1.00 4.26 2.55 0.0 2.00 4.00 6.00 10.00 ▇▆▅▅▁
moonphase 0 1.00 0.48 0.29 0.0 0.25 0.50 0.75 0.98 ▇▇▇▇▇
month 0 1.00 6.62 3.46 1.0 4.00 7.00 10.00 12.00 ▇▅▅▅▇
week 0 1.00 27.01 15.08 1.0 14.00 27.00 40.00 53.00 ▇▇▇▇▇
year 0 1.00 2017.28 4.17 2010.0 2014.00 2017.00 2021.00 2024.00 ▆▇▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2010-07-30 00:00:00 2024-12-30 00:00:00 2017-10-14 12:00:00 5268
sunrise 0 1 2010-07-30 05:17:39 2024-12-30 08:06:26 2017-10-14 19:21:46 5268
sunset 0 1 2010-07-30 20:50:39 2024-12-30 16:00:26 2017-10-15 06:05:26 5268

Summary Statistics

We can now calculate summary statistics for our target variable, bikes_hired.

favstats(~ bikes_hired, data= bike)
minQ1medianQ3maxmeansdnmissing
01.97e+042.61e+043.27e+047.31e+042.63e+049.5e+0352680

We can also group these summary statistics by the categorical variables we created.

favstats(bikes_hired ~ year, data=bike)
yearminQ1medianQ3maxmeansdnmissing
20102.76e+039.3e+03 1.4e+04 1.87e+042.8e+04 1.41e+045.62e+031550
20114.56e+031.63e+042.03e+042.37e+042.94e+041.96e+045.5e+03 3650
20123.53e+031.93e+042.62e+043.25e+044.71e+042.6e+04 9.43e+033660
20133.73e+031.76e+042.2e+04 2.74e+043.56e+042.2e+04 7.28e+033650
20144.33e+032.05e+042.77e+043.44e+044.9e+04 2.75e+049.07e+033650
20155.78e+032.21e+042.66e+043.29e+047.31e+042.7e+04 8.55e+033650
20164.89e+032.24e+042.79e+043.51e+044.66e+042.82e+048.85e+033660
20175.14e+032.41e+042.95e+043.45e+044.6e+04 2.86e+048.38e+033650
20185.86e+032.18e+042.92e+043.77e+044.61e+042.9e+04 1.02e+043650
20195.65e+032.42e+042.89e+043.45e+044.47e+042.86e+048.09e+033650
20204.87e+032.01e+042.75e+043.65e+047.02e+042.85e+041.16e+043660
20216.25e+032.17e+043.1e+04 3.82e+045.69e+043e+04       1.1e+04 3650
20220       2.51e+043.14e+044e+04       6.7e+04 3.15e+041.03e+043650
20236.72e+031.99e+042.37e+042.78e+043.53e+042.34e+046.03e+033650
20247.48e+032.04e+042.48e+042.87e+043.52e+042.4e+04 6.14e+033650
favstats(bikes_hired ~ day_of_week, data=bike)
day_of_weekminQ1medianQ3maxmeansdnmissing
Sunday0       1.41e+042.1e+04 2.97e+046.31e+042.22e+041.05e+047530
Monday3.97e+032.06e+042.57e+043.15e+046.7e+04 2.61e+048.28e+037530
Tuesday3.76e+032.25e+042.79e+043.44e+046.53e+042.82e+048.62e+037520
Wednesday4.33e+032.28e+042.8e+04 3.42e+045.44e+042.84e+048.52e+037520
Thursday5.65e+032.27e+042.75e+043.44e+047.31e+042.83e+048.79e+037520
Friday5.4e+03 2.14e+042.65e+043.29e+046.7e+04 2.7e+04 8.54e+037530
Saturday0       1.6e+04 2.25e+043.08e+047.02e+042.42e+041.11e+047530
favstats(bikes_hired ~ month_name, data=bike)
month_nameminQ1medianQ3maxmeansdnmissing
Jan3.73e+031.39e+041.9e+04 2.33e+043.8e+04 1.87e+045.99e+034340
Feb3.53e+031.6e+04 2.06e+042.46e+045.25e+042.03e+046.37e+033960
Mar5.06e+031.76e+042.31e+042.72e+045.66e+042.27e+047.59e+034340
Apr4.87e+032.1e+04 2.59e+043.08e+044.9e+04 2.6e+04 7.61e+034200
May1.07e+042.45e+042.99e+043.6e+04 7.02e+043.04e+048.44e+034340
Jun6.06e+032.77e+043.31e+043.92e+046.53e+043.34e+048.59e+034200
Jul5.56e+032.91e+043.58e+044.13e+047.31e+043.47e+048.35e+034360
Aug4.3e+03 2.56e+043.24e+043.83e+046.7e+04 3.14e+049.6e+03 4650
Sep0       2.49e+043.12e+043.61e+045.18e+043.04e+048.11e+034500
Oct7.07e+032.33e+042.8e+04 3.26e+044.79e+042.75e+046.77e+034650
Nov6.03e+031.88e+042.37e+042.79e+044.47e+042.33e+046.41e+034500
Dec2.76e+031.14e+041.67e+042.29e+043.91e+041.71e+046.91e+034640
favstats(bikes_hired ~ season_name, data=bike)
season_nameminQ1medianQ3maxmeansdnmissing
Winter2.76e+031.37e+041.88e+042.35e+045.25e+041.86e+046.58e+0312940
Spring4.87e+032.08e+042.61e+043.13e+047.02e+042.64e+048.51e+0312880
Summer4.3e+03 2.76e+043.37e+043.96e+047.31e+043.31e+048.98e+0313210
Autumn0       2.19e+042.72e+043.26e+045.18e+042.7e+04 7.69e+0313650

Exploratory Data Analysis

Time series plot of bikes rented

We will plot the number of bike rentals from January 1, 2014 onwards to observe any long-term trends or seasonal patterns.

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  
  ggplot() +
  aes(x = date, y = bikes_hired) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Daily TfL bike rentals over time",
    x = "Date",
    y = "Number of bikes hired"
  )
  # Apply a clean black and white theme
  theme_bw() +
  
  NULL

Summary: The long-term trend in daily TfL bike rentals shows a slight overall increase from 2014 until a dramatic spike in 2020 (likely due to the pandemic), followed by a subsequent leveling off and possible slight decline in the most recent years (2023-2024), though usage generally remains higher than pre-2020 levels.

Further graphs

Now we will see how rentals differ by months and by season. Fist, we plot a histogram of the bikes rented from January 1, 2014 onwards. Secondly, we plot histograms faceted by seasom_name. After this, we plot histograms faceted by month name, and the last graph is the same as the third one, but the histograms are represented in 4 rows. Having done this, we plot density plots for every result found in the previous graphs. After the density plots, we represent with a boxplot graph bikes hired for each month, then focusing on the relationship between bikes hired and temperature, and bikes hired and temperature divided by season.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~season_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  labs(
    title = "Bike rentals by season",
    x = "Season",
    y = "Number of bikes hired"
  ) +
  aes(x = bikes_hired)+
  geom_histogram(aes(fill=season_name), alpha = 0.5)+
  facet_wrap(~season_name, nrow = 4)+
  theme_minimal()+
  
  theme(legend.position = "none")+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  theme(legend.position="none")+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=month_name, y= bikes_hired)+
  geom_boxplot()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired)+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=humidity, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

Summary: - Rentals are slightly lower and less variable on weekends (Saturday/Sunday) compared to weekdays. - Rentals are highest in Summer, followed by Spring and Autumn. Winter has the lowest and least variable rental numbers, confirming our time series observation.

Model building

Correlation - Scatterplot matrix

Now, we will work with the correlation of weather variables. We have many weather variables (\(temp, feelslike, humidity, windspeed\), etc.).

Some of them are likely to be highly correlated. We’ll use \(ggpairs\) to visualize the relationships between our target variable \(bikeshired\) and key numeric predictors.

bike_numeric <- bike %>% 
  select(bikes_hired, feelslike, windspeed, humidity, precip, solarradiation, cloudcover)

ggpairs(bike_numeric) +
  theme_minimal()

Summary: - bikes_hired has a strong positive linear relationship with \(feelslike\) (and solarradiation).

  • bikes_hired has a negative relationship with humidity, precip (precipitation), and cloudcover.

  • feelslike is strongly correlated with solarradiation and negatively correlated with humidity and cloudcover. We will need to check for multicollinearity in our models.

Weekend or weekdays? Is there a difference?

We use a two-sample t-test to determine if the mean number of bikes hired differs between weekend and weekdays.

t.test(bikes_hired ~ weekend, data= bike)
## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.929, df = 2300.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3739.295 4964.730
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27571.74            23219.72

Summary: - The t-test result shows a p-value of < 2.2e-16, which is small (< 0.05). It means we reject the null hypothesis that the mean number of bikes hired is the same for weekends and weekdays. - The 95% CI for the difference in means is (3739.3, 4964.7). Since this interval is positive, it confirms that the average number of bikes hired on weekdays (FALSE) is higher than on weekends (TRUE).

Model 0: using the mean to predict bikes_hired

We need to build a multiple linear regression model to identify the key factors driving bike rentals. We will build a series of models, such as a simple intercept-only model and a more complex multiple regression model. Then, we will create a confidence interval for mean bikes_hired, also including SE. Finally, we will plot actual data vs. predicted data.

favstats(~bikes_hired, data = bike)
minQ1medianQ3maxmeansdnmissing
01.97e+042.61e+043.27e+047.31e+042.63e+049.5e+0352680
model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26327.6      130.8   201.2   <2e-16 ***
## 
## Residual standard error: 9497 on 5267 degrees of freedom
broom::augment(model0) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 0: mean prediction vs actual data",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: The overall mean of bikes_hired is 26327.6.

  1. Confidence Interval for mean bikes_hired:
  • The mean is the intercept of model0 (26327.6).
  • The t-value for a 95% CI is approximately +- 1.96
  • The SE (Standard Error) of the mean (intercept) is 130.8.
  • The 95% CI is: 26327.6 +- (1.96 x 130.) = 26327.6 +- 256.4.
  • 95% Confidence Interval: (26071.2, 26584.0).
  1. Regression’s Residual Standard Error (RSE):
  • The RSE is a measure of the typical distance between the actual data points and the regression line (in this case, the mean).
  • RSE from msummary(model0): 9497.
  1. Intercept Standard Error:
  • The standard error of the intercept (which is the mean in this model) is shown in the msummary output.
  • Intercept SE: 130.8.

Model 1: bikes_hired on mean_temp

We now add the daily average feelslike temperature as a single predictor. First, we must define the model; then we look at the estimated model and finally plot the actual data vs. the predicted data.

model1 <- lm(bikes_hired ~ feelslike, data = bike)

msummary(model1)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17647.09     189.13   93.31   <2e-16 ***
## feelslike     860.43      15.64   55.01   <2e-16 ***
## 
## Residual standard error: 7569 on 5266 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3648 
## F-statistic:  3027 on 1 and 5266 DF,  p-value: < 2.2e-16
broom::augment(model1) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 1: prediction with 'Feels Like' temperature",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary:

  • Is the effect of mean_temp significant? Why? Yes, the effect is highly significant. The p-value for feelslike is < 2e-16, which is much smaller than the common significance level of 0.05. The coefficient of 860.43 means that for every 1-degree Celsius increase in feelslike temperature, the model predicts an increase of approximately 860.43 bike rentals.
  • What proportion of the overall variability in bike rentals does temperature explain? Adjusted \(R^2\) = 0.353 Temperature explains approximately 35.3% in daily bike rentals.

Model 2: bikes_hired on mean_temp plus weekend

We now add the weekend indicator to model1.We follow the same steps that we used in Model 1 in order to obtain the following model and analyze it.

model2 <- lm(bikes_hired ~ feelslike + weekend, data = bike)

msummary(model2)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18887.79     193.90   97.41   <2e-16 ***
## feelslike     858.98      15.12   56.81   <2e-16 ***
## weekendTRUE -4288.88     223.12  -19.22   <2e-16 ***
## 
## Residual standard error: 7317 on 5265 degrees of freedom
## Multiple R-squared:  0.4066, Adjusted R-squared:  0.4064 
## F-statistic:  1804 on 2 and 5265 DF,  p-value: < 2.2e-16
broom::augment(model2) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 2: prediction with temperature and weekend status",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: - Is the effect of mean_temp significant? Why? Yes, it remains highly significant (p-value < 2e-16). p-value < 0.05.

  • What proportion of the overall variability does this model explain? Adjusted \(R^2\) = 0.420. This model explains 42% of the variability in daily bike rentals.

  • What is the meaning of the effect (slope) of weekendTRUE? What % of the variability of bikes_hired does your model explain? The slope for weekendTRUE is -5876.3. There is a large drop-off in commuter traffic on weekends.

Model 3: bikes_hired on mean_temp plus wday

We now use the full wday (day of the week) categorical variable instead of the simplified weekend dummy. Follow the same steps used in the first two Models.

model3 <- lm(bikes_hired ~ feelslike + wday, data = bike)

msummary(model3)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18339.21     304.81  60.165  < 2e-16 ***
## feelslike     858.34      15.01  57.185  < 2e-16 ***
## wdayMon      -905.19     374.33  -2.418 0.015634 *  
## wdaySat     -2730.52     374.33  -7.294 3.45e-13 ***
## wdaySun     -4737.26     374.33 -12.655  < 2e-16 ***
## wdayThu      1246.66     374.46   3.329 0.000877 ***
## wdayTue      1142.63     374.46   3.051 0.002289 ** 
## wdayWed      1293.74     374.46   3.455 0.000555 ***
## 
## Residual standard error: 7263 on 5260 degrees of freedom
## Multiple R-squared:  0.4158, Adjusted R-squared:  0.4151 
## F-statistic: 534.9 on 7 and 5260 DF,  p-value: < 2.2e-16
broom::augment(model3) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: prediction with temperature and specific day of week",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary 1. What is the meaning of the effect (slope) of, e.g., wdayMon? Let’s assume Sunday (Sun) is the baseline. Then, the coefficient for wdayMon is 7648.7. Holding the “feels like” temperature constant, bike rentals on a Monday are 7,648.7 > Sunday.

  1. What % of the variability of bikes_hired does your model explain? Adjusted \(R^2\) = 0.456. This model explains approximately 45.6% of the variability in daily bike rentals, a little better than Model 2.

Model 4

In Model 4 we extend the regression framework beyond temperature and weekday-effects by introducing humidity and a full month factor. Follow the same steps as the first three Models.

model4 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month), data = bike)

msummary(model4)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40635.74    1017.88  39.922  < 2e-16 ***
## feelslike         471.90      25.43  18.555  < 2e-16 ***
## wdayMon          -936.60     342.57  -2.734 0.006278 ** 
## wdaySat         -2658.12     342.59  -7.759 1.02e-14 ***
## wdaySun         -4825.17     342.59 -14.085  < 2e-16 ***
## wdayThu          1312.82     342.69   3.831 0.000129 ***
## wdayTue          1280.10     342.72   3.735 0.000190 ***
## wdayWed          1428.85     342.72   4.169 3.11e-05 ***
## humidity         -263.48      10.84 -24.301  < 2e-16 ***
## factor(month)2    477.51     463.33   1.031 0.302781    
## factor(month)3   1343.15     458.44   2.930 0.003406 ** 
## factor(month)4   1640.36     486.66   3.371 0.000755 ***
## factor(month)5   3906.79     522.19   7.482 8.56e-14 ***
## factor(month)6   5229.30     569.08   9.189  < 2e-16 ***
## factor(month)7   4950.94     600.98   8.238  < 2e-16 ***
## factor(month)8   2288.17     587.10   3.897 9.84e-05 ***
## factor(month)9   3809.46     546.91   6.965 3.68e-12 ***
## factor(month)10  4138.80     499.60   8.284  < 2e-16 ***
## factor(month)11  3226.77     459.63   7.020 2.50e-12 ***
## factor(month)12 -1631.01     446.03  -3.657 0.000258 ***
## 
## Residual standard error: 6647 on 5248 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5101 
## F-statistic: 289.7 on 19 and 5248 DF,  p-value: < 2.2e-16
broom::augment(model4) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 4: prediction with Temp, Day, Humidity, and Month",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Model 5

Model 5 builds on the previous model by adding precipitation (rainfall) as a new weather variable.

model5 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month) + precip, data = bike)

msummary(model5)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     38021.41    1012.58  37.549  < 2e-16 ***
## feelslike         465.73      24.92  18.690  < 2e-16 ***
## wdayMon          -937.69     335.61  -2.794 0.005226 ** 
## wdaySat         -2684.82     335.63  -7.999 1.53e-15 ***
## wdaySun         -4756.45     335.66 -14.171  < 2e-16 ***
## wdayThu          1361.61     335.74   4.056 5.07e-05 ***
## wdayTue          1340.79     335.78   3.993 6.61e-05 ***
## wdayWed          1478.14     335.77   4.402 1.09e-05 ***
## humidity         -227.51      10.89 -20.883  < 2e-16 ***
## factor(month)2    826.98     454.53   1.819 0.068903 .  
## factor(month)3   1473.67     449.21   3.281 0.001043 ** 
## factor(month)4   1906.95     477.11   3.997 6.51e-05 ***
## factor(month)5   4312.94     512.31   8.419  < 2e-16 ***
## factor(month)6   5702.88     558.43  10.212  < 2e-16 ***
## factor(month)7   5520.45     590.01   9.357  < 2e-16 ***
## factor(month)8   2851.71     576.41   4.947 7.76e-07 ***
## factor(month)9   4101.46     536.15   7.650 2.38e-14 ***
## factor(month)10  4338.59     489.63   8.861  < 2e-16 ***
## factor(month)11  3279.97     450.31   7.284 3.73e-13 ***
## factor(month)12 -1645.02     436.96  -3.765 0.000169 ***
## precip           -290.37      19.53 -14.865  < 2e-16 ***
## 
## Residual standard error: 6512 on 5247 degrees of freedom
## Multiple R-squared:  0.5316, Adjusted R-squared:  0.5298 
## F-statistic: 297.8 on 20 and 5247 DF,  p-value: < 2.2e-16
broom::augment(model5) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 5: prediction with all major predictors",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

As expected, precipitation has a significant negative effect on the number of bikes hired. Rainy days see a noticeable drop in rentals. The adjusted R² rises again, showing that including rainfall improves the model’s ability to explain daily variation in hires. In short, Model 5 provides the best overall fit, confirming that temperature boosts demand, while rain and humidity reduce it, and calendar effects (season and weekday) still play a key role.

Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  • Are other weather variables useful in explaining bikes_hired? Yes. Our final model, model5 (humidity and precip) significantly improved the model fit from \(R^2\) of 0.456 (Model 3) to 0.648. This indicates that weather conditions like high humidity and rainfall are very useful and act as strong negative predictors of bike rentals. Variables like windspeed and cloudcover are also strong variables for next improvement.
  • We also have data on days of the week, month of the year, etc. Could those be helpful? Yes. We saw this from model1 (temp only) to model3 (temp + specific day of the week, wday) to model5 (adding factor(month)).
    • The wday variable improved the model by differentiating weekdays.
    • The factor(month) variable in model4 and model5 shows seasonality that is not fully explained by temperature alone.
  • What’s the best model you can come up with? Model5 is the best model so far (Adjusted \(R^2\) = 0.648)
  • Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE? This model is an explanatory model (thay is used for prediction), because its main goal is to identify which factors influence bike rentals. If we use it, its Redisual SE = 5137. On average, the model’s prediction of daily bike hires will be off by +- 5,137 bikes.

Ploting your best model predicitons vs reality

In the code below, first we plot the actual vs. predicted data, then we plot the fitted line and residuals.

my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())


ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: actual rentals vs fitted predictions",
       subtitle = "Model 3: bikes_hired ~ feelslike + wday",
       x = "Day number (time index)",
       y = "Number of bikes hired") +
  theme_bw()

Summary: The plot shows that the predicted values perfectly shows the general seasonal fluctuation and low-use days (weeekends, winter) with its residual SE of 6620. While model3 is better than a simple mean, it still leaves a lot of daily variation explained.

Diagnostics, collinearity, summary tables

For this analysis, we are interested in 3 plots: residuals vs. fitted (for linearity and homoscedasticity); Normal Q-Q (for normality); Scale-location (for homoscedasticity). In order to do this analysis we must use the autoplot function. To check for multicollinearity we check the value of the VIF, and see if it is above or below the threshold value of 10.

# Residual Analysis
# We are interested in the first 3 plots:
# 1: Residuals vs fitted (for linearity and homoscedasticity)
# 2: Normal Q-Q (for normality)
# 3: Scale-location (for homoscedasticity)
model2 %>% 
  autoplot(which = 1:3) +
  theme_bw()
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check VIF
# A VIF > 10 is a threshold for concern
vif(model2)
## feelslike   weekend 
##  1.000025  1.000025
vif(model3)
##               GVIF Df GVIF^(1/(2*Df))
## feelslike 1.000065  1        1.000032
## wday      1.000065  6        1.000005
vif(model4)
##                   GVIF Df GVIF^(1/(2*Df))
## feelslike     3.428223  1        1.851546
## wday          1.001367  6        1.000114
## humidity      1.515010  1        1.230858
## factor(month) 4.029582 11        1.065398

Summary: The linearity assumption holds. The spread of the residuals are constant (the homoscedasticity assumption is met) and normally distributed. All VIF values are below 10. Our predictor variables are not highly correlated, and multicollinearity is not an issue here.

Comparison of different models

Now we create a summary table, in order to compare the models examined in this analysis.

huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)
(Intercept)26327.597 ***17647.089 ***18887.785 ***18339.214 ***40635.741 ***
(130.846)   (189.133)   (193.902)   (304.814)   (1017.876)   
feelslike        860.427 ***858.979 ***858.340 ***471.901 ***
        (15.640)   (15.120)   (15.010)   (25.432)   
weekendTRUE                -4288.877 ***                
                (223.121)                   
wdayMon                        -905.186 *  -936.602 ** 
                        (374.331)   (342.574)   
wdaySat                        -2730.525 ***-2658.124 ***
                        (374.331)   (342.585)   
wdaySun                        -4737.260 ***-4825.175 ***
                        (374.331)   (342.586)   
wdayThu                        1246.662 ***1312.820 ***
                        (374.456)   (342.690)   
wdayTue                        1142.627 ** 1280.100 ***
                        (374.456)   (342.724)   
wdayWed                        1293.737 ***1428.854 ***
                        (374.459)   (342.722)   
humidity                                -263.483 ***
                                (10.842)   
factor(month)2                                477.507    
                                (463.333)   
factor(month)3                                1343.150 ** 
                                (458.442)   
factor(month)4                                1640.364 ***
                                (486.658)   
factor(month)5                                3906.791 ***
                                (522.191)   
factor(month)6                                5229.300 ***
                                (569.081)   
factor(month)7                                4950.945 ***
                                (600.980)   
factor(month)8                                2288.165 ***
                                (587.096)   
factor(month)9                                3809.465 ***
                                (546.908)   
factor(month)10                                4138.800 ***
                                (499.598)   
factor(month)11                                3226.766 ***
                                (459.634)   
factor(month)12                                -1631.005 ***
                                (446.025)   
#observations5268        5268        5268        5268        5268        
R squared0.000    0.365    0.407    0.416    0.512    
Adj. R Squared0.000    0.365    0.406    0.415    0.510    
Residual SE9496.916    7568.703    7317.024    7263.360    6646.989    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Summary: The Adjusted \(R^2\) consistently increased, and the Residual SE is decreased. Model 5 is clearly the best model as it explains the most variance (41.5%) with the lowest error.