We will analyze data from TfL to understand the usage of the London bike sharing scheme.

London Bikes data

This R code does two main things — loads the data and creates a new variable to mark weekends. This makes it easier to compare or model bike hires between weekdays and weekends.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))

Cleaning our data

The raw data requires cleaning to be useful for analysis. We will convert character variables to factors and engineer new date-based features using lubridate. We will also generate season_name variable and provide a comprehensive overview of each variable using the skim function.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE)) %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    month_name = lubridate::month(date, label = TRUE),
    day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE) 
  ) %>%
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn"))
  )
## Rows: 5268 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (9): preciptype, conditions, description, icon, month_name, day_of_wee...
## dbl  (27): bikes_hired, tempmax, tempmin, temp, feelslikemax, feelslikemin, ...
## lgl   (2): severerisk, weekend
## dttm  (3): date, sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(bike)
Data summary
Name bike
Number of rows 5268
Number of columns 41
_______________________
Column type frequency:
character 6
factor 3
logical 2
numeric 27
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
preciptype 1411 0.73 4 9 0 3 0
conditions 0 1.00 4 28 0 11 0
description 0 1.00 26 85 0 78 0
icon 0 1.00 4 17 0 6 0
wday 0 1.00 3 3 0 7 0
set 0 1.00 5 5 0 1 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month_name 0 1 TRUE 12 Aug: 465, Oct: 465, Dec: 464, Sep: 450
day_of_week 0 1 TRUE 7 Sun: 753, Mon: 753, Fri: 753, Sat: 753
season_name 0 1 FALSE 4 Aut: 1365, Sum: 1321, Win: 1294, Spr: 1288

Variable type: logical

skim_variable n_missing complete_rate mean count
severerisk 5268 0 NaN :
weekend 0 1 0.29 FAL: 3762, TRU: 1506

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1.00 26327.60 9496.92 0.0 19746.75 26070.00 32745.00 73094.00 ▂▇▅▁▁
tempmax 0 1.00 14.73 6.36 -2.6 10.10 14.40 19.50 36.10 ▁▇▇▃▁
tempmin 0 1.00 7.84 5.04 -11.1 4.10 8.00 11.60 22.10 ▁▃▇▇▁
temp 0 1.00 11.23 5.48 -5.9 7.20 11.10 15.50 28.30 ▁▅▇▅▁
feelslikemax 0 1.00 14.09 7.24 -7.2 10.10 14.40 19.50 38.20 ▁▅▇▃▁
feelslikemin 0 1.00 6.18 6.38 -14.6 1.17 6.00 11.60 22.10 ▁▃▇▇▂
feelslike 0 1.00 10.09 6.67 -9.1 4.90 10.60 15.50 28.90 ▁▆▇▇▁
dew 0 1.00 7.41 4.67 -8.0 4.10 7.60 11.00 19.00 ▁▃▇▇▂
humidity 0 1.00 79.28 10.40 39.6 72.20 80.30 87.30 99.90 ▁▂▅▇▅
precip 0 1.00 1.64 4.73 0.0 0.00 0.18 1.34 168.20 ▇▁▁▁▁
precipprob 0 1.00 73.04 44.38 0.0 0.00 100.00 100.00 100.00 ▃▁▁▁▇
precipcover 0 1.00 8.96 11.60 0.0 0.00 8.33 12.50 100.00 ▇▁▁▁▁
snow 0 1.00 0.02 0.25 0.0 0.00 0.00 0.00 8.40 ▇▁▁▁▁
snowdepth 0 1.00 0.05 0.55 0.0 0.00 0.00 0.00 18.10 ▇▁▁▁▁
windgust 1126 0.79 42.08 14.84 9.4 30.72 40.70 51.80 120.90 ▅▇▃▁▁
windspeed 0 1.00 22.95 8.42 5.1 16.88 21.80 27.60 72.20 ▅▇▂▁▁
winddir 0 1.00 196.97 92.33 0.1 127.75 221.60 259.00 359.80 ▃▂▃▇▃
sealevelpressure 0 1.00 1014.90 10.52 966.6 1008.70 1015.70 1022.00 1047.80 ▁▁▇▇▁
cloudcover 0 1.00 60.86 22.45 0.0 46.60 62.70 77.80 100.00 ▁▃▇▇▅
visibility 0 1.00 22.65 10.62 0.2 15.20 22.00 28.90 62.50 ▃▇▆▁▁
solarradiation 0 1.00 112.00 84.89 0.0 40.20 91.50 169.77 358.60 ▇▅▃▂▁
solarenergy 0 1.00 9.66 7.34 0.0 3.50 7.90 14.60 31.00 ▇▅▃▂▁
uvindex 0 1.00 4.26 2.55 0.0 2.00 4.00 6.00 10.00 ▇▆▅▅▁
moonphase 0 1.00 0.48 0.29 0.0 0.25 0.50 0.75 0.98 ▇▇▇▇▇
month 0 1.00 6.62 3.46 1.0 4.00 7.00 10.00 12.00 ▇▅▅▅▇
week 0 1.00 27.01 15.08 1.0 14.00 27.00 40.00 53.00 ▇▇▇▇▇
year 0 1.00 2017.28 4.17 2010.0 2014.00 2017.00 2021.00 2024.00 ▆▇▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2010-07-30 00:00:00 2024-12-30 00:00:00 2017-10-14 12:00:00 5268
sunrise 0 1 2010-07-30 05:17:39 2024-12-30 08:06:26 2017-10-14 19:21:46 5268
sunset 0 1 2010-07-30 20:50:39 2024-12-30 16:00:26 2017-10-15 06:05:26 5268

Summary Statistics

We can now calculate summary statistics for our target variable, bikes_hired.

favstats(~ bikes_hired, data= bike)
minQ1medianQ3maxmeansdnmissing
01.97e+042.61e+043.27e+047.31e+042.63e+049.5e+0352680

We can also group these summary statistics by the categorical variables we created.

favstats(bikes_hired ~ year, data=bike)
yearminQ1medianQ3maxmeansdnmissing
20102.76e+039.3e+03 1.4e+04 1.87e+042.8e+04 1.41e+045.62e+031550
20114.56e+031.63e+042.03e+042.37e+042.94e+041.96e+045.5e+03 3650
20123.53e+031.93e+042.62e+043.25e+044.71e+042.6e+04 9.43e+033660
20133.73e+031.76e+042.2e+04 2.74e+043.56e+042.2e+04 7.28e+033650
20144.33e+032.05e+042.77e+043.44e+044.9e+04 2.75e+049.07e+033650
20155.78e+032.21e+042.66e+043.29e+047.31e+042.7e+04 8.55e+033650
20164.89e+032.24e+042.79e+043.51e+044.66e+042.82e+048.85e+033660
20175.14e+032.41e+042.95e+043.45e+044.6e+04 2.86e+048.38e+033650
20185.86e+032.18e+042.92e+043.77e+044.61e+042.9e+04 1.02e+043650
20195.65e+032.42e+042.89e+043.45e+044.47e+042.86e+048.09e+033650
20204.87e+032.01e+042.75e+043.65e+047.02e+042.85e+041.16e+043660
20216.25e+032.17e+043.1e+04 3.82e+045.69e+043e+04       1.1e+04 3650
20220       2.51e+043.14e+044e+04       6.7e+04 3.15e+041.03e+043650
20236.72e+031.99e+042.37e+042.78e+043.53e+042.34e+046.03e+033650
20247.48e+032.04e+042.48e+042.87e+043.52e+042.4e+04 6.14e+033650
favstats(bikes_hired ~ day_of_week, data=bike)
day_of_weekminQ1medianQ3maxmeansdnmissing
Sunday0       1.41e+042.1e+04 2.97e+046.31e+042.22e+041.05e+047530
Monday3.97e+032.06e+042.57e+043.15e+046.7e+04 2.61e+048.28e+037530
Tuesday3.76e+032.25e+042.79e+043.44e+046.53e+042.82e+048.62e+037520
Wednesday4.33e+032.28e+042.8e+04 3.42e+045.44e+042.84e+048.52e+037520
Thursday5.65e+032.27e+042.75e+043.44e+047.31e+042.83e+048.79e+037520
Friday5.4e+03 2.14e+042.65e+043.29e+046.7e+04 2.7e+04 8.54e+037530
Saturday0       1.6e+04 2.25e+043.08e+047.02e+042.42e+041.11e+047530
favstats(bikes_hired ~ month_name, data=bike)
month_nameminQ1medianQ3maxmeansdnmissing
Jan3.73e+031.39e+041.9e+04 2.33e+043.8e+04 1.87e+045.99e+034340
Feb3.53e+031.6e+04 2.06e+042.46e+045.25e+042.03e+046.37e+033960
Mar5.06e+031.76e+042.31e+042.72e+045.66e+042.27e+047.59e+034340
Apr4.87e+032.1e+04 2.59e+043.08e+044.9e+04 2.6e+04 7.61e+034200
May1.07e+042.45e+042.99e+043.6e+04 7.02e+043.04e+048.44e+034340
Jun6.06e+032.77e+043.31e+043.92e+046.53e+043.34e+048.59e+034200
Jul5.56e+032.91e+043.58e+044.13e+047.31e+043.47e+048.35e+034360
Aug4.3e+03 2.56e+043.24e+043.83e+046.7e+04 3.14e+049.6e+03 4650
Sep0       2.49e+043.12e+043.61e+045.18e+043.04e+048.11e+034500
Oct7.07e+032.33e+042.8e+04 3.26e+044.79e+042.75e+046.77e+034650
Nov6.03e+031.88e+042.37e+042.79e+044.47e+042.33e+046.41e+034500
Dec2.76e+031.14e+041.67e+042.29e+043.91e+041.71e+046.91e+034640
favstats(bikes_hired ~ season_name, data=bike)
season_nameminQ1medianQ3maxmeansdnmissing
Winter2.76e+031.37e+041.88e+042.35e+045.25e+041.86e+046.58e+0312940
Spring4.87e+032.08e+042.61e+043.13e+047.02e+042.64e+048.51e+0312880
Summer4.3e+03 2.76e+043.37e+043.96e+047.31e+043.31e+048.98e+0313210
Autumn0       2.19e+042.72e+043.26e+045.18e+042.7e+04 7.69e+0313650

Exploratory Data Analysis

Time series plot of bikes rented

We will plot the number of bike rentals from January 1, 2014 onwards to observe any long-term trends or seasonal patterns.

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  
  ggplot() +
  aes(x = date, y = bikes_hired) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Daily TfL bike rentals over time",
    x = "Date",
    y = "Number of bikes hired"
  )
  # Apply a clean black and white theme
  theme_bw() +
  
  NULL

Summary: The long-term trend in daily TfL bike rentals shows a slight overall increase from 2014 until a dramatic spike in 2020 (likely due to the pandemic), followed by a subsequent leveling off and possible slight decline in the most recent years (2023-2024), though usage generally remains higher than pre-2020 levels.

Further graphs

Now we will see how rentals differ by months and by season. Fist, we plot a histogram of the bikes rented from January 1, 2014 onwards. Secondly, we plot histograms faceted by seasom_name. After this, we plot histograms faceted by month name, and the last graph is the same as the third one, but the histograms are represented in 4 rows. Having done this, we plot density plots for every result found in the previous graphs. After the density plots, we represent with a boxplot graph bikes hired for each month, then focusing on the relationship between bikes hired and temperature, and bikes hired and temperature divided by season.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~season_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  labs(
    title = "Bike rentals by season",
    x = "Season",
    y = "Number of bikes hired"
  ) +
  aes(x = bikes_hired)+
  geom_histogram(aes(fill=season_name), alpha = 0.5)+
  facet_wrap(~season_name, nrow = 4)+
  theme_minimal()+
  
  theme(legend.position = "none")+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  theme(legend.position="none")+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=month_name, y= bikes_hired)+
  geom_boxplot()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired)+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=humidity, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

Summary: 1. Seasonal and monthly patterns The distribution plots (histograms, density plots, and boxplots) consistently show that the number of bikes hired increases significantly from winter to summer, peaking in July/August and dropping in late autumn/winter. - Winter (Dec-Feb): The distribution is centered at the lowest average daily rental count (around 20,000 to 25,000), with a narrow spread. - Summer (Jun-Aug): The distribution shifts to the highest average daily rental count (peaking near 40,000 in July), is wider, and includes the highest overall rental days (outliers up to ≈70,000). - Spring (Mar-May) & Autumn (Sep-Nov): These seasons represent transition periods. Spring shows a continuous increase in the mean rental count, and Autumn shows a decrease, with October still performing relatively well compared to November and December. Boxplots by Month: The boxplot visually confirms this progression, showing a monotonic increase in the median and interquartile range from January to July, and then a consistent decrease towards December. 2. Weather dependence (temperature and humidity) - A strong positive correlation with temperature- As the temperature rises, the number of rentals generally increases. The relationship is clearest and strongest in the Spring, Summer, and Autumn, whereas Winter shows a flatter, less temperature-sensitive trend at lower average rental counts. - A negative correlation with humidity across all seasons-As humidity increases, the number of rentals generally decreases across all seasons. Summer rentals are the highest overall, but still show this negative slope, suggesting that drier days, even in summer, are preferred for cycling.

Model building

Correlation - Scatterplot matrix

Now, we will work with the correlation of weather variables. We have many weather variables (\(temp, feelslike, humidity, windspeed\), etc.).

Some of them are likely to be highly correlated. We’ll use \(ggpairs\) to visualize the relationships between our target variable \(bikeshired\) and key numeric predictors.

bike_numeric <- bike %>% 
  select(bikes_hired, feelslike, windspeed, humidity, precip, solarradiation, cloudcover)

ggpairs(bike_numeric) +
  theme_minimal()

Summary: 1. Correlation with bikes hired (target variable) The strongest positive correlation with the number of bikes hired is feelslike (Corr: 0.604): As the perceived temperature increases, the number of bikes hired significantly increases. Another strong positive correlation with the number of bikes hired is solarradiation (Corr: 0.489): Higher sun intensity is associated with more rentals. The strongest negative correlatio with the number of bikes hired is humidity (Corr: -0.502): As humidity increases, the number of bikes hired decreases. Weak negative correlations with the number of bikes hired are windspeed, precipitation, and cloudcover (Corr: -0.244, -0.216, and -0.297, respectively), suggesting that high wind, rain, or overcast conditions slightly deter cycling. 2. Inter-correlations among predictors Several predictor variables are highly correlated with each other, which is important for model building (check potential multicollinearity). · humidity is strongly correlated with solarradiation (Corr: -0.516) , cloudcover (Corr: 0.479) and feelslike (Corr: -0.444): Sunny days tend to have lower humidity; More cloud cover is associated with higher humidity; Higher perceived temperatures are associated with lower humidity. · feelslike is strongly correlated with solarradiation (Corr: 0.505): Warmer days are often sunnier.

In summary, the best conditions for cycling (high rentals) are warm, sunny, and dry, while humidity is the strongest weather-related deterrent after temperature.

Weekend or weekdays? Is there a difference?

We use a two-sample t-test to determine if the mean number of bikes hired differs between weekend and weekdays.

t.test(bikes_hired ~ weekend, data= bike)
## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.929, df = 2300.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3739.295 4964.730
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27571.74            23219.72

Summary: - The t-test result shows a p-value of < 2.2e-16, which is small (< 0.05). It means we reject the null hypothesis that the mean number of bikes hired is the same for weekends and weekdays. - The 95% CI for the difference in means is (3739.3, 4964.7). Since this interval is positive, it confirms that the average number of bikes hired on weekdays (FALSE) is higher than on weekends (TRUE).

Model 0: using the mean to predict bikes_hired

We need to build a multiple linear regression model to identify the key factors driving bike rentals. We will build a series of models, such as a simple intercept-only model and a more complex multiple regression model. Then, we will create a confidence interval for mean bikes_hired, also including SE. Finally, we will plot actual data vs. predicted data.

favstats(~bikes_hired, data = bike)
minQ1medianQ3maxmeansdnmissing
01.97e+042.61e+043.27e+047.31e+042.63e+049.5e+0352680
model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26327.6      130.8   201.2   <2e-16 ***
## 
## Residual standard error: 9497 on 5267 degrees of freedom
broom::augment(model0) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 0: mean prediction vs actual data",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: The overall mean of bikes_hired is 26327.6.

  1. Confidence Interval for mean bikes_hired:
  • The mean is the intercept of model0 (26327.6).
  • The t-value for a 95% CI is approximately +- 1.96
  • The SE (Standard Error) of the mean (intercept) is 130.8.
  • The 95% CI is: 26327.6 +- (1.96 x 130.) = 26327.6 +- 256.4.
  • 95% Confidence Interval: (26071.2, 26584.0).
  1. Regression’s Residual Standard Error (RSE):
  • The RSE is a measure of the typical distance between the actual data points and the regression line (in this case, the mean).
  • RSE from msummary(model0): 9497.
  1. Intercept Standard Error:
  • The standard error of the intercept (which is the mean in this model) is shown in the msummary output.
  • Intercept SE: 130.8.

Model 1: bikes_hired on mean_temp

We now add the daily average feelslike temperature as a single predictor. First, we must define the model; then we look at the estimated model and finally plot the actual data vs. the predicted data.

model1 <- lm(bikes_hired ~ feelslike, data = bike)

msummary(model1)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17647.09     189.13   93.31   <2e-16 ***
## feelslike     860.43      15.64   55.01   <2e-16 ***
## 
## Residual standard error: 7569 on 5266 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3648 
## F-statistic:  3027 on 1 and 5266 DF,  p-value: < 2.2e-16
broom::augment(model1) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 1: prediction with 'Feels Like' temperature",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary:

  • Is the effect of mean_temp significant? Why? Yes, the effect is highly significant. The p-value for feelslike is < 2e-16, which is much smaller than the common significance level of 0.05. The coefficient of 860.43 means that for every 1-degree Celsius increase in feelslike temperature, the model predicts an increase of approximately 860.43 bike rentals.
  • What proportion of the overall variability in bike rentals does temperature explain? Adjusted \(R^2\) = 0.3648 Temperature explains approximately 36.48% in daily bike rentals.

Model 2: bikes_hired on mean_temp plus weekend

We now add the weekend indicator to model1.We follow the same steps that we used in Model 1 in order to obtain the following model and analyze it.

model2 <- lm(bikes_hired ~ feelslike + weekend, data = bike)

msummary(model2)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18887.79     193.90   97.41   <2e-16 ***
## feelslike     858.98      15.12   56.81   <2e-16 ***
## weekendTRUE -4288.88     223.12  -19.22   <2e-16 ***
## 
## Residual standard error: 7317 on 5265 degrees of freedom
## Multiple R-squared:  0.4066, Adjusted R-squared:  0.4064 
## F-statistic:  1804 on 2 and 5265 DF,  p-value: < 2.2e-16
broom::augment(model2) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 2: prediction with temperature and weekend status",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: - Is the effect of mean_temp significant? Why? Yes, it remains highly significant (p-value < 2e-16). p-value < 0.05. This allows us to reject the null hypothesis that the true coefficient is zero, confirming that perceived temperature and whether the day is weekend or not are reliable predictors of bike rentals.

  • What proportion of the overall variability does this model explain? Adjusted \(R^2\) = 0.4064. This model explains 40.64% of the variability in daily bike rentals.

  • What is the meaning of the effect (slope) of weekendTRUE? What % of the variability of bikes_hired does your model explain? The slope for weekendTRUE is -4288.88. There is a large drop-off in commuter traffic on weekends.It means that, holding temperature constant, the model predicts a decrease of 4,288.88 fewer bikes hired on a weekend day compared to a weekday (the baseline group). This drop implies that the TfL bike system is heavily used for weekday commuting.

Model 3: bikes_hired on mean_temp plus wday

We now use the full wday (day of the week) categorical variable instead of the simplified weekend dummy. Follow the same steps used in the first two Models.

model3 <- lm(bikes_hired ~ feelslike + wday, data = bike)

msummary(model3)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18339.21     304.81  60.165  < 2e-16 ***
## feelslike     858.34      15.01  57.185  < 2e-16 ***
## wdayMon      -905.19     374.33  -2.418 0.015634 *  
## wdaySat     -2730.52     374.33  -7.294 3.45e-13 ***
## wdaySun     -4737.26     374.33 -12.655  < 2e-16 ***
## wdayThu      1246.66     374.46   3.329 0.000877 ***
## wdayTue      1142.63     374.46   3.051 0.002289 ** 
## wdayWed      1293.74     374.46   3.455 0.000555 ***
## 
## Residual standard error: 7263 on 5260 degrees of freedom
## Multiple R-squared:  0.4158, Adjusted R-squared:  0.4151 
## F-statistic: 534.9 on 7 and 5260 DF,  p-value: < 2.2e-16
broom::augment(model3) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: prediction with temperature and specific day of week",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary 1. What is the meaning of the effect (slope) of, e.g., wdayMon? Let’s assume Sunday (Sun) is the baseline. Then, the coefficient for wdayMon is -905.19. It means that holding the ‘feels like’ temperature constant, the model predicts that bike rentals on a Monday are 905 fewer than on a Friday. Holding the “feels like” temperature constant, bike rentals on a Monday are 7,648.7 > Sunday.

  1. What % of the variability of bikes_hired does your model explain? Adjusted \(R^2\) = 0.4151. This model explains approximately 41.51% of the variability in daily bike rentals, a little better than Model 2.

Model 4

In Model 4 we extend the regression framework beyond temperature and weekday-effects by introducing humidity and a full month factor. Follow the same steps as the first three Models.

model4 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month), data = bike)

msummary(model4)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40635.74    1017.88  39.922  < 2e-16 ***
## feelslike         471.90      25.43  18.555  < 2e-16 ***
## wdayMon          -936.60     342.57  -2.734 0.006278 ** 
## wdaySat         -2658.12     342.59  -7.759 1.02e-14 ***
## wdaySun         -4825.17     342.59 -14.085  < 2e-16 ***
## wdayThu          1312.82     342.69   3.831 0.000129 ***
## wdayTue          1280.10     342.72   3.735 0.000190 ***
## wdayWed          1428.85     342.72   4.169 3.11e-05 ***
## humidity         -263.48      10.84 -24.301  < 2e-16 ***
## factor(month)2    477.51     463.33   1.031 0.302781    
## factor(month)3   1343.15     458.44   2.930 0.003406 ** 
## factor(month)4   1640.36     486.66   3.371 0.000755 ***
## factor(month)5   3906.79     522.19   7.482 8.56e-14 ***
## factor(month)6   5229.30     569.08   9.189  < 2e-16 ***
## factor(month)7   4950.94     600.98   8.238  < 2e-16 ***
## factor(month)8   2288.17     587.10   3.897 9.84e-05 ***
## factor(month)9   3809.46     546.91   6.965 3.68e-12 ***
## factor(month)10  4138.80     499.60   8.284  < 2e-16 ***
## factor(month)11  3226.77     459.63   7.020 2.50e-12 ***
## factor(month)12 -1631.01     446.03  -3.657 0.000258 ***
## 
## Residual standard error: 6647 on 5248 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5101 
## F-statistic: 289.7 on 19 and 5248 DF,  p-value: < 2.2e-16
broom::augment(model4) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 4: prediction with Temp, Day, Humidity, and Month",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary 1. The Adjusted \(R^2\) of model 4 is 0.5101. This model approximately explains 51.01% of the total variability in daily bike rentals. 2. The Residual Standard Error (RSE) has been reduced to 6,647 bikes, meaning the average prediction error is lower. 3. Perceived temperature (feelslike) and humidity are both highly significant predictors. Their p-values are both < 2e-16, which are much smaller than the common significance level of 0.05. 4. Day of week (wday): Confirms the strong commuting pattern. Using Friday as the baseline, rentals are higher on Tuesday, Wednesday, and Thursday (Wednesday is the highest: 1428.85 more than Friday), and lower on Monday, Saturday and Sunday (Sunday is the lowest: 4,825.17 fewer than Friday). 5. Month (factor(month)): Captures the seasonal component that isn’t fully explained by temperature. Using January (month 1) as the baseline: Peak Rentals are predicted for June (5,229.30 more than January), followed closely by July and October. December (1,631.01 fewer than January) is the only month with significantly lower rentals than January, confirming it as the least popular cycling month.

Model 5

Model 5 builds on the previous model by adding precipitation (rainfall) as a new weather variable.

model5 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month) + precip, data = bike)

msummary(model5)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     38021.41    1012.58  37.549  < 2e-16 ***
## feelslike         465.73      24.92  18.690  < 2e-16 ***
## wdayMon          -937.69     335.61  -2.794 0.005226 ** 
## wdaySat         -2684.82     335.63  -7.999 1.53e-15 ***
## wdaySun         -4756.45     335.66 -14.171  < 2e-16 ***
## wdayThu          1361.61     335.74   4.056 5.07e-05 ***
## wdayTue          1340.79     335.78   3.993 6.61e-05 ***
## wdayWed          1478.14     335.77   4.402 1.09e-05 ***
## humidity         -227.51      10.89 -20.883  < 2e-16 ***
## factor(month)2    826.98     454.53   1.819 0.068903 .  
## factor(month)3   1473.67     449.21   3.281 0.001043 ** 
## factor(month)4   1906.95     477.11   3.997 6.51e-05 ***
## factor(month)5   4312.94     512.31   8.419  < 2e-16 ***
## factor(month)6   5702.88     558.43  10.212  < 2e-16 ***
## factor(month)7   5520.45     590.01   9.357  < 2e-16 ***
## factor(month)8   2851.71     576.41   4.947 7.76e-07 ***
## factor(month)9   4101.46     536.15   7.650 2.38e-14 ***
## factor(month)10  4338.59     489.63   8.861  < 2e-16 ***
## factor(month)11  3279.97     450.31   7.284 3.73e-13 ***
## factor(month)12 -1645.02     436.96  -3.765 0.000169 ***
## precip           -290.37      19.53 -14.865  < 2e-16 ***
## 
## Residual standard error: 6512 on 5247 degrees of freedom
## Multiple R-squared:  0.5316, Adjusted R-squared:  0.5298 
## F-statistic: 297.8 on 20 and 5247 DF,  p-value: < 2.2e-16
broom::augment(model5) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 5: prediction with all major predictors",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

As expected, precipitation has a significant negative effect on the number of bikes hired. Rainy days see a noticeable drop in rentals. The adjusted R² rises again, showing that including rainfall improves the model’s ability to explain daily variation in hires. In short, Model 5 provides the best overall fit, confirming that temperature boosts demand, while rain and humidity reduce it, and calendar effects (season and weekday) still play a key role.

Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  • Are other weather variables useful in explaining bikes_hired? Yes, absolutely. The correlation matrix and the progression of the models confirm that humidity and precipitation (precip) are highly significant and useful predictors for bike rentals, in addition to the base temperature (feelslike).
  • ⁠Humidity: Has a strong negative effect (Corr: −0.502), indicating that drier days lead to higher rentals.
  • ⁠Precipitation: Has a significant negative effect (β≈−290 in Model 5), confirming that rain is a deterrent.
  • ⁠The inclusion of these variables in Model 5 significantly lowered the prediction error and increased the explanatory power compared to using only temperature.
  • We also have data on days of the week, month of the year, etc. Could those be helpful? Yes. We saw this from model1 (temp only) to model3 (temp + specific day of the week, wday) to model5 (adding factor(month)).
    • The wday variable improved the model by differentiating weekdays.
    • The factor(month) variable in model4 and model5 shows seasonality that is not fully explained by temperature alone.
  • What’s the best model you can come up with? Model5 is the best model so far (Adjusted \(R^2\) = 0.5298)
  • Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE? This model is an explanatory model (that is used for prediction), because its main goal is to identify which factors influence bike rentals. If we use it, its Redisual SE = 6,512. It means that the model’s prediction for a single day is expected to be within approximately ±6,512 bikes of the actual number of bikes hired.

Ploting your best model predicitons vs reality

In the code below, first we plot the actual vs. predicted data, then we plot the fitted line and residuals.

my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())


ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: actual rentals vs fitted predictions",
       subtitle = "Model 3: bikes_hired ~ feelslike + wday",
       x = "Day number (time index)",
       y = "Number of bikes hired") +
  theme_bw()

Summary: The plot shows that the predicted values perfectly shows the general seasonal fluctuation and low-use days (weeekends, winter) with its residual SE of 6620. While model3 is better than a simple mean, it still leaves a lot of daily variation explained.

Diagnostics, collinearity, summary tables

For this analysis, we are interested in 3 plots: residuals vs. fitted (for linearity and homoscedasticity); Normal Q-Q (for normality); Scale-location (for homoscedasticity). In order to do this analysis we must use the autoplot function. To check for multicollinearity we check the value of the VIF, and see if it is above or below the threshold value of 10.

# Residual Analysis
# We are interested in the first 3 plots:
# 1: Residuals vs fitted (for linearity and homoscedasticity)
# 2: Normal Q-Q (for normality)
# 3: Scale-location (for homoscedasticity)
model2 %>% 
  autoplot(which = 1:3) +
  theme_bw()
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check VIF
# A VIF > 10 is a threshold for concern
vif(model2)
## feelslike   weekend 
##  1.000025  1.000025
vif(model3)
##               GVIF Df GVIF^(1/(2*Df))
## feelslike 1.000065  1        1.000032
## wday      1.000065  6        1.000005
vif(model4)
##                   GVIF Df GVIF^(1/(2*Df))
## feelslike     3.428223  1        1.851546
## wday          1.001367  6        1.000114
## humidity      1.515010  1        1.230858
## factor(month) 4.029582 11        1.065398

Summary: The linearity assumption holds. The spread of the residuals are constant (the homoscedasticity assumption is met) and normally distributed. All VIF values are below 10. Our predictor variables are not highly correlated, and multicollinearity is not an issue here.

Comparison of different models

Now we create a summary table, in order to compare the models examined in this analysis.

huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)
(Intercept)26327.597 ***17647.089 ***18887.785 ***18339.214 ***40635.741 ***
(130.846)   (189.133)   (193.902)   (304.814)   (1017.876)   
feelslike        860.427 ***858.979 ***858.340 ***471.901 ***
        (15.640)   (15.120)   (15.010)   (25.432)   
weekendTRUE                -4288.877 ***                
                (223.121)                   
wdayMon                        -905.186 *  -936.602 ** 
                        (374.331)   (342.574)   
wdaySat                        -2730.525 ***-2658.124 ***
                        (374.331)   (342.585)   
wdaySun                        -4737.260 ***-4825.175 ***
                        (374.331)   (342.586)   
wdayThu                        1246.662 ***1312.820 ***
                        (374.456)   (342.690)   
wdayTue                        1142.627 ** 1280.100 ***
                        (374.456)   (342.724)   
wdayWed                        1293.737 ***1428.854 ***
                        (374.459)   (342.722)   
humidity                                -263.483 ***
                                (10.842)   
factor(month)2                                477.507    
                                (463.333)   
factor(month)3                                1343.150 ** 
                                (458.442)   
factor(month)4                                1640.364 ***
                                (486.658)   
factor(month)5                                3906.791 ***
                                (522.191)   
factor(month)6                                5229.300 ***
                                (569.081)   
factor(month)7                                4950.945 ***
                                (600.980)   
factor(month)8                                2288.165 ***
                                (587.096)   
factor(month)9                                3809.465 ***
                                (546.908)   
factor(month)10                                4138.800 ***
                                (499.598)   
factor(month)11                                3226.766 ***
                                (459.634)   
factor(month)12                                -1631.005 ***
                                (446.025)   
#observations5268        5268        5268        5268        5268        
R squared0.000    0.365    0.407    0.416    0.512    
Adj. R Squared0.000    0.365    0.406    0.415    0.510    
Residual SE9496.916    7568.703    7317.024    7263.360    6646.989    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Summary: The Adjusted \(R^2\) consistently increased, and the Residual SE is decreased. Model 5 is clearly the best model as it explains the most variance (41.5%) with the lowest error.