We will analyze data from TfL to understand the usage of the London bike sharing scheme.

London Bikes data

This R code does two main things — loads the data and creates a new variable to mark weekends. This makes it easier to compare or model bike hires between weekdays and weekends.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))

Cleaning our data

The raw data requires cleaning to be useful for analysis. We will convert character variables to factors and engineer new date-based features using lubridate. We will also generate season_name variable and provide a comprehensive overview of each variable using the skim function.

bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE)) %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    month_name = lubridate::month(date, label = TRUE),
    day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE) 
  ) %>%
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn"))
  )

## Rows: 5268 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (9): preciptype, conditions, description, icon, month_name, day_of_wee...
## dbl  (27): bikes_hired, tempmax, tempmin, temp, feelslikemax, feelslikemin, ...
## lgl   (2): severerisk, weekend
## dttm  (3): date, sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(bike)

Data summary
Name	bike
Number of rows	5268
Number of columns	41
_______________________
Column type frequency:
character	6
factor	3
logical	2
numeric	27
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
preciptype	1411	0.73	4	9	3
conditions	0	1.00	4	28	11
description	0	1.00	26	85	78
icon	0	1.00	4	17	6
wday	0	1.00	3	3	7
set	0	1.00	5	5	1

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
month_name	1	TRUE	12	Aug: 465, Oct: 465, Dec: 464, Sep: 450
day_of_week	1	TRUE	7	Sun: 753, Mon: 753, Fri: 753, Sat: 753
season_name	1	FALSE	4	Aut: 1365, Sum: 1321, Win: 1294, Spr: 1288

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
severerisk	5268	0	NaN	:
weekend	0	1	0.29	FAL: 3762, TRU: 1506

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bikes_hired	0	1.00	26327.60	9496.92	0.0	19746.75	26070.00	32745.00	73094.00	▂▇▅▁▁
tempmax	0	1.00	14.73	6.36	-2.6	10.10	14.40	19.50	36.10	▁▇▇▃▁
tempmin	0	1.00	7.84	5.04	-11.1	4.10	8.00	11.60	22.10	▁▃▇▇▁
temp	0	1.00	11.23	5.48	-5.9	7.20	11.10	15.50	28.30	▁▅▇▅▁
feelslikemax	0	1.00	14.09	7.24	-7.2	10.10	14.40	19.50	38.20	▁▅▇▃▁
feelslikemin	0	1.00	6.18	6.38	-14.6	1.17	6.00	11.60	22.10	▁▃▇▇▂
feelslike	0	1.00	10.09	6.67	-9.1	4.90	10.60	15.50	28.90	▁▆▇▇▁
dew	0	1.00	7.41	4.67	-8.0	4.10	7.60	11.00	19.00	▁▃▇▇▂
humidity	0	1.00	79.28	10.40	39.6	72.20	80.30	87.30	99.90	▁▂▅▇▅
precip	0	1.00	1.64	4.73	0.0	0.00	0.18	1.34	168.20	▇▁▁▁▁
precipprob	0	1.00	73.04	44.38	0.0	0.00	100.00	100.00	100.00	▃▁▁▁▇
precipcover	0	1.00	8.96	11.60	0.0	0.00	8.33	12.50	100.00	▇▁▁▁▁
snow	0	1.00	0.02	0.25	0.0	0.00	0.00	0.00	8.40	▇▁▁▁▁
snowdepth	0	1.00	0.05	0.55	0.0	0.00	0.00	0.00	18.10	▇▁▁▁▁
windgust	1126	0.79	42.08	14.84	9.4	30.72	40.70	51.80	120.90	▅▇▃▁▁
windspeed	0	1.00	22.95	8.42	5.1	16.88	21.80	27.60	72.20	▅▇▂▁▁
winddir	0	1.00	196.97	92.33	0.1	127.75	221.60	259.00	359.80	▃▂▃▇▃
sealevelpressure	0	1.00	1014.90	10.52	966.6	1008.70	1015.70	1022.00	1047.80	▁▁▇▇▁
cloudcover	0	1.00	60.86	22.45	0.0	46.60	62.70	77.80	100.00	▁▃▇▇▅
visibility	0	1.00	22.65	10.62	0.2	15.20	22.00	28.90	62.50	▃▇▆▁▁
solarradiation	0	1.00	112.00	84.89	0.0	40.20	91.50	169.77	358.60	▇▅▃▂▁
solarenergy	0	1.00	9.66	7.34	0.0	3.50	7.90	14.60	31.00	▇▅▃▂▁
uvindex	0	1.00	4.26	2.55	0.0	2.00	4.00	6.00	10.00	▇▆▅▅▁
moonphase	0	1.00	0.48	0.29	0.0	0.25	0.50	0.75	0.98	▇▇▇▇▇
month	0	1.00	6.62	3.46	1.0	4.00	7.00	10.00	12.00	▇▅▅▅▇
week	0	1.00	27.01	15.08	1.0	14.00	27.00	40.00	53.00	▇▇▇▇▇
year	0	1.00	2017.28	4.17	2010.0	2014.00	2017.00	2021.00	2024.00	▆▇▇▇▇

Variable type: POSIXct

skim_variable	complete_rate	min	max	median	n_unique
date	1	2010-07-30 00:00:00	2024-12-30 00:00:00	2017-10-14 12:00:00	5268
sunrise	1	2010-07-30 05:17:39	2024-12-30 08:06:26	2017-10-14 19:21:46	5268
sunset	1	2010-07-30 20:50:39	2024-12-30 16:00:26	2017-10-15 06:05:26	5268

Summary Statistics

We can now calculate summary statistics for our target variable, bikes_hired.

favstats(~ bikes_hired, data= bike)

min	Q1	median	Q3	max	mean	sd	n	missing
0	1.97e+04	2.61e+04	3.27e+04	7.31e+04	2.63e+04	9.5e+03	5268	0

We can also group these summary statistics by the categorical variables we created.

favstats(bikes_hired ~ year, data=bike)

year	min	Q1	median	Q3	max	mean	sd	n
2010	2.76e+03	9.3e+03	1.4e+04	1.87e+04	2.8e+04	1.41e+04	5.62e+03	155
2011	4.56e+03	1.63e+04	2.03e+04	2.37e+04	2.94e+04	1.96e+04	5.5e+03	365
2012	3.53e+03	1.93e+04	2.62e+04	3.25e+04	4.71e+04	2.6e+04	9.43e+03	366
2013	3.73e+03	1.76e+04	2.2e+04	2.74e+04	3.56e+04	2.2e+04	7.28e+03	365
2014	4.33e+03	2.05e+04	2.77e+04	3.44e+04	4.9e+04	2.75e+04	9.07e+03	365
2015	5.78e+03	2.21e+04	2.66e+04	3.29e+04	7.31e+04	2.7e+04	8.55e+03	365
2016	4.89e+03	2.24e+04	2.79e+04	3.51e+04	4.66e+04	2.82e+04	8.85e+03	366
2017	5.14e+03	2.41e+04	2.95e+04	3.45e+04	4.6e+04	2.86e+04	8.38e+03	365
2018	5.86e+03	2.18e+04	2.92e+04	3.77e+04	4.61e+04	2.9e+04	1.02e+04	365
2019	5.65e+03	2.42e+04	2.89e+04	3.45e+04	4.47e+04	2.86e+04	8.09e+03	365
2020	4.87e+03	2.01e+04	2.75e+04	3.65e+04	7.02e+04	2.85e+04	1.16e+04	366
2021	6.25e+03	2.17e+04	3.1e+04	3.82e+04	5.69e+04	3e+04	1.1e+04	365
2022	0	2.51e+04	3.14e+04	4e+04	6.7e+04	3.15e+04	1.03e+04	365
2023	6.72e+03	1.99e+04	2.37e+04	2.78e+04	3.53e+04	2.34e+04	6.03e+03	365
2024	7.48e+03	2.04e+04	2.48e+04	2.87e+04	3.52e+04	2.4e+04	6.14e+03	365

favstats(bikes_hired ~ day_of_week, data=bike)

day_of_week	min	Q1	median	Q3	max	mean	sd	n
Sunday	0	1.41e+04	2.1e+04	2.97e+04	6.31e+04	2.22e+04	1.05e+04	753
Monday	3.97e+03	2.06e+04	2.57e+04	3.15e+04	6.7e+04	2.61e+04	8.28e+03	753
Tuesday	3.76e+03	2.25e+04	2.79e+04	3.44e+04	6.53e+04	2.82e+04	8.62e+03	752
Wednesday	4.33e+03	2.28e+04	2.8e+04	3.42e+04	5.44e+04	2.84e+04	8.52e+03	752
Thursday	5.65e+03	2.27e+04	2.75e+04	3.44e+04	7.31e+04	2.83e+04	8.79e+03	752
Friday	5.4e+03	2.14e+04	2.65e+04	3.29e+04	6.7e+04	2.7e+04	8.54e+03	753
Saturday	0	1.6e+04	2.25e+04	3.08e+04	7.02e+04	2.42e+04	1.11e+04	753

favstats(bikes_hired ~ month_name, data=bike)

month_name	min	Q1	median	Q3	max	mean	sd	n
Jan	3.73e+03	1.39e+04	1.9e+04	2.33e+04	3.8e+04	1.87e+04	5.99e+03	434
Feb	3.53e+03	1.6e+04	2.06e+04	2.46e+04	5.25e+04	2.03e+04	6.37e+03	396
Mar	5.06e+03	1.76e+04	2.31e+04	2.72e+04	5.66e+04	2.27e+04	7.59e+03	434
Apr	4.87e+03	2.1e+04	2.59e+04	3.08e+04	4.9e+04	2.6e+04	7.61e+03	420
May	1.07e+04	2.45e+04	2.99e+04	3.6e+04	7.02e+04	3.04e+04	8.44e+03	434
Jun	6.06e+03	2.77e+04	3.31e+04	3.92e+04	6.53e+04	3.34e+04	8.59e+03	420
Jul	5.56e+03	2.91e+04	3.58e+04	4.13e+04	7.31e+04	3.47e+04	8.35e+03	436
Aug	4.3e+03	2.56e+04	3.24e+04	3.83e+04	6.7e+04	3.14e+04	9.6e+03	465
Sep	0	2.49e+04	3.12e+04	3.61e+04	5.18e+04	3.04e+04	8.11e+03	450
Oct	7.07e+03	2.33e+04	2.8e+04	3.26e+04	4.79e+04	2.75e+04	6.77e+03	465
Nov	6.03e+03	1.88e+04	2.37e+04	2.79e+04	4.47e+04	2.33e+04	6.41e+03	450
Dec	2.76e+03	1.14e+04	1.67e+04	2.29e+04	3.91e+04	1.71e+04	6.91e+03	464

favstats(bikes_hired ~ season_name, data=bike)

season_name	min	Q1	median	Q3	max	mean	sd	n
Winter	2.76e+03	1.37e+04	1.88e+04	2.35e+04	5.25e+04	1.86e+04	6.58e+03	1294
Spring	4.87e+03	2.08e+04	2.61e+04	3.13e+04	7.02e+04	2.64e+04	8.51e+03	1288
Summer	4.3e+03	2.76e+04	3.37e+04	3.96e+04	7.31e+04	3.31e+04	8.98e+03	1321
Autumn	0	2.19e+04	2.72e+04	3.26e+04	5.18e+04	2.7e+04	7.69e+03	1365

Exploratory Data Analysis

Time series plot of bikes rented

We will plot the number of bike rentals from January 1, 2014 onwards to observe any long-term trends or seasonal patterns.

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  
  ggplot() +
  aes(x = date, y = bikes_hired) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Daily TfL bike rentals over time",
    x = "Date",
    y = "Number of bikes hired"
  )
  # Apply a clean black and white theme
  theme_bw() +
  
  NULL

Summary: The long-term trend in daily TfL bike rentals shows a slight overall increase from 2014 until a dramatic spike in 2020 (likely due to the pandemic), followed by a subsequent leveling off and possible slight decline in the most recent years (2023-2024), though usage generally remains higher than pre-2020 levels.

Further graphs

Now we will see how rentals differ by months and by season. Fist, we plot a histogram of the bikes rented from January 1, 2014 onwards. Secondly, we plot histograms faceted by seasom_name. After this, we plot histograms faceted by month name, and the last graph is the same as the third one, but the histograms are represented in 4 rows. Having done this, we plot density plots for every result found in the previous graphs. After the density plots, we represent with a boxplot graph bikes hired for each month, then focusing on the relationship between bikes hired and temperature, and bikes hired and temperature divided by season.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  theme_bw()+
  NULL

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~season_name)+
  theme_bw()+
  NULL

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name)+
  theme_bw()+
  NULL

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  labs(
    title = "Bike rentals by season",
    x = "Season",
    y = "Number of bikes hired"
  ) +
  aes(x = bikes_hired)+
  geom_histogram(aes(fill=season_name), alpha = 0.5)+
  facet_wrap(~season_name, nrow = 4)+
  theme_minimal()+
  
  theme(legend.position = "none")+
  NULL

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  theme(legend.position="none")+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=month_name, y= bikes_hired)+
  geom_boxplot()+
  theme_bw()+
  NULL

bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired)+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL

## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL

## `geom_smooth()` using formula = 'y ~ x'

bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=humidity, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL

## `geom_smooth()` using formula = 'y ~ x'

Summary: - Rentals are slightly lower and less variable on weekends (Saturday/Sunday) compared to weekdays. - Rentals are highest in Summer, followed by Spring and Autumn. Winter has the lowest and least variable rental numbers, confirming our time series observation.

Model building

Correlation - Scatterplot matrix

Now, we will work with the correlation of weather variables. We have many weather variables (\(temp, feelslike, humidity, windspeed\), etc.).

Some of them are likely to be highly correlated. We’ll use \(ggpairs\) to visualize the relationships between our target variable \(bikeshired\) and key numeric predictors.

bike_numeric <- bike %>% 
  select(bikes_hired, feelslike, windspeed, humidity, precip, solarradiation, cloudcover)

ggpairs(bike_numeric) +
  theme_minimal()

Summary: - bikes_hired has a strong positive linear relationship with \(feelslike\) (and solarradiation).

bikes_hired has a negative relationship with humidity, precip (precipitation), and cloudcover.
feelslike is strongly correlated with solarradiation and negatively correlated with humidity and cloudcover. We will need to check for multicollinearity in our models.

Weekend or weekdays? Is there a difference?

We use a two-sample t-test to determine if the mean number of bikes hired differs between weekend and weekdays.

t.test(bikes_hired ~ weekend, data= bike)

## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.929, df = 2300.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3739.295 4964.730
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27571.74            23219.72

Summary: - The t-test result shows a p-value of < 2.2e-16, which is small (< 0.05). It means we reject the null hypothesis that the mean number of bikes hired is the same for weekends and weekdays. - The 95% CI for the difference in means is (3739.3, 4964.7). Since this interval is positive, it confirms that the average number of bikes hired on weekdays (FALSE) is higher than on weekends (TRUE).

Model 0: using the mean to predict `bikes_hired`

We need to build a multiple linear regression model to identify the key factors driving bike rentals. We will build a series of models, such as a simple intercept-only model and a more complex multiple regression model. Then, we will create a confidence interval for mean bikes_hired, also including SE. Finally, we will plot actual data vs. predicted data.

favstats(~bikes_hired, data = bike)

min	Q1	median	Q3	max	mean	sd	n	missing
0	1.97e+04	2.61e+04	3.27e+04	7.31e+04	2.63e+04	9.5e+03	5268	0

model0 <- lm(bikes_hired ~ 1, data= bike)
msummary(model0)

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26327.6      130.8   201.2   <2e-16 ***
## 
## Residual standard error: 9497 on 5267 degrees of freedom

broom::augment(model0) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 0: mean prediction vs actual data",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: The overall mean of bikes_hired is 26327.6.

Confidence Interval for mean bikes_hired:

The mean is the intercept of model0 (26327.6).
The t-value for a 95% CI is approximately +- 1.96
The SE (Standard Error) of the mean (intercept) is 130.8.
The 95% CI is: 26327.6 +- (1.96 x 130.) = 26327.6 +- 256.4.
95% Confidence Interval: (26071.2, 26584.0).

Regression’s Residual Standard Error (RSE):

The RSE is a measure of the typical distance between the actual data points and the regression line (in this case, the mean).
RSE from msummary(model0): 9497.

Intercept Standard Error:

The standard error of the intercept (which is the mean in this model) is shown in the msummary output.
Intercept SE: 130.8.

Model 1: `bikes_hired` on `mean_temp`

We now add the daily average feelslike temperature as a single predictor. First, we must define the model; then we look at the estimated model and finally plot the actual data vs. the predicted data.

model1 <- lm(bikes_hired ~ feelslike, data = bike)

msummary(model1)

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17647.09     189.13   93.31   <2e-16 ***
## feelslike     860.43      15.64   55.01   <2e-16 ***
## 
## Residual standard error: 7569 on 5266 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3648 
## F-statistic:  3027 on 1 and 5266 DF,  p-value: < 2.2e-16

broom::augment(model1) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 1: prediction with 'Feels Like' temperature",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary:

Is the effect of mean_temp significant? Why? Yes, the effect is highly significant. The p-value for feelslike is < 2e-16, which is much smaller than the common significance level of 0.05. The coefficient of 860.43 means that for every 1-degree Celsius increase in feelslike temperature, the model predicts an increase of approximately 860.43 bike rentals.
What proportion of the overall variability in bike rentals does temperature explain? Adjusted \(R^2\) = 0.353 Temperature explains approximately 35.3% in daily bike rentals.

Model 2: `bikes_hired` on `mean_temp` plus `weekend`

We now add the weekend indicator to model1.We follow the same steps that we used in Model 1 in order to obtain the following model and analyze it.

model2 <- lm(bikes_hired ~ feelslike + weekend, data = bike)

msummary(model2)

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18887.79     193.90   97.41   <2e-16 ***
## feelslike     858.98      15.12   56.81   <2e-16 ***
## weekendTRUE -4288.88     223.12  -19.22   <2e-16 ***
## 
## Residual standard error: 7317 on 5265 degrees of freedom
## Multiple R-squared:  0.4066, Adjusted R-squared:  0.4064 
## F-statistic:  1804 on 2 and 5265 DF,  p-value: < 2.2e-16

broom::augment(model2) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 2: prediction with temperature and weekend status",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary: - Is the effect of mean_temp significant? Why? Yes, it remains highly significant (p-value < 2e-16). p-value < 0.05.

What proportion of the overall variability does this model explain? Adjusted \(R^2\) = 0.420. This model explains 42% of the variability in daily bike rentals.
What is the meaning of the effect (slope) of weekendTRUE? What % of the variability of bikes_hired does your model explain? The slope for weekendTRUE is -5876.3. There is a large drop-off in commuter traffic on weekends.

Model 3: `bikes_hired` on `mean_temp` plus `wday`

We now use the full wday (day of the week) categorical variable instead of the simplified weekend dummy. Follow the same steps used in the first two Models.

model3 <- lm(bikes_hired ~ feelslike + wday, data = bike)

msummary(model3)

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18339.21     304.81  60.165  < 2e-16 ***
## feelslike     858.34      15.01  57.185  < 2e-16 ***
## wdayMon      -905.19     374.33  -2.418 0.015634 *  
## wdaySat     -2730.52     374.33  -7.294 3.45e-13 ***
## wdaySun     -4737.26     374.33 -12.655  < 2e-16 ***
## wdayThu      1246.66     374.46   3.329 0.000877 ***
## wdayTue      1142.63     374.46   3.051 0.002289 ** 
## wdayWed      1293.74     374.46   3.455 0.000555 ***
## 
## Residual standard error: 7263 on 5260 degrees of freedom
## Multiple R-squared:  0.4158, Adjusted R-squared:  0.4151 
## F-statistic: 534.9 on 7 and 5260 DF,  p-value: < 2.2e-16

broom::augment(model3) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: prediction with temperature and specific day of week",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Summary 1. What is the meaning of the effect (slope) of, e.g., wdayMon? Let’s assume Sunday (Sun) is the baseline. Then, the coefficient for wdayMon is 7648.7. Holding the “feels like” temperature constant, bike rentals on a Monday are 7,648.7 > Sunday.

What % of the variability of bikes_hired does your model explain? Adjusted \(R^2\) = 0.456. This model explains approximately 45.6% of the variability in daily bike rentals, a little better than Model 2.

Model 4

In Model 4 we extend the regression framework beyond temperature and weekday-effects by introducing humidity and a full month factor. Follow the same steps as the first three Models.

model4 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month), data = bike)

msummary(model4)

##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40635.74    1017.88  39.922  < 2e-16 ***
## feelslike         471.90      25.43  18.555  < 2e-16 ***
## wdayMon          -936.60     342.57  -2.734 0.006278 ** 
## wdaySat         -2658.12     342.59  -7.759 1.02e-14 ***
## wdaySun         -4825.17     342.59 -14.085  < 2e-16 ***
## wdayThu          1312.82     342.69   3.831 0.000129 ***
## wdayTue          1280.10     342.72   3.735 0.000190 ***
## wdayWed          1428.85     342.72   4.169 3.11e-05 ***
## humidity         -263.48      10.84 -24.301  < 2e-16 ***
## factor(month)2    477.51     463.33   1.031 0.302781    
## factor(month)3   1343.15     458.44   2.930 0.003406 ** 
## factor(month)4   1640.36     486.66   3.371 0.000755 ***
## factor(month)5   3906.79     522.19   7.482 8.56e-14 ***
## factor(month)6   5229.30     569.08   9.189  < 2e-16 ***
## factor(month)7   4950.94     600.98   8.238  < 2e-16 ***
## factor(month)8   2288.17     587.10   3.897 9.84e-05 ***
## factor(month)9   3809.46     546.91   6.965 3.68e-12 ***
## factor(month)10  4138.80     499.60   8.284  < 2e-16 ***
## factor(month)11  3226.77     459.63   7.020 2.50e-12 ***
## factor(month)12 -1631.01     446.03  -3.657 0.000258 ***
## 
## Residual standard error: 6647 on 5248 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5101 
## F-statistic: 289.7 on 19 and 5248 DF,  p-value: < 2.2e-16

broom::augment(model4) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 4: prediction with Temp, Day, Humidity, and Month",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

Model 5

Model 5 builds on the previous model by adding precipitation (rainfall) as a new weather variable.

model5 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month) + precip, data = bike)

msummary(model5)

##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     38021.41    1012.58  37.549  < 2e-16 ***
## feelslike         465.73      24.92  18.690  < 2e-16 ***
## wdayMon          -937.69     335.61  -2.794 0.005226 ** 
## wdaySat         -2684.82     335.63  -7.999 1.53e-15 ***
## wdaySun         -4756.45     335.66 -14.171  < 2e-16 ***
## wdayThu          1361.61     335.74   4.056 5.07e-05 ***
## wdayTue          1340.79     335.78   3.993 6.61e-05 ***
## wdayWed          1478.14     335.77   4.402 1.09e-05 ***
## humidity         -227.51      10.89 -20.883  < 2e-16 ***
## factor(month)2    826.98     454.53   1.819 0.068903 .  
## factor(month)3   1473.67     449.21   3.281 0.001043 ** 
## factor(month)4   1906.95     477.11   3.997 6.51e-05 ***
## factor(month)5   4312.94     512.31   8.419  < 2e-16 ***
## factor(month)6   5702.88     558.43  10.212  < 2e-16 ***
## factor(month)7   5520.45     590.01   9.357  < 2e-16 ***
## factor(month)8   2851.71     576.41   4.947 7.76e-07 ***
## factor(month)9   4101.46     536.15   7.650 2.38e-14 ***
## factor(month)10  4338.59     489.63   8.861  < 2e-16 ***
## factor(month)11  3279.97     450.31   7.284 3.73e-13 ***
## factor(month)12 -1645.02     436.96  -3.765 0.000169 ***
## precip           -290.37      19.53 -14.865  < 2e-16 ***
## 
## Residual standard error: 6512 on 5247 degrees of freedom
## Multiple R-squared:  0.5316, Adjusted R-squared:  0.5298 
## F-statistic: 297.8 on 20 and 5247 DF,  p-value: < 2.2e-16

broom::augment(model5) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 5: prediction with all major predictors",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()

As expected, precipitation has a significant negative effect on the number of bikes hired. Rainy days see a noticeable drop in rentals. The adjusted R² rises again, showing that including rainfall improves the model’s ability to explain daily variation in hires. In short, Model 5 provides the best overall fit, confirming that temperature boosts demand, while rain and humidity reduce it, and calendar effects (season and weekday) still play a key role.

Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

Are other weather variables useful in explaining bikes_hired? Yes. Our final model, model5 (humidity and precip) significantly improved the model fit from \(R^2\) of 0.456 (Model 3) to 0.648. This indicates that weather conditions like high humidity and rainfall are very useful and act as strong negative predictors of bike rentals. Variables like windspeed and cloudcover are also strong variables for next improvement.
We also have data on days of the week, month of the year, etc. Could those be helpful? Yes. We saw this from model1 (temp only) to model3 (temp + specific day of the week, wday) to model5 (adding factor(month)).
- The wday variable improved the model by differentiating weekdays.
- The factor(month) variable in model4 and model5 shows seasonality that is not fully explained by temperature alone.
What’s the best model you can come up with? Model5 is the best model so far (Adjusted \(R^2\) = 0.648)
Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE? This model is an explanatory model (thay is used for prediction), because its main goal is to identify which factors influence bike rentals. If we use it, its Redisual SE = 5137. On average, the model’s prediction of daily bike hires will be off by +- 5,137 bikes.

Ploting your best model predicitons vs reality

In the code below, first we plot the actual vs. predicted data, then we plot the fitted line and residuals.

my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())


ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: actual rentals vs fitted predictions",
       subtitle = "Model 3: bikes_hired ~ feelslike + wday",
       x = "Day number (time index)",
       y = "Number of bikes hired") +
  theme_bw()

Summary: The plot shows that the predicted values perfectly shows the general seasonal fluctuation and low-use days (weeekends, winter) with its residual SE of 6620. While model3 is better than a simple mean, it still leaves a lot of daily variation explained.

Diagnostics, collinearity, summary tables

For this analysis, we are interested in 3 plots: residuals vs. fitted (for linearity and homoscedasticity); Normal Q-Q (for normality); Scale-location (for homoscedasticity). In order to do this analysis we must use the autoplot function. To check for multicollinearity we check the value of the VIF, and see if it is above or below the threshold value of 10.

# Residual Analysis
# We are interested in the first 3 plots:
# 1: Residuals vs fitted (for linearity and homoscedasticity)
# 2: Normal Q-Q (for normality)
# 3: Scale-location (for homoscedasticity)
model2 %>% 
  autoplot(which = 1:3) +
  theme_bw()

## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check VIF
# A VIF > 10 is a threshold for concern
vif(model2)

## feelslike   weekend 
##  1.000025  1.000025

vif(model3)

##               GVIF Df GVIF^(1/(2*Df))
## feelslike 1.000065  1        1.000032
## wday      1.000065  6        1.000005

vif(model4)

##                   GVIF Df GVIF^(1/(2*Df))
## feelslike     3.428223  1        1.851546
## wday          1.001367  6        1.000114
## humidity      1.515010  1        1.230858
## factor(month) 4.029582 11        1.065398

Summary: The linearity assumption holds. The spread of the residuals are constant (the homoscedasticity assumption is met) and normally distributed. All VIF values are below 10. Our predictor variables are not highly correlated, and multicollinearity is not an issue here.

Comparison of different models

Now we create a summary table, in order to compare the models examined in this analysis.

huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')

Comparison of models
	(1)	(2)	(3)	(4)	(5)
(Intercept)	26327.597 ***	17647.089 ***	18887.785 ***	18339.214 ***	40635.741 ***
	(130.846)	(189.133)	(193.902)	(304.814)	(1017.876)
feelslike		860.427 ***	858.979 ***	858.340 ***	471.901 ***
		(15.640)	(15.120)	(15.010)	(25.432)
weekendTRUE			-4288.877 ***
			(223.121)
wdayMon				-905.186 *	-936.602 **
				(374.331)	(342.574)
wdaySat				-2730.525 ***	-2658.124 ***
				(374.331)	(342.585)
wdaySun				-4737.260 ***	-4825.175 ***
				(374.331)	(342.586)
wdayThu				1246.662 ***	1312.820 ***
				(374.456)	(342.690)
wdayTue				1142.627 **	1280.100 ***
				(374.456)	(342.724)
wdayWed				1293.737 ***	1428.854 ***
				(374.459)	(342.722)
humidity					-263.483 ***
					(10.842)
factor(month)2					477.507
					(463.333)
factor(month)3					1343.150 **
					(458.442)
factor(month)4					1640.364 ***
					(486.658)
factor(month)5					3906.791 ***
					(522.191)
factor(month)6					5229.300 ***
					(569.081)
factor(month)7					4950.945 ***
					(600.980)
factor(month)8					2288.165 ***
					(587.096)
factor(month)9					3809.465 ***
					(546.908)
factor(month)10					4138.800 ***
					(499.598)
factor(month)11					3226.766 ***
					(459.634)
factor(month)12					-1631.005 ***
					(446.025)
#observations	5268	5268	5268	5268	5268
R squared	0.000	0.365	0.407	0.416	0.512
Adj. R Squared	0.000	0.365	0.406	0.415	0.510
Residual SE	9496.916	7568.703	7317.024	7263.360	6646.989
* p < 0.001; p < 0.01; * p < 0.05.

Summary: The Adjusted \(R^2\) consistently increased, and the Residual SE is decreased. Model 5 is clearly the best model as it explains the most variance (41.5%) with the lowest error.

TfL Bikes: Explore the relationship between bikes hired and a number of variables

Stream 6: Mariia Yevlash, Alessio Guglielmi, Ambre Drissi Qeytouni, Sammy Jenifen, Jason Lau, Malaika Mammen Mathew, Siying Wang

2025-10-23

London Bikes data

Cleaning our data

Summary Statistics

Exploratory Data Analysis

Time series plot of bikes rented

Further graphs

Model building

Correlation - Scatterplot matrix

Weekend or weekdays? Is there a difference?

Model 0: using the mean to predict `bikes_hired`

Model 1: `bikes_hired` on `mean_temp`

Model 2: `bikes_hired` on `mean_temp` plus `weekend`

Model 3: `bikes_hired` on `mean_temp` plus `wday`

Model 4

Model 5

Further variables/questions to explore on your own

Ploting your best model predicitons vs reality

Diagnostics, collinearity, summary tables

Comparison of different models

TfL Bikes: Explore the relationship between bikes hired and a number of variables

Stream 6: Mariia Yevlash, Alessio Guglielmi, Ambre Drissi Qeytouni, Sammy Jenifen, Jason Lau, Malaika Mammen Mathew, Siying Wang

2025-10-23

London Bikes data

Cleaning our data

Summary Statistics

Exploratory Data Analysis

Time series plot of bikes rented

Further graphs

Model building

Correlation - Scatterplot matrix

Weekend or weekdays? Is there a difference?

Model 0: using the mean to predict bikes_hired

Model 1: bikes_hired on mean_temp

Model 2: bikes_hired on mean_temp plus weekend

Model 3: bikes_hired on mean_temp plus wday

Model 4

Model 5

Further variables/questions to explore on your own

Ploting your best model predicitons vs reality

Diagnostics, collinearity, summary tables

Comparison of different models

Model 0: using the mean to predict `bikes_hired`

Model 1: `bikes_hired` on `mean_temp`

Model 2: `bikes_hired` on `mean_temp` plus `weekend`

Model 3: `bikes_hired` on `mean_temp` plus `wday`