We will analyze data from TfL to understand the usage of the London bike sharing scheme.
This R code does two main things — loads the data and creates a new variable to mark weekends. This makes it easier to compare or model bike hires between weekdays and weekends.
bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))The raw data requires cleaning to be useful for analysis. We will convert character variables to factors and engineer new date-based features using lubridate. We will also generate season_name variable and provide a comprehensive overview of each variable using the skim function.
bike <- read_csv(here::here("Data", "london_bikes.csv")) %>% 
  mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE)) %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date),
    month_name = lubridate::month(date, label = TRUE),
    day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE) 
  ) %>%
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn"))
  )## Rows: 5268 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (9): preciptype, conditions, description, icon, month_name, day_of_wee...
## dbl  (27): bikes_hired, tempmax, tempmin, temp, feelslikemax, feelslikemin, ...
## lgl   (2): severerisk, weekend
## dttm  (3): date, sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.| Name | bike | 
| Number of rows | 5268 | 
| Number of columns | 41 | 
| _______________________ | |
| Column type frequency: | |
| character | 6 | 
| factor | 3 | 
| logical | 2 | 
| numeric | 27 | 
| POSIXct | 3 | 
| ________________________ | |
| Group variables | None | 
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace | 
|---|---|---|---|---|---|---|---|
| preciptype | 1411 | 0.73 | 4 | 9 | 0 | 3 | 0 | 
| conditions | 0 | 1.00 | 4 | 28 | 0 | 11 | 0 | 
| description | 0 | 1.00 | 26 | 85 | 0 | 78 | 0 | 
| icon | 0 | 1.00 | 4 | 17 | 0 | 6 | 0 | 
| wday | 0 | 1.00 | 3 | 3 | 0 | 7 | 0 | 
| set | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 | 
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts | 
|---|---|---|---|---|---|
| month_name | 0 | 1 | TRUE | 12 | Aug: 465, Oct: 465, Dec: 464, Sep: 450 | 
| day_of_week | 0 | 1 | TRUE | 7 | Sun: 753, Mon: 753, Fri: 753, Sat: 753 | 
| season_name | 0 | 1 | FALSE | 4 | Aut: 1365, Sum: 1321, Win: 1294, Spr: 1288 | 
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count | 
|---|---|---|---|---|
| severerisk | 5268 | 0 | NaN | : | 
| weekend | 0 | 1 | 0.29 | FAL: 3762, TRU: 1506 | 
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | 
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1.00 | 26327.60 | 9496.92 | 0.0 | 19746.75 | 26070.00 | 32745.00 | 73094.00 | ▂▇▅▁▁ | 
| tempmax | 0 | 1.00 | 14.73 | 6.36 | -2.6 | 10.10 | 14.40 | 19.50 | 36.10 | ▁▇▇▃▁ | 
| tempmin | 0 | 1.00 | 7.84 | 5.04 | -11.1 | 4.10 | 8.00 | 11.60 | 22.10 | ▁▃▇▇▁ | 
| temp | 0 | 1.00 | 11.23 | 5.48 | -5.9 | 7.20 | 11.10 | 15.50 | 28.30 | ▁▅▇▅▁ | 
| feelslikemax | 0 | 1.00 | 14.09 | 7.24 | -7.2 | 10.10 | 14.40 | 19.50 | 38.20 | ▁▅▇▃▁ | 
| feelslikemin | 0 | 1.00 | 6.18 | 6.38 | -14.6 | 1.17 | 6.00 | 11.60 | 22.10 | ▁▃▇▇▂ | 
| feelslike | 0 | 1.00 | 10.09 | 6.67 | -9.1 | 4.90 | 10.60 | 15.50 | 28.90 | ▁▆▇▇▁ | 
| dew | 0 | 1.00 | 7.41 | 4.67 | -8.0 | 4.10 | 7.60 | 11.00 | 19.00 | ▁▃▇▇▂ | 
| humidity | 0 | 1.00 | 79.28 | 10.40 | 39.6 | 72.20 | 80.30 | 87.30 | 99.90 | ▁▂▅▇▅ | 
| precip | 0 | 1.00 | 1.64 | 4.73 | 0.0 | 0.00 | 0.18 | 1.34 | 168.20 | ▇▁▁▁▁ | 
| precipprob | 0 | 1.00 | 73.04 | 44.38 | 0.0 | 0.00 | 100.00 | 100.00 | 100.00 | ▃▁▁▁▇ | 
| precipcover | 0 | 1.00 | 8.96 | 11.60 | 0.0 | 0.00 | 8.33 | 12.50 | 100.00 | ▇▁▁▁▁ | 
| snow | 0 | 1.00 | 0.02 | 0.25 | 0.0 | 0.00 | 0.00 | 0.00 | 8.40 | ▇▁▁▁▁ | 
| snowdepth | 0 | 1.00 | 0.05 | 0.55 | 0.0 | 0.00 | 0.00 | 0.00 | 18.10 | ▇▁▁▁▁ | 
| windgust | 1126 | 0.79 | 42.08 | 14.84 | 9.4 | 30.72 | 40.70 | 51.80 | 120.90 | ▅▇▃▁▁ | 
| windspeed | 0 | 1.00 | 22.95 | 8.42 | 5.1 | 16.88 | 21.80 | 27.60 | 72.20 | ▅▇▂▁▁ | 
| winddir | 0 | 1.00 | 196.97 | 92.33 | 0.1 | 127.75 | 221.60 | 259.00 | 359.80 | ▃▂▃▇▃ | 
| sealevelpressure | 0 | 1.00 | 1014.90 | 10.52 | 966.6 | 1008.70 | 1015.70 | 1022.00 | 1047.80 | ▁▁▇▇▁ | 
| cloudcover | 0 | 1.00 | 60.86 | 22.45 | 0.0 | 46.60 | 62.70 | 77.80 | 100.00 | ▁▃▇▇▅ | 
| visibility | 0 | 1.00 | 22.65 | 10.62 | 0.2 | 15.20 | 22.00 | 28.90 | 62.50 | ▃▇▆▁▁ | 
| solarradiation | 0 | 1.00 | 112.00 | 84.89 | 0.0 | 40.20 | 91.50 | 169.77 | 358.60 | ▇▅▃▂▁ | 
| solarenergy | 0 | 1.00 | 9.66 | 7.34 | 0.0 | 3.50 | 7.90 | 14.60 | 31.00 | ▇▅▃▂▁ | 
| uvindex | 0 | 1.00 | 4.26 | 2.55 | 0.0 | 2.00 | 4.00 | 6.00 | 10.00 | ▇▆▅▅▁ | 
| moonphase | 0 | 1.00 | 0.48 | 0.29 | 0.0 | 0.25 | 0.50 | 0.75 | 0.98 | ▇▇▇▇▇ | 
| month | 0 | 1.00 | 6.62 | 3.46 | 1.0 | 4.00 | 7.00 | 10.00 | 12.00 | ▇▅▅▅▇ | 
| week | 0 | 1.00 | 27.01 | 15.08 | 1.0 | 14.00 | 27.00 | 40.00 | 53.00 | ▇▇▇▇▇ | 
| year | 0 | 1.00 | 2017.28 | 4.17 | 2010.0 | 2014.00 | 2017.00 | 2021.00 | 2024.00 | ▆▇▇▇▇ | 
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique | 
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2010-07-30 00:00:00 | 2024-12-30 00:00:00 | 2017-10-14 12:00:00 | 5268 | 
| sunrise | 0 | 1 | 2010-07-30 05:17:39 | 2024-12-30 08:06:26 | 2017-10-14 19:21:46 | 5268 | 
| sunset | 0 | 1 | 2010-07-30 20:50:39 | 2024-12-30 16:00:26 | 2017-10-15 06:05:26 | 5268 | 
We can now calculate summary statistics for our target variable, bikes_hired.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.61e+04 | 3.27e+04 | 7.31e+04 | 2.63e+04 | 9.5e+03 | 5268 | 0 | 
We can also group these summary statistics by the categorical variables we created.
| year | min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|---|
| 2010 | 2.76e+03 | 9.3e+03 | 1.4e+04 | 1.87e+04 | 2.8e+04 | 1.41e+04 | 5.62e+03 | 155 | 0 | 
| 2011 | 4.56e+03 | 1.63e+04 | 2.03e+04 | 2.37e+04 | 2.94e+04 | 1.96e+04 | 5.5e+03 | 365 | 0 | 
| 2012 | 3.53e+03 | 1.93e+04 | 2.62e+04 | 3.25e+04 | 4.71e+04 | 2.6e+04 | 9.43e+03 | 366 | 0 | 
| 2013 | 3.73e+03 | 1.76e+04 | 2.2e+04 | 2.74e+04 | 3.56e+04 | 2.2e+04 | 7.28e+03 | 365 | 0 | 
| 2014 | 4.33e+03 | 2.05e+04 | 2.77e+04 | 3.44e+04 | 4.9e+04 | 2.75e+04 | 9.07e+03 | 365 | 0 | 
| 2015 | 5.78e+03 | 2.21e+04 | 2.66e+04 | 3.29e+04 | 7.31e+04 | 2.7e+04 | 8.55e+03 | 365 | 0 | 
| 2016 | 4.89e+03 | 2.24e+04 | 2.79e+04 | 3.51e+04 | 4.66e+04 | 2.82e+04 | 8.85e+03 | 366 | 0 | 
| 2017 | 5.14e+03 | 2.41e+04 | 2.95e+04 | 3.45e+04 | 4.6e+04 | 2.86e+04 | 8.38e+03 | 365 | 0 | 
| 2018 | 5.86e+03 | 2.18e+04 | 2.92e+04 | 3.77e+04 | 4.61e+04 | 2.9e+04 | 1.02e+04 | 365 | 0 | 
| 2019 | 5.65e+03 | 2.42e+04 | 2.89e+04 | 3.45e+04 | 4.47e+04 | 2.86e+04 | 8.09e+03 | 365 | 0 | 
| 2020 | 4.87e+03 | 2.01e+04 | 2.75e+04 | 3.65e+04 | 7.02e+04 | 2.85e+04 | 1.16e+04 | 366 | 0 | 
| 2021 | 6.25e+03 | 2.17e+04 | 3.1e+04 | 3.82e+04 | 5.69e+04 | 3e+04 | 1.1e+04 | 365 | 0 | 
| 2022 | 0 | 2.51e+04 | 3.14e+04 | 4e+04 | 6.7e+04 | 3.15e+04 | 1.03e+04 | 365 | 0 | 
| 2023 | 6.72e+03 | 1.99e+04 | 2.37e+04 | 2.78e+04 | 3.53e+04 | 2.34e+04 | 6.03e+03 | 365 | 0 | 
| 2024 | 7.48e+03 | 2.04e+04 | 2.48e+04 | 2.87e+04 | 3.52e+04 | 2.4e+04 | 6.14e+03 | 365 | 0 | 
| day_of_week | min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|---|
| Sunday | 0 | 1.41e+04 | 2.1e+04 | 2.97e+04 | 6.31e+04 | 2.22e+04 | 1.05e+04 | 753 | 0 | 
| Monday | 3.97e+03 | 2.06e+04 | 2.57e+04 | 3.15e+04 | 6.7e+04 | 2.61e+04 | 8.28e+03 | 753 | 0 | 
| Tuesday | 3.76e+03 | 2.25e+04 | 2.79e+04 | 3.44e+04 | 6.53e+04 | 2.82e+04 | 8.62e+03 | 752 | 0 | 
| Wednesday | 4.33e+03 | 2.28e+04 | 2.8e+04 | 3.42e+04 | 5.44e+04 | 2.84e+04 | 8.52e+03 | 752 | 0 | 
| Thursday | 5.65e+03 | 2.27e+04 | 2.75e+04 | 3.44e+04 | 7.31e+04 | 2.83e+04 | 8.79e+03 | 752 | 0 | 
| Friday | 5.4e+03 | 2.14e+04 | 2.65e+04 | 3.29e+04 | 6.7e+04 | 2.7e+04 | 8.54e+03 | 753 | 0 | 
| Saturday | 0 | 1.6e+04 | 2.25e+04 | 3.08e+04 | 7.02e+04 | 2.42e+04 | 1.11e+04 | 753 | 0 | 
| month_name | min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|---|
| Jan | 3.73e+03 | 1.39e+04 | 1.9e+04 | 2.33e+04 | 3.8e+04 | 1.87e+04 | 5.99e+03 | 434 | 0 | 
| Feb | 3.53e+03 | 1.6e+04 | 2.06e+04 | 2.46e+04 | 5.25e+04 | 2.03e+04 | 6.37e+03 | 396 | 0 | 
| Mar | 5.06e+03 | 1.76e+04 | 2.31e+04 | 2.72e+04 | 5.66e+04 | 2.27e+04 | 7.59e+03 | 434 | 0 | 
| Apr | 4.87e+03 | 2.1e+04 | 2.59e+04 | 3.08e+04 | 4.9e+04 | 2.6e+04 | 7.61e+03 | 420 | 0 | 
| May | 1.07e+04 | 2.45e+04 | 2.99e+04 | 3.6e+04 | 7.02e+04 | 3.04e+04 | 8.44e+03 | 434 | 0 | 
| Jun | 6.06e+03 | 2.77e+04 | 3.31e+04 | 3.92e+04 | 6.53e+04 | 3.34e+04 | 8.59e+03 | 420 | 0 | 
| Jul | 5.56e+03 | 2.91e+04 | 3.58e+04 | 4.13e+04 | 7.31e+04 | 3.47e+04 | 8.35e+03 | 436 | 0 | 
| Aug | 4.3e+03 | 2.56e+04 | 3.24e+04 | 3.83e+04 | 6.7e+04 | 3.14e+04 | 9.6e+03 | 465 | 0 | 
| Sep | 0 | 2.49e+04 | 3.12e+04 | 3.61e+04 | 5.18e+04 | 3.04e+04 | 8.11e+03 | 450 | 0 | 
| Oct | 7.07e+03 | 2.33e+04 | 2.8e+04 | 3.26e+04 | 4.79e+04 | 2.75e+04 | 6.77e+03 | 465 | 0 | 
| Nov | 6.03e+03 | 1.88e+04 | 2.37e+04 | 2.79e+04 | 4.47e+04 | 2.33e+04 | 6.41e+03 | 450 | 0 | 
| Dec | 2.76e+03 | 1.14e+04 | 1.67e+04 | 2.29e+04 | 3.91e+04 | 1.71e+04 | 6.91e+03 | 464 | 0 | 
| season_name | min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|---|
| Winter | 2.76e+03 | 1.37e+04 | 1.88e+04 | 2.35e+04 | 5.25e+04 | 1.86e+04 | 6.58e+03 | 1294 | 0 | 
| Spring | 4.87e+03 | 2.08e+04 | 2.61e+04 | 3.13e+04 | 7.02e+04 | 2.64e+04 | 8.51e+03 | 1288 | 0 | 
| Summer | 4.3e+03 | 2.76e+04 | 3.37e+04 | 3.96e+04 | 7.31e+04 | 3.31e+04 | 8.98e+03 | 1321 | 0 | 
| Autumn | 0 | 2.19e+04 | 2.72e+04 | 3.26e+04 | 5.18e+04 | 2.7e+04 | 7.69e+03 | 1365 | 0 | 
We will plot the number of bike rentals from January 1, 2014 onwards to observe any long-term trends or seasonal patterns.
bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  
  ggplot() +
  aes(x = date, y = bikes_hired) +
  geom_point(alpha = 0.4) +
  labs(
    title = "Daily TfL bike rentals over time",
    x = "Date",
    y = "Number of bikes hired"
  )
  # Apply a clean black and white theme
  theme_bw() +
  
  NULLSummary: The long-term trend in daily TfL bike rentals shows a slight overall increase from 2014 until a dramatic spike in 2020 (likely due to the pandemic), followed by a subsequent leveling off and possible slight decline in the most recent years (2023-2024), though usage generally remains higher than pre-2020 levels.
Now we will see how rentals differ by months and by season. Fist, we plot a histogram of the bikes rented from January 1, 2014 onwards. Secondly, we plot histograms faceted by seasom_name. After this, we plot histograms faceted by month name, and the last graph is the same as the third one, but the histograms are represented in 4 rows. Having done this, we plot density plots for every result found in the previous graphs. After the density plots, we represent with a boxplot graph bikes hired for each month, then focusing on the relationship between bikes hired and temperature, and bikes hired and temperature divided by season.
bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  theme_bw()+
  NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~season_name)+
  theme_bw()+
  NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name)+
  theme_bw()+
  NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density()+
  theme_bw()+
  NULLbike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  theme_bw()+
  NULLbike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  labs(
    title = "Bike rentals by season",
    x = "Season",
    y = "Number of bikes hired"
  ) +
  aes(x = bikes_hired)+
  geom_histogram(aes(fill=season_name), alpha = 0.5)+
  facet_wrap(~season_name, nrow = 4)+
  theme_minimal()+
  
  theme(legend.position = "none")+
  NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.bike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x = bikes_hired)+
  geom_density(aes(fill=season_name), alpha = 0.3)+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  theme(legend.position="none")+
  NULLbike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=month_name, y= bikes_hired)+
  geom_boxplot()+
  theme_bw()+
  NULLbike %>% 
  
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired)+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL## `geom_smooth()` using formula = 'y ~ x'bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=temp, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL## `geom_smooth()` using formula = 'y ~ x'bike %>% 
  filter(date >= lubridate::ymd("2014-01-01")) %>%  
  ggplot()+
  aes(x=humidity, y= bikes_hired,
       colour=season_name)+
  geom_point(alpha = 0.1)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL## `geom_smooth()` using formula = 'y ~ x'Summary: 1. Seasonal and monthly patterns The distribution plots (histograms, density plots, and boxplots) consistently show that the number of bikes hired increases significantly from winter to summer, peaking in July/August and dropping in late autumn/winter. - Winter (Dec-Feb): The distribution is centered at the lowest average daily rental count (around 20,000 to 25,000), with a narrow spread. - Summer (Jun-Aug): The distribution shifts to the highest average daily rental count (peaking near 40,000 in July), is wider, and includes the highest overall rental days (outliers up to ≈70,000). - Spring (Mar-May) & Autumn (Sep-Nov): These seasons represent transition periods. Spring shows a continuous increase in the mean rental count, and Autumn shows a decrease, with October still performing relatively well compared to November and December. Boxplots by Month: The boxplot visually confirms this progression, showing a monotonic increase in the median and interquartile range from January to July, and then a consistent decrease towards December. 2. Weather dependence (temperature and humidity) - A strong positive correlation with temperature- As the temperature rises, the number of rentals generally increases. The relationship is clearest and strongest in the Spring, Summer, and Autumn, whereas Winter shows a flatter, less temperature-sensitive trend at lower average rental counts. - A negative correlation with humidity across all seasons-As humidity increases, the number of rentals generally decreases across all seasons. Summer rentals are the highest overall, but still show this negative slope, suggesting that drier days, even in summer, are preferred for cycling.
Now, we will work with the correlation of weather variables. We have many weather variables (\(temp, feelslike, humidity, windspeed\), etc.).
Some of them are likely to be highly correlated. We’ll use \(ggpairs\) to visualize the relationships between our target variable \(bikeshired\) and key numeric predictors.
bike_numeric <- bike %>% 
  select(bikes_hired, feelslike, windspeed, humidity, precip, solarradiation, cloudcover)
ggpairs(bike_numeric) +
  theme_minimal()
Summary: 1. Correlation with bikes hired
(target variable) The strongest positive correlation with the
number of bikes hired is feelslike (Corr: 0.604): As the perceived
temperature increases, the number of bikes hired significantly
increases. Another strong positive correlation with the number of bikes
hired is solarradiation (Corr: 0.489): Higher sun intensity is
associated with more rentals. The strongest negative correlatio with the
number of bikes hired is humidity (Corr: -0.502): As humidity increases,
the number of bikes hired decreases. Weak negative correlations with the
number of bikes hired are windspeed, precipitation, and cloudcover
(Corr: -0.244, -0.216, and -0.297, respectively), suggesting that high
wind, rain, or overcast conditions slightly deter cycling. 2.
Inter-correlations among predictors Several predictor variables
are highly correlated with each other, which is important for model
building (check potential multicollinearity). · humidity is strongly
correlated with solarradiation (Corr: -0.516) , cloudcover (Corr: 0.479)
and feelslike (Corr: -0.444): Sunny days tend to have lower humidity;
More cloud cover is associated with higher humidity; Higher perceived
temperatures are associated with lower humidity. · feelslike is strongly
correlated with solarradiation (Corr: 0.505): Warmer days are often
sunnier.
In summary, the best conditions for cycling (high rentals) are warm, sunny, and dry, while humidity is the strongest weather-related deterrent after temperature.
We use a two-sample t-test to determine if the mean number of bikes hired differs between weekend and weekdays.
## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.929, df = 2300.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3739.295 4964.730
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27571.74            23219.72Summary: - The t-test result shows a p-value of < 2.2e-16, which is small (< 0.05). It means we reject the null hypothesis that the mean number of bikes hired is the same for weekends and weekdays. - The 95% CI for the difference in means is (3739.3, 4964.7). Since this interval is positive, it confirms that the average number of bikes hired on weekdays (FALSE) is higher than on weekends (TRUE).
bikes_hiredWe need to build a multiple linear regression model to identify the key factors driving bike rentals. We will build a series of models, such as a simple intercept-only model and a more complex multiple regression model. Then, we will create a confidence interval for mean bikes_hired, also including SE. Finally, we will plot actual data vs. predicted data.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | 
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.61e+04 | 3.27e+04 | 7.31e+04 | 2.63e+04 | 9.5e+03 | 5268 | 0 | 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26327.6      130.8   201.2   <2e-16 ***
## 
## Residual standard error: 9497 on 5267 degrees of freedombroom::augment(model0) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 0: mean prediction vs actual data",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()
Summary: The overall mean of bikes_hired is
26327.6.
bikes_hired:bikes_hired on mean_tempWe now add the daily average feelslike temperature as a single predictor. First, we must define the model; then we look at the estimated model and finally plot the actual data vs. the predicted data.
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17647.09     189.13   93.31   <2e-16 ***
## feelslike     860.43      15.64   55.01   <2e-16 ***
## 
## Residual standard error: 7569 on 5266 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3648 
## F-statistic:  3027 on 1 and 5266 DF,  p-value: < 2.2e-16broom::augment(model1) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 1: prediction with 'Feels Like' temperature",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()
Summary:
mean_temp significant?
Why? Yes, the effect is highly significant. The p-value for
feelslike is < 2e-16, which is much smaller than the
common significance level of 0.05. The coefficient of 860.43 means that
for every 1-degree Celsius increase in feelslike temperature, the model
predicts an increase of approximately 860.43 bike rentals.bikes_hired on mean_temp plus
weekendWe now add the weekend indicator to model1.We follow the same steps that we used in Model 1 in order to obtain the following model and analyze it.
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18887.79     193.90   97.41   <2e-16 ***
## feelslike     858.98      15.12   56.81   <2e-16 ***
## weekendTRUE -4288.88     223.12  -19.22   <2e-16 ***
## 
## Residual standard error: 7317 on 5265 degrees of freedom
## Multiple R-squared:  0.4066, Adjusted R-squared:  0.4064 
## F-statistic:  1804 on 2 and 5265 DF,  p-value: < 2.2e-16broom::augment(model2) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 2: prediction with temperature and weekend status",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()
Summary: - Is the effect of 
mean_temp
significant? Why? Yes, it remains highly significant (p-value
< 2e-16). p-value < 0.05. This allows us to reject the
null hypothesis that the true coefficient is zero, confirming that
perceived temperature and whether the day is weekend or not are reliable
predictors of bike rentals.
What proportion of the overall variability does this model explain? Adjusted \(R^2\) = 0.4064. This model explains 40.64% of the variability in daily bike rentals.
What is the meaning of the effect (slope) of
weekendTRUE? What % of the variability of bikes_hired does
your model explain? The slope for weekendTRUE is
-4288.88. There is a large drop-off in commuter traffic on
weekends.It means that, holding temperature constant, the model predicts
a decrease of 4,288.88 fewer bikes hired on a weekend day compared to a
weekday (the baseline group). This drop implies that the TfL bike system
is heavily used for weekday commuting.
bikes_hired on mean_temp plus
wdayWe now use the full wday (day of the week) categorical variable instead of the simplified weekend dummy. Follow the same steps used in the first two Models.
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18339.21     304.81  60.165  < 2e-16 ***
## feelslike     858.34      15.01  57.185  < 2e-16 ***
## wdayMon      -905.19     374.33  -2.418 0.015634 *  
## wdaySat     -2730.52     374.33  -7.294 3.45e-13 ***
## wdaySun     -4737.26     374.33 -12.655  < 2e-16 ***
## wdayThu      1246.66     374.46   3.329 0.000877 ***
## wdayTue      1142.63     374.46   3.051 0.002289 ** 
## wdayWed      1293.74     374.46   3.455 0.000555 ***
## 
## Residual standard error: 7263 on 5260 degrees of freedom
## Multiple R-squared:  0.4158, Adjusted R-squared:  0.4151 
## F-statistic: 534.9 on 7 and 5260 DF,  p-value: < 2.2e-16broom::augment(model3) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: prediction with temperature and specific day of week",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()Summary 1. What is the meaning of the effect
(slope) of, e.g., wdayMon? Let’s assume Sunday
(Sun) is the baseline. Then, the coefficient for wdayMon is
-905.19. It means that holding the ‘feels like’ temperature constant,
the model predicts that bike rentals on a Monday are 905 fewer than on a
Friday. Holding the “feels like” temperature constant, bike rentals on a
Monday are 7,648.7 > Sunday.
In Model 4 we extend the regression framework beyond temperature and weekday-effects by introducing humidity and a full month factor. Follow the same steps as the first three Models.
model4 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month), data = bike)
msummary(model4)##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     40635.74    1017.88  39.922  < 2e-16 ***
## feelslike         471.90      25.43  18.555  < 2e-16 ***
## wdayMon          -936.60     342.57  -2.734 0.006278 ** 
## wdaySat         -2658.12     342.59  -7.759 1.02e-14 ***
## wdaySun         -4825.17     342.59 -14.085  < 2e-16 ***
## wdayThu          1312.82     342.69   3.831 0.000129 ***
## wdayTue          1280.10     342.72   3.735 0.000190 ***
## wdayWed          1428.85     342.72   4.169 3.11e-05 ***
## humidity         -263.48      10.84 -24.301  < 2e-16 ***
## factor(month)2    477.51     463.33   1.031 0.302781    
## factor(month)3   1343.15     458.44   2.930 0.003406 ** 
## factor(month)4   1640.36     486.66   3.371 0.000755 ***
## factor(month)5   3906.79     522.19   7.482 8.56e-14 ***
## factor(month)6   5229.30     569.08   9.189  < 2e-16 ***
## factor(month)7   4950.94     600.98   8.238  < 2e-16 ***
## factor(month)8   2288.17     587.10   3.897 9.84e-05 ***
## factor(month)9   3809.46     546.91   6.965 3.68e-12 ***
## factor(month)10  4138.80     499.60   8.284  < 2e-16 ***
## factor(month)11  3226.77     459.63   7.020 2.50e-12 ***
## factor(month)12 -1631.01     446.03  -3.657 0.000258 ***
## 
## Residual standard error: 6647 on 5248 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5101 
## F-statistic: 289.7 on 19 and 5248 DF,  p-value: < 2.2e-16broom::augment(model4) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 4: prediction with Temp, Day, Humidity, and Month",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()Summary 1. The Adjusted \(R^2\) of model 4 is 0.5101. This model approximately explains 51.01% of the total variability in daily bike rentals. 2. The Residual Standard Error (RSE) has been reduced to 6,647 bikes, meaning the average prediction error is lower. 3. Perceived temperature (feelslike) and humidity are both highly significant predictors. Their p-values are both < 2e-16, which are much smaller than the common significance level of 0.05. 4. Day of week (wday): Confirms the strong commuting pattern. Using Friday as the baseline, rentals are higher on Tuesday, Wednesday, and Thursday (Wednesday is the highest: 1428.85 more than Friday), and lower on Monday, Saturday and Sunday (Sunday is the lowest: 4,825.17 fewer than Friday). 5. Month (factor(month)): Captures the seasonal component that isn’t fully explained by temperature. Using January (month 1) as the baseline: Peak Rentals are predicted for June (5,229.30 more than January), followed closely by July and October. December (1,631.01 fewer than January) is the only month with significantly lower rentals than January, confirming it as the least popular cycling month.
Model 5 builds on the previous model by adding precipitation (rainfall) as a new weather variable.
model5 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month) + precip, data = bike)
msummary(model5)##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     38021.41    1012.58  37.549  < 2e-16 ***
## feelslike         465.73      24.92  18.690  < 2e-16 ***
## wdayMon          -937.69     335.61  -2.794 0.005226 ** 
## wdaySat         -2684.82     335.63  -7.999 1.53e-15 ***
## wdaySun         -4756.45     335.66 -14.171  < 2e-16 ***
## wdayThu          1361.61     335.74   4.056 5.07e-05 ***
## wdayTue          1340.79     335.78   3.993 6.61e-05 ***
## wdayWed          1478.14     335.77   4.402 1.09e-05 ***
## humidity         -227.51      10.89 -20.883  < 2e-16 ***
## factor(month)2    826.98     454.53   1.819 0.068903 .  
## factor(month)3   1473.67     449.21   3.281 0.001043 ** 
## factor(month)4   1906.95     477.11   3.997 6.51e-05 ***
## factor(month)5   4312.94     512.31   8.419  < 2e-16 ***
## factor(month)6   5702.88     558.43  10.212  < 2e-16 ***
## factor(month)7   5520.45     590.01   9.357  < 2e-16 ***
## factor(month)8   2851.71     576.41   4.947 7.76e-07 ***
## factor(month)9   4101.46     536.15   7.650 2.38e-14 ***
## factor(month)10  4338.59     489.63   8.861  < 2e-16 ***
## factor(month)11  3279.97     450.31   7.284 3.73e-13 ***
## factor(month)12 -1645.02     436.96  -3.765 0.000169 ***
## precip           -290.37      19.53 -14.865  < 2e-16 ***
## 
## Residual standard error: 6512 on 5247 degrees of freedom
## Multiple R-squared:  0.5316, Adjusted R-squared:  0.5298 
## F-statistic: 297.8 on 20 and 5247 DF,  p-value: < 2.2e-16broom::augment(model5) %>% 
  mutate(day=row_number()) %>% 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 5: prediction with all major predictors",
       y = "Bikes hired",
       x = "Day number (time)") +
  theme_bw()
As expected, precipitation has a significant negative effect on the
number of bikes hired. Rainy days see a noticeable drop in rentals. The
adjusted R² rises again, showing that including rainfall improves the
model’s ability to explain daily variation in hires. In short, Model 5
provides the best overall fit, confirming that temperature boosts
demand, while rain and humidity reduce it, and calendar effects (season
and weekday) still play a key role.
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bikes_hired? Yes, absolutely. The correlation
matrix and the progression of the models confirm that humidity and
precipitation (precip) are highly significant and useful predictors for
bike rentals, in addition to the base temperature (feelslike).model1 (temp only) to model3 (temp + specific
day of the week, wday) to model5 (adding
factor(month)).
wday variable improved the model by differentiating
weekdays.factor(month) variable in model4 and
model5 shows seasonality that is not fully explained by
temperature alone.In the code below, first we plot the actual vs. predicted data, then we plot the fitted line and residuals.
my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  labs(title = "Model 3: actual rentals vs fitted predictions",
       subtitle = "Model 3: bikes_hired ~ feelslike + wday",
       x = "Day number (time index)",
       y = "Number of bikes hired") +
  theme_bw()Summary: The plot shows that the predicted values perfectly shows the general seasonal fluctuation and low-use days (weeekends, winter) with its residual SE of 6620. While model3 is better than a simple mean, it still leaves a lot of daily variation explained.
For this analysis, we are interested in 3 plots: residuals vs. fitted (for linearity and homoscedasticity); Normal Q-Q (for normality); Scale-location (for homoscedasticity). In order to do this analysis we must use the autoplot function. To check for multicollinearity we check the value of the VIF, and see if it is above or below the threshold value of 10.
# Residual Analysis
# We are interested in the first 3 plots:
# 1: Residuals vs fitted (for linearity and homoscedasticity)
# 2: Normal Q-Q (for normality)
# 3: Scale-location (for homoscedasticity)
model2 %>% 
  autoplot(which = 1:3) +
  theme_bw()## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.## feelslike   weekend 
##  1.000025  1.000025##               GVIF Df GVIF^(1/(2*Df))
## feelslike 1.000065  1        1.000032
## wday      1.000065  6        1.000005##                   GVIF Df GVIF^(1/(2*Df))
## feelslike     3.428223  1        1.851546
## wday          1.001367  6        1.000114
## humidity      1.515010  1        1.230858
## factor(month) 4.029582 11        1.065398Summary: The linearity assumption holds. The spread of the residuals are constant (the homoscedasticity assumption is met) and normally distributed. All VIF values are below 10. Our predictor variables are not highly correlated, and multicollinearity is not an issue here.
Now we create a summary table, in order to compare the models examined in this analysis.
huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 26327.597 *** | 17647.089 *** | 18887.785 *** | 18339.214 *** | 40635.741 *** | 
| (130.846) | (189.133) | (193.902) | (304.814) | (1017.876) | |
| feelslike | 860.427 *** | 858.979 *** | 858.340 *** | 471.901 *** | |
| (15.640) | (15.120) | (15.010) | (25.432) | ||
| weekendTRUE | -4288.877 *** | ||||
| (223.121) | |||||
| wdayMon | -905.186 * | -936.602 ** | |||
| (374.331) | (342.574) | ||||
| wdaySat | -2730.525 *** | -2658.124 *** | |||
| (374.331) | (342.585) | ||||
| wdaySun | -4737.260 *** | -4825.175 *** | |||
| (374.331) | (342.586) | ||||
| wdayThu | 1246.662 *** | 1312.820 *** | |||
| (374.456) | (342.690) | ||||
| wdayTue | 1142.627 ** | 1280.100 *** | |||
| (374.456) | (342.724) | ||||
| wdayWed | 1293.737 *** | 1428.854 *** | |||
| (374.459) | (342.722) | ||||
| humidity | -263.483 *** | ||||
| (10.842) | |||||
| factor(month)2 | 477.507 | ||||
| (463.333) | |||||
| factor(month)3 | 1343.150 ** | ||||
| (458.442) | |||||
| factor(month)4 | 1640.364 *** | ||||
| (486.658) | |||||
| factor(month)5 | 3906.791 *** | ||||
| (522.191) | |||||
| factor(month)6 | 5229.300 *** | ||||
| (569.081) | |||||
| factor(month)7 | 4950.945 *** | ||||
| (600.980) | |||||
| factor(month)8 | 2288.165 *** | ||||
| (587.096) | |||||
| factor(month)9 | 3809.465 *** | ||||
| (546.908) | |||||
| factor(month)10 | 4138.800 *** | ||||
| (499.598) | |||||
| factor(month)11 | 3226.766 *** | ||||
| (459.634) | |||||
| factor(month)12 | -1631.005 *** | ||||
| (446.025) | |||||
| #observations | 5268 | 5268 | 5268 | 5268 | 5268 | 
| R squared | 0.000 | 0.365 | 0.407 | 0.416 | 0.512 | 
| Adj. R Squared | 0.000 | 0.365 | 0.406 | 0.415 | 0.510 | 
| Residual SE | 9496.916 | 7568.703 | 7317.024 | 7263.360 | 6646.989 | 
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
Summary: The Adjusted \(R^2\) consistently increased, and the Residual SE is decreased. Model 5 is clearly the best model as it explains the most variance (41.5%) with the lowest error.