We will analyze data from TfL to understand the usage of the London bike sharing scheme.
This R code does two main things — loads the data and creates a new variable to mark weekends. This makes it easier to compare or model bike hires between weekdays and weekends.
bike <- read_csv(here::here("Data", "london_bikes.csv")) %>%
mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE))The raw data requires cleaning to be useful for analysis. We will convert character variables to factors and engineer new date-based features using lubridate. We will also generate season_name variable and provide a comprehensive overview of each variable using the skim function.
bike <- read_csv(here::here("Data", "london_bikes.csv")) %>%
mutate(weekend = if_else((wday == "Sat" | wday == "Sun"), TRUE, FALSE)) %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date),
month_name = lubridate::month(date, label = TRUE),
day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE)
) %>%
mutate(
season_name = case_when(
month_name %in% c("Dec", "Jan", "Feb") ~ "Winter",
month_name %in% c("Mar", "Apr", "May") ~ "Spring",
month_name %in% c("Jun", "Jul", "Aug") ~ "Summer",
month_name %in% c("Sep", "Oct", "Nov") ~ "Autumn",
),
season_name = factor(season_name,
levels = c("Winter", "Spring", "Summer", "Autumn"))
)## Rows: 5268 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): preciptype, conditions, description, icon, month_name, day_of_wee...
## dbl (27): bikes_hired, tempmax, tempmin, temp, feelslikemax, feelslikemin, ...
## lgl (2): severerisk, weekend
## dttm (3): date, sunrise, sunset
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Name | bike |
| Number of rows | 5268 |
| Number of columns | 41 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| factor | 3 |
| logical | 2 |
| numeric | 27 |
| POSIXct | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| preciptype | 1411 | 0.73 | 4 | 9 | 0 | 3 | 0 |
| conditions | 0 | 1.00 | 4 | 28 | 0 | 11 | 0 |
| description | 0 | 1.00 | 26 | 85 | 0 | 78 | 0 |
| icon | 0 | 1.00 | 4 | 17 | 0 | 6 | 0 |
| wday | 0 | 1.00 | 3 | 3 | 0 | 7 | 0 |
| set | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month_name | 0 | 1 | TRUE | 12 | Aug: 465, Oct: 465, Dec: 464, Sep: 450 |
| day_of_week | 0 | 1 | TRUE | 7 | Sun: 753, Mon: 753, Fri: 753, Sat: 753 |
| season_name | 0 | 1 | FALSE | 4 | Aut: 1365, Sum: 1321, Win: 1294, Spr: 1288 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| severerisk | 5268 | 0 | NaN | : |
| weekend | 0 | 1 | 0.29 | FAL: 3762, TRU: 1506 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1.00 | 26327.60 | 9496.92 | 0.0 | 19746.75 | 26070.00 | 32745.00 | 73094.00 | ▂▇▅▁▁ |
| tempmax | 0 | 1.00 | 14.73 | 6.36 | -2.6 | 10.10 | 14.40 | 19.50 | 36.10 | ▁▇▇▃▁ |
| tempmin | 0 | 1.00 | 7.84 | 5.04 | -11.1 | 4.10 | 8.00 | 11.60 | 22.10 | ▁▃▇▇▁ |
| temp | 0 | 1.00 | 11.23 | 5.48 | -5.9 | 7.20 | 11.10 | 15.50 | 28.30 | ▁▅▇▅▁ |
| feelslikemax | 0 | 1.00 | 14.09 | 7.24 | -7.2 | 10.10 | 14.40 | 19.50 | 38.20 | ▁▅▇▃▁ |
| feelslikemin | 0 | 1.00 | 6.18 | 6.38 | -14.6 | 1.17 | 6.00 | 11.60 | 22.10 | ▁▃▇▇▂ |
| feelslike | 0 | 1.00 | 10.09 | 6.67 | -9.1 | 4.90 | 10.60 | 15.50 | 28.90 | ▁▆▇▇▁ |
| dew | 0 | 1.00 | 7.41 | 4.67 | -8.0 | 4.10 | 7.60 | 11.00 | 19.00 | ▁▃▇▇▂ |
| humidity | 0 | 1.00 | 79.28 | 10.40 | 39.6 | 72.20 | 80.30 | 87.30 | 99.90 | ▁▂▅▇▅ |
| precip | 0 | 1.00 | 1.64 | 4.73 | 0.0 | 0.00 | 0.18 | 1.34 | 168.20 | ▇▁▁▁▁ |
| precipprob | 0 | 1.00 | 73.04 | 44.38 | 0.0 | 0.00 | 100.00 | 100.00 | 100.00 | ▃▁▁▁▇ |
| precipcover | 0 | 1.00 | 8.96 | 11.60 | 0.0 | 0.00 | 8.33 | 12.50 | 100.00 | ▇▁▁▁▁ |
| snow | 0 | 1.00 | 0.02 | 0.25 | 0.0 | 0.00 | 0.00 | 0.00 | 8.40 | ▇▁▁▁▁ |
| snowdepth | 0 | 1.00 | 0.05 | 0.55 | 0.0 | 0.00 | 0.00 | 0.00 | 18.10 | ▇▁▁▁▁ |
| windgust | 1126 | 0.79 | 42.08 | 14.84 | 9.4 | 30.72 | 40.70 | 51.80 | 120.90 | ▅▇▃▁▁ |
| windspeed | 0 | 1.00 | 22.95 | 8.42 | 5.1 | 16.88 | 21.80 | 27.60 | 72.20 | ▅▇▂▁▁ |
| winddir | 0 | 1.00 | 196.97 | 92.33 | 0.1 | 127.75 | 221.60 | 259.00 | 359.80 | ▃▂▃▇▃ |
| sealevelpressure | 0 | 1.00 | 1014.90 | 10.52 | 966.6 | 1008.70 | 1015.70 | 1022.00 | 1047.80 | ▁▁▇▇▁ |
| cloudcover | 0 | 1.00 | 60.86 | 22.45 | 0.0 | 46.60 | 62.70 | 77.80 | 100.00 | ▁▃▇▇▅ |
| visibility | 0 | 1.00 | 22.65 | 10.62 | 0.2 | 15.20 | 22.00 | 28.90 | 62.50 | ▃▇▆▁▁ |
| solarradiation | 0 | 1.00 | 112.00 | 84.89 | 0.0 | 40.20 | 91.50 | 169.77 | 358.60 | ▇▅▃▂▁ |
| solarenergy | 0 | 1.00 | 9.66 | 7.34 | 0.0 | 3.50 | 7.90 | 14.60 | 31.00 | ▇▅▃▂▁ |
| uvindex | 0 | 1.00 | 4.26 | 2.55 | 0.0 | 2.00 | 4.00 | 6.00 | 10.00 | ▇▆▅▅▁ |
| moonphase | 0 | 1.00 | 0.48 | 0.29 | 0.0 | 0.25 | 0.50 | 0.75 | 0.98 | ▇▇▇▇▇ |
| month | 0 | 1.00 | 6.62 | 3.46 | 1.0 | 4.00 | 7.00 | 10.00 | 12.00 | ▇▅▅▅▇ |
| week | 0 | 1.00 | 27.01 | 15.08 | 1.0 | 14.00 | 27.00 | 40.00 | 53.00 | ▇▇▇▇▇ |
| year | 0 | 1.00 | 2017.28 | 4.17 | 2010.0 | 2014.00 | 2017.00 | 2021.00 | 2024.00 | ▆▇▇▇▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2010-07-30 00:00:00 | 2024-12-30 00:00:00 | 2017-10-14 12:00:00 | 5268 |
| sunrise | 0 | 1 | 2010-07-30 05:17:39 | 2024-12-30 08:06:26 | 2017-10-14 19:21:46 | 5268 |
| sunset | 0 | 1 | 2010-07-30 20:50:39 | 2024-12-30 16:00:26 | 2017-10-15 06:05:26 | 5268 |
We can now calculate summary statistics for our target variable, bikes_hired.
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.61e+04 | 3.27e+04 | 7.31e+04 | 2.63e+04 | 9.5e+03 | 5268 | 0 |
We can also group these summary statistics by the categorical variables we created.
| year | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 2010 | 2.76e+03 | 9.3e+03 | 1.4e+04 | 1.87e+04 | 2.8e+04 | 1.41e+04 | 5.62e+03 | 155 | 0 |
| 2011 | 4.56e+03 | 1.63e+04 | 2.03e+04 | 2.37e+04 | 2.94e+04 | 1.96e+04 | 5.5e+03 | 365 | 0 |
| 2012 | 3.53e+03 | 1.93e+04 | 2.62e+04 | 3.25e+04 | 4.71e+04 | 2.6e+04 | 9.43e+03 | 366 | 0 |
| 2013 | 3.73e+03 | 1.76e+04 | 2.2e+04 | 2.74e+04 | 3.56e+04 | 2.2e+04 | 7.28e+03 | 365 | 0 |
| 2014 | 4.33e+03 | 2.05e+04 | 2.77e+04 | 3.44e+04 | 4.9e+04 | 2.75e+04 | 9.07e+03 | 365 | 0 |
| 2015 | 5.78e+03 | 2.21e+04 | 2.66e+04 | 3.29e+04 | 7.31e+04 | 2.7e+04 | 8.55e+03 | 365 | 0 |
| 2016 | 4.89e+03 | 2.24e+04 | 2.79e+04 | 3.51e+04 | 4.66e+04 | 2.82e+04 | 8.85e+03 | 366 | 0 |
| 2017 | 5.14e+03 | 2.41e+04 | 2.95e+04 | 3.45e+04 | 4.6e+04 | 2.86e+04 | 8.38e+03 | 365 | 0 |
| 2018 | 5.86e+03 | 2.18e+04 | 2.92e+04 | 3.77e+04 | 4.61e+04 | 2.9e+04 | 1.02e+04 | 365 | 0 |
| 2019 | 5.65e+03 | 2.42e+04 | 2.89e+04 | 3.45e+04 | 4.47e+04 | 2.86e+04 | 8.09e+03 | 365 | 0 |
| 2020 | 4.87e+03 | 2.01e+04 | 2.75e+04 | 3.65e+04 | 7.02e+04 | 2.85e+04 | 1.16e+04 | 366 | 0 |
| 2021 | 6.25e+03 | 2.17e+04 | 3.1e+04 | 3.82e+04 | 5.69e+04 | 3e+04 | 1.1e+04 | 365 | 0 |
| 2022 | 0 | 2.51e+04 | 3.14e+04 | 4e+04 | 6.7e+04 | 3.15e+04 | 1.03e+04 | 365 | 0 |
| 2023 | 6.72e+03 | 1.99e+04 | 2.37e+04 | 2.78e+04 | 3.53e+04 | 2.34e+04 | 6.03e+03 | 365 | 0 |
| 2024 | 7.48e+03 | 2.04e+04 | 2.48e+04 | 2.87e+04 | 3.52e+04 | 2.4e+04 | 6.14e+03 | 365 | 0 |
| day_of_week | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Sunday | 0 | 1.41e+04 | 2.1e+04 | 2.97e+04 | 6.31e+04 | 2.22e+04 | 1.05e+04 | 753 | 0 |
| Monday | 3.97e+03 | 2.06e+04 | 2.57e+04 | 3.15e+04 | 6.7e+04 | 2.61e+04 | 8.28e+03 | 753 | 0 |
| Tuesday | 3.76e+03 | 2.25e+04 | 2.79e+04 | 3.44e+04 | 6.53e+04 | 2.82e+04 | 8.62e+03 | 752 | 0 |
| Wednesday | 4.33e+03 | 2.28e+04 | 2.8e+04 | 3.42e+04 | 5.44e+04 | 2.84e+04 | 8.52e+03 | 752 | 0 |
| Thursday | 5.65e+03 | 2.27e+04 | 2.75e+04 | 3.44e+04 | 7.31e+04 | 2.83e+04 | 8.79e+03 | 752 | 0 |
| Friday | 5.4e+03 | 2.14e+04 | 2.65e+04 | 3.29e+04 | 6.7e+04 | 2.7e+04 | 8.54e+03 | 753 | 0 |
| Saturday | 0 | 1.6e+04 | 2.25e+04 | 3.08e+04 | 7.02e+04 | 2.42e+04 | 1.11e+04 | 753 | 0 |
| month_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Jan | 3.73e+03 | 1.39e+04 | 1.9e+04 | 2.33e+04 | 3.8e+04 | 1.87e+04 | 5.99e+03 | 434 | 0 |
| Feb | 3.53e+03 | 1.6e+04 | 2.06e+04 | 2.46e+04 | 5.25e+04 | 2.03e+04 | 6.37e+03 | 396 | 0 |
| Mar | 5.06e+03 | 1.76e+04 | 2.31e+04 | 2.72e+04 | 5.66e+04 | 2.27e+04 | 7.59e+03 | 434 | 0 |
| Apr | 4.87e+03 | 2.1e+04 | 2.59e+04 | 3.08e+04 | 4.9e+04 | 2.6e+04 | 7.61e+03 | 420 | 0 |
| May | 1.07e+04 | 2.45e+04 | 2.99e+04 | 3.6e+04 | 7.02e+04 | 3.04e+04 | 8.44e+03 | 434 | 0 |
| Jun | 6.06e+03 | 2.77e+04 | 3.31e+04 | 3.92e+04 | 6.53e+04 | 3.34e+04 | 8.59e+03 | 420 | 0 |
| Jul | 5.56e+03 | 2.91e+04 | 3.58e+04 | 4.13e+04 | 7.31e+04 | 3.47e+04 | 8.35e+03 | 436 | 0 |
| Aug | 4.3e+03 | 2.56e+04 | 3.24e+04 | 3.83e+04 | 6.7e+04 | 3.14e+04 | 9.6e+03 | 465 | 0 |
| Sep | 0 | 2.49e+04 | 3.12e+04 | 3.61e+04 | 5.18e+04 | 3.04e+04 | 8.11e+03 | 450 | 0 |
| Oct | 7.07e+03 | 2.33e+04 | 2.8e+04 | 3.26e+04 | 4.79e+04 | 2.75e+04 | 6.77e+03 | 465 | 0 |
| Nov | 6.03e+03 | 1.88e+04 | 2.37e+04 | 2.79e+04 | 4.47e+04 | 2.33e+04 | 6.41e+03 | 450 | 0 |
| Dec | 2.76e+03 | 1.14e+04 | 1.67e+04 | 2.29e+04 | 3.91e+04 | 1.71e+04 | 6.91e+03 | 464 | 0 |
| season_name | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Winter | 2.76e+03 | 1.37e+04 | 1.88e+04 | 2.35e+04 | 5.25e+04 | 1.86e+04 | 6.58e+03 | 1294 | 0 |
| Spring | 4.87e+03 | 2.08e+04 | 2.61e+04 | 3.13e+04 | 7.02e+04 | 2.64e+04 | 8.51e+03 | 1288 | 0 |
| Summer | 4.3e+03 | 2.76e+04 | 3.37e+04 | 3.96e+04 | 7.31e+04 | 3.31e+04 | 8.98e+03 | 1321 | 0 |
| Autumn | 0 | 2.19e+04 | 2.72e+04 | 3.26e+04 | 5.18e+04 | 2.7e+04 | 7.69e+03 | 1365 | 0 |
We will plot the number of bike rentals from January 1, 2014 onwards to observe any long-term trends or seasonal patterns.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot() +
aes(x = date, y = bikes_hired) +
geom_point(alpha = 0.4) +
labs(
title = "Daily TfL bike rentals over time",
x = "Date",
y = "Number of bikes hired"
)
# Apply a clean black and white theme
theme_bw() +
NULLSummary: The long-term trend in daily TfL bike rentals shows a slight overall increase from 2014 until a dramatic spike in 2020 (likely due to the pandemic), followed by a subsequent leveling off and possible slight decline in the most recent years (2023-2024), though usage generally remains higher than pre-2020 levels.
Now we will see how rentals differ by months and by season. Fist, we plot a histogram of the bikes rented from January 1, 2014 onwards. Secondly, we plot histograms faceted by seasom_name. After this, we plot histograms faceted by month name, and the last graph is the same as the third one, but the histograms are represented in 4 rows. Having done this, we plot density plots for every result found in the previous graphs. After the density plots, we represent with a boxplot graph bikes hired for each month, then focusing on the relationship between bikes hired and temperature, and bikes hired and temperature divided by season.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_histogram()+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_histogram()+
facet_wrap(~season_name)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_histogram()+
facet_wrap(~month_name)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_histogram()+
facet_wrap(~month_name, nrow = 4)+
theme_bw()+
NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_density()+
theme_bw()+
NULLbike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_density(aes(fill=season_name), alpha = 0.3)+
theme_bw()+
NULLbike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
labs(
title = "Bike rentals by season",
x = "Season",
y = "Number of bikes hired"
) +
aes(x = bikes_hired)+
geom_histogram(aes(fill=season_name), alpha = 0.5)+
facet_wrap(~season_name, nrow = 4)+
theme_minimal()+
theme(legend.position = "none")+
NULL## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x = bikes_hired)+
geom_density(aes(fill=season_name), alpha = 0.3)+
facet_wrap(~month_name, nrow = 4)+
theme_bw()+
theme(legend.position="none")+
NULLbike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x=month_name, y= bikes_hired)+
geom_boxplot()+
theme_bw()+
NULLbike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x=temp, y= bikes_hired)+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x=temp, y= bikes_hired,
colour=season_name)+
geom_point(alpha = 0.1)+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
bike %>%
filter(date >= lubridate::ymd("2014-01-01")) %>%
ggplot()+
aes(x=humidity, y= bikes_hired,
colour=season_name)+
geom_point(alpha = 0.1)+
geom_smooth(method = "lm", se=FALSE)+
theme_bw()+
NULL## `geom_smooth()` using formula = 'y ~ x'
Summary: - Rentals are slightly lower and less variable on weekends (Saturday/Sunday) compared to weekdays. - Rentals are highest in Summer, followed by Spring and Autumn. Winter has the lowest and least variable rental numbers, confirming our time series observation.
Now, we will work with the correlation of weather variables. We have many weather variables (\(temp, feelslike, humidity, windspeed\), etc.).
Some of them are likely to be highly correlated. We’ll use \(ggpairs\) to visualize the relationships between our target variable \(bikeshired\) and key numeric predictors.
bike_numeric <- bike %>%
select(bikes_hired, feelslike, windspeed, humidity, precip, solarradiation, cloudcover)
ggpairs(bike_numeric) +
theme_minimal()
Summary: - bikes_hired has a strong positive linear
relationship with \(feelslike\) (and
solarradiation).
bikes_hired has a negative relationship with humidity, precip (precipitation), and cloudcover.
feelslike is strongly correlated with solarradiation and negatively correlated with humidity and cloudcover. We will need to check for multicollinearity in our models.
We use a two-sample t-test to determine if the mean number of bikes hired differs between weekend and weekdays.
##
## Welch Two Sample t-test
##
## data: bikes_hired by weekend
## t = 13.929, df = 2300.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## 3739.295 4964.730
## sample estimates:
## mean in group FALSE mean in group TRUE
## 27571.74 23219.72
Summary: - The t-test result shows a p-value of < 2.2e-16, which is small (< 0.05). It means we reject the null hypothesis that the mean number of bikes hired is the same for weekends and weekdays. - The 95% CI for the difference in means is (3739.3, 4964.7). Since this interval is positive, it confirms that the average number of bikes hired on weekdays (FALSE) is higher than on weekends (TRUE).
bikes_hiredWe need to build a multiple linear regression model to identify the key factors driving bike rentals. We will build a series of models, such as a simple intercept-only model and a more complex multiple regression model. Then, we will create a confidence interval for mean bikes_hired, also including SE. Finally, we will plot actual data vs. predicted data.
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.97e+04 | 2.61e+04 | 3.27e+04 | 7.31e+04 | 2.63e+04 | 9.5e+03 | 5268 | 0 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26327.6 130.8 201.2 <2e-16 ***
##
## Residual standard error: 9497 on 5267 degrees of freedom
broom::augment(model0) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 0: mean prediction vs actual data",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()
Summary: The overall mean of bikes_hired is
26327.6.
bikes_hired:bikes_hired on mean_tempWe now add the daily average feelslike temperature as a single predictor. First, we must define the model; then we look at the estimated model and finally plot the actual data vs. the predicted data.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17647.09 189.13 93.31 <2e-16 ***
## feelslike 860.43 15.64 55.01 <2e-16 ***
##
## Residual standard error: 7569 on 5266 degrees of freedom
## Multiple R-squared: 0.365, Adjusted R-squared: 0.3648
## F-statistic: 3027 on 1 and 5266 DF, p-value: < 2.2e-16
broom::augment(model1) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 1: prediction with 'Feels Like' temperature",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()
Summary:
mean_temp significant?
Why? Yes, the effect is highly significant. The p-value for
feelslike is < 2e-16, which is much smaller than the
common significance level of 0.05. The coefficient of 860.43 means that
for every 1-degree Celsius increase in feelslike temperature, the model
predicts an increase of approximately 860.43 bike rentals.bikes_hired on mean_temp plus
weekendWe now add the weekend indicator to model1.We follow the same steps that we used in Model 1 in order to obtain the following model and analyze it.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18887.79 193.90 97.41 <2e-16 ***
## feelslike 858.98 15.12 56.81 <2e-16 ***
## weekendTRUE -4288.88 223.12 -19.22 <2e-16 ***
##
## Residual standard error: 7317 on 5265 degrees of freedom
## Multiple R-squared: 0.4066, Adjusted R-squared: 0.4064
## F-statistic: 1804 on 2 and 5265 DF, p-value: < 2.2e-16
broom::augment(model2) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 2: prediction with temperature and weekend status",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()
Summary: - Is the effect of
mean_temp
significant? Why? Yes, it remains highly significant (p-value
< 2e-16). p-value < 0.05.
What proportion of the overall variability does this model explain? Adjusted \(R^2\) = 0.420. This model explains 42% of the variability in daily bike rentals.
What is the meaning of the effect (slope) of
weekendTRUE? What % of the variability of bikes_hired does
your model explain? The slope for weekendTRUE is
-5876.3. There is a large drop-off in commuter traffic on
weekends.
bikes_hired on mean_temp plus
wdayWe now use the full wday (day of the week) categorical variable instead of the simplified weekend dummy. Follow the same steps used in the first two Models.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18339.21 304.81 60.165 < 2e-16 ***
## feelslike 858.34 15.01 57.185 < 2e-16 ***
## wdayMon -905.19 374.33 -2.418 0.015634 *
## wdaySat -2730.52 374.33 -7.294 3.45e-13 ***
## wdaySun -4737.26 374.33 -12.655 < 2e-16 ***
## wdayThu 1246.66 374.46 3.329 0.000877 ***
## wdayTue 1142.63 374.46 3.051 0.002289 **
## wdayWed 1293.74 374.46 3.455 0.000555 ***
##
## Residual standard error: 7263 on 5260 degrees of freedom
## Multiple R-squared: 0.4158, Adjusted R-squared: 0.4151
## F-statistic: 534.9 on 7 and 5260 DF, p-value: < 2.2e-16
broom::augment(model3) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 3: prediction with temperature and specific day of week",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()Summary 1. What is the meaning of the effect
(slope) of, e.g., wdayMon? Let’s assume Sunday
(Sun) is the baseline. Then, the coefficient for wdayMon is
7648.7. Holding the “feels like” temperature constant, bike rentals on a
Monday are 7,648.7 > Sunday.
In Model 4 we extend the regression framework beyond temperature and weekday-effects by introducing humidity and a full month factor. Follow the same steps as the first three Models.
model4 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month), data = bike)
msummary(model4)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40635.74 1017.88 39.922 < 2e-16 ***
## feelslike 471.90 25.43 18.555 < 2e-16 ***
## wdayMon -936.60 342.57 -2.734 0.006278 **
## wdaySat -2658.12 342.59 -7.759 1.02e-14 ***
## wdaySun -4825.17 342.59 -14.085 < 2e-16 ***
## wdayThu 1312.82 342.69 3.831 0.000129 ***
## wdayTue 1280.10 342.72 3.735 0.000190 ***
## wdayWed 1428.85 342.72 4.169 3.11e-05 ***
## humidity -263.48 10.84 -24.301 < 2e-16 ***
## factor(month)2 477.51 463.33 1.031 0.302781
## factor(month)3 1343.15 458.44 2.930 0.003406 **
## factor(month)4 1640.36 486.66 3.371 0.000755 ***
## factor(month)5 3906.79 522.19 7.482 8.56e-14 ***
## factor(month)6 5229.30 569.08 9.189 < 2e-16 ***
## factor(month)7 4950.94 600.98 8.238 < 2e-16 ***
## factor(month)8 2288.17 587.10 3.897 9.84e-05 ***
## factor(month)9 3809.46 546.91 6.965 3.68e-12 ***
## factor(month)10 4138.80 499.60 8.284 < 2e-16 ***
## factor(month)11 3226.77 459.63 7.020 2.50e-12 ***
## factor(month)12 -1631.01 446.03 -3.657 0.000258 ***
##
## Residual standard error: 6647 on 5248 degrees of freedom
## Multiple R-squared: 0.5119, Adjusted R-squared: 0.5101
## F-statistic: 289.7 on 19 and 5248 DF, p-value: < 2.2e-16
broom::augment(model4) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 4: prediction with Temp, Day, Humidity, and Month",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()Model 5 builds on the previous model by adding precipitation (rainfall) as a new weather variable.
model5 <- lm(bikes_hired ~ feelslike + wday + humidity + factor(month) + precip, data = bike)
msummary(model5)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38021.41 1012.58 37.549 < 2e-16 ***
## feelslike 465.73 24.92 18.690 < 2e-16 ***
## wdayMon -937.69 335.61 -2.794 0.005226 **
## wdaySat -2684.82 335.63 -7.999 1.53e-15 ***
## wdaySun -4756.45 335.66 -14.171 < 2e-16 ***
## wdayThu 1361.61 335.74 4.056 5.07e-05 ***
## wdayTue 1340.79 335.78 3.993 6.61e-05 ***
## wdayWed 1478.14 335.77 4.402 1.09e-05 ***
## humidity -227.51 10.89 -20.883 < 2e-16 ***
## factor(month)2 826.98 454.53 1.819 0.068903 .
## factor(month)3 1473.67 449.21 3.281 0.001043 **
## factor(month)4 1906.95 477.11 3.997 6.51e-05 ***
## factor(month)5 4312.94 512.31 8.419 < 2e-16 ***
## factor(month)6 5702.88 558.43 10.212 < 2e-16 ***
## factor(month)7 5520.45 590.01 9.357 < 2e-16 ***
## factor(month)8 2851.71 576.41 4.947 7.76e-07 ***
## factor(month)9 4101.46 536.15 7.650 2.38e-14 ***
## factor(month)10 4338.59 489.63 8.861 < 2e-16 ***
## factor(month)11 3279.97 450.31 7.284 3.73e-13 ***
## factor(month)12 -1645.02 436.96 -3.765 0.000169 ***
## precip -290.37 19.53 -14.865 < 2e-16 ***
##
## Residual standard error: 6512 on 5247 degrees of freedom
## Multiple R-squared: 0.5316, Adjusted R-squared: 0.5298
## F-statistic: 297.8 on 20 and 5247 DF, p-value: < 2.2e-16
broom::augment(model5) %>%
mutate(day=row_number()) %>%
ggplot()+
aes(x=day, y = bikes_hired)+
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 5: prediction with all major predictors",
y = "Bikes hired",
x = "Day number (time)") +
theme_bw()
As expected, precipitation has a significant negative effect on the
number of bikes hired. Rainy days see a noticeable drop in rentals. The
adjusted R² rises again, showing that including rainfall improves the
model’s ability to explain daily variation in hires. In short, Model 5
provides the best overall fit, confirming that temperature boosts
demand, while rain and humidity reduce it, and calendar effects (season
and weekday) still play a key role.
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bikes_hired? Yes. Our final model, model5
(humidity and precip) significantly improved
the model fit from \(R^2\) of 0.456
(Model 3) to 0.648. This indicates that weather conditions like high
humidity and rainfall are very useful and act as strong negative
predictors of bike rentals. Variables like windspeed and
cloudcover are also strong variables for next
improvement.model1 (temp only) to model3 (temp + specific
day of the week, wday) to model5 (adding
factor(month)).
wday variable improved the model by differentiating
weekdays.factor(month) variable in model4 and
model5 shows seasonality that is not fully explained by
temperature alone.In the code below, first we plot the actual vs. predicted data, then we plot the fitted line and residuals.
my_best_model <- broom::augment(model3) %>%
mutate(day=row_number())
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
geom_point(alpha = 0.2) +
geom_point(aes(y = .fitted),
shape = 1,
colour = "red", alpha = 0.2)+
labs(title = "Model 3: actual rentals vs fitted predictions",
subtitle = "Model 3: bikes_hired ~ feelslike + wday",
x = "Day number (time index)",
y = "Number of bikes hired") +
theme_bw()Summary: The plot shows that the predicted values perfectly shows the general seasonal fluctuation and low-use days (weeekends, winter) with its residual SE of 6620. While model3 is better than a simple mean, it still leaves a lot of daily variation explained.
For this analysis, we are interested in 3 plots: residuals vs. fitted (for linearity and homoscedasticity); Normal Q-Q (for normality); Scale-location (for homoscedasticity). In order to do this analysis we must use the autoplot function. To check for multicollinearity we check the value of the VIF, and see if it is above or below the threshold value of 10.
# Residual Analysis
# We are interested in the first 3 plots:
# 1: Residuals vs fitted (for linearity and homoscedasticity)
# 2: Normal Q-Q (for normality)
# 3: Scale-location (for homoscedasticity)
model2 %>%
autoplot(which = 1:3) +
theme_bw()## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## feelslike weekend
## 1.000025 1.000025
## GVIF Df GVIF^(1/(2*Df))
## feelslike 1.000065 1 1.000032
## wday 1.000065 6 1.000005
## GVIF Df GVIF^(1/(2*Df))
## feelslike 3.428223 1 1.851546
## wday 1.001367 6 1.000114
## humidity 1.515010 1 1.230858
## factor(month) 4.029582 11 1.065398
Summary: The linearity assumption holds. The spread of the residuals are constant (the homoscedasticity assumption is met) and normally distributed. All VIF values are below 10. Our predictor variables are not highly correlated, and multicollinearity is not an issue here.
Now we create a summary table, in order to compare the models examined in this analysis.
huxreg(model0, model1, model2, model3, model4,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 26327.597 *** | 17647.089 *** | 18887.785 *** | 18339.214 *** | 40635.741 *** |
| (130.846) | (189.133) | (193.902) | (304.814) | (1017.876) | |
| feelslike | 860.427 *** | 858.979 *** | 858.340 *** | 471.901 *** | |
| (15.640) | (15.120) | (15.010) | (25.432) | ||
| weekendTRUE | -4288.877 *** | ||||
| (223.121) | |||||
| wdayMon | -905.186 * | -936.602 ** | |||
| (374.331) | (342.574) | ||||
| wdaySat | -2730.525 *** | -2658.124 *** | |||
| (374.331) | (342.585) | ||||
| wdaySun | -4737.260 *** | -4825.175 *** | |||
| (374.331) | (342.586) | ||||
| wdayThu | 1246.662 *** | 1312.820 *** | |||
| (374.456) | (342.690) | ||||
| wdayTue | 1142.627 ** | 1280.100 *** | |||
| (374.456) | (342.724) | ||||
| wdayWed | 1293.737 *** | 1428.854 *** | |||
| (374.459) | (342.722) | ||||
| humidity | -263.483 *** | ||||
| (10.842) | |||||
| factor(month)2 | 477.507 | ||||
| (463.333) | |||||
| factor(month)3 | 1343.150 ** | ||||
| (458.442) | |||||
| factor(month)4 | 1640.364 *** | ||||
| (486.658) | |||||
| factor(month)5 | 3906.791 *** | ||||
| (522.191) | |||||
| factor(month)6 | 5229.300 *** | ||||
| (569.081) | |||||
| factor(month)7 | 4950.945 *** | ||||
| (600.980) | |||||
| factor(month)8 | 2288.165 *** | ||||
| (587.096) | |||||
| factor(month)9 | 3809.465 *** | ||||
| (546.908) | |||||
| factor(month)10 | 4138.800 *** | ||||
| (499.598) | |||||
| factor(month)11 | 3226.766 *** | ||||
| (459.634) | |||||
| factor(month)12 | -1631.005 *** | ||||
| (446.025) | |||||
| #observations | 5268 | 5268 | 5268 | 5268 | 5268 |
| R squared | 0.000 | 0.365 | 0.407 | 0.416 | 0.512 |
| Adj. R Squared | 0.000 | 0.365 | 0.406 | 0.415 | 0.510 |
| Residual SE | 9496.916 | 7568.703 | 7317.024 | 7263.360 | 6646.989 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
Summary: The Adjusted \(R^2\) consistently increased, and the Residual SE is decreased. Model 5 is clearly the best model as it explains the most variance (41.5%) with the lowest error.