In this week’s data dive, I tried to achieve the highest R-Squared value for modeling bikes rented. Previous iterations focused on up to four feature variables. I used backward elimination to maximize the effectiveness of the model, starting with all variables and eliminating based on statistical significance and impact to model performance.
library(readr)
library(tidyr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
#Import Data set
SeoulBikeData <- read.csv("C:\\Users\\jakem\\Documents\\1. Classes\\INFO H510\\SeoulBikeData.csv")
#Clean Dates
SeoulBikeData <- SeoulBikeData |>
separate(date, into = c("day", "month", "year"), sep = "/") |>
mutate(
day = as.numeric(day),
month = as.numeric(month),
year = as.numeric(year)
)
SeoulBikeData <- SeoulBikeData %>%
mutate(month_name = factor(month, levels = 1:12, labels = month.name))
SeoulBikeData <- SeoulBikeData |>
mutate(
date = as.Date(paste(year,month,day, sep = "/", collapse = NULL))
)
df <- SeoulBikeData |>
filter(functioning_day=="Yes")
df$day_of_week <- weekdays(as.Date(df$date))
#Make a categorical column for time of day
df <- df |>
mutate(hour_group = case_when(
hour >= 0 & hour < 6 ~ "Night",
hour >= 6 & hour < 12 ~ "Morning",
hour >= 12 & hour < 18 ~ "Afternoon",
hour >= 18 & hour < 24 ~ "Evening"
)) |>
mutate(hour_group = factor(hour_group, levels = c("Night", "Morning", "Afternoon", "Evening")))
weekend <- c('Saturday','Sunday')
df <- df |>
mutate(weekend = case_when(
day_of_week %in% weekend ~ 1,
!(day_of_week %in% weekend) ~ 0,
))
The data set is a time-series which creates difficulties when using linear regression. The daily and yearly demand cycle of bikes rented adds another layer of complexity when modeling. Due to this, my focus with this model is predicting the average bikes rented during the weekday peak hours between 5pm and 8pm. The practical significance is using the month and weather data to know what to expect for the highest demand period of the day, which could assist with staffing. Past data dive’s have shown that weekends have different demand patterns than weekdays, so weekends were excluded from the model.
rush_df <- df |>
filter(hour %in% c(18, 19, 20, 21) & weekend == 0) |>
group_by(date) |>
summarize(rented = mean(rented_bikes),
temp = mean(temp_c),
wind = mean(wind_ms),
visibility = mean(visibility_10m),
radiation = mean(solar_radiation_mJ_m2),
snow = sum(snowfall_cm),
rain=sum(rainfall_mm),
holiday = max(holiday),
season = first(seasons),
humidity=mean(humid_pct),
month = first(month_name))
#Scale values
rush_df <- rush_df |>
mutate(temp=scale(temp),
wind=scale(wind),
visibility = scale(visibility),
radiation = scale(radiation),
rain = scale(rain),
humidity = scale(humidity))
#Make snow binary
rush_df <- rush_df |>
mutate(snow = case_when(
snow > 0 ~ 1,
snow == 0 ~ 0,
))
#Examine relationship between temperature and rented bikes
ggplot(data=rush_df, aes(x=temp,y=rented)) +
geom_point()+
ggtitle("Relationship Between Temperature and Rentals")
model <- lm(rented ~ temp + wind + visibility + radiation + snow + rain + holiday + month + humidity, data=rush_df)
summary(model)
##
## Call:
## lm(formula = rented ~ temp + wind + visibility + radiation +
## snow + rain + holiday + month + humidity, data = rush_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1298.10 -138.30 18.15 157.80 930.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 715.73 159.29 4.493 1.11e-05 ***
## temp 435.80 75.86 5.745 2.89e-08 ***
## wind -52.79 22.99 -2.296 0.022576 *
## visibility 44.89 31.78 1.412 0.159204
## radiation 264.84 42.10 6.290 1.57e-09 ***
## snow 52.43 114.23 0.459 0.646665
## rain -147.90 23.80 -6.213 2.40e-09 ***
## holidayNo Holiday 371.37 95.05 3.907 0.000123 ***
## monthFebruary -80.56 110.10 -0.732 0.465078
## monthMarch 75.40 126.70 0.595 0.552343
## monthApril 185.42 152.17 1.219 0.224260
## monthMay 329.86 174.90 1.886 0.060548 .
## monthJune 579.33 200.02 2.896 0.004139 **
## monthJuly 66.11 231.05 0.286 0.775032
## monthAugust 10.36 233.51 0.044 0.964646
## monthSeptember 626.66 202.27 3.098 0.002189 **
## monthOctober 770.62 154.15 4.999 1.14e-06 ***
## monthNovember 515.81 127.42 4.048 7.04e-05 ***
## monthDecember 48.72 100.74 0.484 0.629136
## humidity -131.15 33.48 -3.917 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 326.9 on 231 degrees of freedom
## Multiple R-squared: 0.8606, Adjusted R-squared: 0.8491
## F-statistic: 75.03 on 19 and 231 DF, p-value: < 2.2e-16
This initial model had an R-Squared score of 0.85, meaning the model does a great job of explaining the variance in the data. However, the visibility and snow features are not statistically significant. These features were removed.
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## temp 13.464834 1 3.669446
## wind 1.236995 1 1.112203
## visibility 2.363328 1 1.537312
## radiation 4.147495 1 2.036540
## snow 1.505385 1 1.226941
## rain 1.325832 1 1.151448
## holiday 1.042377 1 1.020969
## month 40.508128 11 1.183233
## humidity 2.622726 1 1.619483
Temperature was chosen over dew point temperature because it yielded a high R-Squared value. Both could not be included in the model because of multicollinearity between them.
model <- lm(rented ~ temp + wind + radiation + rain + holiday + month + humidity, data=rush_df)
summary(model)
##
## Call:
## lm(formula = rented ~ temp + wind + radiation + rain + holiday +
## month + humidity, data = rush_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1278.70 -137.12 14.32 168.89 924.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 652.06 148.52 4.390 1.72e-05 ***
## temp 389.48 68.81 5.661 4.41e-08 ***
## wind -50.04 22.90 -2.185 0.029868 *
## radiation 270.59 41.61 6.503 4.76e-10 ***
## rain -151.82 23.64 -6.422 7.46e-10 ***
## holidayNo Holiday 378.28 94.78 3.991 8.82e-05 ***
## monthFebruary -84.66 103.81 -0.816 0.415602
## monthMarch 92.21 119.64 0.771 0.441643
## monthApril 222.38 144.87 1.535 0.126151
## monthMay 397.60 162.13 2.452 0.014930 *
## monthJune 667.97 184.25 3.625 0.000354 ***
## monthJuly 199.64 205.00 0.974 0.331147
## monthAugust 155.29 205.33 0.756 0.450240
## monthSeptember 755.81 175.74 4.301 2.51e-05 ***
## monthOctober 849.73 137.73 6.170 3.00e-09 ***
## monthNovember 538.09 121.86 4.415 1.54e-05 ***
## monthDecember 55.60 99.80 0.557 0.577951
## humidity -153.54 27.44 -5.595 6.18e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 327 on 233 degrees of freedom
## Multiple R-squared: 0.8593, Adjusted R-squared: 0.849
## F-statistic: 83.69 on 17 and 233 DF, p-value: < 2.2e-16
The model now has statistically significant features and high explanatory power. Next, I will check diagnostic plots.
par(mfrow = c(2, 3))
plot(model)
The residuals vs fitted plot shows points scattered randomly for most of the plot, but the left side shows a discernible curve which indicates the model may not be a good fit for the data.
The residuals stray from the line due to the skewness on both sides of the Q-Q plot. This indicates that the residuals are not normally distributed.
There are no discernible patterns in the scale-location plot and the points are randomly scattered around the line. This indicates the model have constant variance across the range of fitted values, called homoscedascity.
The residuals vs leverage plot reveals one highly influential point. If removed, this data point would likely significantly change the coefficients.
Based on the diagnostic plots, the model is not a good fit for the data. I believe part of this may be cause by the strange outliers in the data. There are many hours in the data with abnormally low bike rentals for no apparent reason. Instead of removing these points, I will address another issue with the data: the seasonality of bike share throughout the year. Month was used as a feature, but what if we only modeled one season at a time?
model <- lm(rented ~ temp + radiation + humidity, data=rush_df|>filter(season=='Summer'))
summary(model)
##
## Call:
## lm(formula = rented ~ temp + radiation + humidity, data = filter(rush_df,
## season == "Summer"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1339.26 -310.67 37.09 280.77 930.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2563.76 274.65 9.335 1.99e-13 ***
## temp -561.97 203.25 -2.765 0.00749 **
## radiation 256.40 74.84 3.426 0.00109 **
## humidity -453.62 96.23 -4.714 1.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 478.1 on 62 degrees of freedom
## Multiple R-squared: 0.5711, Adjusted R-squared: 0.5503
## F-statistic: 27.52 on 3 and 62 DF, p-value: 1.952e-11
After limiting the model to Summer evening peak, the rain, wind, holiday, and month features became statistically insignificant and were removed. The adjusted R-Squared value fell to 0.55. All that remained for the
par(mfrow = c(2, 3))
plot(model)
The diagnostic plots indicated this model is better suited for the data
than the previous model. The residuals are normally distributed, the
data is homoscedastic, and there are no influential points.
Interestingly, the temperature, solar radiation per square meter, and humidity remain as the only statistically significant variables. This makes intuitive sense for Summer, because when the temperature is high, humidity, and direct sunlight have strong influences on riding. This gives me an idea to make these relationships interaction terms.
model <- lm(rented ~ temp + humidity*radiation + temp*humidity, data=rush_df|>filter(season=='Summer'))
summary(model)
##
## Call:
## lm(formula = rented ~ temp + humidity * radiation + temp * humidity,
## data = filter(rush_df, season == "Summer"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1267.55 -86.94 34.91 182.92 656.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3434.98 221.90 15.480 < 2e-16 ***
## temp -975.86 153.19 -6.370 2.93e-08 ***
## humidity -1190.85 204.60 -5.820 2.45e-07 ***
## radiation 47.67 59.22 0.805 0.4240
## humidity:radiation 340.33 59.69 5.702 3.85e-07 ***
## temp:humidity 495.32 204.72 2.419 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 336.9 on 60 degrees of freedom
## Multiple R-squared: 0.7939, Adjusted R-squared: 0.7767
## F-statistic: 46.22 on 5 and 60 DF, p-value: < 2.2e-16
The interaction between temperature and solar radiation was not significant. This change brought the accuracy up to 0.777!
model <- lm(rented ~ temp + humidity*radiation + temp*humidity, data=rush_df)
summary(model)
##
## Call:
## lm(formula = rented ~ temp + humidity * radiation + temp * humidity,
## data = rush_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1248.47 -293.51 -48.32 324.92 980.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1409.88 29.06 48.522 < 2e-16 ***
## temp 595.89 40.57 14.687 < 2e-16 ***
## humidity -185.53 32.97 -5.627 5.01e-08 ***
## radiation 149.54 39.14 3.820 0.000169 ***
## humidity:radiation 253.01 41.82 6.049 5.40e-09 ***
## temp:humidity -229.96 36.15 -6.362 9.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 410.7 on 245 degrees of freedom
## Multiple R-squared: 0.7665, Adjusted R-squared: 0.7617
## F-statistic: 160.9 on 5 and 245 DF, p-value: < 2.2e-16
This models holds for the entire year as well. I examined the diagnostics one last time:
par(mfrow = c(2, 3))
plot(model)
The residuals vs fitted plot shows a pattern that resembles a wave. This
tells me that there is a somewhat non-linear relationship happening. The
Q-Q plot shows great normality of the residuals. The scale-location plot
is randomly distributed around the line, which shows homoscedacity.
There are no high influencing outliers that are skewing the results in
the residuals vs leverage plot.
Since I standardized all the predictors, the coefficients now represent the effect of a one standard deviation change in each predictor on the dependent variable rented bikes. This allows comparison of the magnitude of effects between variables. Temperature has the largest positive impact, while radiation has a large negative effect, and humidity negatively impacts rentals, but to a lesser extent. The interaction terms show how the effects of one variable depend on the other, all in terms of standard deviation changes.
For every 1 standard deviation increase in temperature, the rentals increase by 1322.589 units.
For every 1 standard deviation increase in humidity, the rentals decrease by 22.876 units.
For every 1 standard deviation increase in radiation, the rentals decrease by 4235.822 units.