Linear Models

In this week’s data dive, I tried to achieve the highest R-Squared value for modeling bikes rented. Previous iterations focused on up to four feature variables. I used backward elimination to maximize the effectiveness of the model, starting with all variables and eliminating based on statistical significance and impact to model performance.

Data Import and Cleaning

library(readr)
library(tidyr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(broom)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
#Import Data set
SeoulBikeData <- read.csv("C:\\Users\\jakem\\Documents\\1. Classes\\INFO H510\\SeoulBikeData.csv")

#Clean Dates
SeoulBikeData <- SeoulBikeData |>
  separate(date, into = c("day", "month", "year"), sep = "/") |>
  mutate(
    day = as.numeric(day),
    month = as.numeric(month),
    year = as.numeric(year)
  )

SeoulBikeData <- SeoulBikeData %>%
  mutate(month_name = factor(month, levels = 1:12, labels = month.name))

SeoulBikeData <- SeoulBikeData |>
  mutate(
    date = as.Date(paste(year,month,day, sep = "/", collapse = NULL))
  )


df <- SeoulBikeData |>
  filter(functioning_day=="Yes")

df$day_of_week <- weekdays(as.Date(df$date))

#Make a categorical column for time of day
df <- df |>
  mutate(hour_group = case_when(
    hour >= 0 & hour < 6  ~ "Night",
    hour >= 6 & hour < 12 ~ "Morning",
    hour >= 12 & hour < 18 ~ "Afternoon",
    hour >= 18 & hour < 24 ~ "Evening"
  )) |>
  mutate(hour_group = factor(hour_group, levels = c("Night", "Morning", "Afternoon", "Evening")))

weekend <- c('Saturday','Sunday')
df <- df |>
  mutate(weekend = case_when(
    day_of_week %in% weekend  ~ 1,
    !(day_of_week %in% weekend) ~ 0,
  ))

Linear Model

The data set is a time-series which creates difficulties when using linear regression. The daily and yearly demand cycle of bikes rented adds another layer of complexity when modeling. Due to this, my focus with this model is predicting the average bikes rented during the weekday peak hours between 5pm and 8pm. The practical significance is using the month and weather data to know what to expect for the highest demand period of the day, which could assist with staffing. Past data dive’s have shown that weekends have different demand patterns than weekdays, so weekends were excluded from the model.

rush_df <- df |>
  filter(hour %in% c(18, 19, 20, 21) & weekend == 0) |>
  group_by(date) |>
  summarize(rented = mean(rented_bikes), 
            temp = mean(temp_c),
            wind = mean(wind_ms),
            visibility = mean(visibility_10m),
            radiation = mean(solar_radiation_mJ_m2),
            snow = sum(snowfall_cm),
            rain=sum(rainfall_mm),
            holiday = max(holiday),
            season = first(seasons),
            humidity=mean(humid_pct),
            month = first(month_name))

#Scale values
rush_df <- rush_df |> 
  mutate(temp=scale(temp),
wind=scale(wind),
visibility = scale(visibility),
radiation = scale(radiation),
rain = scale(rain),
humidity = scale(humidity))
  
#Make snow binary
rush_df <- rush_df |>
  mutate(snow = case_when(
    snow > 0 ~ 1,
    snow == 0 ~ 0,
  ))

#Examine relationship between temperature and rented bikes
ggplot(data=rush_df, aes(x=temp,y=rented)) +
  geom_point()+
  ggtitle("Relationship Between Temperature and Rentals")

model <- lm(rented ~ temp + wind + visibility + radiation + snow + rain + holiday + month + humidity, data=rush_df)

summary(model)
## 
## Call:
## lm(formula = rented ~ temp + wind + visibility + radiation + 
##     snow + rain + holiday + month + humidity, data = rush_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1298.10  -138.30    18.15   157.80   930.05 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         715.73     159.29   4.493 1.11e-05 ***
## temp                435.80      75.86   5.745 2.89e-08 ***
## wind                -52.79      22.99  -2.296 0.022576 *  
## visibility           44.89      31.78   1.412 0.159204    
## radiation           264.84      42.10   6.290 1.57e-09 ***
## snow                 52.43     114.23   0.459 0.646665    
## rain               -147.90      23.80  -6.213 2.40e-09 ***
## holidayNo Holiday   371.37      95.05   3.907 0.000123 ***
## monthFebruary       -80.56     110.10  -0.732 0.465078    
## monthMarch           75.40     126.70   0.595 0.552343    
## monthApril          185.42     152.17   1.219 0.224260    
## monthMay            329.86     174.90   1.886 0.060548 .  
## monthJune           579.33     200.02   2.896 0.004139 ** 
## monthJuly            66.11     231.05   0.286 0.775032    
## monthAugust          10.36     233.51   0.044 0.964646    
## monthSeptember      626.66     202.27   3.098 0.002189 ** 
## monthOctober        770.62     154.15   4.999 1.14e-06 ***
## monthNovember       515.81     127.42   4.048 7.04e-05 ***
## monthDecember        48.72     100.74   0.484 0.629136    
## humidity           -131.15      33.48  -3.917 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326.9 on 231 degrees of freedom
## Multiple R-squared:  0.8606, Adjusted R-squared:  0.8491 
## F-statistic: 75.03 on 19 and 231 DF,  p-value: < 2.2e-16

This initial model had an R-Squared score of 0.85, meaning the model does a great job of explaining the variance in the data. However, the visibility and snow features are not statistically significant. These features were removed.

vif(model)
##                 GVIF Df GVIF^(1/(2*Df))
## temp       13.464834  1        3.669446
## wind        1.236995  1        1.112203
## visibility  2.363328  1        1.537312
## radiation   4.147495  1        2.036540
## snow        1.505385  1        1.226941
## rain        1.325832  1        1.151448
## holiday     1.042377  1        1.020969
## month      40.508128 11        1.183233
## humidity    2.622726  1        1.619483

Temperature was chosen over dew point temperature because it yielded a high R-Squared value. Both could not be included in the model because of multicollinearity between them.

model <- lm(rented ~ temp + wind + radiation + rain + holiday + month + humidity, data=rush_df)

summary(model)
## 
## Call:
## lm(formula = rented ~ temp + wind + radiation + rain + holiday + 
##     month + humidity, data = rush_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1278.70  -137.12    14.32   168.89   924.29 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         652.06     148.52   4.390 1.72e-05 ***
## temp                389.48      68.81   5.661 4.41e-08 ***
## wind                -50.04      22.90  -2.185 0.029868 *  
## radiation           270.59      41.61   6.503 4.76e-10 ***
## rain               -151.82      23.64  -6.422 7.46e-10 ***
## holidayNo Holiday   378.28      94.78   3.991 8.82e-05 ***
## monthFebruary       -84.66     103.81  -0.816 0.415602    
## monthMarch           92.21     119.64   0.771 0.441643    
## monthApril          222.38     144.87   1.535 0.126151    
## monthMay            397.60     162.13   2.452 0.014930 *  
## monthJune           667.97     184.25   3.625 0.000354 ***
## monthJuly           199.64     205.00   0.974 0.331147    
## monthAugust         155.29     205.33   0.756 0.450240    
## monthSeptember      755.81     175.74   4.301 2.51e-05 ***
## monthOctober        849.73     137.73   6.170 3.00e-09 ***
## monthNovember       538.09     121.86   4.415 1.54e-05 ***
## monthDecember        55.60      99.80   0.557 0.577951    
## humidity           -153.54      27.44  -5.595 6.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 327 on 233 degrees of freedom
## Multiple R-squared:  0.8593, Adjusted R-squared:  0.849 
## F-statistic: 83.69 on 17 and 233 DF,  p-value: < 2.2e-16

The model now has statistically significant features and high explanatory power. Next, I will check diagnostic plots.

Regression Diagnostics

par(mfrow = c(2, 3))
plot(model)

Residuals vs Fitted

The residuals vs fitted plot shows points scattered randomly for most of the plot, but the left side shows a discernible curve which indicates the model may not be a good fit for the data.

Q-Q

The residuals stray from the line due to the skewness on both sides of the Q-Q plot. This indicates that the residuals are not normally distributed.

Scale-Location

There are no discernible patterns in the scale-location plot and the points are randomly scattered around the line. This indicates the model have constant variance across the range of fitted values, called homoscedascity.

Residuals vs Leverage

The residuals vs leverage plot reveals one highly influential point. If removed, this data point would likely significantly change the coefficients.

Conclusion

Based on the diagnostic plots, the model is not a good fit for the data. I believe part of this may be cause by the strange outliers in the data. There are many hours in the data with abnormally low bike rentals for no apparent reason. Instead of removing these points, I will address another issue with the data: the seasonality of bike share throughout the year. Month was used as a feature, but what if we only modeled one season at a time?

model <- lm(rented ~ temp + radiation + humidity, data=rush_df|>filter(season=='Summer'))

summary(model)
## 
## Call:
## lm(formula = rented ~ temp + radiation + humidity, data = filter(rush_df, 
##     season == "Summer"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1339.26  -310.67    37.09   280.77   930.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2563.76     274.65   9.335 1.99e-13 ***
## temp         -561.97     203.25  -2.765  0.00749 ** 
## radiation     256.40      74.84   3.426  0.00109 ** 
## humidity     -453.62      96.23  -4.714 1.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 478.1 on 62 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5503 
## F-statistic: 27.52 on 3 and 62 DF,  p-value: 1.952e-11

After limiting the model to Summer evening peak, the rain, wind, holiday, and month features became statistically insignificant and were removed. The adjusted R-Squared value fell to 0.55. All that remained for the

par(mfrow = c(2, 3))
plot(model)

The diagnostic plots indicated this model is better suited for the data than the previous model. The residuals are normally distributed, the data is homoscedastic, and there are no influential points.

Interestingly, the temperature, solar radiation per square meter, and humidity remain as the only statistically significant variables. This makes intuitive sense for Summer, because when the temperature is high, humidity, and direct sunlight have strong influences on riding. This gives me an idea to make these relationships interaction terms.

model <- lm(rented ~ temp + humidity*radiation + temp*humidity, data=rush_df|>filter(season=='Summer'))

summary(model)
## 
## Call:
## lm(formula = rented ~ temp + humidity * radiation + temp * humidity, 
##     data = filter(rush_df, season == "Summer"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1267.55   -86.94    34.91   182.92   656.47 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3434.98     221.90  15.480  < 2e-16 ***
## temp                -975.86     153.19  -6.370 2.93e-08 ***
## humidity           -1190.85     204.60  -5.820 2.45e-07 ***
## radiation             47.67      59.22   0.805   0.4240    
## humidity:radiation   340.33      59.69   5.702 3.85e-07 ***
## temp:humidity        495.32     204.72   2.419   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 336.9 on 60 degrees of freedom
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7767 
## F-statistic: 46.22 on 5 and 60 DF,  p-value: < 2.2e-16

The interaction between temperature and solar radiation was not significant. This change brought the accuracy up to 0.777!

model <- lm(rented ~ temp + humidity*radiation + temp*humidity, data=rush_df)

summary(model)
## 
## Call:
## lm(formula = rented ~ temp + humidity * radiation + temp * humidity, 
##     data = rush_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1248.47  -293.51   -48.32   324.92   980.90 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1409.88      29.06  48.522  < 2e-16 ***
## temp                 595.89      40.57  14.687  < 2e-16 ***
## humidity            -185.53      32.97  -5.627 5.01e-08 ***
## radiation            149.54      39.14   3.820 0.000169 ***
## humidity:radiation   253.01      41.82   6.049 5.40e-09 ***
## temp:humidity       -229.96      36.15  -6.362 9.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 410.7 on 245 degrees of freedom
## Multiple R-squared:  0.7665, Adjusted R-squared:  0.7617 
## F-statistic: 160.9 on 5 and 245 DF,  p-value: < 2.2e-16

This models holds for the entire year as well. I examined the diagnostics one last time:

par(mfrow = c(2, 3))
plot(model)

The residuals vs fitted plot shows a pattern that resembles a wave. This tells me that there is a somewhat non-linear relationship happening. The Q-Q plot shows great normality of the residuals. The scale-location plot is randomly distributed around the line, which shows homoscedacity. There are no high influencing outliers that are skewing the results in the residuals vs leverage plot.

Coefficients

Since I standardized all the predictors, the coefficients now represent the effect of a one standard deviation change in each predictor on the dependent variable rented bikes. This allows comparison of the magnitude of effects between variables. Temperature has the largest positive impact, while radiation has a large negative effect, and humidity negatively impacts rentals, but to a lesser extent. The interaction terms show how the effects of one variable depend on the other, all in terms of standard deviation changes.

For every 1 standard deviation increase in temperature, the rentals increase by 1322.589 units.

For every 1 standard deviation increase in humidity, the rentals decrease by 22.876 units.

For every 1 standard deviation increase in radiation, the rentals decrease by 4235.822 units.