Simple Linear Regression Analysis

Author

Malaya Wilburd

Purpose

This analysis explores the relationship between temperature (°C) and hourly bike rental demand in Seoul using simple linear regression. The goal is to identify whether temperature is a statistically significant predictor of demand and to evaluate the strengths and limitations of a single-predictor model.

Data Pre-processing

Data Cleaning:

  • Manually converted predictors (column names) to snake case for improved readability, easier referencing, and standardization.
SeoulBikes <- read.csv("SeoulBikeData.csv") #Importing dataset

colnames(SeoulBikes)  #Displaying column names(predictors)
 [1] "date"                    "rented_bike_count"      
 [3] "hour"                    "temperature_c"          
 [5] "humidity_pct"            "wind_speed_mps"         
 [7] "visibility_10m"          "dew_point_temperature_c"
 [9] "solar_radiation_mj_m2"   "rainfall_mm"            
[11] "snowfall_cm"             "season"                 
[13] "holiday"                 "functioning_day"        

Exploring the Dataset

Observations:

  • No missing values
  • 14 columns
  • 8760 observations (spans from January 2017 to November 2018)
  • Rented Bike Count is recorded every hour of a single data, so there are 24 observations ranging from hour 0 to 23 for a given day in the dataset.

Data Types:

  • Contains mainly numerical values with date being categorical, and holiday, season, and functioning day being character types.
str(SeoulBikes) #Displaying structure of dataset
'data.frame':   8760 obs. of  14 variables:
 $ date                   : chr  "1/12/2017" "1/12/2017" "1/12/2017" "1/12/2017" ...
 $ rented_bike_count      : int  254 204 173 107 78 100 181 460 930 490 ...
 $ hour                   : int  0 1 2 3 4 5 6 7 8 9 ...
 $ temperature_c          : num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
 $ humidity_pct           : int  37 38 39 40 36 37 35 38 37 27 ...
 $ wind_speed_mps         : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
 $ visibility_10m         : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
 $ dew_point_temperature_c: num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
 $ solar_radiation_mj_m2  : num  0 0 0 0 0 0 0 0 0.01 0.23 ...
 $ rainfall_mm            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ snowfall_cm            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ season                 : chr  "Winter" "Winter" "Winter" "Winter" ...
 $ holiday                : chr  "No Holiday" "No Holiday" "No Holiday" "No Holiday" ...
 $ functioning_day        : chr  "Yes" "Yes" "Yes" "Yes" ...
sapply(SeoulBikes, anyNA) #Checking for null values
                   date       rented_bike_count                    hour 
                  FALSE                   FALSE                   FALSE 
          temperature_c            humidity_pct          wind_speed_mps 
                  FALSE                   FALSE                   FALSE 
         visibility_10m dew_point_temperature_c   solar_radiation_mj_m2 
                  FALSE                   FALSE                   FALSE 
            rainfall_mm             snowfall_cm                  season 
                  FALSE                   FALSE                   FALSE 
                holiday         functioning_day 
                  FALSE                   FALSE 

Choosing a Variable

  • Temperature was chosen due to it having the strongest linear correlation to Rented Bike Count among the rest of the predictors.

    • Temperature has a moderate, positive linear relationship to Rented Bike Count.
SeoulBikesNum <- SeoulBikes[sapply(SeoulBikes, is.numeric)]  #Converting to numeric

cor_matrix <- cor(SeoulBikesNum) #Creating correlation matrix
round(cor_matrix, 2)
                        rented_bike_count  hour temperature_c humidity_pct
rented_bike_count                    1.00  0.41          0.54        -0.20
hour                                 0.41  1.00          0.12        -0.24
temperature_c                        0.54  0.12          1.00         0.16
humidity_pct                        -0.20 -0.24          0.16         1.00
wind_speed_mps                       0.12  0.29         -0.04        -0.34
visibility_10m                       0.20  0.10          0.03        -0.54
dew_point_temperature_c              0.38  0.00          0.91         0.54
solar_radiation_mj_m2                0.26  0.15          0.35        -0.46
rainfall_mm                         -0.12  0.01          0.05         0.24
snowfall_cm                         -0.14 -0.02         -0.22         0.11
                        wind_speed_mps visibility_10m dew_point_temperature_c
rented_bike_count                 0.12           0.20                    0.38
hour                              0.29           0.10                    0.00
temperature_c                    -0.04           0.03                    0.91
humidity_pct                     -0.34          -0.54                    0.54
wind_speed_mps                    1.00           0.17                   -0.18
visibility_10m                    0.17           1.00                   -0.18
dew_point_temperature_c          -0.18          -0.18                    1.00
solar_radiation_mj_m2             0.33           0.15                    0.09
rainfall_mm                      -0.02          -0.17                    0.13
snowfall_cm                       0.00          -0.12                   -0.15
                        solar_radiation_mj_m2 rainfall_mm snowfall_cm
rented_bike_count                        0.26       -0.12       -0.14
hour                                     0.15        0.01       -0.02
temperature_c                            0.35        0.05       -0.22
humidity_pct                            -0.46        0.24        0.11
wind_speed_mps                           0.33       -0.02        0.00
visibility_10m                           0.15       -0.17       -0.12
dew_point_temperature_c                  0.09        0.13       -0.15
solar_radiation_mj_m2                    1.00       -0.07       -0.07
rainfall_mm                             -0.07        1.00        0.01
snowfall_cm                             -0.07        0.01        1.00

Visuals

Temperature

  • Temperatures range from approximately -20 °C to 40 °C.

    • Most observations fall between 0 °C and 30 °C.
  • The values seem to closely follow a normal distribution.

# Creating histogram for Temperature
hist(
  SeoulBikes$temperature_c,
  main = "Histogram of Temperature",
  xlab = "Temperature (°C)",
  col = "pink"
  )

Hourly Bike Rentals

  • Distribution is right-skewed.

    • Most hours have “lower” rental counts, with fewer high-demand hours.
# Creating histogram for Hourly Bike Rentals
hist(
  SeoulBikes$rented_bike_count,
  main = "Histogram of Hourly Bike Rentals",
  xlab = "Hourly Bike Rentals",
  col = "light blue"
  )

Simple Linear Model

Estimated Regression Equation:

  • \(\hat{y} = 329.9525 + 29.0811x\)
SeoulBikesModel <- lm(rented_bike_count ~ temperature_c, data=SeoulBikes)  #Fitting model

summary(SeoulBikesModel)  #Getting summary stats of model

Call:
lm(formula = rented_bike_count ~ temperature_c, data = SeoulBikes)

Residuals:
     Min       1Q   Median       3Q      Max 
-1100.60  -336.57   -49.69   233.81  2525.19 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   329.9525     8.5411   38.63   <2e-16 ***
temperature_c  29.0811     0.4862   59.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 543.5 on 8758 degrees of freedom
Multiple R-squared:   0.29, Adjusted R-squared:   0.29 
F-statistic:  3578 on 1 and 8758 DF,  p-value: < 2.2e-16

Interpretation

Interpretation:

  • Rented bike count is expected to increase by approximately 29 additional bikes per hour for every 1 °C increase in temperature.

Significant?

Hypothesis Test:

\(H_0:\beta_1 = 0\)

\(H_a: \beta_1 \ne 0\)

\(a = 0.01\)

\(p-value : 2.2 * 10^{-16}\)

  • At a 99% confidence level \((\alpha = 0.01)\), we have enough statistical evidence \((p-value < a)\) to reject the null hypothesis and further conclude that temperature is a statistically significant predictor of hourly bike rentals.

R-Squared & F-Statistic

\(R^2 = 0.29\)

\(F-Statistic = 3578\)

  • The model explains 29% of variance, suggesting that additional predictors are needed to better explain hourly bike rental demand. The f-statistic also shows that the model is highly significant.

Residual Analysis

Residual Plot of Bike Rentals vs Temperature

  • Increase in spread at higher temperatures.

    • Greater variability in bike demand during warmer conditions.
# Residual plot
plot(SeoulBikesModel$fitted.values, SeoulBikesModel$residuals,
     main = "Residual Plot",
     xlab = "Fitted Values",
     ylab = "Residuals",
     pch = 16,
     col = "gray70")
abline(h = 0, col = "orange", lwd = 2)

Confidence and Prediction Intervals

  • The limitations of the sole predictor, temperature, start to show when looking at the confidence and prediction interval of our model. The confidence interval (green) captures the uncertainty around the estimated mean bike rental count at each temperature, while the prediction interval (red) reflects the range within which an individual hourly observation is expected to fall. The notably wide prediction interval highlights the considerable variability in the bike rentals that temperature alone cannot account for, reinforcing the need for additional predictors in future modeling.
temp_range <- data.frame(temperature_c = seq(-20, 40, length.out = 100))

ci <- predict(SeoulBikesModel, temp_range, interval = "confidence")
pi <- predict(SeoulBikesModel, temp_range, interval = "prediction")

plot(SeoulBikes$temperature_c, SeoulBikes$rented_bike_count,
     main = "Bike Rentals vs Temperature",
     xlab = "Temperature (°C)",
     ylab = "Rented Bike Count",
     pch = 16, col = "gray70")
abline(SeoulBikesModel, col = "blue", lwd = 2)
lines(temp_range$temperature_c, ci[,"lwr"], col = "darkgreen", lty = 2)
lines(temp_range$temperature_c, ci[,"upr"], col = "darkgreen", lty = 2)
lines(temp_range$temperature_c, pi[,"lwr"], col = "red", lty = 2)
lines(temp_range$temperature_c, pi[,"upr"], col = "red", lty = 2)

legend("topleft",
       legend = c("Regression Line", "Confidence Interval", "Prediction Interval"),
       col = c("blue", "darkgreen", "red"),
       lty = c(1, 2, 2),
       lwd = c(2, 1.5, 1.5),
       bty = "n")

Prediction at 22°C

predict(SeoulBikesModel, data.frame(temperature_c = 22), interval = "confidence")
       fit      lwr      upr
1 969.7367 955.4166 984.0568
predict(SeoulBikesModel, data.frame(temperature_c = 22), interval = "prediction")
       fit       lwr      upr
1 969.7367 -95.74402 2035.217
  • At 22 °C, the model predicts ~970 rentals per hour, with a 95% confidence interval of (955.42, 984.06) for the true mean and a 95% prediction interval of (-95.74, 2035.22) for an individual observation.

Summary

Using simple linear regression, we modeled the relationship between temperature and hourly bike rental demand across 8,760 observations in Seoul. The estimated regression equation \(\hat{y} = 329.95 + 29.08x\), indicates that for every 1 °C increase in temperature, hourly bike rentals are expected to increase by ~29 bikes. At a 99% confidence level \((a = 0.01)\), temperature was found to be a statistically significant predictor \((p-value = 2.2 * 10^{-16})\), concluding rejection of the null hypothesis that \(\beta_1 = 0\)

However, with an \(R^2\) of 0.29, temperature alone explains only 29% of the variation in hourly rentals. At 22 °C, the model predicts ~970 rentals per hour, with a 95% confidence interval of (955.42, 984.06) for the true mean and a 95% prediction interval of (-95.74, 2035.22) for an individual observation. The wide prediction interval, combined with the heteroscedasticity observed in the residual plot, suggests that while temperature is a meaningful predictor, additional variables such as hour of day, season, and weather conditions would be needed to build a more reliable model.