Machine Learning Project - Linear Regression

Suhas. P. K

2024-02-05

Introduction

Bike Sharing Demand - Forecast use of a city bikeshare system

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Data Fields

  • datetime - hourly date + timestamp
  • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday - whether the day is considered a holiday
  • workingday - whether the day is neither a weekend nor holiday
  • weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - temperature in Celsius
  • atemp - “feels like” temperature in Celsius
  • humidity - relative humidity
  • windspeed - wind speed
  • casual - number of non-registered user rentals initiated
  • registered - number of registered user rentals initiated
  • count - number of total rentals

Let me establish that I am trying to predict count from my dataset.

Note:

For this project I will using ‘Bike sharing demand - forecast use of a city bikeshare system’ from kaggle. Unfortunately, this data is not available in the site. But as part of the course material, all the datasets were already given. So, I will be use the dataset which was given to me.

Exploratory Data Analysis

1. Read in bikeshare.csv file and set it to a datafram called bike.

bike <- read.csv("bikeshare.csv") 
head(bike)
##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1
if (!require(ggplot2)){
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require(ggdark)){
  install.packages("ggdark")
  library(ggdark)
}
if (!require(caTools)){
  install.packages("caTools")
  library(caTools)
}
if (!require(dplyr)){
  install.packages("dplyr")
  library(dplyr)
}

if (!require(patchwork)){
  install.packages("patchwork")
  library(patchwork)
}

2. Create a scatter plot of count vs temp. Set a good alpha value.

sc_plot1 <- ggplot(bike, aes(temp,count)) + geom_point(alpha=0.3,
                                           aes(color=temp)) +
  dark_theme_light()

sc_plot1

3. Plot count vs datetime as a scatterplot wihta color gradient based on temperature.

I will convert the datetime column into POSIXct before plotting.

bike$datetime <- as.POSIXct(bike$datetime)
sc_plot2 <- ggplot(bike, aes(datetime, count)) + 
  geom_point( aes(color=temp), alpha = 0.45) + scale_color_continuous(low = '#66b2ff',high='#ff9900') + dark_theme_light()

sc_plot2

Observation:

A seasonality to the data, for winter and summer. Also that bike rental counts are increasing in general. This may present a problem with using a linear regression model if the data is non-linear. Let’s have a quick overview of pros and cons right now of Linear Regression:

Pros:

  • Simple to explain.
  • Highly interpretative.
  • Model training and prediction are fast.
  • No tuning is required (excluding regularization).
  • Features don’t need scaling.
  • Can perform well with a small number of observations.
  • Well-understood.

Cons:

  • Assumes a linear relationship between the features and the response.
  • Performance is (generally) not competitive with the best supervised learning methods due to high bias.
  • Can’t automatically learn feature interactions.

What is the correlation between temp and count?

cor(bike[,c('temp','count')])
##            temp     count
## temp  1.0000000 0.3944536
## count 0.3944536 1.0000000

Exploring the season data. Creating a boxplot, with the y axis indicating count and the x axis begin a box for each season.

b_plot1 <- ggplot(bike, aes(factor(season), count)) +
  geom_boxplot(aes(color = factor(season))) +
  scale_x_discrete(labels = c("spring", "summer", "fall", "winter")) +
  scale_color_discrete(breaks = c(1, 2, 3, 4), labels = c("spring", "summer", "fall", "winter")) +
  labs(x = "Season", color = "Season") +
  stat_summary(fun = "median", geom = "text", aes(label = round(after_stat(y), 2)),
               position = position_nudge(0.0, 0.5), color = "white") +
  dark_theme_light()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
b_plot1

Observation:

  • A line can’t capture a non-linear relationship.
  • There are more rentals in winter than in spring.
  • We know of these issues because of the growth of rental count, this isn’t due to the actual season!

Feature Engineering

A lot of times you’ll need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column.

Create an “hour” column that takes the hour from the datetime column. You’ll probably need to apply some function to the entire datetime column and reassign it. Hint:

time.stamp <- bike$datetime[4]
format(time.stamp, "%H")
bike$hour <- sapply(bike$datetime, function(x){format(x,"%H")})
head(bike)
##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count hour
## 1    0.0000      3         13    16   00
## 2    0.0000      8         32    40   01
## 3    0.0000      5         27    32   02
## 4    0.0000      3         10    13   03
## 5    0.0000      0          1     1   04
## 6    6.0032      0          1     1   05

Now creating a scatter plot of count versus hour, with color scale based on temp. Only use bike data where workingday==1.

Optional Additions:

  • Use the additional layer: scale_color_gradientn(colors=c('color1',color2,etc..)) where the colors argument is a vector gradient of colors you choose, not just high and low.
  • Use position=position_jitter(w=1, h=0) inside of geom_point() and check out what it does.
pl1 <- ggplot(filter(bike,workingday==1), aes(hour,count))
pl1 <- pl1 + geom_point(position = position_jitter(w=1,h=0),aes(color=temp), alpha =0.2)
pl1 <- pl1 + scale_color_gradientn(colors = c('darkblue', 'blue', 'lightblue', 'green', 'yellow', 'orange', 'red')) + labs(title = "Working days")+ theme(axis.text.x = element_text(size =4))+dark_theme_light()
pl1

Now create the same plot for non working days:

pl2 <- ggplot(filter(bike,workingday==0), aes(hour,count))
pl2 <- pl2 + geom_point(position = position_jitter(w=1,h=0),aes(color=temp), alpha =0.2)
pl2 <- pl2 + scale_color_gradientn(colors = c('darkblue', 'blue', 'lightblue', 'green', 'yellow', 'orange', 'red'))+ labs(title = "Non-working days")+ theme(axis.text.x = element_text(size = 4)) +dark_theme_light()

pl2

combined_plot <- pl1 + pl2
combined_plot

Observation

Working Days Pattern:

  • Peak activity is observed during the morning at around 8 am.
  • Another peak occurs right after work hours, approximately around 5 pm.
  • There is some noticeable activity during lunchtime.

Non-Working Days Pattern:

  • Non-working days exhibit a more steady rise and fall pattern during the afternoon.
  • Unlike working days, there are no distinct peaks corresponding to typical morning and evening commuting times.

Building the model

Use lm() to build a model that predicts count based solely on the temp feature, name it temp.model.

temp.model <- lm(count~temp,bike)
summary(temp.model)
## 
## Call:
## lm(formula = count ~ temp, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -293.32 -112.36  -33.36   78.98  741.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0462     4.4394   1.362    0.173    
## temp          9.1705     0.2048  44.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared:  0.1556, Adjusted R-squared:  0.1555 
## F-statistic:  2006 on 1 and 10884 DF,  p-value: < 2.2e-16

The estimated intercept is 6.0462 and the temp coeffecient is 9.1705.

Model Testing

How many bike rental would we predict if the temperature was \(25^{o} C\)?

There are two ways to calculate:

  1. Using the values we just got the above.
  2. Using predict() function.
intercept <- 6.0462
slope <- 9.1705
predict_temp <- 25
prediction <- predict_temp*slope + intercept
prediction
## [1] 235.3087
temp.test <- data.frame(temp=c(25))
predict(temp.model,temp.test)
##        1 
## 235.3097

Model with many parameters.

Build a model that attempts to predict count based off of the following features : season,holiday, workingday, weather, temp, humidity, windspeed, hour(factor).

I will change the data from hour column to numeric type. For that I will be using as.numeric.

bike$hour <- sapply(bike$hour,as.numeric)

Now let me build my mode based on above mentioned fratures.

model <- lm(count ~ season + holiday + workingday + weather + temp + humidity + windspeed + hour, data = bike)
summary(model)
## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     temp + humidity + windspeed + hour, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -324.61  -96.88  -31.01   55.27  688.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.91369    8.45147   5.551 2.91e-08 ***
## season       21.70333    1.35409  16.028  < 2e-16 ***
## holiday     -10.29914    8.79069  -1.172    0.241    
## workingday   -0.71781    3.14463  -0.228    0.819    
## weather      -3.20909    2.49731  -1.285    0.199    
## temp          7.01953    0.19135  36.684  < 2e-16 ***
## humidity     -2.21174    0.09083 -24.349  < 2e-16 ***
## windspeed     0.20271    0.18639   1.088    0.277    
## hour          7.61283    0.21688  35.102  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3339 
## F-statistic:   683 on 8 and 10877 DF,  p-value: < 2.2e-16

Model Visualization

par(mfrow = c(2, 2))
plot(model)

Model Metrics

  1. R-squared : It measures the proportion of the response variable’s variance that is captured by the model. A higher R-squared indicates a better fit.
summary(model)$r.squared
## [1] 0.3343643
  1. Mean Squared Error (MSE): It measures the average squared difference between the observed and predicted values. Lower MSE values are better.
mse <- mean( (model$residuals)^2 )
mse
## [1] 21839.7
  1. Root mean square error (RMSE): It is the square root of MSE and provides the error in the original unit of the response variable.
rmse <- sqrt(mse)
rmse
## [1] 147.7826

Final remarks

  • A linear model like the one we chose which uses OLS won’t be able to take into account seasonality of our data, and will get thrown off by the growth in our dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing!

  • This sort of model doesn’t work well given our seasonal and time series data.