Introduction
Data Fields
- datetime - hourly date + timestamp
- season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - temperature in Celsius
- atemp - “feels like” temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals
Let me establish that I am trying to predict
count from my dataset.
Note:
For this project I will using ‘Bike sharing demand - forecast use of a city bikeshare system’ from kaggle. Unfortunately, this data is not available in the site. But as part of the course material, all the datasets were already given. So, I will be use the dataset which was given to me.
Exploratory Data Analysis
2. Create a scatter plot of count vs temp.
Set a good alpha value.
sc_plot1 <- ggplot(bike, aes(temp,count)) + geom_point(alpha=0.3,
aes(color=temp)) +
dark_theme_light()
sc_plot1
3. Plot count vs datetime as a scatterplot
wihta color gradient based on temperature.
I will convert the datetime column into POSIXct before plotting.
bike$datetime <- as.POSIXct(bike$datetime)
sc_plot2 <- ggplot(bike, aes(datetime, count)) +
geom_point( aes(color=temp), alpha = 0.45) + scale_color_continuous(low = '#66b2ff',high='#ff9900') + dark_theme_light()
sc_plot2
Observation:
A seasonality to the data, for winter and summer. Also that bike rental counts are increasing in general. This may present a problem with using a linear regression model if the data is non-linear. Let’s have a quick overview of pros and cons right now of Linear Regression:
Pros:
- Simple to explain.
- Highly interpretative.
- Model training and prediction are fast.
- No tuning is required (excluding regularization).
- Features don’t need scaling.
- Can perform well with a small number of observations.
- Well-understood.
Cons:
- Assumes a linear relationship between the features and the response.
- Performance is (generally) not competitive with the best supervised learning methods due to high bias.
- Can’t automatically learn feature interactions.
What is the correlation between temp and
count?
cor(bike[,c('temp','count')])
## temp count
## temp 1.0000000 0.3944536
## count 0.3944536 1.0000000
Exploring the season data. Creating a boxplot, with the y axis indicating count and the x axis begin a box for each season.
b_plot1 <- ggplot(bike, aes(factor(season), count)) +
geom_boxplot(aes(color = factor(season))) +
scale_x_discrete(labels = c("spring", "summer", "fall", "winter")) +
scale_color_discrete(breaks = c(1, 2, 3, 4), labels = c("spring", "summer", "fall", "winter")) +
labs(x = "Season", color = "Season") +
stat_summary(fun = "median", geom = "text", aes(label = round(after_stat(y), 2)),
position = position_nudge(0.0, 0.5), color = "white") +
dark_theme_light()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
b_plot1
Observation:
- A line can’t capture a non-linear relationship.
- There are more rentals in winter than in spring.
- We know of these issues because of the growth of rental count, this isn’t due to the actual season!
Feature Engineering
A lot of times you’ll need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column.
Create an “hour” column that takes the hour from the datetime column. You’ll probably need to apply some function to the entire datetime column and reassign it. Hint:
time.stamp <- bike$datetime[4]
format(time.stamp, "%H")
bike$hour <- sapply(bike$datetime, function(x){format(x,"%H")})
head(bike)
## datetime season holiday workingday weather temp atemp humidity
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75
## windspeed casual registered count hour
## 1 0.0000 3 13 16 00
## 2 0.0000 8 32 40 01
## 3 0.0000 5 27 32 02
## 4 0.0000 3 10 13 03
## 5 0.0000 0 1 1 04
## 6 6.0032 0 1 1 05
Now creating a scatter plot of count versus
hour, with color scale based on temp. Only use bike data
where workingday==1.
Optional Additions:
- Use the additional layer:
scale_color_gradientn(colors=c('color1',color2,etc..))where the colors argument is a vector gradient of colors you choose, not just high and low. - Use
position=position_jitter(w=1, h=0)inside ofgeom_point()and check out what it does.
pl1 <- ggplot(filter(bike,workingday==1), aes(hour,count))
pl1 <- pl1 + geom_point(position = position_jitter(w=1,h=0),aes(color=temp), alpha =0.2)
pl1 <- pl1 + scale_color_gradientn(colors = c('darkblue', 'blue', 'lightblue', 'green', 'yellow', 'orange', 'red')) + labs(title = "Working days")+ theme(axis.text.x = element_text(size =4))+dark_theme_light()
pl1
Now create the same plot for non working days:
pl2 <- ggplot(filter(bike,workingday==0), aes(hour,count))
pl2 <- pl2 + geom_point(position = position_jitter(w=1,h=0),aes(color=temp), alpha =0.2)
pl2 <- pl2 + scale_color_gradientn(colors = c('darkblue', 'blue', 'lightblue', 'green', 'yellow', 'orange', 'red'))+ labs(title = "Non-working days")+ theme(axis.text.x = element_text(size = 4)) +dark_theme_light()
pl2
combined_plot <- pl1 + pl2
combined_plot
Observation
Working Days Pattern:
- Peak activity is observed during the morning at around 8 am.
- Another peak occurs right after work hours, approximately around 5 pm.
- There is some noticeable activity during lunchtime.
Non-Working Days Pattern:
- Non-working days exhibit a more steady rise and fall pattern during the afternoon.
- Unlike working days, there are no distinct peaks corresponding to typical morning and evening commuting times.
Building the model
Use lm() to build a model that predicts count based
solely on the temp feature, name it temp.model.
temp.model <- lm(count~temp,bike)
summary(temp.model)
##
## Call:
## lm(formula = count ~ temp, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.32 -112.36 -33.36 78.98 741.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0462 4.4394 1.362 0.173
## temp 9.1705 0.2048 44.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared: 0.1556, Adjusted R-squared: 0.1555
## F-statistic: 2006 on 1 and 10884 DF, p-value: < 2.2e-16
The estimated intercept is 6.0462 and the
temp coeffecient is 9.1705.
Model Testing
How many bike rental would we predict if the temperature was \(25^{o} C\)?
There are two ways to calculate:
- Using the values we just got the above.
- Using
predict()function.
intercept <- 6.0462
slope <- 9.1705
predict_temp <- 25
prediction <- predict_temp*slope + intercept
prediction
## [1] 235.3087
temp.test <- data.frame(temp=c(25))
predict(temp.model,temp.test)
## 1
## 235.3097
Model with many parameters.
Build a model that attempts to predict count based off of the
following features : season,holiday,
workingday, weather, temp,
humidity, windspeed,
hour(factor).
I will change the data from hour column to numeric type.
For that I will be using as.numeric.
bike$hour <- sapply(bike$hour,as.numeric)
Now let me build my mode based on above mentioned fratures.
model <- lm(count ~ season + holiday + workingday + weather + temp + humidity + windspeed + hour, data = bike)
summary(model)
##
## Call:
## lm(formula = count ~ season + holiday + workingday + weather +
## temp + humidity + windspeed + hour, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.61 -96.88 -31.01 55.27 688.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91369 8.45147 5.551 2.91e-08 ***
## season 21.70333 1.35409 16.028 < 2e-16 ***
## holiday -10.29914 8.79069 -1.172 0.241
## workingday -0.71781 3.14463 -0.228 0.819
## weather -3.20909 2.49731 -1.285 0.199
## temp 7.01953 0.19135 36.684 < 2e-16 ***
## humidity -2.21174 0.09083 -24.349 < 2e-16 ***
## windspeed 0.20271 0.18639 1.088 0.277
## hour 7.61283 0.21688 35.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3339
## F-statistic: 683 on 8 and 10877 DF, p-value: < 2.2e-16
Model Visualization
par(mfrow = c(2, 2))
plot(model)
Model Metrics
- R-squared : It measures the proportion of the response variable’s variance that is captured by the model. A higher R-squared indicates a better fit.
summary(model)$r.squared
## [1] 0.3343643
- Mean Squared Error (MSE): It measures the average squared difference between the observed and predicted values. Lower MSE values are better.
mse <- mean( (model$residuals)^2 )
mse
## [1] 21839.7
- Root mean square error (RMSE): It is the square root of MSE and provides the error in the original unit of the response variable.
rmse <- sqrt(mse)
rmse
## [1] 147.7826
Final remarks
A linear model like the one we chose which uses OLS won’t be able to take into account seasonality of our data, and will get thrown off by the growth in our dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing!
This sort of model doesn’t work well given our seasonal and time series data.