An open source Rental Bike sharing dataset published on Kaggle was chosen for this project. Bike sharing is a highly-demanded application where users get a chance to download a phone application, locate bicycles and rent one when needed. This project involves the implementation of Linear Regression and Model Checking using Residual Analysis on the variables cnt(Daily Bike Count), temp(Temperature),windspeed(Wind speed) available in the dataset. Two models were built- one with cnt vs temp and the other with cnt vs windspeed. We test these two models for the primary assumptions of Linear Regression and conclude with the model that should be preferred for a better explanation of the variation in bike count.
The dataset that describes the daily rental count of bikes across 4 seasons of a year(Spring, Summer, Fall, Winter) for the years 2011-2012 was chosen for analysis. This dataset was obtained from Kaggle - Bike Sharing Data.
The dataset day.csv has the following fields:-
instant: record indexdteday : dateseason : season (1: spring, 2: summer, 3: fall, 4: winter)yr : year (0: 2011, 1: 2012)mnth : month ( 1 to 12)hr (only available in hour.csv) : hour (0 to 23)holiday : weather day is holiday or notweekday : day of the weekworkingday : 1 if day is neither weekend nor holiday, 0 otherwise.weathersit :
temp : Normalized temperature in Celsius. The values are divided to 41 (max)atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)hum: Normalized humidity. The values are divided to 100 (max)windspeed: Normalized wind speed. The values are divided to 67 (max)casual: count of casual usersregistered: count of registered userscnt: count of total rental bikes including both casual and registered## spec_tbl_df [731 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:731], format: "2011-01-01" "2011-01-02" ...
## $ season : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: num [1:731] 654 670 1229 1454 1518 ...
## $ cnt : num [1:731] 985 801 1349 1562 1600 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## instant dteday season yr mnth holiday weekday
## 0 0 0 0 0 0 0
## workingday weathersit temp atemp hum windspeed casual
## 0 0 0 0 0 0 0
## registered cnt
## 0 0
We can create new columns to gain more insights from the data. I want to create a column year by extracting the year from dteday.
## # A tibble: 8 x 3
## # Groups: year [2]
## year season count
## <chr> <fct> <dbl>
## 1 2011 1 150000
## 2 2011 2 347316
## 3 2011 3 419650
## 4 2011 4 326137
## 5 2012 1 321348
## 6 2012 2 571273
## 7 2012 3 641479
## 8 2012 4 515476
The below plot describes the seasonal count of rented bikes for the two years 2011, 2012. We observe that Fall season has the highest number of rented bikes.
We want to find out the correlation coefficients for the continuous variables of our interest - cnt, temp, atemp, hum, windspeed.
## cnt temp atemp hum windspeed
## cnt 1.0000000 0.6274940 0.6310657 -0.1006586 -0.2345450
## temp 0.6274940 1.0000000 0.9917016 0.1269629 -0.1579441
## atemp 0.6310657 0.9917016 1.0000000 0.1399881 -0.1836430
## hum -0.1006586 0.1269629 0.1399881 1.0000000 -0.2484891
## windspeed -0.2345450 -0.1579441 -0.1836430 -0.2484891 1.0000000
From the above output, we observe that cnt and temp have a correlation coefficient of 0.627 (approx.) and, cnt , windspeed have a coefficient of -0.234 (approx.). These values are significant. So, these variables should be taken into consideration for further analysis.
The correlation coefficients for temp and atemp are very similar with respect to other variables. So, we can choose one of them while building regression models.
cnt and temp, and the other with cnt and windspeed.##
## Call:
## lm(formula = cnt ~ temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4615.3 -1134.9 -104.4 1044.3 3737.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1214.6 161.2 7.537 1.43e-13 ***
## temp 6640.7 305.2 21.759 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16
Now, let us build the second model for cnt vs windspeed.
##
## Call:
## lm(formula = cnt ~ windspeed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4522.7 -1374.7 -74.6 1461.8 4544.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5621.2 185.1 30.374 < 2e-16 ***
## windspeed -5862.9 900.0 -6.514 1.36e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1884 on 729 degrees of freedom
## Multiple R-squared: 0.05501, Adjusted R-squared: 0.05372
## F-statistic: 42.44 on 1 and 729 DF, p-value: 1.36e-10
res_hat1.## 1 2 3 4 5 6
## -2515.1554 -2827.3941 -1169.6385 -980.7841 -1121.7977 -965.6579
From the below Histogram output, we notice that the residuals for Model 1 are not normally distributed.
Displaying the first few values in res_hat2.
## 1 2 3 4 5 6
## -3695.472 -3362.990 -2816.339 -3119.351 -2925.374 -3490.040
Shapiro-Wilk normality test is performed on residual vectors of both models.
From the below output, we can infer that the p-value for Model-1 is close to 0 and we can reject the null hypotheses that the data are from normal a distribution.
##
## Shapiro-Wilk normality test
##
## data: res_hat1
## W = 0.98671, p-value = 3.392e-06
##
## Shapiro-Wilk normality test
##
## data: res_hat2
## W = 0.98616, p-value = 2.129e-06
##
## One Sample t-test
##
## data: res_hat1
## t = 1.1531e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -109.525 109.525
## sample estimates:
## mean of x
## 6.433118e-14
##
## One Sample t-test
##
## data: res_hat2
## t = 6.2361e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -136.7415 136.7415
## sample estimates:
## mean of x
## 4.343524e-13
From the below output, for Model-2 the residuals seem to be roughly symmetrical around 0. So, this model fulfils the assumption of constant variance.
abline :- This function is used to add Regression lines to a plot. In this project, the regression lines for Model-1, Model-2 were added to the respective plots using this function.
facet_wrap :- This function is used to produce multi-panel plots in ggplot2. In this project, in order to produce the visualization for Bike Count across the two years 2011, 2012, facet_wrap() is used. The plot was wrapped with the variable year and in the output we can see two panels with bike counts for the years 2011 and 2012.
scale_y_continuous :- This function is used to set values for continuous y-axis scale aesthetics. In this project, we plotted daily count of Rental Bikes on Y-Axis, which is a continuous value. Also, to get the labels on Y-Axis with comma inserted between digits, this function was used.
scale_fill_discrete :- This function was used to rename the labels of the legend in the Data Visualization plot in Section 4. The plots were filled with discrete variable season. Their unique values are 1, 2, 3, 4. To rename them on the legend to Spring, Summer, Fall, Winter, we used this function.
ggtitle :- This function is generally used to give titles to plots. In this project, this function was used to give both the title and subtitle to the plot in Section 4.
theme :- This function is used to customize the non-data elements of a plot like the titles, labels, fonts, etc. In this project, to rotate the X-Axis Categorical labels 90 degrees in the counter-clockwise direction, and aligning them to the center of each bar in the panel, theme() was used.
scale_x_discrete :- This function is used to set discrete aesthetics on X-Axis. In this project, it was used to set season variables (discrete) on X-Axis and to rename them with the respective names of Seasons.
## Model-1 Model-2
## R-Squared 0.3929 0.05372
## Assumption of Normality Violated Violated
## Assumption of Zero Mean Violated Violated
## Assumption of Constant Variance Violated Satisfied