1 About the Project
2 Source and Description of Data
3 Data Importing and Wrangling
- 3.1 Missing Values
- 3.2 Creating new columns and dataframes
4 Data Visualization
5 Correlation between various variables of concern
6 Linear Regression Models
- 6.1 Model-1 for Bike Count vs Temperature
  - 6.1.1 Scatterplot for Bike Count vs Temperature
- 6.2 Model-2 for Bike Count vs Windspeed
  - 6.2.1 Scatterplot for Bike Count vs Windspeed
7 Residual Diagnostics
- 7.1 Residual plots for both Models
8 New functions used in this project
9 Conclusion

1 About the Project

An open source Rental Bike sharing dataset published on Kaggle was chosen for this project. Bike sharing is a highly-demanded application where users get a chance to download a phone application, locate bicycles and rent one when needed. This project involves the implementation of Linear Regression and Model Checking using Residual Analysis on the variables cnt(Daily Bike Count), temp(Temperature),windspeed(Wind speed) available in the dataset. Two models were built- one with cnt vs temp and the other with cnt vs windspeed. We test these two models for the primary assumptions of Linear Regression and conclude with the model that should be preferred for a better explanation of the variation in bike count.

2 Source and Description of Data

The dataset that describes the daily rental count of bikes across 4 seasons of a year(Spring, Summer, Fall, Winter) for the years 2011-2012 was chosen for analysis. This dataset was obtained from Kaggle - Bike Sharing Data.

The dataset day.csv has the following fields:-

instant: record index
dteday : date
season : season (1: spring, 2: summer, 3: fall, 4: winter)
yr : year (0: 2011, 1: 2012)
mnth : month ( 1 to 12)
hr (only available in hour.csv) : hour (0 to 23)
holiday : weather day is holiday or not
weekday : day of the week
workingday : 1 if day is neither weekend nor holiday, 0 otherwise.
weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

3 Data Importing and Wrangling

The dataset has 731 records and 16 variables.

## spec_tbl_df [731 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ season    : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: num [1:731] 654 670 1229 1454 1518 ...
##  $ cnt       : num [1:731] 985 801 1349 1562 1600 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3.1 Missing Values

There are no missing values in any of the columns in the dataset.

##    instant     dteday     season         yr       mnth    holiday    weekday 
##          0          0          0          0          0          0          0 
## workingday weathersit       temp      atemp        hum  windspeed     casual 
##          0          0          0          0          0          0          0 
## registered        cnt 
##          0          0

3.2 Creating new columns and dataframes

We can create new columns to gain more insights from the data. I want to create a column year by extracting the year from dteday.

## # A tibble: 8 x 3
## # Groups:   year [2]
##   year  season  count
##   <chr> <fct>   <dbl>
## 1 2011  1      150000
## 2 2011  2      347316
## 3 2011  3      419650
## 4 2011  4      326137
## 5 2012  1      321348
## 6 2012  2      571273
## 7 2012  3      641479
## 8 2012  4      515476

4 Data Visualization

The below plot describes the seasonal count of rented bikes for the two years 2011, 2012. We observe that Fall season has the highest number of rented bikes.

5 Correlation between various variables of concern

We want to find out the correlation coefficients for the continuous variables of our interest - cnt, temp, atemp, hum, windspeed.

##                  cnt       temp      atemp        hum  windspeed
## cnt        1.0000000  0.6274940  0.6310657 -0.1006586 -0.2345450
## temp       0.6274940  1.0000000  0.9917016  0.1269629 -0.1579441
## atemp      0.6310657  0.9917016  1.0000000  0.1399881 -0.1836430
## hum       -0.1006586  0.1269629  0.1399881  1.0000000 -0.2484891
## windspeed -0.2345450 -0.1579441 -0.1836430 -0.2484891  1.0000000

From the above output, we observe that cnt and temp have a correlation coefficient of 0.627 (approx.) and, cnt , windspeed have a coefficient of -0.234 (approx.). These values are significant. So, these variables should be taken into consideration for further analysis.
The correlation coefficients for temp and atemp are very similar with respect to other variables. So, we can choose one of them while building regression models.

6 Linear Regression Models

We try to build two simple linear regression models - one with cnt and temp, and the other with cnt and windspeed.

6.1 Model-1 for Bike Count vs Temperature

## 
## Call:
## lm(formula = cnt ~ temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

From the above output, we observe that for every unit increase in temperature, there is an increase of 6640.7 in the number of rented bikes.
The p-value is very close to 0 and it indicates that there is a significant relationship between temperature and the bike count and we should reject the null hypothesis.

6.1.1 Scatterplot for Bike Count vs Temperature

6.2 Model-2 for Bike Count vs Windspeed

Now, let us build the second model for cnt vs windspeed.

## 
## Call:
## lm(formula = cnt ~ windspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4522.7 -1374.7   -74.6  1461.8  4544.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5621.2      185.1  30.374  < 2e-16 ***
## windspeed    -5862.9      900.0  -6.514 1.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1884 on 729 degrees of freedom
## Multiple R-squared:  0.05501,    Adjusted R-squared:  0.05372 
## F-statistic: 42.44 on 1 and 729 DF,  p-value: 1.36e-10

From the above output, we observe that for every unit increase in wind speed, there is a decrease in the rented bike count by 5862.9
The p-value is very close to 0 and indicates that there is a significant relationship between bike count and windspeed and we should reject the null hypothesis.

6.2.1 Scatterplot for Bike Count vs Windspeed

7 Residual Diagnostics

7.1 Residual plots for both Models

Displaying the first few values in res_hat1.

##          1          2          3          4          5          6 
## -2515.1554 -2827.3941 -1169.6385  -980.7841 -1121.7977  -965.6579

From the below Histogram output, we notice that the residuals for Model 1 are not normally distributed.
Displaying the first few values in res_hat2.

##         1         2         3         4         5         6 
## -3695.472 -3362.990 -2816.339 -3119.351 -2925.374 -3490.040

From the below Histogram output, we notice that the residuals for Model 2 are not normally distributed.

7.1.1 Checking the assumption of Normality

Shapiro-Wilk normality test is performed on residual vectors of both models.
From the below output, we can infer that the p-value for Model-1 is close to 0 and we can reject the null hypotheses that the data are from normal a distribution.

## 
##  Shapiro-Wilk normality test
## 
## data:  res_hat1
## W = 0.98671, p-value = 3.392e-06

From the below output, we can infer that the p-value for Model-2 is close to 0 and we can reject the null hypotheses that the data are from normal a distribution.

## 
##  Shapiro-Wilk normality test
## 
## data:  res_hat2
## W = 0.98616, p-value = 2.129e-06

7.1.2 Checking the assumption of a Zero Mean

From the below output, we observe that the assumption of a zero mean is violated for Model-1.

## 
##  One Sample t-test
## 
## data:  res_hat1
## t = 1.1531e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -109.525  109.525
## sample estimates:
##    mean of x 
## 6.433118e-14

From the below output, we observe that the assumption of a zero mean is violated for Model-2.

## 
##  One Sample t-test
## 
## data:  res_hat2
## t = 6.2361e-15, df = 730, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -136.7415  136.7415
## sample estimates:
##    mean of x 
## 4.343524e-13

7.1.3 Assumption of Constant Variance

From the below plots, we can see that there is some pattern (increasing and decreasing) in the residuals for Model-1 around 0. So, the assumption of constant variance is not met.

From the below output, for Model-2 the residuals seem to be roughly symmetrical around 0. So, this model fulfils the assumption of constant variance.

8 New functions used in this project

abline :- This function is used to add Regression lines to a plot. In this project, the regression lines for Model-1, Model-2 were added to the respective plots using this function.
facet_wrap :- This function is used to produce multi-panel plots in ggplot2. In this project, in order to produce the visualization for Bike Count across the two years 2011, 2012, facet_wrap() is used. The plot was wrapped with the variable year and in the output we can see two panels with bike counts for the years 2011 and 2012.
scale_y_continuous :- This function is used to set values for continuous y-axis scale aesthetics. In this project, we plotted daily count of Rental Bikes on Y-Axis, which is a continuous value. Also, to get the labels on Y-Axis with comma inserted between digits, this function was used.
scale_fill_discrete :- This function was used to rename the labels of the legend in the Data Visualization plot in Section 4. The plots were filled with discrete variable season. Their unique values are 1, 2, 3, 4. To rename them on the legend to Spring, Summer, Fall, Winter, we used this function.
ggtitle :- This function is generally used to give titles to plots. In this project, this function was used to give both the title and subtitle to the plot in Section 4.
theme :- This function is used to customize the non-data elements of a plot like the titles, labels, fonts, etc. In this project, to rotate the X-Axis Categorical labels 90 degrees in the counter-clockwise direction, and aligning them to the center of each bar in the panel, theme() was used.
scale_x_discrete :- This function is used to set discrete aesthetics on X-Axis. In this project, it was used to set season variables (discrete) on X-Axis and to rename them with the respective names of Seasons.

9 Conclusion

The below table summarizes the results of the project:-

##                                  Model-1   Model-2
## R-Squared                         0.3929   0.05372
## Assumption of Normality         Violated  Violated
## Assumption of Zero Mean         Violated  Violated
## Assumption of Constant Variance Violated Satisfied

From the above table, we observe that Model-1 has R-Squared value of 0.3929, which implies that Temperature explains 39.29% variance in Bike Count.
Also, the R-Squared value for Model-2 is approximately 5.3%. So, Windspeed is not a significant contributor to the variance in rental bike count.
For obtaining a better regression model, we can extend this project to the stage of applying transformations to the Model-1, as it is more crucial for the analysis.

Statistical Computing Final Project

Aditya Gopavajjula

11/9/2021