To make comprehensive bike usage predictions, generally we took five steps to complete our regression analysis: 1) Clean data. 2) Analyze the validity of the initial regression model by reflecting on five major assumptions. 3) Use boxplot of each discrete variables to illustrate the usage patterns. 4) With respect to the Multicollinearity analysis in step 2 and R-squared analysis, we eliminate some redundant variables and add dummy into the model to generalize a sophisticated regression model for usage prediction. 5) Based on the new regression model, we intepret the coefficient in the regression model and give some recommendations. For the purpose of concision, we hide the output chart in this report. The chart can be shown by deleting ‘include=FALSE’ at the beginning of every R chunk.
We calculate correlation between working days and the count of registered users and total users for two reasons: 1) there is a strong positive correlation between working day and count of registered users, whereas the correlation of working day and count of total is insignificant. In this case, to make full use of the variables and make more accurate predictions, we select count of registered users as our dependent variables. 2) for the business purpose, we focused on getting insights from the usage pattern of our current members, not only to predict member’s usage behavior, but also provide better service to enhance loyalty.
unmodified regression equation:
\[Registered-hat = 761.46+449.44Xseason+ 1754.87Xyear - 23.31Xmnth - 243.16Xholiday+ 42.05Xweekday\] \[+950.3Xworkingdays-499.15Xweathersit+ 888.25Xtemp +2611.88Xatemp-607.23Xhum- 1709.1Xwindspeed\]
As shown in the R-markdown output 2.1, ‘weekday’ and ‘workingday’ are generally linearly related to count of registered users.
The p-value is 0, we fail to reject the null hypothesis that there is no autocorrelation. In other words, these errors appear independent.
The output in R-markdown 2.3 shows the multicollinearity analysis of each pair of variables. From this result, we found that ‘temp’ and ‘atemp’ are perfectly positively related to each other(coefficient: 0.99). In this case, we will use R-square function to delete either in the model modification stage later.
We visualize the normality of the residual. It seems there are significant number of residuals falls out of the normality range. The residuals are not normaily distributed.
Fitted & Residual plot:As shown in the plot, the residuals are randomly distributed around 0, indicating that the model fits the data relatively well and a linear model can fit this data even though at the end of the chart the variances become larger.
In this part, we will use Multicollinearity analysis in 2.3 and R-Squared index to make decisions of eliminating redundant variables and adding dummies. as shown in the output 2.3, ‘temp’ and ‘atemp’ are perfectly positively related to each other(coefficient 0.99). Similarly, season and month also shares high positive linear relationship. Additionally, the negative relationship between workingday and holiday is also significant. Based on these data, we used R-squared analysis to help us make decisions. We found that the model without temp/month/holiday have larger R-Squared than the model without atemp/season/working day do. So we keep atemp rather than temp, season rather than month, working day rather than holiday. Also, “Atemp” refers to feeling temperature which is more significant in terms of business perspective. Working day provides the information of holiday and weekdays. So the modification make sense. As for the dummy variables, we use Spring as the baseline of season. And we use both the weather situation 1 and 2 as the baseline since they have multicollinearity and the situation are similar.
\[Registered-hat = 1000.49 + 1716.51*Xyear + 958.53*Xworking day + 3747.31*Xatemp - 1565.24*Xhum\] \[- 1900.12*Xwindspeed + 799.86*Xsummer +803.03*Xfall + 1377.17*Xwinter -1283.27*Xsnow/rain\]
Identifying office/residential areas with high demand. In order to improve our member’s experience, we need to allocate enough bikes in CBD or using heat map to find the most popular office areas that have a lot of demand for bike service. In this case, we want to make sure in those areas we have enough bikes especially at peak hour.