## [1] 116 20 74 31 5 7 186 43 4 97 150 218 229 170 193 156 227 169
## [19] 2 85 102 27 75 181 122 187 135 148 162 164 100 125 160 201 104 188
## [37] 113 110 57 22 1 107 146 98 39 168 6 111 61 207 25 45 155 126
## [55] 175 138 158 86 58 34 159 133 93 163 143 63 50 145 118 119 96 195
## [73] 194 172 157 114 9 11 190 21 182 144 36 88 35 17 47 81 173 41
## [91] 55 30 32 189 132 178 65 79 134 52 153 62 204 89 212 147 49 78
## [109] 161 151 92 205 15 183 117 54 73 13 154 141 109 68 191 84 136 56
## [127] 152 70 28 16 176 67 38 40 105 95 26 53 3 131 87 90 42 33
## [145] 71 196 64 80 112 60
Table 1. Selected IDs from the original dataset
After I created a new variable \(lotsize=lotwidth\times lotlength\) from my sample (codes can be found in the original .Rmd file), there are 89 NAs in the variable maxsqfoot, so this predictor maxsqfoot will be removed for further analysis.
## row col
## [1,] 37 6
## [2,] 49 6
## [3,] 88 6
## [4,] 91 6
## [5,] 104 6
## [6,] 121 6
## [7,] 124 6
## [8,] 76 7
## [9,] 71 9
## [10,] 90 9
Table 2. The row number of the cases which contain NA values
Also, Table 2 shows the cases which contain NA values. They are removed for further analysis since NAs will affect the calculation of coefficients of predictors.
ID: Categorical variable because each ID represents a detached houses and the number of one ID is not larger/smaller than the other (not on a scale).
location: Categorical variable because either M(Mississauga Neighbourhood) or T (Toronto Neighbourhood) can be taken for the variable location
sale: Discrete numerical variable
list: Discrete numerical variable
taxes: Discrete numerical variable
bedroom: Discrete numerical variable
bathroom: Discrete numerical variable
parking: Discrete numerical variable
maxsqfoot: Continuous numerical variable
lotwidth: Continuous numerical variable
lotlength: Continuous numerical variable
lotsize: Continuous numerical variable
## sale list bedroom bathroom parking taxes lotsize
## sale 1.0000 0.9872 0.4706 0.6790 0.2200 0.7519 0.4459
## list 0.9872 1.0000 0.4691 0.6975 0.2628 0.7275 0.4624
## bedroom 0.4706 0.4691 1.0000 0.5467 0.3458 0.4232 0.2234
## bathroom 0.6790 0.6975 0.5467 1.0000 0.3941 0.4859 0.3622
## parking 0.2200 0.2628 0.3458 0.3941 1.0000 0.3824 0.6975
## taxes 0.7519 0.7275 0.4232 0.4859 0.3824 1.0000 0.5650
## lotsize 0.4459 0.4624 0.2234 0.3622 0.6975 0.5650 1.0000
Figure 1. Correlations matrix of all quantitative variables in the sample
Figure 2. Pairwise scatterplot matrix of all quantitative variables in the sample
The variable location is excluded in these two Figures because it is a categorical variable encoded in “T” and “M”.
## list taxes bathroom bedroom lotsize parking
## 0.9872 0.7519 0.6790 0.4706 0.4459 0.2200
Table 3. Rank of the quantitative predictors
If we rank the quantitative predictors for sale price in terms their correlation coefficient from highest to lowest, we can see that the top 3 predictors in this ranking are variable list (list price) with correlation coefficient 0.9872, the variable taxes (taxes in previous year) with correlation coefficient 0.4706 and the variable bathroom (number of bathrooms) with correlation coefficient 0.679.
Based on the Figure 2, we can see the predictor parking strongly violates the assumption of constant variance.
To show how the assumption is violated, we can fit a linear model for the variable sale with the single predictor parking.
\[sale=\beta_0+\beta parking\]
where sale is the response variable, \(\beta_0\) is the y-intercept and \(\beta\) is the coefficient of the predictor parking.
The \(\sqrt{|\text{Standardized Residuals}|}\) vs fitted values below shows how the assumption is violated.
Figure 3. The \(\sqrt{|\text{Standardized Residuals}|}\) vs fitted values for the model with the single predictor parking
As you can see above, there is a positive trend first for \(\sqrt{|\text{Standardized Residuals}|}\) as the fitted values increase and then another increasing trend. Since we can observe a general increasing trend in Figure 3, the assumption of constant variance is violated.
The regression model is:
\[y=β_0+β_1list+β_2bedroom+β_3bathroom+β_4parking+β_5taxes+β_6lotsize+\beta_7location\]
\(y\) is the response variable sale (actual sale price). \(\beta_0\) is the y-intercept of this function, which may not have practical meaning when the lotsize and list price are 0. \(\beta_1\) is the coefficient of list (list price), \(\beta_2\) is the coefficient of bedroom (number of bedroom), \(\beta_3\) is the coefficient of bathroom (number of bathroom), \(\beta_4\) is the coefficient of parking (total number of parking spots), \(\beta_5\) is the coefficient of race_ethnicity4 (1 if Asian/Pacific), \(\beta_6\) is the coefficient of lotsize (the frontage times the length) and \(\beta_7\) is the coefficient of location (1 for Toronto and 0 for Mississauga).
## Estimate Pr(>|t|)
## (Intercept) 3.462e+04 5.439e-01
## list 8.083e-01 6.795e-70
## bedroom 1.477e+04 3.227e-01
## bathroom 1.675e+04 2.348e-01
## parking -1.293e+04 1.498e-01
## taxes 2.068e+01 2.251e-06
## lotsize 3.154e+00 2.087e-01
## locationT 1.217e+05 3.138e-03
Table 5. The coefficients of the full model and the related p-values
Since the F-statistic of the full model is 1026 and the p-value of the global F-test on this model is smaller than 0.05, the global F-test is statistically significant. Therefore, we will focus on the p-values of the t-test (partial F-test on each variable).
Only the coefficients for the variables list (\(β_1\)), taxes (\(β_5\)) and location (\(\beta_7\)) are significant given that the p-values for the corresponding t-tests are smaller than the significance level 5%.
\[\beta_1 = 0.8083,\] which means 1 dollar increase in the list price is associated with 0.8083 increase in the actual sale price.
\[\beta_5 = 20.6836,\] which means 1 dollar increase in the previous year’s property tax is associated with 20.6836 increase in the actual sale price.
\[\beta_7 = 1.217\times 10^{5},\] which means on average the actual sale price of a detached house in Toronto is \(1.217\times10^{5}\) higher than the actual sale price of a detached house in Mississauga when all other independent variables are the same (all other held fixed).
The linear model obtained by backward-elimination AIC method is:
\[\hat{y}=\hat{β_0}+\hat{β_1}list+\hat{β_5}taxes+\hat{\beta_7}location\]
where \[\hat{β_0} = 7.621\times 10^{4}\], \[\hat{β_1} = 0.8295\], \[\hat{β_5} = 21.03\], \[\hat{β_7} = 1.32\times 10^{5}\]
The results are consistent with those in part i for three reasons. First, the p-value of the global F-test on this model is smaller than 0.05, so the global F-test is statistically significant.
Second, the three variables incorporated in this AIC model are the same as the statistically significant variables identified in part i. Moreover, the values of the coefficients in AIC model are very close to the values in the full additive linear regression model.
Finally, the three variables in the AIC model are also statistically significant. (The values of the coefficients can be found in the original .Rmd)
The linear model obtained by backward-elimination BIC method is:
\[\hat{y}=\hat{β_0}+\hat{β_1}list+\hat{β_5}taxes+\hat{\beta_7}location\]
where \[\hat{β_0} = 7.621\times 10^{4}\], \[\hat{β_1} = 0.8295\], \[\hat{β_5} = 21.03\], \[\hat{β_7} = 1.32\times 10^{5}\]
The results are consistent with those in part i and ii for three reasons. First, the p-value of the global F-test on this BIC model is smaller than 0.05, so the global F-test is statistically significant.
Second, the three variables incorporated in this BIC model are the same as the statistically significant variables identified in part i and ii. Moreover, the values of the coefficients in BIC model are very close to the values in the full additive linear regression model as well as the AIC model.
Finally, the three variables in the BIC model are also statistically significant. (The values of the coefficients can be found in the original .Rmd)
Figure 6. Diagnostic plots for backward BIC model
The first plot 5830 Residuals vs Fitted in Figure 6 shows the association between residuals and fitted values of the model. The overall trend is close to a horizontal line i.e. no association between these two variables. From this graph, we may say the assumption of linear relationship is mainly satisfied.
The second plot 5830 Normal Q-Q in Figure 6 can be used to examine whether the errors are normally distributed. If the standardized residuals are normally distributed, the points should be close to the theoretical quantile-quantile line (straight dotted line in Figure 6).
Since the points are quite close to the line except for the tail and the first few points, we can claim that the normal error MLR assumption is mainly satisfied.
The third graph is 5830 Scale-Location plot in Figure 6, which shows the relationship between \(\sqrt{\text{Standardized Residual}}\) vs fitted values. It can be used to check the assumption of constant variance. The assumption is satisfied if no pattern is found (horizontal line) and the points are equally spread. Since there is a slightly increasing trend in this graph, the assumption of constant variance is violated to some extent.
The fourth graph is 5830 Residuals vs Leverage. This graph can be used to identify influential points. There are not outliers in this case.
To investigate a full ‘final’ model, we can improve the current model by remedy. Since the assumption of constant variance is violated to some extent, we can transform the dependent variable or fit a generalized linear model for remedy.
Moreover, we discarded NAs in our analysis, but removing these values will affect the statistical power of our model and NA values may contain information that is useful for our analysis. Hence, we can study what NAs represent in our sample and how we can utilize all the data in the original sample to a great extent.
Furthermore, there are many more cases in the original dataset, so we can generate another sample and/or a sample with larger sample size. Then we can study the MLR for the new sample. Finally, we can compare the new model from the new sample with our final model in the last part, and find a full model that explains the data the most.