Fit a Multiple Linear Regression Model for the Sale Price

I. Data Wrangling

Sample Overview

##   [1] 116  20  74  31   5   7 186  43   4  97 150 218 229 170 193 156 227 169
##  [19]   2  85 102  27  75 181 122 187 135 148 162 164 100 125 160 201 104 188
##  [37] 113 110  57  22   1 107 146  98  39 168   6 111  61 207  25  45 155 126
##  [55] 175 138 158  86  58  34 159 133  93 163 143  63  50 145 118 119  96 195
##  [73] 194 172 157 114   9  11 190  21 182 144  36  88  35  17  47  81 173  41
##  [91]  55  30  32 189 132 178  65  79 134  52 153  62 204  89 212 147  49  78
## [109] 161 151  92 205  15 183 117  54  73  13 154 141 109  68 191  84 136  56
## [127] 152  70  28  16 176  67  38  40 105  95  26  53   3 131  87  90  42  33
## [145]  71 196  64  80 112  60

Table 1. Selected IDs from the original dataset

Data Cleaning

After I created a new variable \(lotsize=lotwidth\times lotlength\) from my sample (codes can be found in the original .Rmd file), there are 89 NAs in the variable maxsqfoot, so this predictor maxsqfoot will be removed for further analysis.

##       row col
##  [1,]  37   6
##  [2,]  49   6
##  [3,]  88   6
##  [4,]  91   6
##  [5,] 104   6
##  [6,] 121   6
##  [7,] 124   6
##  [8,]  76   7
##  [9,]  71   9
## [10,]  90   9

Table 2. The row number of the cases which contain NA values

Also, Table 2 shows the cases which contain NA values. They are removed for further analysis since NAs will affect the calculation of coefficients of predictors.

II. Exploratory Data Analysis

Variable Classification

ID: Categorical variable because each ID represents a detached houses and the number of one ID is not larger/smaller than the other (not on a scale).
location: Categorical variable because either M(Mississauga Neighbourhood) or T (Toronto Neighbourhood) can be taken for the variable location
sale: Discrete numerical variable
list: Discrete numerical variable
taxes: Discrete numerical variable
bedroom: Discrete numerical variable
bathroom: Discrete numerical variable
parking: Discrete numerical variable
maxsqfoot: Continuous numerical variable
lotwidth: Continuous numerical variable
lotlength: Continuous numerical variable
lotsize: Continuous numerical variable

Pairwise Correlations and Scatterplot Matrix

##            sale   list bedroom bathroom parking  taxes lotsize
## sale     1.0000 0.9872  0.4706   0.6790  0.2200 0.7519  0.4459
## list     0.9872 1.0000  0.4691   0.6975  0.2628 0.7275  0.4624
## bedroom  0.4706 0.4691  1.0000   0.5467  0.3458 0.4232  0.2234
## bathroom 0.6790 0.6975  0.5467   1.0000  0.3941 0.4859  0.3622
## parking  0.2200 0.2628  0.3458   0.3941  1.0000 0.3824  0.6975
## taxes    0.7519 0.7275  0.4232   0.4859  0.3824 1.0000  0.5650
## lotsize  0.4459 0.4624  0.2234   0.3622  0.6975 0.5650  1.0000

Figure 1. Correlations matrix of all quantitative variables in the sample

Figure 2. Pairwise scatterplot matrix of all quantitative variables in the sample

The variable location is excluded in these two Figures because it is a categorical variable encoded in “T” and “M”.

##     list    taxes bathroom  bedroom  lotsize  parking 
##   0.9872   0.7519   0.6790   0.4706   0.4459   0.2200

Table 3. Rank of the quantitative predictors

If we rank the quantitative predictors for sale price in terms their correlation coefficient from highest to lowest, we can see that the top 3 predictors in this ranking are variable list (list price) with correlation coefficient 0.9872, the variable taxes (taxes in previous year) with correlation coefficient 0.4706 and the variable bathroom (number of bathrooms) with correlation coefficient 0.679.

Violation of the Assumption of Constant Variance

Based on the Figure 2, we can see the predictor parking strongly violates the assumption of constant variance.

To show how the assumption is violated, we can fit a linear model for the variable sale with the single predictor parking.

\[sale=\beta_0+\beta parking\]

where sale is the response variable, \(\beta_0\) is the y-intercept and \(\beta\) is the coefficient of the predictor parking.

The \(\sqrt{|\text{Standardized Residuals}|}\) vs fitted values below shows how the assumption is violated.

Figure 3. The \(\sqrt{|\text{Standardized Residuals}|}\) vs fitted values for the model with the single predictor parking

As you can see above, there is a positive trend first for \(\sqrt{|\text{Standardized Residuals}|}\) as the fitted values increase and then another increasing trend. Since we can observe a general increasing trend in Figure 3, the assumption of constant variance is violated.

III. Methods and Model

Part I

The regression model is:

\[y=β_0+β_1list+β_2bedroom+β_3bathroom+β_4parking+β_5taxes+β_6lotsize+\beta_7location\]

\(y\) is the response variable sale (actual sale price). \(\beta_0\) is the y-intercept of this function, which may not have practical meaning when the lotsize and list price are 0. \(\beta_1\) is the coefficient of list (list price), \(\beta_2\) is the coefficient of bedroom (number of bedroom), \(\beta_3\) is the coefficient of bathroom (number of bathroom), \(\beta_4\) is the coefficient of parking (total number of parking spots), \(\beta_5\) is the coefficient of race_ethnicity4 (1 if Asian/Pacific), \(\beta_6\) is the coefficient of lotsize (the frontage times the length) and \(\beta_7\) is the coefficient of location (1 for Toronto and 0 for Mississauga).

##               Estimate  Pr(>|t|)
## (Intercept)  3.462e+04 5.439e-01
## list         8.083e-01 6.795e-70
## bedroom      1.477e+04 3.227e-01
## bathroom     1.675e+04 2.348e-01
## parking     -1.293e+04 1.498e-01
## taxes        2.068e+01 2.251e-06
## lotsize      3.154e+00 2.087e-01
## locationT    1.217e+05 3.138e-03

Table 5. The coefficients of the full model and the related p-values

Since the F-statistic of the full model is 1026 and the p-value of the global F-test on this model is smaller than 0.05, the global F-test is statistically significant. Therefore, we will focus on the p-values of the t-test (partial F-test on each variable).

Only the coefficients for the variables list (\(β_1\)), taxes (\(β_5\)) and location (\(\beta_7\)) are significant given that the p-values for the corresponding t-tests are smaller than the significance level 5%.

\[\beta_1 = 0.8083,\] which means 1 dollar increase in the list price is associated with 0.8083 increase in the actual sale price.

\[\beta_5 = 20.6836,\] which means 1 dollar increase in the previous year’s property tax is associated with 20.6836 increase in the actual sale price.

\[\beta_7 = 1.217\times 10^{5},\] which means on average the actual sale price of a detached house in Toronto is \(1.217\times10^{5}\) higher than the actual sale price of a detached house in Mississauga when all other independent variables are the same (all other held fixed).

Part II

The linear model obtained by backward-elimination AIC method is:

\[\hat{y}=\hat{β_0}+\hat{β_1}list+\hat{β_5}taxes+\hat{\beta_7}location\]

where \[\hat{β_0} = 7.621\times 10^{4}\], \[\hat{β_1} = 0.8295\], \[\hat{β_5} = 21.03\], \[\hat{β_7} = 1.32\times 10^{5}\]

The results are consistent with those in part i for three reasons. First, the p-value of the global F-test on this model is smaller than 0.05, so the global F-test is statistically significant.

Second, the three variables incorporated in this AIC model are the same as the statistically significant variables identified in part i. Moreover, the values of the coefficients in AIC model are very close to the values in the full additive linear regression model.

Finally, the three variables in the AIC model are also statistically significant. (The values of the coefficients can be found in the original .Rmd)

Part III

The linear model obtained by backward-elimination BIC method is:

\[\hat{y}=\hat{β_0}+\hat{β_1}list+\hat{β_5}taxes+\hat{\beta_7}location\]

where \[\hat{β_0} = 7.621\times 10^{4}\], \[\hat{β_1} = 0.8295\], \[\hat{β_5} = 21.03\], \[\hat{β_7} = 1.32\times 10^{5}\]

The results are consistent with those in part i and ii for three reasons. First, the p-value of the global F-test on this BIC model is smaller than 0.05, so the global F-test is statistically significant.

Second, the three variables incorporated in this BIC model are the same as the statistically significant variables identified in part i and ii. Moreover, the values of the coefficients in BIC model are very close to the values in the full additive linear regression model as well as the AIC model.

Finally, the three variables in the BIC model are also statistically significant. (The values of the coefficients can be found in the original .Rmd)

IV. Discussions and Limitations

Diagnostic Plots and Interpretation

Figure 6. Diagnostic plots for backward BIC model

The first plot 5830 Residuals vs Fitted in Figure 6 shows the association between residuals and fitted values of the model. The overall trend is close to a horizontal line i.e. no association between these two variables. From this graph, we may say the assumption of linear relationship is mainly satisfied.

The second plot 5830 Normal Q-Q in Figure 6 can be used to examine whether the errors are normally distributed. If the standardized residuals are normally distributed, the points should be close to the theoretical quantile-quantile line (straight dotted line in Figure 6).

Since the points are quite close to the line except for the tail and the first few points, we can claim that the normal error MLR assumption is mainly satisfied.

The third graph is 5830 Scale-Location plot in Figure 6, which shows the relationship between \(\sqrt{\text{Standardized Residual}}\) vs fitted values. It can be used to check the assumption of constant variance. The assumption is satisfied if no pattern is found (horizontal line) and the points are equally spread. Since there is a slightly increasing trend in this graph, the assumption of constant variance is violated to some extent.

The fourth graph is 5830 Residuals vs Leverage. This graph can be used to identify influential points. There are not outliers in this case.

Next Step

To investigate a full ‘final’ model, we can improve the current model by remedy. Since the assumption of constant variance is violated to some extent, we can transform the dependent variable or fit a generalized linear model for remedy.

Moreover, we discarded NAs in our analysis, but removing these values will affect the statistical power of our model and NA values may contain information that is useful for our analysis. Hence, we can study what NAs represent in our sample and how we can utilize all the data in the original sample to a great extent.

Furthermore, there are many more cases in the original dataset, so we can generate another sample and/or a sample with larger sample size. Then we can study the MLR for the new sample. Finally, we can compare the new model from the new sample with our final model in the last part, and find a full model that explains the data the most.