I. Exploratory Data Analysis

The dataset from the Toronto Real Estate Board (TREB) is about detached houses in two separate neighborhoods - one in the city of Toronto and another in the city of Mississauga .

The variables in the dataset are:

A sample of 200 is randomly generated from this dataset to build my models.

An overview of the sample

## # A tibble: 6 x 5
##      ID  sold  list   taxes location
##   <dbl> <dbl> <dbl>   <dbl> <chr>   
## 1    88 1.4   1.50  4973    T       
## 2   229 1.62  1.64     7.69 M       
## 3    52 1.16  1.15  3927    T       
## 4   161 1.06  1.05  5404    M       
## 5    76 0.875 0.895 3150    T       
## 6    74 1.15  1.14  3683    T

Table 1. Overview of the sample

Figure 1. Scatter plot between actual sale price (sold) and last list price (list) from the original sample (before cleaning)

## # A tibble: 2 x 5
##      ID  sold  list taxes location
##   <dbl> <dbl> <dbl> <dbl> <chr>   
## 1   112 1.08  85.0   4457 T       
## 2    95 0.672  6.80  2577 T

Table 2. Cases of influential points

A scatter plot can show a general relationship between two variables. Figure 1. demonstrates the relationship between actual sale price (sold) and last list price (list) from the original sample (before cleaning). It also shows that the actual sale price is often very close to the list price for most cases in the original sample. Indeed, the mean ratio between actual sale price and list price, 1.0103775 also supports this.

However, there are two exceptions shown by the enlarged points in Figure 1. They do not follow this pattern where the last list price of the property (list), is significantly higher than the actual sale price of the property (sold) for these two cases.

Thus, based on Figure 1., these two points are outliers. That is, they do not follow the pattern set by the bulk of the data, when I take an simple linear regression model (SLR) into account. In addition, since the variable list is an explanatory variable (on x-axis) in SLR models in my analysis, these two enlarged points will also be leverage points i.e. outliers with respect to the explanatory variable. In particular, these two points are very likely to be influential points (“Bad” leverage points) given that the fitted models may change substantially after removal of these points. Therefore these two points are removed for further analysis.

Figure 2. Scatter plot between actual sale price (sold) and last list price (list) from the sample after cleaning

In this scatter plot, 5830 Actual Sale Price (millions of Canadian dollars) vs Last List Price (millions of Canadian dollars), it is obvious that there is a positive relationship between the dependent variable sold (actual sale price in millions of Canadian dollars) and the independent variable list (last list price in millions of Canadian dollars).

Moreover, the variance of dependent variable is quite constant (except for 2 points) since the graph shows a strong linear relationship, even without calculating the SLR model using least squared method.

Figure 3. Scatter plot between actual sale price (sold) vs taxes from the sample after cleaning

In this scatter plot, 5830 Actual Sale Price (millions of Canadian dollars) vs Last List Price (millions of Canadian dollars), one can identify that there is a positive relationship between the dependent variable sold (actual sale price in millions of Canadian dollars) and the independent variable taxes (previous year’s property tax in Canadian dollars).

However, for the same amount of taxes, the sale price of a property is higher in Toronto Neighborhood than in Mississauga Neighborhood. The variance of dependent variable is not constant and the variance increases as the variable taxes increases. Therefore, the scatter plot is closer to a fan-shaped graph.

II. Methods and Model

##               r_squared intercept  slope error_variance slope_p_value
## SQZ_model_all    0.9650    0.1485 0.9091        0.03193    1.175e-144
## SQZ_model_T      0.9510    0.1901 0.9047        0.02574     6.149e-70
## SQZ_model_M      0.9841    0.1395 0.8903        0.00482     8.826e-83
##               confit_lower_bound confit_upper_bound
## SQZ_model_all             0.8847             0.9335
## SQZ_model_T               0.8648             0.9446
## SQZ_model_M               0.8666             0.9139

Table 2. Summary of the three models (SQZ_model_all, SQZ_model_T and SQZ_model_M)

SQZ_model_all is a linear model fitted with data from both locations. The regression model is: \(y = 0.9091x + 0.1485\)

SQZ_model_T is a linear model fitted with data in Toronto Neighborhood. The regression model is: \(y = 0.9047x + 0.1901\)

SQZ_model_M is a linear model fitted with data in Mississauga Neighborhood. The regression model is: \(y = 0.8903x + 0.1395\)

In all three models, y (dependent variable) is the variable sold (actual sale price in millions of Canadian dollars) and the independent variable list (last list price in millions of Canadian dollars). From Table 2. it is clear that all models have similar slope and intercepts. The p-value of the slope is very small compared to the significance level 5% for these models, so we have strong evidence to reject the null hypothesis that the slope is 0 for all models.

Moreover, \(R^2\) gives percentage of variation in the dependent variable sold explained by the regression line. The \(R^2\) value for SQZ_model_all is 0.9650095, which is a bit larger than the \(R^2\) value for SQZ_model_T is 0.9510096 and a bit smaller than the \(R^2\) value for SQZ_model_M is 0.9841358.

This is probably because the data from Toronto Neighborhood is more scattered (higher variance in dependent variable) shown by Figure 2., so these data points have a higher impact on the model with data only from Toronto Neighborhood and smaller impact on the model with all data from the sample. As a result, a higher percentage of variation in the dependent variable is explained by SQZ_model_all compared to SQZ_model_T.

Furthermore, the SQZ_model_M, which use data from Mississauga only, explains more variation in dependent variable compared to the SQZ_model_all. Nonetheless, a similar percentage of variation of the dependent variable is explained by all three models.

Only if the two samples from different neighborhoods satisfy certain assumptions, a pooled two-sample t-test can be used to determine if there is a statistically significant difference between the slopes of the simple linear models for the two neighborhoods.

The first assumption is that both samples used to build models are independent from each other. It is reasonable to assume that these two samples are independent because both subsets of the sample are generated randomly from the large dataset. However, this assumption may not be true in practice if we have more information about the housing markets and housing policies. In other words, if we know the sale price of properties in one location in my sample affects the sale price of another location because of market mechanism, then these two samples are not independent.

The second assumption is that the two populations i.e. sale price of all properties in Toronto and Mississauga should have the same variance. Nevertheless, it is hard to examine whether it is true in reality. Hence I believe a pooled two-sample t-test is not appropriate in this case.

III. Discussions and Limitations

I selected the SQZ_model_all, which is built with all data from the cleaned sample in part II. There are two reasons why I select this model: First, the model utilizes all data from both locations, , which can avoid over-fitting for data from one location. Hence it may be more useful in a larger context compared to models with data from only one location. Second, the p-value of the slope is the smallest among three models, meaning it has very strong evidence to reject the null hypothesis that the slope of this model is 0 in a hypothesis testing.

We discussed four assumptions in class about making a valid statistical inference using an SLR model. The first assumption is that a simple linear model is appropriate in this case. By observing the scatter plot (Figure 2.), we can see a clear linear trend in the data. Therefore, this assumption is not violated.

The second assumption is that the errors are uncorrelated. We can check the design of the study, but we have no idea about how data are collected in the original dataset. Nonetheless, this assumption is still satisfied since the sample is randomly generated from the dataset.

The third assumption is that the errors have constant variance, which can be checked by Figure 4. There are many data points for fitted values ranging from 0 to 2.5 and the variance is relatively constant except for a few points because most points are concentrated around x-axis. However, as the fitted values increase, there are fewer data points and the residuals are more deviated at the tail. Therefore, the assumption may be violated given that the variance changes as fitted values increase. A remedy, such as transformation of independent variable, may be necessary.

Figure 4. The residual plot for SQZ_model_all

The fourth assumption is that the errors are normally distributed, which can be checked by the normal q-q plot of SQZ_model_all as shown below.

Figure 5. Normal Q-Q plot for SQZ_model_all

Normal Q-Q plot helps to test whether the standardized residuals are distributed normally. If the standardized residuals are normally distributed, the points should be close to the straight dotted line in Figure 5. Since the points are quite close to the line except for the tail and the first few points, we can claim that the fourth assumption is mainly satisfied. However, a transformation of dependent variable or a generalized linear model for remedy can also be considered given this data.

Finally, I believe there are two potential numeric predictors that could be used to fit a multiple linear regression for sale price. The first numeric predictor is the area of the property (measured by square footage) because one may expect a larger property is associated with a higher sale price. The second predictor is the number of floors of this property. Similarly, a larger number of floors of a house may be associated with higher sale price.