I. Exploratory Data Analysis

An overview of the dataset

Table 1. Overview of dataset

Figure 1. Scatter plot between actual sale price (sold) and last list price (list) from the original sample (before cleaning)

## # A tibble: 2 x 5
##      ID  sold  list taxes location
##   <dbl> <dbl> <dbl> <dbl> <chr>   
## 1   112 1.08  85.0   4457 T       
## 2    95 0.672  6.80  2577 T

Table 2. Cases of influential points

A scatter plot can show a general relationship between two variables. Figure 1. demonstrates the relationship between actual sale price (sold) and last list price (list) from the original sample (before cleaning). It also shows that the actual sale price is often very close to the list price for most cases in the original sample. Indeed, the mean ratio between actual sale price and list price, 1.0103775 also supports this.

However, two cases shown by enlarged points in Figure 1. do not follow this pattern as the last list price of the property (list), is significantly higher than the actual sale price of the property (sold) for these two cases. Moreover, the ratio for case 112 and case 95 is 0.0127662 and 0.0988381 respectively.

Thus, based on Figure 1. and the ratios, these two points are outliers. That is, they do not follow the pattern set by the bulk of the data, when I take an SLR model into account. In addition, since variable list is an explanatory variable (on x-axis) in SLR models in my analysis, these two enlarged points will also be leverage points i.e. outliers with respect to the explanatory variable. In particular, these two points are very likely to be influential points (“Bad” leverage points) given that the fitted model may change substantially after removal of these points. Therefore these two points are removed for further analysis.

Figure 2. Scatter plot between actual sale price (sold) and last list price (list) from the sample after cleaning

In this scatter plot, 5830 Actual Sale Price (millions of Canadian dollars) vs Last List Price (millions of Canadian dollars), it is obvious that there is a positive relationship between the dependent variable sold (actual sale price in millions of Canadian dollars) and the independent variable list (last list price in millions of Canadian dollars).

Moreover, the variance of dependent variable is quite constant (except for 2 points) since the graph shows a strong linear relationship, even without calculating the SLR model using least squared method.

Figure 3. Scatter plot between actual sale price (sold) vs taxes from the sample after cleaning

In this scatter plot, 5830 Actual Sale Price (millions of Canadian dollars) vs Last List Price (millions of Canadian dollars), one can identify that there is a positive relationship between the dependent variable sold (actual sale price in millions of Canadian dollars) and the independent variable taxes (previous year’s property tax in Canadian dollars).

However, for the same amount of taxes, the sale price of a property is higher in Toronto Neighborhood than in Mississauga Neighborhood. The variance of dependent variable is not constant and the variance increases as the variable taxes increases. Therefore, the scatter plot is closer to a fan-shaped graph.

II. Methods and Model

##               r_squared intercept     slope error_variance slope_p_value
## SQZ_model_all 0.9650095 0.1485496 0.9090940    0.031934783 1.174933e-144
## SQZ_model_T   0.9510096 0.1901410 0.9046939    0.025739883  6.149449e-70
## SQZ_model_M   0.9841358 0.1395219 0.8902733    0.004820206  8.826153e-83
##               confit_lower_bound confit_upper_bound
## SQZ_model_all          0.8847088          0.9334793
## SQZ_model_T            0.8647657          0.9446220
## SQZ_model_M            0.8666027          0.9139439

Table 2. Summary of the three models (SQZ_model_all, SQZ_model_T and SQZ_model_M)

SQZ_model_all is a linear model fitted with data from both locations. The regression model is: \(y = 0.9091x + 0.1485\)

SQZ_model_T is a linear model fitted with data in Toronto Neighborhood. The regression model is: \(y = 0.9047x + 0.1901\)

SQZ_model_M is a linear model fitted with data in Mississauga Neighborhood. The regression model is: \(y = 0.8903x + 0.1395\)

In all three models, y (dependent variable) is variable sold (actual sale price in millions of Canadian dollars) and the independent variable list (last list price in millions of Canadian dollars).

\(R^2\) gives percentage of variation in dependent variable sold explained by the regression line. The \(R^2\) value for SQZ_model_all is 0.9650095, which is a bit larger than the \(R^2\) value for SQZ_model_T is 0.9510096 and a bit smaller than the \(R^2\) value for SQZ_model_M is 0.9841358. This is probably because the data from Toronto Neighborhood is more scattered (higher variance in dependent variable) shown by Figure 2., so these data points have higher impact on the model with data only from Toronto Neighborhood and smaller impact on the model with all data from the sample. As a result, a higher percentage of variation in dependent variable is explained by SQZ_model_all compared to SQZ_model_T.

Furthermore, since the \(R^2\) value for SQZ_model_all is 0.9650095 smaller than the \(R^2\) value for SQZ_model_M, SQZ_model_M explains more variation in dependent variable compared to SQZ_model_all. Nonetheless, a similar percentage of variation of the dependent variable is explained by all three models.

Only if the two samples from different neighborhoods satisfy certain assumptions, a pooled two-sample t-test can be used to determine if there is a statistically significant difference between the slopes of the simple linear models for the two neighborhoods.

The first assumption is that both samples used to build models are independent from each other. It is reasonable to assume that these two samples are independent because both subsets of the sample are generated randomly from the large dataset. However, this assumption may not be true in practice if we have more information about the housing markets and housing policies. In other words, if we know the sale price of properties in one location in my sample affects the sale price of another location because of market mechanism, then these two samples are not independent.

The second assumption is that the two populations i.e. sale price of all properties in Toronto and Mississauga should have the same variance. Nevertheless, it is hard to examine whether it is true in reality. Hence I believe a pooled two-sample t-test is not appropriate in this case.

III. Discussions and Limitations

I select the model SQZ_model_all, which is built with all data from the cleaned sample in part II. There are two reasons why I select this model. First, the model utilizes all data from both locations, so it may be more useful in a larger context compared to models with data from only one location, which can avoid over-fitting for data from one location. Second, the p-value of the slope is the smallest among three models, meaning it has very strong evidence to reject the null hypothesis that the slope of this model is 0 in a hypothesis testing.

We discuss four assumptions in class about making a valid statistical inference using an SLR model. The first assumption is that a simple linear model is appropriate in this case. By observing the scatter plot (Figure 2.), we can see a clear linear trend in the data. Therefore, this assumption is not violated.

The second assumption is that the errors are uncorrelated. We can check the design of the study, but we have no idea about how data is collected in the original dataset. Nonetheless, this assumption is still satisfied since the sample is randomly generated from the dataset

The third assumption is that the errors have constant variance, which can be checked by Figure 4. There are many data points for fitted values ranging from 0 to 2.5 and the variance is relatively constant except for a few points because most points are concentrated around x-axis. However, as the fitted values increase, there are less data points and the residuals are more deviated at the tail. Therefore, the assumption may be violated given that the variance changes as fitted values increase. A remedy, such as transformation of independent variable, may be necessary.

Figure 4. The residual plot for SQZ_model_all

The fourth assumption is that the errors are normally distributed, which can be checked by the normal q-q plot of SQZ_model_all as shown below.

Figure 5. Normal Q-Q plot for SQZ_model_all

Normal Q-Q plot helps to test whether the standardized residuals are distributed normally. If the standardized residuals are normally distributed, the points should be close to the straight dotted line in Figure 5. Since the points are quite close to the line except for the tail and the first few points, we can claim that the fourth assumption is mainly satisfied. However, a transformation of dependent variable or a generalized linear model for remedy can also be considered given this data.

Finally, I believe there are two potential numeric predictors that could be used to fit a multiple linear regression for sale price. The first numeric predictor is the area of the property (measured by square footage) because one may expect a larger property is associated with higher sale price. The second predictor is the number of floors of this property. Similarly, a larger number of floors of a house may be associated with higher sale price.