Assignment: 1.4 Finding association II

a.Variability of house prices are complex and likely to be explained by many different factors. Construct a multiple linear regression here by examining if adding Suburb as a predictor will improve the prediction? Notice that Suburb is a categorical variable. Briefly describe how to interpret the regression coefficients returned by lm and create an appropriate visualisation for the extended model.

b.There are many other variables in the data, consider whether adding the number of car spaces as a predictor improves the prediction model?

Solution:

Step1: Loading the data

melb_data <- read.csv("Melbourne_housing_FULL.csv")

Step2: Processing subset (Leaving only three subrubs - Brunswick, Craigieburn and Hawthorn)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
melb_data_sub <- melb_data %>% dplyr::filter(Suburb=='Brunswick'|Suburb=='Craigieburn'|Suburb=='Hawthorn') 

Step3: Building a simple linear regression (Price ~ BuildingArea)(used to compare result)

lm.fit <- lm(Price/1000 ~ BuildingArea, data=melb_data_sub)

Step4: Building a multiple linear regression (Price ~ BuildingArea + Suburb)

multi.fit1 <- lm(Price/1000 ~ factor(Suburb) + BuildingArea, data=melb_data_sub)
summary(multi.fit1)
## 
## Call:
## lm(formula = Price/1000 ~ factor(Suburb) + BuildingArea, data = melb_data_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4074.1  -248.5   -26.7   166.0  5479.9 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                509.7567    62.4510   8.163 3.99e-15 ***
## factor(Suburb)Craigieburn -660.2773    72.9115  -9.056  < 2e-16 ***
## factor(Suburb)Hawthorn     400.7159    72.5435   5.524 5.87e-08 ***
## BuildingArea                 4.4354     0.3469  12.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 609.9 on 414 degrees of freedom
##   (709 observations deleted due to missingness)
## Multiple R-squared:  0.425,  Adjusted R-squared:  0.4208 
## F-statistic:   102 on 3 and 414 DF,  p-value: < 2.2e-16

Step5: Creating an appropriate visualisation for above model.

Visualisation1: Scatter Plot And Regression Lines

library(ggplot2)
melb_data_sub$predicted_price <- predict(multi.fit1, newdata = melb_data_sub)
ggplot(melb_data_sub, aes(x = BuildingArea, y = Price/1000, color = factor(Suburb))) +
  geom_point() +
  geom_line(aes(y = predicted_price)) +
  labs(x = "Building Area", y = "Price (in 1000s of $AUD)", title = "Price vs Building Area by Suburb") +
  scale_x_continuous(limits = c(0, 600))+
  scale_y_continuous(limits = c(0, 3500))+
  theme_minimal()
## Warning: Removed 718 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 627 rows containing missing values or values outside the scale range
## (`geom_line()`).

Visualisation2: Estimates of model coefficients and their confidence intervals

library(broom)
coef_df <- tidy(multi.fit1)
ggplot(coef_df, aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error)) +
  xlab("Terms") + ylab("Estimates") + ggtitle("Coefficient Estimates")

Visualisation3: Diagnostic Chart

library(ggfortify)
autoplot(multi.fit1, which = 1:6)

(1). Answer to whether the prediction is improved or not by adding Suburb as a predictor:

Firstly, we can see that the R-squared value increases significantly from 0.1725 to 0.425 after adding the Suburb variable, indicating that the model’s ability to explain the variation is enhanced.

Meanwhile, the adjusted R-squared value also improves from 0.1705 to 0.4208. this indicates that the explanatory power of the model is still enhanced after considering the model complexity.

In addition, the residual standard error decreased from 730 to 609.9, indicating that the predictive accuracy of the model was improved.

Finally, the F-statistic increased from 86.69 to 102, with a p-value(2.2e-16) less than 0.001, indicating that the overall statistical significance of the model was also enhanced.

In summary, the addition of Suburb as a categorical variable resulted in a significant improvement in the overall performance of the model, indicating that the Suburb variable is an important factor in predicting house prices.

(2). Description of coefficients:

  1. Statistical Significance:

All coefficients are statistically significant with p-values less than 0.001, marked by ’***’, which denotes a strong rejection of the null hypothesis that the coefficients are equal to zero.

  1. Results: \[ Price = 509.7567 - 660.2773×Suburb(Craigieburn) + 400.7159×Suburb(Hawthorn) + 4.4354×BuildingArea \]

A. The intercept is estimated at A$509.7567, which can be interpreted as the base price for houses in the reference suburb, Brunswick, assuming all other variables are held at zero.

B. The suburb Craigieburn has a coefficient of -A$660.2773, indicating that houses in Craigieburn are, on average, expected to be A$660.2773 less expensive than those in Brunswick, reflecting a negative impact on house prices. Conversely, Hawthorn shows a positive coefficient of A$400.7159, suggesting that houses in Hawthorn are, on average, expected to be A$400.7159 more expensive than in Brunswick, indicating a positive effect on house prices.

C. The coefficient for ‘BuildingArea’ is 4.4354, indicating that for every additional square meter of building area, the house price is expected to increase by approximately A$4.4354.

Step6: Building another Multiple linear regression (Price ~ BuildingArea + Suburb + Car Spaces)

multi.fit2 <- lm(Price/1000 ~ BuildingArea + Suburb + Car, data=melb_data_sub)
summary(multi.fit2)
## 
## Call:
## lm(formula = Price/1000 ~ BuildingArea + Suburb + Car, data = melb_data_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3417.3  -273.4   -59.2   252.5  5001.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        333.5607    67.0114   4.978 9.56e-07 ***
## BuildingArea         3.7617     0.3513  10.708  < 2e-16 ***
## SuburbCraigieburn -781.1077    73.7764 -10.588  < 2e-16 ***
## SuburbHawthorn     363.4368    71.1395   5.109 5.02e-07 ***
## Car                220.7405    34.6168   6.377 4.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 588.5 on 402 degrees of freedom
##   (720 observations deleted due to missingness)
## Multiple R-squared:  0.477,  Adjusted R-squared:  0.4718 
## F-statistic: 91.66 on 4 and 402 DF,  p-value: < 2.2e-16

(1) Visualisation: Estimates of model coefficients and their confidence intervals

coef_df <- tidy(multi.fit2)
ggplot(coef_df, aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error)) +
  xlab("Terms") + ylab("Estimates") + ggtitle("Coefficient Estimates")

(2) Answer to whether the prediction is improved or not by adding the number of car spaces as a predictor:

Firstly, when we added the ‘Car’ variable, the R-squared value increased from 0.425 to 0.477, enhancing the model’s explanatory power. The adjusted R-squared value rose from 0.4208 to 0.4718, indicating an improved ability to explain variation.

Meanwhile, the residual standard error decreased from 609.9 to 588.5, implying that the model’s predictive accuracy was enhanced. Although the F-statistic decreased slightly, this change is reasonable considering the increase in model variables. The F-statistics remain high, and with a very small p-value(2.2e-16) less than 0.001, the model is statistically significant as a whole.

After adding the ‘Car’ variable, the estimated coefficient for ‘Car’ is 220.7405 with a standard error of 34.6168, and its p-value is 4.98e-10, which is much smaller than the significance level of 0.001, indicating that the ‘Car’ variable is statistically significant.

In summary, the overall performance of the model does improve with the addition of the ‘Car’ variable. This suggests that the number of car spaces is an important factor in predicting house prices.