tutorial2

Question 1. Melbourne House Prices Regression Model

1.1 Load the data

mel_data <- read.csv("Melbourne_housing_FULL.csv", header = TRUE)

1.2 Initial data analysis

We will need to subset the data to only look at 3 suburbs - Brunswick, Craigieburn and Hawthorn. Similar to lab 1, start the data analysis by generating some quantitative and graphical summaries. For example, determine the average price in each of these three suburbs. Explore more summaries of the data.

For the following questions, use the subsetted data for the Suburbs of Brunswick, Craigieburn and Hawthorn.

mel_sub = subset(mel_data, Suburb=="Hawthorn"|Suburb=="Brunswick"|Suburb=="Craigieburn")
mel_sub_na = mel_sub[!is.na(mel_sub$Price), ]

local_price = c(mean(mel_sub_na$Price)/1000, 
                mean(mel_sub_na[mel_sub_na$Suburb=="Hawthorn",]$Price)/1000,
                mean(mel_sub_na[mel_sub_na$Suburb=="Brunswick",]$Price)/1000,
                mean(mel_sub_na[mel_sub_na$Suburb=="Craigieburn",]$Price)/1000)
barplot(local_price, names.arg=c("Melbourne","Hawthorn","Brunswick","Craigieburn"), xlab = "Suburt", ylab = "Avgrage Price (K)")

1.3 Finding association I

To examine the association between house prices and a single variable, start by constructing a simple linear regression using only BuildingArea as a predictor. Use an appropriate statistic to justify the goodness of fit of the prediction and create a graphical output to enable you to assess your model fit.

Note: you might consider other variables too.

model1 = lm(Price/1e6 ~ BuildingArea, mel_sub)
summary(model1)

## 
## Call:
## lm(formula = Price/1e+06 ~ BuildingArea, data = mel_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4218 -0.4639 -0.1480  0.2591  6.0524 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5181921  0.0665956   7.781 5.74e-14 ***
## BuildingArea 0.0038007  0.0004082   9.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.73 on 416 degrees of freedom
##   (709 observations deleted due to missingness)
## Multiple R-squared:  0.1725, Adjusted R-squared:  0.1705 
## F-statistic: 86.69 on 1 and 416 DF,  p-value: < 2.2e-16

x = mel_sub_na$BuildingArea
y = mel_sub_na$Price/1e6
plot(x, y, xlab = "BuildingArea", ylab = "Price (Million)")
abline(model1, col = "orange", lty = "dotted", lwd = 3)

# check normal distribution or not
qqplot(x,y,plot.it = TRUE)

Notice: The linear regression coefficients intercept is 0.5181921, and the BuildingArea slop is 0.0038007. The Multiple \(R^2\) should be the 17.25%, and the adjusted \(R^2\) is 17.05%. However, there should be two lines from the graph analysis.

1.4 Finding association II

(a). Variability of house prices are complex and likely to be explained by many different factors. Construct a multiple linear regression here by examining if adding Suburb as a predictor will improve the prediction? Notice that Suburb is a categorical variable. Briefly describe how to interpret the regression coefficients returned by lm.

model2 = lm(Price/1e6 ~ BuildingArea + Suburb, mel_sub_na)
summary(model2)

## 
## Call:
## lm(formula = Price/1e+06 ~ BuildingArea + Suburb, data = mel_sub_na)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0741 -0.2485 -0.0267  0.1660  5.4799 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.5097567  0.0624510   8.163 3.99e-15 ***
## BuildingArea       0.0044354  0.0003469  12.786  < 2e-16 ***
## SuburbCraigieburn -0.6602773  0.0729115  -9.056  < 2e-16 ***
## SuburbHawthorn     0.4007159  0.0725435   5.524 5.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6099 on 414 degrees of freedom
##   (490 observations deleted due to missingness)
## Multiple R-squared:  0.425,  Adjusted R-squared:  0.4208 
## F-statistic:   102 on 3 and 414 DF,  p-value: < 2.2e-16

# analysis the model2 by graphes 
#plot(model2)

Notice: The line for the BuildingArea factor should be: Y = 0.0044354X + 0.5097567. Residual standard error: The error is related to the measurement. The error measures the accuracy of the measurement. The larger the error is, the less accurate the measurement is. The residual measures the accuracy of the prediction. The larger the residual, the less accurate the prediction. The residual is related to the distribution characteristics of data and the choice of regression equation.

R-squared (value range 0-1) describes the extent to which input variables explain output variables. In univariate linear regression, the larger r-squared is, the better the fitting degree is.However, as long as more variables are added, no matter whether the added variables are related to the output variables, R-Squared will either remain the same or increase.So, adjusted R-squared is required, which adds a penalty direction to those additional variables that do not improve the effectiveness of the model.

As we found the model2 Residual standard error is 0.6099, and it smaller than the model one Residual standard error 0.73. The model1 Multiple \(R^2\) is 17.25% with 17.05% after adjusted, but the model2Multiple \(R^2\) is 42.5% with 42.08% after adjusted. It much bigger than the model1

(b). There are many other variables in the data, consider whether adding the number of car spaces as a predictor improves the prediction model?

model3 = lm(Price/1e6 ~ BuildingArea + Suburb + Car, mel_sub_na)
summary(model3)

## 
## Call:
## lm(formula = Price/1e+06 ~ BuildingArea + Suburb + Car, data = mel_sub_na)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4173 -0.2734 -0.0592  0.2525  5.0017 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.3335607  0.0670114   4.978 9.56e-07 ***
## BuildingArea       0.0037617  0.0003513  10.708  < 2e-16 ***
## SuburbCraigieburn -0.7811077  0.0737764 -10.588  < 2e-16 ***
## SuburbHawthorn     0.3634368  0.0711395   5.109 5.02e-07 ***
## Car                0.2207405  0.0346168   6.377 4.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5885 on 402 degrees of freedom
##   (501 observations deleted due to missingness)
## Multiple R-squared:  0.477,  Adjusted R-squared:  0.4718 
## F-statistic: 91.66 on 4 and 402 DF,  p-value: < 2.2e-16

As we can found the Residual standard error is 0.5885 from the model3 feedback after adding the “Car” factor. Comparing to the model2’s Residual standard error 0.6099, it not has a big difference. The model2 \(R^2\) is 42.5%, it just smaller a bit than the model3 47.7%. So, we do think the “Car” factor has a positive effect to the prediction, but it too small.

1.5 Impact of outliers

Model construction can be affected by unwanted variation and noise such as outliers. For example, houses with very small building areas of 5sqm and lower and larger places over 300 sqm look like outliers. How would you assess the impact of outliers?

mel_sub_out_lier = subset(mel_sub_na, BuildingArea > 5 & BuildingArea < 300)
model4 = lm(data = mel_sub_out_lier, Price/1e6 ~ BuildingArea + Suburb + Car)
summary(model4)

## 
## Call:
## lm(formula = Price/1e+06 ~ BuildingArea + Suburb + Car, data = mel_sub_out_lier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6099 -0.2751 -0.0353  0.2069  4.8100 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.0131493  0.0625437   0.210    0.834    
## BuildingArea       0.0082639  0.0004623  17.874  < 2e-16 ***
## SuburbCraigieburn -0.8941636  0.0587542 -15.219  < 2e-16 ***
## SuburbHawthorn     0.2812750  0.0551611   5.099 5.37e-07 ***
## Car                0.0496499  0.0285624   1.738    0.083 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4455 on 384 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.6019, Adjusted R-squared:  0.5978 
## F-statistic: 145.2 on 4 and 384 DF,  p-value: < 2.2e-16

plot(model4)

The model4 Residual standard error 0.4455 much smaller than model1 0.73. The model4 \(R^2\) is 60.19% with , it much bigger than the model1 17.25%. Same as with the model4 F-statistic 145.2 than the model1 86.69. From the model4 plot we can see the “Residuals VS Fitted” grapf, the red line is more flattening than the model1. “NormalQ-Q” graph also show the data more fit the normal distribution.

1.6 Prediction

Predict the price of a house in Hawthorn with 2 car spaces and 100 sqm in building area. What is the 95% confidence interval of your prediction value?

Suburb = c("Hawthorn")
Car = c(2)
BuildingArea = c(100)
newdata = data.frame(Suburb, Car, BuildingArea)
summary(model4)

## 
## Call:
## lm(formula = Price/1e+06 ~ BuildingArea + Suburb + Car, data = mel_sub_out_lier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6099 -0.2751 -0.0353  0.2069  4.8100 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.0131493  0.0625437   0.210    0.834    
## BuildingArea       0.0082639  0.0004623  17.874  < 2e-16 ***
## SuburbCraigieburn -0.8941636  0.0587542 -15.219  < 2e-16 ***
## SuburbHawthorn     0.2812750  0.0551611   5.099 5.37e-07 ***
## Car                0.0496499  0.0285624   1.738    0.083 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4455 on 384 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.6019, Adjusted R-squared:  0.5978 
## F-statistic: 145.2 on 4 and 384 DF,  p-value: < 2.2e-16

predict(model4, newdata, level = 0.95, interval = "confidence")

##        fit      lwr      upr
## 1 1.220112 1.122321 1.317904

spline_fit = smooth.spline(x = mel_sub_out_lier$BuildingArea, y = mel_sub_out_lier$Price)
lo1_fit = loess(Price ~ BuildingArea, data = mel_sub_out_lier)

plot(mel_sub_out_lier$BuildingArea, mel_sub_out_lier$Price)
lines(spline_fit, col = "purple", lwd =2)
lines(sort(mel_sub_out_lier$BuildingArea), predict(lo1_fit, sort(mel_sub_out_lier$BuildingArea)), col = "red", lwd =2)
legend("topleft", c("Spline","Loess"), lty=c(1,1), col=c("purple", "red"))

tutorial2

Ifan

18/08/2021