Student Names:

* Hongli Li (s3697150)

* Shel Nee Gan (s3746473)


1. Introduction

The pricing of real estate has been a difficult decision in the Sindian District of New Taipei City, due to a lot of factors needed to be taken into consideration. However in reality, there are many restrictions and limitations to collect useful data to obtain the accurate price prediction. The aim of this project is to find a best linear regression model that can best capture the relationship between the housing price (per unit area) and its predictive regressors. More details of the variables and dataset are described in section 2. Section 3 will present at least three linear regression analyses to evaluate the relationship between the variables. Hypothesis testing will be conducted to evaluate the assumptions of the listed linear regression models. The R code used for analysis, with output and results, are also displayed in this section. Section 4 provides a discussion of the test results for the assumptions and the validity of each model. Finally section 5 gives a detailed conclusion on the most appropriate model within the context of real estate pricing of New Taipei City.


2. Data Exploration

2.1 Data Set Description

The real estate valuation dataset is provided by UCI machine learning repository, origin owner of this dataset is professor Prof. I-Cheng Yeh from TamKang University. It contains historical data from real estate market across 2012 to 2013, Sindian District, New Taipei City. Within the dataset, each row represents a transaction record of an real estate property, whereas each column corresponds to a feature describing that record. The dataset contains 414 records of property sales (414 rows).


2.2 Identification of Response Variable & Regressor Variables.

There are six regressor variables (from X1 to X6, as labelled below) and one response variable (namely, y):

X1 = transaction date
X2 = the age of house in year
X3 = the distance to nearest MRT station in meters.
X4 = the number of convenience stores within walking distance
X5 = latitude coordinates
X6 = longitude coordinates
y = house price per local unit area (NewTaipei Dollar $10000 per 3.3 square meters)

Data-Preprocessing

library(leaps)
library(DAAG)
library(car)
library(MASS)
#setwd("~/R")
library(readxl)
real_estate <- read_excel("Real estate valuation data set.xlsx", range = "B1:H415")

#Rename the column
colnames(real_estate)[colnames(real_estate) == "X1 transaction date"] <- "x1"
colnames(real_estate)[colnames(real_estate) == "X2 house age"] <- "x2"
colnames(real_estate)[colnames(real_estate) == "X3 distance to the nearest MRT station"] <- "x3"
colnames(real_estate)[colnames(real_estate) == "X4 number of convenience stores"] <- "x4"
colnames(real_estate)[colnames(real_estate) == "X5 latitude"] <- "x5"
colnames(real_estate)[colnames(real_estate) == "X6 longitude"] <- "x6"
colnames(real_estate)[colnames(real_estate) == "Y house price of unit area"] <- "y"

head(real_estate) # present the first 5 observations
colSums(is.na(real_estate)) # summarise the missing values in each column of the dataset
## x1 x2 x3 x4 x5 x6  y 
##  0  0  0  0  0  0  0

For each variable, there is NO missing observations.

dim(real_estate) # correct number of rows and columns as described
## [1] 414   7

2.3 Relationship Between Variables



pairs(real_estate[,1:4], col= "blue", pch=18, main= "Plot 6- Relationship between x1, x2, x3, x4")

2.4 Relevance of these Relationships within Context of Data

  • Plot 1 shows that the house price rises to above 60/m2 after early 2013; while roughly before the first month of 2013, the house price has been below 60, with only one outlier in August 2012. (the x-axis scale may be confusing, but the 12 columns of data points correspond to 12 months from August 2012 to July 2013.)

  • Plot 2 shows the relationship between house price and house age. The house ranges from newly built (0 years) to an age of up to nearly 44 years old. However, the house price across the 12-month timespan does not show an obvious trend. This may imply that the house age is not so relevant to the house price, but this will be checked by fitting to the regression models in the below sections.

  • Plot 3 shows a clear decreasing trend between the house price and the distance to the nearest MRT station (public rail station). Houses with the highest sale price are within 1000 meters to an MRT station, and then within 2000 meters to an MRT station. So this may imply that the variable x3 (distance to MRT station) is an important predictor for the house price.

  • Plot 4 shows the relationship between the house price and the number of convenience stores. There is a slowly increasing trend between these two variables. Only when the number of convenience stores is less than 3, the house prices go below 20/m2. When there are more than 7 convenience stores around, most of the house prices are greater than 40/m2. When there are 10 convenient stores, the majority of the house price is around 50/m2. This again implies that variable x4 (number of convenience stores) could also be an important predictor for the response variable (house price).

  • Plot 5 shows the relationship between the house price and the location of the house in terms of the latitude and longitude coordinates. It is clearly shown that the range of house price distributes in a circular pattern, for example, houses located in a crowd or inner circle are having higher prices (red dots), and the lower house prices are located far from the crowd, in the outer region and where only a few of houses are found there (blue dots). The location can be an important predictor for the house price.

  • Plot 6 shows the relationship between variables x1, x2, x3 and x4 by using matrix scatter plot. Variable x1 (Transaction date) has a relationship with variable x3 (the distance to nearest MRT station in meters). Majority of the buyers purchase the property that has a short distance to the nearest MRT station. Moreover, variable x2 (the age of the house) also has a relationship with variable x3. The scatter plot indicates that buyers tend to purchase the property that has a short distance to the nearest MRT station regardless of the age of the house. Variable x3 and Variable x4 also have some relationship. The plot indicates that the number of convenience stores within walking distance is higher if the distance to the nearest MRT station is short. We can conclude that variable x3 is a useful predictor.

  • Plot 7 shows the relationship between the transaction date and the location of the house. We observe that the number of buyers has increased in April 2013 (red) compare to the number of buyers back in August 2012 (blue). Moreover, houses are clustered in a location with latitude between 24.95 and 24.987 and longitude between 121.5 and 121.550, we assume that this area is a residential area.

  • Plot 8 shows the relationship between the age of the house and the location. It shows that the range of house age from 0 to 40 can be found in the residential area. The distribution of plot 7 is very similar to this plot.

  • Plot 9 shows the relationship between distance to nearest MRT station and the location. Based on the plot, it is clear that the houses located in the residential area have a shorter distance to the nearest MRT station (less than 2000 meters) compare to the houses that are far away from the residential area, the distance to the nearest MRT station is more than 4000 meters.

  • Plot 10 shows the relationship between the number of the convenience store and the location. It is clear that the residential area tends to have a higher number of convenience store within walking distance (between 5 to 10), whereas the houses are not located in the residential area tend to have a lower number of convenience store within walking distance (less than 2). Hence, location (variable x5 and x6) is a very important feature when considering purchasing a property.


2.5 Collection Method Description

The data was collected from the historical market of real estate within Sindian District of New Taipei City, the timespan across 2012 August to 2013 July. In the dataset, the response variable (house price per unit area) is calculated in a local unit which is approximately $10000 New Taipei Dollar per 3.3 squared meters. For the collection of regressor data, the transaction dates are transformed into a format such that 2013.250 = 2013 March, 2013.500 = 2013 June etc. The house age was collected in years and the distance to MRT stations is measured in meters.


2.6 Reference of data source

The origin owner of this Real Estate Valuation dataset is professor I-Cheng Yeh from TamKang University (Department of Civil Engineering). Prof. Yeh donated this dataset to UCI machine learning repository on 18th August 2018. The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set#[1].


3. Methods

3.1. Manual Backward Elimination

full=lm(y~., data = real_estate) # Start with full model with all variables x1, x2, x3
drop1(full,test="F") # Manual F-test-based backward selection
## x6 is the least significant variable by the partial F test
drop1(update(full, ~ . -x6), test = "F")
## Now x1 is the least significat, drop x6 then drop x1
drop1(update(full, ~ . -x6-x1), test = "F")
model1 = lm(y ~ x2 + x3 + x4 + x5, data = real_estate)
summary(model1)
## 
## Call:
## lm(formula = y ~ x2 + x3 + x4 + x5, data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.522  -5.292  -1.579   4.264  76.466 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.916e+03  1.113e+03  -5.317 1.74e-07 ***
## x2          -2.687e-01  3.893e-02  -6.903 1.95e-11 ***
## x3          -4.175e-03  4.928e-04  -8.473 4.37e-16 ***
## x4           1.165e+00  1.897e-01   6.141 1.94e-09 ***
## x5           2.386e+02  4.456e+01   5.355 1.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.954 on 409 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5669 
## F-statistic: 136.2 on 4 and 409 DF,  p-value: < 2.2e-16

The Backward Elimination method suggests the best fitted model is y_hat = -5916 - 0.268 x2 - 0.004 x3 + 1.165 x4 + 238.6 x5


Residual check for model 1

#Present the Residuals vs. Fitted plot
plot(model1 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are flat (look at the red line), which indicates the model has linear relationship and the residuals are rougly equal variance across the range of fitted values, this is a sign of homoscedasticity, therefore this assumption is not violated.
#Present the Normal Q-Q plot
plot(model1 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(model1 , which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(model1 , which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases.

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.86768, p-value < 2.2e-16

Since p-value is less than 0.05, it is significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the assumption of normality is violated.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(model1)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07904265      2.150414   0.132
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(model1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.521196, Df = 1, p = 0.21744

Since p-value is greater than 0.05, cannot reject the \(H_0\) and which means the constant variance assumption is not violated.


3.2 Manual Forward Selection

null=lm(y~1, data = real_estate) # Start with null model with no variables
add1(null, scope =full, test = "F") # Manual F-test-based forward selection
add1(update(null, ~ . +x3), scope = full, test = "F")
add1(update(null, ~ . +x3+x4), scope = full, test = "F")
add1(update(null, ~ . +x3+x4+x2), scope = full, test = "F")
add1(update(null, ~ . +x3+x4+x2+x5), scope = full, test = "F")
add1(update(null, ~ . +x3+x4+x2+x5+x1), scope = full, test = "F")
model2 = lm(y~x3+x4+x2+x5+x1, data =real_estate)
summary(model2)
## 
## Call:
## lm(formula = y ~ x3 + x4 + x2 + x5 + x1, data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.625  -5.373  -1.020   4.243  75.343 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.596e+04  3.233e+03  -4.938 1.15e-06 ***
## x3          -4.353e-03  4.899e-04  -8.887  < 2e-16 ***
## x4           1.136e+00  1.876e-01   6.056 3.17e-09 ***
## x2          -2.694e-01  3.847e-02  -7.003 1.04e-11 ***
## x5           2.269e+02  4.417e+01   5.136 4.35e-07 ***
## x1           5.138e+00  1.554e+00   3.305  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.847 on 408 degrees of freedom
## Multiple R-squared:  0.5823, Adjusted R-squared:  0.5772 
## F-statistic: 113.8 on 5 and 408 DF,  p-value: < 2.2e-16

The best fitted model suggested by manual forward selection is: y_hat = -15960 - 0.004 x3 + 1.136 x4 - 0.2694 x2 + 226.9 x5 + 5.138 x1, we call it model2.

Residual check for model 2

#Present the Residuals vs. Fitted plot
plot(model2 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are not flat (look at the red line), which indicates the model does not have linear relationship. Furthermore, variability on y axis does not constantly across the range of valus on the x axis (there is a pattern in variability), this can be a sign of heteroscedasticity, therefore this assumption is violated.
#Present the Normal Q-Q plot
plot(model2 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(model2 , which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(model2 , which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases.

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.87551, p-value < 2.2e-16

Since p-value is less than 0.05, it is significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the assumption of normality is violated.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(model2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07997944      2.154145   0.092
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.995745, Df = 1, p = 0.045615

Since p-value is less than 0.05, it is significant to reject the \(H_0\) and which means the constant variance assumption is violated.

Compared to model1, model2 violates 2 assumptions whereas model1 only violates one assumption. Hence model1 is better than model2. Also, by looking at the F statistic of the two models, model1 has a larger F statistic which means its p-value should be smaller than that of model2’s, again it confirms that model1 is more significant than model2, hence model 1 is the better model.


3.3 Stepwise selection (Automatic backward/foward/stepwise selection)

#Create full model that contains all the variables 
full=lm(y~., data=real_estate)

#Create null model that contains no variable
null=lm(y~1, data=real_estate)

Automatic backward selection using AIC values

step(full, data=real_estate, direction="backward")
## Start:  AIC=1813.03
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq   RSS    AIC
## - x6    1       5.1 31937 1811.1
## <none>              31931 1813.0
## - x1    1     858.2 32790 1822.0
## - x5    1    2008.2 33940 1836.3
## - x4    1    2846.3 34778 1846.4
## - x3    1    3064.6 34996 1849.0
## - x2    1    3843.9 35775 1858.1
## 
## Step:  AIC=1811.1
## y ~ x1 + x2 + x3 + x4 + x5
## 
##        Df Sum of Sq   RSS    AIC
## <none>              31937 1811.1
## - x1    1     855.0 32792 1820.0
## - x5    1    2064.9 34001 1835.0
## - x4    1    2870.9 34807 1844.7
## - x2    1    3838.9 35775 1856.1
## - x3    1    6181.9 38118 1882.3
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = real_estate)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4  
##  -1.596e+04    5.138e+00   -2.694e-01   -4.353e-03    1.136e+00  
##          x5  
##   2.269e+02

The best model suggested by stepwise (automatic backward selection) is same as model2.


Automatic forward selection using AIC values

step(null, scope=list(lower=null, upper=full), direction="forward")
## Start:  AIC=2162.53
## y ~ 1
## 
##        Df Sum of Sq   RSS    AIC
## + x3    1     34695 41767 1914.2
## + x4    1     24930 51531 2001.2
## + x5    1     22820 53641 2017.8
## + x6    1     20937 55524 2032.1
## + x2    1      3390 73071 2145.8
## + x1    1       586 75876 2161.3
## <none>              76461 2162.5
## 
## Step:  AIC=1914.19
## y ~ x3
## 
##        Df Sum of Sq   RSS    AIC
## + x4    1    3273.6 38493 1882.4
## + x2    1    2859.1 38908 1886.8
## + x5    1    2579.5 39187 1889.8
## + x1    1    1268.0 40499 1903.4
## <none>              41767 1914.2
## + x6    1      86.2 41681 1915.3
## 
## Step:  AIC=1882.4
## y ~ x3 + x4
## 
##        Df Sum of Sq   RSS    AIC
## + x2    1    3402.1 35091 1846.1
## + x5    1    1881.3 36612 1863.7
## + x1    1    1046.1 37447 1873.0
## <none>              38493 1882.4
## + x6    1      23.5 38470 1884.2
## 
## Step:  AIC=1846.09
## y ~ x3 + x4 + x2
## 
##        Df Sum of Sq   RSS    AIC
## + x5    1   2299.34 32792 1820.0
## + x1    1   1089.49 34001 1835.0
## <none>              35091 1846.1
## + x6    1     52.53 35038 1847.5
## 
## Step:  AIC=1820.03
## y ~ x3 + x4 + x2 + x5
## 
##        Df Sum of Sq   RSS    AIC
## + x1    1    855.04 31937 1811.1
## <none>              32792 1820.0
## + x6    1      2.03 32790 1822.0
## 
## Step:  AIC=1811.1
## y ~ x3 + x4 + x2 + x5 + x1
## 
##        Df Sum of Sq   RSS    AIC
## <none>              31937 1811.1
## + x6    1    5.1353 31931 1813.0
## 
## Call:
## lm(formula = y ~ x3 + x4 + x2 + x5 + x1, data = real_estate)
## 
## Coefficients:
## (Intercept)           x3           x4           x2           x5  
##  -1.596e+04   -4.353e-03    1.136e+00   -2.694e-01    2.269e+02  
##          x1  
##   5.138e+00

The best model suggested by stepwise (automatic forward selection) is also same as model2.


Automatic Stepwise Selection

#stepwise regression using AIC values
step(null, scope = list(upper=full), data=real_estate, direction="both")
## Start:  AIC=2162.53
## y ~ 1
## 
##        Df Sum of Sq   RSS    AIC
## + x3    1     34695 41767 1914.2
## + x4    1     24930 51531 2001.2
## + x5    1     22820 53641 2017.8
## + x6    1     20937 55524 2032.1
## + x2    1      3390 73071 2145.8
## + x1    1       586 75876 2161.3
## <none>              76461 2162.5
## 
## Step:  AIC=1914.19
## y ~ x3
## 
##        Df Sum of Sq   RSS    AIC
## + x4    1      3274 38493 1882.4
## + x2    1      2859 38908 1886.8
## + x5    1      2580 39187 1889.8
## + x1    1      1268 40499 1903.4
## <none>              41767 1914.2
## + x6    1        86 41681 1915.3
## - x3    1     34695 76461 2162.5
## 
## Step:  AIC=1882.4
## y ~ x3 + x4
## 
##        Df Sum of Sq   RSS    AIC
## + x2    1    3402.1 35091 1846.1
## + x5    1    1881.3 36612 1863.7
## + x1    1    1046.1 37447 1873.0
## <none>              38493 1882.4
## + x6    1      23.5 38470 1884.2
## - x4    1    3273.6 41767 1914.2
## - x3    1   13038.3 51531 2001.2
## 
## Step:  AIC=1846.09
## y ~ x3 + x4 + x2
## 
##        Df Sum of Sq   RSS    AIC
## + x5    1    2299.3 32792 1820.0
## + x1    1    1089.5 34001 1835.0
## <none>              35091 1846.1
## + x6    1      52.5 35038 1847.5
## - x2    1    3402.1 38493 1882.4
## - x4    1    3816.7 38908 1886.8
## - x3    1   12066.4 47157 1966.5
## 
## Step:  AIC=1820.03
## y ~ x3 + x4 + x2 + x5
## 
##        Df Sum of Sq   RSS    AIC
## + x1    1     855.0 31937 1811.1
## <none>              32792 1820.0
## + x6    1       2.0 32790 1822.0
## - x5    1    2299.3 35091 1846.1
## - x4    1    3023.6 35815 1854.5
## - x2    1    3820.2 36612 1863.7
## - x3    1    5755.8 38547 1885.0
## 
## Step:  AIC=1811.1
## y ~ x3 + x4 + x2 + x5 + x1
## 
##        Df Sum of Sq   RSS    AIC
## <none>              31937 1811.1
## + x6    1       5.1 31931 1813.0
## - x1    1     855.0 32792 1820.0
## - x5    1    2064.9 34001 1835.0
## - x4    1    2870.9 34807 1844.7
## - x2    1    3838.9 35775 1856.1
## - x3    1    6181.9 38118 1882.3
## 
## Call:
## lm(formula = y ~ x3 + x4 + x2 + x5 + x1, data = real_estate)
## 
## Coefficients:
## (Intercept)           x3           x4           x2           x5  
##  -1.596e+04   -4.353e-03    1.136e+00   -2.694e-01    2.269e+02  
##          x1  
##   5.138e+00

The best model suggested by automatic stepwise selection is still the same as model2.

3.4 All Possible subsets of regression

r<-leaps::regsubsets(y~x1+x2+x3+x4+x5+x6, data=real_estate)
plot(r, scale="adjr2") 

Since the top two models both have adjusted R2 = 0.58, they are equally good in terms of adjusted R2, first model is same as model2, the second model is equivalent to the full model, which will be fitted below:

model 3 = Full model fitting

model3 <- lm(y~x1+x2+x3+x4+x5+x6, data = real_estate)
summary(model3)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.667  -5.412  -0.967   4.217  75.190 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.444e+04  6.775e+03  -2.132  0.03364 *  
## x1           5.149e+00  1.557e+00   3.307  0.00103 ** 
## x2          -2.697e-01  3.853e-02  -7.000 1.06e-11 ***
## x3          -4.488e-03  7.180e-04  -6.250 1.04e-09 ***
## x4           1.133e+00  1.882e-01   6.023 3.83e-09 ***
## x5           2.255e+02  4.457e+01   5.059 6.38e-07 ***
## x6          -1.243e+01  4.858e+01  -0.256  0.79820    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.858 on 407 degrees of freedom
## Multiple R-squared:  0.5824, Adjusted R-squared:  0.5762 
## F-statistic:  94.6 on 6 and 407 DF,  p-value: < 2.2e-16

The best fitted model (model3): y_hat = -14440 + 5.149x1 - 0.2697x2 -0.0045x3 + 1.133x4 + 225.5x5 - 12.43x6

Residual checks for model 3

#Present the Residuals vs. Fitted plot
plot(model3 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are not flat (look at the red line), which indicates the model does not have linear relationship. Furthermore, variability on y axis does not constantly across the range of valus on the x axis (there is a pattern in variability), this can be a sign of heteroscedasticity, therefore this assumption is violated.
#Present the Normal Q-Q plot
plot(model3 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(model3 , which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(model3 , which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases.

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(model3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.87622, p-value < 2.2e-16

Since p-value is less than 0.05, it is significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the assumption of normality is violated.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(model3)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07927517      2.152731   0.116
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(model3)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.1988, Df = 1, p = 0.040453

Since p-value is less than 0.05, it is significant to reject the \(H_0\) and which means the constant variance assumption is violated.

Compare model 3 with model1 and model2, both model2 and model3 violate two assumptions whereas model1 violates only one assumption. Hence model1 is still better than the other two models. Now since model2 & 3 violate the normality and the constant variance assumptions, we will apply transformation to the data, and see if that can make the three models meet all assumptions.

r<-leaps::regsubsets(y~x1+x2+x3+x4+x5+x6, data=real_estate)
plot(r, scale="Cp") ## Scale can be "Cp", "adjr2", "r2" or "bic"

The Cp criterion ranks the top three models in terms of Cp value, the three models are actually model2, model3 and model1 respectively.

r<-leaps::regsubsets(y~x1+x2+x3+x4+x5+x6, data=real_estate)
plot(r, scale="bic") ## Scale can be "Cp", "adjr2", "r2" or "bic"

The BIC criterion ranks the top three models in terms of BIC values, the three models are actually model2, model1 and model3 respectively.

3.5 Apply transformation to model 1, model 2, model 3

As said earlier, we now normalise the dataset and fit to the three best models again, and then assess the performance of three models again by the second round of residual checks.

library(magrittr)
normalised_real_estate = scale(real_estate, center = TRUE, scale = TRUE)
N_data = as.data.frame(normalised_real_estate) # put normalised data in a dataframe
head(N_data)

Fit Normalised data to model 1

normalised_model1 = lm(y ~ x2 + x3 + x4 + x5, data = N_data)
summary(normalised_model1)
## 
## Call:
## lm(formula = y ~ x2 + x3 + x4 + x5, data = N_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5371 -0.3890 -0.1161  0.3134  5.6198 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.936e-14  3.234e-02   0.000        1    
## x2          -2.250e-01  3.259e-02  -6.903 1.95e-11 ***
## x3          -3.873e-01  4.571e-02  -8.473 4.37e-16 ***
## x4           2.522e-01  4.106e-02   6.141 1.94e-09 ***
## x5           2.177e-01  4.064e-02   5.355 1.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6581 on 409 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5669 
## F-statistic: 136.2 on 4 and 409 DF,  p-value: < 2.2e-16

All variables are significant.

Residual checks for normalised_model 1

#Present the Residuals vs. Fitted plot
plot(normalised_model1 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are close to flat (look at the red line), which indicates the model has linear relationship and the residuals are rougly equal variance across the range of fitted values, this is a sign of homoscedasticity, therefore this assumption is not violated.
#Present the Normal Q-Q plot
plot(normalised_model1 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(normalised_model1, which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(normalised_model1, which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases.

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(normalised_model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  normalised_model1$residuals
## W = 0.86768, p-value < 2.2e-16

Since p-value is less than 0.05, it is still significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the normalisation of data did not help model1 to meet the normality.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(normalised_model1)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07904265      2.150414   0.124
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(normalised_model1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.521196, Df = 1, p = 0.21744

Since p-value is greater than 0.05, cannot reject the \(H_0\) and which means the constant variance assumption is not violated. The transformation did not help much with model1’s performance.

Fit Normalised data to model 2

normalised_model2 = lm(y ~ x1 + x2 + x3 + x4 + x5, data = N_data)
summary(normalised_model2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = N_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6183 -0.3949 -0.0749  0.3119  5.5373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.028e-14  3.196e-02   0.000  1.00000    
## x1           1.065e-01  3.222e-02   3.305  0.00103 ** 
## x2          -2.255e-01  3.221e-02  -7.003 1.04e-11 ***
## x3          -4.038e-01  4.544e-02  -8.887  < 2e-16 ***
## x4           2.460e-01  4.061e-02   6.056 3.17e-09 ***
## x5           2.069e-01  4.029e-02   5.136 4.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6502 on 408 degrees of freedom
## Multiple R-squared:  0.5823, Adjusted R-squared:  0.5772 
## F-statistic: 113.8 on 5 and 408 DF,  p-value: < 2.2e-16

All variables except for the intercept are significant.

Residual checks for normalised_model 2

#Present the Residuals vs. Fitted plot
plot(normalised_model2 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are not flat (look at the red line), which indicates the model does not have linear relationship. Furthermore, variability on y axis does not constantly across the range of valus on the x axis and therefore this assumption is violated.
#Present the Normal Q-Q plot
plot(normalised_model2 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(normalised_model2, which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(normalised_model2, which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases. ***

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(normalised_model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  normalised_model2$residuals
## W = 0.87551, p-value < 2.2e-16

Since p-value is less than 0.05, it is still significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the normalisation of data did not help model2 to meet the normality.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(normalised_model2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07997944      2.154145     0.1
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(normalised_model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.995745, Df = 1, p = 0.045615

Since p-value is less than 0.05, reject the \(H_0\) and which means the constant variance assumption is violated. The transformation has made model2 worse.


Fit Normalised data to model 3

normalised_model3 = lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = N_data)
summary(normalised_model3)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = N_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6213 -0.3977 -0.0710  0.3099  5.5261 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.682e-14  3.199e-02   0.000  1.00000    
## x1           1.067e-01  3.227e-02   3.307  0.00103 ** 
## x2          -2.258e-01  3.226e-02  -7.000 1.06e-11 ***
## x3          -4.163e-01  6.660e-02  -6.250 1.04e-09 ***
## x4           2.453e-01  4.073e-02   6.023 3.83e-09 ***
## x5           2.056e-01  4.065e-02   5.059 6.38e-07 ***
## x6          -1.402e-02  5.480e-02  -0.256  0.79820    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.651 on 407 degrees of freedom
## Multiple R-squared:  0.5824, Adjusted R-squared:  0.5762 
## F-statistic:  94.6 on 6 and 407 DF,  p-value: < 2.2e-16

All variables are significant except for the intercept and x6.

Residual checks for normalised_model 3

#Present the Residuals vs. Fitted plot
plot(normalised_model3 , which=1)

  • Residuals vs Fitted: The relationship between fitted values and residuals are not flat (look at the red line), which indicates the model does not have linear relationship. Furthermore, variability on y axis does not constantly across the range of valus on the x axis, therefore this assumption is violated.
#Present the Normal Q-Q plot
plot(normalised_model3 , which=2)

  • Normal Q-Q: The residuals do not fall close to the line (end of the right tail) and there are some deviations from normality, so it is assumed that the residuals are not normally distributed and this assumption is violated.
#Present the Scale-location plot
plot(normalised_model3, which=3)

  • Scale-location: The red line in this plot is flat and the variances in the square root of the standardized residuals are consistenly across fitted values. Therefore, this is a sign of homoscedasticity and the assumption is not violated.
#Present the Residuals vs. Leverage plot 
plot(normalised_model3, which=5)

  • Residuals vs. Leverage: There is no values that fall in the upper and lower right hand side of the plot beyong the red bands, therefore there is no evidence of influential cases.

\[H_0: Errors\ are\ normally\ distributed\]

\[H_1: Errors\ are\ not\ normally\ distributed\]

shapiro.test(normalised_model3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  normalised_model3$residuals
## W = 0.87622, p-value < 2.2e-16

Since p-value is less than 0.05, it is still significant to reject the \(H_0\) and which means the errors are not normally distributed. Hence the normalisation of data did not help model3 to meet the normality.


\[H_0: Errors\ are\ uncorrelated. \]

\[H_1: Errors\ are\ autocorrelated. \]

durbinWatsonTest(normalised_model3)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07927517      2.152731   0.096
##  Alternative hypothesis: rho != 0

Since p-value is greater than 0.05, cannot reject \(H_0\), which means errors are uncorrelated, hence the uncorrelation assumption is not violated.


\[H_0: Errors\ have\ a\ constant\ vairance. \]

\[H_1: Errors\ have\ a\ non-constant\ variance. \]

ncvTest(normalised_model3)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.1988, Df = 1, p = 0.040453

Since p-value is less than 0.05, reject the \(H_0\) and which means the constant variance assumption is violated. The transformation has made model3 worse.

In conclusion, the normalisation did not improve the residual distribution for the three models, their performace remain the same or become worse after fitting the normalised data. Now, the best model is still model1.


4. Discussion

We firstly used manual backward elimination to select the best model. Starting from the full model, the least significant variable (with lowest AIC value) is x6, hence it is dropped first. Then, the next least significant variable in the model is x1, then after dropping x1, all variable coefficients in model are significant, hence we stop and obtained model1: y ~ x2 + x3 + x4 + x5. The residual check shows model1 violates the normality assumption.

Next, we use the manual forward selection method to select a best linear regression model. Starting with the null model, each step, one most significant variable (indicated by both p-value and lowest AIC value) is added to the model, until none of the unselected variable is significant (indicated by p-value). This time the best model obtained is model2: y ~ x1 + x2 + x3 + x4 + x5. The residual check shows model 2 violates the normality assumption and the constant-variance assumption.

Next, automatic stepwise is used to select the best model. The automatic procedure goes through the forward, backward and stepwise selection. But all three procedures return the same model as the model2.

Then, we tried the subsets regression models, going through the criteria of adjusted R2, Cp values and the BIC values, the same model has been returned, which is model2. Furthermore, since the full model has an equally high adjusted R2 value with the best model (model2), so the full model will be accepted as model3. Residual check on model3 shows a similar conclusion with those of model2’s. Now, both model2 and model3 violates the normality assumption and the constant-variance assumption, whereas model1 only violates the normality assumption.

Afterwards, in order to improve the fitting effect of the models, we normalise the data and then fit to the three linear regression models again. The second round of the residual checks are performed to verify assumptions. The conclusion is that the normalisation transformation did not make the test results better. So we stay with the original three models.

Finally, we compare model1, model2 and model3, taking all situations into account. By comparing the p-values and F statistics and the number of assumptions violated, we conclude that model1 is the best model. Because all three models have a tiny p-value, means the regression in all three models is significant. Then comparing the F statistics across three models, model1 has the largest F statistic value, which corresponds to the smallest p-value, so that implies model1 (y_hat = -5916 - 0.268 x2 - 0.004 x3 + 1.165 x4 + 238.6 x5) is the best among the three.


5. Conclusion

In summary, the most appropriate model is y_hat = -5916 - 0.268 x2 - 0.004 x3 + 1.165 x4 + 238.6 x5, after taken into account the descriptive statistics and the residual checks, since the transformation did not help with modelling, the unscaled data is used to obtain this best linear regression model. In the context of data, the house price per unit area in Sindian District of NewTaipei City will decrease 0.268 for every increased year in the house age, will decrease 0.004 for every increased meter to the nearest MRT station, will increase 1.165 for every increased number of convenience store, will increase 238.6 for every change in the latitude coordinate.


References

[1] UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set#