1 Overview

The purpose of this project is to predict house sales in King County, Washington, USA using the multiple linear regression method. The dataset consists of historical data on houses that were sold between May 2014 and May 2015.

2 Data preparation

2.1 Import Data

First, we need to import the dataset to R. The data used in the project is kc_house_data.csv, you can download from here

options(scipen=999)

house <- read.csv("kc_house_data.csv")

2.2 Variable description

Here are some brief explanations of the variables used in this project :

price : Price of the house, and prediction target in this project
bedrooms : Number of bedrooms
bathrooms : Number of bathrooms
sqft_living : Square footage of the home
sqft_lot : Square footage of the lot
floors : Total floors (levels) in house
waterfront : House which has a view to a waterfront
view : Has been viewed
condition : How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent
grade : Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent
sqft_above : Square footage of house apart from basement
sqft_basement: Square footage of the basement
yr_built : Built Year
yr_renovated : Year when house was renovated
zipcode : zip
lat : Latitude coordinate
long : Longitude coordinate
sqft_living15: Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize area
sqft_lot15 : LotSize area in 2015(implies– some renovations)

Then we can start to inspect the data

str(house)

## 'data.frame':    21597 obs. of  21 variables:
##  $ id           : num  7129300520 6414100192 5631500400 2487200875 1954400510 ...
##  $ date         : chr  "10/13/2014" "12/9/2014" "2/25/2015" "12/9/2014" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

rmarkdown::paged_table(house)

From the data structure above, there is data id and date. Those variable does not provide enough information to perform linear regression analysis, which can can be removed. It should be noted that in this project we do not consider the sales year because the data is taken from May 2014 to May 2015 so the time changes that occur do not significantly affect the selling price of the house.

library(dplyr)
house <- house %>% 
  select(-id,-date)

house <- house %>% 
  mutate(bathrooms = as.integer(bathrooms)) %>% 
  mutate(floors = as.integer(floors))

Then we can check to find out any missing values

colSums(is.na(house))

##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##             0             0             0             0             0 
##        floors    waterfront          view     condition         grade 
##             0             0             0             0             0 
##    sqft_above sqft_basement      yr_built  yr_renovated       zipcode 
##             0             0             0             0             0 
##           lat          long sqft_living15    sqft_lot15 
##             0             0             0             0

summary(house)

##      price            bedrooms        bathrooms      sqft_living   
##  Min.   :  78000   Min.   : 1.000   Min.   :0.000   Min.   :  370  
##  1st Qu.: 322000   1st Qu.: 3.000   1st Qu.:1.000   1st Qu.: 1430  
##  Median : 450000   Median : 3.000   Median :2.000   Median : 1910  
##  Mean   : 540297   Mean   : 3.373   Mean   :1.751   Mean   : 2080  
##  3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.000   3rd Qu.: 2550  
##  Max.   :7700000   Max.   :33.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors        waterfront            view       
##  Min.   :    520   Min.   :1.000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:   5040   1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :   7618   Median :1.000   Median :0.000000   Median :0.0000  
##  Mean   :  15099   Mean   :1.446   Mean   :0.007547   Mean   :0.2343  
##  3rd Qu.:  10685   3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :1651359   Max.   :3.000   Max.   :1.000000   Max.   :4.0000  
##    condition        grade          sqft_above   sqft_basement       yr_built   
##  Min.   :1.00   Min.   : 3.000   Min.   : 370   Min.   :   0.0   Min.   :1900  
##  1st Qu.:3.00   1st Qu.: 7.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median :3.00   Median : 7.000   Median :1560   Median :   0.0   Median :1975  
##  Mean   :3.41   Mean   : 7.658   Mean   :1789   Mean   : 291.7   Mean   :1971  
##  3rd Qu.:4.00   3rd Qu.: 8.000   3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997  
##  Max.   :5.00   Max.   :13.000   Max.   :9410   Max.   :4820.0   Max.   :2015  
##   yr_renovated        zipcode           lat             long       
##  Min.   :   0.00   Min.   :98001   Min.   :47.16   Min.   :-122.5  
##  1st Qu.:   0.00   1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3  
##  Median :   0.00   Median :98065   Median :47.57   Median :-122.2  
##  Mean   :  84.46   Mean   :98078   Mean   :47.56   Mean   :-122.2  
##  3rd Qu.:   0.00   3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1  
##  Max.   :2015.00   Max.   :98199   Max.   :47.78   Max.   :-121.3  
##  sqft_living15    sqft_lot15    
##  Min.   : 399   Min.   :   651  
##  1st Qu.:1490   1st Qu.:  5100  
##  Median :1840   Median :  7620  
##  Mean   :1987   Mean   : 12758  
##  3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :6210   Max.   :871200

Well, the dataset now looks OK then we can go to the next step

3 Explanatory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables

Find the correlation between variables using ggcor

# calculating the correlation

library(GGally)
ggcorr(data = house, label = T, size = 3, label_size= 3,hjust = 0.95, layout.exp = 2) +
  labs(
    title = "Dataset Correlation Matrix"
  )+
  theme_minimal()+
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 8, face="bold"),
    axis.text.y = element_blank()
  )

The correlation matrix above shows that each variable has an influence on price except condition and longitude, and variables that have the highest correlation with price are sqft_living and grade.

The distibution of price as target variable

house %>% 
  ggplot(aes(x=price)) +
  geom_histogram(aes(y=..density..),color = "black", fill="white")+
  geom_density(alpha=0.2, fill="blue")+
  labs(title = "House Price Distribution") +
  theme_minimal()+
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 9, face ="bold"),
    axis.title.y = element_text(margin = margin(l=5)),
    axis.title.x.bottom = element_text(margin = margin(b=5))
  )

From the histogram above we can see there are outliers causing the distribution of house prices to be abnormal, so we have to remove remove them

# filtering house price under 2,000,000 USD
house_clean <- house %>% 
  filter(price < 2000000)

house_clean %>% 
  ggplot(aes(x=price)) +
  geom_histogram(aes(y=..density..),color = "black", fill="white")+
  geom_density(alpha=0.2, fill="blue")+
  labs(title = "House Price Distribution") +
  theme_minimal()+
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 9, face ="bold"),
    axis.title.y = element_text(margin = margin(l=5)),
    axis.title.x.bottom = element_text(margin = margin(b=5))
  )

Now the dataset looks good with better price distribution

4 Modeling

4.1 Train-Test Split

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will use 80% of the data as the training data and the rest of it as the testing data.

set.seed(999)
idx_house <- sample (x=nrow(house_clean),size=nrow(house_clean)*0.8)
house_train <- house_clean[idx_house,]
house_test <- house_clean[-idx_house,]

4.2 Linear Reggresion

4.2.1 Single Predictor

Based on the previous display matrix, the sqft_living and grade variables have the strongest croorelation with the Price variable. Let’s give it a try

Linear Regression with single predictor: `sqft_living

model_sqlv <- lm(formula = price ~ sqft_living, data = house_train)

summary(model_sqlv)

## 
## Call:
## lm(formula = price ~ sqft_living, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -939480 -137804  -22111  100180 1329080 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 56013.494   4148.643    13.5 <0.0000000000000002 ***
## sqft_living   225.062      1.863   120.8 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209500 on 17109 degrees of freedom
## Multiple R-squared:  0.4603, Adjusted R-squared:  0.4603 
## F-statistic: 1.459e+04 on 1 and 17109 DF,  p-value: < 0.00000000000000022

If wee see from the P-Value, the sqft_living is significantly affects the price. However, as a single predictor, this model only has an Multiple R-squared value: 0.4656, which means it only describes of the target by 46%.

Linear Regression with single predictor: grade

model_grade <- lm(formula = price ~ grade, data = house_train)

summary(model_grade)

## 
## Call:
## lm(formula = price ~ grade, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -545919 -136608  -29516   94967 1408350 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -795425      10892  -73.03 <0.0000000000000002 ***
## grade         172134       1412  121.93 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208600 on 17109 degrees of freedom
## Multiple R-squared:  0.4649, Adjusted R-squared:  0.4649 
## F-statistic: 1.487e+04 on 1 and 17109 DF,  p-value: < 0.00000000000000022

If wee see the from P-Value, the garde is significantly affects the price. However, as a single predictor, this model only has an Multiple R-squared value: 0.4648, which means it only describes of the target by 46%.

Conclusion : The Linear Regression model using a single predictor is not suitable for determining house prices in this dataset.

4.2.2 Multiple Reggresion

model_all <- lm(formula = price ~ ., data = house_train)

summary(model_all)

## 
## Call:
## lm(formula = price ~ ., data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790595  -87391  -10739   68277 1016615 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   -1171255.11005  2492726.85776  -0.470               0.6385    
## bedrooms        -15688.35659     1622.92052  -9.667 < 0.0000000000000002 ***
## bathrooms        31720.96720     2495.92754  12.709 < 0.0000000000000002 ***
## sqft_living         98.77384        3.70949  26.627 < 0.0000000000000002 ***
## sqft_lot             0.21299        0.04456   4.780       0.000001766638 ***
## floors           26638.43710     3147.58324   8.463 < 0.0000000000000002 ***
## waterfront      272756.77783    17170.32813  15.885 < 0.0000000000000002 ***
## view             46277.35452     1858.95803  24.894 < 0.0000000000000002 ***
## condition        29116.62571     2011.26444  14.477 < 0.0000000000000002 ***
## grade            92808.34273     1865.49909  49.750 < 0.0000000000000002 ***
## sqft_above           9.04028        3.69336   2.448               0.0144 *  
## sqft_basement             NA             NA      NA                   NA    
## yr_built         -2166.71128       62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated        20.44961        3.16428   6.463       0.000000000106 ***
## zipcode           -408.85122       28.15316 -14.522 < 0.0000000000000002 ***
## lat             586173.00194     9166.42208  63.948 < 0.0000000000000002 ***
## long           -139214.47326    11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15       44.76065        3.01698  14.836 < 0.0000000000000002 ***
## sqft_lot15          -0.27771        0.06485  -4.282       0.000018588763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared:  0.7131, Adjusted R-squared:  0.7129 
## F-statistic:  2500 on 17 and 17093 DF,  p-value: < 0.00000000000000022

The summary of model_all model shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable. To interpret the value of each coefficient, for example with every increased value of 1 square feet in highwaympg will contribute to 98.41581 increase in the house price.

4.2.3 Using Stepwise Methods

For selection of the predictor / X in regression analysis, we can use Stepwise Regression methods. This methods will utilizing the AIC value as a measure to reduce / add X into the linear regression model. And I choose the backward step to removes predictors which have least significant effect on Y

backward <- step(object = model_all, trace = 1)

## Start:  AIC=408518
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + 
##     sqft_lot15
## 
## 
## Step:  AIC=408518
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + yr_built + 
##     yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15
## 
##                 Df      Sum of Sq             RSS    AIC
## <none>                            398990083371061 408518
## - sqft_above     1   139850327970 399129933699031 408522
## - sqft_lot15     1   428075681605 399418159052666 408534
## - sqft_lot       1   533357265985 399523440637046 408539
## - yr_renovated   1   974911614958 399964994986019 408558
## - floors         1  1671886697922 400661970068983 408588
## - bedrooms       1  2181243805008 401171327176068 408609
## - long           1  3596900258080 402586983629141 408670
## - bathrooms      1  3770272198291 402760355569352 408677
## - condition      1  4892012391550 403882095762611 408725
## - zipcode        1  4922888712525 403912972083586 408726
## - sqft_living15  1  5137978530474 404128061901535 408735
## - waterfront     1  5890304177169 404880387548229 408767
## - view           1 14465774408034 413455857779094 409125
## - sqft_living    1 16550071533071 415540154904131 409211
## - yr_built       1 27629490242992 426619573614053 409662
## - grade          1 57773377934993 456763461306054 410830
## - lat            1 95454351393692 494444434764753 412186

summary(backward)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + 
##     sqft_lot15, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790595  -87391  -10739   68277 1016615 
## 
## Coefficients:
##                     Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   -1171255.11005  2492726.85776  -0.470               0.6385    
## bedrooms        -15688.35659     1622.92052  -9.667 < 0.0000000000000002 ***
## bathrooms        31720.96720     2495.92754  12.709 < 0.0000000000000002 ***
## sqft_living         98.77384        3.70949  26.627 < 0.0000000000000002 ***
## sqft_lot             0.21299        0.04456   4.780       0.000001766638 ***
## floors           26638.43710     3147.58324   8.463 < 0.0000000000000002 ***
## waterfront      272756.77783    17170.32813  15.885 < 0.0000000000000002 ***
## view             46277.35452     1858.95803  24.894 < 0.0000000000000002 ***
## condition        29116.62571     2011.26444  14.477 < 0.0000000000000002 ***
## grade            92808.34273     1865.49909  49.750 < 0.0000000000000002 ***
## sqft_above           9.04028        3.69336   2.448               0.0144 *  
## yr_built         -2166.71128       62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated        20.44961        3.16428   6.463       0.000000000106 ***
## zipcode           -408.85122       28.15316 -14.522 < 0.0000000000000002 ***
## lat             586173.00194     9166.42208  63.948 < 0.0000000000000002 ***
## long           -139214.47326    11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15       44.76065        3.01698  14.836 < 0.0000000000000002 ***
## sqft_lot15          -0.27771        0.06485  -4.282       0.000018588763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared:  0.7131, Adjusted R-squared:  0.7129 
## F-statistic:  2500 on 17 and 17093 DF,  p-value: < 0.00000000000000022

As the result of the stepwise backward, its returns formula for Multiple Linear Regression:

Formula:

lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + sqft_above + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15, data = house_train)

model_back <- lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
    floors + waterfront + view + condition + grade + sqft_above + 
    yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + 
    sqft_lot15, data = house_train)

summary(model_back)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + 
##     sqft_lot15, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790595  -87391  -10739   68277 1016615 
## 
## Coefficients:
##                     Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   -1171255.11005  2492726.85776  -0.470               0.6385    
## bedrooms        -15688.35659     1622.92052  -9.667 < 0.0000000000000002 ***
## bathrooms        31720.96720     2495.92754  12.709 < 0.0000000000000002 ***
## sqft_living         98.77384        3.70949  26.627 < 0.0000000000000002 ***
## sqft_lot             0.21299        0.04456   4.780       0.000001766638 ***
## floors           26638.43710     3147.58324   8.463 < 0.0000000000000002 ***
## waterfront      272756.77783    17170.32813  15.885 < 0.0000000000000002 ***
## view             46277.35452     1858.95803  24.894 < 0.0000000000000002 ***
## condition        29116.62571     2011.26444  14.477 < 0.0000000000000002 ***
## grade            92808.34273     1865.49909  49.750 < 0.0000000000000002 ***
## sqft_above           9.04028        3.69336   2.448               0.0144 *  
## yr_built         -2166.71128       62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated        20.44961        3.16428   6.463       0.000000000106 ***
## zipcode           -408.85122       28.15316 -14.522 < 0.0000000000000002 ***
## lat             586173.00194     9166.42208  63.948 < 0.0000000000000002 ***
## long           -139214.47326    11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15       44.76065        3.01698  14.836 < 0.0000000000000002 ***
## sqft_lot15          -0.27771        0.06485  -4.282       0.000018588763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared:  0.7131, Adjusted R-squared:  0.7129 
## F-statistic:  2500 on 17 and 17093 DF,  p-value: < 0.00000000000000022

5 Evaluation

There are some metrics can be used for evaluation in regression model above:

5.1 Performance

1. Accuracy

R Square measures how much of variability in dependent variable can be explained by the model. R Square is a good measure to determine how well the model fits the dependent variables. From the results above, the model has an Adjusted R-squared value: 0.7134, this indicates that the model_back model can predict 71.3%.

2. Error

we can perform error checking with the following parameters:

Mean Square Error (MSE) : is an absolute measure of the goodness for the fit. MSE is calculated by the sum of square of prediction error which is real output minus predicted output and then divide by the number of data points
Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily.
Mean Absolute Error(MAE) is similar to Mean Square Error(MSE). However, instead of the sum of square of error in MSE, MAE is taking the sum of absolute value of error. Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalisation to big prediction error by square it while MAE treats all errors the same.
Mean Absolute Percentage Error (MAPE) is the mean or average of the absolute percentage errors of forecasts. Error is defined as actual or observed value minus the forecasted value

pred <- predict(object = model_back, newdata = house_test)

mse <- MSE(y_pred = pred, y_true = house_test$price)
rmse <- RMSE(y_pred = pred, y_true = house_test$price)
mae <- MAE(y_pred = pred, y_true = house_test$price)
mape <- MAPE(y_pred = pred, y_true = house_test$price)

data.frame("MSE"=mse,"RMSE"=rmse,"MAE"=mae,"MAPE"=mape)

##           MSE   RMSE      MAE      MAPE
## 1 23649195971 153783 106878.6 0.2233004

The results of error testing using the 3 methods above show that the model is not good enaugh. Based on the MAE value, this model has an error of approximately USD 106,878

5.2 Assumptions

These are some of the results of the assumption test on the model

1. Linearity

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values. And unfortunately, formodel_back there is strong pattern in the residuals indicates non-linearity in the data

resact <- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point(color="#EF7129") + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

2. Normality Error/Residual

In order to make valid inferences from regression, the residuals of the regression should follow a normal distribution. The residuals are simply the error terms, or the differences between the observed value of the dependent variable and the predicted value.

hist(model_back$residuals)

3. Homoscedasticity

Homoscedasticity refers to whether these residuals are equally distributed, or whether they tend to bunch together at some values, and at other values, spread far apart.

The Breusch-Pagan test is used to determine whether or not heteroscedasticity is present in a regression model.

Breusch-Pagan hypothesis: - H0: Homoscedasticity - H1: Heteroscedasticity

library(lmtest)
bptest(model_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_back
## BP = 2198.8, df = 17, p-value < 0.00000000000000022

Conclusion: because p-value = 0.00000000000000022 < alpha = 0.05, then model_back does not fulfills the homoscedasticity assumption

4. No-multicolinearity

No-multicolinearity means no correlation between predictors (X1, X2, … Xn). We will calculate the vif value using the vif () function from the car library. When the VIF value <10, the assumption of no-multicollinearity is fulfilled

library(car)
vif(model_back)

##      bedrooms     bathrooms   sqft_living      sqft_lot        floors 
##      1.629430      2.289288      7.450804      2.182089      2.205386 
##    waterfront          view     condition         grade    sqft_above 
##      1.155739      1.356328      1.247682      3.253969      6.272410 
##      yr_built  yr_renovated       zipcode           lat          long 
##      2.492501      1.137028      1.664121      1.190491      1.843140 
## sqft_living15    sqft_lot15 
##      2.951904      2.200961

6 Conclusions

This model has an Adjusted R-squared value: 0.7134, this indicates that the model can predict 71.2%. Then the results of error testing using the MAE method show an error of approximately USD 106,878. In assumptions test, this model only passes the Multicollinearity test, while failing the Linearity, Normality and Heteroscedasticity tests. In conclusion, this model is not good enough if it is to be used to predict house prices in relation to this dataset. I will try tuning or using another method in the next time.

Linear Regression on House Sales Prediction

Rifky Novrian Kahar

2021-05-01