The purpose of this project is to predict house sales in King County, Washington, USA using the multiple linear regression method. The dataset consists of historical data on houses that were sold between May 2014 and May 2015.
First, we need to import the dataset to R. The data used in the project is kc_house_data.csv, you can download from here
options(scipen=999)
house <- read.csv("kc_house_data.csv")Here are some brief explanations of the variables used in this project :
price : Price of the house, and prediction target in this projectbedrooms : Number of bedroomsbathrooms : Number of bathroomssqft_living : Square footage of the homesqft_lot : Square footage of the lotfloors : Total floors (levels) in housewaterfront : House which has a view to a waterfrontview : Has been viewedcondition : How good the condition is ( Overall ). 1 indicates worn out property and 5 excellentgrade : Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellentsqft_above : Square footage of house apart from basementsqft_basement: Square footage of the basementyr_built : Built Yearyr_renovated : Year when house was renovatedzipcode : ziplat : Latitude coordinatelong : Longitude coordinatesqft_living15: Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize areasqft_lot15 : LotSize area in 2015(implies– some renovations)Then we can start to inspect the data
str(house)## 'data.frame': 21597 obs. of 21 variables:
## $ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
## $ date : chr "10/13/2014" "12/9/2014" "2/25/2015" "12/9/2014" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
rmarkdown::paged_table(house)From the data structure above, there is data id and date. Those variable does not provide enough information to perform linear regression analysis, which can can be removed. It should be noted that in this project we do not consider the sales year because the data is taken from May 2014 to May 2015 so the time changes that occur do not significantly affect the selling price of the house.
library(dplyr)
house <- house %>%
select(-id,-date)house <- house %>%
mutate(bathrooms = as.integer(bathrooms)) %>%
mutate(floors = as.integer(floors))Then we can check to find out any missing values
colSums(is.na(house))## price bedrooms bathrooms sqft_living sqft_lot
## 0 0 0 0 0
## floors waterfront view condition grade
## 0 0 0 0 0
## sqft_above sqft_basement yr_built yr_renovated zipcode
## 0 0 0 0 0
## lat long sqft_living15 sqft_lot15
## 0 0 0 0
summary(house)## price bedrooms bathrooms sqft_living
## Min. : 78000 Min. : 1.000 Min. :0.000 Min. : 370
## 1st Qu.: 322000 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.: 1430
## Median : 450000 Median : 3.000 Median :2.000 Median : 1910
## Mean : 540297 Mean : 3.373 Mean :1.751 Mean : 2080
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.000 3rd Qu.: 2550
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :13540
## sqft_lot floors waterfront view
## Min. : 520 Min. :1.000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 7618 Median :1.000 Median :0.000000 Median :0.0000
## Mean : 15099 Mean :1.446 Mean :0.007547 Mean :0.2343
## 3rd Qu.: 10685 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1651359 Max. :3.000 Max. :1.000000 Max. :4.0000
## condition grade sqft_above sqft_basement yr_built
## Min. :1.00 Min. : 3.000 Min. : 370 Min. : 0.0 Min. :1900
## 1st Qu.:3.00 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median :3.00 Median : 7.000 Median :1560 Median : 0.0 Median :1975
## Mean :3.41 Mean : 7.658 Mean :1789 Mean : 291.7 Mean :1971
## 3rd Qu.:4.00 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
## Max. :5.00 Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
## yr_renovated zipcode lat long
## Min. : 0.00 Min. :98001 Min. :47.16 Min. :-122.5
## 1st Qu.: 0.00 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
## Median : 0.00 Median :98065 Median :47.57 Median :-122.2
## Mean : 84.46 Mean :98078 Mean :47.56 Mean :-122.2
## 3rd Qu.: 0.00 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1
## Max. :2015.00 Max. :98199 Max. :47.78 Max. :-121.3
## sqft_living15 sqft_lot15
## Min. : 399 Min. : 651
## 1st Qu.:1490 1st Qu.: 5100
## Median :1840 Median : 7620
## Mean :1987 Mean : 12758
## 3rd Qu.:2360 3rd Qu.: 10083
## Max. :6210 Max. :871200
Well, the dataset now looks OK then we can go to the next step
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables
Find the correlation between variables using ggcor
# calculating the correlation
library(GGally)
ggcorr(data = house, label = T, size = 3, label_size= 3,hjust = 0.95, layout.exp = 2) +
labs(
title = "Dataset Correlation Matrix"
)+
theme_minimal()+
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 8, face="bold"),
axis.text.y = element_blank()
)The correlation matrix above shows that each variable has an influence on price except condition and longitude, and variables that have the highest correlation with price are sqft_living and grade.
The distibution of price as target variable
house %>%
ggplot(aes(x=price)) +
geom_histogram(aes(y=..density..),color = "black", fill="white")+
geom_density(alpha=0.2, fill="blue")+
labs(title = "House Price Distribution") +
theme_minimal()+
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 9, face ="bold"),
axis.title.y = element_text(margin = margin(l=5)),
axis.title.x.bottom = element_text(margin = margin(b=5))
)From the histogram above we can see there are outliers causing the distribution of house prices to be abnormal, so we have to remove remove them
# filtering house price under 2,000,000 USD
house_clean <- house %>%
filter(price < 2000000)
house_clean %>%
ggplot(aes(x=price)) +
geom_histogram(aes(y=..density..),color = "black", fill="white")+
geom_density(alpha=0.2, fill="blue")+
labs(title = "House Price Distribution") +
theme_minimal()+
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 9, face ="bold"),
axis.title.y = element_text(margin = margin(l=5)),
axis.title.x.bottom = element_text(margin = margin(b=5))
)Now the dataset looks good with better price distribution
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will use 80% of the data as the training data and the rest of it as the testing data.
set.seed(999)
idx_house <- sample (x=nrow(house_clean),size=nrow(house_clean)*0.8)
house_train <- house_clean[idx_house,]
house_test <- house_clean[-idx_house,]Based on the previous display matrix, the sqft_living and grade variables have the strongest croorelation with the Price variable. Let’s give it a try
Linear Regression with single predictor: `sqft_living
model_sqlv <- lm(formula = price ~ sqft_living, data = house_train)
summary(model_sqlv)##
## Call:
## lm(formula = price ~ sqft_living, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -939480 -137804 -22111 100180 1329080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56013.494 4148.643 13.5 <0.0000000000000002 ***
## sqft_living 225.062 1.863 120.8 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209500 on 17109 degrees of freedom
## Multiple R-squared: 0.4603, Adjusted R-squared: 0.4603
## F-statistic: 1.459e+04 on 1 and 17109 DF, p-value: < 0.00000000000000022
If wee see from the P-Value, the sqft_living is significantly affects the price. However, as a single predictor, this model only has an Multiple R-squared value: 0.4656, which means it only describes of the target by 46%.
Linear Regression with single predictor: grade
model_grade <- lm(formula = price ~ grade, data = house_train)
summary(model_grade)##
## Call:
## lm(formula = price ~ grade, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -545919 -136608 -29516 94967 1408350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -795425 10892 -73.03 <0.0000000000000002 ***
## grade 172134 1412 121.93 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208600 on 17109 degrees of freedom
## Multiple R-squared: 0.4649, Adjusted R-squared: 0.4649
## F-statistic: 1.487e+04 on 1 and 17109 DF, p-value: < 0.00000000000000022
If wee see the from P-Value, the garde is significantly affects the price. However, as a single predictor, this model only has an Multiple R-squared value: 0.4648, which means it only describes of the target by 46%.
Conclusion : The Linear Regression model using a single predictor is not suitable for determining house prices in this dataset.
model_all <- lm(formula = price ~ ., data = house_train)
summary(model_all)##
## Call:
## lm(formula = price ~ ., data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -790595 -87391 -10739 68277 1016615
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1171255.11005 2492726.85776 -0.470 0.6385
## bedrooms -15688.35659 1622.92052 -9.667 < 0.0000000000000002 ***
## bathrooms 31720.96720 2495.92754 12.709 < 0.0000000000000002 ***
## sqft_living 98.77384 3.70949 26.627 < 0.0000000000000002 ***
## sqft_lot 0.21299 0.04456 4.780 0.000001766638 ***
## floors 26638.43710 3147.58324 8.463 < 0.0000000000000002 ***
## waterfront 272756.77783 17170.32813 15.885 < 0.0000000000000002 ***
## view 46277.35452 1858.95803 24.894 < 0.0000000000000002 ***
## condition 29116.62571 2011.26444 14.477 < 0.0000000000000002 ***
## grade 92808.34273 1865.49909 49.750 < 0.0000000000000002 ***
## sqft_above 9.04028 3.69336 2.448 0.0144 *
## sqft_basement NA NA NA NA
## yr_built -2166.71128 62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated 20.44961 3.16428 6.463 0.000000000106 ***
## zipcode -408.85122 28.15316 -14.522 < 0.0000000000000002 ***
## lat 586173.00194 9166.42208 63.948 < 0.0000000000000002 ***
## long -139214.47326 11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15 44.76065 3.01698 14.836 < 0.0000000000000002 ***
## sqft_lot15 -0.27771 0.06485 -4.282 0.000018588763 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared: 0.7131, Adjusted R-squared: 0.7129
## F-statistic: 2500 on 17 and 17093 DF, p-value: < 0.00000000000000022
The summary of model_all model shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable. To interpret the value of each coefficient, for example with every increased value of 1 square feet in highwaympg will contribute to 98.41581 increase in the house price.
For selection of the predictor / X in regression analysis, we can use Stepwise Regression methods. This methods will utilizing the AIC value as a measure to reduce / add X into the linear regression model. And I choose the backward step to removes predictors which have least significant effect on Y
backward <- step(object = model_all, trace = 1)## Start: AIC=408518
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated + zipcode + lat + long + sqft_living15 +
## sqft_lot15
##
##
## Step: AIC=408518
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15
##
## Df Sum of Sq RSS AIC
## <none> 398990083371061 408518
## - sqft_above 1 139850327970 399129933699031 408522
## - sqft_lot15 1 428075681605 399418159052666 408534
## - sqft_lot 1 533357265985 399523440637046 408539
## - yr_renovated 1 974911614958 399964994986019 408558
## - floors 1 1671886697922 400661970068983 408588
## - bedrooms 1 2181243805008 401171327176068 408609
## - long 1 3596900258080 402586983629141 408670
## - bathrooms 1 3770272198291 402760355569352 408677
## - condition 1 4892012391550 403882095762611 408725
## - zipcode 1 4922888712525 403912972083586 408726
## - sqft_living15 1 5137978530474 404128061901535 408735
## - waterfront 1 5890304177169 404880387548229 408767
## - view 1 14465774408034 413455857779094 409125
## - sqft_living 1 16550071533071 415540154904131 409211
## - yr_built 1 27629490242992 426619573614053 409662
## - grade 1 57773377934993 456763461306054 410830
## - lat 1 95454351393692 494444434764753 412186
summary(backward)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + sqft_above +
## yr_built + yr_renovated + zipcode + lat + long + sqft_living15 +
## sqft_lot15, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -790595 -87391 -10739 68277 1016615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1171255.11005 2492726.85776 -0.470 0.6385
## bedrooms -15688.35659 1622.92052 -9.667 < 0.0000000000000002 ***
## bathrooms 31720.96720 2495.92754 12.709 < 0.0000000000000002 ***
## sqft_living 98.77384 3.70949 26.627 < 0.0000000000000002 ***
## sqft_lot 0.21299 0.04456 4.780 0.000001766638 ***
## floors 26638.43710 3147.58324 8.463 < 0.0000000000000002 ***
## waterfront 272756.77783 17170.32813 15.885 < 0.0000000000000002 ***
## view 46277.35452 1858.95803 24.894 < 0.0000000000000002 ***
## condition 29116.62571 2011.26444 14.477 < 0.0000000000000002 ***
## grade 92808.34273 1865.49909 49.750 < 0.0000000000000002 ***
## sqft_above 9.04028 3.69336 2.448 0.0144 *
## yr_built -2166.71128 62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated 20.44961 3.16428 6.463 0.000000000106 ***
## zipcode -408.85122 28.15316 -14.522 < 0.0000000000000002 ***
## lat 586173.00194 9166.42208 63.948 < 0.0000000000000002 ***
## long -139214.47326 11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15 44.76065 3.01698 14.836 < 0.0000000000000002 ***
## sqft_lot15 -0.27771 0.06485 -4.282 0.000018588763 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared: 0.7131, Adjusted R-squared: 0.7129
## F-statistic: 2500 on 17 and 17093 DF, p-value: < 0.00000000000000022
As the result of the stepwise backward, its returns formula for Multiple Linear Regression:
Formula:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition + grade + sqft_above + yr_built + yr_renovated + zipcode + lat + long + sqft_living15 + sqft_lot15, data = house_train)
model_back <- lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
floors + waterfront + view + condition + grade + sqft_above +
yr_built + yr_renovated + zipcode + lat + long + sqft_living15 +
sqft_lot15, data = house_train)
summary(model_back)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + sqft_above +
## yr_built + yr_renovated + zipcode + lat + long + sqft_living15 +
## sqft_lot15, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -790595 -87391 -10739 68277 1016615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1171255.11005 2492726.85776 -0.470 0.6385
## bedrooms -15688.35659 1622.92052 -9.667 < 0.0000000000000002 ***
## bathrooms 31720.96720 2495.92754 12.709 < 0.0000000000000002 ***
## sqft_living 98.77384 3.70949 26.627 < 0.0000000000000002 ***
## sqft_lot 0.21299 0.04456 4.780 0.000001766638 ***
## floors 26638.43710 3147.58324 8.463 < 0.0000000000000002 ***
## waterfront 272756.77783 17170.32813 15.885 < 0.0000000000000002 ***
## view 46277.35452 1858.95803 24.894 < 0.0000000000000002 ***
## condition 29116.62571 2011.26444 14.477 < 0.0000000000000002 ***
## grade 92808.34273 1865.49909 49.750 < 0.0000000000000002 ***
## sqft_above 9.04028 3.69336 2.448 0.0144 *
## yr_built -2166.71128 62.97766 -34.404 < 0.0000000000000002 ***
## yr_renovated 20.44961 3.16428 6.463 0.000000000106 ***
## zipcode -408.85122 28.15316 -14.522 < 0.0000000000000002 ***
## lat 586173.00194 9166.42208 63.948 < 0.0000000000000002 ***
## long -139214.47326 11214.81453 -12.413 < 0.0000000000000002 ***
## sqft_living15 44.76065 3.01698 14.836 < 0.0000000000000002 ***
## sqft_lot15 -0.27771 0.06485 -4.282 0.000018588763 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 152800 on 17093 degrees of freedom
## Multiple R-squared: 0.7131, Adjusted R-squared: 0.7129
## F-statistic: 2500 on 17 and 17093 DF, p-value: < 0.00000000000000022
There are some metrics can be used for evaluation in regression model above:
1. Accuracy
R Square measures how much of variability in dependent variable can be explained by the model. R Square is a good measure to determine how well the model fits the dependent variables. From the results above, the model has an Adjusted R-squared value: 0.7134, this indicates that the model_back model can predict 71.3%.
2. Error
we can perform error checking with the following parameters:
Mean Square Error (MSE) : is an absolute measure of the goodness for the fit. MSE is calculated by the sum of square of prediction error which is real output minus predicted output and then divide by the number of data pointsRoot Mean Square Error(RMSE) is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily.Mean Absolute Error(MAE) is similar to Mean Square Error(MSE). However, instead of the sum of square of error in MSE, MAE is taking the sum of absolute value of error. Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalisation to big prediction error by square it while MAE treats all errors the same.Mean Absolute Percentage Error (MAPE) is the mean or average of the absolute percentage errors of forecasts. Error is defined as actual or observed value minus the forecasted valuepred <- predict(object = model_back, newdata = house_test)
mse <- MSE(y_pred = pred, y_true = house_test$price)
rmse <- RMSE(y_pred = pred, y_true = house_test$price)
mae <- MAE(y_pred = pred, y_true = house_test$price)
mape <- MAPE(y_pred = pred, y_true = house_test$price)
data.frame("MSE"=mse,"RMSE"=rmse,"MAE"=mae,"MAPE"=mape)## MSE RMSE MAE MAPE
## 1 23649195971 153783 106878.6 0.2233004
The results of error testing using the 3 methods above show that the model is not good enaugh. Based on the MAE value, this model has an error of approximately USD 106,878
These are some of the results of the assumption test on the model
1. Linearity
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values. And unfortunately, formodel_back there is strong pattern in the residuals indicates non-linearity in the data
resact <- data.frame(residual = model_back$residuals, fitted = model_back$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point(color="#EF7129") + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())2. Normality Error/Residual
In order to make valid inferences from regression, the residuals of the regression should follow a normal distribution. The residuals are simply the error terms, or the differences between the observed value of the dependent variable and the predicted value.
hist(model_back$residuals)3. Homoscedasticity
Homoscedasticity refers to whether these residuals are equally distributed, or whether they tend to bunch together at some values, and at other values, spread far apart.
The Breusch-Pagan test is used to determine whether or not heteroscedasticity is present in a regression model.
Breusch-Pagan hypothesis: - H0: Homoscedasticity - H1: Heteroscedasticity
library(lmtest)
bptest(model_back)##
## studentized Breusch-Pagan test
##
## data: model_back
## BP = 2198.8, df = 17, p-value < 0.00000000000000022
Conclusion: because p-value = 0.00000000000000022 < alpha = 0.05, then model_back does not fulfills the homoscedasticity assumption
4. No-multicolinearity
No-multicolinearity means no correlation between predictors (X1, X2, … Xn). We will calculate the vif value using the vif () function from the car library. When the VIF value <10, the assumption of no-multicollinearity is fulfilled
library(car)
vif(model_back)## bedrooms bathrooms sqft_living sqft_lot floors
## 1.629430 2.289288 7.450804 2.182089 2.205386
## waterfront view condition grade sqft_above
## 1.155739 1.356328 1.247682 3.253969 6.272410
## yr_built yr_renovated zipcode lat long
## 2.492501 1.137028 1.664121 1.190491 1.843140
## sqft_living15 sqft_lot15
## 2.951904 2.200961
This model has an Adjusted R-squared value: 0.7134, this indicates that the model can predict 71.2%. Then the results of error testing using the MAE method show an error of approximately USD 106,878. In assumptions test, this model only passes the Multicollinearity test, while failing the Linearity, Normality and Heteroscedasticity tests. In conclusion, this model is not good enough if it is to be used to predict house prices in relation to this dataset. I will try tuning or using another method in the next time.