Introduction

The data about analyzing the house price based on some variables. The data was provided by Kaggle. We’ll be using linear regression model to predict the prices of the house.

Import Library

library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(nortest)

Read Files

house <- read.csv("Houseprice.csv")

Explanatory Data Analysis

str(house)
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Data house contains 500,000 obs with 16 columns. Before we coninue to predict the price (target variable), let’s make sure there is no missing value in the data.

anyNA(house)
## [1] FALSE

There is no misisng value in data. Let’s get the sample of data and the corelation between variables in data.

#get the sample of data with head
head(house)
##   Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1  164      2         0     2            0            1             0      0
## 2   84      2         0     4            0            0             1      1
## 3  190      2         4     4            1            0             0      0
## 4   75      2         4     4            0            0             1      1
## 5  148      1         4     2            1            0             0      1
## 6  124      3         3     3            0            1             0      1
##   City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1    3     1        1     1           1            0      0  43800
## 2    2     0        0     0           1            1      1  37550
## 3    2     0        0     1           0            0      0  49500
## 4    1     1        1     1           1            1      1  50075
## 5    2     1        0     0           1            1      1  52400
## 6    1     0        0     1           1            1      1  54300

Check if there is outlier of Prices

sort(boxplot(house$Prices, plot = F)$out)
##  [1] 76750 76775 76800 76825 76950 76975 77000 77075 77175 77225 77225 77250
## [13] 77375 77525 77700 77975

There’s some outlier of our data. Let’s filter the data without outlier. I just filter the data by get the number of Prices that smaller than 76750

house <- house %>% 
  filter(Prices < 76750)

Get the correlation of variables with ggcorr() function from package GGally

#get the correlation between variables 
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

From the correlation, we can see that Floors, Fiber and White.Marble has strong correlation with price.

Modeling

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will use 70% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(123)
idx <- sample(nrow(house), nrow(house) *0.7)
data_train <- house[idx, ]
data_test <- house[-idx, ]

Now we will try to model the linear regression using : 1. Prices as the target variable 2. Floors, Fiber, and White.Marble as predictor because they have strong correlation with price.

house_lm <- lm(formula = Prices ~ Floors + Fiber + White.Marble, data_train)
summary(house_lm)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17527.4  -3592.4      6.6   3585.3  17506.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24865.79      16.32  1524.0   <2e-16 ***
## Floors       14977.66      17.45   858.4   <2e-16 ***
## Fiber        11733.96      17.45   672.5   <2e-16 ***
## White.Marble 11523.94      18.51   622.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5161 on 349984 degrees of freedom
## Multiple R-squared:  0.8184, Adjusted R-squared:  0.8184 
## F-statistic: 5.257e+05 on 3 and 349984 DF,  p-value: < 2.2e-16

From the summary of model lm(formula = Prices ~ Floors + Fiber, data = house) we can get information that adjusted R-squared of 0.8184, meaning that the model can explain 81.84% of variance in the target variable (house price).

Now, let’s try to get the predictor variables automaticly by using step-wise regression method with direction ="backward" parameter.

house_step <- step(object = lm(Prices ~ ., data_train), direction = "backward")
## Start:  AIC=-13056840
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden
## 
## 
## Step:  AIC=-13056840
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Garden
## 
##                Df  Sum of Sq        RSS       AIC
## - Garden        1 0.0000e+00 0.0000e+00 -13056841
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -13056840
## <none>                       0.0000e+00 -13056840
## - Solar         1 5.4684e+09 5.4684e+09   3379718
## - Electric      1 1.3671e+11 1.3671e+11   4506295
## - FirePlace     1 3.9389e+11 3.9389e+11   4876649
## - Garage        1 5.2500e+11 5.2500e+11   4977213
## - Baths         1 1.0948e+12 1.0948e+12   5234440
## - Area          1 1.1267e+12 1.1267e+12   5244469
## - Black.Marble  1 1.4596e+12 1.4596e+12   5335072
## - Glass.Doors   1 1.7326e+12 1.7326e+12   5395088
## - City          1 2.8556e+12 2.8556e+12   5569966
## - White.Marble  1 1.1445e+13 1.1445e+13   6055836
## - Fiber         1 1.2080e+13 1.2080e+13   6074727
## - Floors        1 1.9686e+13 1.9686e+13   6245667
## 
## Step:  AIC=-13056841
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool
## 
##                Df  Sum of Sq        RSS       AIC
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -13056842
## <none>                       0.0000e+00 -13056841
## - Solar         1 5.4684e+09 5.4684e+09   3379720
## - Electric      1 1.3671e+11 1.3671e+11   4506293
## - FirePlace     1 3.9389e+11 3.9389e+11   4876647
## - Garage        1 5.2500e+11 5.2500e+11   4977212
## - Baths         1 1.0948e+12 1.0948e+12   5234439
## - Area          1 1.1267e+12 1.1267e+12   5244467
## - Black.Marble  1 1.4596e+12 1.4596e+12   5335070
## - Glass.Doors   1 1.7326e+12 1.7326e+12   5395090
## - City          1 2.8556e+12 2.8556e+12   5569964
## - White.Marble  1 1.1445e+13 1.1445e+13   6055836
## - Fiber         1 1.2080e+13 1.2080e+13   6074726
## - Floors        1 1.9686e+13 1.9686e+13   6245666
## 
## Step:  AIC=-13056842
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors
## 
##                Df  Sum of Sq        RSS       AIC
## <none>                       0.0000e+00 -13056842
## - Solar         1 5.4684e+09 5.4684e+09   3379718
## - Electric      1 1.3671e+11 1.3671e+11   4506292
## - FirePlace     1 3.9389e+11 3.9389e+11   4876647
## - Garage        1 5.2500e+11 5.2500e+11   4977210
## - Baths         1 1.0949e+12 1.0949e+12   5234441
## - Area          1 1.1267e+12 1.1267e+12   5244465
## - Black.Marble  1 1.4596e+12 1.4596e+12   5335069
## - Glass.Doors   1 1.7326e+12 1.7326e+12   5395088
## - City          1 2.8556e+12 2.8556e+12   5569962
## - White.Marble  1 1.1445e+13 1.1445e+13   6055837
## - Fiber         1 1.2080e+13 1.2080e+13   6074727
## - Floors        1 1.9686e+13 1.9686e+13   6245665
summary(house_step)
## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = data_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.684e-06  0.000e+00  0.000e+00  0.000e+00  1.037e-07 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.000e+03  7.285e-11 1.373e+13   <2e-16 ***
## Area         2.500e+01  1.866e-13 1.339e+14   <2e-16 ***
## Garage       1.500e+03  1.641e-11 9.143e+13   <2e-16 ***
## FirePlace    7.500e+02  9.470e-12 7.920e+13   <2e-16 ***
## Baths        1.250e+03  9.467e-12 1.320e+14   <2e-16 ***
## White.Marble 1.400e+04  3.279e-11 4.269e+14   <2e-16 ***
## Black.Marble 5.000e+03  3.280e-11 1.525e+14   <2e-16 ***
## Floors       1.500e+04  2.679e-11 5.599e+14   <2e-16 ***
## City         3.500e+03  1.641e-11 2.132e+14   <2e-16 ***
## Solar        2.500e+02  2.679e-11 9.331e+12   <2e-16 ***
## Electric     1.250e+03  2.679e-11 4.666e+13   <2e-16 ***
## Fiber        1.175e+04  2.679e-11 4.386e+14   <2e-16 ***
## Glass.Doors  4.450e+03  2.679e-11 1.661e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.925e-09 on 349975 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.811e+28 on 12 and 349975 DF,  p-value: < 2.2e-16

From the summary of model using stepwise regression model, we can get information such as :

  1. Predictor variables which has high influence to Prices are : Area, Garage, FirePlace, Baths, White.Marble, Black.Marble, Floors, City, Solar, Electric, Fiber, and Glass.Doors (marked by ***)
  2. The model give adjusted R-squared of 1, meaning that the model can explain 100% of variance in the target variable (house price).
  3. The last model to predict the Prices based on Floors + Fiber + White.Marble has adjusted R-squared only 0.6173. It means that house_step is better than house_lm in this case.

Prediksi Model dan Error

Let’s predict the Prices based on the predictor using model house_step. The results of data will be compared to actual result in data_train and data_test.

#predict prices in data_train using house_step
data_train$pred <- predict(house_step, newdata = data.frame(data_train))
data_train %>% 
  select(Prices, pred) %>% 
  head()
##        Prices  pred
## 143785  52550 52550
## 394140  36800 36800
## 204482  52575 52575
## 441492  54200 54200
## 470215  49500 49500
## 22778   47775 47775
#predict prices in data_test using house_step
data_test$pred <- predict(house_step, newdata = data.frame(data_test))
data_test %>% 
  select(Prices, pred) %>% 
  head()
##    Prices  pred
## 9   29575 29575
## 10  22300 22300
## 15  38500 38500
## 19  53625 53625
## 20  38300 38300
## 21  67850 67850

Let’s check the error of the model using MAPE method by using MAPE() function from package MLmetrics

#check the error of Prices in data_train
format(MAPE(y_pred = data_train$pred, y_true = data_train$Prices), scientific = F)
## [1] "0.0000000000002072597"
#check the error of Prices in data_test
format(MAPE(y_pred = data_test$pred, y_true = data_test$Prices), scientific = F)
## [1] "0.0000000000002072442"

By using the MAPE (Mean Absolut Percentage Error) method, we get the information that model house_step which using stepwise regression model has error less than 1%.

Evaluation Model

Linearity

cor.test(house$Prices, house$Area)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Area
## t = 105.52, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1448892 0.1503121
## sample estimates:
##       cor 
## 0.1476017
cor.test(house$Prices, house$Garage)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Garage
## t = 71.207, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0974516 0.1029397
## sample estimates:
##       cor 
## 0.1001964
cor.test(house$Prices, house$FirePlace)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$FirePlace
## t = 63.212, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08629182 0.09179158
## sample estimates:
##        cor 
## 0.08904238
cor.test(house$Prices, house$Baths)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Baths
## t = 103.62, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1422766 0.1477037
## sample estimates:
##       cor 
## 0.1449912
cor.test(house$Prices, house$White.Marble)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$White.Marble
## t = 354.42, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4458796 0.4503102
## sample estimates:
##       cor 
## 0.4480976
cor.test(house$Prices, house$Black.Marble)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Black.Marble
## t = -55.318, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08074937 -0.07523939
## sample estimates:
##         cor 
## -0.07799497
cor.test(house$Prices, house$Floors)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Floors
## t = 557.95, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6177400 0.6211565
## sample estimates:
##       cor 
## 0.6194512
cor.test(house$Prices, house$City)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$City
## t = 169.56, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2305597 0.2358020
## sample estimates:
##       cor 
## 0.2331825
cor.test(house$Prices, house$Solar)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Solar
## t = 5.9364, df = 499982, p-value = 2.916e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.00562345 0.01116677
## sample estimates:
##         cor 
## 0.008395172
cor.test(house$Prices, house$Electric)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Electric
## t = 37.081, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04960462 0.05513312
## sample estimates:
##        cor 
## 0.05236927
cor.test(house$Prices, house$Fiber)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Fiber
## t = 391.73, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4824821 0.4867240
## sample estimates:
##       cor 
## 0.4846059
cor.test(house$Prices, house$Glass.Doors)
## 
##  Pearson's product-moment correlation
## 
## data:  house$Prices and house$Glass.Doors
## t = 130.81, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1792264 0.1845867
## sample estimates:
##       cor 
## 0.1819079

From the corr.test, we see that all predictor variable has p-value < 0.05, which means the model is linear.

Normality (Check Residual)

Normality test is evaluation model to check the residual.

hist(house_step$residuals, breaks = 5)

length(house_step$residuals)
## [1] 349988

Because the sample size of shapiro.test() has maximum of 5,000 and our residuals has 349988, we can use the 5000 of first data or we can using ad.test() from package nortest for larger samples.

#using shapiro.test for 5000 first samples
shapiro.test(house_step$residuals[0:5000])
## 
##  Shapiro-Wilk normality test
## 
## data:  house_step$residuals[0:5000]
## W = 0.0036873, p-value < 2.2e-16
ad.test(house_step$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  house_step$residuals
## A = 133168, p-value < 2.2e-16

By using shapiro.test or ad.test, we get the p-value < 2.2e-16, which means our residuals was distributed not normally and need to be tuned.

Heteroscedasticity

bptest(house_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  house_step
## BP = 14.228, df = 12, p-value = 0.2864
plot(data_train$Prices, house_step$residuals)
abline(h = 0, col = "red")

From the BPtest, we get the p-value of 0.6258, which mean p-value > 0.05 and it means the model is homoscedasticity (no heteroscedasticity) in our model.

Variance Inflation Factor (Multicollinearity)

vif(house_step)
##         Area       Garage    FirePlace        Baths White.Marble Black.Marble 
##     1.000024     1.000041     1.000021     1.000031     1.330728     1.330724 
##       Floors         City        Solar     Electric        Fiber  Glass.Doors 
##     1.000014     1.000028     1.000022     1.000009     1.000030     1.000028

There is no vif-value larger than 10 so that we can conclude that there is no multicollinearity.

Conclusion

Variables that are useful to describe the variances in house prices are Area, Garage, FirePlace, Baths, White.Marble, Black.Marble, Floors, City, Solar, Electric, Fiber, and Glass.Doors. Our final model has passed the model evaluation of linearity test, Homoscedasticity, and No Multicollinearity, but has not passed the Normality test. We have remove the outlier of Prices. The R-squared of the model is high, with 100% of the variables can explain the variances in the house price. The accuracy of the model in predicting the car price is measured with MAPE, with training data has MAPE (Mean Absolut Percentage Error) of training and testing data with error lower than 0.001%.