Introduction
The data about analyzing the house price based on some variables. The data was provided by Kaggle. We’ll be using linear regression model to predict the prices of the house.
Import Library
Explanatory Data Analysis
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Data house contains 500,000 obs with 16 columns. Before we coninue to predict the price (target variable), let’s make sure there is no missing value in the data.
## [1] FALSE
There is no misisng value in data. Let’s get the sample of data and the corelation between variables in data.
## Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1 164 2 0 2 0 1 0 0
## 2 84 2 0 4 0 0 1 1
## 3 190 2 4 4 1 0 0 0
## 4 75 2 4 4 0 0 1 1
## 5 148 1 4 2 1 0 0 1
## 6 124 3 3 3 0 1 0 1
## City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1 3 1 1 1 1 0 0 43800
## 2 2 0 0 0 1 1 1 37550
## 3 2 0 0 1 0 0 0 49500
## 4 1 1 1 1 1 1 1 50075
## 5 2 1 0 0 1 1 1 52400
## 6 1 0 0 1 1 1 1 54300
Check if there is outlier of Prices
## [1] 76750 76775 76800 76825 76950 76975 77000 77075 77175 77225 77225 77250
## [13] 77375 77525 77700 77975
There’s some outlier of our data. Let’s filter the data without outlier. I just filter the data by get the number of Prices that smaller than 76750
Get the correlation of variables with ggcorr()
function from package GGally
#get the correlation between variables
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
From the correlation, we can see that Floors
, Fiber
and White.Marble
has strong correlation with price.
Modeling
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will use 70% of the data as the training data and the rest of it as the testing data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
idx <- sample(nrow(house), nrow(house) *0.7)
data_train <- house[idx, ]
data_test <- house[-idx, ]
Now we will try to model the linear regression using : 1. Prices
as the target variable 2. Floors, Fiber, and White.Marble
as predictor because they have strong correlation with price.
##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17527.4 -3592.4 6.6 3585.3 17506.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24865.79 16.32 1524.0 <2e-16 ***
## Floors 14977.66 17.45 858.4 <2e-16 ***
## Fiber 11733.96 17.45 672.5 <2e-16 ***
## White.Marble 11523.94 18.51 622.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5161 on 349984 degrees of freedom
## Multiple R-squared: 0.8184, Adjusted R-squared: 0.8184
## F-statistic: 5.257e+05 on 3 and 349984 DF, p-value: < 2.2e-16
From the summary of model lm(formula = Prices ~ Floors + Fiber, data = house) we can get information that adjusted R-squared of 0.8184, meaning that the model can explain 81.84% of variance in the target variable (house price).
Now, let’s try to get the predictor variables automaticly by using step-wise regression method with direction ="backward"
parameter.
## Start: AIC=-13056840
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors + Swiming.Pool + Garden
##
##
## Step: AIC=-13056840
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool + Garden
##
## Df Sum of Sq RSS AIC
## - Garden 1 0.0000e+00 0.0000e+00 -13056841
## - Swiming.Pool 1 0.0000e+00 0.0000e+00 -13056840
## <none> 0.0000e+00 -13056840
## - Solar 1 5.4684e+09 5.4684e+09 3379718
## - Electric 1 1.3671e+11 1.3671e+11 4506295
## - FirePlace 1 3.9389e+11 3.9389e+11 4876649
## - Garage 1 5.2500e+11 5.2500e+11 4977213
## - Baths 1 1.0948e+12 1.0948e+12 5234440
## - Area 1 1.1267e+12 1.1267e+12 5244469
## - Black.Marble 1 1.4596e+12 1.4596e+12 5335072
## - Glass.Doors 1 1.7326e+12 1.7326e+12 5395088
## - City 1 2.8556e+12 2.8556e+12 5569966
## - White.Marble 1 1.1445e+13 1.1445e+13 6055836
## - Fiber 1 1.2080e+13 1.2080e+13 6074727
## - Floors 1 1.9686e+13 1.9686e+13 6245667
##
## Step: AIC=-13056841
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool
##
## Df Sum of Sq RSS AIC
## - Swiming.Pool 1 0.0000e+00 0.0000e+00 -13056842
## <none> 0.0000e+00 -13056841
## - Solar 1 5.4684e+09 5.4684e+09 3379720
## - Electric 1 1.3671e+11 1.3671e+11 4506293
## - FirePlace 1 3.9389e+11 3.9389e+11 4876647
## - Garage 1 5.2500e+11 5.2500e+11 4977212
## - Baths 1 1.0948e+12 1.0948e+12 5234439
## - Area 1 1.1267e+12 1.1267e+12 5244467
## - Black.Marble 1 1.4596e+12 1.4596e+12 5335070
## - Glass.Doors 1 1.7326e+12 1.7326e+12 5395090
## - City 1 2.8556e+12 2.8556e+12 5569964
## - White.Marble 1 1.1445e+13 1.1445e+13 6055836
## - Fiber 1 1.2080e+13 1.2080e+13 6074726
## - Floors 1 1.9686e+13 1.9686e+13 6245666
##
## Step: AIC=-13056842
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors
##
## Df Sum of Sq RSS AIC
## <none> 0.0000e+00 -13056842
## - Solar 1 5.4684e+09 5.4684e+09 3379718
## - Electric 1 1.3671e+11 1.3671e+11 4506292
## - FirePlace 1 3.9389e+11 3.9389e+11 4876647
## - Garage 1 5.2500e+11 5.2500e+11 4977210
## - Baths 1 1.0949e+12 1.0949e+12 5234441
## - Area 1 1.1267e+12 1.1267e+12 5244465
## - Black.Marble 1 1.4596e+12 1.4596e+12 5335069
## - Glass.Doors 1 1.7326e+12 1.7326e+12 5395088
## - City 1 2.8556e+12 2.8556e+12 5569962
## - White.Marble 1 1.1445e+13 1.1445e+13 6055837
## - Fiber 1 1.2080e+13 1.2080e+13 6074727
## - Floors 1 1.9686e+13 1.9686e+13 6245665
##
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
## Black.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.684e-06 0.000e+00 0.000e+00 0.000e+00 1.037e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+03 7.285e-11 1.373e+13 <2e-16 ***
## Area 2.500e+01 1.866e-13 1.339e+14 <2e-16 ***
## Garage 1.500e+03 1.641e-11 9.143e+13 <2e-16 ***
## FirePlace 7.500e+02 9.470e-12 7.920e+13 <2e-16 ***
## Baths 1.250e+03 9.467e-12 1.320e+14 <2e-16 ***
## White.Marble 1.400e+04 3.279e-11 4.269e+14 <2e-16 ***
## Black.Marble 5.000e+03 3.280e-11 1.525e+14 <2e-16 ***
## Floors 1.500e+04 2.679e-11 5.599e+14 <2e-16 ***
## City 3.500e+03 1.641e-11 2.132e+14 <2e-16 ***
## Solar 2.500e+02 2.679e-11 9.331e+12 <2e-16 ***
## Electric 1.250e+03 2.679e-11 4.666e+13 <2e-16 ***
## Fiber 1.175e+04 2.679e-11 4.386e+14 <2e-16 ***
## Glass.Doors 4.450e+03 2.679e-11 1.661e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.925e-09 on 349975 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.811e+28 on 12 and 349975 DF, p-value: < 2.2e-16
From the summary of model using stepwise regression model, we can get information such as :
- Predictor variables which has high influence to
Prices
are : Area, Garage, FirePlace, Baths, White.Marble, Black.Marble, Floors, City, Solar, Electric, Fiber, and Glass.Doors (marked by ***) - The model give adjusted R-squared of 1, meaning that the model can explain 100% of variance in the target variable (house price).
- The last model to predict the
Prices
based onFloors + Fiber + White.Marble
has adjusted R-squared only 0.6173. It means thathouse_step
is better thanhouse_lm
in this case.
Prediksi Model dan Error
Let’s predict the Prices
based on the predictor using model house_step
. The results of data will be compared to actual result in data_train
and data_test
.
#predict prices in data_train using house_step
data_train$pred <- predict(house_step, newdata = data.frame(data_train))
data_train %>%
select(Prices, pred) %>%
head()
## Prices pred
## 143785 52550 52550
## 394140 36800 36800
## 204482 52575 52575
## 441492 54200 54200
## 470215 49500 49500
## 22778 47775 47775
#predict prices in data_test using house_step
data_test$pred <- predict(house_step, newdata = data.frame(data_test))
data_test %>%
select(Prices, pred) %>%
head()
## Prices pred
## 9 29575 29575
## 10 22300 22300
## 15 38500 38500
## 19 53625 53625
## 20 38300 38300
## 21 67850 67850
Let’s check the error of the model using MAPE method by using MAPE()
function from package MLmetrics
#check the error of Prices in data_train
format(MAPE(y_pred = data_train$pred, y_true = data_train$Prices), scientific = F)
## [1] "0.0000000000002072597"
#check the error of Prices in data_test
format(MAPE(y_pred = data_test$pred, y_true = data_test$Prices), scientific = F)
## [1] "0.0000000000002072442"
By using the MAPE (Mean Absolut Percentage Error)
method, we get the information that model house_step
which using stepwise regression model has error less than 1%.
Evaluation Model
Linearity
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Area
## t = 105.52, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1448892 0.1503121
## sample estimates:
## cor
## 0.1476017
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Garage
## t = 71.207, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0974516 0.1029397
## sample estimates:
## cor
## 0.1001964
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$FirePlace
## t = 63.212, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08629182 0.09179158
## sample estimates:
## cor
## 0.08904238
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Baths
## t = 103.62, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1422766 0.1477037
## sample estimates:
## cor
## 0.1449912
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$White.Marble
## t = 354.42, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4458796 0.4503102
## sample estimates:
## cor
## 0.4480976
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Black.Marble
## t = -55.318, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08074937 -0.07523939
## sample estimates:
## cor
## -0.07799497
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Floors
## t = 557.95, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6177400 0.6211565
## sample estimates:
## cor
## 0.6194512
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$City
## t = 169.56, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2305597 0.2358020
## sample estimates:
## cor
## 0.2331825
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Solar
## t = 5.9364, df = 499982, p-value = 2.916e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.00562345 0.01116677
## sample estimates:
## cor
## 0.008395172
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Electric
## t = 37.081, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04960462 0.05513312
## sample estimates:
## cor
## 0.05236927
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Fiber
## t = 391.73, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4824821 0.4867240
## sample estimates:
## cor
## 0.4846059
##
## Pearson's product-moment correlation
##
## data: house$Prices and house$Glass.Doors
## t = 130.81, df = 499982, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1792264 0.1845867
## sample estimates:
## cor
## 0.1819079
From the corr.test, we see that all predictor variable has p-value < 0.05, which means the model is linear.
Normality (Check Residual)
Normality test is evaluation model to check the residual.
## [1] 349988
Because the sample size of shapiro.test()
has maximum of 5,000 and our residuals has 349988, we can use the 5000 of first data or we can using ad.test()
from package nortest for larger samples.
##
## Shapiro-Wilk normality test
##
## data: house_step$residuals[0:5000]
## W = 0.0036873, p-value < 2.2e-16
##
## Anderson-Darling normality test
##
## data: house_step$residuals
## A = 133168, p-value < 2.2e-16
By using shapiro.test or ad.test, we get the p-value < 2.2e-16, which means our residuals was distributed not normally and need to be tuned.
Heteroscedasticity
##
## studentized Breusch-Pagan test
##
## data: house_step
## BP = 14.228, df = 12, p-value = 0.2864
From the BPtest, we get the p-value of 0.6258, which mean p-value > 0.05 and it means the model is homoscedasticity (no heteroscedasticity) in our model.
Variance Inflation Factor (Multicollinearity)
## Area Garage FirePlace Baths White.Marble Black.Marble
## 1.000024 1.000041 1.000021 1.000031 1.330728 1.330724
## Floors City Solar Electric Fiber Glass.Doors
## 1.000014 1.000028 1.000022 1.000009 1.000030 1.000028
There is no vif-value larger than 10 so that we can conclude that there is no multicollinearity.
Conclusion
Variables that are useful to describe the variances in house prices are Area, Garage, FirePlace, Baths, White.Marble, Black.Marble, Floors, City, Solar, Electric, Fiber, and Glass.Doors. Our final model has passed the model evaluation of linearity test, Homoscedasticity, and No Multicollinearity, but has not passed the Normality test. We have remove the outlier of Prices
. The R-squared of the model is high, with 100% of the variables can explain the variances in the house price. The accuracy of the model in predicting the car price is measured with MAPE, with training data has MAPE (Mean Absolut Percentage Error) of training and testing data with error lower than 0.001%.