Simple Linear Regression for House Price

M. Fadhlurrohman Faqih

09 July 2023

Linear regression is one type of machine learning that has multiple usage, linear regression works by finds a linear equation that best describes the correlation of the explanatory variables (independent variable) with the dependent variable. In this analysis I’ll try to predict house pricing using linear regression model.

The data I use was obtained from kaggle.com. This is not real data but performed by computer in order for anyone who want to perform and get better understanding about linear regression. This data contain 500000 rows and 16 columns. From all this columns, Price column will be target variable (dependent variable) and the rest as independent variable.

IMPORT LIBRARY

Import all necessary library

library(dplyr)
library(GGally)
library(lmtest)
library(car)
library(ggplot2)
library(MLmetrics)
library(tidyverse)
library(tidymodels)
library(data.table)

READ DATA

house <- read.csv("HousePrices_HalfMil.csv")

rmarkdown::paged_table(house)

DATA PREPROCESSING

Data pre-processing is step where we preparing our raw data before we do analysis or machine learning. So, we can be sure the quality, consistency and compatibility of our data. Anything that can be done here they are handling abnormal value and missing value, and data coertion.

Checking for Missing Value

colSums(is.na(house))
#>          Area        Garage     FirePlace         Baths  White.Marble 
#>             0             0             0             0             0 
#>  Black.Marble Indian.Marble        Floors          City         Solar 
#>             0             0             0             0             0 
#>      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
#>             0             0             0             0             0 
#>        Prices 
#>             0

As we can see above, there’s no missing value detected.

Data Coertion

glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

Some columns still not in their correct data type yet, there is 11 columns should be have factor data type instead of numeric data type. Why factor data type? When we perform machine learning, linear regressing in this case, system will assume any numeric data type as level. For example, an apartment company sell some apartments in a city. The apartment they sell have 5 total area type, they are 50 meters square, 70, 80, 100 and 130 meters square. If I decided this column as numeric instead of factor data type, the system will assume that apartment with total area of 130 meter square have more price than other total area. So, any column that show group or data redundan instead of level, must be convert into factor data type. Columns like dire place, garage and baths must be stay as numeric.

house <- house %>% 
  mutate(White.Marble = as.factor(White.Marble),
         Black.Marble = as.factor(Black.Marble),
         Indian.Marble = as.factor(Indian.Marble),
         Floors = as.factor(Floors),
         City = as.factor(City),
         Solar = as.factor(Solar),
         Electric = as.factor(Electric),
         Fiber = as.factor(Fiber),
         Glass.Doors = as.factor(Glass.Doors),
         Swiming.Pool = as.factor(Swiming.Pool),
         Garden = as.factor(Garden))
glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors        <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

All columns already in their appropriate data type

Checking for Outliers

summary(house)
#>       Area           Garage        FirePlace         Baths       White.Marble
#>  Min.   :  1.0   Min.   :1.000   Min.   :0.000   Min.   :1.000   0:333504    
#>  1st Qu.: 63.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1:166496    
#>  Median :125.0   Median :2.000   Median :2.000   Median :3.000               
#>  Mean   :124.9   Mean   :2.001   Mean   :2.003   Mean   :2.998               
#>  3rd Qu.:187.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000               
#>  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000               
#>  Black.Marble Indian.Marble Floors     City       Solar      Electric  
#>  0:333655     0:332841      0:250307   1:166314   0:250653   0:249675  
#>  1:166345     1:167159      1:249693   2:166902   1:249347   1:250325  
#>                                        3:166784                        
#>                                                                        
#>                                                                        
#>                                                                        
#>  Fiber      Glass.Doors Swiming.Pool Garden         Prices     
#>  0:249766   0:250065    0:249782     0:249177   Min.   : 7725  
#>  1:250234   1:249935    1:250218     1:250823   1st Qu.:33500  
#>                                                 Median :41850  
#>                                                 Mean   :42050  
#>                                                 3rd Qu.:50750  
#>                                                 Max.   :77975

Since this data was performed by computer, we don’t need to do advance technique when handling outliers. But, as we can see from summary above, there’s an outliers in area column. It’s impossible where a house only has total area of 1 square meters. I’ll use interquartile range to handle any otliers in this column, but first, I’ll remove any abnormal values from this column. based on kompas.com website the international standard for minimum total area of house is 12 square meters per person, let say each house was designed to be habited by 3 persons, so the total minimum area of house will be 36 square meters, I’ll eliminate any rows with value under 36 in are column.

house <- house %>% 
  filter(Area > 63)

Area column already have only value above 36, then I’ll perform interquartile range. The idea behind this technique is we made an imaginary line that describe the spread of the middle half of our data (50% concentrated data). Than any data that 1.5 lower or 1.5 higher than this imaginary data will be assume as outliers. Here how we use it.

Q1 <- quantile(house$Area, 0.25)
Q3 <- quantile(house$Area, 0.75)
IQR <- Q3 - Q1

inner_lower_fences <- Q1 - 1.5 * IQR #any data must higher than this line
inner_upper_fences <- Q3 + 1.5 * IQR #any data must lower than this line

house_clean <- house %>% 
  filter(Area > inner_lower_fences,
         Area < inner_upper_fences)
summary(house_clean)
#>       Area           Garage        FirePlace         Baths       White.Marble
#>  Min.   : 64.0   Min.   :1.000   Min.   :0.000   Min.   :1.000   0:248510    
#>  1st Qu.:110.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1:124687    
#>  Median :157.0   Median :2.000   Median :2.000   Median :3.000               
#>  Mean   :156.5   Mean   :2.001   Mean   :2.004   Mean   :2.999               
#>  3rd Qu.:203.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000               
#>  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000               
#>  Black.Marble Indian.Marble Floors     City       Solar      Electric  
#>  0:249195     0:248689      0:186941   1:124195   0:187116   0:186306  
#>  1:124002     1:124508      1:186256   2:124766   1:186081   1:186891  
#>                                        3:124236                        
#>                                                                        
#>                                                                        
#>                                                                        
#>  Fiber      Glass.Doors Swiming.Pool Garden         Prices     
#>  0:186417   0:186754    0:186335     0:185665   Min.   : 9100  
#>  1:186780   1:186443    1:186862     1:187532   1st Qu.:34350  
#>                                                 Median :42625  
#>                                                 Mean   :42845  
#>                                                 3rd Qu.:51550  
#>                                                 Max.   :77975

Seems after I remove abnormal value, there’s no outliers in area column

DATA EXPLORATORY

Data exploratory process will allow us to see more deeper into our data, from here we can get any valuable insight that gonna be useful in our analysis or machine learning process.

Correlation

While performing linear regression, strong correlation between dependent and independent variable will determine the quality of the model, the stronger the correlation, the good the model. But, same condition will result to bad model if the correlation between independent variable. So we will plot our data to see each numeric column correlation (correalation plot can only use prce column).

ggcorr(house_clean, label = T, label_size = 3, hjust = 1)

Since this data was build by computer, it is not surprising if the quality of the data was bad. From plot above, there’s almost no correlation between Price column with numeric independent variable, so we better pay more attention when build the model, since no correlation means bad model

Data Distribution

While performing linear regression, the model will learning to finding the best linear equation. The shape of data distribution of target variable play important role to decide the quality of the model. Say if the data has skew, the model will only best learning in the concentrated data, while the other are not. Another reason why the target variable must follow normal distribution is when target variable has normal distribution, it increases the likelihood that residuals will also be approximately normally distributed. See plot about price column distribution bellow.

From plot above, the target variable already normally distributed.

CROSS VALIDATION

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)

index <- sample(1:nrow(house_clean), size = floor(0.8 * nrow(house_clean)))

# Split data into train and test sets
data_train <- house_clean[index, ]
data_test <- house_clean[-index, ]

BUILDING MODEL

I’ll try to make three models with different condition, first model will only use numeric columns, second model will only use factorial column and third model will use all column as predictor

First Model

first_model <- lm(Prices ~ Area + Garage + FirePlace + Baths, data_train)
summary(first_model)
#> 
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths, data = data_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -26309.0  -8214.7   -385.3   8683.7  27625.6 
#> 
#> Coefficients:
#>               Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) 30749.3165   100.9507  304.60 <0.0000000000000002 ***
#> Area           24.5199     0.4009   61.16 <0.0000000000000002 ***
#> Garage       1501.4023    26.3045   57.08 <0.0000000000000002 ***
#> FirePlace     775.4690    15.1864   51.06 <0.0000000000000002 ***
#> Baths        1237.6131    15.1964   81.44 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 11740 on 298552 degrees of freedom
#> Multiple R-squared:  0.0515, Adjusted R-squared:  0.05149 
#> F-statistic:  4053 on 4 and 298552 DF,  p-value: < 0.00000000000000022

Our first model have R squared of 0.05 and as same as its adjusted R squared. Pay attention to Pr(>|t|) of the model, it shows us how significant our independent variable to the target variable. Any variable with Pr equal or lower than 0.001 means contribute significant to the model. The reason why a variable with almost no correlation to the target give significant contribution to the model is because something called confounding. Confounding is condition where individual variable that have low correlation, combining with other independent variable and become significant to the model because the influence of another independent variable.

Second Model

In our second model, we will use only categorical solumn as predictor.

second_model <- lm(Prices ~ White.Marble + Black.Marble + Floors + City + Solar + Electric + Fiber + Glass.Doors + Swiming.Pool + Garden + Indian.Marble, data_train)

summary(second_model)
#> 
#> Call:
#> lm(formula = Prices ~ White.Marble + Black.Marble + Floors + 
#>     City + Solar + Electric + Fiber + Glass.Doors + Swiming.Pool + 
#>     Garden + Indian.Marble, data = data_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7859.9 -1933.3     0.3  1939.9  7847.9 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                 Estimate Std. Error  t value            Pr(>|t|)    
#> (Intercept)    16672.714     17.413  957.485 <0.0000000000000002 ***
#> White.Marble1  14011.553     12.295 1139.623 <0.0000000000000002 ***
#> Black.Marble1   4995.572     12.316  405.629 <0.0000000000000002 ***
#> Floors1        15001.761     10.047 1493.189 <0.0000000000000002 ***
#> City2           3494.670     12.307  283.952 <0.0000000000000002 ***
#> City3           6975.402     12.303  566.981 <0.0000000000000002 ***
#> Solar1           253.936     10.047   25.275 <0.0000000000000002 ***
#> Electric1       1257.392     10.047  125.154 <0.0000000000000002 ***
#> Fiber1         11742.315     10.047 1168.748 <0.0000000000000002 ***
#> Glass.Doors1    4436.039     10.047  441.537 <0.0000000000000002 ***
#> Swiming.Pool1     14.355     10.047    1.429               0.153    
#> Garden1           -1.144     10.047   -0.114               0.909    
#> Indian.Marble1        NA         NA       NA                  NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2745 on 298545 degrees of freedom
#> Multiple R-squared:  0.9482, Adjusted R-squared:  0.9482 
#> F-statistic: 4.964e+05 on 11 and 298545 DF,  p-value: < 0.00000000000000022

This model has 0.948 for both r squared and adjusted r squared

Third Model

In third model, we will use all column as predictor

third_model <- lm(Prices ~ ., data_train)
summary(third_model)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = data_train)
#> 
#> Residuals:
#>           Min            1Q        Median            3Q           Max 
#> -0.0000000200 -0.0000000003 -0.0000000001  0.0000000002  0.0000254665 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                             Estimate            Std. Error           t value
#> (Intercept)     4500.000000047252797     0.000000000490933  9166220076524.77
#> Area              25.000000000003915     0.000000000001592 15707487904511.16
#> Garage          1499.999999999772854     0.000000000104436 14362837919959.43
#> FirePlace        749.999999999939860     0.000000000060294 12439149675712.38
#> Baths           1249.999999999957481     0.000000000060333 20718310394126.78
#> White.Marble1  13999.999999999563443     0.000000000208796 67051225190797.31
#> Black.Marble1   5000.000000000120053     0.000000000209148 23906531929987.04
#> Indian.Marble1                    NA                    NA                NA
#> Floors1        14999.999999999457941     0.000000000170617 87916349472160.97
#> City2           3500.000000000499767     0.000000000209006 16745921372901.12
#> City3           6999.999999999979991     0.000000000208929 33504142109415.35
#> Solar1           249.999999999827480     0.000000000170619  1465252635892.32
#> Electric1       1249.999999999819693     0.000000000170617  7326370184393.33
#> Fiber1         11750.000000000147338     0.000000000170620 68866427844877.61
#> Glass.Doors1    4450.000000000057298     0.000000000170617 26081812489206.13
#> Swiming.Pool1     -0.000000000170625     0.000000000170618             -1.00
#> Garden1           -0.000000000172278     0.000000000170621             -1.01
#>                           Pr(>|t|)    
#> (Intercept)    <0.0000000000000002 ***
#> Area           <0.0000000000000002 ***
#> Garage         <0.0000000000000002 ***
#> FirePlace      <0.0000000000000002 ***
#> Baths          <0.0000000000000002 ***
#> White.Marble1  <0.0000000000000002 ***
#> Black.Marble1  <0.0000000000000002 ***
#> Indian.Marble1                  NA    
#> Floors1        <0.0000000000000002 ***
#> City2          <0.0000000000000002 ***
#> City3          <0.0000000000000002 ***
#> Solar1         <0.0000000000000002 ***
#> Electric1      <0.0000000000000002 ***
#> Fiber1         <0.0000000000000002 ***
#> Glass.Doors1   <0.0000000000000002 ***
#> Swiming.Pool1                0.317    
#> Garden1                      0.313    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000004661 on 298541 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.331e+27 on 15 and 298541 DF,  p-value: < 0.00000000000000022

So far our third model has biggest both r squared and adjusted R squared, which mean from all three models, third model is the best model. But from summary above, there’s 3 column we can remove form our model. Indian.Marble column seems has high correlation with another independent variable that’s why the result is NA. Swimming Pool and garden columns has least significant to the model, so we can remove both of them as well.

Fourth Model

Fourth model will use condition from third model

fourth_train <- data_train %>% 
  select(-c(Indian.Marble, Swiming.Pool, Garden))
fourth_model <- lm(Prices ~ ., fourth_train)

summary(fourth_model)
#> 
#> Call:
#> lm(formula = Prices ~ ., data = fourth_train)
#> 
#> Residuals:
#>           Min            1Q        Median            3Q           Max 
#> -0.0000000200 -0.0000000003 -0.0000000001  0.0000000002  0.0000254666 
#> 
#> Coefficients:
#>                            Estimate            Std. Error        t value
#> (Intercept)    4500.000000047077265     0.000000000475938  9455020189716
#> Area             25.000000000003915     0.000000000001592 15707499511756
#> Garage         1499.999999999772854     0.000000000104435 14362980922003
#> FirePlace       749.999999999939860     0.000000000060293 12439154172046
#> Baths          1249.999999999957026     0.000000000060333 20718436698176
#> White.Marble1 13999.999999999563443     0.000000000208795 67051396409737
#> Black.Marble1  5000.000000000120053     0.000000000209148 23906543707083
#> Floors1       14999.999999999457941     0.000000000170616 87916489568598
#> City2          3500.000000000498858     0.000000000209005 16745988538105
#> City3          6999.999999999979082     0.000000000208929 33504257931188
#> Solar1          249.999999999828162     0.000000000170618  1465260171780
#> Electric1      1249.999999999819693     0.000000000170617  7326371562395
#> Fiber1        11750.000000000147338     0.000000000170620 68866453965382
#> Glass.Doors1   4450.000000000057298     0.000000000170617 26081876863669
#>                          Pr(>|t|)    
#> (Intercept)   <0.0000000000000002 ***
#> Area          <0.0000000000000002 ***
#> Garage        <0.0000000000000002 ***
#> FirePlace     <0.0000000000000002 ***
#> Baths         <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Floors1       <0.0000000000000002 ***
#> City2         <0.0000000000000002 ***
#> City3         <0.0000000000000002 ***
#> Solar1        <0.0000000000000002 ***
#> Electric1     <0.0000000000000002 ***
#> Fiber1        <0.0000000000000002 ***
#> Glass.Doors1  <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000004661 on 298543 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 1.536e+27 on 13 and 298543 DF,  p-value: < 0.00000000000000022

Seems our fourth model has same R square as our third model

MODEL EVALUATION

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Note that bigger

RMSE of First Model

We will calculate RMSE for both data train and data set in order to know whether our model is over fit or not. While calculating

RMSE of data train

RMSE(y_pred = first_model$fitted.values, y_true = data_train$Prices)
#> [1] 11740.56
first_pred <- predict(first_model, newdata = data_test %>% select(-Prices))

RMSE of data test

RMSE(y_pred = first_pred, y_true = data_test$Prices)
#> [1] 11748.17

From two RMSE that we calculated before, I can say our model was not over fit (I assuming the model will be over fit when RMSE of data train 10% bigger than RMSE of data test). We will do the same for another model.

RMSE of Second Model

RMSE of train data set

RMSE(y_pred = second_model$fitted.values, y_true = data_train$Prices)
#> [1] 2744.728
second_pred <- predict(second_model, newdata = data_test %>% select(-Prices))

RMSE of test data set

RMSE(y_pred = second_pred, y_true = data_test$Prices)
#> [1] 2740.513

RMSE Third Model

RMSE of train data set

RMSE(y_pred = third_model$fitted.values, y_true = data_train$Prices)
#> [1] 0.00000004661109
third_pred <- predict(second_model, newdata = data_test %>% select(-Prices))

RMSE of test data set

RMSE(y_pred = third_pred, y_true = data_test$Prices)
#> [1] 2740.513

From all three RMSE tests above, we will choose first model as final mode, it’s because this model was not over fit and has low RMSE.

LM ASSUMPTION TEST

There’s 4 assumption test that must be fulfilled bny linear regression mode, they are Normality test, heteroscedasticity, multicollinearity and linearity.

Normality Test

This test perform to know whether the model residual has normality distribution or not, good model must have normal distribution pattern on it’s residuals

Our first model already follow residual normality test

Heteroscedasticity

Heteroscedasticity is statistical condition where the variability of errors or residuals was not constant across all level of the independent variables. In simple word, the spread of the residuals tends to increase or decrease systematically as the values of the independent variable change. We must avoid this condition to occur in our model. This test can be done in two ways, using bptest or plot the residuals.

bptest(first_model)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  first_model
#> BP = 3.4063, df = 4, p-value = 0.4923

Since P-value is greater than 0.05, there’s no present of heteroscedasticity.

resact <- data.frame(residual = first_model$residuals, fitted = first_model$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

Since our data is quite big, it’s hard to interpret the plot, but from bptest we can say there’s no Heteroscedasticity

Multicollinearity

Multicollinearity is a condition where one independent variable has high correlation with other independent variable. When this condition appear in our model, it becomes difficult to isolate the individual effects of each independent variable on the dependent variable because the effects are confounded or combined. The presence of multicollinearity does not affect the overall predictive power of the model, but it can lead to unreliable estimates of the individual regression coefficients. We will use vif function to check this assumption.

vif(first_model)
#>      Area    Garage FirePlace     Baths 
#>  1.000002  1.000008  1.000001  1.000006

As we can see we get value of 1 for all predictor, which means there’s no significant problem with multicollinearty. For information, value smaller than 1 means completely no problem with multicollinearity, value between 1 and 5 (1<x<5) means no significant problem with multicollinearity and still accepted as normal and value bigger than 5 means theres significant problem with multicollinearity (multicollinearity appear in our model)

Linearity

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

See the horizontal blue line above, it’s form a straight line indicating that there’s no pattern in our residual or in other word our residual already follow linear assumption.

CONCLUSION

If we look into predictor as individual variable, they might be hafe no correlation with target variable at all. But if we combine all predictor together, they can have significant impact to the model, this condition is called confounding effect. From all four models we already made above, our first model is the model to predict house price. This model has both Rsquared and adjusted Rsquared of 0.051 and already follow all linear regression assumption. Using training data set, our model has RMSE of 11740 while with test data set we get RMSE of 11748, we can say our mdel has good fit.