Linear regression is one type of machine learning that has multiple usage, linear regression works by finds a linear equation that best describes the correlation of the explanatory variables (independent variable) with the dependent variable. In this analysis I’ll try to predict house pricing using linear regression model.
The data I use was obtained from kaggle.com. This is not real data but performed by computer in order for anyone who want to perform and get better understanding about linear regression. This data contain 500000 rows and 16 columns. From all this columns, Price column will be target variable (dependent variable) and the rest as independent variable.
IMPORT LIBRARY
Import all necessary library
library(dplyr)
library(GGally)
library(lmtest)
library(car)
library(ggplot2)
library(MLmetrics)
library(tidyverse)
library(tidymodels)
library(data.table)
READ DATA
house <- read.csv("HousePrices_HalfMil.csv")
rmarkdown::paged_table(house)
DATA PREPROCESSING
Data pre-processing is step where we preparing our raw data before we do analysis or machine learning. So, we can be sure the quality, consistency and compatibility of our data. Anything that can be done here they are handling abnormal value and missing value, and data coertion.
Checking for Missing Value
colSums(is.na(house))
#> Area Garage FirePlace Baths White.Marble
#> 0 0 0 0 0
#> Black.Marble Indian.Marble Floors City Solar
#> 0 0 0 0 0
#> Electric Fiber Glass.Doors Swiming.Pool Garden
#> 0 0 0 0 0
#> Prices
#> 0
As we can see above, there’s no missing value detected.
Data Coertion
glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
Some columns still not in their correct data type yet, there is 11 columns should be have factor data type instead of numeric data type. Why factor data type? When we perform machine learning, linear regressing in this case, system will assume any numeric data type as level. For example, an apartment company sell some apartments in a city. The apartment they sell have 5 total area type, they are 50 meters square, 70, 80, 100 and 130 meters square. If I decided this column as numeric instead of factor data type, the system will assume that apartment with total area of 130 meter square have more price than other total area. So, any column that show group or data redundan instead of level, must be convert into factor data type. Columns like dire place, garage and baths must be stay as numeric.
house <- house %>%
mutate(White.Marble = as.factor(White.Marble),
Black.Marble = as.factor(Black.Marble),
Indian.Marble = as.factor(Indian.Marble),
Floors = as.factor(Floors),
City = as.factor(City),
Solar = as.factor(Solar),
Electric = as.factor(Electric),
Fiber = as.factor(Fiber),
Glass.Doors = as.factor(Glass.Doors),
Swiming.Pool = as.factor(Swiming.Pool),
Garden = as.factor(Garden))
glimpse(house)
#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
All columns already in their appropriate data type
Checking for Outliers
summary(house)
#> Area Garage FirePlace Baths White.Marble
#> Min. : 1.0 Min. :1.000 Min. :0.000 Min. :1.000 0:333504
#> 1st Qu.: 63.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1:166496
#> Median :125.0 Median :2.000 Median :2.000 Median :3.000
#> Mean :124.9 Mean :2.001 Mean :2.003 Mean :2.998
#> 3rd Qu.:187.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
#> Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
#> Black.Marble Indian.Marble Floors City Solar Electric
#> 0:333655 0:332841 0:250307 1:166314 0:250653 0:249675
#> 1:166345 1:167159 1:249693 2:166902 1:249347 1:250325
#> 3:166784
#>
#>
#>
#> Fiber Glass.Doors Swiming.Pool Garden Prices
#> 0:249766 0:250065 0:249782 0:249177 Min. : 7725
#> 1:250234 1:249935 1:250218 1:250823 1st Qu.:33500
#> Median :41850
#> Mean :42050
#> 3rd Qu.:50750
#> Max. :77975
Since this data was performed by computer, we don’t need to do advance technique when handling outliers. But, as we can see from summary above, there’s an outliers in area column. It’s impossible where a house only has total area of 1 square meters. I’ll use interquartile range to handle any otliers in this column, but first, I’ll remove any abnormal values from this column. based on kompas.com website the international standard for minimum total area of house is 12 square meters per person, let say each house was designed to be habited by 3 persons, so the total minimum area of house will be 36 square meters, I’ll eliminate any rows with value under 36 in are column.
house <- house %>%
filter(Area > 63)
Area column already have only value above 36, then I’ll perform interquartile range. The idea behind this technique is we made an imaginary line that describe the spread of the middle half of our data (50% concentrated data). Than any data that 1.5 lower or 1.5 higher than this imaginary data will be assume as outliers. Here how we use it.
Q1 <- quantile(house$Area, 0.25)
Q3 <- quantile(house$Area, 0.75)
IQR <- Q3 - Q1
inner_lower_fences <- Q1 - 1.5 * IQR #any data must higher than this line
inner_upper_fences <- Q3 + 1.5 * IQR #any data must lower than this line
house_clean <- house %>%
filter(Area > inner_lower_fences,
Area < inner_upper_fences)
summary(house_clean)
#> Area Garage FirePlace Baths White.Marble
#> Min. : 64.0 Min. :1.000 Min. :0.000 Min. :1.000 0:248510
#> 1st Qu.:110.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1:124687
#> Median :157.0 Median :2.000 Median :2.000 Median :3.000
#> Mean :156.5 Mean :2.001 Mean :2.004 Mean :2.999
#> 3rd Qu.:203.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
#> Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
#> Black.Marble Indian.Marble Floors City Solar Electric
#> 0:249195 0:248689 0:186941 1:124195 0:187116 0:186306
#> 1:124002 1:124508 1:186256 2:124766 1:186081 1:186891
#> 3:124236
#>
#>
#>
#> Fiber Glass.Doors Swiming.Pool Garden Prices
#> 0:186417 0:186754 0:186335 0:185665 Min. : 9100
#> 1:186780 1:186443 1:186862 1:187532 1st Qu.:34350
#> Median :42625
#> Mean :42845
#> 3rd Qu.:51550
#> Max. :77975
Seems after I remove abnormal value, there’s no outliers in area column
DATA EXPLORATORY
Data exploratory process will allow us to see more deeper into our data, from here we can get any valuable insight that gonna be useful in our analysis or machine learning process.
Correlation
While performing linear regression, strong correlation between dependent and independent variable will determine the quality of the model, the stronger the correlation, the good the model. But, same condition will result to bad model if the correlation between independent variable. So we will plot our data to see each numeric column correlation (correalation plot can only use prce column).
ggcorr(house_clean, label = T, label_size = 3, hjust = 1)
Since this data was build by computer, it is not surprising if the quality of the data was bad. From plot above, there’s almost no correlation between Price column with numeric independent variable, so we better pay more attention when build the model, since no correlation means bad model
Data Distribution
While performing linear regression, the model will learning to finding the best linear equation. The shape of data distribution of target variable play important role to decide the quality of the model. Say if the data has skew, the model will only best learning in the concentrated data, while the other are not. Another reason why the target variable must follow normal distribution is when target variable has normal distribution, it increases the likelihood that residuals will also be approximately normally distributed. See plot about price column distribution bellow.
From plot above, the target variable already normally distributed.
CROSS VALIDATION
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
set.seed(123)
index <- sample(1:nrow(house_clean), size = floor(0.8 * nrow(house_clean)))
# Split data into train and test sets
data_train <- house_clean[index, ]
data_test <- house_clean[-index, ]
BUILDING MODEL
I’ll try to make three models with different condition, first model will only use numeric columns, second model will only use factorial column and third model will use all column as predictor
First Model
first_model <- lm(Prices ~ Area + Garage + FirePlace + Baths, data_train)
summary(first_model)
#>
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -26309.0 -8214.7 -385.3 8683.7 27625.6
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30749.3165 100.9507 304.60 <0.0000000000000002 ***
#> Area 24.5199 0.4009 61.16 <0.0000000000000002 ***
#> Garage 1501.4023 26.3045 57.08 <0.0000000000000002 ***
#> FirePlace 775.4690 15.1864 51.06 <0.0000000000000002 ***
#> Baths 1237.6131 15.1964 81.44 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 11740 on 298552 degrees of freedom
#> Multiple R-squared: 0.0515, Adjusted R-squared: 0.05149
#> F-statistic: 4053 on 4 and 298552 DF, p-value: < 0.00000000000000022
Our first model have R squared of 0.05 and as same as its adjusted R squared. Pay attention to Pr(>|t|) of the model, it shows us how significant our independent variable to the target variable. Any variable with Pr equal or lower than 0.001 means contribute significant to the model. The reason why a variable with almost no correlation to the target give significant contribution to the model is because something called confounding. Confounding is condition where individual variable that have low correlation, combining with other independent variable and become significant to the model because the influence of another independent variable.
Second Model
In our second model, we will use only categorical solumn as predictor.
second_model <- lm(Prices ~ White.Marble + Black.Marble + Floors + City + Solar + Electric + Fiber + Glass.Doors + Swiming.Pool + Garden + Indian.Marble, data_train)
summary(second_model)
#>
#> Call:
#> lm(formula = Prices ~ White.Marble + Black.Marble + Floors +
#> City + Solar + Electric + Fiber + Glass.Doors + Swiming.Pool +
#> Garden + Indian.Marble, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7859.9 -1933.3 0.3 1939.9 7847.9
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 16672.714 17.413 957.485 <0.0000000000000002 ***
#> White.Marble1 14011.553 12.295 1139.623 <0.0000000000000002 ***
#> Black.Marble1 4995.572 12.316 405.629 <0.0000000000000002 ***
#> Floors1 15001.761 10.047 1493.189 <0.0000000000000002 ***
#> City2 3494.670 12.307 283.952 <0.0000000000000002 ***
#> City3 6975.402 12.303 566.981 <0.0000000000000002 ***
#> Solar1 253.936 10.047 25.275 <0.0000000000000002 ***
#> Electric1 1257.392 10.047 125.154 <0.0000000000000002 ***
#> Fiber1 11742.315 10.047 1168.748 <0.0000000000000002 ***
#> Glass.Doors1 4436.039 10.047 441.537 <0.0000000000000002 ***
#> Swiming.Pool1 14.355 10.047 1.429 0.153
#> Garden1 -1.144 10.047 -0.114 0.909
#> Indian.Marble1 NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2745 on 298545 degrees of freedom
#> Multiple R-squared: 0.9482, Adjusted R-squared: 0.9482
#> F-statistic: 4.964e+05 on 11 and 298545 DF, p-value: < 0.00000000000000022
This model has 0.948 for both r squared and adjusted r squared
Third Model
In third model, we will use all column as predictor
third_model <- lm(Prices ~ ., data_train)
summary(third_model)
#>
#> Call:
#> lm(formula = Prices ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.0000000200 -0.0000000003 -0.0000000001 0.0000000002 0.0000254665
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value
#> (Intercept) 4500.000000047252797 0.000000000490933 9166220076524.77
#> Area 25.000000000003915 0.000000000001592 15707487904511.16
#> Garage 1499.999999999772854 0.000000000104436 14362837919959.43
#> FirePlace 749.999999999939860 0.000000000060294 12439149675712.38
#> Baths 1249.999999999957481 0.000000000060333 20718310394126.78
#> White.Marble1 13999.999999999563443 0.000000000208796 67051225190797.31
#> Black.Marble1 5000.000000000120053 0.000000000209148 23906531929987.04
#> Indian.Marble1 NA NA NA
#> Floors1 14999.999999999457941 0.000000000170617 87916349472160.97
#> City2 3500.000000000499767 0.000000000209006 16745921372901.12
#> City3 6999.999999999979991 0.000000000208929 33504142109415.35
#> Solar1 249.999999999827480 0.000000000170619 1465252635892.32
#> Electric1 1249.999999999819693 0.000000000170617 7326370184393.33
#> Fiber1 11750.000000000147338 0.000000000170620 68866427844877.61
#> Glass.Doors1 4450.000000000057298 0.000000000170617 26081812489206.13
#> Swiming.Pool1 -0.000000000170625 0.000000000170618 -1.00
#> Garden1 -0.000000000172278 0.000000000170621 -1.01
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Indian.Marble1 NA
#> Floors1 <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> Swiming.Pool1 0.317
#> Garden1 0.313
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000004661 on 298541 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.331e+27 on 15 and 298541 DF, p-value: < 0.00000000000000022
So far our third model has biggest both r squared and adjusted R squared, which mean from all three models, third model is the best model. But from summary above, there’s 3 column we can remove form our model. Indian.Marble column seems has high correlation with another independent variable that’s why the result is NA. Swimming Pool and garden columns has least significant to the model, so we can remove both of them as well.
Fourth Model
Fourth model will use condition from third model
fourth_train <- data_train %>%
select(-c(Indian.Marble, Swiming.Pool, Garden))
fourth_model <- lm(Prices ~ ., fourth_train)
summary(fourth_model)
#>
#> Call:
#> lm(formula = Prices ~ ., data = fourth_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.0000000200 -0.0000000003 -0.0000000001 0.0000000002 0.0000254666
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 4500.000000047077265 0.000000000475938 9455020189716
#> Area 25.000000000003915 0.000000000001592 15707499511756
#> Garage 1499.999999999772854 0.000000000104435 14362980922003
#> FirePlace 749.999999999939860 0.000000000060293 12439154172046
#> Baths 1249.999999999957026 0.000000000060333 20718436698176
#> White.Marble1 13999.999999999563443 0.000000000208795 67051396409737
#> Black.Marble1 5000.000000000120053 0.000000000209148 23906543707083
#> Floors1 14999.999999999457941 0.000000000170616 87916489568598
#> City2 3500.000000000498858 0.000000000209005 16745988538105
#> City3 6999.999999999979082 0.000000000208929 33504257931188
#> Solar1 249.999999999828162 0.000000000170618 1465260171780
#> Electric1 1249.999999999819693 0.000000000170617 7326371562395
#> Fiber1 11750.000000000147338 0.000000000170620 68866453965382
#> Glass.Doors1 4450.000000000057298 0.000000000170617 26081876863669
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Floors1 <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000004661 on 298543 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.536e+27 on 13 and 298543 DF, p-value: < 0.00000000000000022
Seems our fourth model has same R square as our third model
MODEL EVALUATION
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Note that bigger
RMSE of First Model
We will calculate RMSE for both data train and data set in order to know whether our model is over fit or not. While calculating
RMSE of data train
RMSE(y_pred = first_model$fitted.values, y_true = data_train$Prices)
#> [1] 11740.56
first_pred <- predict(first_model, newdata = data_test %>% select(-Prices))
RMSE of data test
RMSE(y_pred = first_pred, y_true = data_test$Prices)
#> [1] 11748.17
From two RMSE that we calculated before, I can say our model was not over fit (I assuming the model will be over fit when RMSE of data train 10% bigger than RMSE of data test). We will do the same for another model.
RMSE of Second Model
RMSE of train data set
RMSE(y_pred = second_model$fitted.values, y_true = data_train$Prices)
#> [1] 2744.728
second_pred <- predict(second_model, newdata = data_test %>% select(-Prices))
RMSE of test data set
RMSE(y_pred = second_pred, y_true = data_test$Prices)
#> [1] 2740.513
RMSE Third Model
RMSE of train data set
RMSE(y_pred = third_model$fitted.values, y_true = data_train$Prices)
#> [1] 0.00000004661109
third_pred <- predict(second_model, newdata = data_test %>% select(-Prices))
RMSE of test data set
RMSE(y_pred = third_pred, y_true = data_test$Prices)
#> [1] 2740.513
From all three RMSE tests above, we will choose first model as final mode, it’s because this model was not over fit and has low RMSE.
LM ASSUMPTION TEST
There’s 4 assumption test that must be fulfilled bny linear regression mode, they are Normality test, heteroscedasticity, multicollinearity and linearity.
Normality Test
This test perform to know whether the model residual has normality distribution or not, good model must have normal distribution pattern on it’s residuals
Our first model already follow residual normality test
Heteroscedasticity
Heteroscedasticity is statistical condition where the variability of errors or residuals was not constant across all level of the independent variables. In simple word, the spread of the residuals tends to increase or decrease systematically as the values of the independent variable change. We must avoid this condition to occur in our model. This test can be done in two ways, using bptest or plot the residuals.
bptest(first_model)
#>
#> studentized Breusch-Pagan test
#>
#> data: first_model
#> BP = 3.4063, df = 4, p-value = 0.4923
Since P-value is greater than 0.05, there’s no present of heteroscedasticity.
resact <- data.frame(residual = first_model$residuals, fitted = first_model$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
Since our data is quite big, it’s hard to interpret the plot, but from bptest we can say there’s no Heteroscedasticity
Multicollinearity
Multicollinearity is a condition where one independent variable has high correlation with other independent variable. When this condition appear in our model, it becomes difficult to isolate the individual effects of each independent variable on the dependent variable because the effects are confounded or combined. The presence of multicollinearity does not affect the overall predictive power of the model, but it can lead to unreliable estimates of the individual regression coefficients. We will use vif function to check this assumption.
vif(first_model)
#> Area Garage FirePlace Baths
#> 1.000002 1.000008 1.000001 1.000006
As we can see we get value of 1 for all predictor, which means there’s no significant problem with multicollinearty. For information, value smaller than 1 means completely no problem with multicollinearity, value between 1 and 5 (1<x<5) means no significant problem with multicollinearity and still accepted as normal and value bigger than 5 means theres significant problem with multicollinearity (multicollinearity appear in our model)
Linearity
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
See the horizontal blue line above, it’s form a straight line indicating that there’s no pattern in our residual or in other word our residual already follow linear assumption.
CONCLUSION
If we look into predictor as individual variable, they might be hafe no correlation with target variable at all. But if we combine all predictor together, they can have significant impact to the model, this condition is called confounding effect. From all four models we already made above, our first model is the model to predict house price. This model has both Rsquared and adjusted Rsquared of 0.051 and already follow all linear regression assumption. Using training data set, our model has RMSE of 11740 while with test data set we get RMSE of 11748, we can say our mdel has good fit.