# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

1 Introduction

We all have experienced a time when we have to look up for a new house to buy. But then the journey begins with a lot of frauds, negotiating deals, researching the local areas and so on.

Thousands of houses are sold every day. There are some questions every buyer asks himself like: What is the actual price that this house deserves? Am I paying a fair price? In this project, a machine learning model is proposed to predict a house price based on data related to the house (its size, the year it was built in, etc.).

House price prediction is a valuable tool for buyers in the real estate market, providing data-driven insights to aid their decision-making process. With accurate predictions, buyers can plan their budgets more effectively, assess property values, and identify potential investment opportunities. They can compare properties in different neighborhoods, leverage market timing, and mitigate risks associated with property purchases. Armed with data-backed predictions, buyers gain confidence in their decisions, avoid buyer’s remorse, and save time during the property search. While predictions offer valuable estimates, consulting with real estate professionals is essential for personalized advice and guidance. By combining data-driven insights and expert assistance, buyers can navigate the real estate market confidently and find their ideal home at the right price.

In this project, we will utilize a Multiple Regression model to predict the House’s Price. We will also conduct correlation analysis between the predictor variables and the target variable. The dataset of choice is the HousePricing dataset obtained from Kaggle, which will serve as the fundamental data source for our data-driven predictions.

1.1 Data Understanding

Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.

library(GGally) # Correlation matrix with ggcorr
library(MLmetrics) # Compare regression results (mape, mae, R2_score)
library(car) # Calculate VIF
library(caret) # pre-process and scale the data
library(tidyverse) # For data processing and visualization
library(performance) # model compare_performance
library(lmtest) # Breusch-Pagan test
library(glue) # Pretty printing
library(plotly) # Membuat plot interaktif
library(heatmaply) # Membuat Plot heatmap
library(ggcorrplot) # Membuat Plot Heatmap Correlation
library(echarts4r) # make chart interactive
library(olsrr) # Normality Test
library(nortest) # Normality Test

And then import our data

house_raw <- read.csv('datasets/HousePrices_HalfMil.csv')

dim(house_raw)
## [1] 500000     16
house_raw %>% head()
glimpse(house_raw)
## Rows: 500,000
## Columns: 16
## $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
summary(house_raw)
##       Area           Garage        FirePlace         Baths      
##  Min.   :  1.0   Min.   :1.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.: 63.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :125.0   Median :2.000   Median :2.000   Median :3.000  
##  Mean   :124.9   Mean   :2.001   Mean   :2.003   Mean   :2.998  
##  3rd Qu.:187.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000  
##   White.Marble    Black.Marble    Indian.Marble        Floors      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.333   Mean   :0.3327   Mean   :0.3343   Mean   :0.4994  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       City           Solar           Electric          Fiber       
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :2.001   Mean   :0.4987   Mean   :0.5007   Mean   :0.5005  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Glass.Doors      Swiming.Pool        Garden           Prices     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 7725  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:33500  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :41850  
##  Mean   :0.4999   Mean   :0.5004   Mean   :0.5016   Mean   :42050  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:50750  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :77975

The dataset contains 500,000 rows and 16 columns. The target variable is “Prices,” while the remaining variables are considered predictor variables.

The observation data consists of the following variables:

  • Area: Property area size (numeric)
  • Garage: Number of garages in the property (numeric)
  • FirePlace: Number of fireplaces in the property (numeric)
  • Baths: Number of bathrooms in the property (numeric)
  • White.Marble: Whether the property uses white marble flooring (factor)
  • Black.Marble: Whether the property uses black marble flooring (factor)
  • Indian.Marble: Whether the property uses Indian marble flooring (factor)
  • Floors: Number of floors in the property (numeric)
  • City: City where the property is located (factor)
  • Solar: Whether the property uses solar energy (factor)
  • Electric: Whether the property uses electricity (factor)
  • Fiber: Whether the property has access to fiber optic (factor)
  • Glass.Doors: Whether the property uses glass doors (factor)
  • Swiming.Pool: Whether the property has a swimming pool (factor)
  • Garden: Whether the property has a garden (factor)
  • Prices: House Price (numeric)

1.2 Data preprocessing

Preprocessing plays a pivotal role in data analysis, encompassing vital steps such as cleaning and transforming the data to ensure its quality and suitability for subsequent analysis and modeling.

Before delving into our Exploratory Data Analysis, we will thoroughly examine our dataset to identify and address any anomalies present. We understand that the quality of insights is directly tied to the integrity of the underlying data. Hence, ensuring that the data is clean and in a usable form is of paramount importance.

1.2.1 Fix Data Types

From the results above, it is evident that several variables have inappropriate data types. Therefore, we will rectify them as follows:

house <- house_raw %>% 
  mutate_at(vars(White.Marble, Black.Marble, Indian.Marble, 
                  City, Solar, Electric, Fiber, Glass.Doors, 
                  Swiming.Pool, Garden),as.factor)
glimpse(house)
## Rows: 500,000
## Columns: 16
## $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble  <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble  <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City          <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar         <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber         <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors   <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden        <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

1.2.2 Missing values

The first step is to check for and handle any missing values in the dataset.

colSums(is.na(house))
##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

Fortunately, our dataset does not contain any missing values, which is beneficial for our analysis. In case there were missing values, we would need to make a decision on whether to impute the missing data or exclude the observations with missing values from the analysis.

1.3 Exploratory Data Analysis

Exploratory data analysis (EDA) entails utilizing graphics and visualizations to thoroughly explore and analyze a dataset. The primary objective is to investigate and gain insights into the data variables present, facilitating a comprehensive understanding of the dataset’s characteristics.

1.3.1 Data Distribution

From the density graph, we can observe the distribution of each variable. Data distribution refers to the pattern or arrangement of data values in a dataset. It is a fundamental aspect of data analysis as it provides essential information about how the data is spread across different values. Understanding the distribution of data is crucial for making informed decisions, choosing appropriate statistical methods, and gaining insights into the underlying phenomena represented by the data.

num_data <- house %>% 
  select_if(is.numeric)

house_long <- num_data %>% 
  pivot_longer(colnames(num_data)) %>% 
  as.data.frame()

ggp1 <- ggplot(house_long, aes(x = value, fill=name)) +
  geom_density() +
  theme(legend.position = "none") +
  facet_wrap(~ name, scales = "free")  +
  labs(
      title = 'Density graph of parameters',
      x = 'Parameters',
      y = 'Frequency'
  )
  
ggp1

house %>% 
    select_if(is.numeric) %>% 
    pivot_longer(everything(), cols_vary = "slowest", names_to = 'params') %>% 
    ggplot(aes(x = value, fill=params)) +
    geom_boxplot() +
    theme(legend.position = "none") +
    facet_wrap(~params, scales = 'free_x')  +
    labs(
        title = 'Box Plot of parameters',
        x = 'Parameters',
        y = 'Frequency'
    )

From the box plot visualization above, it can be concluded that there are no outliers in the dataset. The data also indicates that the majority of variables have a normal distribution. However, there are a few outliers observed in the Prices variable.

fac_data <- house %>% 
  select_if(is.factor)

lvls <- unique(unlist(fac_data %>% select(-City)))
t(sapply(fac_data %>% select(-City), function(x) table(factor(x, levels = lvls))))
##                    0      1
## White.Marble  333504 166496
## Black.Marble  333655 166345
## Indian.Marble 332841 167159
## Solar         250653 249347
## Electric      249675 250325
## Fiber         249766 250234
## Glass.Doors   250065 249935
## Swiming.Pool  249782 250218
## Garden        249177 250823
table(fac_data$City)
## 
##      1      2      3 
## 166314 166902 166784
fac_data <- house %>% 
  select_if(is.factor)

house_long <- fac_data %>% 
  pivot_longer(colnames(fac_data)) %>% 
  as.data.frame()

gg_box1 <- house_long %>% 
  filter(name %in% c("White.Marble", "Black.Marble", "Indian.Marble")) %>% 
  ggplot(aes(x = value, fill=name)) +
  geom_bar() +
  theme(legend.position = "none") +
  facet_wrap(~ name, scales = "free")  +
  labs(
      title = 'Bar graph of parameters',
      x = 'Parameters',
      y = 'Frequency'
  )
  
gg_box1

gg_box2 <- house_long %>% 
  filter(name %in% c("Solar", "Electric", "Fiber")) %>% 
  ggplot(aes(x = value, fill=name)) +
  geom_bar() +
  theme(legend.position = "none") +
  facet_wrap(~ name, scales = "free")  +
  labs(
      title = 'Bar graph of parameters',
      x = 'Parameters',
      y = 'Frequency'
  )
  
gg_box2

gg_box3 <- house_long %>% 
  filter(name %in% c("Glass.Doors", "Swiming.Pool", "Garden", "City")) %>% 
  ggplot(aes(x = value, fill=name)) +
  geom_bar() +
  theme(legend.position = "none") +
  facet_wrap(~ name, scales = "free")  +
  labs(
      title = 'Bar graph of parameters',
      x = 'Parameters',
      y = 'Frequency'
  )
  
gg_box3

It is evident that for the variables White.Marble, Black.Marble, and Indian.Marble, the highest frequency is 0, indicating that many properties do not currently utilize these types of marble flooring.

As for the energy sources in houses, the proportions are relatively equal for Solar, Electric, and Fiber. A similar balanced distribution is also observed for Glass.Doors, Swiming.Pool, Garden, and City variables.

1.3.2 Trend Comparison of Concrete Strength

After now. Let’s see the trend of our data for each predictor

names(house_raw)
##  [1] "Area"          "Garage"        "FirePlace"     "Baths"        
##  [5] "White.Marble"  "Black.Marble"  "Indian.Marble" "Floors"       
##  [9] "City"          "Solar"         "Electric"      "Fiber"        
## [13] "Glass.Doors"   "Swiming.Pool"  "Garden"        "Prices"
house_raw %>% 
    head(1000) %>% 
    select(c("Prices", "Area", "Garage", "FirePlace", "Baths", "White.Marble", "Black.Marble", "Indian.Marble")) %>% 
    pivot_longer(cols = -Prices, names_to = 'param') %>% 
    ggplot(aes(x = value, y = Prices, colour = param)) +
    geom_point(size = 0.2) +
    geom_smooth(formula = y ~ x, method = "loess") +
    theme(axis.text.x = element_text(hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "none") +
    facet_wrap(~param, scales = 'free_x')  +
    labs(
        title = 'Trends in each parameter',
        x = 'Parameters',
        y = 'Values'
    )

house_raw %>% 
    head(1000) %>% 
    select(c("Floors", "City", "Solar", "Electric", "Fiber", "Glass.Doors", 
             "Swiming.Pool", "Garden", "Prices" )) %>% 
    pivot_longer(cols = -Prices, names_to = 'param') %>% 
    ggplot(aes(x = value, y = Prices, colour = param)) +
    geom_point(size = 0.2) +
    geom_smooth(formula = y ~ x, method = "loess") +
    theme(axis.text.x = element_text(hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "none") +
    facet_wrap(~param, scales = 'free_x')  +
    labs(
        title = 'Trends in each parameter',
        x = 'Parameters',
        y = 'Values'
    )

1.3.3 Correlation

Correlation among predictors is a crucial consideration in linear regression. High correlation between predictors can lead to multicollinearity, affecting the accuracy of the regression model. We can use the ggcorrplot() function from the ggcorplot package to visualize and assess the correlation between predictors.

In this section, we will be analyzing the Pearson Correlation between all of our variables.

num_data <- house_raw %>% select_if(is.numeric)
# Compute a correlation matrix
corr <- round(cor(num_data), 2)

# Visualize the lower triangle of the correlation matrix
# Barring the no significant coefficient
corr.plot <- ggcorrplot(corr, type = "lower", lab = TRUE, p.mat = NULL,
                        lab_size = 2, pch = 1, pch.col = "black", pch.cex =1, tl.cex = 6)
# ggplotly(corr.plot)
corr.plot

The correlation values closer to 1 or -1 indicate a stronger relationship between predictors.

Since we aim to predict the prices of houses, we need to identify variables that exhibit a strong correlation with the Price. Based on the plot above, the following variables show notable correlations with Price: White Marble, Indian Marble, Floors, Fiber, City, and Glass Doors.

1.3.4 Multicollinearity Check

Upon revisiting the plot, we observe instances of Multicollinearity between the White.Marble, Black.Marble, and Indian.Marble variables. To address this concern, we need to examine their respective values to determine if the types are as intended.

unique(house$White.Marble)
## [1] 0 1
## Levels: 0 1
unique(house$Indian.Marble)
## [1] 0 1
## Levels: 0 1
unique(house$Black.Marble)
## [1] 1 0
## Levels: 0 1

Since the values are either 1 or 0, we will change their data type into factor/categorical and test their correlation with Pearson’s Chi-squared test (Note: these are used for categorical data types only)

# Change the data types first
white.marble <- as.factor(house$White.Marble)
indian.marble <- as.factor(house$Indian.Marble)
black.marble <- as.factor(house$Black.Marble)

# Pearson's Chi-squared test
chisq.test(white.marble, indian.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  white.marble and indian.marble
## X-squared = 125360, df = 1, p-value < 2.2e-16
chisq.test(white.marble, black.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  white.marble and black.marble
## X-squared = 124445, df = 1, p-value < 2.2e-16
chisq.test(indian.marble, black.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  indian.marble and black.marble
## X-squared = 125189, df = 1, p-value < 2.2e-16

From the results above, it is evident that these three variables are statistically significant with each other, as indicated by their p-values below 0.05, suggesting the presence of multicollinearity caused by these variables. To address this issue, we will eliminate only one variable that exhibits the weakest correlation with our target variable, which is Black.Marble.

#remove Black.Marble from our dataset
house <- house %>% 
  select(-Black.Marble)

1.3.5 Checking outliers

We will examine the distribution of our data by generating boxplots, which will provide insights into the central tendency, spread, and potential outliers of the variables.

In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.

house %>% 
    select_if(is.numeric) %>% 
    pivot_longer(everything(), cols_vary = "slowest", names_to = 'params') %>% 
    ggplot(aes(x = value, fill=params)) +
    geom_boxplot() +
    theme(legend.position = "none") +
    facet_wrap(~params, scales = 'free_x')  +
    labs(
        title = 'Density graph of parameters',
        x = 'Parameters',
        y = 'Frequency'
    )

As demonstrated by the boxplot above, there are six outliers that are not significantly distant from the other data points. Furthermore, the median is perfectly centered within the box, indicating that the target variable is normally distributed.

we use boxplot() function to detect which values are outliers. Here is the list of outlier data for each column:

cat("Area :",ifelse(length(boxplot(house$Area, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))
## Area : No Outlier
cat("\nGarage :",ifelse(length(boxplot(house$Garage, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))
## 
## Garage : No Outlier
cat("\nFirePlace :",ifelse(length(boxplot(house$FirePlace, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))
## 
## FirePlace : No Outlier
cat("\nBaths :",ifelse(length(boxplot(house$Baths, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))
## 
## Baths : No Outlier
cat("\nFloors :",ifelse(length(boxplot(house$Floors, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))
## 
## Floors : No Outlier
cat("\nPrices :", boxplot(house$Prices, plot = F)$out)
## 
## Prices : 76975 77225 77000 77175 77375 77525 76825 77700 76950 77075 77250 77975 76750 77225 76775 76800

We’ll use the filter() function to identify and later remove the outliers from the dataset using IQR.

columns <- colnames(house[,sapply(house,is.numeric)])
house.outliers <- data.frame()
for (col in columns) {
  # get all outlier data each columns
  cat(col, "\n")
  outlier <- boxplot(house[col], plot = FALSE)$out
  outliers <- house %>% filter(.[[col]] %in% outlier)
  cat(nrow(outliers), "\n\n")
  house.outliers <- rbind(house.outliers, outliers)
}
## Area 
## 0 
## 
## Garage 
## 0 
## 
## FirePlace 
## 0 
## 
## Baths 
## 0 
## 
## Floors 
## 0 
## 
## Prices 
## 16
house.outliers %>% distinct()

To more objectively determine if these outliers are significantly different, we will conduct a T-test.

First, we take the data excluding the outliers.

# create detect outlier function
detect_outlier <- function(x) {
 
    # calculate first quantile
    Quantile1 <- quantile(x, probs=.25)
 
    # calculate third quantile
    Quantile3 <- quantile(x, probs=.75)
 
    # calculate inter quartile range
    IQR = Quantile3-Quantile1
 
    # return true or false
    x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}
 
# create remove outlier function
remove_outlier <- function(dataframe,
                            columns=names(dataframe)) {
 
    # for loop to traverse in columns vector
    for (col in columns) {
 
        # remove observation if it satisfies outlier function
        dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
    }
    
    dataframe
}
 
house.no_outliers <- remove_outlier(house, colnames(house[,sapply(house,is.numeric)]))

house.no_outliers %>% head()
t.test(house.no_outliers$Prices, house.outliers$Prices)
## 
##  Welch Two Sample t-test
## 
## data:  house.no_outliers$Prices and house.outliers$Prices
## t = -395.51, df = 16.182, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35301.52 -34925.45
## sample estimates:
## mean of x mean of y 
##  42049.02  77162.50

The T-test yields a p-value of < 0.05, suggesting that there are significant differences in strength between the outliers and the normal populations. A p-value less than 0.05 leads to the rejection of the null hypothesis, indicating that the groups are not statistically similar to each other. However, since the outliers only exist in the Prices column and their quantity is minimal, they will not be dropped.

1.3.6 Bivariate Analysis

The purpose of bivariate analysis is to understand the relationship between two variables: X (independent/explanatory/outcome variable) and Y (dependent/outcome variable). In this case, we have selected our predictors, and since all predictor values are categorical (1 for present and 0 for not present), we will utilize Boxplot to visualize their relationship with the target variable.

1.3.6.1 White Marble

house %>% 
  ggplot(aes(x=White.Marble, y=Prices, fill=White.Marble)) +
  geom_boxplot()

We can conclude that the Prices are going higher if the House has White Marble in it.

1.3.6.2 Indian Marble

house %>% 
  ggplot(aes(x=Indian.Marble, y=Prices, fill=Indian.Marble)) +
  geom_boxplot()

We can conclude that the Prices are going lower if the House has Indian Marble in it.

1.3.6.3 Floors

house %>% 
  mutate(Floors = as.factor(Floors)) %>% 
  ggplot(aes(x=Floors, y=Prices, fill=Floors)) +
  geom_boxplot()

We can conclude that the Prices are going higher if the House has Floors in it.

1.3.6.4 Fiber

house %>% 
  ggplot(aes(x=Fiber, y=Prices, fill=Fiber)) +
  geom_boxplot()

We can conclude that the Prices are going higher if the House has Fiber in it.

1.3.6.5 Glass Doors

house %>% 
  ggplot(aes(x=Glass.Doors, y=Prices, fill=Glass.Doors)) +
  geom_boxplot()

We can conclude that the Prices are going higher if the House has Glass Doors in it.

1.3.6.6 City

house %>% 
  ggplot(aes(x=City, y=Prices, fill=City)) +
  geom_boxplot()

2 Regression Models

2.1 Cross Validation

As a preliminary step before diving into the modeling process, we will conduct Cross Validation on our dataset. This procedure is crucial as it enables us to evaluate the effectiveness of machine learning algorithms when making predictions on new, unseen data. Specifically, we will allocate 80% of our dataset for training the model and reserve the remaining 20% for testing the predictions, allowing us to gauge the model’s performance accurately.

# Set seed to lock the random
RNGkind(sample.kind = "Rounding") 
set.seed(100)

# Index Sampling
index <- sample(nrow(house), size = nrow(house)*0.80)

# Splitting
house_train <- house[index,] #take 80%
house_test <- house[-index,] #take 20%

2.2 Modelling

Considering that we possess numeric predictors and a numeric target variable, it is appropriate to construct a linear regression model. We will now proceed with building the model and evaluating its fit to our data. After the data has been split, the next step involves training our models using the house_train dataset.

2.2.1 Model Without Predictor

Let’s start with a minimal model, where we have no predictors, and the target variable is used as its own predictor. In this case, the model will simply output the mean of the target variable, as this statistically minimizes the error.

model.none <- lm(Prices ~ 1, house_train)
summary(model.none)
## 
## Call:
## lm(formula = Prices ~ 1, data = house_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34324  -8549   -199   8701  35926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42048.54      19.14    2196   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12110 on 399999 degrees of freedom

2.2.2 Model with All Predictors

Now, we will construct the maximal model, utilizing all the parameters in our dataset. This model incorporates all available predictors to explore their potential impact on the target variable.

model.all <- lm(Prices ~ ., house_train)
summary(model.all)
## 
## Call:
## lm(formula = Prices ~ ., data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.642e-05 -1.000e-10  1.000e-10  2.000e-10  2.384e-07 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     9.500e+03  3.468e-10  2.739e+13   <2e-16 ***
## Area            2.500e+01  9.203e-13  2.717e+13   <2e-16 ***
## Garage          1.500e+03  8.086e-11  1.855e+13   <2e-16 ***
## FirePlace       7.500e+02  4.672e-11  1.605e+13   <2e-16 ***
## Baths           1.250e+03  4.671e-11  2.676e+13   <2e-16 ***
## White.Marble1   9.000e+03  1.620e-10  5.555e+13   <2e-16 ***
## Indian.Marble1 -5.000e+03  1.618e-10 -3.091e+13   <2e-16 ***
## Floors          1.500e+04  1.321e-10  1.135e+14   <2e-16 ***
## City2           3.500e+03  1.620e-10  2.161e+13   <2e-16 ***
## City3           7.000e+03  1.619e-10  4.325e+13   <2e-16 ***
## Solar1          2.500e+02  1.321e-10  1.892e+12   <2e-16 ***
## Electric1       1.250e+03  1.321e-10  9.460e+12   <2e-16 ***
## Fiber1          1.175e+04  1.321e-10  8.892e+13   <2e-16 ***
## Glass.Doors1    4.450e+03  1.321e-10  3.368e+13   <2e-16 ***
## Swiming.Pool1  -1.338e-10  1.321e-10 -1.013e+00    0.311    
## Garden1         1.311e-10  1.321e-10  9.920e-01    0.321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.179e-08 on 399984 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.239e+27 on 15 and 399984 DF,  p-value: < 2.2e-16

As the maximal model uses all the data, one might logically expect it to be the best model. However, this may not always hold true. To verify the optimal model, we will conduct stepwise regression.

2.2.3 Variable Selection

We have created model.none, which uses no predictors, and model.all, which uses all variables. Stepwise regression employs the Akaika Information Criterion (AIC) as the metric to select the optimal model. The method aims to minimize the AIC, representing the least information loss. Although stepwise regression employs a greedy algorithm to search for a local minimum, it does not guarantee the absolute best model. We will proceed with stepwise regression to identify the important variables for our model.

  • Pearson Correlation

For the independent variables (X), we will utilize White Marble, Indian Marble, City, Floor, Fibers, and Glass.Doors since they exhibit high correlation with our Prices.

model.corr <- lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + City + Glass.Doors,
                data = house_train)
summary(model.corr)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + 
##     City + Glass.Doors, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9367.1 -2156.3    -0.5  2147.9  9354.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21634.874     13.718  1577.1   <2e-16 ***
## Floors         14992.472      9.683  1548.4   <2e-16 ***
## Fiber1         11754.130      9.683  1213.9   <2e-16 ***
## White.Marble1   9023.379     11.871   760.1   <2e-16 ***
## Indian.Marble1 -4998.393     11.853  -421.7   <2e-16 ***
## City2           3490.648     11.868   294.1   <2e-16 ***
## City3           6981.662     11.860   588.7   <2e-16 ***
## Glass.Doors1    4429.734      9.683   457.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3062 on 399992 degrees of freedom
## Multiple R-squared:  0.936,  Adjusted R-squared:  0.936 
## F-statistic: 8.364e+05 on 7 and 399992 DF,  p-value: < 2.2e-16

R2 = 0.936, indicating that approximately 93.6% of the observed variation can be explained by the model’s inputs. Furthermore, all variables have p-values below 0.05, signifying that they are all significant with our dependent variable (Y).

  • Backward Elimination
model.step.back <- step(model.all, scope = formula(model.none), direction = 'backward', trace = 0)
summary(model.step.back)
## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.642e-05 -1.000e-10  1.000e-10  2.000e-10  2.384e-07 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     9.500e+03  3.343e-10  2.842e+13   <2e-16 ***
## Area            2.500e+01  9.203e-13  2.717e+13   <2e-16 ***
## Garage          1.500e+03  8.086e-11  1.855e+13   <2e-16 ***
## FirePlace       7.500e+02  4.672e-11  1.605e+13   <2e-16 ***
## Baths           1.250e+03  4.671e-11  2.676e+13   <2e-16 ***
## White.Marble1   9.000e+03  1.620e-10  5.555e+13   <2e-16 ***
## Indian.Marble1 -5.000e+03  1.618e-10 -3.091e+13   <2e-16 ***
## Floors          1.500e+04  1.321e-10  1.135e+14   <2e-16 ***
## City2           3.500e+03  1.620e-10  2.161e+13   <2e-16 ***
## City3           7.000e+03  1.619e-10  4.325e+13   <2e-16 ***
## Solar1          2.500e+02  1.321e-10  1.892e+12   <2e-16 ***
## Electric1       1.250e+03  1.321e-10  9.460e+12   <2e-16 ***
## Fiber1          1.175e+04  1.321e-10  8.892e+13   <2e-16 ***
## Glass.Doors1    4.450e+03  1.321e-10  3.368e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.179e-08 on 399986 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.583e+27 on 13 and 399986 DF,  p-value: < 2.2e-16

The model.step.back utilizes only 13 from 15 variables.

  • Forward Selection
model.step.fwd <- step(model.none, scope = list(upper = model.all), direction = 'forward', trace = F)
summary(model.step.fwd)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
##     Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace + 
##     Electric + Solar, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.643e-05 -1.000e-10  1.000e-10  2.000e-10  1.362e-05 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     9.500e+03  4.388e-10  2.165e+13   <2e-16 ***
## Floors          1.500e+04  1.735e-10  8.646e+13   <2e-16 ***
## Fiber1          1.175e+04  1.735e-10  6.773e+13   <2e-16 ***
## White.Marble1   9.000e+03  2.127e-10  4.231e+13   <2e-16 ***
## City2           3.500e+03  2.126e-10  1.646e+13   <2e-16 ***
## City3           7.000e+03  2.125e-10  3.294e+13   <2e-16 ***
## Glass.Doors1    4.450e+03  1.735e-10  2.565e+13   <2e-16 ***
## Indian.Marble1 -5.000e+03  2.124e-10 -2.354e+13   <2e-16 ***
## Area            2.500e+01  1.208e-12  2.069e+13   <2e-16 ***
## Baths           1.250e+03  6.132e-11  2.038e+13   <2e-16 ***
## Garage          1.500e+03  1.062e-10  1.413e+13   <2e-16 ***
## FirePlace       7.500e+02  6.134e-11  1.223e+13   <2e-16 ***
## Electric1       1.250e+03  1.735e-10  7.205e+12   <2e-16 ***
## Solar1          2.500e+02  1.735e-10  1.441e+12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.486e-08 on 399986 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.499e+27 on 13 and 399986 DF,  p-value: < 2.2e-16

The model.step.fwd utilizes only 13 from 15 variables.

  • Both
model.step.both <- step(model.none, scope = list(upper = model.all), direction = 'both', trace = F)
summary(model.step.both)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + City + 
##     Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace + 
##     Electric + Solar, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.643e-05 -1.000e-10  1.000e-10  2.000e-10  1.362e-05 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     9.500e+03  4.388e-10  2.165e+13   <2e-16 ***
## Floors          1.500e+04  1.735e-10  8.646e+13   <2e-16 ***
## Fiber1          1.175e+04  1.735e-10  6.773e+13   <2e-16 ***
## White.Marble1   9.000e+03  2.127e-10  4.231e+13   <2e-16 ***
## City2           3.500e+03  2.126e-10  1.646e+13   <2e-16 ***
## City3           7.000e+03  2.125e-10  3.294e+13   <2e-16 ***
## Glass.Doors1    4.450e+03  1.735e-10  2.565e+13   <2e-16 ***
## Indian.Marble1 -5.000e+03  2.124e-10 -2.354e+13   <2e-16 ***
## Area            2.500e+01  1.208e-12  2.069e+13   <2e-16 ***
## Baths           1.250e+03  6.132e-11  2.038e+13   <2e-16 ***
## Garage          1.500e+03  1.062e-10  1.413e+13   <2e-16 ***
## FirePlace       7.500e+02  6.134e-11  1.223e+13   <2e-16 ***
## Electric1       1.250e+03  1.735e-10  7.205e+12   <2e-16 ***
## Solar1          2.500e+02  1.735e-10  1.441e+12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.486e-08 on 399986 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.499e+27 on 13 and 399986 DF,  p-value: < 2.2e-16

The model.step.both utilizes only 13 from 15 variables.

3 Model Evaluation

3.1 Model Performance

However, evaluating our model’s performance based solely on those values and comparing them would be challenging. In this section, we will introduce you to the RMSE (Root Mean Squared Error) calculation, which measures our model’s performance as follows:

\[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \]

RMSE is one of the most popular measures used to estimate the accuracy of our forecasting model’s predicted values in comparison to the actual or observed values during the training of regression models.

Model Function

evaluate <- function(model, data_train, data_test, name) {
  # Evaluating Testing
  result.train <- model$fitted.values
  
  result.train.mae <- MAE(result.train, data_train$Prices)
  result.train.mape <- MAPE(result.train, data_train$Prices)
  result.train.rmse <- RMSE(result.train, data_train$Prices)
  result.train.r2 <- R2_Score(result.train, data_train$Prices)
  
  # Evaluating Testing
  result.test <- predict(model, data_test)
  
  result.test.mae <- MAE(result.test, data_test$Prices)
  result.test.mape <- MAPE(result.test, data_test$Prices)
  result.test.rmse <- RMSE(result.test, data_test$Prices)
  result.test.r2 <- R2_Score(result.test, data_test$Prices)
  
  cat('Result of',name,'\n\n',
      
      'Training', '\n',
      
      'MAE : ', round(result.train.mae, 4), '\n', 
      'MAPE : ', round(result.train.mape, 4), '\n',
      'RMSE : ', round(result.train.rmse, 4), '\n',
      'R-squared : ', round(result.train.r2, 4), '\n\n',
      
      'Testing', '\n',
      
      'MAE : ', round(result.test.mae, 4), '\n', 
      'MAPE : ', round(result.test.mape, 4), '\n',
      'RMSE : ', round(result.test.rmse, 4), '\n',
      'R-squared : ', round(result.test.r2, 4), '\n')
}

Evaluating model.all

evaluate(model.all, house_train, house_test, "Model All")
## Result of Model All 
## 
##  Training 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1 
## 
##  Testing 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1

Next, let’s compare the performance of model.all with model.corr, model.step.back, model.forward, and model.both. This comparison will help us understand which model performs better in predicting the concrete strength.

evaluate(model.corr, house_train, house_test, "Model Feature Selection (Pearson Corr)")
## Result of Model Feature Selection (Pearson Corr) 
## 
##  Training 
##  MAE :  2482.624 
##  MAPE :  0.0658 
##  RMSE :  3061.851 
##  R-squared :  0.936 
## 
##  Testing 
##  MAE :  2476.665 
##  MAPE :  0.0657 
##  RMSE :  3057.646 
##  R-squared :  0.9364
evaluate(model.step.back, house_train, house_test, "Model Backward Elimination")
## Result of Model Backward Elimination 
## 
##  Training 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1 
## 
##  Testing 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1
evaluate(model.step.fwd, house_train, house_test, "Model Forward Selection")
## Result of Model Forward Selection 
## 
##  Training 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1 
## 
##  Testing 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1
evaluate(model.step.both, house_train, house_test, "Model Both (Back + Forward)")
## Result of Model Both (Back + Forward) 
## 
##  Training 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1 
## 
##  Testing 
##  MAE :  0 
##  MAPE :  0 
##  RMSE :  0 
##  R-squared :  1

To compare the performance of the models more conveniently, we can use the compare_performance function from the performance package. Let’s compare the performance of the four models we have created.

comparison <- compare_performance(model.all, model.corr, model.step.back, model.step.fwd, model.step.both)
cmp <- as.data.frame(comparison)

cmp[order(cmp$AIC),]

After conducting a thorough comparison, it becomes evident that model.step.back stands out as the best performer among all models. It demonstrates superior predictive capabilities, boasting lower MAE, MAPE, and RMSE values, along with a higher R-squared value. Moreover, the similarity between the RMSE scores of both test and train datasets is noticeable, and these scores are significantly smaller compared to the range of values in both datasets. This indicates that our model shows no signs of Overfitting or Underfitting, and its performance is already deemed satisfactory.

As a result, we will proceed with using model.step.back exclusively for further analysis and predictions.

3.2 Assumptions Checking

3.2.1 1. Linearity

The assumption of linearity assumes that a linear relationship exists between the predictors and the target variable, allowing our model to accurately describe it. One way to visually assess this is by plotting the residuals’ values between our observed data and the model.

plot(model.step.back$fitted.values, model.step.back$residuals, ylab = 'Residuals', xlab = 'Prediction', main = 'Test')
abline(h = 0, col = 'red')

Based on the visual inspection of the residuals plot, it is evident that the residuals are not randomly and evenly distributed along the horizontal zero line, indicating potential issues with the assumption of linearity.

3.2.2 2. Normality of Residuals

The normality assumption states that the error term in the relationship between predictor and target variables follows a normal distribution. To assess this assumption, we can conduct the Shapiro-Wilk test, with the following hypotheses:

  • H0: Residuals are normally distributed.
  • H1: Residuals are not normally distributed.
shapiro.test(head(model.step.back$residual, 5000))
## 
##  Shapiro-Wilk normality test
## 
## data:  head(model.step.back$residual, 5000)
## W = 0.0032913, p-value < 2.2e-16
# Perform a Normality Test
ols_test_normality(head(model.step.back$residual, 5000))
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.0033         0.0000 
## Kolmogorov-Smirnov        0.5032         0.0000 
## Cramer-von Mises        1666.6667        0.0000 
## Anderson-Darling        1924.9587        0.0000 
## -----------------------------------------------

The type of test to use depends on the number of observations. In general, if you have less than 50 observations, you should use the Shapiro-Wilk test. Otherwise, the Kolmogorov-Smirnov test is better.

The criteria to reject the null hypothesis, and therefore conclude that the data is not normally distributed, depends on the p-value. Typically, if the p-value is <0.05 we reject the null hypothesis.

Karena data kita di atas > 5000 bisa menggunakan Anderson-Darling normality test, which works for larger sample sizes.

ad.test(model.step.back$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model.step.back$residuals
## A = 152579, p-value < 2.2e-16

Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.

res <- data.frame(residual = model.step.back$residuals)

res %>% 
e_charts() %>% 
  e_histogram(residual, name = "Error", legend = F) %>% 
  e_title(
    text = "Normality of Residuals",
    textStyle = list(fontSize = 25),
    subtext = "Normally Distributed Error",
    subtextStyle = list(fontSize = 15, fontStyle = "italic"),
    left = "center") %>% 
  e_hide_grid_lines() %>% 
  e_tooltip() %>% 
  e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>% 
  e_y_axis(axisLabel = list(color = "#FAF7E6"))

As demonstrated above, our Errors are mostly concentrated around 0 which means the Errors are normally distributed and our Normality assumption is fulfilled.

Moreover, if you might be wondering why we don’t do any Shapiro Test for this assumption? It’s because in R language, the Shapiro Test only limited to 5,000 data meanwhile our dataset has more than that. We afraid that, if we only use our data partially, the test itself wouldn’t be able to fully represent the entirety of the assumption. Becasue of that, we will only use our visualization above to test this particular assumption.

3.2.3 3. Homoscedasticity of Residuals

The homoscedasticity assumption asserts that the error term in the relationship between predictor and target variables remains constant across all values of inputs. To verify this assumption, we can conduct the Breusch-Pagan test, with the following hypotheses:

To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:

  • H0: error spreads in constant manner, which means the error is homoscedasticity
  • H1: error spreads in non-constant manner, which means the error is heteroscedasticity

By performing the Breusch-Pagan test, we can assess whether the model satisfies the homoscedasticity assumption and make necessary adjustments if the assumption is violated.

bptest(model.step.back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.step.back
## BP = 11.473, df = 13, p-value = 0.5713

To fulfill Homoscedasticity assumption, we need to fail to reject our H0 Hyphotesis means we should have p-value > 0.05. Based on the result above, our model p-value is 0.5713 which is bigger than 0.05. This implies that our Error spreads in constant manner, hence the error is homoscedasticity and Homoscedasticity assumption is fulfilled.

3.2.4 4. No Multicollinearity

Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:

  • VIF > 10: There’s multicollinearity.
  • VIF < 10: There’s no multicollinearity.

To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.

vif(model.step.back)
##                   GVIF Df GVIF^(1/(2*Df))
## Area          1.000037  1        1.000018
## Garage        1.000037  1        1.000019
## FirePlace     1.000014  1        1.000007
## Baths         1.000029  1        1.000015
## White.Marble  1.334862  1        1.155362
## Indian.Marble 1.334858  1        1.155361
## Floors        1.000018  1        1.000009
## City          1.000071  2        1.000018
## Solar         1.000024  1        1.000012
## Electric      1.000021  1        1.000011
## Fiber         1.000038  1        1.000019
## Glass.Doors   1.000032  1        1.000016

Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the No Multicollinearity assumption is fulfilled.

3.3 Model Interpretation & Insight

Now, let’s examine the coefficients of the model to understand the relationships between the predictor variables and the target variable in more detail.

summary(model.step.back)
## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.642e-05 -1.000e-10  1.000e-10  2.000e-10  2.384e-07 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     9.500e+03  3.343e-10  2.842e+13   <2e-16 ***
## Area            2.500e+01  9.203e-13  2.717e+13   <2e-16 ***
## Garage          1.500e+03  8.086e-11  1.855e+13   <2e-16 ***
## FirePlace       7.500e+02  4.672e-11  1.605e+13   <2e-16 ***
## Baths           1.250e+03  4.671e-11  2.676e+13   <2e-16 ***
## White.Marble1   9.000e+03  1.620e-10  5.555e+13   <2e-16 ***
## Indian.Marble1 -5.000e+03  1.618e-10 -3.091e+13   <2e-16 ***
## Floors          1.500e+04  1.321e-10  1.135e+14   <2e-16 ***
## City2           3.500e+03  1.620e-10  2.161e+13   <2e-16 ***
## City3           7.000e+03  1.619e-10  4.325e+13   <2e-16 ***
## Solar1          2.500e+02  1.321e-10  1.892e+12   <2e-16 ***
## Electric1       1.250e+03  1.321e-10  9.460e+12   <2e-16 ***
## Fiber1          1.175e+04  1.321e-10  8.892e+13   <2e-16 ***
## Glass.Doors1    4.450e+03  1.321e-10  3.368e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.179e-08 on 399986 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.583e+27 on 13 and 399986 DF,  p-value: < 2.2e-16

Interpretation 💡 Insight:

The model was trained on the original data. Looking at the variables, all have positive correlation with Prices, except Indian.Marbel1. This means that according to our model, all variables except Indian.Marbel1 contributes positively.

  • coefficient: model \[strength = 9500 + 25*Area + 1500*garage + 750*FirePlace + 1250*Baths + 9000*White.Marble1 - 5000*Indian.Marbel1 + 150000*Floors + 3500*City2 + 7000*City3 + 250*Solar1 + 1250*Electric1 + 11750*Fibar1 + 4450*GlassDoors1\]

    • Intercept: The initial strength value will be 9500 when there is no additional information available.

    • Semua Nilai Coeficient: The Prices value will increase (positif) or decrease (negative) by coefisien value when there is a one-unit increase in the variable, while keeping other variables constant.

  • Interpretasi coefficient untuk prediktor kategorik :

    • White.Marble1, Indian.Marble1, Solar1, Electric1, Fiber1, Glass.Doors1

      • The Prices value will increase (positif) or decrease (negative) by coefisien value when, apabila pada variable variable di atas ini bernilai 1 atau tersedia di rumah/properti, while keeping other variables constant.
    • City2 , City3

      • The Prices value will increase (positif) or decrease (negative) by coefisien value when, apabila pada variable variable di atas ini bernilai 2 (City 2) / 3 (City3), while keeping other variables constant.
  • Adjusted R-squared: 1, indicating that the model can explain strength well by 100%.

Certainly, let’s extract the coefficients from the model and create a plot to visualize them for easier interpretation.

coef.all <- model.all$coefficients[-1] 
barplot(coef.all, xlab = names(coef.all), main = 'Model `model.all` coefficients',  ylab = 'Value', col="deepskyblue")

Looking at the chart, it is evident that White.Marble1, Floors, and Fiber are the significant factors that strongly influence Prices in our regression model.

3.4 Predicting

We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:

pred_model <- predict(model.step.back, newdata = house_test) 
pred_comp <- data.frame(Actual.Value = house_test$Prices, Prediction.Value = pred_model) 

pred_comp

4 Conclusion

From the conclusion above, it is evident that our Independent Variables proof to be useful to predict the House’s Prices. Those variables are:

\[strength = 9500 + 25*Area + 1500*garage + 750*FirePlace + 1250*Baths + 9000*White.Marble1 - 5000*Indian.Marbel1 + 150000*Floors + 3500*City2 + 7000*City3 + 250*Solar1 + 1250*Electric1 + 11750*Fibar1 + 4450*GlassDoors1\]

Other than that, our model has successfully satisfied all the classical assumptions Normality, Homoscedasticity, and No Multicollinearity.

Moreover, the R-Squared of our model is high with approximately 100% of the observed variation can be explained by our model’s inputs and the accuracy (our RMSE scores) of the model in predicting the house price is indubitably good without any indication of Overfitting and Underfitting. Therefore, this particular model is excellent to use for our prediction attempt.