We all have experienced a time when we have to look up for a new house to buy. But then the journey begins with a lot of frauds, negotiating deals, researching the local areas and so on.
Thousands of houses are sold every day. There are some questions every buyer asks himself like: What is the actual price that this house deserves? Am I paying a fair price? In this project, a machine learning model is proposed to predict a house price based on data related to the house (its size, the year it was built in, etc.).
House price prediction is a valuable tool for buyers in the real estate market, providing data-driven insights to aid their decision-making process. With accurate predictions, buyers can plan their budgets more effectively, assess property values, and identify potential investment opportunities. They can compare properties in different neighborhoods, leverage market timing, and mitigate risks associated with property purchases. Armed with data-backed predictions, buyers gain confidence in their decisions, avoid buyer’s remorse, and save time during the property search. While predictions offer valuable estimates, consulting with real estate professionals is essential for personalized advice and guidance. By combining data-driven insights and expert assistance, buyers can navigate the real estate market confidently and find their ideal home at the right price.
In this project, we will utilize a Multiple Regression model to predict the House’s Price. We will also conduct correlation analysis between the predictor variables and the target variable. The dataset of choice is the HousePricing dataset obtained from Kaggle, which will serve as the fundamental data source for our data-driven predictions.
Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.
library(GGally) # Correlation matrix with ggcorr
library(MLmetrics) # Compare regression results (mape, mae, R2_score)
library(car) # Calculate VIF
library(caret) # pre-process and scale the data
library(tidyverse) # For data processing and visualization
library(performance) # model compare_performance
library(lmtest) # Breusch-Pagan test
library(glue) # Pretty printing
library(plotly) # Membuat plot interaktif
library(heatmaply) # Membuat Plot heatmap
library(ggcorrplot) # Membuat Plot Heatmap Correlation
library(echarts4r) # make chart interactive
library(olsrr) # Normality Test
library(nortest) # Normality TestAnd then import our data
## [1] 500000 16
## Rows: 500,000
## Columns: 16
## $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
## Area Garage FirePlace Baths
## Min. : 1.0 Min. :1.000 Min. :0.000 Min. :1.000
## 1st Qu.: 63.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :125.0 Median :2.000 Median :2.000 Median :3.000
## Mean :124.9 Mean :2.001 Mean :2.003 Mean :2.998
## 3rd Qu.:187.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
## White.Marble Black.Marble Indian.Marble Floors
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.333 Mean :0.3327 Mean :0.3343 Mean :0.4994
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## City Solar Electric Fiber
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :2.001 Mean :0.4987 Mean :0.5007 Mean :0.5005
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Glass.Doors Swiming.Pool Garden Prices
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 7725
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:33500
## Median :0.0000 Median :1.0000 Median :1.0000 Median :41850
## Mean :0.4999 Mean :0.5004 Mean :0.5016 Mean :42050
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:50750
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :77975
The dataset contains 500,000 rows and 16 columns. The target variable is “Prices,” while the remaining variables are considered predictor variables.
The observation data consists of the following variables:
Area: Property area size (numeric)Garage: Number of garages in the property
(numeric)FirePlace: Number of fireplaces in the property
(numeric)Baths: Number of bathrooms in the property
(numeric)White.Marble: Whether the property uses white marble
flooring (factor)Black.Marble: Whether the property uses black marble
flooring (factor)Indian.Marble: Whether the property uses Indian marble
flooring (factor)Floors: Number of floors in the property (numeric)City: City where the property is located (factor)Solar: Whether the property uses solar energy
(factor)Electric: Whether the property uses electricity
(factor)Fiber: Whether the property has access to fiber optic
(factor)Glass.Doors: Whether the property uses glass doors
(factor)Swiming.Pool: Whether the property has a swimming pool
(factor)Garden: Whether the property has a garden (factor)Prices: House Price (numeric)Preprocessing plays a pivotal role in data analysis, encompassing vital steps such as cleaning and transforming the data to ensure its quality and suitability for subsequent analysis and modeling.
Before delving into our Exploratory Data Analysis, we will thoroughly examine our dataset to identify and address any anomalies present. We understand that the quality of insights is directly tied to the integrity of the underlying data. Hence, ensuring that the data is clean and in a usable form is of paramount importance.
From the results above, it is evident that several variables have inappropriate data types. Therefore, we will rectify them as follows:
house <- house_raw %>%
mutate_at(vars(White.Marble, Black.Marble, Indian.Marble,
City, Solar, Electric, Fiber, Glass.Doors,
Swiming.Pool, Garden),as.factor)
glimpse(house)## Rows: 500,000
## Columns: 16
## $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
The first step is to check for and handle any missing values in the dataset.
## Area Garage FirePlace Baths White.Marble
## 0 0 0 0 0
## Black.Marble Indian.Marble Floors City Solar
## 0 0 0 0 0
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 0 0 0 0 0
## Prices
## 0
Fortunately, our dataset does not contain any missing values, which is beneficial for our analysis. In case there were missing values, we would need to make a decision on whether to impute the missing data or exclude the observations with missing values from the analysis.
Exploratory data analysis (EDA) entails utilizing graphics and visualizations to thoroughly explore and analyze a dataset. The primary objective is to investigate and gain insights into the data variables present, facilitating a comprehensive understanding of the dataset’s characteristics.
From the density graph, we can observe the distribution of each variable. Data distribution refers to the pattern or arrangement of data values in a dataset. It is a fundamental aspect of data analysis as it provides essential information about how the data is spread across different values. Understanding the distribution of data is crucial for making informed decisions, choosing appropriate statistical methods, and gaining insights into the underlying phenomena represented by the data.
num_data <- house %>%
select_if(is.numeric)
house_long <- num_data %>%
pivot_longer(colnames(num_data)) %>%
as.data.frame()
ggp1 <- ggplot(house_long, aes(x = value, fill=name)) +
geom_density() +
theme(legend.position = "none") +
facet_wrap(~ name, scales = "free") +
labs(
title = 'Density graph of parameters',
x = 'Parameters',
y = 'Frequency'
)
ggp1house %>%
select_if(is.numeric) %>%
pivot_longer(everything(), cols_vary = "slowest", names_to = 'params') %>%
ggplot(aes(x = value, fill=params)) +
geom_boxplot() +
theme(legend.position = "none") +
facet_wrap(~params, scales = 'free_x') +
labs(
title = 'Box Plot of parameters',
x = 'Parameters',
y = 'Frequency'
)From the box plot visualization above, it can be concluded that there are no outliers in the dataset. The data also indicates that the majority of variables have a normal distribution. However, there are a few outliers observed in the Prices variable.
fac_data <- house %>%
select_if(is.factor)
lvls <- unique(unlist(fac_data %>% select(-City)))
t(sapply(fac_data %>% select(-City), function(x) table(factor(x, levels = lvls))))## 0 1
## White.Marble 333504 166496
## Black.Marble 333655 166345
## Indian.Marble 332841 167159
## Solar 250653 249347
## Electric 249675 250325
## Fiber 249766 250234
## Glass.Doors 250065 249935
## Swiming.Pool 249782 250218
## Garden 249177 250823
##
## 1 2 3
## 166314 166902 166784
fac_data <- house %>%
select_if(is.factor)
house_long <- fac_data %>%
pivot_longer(colnames(fac_data)) %>%
as.data.frame()
gg_box1 <- house_long %>%
filter(name %in% c("White.Marble", "Black.Marble", "Indian.Marble")) %>%
ggplot(aes(x = value, fill=name)) +
geom_bar() +
theme(legend.position = "none") +
facet_wrap(~ name, scales = "free") +
labs(
title = 'Bar graph of parameters',
x = 'Parameters',
y = 'Frequency'
)
gg_box1gg_box2 <- house_long %>%
filter(name %in% c("Solar", "Electric", "Fiber")) %>%
ggplot(aes(x = value, fill=name)) +
geom_bar() +
theme(legend.position = "none") +
facet_wrap(~ name, scales = "free") +
labs(
title = 'Bar graph of parameters',
x = 'Parameters',
y = 'Frequency'
)
gg_box2gg_box3 <- house_long %>%
filter(name %in% c("Glass.Doors", "Swiming.Pool", "Garden", "City")) %>%
ggplot(aes(x = value, fill=name)) +
geom_bar() +
theme(legend.position = "none") +
facet_wrap(~ name, scales = "free") +
labs(
title = 'Bar graph of parameters',
x = 'Parameters',
y = 'Frequency'
)
gg_box3It is evident that for the variables White.Marble,
Black.Marble, and Indian.Marble, the highest
frequency is 0, indicating that many properties do not currently utilize
these types of marble flooring.
As for the energy sources in houses, the proportions are relatively
equal for Solar, Electric, and
Fiber. A similar balanced distribution is also observed for
Glass.Doors, Swiming.Pool,
Garden, and City variables.
After now. Let’s see the trend of our data for each predictor
## [1] "Area" "Garage" "FirePlace" "Baths"
## [5] "White.Marble" "Black.Marble" "Indian.Marble" "Floors"
## [9] "City" "Solar" "Electric" "Fiber"
## [13] "Glass.Doors" "Swiming.Pool" "Garden" "Prices"
house_raw %>%
head(1000) %>%
select(c("Prices", "Area", "Garage", "FirePlace", "Baths", "White.Marble", "Black.Marble", "Indian.Marble")) %>%
pivot_longer(cols = -Prices, names_to = 'param') %>%
ggplot(aes(x = value, y = Prices, colour = param)) +
geom_point(size = 0.2) +
geom_smooth(formula = y ~ x, method = "loess") +
theme(axis.text.x = element_text(hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none") +
facet_wrap(~param, scales = 'free_x') +
labs(
title = 'Trends in each parameter',
x = 'Parameters',
y = 'Values'
)house_raw %>%
head(1000) %>%
select(c("Floors", "City", "Solar", "Electric", "Fiber", "Glass.Doors",
"Swiming.Pool", "Garden", "Prices" )) %>%
pivot_longer(cols = -Prices, names_to = 'param') %>%
ggplot(aes(x = value, y = Prices, colour = param)) +
geom_point(size = 0.2) +
geom_smooth(formula = y ~ x, method = "loess") +
theme(axis.text.x = element_text(hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none") +
facet_wrap(~param, scales = 'free_x') +
labs(
title = 'Trends in each parameter',
x = 'Parameters',
y = 'Values'
)Correlation among predictors is a crucial consideration in linear
regression. High correlation between predictors can lead to
multicollinearity, affecting the accuracy of the regression model. We
can use the ggcorrplot() function from the ggcorplot
package to visualize and assess the correlation between predictors.
In this section, we will be analyzing the Pearson Correlation between all of our variables.
num_data <- house_raw %>% select_if(is.numeric)
# Compute a correlation matrix
corr <- round(cor(num_data), 2)
# Visualize the lower triangle of the correlation matrix
# Barring the no significant coefficient
corr.plot <- ggcorrplot(corr, type = "lower", lab = TRUE, p.mat = NULL,
lab_size = 2, pch = 1, pch.col = "black", pch.cex =1, tl.cex = 6)
# ggplotly(corr.plot)
corr.plotThe correlation values closer to 1 or -1 indicate a stronger relationship between predictors.
Since we aim to predict the prices of houses, we need to identify
variables that exhibit a strong correlation with the Price.
Based on the plot above, the following variables show notable
correlations with Price: White Marble,
Indian Marble, Floors, Fiber,
City, and Glass Doors.
Upon revisiting the plot, we observe instances of
Multicollinearity between the
White.Marble, Black.Marble, and
Indian.Marble variables. To address this concern, we need
to examine their respective values to determine if the types are as
intended.
## [1] 0 1
## Levels: 0 1
## [1] 0 1
## Levels: 0 1
## [1] 1 0
## Levels: 0 1
Since the values are either 1 or 0, we will change their data type into factor/categorical and test their correlation with Pearson’s Chi-squared test (Note: these are used for categorical data types only)
# Change the data types first
white.marble <- as.factor(house$White.Marble)
indian.marble <- as.factor(house$Indian.Marble)
black.marble <- as.factor(house$Black.Marble)
# Pearson's Chi-squared test
chisq.test(white.marble, indian.marble)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: white.marble and indian.marble
## X-squared = 125360, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: white.marble and black.marble
## X-squared = 124445, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: indian.marble and black.marble
## X-squared = 125189, df = 1, p-value < 2.2e-16
From the results above, it is evident that these three variables are
statistically significant with each other, as indicated by their
p-values below 0.05, suggesting the presence
of multicollinearity caused by these variables. To address this issue,
we will eliminate only one variable that exhibits the
weakest correlation with our target variable, which is
Black.Marble.
We will examine the distribution of our data by generating boxplots, which will provide insights into the central tendency, spread, and potential outliers of the variables.
In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.
house %>%
select_if(is.numeric) %>%
pivot_longer(everything(), cols_vary = "slowest", names_to = 'params') %>%
ggplot(aes(x = value, fill=params)) +
geom_boxplot() +
theme(legend.position = "none") +
facet_wrap(~params, scales = 'free_x') +
labs(
title = 'Density graph of parameters',
x = 'Parameters',
y = 'Frequency'
)As demonstrated by the boxplot above, there are six outliers that are not significantly distant from the other data points. Furthermore, the median is perfectly centered within the box, indicating that the target variable is normally distributed.
we use boxplot() function to detect which values are outliers. Here is the list of outlier data for each column:
cat("Area :",ifelse(length(boxplot(house$Area, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))## Area : No Outlier
cat("\nGarage :",ifelse(length(boxplot(house$Garage, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))##
## Garage : No Outlier
cat("\nFirePlace :",ifelse(length(boxplot(house$FirePlace, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))##
## FirePlace : No Outlier
cat("\nBaths :",ifelse(length(boxplot(house$Baths, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))##
## Baths : No Outlier
cat("\nFloors :",ifelse(length(boxplot(house$Floors, plot = F)$out) != 0, boxplot(house$Area, plot = F)$out, "No Outlier"))##
## Floors : No Outlier
##
## Prices : 76975 77225 77000 77175 77375 77525 76825 77700 76950 77075 77250 77975 76750 77225 76775 76800
We’ll use the filter() function to identify and later
remove the outliers from the dataset using IQR.
columns <- colnames(house[,sapply(house,is.numeric)])
house.outliers <- data.frame()
for (col in columns) {
# get all outlier data each columns
cat(col, "\n")
outlier <- boxplot(house[col], plot = FALSE)$out
outliers <- house %>% filter(.[[col]] %in% outlier)
cat(nrow(outliers), "\n\n")
house.outliers <- rbind(house.outliers, outliers)
}## Area
## 0
##
## Garage
## 0
##
## FirePlace
## 0
##
## Baths
## 0
##
## Floors
## 0
##
## Prices
## 16
To more objectively determine if these outliers are significantly different, we will conduct a T-test.
First, we take the data excluding the outliers.
# create detect outlier function
detect_outlier <- function(x) {
# calculate first quantile
Quantile1 <- quantile(x, probs=.25)
# calculate third quantile
Quantile3 <- quantile(x, probs=.75)
# calculate inter quartile range
IQR = Quantile3-Quantile1
# return true or false
x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}
# create remove outlier function
remove_outlier <- function(dataframe,
columns=names(dataframe)) {
# for loop to traverse in columns vector
for (col in columns) {
# remove observation if it satisfies outlier function
dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
}
dataframe
}
house.no_outliers <- remove_outlier(house, colnames(house[,sapply(house,is.numeric)]))
house.no_outliers %>% head()##
## Welch Two Sample t-test
##
## data: house.no_outliers$Prices and house.outliers$Prices
## t = -395.51, df = 16.182, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35301.52 -34925.45
## sample estimates:
## mean of x mean of y
## 42049.02 77162.50
The T-test yields a p-value of < 0.05, suggesting that there are
significant differences in strength between the outliers and the normal
populations. A p-value less than 0.05 leads to the rejection of the null
hypothesis, indicating that the groups are not statistically similar to
each other. However, since the outliers only exist in the
Prices column and their quantity is minimal, they will not
be dropped.
The purpose of bivariate analysis is to understand the relationship between two variables: X (independent/explanatory/outcome variable) and Y (dependent/outcome variable). In this case, we have selected our predictors, and since all predictor values are categorical (1 for present and 0 for not present), we will utilize Boxplot to visualize their relationship with the target variable.
We can conclude that the Prices are going
higher if the House has White Marble in
it.
We can conclude that the Prices are going
lower if the House has Indian Marble in
it.
house %>%
mutate(Floors = as.factor(Floors)) %>%
ggplot(aes(x=Floors, y=Prices, fill=Floors)) +
geom_boxplot()We can conclude that the Prices are going
higher if the House has Floors in it.
We can conclude that the Prices are going
higher if the House has Fiber in it.
We can conclude that the Prices are going
higher if the House has Glass Doors in
it.
As a preliminary step before diving into the modeling process, we will conduct Cross Validation on our dataset. This procedure is crucial as it enables us to evaluate the effectiveness of machine learning algorithms when making predictions on new, unseen data. Specifically, we will allocate 80% of our dataset for training the model and reserve the remaining 20% for testing the predictions, allowing us to gauge the model’s performance accurately.
Considering that we possess numeric predictors and a numeric target
variable, it is appropriate to construct a linear regression model. We
will now proceed with building the model and evaluating its fit to our
data. After the data has been split, the next step involves training our
models using the house_train dataset.
Let’s start with a minimal model, where we have no predictors, and the target variable is used as its own predictor. In this case, the model will simply output the mean of the target variable, as this statistically minimizes the error.
##
## Call:
## lm(formula = Prices ~ 1, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34324 -8549 -199 8701 35926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42048.54 19.14 2196 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12110 on 399999 degrees of freedom
Now, we will construct the maximal model, utilizing all the parameters in our dataset. This model incorporates all available predictors to explore their potential impact on the target variable.
##
## Call:
## lm(formula = Prices ~ ., data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.642e-05 -1.000e-10 1.000e-10 2.000e-10 2.384e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.500e+03 3.468e-10 2.739e+13 <2e-16 ***
## Area 2.500e+01 9.203e-13 2.717e+13 <2e-16 ***
## Garage 1.500e+03 8.086e-11 1.855e+13 <2e-16 ***
## FirePlace 7.500e+02 4.672e-11 1.605e+13 <2e-16 ***
## Baths 1.250e+03 4.671e-11 2.676e+13 <2e-16 ***
## White.Marble1 9.000e+03 1.620e-10 5.555e+13 <2e-16 ***
## Indian.Marble1 -5.000e+03 1.618e-10 -3.091e+13 <2e-16 ***
## Floors 1.500e+04 1.321e-10 1.135e+14 <2e-16 ***
## City2 3.500e+03 1.620e-10 2.161e+13 <2e-16 ***
## City3 7.000e+03 1.619e-10 4.325e+13 <2e-16 ***
## Solar1 2.500e+02 1.321e-10 1.892e+12 <2e-16 ***
## Electric1 1.250e+03 1.321e-10 9.460e+12 <2e-16 ***
## Fiber1 1.175e+04 1.321e-10 8.892e+13 <2e-16 ***
## Glass.Doors1 4.450e+03 1.321e-10 3.368e+13 <2e-16 ***
## Swiming.Pool1 -1.338e-10 1.321e-10 -1.013e+00 0.311
## Garden1 1.311e-10 1.321e-10 9.920e-01 0.321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.179e-08 on 399984 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.239e+27 on 15 and 399984 DF, p-value: < 2.2e-16
As the maximal model uses all the data, one might logically expect it to be the best model. However, this may not always hold true. To verify the optimal model, we will conduct stepwise regression.
We have created model.none, which uses no predictors,
and model.all, which uses all variables. Stepwise
regression employs the Akaika Information Criterion (AIC) as the metric
to select the optimal model. The method aims to minimize the AIC,
representing the least information loss. Although stepwise regression
employs a greedy algorithm to search for a local minimum, it does not
guarantee the absolute best model. We will proceed with stepwise
regression to identify the important variables for our model.
For the independent variables (X),
we will utilize White Marble, Indian Marble,
City, Floor, Fibers, and
Glass.Doors since they exhibit high correlation with our
Prices.
model.corr <- lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + City + Glass.Doors,
data = house_train)
summary(model.corr)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble +
## City + Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9367.1 -2156.3 -0.5 2147.9 9354.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21634.874 13.718 1577.1 <2e-16 ***
## Floors 14992.472 9.683 1548.4 <2e-16 ***
## Fiber1 11754.130 9.683 1213.9 <2e-16 ***
## White.Marble1 9023.379 11.871 760.1 <2e-16 ***
## Indian.Marble1 -4998.393 11.853 -421.7 <2e-16 ***
## City2 3490.648 11.868 294.1 <2e-16 ***
## City3 6981.662 11.860 588.7 <2e-16 ***
## Glass.Doors1 4429.734 9.683 457.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3062 on 399992 degrees of freedom
## Multiple R-squared: 0.936, Adjusted R-squared: 0.936
## F-statistic: 8.364e+05 on 7 and 399992 DF, p-value: < 2.2e-16
R2 = 0.936, indicating that approximately 93.6% of the observed variation can be explained by the model’s inputs. Furthermore, all variables have p-values below 0.05, signifying that they are all significant with our dependent variable (Y).
model.step.back <- step(model.all, scope = formula(model.none), direction = 'backward', trace = 0)
summary(model.step.back)##
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.642e-05 -1.000e-10 1.000e-10 2.000e-10 2.384e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.500e+03 3.343e-10 2.842e+13 <2e-16 ***
## Area 2.500e+01 9.203e-13 2.717e+13 <2e-16 ***
## Garage 1.500e+03 8.086e-11 1.855e+13 <2e-16 ***
## FirePlace 7.500e+02 4.672e-11 1.605e+13 <2e-16 ***
## Baths 1.250e+03 4.671e-11 2.676e+13 <2e-16 ***
## White.Marble1 9.000e+03 1.620e-10 5.555e+13 <2e-16 ***
## Indian.Marble1 -5.000e+03 1.618e-10 -3.091e+13 <2e-16 ***
## Floors 1.500e+04 1.321e-10 1.135e+14 <2e-16 ***
## City2 3.500e+03 1.620e-10 2.161e+13 <2e-16 ***
## City3 7.000e+03 1.619e-10 4.325e+13 <2e-16 ***
## Solar1 2.500e+02 1.321e-10 1.892e+12 <2e-16 ***
## Electric1 1.250e+03 1.321e-10 9.460e+12 <2e-16 ***
## Fiber1 1.175e+04 1.321e-10 8.892e+13 <2e-16 ***
## Glass.Doors1 4.450e+03 1.321e-10 3.368e+13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.179e-08 on 399986 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.583e+27 on 13 and 399986 DF, p-value: < 2.2e-16
The model.step.back utilizes only 13 from 15
variables.
model.step.fwd <- step(model.none, scope = list(upper = model.all), direction = 'forward', trace = F)
summary(model.step.fwd)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
## Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace +
## Electric + Solar, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.643e-05 -1.000e-10 1.000e-10 2.000e-10 1.362e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.500e+03 4.388e-10 2.165e+13 <2e-16 ***
## Floors 1.500e+04 1.735e-10 8.646e+13 <2e-16 ***
## Fiber1 1.175e+04 1.735e-10 6.773e+13 <2e-16 ***
## White.Marble1 9.000e+03 2.127e-10 4.231e+13 <2e-16 ***
## City2 3.500e+03 2.126e-10 1.646e+13 <2e-16 ***
## City3 7.000e+03 2.125e-10 3.294e+13 <2e-16 ***
## Glass.Doors1 4.450e+03 1.735e-10 2.565e+13 <2e-16 ***
## Indian.Marble1 -5.000e+03 2.124e-10 -2.354e+13 <2e-16 ***
## Area 2.500e+01 1.208e-12 2.069e+13 <2e-16 ***
## Baths 1.250e+03 6.132e-11 2.038e+13 <2e-16 ***
## Garage 1.500e+03 1.062e-10 1.413e+13 <2e-16 ***
## FirePlace 7.500e+02 6.134e-11 1.223e+13 <2e-16 ***
## Electric1 1.250e+03 1.735e-10 7.205e+12 <2e-16 ***
## Solar1 2.500e+02 1.735e-10 1.441e+12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.486e-08 on 399986 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.499e+27 on 13 and 399986 DF, p-value: < 2.2e-16
The model.step.fwd utilizes only 13 from 15
variables.
model.step.both <- step(model.none, scope = list(upper = model.all), direction = 'both', trace = F)
summary(model.step.both)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
## Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace +
## Electric + Solar, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.643e-05 -1.000e-10 1.000e-10 2.000e-10 1.362e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.500e+03 4.388e-10 2.165e+13 <2e-16 ***
## Floors 1.500e+04 1.735e-10 8.646e+13 <2e-16 ***
## Fiber1 1.175e+04 1.735e-10 6.773e+13 <2e-16 ***
## White.Marble1 9.000e+03 2.127e-10 4.231e+13 <2e-16 ***
## City2 3.500e+03 2.126e-10 1.646e+13 <2e-16 ***
## City3 7.000e+03 2.125e-10 3.294e+13 <2e-16 ***
## Glass.Doors1 4.450e+03 1.735e-10 2.565e+13 <2e-16 ***
## Indian.Marble1 -5.000e+03 2.124e-10 -2.354e+13 <2e-16 ***
## Area 2.500e+01 1.208e-12 2.069e+13 <2e-16 ***
## Baths 1.250e+03 6.132e-11 2.038e+13 <2e-16 ***
## Garage 1.500e+03 1.062e-10 1.413e+13 <2e-16 ***
## FirePlace 7.500e+02 6.134e-11 1.223e+13 <2e-16 ***
## Electric1 1.250e+03 1.735e-10 7.205e+12 <2e-16 ***
## Solar1 2.500e+02 1.735e-10 1.441e+12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.486e-08 on 399986 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.499e+27 on 13 and 399986 DF, p-value: < 2.2e-16
The model.step.both utilizes only 13 from 15
variables.
However, evaluating our model’s performance based solely on those values and comparing them would be challenging. In this section, we will introduce you to the RMSE (Root Mean Squared Error) calculation, which measures our model’s performance as follows:
\[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \]
RMSE is one of the most popular measures used to estimate the accuracy of our forecasting model’s predicted values in comparison to the actual or observed values during the training of regression models.
Model Function
evaluate <- function(model, data_train, data_test, name) {
# Evaluating Testing
result.train <- model$fitted.values
result.train.mae <- MAE(result.train, data_train$Prices)
result.train.mape <- MAPE(result.train, data_train$Prices)
result.train.rmse <- RMSE(result.train, data_train$Prices)
result.train.r2 <- R2_Score(result.train, data_train$Prices)
# Evaluating Testing
result.test <- predict(model, data_test)
result.test.mae <- MAE(result.test, data_test$Prices)
result.test.mape <- MAPE(result.test, data_test$Prices)
result.test.rmse <- RMSE(result.test, data_test$Prices)
result.test.r2 <- R2_Score(result.test, data_test$Prices)
cat('Result of',name,'\n\n',
'Training', '\n',
'MAE : ', round(result.train.mae, 4), '\n',
'MAPE : ', round(result.train.mape, 4), '\n',
'RMSE : ', round(result.train.rmse, 4), '\n',
'R-squared : ', round(result.train.r2, 4), '\n\n',
'Testing', '\n',
'MAE : ', round(result.test.mae, 4), '\n',
'MAPE : ', round(result.test.mape, 4), '\n',
'RMSE : ', round(result.test.rmse, 4), '\n',
'R-squared : ', round(result.test.r2, 4), '\n')
}Evaluating model.all
## Result of Model All
##
## Training
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
##
## Testing
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
Next, let’s compare the performance of model.all with
model.corr, model.step.back,
model.forward, and model.both. This comparison
will help us understand which model performs better in predicting the
concrete strength.
## Result of Model Feature Selection (Pearson Corr)
##
## Training
## MAE : 2482.624
## MAPE : 0.0658
## RMSE : 3061.851
## R-squared : 0.936
##
## Testing
## MAE : 2476.665
## MAPE : 0.0657
## RMSE : 3057.646
## R-squared : 0.9364
## Result of Model Backward Elimination
##
## Training
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
##
## Testing
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
## Result of Model Forward Selection
##
## Training
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
##
## Testing
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
## Result of Model Both (Back + Forward)
##
## Training
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
##
## Testing
## MAE : 0
## MAPE : 0
## RMSE : 0
## R-squared : 1
To compare the performance of the models more conveniently, we can
use the compare_performance function from the
performance package. Let’s compare the performance of the
four models we have created.
comparison <- compare_performance(model.all, model.corr, model.step.back, model.step.fwd, model.step.both)
cmp <- as.data.frame(comparison)
cmp[order(cmp$AIC),]After conducting a thorough comparison, it becomes evident that
model.step.back stands out as the best performer among all
models. It demonstrates superior predictive capabilities, boasting lower
MAE, MAPE, and RMSE values, along
with a higher R-squared value. Moreover, the similarity
between the RMSE scores of both test and train datasets is noticeable,
and these scores are significantly smaller compared to the range of
values in both datasets. This indicates that our model shows no signs of
Overfitting or
Underfitting, and its performance is already
deemed satisfactory.
As a result, we will proceed with using model.step.back
exclusively for further analysis and predictions.
The assumption of linearity assumes that a linear relationship exists between the predictors and the target variable, allowing our model to accurately describe it. One way to visually assess this is by plotting the residuals’ values between our observed data and the model.
plot(model.step.back$fitted.values, model.step.back$residuals, ylab = 'Residuals', xlab = 'Prediction', main = 'Test')
abline(h = 0, col = 'red')Based on the visual inspection of the residuals plot, it is evident that the residuals are not randomly and evenly distributed along the horizontal zero line, indicating potential issues with the assumption of linearity.
The normality assumption states that the error term in the relationship between predictor and target variables follows a normal distribution. To assess this assumption, we can conduct the Shapiro-Wilk test, with the following hypotheses:
##
## Shapiro-Wilk normality test
##
## data: head(model.step.back$residual, 5000)
## W = 0.0032913, p-value < 2.2e-16
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.0033 0.0000
## Kolmogorov-Smirnov 0.5032 0.0000
## Cramer-von Mises 1666.6667 0.0000
## Anderson-Darling 1924.9587 0.0000
## -----------------------------------------------
The type of test to use depends on the number of observations. In general, if you have less than 50 observations, you should use the Shapiro-Wilk test. Otherwise, the Kolmogorov-Smirnov test is better.
The criteria to reject the null hypothesis, and therefore conclude that the data is not normally distributed, depends on the p-value. Typically, if the p-value is <0.05 we reject the null hypothesis.
Karena data kita di atas > 5000 bisa menggunakan Anderson-Darling normality test, which works for larger sample sizes.
##
## Anderson-Darling normality test
##
## data: model.step.back$residuals
## A = 152579, p-value < 2.2e-16
Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.
res <- data.frame(residual = model.step.back$residuals)
res %>%
e_charts() %>%
e_histogram(residual, name = "Error", legend = F) %>%
e_title(
text = "Normality of Residuals",
textStyle = list(fontSize = 25),
subtext = "Normally Distributed Error",
subtextStyle = list(fontSize = 15, fontStyle = "italic"),
left = "center") %>%
e_hide_grid_lines() %>%
e_tooltip() %>%
e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>%
e_y_axis(axisLabel = list(color = "#FAF7E6"))As demonstrated above, our Errors are mostly concentrated around 0 which means the Errors are normally distributed and our Normality assumption is fulfilled.
Moreover, if you might be wondering why we don’t do any Shapiro Test for this assumption? It’s because in R language, the Shapiro Test only limited to 5,000 data meanwhile our dataset has more than that. We afraid that, if we only use our data partially, the test itself wouldn’t be able to fully represent the entirety of the assumption. Becasue of that, we will only use our visualization above to test this particular assumption.
The homoscedasticity assumption asserts that the error term in the relationship between predictor and target variables remains constant across all values of inputs. To verify this assumption, we can conduct the Breusch-Pagan test, with the following hypotheses:
To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:
By performing the Breusch-Pagan test, we can assess whether the model satisfies the homoscedasticity assumption and make necessary adjustments if the assumption is violated.
##
## studentized Breusch-Pagan test
##
## data: model.step.back
## BP = 11.473, df = 13, p-value = 0.5713
To fulfill Homoscedasticity assumption, we need to fail to reject our H0 Hyphotesis means we should have p-value > 0.05. Based on the result above, our model p-value is 0.5713 which is bigger than 0.05. This implies that our Error spreads in constant manner, hence the error is homoscedasticity and Homoscedasticity assumption is fulfilled.
Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:
To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.
## GVIF Df GVIF^(1/(2*Df))
## Area 1.000037 1 1.000018
## Garage 1.000037 1 1.000019
## FirePlace 1.000014 1 1.000007
## Baths 1.000029 1 1.000015
## White.Marble 1.334862 1 1.155362
## Indian.Marble 1.334858 1 1.155361
## Floors 1.000018 1 1.000009
## City 1.000071 2 1.000018
## Solar 1.000024 1 1.000012
## Electric 1.000021 1 1.000011
## Fiber 1.000038 1 1.000019
## Glass.Doors 1.000032 1 1.000016
Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the No Multicollinearity assumption is fulfilled.
Now, let’s examine the coefficients of the model to understand the relationships between the predictor variables and the target variable in more detail.
##
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.642e-05 -1.000e-10 1.000e-10 2.000e-10 2.384e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.500e+03 3.343e-10 2.842e+13 <2e-16 ***
## Area 2.500e+01 9.203e-13 2.717e+13 <2e-16 ***
## Garage 1.500e+03 8.086e-11 1.855e+13 <2e-16 ***
## FirePlace 7.500e+02 4.672e-11 1.605e+13 <2e-16 ***
## Baths 1.250e+03 4.671e-11 2.676e+13 <2e-16 ***
## White.Marble1 9.000e+03 1.620e-10 5.555e+13 <2e-16 ***
## Indian.Marble1 -5.000e+03 1.618e-10 -3.091e+13 <2e-16 ***
## Floors 1.500e+04 1.321e-10 1.135e+14 <2e-16 ***
## City2 3.500e+03 1.620e-10 2.161e+13 <2e-16 ***
## City3 7.000e+03 1.619e-10 4.325e+13 <2e-16 ***
## Solar1 2.500e+02 1.321e-10 1.892e+12 <2e-16 ***
## Electric1 1.250e+03 1.321e-10 9.460e+12 <2e-16 ***
## Fiber1 1.175e+04 1.321e-10 8.892e+13 <2e-16 ***
## Glass.Doors1 4.450e+03 1.321e-10 3.368e+13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.179e-08 on 399986 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.583e+27 on 13 and 399986 DF, p-value: < 2.2e-16
Interpretation 💡 Insight:
The model was trained on the original data. Looking at the variables,
all have positive correlation with Prices, except
Indian.Marbel1. This means that according to our model, all
variables except Indian.Marbel1 contributes positively.
coefficient: model \[strength = 9500 + 25*Area + 1500*garage + 750*FirePlace + 1250*Baths + 9000*White.Marble1 - 5000*Indian.Marbel1 + 150000*Floors + 3500*City2 + 7000*City3 + 250*Solar1 + 1250*Electric1 + 11750*Fibar1 + 4450*GlassDoors1\]
Intercept: The initial strength value will be 9500 when there is no additional information available.
Semua Nilai Coeficient: The Prices value will increase (positif) or decrease (negative) by coefisien value when there is a one-unit increase in the variable, while keeping other variables constant.
Interpretasi coefficient untuk prediktor kategorik :
White.Marble1, Indian.Marble1, Solar1, Electric1, Fiber1, Glass.Doors1
City2 , City3
Adjusted R-squared: 1, indicating that the model can explain strength well by 100%.
Certainly, let’s extract the coefficients from the model and create a plot to visualize them for easier interpretation.
coef.all <- model.all$coefficients[-1]
barplot(coef.all, xlab = names(coef.all), main = 'Model `model.all` coefficients', ylab = 'Value', col="deepskyblue")Looking at the chart, it is evident that
White.Marble1, Floors, and Fiber are the significant
factors that strongly influence Prices in our regression model.
We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:
pred_model <- predict(model.step.back, newdata = house_test)
pred_comp <- data.frame(Actual.Value = house_test$Prices, Prediction.Value = pred_model)
pred_compFrom the conclusion above, it is evident that our Independent Variables proof to be useful to predict the House’s Prices. Those variables are:
\[strength = 9500 + 25*Area + 1500*garage + 750*FirePlace + 1250*Baths + 9000*White.Marble1 - 5000*Indian.Marbel1 + 150000*Floors + 3500*City2 + 7000*City3 + 250*Solar1 + 1250*Electric1 + 11750*Fibar1 + 4450*GlassDoors1\]
Other than that, our model has successfully satisfied all the classical assumptions Normality, Homoscedasticity, and No Multicollinearity.
Moreover, the R-Squared of our model is high with approximately 100% of the observed variation can be explained by our model’s inputs and the accuracy (our RMSE scores) of the model in predicting the house price is indubitably good without any indication of Overfitting and Underfitting. Therefore, this particular model is excellent to use for our prediction attempt.