In this project, we will predict House’s Price using Multiple Regression model along with the correlation analysis between the predictor variables and the target variable. The dataset that we’re using are HousePricing dataset from Kaggle.
Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.
house <- read.csv("HousePrices_HalfMil.csv")
rmarkdown::paged_table(house)
glimpse(house)
## Rows: 500,000
## Columns: 16
## $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
Based on the glimpse alone, there are no anomalies within the dataset.
colSums(is.na(house))
## Area Garage FirePlace Baths White.Marble
## 0 0 0 0 0
## Black.Marble Indian.Marble Floors City Solar
## 0 0 0 0 0
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 0 0 0 0 0
## Prices
## 0
We can see that there are no missing values within our dataset and we can confirm that the data is already clean. By this, we can move into our next step which is Exploratory Data Analysis.
Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset. In this section, we will be analyzing the Pearson Correlation between all of our variables.
cormat <- round(cor(house),2)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
lower_tri <- get_lower_tri(cormat)
house_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
house_cor_melt <- house_cor_melt %>%
mutate(label = glue("{Var1} ~ {Var2}"))
plot1 <- ggplot(house_cor_melt,aes(Var1, Var2, text = label)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3) +
scale_fill_gradientn(colors = c("#CA0034","#FAF7E6","#CA0034"),
values = rescale(c(-1,0,1)),
limits = c(-1,1)) +
labs(x = NULL,
y = NULL,
fill = "Pearson Corr:") +
theme(legend.background = element_rect(fill = "#FAF7E6", color = "#FAF7E6"),
plot.background = element_rect(fill = "#FAF7E6", color = "#FAF7E6"),
panel.background = element_rect(fill = "#FAF7E6"),
panel.grid = element_line(colour = "#FAF7E6"),
panel.grid.major.x = element_line(colour = "#FAF7E6"),
panel.grid.minor.x = element_line(colour = "#FAF7E6"),
legend.title = element_text(colour = "Black", face ="bold", family = "Cardo"),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
axis.text.x = element_text(color = "Black", family = "Cardo",
angle = 45, vjust = 0.5, hjust = 2),
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
ggplotly(plot1, tooltip = "text") %>%
layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
font = list(
color = "Black",
family = "Cardo",
size = 12
)))
Since we’re going to predict the prices of the House, we want to see
which variables that have strong correlation with Price.
Based on the plot above, those variables are White Marble,
Indian Marble, Floors, Fiber,
City, and Glass Doors.
However, if we look once again to our plot, there are cases of
Multicollinearity between the
White.Marble, Black.Marble, and
Indian.Marble. To deal with this kind of issue, we have to
see their values first to see if the types are already as we
desired.
unique(house$White.Marble)
## [1] 0 1
unique(house$Indian.Marble)
## [1] 0 1
unique(house$Black.Marble)
## [1] 1 0
Since the values are either 1 or 0, we will change their data type into factor/categorical and test their correlation with Pearson’s Chi-squared test (Note: these are used for categorical data types only)
# Change the data types first
white.marble <- as.factor(house$White.Marble)
indian.marble <- as.factor(house$Indian.Marble)
black.marble <- as.factor(house$Black.Marble)
# Pearson's Chi-squared test
chisq.test(white.marble, indian.marble)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: white.marble and indian.marble
## X-squared = 125360, df = 1, p-value < 0.00000000000000022
chisq.test(white.marble, black.marble)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: white.marble and black.marble
## X-squared = 124445, df = 1, p-value < 0.00000000000000022
chisq.test(indian.marble, black.marble)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: indian.marble and black.marble
## X-squared = 125189, df = 1, p-value < 0.00000000000000022
From the results above we can see that these 3 variables are
statistically significant within each other with p-values below
0.05 that indicates there are multicollinearity
caused by these variables. To deal with this kind of problem, we will
remove only one variable that have the weakest
correlation with our target variable which is
Black.Marble.
#remove Black.Marble from our dataset
house <- house %>%
select(-Black.Marble)
In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.
house_price <- house %>%
select(Prices) %>%
pivot_longer(
Prices,
names_to = "name")
myboxplotData <- data_to_boxplot(house_price, value, name,
group_var = name,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotData)%>%
hc_title(text= "Price Distribution") %>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")),
reversed = T) %>%
hc_chart(inverted = TRUE)
As demonstrated from the boxplot above, there are 6 outliers whose are not too far away from the others. Moreover, the median also perfectly placed in the middle of the box which indicates that the target variable is normaly distributed.
The purpose of bivariate analysis is to understand the relationship between two variables. The bivariate analysis involves the analysis of two variables, X: independent/explanatory/outcome variable and Y: dependent/outcome variable, to determine the relationship between them. In this case, we already have selected our predictors and we want to see their relationship with the target variable using Boxplot.
Why Boxplot? It’s because all of our predictor’s values are categorical. If you’re going back to our Data Wrangling session you can see that these variables has value of 1 and 0. Which means 1 if the facility is present and 0 if the facility is not present.
price_wmarble <- house %>%
select(Prices, White.Marble) %>%
pivot_longer(
White.Marble,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDataw <- data_to_boxplot(price_wmarble, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDataw)%>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
We can conclude that the Prices are going
higher if the House has White Marble in
it.
price_imarble <- house %>%
select(Prices, Indian.Marble) %>%
pivot_longer(
Indian.Marble,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDatai <- data_to_boxplot(price_imarble, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDatai)%>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
We can conclude that the Prices are going
lower if the House has Indian Marble in
it.
price_floor <- house %>%
select(Prices, Floors) %>%
pivot_longer(
Floors,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDatafl <- data_to_boxplot(price_floor, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDatafl) %>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
We can conclude that the Prices are going
higher if the House has Floors in it.
price_fiber <- house %>%
select(Prices, Fiber) %>%
pivot_longer(
Fiber,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDataf <- data_to_boxplot(price_fiber, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDataf)%>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
We can conclude that the Prices are going
higher if the House has Fiber in it.
price_gdoor <- house %>%
select(Prices, Glass.Doors) %>%
pivot_longer(
Glass.Doors,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDatag <- data_to_boxplot(price_gdoor, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDatag) %>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
We can conclude that the Prices are going
higher if the House has Glass Doors in
it.
price_city <- house %>%
select(Prices, City) %>%
pivot_longer(
City,
names_to = "name") %>%
mutate(value = as.factor(value))
myboxplotDatac <- data_to_boxplot(price_city, Prices, value,
group_var = value,
add_outliers = TRUE,
fillColor = "#CA0034",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotDatac) %>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")))
Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.
# Set seed to lock the random
set.seed(417)
# Index Sampling
index <- sample(nrow(house), size = nrow(house)*0.80)
# Splitting
house_train <- house[index,] #take 80%
house_test <- house[-index,] #take 20%
After we split our data, now we will train our models using the
house_train data. For the independent variables
(X), we are going to use White Marble,
Indian Marble, City, Floor
Fibers and Glass.Doors since they’re the ones
who have high correlation with our Prices.
model_corr <- lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + City + Glass.Doors,
data = house_train)
summary(model_corr)
##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble +
## City + Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9371.7 -2148.0 1.5 2144.2 9350.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18142.231 16.767 1082.0 <0.0000000000000002 ***
## Floors 14990.743 9.675 1549.5 <0.0000000000000002 ***
## Fiber 11753.456 9.675 1214.9 <0.0000000000000002 ***
## White.Marble 9030.670 11.854 761.8 <0.0000000000000002 ***
## Indian.Marble -4992.513 11.847 -421.4 <0.0000000000000002 ***
## City 3490.363 5.927 588.9 <0.0000000000000002 ***
## Glass.Doors 4433.424 9.675 458.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 399993 degrees of freedom
## Multiple R-squared: 0.9362, Adjusted R-squared: 0.9362
## F-statistic: 9.777e+05 on 6 and 399993 DF, p-value: < 0.00000000000000022
\[ R^2 = 0.9361 \] It means approximately 93,6% of the observed variation can be explained by the model’s inputs. Moreover, all variables p-values are below 0.05 which means they’re all significant with our dependent variable (Y)
We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:
pred_model <- predict(model_corr, newdata = house_test)
pred_comp <- data.frame(Actual.Value = house_test$Prices, Prediction.Value = pred_model)
rmarkdown::paged_table(pred_comp)
However, seeing only those values and comparing them will be too hard for us to evaluating our model’s performance. In this part, we will introduce you the RMSE or Root Mean Squared Error calculation to calculates our model’s performance: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is one of the most popular measures to estimate the accuracy of our forecasting model’s predicted values versus the actual or observed values while training the regression models. Below is the RMSE’s comparison for our training and test data.
# RMSE of train dataset
RMSE(y_pred = model_corr$fitted.values, y_true = house_train$Prices)
## [1] 3059.35
# Check the range
range(house_train$Prices)
## [1] 7725 77975
# RMSE of test dataset
RMSE(y_pred = pred_model, y_true = house_test$Prices)
## [1] 3067.634
# Check the range
range(house_test$Prices)
## [1] 8275 77075
As shown above that the comparison of RMSE of both test and train datasets are quite similar between each other. Moreover, these two RMSE scores are indubitably small/below compared to both dataset’s range of value. This means that there is no indication of Overfitting nor Underfitting in our model and the model performance is already GOOD enough.
Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.
res <- data.frame(residual = model_corr$residuals)
res %>%
e_charts() %>%
e_histogram(residual, name = "Error", legend = F) %>%
e_title(
text = "Normality of Residuals",
textStyle = list(fontFamily = "Cardo", fontSize = 25),
subtext = "Normally Distributed Error",
subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
left = "center") %>%
e_hide_grid_lines() %>%
e_tooltip() %>%
e_theme_custom("rangga_rm.json") %>%
e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>%
e_y_axis(axisLabel = list(color = "#FAF7E6"))
As demonstrated above, our Errors are mostly concentrated around 0 which means the Errors are normally distributed and our Normality assumption is fulfilled.
Moreover, if you might be wondering why we don’t do any Shapiro Test for this assumption? It’s because in R language, the Shapiro Test only limited to 5,000 data meanwhile our dataset has more than that. We afraid that, if we only use our data partially, the test itself wouldn’t be able to fully represent the entirety of the assumption. Becasue of that, we will only use our visualization above to test this particular assumption.
To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:
bptest(model_corr)
##
## studentized Breusch-Pagan test
##
## data: model_corr
## BP = 7.6701, df = 6, p-value = 0.2633
To fulfill Homoscedasticity assumption, we need to fail to reject our H0 Hyphotesis means we should have p-value > 0.05. Based on the result above, our model p-value is 0.2091 which is bigger than 0.05. This implies that our Error spreads in constant manner, hence the error is homoscedasticity and Homoscedasticity assumption is fulfilled.
Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:
To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.
vif(model_corr)
## Floors Fiber White.Marble Indian.Marble City
## 1.000004 1.000013 1.333986 1.333990 1.000012
## Glass.Doors
## 1.000012
Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the _ No Multicollinearity assumption is fulfilled_.
From the conclusion above, it is evident that our Independent Variables are proven to be useful for predicting the House’s Prices. Those variables are:
White.MarbleIndian.MarbleCityFloorsFiberGlass.DoorsOther than that, our model has successfully satisfied all the Normality, Homoscedasticity, and No Multicollinearity assumptions
Moreover, the R-Squared of our model is high with approximately 93,6% of the observed variation can be explained by our model’s inputs and the accuracy (our RMSE scores) of the model in predicting the house price is indubitably excellent without any indication of Overfitting and Underfitting.
A work by Rangga Gemilang
gemilang.rangga94@gmail.com
22-08-2022