Introduction

In this project, we will predict House’s Price using Multiple Regression model along with the correlation analysis between the predictor variables and the target variable. The dataset that we’re using are HousePricing dataset from Kaggle.

Understanding our Data

Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.

Load Data

house <- read.csv("HousePrices_HalfMil.csv")
rmarkdown::paged_table(house)

Data Wrangling

  • Check for data types:
glimpse(house)
## Rows: 500,000
## Columns: 16
## $ Area          <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage        <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace     <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths         <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble  <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble  <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors        <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City          <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar         <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric      <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber         <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors   <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden        <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices        <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …

Based on the glimpse alone, there are no anomalies within the dataset.

  • Check for missing values:
colSums(is.na(house))
##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

We can see that there are no missing values within our dataset and we can confirm that the data is already clean. By this, we can move into our next step which is Exploratory Data Analysis.

Exploratory Data Analysis

Exploratory data analysis (EDA) involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset. In this section, we will be analyzing the Pearson Correlation between all of our variables.

cormat <- round(cor(house),2)

get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
  }

get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

lower_tri <- get_lower_tri(cormat)

house_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
house_cor_melt <- house_cor_melt %>% 
  mutate(label = glue("{Var1} ~ {Var2}"))

plot1 <- ggplot(house_cor_melt,aes(Var1, Var2, text = label)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3) + 
  scale_fill_gradientn(colors = c("#CA0034","#FAF7E6","#CA0034"),
                      values = rescale(c(-1,0,1)),
                      limits = c(-1,1)) +
  labs(x = NULL,
      y = NULL,
      fill = "Pearson Corr:") +
  theme(legend.background = element_rect(fill = "#FAF7E6", color = "#FAF7E6"),
        plot.background = element_rect(fill = "#FAF7E6", color = "#FAF7E6"),
        panel.background = element_rect(fill = "#FAF7E6"),
        panel.grid = element_line(colour = "#FAF7E6"),
        panel.grid.major.x = element_line(colour = "#FAF7E6"),
        panel.grid.minor.x = element_line(colour = "#FAF7E6"),
        legend.title = element_text(colour = "Black", face ="bold", family = "Cardo"),
        legend.justification = c(1, 0),
        legend.position = c(0.6, 0.7),
        axis.text.x = element_text(color = "Black", family = "Cardo",
                                angle = 45, vjust = 0.5, hjust = 2),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
        guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                  title.position = "top", title.hjust = 0.5))

ggplotly(plot1, tooltip = "text") %>% 
  layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
                         font = list(
                           color = "Black",
                           family = "Cardo",
                           size = 12
                         )))

Since we’re going to predict the prices of the House, we want to see which variables that have strong correlation with Price. Based on the plot above, those variables are White Marble, Indian Marble, Floors, Fiber, City, and Glass Doors.

Multicollinearity (Initial)

However, if we look once again to our plot, there are cases of Multicollinearity between the White.Marble, Black.Marble, and Indian.Marble. To deal with this kind of issue, we have to see their values first to see if the types are already as we desired.

unique(house$White.Marble)
## [1] 0 1
unique(house$Indian.Marble)
## [1] 0 1
unique(house$Black.Marble)
## [1] 1 0

Since the values are either 1 or 0, we will change their data type into factor/categorical and test their correlation with Pearson’s Chi-squared test (Note: these are used for categorical data types only)

# Change the data types first
white.marble <- as.factor(house$White.Marble)
indian.marble <- as.factor(house$Indian.Marble)
black.marble <- as.factor(house$Black.Marble)

# Pearson's Chi-squared test
chisq.test(white.marble, indian.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  white.marble and indian.marble
## X-squared = 125360, df = 1, p-value < 0.00000000000000022
chisq.test(white.marble, black.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  white.marble and black.marble
## X-squared = 124445, df = 1, p-value < 0.00000000000000022
chisq.test(indian.marble, black.marble)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  indian.marble and black.marble
## X-squared = 125189, df = 1, p-value < 0.00000000000000022

From the results above we can see that these 3 variables are statistically significant within each other with p-values below 0.05 that indicates there are multicollinearity caused by these variables. To deal with this kind of problem, we will remove only one variable that have the weakest correlation with our target variable which is Black.Marble.

#remove Black.Marble from our dataset
house <- house %>% 
  select(-Black.Marble)

Outliers

In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.

house_price <- house %>% 
  select(Prices) %>% 
  pivot_longer(
  Prices,
  names_to = "name")

myboxplotData <- data_to_boxplot(house_price, value, name,
                                 group_var = name,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotData)%>%
  hc_title(text= "Price Distribution") %>% 
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")),
    reversed = T) %>% 
  hc_chart(inverted = TRUE)  

As demonstrated from the boxplot above, there are 6 outliers whose are not too far away from the others. Moreover, the median also perfectly placed in the middle of the box which indicates that the target variable is normaly distributed.

Bivariate Analysis

The purpose of bivariate analysis is to understand the relationship between two variables. The bivariate analysis involves the analysis of two variables, X: independent/explanatory/outcome variable and Y: dependent/outcome variable, to determine the relationship between them. In this case, we already have selected our predictors and we want to see their relationship with the target variable using Boxplot.

Why Boxplot? It’s because all of our predictor’s values are categorical. If you’re going back to our Data Wrangling session you can see that these variables has value of 1 and 0. Which means 1 if the facility is present and 0 if the facility is not present.

vs. White Marble

price_wmarble <- house %>% 
  select(Prices, White.Marble) %>% 
  pivot_longer(
  White.Marble,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDataw <- data_to_boxplot(price_wmarble, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDataw)%>%
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

We can conclude that the Prices are going higher if the House has White Marble in it.

vs Indian Marble

price_imarble <- house %>% 
  select(Prices, Indian.Marble) %>% 
  pivot_longer(
  Indian.Marble,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDatai <- data_to_boxplot(price_imarble, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDatai)%>%
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

We can conclude that the Prices are going lower if the House has Indian Marble in it.

vs. Floors

price_floor <- house %>% 
  select(Prices, Floors) %>% 
  pivot_longer(
  Floors,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDatafl <- data_to_boxplot(price_floor, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDatafl) %>% 
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

We can conclude that the Prices are going higher if the House has Floors in it.

vs. Fiber

price_fiber <- house %>% 
  select(Prices, Fiber) %>% 
  pivot_longer(
  Fiber,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDataf <- data_to_boxplot(price_fiber, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDataf)%>%
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

We can conclude that the Prices are going higher if the House has Fiber in it.

vs. Glass Doors

price_gdoor <- house %>% 
  select(Prices, Glass.Doors) %>% 
  pivot_longer(
  Glass.Doors,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDatag <- data_to_boxplot(price_gdoor, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDatag) %>% 
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

We can conclude that the Prices are going higher if the House has Glass Doors in it.

vs. City

price_city <- house %>% 
  select(Prices, City) %>% 
  pivot_longer(
  City,
  names_to = "name") %>% 
  mutate(value = as.factor(value))

myboxplotDatac <- data_to_boxplot(price_city, Prices, value,
                                 group_var = value,
                                 add_outliers = TRUE,
                                 fillColor = "#CA0034",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotDatac) %>% 
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")))

Model

Cross Validation

Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.

# Set seed to lock the random
set.seed(417)

# Index Sampling
index <- sample(nrow(house), size = nrow(house)*0.80)

# Splitting
house_train <- house[index,] #take 80%
house_test <- house[-index,] #take 20%

Fitting

After we split our data, now we will train our models using the house_train data. For the independent variables (X), we are going to use White Marble, Indian Marble, City, Floor Fibers and Glass.Doors since they’re the ones who have high correlation with our Prices.

model_corr <- lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + City + Glass.Doors,
                data = house_train)
summary(model_corr)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + 
##     City + Glass.Doors, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9371.7 -2148.0     1.5  2144.2  9350.7 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   18142.231     16.767  1082.0 <0.0000000000000002 ***
## Floors        14990.743      9.675  1549.5 <0.0000000000000002 ***
## Fiber         11753.456      9.675  1214.9 <0.0000000000000002 ***
## White.Marble   9030.670     11.854   761.8 <0.0000000000000002 ***
## Indian.Marble -4992.513     11.847  -421.4 <0.0000000000000002 ***
## City           3490.363      5.927   588.9 <0.0000000000000002 ***
## Glass.Doors    4433.424      9.675   458.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3059 on 399993 degrees of freedom
## Multiple R-squared:  0.9362, Adjusted R-squared:  0.9362 
## F-statistic: 9.777e+05 on 6 and 399993 DF,  p-value: < 0.00000000000000022

\[ R^2 = 0.9361 \] It means approximately 93,6% of the observed variation can be explained by the model’s inputs. Moreover, all variables p-values are below 0.05 which means they’re all significant with our dependent variable (Y)

Predicting

We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:

pred_model <- predict(model_corr, newdata = house_test)
pred_comp <- data.frame(Actual.Value = house_test$Prices, Prediction.Value = pred_model)
rmarkdown::paged_table(pred_comp)

Evaluation

Model Performance

However, seeing only those values and comparing them will be too hard for us to evaluating our model’s performance. In this part, we will introduce you the RMSE or Root Mean Squared Error calculation to calculates our model’s performance: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is one of the most popular measures to estimate the accuracy of our forecasting model’s predicted values versus the actual or observed values while training the regression models. Below is the RMSE’s comparison for our training and test data.

# RMSE of train dataset
RMSE(y_pred = model_corr$fitted.values, y_true = house_train$Prices)
## [1] 3059.35
# Check the range
range(house_train$Prices)
## [1]  7725 77975
# RMSE of test dataset
RMSE(y_pred = pred_model, y_true = house_test$Prices)
## [1] 3067.634
# Check the range
range(house_test$Prices)
## [1]  8275 77075

As shown above that the comparison of RMSE of both test and train datasets are quite similar between each other. Moreover, these two RMSE scores are indubitably small/below compared to both dataset’s range of value. This means that there is no indication of Overfitting nor Underfitting in our model and the model performance is already GOOD enough.

Assumptions

Normality

Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.

res <- data.frame(residual = model_corr$residuals)

res %>% 
e_charts() %>% 
  e_histogram(residual, name = "Error", legend = F) %>% 
  e_title(
    text = "Normality of Residuals",
    textStyle = list(fontFamily = "Cardo", fontSize = 25),
    subtext = "Normally Distributed Error",
    subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
    left = "center") %>% 
  e_hide_grid_lines() %>% 
  e_tooltip() %>% 
  e_theme_custom("rangga_rm.json") %>% 
  e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>% 
  e_y_axis(axisLabel = list(color = "#FAF7E6"))

As demonstrated above, our Errors are mostly concentrated around 0 which means the Errors are normally distributed and our Normality assumption is fulfilled.

Moreover, if you might be wondering why we don’t do any Shapiro Test for this assumption? It’s because in R language, the Shapiro Test only limited to 5,000 data meanwhile our dataset has more than that. We afraid that, if we only use our data partially, the test itself wouldn’t be able to fully represent the entirety of the assumption. Becasue of that, we will only use our visualization above to test this particular assumption.

Homoscedasticity

To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:

  • H0: error spreads in constant manner, which means the error is homoscedasticity
  • H1: error spreads in non-constant manner, which means the error is heteroscedasticity
bptest(model_corr)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_corr
## BP = 7.6701, df = 6, p-value = 0.2633

To fulfill Homoscedasticity assumption, we need to fail to reject our H0 Hyphotesis means we should have p-value > 0.05. Based on the result above, our model p-value is 0.2091 which is bigger than 0.05. This implies that our Error spreads in constant manner, hence the error is homoscedasticity and Homoscedasticity assumption is fulfilled.

No Multicollinearity

Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:

  • VIF > 10: There’s multicollinearity.
  • VIF < 10: There’s no multicollinearity.

To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.

vif(model_corr)
##        Floors         Fiber  White.Marble Indian.Marble          City 
##      1.000004      1.000013      1.333986      1.333990      1.000012 
##   Glass.Doors 
##      1.000012

Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the _ No Multicollinearity assumption is fulfilled_.

Conclusion

From the conclusion above, it is evident that our Independent Variables are proven to be useful for predicting the House’s Prices. Those variables are:

  • White.Marble
  • Indian.Marble
  • City
  • Floors
  • Fiber
  • Glass.Doors

Other than that, our model has successfully satisfied all the Normality, Homoscedasticity, and No Multicollinearity assumptions

Moreover, the R-Squared of our model is high with approximately 93,6% of the observed variation can be explained by our model’s inputs and the accuracy (our RMSE scores) of the model in predicting the house price is indubitably excellent without any indication of Overfitting and Underfitting.

 

A work by Rangga Gemilang

gemilang.rangga94@gmail.com

22-08-2022