Introduction

In this project, we will predict house’s Price using Multiple Regression model along with the correlation analysis between the predictor variables and the target variable. The dataset that we’re using are housePrice dataset from Kaggle.

Understanding our Data

Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.

Load Data

house <- read.csv("kc_house_data.csv")
rmarkdown::paged_table(house)

Data Wrangling

  • Check for data types:
# Mengubah tipe data semua kolom menjadi integer
glimpse(house)
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date          <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

Based on the glimpse alone, there are no anomalies within the dataset.

  • Check for missing values:
colSums(is.na(house))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0

We can see that there are no missing values within our dataset and we can confirm that the data is already clean. By this, we can move into our next step which is Exploratory Data Analysis.

Column Selection

What makes a column unimportant? In this project, we will determine it by looking at which columns are already represented by other columns, which columns have no relationship with the Price column, and which columns contain data other than numbers just by looking at them. As of now, according to our understanding, there are 5 columns with those characteristics: id, date, zipcode, lat, and long.

Drop those columns from our dataset.

house <- subset(house,select=-c(id,date,zipcode,lat,long))
dim(house)
## [1] 21613    16

Exploratory Data Analysis

Exploratory data analysis (EDA) and Future Selection involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset. In this section, we will be analyzing Outliers and the Stepwise Regresion between all of our variables.

Outliers

In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.

house_price <- house %>% 
  select(price) %>% 
  pivot_longer(
  price,
  names_to = "name")

myboxplotData <- data_to_boxplot(house_price, value, name,
                                 group_var = name,
                                 add_outliers = TRUE,
                                 fillColor = "#176B87",
                                 color = "Black")

highchart()%>%
  hc_xAxis(type ="category")%>%
  hc_add_series_list(myboxplotData)%>%
  hc_title(text= "Price Distribution") %>% 
  hc_legend(enabled= FALSE) %>% 
  hc_add_theme(thm) %>% 
  hc_yAxis(
    labels = list(
      style = list(
        color = "#FAF7E6"))) %>% 
  hc_xAxis(
    labels = list(
      style = list(
        color = "#FAF7E6")),
    reversed = T) %>% 
  hc_chart(inverted = TRUE)  
length(house$price[house$price >= 1130000]) # base on boxplot
## [1] 1146
column <- house$price
Q1 <- quantile(column, 0.25)
Q3 <- quantile(column, 0.75)
IQR <- Q3 - Q1
outliers <- column[column < Q1 - 1.5 * IQR | column > Q3 + 1.5 * IQR]
count <- length(outliers)
count
## [1] 1146

It was found that there were 1146 data outliers in the price column, therefore we decided to remove the data because it is still below 10% of the total data held.

house<- house[!(house$price >= 1130000), ]

Step-wise regresiion

Step-wise regression helps us choose a good predictor, by looking for a combination of predictors that produces the best model based on the AIC value. In Project using Both methods

model_all <- lm(formula = price ~.,
                data = house)
model_both <- step(object = model_all,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = F)
summary(model_both)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + view + condition + grade + sqft_above + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -585451  -91224   -6461   80674  723434 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)   4581689.82565   91288.87538  50.189 < 0.0000000000000002 ***
## bedrooms       -13054.08556    1330.96808  -9.808 < 0.0000000000000002 ***
## bathrooms       26249.97596    2309.94061  11.364 < 0.0000000000000002 ***
## sqft_living        86.40031       3.22944  26.754 < 0.0000000000000002 ***
## sqft_lot            0.13971       0.03360   4.158          0.000032293 ***
## floors          54771.62689    2479.67543  22.088 < 0.0000000000000002 ***
## waterfront      78948.74439   18205.76933   4.336          0.000014548 ***
## view            20230.96921    1620.58379  12.484 < 0.0000000000000002 ***
## condition       18733.80035    1607.46682  11.654 < 0.0000000000000002 ***
## grade           91461.61385    1491.93496  61.304 < 0.0000000000000002 ***
## sqft_above        -33.72312       3.10237 -10.870 < 0.0000000000000002 ***
## yr_built        -2625.98172      47.05355 -55.808 < 0.0000000000000002 ***
## yr_renovated        5.08661       2.64776   1.921               0.0547 .  
## sqft_living15      57.61616       2.51772  22.884 < 0.0000000000000002 ***
## sqft_lot15         -0.25527       0.05119  -4.987          0.000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135100 on 20452 degrees of freedom
## Multiple R-squared:  0.5798, Adjusted R-squared:  0.5795 
## F-statistic:  2016 on 14 and 20452 DF,  p-value: < 0.00000000000000022

Since we are going to be predicting house prices, we want to see which variable has a strong correlation with Price. Based on the results above, it turns out that 14 columns affect the target column.

Model

Cross Validation

Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.

# Set seed to lock the random
set.seed(100)

# Index Sampling
index <- sample(nrow(house), size = nrow(house)*0.80)

# Splitting
house_train <- house[index,] #take 80%
house_test <- house[-index,] #take 20%

Fitting

After we split our data, now we will train our models using the house_train data. For the independent variables (X), we will use the model_both formula that was created earlier.

model_corr <- lm(formula = model_both,
                data = house_train)
summary(model_corr)
## 
## Call:
## lm(formula = model_both, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -584370  -90969   -6389   80763  723761 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)   4591394.28562  102486.89388  44.800 < 0.0000000000000002 ***
## bedrooms       -12856.07556    1478.20719  -8.697 < 0.0000000000000002 ***
## bathrooms       28527.13920    2589.16942  11.018 < 0.0000000000000002 ***
## sqft_living        81.88942       3.60453  22.719 < 0.0000000000000002 ***
## sqft_lot            0.13374       0.03756   3.561             0.000371 ***
## floors          52822.84947    2771.96503  19.056 < 0.0000000000000002 ***
## waterfront     102014.74490   20349.56615   5.013          0.000000541 ***
## view            20707.34534    1800.16842  11.503 < 0.0000000000000002 ***
## condition       19905.80496    1805.85530  11.023 < 0.0000000000000002 ***
## grade           92806.31765    1672.43494  55.492 < 0.0000000000000002 ***
## sqft_above        -31.41861       3.47155  -9.050 < 0.0000000000000002 ***
## yr_built        -2637.62325      52.81477 -49.941 < 0.0000000000000002 ***
## yr_renovated        6.08958       2.94614   2.067             0.038753 *  
## sqft_living15      58.75218       2.80632  20.936 < 0.0000000000000002 ***
## sqft_lot15         -0.28777       0.05806  -4.957          0.000000725 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135300 on 16358 degrees of freedom
## Multiple R-squared:  0.5802, Adjusted R-squared:  0.5798 
## F-statistic:  1615 on 14 and 16358 DF,  p-value: < 0.00000000000000022

\[ R^2 = 0.5802 \] It means approximately 58,02% of the observed variation can be explained by the model’s inputs. Moreover, all variables p-values are below 0.05 which means they’re all significant with our dependent variable (Y)

Predicting

We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:

pred_model <- predict(model_corr, newdata = house_test)
pred_comp <- data.frame(Actual.Value = house_test$price, Prediction.Value = pred_model)
rmarkdown::paged_table(pred_comp)

Evaluation

Model Performance

However, seeing only those values and comparing them will be too hard for us to evaluating our model’s performance. In this part, we will introduce you the RMSE or Root Mean Squared Error calculation to calculates our model’s performance: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is one of the most popular measures to estimate the accuracy of our forecasting model’s predicted values versus the actual or observed values while training the regression models. Below is the RMSE’s comparison for our training and test data.

# RMSE of train dataset
RMSE(y_pred = model_corr$fitted.values, y_true = house_train$price)
## [1] 135248.4
# Check the range
range(house_train$price)
## [1]   75000 1127500
# RMSE of test dataset
RMSE(y_pred = pred_model, y_true = house_test$price)
## [1] 134438
# Check the range
range(house_test$price)
## [1]   85000 1125000

As shown above, there is a significant difference in the RMSE comparison of the test and training datasets. In addition, these two RMSE scores are definitely understated compared to the range of values for the two data sets. This means that the model indicated Underfitting in the model

Assumptions

Normality

Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.

res <- data.frame(residual = model_corr$residuals)

res %>% 
e_charts() %>% 
  e_histogram(residual, name = "Error", legend = F) %>% 
  e_title(
    text = "Normality of Residuals",
    textStyle = list(fontFamily = "Cardo", fontSize = 25),
    subtext = "Normally Distributed Error",
    subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
    left = "center") %>% 
  e_hide_grid_lines() %>% 
  e_tooltip() %>% 
  e_theme_custom("formater.json") %>% 
  e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>% 
  e_y_axis(axisLabel = list(color = "#FAF7E6"))

As shown above, our Error is mostly concentrated around 0 but skewed slightly to the right

Also, if you’re probably wondering why we don’t do a Shapiro Test for these assumptions? That’s because in the R language, the Shapiro Test is only limited to 5000 data whereas our dataset has much more than that. We are concerned that if we only use a subset of the data, the test itself will not be able to fully represent all of the assumptions. Therefore, we will only use our visualization above to test this particular assumption.

Homoscedasticity

To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:

  • H0: error spreads in constant manner, which means the error is homoscedasticity
  • H1: error spreads in non-constant manner, which means the error is heteroscedasticity
bptest(model_corr)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_corr
## BP = 725.78, df = 14, p-value < 0.00000000000000022

To satisfy the Homosscedasticity assumption, we must fail to reject Hypothesis H0 we must have a p-value > 0.05. Based on the results above, our p-value model is smaller than 0.05. This implies that Our errors are not constant, so the error is not homoscedasticity and The assumption of homoscedasticity is not met.

No Multicollinearity

Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:

  • VIF > 10: There’s multicollinearity.
  • VIF < 10: There’s no multicollinearity.

To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.

vif(model_corr)
##      bedrooms     bathrooms   sqft_living      sqft_lot        floors 
##      1.640552      3.025210      7.000226      1.966997      1.981423 
##    waterfront          view     condition         grade    sqft_above 
##      1.104949      1.214153      1.213539      2.675233      5.698676 
##      yr_built  yr_renovated sqft_living15    sqft_lot15 
##      2.113134      1.131437      2.662341      1.997037

Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the No Multicollinearity assumption is fulfilled.

Conclusion

From the conclusion above, it is evident that our 14 Independent Variables proof to be useful to predict the House’s Prices.

In addition, our model has failed to meet several classic assumptions Normality, Homoscedasticity, and No Multicollinearity. namely for Normality and Homosscedasticity

In addition, the R-Squared model is not very high at around 58.02% the observed variation can be explained by our model inputs and the accuracy (our RMSE score) of the model in predicting house prices is quite good despite indication Underfitting. Therefore, this particular model has not been very well used for our prediction efforts.

 

A work by Ahmad Fauzi

wrk.ahmadfauzi@gmail.com