In this project, we will predict house’s Price using Multiple Regression model along with the correlation analysis between the predictor variables and the target variable. The dataset that we’re using are housePrice dataset from Kaggle.
Before we’re doing our Exploratory Data Analysis, we will examine our dataset first to see if there are any anomalies that we can fix within the dataset. It is important that insights are only as good as the data that informs them. This means it’s vital for the data to be clean and in the shape of usable form.
house <- read.csv("kc_house_data.csv")
rmarkdown::paged_table(house)
# Mengubah tipe data semua kolom menjadi integer
glimpse(house)
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
Based on the glimpse alone, there are no anomalies within the dataset.
colSums(is.na(house))
## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
We can see that there are no missing values within our dataset and we can confirm that the data is already clean. By this, we can move into our next step which is Exploratory Data Analysis.
What makes a column unimportant? In this project, we will determine
it by looking at which columns are already represented by other columns,
which columns have no relationship with the Price column, and which
columns contain data other than numbers just by looking at them. As of
now, according to our understanding, there are 5 columns with those
characteristics: id, date,
zipcode, lat, and long.
Drop those columns from our dataset.
house <- subset(house,select=-c(id,date,zipcode,lat,long))
dim(house)
## [1] 21613 16
Exploratory data analysis (EDA) and Future Selection involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn the data variables within our dataset. In this section, we will be analyzing Outliers and the Stepwise Regresion between all of our variables.
In data analysis, the identification of outliers and thus, observations that fall well outside the overall pattern of the data is very important. As a diagnostic tool for spotting observations that may be outliers we may use a boxplot which based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.
house_price <- house %>%
select(price) %>%
pivot_longer(
price,
names_to = "name")
myboxplotData <- data_to_boxplot(house_price, value, name,
group_var = name,
add_outliers = TRUE,
fillColor = "#176B87",
color = "Black")
highchart()%>%
hc_xAxis(type ="category")%>%
hc_add_series_list(myboxplotData)%>%
hc_title(text= "Price Distribution") %>%
hc_legend(enabled= FALSE) %>%
hc_add_theme(thm) %>%
hc_yAxis(
labels = list(
style = list(
color = "#FAF7E6"))) %>%
hc_xAxis(
labels = list(
style = list(
color = "#FAF7E6")),
reversed = T) %>%
hc_chart(inverted = TRUE)
length(house$price[house$price >= 1130000]) # base on boxplot
## [1] 1146
column <- house$price
Q1 <- quantile(column, 0.25)
Q3 <- quantile(column, 0.75)
IQR <- Q3 - Q1
outliers <- column[column < Q1 - 1.5 * IQR | column > Q3 + 1.5 * IQR]
count <- length(outliers)
count
## [1] 1146
It was found that there were 1146 data outliers in the price column, therefore we decided to remove the data because it is still below 10% of the total data held.
house<- house[!(house$price >= 1130000), ]
Step-wise regression helps us choose a good predictor, by looking for a combination of predictors that produces the best model based on the AIC value. In Project using Both methods
model_all <- lm(formula = price ~.,
data = house)
model_both <- step(object = model_all,
direction = "both",
scope = list(upper = model_all),
trace = F)
summary(model_both)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + sqft_above +
## yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -585451 -91224 -6461 80674 723434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4581689.82565 91288.87538 50.189 < 0.0000000000000002 ***
## bedrooms -13054.08556 1330.96808 -9.808 < 0.0000000000000002 ***
## bathrooms 26249.97596 2309.94061 11.364 < 0.0000000000000002 ***
## sqft_living 86.40031 3.22944 26.754 < 0.0000000000000002 ***
## sqft_lot 0.13971 0.03360 4.158 0.000032293 ***
## floors 54771.62689 2479.67543 22.088 < 0.0000000000000002 ***
## waterfront 78948.74439 18205.76933 4.336 0.000014548 ***
## view 20230.96921 1620.58379 12.484 < 0.0000000000000002 ***
## condition 18733.80035 1607.46682 11.654 < 0.0000000000000002 ***
## grade 91461.61385 1491.93496 61.304 < 0.0000000000000002 ***
## sqft_above -33.72312 3.10237 -10.870 < 0.0000000000000002 ***
## yr_built -2625.98172 47.05355 -55.808 < 0.0000000000000002 ***
## yr_renovated 5.08661 2.64776 1.921 0.0547 .
## sqft_living15 57.61616 2.51772 22.884 < 0.0000000000000002 ***
## sqft_lot15 -0.25527 0.05119 -4.987 0.000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135100 on 20452 degrees of freedom
## Multiple R-squared: 0.5798, Adjusted R-squared: 0.5795
## F-statistic: 2016 on 14 and 20452 DF, p-value: < 0.00000000000000022
Since we are going to be predicting house prices, we want to see
which variable has a strong correlation with Price. Based
on the results above, it turns out that 14 columns
affect the target column.
Before we go into modelling, we are going to do Cross Validation for our dataset first. Why? The Cross Validation procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on unseen data. In this context, we will take 80% of our dataset to train our model and 20% of our dataset to test our prediction to see the model’s performance.
# Set seed to lock the random
set.seed(100)
# Index Sampling
index <- sample(nrow(house), size = nrow(house)*0.80)
# Splitting
house_train <- house[index,] #take 80%
house_test <- house[-index,] #take 20%
After we split our data, now we will train our models using the
house_train data. For the independent variables
(X), we will use the model_both formula that was
created earlier.
model_corr <- lm(formula = model_both,
data = house_train)
summary(model_corr)
##
## Call:
## lm(formula = model_both, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -584370 -90969 -6389 80763 723761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4591394.28562 102486.89388 44.800 < 0.0000000000000002 ***
## bedrooms -12856.07556 1478.20719 -8.697 < 0.0000000000000002 ***
## bathrooms 28527.13920 2589.16942 11.018 < 0.0000000000000002 ***
## sqft_living 81.88942 3.60453 22.719 < 0.0000000000000002 ***
## sqft_lot 0.13374 0.03756 3.561 0.000371 ***
## floors 52822.84947 2771.96503 19.056 < 0.0000000000000002 ***
## waterfront 102014.74490 20349.56615 5.013 0.000000541 ***
## view 20707.34534 1800.16842 11.503 < 0.0000000000000002 ***
## condition 19905.80496 1805.85530 11.023 < 0.0000000000000002 ***
## grade 92806.31765 1672.43494 55.492 < 0.0000000000000002 ***
## sqft_above -31.41861 3.47155 -9.050 < 0.0000000000000002 ***
## yr_built -2637.62325 52.81477 -49.941 < 0.0000000000000002 ***
## yr_renovated 6.08958 2.94614 2.067 0.038753 *
## sqft_living15 58.75218 2.80632 20.936 < 0.0000000000000002 ***
## sqft_lot15 -0.28777 0.05806 -4.957 0.000000725 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135300 on 16358 degrees of freedom
## Multiple R-squared: 0.5802, Adjusted R-squared: 0.5798
## F-statistic: 1615 on 14 and 16358 DF, p-value: < 0.00000000000000022
\[ R^2 = 0.5802 \] It means approximately 58,02% of the observed variation can be explained by the model’s inputs. Moreover, all variables p-values are below 0.05 which means they’re all significant with our dependent variable (Y)
We will predict the unseen data (the test data) using our selected model. To see the comparison between the actual value and the prediction value, you can refer to the table below:
pred_model <- predict(model_corr, newdata = house_test)
pred_comp <- data.frame(Actual.Value = house_test$price, Prediction.Value = pred_model)
rmarkdown::paged_table(pred_comp)
However, seeing only those values and comparing them will be too hard for us to evaluating our model’s performance. In this part, we will introduce you the RMSE or Root Mean Squared Error calculation to calculates our model’s performance: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is one of the most popular measures to estimate the accuracy of our forecasting model’s predicted values versus the actual or observed values while training the regression models. Below is the RMSE’s comparison for our training and test data.
# RMSE of train dataset
RMSE(y_pred = model_corr$fitted.values, y_true = house_train$price)
## [1] 135248.4
# Check the range
range(house_train$price)
## [1] 75000 1127500
# RMSE of test dataset
RMSE(y_pred = pred_model, y_true = house_test$price)
## [1] 134438
# Check the range
range(house_test$price)
## [1] 85000 1125000
As shown above, there is a significant difference in the RMSE comparison of the test and training datasets. In addition, these two RMSE scores are definitely understated compared to the range of values for the two data sets. This means that the model indicated Underfitting in the model
Our Linear Regression model is expected to have normally distributed error which means majority of the errors are expected to be around 0.
res <- data.frame(residual = model_corr$residuals)
res %>%
e_charts() %>%
e_histogram(residual, name = "Error", legend = F) %>%
e_title(
text = "Normality of Residuals",
textStyle = list(fontFamily = "Cardo", fontSize = 25),
subtext = "Normally Distributed Error",
subtextStyle = list(fontFamily = "Cardo", fontSize = 15, fontStyle = "italic"),
left = "center") %>%
e_hide_grid_lines() %>%
e_tooltip() %>%
e_theme_custom("formater.json") %>%
e_x_axis(axisLabel = list(fontFamily = "Cardo")) %>%
e_y_axis(axisLabel = list(color = "#FAF7E6"))
As shown above, our Error is mostly concentrated around 0 but skewed slightly to the right
Also, if you’re probably wondering why we don’t do a Shapiro Test for these assumptions? That’s because in the R language, the Shapiro Test is only limited to 5000 data whereas our dataset has much more than that. We are concerned that if we only use a subset of the data, the test itself will not be able to fully represent all of the assumptions. Therefore, we will only use our visualization above to test this particular assumption.
To test our Error’s homoscedasticity, we will use Breusch-Pagan hypothesis test:
bptest(model_corr)
##
## studentized Breusch-Pagan test
##
## data: model_corr
## BP = 725.78, df = 14, p-value < 0.00000000000000022
To satisfy the Homosscedasticity assumption, we must fail to reject Hypothesis H0 we must have a p-value > 0.05. Based on the results above, our p-value model is smaller than 0.05. This implies that Our errors are not constant, so the error is not homoscedasticity and The assumption of homoscedasticity is not met.
Multicollinearity is a condition when there are strong correlations between our Independent Variable (X). We don’t want these variables to be our predictor since they’re redundant and we are going to choose only one variable out of it. To test this assumption, we will apply Variance Inflation Factor (VIF) test with conditions:
To fulfill this assumption, all of our VIF Scores should be below 10 or No Multicollinearity happened.
vif(model_corr)
## bedrooms bathrooms sqft_living sqft_lot floors
## 1.640552 3.025210 7.000226 1.966997 1.981423
## waterfront view condition grade sqft_above
## 1.104949 1.214153 1.213539 2.675233 5.698676
## yr_built yr_renovated sqft_living15 sqft_lot15
## 2.113134 1.131437 2.662341 1.997037
Evidently from the results above, the VIF scores for all our Independent Variable (X) are all below 10 which indicates that the No Multicollinearity assumption is fulfilled.
From the conclusion above, it is evident that our 14 Independent Variables proof to be useful to predict the House’s Prices.
In addition, our model has failed to meet several classic assumptions Normality, Homoscedasticity, and No Multicollinearity. namely for Normality and Homosscedasticity
In addition, the R-Squared model is not very high at around 58.02% the observed variation can be explained by our model inputs and the accuracy (our RMSE score) of the model in predicting house prices is quite good despite indication Underfitting. Therefore, this particular model has not been very well used for our prediction efforts.
A work by Ahmad Fauzi
wrk.ahmadfauzi@gmail.com