Learning to use linear regression model using “housing prediction” dataset from Kaggle. We want to understand the relationship between variables to predict the housing price.
Housing prices are influenced by a variety of factors. Some of the key considerations that can affect housing prices are location (the proximity of housing locations to job centers, schools, shopping centers, and other public areas), economic conditions (economic health of the area, include job opportunities and income levels), housing supply and demand, interest rates, demographic, government policies, condition and age of the property, property size and features, neighborhood trends, market sentiment, infrastructure and development, environmental factors, interest from investors, and housing market cycles.
Based on the description above, we want to understand the variable that influence the housing price. So that housing developer can create house design and business housing strategy based on the demand pricing of houses.
Install the package required.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
##
## The following object is masked from 'package:base':
##
## Recall
Load the dataset.
housing <- read.csv("Housing prediction.csv")
head(housing)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
glimpse(housing)
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <chr> "20141013T000000", "20141209T000000", "20150225T000000",…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
The data has 21,613 rows and 21 columns, with the descriptions below. - id : unique identifier for house - date : date the house is listed - price : housing price - bedrooms : number of bedrooms - bathrooms: number of bathrooms - sqft_living : living room area (sq ft) - sqft_lot : lot area (sq ft) - floors : number of floors - waterfront : presence of waterfront (0 = no, 1 = yes) - view : presence of view (0 = no, 1 = yes) - condition : house condition (1 - 5 (excellent)) - grade : the mark of housing assessment and categorization based on the quality, condition, and overall features of housing property (the bigger the number, the better) - sqft_above : floor area above the ground (sq ft) - sqft_basement : floor area beneath the ground (sq ft) - yr_built : the year the house is built (we assume ) - yr_renovated : the year the house is renovated - zipcode : the house location code area - lat : house location latitude - long : house location longitude - sqft_living15 : living room area (sq ft) - sqft_lot15 : lot area (sq ft)
First, we need to filter the columns that is a unique identifier that we can ignore, and columns that have the same information with other columns. Columns that is filtered is explained below: - id - date - lat - long - sqft_living15 - sqft_lot15
Based on the data above, there are some column data that isn’t the correct data type and needs to be transformed.
housing <- housing %>% select(- id,
- date,
- lat,
- long,
- sqft_living15,
- sqft_lot15) %>%
mutate(bathrooms = as.integer(bathrooms),
floors = as.integer(floors))
Thus, our target variable is the housing price. While other variables are considered as the variables to predict the housing price.
Exploratory data analysis is carried out to determine the condition of our data and gain useful insights for further analysis by using ggcor().
ggcorr(housing, label = T, hjust = T, layout.exp = 3)
Based on the correlation matrix, price has a strong correlation with the living room area (0.7) and housing grade (0.7).
We will build a regression linear model with living room area and hosusing grade as the predictor variable, because of the strong correlation towards the target variable (price).
model_profit_multiple <- lm(formula = price ~ sqft_living + grade , data = housing)
summary(model_profit_multiple)
##
## Call:
## lm(formula = price ~ sqft_living + grade, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1065457 -138304 -25043 100447 4794633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.981e+05 1.330e+04 -44.98 <2e-16 ***
## sqft_living 1.844e+02 2.869e+00 64.29 <2e-16 ***
## grade 9.855e+04 2.241e+03 43.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 250500 on 21610 degrees of freedom
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5345
## F-statistic: 1.241e+04 on 2 and 21610 DF, p-value: < 2.2e-16
Based on the model above, we interpret that: - When the area increases by one unit, the price increases by 184.4 points, provided that the other predictor values remain constant. - When the quality of the house (grade) increases by one unit, the price increases by 98550 points, provided that the values of other predictors remain constant.
Based on the model above, multiple linear regression is used. Thus, 53.45% (Adjusted R-Squared) of the variance of the target (price) can be explained by the selected predictor.
Next, we will try to automatically select predictor variables using step-wise regression with the backward elimination method.
model_all <- lm(formula = price ~ . , data = housing)
model_backward <- step (model_all, direction = "backward")
## Start: AIC=530952.4
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + yr_renovated + zipcode
##
##
## Step: AIC=530952.4
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## yr_renovated + zipcode
##
## Df Sum of Sq RSS AIC
## - sqft_above 1 1.4447e+10 1.0074e+15 530951
## - zipcode 1 8.4591e+10 1.0074e+15 530952
## <none> 1.0073e+15 530952
## - yr_renovated 1 1.9053e+11 1.0075e+15 530954
## - floors 1 1.5505e+12 1.0089e+15 530984
## - sqft_lot 1 2.3794e+12 1.0097e+15 531001
## - condition 1 2.8193e+12 1.0102e+15 531011
## - bathrooms 1 1.3923e+13 1.0213e+15 531247
## - bedrooms 1 1.6820e+13 1.0242e+15 531308
## - view 1 1.8743e+13 1.0261e+15 531349
## - waterfront 1 4.4660e+13 1.0520e+15 531888
## - sqft_living 1 6.9719e+13 1.0771e+15 532397
## - yr_built 1 1.0686e+14 1.1142e+15 533130
## - grade 1 1.5690e+14 1.1643e+15 534079
##
## Step: AIC=530950.7
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + yr_built + yr_renovated +
## zipcode
##
## Df Sum of Sq RSS AIC
## - zipcode 1 7.5762e+10 1.0074e+15 530950
## <none> 1.0074e+15 530951
## - yr_renovated 1 1.9319e+11 1.0076e+15 530953
## - floors 1 1.6739e+12 1.0090e+15 530985
## - sqft_lot 1 2.4430e+12 1.0098e+15 531001
## - condition 1 2.8852e+12 1.0102e+15 531011
## - bathrooms 1 1.4167e+13 1.0215e+15 531251
## - bedrooms 1 1.6807e+13 1.0242e+15 531306
## - view 1 1.9344e+13 1.0267e+15 531360
## - waterfront 1 4.4655e+13 1.0520e+15 531886
## - yr_built 1 1.0687e+14 1.1142e+15 533128
## - sqft_living 1 1.3079e+14 1.1382e+15 533587
## - grade 1 1.6046e+14 1.1678e+15 534143
##
## Step: AIC=530950.3
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## <none> 1.0074e+15 530950
## - yr_renovated 1 2.0634e+11 1.0076e+15 530953
## - floors 1 1.6108e+12 1.0091e+15 530983
## - sqft_lot 1 2.3817e+12 1.0098e+15 530999
## - condition 1 3.0491e+12 1.0105e+15 531014
## - bathrooms 1 1.4103e+13 1.0215e+15 531249
## - bedrooms 1 1.6736e+13 1.0242e+15 531304
## - view 1 1.9305e+13 1.0267e+15 531359
## - waterfront 1 4.4699e+13 1.0521e+15 531887
## - yr_built 1 1.1849e+14 1.1259e+15 533352
## - sqft_living 1 1.3209e+14 1.1395e+15 533611
## - grade 1 1.6045e+14 1.1679e+15 534143
summary(model_backward)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + view + condition + grade + yr_built +
## yr_renovated, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1302488 -109469 -9727 89041 4200488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.152e+06 1.368e+05 44.976 < 2e-16 ***
## bedrooms -3.797e+04 2.004e+03 -18.943 < 2e-16 ***
## bathrooms 5.350e+04 3.077e+03 17.389 < 2e-16 ***
## sqft_living 1.690e+02 3.175e+00 53.218 < 2e-16 ***
## sqft_lot -2.596e-01 3.632e-02 -7.146 9.21e-13 ***
## floors 2.095e+04 3.564e+03 5.877 4.24e-09 ***
## waterfront 5.759e+05 1.860e+04 30.958 < 2e-16 ***
## view 4.517e+04 2.220e+03 20.345 < 2e-16 ***
## condition 2.007e+04 2.482e+03 8.086 6.51e-16 ***
## grade 1.249e+05 2.129e+03 58.655 < 2e-16 ***
## yr_built -3.549e+03 7.041e+01 -50.403 < 2e-16 ***
## yr_renovated 8.200e+00 3.898e+00 2.103 0.0354 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216000 on 21601 degrees of freedom
## Multiple R-squared: 0.6541, Adjusted R-squared: 0.654
## F-statistic: 3714 on 11 and 21601 DF, p-value: < 2.2e-16
This step-wise regression method will produce an optimum formula based on the lowest AIC value, where the lower the AIC value, the smaller the value of observations that are not captured.
When compared with the initial model (predictor variable = sqft_living and grade), using bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, yr_built, and yr_renovated as the predictor variable has an adjusted R-squared of 0.654, higher than the previous model with an adjusted R-squared of 0.5345.
Based on the model using backward elimination method, predictor variabels that are significantly imapcts the variable target are bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, yr_built, and yr_renovated.
Thus, the model candidate are:
Price = -3.797e+04 (bedrooms) + 5.350e+04 (bathrooms) + 1.690e+02 (sqft_living) - 2.596e-01 (sqft_lot) + 2.095e+04 (floors) + 5.759e+05 (waterfront) + 4.517e+04 (view) + 2.007e+04 (condition) + 1.249e+05 (grade) - 3.549e+03 (yr_built) + 8.200e+00 (yr_renovated)
Price = 1.844e+02(sqft_living) + 9.855e+04(grade) - 5.981e+05
We will make predictions on our data for each model. This is done because we will evaluate the prediction results for each model.
housing_pred <- data.frame(price = housing$price) #for dataframe
housing_pred$model_signif <- predict (model_profit_multiple, housing)
housing_pred$model_backward <- predict(model_backward, housing)
head(housing_pred)
## price model_signif model_backward
## 1 221900 309390.4 306045.3
## 2 538000 565734.5 645473.1
## 3 180000 135223.4 226804.5
## 4 604000 453238.2 511679.1
## 5 510000 500155.3 454710.9
## 6 1225000 1485551.3 1456306.2
Based on the table above, the two models have quite different prediction results from the actual data.
Error / Residual is the difference between the predicted and actual values of the target variable. Calculating the error using Root Mean Squared Error (RMSE) to minimize big error sensitivity and to be interpreted.
RMSE (y_pred = housing_pred$model_signif,
y_true = housing_pred$price)
## [1] 250475.5
RMSE (y_pred = housing_pred$model_backward,
y_true = housing_pred$price)
## [1] 215899.7
Based on the calculation, model_backward with bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, yr_built, and yr_renovated as the predictor variable (215899.7) has a smaller RMSE value than model_signif with sqft_living and grade as the predictor variable (250475.5).
Thus, prediction using model_backward will occur an average error of ± 215899.7. While prediction using model_signif will occur an average error of ± 250475.5.
model_backward has an adjusted R-square of 0.654 and RMSE value of 215899.7. Thus, it has a better model than model_signif. Based on this model, the housing price has a strong correlation with bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, yr_built, and yr_renovated. In other words, housing price will be influence with those predictor variables. The increase in house prices will be influenced by the number of bathrooms, the size of the living room, the number of floors, the presence of waterfront and views, the condition of the house, the quality of the house, and the year the house was renovated. Meanwhile, house prices will decrease, influenced by the number of bedrooms, lot size, and year the house was built.