Description

About this file Online property companies offer valuations of houses using machine learning techniques. The aim of this report is to predict the house sales in King County, Washington State, USA using Simple Linear Regression (SLR) Multiple Linear Regression (MLR), and Polynomial Regression. The dataset consisted of historic data of houses sold between May 2014 to May 2015.” The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. This data was published/released under CC0: Public Domain. Unfortunately, the user has not indicated the source of the data. Please find the citation and database description in the Glossary and Bibliography. The dataset consisted of 21 variables and 21613 observations.

The dataset link: here

The raport is structured as follow:

  1. Data Extraction.

  2. Exploratory Data Analysis.

  3. Data Preparation/ Preprocessing.

  4. Modelling

  5. Evaluations

  6. Recommendations

1. Data Extraction

Read the house dataset and its structure

houseData_df <- read.csv("csv/houseData.csv")
str(houseData_df)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

The dataset contains 21613 observations and 21 variables. The target variable is price

Statistical Summary

##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200

2. Exploratory Data Analysis

2.1 Univariate Data Analysis

ggplot(data=houseData_df,
       aes(y=price))+
  geom_boxplot()+
  scale_y_continuous(limits=c(0,2000000))

2.2 Bivariate Data Analysis

#
ggplot(houseData_df, 
       aes(x=sqft_living,
           y=price,
           color=floors)) +
  geom_point() +
  labs(title="Grade VS Price",
       x = "Grade",
       y = "Price") +
  scale_y_continuous(limits = c(0, 2500000))

ggplot(houseData_df, 
       aes(x=grade,
           y=price,
           color=floors)) +
  geom_point() +
  labs(title="Grade VS Price",
       x = "Grade",
       y = "Price") +
  scale_y_continuous(limits = c(0, 2500000))

ggplot(houseData_df, 
       aes(x=yr_built,
           y=price,
           color=floors)) +
  geom_point() +
  labs(title="Year Built vs Price",
       x="Year Built",
       y="Price") +
  scale_y_continuous(limits = c(0, 2000000))

ggplot(data=houseData_df,
       aes(x=bedrooms2, y=price))+
  geom_boxplot()+
  scale_y_continuous(limits=c(0,2000000))

2.3 Multivariate Data Analysis

houseData_df_num <- houseData_df[, 3:14]
r <- cor(houseData_df_num)
r
##                    price     bedrooms   bathrooms sqft_living     sqft_lot
## price         1.00000000  0.308349598  0.52513751  0.70203505  0.089660861
## bedrooms      0.30834960  1.000000000  0.51588364  0.57667069  0.031703243
## bathrooms     0.52513751  0.515883638  1.00000000  0.75466528  0.087739662
## sqft_living   0.70203505  0.576670693  0.75466528  1.00000000  0.172825661
## sqft_lot      0.08966086  0.031703243  0.08773966  0.17282566  1.000000000
## floors        0.25679389  0.175428935  0.50065317  0.35394929 -0.005200991
## waterfront    0.26636943 -0.006582479  0.06374363  0.10381782  0.021603683
## view          0.39729349  0.079531852  0.18773702  0.28461119  0.074710106
## condition     0.03636179  0.028472104 -0.12498193 -0.05875259 -0.008958250
## grade         0.66743426  0.356966725  0.66498253  0.76270448  0.113621124
## sqft_above    0.60556730  0.477600161  0.68534248  0.87659660  0.183512281
## sqft_basement 0.32381602  0.303093375  0.28377003  0.43504297  0.015286202
##                     floors   waterfront       view   condition       grade
## price          0.256793888  0.266369434 0.39729349  0.03636179  0.66743426
## bedrooms       0.175428935 -0.006582479 0.07953185  0.02847210  0.35696673
## bathrooms      0.500653173  0.063743629 0.18773702 -0.12498193  0.66498253
## sqft_living    0.353949290  0.103817818 0.28461119 -0.05875259  0.76270448
## sqft_lot      -0.005200991  0.021603683 0.07471011 -0.00895825  0.11362112
## floors         1.000000000  0.023698320 0.02944382 -0.26376795  0.45818251
## waterfront     0.023698320  1.000000000 0.40185735  0.01665316  0.08277491
## view           0.029443820  0.401857351 1.00000000  0.04598974  0.25132058
## condition     -0.263767946  0.016653157 0.04598974  1.00000000 -0.14467367
## grade          0.458182514  0.082774914 0.25132058 -0.14467367  1.00000000
## sqft_above     0.523884710  0.072074592 0.16764934 -0.15821362  0.75592294
## sqft_basement -0.245704542  0.080587939 0.27694658  0.17410491  0.16839182
##                sqft_above sqft_basement
## price          0.60556730    0.32381602
## bedrooms       0.47760016    0.30309338
## bathrooms      0.68534248    0.28377003
## sqft_living    0.87659660    0.43504297
## sqft_lot       0.18351228    0.01528620
## floors         0.52388471   -0.24570454
## waterfront     0.07207459    0.08058794
## view           0.16764934    0.27694658
## condition     -0.15821362    0.17410491
## grade          0.75592294    0.16839182
## sqft_above     1.00000000   -0.05194331
## sqft_basement -0.05194331    1.00000000
library(corrgram)
corrgram(houseData_df_num, order=TRUE,
         upper.panel=panel.pie)

3. Data Preparations

3.1 Data Cleaning

Remove rows with outlier prices

out_price <- boxplot.stats(houseData_df_num$price)$out

out_idx <- which(houseData_df_num$price %in% c(out_price))

houseData_df_num <- houseData_df_num[-out_idx,]

summary(houseData_df_num)
##      price            bedrooms       bathrooms      sqft_living  
##  Min.   :  75000   Min.   : 0.00   Min.   :0.000   Min.   : 290  
##  1st Qu.: 315000   1st Qu.: 3.00   1st Qu.:1.500   1st Qu.:1400  
##  Median : 437500   Median : 3.00   Median :2.000   Median :1860  
##  Mean   : 476985   Mean   : 3.33   Mean   :2.052   Mean   :1976  
##  3rd Qu.: 600000   3rd Qu.: 4.00   3rd Qu.:2.500   3rd Qu.:2431  
##  Max.   :1127500   Max.   :33.00   Max.   :7.500   Max.   :7480  
##     sqft_lot           floors        waterfront           view       
##  Min.   :    520   Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:   5000   1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :   7500   Median :1.000   Median :0.00000   Median :0.0000  
##  Mean   :  14610   Mean   :1.476   Mean   :0.00298   Mean   :0.1727  
##  3rd Qu.:  10319   3rd Qu.:2.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1651359   Max.   :3.500   Max.   :1.00000   Max.   :4.0000  
##    condition         grade          sqft_above   sqft_basement   
##  Min.   :1.000   Min.   : 1.000   Min.   : 290   Min.   :   0.0  
##  1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:1170   1st Qu.:   0.0  
##  Median :3.000   Median : 7.000   Median :1520   Median :   0.0  
##  Mean   :3.406   Mean   : 7.531   Mean   :1708   Mean   : 267.2  
##  3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:2100   3rd Qu.: 510.0  
##  Max.   :5.000   Max.   :12.000   Max.   :5710   Max.   :2720.0
nrow(houseData_df_num)
## [1] 20467

Number of observations after the preprocessing is 20467.

3.2 Feature Extraction

houseData_df <- houseData_df[rownames(houseData_df_num),]

#Create a dataframe for statezip
zipcode <- houseData_df$zipcode
zipcode_df <- data.frame(zipcode)
colnames(zipcode_df) <- c("loc.")

#One Hot Encoding on Statezip
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill
df1 <- dummyVars("~.", data=zipcode_df)
df2 <- data.frame(predict(df1, newdata=zipcode_df))

## Combine house_df with satezip_df
houseData_df_num <- cbind(houseData_df_num, df2)
dim(houseData_df_num)
## [1] 20467    13

3.3 Training and Testing Data Division

set.seed(2022)
train_idx <- sample(m, m*0.7)
train_idx[1:5]
## [1] 20708  9651  7886  2871 12107
train_data <- houseData_df_num[train_idx,]
test_data <- houseData_df_num[-train_idx,]

dim(train_data)
## [1] 15129    13
dim(test_data)
## [1] 6121   13

4. Modeling

mymodel <- lm(formula=price~. +
                I(sqft_living^2), data = train_data)
summary(mymodel)
## 
## Call:
## lm(formula = price ~ . + I(sqft_living^2), data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -501336 -103902  -11523   88270  824159 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.073e+07  2.407e+06 -16.922  < 2e-16 ***
## bedrooms         -8.674e+03  1.717e+03  -5.051 4.45e-07 ***
## bathrooms        -1.473e+04  2.795e+03  -5.269 1.39e-07 ***
## sqft_living       1.378e+02  7.763e+00  17.752  < 2e-16 ***
## sqft_lot          8.026e-03  3.243e-02   0.247  0.80454    
## floors            1.602e+04  3.146e+03   5.092 3.58e-07 ***
## waterfront        6.011e+04  2.292e+04   2.622  0.00875 ** 
## view              3.397e+04  2.079e+03  16.344  < 2e-16 ***
## condition         4.204e+04  1.969e+03  21.351  < 2e-16 ***
## grade             8.451e+04  1.818e+03  46.494  < 2e-16 ***
## sqft_above       -2.566e+01  3.957e+00  -6.485 9.17e-11 ***
## sqft_basement            NA         NA      NA       NA    
## loc.              4.103e+02  2.452e+01  16.733  < 2e-16 ***
## I(sqft_living^2) -3.062e-03  1.382e-03  -2.215  0.02675 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145700 on 14333 degrees of freedom
##   (783 observations deleted due to missingness)
## Multiple R-squared:  0.5098, Adjusted R-squared:  0.5094 
## F-statistic:  1242 on 12 and 14333 DF,  p-value: < 2.2e-16

5. Evaluations

actual <- test_data$price

pred.mymodel <- predict(mymodel, test_data)
## Warning in predict.lm(mymodel, test_data): prediction from a rank-deficient fit
## may be misleading

Plot Actual vs Prediction Series

price_df <- data.frame(actual, pred.mymodel)

ggplot(data=price_df,
       aes(x=actual, y=pred.mymodel))+
  geom_point()+
  scale_x_continuous(limits=c(0,2000000))+
  scale_y_continuous(limits=c(0,2000000))
## Warning: Removed 2 rows containing missing values (geom_point).

Regression Metrics Measurement

performance <- function(actual, prediction, method){
  e  <- prediction-actual
  se <- e^2
  sse<- sum(se)
  mse<- mean(se)
  rmse<-sqrt(mse)
  
  result <- paste("Method:", method,
                  "\n SSE:", round(sse,3),
                  "\n MSE:", round(mse,3),
                  "\n RMSE:", round(rmse,3))
  cat(result)
}

performance(actual, pred.mymodel, "Regression Model")
## Method: Regression Model 
##  SSE: 131342375304986 
##  MSE: 21457666280.834 
##  RMSE: 146484.355