Description

This report provides house price prediction using regression algorithms. The dataset used in this report for modeling is house data in Australia. The dataset link : here

The report is structured as follows :
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparations
4. Modeling
5. Evaluations
6. Recommendations

1. Data Extraction

Import necessary libraries

rm(list=ls())
library(ggplot2)

Read house dataset and its structure

house_df <- read.csv("data/house.csv")

# structure of dataframe
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

The dataset contains 4600 observations and 18 variables. The target variable is price and the remaining are features

# data dimension
d <- dim(house_df)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns

# statistical summary
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

2. Exploratory Data Analysis

2.1 Univariate Data Analysis

Plot distribution of price (Boxplot)

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

2.2 Bivariate Data Analysis

## house price based on number of bedrooms
## casting bedrooms from numeric to factor
house_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$city2 <- factor(house_df$city)
house_df$statezip2 <- factor(house_df$statezip)
house_df$street2 <- factor(house_df$street)
house_df$country2 <- factor(house_df$country)
ggplot(data = house_df, aes(x = bedrooms2,
                            y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

2.3 Multivariate Data Analysis

cor(house_df$price, house_df$bedrooms)
## [1] 0.2003363
cor(house_df$price, house_df$bathrooms)
## [1] 0.3271099
cor(house_df$bedrooms, house_df$bathrooms)
## [1] 0.5459199
house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)

library(corrgram)
corrgram(house_df_num, order=TRUE,
         upper.panel = panel.pie)

3. Data Preparations

3.1 Data Cleaning

Remove rows with incorrect value prices

idx <- which(house_df_num$price %in% c(0))

# remove those rows
house_df_num <- house_df_num[-idx, ]
summary(house_df_num$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7800   326264   465000   557906   657500 26590000

Remove rows with outliers in prices

out_price <- boxplot.stats(house_df_num$price)$out
out_price
##   [1]  2384000  1400000  1200000  1395000  2280000  1425000  1225000  1580000
##   [9]  1870000  1800000  3200000  1500000  1340000  1532500  1750000  1500000
##  [17]  1307000  1250000  2400000  1370000  1157200  1702500  2238888  1225000
##  [25]  1200000  1575000  1315000  1300000  1346400  1225000  1264000  1895000
##  [33]  2555000  1550000  1400000  1680000  1400000  1222500  1255000  2100000
##  [41]  2000000  1270000  1216000  1405000  1190000  1300000  1212500  1275000
##  [49]  1800000  1690000  1506000  1580000  1220000  1185000  1550000  2700000
##  [57]  1600000  1170000  1180500  2150000  1325000  1610000  1200000  2000000
##  [65]  1680000  1570000  1795000  1655000  1381000  1600000  1710000  1285000
##  [73]  1175000  1400000  1210000  1970000  1400000  2453500  3100000  2750000
##  [81]  1425000  2400000  1385000  1365000  1886700  1256500  3710000  1505000
##  [89]  1595000  1250000  1965221  1298000  1270000  1901000  2400000  2005000
##  [97]  1300000  1800000  1415000  1300000  1370000  1280000  2700000  1635000
## [105]  1339000  1350000  1900000  1225000  1230000  1240000  1795000  1180000
## [113]  7062500  2888000  1365000  1325000  1485000  2250000  1184000  1228000
## [121]  1688000  1240000  1335000  1735000  1695000  1735000  1712500  1920000
## [129]  4668000  1170000  1200000  1538000  1185001  2027000  2475000  1728000
## [137]  1555000  1325000  1384000  1600000  4489000  1400000  3000000  1160000
## [145]  1399950  1525000  1325000  1309500  2200000  1440000  1240000  1815000
## [153]  1950000  1300000  1462497  2110000  2400000  1320000  1240000  2075000
## [161]  1580000  1220000  1700000  1215000  1157400  2466350  1295648  1250000
## [169]  1200000  2300000  1229000  1355000  1700000  1330000  1280000  1250000
## [177]  1340000  1388000  1990000  1387800  1410000  1619999  1820000  1195000
## [185]  1600000  1200000  1165000  1230000  1925000  1730000  1309500  1510000
## [193]  2065000  1198000  1297000  3800000  1695000  1640000  2300000  1565000
## [201]  2367000  1356925  1300000  2271150  2147500  1411600  1195000  1875000
## [209]  1755000  1625000  2351956  1200000  1738000  1280000  1275000  1636000
## [217]  1450000  1360000  1300000  1195000  2458000  1205000  1675000  2180000
## [225]  1355000  1465000  1250000  1681000  2680000  2321000  1240000  1550000
## [233]  1234582 12899000  2110000  2199900 26590000  2560498  1337044  1288333
# get indices
out_idx <- which(house_df_num$price %in% c(out_price))

# remove those rows
house_df_num <- house_df_num[ -out_idx, ]
summary(house_df_num$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7800  320000  450000  487457  615000 1150000

The max value on price is free from 1150000

nrow(house_df_num)
## [1] 4311

Number of observation: 4311

3.2 Feature Extraction

One Hot Encoding for location features

house_df <- house_df[ rownames(house_df_num), ]

#### 1. Create dataframe for statezip (available on house_df)
statezip <- house_df$statezip
statezip_df <- data.frame(statezip)
colnames(statezip_df) <- c("loc.")

#### 2. OHE on statezip dataframe
library(caret)
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))

#### 3. Combine to house_df_num dataframe
#### 11 + 77 = 88 columns
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4311   88

3.3 Training and Testing Division

d <- dim(house_df_num)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns

## ratio train:test = 70:30
## randomly divided
set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]
## [1] 2214  442  743
train_data <- house_df_num[ train_idx , ]
test_data <- house_df_num[ -train_idx , ]

dim(train_data)
## [1] 3017   88
dim(test_data)
## [1] 1294   88

4. Modeling

Create regresion model

mymodel <- lm(formula = price ~ . + 
                I(sqft_living^2) +
                sqft_living:bedrooms,
             data = train_data)
summary(mymodel)
## 
## Call:
## lm(formula = price ~ . + I(sqft_living^2) + sqft_living:bedrooms, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -872020  -52263   -1836   48176  520918 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.000e+05  1.037e+05  -0.964 0.335018    
## bedrooms              1.127e+04  7.435e+03   1.516 0.129518    
## bathrooms             2.191e+04  4.317e+03   5.075 4.11e-07 ***
## sqft_living           1.702e+02  1.137e+01  14.960  < 2e-16 ***
## sqft_lot              2.743e-01  5.575e-02   4.920 9.13e-07 ***
## floors               -3.338e+04  5.251e+03  -6.356 2.39e-10 ***
## waterfront            1.354e+05  3.682e+04   3.678 0.000239 ***
## view                  3.672e+04  3.262e+03  11.257  < 2e-16 ***
## condition             2.308e+04  3.043e+03   7.585 4.44e-14 ***
## sqft_above            8.334e+01  6.353e+00  13.117  < 2e-16 ***
## sqft_basement                NA         NA      NA       NA    
## loc.WA.98001         -1.046e+05  1.026e+05  -1.020 0.307959    
## loc.WA.98002         -1.127e+05  1.037e+05  -1.087 0.277266    
## loc.WA.98003         -9.356e+04  1.030e+05  -0.908 0.363934    
## loc.WA.98004          4.455e+05  1.036e+05   4.299 1.77e-05 ***
## loc.WA.98005          2.412e+05  1.042e+05   2.315 0.020657 *  
## loc.WA.98006          1.915e+05  1.023e+05   1.872 0.061294 .  
## loc.WA.98007          2.161e+05  1.036e+05   2.086 0.037080 *  
## loc.WA.98008          1.237e+05  1.030e+05   1.201 0.229827    
## loc.WA.98010          5.043e+03  1.112e+05   0.045 0.963840    
## loc.WA.98011          4.492e+04  1.037e+05   0.433 0.664739    
## loc.WA.98014          5.046e+03  1.052e+05   0.048 0.961734    
## loc.WA.98019         -8.018e+03  1.033e+05  -0.078 0.938134    
## loc.WA.98022         -1.214e+05  1.040e+05  -1.167 0.243275    
## loc.WA.98023         -1.179e+05  1.022e+05  -1.153 0.248916    
## loc.WA.98024          6.126e+04  1.097e+05   0.559 0.576464    
## loc.WA.98027          9.721e+04  1.025e+05   0.949 0.342795    
## loc.WA.98028          5.189e+04  1.026e+05   0.506 0.612964    
## loc.WA.98029          1.634e+05  1.024e+05   1.595 0.110740    
## loc.WA.98030         -9.482e+04  1.034e+05  -0.917 0.359028    
## loc.WA.98031         -9.087e+04  1.027e+05  -0.885 0.376375    
## loc.WA.98032         -1.182e+05  1.051e+05  -1.125 0.260848    
## loc.WA.98033          2.388e+05  1.024e+05   2.331 0.019809 *  
## loc.WA.98034          1.069e+05  1.024e+05   1.044 0.296346    
## loc.WA.98038         -7.578e+04  1.023e+05  -0.741 0.458861    
## loc.WA.98039         -3.825e+05  1.436e+05  -2.663 0.007786 ** 
## loc.WA.98040          3.101e+05  1.030e+05   3.009 0.002641 ** 
## loc.WA.98042         -9.962e+04  1.022e+05  -0.975 0.329789    
## loc.WA.98045         -6.168e+03  1.029e+05  -0.060 0.952219    
## loc.WA.98047         -9.415e+04  1.112e+05  -0.847 0.397284    
## loc.WA.98050          8.550e+03  1.441e+05   0.059 0.952686    
## loc.WA.98051         -2.062e+04  1.116e+05  -0.185 0.853467    
## loc.WA.98052          1.918e+05  1.020e+05   1.879 0.060306 .  
## loc.WA.98053          1.514e+05  1.023e+05   1.480 0.138858    
## loc.WA.98055         -5.428e+04  1.043e+05  -0.520 0.602784    
## loc.WA.98056          2.489e+04  1.024e+05   0.243 0.808066    
## loc.WA.98057         -7.989e+04  1.060e+05  -0.753 0.451264    
## loc.WA.98058         -5.394e+04  1.023e+05  -0.527 0.597992    
## loc.WA.98059          5.430e+03  1.022e+05   0.053 0.957646    
## loc.WA.98065          4.819e+04  1.027e+05   0.469 0.639102    
## loc.WA.98068          1.495e+05  1.437e+05   1.041 0.298156    
## loc.WA.98070          7.905e+03  1.044e+05   0.076 0.939640    
## loc.WA.98072          1.012e+05  1.027e+05   0.985 0.324596    
## loc.WA.98074          1.267e+05  1.023e+05   1.239 0.215417    
## loc.WA.98075          1.705e+05  1.025e+05   1.663 0.096330 .  
## loc.WA.98077          1.219e+05  1.032e+05   1.181 0.237887    
## loc.WA.98092         -1.075e+05  1.024e+05  -1.050 0.293706    
## loc.WA.98102          3.308e+05  1.050e+05   3.150 0.001648 ** 
## loc.WA.98103          2.502e+05  1.021e+05   2.450 0.014339 *  
## loc.WA.98105          3.583e+05  1.039e+05   3.447 0.000575 ***
## loc.WA.98106          3.489e+04  1.027e+05   0.340 0.734118    
## loc.WA.98107          2.486e+05  1.027e+05   2.421 0.015551 *  
## loc.WA.98108          2.727e+04  1.032e+05   0.264 0.791644    
## loc.WA.98109          4.323e+05  1.066e+05   4.055 5.13e-05 ***
## loc.WA.98112          3.731e+05  1.031e+05   3.619 0.000301 ***
## loc.WA.98115          2.409e+05  1.021e+05   2.360 0.018358 *  
## loc.WA.98116          2.219e+05  1.027e+05   2.160 0.030848 *  
## loc.WA.98117          2.241e+05  1.021e+05   2.195 0.028250 *  
## loc.WA.98118          8.617e+04  1.025e+05   0.841 0.400645    
## loc.WA.98119          3.310e+05  1.034e+05   3.200 0.001389 ** 
## loc.WA.98122          2.650e+05  1.028e+05   2.578 0.009984 ** 
## loc.WA.98125          8.670e+04  1.024e+05   0.847 0.397056    
## loc.WA.98126          1.142e+05  1.024e+05   1.115 0.264930    
## loc.WA.98133          6.011e+04  1.023e+05   0.588 0.556720    
## loc.WA.98136          1.694e+05  1.029e+05   1.646 0.099898 .  
## loc.WA.98144          1.748e+05  1.025e+05   1.706 0.088124 .  
## loc.WA.98146          4.353e+04  1.029e+05   0.423 0.672173    
## loc.WA.98148         -3.307e+04  1.085e+05  -0.305 0.760587    
## loc.WA.98155          3.714e+04  1.024e+05   0.363 0.716785    
## loc.WA.98166          9.244e+02  1.027e+05   0.009 0.992821    
## loc.WA.98168         -5.496e+04  1.026e+05  -0.535 0.592420    
## loc.WA.98177          1.633e+05  1.029e+05   1.587 0.112601    
## loc.WA.98178         -5.612e+04  1.032e+05  -0.544 0.586724    
## loc.WA.98188         -9.531e+04  1.049e+05  -0.909 0.363523    
## loc.WA.98198         -9.769e+04  1.028e+05  -0.950 0.341938    
## loc.WA.98199          2.732e+05  1.030e+05   2.653 0.008022 ** 
## loc.WA.98288          4.083e+04  1.438e+05   0.284 0.776479    
## loc.WA.98354                 NA         NA      NA       NA    
## I(sqft_living^2)     -8.753e-03  2.508e-03  -3.490 0.000490 ***
## bedrooms:sqft_living -1.086e+01  3.057e+00  -3.553 0.000387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101500 on 2929 degrees of freedom
## Multiple R-squared:  0.7877, Adjusted R-squared:  0.7814 
## F-statistic: 124.9 on 87 and 2929 DF,  p-value: < 2.2e-16

Some statezip location have significant influence on the prices, which means location is important to predict the house price

5. Evaluations

Predict Price

actual <- test_data$price

## predicted price
pred.mymodel <- predict(mymodel, test_data)
## Warning in predict.lm(mymodel, test_data): prediction from a rank-deficient fit
## may be misleading

Plot Actual VS Predicted Prices

price_df <- data.frame(actual, pred.mymodel)

ggplot(data = price_df, aes(x = actual, y = pred.mymodel)) +
  geom_point() +
  scale_x_continuous(limits = c(0,1500000)) +
  scale_y_continuous(limits = c(0,1500000))

Measure RMSE and R

6. Recommendations

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.