This report provides house price prediction using regression algorithms. The dataset used in this report for modeling is house data in Australia. The dataset link : here
The report is structured as follows :
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparations
4. Modeling
5. Evaluations
6. Recommendations
Import necessary libraries
rm(list=ls())
library(ggplot2)
Read house dataset and its structure
house_df <- read.csv("data/house.csv")
# structure of dataframe
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
The dataset contains 4600 observations and 18 variables. The target variable is price and the remaining are features
# data dimension
d <- dim(house_df)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns
# statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Plot distribution of price (Boxplot)
ggplot(data = house_df, aes(y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
## house price based on number of bedrooms
## casting bedrooms from numeric to factor
house_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$city2 <- factor(house_df$city)
house_df$statezip2 <- factor(house_df$statezip)
house_df$street2 <- factor(house_df$street)
house_df$country2 <- factor(house_df$country)
ggplot(data = house_df, aes(x = bedrooms2,
y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
cor(house_df$price, house_df$bedrooms)
## [1] 0.2003363
cor(house_df$price, house_df$bathrooms)
## [1] 0.3271099
cor(house_df$bedrooms, house_df$bathrooms)
## [1] 0.5459199
house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)
library(corrgram)
corrgram(house_df_num, order=TRUE,
upper.panel = panel.pie)
Remove rows with incorrect value prices
idx <- which(house_df_num$price %in% c(0))
# remove those rows
house_df_num <- house_df_num[-idx, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 326264 465000 557906 657500 26590000
Remove rows with outliers in prices
out_price <- boxplot.stats(house_df_num$price)$out
out_price
## [1] 2384000 1400000 1200000 1395000 2280000 1425000 1225000 1580000
## [9] 1870000 1800000 3200000 1500000 1340000 1532500 1750000 1500000
## [17] 1307000 1250000 2400000 1370000 1157200 1702500 2238888 1225000
## [25] 1200000 1575000 1315000 1300000 1346400 1225000 1264000 1895000
## [33] 2555000 1550000 1400000 1680000 1400000 1222500 1255000 2100000
## [41] 2000000 1270000 1216000 1405000 1190000 1300000 1212500 1275000
## [49] 1800000 1690000 1506000 1580000 1220000 1185000 1550000 2700000
## [57] 1600000 1170000 1180500 2150000 1325000 1610000 1200000 2000000
## [65] 1680000 1570000 1795000 1655000 1381000 1600000 1710000 1285000
## [73] 1175000 1400000 1210000 1970000 1400000 2453500 3100000 2750000
## [81] 1425000 2400000 1385000 1365000 1886700 1256500 3710000 1505000
## [89] 1595000 1250000 1965221 1298000 1270000 1901000 2400000 2005000
## [97] 1300000 1800000 1415000 1300000 1370000 1280000 2700000 1635000
## [105] 1339000 1350000 1900000 1225000 1230000 1240000 1795000 1180000
## [113] 7062500 2888000 1365000 1325000 1485000 2250000 1184000 1228000
## [121] 1688000 1240000 1335000 1735000 1695000 1735000 1712500 1920000
## [129] 4668000 1170000 1200000 1538000 1185001 2027000 2475000 1728000
## [137] 1555000 1325000 1384000 1600000 4489000 1400000 3000000 1160000
## [145] 1399950 1525000 1325000 1309500 2200000 1440000 1240000 1815000
## [153] 1950000 1300000 1462497 2110000 2400000 1320000 1240000 2075000
## [161] 1580000 1220000 1700000 1215000 1157400 2466350 1295648 1250000
## [169] 1200000 2300000 1229000 1355000 1700000 1330000 1280000 1250000
## [177] 1340000 1388000 1990000 1387800 1410000 1619999 1820000 1195000
## [185] 1600000 1200000 1165000 1230000 1925000 1730000 1309500 1510000
## [193] 2065000 1198000 1297000 3800000 1695000 1640000 2300000 1565000
## [201] 2367000 1356925 1300000 2271150 2147500 1411600 1195000 1875000
## [209] 1755000 1625000 2351956 1200000 1738000 1280000 1275000 1636000
## [217] 1450000 1360000 1300000 1195000 2458000 1205000 1675000 2180000
## [225] 1355000 1465000 1250000 1681000 2680000 2321000 1240000 1550000
## [233] 1234582 12899000 2110000 2199900 26590000 2560498 1337044 1288333
# get indices
out_idx <- which(house_df_num$price %in% c(out_price))
# remove those rows
house_df_num <- house_df_num[ -out_idx, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 320000 450000 487457 615000 1150000
The max value on price is free from 1150000
nrow(house_df_num)
## [1] 4311
Number of observation: 4311
One Hot Encoding for location features
house_df <- house_df[ rownames(house_df_num), ]
#### 1. Create dataframe for statezip (available on house_df)
statezip <- house_df$statezip
statezip_df <- data.frame(statezip)
colnames(statezip_df) <- c("loc.")
#### 2. OHE on statezip dataframe
library(caret)
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))
#### 3. Combine to house_df_num dataframe
#### 11 + 77 = 88 columns
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4311 88
d <- dim(house_df_num)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns
## ratio train:test = 70:30
## randomly divided
set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]
## [1] 2214 442 743
train_data <- house_df_num[ train_idx , ]
test_data <- house_df_num[ -train_idx , ]
dim(train_data)
## [1] 3017 88
dim(test_data)
## [1] 1294 88
Create regresion model
mymodel <- lm(formula = price ~ . +
I(sqft_living^2) +
sqft_living:bedrooms,
data = train_data)
summary(mymodel)
##
## Call:
## lm(formula = price ~ . + I(sqft_living^2) + sqft_living:bedrooms,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -872020 -52263 -1836 48176 520918
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000e+05 1.037e+05 -0.964 0.335018
## bedrooms 1.127e+04 7.435e+03 1.516 0.129518
## bathrooms 2.191e+04 4.317e+03 5.075 4.11e-07 ***
## sqft_living 1.702e+02 1.137e+01 14.960 < 2e-16 ***
## sqft_lot 2.743e-01 5.575e-02 4.920 9.13e-07 ***
## floors -3.338e+04 5.251e+03 -6.356 2.39e-10 ***
## waterfront 1.354e+05 3.682e+04 3.678 0.000239 ***
## view 3.672e+04 3.262e+03 11.257 < 2e-16 ***
## condition 2.308e+04 3.043e+03 7.585 4.44e-14 ***
## sqft_above 8.334e+01 6.353e+00 13.117 < 2e-16 ***
## sqft_basement NA NA NA NA
## loc.WA.98001 -1.046e+05 1.026e+05 -1.020 0.307959
## loc.WA.98002 -1.127e+05 1.037e+05 -1.087 0.277266
## loc.WA.98003 -9.356e+04 1.030e+05 -0.908 0.363934
## loc.WA.98004 4.455e+05 1.036e+05 4.299 1.77e-05 ***
## loc.WA.98005 2.412e+05 1.042e+05 2.315 0.020657 *
## loc.WA.98006 1.915e+05 1.023e+05 1.872 0.061294 .
## loc.WA.98007 2.161e+05 1.036e+05 2.086 0.037080 *
## loc.WA.98008 1.237e+05 1.030e+05 1.201 0.229827
## loc.WA.98010 5.043e+03 1.112e+05 0.045 0.963840
## loc.WA.98011 4.492e+04 1.037e+05 0.433 0.664739
## loc.WA.98014 5.046e+03 1.052e+05 0.048 0.961734
## loc.WA.98019 -8.018e+03 1.033e+05 -0.078 0.938134
## loc.WA.98022 -1.214e+05 1.040e+05 -1.167 0.243275
## loc.WA.98023 -1.179e+05 1.022e+05 -1.153 0.248916
## loc.WA.98024 6.126e+04 1.097e+05 0.559 0.576464
## loc.WA.98027 9.721e+04 1.025e+05 0.949 0.342795
## loc.WA.98028 5.189e+04 1.026e+05 0.506 0.612964
## loc.WA.98029 1.634e+05 1.024e+05 1.595 0.110740
## loc.WA.98030 -9.482e+04 1.034e+05 -0.917 0.359028
## loc.WA.98031 -9.087e+04 1.027e+05 -0.885 0.376375
## loc.WA.98032 -1.182e+05 1.051e+05 -1.125 0.260848
## loc.WA.98033 2.388e+05 1.024e+05 2.331 0.019809 *
## loc.WA.98034 1.069e+05 1.024e+05 1.044 0.296346
## loc.WA.98038 -7.578e+04 1.023e+05 -0.741 0.458861
## loc.WA.98039 -3.825e+05 1.436e+05 -2.663 0.007786 **
## loc.WA.98040 3.101e+05 1.030e+05 3.009 0.002641 **
## loc.WA.98042 -9.962e+04 1.022e+05 -0.975 0.329789
## loc.WA.98045 -6.168e+03 1.029e+05 -0.060 0.952219
## loc.WA.98047 -9.415e+04 1.112e+05 -0.847 0.397284
## loc.WA.98050 8.550e+03 1.441e+05 0.059 0.952686
## loc.WA.98051 -2.062e+04 1.116e+05 -0.185 0.853467
## loc.WA.98052 1.918e+05 1.020e+05 1.879 0.060306 .
## loc.WA.98053 1.514e+05 1.023e+05 1.480 0.138858
## loc.WA.98055 -5.428e+04 1.043e+05 -0.520 0.602784
## loc.WA.98056 2.489e+04 1.024e+05 0.243 0.808066
## loc.WA.98057 -7.989e+04 1.060e+05 -0.753 0.451264
## loc.WA.98058 -5.394e+04 1.023e+05 -0.527 0.597992
## loc.WA.98059 5.430e+03 1.022e+05 0.053 0.957646
## loc.WA.98065 4.819e+04 1.027e+05 0.469 0.639102
## loc.WA.98068 1.495e+05 1.437e+05 1.041 0.298156
## loc.WA.98070 7.905e+03 1.044e+05 0.076 0.939640
## loc.WA.98072 1.012e+05 1.027e+05 0.985 0.324596
## loc.WA.98074 1.267e+05 1.023e+05 1.239 0.215417
## loc.WA.98075 1.705e+05 1.025e+05 1.663 0.096330 .
## loc.WA.98077 1.219e+05 1.032e+05 1.181 0.237887
## loc.WA.98092 -1.075e+05 1.024e+05 -1.050 0.293706
## loc.WA.98102 3.308e+05 1.050e+05 3.150 0.001648 **
## loc.WA.98103 2.502e+05 1.021e+05 2.450 0.014339 *
## loc.WA.98105 3.583e+05 1.039e+05 3.447 0.000575 ***
## loc.WA.98106 3.489e+04 1.027e+05 0.340 0.734118
## loc.WA.98107 2.486e+05 1.027e+05 2.421 0.015551 *
## loc.WA.98108 2.727e+04 1.032e+05 0.264 0.791644
## loc.WA.98109 4.323e+05 1.066e+05 4.055 5.13e-05 ***
## loc.WA.98112 3.731e+05 1.031e+05 3.619 0.000301 ***
## loc.WA.98115 2.409e+05 1.021e+05 2.360 0.018358 *
## loc.WA.98116 2.219e+05 1.027e+05 2.160 0.030848 *
## loc.WA.98117 2.241e+05 1.021e+05 2.195 0.028250 *
## loc.WA.98118 8.617e+04 1.025e+05 0.841 0.400645
## loc.WA.98119 3.310e+05 1.034e+05 3.200 0.001389 **
## loc.WA.98122 2.650e+05 1.028e+05 2.578 0.009984 **
## loc.WA.98125 8.670e+04 1.024e+05 0.847 0.397056
## loc.WA.98126 1.142e+05 1.024e+05 1.115 0.264930
## loc.WA.98133 6.011e+04 1.023e+05 0.588 0.556720
## loc.WA.98136 1.694e+05 1.029e+05 1.646 0.099898 .
## loc.WA.98144 1.748e+05 1.025e+05 1.706 0.088124 .
## loc.WA.98146 4.353e+04 1.029e+05 0.423 0.672173
## loc.WA.98148 -3.307e+04 1.085e+05 -0.305 0.760587
## loc.WA.98155 3.714e+04 1.024e+05 0.363 0.716785
## loc.WA.98166 9.244e+02 1.027e+05 0.009 0.992821
## loc.WA.98168 -5.496e+04 1.026e+05 -0.535 0.592420
## loc.WA.98177 1.633e+05 1.029e+05 1.587 0.112601
## loc.WA.98178 -5.612e+04 1.032e+05 -0.544 0.586724
## loc.WA.98188 -9.531e+04 1.049e+05 -0.909 0.363523
## loc.WA.98198 -9.769e+04 1.028e+05 -0.950 0.341938
## loc.WA.98199 2.732e+05 1.030e+05 2.653 0.008022 **
## loc.WA.98288 4.083e+04 1.438e+05 0.284 0.776479
## loc.WA.98354 NA NA NA NA
## I(sqft_living^2) -8.753e-03 2.508e-03 -3.490 0.000490 ***
## bedrooms:sqft_living -1.086e+01 3.057e+00 -3.553 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101500 on 2929 degrees of freedom
## Multiple R-squared: 0.7877, Adjusted R-squared: 0.7814
## F-statistic: 124.9 on 87 and 2929 DF, p-value: < 2.2e-16
Some statezip location have significant influence on the prices, which means location is important to predict the house price
Predict Price
actual <- test_data$price
## predicted price
pred.mymodel <- predict(mymodel, test_data)
## Warning in predict.lm(mymodel, test_data): prediction from a rank-deficient fit
## may be misleading
Plot Actual VS Predicted Prices
price_df <- data.frame(actual, pred.mymodel)
ggplot(data = price_df, aes(x = actual, y = pred.mymodel)) +
geom_point() +
scale_x_continuous(limits = c(0,1500000)) +
scale_y_continuous(limits = c(0,1500000))
Measure RMSE and R
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.