About this file Online property companies offer valuations of houses using machine learning techniques. The aim of this report is to predict the house sales in King County, Washington State, USA using Simple Linear Regression (SLR) Multiple Linear Regression (MLR), and Polynomial Regression. The dataset consisted of historic data of houses sold between May 2014 to May 2015.” The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. This data was published/released under CC0: Public Domain. Unfortunately, the user has not indicated the source of the data. Please find the citation and database description in the Glossary and Bibliography. The dataset consisted of 21 variables and 21613 observations.
The dataset link: here
The raport is structured as follow:
Data Extraction.
Exploratory Data Analysis.
Data Preparation/ Preprocessing.
Modelling
Evaluations
Recommendations
Read the house dataset and its structure
houseData_df <- read.csv("csv/houseData.csv")
str(houseData_df)
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
The dataset contains 21613 observations and 21 variables. The target variable is price
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
ggplot(data=houseData_df,
aes(y=price))+
geom_boxplot()+
scale_y_continuous(limits=c(0,2000000))
#
ggplot(houseData_df,
aes(x=sqft_living,
y=price,
color=floors)) +
geom_point() +
labs(title="Grade VS Price",
x = "Grade",
y = "Price") +
scale_y_continuous(limits = c(0, 2500000))
ggplot(houseData_df,
aes(x=grade,
y=price,
color=floors)) +
geom_point() +
labs(title="Grade VS Price",
x = "Grade",
y = "Price") +
scale_y_continuous(limits = c(0, 2500000))
ggplot(houseData_df,
aes(x=yr_built,
y=price,
color=floors)) +
geom_point() +
labs(title="Year Built vs Price",
x="Year Built",
y="Price") +
scale_y_continuous(limits = c(0, 2000000))
ggplot(data=houseData_df,
aes(x=bedrooms2, y=price))+
geom_boxplot()+
scale_y_continuous(limits=c(0,2000000))
houseData_df_num <- houseData_df[, 3:14]
r <- cor(houseData_df_num)
r
## price bedrooms bathrooms sqft_living sqft_lot
## price 1.00000000 0.308349598 0.52513751 0.70203505 0.089660861
## bedrooms 0.30834960 1.000000000 0.51588364 0.57667069 0.031703243
## bathrooms 0.52513751 0.515883638 1.00000000 0.75466528 0.087739662
## sqft_living 0.70203505 0.576670693 0.75466528 1.00000000 0.172825661
## sqft_lot 0.08966086 0.031703243 0.08773966 0.17282566 1.000000000
## floors 0.25679389 0.175428935 0.50065317 0.35394929 -0.005200991
## waterfront 0.26636943 -0.006582479 0.06374363 0.10381782 0.021603683
## view 0.39729349 0.079531852 0.18773702 0.28461119 0.074710106
## condition 0.03636179 0.028472104 -0.12498193 -0.05875259 -0.008958250
## grade 0.66743426 0.356966725 0.66498253 0.76270448 0.113621124
## sqft_above 0.60556730 0.477600161 0.68534248 0.87659660 0.183512281
## sqft_basement 0.32381602 0.303093375 0.28377003 0.43504297 0.015286202
## floors waterfront view condition grade
## price 0.256793888 0.266369434 0.39729349 0.03636179 0.66743426
## bedrooms 0.175428935 -0.006582479 0.07953185 0.02847210 0.35696673
## bathrooms 0.500653173 0.063743629 0.18773702 -0.12498193 0.66498253
## sqft_living 0.353949290 0.103817818 0.28461119 -0.05875259 0.76270448
## sqft_lot -0.005200991 0.021603683 0.07471011 -0.00895825 0.11362112
## floors 1.000000000 0.023698320 0.02944382 -0.26376795 0.45818251
## waterfront 0.023698320 1.000000000 0.40185735 0.01665316 0.08277491
## view 0.029443820 0.401857351 1.00000000 0.04598974 0.25132058
## condition -0.263767946 0.016653157 0.04598974 1.00000000 -0.14467367
## grade 0.458182514 0.082774914 0.25132058 -0.14467367 1.00000000
## sqft_above 0.523884710 0.072074592 0.16764934 -0.15821362 0.75592294
## sqft_basement -0.245704542 0.080587939 0.27694658 0.17410491 0.16839182
## sqft_above sqft_basement
## price 0.60556730 0.32381602
## bedrooms 0.47760016 0.30309338
## bathrooms 0.68534248 0.28377003
## sqft_living 0.87659660 0.43504297
## sqft_lot 0.18351228 0.01528620
## floors 0.52388471 -0.24570454
## waterfront 0.07207459 0.08058794
## view 0.16764934 0.27694658
## condition -0.15821362 0.17410491
## grade 0.75592294 0.16839182
## sqft_above 1.00000000 -0.05194331
## sqft_basement -0.05194331 1.00000000
library(corrgram)
corrgram(houseData_df_num, order=TRUE,
upper.panel=panel.pie)
Remove rows with outlier prices
out_price <- boxplot.stats(houseData_df_num$price)$out
out_idx <- which(houseData_df_num$price %in% c(out_price))
houseData_df_num <- houseData_df_num[-out_idx,]
summary(houseData_df_num)
## price bedrooms bathrooms sqft_living
## Min. : 75000 Min. : 0.00 Min. :0.000 Min. : 290
## 1st Qu.: 315000 1st Qu.: 3.00 1st Qu.:1.500 1st Qu.:1400
## Median : 437500 Median : 3.00 Median :2.000 Median :1860
## Mean : 476985 Mean : 3.33 Mean :2.052 Mean :1976
## 3rd Qu.: 600000 3rd Qu.: 4.00 3rd Qu.:2.500 3rd Qu.:2431
## Max. :1127500 Max. :33.00 Max. :7.500 Max. :7480
## sqft_lot floors waterfront view
## Min. : 520 Min. :1.000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 5000 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 7500 Median :1.000 Median :0.00000 Median :0.0000
## Mean : 14610 Mean :1.476 Mean :0.00298 Mean :0.1727
## 3rd Qu.: 10319 3rd Qu.:2.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1651359 Max. :3.500 Max. :1.00000 Max. :4.0000
## condition grade sqft_above sqft_basement
## Min. :1.000 Min. : 1.000 Min. : 290 Min. : 0.0
## 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1170 1st Qu.: 0.0
## Median :3.000 Median : 7.000 Median :1520 Median : 0.0
## Mean :3.406 Mean : 7.531 Mean :1708 Mean : 267.2
## 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2100 3rd Qu.: 510.0
## Max. :5.000 Max. :12.000 Max. :5710 Max. :2720.0
nrow(houseData_df_num)
## [1] 20467
Number of observations after the preprocessing is 20467.
houseData_df <- houseData_df[rownames(houseData_df_num),]
#Create a dataframe for statezip
zipcode <- houseData_df$zipcode
zipcode_df <- data.frame(zipcode)
colnames(zipcode_df) <- c("loc.")
#One Hot Encoding on Statezip
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
df1 <- dummyVars("~.", data=zipcode_df)
df2 <- data.frame(predict(df1, newdata=zipcode_df))
## Combine house_df with satezip_df
houseData_df_num <- cbind(houseData_df_num, df2)
dim(houseData_df_num)
## [1] 20467 13
set.seed(2022)
train_idx <- sample(m, m*0.7)
train_idx[1:5]
## [1] 20708 9651 7886 2871 12107
train_data <- houseData_df_num[train_idx,]
test_data <- houseData_df_num[-train_idx,]
dim(train_data)
## [1] 15129 13
dim(test_data)
## [1] 6121 13
mymodel <- lm(formula=price~. +
I(sqft_living^2), data = train_data)
summary(mymodel)
##
## Call:
## lm(formula = price ~ . + I(sqft_living^2), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -501336 -103902 -11523 88270 824159
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.073e+07 2.407e+06 -16.922 < 2e-16 ***
## bedrooms -8.674e+03 1.717e+03 -5.051 4.45e-07 ***
## bathrooms -1.473e+04 2.795e+03 -5.269 1.39e-07 ***
## sqft_living 1.378e+02 7.763e+00 17.752 < 2e-16 ***
## sqft_lot 8.026e-03 3.243e-02 0.247 0.80454
## floors 1.602e+04 3.146e+03 5.092 3.58e-07 ***
## waterfront 6.011e+04 2.292e+04 2.622 0.00875 **
## view 3.397e+04 2.079e+03 16.344 < 2e-16 ***
## condition 4.204e+04 1.969e+03 21.351 < 2e-16 ***
## grade 8.451e+04 1.818e+03 46.494 < 2e-16 ***
## sqft_above -2.566e+01 3.957e+00 -6.485 9.17e-11 ***
## sqft_basement NA NA NA NA
## loc. 4.103e+02 2.452e+01 16.733 < 2e-16 ***
## I(sqft_living^2) -3.062e-03 1.382e-03 -2.215 0.02675 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145700 on 14333 degrees of freedom
## (783 observations deleted due to missingness)
## Multiple R-squared: 0.5098, Adjusted R-squared: 0.5094
## F-statistic: 1242 on 12 and 14333 DF, p-value: < 2.2e-16
actual <- test_data$price
pred.mymodel <- predict(mymodel, test_data)
## Warning in predict.lm(mymodel, test_data): prediction from a rank-deficient fit
## may be misleading
Plot Actual vs Prediction Series
price_df <- data.frame(actual, pred.mymodel)
ggplot(data=price_df,
aes(x=actual, y=pred.mymodel))+
geom_point()+
scale_x_continuous(limits=c(0,2000000))+
scale_y_continuous(limits=c(0,2000000))
## Warning: Removed 2 rows containing missing values (geom_point).
Regression Metrics Measurement
performance <- function(actual, prediction, method){
e <- prediction-actual
se <- e^2
sse<- sum(se)
mse<- mean(se)
rmse<-sqrt(mse)
result <- paste("Method:", method,
"\n SSE:", round(sse,3),
"\n MSE:", round(mse,3),
"\n RMSE:", round(rmse,3))
cat(result)
}
performance(actual, pred.mymodel, "Regression Model")
## Method: Regression Model
## SSE: 131342375304986
## MSE: 21457666280.834
## RMSE: 146484.355