This data has been uploaded to Kaggle and contains house sale prices for King County, WA. The data is for one year between May ’14 and May ’15. I’ll do some brief analysis and plotting in R.
https://www.kaggle.com/harlfoxem/housesalesprediction
options(warn=-1)
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(corrplot)
library(caret)
Loading required package: lattice
Attaching package: <U+393C><U+3E31>caret<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:purrr<U+393C><U+3E32>:
lift
library(gbm)
Loading required package: survival
Attaching package: <U+393C><U+3E31>survival<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:caret<U+393C><U+3E32>:
cluster
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
library(e1071)
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Attaching package: <U+393C><U+3E31>randomForest<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
combine
The following object is masked from <U+393C><U+3E31>package:ggplot2<U+393C><U+3E32>:
margin
library(lubridate)
Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
date
library(xgboost)
Attaching package: <U+393C><U+3E31>xgboost<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
slice
Read the data from the .csv file into R.
df<-read.csv('kc_house_data.csv')
str(df)
'data.frame': 21613 obs. of 21 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
The dataset is quite small and manageable, along with having decently formatted headers.
The date_string variable needs to be converted into a proper data as well as changing a few variables to factors.
date_string <- substr(df$date,1,8)
df$date <- as.Date(date_string, format = '%Y%m%d')
df$waterfront<- as.factor(df$waterfront)
df$zipcode <- as.factor(df$zipcode)
A quick check shows no missing values.
apply(apply(df,2,is.na),2,sum)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
0 0 0 0 0 0 0 0 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 0 0 0 0 0 0 0 0
long sqft_living15 sqft_lot15
0 0 0
I will plot out all the numeric variables using facet_wrap. There is a lot of great information to see in this output. All of this can be cleaned up and dived into further.
df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density(fill= 'blue') +
theme_bw()
Similar plotting for factor variables.
df %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_bar(fill = 'blue') +
theme_bw()
attributes are not identical across measure variables; they will be dropped
I will now plot the numeric variables against the price dependent variable using the same technique as before.
df %>%
select(-id) %>%
keep(is.numeric) %>%
gather(key,value,-price) %>%
ggplot(aes(x=value,y=price)) +
geom_jitter(color = 'light blue',alpha = .6) +
geom_smooth(method = 'gam', color= 'dark blue', fill = 'grey', alpha = .2) +
facet_wrap(~key, scales = 'free') +
theme_bw()
Factor variables plotted against price.
df %>%
select(waterfront,price,zipcode) %>%
gather(key,value,-price) %>%
ggplot(aes(x=value,y=price)) +
geom_boxplot(color= 'dark blue', fill = 'grey', alpha = .2) +
facet_wrap(~key, scales = 'free') +
theme_bw() +
coord_flip()
attributes are not identical across measure variables; they will be dropped
The outlier in the bedroom field seems to be an actual mistake and not a real value. I will replace it with 3 as it seems to be reasonable based on the distributions.
df %>%
ggplot(aes(x=bedrooms, y=bathrooms)) +
geom_count(color = 'dark green')
df[df$bedrooms == 33,'bedrooms'] <- 3
To see how our predictors relate I will do a correlation matrix. Not an extreme amount of correlation between the numeric variable. We do see SHOT_DIST with strong negative correlation with the players FG%, typical three point shooters would be expected to have lower accuracy.
df %>%
select (-date) %>%
apply(2,as.character) %>%
apply(2,as.numeric) %>%
cor(use='complete',method='pearson') %>%
corrplot(type='lower', diag = F)
The time of the year may have an impact on sale price, a few dummy variables can be created and tested.
df$year_2015 <- as.factor(ifelse(year(df$date) == '2015',1,0))
df$winter <- as.factor(ifelse(month(df$date) %in% c(11,12,1),1,0))
df$summer <- as.factor(ifelse(month(df$date) %in% c(5,6,7),1,0))
df$fall <- as.factor(ifelse(month(df$date) %in% c(8,9,10),1,0))
The time of the renovation is also interesting and can be broken out as well.
df %>%
ggplot(aes(x=yr_renovated)) +
geom_area(stat='bin', fill = 'light blue') +
xlim(1900,2017)
df$renovated_t_f <- as.factor(ifelse(df$yr_renovated == 0, 0,1))
df$renovated_b_1950 <- as.factor(ifelse(df$yr_renovated > 0 & df$yr_renovated <= 1950 , 1, 0))
df$renovated_b_1975 <- as.factor(ifelse(df$yr_renovated > 1950 & df$yr_renovated <= 1975 , 1, 0))
df$renovated_b_2000 <- as.factor(ifelse(df$yr_renovated > 1975 & df$yr_renovated <= 2000 , 1, 0))
df$renovated_a_2000 <- as.factor(ifelse(df$yr_renovated > 2000 , 1, 0))
A dummy variable may be more appropriate than the actual basement parameters due to the low occurrence of basements
df$basement_t_f <- as.factor(ifelse(df$sqft_basement == 0, 0, 1))
There is a lot of interesting information using the zip code, for now a few tiers can be created based on their deviation from the average.
df %>%
group_by(zipcode) %>%
summarize(count = n(), zip_ave = mean(price)) %>%
mutate(zip_diff = zip_ave - mean(zip_ave)) %>%
mutate(zip_difference = ifelse(zip_diff > 0, 'Below Average' , 'Above Average')) %>%
arrange((zip_diff)) %>%
#filter(count > 20) %>%
ggplot(aes( x = factor(zipcode, levels =zipcode), y = zip_diff)) +
geom_bar(stat = 'identity', aes(fill=zip_difference), width = .6) +
coord_flip() +
labs(x = 'Zip Code', y = 'Percentage Point Difference', title = 'Price Averages per Zip Code') +
theme_bw()
df$tier1_zip <-ifelse(df$zipcode %in% c(98039),1,0)
df$tier2_zip <-ifelse(df$zipcode %in% c(98004,98040,98112,98102,98109,98105,98006,98119),1,0)
df$tier4_zip <-ifelse(df$zipcode %in% c(98002,98168,98032,98001,98148),1,0)
df$tier3_zip <- 1 - df$tier1_zip - df$tier2_zip - df$tier4_zip
The data can be split into training and test using the caret package. We will hold out 30% to test our final model on while doing cross-validation on the other 70% to discover the strongest modeling type.
set.seed(57)
train_flag <- createDataPartition(df$price, p = .7, list = FALSE, times = 1)
df_train <- df[train_flag,]
df_test <- df[-train_flag,]
The datasets can be narrowed down to only include the columns to be used as predictors.
df_train<- subset(df_train, select = c(price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view,
condition, grade, sqft_above, basement_t_f, yr_built, lat, long, sqft_living15, sqft_lot15,
year_2015, winter, summer, fall, renovated_b_1950,
renovated_b_1975, renovated_b_2000, renovated_a_2000, tier1_zip, tier2_zip, tier4_zip ) )
df_test<- subset(df_train, select = c(price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view,
condition, grade, sqft_above, basement_t_f, yr_built, lat, long, sqft_living15, sqft_lot15,
year_2015, winter, summer, fall, renovated_b_1950,
renovated_b_1975, renovated_b_2000, renovated_a_2000, tier1_zip, tier2_zip, tier4_zip ) )
Parameters tuned for linear regrestion and XG Boost Linear models.
fitControl <- trainControl( method = "repeatedcv",
number = 5,
repeats = 2)
lr_fit <- train(price ~ ., data = as.data.frame(df_train),
method = 'lm',
trControl = fitControl)
xgb_fit <- train(price ~ ., data = as.data.frame(df_train),
method = 'xgbLinear',
trControl = fitControl)
The grade feature was quite popular for XGB, I would need to research what this field means. It is perhaps the quality of the home which would make sense.
ggplot(varImp(xgb_fit)) +
geom_col() +
coord_flip()
XG Boost did much better than the linear regression.
summary(resamples(list(LR = lr_fit, XGB = xgb_fit)))
Call:
summary.resamples(object = resamples(list(LR = lr_fit, XGB = xgb_fit)))
Models: LR, XGB
Number of resamples: 10
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LR 169596.1 173920.6 179855.0 181637.9 184371.0 200397.2 0
XGB 109458.8 123821.3 132609.7 132231.7 140281.9 148709.0 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LR 0.7377668 0.7441515 0.7550750 0.7552494 0.7641244 0.7771353 0
XGB 0.8324944 0.8543750 0.8697135 0.8706166 0.8873433 0.9029150 0
For better predictions, more feature engineering would be needed. Lat/Long data could perhaps be treated better. The predictive power of other model types also can be investigated, though it would be hard to beat XGB. As there is no competition for this dataset, I will chose to end after these basic steps.