In this project, I’ve been looking at the house prices at Umeå, a medium-sized city located in the north of Sweden. Firstly, I scraped data from https://www.hemnet.se/ and managed to get about 189 observations including features such as price, squared meters, number of rooms, and location. As we are mainly interested to predict the price of the house in relation to squared meters, location and number of rooms, this is a regression problem, that is, the dependent variable is continuous.
Load the required packages for importing data and preprocessing.
library(tidyverse)
library(readxl)
dim(df)
## [1] 189 40
summary(df)
## id loc price size_sqrt_m
## Min. : 1 Length:189 Min. : 895000 Min. : 24.00
## 1st Qu.: 64 Class :character 1st Qu.:1395000 1st Qu.: 48.50
## Median :129 Mode :character Median :1675000 Median : 64.00
## Mean :125 Mean :1861593 Mean : 65.68
## 3rd Qu.:185 3rd Qu.:2189000 3rd Qu.: 79.00
## Max. :251 Max. :5495000 Max. :140.00
## rooms area area_mariestrand area_teg
## Min. :1.000 Length:189 Min. :0.00000 Min. :0.00000
## 1st Qu.:2.000 Class :character 1st Qu.:0.00000 1st Qu.:0.00000
## Median :2.000 Mode :character Median :0.00000 Median :0.00000
## Mean :2.466 Mean :0.03704 Mean :0.08466
## 3rd Qu.:3.000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :4.000 Max. :1.00000 Max. :1.00000
## area_haga area_umeå area_ersboda area_umedalen
## Min. :0.0000 Min. :0 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0 Median :0.00000 Median :0.00000
## Mean :0.1323 Mean :0 Mean :0.05291 Mean :0.04233
## 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :0 Max. :1.00000 Max. :1.00000
## area_böleäng area_tomtebo area_gimonäs area_mariehöjd
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01058 Mean :0.08995 Mean :0.02645 Mean :0.02116
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## area_sandåkern area_vasterslatt area_mariehem area_ålidhem
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.03704 Mean :0.03175 Mean :0.06878 Mean :0.02116
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## area_outside area_obbola area_grubbe area_centrum
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.01587 Mean :0.005291 Mean :0.01058 Mean :0.07407
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.00000
## area_stöcksjö area_öst.på.stan area_berghem area_väst.på.stan
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.005291 Mean :0.02116 Mean :0.05291 Mean :0.07937
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## area_ersmark area_sävar area_hörnefors area_sörmjöle
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.005291 Mean :0.01058 Mean :0.005291 Mean :0.01587
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## area_röbäck area_holmsund area_brännland area_söderslätt
## Min. :0.000000 Min. :0.000000 Min. :0 Min. :0
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0 1st Qu.:0
## Median :0.000000 Median :0.000000 Median :0 Median :0
## Mean :0.005291 Mean :0.005291 Mean :0 Mean :0
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0 3rd Qu.:0
## Max. :1.000000 Max. :1.000000 Max. :0 Max. :0
## area_mariedal area_tavleliden area_öbacka.strand date
## Min. :0.00000 Min. :0 Min. :0.00000 Length:189
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.00000 Class :character
## Median :0.00000 Median :0 Median :0.00000 Mode :character
## Mean :0.01587 Mean :0 Mean :0.01587
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.00000
## Max. :1.00000 Max. :0 Max. :1.00000
No missing data seems to be present.
# price
ggplot(df, aes(x = price)) +
geom_histogram(bins = 20)
# rooms
ggplot(df, aes(x = rooms)) +
geom_histogram(bins = 4)
# squared meters
ggplot(df, aes(x = size_sqrt_m)) +
geom_histogram(bins = 25)
# area
df %>%
group_by(area) %>%
summarise(n = n()) %>%
ggplot(aes(x = reorder(area, n), y = n))+
geom_bar(stat = "identity") +
coord_flip()
Check intercorrelations between the most relevant variables.
library(ggcorrplot)
corr = df %>%
select(price, size_sqrt_m, rooms, area_centrum, area_haga, area_teg, area_ersboda, area_tomtebo)
corr = cor(corr)
ggcorrplot(corr, lab = T, lab_size = 4) +
theme(axis.text.x = element_text(size = 16)) +
theme(axis.text.y = element_text(size = 16)) +
theme(legend.title = element_text(size = 16)) +
theme(legend.text = element_text(size = 14))
Particularly squared meter and rooms appears to be highly correlated with the price of the house. Also, if the house is located in downtown of Umeå, it appears to increase the price. Conversely, if the house is located in Ersboda, the price of the house is decreased.
Before proceeding into the prediction phase, it’s important to preprocess the data. Here, I’ll remove those features from the df that are not relevant for the analyses. I also standardize each feature which is important. Otherwise, those features with higher values might have more impact on the dependent variable as compared to those features with smaller values. I also remove some of the dummy coded area features which had few observations.
df = df %>%
mutate_at(vars(-price, -id, -area, -loc, -date), funs(scale)) %>%
select(-id, -area, -loc, -date, -area_umeå, -area_brännland, -area_söderslätt, -area_tavleliden, -area_hörnefors, -area_öbacka.strand)
head(df)
## # A tibble: 6 x 30
## price size_sqrt_m[,1] rooms[,1] area_mariestran~ area_teg[,1]
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.40e6 -0.0749 -0.503 5.09 -0.303
## 2 1.25e6 -1.74 -1.58 -0.196 3.28
## 3 1.60e6 -0.577 -0.503 -0.196 -0.303
## 4 1.50e6 -1.17 -1.58 -0.196 -0.303
## 5 1.80e6 2.10 1.66 -0.196 -0.303
## 6 1.12e6 0.182 -0.503 -0.196 -0.303
## # ... with 25 more variables: area_haga[,1] <dbl>, area_ersboda[,1] <dbl>,
## # area_umedalen[,1] <dbl>, area_böleäng[,1] <dbl>,
## # area_tomtebo[,1] <dbl>, area_gimonäs[,1] <dbl>,
## # area_mariehöjd[,1] <dbl>, area_sandåkern[,1] <dbl>,
## # area_vasterslatt[,1] <dbl>, area_mariehem[,1] <dbl>,
## # area_ålidhem[,1] <dbl>, area_outside[,1] <dbl>, area_obbola[,1] <dbl>,
## # area_grubbe[,1] <dbl>, area_centrum[,1] <dbl>,
## # area_stöcksjö[,1] <dbl>, area_öst.på.stan[,1] <dbl>,
## # area_berghem[,1] <dbl>, area_väst.på.stan[,1] <dbl>,
## # area_ersmark[,1] <dbl>, area_sävar[,1] <dbl>, area_sörmjöle[,1] <dbl>,
## # area_röbäck[,1] <dbl>, area_holmsund[,1] <dbl>,
## # area_mariedal[,1] <dbl>
Next, I’ll split the data into a training and test set with 75 % of the observations allotted to the train data and 25 % of the observation to the test data. I also set a random seed so that the analyses are replicable.
# SPLIT INTO TRAINING AND TEST SET
smp_size = floor(0.75 * nrow(df))
set.seed(1234)
train_ind = sample(seq_len(nrow(df)), size = smp_size)
# train set
train = df[train_ind,]
# test set
test = df[-train_ind,]
Now we are ready for the analyses. In this project, I will employ three typical machine learning algorithms used for Regression problems, namely a Linear regression, a Random forest regression and a Elastic net regression. For validating the model accuracy, I will use the root mean squared error.
The Linear regression is the modeling approach that one typically starts with. Here, it will serve as a baseline model that the more complex models are compared against. Linear regrssion performs well when there are linear relationships between the features and the dependent variable. However, it tends tends to underfit the data when the relationships are non-linear.
Fit the model and predict price on the test set.
library(Metrics) # Retrieves a built-in function for root mean squared error
# Fit model on training data
linreg_model = lm(price ~ ., data = train)
# predict price on the test set
test$pred <- predict(linreg_model, test)
# calculate rmse
linreg_test_rmse = rmse(actual = test$price,
predicted =test$pred)
# Retrieve outuput as a data frame
linreg_rmse_test = data.frame(linreg_test_rmse)
print(linreg_rmse_test)
## linreg_test_rmse
## 1 452028.5
It appears that the root mean squared error is 452028.5
A random forest regression is a commpon ensemble learning algorithm which typically is less sensitive for overfitting the training data as compared with simple regression trees. Here, I will compute a Random Forest model using the caret package with a five fold cross-validation. Moreover, I’ll compute a tuneGrid where the mtry (i.e., how many features the model takes into account in each split) is rotated from 1 to 29.
Fit model and predict on the test set.
library(caret)
library(ranger)
# Create grid search by varying mtry from 0-29
tuneGrid = data.frame(
mtry = c(1:29),
splitrule = "variance",
min.node.size = 5
)
# Compute tuned model
model_rf_tuned = train(price~.,
data = train,
method = "ranger",
tuneGrid = tuneGrid,
trControl = trainControl(method = "cv", number = 5))
Predict on test set.
# Compute test set RMSE on best_model
pred = predict(model_rf_tuned, test)
rf_rmse_test = as.data.frame(rmse(actual = test$price, predicted = pred)) %>%
rename(rf_rmse_test = 'rmse(actual = test$price, predicted = pred)')
# Print output
rf_rmse_test
## rf_rmse_test
## 1 461906.8
The root mean squared error for the best model is 461906.8 suggesting that the Linear Regression performs better than the Random Forest regression.
Elastic net regression is a mix of two penalized regression methods: ridge regression (penalizes numner of non-zero coefficients) and lasso regression (penalizes the magnitude of non-zero coefficients). Here, I’ll compute a Elastic net regression with alpha hyperparameters ranging from 0.0001 to 1 and lambda parameters being either 0 or 1. In accordance with the Random forest regression, I use a five-fold cross validation.
Fit model.
library(elasticnet)
model.elnet = train(price ~.,
data = train,
tuneGrid = expand.grid(alpha = 0:1,
lambda = seq(0.0001, 1, length = 100)),
method = "glmnet",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 5))
Predict test set.
# predict on test set
pred = predict(model.elnet, test)
# Retrieve RMSE value
elnet_rmse_test = rmse(actual = test$price, pred)
elnet_rmse_test
## [1] 450869.8
The root mean squared error is 450869.8, indicating that the Elastic net regression performs marginally better than both the Linear regression and the Random forest regression when predicting house prices in Umeå.
Here is an aggregated table of the output from the respective model.
linreg_rmse_test %>%
cbind(rf_rmse_test) %>%
cbind(elnet_rmse_test) %>%
gather(test, rmse, linreg_test_rmse:elnet_rmse_test) %>%
arrange(rmse)
## test rmse
## 1 elnet_rmse_test 450869.8
## 2 linreg_test_rmse 452028.5
## 3 rf_rmse_test 461906.8