In this project, I’ve been looking at the house prices at Umeå, a medium-sized city located in the north of Sweden. Firstly, I scraped data from https://www.hemnet.se/ and managed to get about 189 observations including features such as price, squared meters, number of rooms, and location. As we are mainly interested to predict the price of the house in relation to squared meters, location and number of rooms, this is a regression problem, that is, the dependent variable is continuous.

Load the required packages for importing data and preprocessing.

library(tidyverse)
library(readxl)



Check structure of the dataset

dim(df)
## [1] 189  40


Check summary of the dataset

summary(df)
##        id          loc                price          size_sqrt_m    
##  Min.   :  1   Length:189         Min.   : 895000   Min.   : 24.00  
##  1st Qu.: 64   Class :character   1st Qu.:1395000   1st Qu.: 48.50  
##  Median :129   Mode  :character   Median :1675000   Median : 64.00  
##  Mean   :125                      Mean   :1861593   Mean   : 65.68  
##  3rd Qu.:185                      3rd Qu.:2189000   3rd Qu.: 79.00  
##  Max.   :251                      Max.   :5495000   Max.   :140.00  
##      rooms           area           area_mariestrand     area_teg      
##  Min.   :1.000   Length:189         Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:2.000   Class :character   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :2.000   Mode  :character   Median :0.00000   Median :0.00000  
##  Mean   :2.466                      Mean   :0.03704   Mean   :0.08466  
##  3rd Qu.:3.000                      3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :4.000                      Max.   :1.00000   Max.   :1.00000  
##    area_haga        area_umeå  area_ersboda     area_umedalen    
##  Min.   :0.0000   Min.   :0   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0   Median :0.00000   Median :0.00000  
##  Mean   :0.1323   Mean   :0   Mean   :0.05291   Mean   :0.04233  
##  3rd Qu.:0.0000   3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :0   Max.   :1.00000   Max.   :1.00000  
##   area_böleäng      area_tomtebo      area_gimonäs     area_mariehöjd   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.01058   Mean   :0.08995   Mean   :0.02645   Mean   :0.02116  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##  area_sandåkern    area_vasterslatt  area_mariehem      area_ålidhem    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.03704   Mean   :0.03175   Mean   :0.06878   Mean   :0.02116  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   area_outside      area_obbola        area_grubbe       area_centrum    
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.01587   Mean   :0.005291   Mean   :0.01058   Mean   :0.07407  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
##  area_stöcksjö      area_öst.på.stan   area_berghem     area_väst.på.stan
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.005291   Mean   :0.02116   Mean   :0.05291   Mean   :0.07937  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   area_ersmark        area_sävar      area_hörnefors     area_sörmjöle    
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.005291   Mean   :0.01058   Mean   :0.005291   Mean   :0.01587  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##   area_röbäck       area_holmsund      area_brännland area_söderslätt
##  Min.   :0.000000   Min.   :0.000000   Min.   :0      Min.   :0      
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0      1st Qu.:0      
##  Median :0.000000   Median :0.000000   Median :0      Median :0      
##  Mean   :0.005291   Mean   :0.005291   Mean   :0      Mean   :0      
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0      3rd Qu.:0      
##  Max.   :1.000000   Max.   :1.000000   Max.   :0      Max.   :0      
##  area_mariedal     area_tavleliden area_öbacka.strand     date          
##  Min.   :0.00000   Min.   :0       Min.   :0.00000    Length:189        
##  1st Qu.:0.00000   1st Qu.:0       1st Qu.:0.00000    Class :character  
##  Median :0.00000   Median :0       Median :0.00000    Mode  :character  
##  Mean   :0.01587   Mean   :0       Mean   :0.01587                      
##  3rd Qu.:0.00000   3rd Qu.:0       3rd Qu.:0.00000                      
##  Max.   :1.00000   Max.   :0       Max.   :1.00000

No missing data seems to be present.



Visualization

# price
ggplot(df, aes(x = price)) +
  geom_histogram(bins = 20)

# rooms
ggplot(df, aes(x = rooms)) + 
  geom_histogram(bins = 4)

# squared meters
ggplot(df, aes(x = size_sqrt_m)) +
  geom_histogram(bins = 25)

# area 
df %>%
  group_by(area) %>% 
  summarise(n = n()) %>%
  ggplot(aes(x = reorder(area, n), y = n))+
  geom_bar(stat = "identity") + 
  coord_flip()

Check intercorrelations between the most relevant variables.

library(ggcorrplot)

corr = df %>%
  select(price, size_sqrt_m, rooms, area_centrum, area_haga, area_teg, area_ersboda, area_tomtebo)
corr = cor(corr)

ggcorrplot(corr, lab = T, lab_size = 4) +
  theme(axis.text.x = element_text(size = 16)) + 
  theme(axis.text.y = element_text(size = 16)) +
  theme(legend.title = element_text(size = 16)) + 
  theme(legend.text = element_text(size = 14))

Particularly squared meter and rooms appears to be highly correlated with the price of the house. Also, if the house is located in downtown of Umeå, it appears to increase the price. Conversely, if the house is located in Ersboda, the price of the house is decreased.



Data Preprocessing

Before proceeding into the prediction phase, it’s important to preprocess the data. Here, I’ll remove those features from the df that are not relevant for the analyses. I also standardize each feature which is important. Otherwise, those features with higher values might have more impact on the dependent variable as compared to those features with smaller values. I also remove some of the dummy coded area features which had few observations.

df = df %>%
  mutate_at(vars(-price, -id, -area, -loc, -date), funs(scale)) %>%
  select(-id, -area, -loc, -date, -area_umeå, -area_brännland, -area_söderslätt, -area_tavleliden, -area_hörnefors, -area_öbacka.strand)

head(df)
## # A tibble: 6 x 30
##    price size_sqrt_m[,1] rooms[,1] area_mariestran~ area_teg[,1]
##    <dbl>           <dbl>     <dbl>            <dbl>        <dbl>
## 1 1.40e6         -0.0749    -0.503            5.09        -0.303
## 2 1.25e6         -1.74      -1.58            -0.196        3.28 
## 3 1.60e6         -0.577     -0.503           -0.196       -0.303
## 4 1.50e6         -1.17      -1.58            -0.196       -0.303
## 5 1.80e6          2.10       1.66            -0.196       -0.303
## 6 1.12e6          0.182     -0.503           -0.196       -0.303
## # ... with 25 more variables: area_haga[,1] <dbl>, area_ersboda[,1] <dbl>,
## #   area_umedalen[,1] <dbl>, area_böleäng[,1] <dbl>,
## #   area_tomtebo[,1] <dbl>, area_gimonäs[,1] <dbl>,
## #   area_mariehöjd[,1] <dbl>, area_sandåkern[,1] <dbl>,
## #   area_vasterslatt[,1] <dbl>, area_mariehem[,1] <dbl>,
## #   area_ålidhem[,1] <dbl>, area_outside[,1] <dbl>, area_obbola[,1] <dbl>,
## #   area_grubbe[,1] <dbl>, area_centrum[,1] <dbl>,
## #   area_stöcksjö[,1] <dbl>, area_öst.på.stan[,1] <dbl>,
## #   area_berghem[,1] <dbl>, area_väst.på.stan[,1] <dbl>,
## #   area_ersmark[,1] <dbl>, area_sävar[,1] <dbl>, area_sörmjöle[,1] <dbl>,
## #   area_röbäck[,1] <dbl>, area_holmsund[,1] <dbl>,
## #   area_mariedal[,1] <dbl>

Next, I’ll split the data into a training and test set with 75 % of the observations allotted to the train data and 25 % of the observation to the test data. I also set a random seed so that the analyses are replicable.

# SPLIT INTO TRAINING AND TEST SET
smp_size = floor(0.75 * nrow(df))
set.seed(1234)
train_ind = sample(seq_len(nrow(df)), size = smp_size)

# train set
train = df[train_ind,]

# test set
test = df[-train_ind,]



Machine Learning

Now we are ready for the analyses. In this project, I will employ three typical machine learning algorithms used for Regression problems, namely a Linear regression, a Random forest regression and a Elastic net regression. For validating the model accuracy, I will use the root mean squared error.


Linear Regression

The Linear regression is the modeling approach that one typically starts with. Here, it will serve as a baseline model that the more complex models are compared against. Linear regrssion performs well when there are linear relationships between the features and the dependent variable. However, it tends tends to underfit the data when the relationships are non-linear.

Fit the model and predict price on the test set.

library(Metrics) # Retrieves a built-in function for root mean squared error

# Fit model on training data
linreg_model = lm(price ~ ., data = train)

# predict price on the test set
test$pred <- predict(linreg_model, test)

# calculate rmse
linreg_test_rmse = rmse(actual = test$price, 
     predicted =test$pred)

# Retrieve outuput as a data frame
linreg_rmse_test = data.frame(linreg_test_rmse)

print(linreg_rmse_test)
##   linreg_test_rmse
## 1         452028.5

It appears that the root mean squared error is 452028.5



Random Forest regression

A random forest regression is a commpon ensemble learning algorithm which typically is less sensitive for overfitting the training data as compared with simple regression trees. Here, I will compute a Random Forest model using the caret package with a five fold cross-validation. Moreover, I’ll compute a tuneGrid where the mtry (i.e., how many features the model takes into account in each split) is rotated from 1 to 29.

Fit model and predict on the test set.

library(caret)
library(ranger) 

# Create grid search by varying mtry from 0-29
tuneGrid = data.frame(
  mtry = c(1:29),
  splitrule = "variance",
  min.node.size = 5
  
)

# Compute tuned model 
model_rf_tuned = train(price~.,
                 data = train,
                 method = "ranger",
                 tuneGrid = tuneGrid,
                 trControl = trainControl(method = "cv", number = 5))

Predict on test set.

# Compute test set RMSE on best_model
pred = predict(model_rf_tuned, test)

rf_rmse_test =  as.data.frame(rmse(actual = test$price, predicted = pred)) %>%
  rename(rf_rmse_test = 'rmse(actual = test$price, predicted = pred)')

# Print output
rf_rmse_test
##   rf_rmse_test
## 1     461906.8

The root mean squared error for the best model is 461906.8 suggesting that the Linear Regression performs better than the Random Forest regression.



Elastic net regression

Elastic net regression is a mix of two penalized regression methods: ridge regression (penalizes numner of non-zero coefficients) and lasso regression (penalizes the magnitude of non-zero coefficients). Here, I’ll compute a Elastic net regression with alpha hyperparameters ranging from 0.0001 to 1 and lambda parameters being either 0 or 1. In accordance with the Random forest regression, I use a five-fold cross validation.

Fit model.

library(elasticnet)
model.elnet = train(price ~., 
                    data = train,
                    tuneGrid = expand.grid(alpha = 0:1,
                    lambda = seq(0.0001, 1, length = 100)),
                    method = "glmnet",
                    tuneLength = 10,
                    trControl = trainControl(method = "cv", number = 5))

Predict test set.

# predict on test set
pred = predict(model.elnet, test)

# Retrieve RMSE value
elnet_rmse_test = rmse(actual = test$price, pred)

elnet_rmse_test
## [1] 450869.8

The root mean squared error is 450869.8, indicating that the Elastic net regression performs marginally better than both the Linear regression and the Random forest regression when predicting house prices in Umeå.



Here is an aggregated table of the output from the respective model.

linreg_rmse_test %>%
  cbind(rf_rmse_test) %>%
  cbind(elnet_rmse_test) %>%
  gather(test, rmse, linreg_test_rmse:elnet_rmse_test) %>%
  arrange(rmse)
##               test     rmse
## 1  elnet_rmse_test 450869.8
## 2 linreg_test_rmse 452028.5
## 3     rf_rmse_test 461906.8