Introduction

The poreject aim is the factors affected Total Rent in Apartment Rental in Germany likes service charge, location, condition of flats, and etc. the Apartment rental offers in Germany dataset is analysed by using machine learning algorithms such as backward selection, elastic-net regression, linear regression, k-nearest neighbours (KNN), and random forest. For each algorithm, feature selection, feature engineering, and logarithmic transformation for the dependent variables are done in order to improve model outputs and prediction power. Therefore, the models are compared to each other in order to obtain the best prediction results.

Data

The original data has 268,850 observations and 49 variables. For efficiency after the elimination of missing values, data will be described in detail. The missing values are shown below:

colSums(is.na(immo_data)) %>% 
  sort()

##                   regio1               newlyConst                  balcony 
##                        0                        0                        0 
##             picturecount                  scoutId               hasKitchen 
##                        0                        0                        0 
##                  geo_bln                   cellar                 baseRent 
##                        0                        0                        0 
##              livingSpace                  geo_krs                   street 
##                        0                        0                        0 
##                     lift            baseRentRange                  geo_plz 
##                        0                        0                        0 
##                  noRooms             noRoomsRange                   garden 
##                        0                        0                        0 
##         livingSpaceRange                   regio2                   regio3 
##                        0                        0                        0 
##                     date               pricetrend            serviceCharge 
##                        0                     1832                     6909 
##              description           telekomTvOffer       telekomUploadSpeed 
##                    19747                    32619                    33358 
##               typeOfFlat                totalRent              heatingType 
##                    36614                    40517                    44856 
##                    floor               facilities              firingTypes 
##                    51309                    52924                    56964 
##          yearConstructed     yearConstructedRange                condition 
##                    57045                    57045                    68489 
##              streetPlain              houseNumber           numberOfFloors 
##                    71013                    71018                    97732 
##              thermalChar             interiorQual              petsAllowed 
##                   106506                   112665                   114573 
##             noParkSpaces             heatingCosts            lastRefurbish 
##                   175798                   183332                   188139 
##    energyEfficiencyClass     electricityBasePrice      electricityKwhPrice 
##                   191063                   222004                   222004 
## telekomHybridUploadSpeed 
##                   223830

immo_data %>% glimpse()

## Observations: 268,850
## Variables: 49
## $ regio1                   <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz…
## $ serviceCharge            <dbl> 245.00, 134.00, 255.00, 58.15, 138.00, …
## $ heatingType              <chr> "central_heating", "self_contained_cent…
## $ telekomTvOffer           <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_…
## $ telekomHybridUploadSpeed <dbl> NA, NA, 10, NA, NA, NA, 10, 10, NA, NA,…
## $ newlyConst               <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
## $ balcony                  <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FA…
## $ picturecount             <dbl> 6, 8, 8, 9, 19, 5, 9, 5, 5, 7, 11, 9, 4…
## $ pricetrend               <dbl> 4.62, 3.47, 2.72, 1.53, 2.46, 4.48, 1.0…
## $ telekomUploadSpeed       <dbl> 10.0, 10.0, 2.4, 40.0, NA, 2.4, 2.4, 2.…
## $ totalRent                <dbl> 840.00, NA, 1300.00, NA, 903.00, NA, 38…
## $ yearConstructed          <dbl> 1965, 1871, 2019, 1964, 1950, 1999, NA,…
## $ scoutId                  <dbl> 96107057, 111378734, 113147523, 1088909…
## $ noParkSpaces             <dbl> 1, 2, 1, NA, NA, NA, NA, NA, 1, NA, NA,…
## $ firingTypes              <chr> "oil", "gas", NA, "district_heating", "…
## $ hasKitchen               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ geo_bln                  <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz…
## $ cellar                   <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE,…
## $ yearConstructedRange     <dbl> 2, 1, 9, 2, 1, 5, NA, 2, 2, 2, 1, 1, 1,…
## $ baseRent                 <dbl> 595.00, 800.00, 965.00, 343.00, 765.00,…
## $ houseNumber              <chr> "244", NA, "4", "35", "10", "1e", "14",…
## $ livingSpace              <dbl> 86.00, 89.00, 83.80, 58.15, 84.97, 53.4…
## $ geo_krs                  <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresd…
## $ condition                <chr> "well_kept", "refurbished", "first_time…
## $ interiorQual             <chr> "normal", "normal", "sophisticated", NA…
## $ petsAllowed              <chr> NA, "no", NA, NA, NA, "no", NA, NA, "no…
## $ street                   <chr> "Sch&uuml;ruferstra&szlig;e", "no_infor…
## $ streetPlain              <chr> "Schüruferstraße", NA, "Turnerweg", "Gl…
## $ lift                     <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
## $ baseRentRange            <dbl> 4, 5, 6, 2, 5, 2, 2, 3, 4, 1, 1, 2, 5, …
## $ typeOfFlat               <chr> "ground_floor", "ground_floor", "apartm…
## $ geo_plz                  <chr> "44269", "67459", "01097", "09599", "28…
## $ noRooms                  <dbl> 4.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0,…
## $ thermalChar              <dbl> 181.40, NA, NA, 86.00, 188.90, 165.00, …
## $ floor                    <dbl> 1, NA, 3, 3, 1, NA, 1, NA, 2, 2, 3, 1, …
## $ numberOfFloors           <dbl> 3, NA, 4, NA, NA, NA, 4, NA, 2, 5, NA, …
## $ noRoomsRange             <dbl> 4, 3, 3, 3, 3, 2, 2, 3, 2, 2, 2, 3, 4, …
## $ garden                   <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ livingSpaceRange         <dbl> 4, 4, 4, 2, 4, 2, 3, 2, 2, 2, 1, 3, 4, …
## $ regio2                   <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresd…
## $ regio3                   <chr> "Schüren", "Böhl_Iggelheim", "Äußere_Ne…
## $ description              <chr> "Die ebenerdig zu erreichende Erdgescho…
## $ facilities               <chr> "Die Wohnung ist mit Laminat ausgelegt.…
## $ heatingCosts             <dbl> NA, NA, NA, 87.23, NA, NA, NA, 44.00, N…
## $ energyEfficiencyClass    <chr> NA, NA, NA, NA, NA, NA, NA, "B", "E", N…
## $ lastRefurbish            <dbl> NA, 2019, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricityBasePrice     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ electricityKwhPrice      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ date                     <chr> "May19", "May19", "Oct19", "May19", "Fe…

In this step, the unnecessary variables and the variables which have an extremely big amount of missing values are eliminated.The last shape of the data as it is shown.

colSums(is.na(immo_data_main)) %>% 
  sort()

##             regio1      serviceCharge     telekomTvOffer 
##                  0                  0                  0 
##         newlyConst            balcony       picturecount 
##                  0                  0                  0 
## telekomUploadSpeed          totalRent    yearConstructed 
##                  0                  0                  0 
##         hasKitchen             cellar           baseRent 
##                  0                  0                  0 
##        livingSpace          condition       interiorQual 
##                  0                  0                  0 
##               lift         typeOfFlat            noRooms 
##                  0                  0                  0 
##              floor             garden 
##                  0                  0

glimpse(immo_data_main)

## Observations: 71,087
## Variables: 20
## $ regio1             <chr> "Nordrhein_Westfalen", "Sachsen", "Baden_Würt…
## $ serviceCharge      <dbl> 245.00, 255.00, 110.00, 200.00, 215.00, 50.00…
## $ telekomTvOffer     <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_YEAR_F…
## $ newlyConst         <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE…
## $ balcony            <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE,…
## $ picturecount       <dbl> 6, 8, 5, 3, 12, 12, 35, 15, 9, 7, 16, 14, 11,…
## $ telekomUploadSpeed <dbl> 10.0, 2.4, 40.0, 40.0, 2.4, 40.0, 5.0, 10.0, …
## $ totalRent          <dbl> 840.00, 1300.00, 690.00, 1150.00, 1320.65, 32…
## $ yearConstructed    <dbl> 1965, 2019, 1970, 1951, 2018, 1897, 2013, 197…
## $ hasKitchen         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS…
## $ cellar             <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ baseRent           <dbl> 595.00, 965.00, 580.00, 950.00, 972.60, 200.0…
## $ livingSpace        <dbl> 86.00, 83.80, 53.00, 123.44, 87.00, 50.00, 12…
## $ condition          <chr> "well_kept", "first_time_use", "well_kept", "…
## $ interiorQual       <chr> "normal", "sophisticated", "sophisticated", "…
## $ lift               <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE,…
## $ typeOfFlat         <chr> "ground_floor", "apartment", "roof_storey", "…
## $ noRooms            <dbl> 4.0, 3.0, 2.0, 4.0, 3.0, 2.0, 5.0, 4.0, 3.0, …
## $ floor              <dbl> 1, 3, 2, 4, 0, 3, 1, 0, 2, 1, 1, 3, 2, 0, 2, …
## $ garden             <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…

The last version of the data is obtained by the elimination of outliers.

immo_data_ready <- immo_data_main %>% 
  filter(baseRent < 10000 & baseRent > 100,
         totalRent< 10000 & totalRent > 100,
         livingSpace < 500 & livingSpace > 10,
         serviceCharge < 1500 & serviceCharge > 1,
         floor <19,
         noRooms < 15 & noRooms > 0)

Data Description

After the last step, 71,087 observations and 20 variables have remained. The variables’ descriptions are as follows.

regio1 - States.

serviceCharge - Aucilliary costs such as electricty or internet in €.

telekomTvOffer - Is payed TV included if so which offer?

newlyConst - Is the building newly constructed?

balcony - Does the object have a balcony?

picturecount - How many pictures were uploaded to the listing?

telekomUploadSpeed - How fast is the internet upload speed?

totalRent - Total rent (usually a sum of base rent, service charge and heating cost).

yearConstructed - Construction year.

hasKitchen - Has it a kitchen?

cellar - Has it a cellar?

baseRent - Base rent without electricity and heating.

livingSpace - Living space in sqm.

condition - Condition of the flat.

interiorQual - Interior quality.

lift - Is elevator available?

typeOfFlat - Type of flat.

noRooms - Number of rooms.

floor - Which floor is the flat on?

garden - Has it a garden?

Detail About Regions

The data obtained from Wikipedia in order to apply feature engineering according to the GDPPerC values. The data includes current information about the German States.

DetailOfRegio1 %>% head()

##                States AreaKm2 Population PopPerKm2 GDPPerC
## 1 Nordrhein_Westfalen   34085   17932651       526   38645
## 2           Thuringen   16172    2143145       133   28747
## 3              Bayern   70552   13076721       185   45810
## 4         Brandenburg   29479    2511917        85   27675
## 5              Berlin     892    3644826      4086   38032
## 6            Saarland    2569     990509       386   35460

Correlation Analysis

At this step, correlations are analysed between numeric variables.

houses_numeric_vars <- 
  sapply(immo_data_ready, is.numeric) %>% 
  which() %>% 
  names()


houses_correlations <- 
  cor(immo_data_ready[,houses_numeric_vars],
      use = "pairwise.complete.obs")

The numeric variables are shown below:

## [1] "serviceCharge"      "picturecount"       "telekomUploadSpeed"
## [4] "totalRent"          "yearConstructed"    "baseRent"          
## [7] "livingSpace"        "noRooms"            "floor"

Here all correlation values are calculated between each variable. Thus, It is obtained that, there are several values that have a high correlation between each other. Moreover, baseRent has the highest correlation between totalRent which is 0.99242000, and serviceCharge and livingSpace have a high correlation between totalRent. For this reason, baseRent, livingSpace, and serviceCharge are excluded.

##                    serviceCharge picturecount telekomUploadSpeed
## serviceCharge         1.00000000  0.262870078        0.012980466
## picturecount          0.26287008  1.000000000        0.005611472
## telekomUploadSpeed    0.01298047  0.005611472        1.000000000
## totalRent             0.75482456  0.300535601        0.025630705
## yearConstructed       0.13699861  0.006427143       -0.010442569
## baseRent              0.70345989  0.295050836        0.025414612
## livingSpace           0.69195104  0.307289034       -0.004532310
## noRooms               0.48297396  0.219839610        0.002225827
## floor                 0.03572704  0.016630275        0.010408192
##                     totalRent yearConstructed   baseRent   livingSpace
## serviceCharge      0.75482456     0.136998610 0.70345989  0.6919510395
## picturecount       0.30053560     0.006427143 0.29505084  0.3072890345
## telekomUploadSpeed 0.02563070    -0.010442569 0.02541461 -0.0045323098
## totalRent          1.00000000     0.163817734 0.99242000  0.7601224929
## yearConstructed    0.16381773     1.000000000 0.15925701  0.0502247807
## baseRent           0.99242000     0.159257010 1.00000000  0.7358925869
## livingSpace        0.76012249     0.050224781 0.73589259  1.0000000000
## noRooms            0.49697236     0.013888760 0.47140203  0.7757844632
## floor              0.05062938     0.001643554 0.04489883  0.0005930456
##                        noRooms        floor
## serviceCharge      0.482973959 0.0357270376
## picturecount       0.219839610 0.0166302745
## telekomUploadSpeed 0.002225827 0.0104081920
## totalRent          0.496972356 0.0506293757
## yearConstructed    0.013888760 0.0016435542
## baseRent           0.471402031 0.0448988296
## livingSpace        0.775784463 0.0005930456
## noRooms            1.000000000 0.0035751333
## floor              0.003575133 1.0000000000

Let’s analyze correlation results with corrplot.

corrplot(houses_correlations, 
         method = "pie")

Let’s extract the variable which has the highest correlation with the dependent variable(totalRent) and according to correlation results, baseRent has the highest correlation with totalRent.

houses_numeric_vars_order <- 
  houses_correlations[,"totalRent"] %>% 
  sort(decreasing = TRUE) %>%
  names()

houses_numeric_vars_order

## [1] "totalRent"          "baseRent"           "livingSpace"       
## [4] "serviceCharge"      "noRooms"            "picturecount"      
## [7] "yearConstructed"    "floor"              "telekomUploadSpeed"

As can be seen on the plot, totalRent and baseRent have a highly positive linear relationship. Although, the situation is not unexpected because as is mentioned in the data description part, totalRent includes baseRent and for this reason, baseRent is not an explanatory variable in this case. Therefore, baseRent, livingSpace and serviceCharge will be excluded for each model.

ggplot(immo_data_ready,
       aes(x = totalRent,
           y = baseRent)) +
  geom_point(col = "red") +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()

Feature Enginerring

totalRent

In the below, logarithmic transformation is done for the dependent variable (totalRent). Each graph shows the distribution of totalRent. The first one is different from others because totalRent has extreme outliers. The second graph is the original distribution of the data without outliers and as can be clearly seen, it is a right-skewed data. The last one is obtained after logarithmic transformation on totalRent and the shape is obtained which is close to normal distribution. For this analysis, the logarithmic transformation is important for model power.

layout(matrix(1:3, ncol = 3))
hist(immo_data_main$totalRent,
     main = "totalRent With Outliers")
hist(immo_data_ready$totalRent, 
     main = "totalRent",
     breaks = 100)
hist(log(immo_data_ready$totalRent + 1), 
     main = "log(totalRent+1)",
     breaks = 100)

layout(matrix(1))

regio1

As can be seen on the graph, around 37000 GDPPerC value is appropriate for dividing regions as A_class and B_Class. Let’s divide States into 2 groups as A_class and B_class in order to decrease the factors and increase the power of the models. If GDPPerC larger than 37000, the group will be determined as A_class, and If GDPPerC less than 37000, the group will be determined as B_class.

ggplot(DetailOfRegio1, aes(x=GDPPerC))+geom_histogram()

For all states, A_class and B_class are crated and the states which have higher GDPPerC then 37000 are determined as A_class. Others are determined as B_class.

DetailOfRegio1$States[DetailOfRegio1$GDPPerC>37000]

## [1] Nordrhein_Westfalen Bayern              Berlin             
## [4] Hamburg             Hessen              Baden_Wurttemberg  
## [7] Bremen             
## 16 Levels: Baden_Wurttemberg Bayern Berlin Brandenburg Bremen ... Thuringen

immo_data_ready1 <- immo_data_ready
immo_data_ready1$RegioStatus <- immo_data_ready$regio1
immo_data_ready1$RegioStatus <- "Z"

immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Baden_Württemberg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bayern"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bremen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hamburg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hessen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Nordrhein_Westfalen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Berlin"] <- c("A_class")


immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Saarland"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Niedersachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Rheinland_Pfalz"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Schleswig_Holstein"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Brandenburg"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Mecklenburg_Vorpommern"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen_Anhalt"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Thüringen"] <- c("B_class")

immo_data_ready1$RegioStatus %>% table()

## .
## A_class B_class 
##   39907   30829

interiorQual and condition

In this step, the aim is to reduce the internal factors of quality and condition . The interiorQual operation is shown.

immo_data_ready$interiorQual %>%  table()

## .
##        luxury        normal        simple sophisticated 
##          3403         37689           640         29004

immo_data_ready1$interiorQual <- factor(immo_data_ready1$interiorQual,
                              # levels from lowest to highest
                              levels = c("luxury",
                                         "normal",
                                         "simple",
                                         "sophisticated"
                                         ),
                              ordered = TRUE) # ordinal

immo_data_ready1$interiorQual[immo_data_ready1$interiorQual == "luxury"] <- "sophisticated"

immo_data_ready1$interiorQual[immo_data_ready1$interiorQual == "simple"] <- "normal"

immo_data_ready1$interiorQual <- droplevels(immo_data_ready1$interiorQual)

immo_data_ready1$interiorQual %>%  table()

## .
##        normal sophisticated 
##         38329         32407

Splitting Data

The data is splited 70% for train and 30% for test.

set.seed(987654321)
houses_train <- createDataPartition(immo_data_ready$totalRent, 
                                          p = 0.7,
                                          list = FALSE)
housesTrain <- immo_data_ready[houses_train,]
housesTest <- immo_data_ready[-houses_train,]

housesTrain <- housesTrain %>% select(-regio1,-baseRent, -serviceCharge, -livingSpace)
housesTest <- housesTest %>% select(-regio1,-baseRent, -serviceCharge, -livingSpace)

Models

In the model part, backward selection and elastic-net regression mainly are used for determining the optimal variables which should use in the data and linear regression, knn, and random forest are used for prediction with these variables, respectively. For comparison and understanding prediction power, Mean Squera Error (MSE), Root Mean Square Error(RMSE), Mean Absolute Error(MAE), Median Absolute Error(MedAE), Mean Logarithmic Absolute Error(MSLE), Total Sum of Squares(TSS), Explained Sum of Squares(RSS) and R square are calculated. Additionally, all models have the same trainControl method which is cross-validation with 5 number of folds.

regressionMetrics <- function(real, predicted) {
  # Mean Squera Error
  MSE <- mean((real - predicted)^2)
  # Root Mean Square Error
  RMSE <- sqrt(MSE)
  # Mean Absolute Error
  MAE <- mean(abs(real - predicted))
  # Median Absolute Error
  MedAE <- median(abs(real - predicted))
  # Mean Logarithmic Absolute Error
  MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
  # Total Sum of Squares
  TSS <- sum((real - mean(real))^2)
  # Explained Sum of Squares
  RSS <- sum((predicted - real)^2)
  # R2
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
  return(result)
}

Backward Selection

“leapBackward”is used to fit a linear regression with backward selection. The model suggests that all variables increases R2 however in the 20th step of the model, contion, typeOfFlat and telekomTvOffer have some insignificant factors and after 8th variables there is no dramatic change between R2. For further models contion, typeOfFlat and telekomTvOffer can be exploded for faster computability. summary(step.model$finalModel) cannot be included because the output is too long.

train.control <- trainControl(method = "cv", number = 5)

set.seed(987654321)
step.model <- train(log(totalRent+1) ~., data = housesTrain,
                    method = "leapBackward",
                    tuneGrid = data.frame(nvmax = 1:20),
                    trControl = train.control
                    )
step.model

## Linear Regression with Backwards Selection 
## 
## 49517 samples
##    16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE      
##    1     0.4498237  0.2955712  0.3610902
##    2     0.3744020  0.5119366  0.2957746
##    3     0.3425989  0.5913290  0.2706071
##    4     0.3238595  0.6348098  0.2555612
##    5     0.3123257  0.6603562  0.2451130
##    6     0.3075998  0.6705558  0.2411725
##    7     0.3050455  0.6760049  0.2389634
##    8     0.3019190  0.6826051  0.2362598
##    9     0.3006757  0.6852170  0.2352668
##   10     0.3001994  0.6862168  0.2347769
##   11     0.2992772  0.6881359  0.2340369
##   12     0.2983962  0.6899714  0.2335103
##   13     0.2979153  0.6909711  0.2331389
##   14     0.2975701  0.6916874  0.2329776
##   15     0.2973530  0.6921371  0.2328449
##   16     0.2971065  0.6926480  0.2326444
##   17     0.2967804  0.6933223  0.2324390
##   18     0.2966173  0.6936586  0.2323626
##   19     0.2965662  0.6937641  0.2322786
##   20     0.2964993  0.6939012  0.2321944
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 20.

# summary(step.model$finalModel)

Elastic-net Regression

The best tune captured with alpha = 0.2 and lambda = 0 in this model.The Elastic-net regression also suggests that one of the condition factors should be excluded. Lastly, according to the backward selection model and Elastic-net regression model, contion, typeOfFlat, and telekomTvOffer can be excluded in order to obtain faster and accurate computability.

parameters_elastic <- expand.grid(alpha = seq(0, 1, 0.2),
                                  lambda = seq(0, 1e3, 0.5))

train.control <- trainControl(method = "cv", number = 5)
set.seed(987654321)
modelelasticlog <- train(log(totalRent+1) ~ .,
                        data = housesTrain,
                        method = "glmnet",
                        tuneGrid = parameters_elastic,
                        trControl = train.control)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

predict(modelelasticlog$finalModel,
        s = modelelasticlog$bestTune$lambda,
        type = "coefficients")

## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                           1
## (Intercept)                    6.6930968721
## telekomTvOfferON_DEMAND       -0.0075497415
## telekomTvOfferONE_YEAR_FREE    0.0505626085
## newlyConstTRUE                 0.1707208032
## balconyTRUE                    0.1255185378
## picturecount                   0.0058951900
## telekomUploadSpeed             0.0004506272
## yearConstructed               -0.0004838201
## hasKitchenTRUE                 0.1702787555
## cellarTRUE                    -0.0278597504
## condition.L                    0.0336227346
## condition.Q                    .           
## condition.C                   -0.0396784749
## condition^4                   -0.0140670029
## interiorQual.L                 0.1761134205
## liftTRUE                       0.2013016776
## typeOfFlatground_floor         0.0185743324
## typeOfFlathalf_basement        0.0150987126
## typeOfFlatloft                 0.3304071487
## typeOfFlatmaisonette           0.1207774331
## typeOfFlatother               -0.0274778946
## typeOfFlatpenthouse            0.2300205883
## typeOfFlatraised_ground_floor  0.0113819332
## typeOfFlatroof_storey          0.0161211366
## typeOfFlatterraced_flat        0.1315321281
## noRooms                        0.2426135332
## floor                         -0.0021185862
## gardenTRUE                    -0.0253723946
## RegioStatusB_class            -0.2771766492

Linear Regression

Floor has the lowest significance level in the linear regression output and other outputs are highly significant and R2 is 0.6853 which is quite fine score.

train.control <- trainControl(method = "cv", number = 5)
set.seed(987654321)
lmModellog <- train(log(totalRent+1) ~.,
                            data = housesTrain %>%
                      dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
                            method  = "lm",
                            trControl = train.control
                    )
lmModellog %>% summary

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.76318 -0.20121 -0.01602  0.18604  2.11831 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)         6.75325290  0.06745243 100.119 < 0.0000000000000002
## newlyConstTRUE      0.19069923  0.00569900  33.462 < 0.0000000000000002
## balconyTRUE         0.12892172  0.00320080  40.278 < 0.0000000000000002
## picturecount        0.00663126  0.00020977  31.612 < 0.0000000000000002
## telekomUploadSpeed  0.00060562  0.00008239   7.351    0.000000000000200
## yearConstructed    -0.00048582  0.00003436 -14.140 < 0.0000000000000002
## hasKitchenTRUE      0.17599073  0.00289525  60.786 < 0.0000000000000002
## cellarTRUE         -0.03154045  0.00321367  -9.814 < 0.0000000000000002
## interiorQual.L      0.19251924  0.00223964  85.960 < 0.0000000000000002
## liftTRUE            0.20736813  0.00355729  58.294 < 0.0000000000000002
## noRooms             0.24862362  0.00147610 168.432 < 0.0000000000000002
## floor              -0.00086381  0.00090097  -0.959                0.338
## gardenTRUE         -0.02425665  0.00324323  -7.479    0.000000000000076
## RegioStatusB_class -0.27764904  0.00286221 -97.005 < 0.0000000000000002
##                       
## (Intercept)        ***
## newlyConstTRUE     ***
## balconyTRUE        ***
## picturecount       ***
## telekomUploadSpeed ***
## yearConstructed    ***
## hasKitchenTRUE     ***
## cellarTRUE         ***
## interiorQual.L     ***
## liftTRUE           ***
## noRooms            ***
## floor                 
## gardenTRUE         ***
## RegioStatusB_class ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3006 on 49503 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6853 
## F-statistic:  8297 on 13 and 49503 DF,  p-value: < 0.00000000000000022

KNN

Additionally, with number of k based on square root number of rows should be used however in our case this number is 222.5242 and at this condition, computation with that number of k is taken extreme time.

train.control <- trainControl(method = "cv", number = 5)

different_k <- data.frame(k = seq(1, 99, 4))


set.seed(987654321)
knnmodellog <-
  train(log(totalRent+1) ~ .,
        data = housesTrain %>%
          dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
        method = "knn",
        trControl = train.control,
        tuneGrid = different_k)


knnmodellog

## k-Nearest Neighbors 
## 
## 49517 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.4202167  0.4570260  0.3161156
##    5  0.3606577  0.5482199  0.2760081
##    9  0.3606470  0.5486679  0.2758570
##   13  0.3649385  0.5390542  0.2792270
##   17  0.3688070  0.5302107  0.2823475
##   21  0.3724330  0.5216251  0.2851579
##   25  0.3754236  0.5144301  0.2876286
##   29  0.3783269  0.5072322  0.2899145
##   33  0.3807989  0.5010854  0.2920180
##   37  0.3833722  0.4944422  0.2941481
##   41  0.3853953  0.4893165  0.2958058
##   45  0.3872953  0.4844194  0.2973732
##   49  0.3892430  0.4792739  0.2989850
##   53  0.3909832  0.4746576  0.3004906
##   57  0.3925410  0.4705551  0.3018170
##   61  0.3940871  0.4664228  0.3030810
##   65  0.3954860  0.4627003  0.3042510
##   69  0.3967906  0.4591675  0.3053380
##   73  0.3980483  0.4557407  0.3064149
##   77  0.3992858  0.4523558  0.3074669
##   81  0.4004103  0.4492895  0.3084372
##   85  0.4015482  0.4461112  0.3094390
##   89  0.4026006  0.4431746  0.3103134
##   93  0.4035783  0.4404749  0.3111674
##   97  0.4045661  0.4376749  0.3119781
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

plot(knnmodellog)

Random Forest

“ranger” method is used in order to process random forest and tuneLentgh determined as a default in order to obtain faster computation.

set.seed(987654321)
RFmodel <- train(
  log(totalRent+1) ~.,
  # tuneLength = tuneGrid,
  data = housesTrain %>%
     dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
  method = "ranger",
  trControl = train.control
)

## Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 91%. Estimated remaining time: 3 seconds.
## Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 74%. Estimated remaining time: 11 seconds.
## Growing trees.. Progress: 79%. Estimated remaining time: 8 seconds.
## Growing trees.. Progress: 80%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.

RFmodel

## Random Forest 
## 
## 49517 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE      
##    2    variance    0.2977673  0.7092157  0.2320206
##    2    extratrees  0.3307934  0.6454508  0.2586837
##    7    variance    0.2831053  0.7212512  0.2184012
##    7    extratrees  0.2831115  0.7209453  0.2191308
##   13    variance    0.2882801  0.7118575  0.2221255
##   13    extratrees  0.2871006  0.7136641  0.2217555
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule =
##  variance and min.node.size = 5.

Model Comparison

The model comparisons were done by housesTest data and forecasting was analysed. As was mentioned, above logarithmic transformation was used for all models. For better interpretation, two types of comparisons are included. The first one is without retransformed data and obtains a logarithmically transformed prediction. The other one is perdition with retransformation.

Models_forecastslog <-
  data.frame(Linear_Regression_Forecast= lmModellog_forecastLog,
             Knn_Forecast = knnForecastlog,
             RF_Forecast = model_predictlog
             )


sapply(Models_forecastslog,
       function(x) regressionMetrics(predicted = x,
                                        real = log(housesTest$totalRent+1))) %>%
  t()

##                            MSE        RMSE      MAE       MedAE    
## Linear_Regression_Forecast 0.09027711 0.3004615 0.2343608 0.1927142
## Knn_Forecast               0.1251214  0.353725  0.2707002 0.2151143
## RF_Forecast                0.07964181 0.2822088 0.2156262 0.1703432
##                            MSLE        R2       
## Linear_Regression_Forecast 0.001519177 0.683947 
## Knn_Forecast               0.002106946 0.5619599
## RF_Forecast                0.001341847 0.7211803

Models_forecasts <-
  data.frame(Linear_Regression_Forecast= lmModellog_forecast2,
             Knn_Forecast = knnForecast,
             RF_Forecast = model_predict
             )




sapply(Models_forecasts,
       function(x) regressionMetrics(predicted = x,
                                        real = housesTest$totalRent)) %>%
  t()

##                            MSE      RMSE     MAE      MedAE    MSLE      
## Linear_Regression_Forecast 133337.3 365.1538 215.6405 134.4642 0.09027711
## Knn_Forecast               173548.3 416.5913 245.0436 148.6661 0.1251214 
## RF_Forecast                111948.6 334.5871 198.5949 119.3081 0.07964181
##                            R2       
## Linear_Regression_Forecast 0.580809 
## Knn_Forecast               0.4543922
## RF_Forecast                0.6480518

modelList <- list(RFmodel = RFmodel, knnmodellog = knnmodellog, modelelasticlog = modelelasticlog, lmModellog= lmModellog)


resamples <- resamples(modelList)

summary(resamples)

## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: RFmodel, knnmodellog, modelelasticlog, lmModellog 
## Number of resamples: 5 
## 
## MAE 
##                      Min.   1st Qu.    Median      Mean   3rd Qu.
## RFmodel         0.2171264 0.2177587 0.2177814 0.2184012 0.2186260
## knnmodellog     0.2736814 0.2753159 0.2757223 0.2758570 0.2762944
## modelelasticlog 0.2307210 0.2309609 0.2312986 0.2319982 0.2322046
## lmModellog      0.2333638 0.2346559 0.2349406 0.2353845 0.2355658
##                      Max. NA's
## RFmodel         0.2207135    0
## knnmodellog     0.2782707    0
## modelelasticlog 0.2348061    0
## lmModellog      0.2383964    0
## 
## RMSE 
##                      Min.   1st Qu.    Median      Mean   3rd Qu.
## RFmodel         0.2819933 0.2825673 0.2825761 0.2831053 0.2834517
## knnmodellog     0.3587795 0.3594318 0.3609987 0.3606470 0.3611233
## modelelasticlog 0.2945782 0.2948382 0.2966834 0.2963830 0.2972489
## lmModellog      0.2984608 0.3002063 0.3003106 0.3006663 0.3013778
##                      Max. NA's
## RFmodel         0.2849379    0
## knnmodellog     0.3629016    0
## modelelasticlog 0.2985662    0
## lmModellog      0.3029760    0
## 
## Rsquared 
##                      Min.   1st Qu.    Median      Mean   3rd Qu.
## RFmodel         0.7126525 0.7207377 0.7225302 0.7212512 0.7248489
## knnmodellog     0.5348358 0.5440158 0.5522512 0.5486679 0.5548388
## modelelasticlog 0.6840272 0.6915645 0.6964239 0.6941522 0.6972115
## lmModellog      0.6747117 0.6841962 0.6850071 0.6852338 0.6887348
##                      Max. NA's
## RFmodel         0.7254868    0
## knnmodellog     0.5573978    0
## modelelasticlog 0.7015340    0
## lmModellog      0.6935192    0

bwplot(resamples, metric = "RMSE")

For both results, the best scores are obtained with random forest and the worst scores obtained with KNN model. For all models, R2 is significantly changing with retransformation in both cases, linear regression is the second model for MAE however with the retransformation error terms are significantly changed. To sum up, in these conditions, the best model is the random forest.

Conclusion

To conclude, the model can be improved with more feature engineering methods. Moreover, the better model can be achieved for KNN and random forest with adjusting higher range hyperparameters, and more models can be tried such as gradient boosting regressor.

Reference

1.) https://www.kaggle.com/corrieaar/apartment-rental-offers-in-germany

2.) https://en.wikipedia.org/wiki/States_of_Germany

Apartment Rental Offers in Germany

Mert ÇAMUR - Adinda Herawati

14/06/2020