Introduction

The project aim is analysing and building predictive models for the factors affected by Total Rent in Apartment Rental in Germany such as service charge, location, condition of flats. The Apartment rental offers in Germany dataset is analysed by using machine learning algorithms such as backward selection, elastic-net regression, linear regression, k-nearest neighbors (KNN), and random forest. For each algorithm, feature selection, feature engineering, and logarithmic transformation for the dependent variables are done to improve model outputs and prediction power. Therefore, the models are compared to each other to obtain the best prediction results.

Data

The original data has 268,850 observations and 49 variables. For efficiency after the elimination of missing values, data will be described in detail. The missing values are shown below:

##                   regio1               newlyConst                  balcony 
##                        0                        0                        0 
##             picturecount                  scoutId               hasKitchen 
##                        0                        0                        0 
##                  geo_bln                   cellar                 baseRent 
##                        0                        0                        0 
##              livingSpace                  geo_krs                   street 
##                        0                        0                        0 
##                     lift            baseRentRange                  geo_plz 
##                        0                        0                        0 
##                  noRooms             noRoomsRange                   garden 
##                        0                        0                        0 
##         livingSpaceRange                   regio2                   regio3 
##                        0                        0                        0 
##                     date               pricetrend            serviceCharge 
##                        0                     1832                     6909 
##              description           telekomTvOffer       telekomUploadSpeed 
##                    19747                    32619                    33358 
##               typeOfFlat                totalRent              heatingType 
##                    36614                    40517                    44856 
##                    floor               facilities              firingTypes 
##                    51309                    52924                    56964 
##          yearConstructed     yearConstructedRange                condition 
##                    57045                    57045                    68489 
##              streetPlain              houseNumber           numberOfFloors 
##                    71013                    71018                    97732 
##              thermalChar             interiorQual              petsAllowed 
##                   106506                   112665                   114573 
##             noParkSpaces             heatingCosts            lastRefurbish 
##                   175798                   183332                   188139 
##    energyEfficiencyClass     electricityBasePrice      electricityKwhPrice 
##                   191063                   222004                   222004 
## telekomHybridUploadSpeed 
##                   223830
## Observations: 268,850
## Variables: 49
## $ regio1                   <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz", ...
## $ serviceCharge            <dbl> 245.00, 134.00, 255.00, 58.15, 138.00, 142...
## $ heatingType              <chr> "central_heating", "self_contained_central...
## $ telekomTvOffer           <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_YEA...
## $ telekomHybridUploadSpeed <dbl> NA, NA, 10, NA, NA, NA, 10, 10, NA, NA, NA...
## $ newlyConst               <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F...
## $ balcony                  <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE...
## $ picturecount             <dbl> 6, 8, 8, 9, 19, 5, 9, 5, 5, 7, 11, 9, 4, 3...
## $ pricetrend               <dbl> 4.62, 3.47, 2.72, 1.53, 2.46, 4.48, 1.01, ...
## $ telekomUploadSpeed       <dbl> 10.0, 10.0, 2.4, 40.0, NA, 2.4, 2.4, 2.4, ...
## $ totalRent                <dbl> 840.00, NA, 1300.00, NA, 903.00, NA, 380.0...
## $ yearConstructed          <dbl> 1965, 1871, 2019, 1964, 1950, 1999, NA, 19...
## $ scoutId                  <dbl> 96107057, 111378734, 113147523, 108890903,...
## $ noParkSpaces             <dbl> 1, 2, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA...
## $ firingTypes              <chr> "oil", "gas", NA, "district_heating", "gas...
## $ hasKitchen               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, F...
## $ geo_bln                  <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz", ...
## $ cellar                   <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TR...
## $ yearConstructedRange     <dbl> 2, 1, 9, 2, 1, 5, NA, 2, 2, 2, 1, 1, 1, 2,...
## $ baseRent                 <dbl> 595.00, 800.00, 965.00, 343.00, 765.00, 31...
## $ houseNumber              <chr> "244", NA, "4", "35", "10", "1e", "14", "3...
## $ livingSpace              <dbl> 86.00, 89.00, 83.80, 58.15, 84.97, 53.43, ...
## $ geo_krs                  <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresden"...
## $ condition                <chr> "well_kept", "refurbished", "first_time_us...
## $ interiorQual             <chr> "normal", "normal", "sophisticated", NA, N...
## $ petsAllowed              <chr> NA, "no", NA, NA, NA, "no", NA, NA, "no", ...
## $ street                   <chr> "Sch&uuml;ruferstra&szlig;e", "no_informat...
## $ streetPlain              <chr> "Schüruferstraße", NA, "Turnerweg", "Glück...
## $ lift                     <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F...
## $ baseRentRange            <dbl> 4, 5, 6, 2, 5, 2, 2, 3, 4, 1, 1, 2, 5, 6, ...
## $ typeOfFlat               <chr> "ground_floor", "ground_floor", "apartment...
## $ geo_plz                  <chr> "44269", "67459", "01097", "09599", "28213...
## $ noRooms                  <dbl> 4.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0, 2....
## $ thermalChar              <dbl> 181.40, NA, NA, 86.00, 188.90, 165.00, NA,...
## $ floor                    <dbl> 1, NA, 3, 3, 1, NA, 1, NA, 2, 2, 3, 1, NA,...
## $ numberOfFloors           <dbl> 3, NA, 4, NA, NA, NA, 4, NA, 2, 5, NA, NA,...
## $ noRoomsRange             <dbl> 4, 3, 3, 3, 3, 2, 2, 3, 2, 2, 2, 3, 4, 4, ...
## $ garden                   <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ livingSpaceRange         <dbl> 4, 4, 4, 2, 4, 2, 3, 2, 2, 2, 1, 3, 4, 6, ...
## $ regio2                   <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresden"...
## $ regio3                   <chr> "Schüren", "Böhl_Iggelheim", "Äußere_Neust...
## $ description              <chr> "Die ebenerdig zu erreichende Erdgeschossw...
## $ facilities               <chr> "Die Wohnung ist mit Laminat ausgelegt. Da...
## $ heatingCosts             <dbl> NA, NA, NA, 87.23, NA, NA, NA, 44.00, NA, ...
## $ energyEfficiencyClass    <chr> NA, NA, NA, NA, NA, NA, NA, "B", "E", NA, ...
## $ lastRefurbish            <dbl> NA, 2019, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ electricityBasePrice     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ electricityKwhPrice      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ date                     <chr> "May19", "May19", "Oct19", "May19", "Feb20...

In this step, the unnecessary variables and the variables which have an extremely big amount of missing values are eliminated.The last shape of the data as it is shown.

##             regio1      serviceCharge     telekomTvOffer         newlyConst 
##                  0                  0                  0                  0 
##            balcony       picturecount telekomUploadSpeed          totalRent 
##                  0                  0                  0                  0 
##    yearConstructed         hasKitchen             cellar           baseRent 
##                  0                  0                  0                  0 
##        livingSpace          condition       interiorQual               lift 
##                  0                  0                  0                  0 
##         typeOfFlat            noRooms              floor             garden 
##                  0                  0                  0                  0
## Observations: 71,087
## Variables: 20
## $ regio1             <chr> "Nordrhein_Westfalen", "Sachsen", "Baden_Württem...
## $ serviceCharge      <dbl> 245.00, 255.00, 110.00, 200.00, 215.00, 50.00, 2...
## $ telekomTvOffer     <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_YEAR_FREE...
## $ newlyConst         <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ balcony            <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TR...
## $ picturecount       <dbl> 6, 8, 5, 3, 12, 12, 35, 15, 9, 7, 16, 14, 11, 4,...
## $ telekomUploadSpeed <dbl> 10.0, 2.4, 40.0, 40.0, 2.4, 40.0, 5.0, 10.0, 40....
## $ totalRent          <dbl> 840.00, 1300.00, 690.00, 1150.00, 1320.65, 325.0...
## $ yearConstructed    <dbl> 1965, 2019, 1970, 1951, 2018, 1897, 2013, 1978, ...
## $ hasKitchen         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, ...
## $ cellar             <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE...
## $ baseRent           <dbl> 595.00, 965.00, 580.00, 950.00, 972.60, 200.00, ...
## $ livingSpace        <dbl> 86.00, 83.80, 53.00, 123.44, 87.00, 50.00, 127.9...
## $ condition          <chr> "well_kept", "first_time_use", "well_kept", "fir...
## $ interiorQual       <chr> "normal", "sophisticated", "sophisticated", "sop...
## $ lift               <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FA...
## $ typeOfFlat         <chr> "ground_floor", "apartment", "roof_storey", "apa...
## $ noRooms            <dbl> 4.0, 3.0, 2.0, 4.0, 3.0, 2.0, 5.0, 4.0, 3.0, 1.0...
## $ floor              <dbl> 1, 3, 2, 4, 0, 3, 1, 0, 2, 1, 1, 3, 2, 0, 2, 1, ...
## $ garden             <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, T...

The last version of the data is obtained by the elimination of outliers.

Data Description

After the last step, 71,087 observations and 20 variables have remained. The variables’ descriptions are as follows.

regio1 - States.

serviceCharge - Aucilliary costs such as electricty or internet in €.

telekomTvOffer - Is payed TV included if so which offer?

newlyConst - Is the building newly constructed?

balcony - Does the object have a balcony?

picturecount - How many pictures were uploaded to the listing?

telekomUploadSpeed - How fast is the internet upload speed?

totalRent - Total rent (usually a sum of base rent, service charge and heating cost).

yearConstructed - Construction year.

hasKitchen - Has it a kitchen?

cellar - Has it a cellar?

baseRent - Base rent without electricity and heating.

livingSpace - Living space in sqm.

condition - Condition of the flat.

interiorQual - Interior quality.

lift - Is elevator available?

typeOfFlat - Type of flat.

noRooms - Number of rooms.

floor - Which floor is the flat on?

garden - Has it a garden?

Detail About Regions

The data obtained from Wikipedia in order to apply feature engineering according to the GDPPerC values. The data includes current information about the German States.

##                States AreaKm2 Population PopPerKm2 GDPPerC
## 1 Nordrhein_Westfalen   34085   17932651       526   38645
## 2           Thuringen   16172    2143145       133   28747
## 3              Bayern   70552   13076721       185   45810
## 4         Brandenburg   29479    2511917        85   27675
## 5              Berlin     892    3644826      4086   38032
## 6            Saarland    2569     990509       386   35460

Correlation Analysis

At this step, correlations are analysed between numeric variables.

The numeric variables are shown below:

## [1] "serviceCharge"      "picturecount"       "telekomUploadSpeed"
## [4] "totalRent"          "yearConstructed"    "baseRent"          
## [7] "livingSpace"        "noRooms"            "floor"

Here all correlation values are calculated between each variable. Thus, It is obtained that, several values have a high correlation between each other. Moreover, baseRent has the highest correlation between totalRent which is 0.99242000, and noRooms and livingSpace have a high correlation between each other. For this reason, baseRent and livingSpace are excluded.

##                    serviceCharge picturecount telekomUploadSpeed  totalRent
## serviceCharge         1.00000000  0.262870078        0.012980466 0.75482456
## picturecount          0.26287008  1.000000000        0.005611472 0.30053560
## telekomUploadSpeed    0.01298047  0.005611472        1.000000000 0.02563070
## totalRent             0.75482456  0.300535601        0.025630705 1.00000000
## yearConstructed       0.13699861  0.006427143       -0.010442569 0.16381773
## baseRent              0.70345989  0.295050836        0.025414612 0.99242000
## livingSpace           0.69195104  0.307289034       -0.004532310 0.76012249
## noRooms               0.48297396  0.219839610        0.002225827 0.49697236
## floor                 0.03572704  0.016630275        0.010408192 0.05062938
##                    yearConstructed   baseRent   livingSpace     noRooms
## serviceCharge          0.136998610 0.70345989  0.6919510395 0.482973959
## picturecount           0.006427143 0.29505084  0.3072890345 0.219839610
## telekomUploadSpeed    -0.010442569 0.02541461 -0.0045323098 0.002225827
## totalRent              0.163817734 0.99242000  0.7601224929 0.496972356
## yearConstructed        1.000000000 0.15925701  0.0502247807 0.013888760
## baseRent               0.159257010 1.00000000  0.7358925869 0.471402031
## livingSpace            0.050224781 0.73589259  1.0000000000 0.775784463
## noRooms                0.013888760 0.47140203  0.7757844632 1.000000000
## floor                  0.001643554 0.04489883  0.0005930456 0.003575133
##                           floor
## serviceCharge      0.0357270376
## picturecount       0.0166302745
## telekomUploadSpeed 0.0104081920
## totalRent          0.0506293757
## yearConstructed    0.0016435542
## baseRent           0.0448988296
## livingSpace        0.0005930456
## noRooms            0.0035751333
## floor              1.0000000000

Let’s analyze correlation results with corrplot.

Let’s extract the variable which has the highest correlation with the dependent variable (totalRent) and according to correlation results, baseRent has the highest correlation with totalRent.

## [1] "totalRent"          "baseRent"           "livingSpace"       
## [4] "serviceCharge"      "noRooms"            "picturecount"      
## [7] "yearConstructed"    "floor"              "telekomUploadSpeed"

As can be seen on the plot, totalRent and baseRent have a highly positive linear relationship. Although, the situation is not unexpected because as is mentioned in the data description part, totalRent includes baseRent and for this reason, baseRent is not an explanatory variable in this case. Therefore, baseRent and livingSpace will be excluded for each model.

## `geom_smooth()` using formula 'y ~ x'

Feature Enginerring

totalRent

In the below, logarithmic transformation is done for the dependent variable (totalRent). Each graph shows the distribution of totalRent. The first one is different from others because totalRent has extreme outliers. The second graph is the original distribution of the data without outliers and as can be clearly seen, it is a right-skewed data. The last one is obtained after logarithmic transformation on totalRent and the shape is obtained which is close to normal distribution. For this analysis, the logarithmic transformation is important for model power.

regio1

As can be seen on the graph, around 37000 GDPPerC value is appropriate for dividing regions as A_class and B_Class. Let’s divide States into 2 groups as A_class and B_class in order to decrease the factors and increase the power of the models. If GDPPerC larger than 37000, the group will be determined as A_class, and If GDPPerC less than 37000, the group will be determined as B_class.

For all states, A_class and B_class are crated and the states which have higher GDPPerC then 37000 are determined as A_class. Others are determined as B_class.

## [1] Nordrhein_Westfalen Bayern              Berlin             
## [4] Hamburg             Hessen              Baden_Wurttemberg  
## [7] Bremen             
## 16 Levels: Baden_Wurttemberg Bayern Berlin Brandenburg Bremen ... Thuringen
immo_data_ready1 <- immo_data_ready
immo_data_ready1$RegioStatus <- immo_data_ready$regio1
immo_data_ready1$RegioStatus <- "Z"

immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Baden_Württemberg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bayern"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bremen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hamburg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hessen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Nordrhein_Westfalen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Berlin"] <- c("A_class")


immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Saarland"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Niedersachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Rheinland_Pfalz"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Schleswig_Holstein"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Brandenburg"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Mecklenburg_Vorpommern"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen_Anhalt"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Thüringen"] <- c("B_class")
## .
## A_class B_class 
##   39907   30829

Models

In the model part, backward selection and elastic-net regression mainly are used for determining the optimal variables which should use in the models, and linear regression, knn and random forest are used for prediction with these variables, respectively. For comparison and understanding prediction power, Mean Squera Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MedAE), Median Absolute Error (MAE), Mean Logarithmic Absolute Error (MSLE), Total Sum of Squares (TSS), Explained Sum of Squares (RSS) and R square are calculated. Additionally, all models have the same trainControl method which is cross-validation with 5 number of folds.

Backward Selection

“leapBackward”is used to fit a linear regression with backward selection. The model suggests that all variables increases R2 however in the 20th step of the model, contion, typeOfFlat and telekomTvOffer have some insignificant factors and after 9th variables there is no dramatic change between R2 values. For further models contion, typeOfFlat and telekomTvOffer can be exploded for faster computability. summary(step.model$finalModel) cannot be included because the output is too long.

## Linear Regression with Backwards Selection 
## 
## 49517 samples
##    17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE      
##    1     0.3666132  0.5332217  0.2856310
##    2     0.3281069  0.6257239  0.2527690
##    3     0.3048535  0.6765721  0.2374358
##    4     0.2825133  0.7221618  0.2206846
##    5     0.2741978  0.7382829  0.2134142
##    6     0.2679085  0.7501258  0.2082397
##    7     0.2639207  0.7575151  0.2048003
##    8     0.2610415  0.7627519  0.2025859
##    9     0.2596422  0.7652794  0.2016295
##   10     0.2591423  0.7661832  0.2014140
##   11     0.2585643  0.7672266  0.2011318
##   12     0.2583972  0.7675336  0.2010090
##   13     0.2579930  0.7682593  0.2006695
##   14     0.2575135  0.7691219  0.2003228
##   15     0.2572477  0.7696004  0.2001702
##   16     0.2570926  0.7698791  0.2000211
##   17     0.2569082  0.7702084  0.1998442
##   18     0.2566602  0.7706546  0.1996277
##   19     0.2563648  0.7711802  0.1994338
##   20     0.2562696  0.7713499  0.1993362
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 20.

Elastic-net Regression

The best tune captured with alpha = 0 and lambda = 10 in this model. At this condition, the Elastic-net regression model suggests that some of the variables have insignificant results such as the condition factors. Lastly, according to the backward selection model and Elastic-net regression model, contion, typeOfFlat, and telekomTvOffer can be excluded in order to obtain faster and accurate computability.

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## 30 x 1 sparse Matrix of class "dgCMatrix"
##                                            1
## (Intercept)                    6.47299962465
## serviceCharge                  0.00225595542
## telekomTvOfferON_DEMAND       -0.01418369616
## telekomTvOfferONE_YEAR_FREE    0.04028685535
## newlyConstTRUE                 0.17887535505
## balconyTRUE                    0.10030316869
## picturecount                   0.00382445436
## telekomUploadSpeed             0.00042338889
## yearConstructed               -0.00040388622
## hasKitchenTRUE                 0.13864592491
## cellarTRUE                    -0.02489297610
## condition.L                    0.02629799733
## condition.Q                    .            
## condition.C                   -0.04213150598
## condition^4                   -0.01277007819
## interiorQual.L                 0.14073213318
## liftTRUE                       0.10191610369
## typeOfFlatground_floor         0.01390509971
## typeOfFlathalf_basement        .            
## typeOfFlatloft                 0.21441716690
## typeOfFlatmaisonette           0.07081569087
## typeOfFlatother               -0.02111985206
## typeOfFlatpenthouse            0.09288833823
## typeOfFlatraised_ground_floor  .            
## typeOfFlatroof_storey          0.00401983349
## typeOfFlatterraced_flat        0.08034635608
## noRooms                        0.16031341788
## floor                         -0.00003790065
## gardenTRUE                    -0.02973929373
## RegioStatusB_class            -0.23007660106

Linear Regression

Floor has the lowest significance level in the linear regression output and other outputs are highly significant and R2 is 0.7679 which is fine score.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9752 -0.1714 -0.0146  0.1589  2.2223 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         6.58750771  0.05794611 113.683 < 0.0000000000000002 ***
## serviceCharge       0.00230823  0.00001740 132.691 < 0.0000000000000002 ***
## newlyConstTRUE      0.19621588  0.00489486  40.086 < 0.0000000000000002 ***
## balconyTRUE         0.10261449  0.00275620  37.230 < 0.0000000000000002 ***
## picturecount        0.00422950  0.00018107  23.358 < 0.0000000000000002 ***
## telekomUploadSpeed  0.00057939  0.00007076   8.188 0.000000000000000273 ***
## yearConstructed    -0.00043818  0.00002951 -14.848 < 0.0000000000000002 ***
## hasKitchenTRUE      0.14075953  0.00250077  56.287 < 0.0000000000000002 ***
## cellarTRUE         -0.02787262  0.00276026 -10.098 < 0.0000000000000002 ***
## interiorQual.L      0.15350876  0.00194589  78.889 < 0.0000000000000002 ***
## liftTRUE            0.10535806  0.00315048  33.442 < 0.0000000000000002 ***
## noRooms             0.16176797  0.00142679 113.379 < 0.0000000000000002 ***
## floor              -0.00033056  0.00077382  -0.427                0.669    
## gardenTRUE         -0.02938173  0.00278577 -10.547 < 0.0000000000000002 ***
## RegioStatusB_class -0.22782398  0.00248677 -91.614 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2582 on 49502 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7679 
## F-statistic: 1.17e+04 on 14 and 49502 DF,  p-value: < 0.00000000000000022

KNN

Additionally, with number of k based on square root number of rows should be used however in our case this number is 222.5242 and at this condition, computation with that number of k is taken extreme time.

## k-Nearest Neighbors 
## 
## 49517 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.3971441  0.5223325  0.2960954
##    5  0.3256234  0.6350436  0.2491881
##    9  0.3195560  0.6458806  0.2452863
##   13  0.3178202  0.6490082  0.2442260
##   17  0.3169691  0.6506304  0.2438506
##   21  0.3168517  0.6507782  0.2441101
##   25  0.3168653  0.6506881  0.2442527
##   29  0.3170062  0.6503507  0.2444524
##   33  0.3171084  0.6501007  0.2445626
##   37  0.3172019  0.6498836  0.2447000
##   41  0.3174653  0.6492953  0.2449990
##   45  0.3176763  0.6488255  0.2452032
##   49  0.3179287  0.6482639  0.2454526
##   53  0.3181144  0.6478577  0.2456101
##   57  0.3183581  0.6473214  0.2458147
##   61  0.3185431  0.6469192  0.2459348
##   65  0.3187294  0.6465108  0.2460781
##   69  0.3190085  0.6458989  0.2462612
##   73  0.3191663  0.6455536  0.2463657
##   77  0.3194053  0.6450264  0.2465397
##   81  0.3196808  0.6444177  0.2467576
##   85  0.3198484  0.6440560  0.2469137
##   89  0.3199957  0.6437343  0.2470334
##   93  0.3200917  0.6435287  0.2471316
##   97  0.3202263  0.6432412  0.2472319
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 21.

Random Forest

“ranger” method is used in order to process random forest and tuneLentgh determined as a default in order to obtain faster computation.

## Random Forest 
## 
## 49517 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE      
##    2    variance    0.2505852  0.7951319  0.1944783
##    2    extratrees  0.2953312  0.7282210  0.2314892
##    8    variance    0.2367370  0.8048482  0.1802997
##    8    extratrees  0.2377750  0.8032069  0.1822854
##   14    variance    0.2399272  0.7996466  0.1823716
##   14    extratrees  0.2391664  0.8008173  0.1828311
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 8, splitrule = variance
##  and min.node.size = 5.

Model Comparison

The model comparisons were done by housesTest data and forecasting was analysed. For better interpretation, the error terms values of linear regression, KNN, and random forest are included.

##                            MSE        RMSE      MAE       MedAE    
## Linear_Regression_Forecast 0.06661238 0.2580937 0.200053  0.1636902
## Knn_Forecast               0.09899574 0.3146359 0.2423979 0.1963692
## RF_Forecast                0.05514931 0.2348389 0.1776705 0.1392194
##                            MSLE         R2       
## Linear_Regression_Forecast 0.001129262  0.7667953
## Knn_Forecast               0.001701282  0.6534237
## RF_Forecast                0.0009384712 0.8069266
## 
## Call:
## summary.resamples(object = resamples)
## 
## Models: RFmodel, knnmodellog, lmModellog 
## Number of resamples: 5 
## 
## MAE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RFmodel     0.1775401 0.1797476 0.1805965 0.1802997 0.1806250 0.1829892    0
## knnmodellog 0.2396396 0.2443568 0.2450975 0.2441101 0.2452817 0.2461751    0
## lmModellog  0.1978575 0.2006997 0.2008644 0.2007520 0.2014597 0.2028788    0
## 
## RMSE 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RFmodel     0.2326124 0.2359595 0.2376769 0.2367370 0.2385354 0.2389009    0
## knnmodellog 0.3115479 0.3169089 0.3169173 0.3168517 0.3184953 0.3203889    0
## lmModellog  0.2522970 0.2569578 0.2593411 0.2583090 0.2604931 0.2624560    0
## 
## Rsquared 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RFmodel     0.7977081 0.8027031 0.8049617 0.8048482 0.8050128 0.8138555    0
## knnmodellog 0.6444708 0.6461888 0.6483256 0.6507782 0.6484670 0.6664387    0
## lmModellog  0.7594739 0.7641198 0.7651228 0.7676896 0.7686516 0.7810800    0

The best score is obtained with random forest, the following is linear regression and the worst scores obtained with KNN model. This order same for all error terms. At these conditions, the best option is the random forest.

Conclusion

To conclude, the models can be improved with more feature engineering methods. Moreover, the better model can be achieved for KNN and random forest by adjusting a higher range of hyperparameters, and more algorithms can be tried such as gradient boosting regressor.