The poreject aim is the factors affected Total Rent in Apartment Rental in Germany likes service charge, location, condition of flats, and etc. the Apartment rental offers in Germany dataset is analysed by using machine learning algorithms such as backward selection, elastic-net regression, linear regression, k-nearest neighbours (KNN), and random forest. For each algorithm, feature selection, feature engineering, and logarithmic transformation for the dependent variables are done in order to improve model outputs and prediction power. Therefore, the models are compared to each other in order to obtain the best prediction results.
The original data has 268,850 observations and 49 variables. For efficiency after the elimination of missing values, data will be described in detail. The missing values are shown below:
## regio1 newlyConst balcony
## 0 0 0
## picturecount scoutId hasKitchen
## 0 0 0
## geo_bln cellar baseRent
## 0 0 0
## livingSpace geo_krs street
## 0 0 0
## lift baseRentRange geo_plz
## 0 0 0
## noRooms noRoomsRange garden
## 0 0 0
## livingSpaceRange regio2 regio3
## 0 0 0
## date pricetrend serviceCharge
## 0 1832 6909
## description telekomTvOffer telekomUploadSpeed
## 19747 32619 33358
## typeOfFlat totalRent heatingType
## 36614 40517 44856
## floor facilities firingTypes
## 51309 52924 56964
## yearConstructed yearConstructedRange condition
## 57045 57045 68489
## streetPlain houseNumber numberOfFloors
## 71013 71018 97732
## thermalChar interiorQual petsAllowed
## 106506 112665 114573
## noParkSpaces heatingCosts lastRefurbish
## 175798 183332 188139
## energyEfficiencyClass electricityBasePrice electricityKwhPrice
## 191063 222004 222004
## telekomHybridUploadSpeed
## 223830
## Observations: 268,850
## Variables: 49
## $ regio1 <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz…
## $ serviceCharge <dbl> 245.00, 134.00, 255.00, 58.15, 138.00, …
## $ heatingType <chr> "central_heating", "self_contained_cent…
## $ telekomTvOffer <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_…
## $ telekomHybridUploadSpeed <dbl> NA, NA, 10, NA, NA, NA, 10, 10, NA, NA,…
## $ newlyConst <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
## $ balcony <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FA…
## $ picturecount <dbl> 6, 8, 8, 9, 19, 5, 9, 5, 5, 7, 11, 9, 4…
## $ pricetrend <dbl> 4.62, 3.47, 2.72, 1.53, 2.46, 4.48, 1.0…
## $ telekomUploadSpeed <dbl> 10.0, 10.0, 2.4, 40.0, NA, 2.4, 2.4, 2.…
## $ totalRent <dbl> 840.00, NA, 1300.00, NA, 903.00, NA, 38…
## $ yearConstructed <dbl> 1965, 1871, 2019, 1964, 1950, 1999, NA,…
## $ scoutId <dbl> 96107057, 111378734, 113147523, 1088909…
## $ noParkSpaces <dbl> 1, 2, 1, NA, NA, NA, NA, NA, 1, NA, NA,…
## $ firingTypes <chr> "oil", "gas", NA, "district_heating", "…
## $ hasKitchen <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ geo_bln <chr> "Nordrhein_Westfalen", "Rheinland_Pfalz…
## $ cellar <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE,…
## $ yearConstructedRange <dbl> 2, 1, 9, 2, 1, 5, NA, 2, 2, 2, 1, 1, 1,…
## $ baseRent <dbl> 595.00, 800.00, 965.00, 343.00, 765.00,…
## $ houseNumber <chr> "244", NA, "4", "35", "10", "1e", "14",…
## $ livingSpace <dbl> 86.00, 89.00, 83.80, 58.15, 84.97, 53.4…
## $ geo_krs <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresd…
## $ condition <chr> "well_kept", "refurbished", "first_time…
## $ interiorQual <chr> "normal", "normal", "sophisticated", NA…
## $ petsAllowed <chr> NA, "no", NA, NA, NA, "no", NA, NA, "no…
## $ street <chr> "Schüruferstraße", "no_infor…
## $ streetPlain <chr> "Schüruferstraße", NA, "Turnerweg", "Gl…
## $ lift <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
## $ baseRentRange <dbl> 4, 5, 6, 2, 5, 2, 2, 3, 4, 1, 1, 2, 5, …
## $ typeOfFlat <chr> "ground_floor", "ground_floor", "apartm…
## $ geo_plz <chr> "44269", "67459", "01097", "09599", "28…
## $ noRooms <dbl> 4.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0,…
## $ thermalChar <dbl> 181.40, NA, NA, 86.00, 188.90, 165.00, …
## $ floor <dbl> 1, NA, 3, 3, 1, NA, 1, NA, 2, 2, 3, 1, …
## $ numberOfFloors <dbl> 3, NA, 4, NA, NA, NA, 4, NA, 2, 5, NA, …
## $ noRoomsRange <dbl> 4, 3, 3, 3, 3, 2, 2, 3, 2, 2, 2, 3, 4, …
## $ garden <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ livingSpaceRange <dbl> 4, 4, 4, 2, 4, 2, 3, 2, 2, 2, 1, 3, 4, …
## $ regio2 <chr> "Dortmund", "Rhein_Pfalz_Kreis", "Dresd…
## $ regio3 <chr> "Schüren", "Böhl_Iggelheim", "Äußere_Ne…
## $ description <chr> "Die ebenerdig zu erreichende Erdgescho…
## $ facilities <chr> "Die Wohnung ist mit Laminat ausgelegt.…
## $ heatingCosts <dbl> NA, NA, NA, 87.23, NA, NA, NA, 44.00, N…
## $ energyEfficiencyClass <chr> NA, NA, NA, NA, NA, NA, NA, "B", "E", N…
## $ lastRefurbish <dbl> NA, 2019, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricityBasePrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ electricityKwhPrice <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ date <chr> "May19", "May19", "Oct19", "May19", "Fe…
In this step, the unnecessary variables and the variables which have an extremely big amount of missing values are eliminated.The last shape of the data as it is shown.
## regio1 serviceCharge telekomTvOffer
## 0 0 0
## newlyConst balcony picturecount
## 0 0 0
## telekomUploadSpeed totalRent yearConstructed
## 0 0 0
## hasKitchen cellar baseRent
## 0 0 0
## livingSpace condition interiorQual
## 0 0 0
## lift typeOfFlat noRooms
## 0 0 0
## floor garden
## 0 0
## Observations: 71,087
## Variables: 20
## $ regio1 <chr> "Nordrhein_Westfalen", "Sachsen", "Baden_Würt…
## $ serviceCharge <dbl> 245.00, 255.00, 110.00, 200.00, 215.00, 50.00…
## $ telekomTvOffer <chr> "ONE_YEAR_FREE", "ONE_YEAR_FREE", "ONE_YEAR_F…
## $ newlyConst <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE…
## $ balcony <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE,…
## $ picturecount <dbl> 6, 8, 5, 3, 12, 12, 35, 15, 9, 7, 16, 14, 11,…
## $ telekomUploadSpeed <dbl> 10.0, 2.4, 40.0, 40.0, 2.4, 40.0, 5.0, 10.0, …
## $ totalRent <dbl> 840.00, 1300.00, 690.00, 1150.00, 1320.65, 32…
## $ yearConstructed <dbl> 1965, 2019, 1970, 1951, 2018, 1897, 2013, 197…
## $ hasKitchen <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS…
## $ cellar <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ baseRent <dbl> 595.00, 965.00, 580.00, 950.00, 972.60, 200.0…
## $ livingSpace <dbl> 86.00, 83.80, 53.00, 123.44, 87.00, 50.00, 12…
## $ condition <chr> "well_kept", "first_time_use", "well_kept", "…
## $ interiorQual <chr> "normal", "sophisticated", "sophisticated", "…
## $ lift <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE,…
## $ typeOfFlat <chr> "ground_floor", "apartment", "roof_storey", "…
## $ noRooms <dbl> 4.0, 3.0, 2.0, 4.0, 3.0, 2.0, 5.0, 4.0, 3.0, …
## $ floor <dbl> 1, 3, 2, 4, 0, 3, 1, 0, 2, 1, 1, 3, 2, 0, 2, …
## $ garden <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
The last version of the data is obtained by the elimination of outliers.
immo_data_ready <- immo_data_main %>%
filter(baseRent < 10000 & baseRent > 100,
totalRent< 10000 & totalRent > 100,
livingSpace < 500 & livingSpace > 10,
serviceCharge < 1500 & serviceCharge > 1,
floor <19,
noRooms < 15 & noRooms > 0)After the last step, 71,087 observations and 20 variables have remained. The variables’ descriptions are as follows.
regio1 - States.
serviceCharge - Aucilliary costs such as electricty or internet in €.
telekomTvOffer - Is payed TV included if so which offer?
newlyConst - Is the building newly constructed?
balcony - Does the object have a balcony?
picturecount - How many pictures were uploaded to the listing?
telekomUploadSpeed - How fast is the internet upload speed?
totalRent - Total rent (usually a sum of base rent, service charge and heating cost).
yearConstructed - Construction year.
hasKitchen - Has it a kitchen?
cellar - Has it a cellar?
baseRent - Base rent without electricity and heating.
livingSpace - Living space in sqm.
condition - Condition of the flat.
interiorQual - Interior quality.
lift - Is elevator available?
typeOfFlat - Type of flat.
noRooms - Number of rooms.
floor - Which floor is the flat on?
garden - Has it a garden?
The data obtained from Wikipedia in order to apply feature engineering according to the GDPPerC values. The data includes current information about the German States.
## States AreaKm2 Population PopPerKm2 GDPPerC
## 1 Nordrhein_Westfalen 34085 17932651 526 38645
## 2 Thuringen 16172 2143145 133 28747
## 3 Bayern 70552 13076721 185 45810
## 4 Brandenburg 29479 2511917 85 27675
## 5 Berlin 892 3644826 4086 38032
## 6 Saarland 2569 990509 386 35460
At this step, correlations are analysed between numeric variables.
houses_numeric_vars <-
sapply(immo_data_ready, is.numeric) %>%
which() %>%
names()
houses_correlations <-
cor(immo_data_ready[,houses_numeric_vars],
use = "pairwise.complete.obs")The numeric variables are shown below:
## [1] "serviceCharge" "picturecount" "telekomUploadSpeed"
## [4] "totalRent" "yearConstructed" "baseRent"
## [7] "livingSpace" "noRooms" "floor"
Here all correlation values are calculated between each variable. Thus, It is obtained that, there are several values that have a high correlation between each other. Moreover, baseRent has the highest correlation between totalRent which is 0.99242000, and serviceCharge and livingSpace have a high correlation between totalRent. For this reason, baseRent, livingSpace, and serviceCharge are excluded.
## serviceCharge picturecount telekomUploadSpeed
## serviceCharge 1.00000000 0.262870078 0.012980466
## picturecount 0.26287008 1.000000000 0.005611472
## telekomUploadSpeed 0.01298047 0.005611472 1.000000000
## totalRent 0.75482456 0.300535601 0.025630705
## yearConstructed 0.13699861 0.006427143 -0.010442569
## baseRent 0.70345989 0.295050836 0.025414612
## livingSpace 0.69195104 0.307289034 -0.004532310
## noRooms 0.48297396 0.219839610 0.002225827
## floor 0.03572704 0.016630275 0.010408192
## totalRent yearConstructed baseRent livingSpace
## serviceCharge 0.75482456 0.136998610 0.70345989 0.6919510395
## picturecount 0.30053560 0.006427143 0.29505084 0.3072890345
## telekomUploadSpeed 0.02563070 -0.010442569 0.02541461 -0.0045323098
## totalRent 1.00000000 0.163817734 0.99242000 0.7601224929
## yearConstructed 0.16381773 1.000000000 0.15925701 0.0502247807
## baseRent 0.99242000 0.159257010 1.00000000 0.7358925869
## livingSpace 0.76012249 0.050224781 0.73589259 1.0000000000
## noRooms 0.49697236 0.013888760 0.47140203 0.7757844632
## floor 0.05062938 0.001643554 0.04489883 0.0005930456
## noRooms floor
## serviceCharge 0.482973959 0.0357270376
## picturecount 0.219839610 0.0166302745
## telekomUploadSpeed 0.002225827 0.0104081920
## totalRent 0.496972356 0.0506293757
## yearConstructed 0.013888760 0.0016435542
## baseRent 0.471402031 0.0448988296
## livingSpace 0.775784463 0.0005930456
## noRooms 1.000000000 0.0035751333
## floor 0.003575133 1.0000000000
Let’s analyze correlation results with corrplot.
Let’s extract the variable which has the highest correlation with the dependent variable(totalRent) and according to correlation results, baseRent has the highest correlation with totalRent.
houses_numeric_vars_order <-
houses_correlations[,"totalRent"] %>%
sort(decreasing = TRUE) %>%
names()
houses_numeric_vars_order## [1] "totalRent" "baseRent" "livingSpace"
## [4] "serviceCharge" "noRooms" "picturecount"
## [7] "yearConstructed" "floor" "telekomUploadSpeed"
As can be seen on the plot, totalRent and baseRent have a highly positive linear relationship. Although, the situation is not unexpected because as is mentioned in the data description part, totalRent includes baseRent and for this reason, baseRent is not an explanatory variable in this case. Therefore, baseRent, livingSpace and serviceCharge will be excluded for each model.
ggplot(immo_data_ready,
aes(x = totalRent,
y = baseRent)) +
geom_point(col = "red") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw()In the below, logarithmic transformation is done for the dependent variable (totalRent). Each graph shows the distribution of totalRent. The first one is different from others because totalRent has extreme outliers. The second graph is the original distribution of the data without outliers and as can be clearly seen, it is a right-skewed data. The last one is obtained after logarithmic transformation on totalRent and the shape is obtained which is close to normal distribution. For this analysis, the logarithmic transformation is important for model power.
layout(matrix(1:3, ncol = 3))
hist(immo_data_main$totalRent,
main = "totalRent With Outliers")
hist(immo_data_ready$totalRent,
main = "totalRent",
breaks = 100)
hist(log(immo_data_ready$totalRent + 1),
main = "log(totalRent+1)",
breaks = 100)As can be seen on the graph, around 37000 GDPPerC value is appropriate for dividing regions as A_class and B_Class. Let’s divide States into 2 groups as A_class and B_class in order to decrease the factors and increase the power of the models. If GDPPerC larger than 37000, the group will be determined as A_class, and If GDPPerC less than 37000, the group will be determined as B_class.
For all states, A_class and B_class are crated and the states which have higher GDPPerC then 37000 are determined as A_class. Others are determined as B_class.
## [1] Nordrhein_Westfalen Bayern Berlin
## [4] Hamburg Hessen Baden_Wurttemberg
## [7] Bremen
## 16 Levels: Baden_Wurttemberg Bayern Berlin Brandenburg Bremen ... Thuringen
immo_data_ready1 <- immo_data_ready
immo_data_ready1$RegioStatus <- immo_data_ready$regio1
immo_data_ready1$RegioStatus <- "Z"
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Baden_Württemberg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bayern"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Bremen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hamburg"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Hessen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Nordrhein_Westfalen"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Berlin"] <- c("A_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Saarland"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Niedersachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Rheinland_Pfalz"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Schleswig_Holstein"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Brandenburg"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Mecklenburg_Vorpommern"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Sachsen_Anhalt"] <- c("B_class")
immo_data_ready1$RegioStatus[immo_data_ready1$regio1 == "Thüringen"] <- c("B_class")## .
## A_class B_class
## 39907 30829
In this step, the aim is to reduce the internal factors of quality and condition . The interiorQual operation is shown.
## .
## luxury normal simple sophisticated
## 3403 37689 640 29004
immo_data_ready1$interiorQual <- factor(immo_data_ready1$interiorQual,
# levels from lowest to highest
levels = c("luxury",
"normal",
"simple",
"sophisticated"
),
ordered = TRUE) # ordinal
immo_data_ready1$interiorQual[immo_data_ready1$interiorQual == "luxury"] <- "sophisticated"
immo_data_ready1$interiorQual[immo_data_ready1$interiorQual == "simple"] <- "normal"
immo_data_ready1$interiorQual <- droplevels(immo_data_ready1$interiorQual)## .
## normal sophisticated
## 38329 32407
The data is splited 70% for train and 30% for test.
set.seed(987654321)
houses_train <- createDataPartition(immo_data_ready$totalRent,
p = 0.7,
list = FALSE)
housesTrain <- immo_data_ready[houses_train,]
housesTest <- immo_data_ready[-houses_train,]
housesTrain <- housesTrain %>% select(-regio1,-baseRent, -serviceCharge, -livingSpace)
housesTest <- housesTest %>% select(-regio1,-baseRent, -serviceCharge, -livingSpace)In the model part, backward selection and elastic-net regression mainly are used for determining the optimal variables which should use in the data and linear regression, knn, and random forest are used for prediction with these variables, respectively. For comparison and understanding prediction power, Mean Squera Error (MSE), Root Mean Square Error(RMSE), Mean Absolute Error(MAE), Median Absolute Error(MedAE), Mean Logarithmic Absolute Error(MSLE), Total Sum of Squares(TSS), Explained Sum of Squares(RSS) and R square are calculated. Additionally, all models have the same trainControl method which is cross-validation with 5 number of folds.
regressionMetrics <- function(real, predicted) {
# Mean Squera Error
MSE <- mean((real - predicted)^2)
# Root Mean Square Error
RMSE <- sqrt(MSE)
# Mean Absolute Error
MAE <- mean(abs(real - predicted))
# Median Absolute Error
MedAE <- median(abs(real - predicted))
# Mean Logarithmic Absolute Error
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
# Total Sum of Squares
TSS <- sum((real - mean(real))^2)
# Explained Sum of Squares
RSS <- sum((predicted - real)^2)
# R2
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, RMSE, MAE, MedAE, MSLE, R2)
return(result)
}“leapBackward”is used to fit a linear regression with backward selection. The model suggests that all variables increases R2 however in the 20th step of the model, contion, typeOfFlat and telekomTvOffer have some insignificant factors and after 8th variables there is no dramatic change between R2. For further models contion, typeOfFlat and telekomTvOffer can be exploded for faster computability. summary(step.model$finalModel) cannot be included because the output is too long.
train.control <- trainControl(method = "cv", number = 5)
set.seed(987654321)
step.model <- train(log(totalRent+1) ~., data = housesTrain,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:20),
trControl = train.control
)
step.model## Linear Regression with Backwards Selection
##
## 49517 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 0.4498237 0.2955712 0.3610902
## 2 0.3744020 0.5119366 0.2957746
## 3 0.3425989 0.5913290 0.2706071
## 4 0.3238595 0.6348098 0.2555612
## 5 0.3123257 0.6603562 0.2451130
## 6 0.3075998 0.6705558 0.2411725
## 7 0.3050455 0.6760049 0.2389634
## 8 0.3019190 0.6826051 0.2362598
## 9 0.3006757 0.6852170 0.2352668
## 10 0.3001994 0.6862168 0.2347769
## 11 0.2992772 0.6881359 0.2340369
## 12 0.2983962 0.6899714 0.2335103
## 13 0.2979153 0.6909711 0.2331389
## 14 0.2975701 0.6916874 0.2329776
## 15 0.2973530 0.6921371 0.2328449
## 16 0.2971065 0.6926480 0.2326444
## 17 0.2967804 0.6933223 0.2324390
## 18 0.2966173 0.6936586 0.2323626
## 19 0.2965662 0.6937641 0.2322786
## 20 0.2964993 0.6939012 0.2321944
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 20.
The best tune captured with alpha = 0.2 and lambda = 0 in this model.The Elastic-net regression also suggests that one of the condition factors should be excluded. Lastly, according to the backward selection model and Elastic-net regression model, contion, typeOfFlat, and telekomTvOffer can be excluded in order to obtain faster and accurate computability.
parameters_elastic <- expand.grid(alpha = seq(0, 1, 0.2),
lambda = seq(0, 1e3, 0.5))
train.control <- trainControl(method = "cv", number = 5)
set.seed(987654321)
modelelasticlog <- train(log(totalRent+1) ~ .,
data = housesTrain,
method = "glmnet",
tuneGrid = parameters_elastic,
trControl = train.control)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## 29 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 6.6930968721
## telekomTvOfferON_DEMAND -0.0075497415
## telekomTvOfferONE_YEAR_FREE 0.0505626085
## newlyConstTRUE 0.1707208032
## balconyTRUE 0.1255185378
## picturecount 0.0058951900
## telekomUploadSpeed 0.0004506272
## yearConstructed -0.0004838201
## hasKitchenTRUE 0.1702787555
## cellarTRUE -0.0278597504
## condition.L 0.0336227346
## condition.Q .
## condition.C -0.0396784749
## condition^4 -0.0140670029
## interiorQual.L 0.1761134205
## liftTRUE 0.2013016776
## typeOfFlatground_floor 0.0185743324
## typeOfFlathalf_basement 0.0150987126
## typeOfFlatloft 0.3304071487
## typeOfFlatmaisonette 0.1207774331
## typeOfFlatother -0.0274778946
## typeOfFlatpenthouse 0.2300205883
## typeOfFlatraised_ground_floor 0.0113819332
## typeOfFlatroof_storey 0.0161211366
## typeOfFlatterraced_flat 0.1315321281
## noRooms 0.2426135332
## floor -0.0021185862
## gardenTRUE -0.0253723946
## RegioStatusB_class -0.2771766492
Floor has the lowest significance level in the linear regression output and other outputs are highly significant and R2 is 0.6853 which is quite fine score.
train.control <- trainControl(method = "cv", number = 5)
set.seed(987654321)
lmModellog <- train(log(totalRent+1) ~.,
data = housesTrain %>%
dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
method = "lm",
trControl = train.control
)
lmModellog %>% summary##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.76318 -0.20121 -0.01602 0.18604 2.11831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.75325290 0.06745243 100.119 < 0.0000000000000002
## newlyConstTRUE 0.19069923 0.00569900 33.462 < 0.0000000000000002
## balconyTRUE 0.12892172 0.00320080 40.278 < 0.0000000000000002
## picturecount 0.00663126 0.00020977 31.612 < 0.0000000000000002
## telekomUploadSpeed 0.00060562 0.00008239 7.351 0.000000000000200
## yearConstructed -0.00048582 0.00003436 -14.140 < 0.0000000000000002
## hasKitchenTRUE 0.17599073 0.00289525 60.786 < 0.0000000000000002
## cellarTRUE -0.03154045 0.00321367 -9.814 < 0.0000000000000002
## interiorQual.L 0.19251924 0.00223964 85.960 < 0.0000000000000002
## liftTRUE 0.20736813 0.00355729 58.294 < 0.0000000000000002
## noRooms 0.24862362 0.00147610 168.432 < 0.0000000000000002
## floor -0.00086381 0.00090097 -0.959 0.338
## gardenTRUE -0.02425665 0.00324323 -7.479 0.000000000000076
## RegioStatusB_class -0.27764904 0.00286221 -97.005 < 0.0000000000000002
##
## (Intercept) ***
## newlyConstTRUE ***
## balconyTRUE ***
## picturecount ***
## telekomUploadSpeed ***
## yearConstructed ***
## hasKitchenTRUE ***
## cellarTRUE ***
## interiorQual.L ***
## liftTRUE ***
## noRooms ***
## floor
## gardenTRUE ***
## RegioStatusB_class ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3006 on 49503 degrees of freedom
## Multiple R-squared: 0.6854, Adjusted R-squared: 0.6853
## F-statistic: 8297 on 13 and 49503 DF, p-value: < 0.00000000000000022
Additionally, with number of k based on square root number of rows should be used however in our case this number is 222.5242 and at this condition, computation with that number of k is taken extreme time.
train.control <- trainControl(method = "cv", number = 5)
different_k <- data.frame(k = seq(1, 99, 4))
set.seed(987654321)
knnmodellog <-
train(log(totalRent+1) ~ .,
data = housesTrain %>%
dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
method = "knn",
trControl = train.control,
tuneGrid = different_k)
knnmodellog## k-Nearest Neighbors
##
## 49517 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.4202167 0.4570260 0.3161156
## 5 0.3606577 0.5482199 0.2760081
## 9 0.3606470 0.5486679 0.2758570
## 13 0.3649385 0.5390542 0.2792270
## 17 0.3688070 0.5302107 0.2823475
## 21 0.3724330 0.5216251 0.2851579
## 25 0.3754236 0.5144301 0.2876286
## 29 0.3783269 0.5072322 0.2899145
## 33 0.3807989 0.5010854 0.2920180
## 37 0.3833722 0.4944422 0.2941481
## 41 0.3853953 0.4893165 0.2958058
## 45 0.3872953 0.4844194 0.2973732
## 49 0.3892430 0.4792739 0.2989850
## 53 0.3909832 0.4746576 0.3004906
## 57 0.3925410 0.4705551 0.3018170
## 61 0.3940871 0.4664228 0.3030810
## 65 0.3954860 0.4627003 0.3042510
## 69 0.3967906 0.4591675 0.3053380
## 73 0.3980483 0.4557407 0.3064149
## 77 0.3992858 0.4523558 0.3074669
## 81 0.4004103 0.4492895 0.3084372
## 85 0.4015482 0.4461112 0.3094390
## 89 0.4026006 0.4431746 0.3103134
## 93 0.4035783 0.4404749 0.3111674
## 97 0.4045661 0.4376749 0.3119781
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
“ranger” method is used in order to process random forest and tuneLentgh determined as a default in order to obtain faster computation.
set.seed(987654321)
RFmodel <- train(
log(totalRent+1) ~.,
# tuneLength = tuneGrid,
data = housesTrain %>%
dplyr::select(-typeOfFlat,-telekomTvOffer,-condition),
method = "ranger",
trControl = train.control
)## Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 91%. Estimated remaining time: 3 seconds.
## Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
## Growing trees.. Progress: 82%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 74%. Estimated remaining time: 11 seconds.
## Growing trees.. Progress: 79%. Estimated remaining time: 8 seconds.
## Growing trees.. Progress: 80%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.
## Random Forest
##
## 49517 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 39614, 39612, 39614, 39614, 39614
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.2977673 0.7092157 0.2320206
## 2 extratrees 0.3307934 0.6454508 0.2586837
## 7 variance 0.2831053 0.7212512 0.2184012
## 7 extratrees 0.2831115 0.7209453 0.2191308
## 13 variance 0.2882801 0.7118575 0.2221255
## 13 extratrees 0.2871006 0.7136641 0.2217555
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 7, splitrule =
## variance and min.node.size = 5.
The model comparisons were done by housesTest data and forecasting was analysed. As was mentioned, above logarithmic transformation was used for all models. For better interpretation, two types of comparisons are included. The first one is without retransformed data and obtains a logarithmically transformed prediction. The other one is perdition with retransformation.
Models_forecastslog <-
data.frame(Linear_Regression_Forecast= lmModellog_forecastLog,
Knn_Forecast = knnForecastlog,
RF_Forecast = model_predictlog
)
sapply(Models_forecastslog,
function(x) regressionMetrics(predicted = x,
real = log(housesTest$totalRent+1))) %>%
t()## MSE RMSE MAE MedAE
## Linear_Regression_Forecast 0.09027711 0.3004615 0.2343608 0.1927142
## Knn_Forecast 0.1251214 0.353725 0.2707002 0.2151143
## RF_Forecast 0.07964181 0.2822088 0.2156262 0.1703432
## MSLE R2
## Linear_Regression_Forecast 0.001519177 0.683947
## Knn_Forecast 0.002106946 0.5619599
## RF_Forecast 0.001341847 0.7211803
Models_forecasts <-
data.frame(Linear_Regression_Forecast= lmModellog_forecast2,
Knn_Forecast = knnForecast,
RF_Forecast = model_predict
)
sapply(Models_forecasts,
function(x) regressionMetrics(predicted = x,
real = housesTest$totalRent)) %>%
t()## MSE RMSE MAE MedAE MSLE
## Linear_Regression_Forecast 133337.3 365.1538 215.6405 134.4642 0.09027711
## Knn_Forecast 173548.3 416.5913 245.0436 148.6661 0.1251214
## RF_Forecast 111948.6 334.5871 198.5949 119.3081 0.07964181
## R2
## Linear_Regression_Forecast 0.580809
## Knn_Forecast 0.4543922
## RF_Forecast 0.6480518
modelList <- list(RFmodel = RFmodel, knnmodellog = knnmodellog, modelelasticlog = modelelasticlog, lmModellog= lmModellog)
resamples <- resamples(modelList)
summary(resamples)##
## Call:
## summary.resamples(object = resamples)
##
## Models: RFmodel, knnmodellog, modelelasticlog, lmModellog
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu.
## RFmodel 0.2171264 0.2177587 0.2177814 0.2184012 0.2186260
## knnmodellog 0.2736814 0.2753159 0.2757223 0.2758570 0.2762944
## modelelasticlog 0.2307210 0.2309609 0.2312986 0.2319982 0.2322046
## lmModellog 0.2333638 0.2346559 0.2349406 0.2353845 0.2355658
## Max. NA's
## RFmodel 0.2207135 0
## knnmodellog 0.2782707 0
## modelelasticlog 0.2348061 0
## lmModellog 0.2383964 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu.
## RFmodel 0.2819933 0.2825673 0.2825761 0.2831053 0.2834517
## knnmodellog 0.3587795 0.3594318 0.3609987 0.3606470 0.3611233
## modelelasticlog 0.2945782 0.2948382 0.2966834 0.2963830 0.2972489
## lmModellog 0.2984608 0.3002063 0.3003106 0.3006663 0.3013778
## Max. NA's
## RFmodel 0.2849379 0
## knnmodellog 0.3629016 0
## modelelasticlog 0.2985662 0
## lmModellog 0.3029760 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu.
## RFmodel 0.7126525 0.7207377 0.7225302 0.7212512 0.7248489
## knnmodellog 0.5348358 0.5440158 0.5522512 0.5486679 0.5548388
## modelelasticlog 0.6840272 0.6915645 0.6964239 0.6941522 0.6972115
## lmModellog 0.6747117 0.6841962 0.6850071 0.6852338 0.6887348
## Max. NA's
## RFmodel 0.7254868 0
## knnmodellog 0.5573978 0
## modelelasticlog 0.7015340 0
## lmModellog 0.6935192 0
For both results, the best scores are obtained with random forest and the worst scores obtained with KNN model. For all models, R2 is significantly changing with retransformation in both cases, linear regression is the second model for MAE however with the retransformation error terms are significantly changed. To sum up, in these conditions, the best model is the random forest.
To conclude, the model can be improved with more feature engineering methods. Moreover, the better model can be achieved for KNN and random forest with adjusting higher range hyperparameters, and more models can be tried such as gradient boosting regressor.