Introduction

The purpose of this project was to help predict prices for our newly built small and mid-sized apartments hosting 2-6 guests in Crete, Greece. The project has utilized multiple prediction models on Airbnb data for Crete, Greece. The data was filtered to property types of apartments/lofts that could accommodate up to 6 individuals.After running different prediction models, the project advises the company to focus on the apartments in the Rethymnon municipality as those might garner higher rental prices, however, the predictive power of the model is weak for this municipality. On the other hand, the model is comparatively better at predicting prices in the remaining 3 municipalities. The model predicts these prices for municipalities given certain property types, amenities, and a few other requirements based on a Random Forest predictive model that was tuned with specific parameters. These are based on the tuned Random Forest model with the loss function being the root mean squared error (RMSE).

The models built for the purpose of this project utilized the Airbnb dataset scraped during the last week of December 2021, for Crete, Greece. The dataset contained information on numerous types of properties, reviews, municipalities, host related variables, and amenities. However, to be in line with the company vision of marketing the small and medium-sized apartments, the project focused on property types of ‘Serviced Apartments, Entire Loft, Home/Apt’ from the dataset. It was further filtered on the number of people the properties could accommodate, which in this case was between 2-6 people. A major downside of this filtering was the extreme reduction in the size of the data set from around 20,000 observations to only around 500 observations. Out of these 500 observations, 80% were used to train the models and the remaining 20% were used as the hold out which served as live data to test the models. That said, for the purpose of this project, prediction exercises were conducted on these limited observations to come up with an initial pricing mechanism for feasibility purposes, which can be improved upon later on by doing this exercise on a larger dataset.

Initial Data Analysis

The y-variable in this case was the prices of the properties and before conducting any further predictive analysis, we considered looking at the distribution of the variable. The graph below shows the Price distribution where the observations were filtered to give properties with prices less than or equal to 300 Euros. Looking at the graph, the distribution seems near normal, hence we decided not to take the log of price, which would have also resulted in further complications further in the analysis. Similarly, we looked at the distribution of some other important predictor variables as well and we ended up using the variables as they are instead of making any changes to them.

price_dist <- ggplot(data) +
  aes(x = price) +
  geom_histogram(bins = 60L, fill = "#B22222", color = 'white') +
  labs(
    x = "Price (Euros)",
    y = "Number Of Observations",
    title = "Distribution Of Price Variable"
  ) +
  ggthemes::theme_economist() +
  theme(
    plot.title = element_text(size = 18L,
    face = "bold",
    hjust = 0.5),
    axis.title.y = element_text(size = 13L,
    face = "bold", margin = margin(r=5)),
    axis.title.x = element_text(size = 13L,
    face = "bold")
  ) +
  scale_y_continuous(limits=c(0,55), breaks=seq(0,55, by=5))+
  scale_x_continuous(limits=c(0,250), breaks=seq(0,250, by=50)) 

price_dist

Doing a bit more of exploratory data analysis before moving on to prediction analysis, we built quick LOWESS curves to see how price was associated with a few important predictors, such as number of people a property accommodates and number of beds in a property. The relationship between number of people that can be accommodated in a property seems to be linear more or less, indicating we don’t have to take any higher order polynomials of the predictor. Similarly, the association between number of beds in a property compared to the property’s price seems linear as well, except for the small region where the number of beds is between 1 and 3, however, since the kink is minute, we can still use the linear version of the beds variables for our predictive modelling.

price_acco_lowess <- ggplot(data) +
  aes(x = n_accommodates, y = price) +
  geom_smooth(method = 'loess' , formula = 'y ~ x', color = 'maroon', se = FALSE) +
  geom_point(alpha = 0.5)+
  labs(
    x = "Accommodates (People)",
    y = "Price (Euros)",
    title = "Association between Price and Accommodates"
  ) +
  ggthemes::theme_economist() +
  theme(
    plot.title = element_text(size = 18L,
    face = "bold",
    hjust = 0.5),
    axis.title.y = element_text(size = 13L,
    face = "bold", margin = margin(r=5)),
    axis.title.x = element_text(size = 13L,
    face = "bold")
  ) +
  scale_y_continuous(limits=c(0,250), breaks=seq(0,250, by=50))

price_acco_lowess

price_bed_lowess <- ggplot(data) +
  aes(x = n_beds, y = price) +
  geom_smooth(method = 'loess' , formula = 'y ~ x', color = 'maroon', se = FALSE) +
  geom_point(alpha = 0.5)+
  labs(
    x = "Number of Beds",
    y = "Price (Euros)",
    title = "Association between Price and Number of Beds"
  ) +
  ggthemes::theme_economist() +
  theme(
    plot.title = element_text(size = 18L,
    face = "bold",
    hjust = 0.5),
    axis.title.y = element_text(size = 13L,
    face = "bold", margin = margin(r=5)),
    axis.title.x = element_text(size = 13L,
    face = "bold")
  ) +
  scale_y_continuous(limits=c(0,250), breaks=seq(0,250, by=50))

price_bed_lowess

We then looked at the conditional prices and the distribution of the predictor variables. The tables below outlines the conditional mean prices for a few important variables. The following table shows the conditional prices for the properties that can accommodate different number of people. Intuitively, prices are higher for properties that accommodate greater number of people.

Accommodates N Percent Mean Median
price 2 141 28.95 56.50 50.00
3 124 25.46 63.78 60.00
4 172 35.32 78.10 69.50
5 32 6.57 81.53 80.00
6 18 3.70 117.56 105.50
This table shows the conditional mean prices for number of beds in a property. 1 interesting fact here is that prices rise until the number of beds in a property are six and then it stalls as the mean conditional prices are same for 6 and 8 beds. However, this could be because of only single observations for 6 and 8 beds.
Beds N Percent Mean Median
price 1 168 34.50 68.10 60.00
2 139 28.54 64.04 52.00
3 118 24.23 76.56 69.50
4 51 10.47 70.63 64.00
5 9 1.85 95.11 99.00
6 1 0.21 100.00 100.00
8 1 0.21 100.00 100.00
Following table shows the mean prices for different municipalities in Crete, Greece. Heraklion properties having lowest prices and Lasithi properties charging highest prices.
Municipality N Percent Mean Median
price Heraklion 115 23.61 64.14 60.00
Khania 247 50.72 68.59 60.00
Lasithi 48 9.86 79.75 80.00
Rethymnon 77 15.81 76.45 65.00

The following chart shows the conditional prices based on the three different types of properties in our dataset. The layered violin plots show the distribution of each type of properties’ observations, where the ‘Entire Home/apt’ property type is not normally distributed but the remaining two resemble a near normal distriution. The box plots also show the inter-quartile ranges (25th percentile and 75th percentile) and any outliers in the observations. For Entire Home/apt category, there seem to be only two observations, hence a very peculiar distibution, otherwise the remaining two seem to have a major chunk of observations falling into the above mentioned quartiles.

price_ppty_box <- ggplot(data) +
 aes(x = f_property_type, y = price, fill = f_property_type) +
 geom_boxplot(shape = "circle", color = "black", alpha = 0.65) +
  geom_point(alpha = 0.8)+ 
  geom_violin(alpha =0.2, color = "white")+
  stat_boxplot(geom = "errorbar",
               width = 0.15) +
  transition_states(f_property_type, wrap = FALSE) +
  shadow_mark(alpha = 0.5) +
  enter_grow()+
  exit_fade()+
  ease_aes("sine-in")+
  scale_y_continuous(limits=c(0,300), breaks=seq(0,300, by=50)) +
 scale_fill_manual(name ="Property Type",values = c(`Entire home/apt` = "maroon", `Entire loft` = "darkgreen", `Entire serviced apartment` = "darkblue"
 )) +
 labs(x = "Property Type", y = "Price (Euros)", title = "Price Variation By Property Type") +
 ggthemes::theme_economist() +
 theme(legend.position = "bottom", plot.title = element_text(size = 15, face = "bold", hjust = 0.5), 
 axis.title.y = element_text(size = 11, face = "bold", margin = margin(r=5)), axis.title.x = element_text(size = 11, 
 face = "bold"), legend.text = element_text(size = 11))

price_ppty_box

The following map plots out the properties from the current dataset. The dataset contained longitude and latitude data for the properties, however, the defined municipalities (Heaklion, Khania, Lasithi, Rethymnon) were defined afterwards based on the major divisions of the island itself. The purpose of the map is to see how the properties in the dataset are distributed over the defined region. The location of properties also makes sense as the island is tourist focused and these Airbnb properties are located near the beach to facilitate the toursits.

bbox <- c(bottom = 34.8, top = 35.7, right = 26.5, left = 23.5)

cretemap <- get_stamenmap(bbox = bbox, maptype = 'terrain-background', color = c("color", "bw")) 

property_map <- ggmap(cretemap) +
  geom_point(data=data,aes(x=longitude,y=latitude, color= f_municipality), alpha = 01) +
  transition_states(f_municipality, wrap = FALSE) +
  shadow_mark(alpha = 0.5) +
  enter_grow()+
  exit_fade()+
  ease_aes("bounce-in")+
  theme_void() +
  labs(title = "Property Distribution Across The 4 Municipalities of Crete, Greece") + 
  theme(legend.position='bottom', 
        legend.title = element_blank(), 
        plot.title = element_text(hjust = 0.5, face = "bold", size = 14,family="serif"),
        legend.key = element_rect("white"),  # Key background
        legend.text = element_text(face = "bold", size = 10,family="serif"),   
    # Margins around the full legend area
    legend.box.margin = margin(0, 0, 0, 0, "cm"), 
    # Background of legend area: element_rect()
    legend.box.background = element_rect(color = "black"), 
    # The spacing between the plotting area and the legend box
    legend.box.spacing = unit(0.4, "cm")) +
  labs(x = "Longitude",
       y = "Average Price") +
    scale_colour_manual(values = c("orange", "red", "black","purple"))
#ggsave(paste0("A2/visuals/crete_properties_distribution.png"))

property_map

Predictive Analysis

CART model

As part of the prediction exercise, we first ran a pruned Classification and Regression Tree (CART). Similar to other models we ran, CART was also run on basic level variables, host related variables, reviews related variables, amenities, and interaction terms that were filtered using the LASSO model. The resulting tree is as follows, having a minimum of observations of 7 and more in the last nodes. CART seemed to have given high importance to number of people a property can accommodate, followed by the log of reviews and number of bathrooms, host acceptance rate, and some amenities.

# Tree graph
rpart.plot(cart_model$finalModel, tweak=1.2, digits=-1, extra=1)

Random Forest

Next, we ran the Random Forest models with 500 trees in each. The first RF model was tuned with random variables option of 6, 8, 10 and minimum node sizes of 5, 10, and 15. This model returned the best tuning parameters of random variables in each tree as 10 and minimum node size of 5. With these parameters, it returned an RMSE of 30.04.

We further ran diagnostics on the RF model using Variable Importance (VI) plots, Partial Dependency (PD) plots, and checking subsample performances. For variable importance plots, we grouped together similar variables and re-calculated their importance to gauge the relative importance of these variable groups in predicting the prices. For the purpose of this report, we are showing the diagnostics of the tuned RF model. Amenities stood at the top with a relative importance of around 34% followed by the type of property with 20% relative importance, as shown in the graph below. Based on this, we recommend the management to focus on the types of amenities provided in the apartment towers to have a major impact on the property prices. The following VI plot shows the top 9 grouped calculated importance of the variables.

var_imp_rf <- ggplot() +
  geom_point(data = rf_model_var_imp_grouped_df, aes(x=reorder(varname, imp), y=imp_percentage*100), color='red', size=2) +
  geom_segment(data = rf_model_var_imp_grouped_df, aes(x=varname,xend=varname,y=0,yend=imp_percentage*100), color='red', size=1) +
  ylab("Importance (Percent)") +   xlab("Variable Name") +
  coord_flip() +
  # expand=c(0,0),
  transition_reveal(imp_percentage) + 
  enter_grow()+
  exit_fade()+
  ease_aes("back-in")+
  theme_bw()+
  labs(title = "Variable Importance Plot For The Tuned Random Forest Prediction") + 
  theme(plot.title = element_text(face = "bold", size = 10, family="serif"),
        axis.text = element_text(color="black", size=11, face = "bold"),
        axis.title = element_text(color="red", size=11, face = "bold"))
var_imp_rf

As mentioned above, we also ran PD plots on a few important variables. One of the important variables being the number of people a property can accommodate. The graph below shows the change in expected predicted prices when the number of accommodates change from 2 to 6, while keeping everything else constant.

pdp_n_accommodates <- pdp::partial(rf_model, pred.var = "n_accommodates", 
                               pred.grid = distinct_(data_test, "n_accommodates"), 
                               train = data_train)
pdp <- pdp_n_accommodates %>%
  autoplot() +
  geom_point(color='red', size=4, alpha = 0.5) +
  scale_y_continuous(limits=c(60,90), breaks=seq(60,90, by=5)) +
  transition_reveal(n_accommodates) +
   shadow_mark(alpha = 0.5) +
  enter_grow()+
  exit_fade()+
  ease_aes("sine-in")+
  ggthemes::theme_economist() +
  labs(title = "Partial Dependancy Plot: Price vs Number of People Accomodated In The Apartment") + 
  labs(x = "Accommodates (Persons)",
       y = "Predicted Price (Euros)") +
  theme(plot.title = element_text( size = 11,family="serif", hjust = 1),
        axis.title.y = element_text(margin = margin(r=5)),
        axis.text = element_text(color="black", size=11, face = "bold"),
        axis.title = element_text(color="black", size=11, face = "bold"))

pdp

Conclusion

Based on these model predictions, it is very difficult to drill down into certain prediction, perhaps because of the very small size of the analysis dataset. If the management of the newly built properties is willing to invest in another project where the focus is given on data collection so that the resulting dataset is large enough, the project recommends using the Random Forest model with provided tuning parameters to predict the prices for our apartments in Crete, Greece.

Moreover, the following plot shows the predicted prices versus actual prices of the properties from the hold-out dataset that served as a live data. This data was initially separated and kept for testing. The 45 degree line shows the accuracy of the predicted prices. The points above the line are over predicted and the points below are under predicted. Additionally, since it’s an estimation based on the given variables, the points above the line suggest that staying in that Airbnb is not a good deal as the prices are higher than expected and for the points below the line suggests that is a good deal, i.e. price for that property is lower than the expected price.

# FIGURES FOR FITTED VS ACTUAL OUTCOME VARIABLES #
##--------------------------------------------------
Ylev <- data_train[["price"]] 

# Predicted values
prediction_train_pred <- as.data.frame(predict(rf_model, newdata = data_train, interval="predict"))

predictionlev_train <- cbind(data_train[,c("price","n_accommodates")],
                               prediction_train_pred)



# Create data frame with the real and predicted values
dfd <- data.frame(ylev=Ylev, predlev=predictionlev_train[,3] )
# Check the differences
dfd$elev <- dfd$ylev - dfd$predlev

level_vs_pred <- ggplot() +
  geom_point(data = dfd,aes(y=Ylev, x=predlev , color = "purple"), color = "purple", size = 3.5,
             shape = 16, alpha = 1, show.legend=FALSE, na.rm=TRUE) +
  # geom_point(data = dfe, aes(y=Ylevv, x=predlev, color = "orange"), color = "black", size = 2,
  #            shape = 12, alpha = 0.5, show.legend=FALSE, na.rm=TRUE)  +
  geom_segment(aes(x = 0, y = 0, xend = 100, yend =100), size=0.8, color="black", linetype=2) +
     transition_states(Ylev) +
  shadow_mark(alpha = 0.75) +
  enter_grow()+
  exit_fade()+
    ease_aes("sine-in")+
  labs(y = "Actual Price (Euros)", x = "Predicted price  (Euros)") +
  ggthemes::theme_economist() +
  labs(title = "Actual Prices vs. Predicted Prices for Holdout Dataset")+
  scale_fill_manual(values = c("purple","orange"), labels = c("Training Data", "Hold-Out Data")) +
  theme(plot.title = element_text( size = 13,family="serif"),
        axis.title.y = element_text(margin = margin(r=5)),
        axis.text = element_text(color="black", size=12, face = "bold"),
        axis.title = element_text(color="black", size=12, face = "bold")) +
    scale_y_continuous(limits=c(0,100), breaks=seq(0,100, by=20)) +
    scale_x_continuous(limits=c(0,100), breaks=seq(0,100, by=20))

level_vs_pred