Introduction

With more than 7.7 million listings by more than 5 million hosts in 191 countries (Woodward, 2024), airbnb is one of the biggest platforms for finding short and mid term accomodation in the world. The platform is part of the sharing economy (Andreu et al., 2020), which can be described as “the collaborative consumption made by the activities of sharing, exchanging, and rental of resources without owning the goods” (Lessig, 2008, p. 143). It is often characterized by using online platforms (Andreu et al., 2020), which is exactly the business model of airbnb.

This report will zoom into the airbnb data from one of the most beautiful and touristically attractive cities within Europe, Florence in Italy.

Read data

getwd()

## [1] "/Users/sophiabauer/Desktop/DDMB/Final report"

df <- read.csv("Florence.csv")
colnames(df)

##  [1] "id"                          "host_id"                    
##  [3] "host_name"                   "listing_name"               
##  [5] "room_type"                   "bathrooms_text"             
##  [7] "bedrooms"                    "beds"                       
##  [9] "host_accept_rate"            "host_nlistings"             
## [11] "host_is_superhost"           "accommodates"               
## [13] "price"                       "amenities"                  
## [15] "nreviews"                    "nreviews_lastyear"          
## [17] "nreviews_lastmonth"          "nreviews_month"             
## [19] "neighbourhood"               "borough"                    
## [21] "latitude"                    "longitude"                  
## [23] "minstay"                     "maxstay"                    
## [25] "availability_30"             "availability_60"            
## [27] "availability_90"             "availability_365"           
## [29] "instant_bookable"            "review_scores_rating"       
## [31] "review_scores_accuracy"      "review_scores_cleanliness"  
## [33] "review_scores_checkin"       "review_scores_communication"
## [35] "review_scores_location"      "review_scores_value"

table(df$neighbourhood)

## 
##    Campo di Marte    Centro Storico Gavinana Galluzzo  Isolotto Legnaia 
##              1288              9284               453               541 
##           Rifredi 
##              1012

Research question

In the data I have noticed that there are certain hosts who have a large number of listings. As I have lived in Florence and was looking for an apartment to stay in via Airbnb and similar platforms, I experienced that there is hosts who manage properties for a living. Thus, I am very interested how this influences the price and the rating scores of apartments who seem to be managed professionally compared to private listings. Furthermore, I am interested to see how the price and reviews differ between the most popular area, the Cetro Storico and the rest of Florence.

This leads me to my research question:

How do the number of listings of a host and neighbourhood influence the price and the rating scores of an apartment in Florence?

To answer this research question, I will use several sub-research questions: - How does the number of listings per host influence price? - How does the number of listings per host influence the rating score? - Is there a moderating effect by the host being a superhost? - How does the neighbourhood influence price? - How does the neighbourhood the rating score? - How can the listings be classified hierarchically and non-hierarchically based on the standardized variables of “price”, “host_nlistings” and “review_score_ratings”

Dataset

As stated previously, the chosen city is Florence. The dataset I will be working with was cleaned to exclude NAs in either of the selected variables. The choice of my variables will be explained in the next section.

In the next step outliers are excluded. After initially reviewing the scatterplot between price and host_nlistings, listings with a price of more than 500 and/or hosts with more than 500 listings will be disregarded.

df_1 <- na.omit(df[, c("price", "neighbourhood", "host_nlistings", "review_scores_rating", "host_is_superhost")])


df_clean <<- df_1[df_1$price <= 500 & df_1$host_nlistings <= 500, ]

Variables

As mentioned above I am very interested to see whether the number of ratings a host has on airbnb and the neighbourhood have an influence on the price and the review scores. Thus I decided to include the variables “price”, “review_scores_rating”, “neighbourhood” and “host_nlistings” in my subset. Furthermore, I decided to keep one additional variable, “host_is_superhost”, as it can have an impact on the relationships of interest.

My dependent variables are “price” and “review_score_ratings” as I am interested in how these two variables are influenced, by my independent variables, so “host_nlistings” and “neighbourhood”.

My assumption is that hosts with a lot of listings are more likely to be professional hosts. Professional hosts might want to make a living and thus higher profits from renting on airbnb and might have a higher expertise, which could influence ratings. A superhost on airbnb is a host who fulfills certain quality standards (Airbnb Superhost details for guests, n.d.). That’s why I am testing whether “host_is_superhost” is a moderator for my first two sub research questions.

Data Analysis

Data description

Initially, I want to explore the data set through descriptive statistics to get a feeling for my dataset.

To make plots that have a higher visual quality, I will use the R function ggplot. To acquaint myself with the function I watched a YouTube tutorial by R for Ecology (15.03.2024). ChatGPT was asked to help debug when errors arose and give visual suggestions.

Firstly, I would like to see how the price is distributed against the number of listings per host.

ggplot(df_clean, aes(host_nlistings, price))+
  geom_point(aes(color = host_nlistings))+
  labs(x = "Number of Listings per host",
       y = "Price",
       title = "Price vs. Number of Listings per host")

This scatterplot shows that most listings are from hosts who have between 1 and 60 listings. This plot includes listings from hosts with up to 500 listings, as above this mark listings are considered as outliers. Overall, the prices vary drastically among all numbers of listings per host. It appears that listings with more than 100 listings per host tend to not be very cheap, so not below 50€.

To get a general overview of where the reviews lie in Florence, I’m making a boxplot of all of the review scores.

my_colors <<- c("steelblue")

ggplot(df_clean, aes(x = "", y = review_scores_rating)) +
  geom_boxplot(color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = NULL, y = "Review Scores Rating",
       title = "Distribution of Review Scores Rating") +
  coord_flip()

Overall the ratings are very high, with the median being around 4,8 stars and almost all ratings being above 4 stars.

Next, I would like to see how the review scores are distributed against the listings per host, which is why I am making a scatterplot.

ggplot(df_clean, aes(host_nlistings, review_scores_rating))+
  geom_point(aes(color = host_nlistings))+
  labs(x = "Number of Listings per host",
       y = "Review scores",
       title = "Review score vs. Number of Listings per host")

The plot shows that there are very high ratings for all numbers of listings per host. As mentioned above, there are generally not many low ratings, but all ratings between 1 and 2,5 are between 1 and 330 listings per host. Listings where the host has more than 330 listings have at least 3 stars.

Next, I am making a barplot to get a first impression of the neighbourhoods and the number of listings within them.

ggplot(df_clean, aes(x = neighbourhood)) +
  geom_bar(color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Neighbourhoods",
       y = "Count",
       title = "Overview of Neighbourhoods")

Majority, so over 7000 of the 9749 observations are in the Centro Storico. The second largest neighbourhood is Campo di Marte with around 1000 listings. Gavanina Galluzzo and Isolotto Legnaia are the smallest neighbourhoods with less than 500 listings each.

To get a general overview of how the prices are distributed, I am examining the frequency of prices through a histogram for each neighbourhood separately.

hist_cs <- ggplot(data = subset(df_clean, neighbourhood == "Centro Storico"), aes(x = price)) +
  geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram price Centro Storico")

hist_cdm <- ggplot(data = subset(df_clean, neighbourhood == "Campo di Marte"), aes(x = price)) +
  geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram price Campo di Marte")

hist_gg <- ggplot(data = subset(df_clean, neighbourhood == "Gavinana Galluzzo"), aes(x = price)) +
  geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram price Gaviana Galluzzo")

hist_il <- ggplot(data = subset(df_clean, neighbourhood == "Isolotto Legnaia"), aes(x = price)) +
  geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram price Isolotto Legnaia")

hist_r <- ggplot(data = subset(df_clean, neighbourhood == "Rifredi"), aes(x = price)) +
  geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram price Rifredi")

grid.arrange(hist_cs, hist_cdm, hist_gg, hist_il, hist_r)

The histogram shows that all neighbourhoods have the highest frequency of prices around the price of 100. However, the Centro Storico has higher frequencies for prices above 100 than other neighbourhoods. It has to be noted that this area also has the highest overall number of listings.

Similarly, I’m also interested to see how the frequency of different ratings differs between different between neighbourhoods.

hist_cs2 <- ggplot(data = subset(df_clean, neighbourhood == "Centro Storico"), aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram review score Centro Storico")

hist_cdm2 <- ggplot(data = subset(df_clean, neighbourhood == "Campo di Marte"), aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram review score Campo di Marte")

hist_gg2 <- ggplot(data = subset(df_clean, neighbourhood == "Gavinana Galluzzo"), aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram review score Gaviana Galluzzo")

hist_il2 <- ggplot(data = subset(df_clean, neighbourhood == "Isolotto Legnaia"), aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram review score Isolotto Legnaia")

hist_r2 <- ggplot(data = subset(df_clean, neighbourhood == "Rifredi"), aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
  labs(x = "Price", y = "Frequency", title = "Histogram review score Rifredi")

grid.arrange(hist_cs2, hist_cdm2, hist_gg2, hist_il2, hist_r2)

All neighbourhoods, but the Centro Storico have 5 stars as the most frequent rating. For the Centro Storico the most frequent rating is 4,75 stars. Ratings below 4 stars are very rare among all neighbourhoods, which was to be expected after examining the boxplot above.

Hypothesis tests

H1: Listings where the host has many listings are more expensive.

summary(aov(price ~ host_nlistings, data = df_clean))

##                  Df   Sum Sq Mean Sq F value  Pr(>F)    
## host_nlistings    1    82281   82281   11.61 0.00066 ***
## Residuals      9747 69094945    7089                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value is below 0.05 there appears a significant relationship between the price and host_nlistings.

I’m assuming that hosts with a lot of listings are professional hosts who have higher expertise with renting out

Next, I want to test whether there is a significant moderating effect by the variable “host_is_superhost” for the previous test. To conduct this analysis I am running a linear regression with the moderator, as I have seen with Moderation Analysis – Advanced Statistics using R. (n. D.).

df_clean$xz <- df_clean$host_nlistings * df_clean$host_is_superhost
summary(lm(price ~ host_nlistings + host_is_superhost + xz, data = df_clean))

## 
## Call:
## lm(formula = price ~ host_nlistings + host_is_superhost + xz, 
##     data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -201.85  -53.90  -23.90   26.56  371.10 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           132.63532    1.21421 109.236  < 2e-16 ***
## host_nlistings          0.03609    0.01166   3.096  0.00196 ** 
## host_is_superhostTRUE  -5.48660    1.98715  -2.761  0.00577 ** 
## xz                      1.71272    0.15050  11.380  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.62 on 9745 degrees of freedom
## Multiple R-squared:  0.01507,    Adjusted R-squared:  0.01477 
## F-statistic:  49.7 on 3 and 9745 DF,  p-value: < 2.2e-16

There seems to be a significant moderating effect of “host_is_superhost” on the relationship between “host_nlistings” and “price”, as there is a p-value that is below 0.05 for the xz variable.

H2: Listings where the host has many listings have better reviews.

summary(aov(review_scores_rating ~ host_nlistings, data = df_clean))

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## host_nlistings    1   97.7   97.71   713.2 <2e-16 ***
## Residuals      9747 1335.3    0.14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value is below 0.05 there appears a significant relationship between the review_scores_rating and host_nlistings.

For the relationship between “review_scores_rating” and “host_nlistings” I am also testing for a moderating effect by “host_is_superhost”.

df_clean$xz <- df_clean$host_nlistings * df_clean$host_is_superhost
summary(lm(review_scores_rating ~ host_nlistings + host_is_superhost + xz, data = df_clean))

## 
## Call:
## lm(formula = review_scores_rating ~ host_nlistings + host_is_superhost + 
##     xz, data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6397 -0.0744  0.0456  0.1422  0.8344 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.641e+00  5.134e-03 903.902   <2e-16 ***
## host_nlistings        -9.755e-04  4.929e-05 -19.792   <2e-16 ***
## host_is_superhostTRUE  2.361e-01  8.402e-03  28.094   <2e-16 ***
## xz                    -1.373e-03  6.364e-04  -2.158    0.031 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3536 on 9745 degrees of freedom
## Multiple R-squared:  0.1499, Adjusted R-squared:  0.1496 
## F-statistic: 572.8 on 3 and 9745 DF,  p-value: < 2.2e-16

Similarly to the previous moderator test, there also appears to be a statistically significant moderating effect of “host_is_superhost” for “review_scores_rating” and “host_nlistings” as the p-value is 0.031.

After looking at the histograms reviewing price and ratings per neighbourhood I am especially interested in seeing how the Centro Storico compares to other neighbourhoods, which is why I create a Dummy variable as in Assignment 1.

df_clean$dummy <- ifelse(df_clean$neighbourhood == "Centro Storico", "CS", "NCS")

H3: The price is higher for listings in the Centro Storico

t.test(df_clean$price, df_clean$dummy == "CS", "greater")

## 
##  Welch Two Sample t-test
## 
## data:  df_clean$price and df_clean$dummy == "CS"
## t = 157.95, df = 9748.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  133.3617      Inf
## sample estimates:
##   mean of x   mean of y 
## 135.5147195   0.7495128

As the p-value is virtually 0, there is not enough evidence to reject H3.

The histogram shows that the ratings for the Centro Storico seem to be lower than for

H4: The ratings are lower in the Centro Storico.

t.test(df_clean$review_scores_rating, df_clean$dummy == "CS", "less")

## 
##  Welch Two Sample t-test
## 
## data:  df_clean$review_scores_rating and df_clean$dummy == "CS"
## t = 676.05, df = 19211, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 3.971246
## sample estimates:
## mean of x mean of y 
## 4.7111201 0.7495128

The p-value is 1, which means that H4 has to be rejected. This means that there is a very strong evidence that there is no difference between the Centro Storico and the other neighbourhoods.

To test whether there also is a significant value when the price would binary, so high and low, I am creating a price dummy where all the observations below the median get the label “low” and the observations with a price higher than the median get the label “high”.

print("Median:")

## [1] "Median:"

median(df_clean$price)

## [1] 110

df_clean$price_dummy <- ifelse(df_clean$price <= 110, "low", "high")

chisq.test(df_clean$dummy, df_clean$price_dummy)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df_clean$dummy and df_clean$price_dummy
## X-squared = 407.6, df = 1, p-value < 2.2e-16

The p-value is the same as with the t-test, which means that the statistical significance of price and the has been confirmed by two tests.

Classification

There are two types of clustering that were discussed during the third and fifth week of the course DDMB. Namely those are hierarchical clustering and non-hierarchical clustering (GfG, 2022). In hierarchical clustering observations are categorized on different levels from top to bottom using an agglomerative algorithm, hence why it is called hierarchical (Yusuf, 2023). Non-hierarchical clustering on the other hand clusters observations into a pre-specified number of clusters based on the distance between observations. Scaling is necessary for both approaches (How do you clean your data before clustering?, 2024), which is why is why I will first scale all necessary variables as learnt in the third week of this course.

  df_clean$Z_price <- scale(df_clean$price)
  df_clean$Z_review_scores_rating <- scale(df_clean$review_scores_rating)
  df_clean$Z_host_nlisting <- scale(df_clean$host_nlisting)

The code used for this analysis is derived from the third weekly assignment of the course DDMB. It was debugged by ChatGPT. I am using the three scaled numerical main variables as a basis of the clustering, as I do not want to include more than three variables, since every new variable adds a dimension and more than three dimensions are hard to imagine.

plot(hc1<-hclust(dist(na.omit(df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]))), 
     ylab = "Distance", main = "Dendrogram sample ", xlab = "Observations", hang = -1)

The dendrogram shows, that at different distances, different clusters would be formed. In my opinion it would not make sense to set the distance at 0, because most clusters would only contain one observation, due to the agglomerative approach. According to one of the lectures, clusters should contain observations that are similar within the cluster, but distinct from other clusters. Thus, I am examining the clusters at different distances with the following code, which was provided by Meike Morren during a lecture on 12.03.2024.

table(c1<-cutree(hc1,h=8))

## 
##    1    2    3    4 
## 9294   29  131  295

df_clean$hier8<-c1

table(c7<<-cutree(hc1,h=7))

## 
##    1    2    3    4    5 
## 8922  372   29  131  295

df_clean$hier7<-c1

table(c6<<-cutree(hc1,h=6))

## 
##    1    2    3    4    5    6    7    8 
## 8922  327   20   45  131  247    9   48

df_clean$hier6<-c1

table(c1<-cutree(hc1,h=5))

## 
##    1    2    3    4    5    6    7    8    9   10   11 
## 7559 1363  327   20   45   88   33  247    9   48   10

df_clean$hier5<-c1

The hierarchical clustering produces one very large first cluster with almost all listings in it. This is not desirable, as we want to produce clusters that are distinct from each other but similar within. If almost all of the listings are similar to each other, this is not very helpful for working with the clusters. To examine why this could be the case I am inspecting the 3D scatterplot of the three standardized variables. This plot was created using the code by Agarwal (n.d.). Furthermore, I am making a second dendrogram with a different method, in this case “centroid”, to examine whether the clusters would be different, if a different method were used.

plot(hc2<-hclust(dist(na.omit(df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")])), method = "centroid"), 
     ylab = "Distance", main = "Dendrogram sample ", xlab = "Observations", hang = -1)

At first glance the dendrongram based on centroids looks very similar to the dendrogram with the standard setting, complete linkage and thus the method does not seem to be the reason why there is one very big cluster.

library(scatterplot3d)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(rgl)

plot3d(df_clean$Z_price, df_clean$Z_host_nlisting, df_clean$Z_review_scores_rating, type = "s", col = my_colors)
rglwidget()

The 3D plot of the three standardized variables provides inside into why there is one big main cluster. Majority of the listings are between 1 and -4 for “Z_price”, between 1 and -2 for “Z_review_scores_rating” and between -1 and 1 for “Z_host_nlistings”. Thus, the general nature of the data for the three standardized variables seems to not be suitable for hierarchical clustering.

To fix plotting issues, I used a post from the stackoverview forum that was sent to me by Meike Morren.

The next step I’m taking is to reassure myself of the number of clusters I will use for the kmeans analysis by using the iterative procedure, with the methods “elbow”, “gap” and “silhouette” as discussed during lecture 7 (Morren, 2024).

library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(gridExtra)

set.seed(1234)
  elbow2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "wss", nboot = 5)

  # repeat for gap statistic
  
  set.seed(1234)
  gap2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "gap_stat", nboot = 5)
  
  # repeat for silhouette
  set.seed(1234)
  sil2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "silhouette", nboot = 5)
  
  # ANSWER: print them next to one another using grid.arrange
  print(grid.arrange(elbow2, gap2, sil2, ncol = 3))

## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]

The elbow method does not clearly identify an ideal number of clusters. However, the curve is drastically flattening around the 6 cluster mark, which is why this would be the ideal number of clusters. The gap statistic tells me that one cluster would be ideal, which does not make much sense as stated above. The silhouette procedure shows that 2 clusters would be optimal, which I also perceive as a too small amount of clusters.

Revisiting the results of the hierarchical clustering, the distances h = 7 with 5 clusters and h = 6 with 8 clusters are the closest to the iterative procedure results.

After performing and reviewing the hierarchical clustering and iterative procedures, I will perform the non-hierarchical clustering with 5, 6 and 8 clusters. The non-hierarchical clustering will be done by performing an iterative kmeans analysis with 10 iterations. This code was taken from the 7th lecture (Morren, 2024) and debugged by ChatGPT, by providing the code that was already adapted to my data set but gave me an error.

# Define the data matrix Q and the number of clusters k
Q <- df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 5
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3,3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5 
##  248  355 5239  820 3087

table(cluster, c7)

##        c7
## cluster    1    2    3    4    5
##       1   75   21   21  131    0
##       2   45    7    8    0  295
##       3 5239    0    0    0    0
##       4  486  334    0    0    0
##       5 3077   10    0    0    0

# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 6
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3,3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5    6 
##  255  342 3767 1118 3800  467

# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 8
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3, 3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5    6    7    8 
##  111  326 2172  766 1884  405 3394  691

table(cluster, c6)

##        c6
## cluster    1    2    3    4    5    6    7    8
##       1    0    0   20    0   82    0    9    0
##       2   26    0    0    5    0  247    0   48
##       3 2172    0    0    0    0    0    0    0
##       4  728   23    0   15    0    0    0    0
##       5 1884    0    0    0    0    0    0    0
##       6  105  273    0   25    2    0    0    0
##       7 3394    0    0    0    0    0    0    0
##       8  613   31    0    0   47    0    0    0

# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 5
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3, 3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5 
##  437  359 5103  921 2929

table(cluster, c7)

##        c7
## cluster    1    2    3    4    5
##       1  243   37   26  131    0
##       2   48   13    3    0  295
##       3 5103    0    0    0    0
##       4  600  321    0    0    0
##       5 2928    1    0    0    0

# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 6
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3, 3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5    6 
##  267  357 2487 1693 4408  537

# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 8
n <- dim(Q)[1]  # Number of observations
iter.max <- 10  # Maximum number of iterations

# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k)  # Randomly select k indices as starting centroids
C <- Q[p, ]

# Initialize cluster assignments
cluster <- numeric(n)

# Plotting setup
par(mfrow = c(3, 3))

# Main K-means loop
for (i in 1:iter.max) {
  
  # Assign each point to the nearest centroid
  Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
  for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
  
  # Plot the data points and centroids
  plot(Q, col = cluster + 1, main = paste("Iteration", i))
  points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
  
  # Update centroids
  C <- NULL
  for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}

table(cluster)

## cluster
##    1    2    3    4    5    6    7    8 
##  129  358 2316  867 1905  435 2707 1032

table(cluster, c6)

##        c6
## cluster    1    2    3    4    5    6    7    8
##       1    0    0   20    0   98    0    9    2
##       2   52    0    0   13    0  247    0   46
##       3 2316    0    0    0    0    0    0    0
##       4  849   10    0    8    0    0    0    0
##       5 1905    0    0    0    0    0    0    0
##       6  133  278    0   24    0    0    0    0
##       7 2707    0    0    0    0    0    0    0
##       8  960   39    0    0   33    0    0    0

``` The kmeans analysis differs slightly between Manhattan and Euclidean distance. However, overall the results are quite similar. Both tables for k = 8 show that there are some clusters with relatively few observations. Thus, I conclude that 8 clusters are too many. Both tables for k = 5 and k = 6 clusters show that there are no such smaller clusters. This means that based on the kmeans analysis, either 5 or 6 clusters can be justified.

As the hierarchical and the non-hierarchical clustering differ a great deal, I decided not to conduct any further tests to compare the two approaches.

Concludingly, only the non-hierarchical clustering is appropriate for this data set. However, as discussed in the seventh lecture (Morren, 2024), it is good practice to start with hierarchical clustering to find an appropriate number of clusters and then move on to non-hierarchical clustering. Thus, doing the hierarchical clustering before the kmeans analysis,

References

Agarwal, A. (n.d.). RPubs - R : Interactive 3-D (Three Dimensional) Visualization of Data and Plot Predicted Values on the 3-D graph. https://rpubs.com/aagarwal29/179912

Airbnb Superhost details for guests. (n.d.). Airbnb. https://de.airbnb.com/d/superhost-guest?_set_bev_on_new_domain=1710968150_NjAxMjBkMjNiNGI5#:~:text=a%20Superhost%20provide%3F-,What%20does%20a%20Superhost%20provide%3F,and%20often%20exceeding%2C%20guest%20expectations.

Andreu, L., au, Bigne, E., Amaro, S., & Palomo, J., us. (2020). Airbnb research: an analysis in tourism and hospitality journals. INTERNATIONAL JOURNAL OF CULTURE, TOURISM AND HOSPITALITY RESEARCH. https://doi.org/10.1108/IJCTHR-06-2019-0113

ChatGPT. (n. d.). https://chat.openai.com/c/eee4b840-8f1e-446d-a4cb-eaccee139c11

GfG. (2022, 29. Dezember). Difference between Hierarchical and Non Hierarchical Clustering. GeeksforGeeks. https://www.geeksforgeeks.org/difference-between-hierarchical-and-non-hierarchical-clustering/

How do you clean your data before clustering? (2024, 17. Februar). www.linkedin.com. https://www.linkedin.com/advice/0/how-do-you-clean-your-data-before-clustering-skills-data-analysis#:~:text=Since%20clustering%20algorithms%20are%20sensitive,a%20standard%20deviation%20of%201.

Lessig, L. (2008), Remix: making Art and Commerce Thrive in the Hybrid Economy, Penguim, New York, NY.

Moderation Analysis – Advanced Statistics using R. (n. D.). https://advstats.psychstat.org/book/moderation/index.php

Morren, Meike. “DDMB_IBA_Lecture7_Kmeans”. Lecture, Vrije Universiteit Amsterdam, Amsterdam, March 12, 2024.

R for Ecology. (2021, 27. Januar). ggplot2 explained in 5 minutes! [Video]. YouTube. https://www.youtube.com/watch?v=FdVy57oGJuc

https://stackoverflow.com/questions/63595786/rmarkdown-how-to-embed-an-3d-plot#63597059

Woodward, M. (2024, March 1). Airbnb Statistics [2024]: User & Market Growth Data. SearchLogistics. https://www.searchlogistics.com/learn/statistics/airbnb-statistics/#:~:text=There%20are%20currently%20over%205,booked%20over%201.5%20billion%20stays

Yusuf, F. (2023, October 9). Difference between K means and Hierarchical Clustering. Medium. https://medium.com/@waziriphareeyda/difference-between-k-means-and-hierarchical-clustering-edfec55a34f8

Data Driven Decision-Making in Business

Individual report

Sophia Bauer