With more than 7.7 million listings by more than 5 million hosts in 191 countries (Woodward, 2024), airbnb is one of the biggest platforms for finding short and mid term accomodation in the world. The platform is part of the sharing economy (Andreu et al., 2020), which can be described as “the collaborative consumption made by the activities of sharing, exchanging, and rental of resources without owning the goods” (Lessig, 2008, p. 143). It is often characterized by using online platforms (Andreu et al., 2020), which is exactly the business model of airbnb.
This report will zoom into the airbnb data from one of the most beautiful and touristically attractive cities within Europe, Florence in Italy.
getwd()
## [1] "/Users/sophiabauer/Desktop/DDMB/Final report"
df <- read.csv("Florence.csv")
colnames(df)
## [1] "id" "host_id"
## [3] "host_name" "listing_name"
## [5] "room_type" "bathrooms_text"
## [7] "bedrooms" "beds"
## [9] "host_accept_rate" "host_nlistings"
## [11] "host_is_superhost" "accommodates"
## [13] "price" "amenities"
## [15] "nreviews" "nreviews_lastyear"
## [17] "nreviews_lastmonth" "nreviews_month"
## [19] "neighbourhood" "borough"
## [21] "latitude" "longitude"
## [23] "minstay" "maxstay"
## [25] "availability_30" "availability_60"
## [27] "availability_90" "availability_365"
## [29] "instant_bookable" "review_scores_rating"
## [31] "review_scores_accuracy" "review_scores_cleanliness"
## [33] "review_scores_checkin" "review_scores_communication"
## [35] "review_scores_location" "review_scores_value"
table(df$neighbourhood)
##
## Campo di Marte Centro Storico Gavinana Galluzzo Isolotto Legnaia
## 1288 9284 453 541
## Rifredi
## 1012
In the data I have noticed that there are certain hosts who have a large number of listings. As I have lived in Florence and was looking for an apartment to stay in via Airbnb and similar platforms, I experienced that there is hosts who manage properties for a living. Thus, I am very interested how this influences the price and the rating scores of apartments who seem to be managed professionally compared to private listings. Furthermore, I am interested to see how the price and reviews differ between the most popular area, the Cetro Storico and the rest of Florence.
This leads me to my research question:
How do the number of listings of a host and neighbourhood influence the price and the rating scores of an apartment in Florence?
To answer this research question, I will use several sub-research questions: - How does the number of listings per host influence price? - How does the number of listings per host influence the rating score? - Is there a moderating effect by the host being a superhost? - How does the neighbourhood influence price? - How does the neighbourhood the rating score? - How can the listings be classified hierarchically and non-hierarchically based on the standardized variables of “price”, “host_nlistings” and “review_score_ratings”
As stated previously, the chosen city is Florence. The dataset I will be working with was cleaned to exclude NAs in either of the selected variables. The choice of my variables will be explained in the next section.
In the next step outliers are excluded. After initially reviewing the scatterplot between price and host_nlistings, listings with a price of more than 500 and/or hosts with more than 500 listings will be disregarded.
df_1 <- na.omit(df[, c("price", "neighbourhood", "host_nlistings", "review_scores_rating", "host_is_superhost")])
df_clean <<- df_1[df_1$price <= 500 & df_1$host_nlistings <= 500, ]
As mentioned above I am very interested to see whether the number of ratings a host has on airbnb and the neighbourhood have an influence on the price and the review scores. Thus I decided to include the variables “price”, “review_scores_rating”, “neighbourhood” and “host_nlistings” in my subset. Furthermore, I decided to keep one additional variable, “host_is_superhost”, as it can have an impact on the relationships of interest.
My dependent variables are “price” and “review_score_ratings” as I am interested in how these two variables are influenced, by my independent variables, so “host_nlistings” and “neighbourhood”.
My assumption is that hosts with a lot of listings are more likely to be professional hosts. Professional hosts might want to make a living and thus higher profits from renting on airbnb and might have a higher expertise, which could influence ratings. A superhost on airbnb is a host who fulfills certain quality standards (Airbnb Superhost details for guests, n.d.). That’s why I am testing whether “host_is_superhost” is a moderator for my first two sub research questions.
Initially, I want to explore the data set through descriptive statistics to get a feeling for my dataset.
To make plots that have a higher visual quality, I will use the R function ggplot. To acquaint myself with the function I watched a YouTube tutorial by R for Ecology (15.03.2024). ChatGPT was asked to help debug when errors arose and give visual suggestions.
Firstly, I would like to see how the price is distributed against the number of listings per host.
ggplot(df_clean, aes(host_nlistings, price))+
geom_point(aes(color = host_nlistings))+
labs(x = "Number of Listings per host",
y = "Price",
title = "Price vs. Number of Listings per host")
This scatterplot shows that most listings are from hosts who have between 1 and 60 listings. This plot includes listings from hosts with up to 500 listings, as above this mark listings are considered as outliers. Overall, the prices vary drastically among all numbers of listings per host. It appears that listings with more than 100 listings per host tend to not be very cheap, so not below 50€.
To get a general overview of where the reviews lie in Florence, I’m making a boxplot of all of the review scores.
my_colors <<- c("steelblue")
ggplot(df_clean, aes(x = "", y = review_scores_rating)) +
geom_boxplot(color = "black", fill = my_colors, alpha = 0.7) +
labs(x = NULL, y = "Review Scores Rating",
title = "Distribution of Review Scores Rating") +
coord_flip()
Overall the ratings are very high, with the median being around 4,8
stars and almost all ratings being above 4 stars.
Next, I would like to see how the review scores are distributed against the listings per host, which is why I am making a scatterplot.
ggplot(df_clean, aes(host_nlistings, review_scores_rating))+
geom_point(aes(color = host_nlistings))+
labs(x = "Number of Listings per host",
y = "Review scores",
title = "Review score vs. Number of Listings per host")
The plot shows that there are very high ratings for all numbers of
listings per host. As mentioned above, there are generally not many low
ratings, but all ratings between 1 and 2,5 are between 1 and 330
listings per host. Listings where the host has more than 330 listings
have at least 3 stars.
Next, I am making a barplot to get a first impression of the neighbourhoods and the number of listings within them.
ggplot(df_clean, aes(x = neighbourhood)) +
geom_bar(color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Neighbourhoods",
y = "Count",
title = "Overview of Neighbourhoods")
Majority, so over 7000 of the 9749 observations are in the Centro
Storico. The second largest neighbourhood is Campo di Marte with around
1000 listings. Gavanina Galluzzo and Isolotto Legnaia are the smallest
neighbourhoods with less than 500 listings each.
To get a general overview of how the prices are distributed, I am examining the frequency of prices through a histogram for each neighbourhood separately.
hist_cs <- ggplot(data = subset(df_clean, neighbourhood == "Centro Storico"), aes(x = price)) +
geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram price Centro Storico")
hist_cdm <- ggplot(data = subset(df_clean, neighbourhood == "Campo di Marte"), aes(x = price)) +
geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram price Campo di Marte")
hist_gg <- ggplot(data = subset(df_clean, neighbourhood == "Gavinana Galluzzo"), aes(x = price)) +
geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram price Gaviana Galluzzo")
hist_il <- ggplot(data = subset(df_clean, neighbourhood == "Isolotto Legnaia"), aes(x = price)) +
geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram price Isolotto Legnaia")
hist_r <- ggplot(data = subset(df_clean, neighbourhood == "Rifredi"), aes(x = price)) +
geom_histogram(binwidth = 50, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram price Rifredi")
grid.arrange(hist_cs, hist_cdm, hist_gg, hist_il, hist_r)
The histogram shows that all neighbourhoods have the highest frequency
of prices around the price of 100. However, the Centro Storico has
higher frequencies for prices above 100 than other neighbourhoods. It
has to be noted that this area also has the highest overall number of
listings.
Similarly, I’m also interested to see how the frequency of different ratings differs between different between neighbourhoods.
hist_cs2 <- ggplot(data = subset(df_clean, neighbourhood == "Centro Storico"), aes(x = review_scores_rating)) +
geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram review score Centro Storico")
hist_cdm2 <- ggplot(data = subset(df_clean, neighbourhood == "Campo di Marte"), aes(x = review_scores_rating)) +
geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram review score Campo di Marte")
hist_gg2 <- ggplot(data = subset(df_clean, neighbourhood == "Gavinana Galluzzo"), aes(x = review_scores_rating)) +
geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram review score Gaviana Galluzzo")
hist_il2 <- ggplot(data = subset(df_clean, neighbourhood == "Isolotto Legnaia"), aes(x = review_scores_rating)) +
geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram review score Isolotto Legnaia")
hist_r2 <- ggplot(data = subset(df_clean, neighbourhood == "Rifredi"), aes(x = review_scores_rating)) +
geom_histogram(binwidth = 0.25, color = "black", fill = my_colors, alpha = 0.7) +
labs(x = "Price", y = "Frequency", title = "Histogram review score Rifredi")
grid.arrange(hist_cs2, hist_cdm2, hist_gg2, hist_il2, hist_r2)
All neighbourhoods, but the Centro Storico have 5 stars as the most frequent rating. For the Centro Storico the most frequent rating is 4,75 stars. Ratings below 4 stars are very rare among all neighbourhoods, which was to be expected after examining the boxplot above.
H1: Listings where the host has many listings are more expensive.
summary(aov(price ~ host_nlistings, data = df_clean))
## Df Sum Sq Mean Sq F value Pr(>F)
## host_nlistings 1 82281 82281 11.61 0.00066 ***
## Residuals 9747 69094945 7089
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value is below 0.05 there appears a significant relationship between the price and host_nlistings.
I’m assuming that hosts with a lot of listings are professional hosts who have higher expertise with renting out
Next, I want to test whether there is a significant moderating effect by the variable “host_is_superhost” for the previous test. To conduct this analysis I am running a linear regression with the moderator, as I have seen with Moderation Analysis – Advanced Statistics using R. (n. D.).
df_clean$xz <- df_clean$host_nlistings * df_clean$host_is_superhost
summary(lm(price ~ host_nlistings + host_is_superhost + xz, data = df_clean))
##
## Call:
## lm(formula = price ~ host_nlistings + host_is_superhost + xz,
## data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -201.85 -53.90 -23.90 26.56 371.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.63532 1.21421 109.236 < 2e-16 ***
## host_nlistings 0.03609 0.01166 3.096 0.00196 **
## host_is_superhostTRUE -5.48660 1.98715 -2.761 0.00577 **
## xz 1.71272 0.15050 11.380 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.62 on 9745 degrees of freedom
## Multiple R-squared: 0.01507, Adjusted R-squared: 0.01477
## F-statistic: 49.7 on 3 and 9745 DF, p-value: < 2.2e-16
There seems to be a significant moderating effect of “host_is_superhost” on the relationship between “host_nlistings” and “price”, as there is a p-value that is below 0.05 for the xz variable.
H2: Listings where the host has many listings have better reviews.
summary(aov(review_scores_rating ~ host_nlistings, data = df_clean))
## Df Sum Sq Mean Sq F value Pr(>F)
## host_nlistings 1 97.7 97.71 713.2 <2e-16 ***
## Residuals 9747 1335.3 0.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value is below 0.05 there appears a significant relationship between the review_scores_rating and host_nlistings.
For the relationship between “review_scores_rating” and “host_nlistings” I am also testing for a moderating effect by “host_is_superhost”.
df_clean$xz <- df_clean$host_nlistings * df_clean$host_is_superhost
summary(lm(review_scores_rating ~ host_nlistings + host_is_superhost + xz, data = df_clean))
##
## Call:
## lm(formula = review_scores_rating ~ host_nlistings + host_is_superhost +
## xz, data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6397 -0.0744 0.0456 0.1422 0.8344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.641e+00 5.134e-03 903.902 <2e-16 ***
## host_nlistings -9.755e-04 4.929e-05 -19.792 <2e-16 ***
## host_is_superhostTRUE 2.361e-01 8.402e-03 28.094 <2e-16 ***
## xz -1.373e-03 6.364e-04 -2.158 0.031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3536 on 9745 degrees of freedom
## Multiple R-squared: 0.1499, Adjusted R-squared: 0.1496
## F-statistic: 572.8 on 3 and 9745 DF, p-value: < 2.2e-16
Similarly to the previous moderator test, there also appears to be a statistically significant moderating effect of “host_is_superhost” for “review_scores_rating” and “host_nlistings” as the p-value is 0.031.
After looking at the histograms reviewing price and ratings per neighbourhood I am especially interested in seeing how the Centro Storico compares to other neighbourhoods, which is why I create a Dummy variable as in Assignment 1.
df_clean$dummy <- ifelse(df_clean$neighbourhood == "Centro Storico", "CS", "NCS")
H3: The price is higher for listings in the Centro Storico
t.test(df_clean$price, df_clean$dummy == "CS", "greater")
##
## Welch Two Sample t-test
##
## data: df_clean$price and df_clean$dummy == "CS"
## t = 157.95, df = 9748.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 133.3617 Inf
## sample estimates:
## mean of x mean of y
## 135.5147195 0.7495128
As the p-value is virtually 0, there is not enough evidence to reject H3.
The histogram shows that the ratings for the Centro Storico seem to be lower than for
H4: The ratings are lower in the Centro Storico.
t.test(df_clean$review_scores_rating, df_clean$dummy == "CS", "less")
##
## Welch Two Sample t-test
##
## data: df_clean$review_scores_rating and df_clean$dummy == "CS"
## t = 676.05, df = 19211, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 3.971246
## sample estimates:
## mean of x mean of y
## 4.7111201 0.7495128
The p-value is 1, which means that H4 has to be rejected. This means that there is a very strong evidence that there is no difference between the Centro Storico and the other neighbourhoods.
To test whether there also is a significant value when the price would binary, so high and low, I am creating a price dummy where all the observations below the median get the label “low” and the observations with a price higher than the median get the label “high”.
print("Median:")
## [1] "Median:"
median(df_clean$price)
## [1] 110
df_clean$price_dummy <- ifelse(df_clean$price <= 110, "low", "high")
chisq.test(df_clean$dummy, df_clean$price_dummy)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df_clean$dummy and df_clean$price_dummy
## X-squared = 407.6, df = 1, p-value < 2.2e-16
The p-value is the same as with the t-test, which means that the statistical significance of price and the has been confirmed by two tests.
There are two types of clustering that were discussed during the third and fifth week of the course DDMB. Namely those are hierarchical clustering and non-hierarchical clustering (GfG, 2022). In hierarchical clustering observations are categorized on different levels from top to bottom using an agglomerative algorithm, hence why it is called hierarchical (Yusuf, 2023). Non-hierarchical clustering on the other hand clusters observations into a pre-specified number of clusters based on the distance between observations. Scaling is necessary for both approaches (How do you clean your data before clustering?, 2024), which is why is why I will first scale all necessary variables as learnt in the third week of this course.
df_clean$Z_price <- scale(df_clean$price)
df_clean$Z_review_scores_rating <- scale(df_clean$review_scores_rating)
df_clean$Z_host_nlisting <- scale(df_clean$host_nlisting)
The code used for this analysis is derived from the third weekly assignment of the course DDMB. It was debugged by ChatGPT. I am using the three scaled numerical main variables as a basis of the clustering, as I do not want to include more than three variables, since every new variable adds a dimension and more than three dimensions are hard to imagine.
plot(hc1<-hclust(dist(na.omit(df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]))),
ylab = "Distance", main = "Dendrogram sample ", xlab = "Observations", hang = -1)
The dendrogram shows, that at different distances, different clusters
would be formed. In my opinion it would not make sense to set the
distance at 0, because most clusters would only contain one observation,
due to the agglomerative approach. According to one of the lectures,
clusters should contain observations that are similar within the
cluster, but distinct from other clusters. Thus, I am examining the
clusters at different distances with the following code, which was
provided by Meike Morren during a lecture on 12.03.2024.
table(c1<-cutree(hc1,h=8))
##
## 1 2 3 4
## 9294 29 131 295
df_clean$hier8<-c1
table(c7<<-cutree(hc1,h=7))
##
## 1 2 3 4 5
## 8922 372 29 131 295
df_clean$hier7<-c1
table(c6<<-cutree(hc1,h=6))
##
## 1 2 3 4 5 6 7 8
## 8922 327 20 45 131 247 9 48
df_clean$hier6<-c1
table(c1<-cutree(hc1,h=5))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 7559 1363 327 20 45 88 33 247 9 48 10
df_clean$hier5<-c1
The hierarchical clustering produces one very large first cluster with almost all listings in it. This is not desirable, as we want to produce clusters that are distinct from each other but similar within. If almost all of the listings are similar to each other, this is not very helpful for working with the clusters. To examine why this could be the case I am inspecting the 3D scatterplot of the three standardized variables. This plot was created using the code by Agarwal (n.d.). Furthermore, I am making a second dendrogram with a different method, in this case “centroid”, to examine whether the clusters would be different, if a different method were used.
plot(hc2<-hclust(dist(na.omit(df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")])), method = "centroid"),
ylab = "Distance", main = "Dendrogram sample ", xlab = "Observations", hang = -1)
At first glance the dendrongram based on centroids looks very similar to
the dendrogram with the standard setting, complete linkage and thus the
method does not seem to be the reason why there is one very big
cluster.
library(scatterplot3d)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rgl)
plot3d(df_clean$Z_price, df_clean$Z_host_nlisting, df_clean$Z_review_scores_rating, type = "s", col = my_colors)
rglwidget()
The 3D plot of the three standardized variables provides inside into why there is one big main cluster. Majority of the listings are between 1 and -4 for “Z_price”, between 1 and -2 for “Z_review_scores_rating” and between -1 and 1 for “Z_host_nlistings”. Thus, the general nature of the data for the three standardized variables seems to not be suitable for hierarchical clustering.
To fix plotting issues, I used a post from the stackoverview forum that was sent to me by Meike Morren.
The next step I’m taking is to reassure myself of the number of clusters I will use for the kmeans analysis by using the iterative procedure, with the methods “elbow”, “gap” and “silhouette” as discussed during lecture 7 (Morren, 2024).
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
set.seed(1234)
elbow2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "wss", nboot = 5)
# repeat for gap statistic
set.seed(1234)
gap2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "gap_stat", nboot = 5)
# repeat for silhouette
set.seed(1234)
sil2 <- fviz_nbclust(df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")], kmeans, method = "silhouette", nboot = 5)
# ANSWER: print them next to one another using grid.arrange
print(grid.arrange(elbow2, gap2, sil2, ncol = 3))
## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
The elbow method does not clearly identify an ideal number of clusters. However, the curve is drastically flattening around the 6 cluster mark, which is why this would be the ideal number of clusters. The gap statistic tells me that one cluster would be ideal, which does not make much sense as stated above. The silhouette procedure shows that 2 clusters would be optimal, which I also perceive as a too small amount of clusters.
Revisiting the results of the hierarchical clustering, the distances h = 7 with 5 clusters and h = 6 with 8 clusters are the closest to the iterative procedure results.
After performing and reviewing the hierarchical clustering and iterative procedures, I will perform the non-hierarchical clustering with 5, 6 and 8 clusters. The non-hierarchical clustering will be done by performing an iterative kmeans analysis with 10 iterations. This code was taken from the 7th lecture (Morren, 2024) and debugged by ChatGPT, by providing the code that was already adapted to my data set but gave me an error.
# Define the data matrix Q and the number of clusters k
Q <- df_clean[,c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 5
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3,3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5
## 248 355 5239 820 3087
table(cluster, c7)
## c7
## cluster 1 2 3 4 5
## 1 75 21 21 131 0
## 2 45 7 8 0 295
## 3 5239 0 0 0 0
## 4 486 334 0 0 0
## 5 3077 10 0 0 0
# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 6
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3,3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5 6
## 255 342 3767 1118 3800 467
# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 8
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3, 3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "manhattan"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5 6 7 8
## 111 326 2172 766 1884 405 3394 691
table(cluster, c6)
## c6
## cluster 1 2 3 4 5 6 7 8
## 1 0 0 20 0 82 0 9 0
## 2 26 0 0 5 0 247 0 48
## 3 2172 0 0 0 0 0 0 0
## 4 728 23 0 15 0 0 0 0
## 5 1884 0 0 0 0 0 0 0
## 6 105 273 0 25 2 0 0 0
## 7 3394 0 0 0 0 0 0 0
## 8 613 31 0 0 47 0 0 0
# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 5
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3, 3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5
## 437 359 5103 921 2929
table(cluster, c7)
## c7
## cluster 1 2 3 4 5
## 1 243 37 26 131 0
## 2 48 13 3 0 295
## 3 5103 0 0 0 0
## 4 600 321 0 0 0
## 5 2928 1 0 0 0
# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 6
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3, 3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5 6
## 267 357 2487 1693 4408 537
# Define the data matrix Q and the number of clusters k
Q <- df_clean[, c("Z_price", "Z_review_scores_rating", "Z_host_nlisting")]
k <- 8
n <- dim(Q)[1] # Number of observations
iter.max <- 10 # Maximum number of iterations
# Initialize centroids randomly
set.seed(1234)
p <- sample(n, k) # Randomly select k indices as starting centroids
C <- Q[p, ]
# Initialize cluster assignments
cluster <- numeric(n)
# Plotting setup
par(mfrow = c(3, 3))
# Main K-means loop
for (i in 1:iter.max) {
# Assign each point to the nearest centroid
Dist <- as.matrix(dist(rbind(C, Q), method = "euclidean"))[(k + 1):(k + n), 1:k]
for (j in 1:n) cluster[j] <- which.min(Dist[j, ])
# Plot the data points and centroids
plot(Q, col = cluster + 1, main = paste("Iteration", i))
points(C, col = 1:k, pch = 4, cex = 2, lwd = 2)
# Update centroids
C <- NULL
for (l in 1:k) C <- rbind(C, colMeans(Q[cluster == l, ]))
}
table(cluster)
## cluster
## 1 2 3 4 5 6 7 8
## 129 358 2316 867 1905 435 2707 1032
table(cluster, c6)
## c6
## cluster 1 2 3 4 5 6 7 8
## 1 0 0 20 0 98 0 9 2
## 2 52 0 0 13 0 247 0 46
## 3 2316 0 0 0 0 0 0 0
## 4 849 10 0 8 0 0 0 0
## 5 1905 0 0 0 0 0 0 0
## 6 133 278 0 24 0 0 0 0
## 7 2707 0 0 0 0 0 0 0
## 8 960 39 0 0 33 0 0 0
``` The kmeans analysis differs slightly between Manhattan and Euclidean distance. However, overall the results are quite similar. Both tables for k = 8 show that there are some clusters with relatively few observations. Thus, I conclude that 8 clusters are too many. Both tables for k = 5 and k = 6 clusters show that there are no such smaller clusters. This means that based on the kmeans analysis, either 5 or 6 clusters can be justified.
As the hierarchical and the non-hierarchical clustering differ a great deal, I decided not to conduct any further tests to compare the two approaches.
Concludingly, only the non-hierarchical clustering is appropriate for this data set. However, as discussed in the seventh lecture (Morren, 2024), it is good practice to start with hierarchical clustering to find an appropriate number of clusters and then move on to non-hierarchical clustering. Thus, doing the hierarchical clustering before the kmeans analysis,
Agarwal, A. (n.d.). RPubs - R : Interactive 3-D (Three Dimensional) Visualization of Data and Plot Predicted Values on the 3-D graph. https://rpubs.com/aagarwal29/179912
Airbnb Superhost details for guests. (n.d.). Airbnb. https://de.airbnb.com/d/superhost-guest?_set_bev_on_new_domain=1710968150_NjAxMjBkMjNiNGI5#:~:text=a%20Superhost%20provide%3F-,What%20does%20a%20Superhost%20provide%3F,and%20often%20exceeding%2C%20guest%20expectations.
Andreu, L., au, Bigne, E., Amaro, S., & Palomo, J., us. (2020). Airbnb research: an analysis in tourism and hospitality journals. INTERNATIONAL JOURNAL OF CULTURE, TOURISM AND HOSPITALITY RESEARCH. https://doi.org/10.1108/IJCTHR-06-2019-0113
ChatGPT. (n. d.). https://chat.openai.com/c/eee4b840-8f1e-446d-a4cb-eaccee139c11
GfG. (2022, 29. Dezember). Difference between Hierarchical and Non Hierarchical Clustering. GeeksforGeeks. https://www.geeksforgeeks.org/difference-between-hierarchical-and-non-hierarchical-clustering/
How do you clean your data before clustering? (2024, 17. Februar). www.linkedin.com. https://www.linkedin.com/advice/0/how-do-you-clean-your-data-before-clustering-skills-data-analysis#:~:text=Since%20clustering%20algorithms%20are%20sensitive,a%20standard%20deviation%20of%201.
Lessig, L. (2008), Remix: making Art and Commerce Thrive in the Hybrid Economy, Penguim, New York, NY.
Moderation Analysis – Advanced Statistics using R. (n. D.). https://advstats.psychstat.org/book/moderation/index.php
Morren, Meike. “DDMB_IBA_Lecture7_Kmeans”. Lecture, Vrije Universiteit Amsterdam, Amsterdam, March 12, 2024.
R for Ecology. (2021, 27. Januar). ggplot2 explained in 5 minutes! [Video]. YouTube. https://www.youtube.com/watch?v=FdVy57oGJuc
https://stackoverflow.com/questions/63595786/rmarkdown-how-to-embed-an-3d-plot#63597059
Woodward, M. (2024, March 1). Airbnb Statistics [2024]: User & Market Growth Data. SearchLogistics. https://www.searchlogistics.com/learn/statistics/airbnb-statistics/#:~:text=There%20are%20currently%20over%205,booked%20over%201.5%20billion%20stays
Yusuf, F. (2023, October 9). Difference between K means and Hierarchical Clustering. Medium. https://medium.com/@waziriphareeyda/difference-between-k-means-and-hierarchical-clustering-edfec55a34f8