Airbnb is a peer-to-peer online marketplace and homestay network that enables people to list or rent short-term lodging in residential properties, with the cost of such accommodation set by the property owner. The company receives percentage service fees from both guests and hosts in conjunction with every booking. It has over 2,000,000 listings in 34,000 cities and 191 countries.
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
Column Explanations are :
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
df <- df %>%
select(-reviews_per_month,-last_review,-latitude,-longitude) %>%
mutate(name = as.character(name),
id = as.character(id),
host_id = as.character(host_id),
host_name = as.character(host_name),
price = as.numeric(price))## id name
## 0 0
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## calculated_host_listings_count availability_365
## 0 0
From this summary of PCA, we can reduce the dimensionality of our data to 4 PC’s and still obtain 86.83 % of the information. We dont want to choose lower than 4 PC’s because the data retained is only 69.4 %.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion 0.2735 0.5002 0.6940 0.8683 1.0000
We can also locate major outliers using PCA
#Allows us to look on how the initial variables contirubte to PC1 and PC2
pca2 <- PCA(df.num,graph = T)df_scale <- as.data.frame(scale(df.num))
df_km <- kmeans(df_scale,7)
#Plot Cluster
fviz_cluster(df_km,
data = df_scale) +
theme_minimal()## [1] 1785.013 3763.203 39664.822 13932.443 10870.864 5919.191 9739.015
## [1] 85674.55
## [1] 158795.4
From what we can see in the graph, we want to take the K value where the gradient starts to slant, in this case it seems to be at a k value of 8.
# Determine K
wss <- function(data, maxCluster = 20) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(df_scale)#Cluster to 8
df_km <- kmeans(df_scale,8)
#Plot new Cluster
fviz_cluster(df_km,
data = df_scale) +
theme_minimal()New Df with Cluster
cluster <- df_km$cluster
df_cluster <- cbind(df,cluster)
#Proportions of Cluster Components
prop.table(table(df_cluster$cluster))##
## 1 2 3 4 5
## 0.0008794355 0.0025769506 0.0980468351 0.2484303098 0.5987728807
## 6 7 8
## 0.0243174149 0.0155435116 0.0114326618
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.1695 1.0646 0.9845 0.9334 0.8115
## Proportion of Variance 0.2735 0.2267 0.1938 0.1742 0.1317
## Cumulative Proportion 0.2735 0.5002 0.6940 0.8683 1.0000
##
## 1 2 3 4 5
## 0.0008794355 0.0025769506 0.0980468351 0.2484303098 0.5987728807
## 6 7 8
## 0.0243174149 0.0155435116 0.0114326618
From our pca analysis we find that from all our numerical variables, all of them can be 50 % summarised by 2 PC (principal components) in which the biggest conttributio towards the PC 1 is availability 365 and calculated host listing counts. And the biggest contributer tu the second PC is the number of reviews. All of the other numercical variables have more of a multicolinearity towards each other as we can see from the variables factor map.
When we attempt a clustering, what we want to achieve is to cluster the observations with the closest characteristics within a cluster, and create clusters separated as distantly possible.
This happens because clustering by kmeans is done by comparing the positions of the observations based on their numeric variables, and then findind other near observations to form a cluster. So from every one of these clusters each of their elements should result in fairly the same “value” from their numerical variables.
Insight that is given from this clustering is that when we fail to obtain a targeted airbnb listing, we can search to look to others in its cluster and we will easily find another with the same characteristics.