library(ggplot2)
library(dplyr)
library(readr)
library(statsr)
library(gridExtra)
library(knitr)
library(ggpubr)# Load the datasets for individual cities
ams_wd <- read_csv("amsterdam_weekdays.csv"); ath_wd <- read_csv("athens_weekdays.csv");
bcn_wd <- read_csv("barcelona_weekdays.csv"); ber_wd <- read_csv("berlin_weekdays.csv");
bud_wd <- read_csv("budapest_weekdays.csv"); lis_wd <- read_csv("lisbon_weekdays.csv");
lon_wd <- read_csv("london_weekdays.csv"); par_wd <- read_csv("paris_weekdays.csv");
rom_wd <- read_csv("rome_weekdays.csv"); vie_wd <- read_csv("vienna_weekdays.csv")
# Label the datasets with city name
ams_wd$city <- c(rep("amsterdam", times = nrow(ams_wd))); ath_wd$city <- c(rep("athens", times = nrow(ath_wd)));
bcn_wd$city <- c(rep("barcelona", times = nrow(bcn_wd))); ber_wd$city <- c(rep("berlin", times = nrow(ber_wd)));
bud_wd$city <- c(rep("budapest", times = nrow(bud_wd))); lis_wd$city <- c(rep("lisbon", times = nrow(lis_wd)));
lon_wd$city <- c(rep("london", times = nrow(lon_wd))); par_wd$city <- c(rep("paris", times = nrow(par_wd)));
rom_wd$city <- c(rep("rome", times = nrow(rom_wd))); vie_wd$city <- c(rep("vienna", times = nrow(vie_wd)))
# Combine all datasets and name it eu_cities_wd
eu_cities_wd <- rbind(ams_wd,ath_wd,bcn_wd,ber_wd,bud_wd,lis_wd,lon_wd,par_wd,rom_wd,vie_wd) Data
eu_cities_wd is a dataset that combines information
about Airbnb prices from 10 cities during weekdays. It has 25500 rows
and 21 columns, some of them describing important factors, such as: city
name, distance from city centre, price for a 2 person accomodation for 2
nights, etc.
Method
The dataset is part of the research project titled: Determinants of Airbnb prices in European cities: A spatial econometrics approach (link to research paper, link to the source dataset), which is supported by the National Science Centre, Poland, with project number 2017/27/N/HS4/00951.
From the source page, there are 20 datasets covering 10 cities where
1 city has 2 datasets - one for weekdays and
weekends. To specify our work and ease analysis, we
have combined 10 weekdays datasets from 10 cities and
name it eu_cities_wd. We also add one more variable to
eu_cities_wd, named city, to ensure that the
combined dataset remains distinct. Since the number of observations vary
among cities, this will ensure that we are able to obtain accurate
results when it is group under the city category.
Research question 1: What is the relationship between price and distance from city center across different European cities?
This question seeks to explore how the relationship between price, and distance from city centre differ across differenct European cities. It is also to explore the distribution of airbnb rooms across European cities.
Research question 2: How does the number of bedrooms and cleanliness rating affect overall satisfaction?
This question explores the general customer preference during weekdays as measured by the choice of the number of bedrooms and cleanliness rating. It also seeks to answer which of the factors would determine overall satisfaction.
Research question 3: If the budget is less than $250, which city is the most affordable to go based on the best satisfaction rating?
This question comes to our interest as we want to find the most affordable city in Europe for a vacation to stay in. We restrict the range of hotels price to be at most $250 as that would be the maximum budget. We are also curious if we have any other options for a hotel with affordable price across the Europe cities.
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
Explain each finding.
eu_cities_wd_medians <- eu_cities_wd %>% group_by(dist, city) %>%
summarize(median_price = median(realSum))
ggplot(eu_cities_wd_medians, aes(x = dist, y = median_price, color = city)) +
geom_point() + geom_smooth(method = "lm", color = "black") +
labs(x = "Distance from City Center (km)", y = "Median Airbnb Price (EUR)",
title = "Price and Distance from City Center") + theme_bw()model <- lm(realSum ~ dist * city, data = eu_cities_wd)
anova(model)## Analysis of Variance Table
##
## Response: realSum
## Df Sum Sq Mean Sq F value Pr(>F)
## dist 1 4745755 4745755 42.108 8.795e-11 ***
## city 9 301751167 33527907 297.487 < 2.2e-16 ***
## dist:city 9 22292412 2476935 21.977 < 2.2e-16 ***
## Residuals 25480 2871693843 112704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
intervals <- (cut(eu_cities_wd$realSum, breaks = seq(0, 30000, by = 500)))
relevant_intervals <- as.data.frame(table(intervals))
relevant_intervals$percentage <- (relevant_intervals$Freq/sum(relevant_intervals$Freq))*100
head(relevant_intervals)## intervals Freq percentage
## 1 (0,500] 23066 90.45490196
## 2 (500,1e+03] 2080 8.15686275
## 3 (1e+03,1.5e+03] 230 0.90196078
## 4 (1.5e+03,2e+03] 60 0.23529412
## 5 (2e+03,2.5e+03] 25 0.09803922
## 6 (2.5e+03,3e+03] 12 0.04705882
filtered_cities <- eu_cities_wd %>% filter(realSum <= 1000)
ggplot(filtered_cities, aes(x = dist, y = realSum, color = city)) +
geom_point() + geom_smooth(method = "lm", color = "black") +
labs(x = "Distance from City Center (km)", y = "Median Airbnb Price (EUR)",
title = "Price and Distance from City Center") +
theme_bw() + facet_wrap(~city, ncol = 2) We answer this question by summarizing and visualising the general
relationship between the room price and distance from city centre, where
the relationship is visualised by the black-coloured regression line.
Notice the presence of outliers with prices above €2500
limits us from properly describing the relationship using the regression
line. We solve this by creating a new object model which
relates the room price and distance form city centre, denoted
realSum and dist respectively. Then, an ANOVA
test is conducted on model, where
Note that the \(p = 2.2\times10^{-16} <
0.05\), therefore we reject \(H_{0}\). To highlight the relationship
across different cities, we filter eu_cities_wd from
outliers by creating relevant_intervals, which determines
the intervals of realSum are the most representative of the
population. Since the intervals \([0,500] \cup
[500, 1000]\) contains more than 98% of the room prices, we
filter realSum to room prices less or equal than
€1000 and plot the relationship across individual cities. We
see there is a negative correlation between room price and distance from
city centre, and this correlation applies to all cities, with varying
levels of significance.
eu_cities_wd %>% group_by(guest_satisfaction_overall) %>%
summarise(mean_bedroom = mean(bedrooms),
mean_cleanliness = mean(cleanliness_rating)) %>%
arrange(-guest_satisfaction_overall)## # A tibble: 51 × 3
## guest_satisfaction_overall mean_bedroom mean_cleanliness
## <dbl> <dbl> <dbl>
## 1 100 1.21 9.85
## 2 99 1.20 9.97
## 3 98 1.21 9.91
## 4 97 1.17 9.83
## 5 96 1.17 9.74
## 6 95 1.17 9.65
## 7 94 1.15 9.52
## 8 93 1.15 9.43
## 9 92 1.14 9.31
## 10 91 1.14 9.24
## # ℹ 41 more rows
clean_model <- lm(guest_satisfaction_overall ~ cleanliness_rating, data = eu_cities_wd)
bedroom_model <- lm(guest_satisfaction_overall ~ bedrooms, data = eu_cities_wd)
summary(clean_model)$r.squared;summary(bedroom_model)$r.squared## [1] 0.5114414
## [1] 0.001842468
lm_model <- lm(guest_satisfaction_overall ~ cleanliness_rating + bedrooms, data = eu_cities_wd)
summary(lm_model); anova(lm_model)##
## Call:
## lm(formula = guest_satisfaction_overall ~ cleanliness_rating +
## bedrooms, data = eu_cities_wd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.944 -2.217 1.076 3.368 37.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.24429 0.39238 74.531 < 2e-16 ***
## cleanliness_rating 6.70763 0.04109 163.239 < 2e-16 ***
## bedrooms 0.31154 0.06184 5.038 4.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.263 on 25497 degrees of freedom
## Multiple R-squared: 0.5119, Adjusted R-squared: 0.5119
## F-statistic: 1.337e+04 on 2 and 25497 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: guest_satisfaction_overall
## Df Sum Sq Mean Sq F value Pr(>F)
## cleanliness_rating 1 1048039 1048039 26717.791 < 2.2e-16 ***
## bedrooms 1 996 996 25.384 4.73e-07 ***
## Residuals 25497 1000152 39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a <- ggplot(eu_cities_wd, aes(x = cleanliness_rating, y = guest_satisfaction_overall)) +
geom_point(color="red") + geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(x = "Cleanliness Rating", y = "Overall Satisfaction") +
ggtitle("Cleanliness Rating and Overall Satisfaction") +
theme_bw()
b <- ggplot(eu_cities_wd, aes(x = bedrooms, y = guest_satisfaction_overall)) +
geom_point(color="blue") + geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(x = "No. of Bedrooms", y = "Overall Satisfaction") +
ggtitle("No. of Bedrooms and Overall Satisfaction") +
theme_bw()
grid.arrange(a,b, ncol = 2)We begin by finding the summary statistics for
eu_cities_wd, grouped by
guess_satisfaction_overall, taking into consideration the
columns bedrooms and cleanliness_rating as our
variables, and arranging guess_satisfaction_overall in a
descending order. Now, we wanted to answer the question even further:
Between the columns bedrooms and
cleanliness_rating, which is the stronger predictor in
determining guess_satisfaction_overall?
Our summary statistics didn’t tell us much, since the trends are
mostly the same on all categories. Therefore, we begin by employing two
methods to answer this question. Firstly, we construct two linear models
clean_model and bedroom_model, where each uses
cleanliness_rating and bedrooms as predictors
respectively. We then compare the R-squared values of both of these
models, where a higher R-squared value implies a stronger predictor.
Secondly, we approach this question through hypothesis testing. A linear
model lm_model is created which relates guess satisfaction
to both cleanliness rating and bedrooms. Then, an ANOVA
test is conducted to this model with the following null and alternative
hypotheses:
From comparing the our R-Squared values of both models and the p-value of our ANOVA test, we see that \(R_{\text{clean_model}}^{2} = 0.5114 > 0.0018 = R_{\text{bedroom_model}}^{2}\). We also see that \(p_{\text{ cleanliness_rating}} = 2.2\times10^{-16} < 4.7\times10^{-07} = p_{\text{ bedrooms}}\), therefore we reject \(H_{0}\). In conlusion, cleanliness ratings have the stronger influence in determining overall guest satisfaction. We end this question with a visualisation of two plots which individually relates overall guess satisfaction with cleanliness rating and no.of bedrooms side-by-side.
eu_cities_wd %>% group_by(city) %>%
summarise(median_ratings = median(guest_satisfaction_overall),
mean_price = mean(realSum), median_price = median(realSum),
n = n(), percentage = n/nrow(eu_cities_wd)*100) %>%
arrange(desc(median_ratings), median_price)## # A tibble: 10 × 6
## city median_ratings mean_price median_price n percentage
## <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 athens 97 156. 128. 2653 10.4
## 2 budapest 96 168. 147. 2074 8.13
## 3 berlin 96 240. 188. 1284 5.04
## 4 vienna 96 240. 205. 1738 6.82
## 5 amsterdam 96 545. 430. 1103 4.33
## 6 rome 95 202. 180. 4492 17.6
## 7 paris 94 399. 319. 3130 12.3
## 8 barcelona 93 288. 209. 1555 6.10
## 9 lisbon 93 236. 223. 2857 11.2
## 10 london 93 360. 256. 4614 18.1
cheapest_cities <- rbind(ath_wd, bud_wd, ber_wd, vie_wd) %>% filter(realSum < 250)
cheapest_cities$city <- as.factor(cheapest_cities$city)
cheapest_cities %>% group_by(city) %>%
summarise(mean_price = mean(realSum),median_price = median(realSum),
sd = sd(realSum), n=n(), percentage = n/nrow(cheapest_cities)*100)## # A tibble: 4 × 6
## city mean_price median_price sd n percentage
## <fct> <dbl> <dbl> <dbl> <int> <dbl>
## 1 athens 130. 124. 43.5 2421 37.7
## 2 berlin 163. 158. 41.4 892 13.9
## 3 budapest 146. 143. 41.5 1884 29.3
## 4 vienna 177. 180. 42.0 1226 19.1
a <- ggplot(data = cheapest_cities, aes(x = realSum, fill = city)) +
geom_histogram(binwidth = 5) +
labs(title = "Distribution of Airbnb prices against the no.of rooms in 4 highest-rated cities",
y = "No.of rooms", x = "Price")
b <- ggplot(cheapest_cities, aes(x = city, y = realSum, fill = city)) + geom_boxplot() +
labs(title = "Distribution of Airbnb prices 4 highest-rated cities",
y="Room Price for 2 people for 2 nights (EUR)",x="City") +
scale_color_brewer(type = "qual", palette = 2) + theme_minimal() + theme(legend.position = "none") +
annotate("text", x = 4, y = 55, size = 4,
label = "H_0: group means are the same\nH_a: group means are different") +
annotate("text", x = 4, y = 35, size = 4, label = "p = 2.2e-16 < 0.05, reject H_0")
grid.arrange(a, b, nrow = 2)model <- lm(realSum ~ city, data = cheapest_cities)
anova(model)## Analysis of Variance Table
##
## Response: realSum
## Df Sum Sq Mean Sq F value Pr(>F)
## city 3 2031454 677151 377.8 < 2.2e-16 ***
## Residuals 6419 11505201 1792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We begin by finding the summary statistics of
eu_cities_wd based on the satisfaction rate and median
prices. Then a new dataset cheapest_cities is created where
the data is filtered by the conditions we set on our question we found
on our summary. From here, we use the insights gained to visualise our
results. In particular, we visualize the relationship from two angles -
the histogram describes the relationship of price and the number of
rooms available, while the boxplot describes the price distribution in 4
different cities.
To confirm our results, we conducted an ANOVA test to determine whether there are differences in mean prices among different cities, where
Note that the \(p = 2.2\times10^{-16} < 0.05\), therefore we reject \(H_{0}\). From these summary statistics and visualisations, we conclude that Athens, Budapest, Berlin, and Vienna are the best cities to go in terms of price and guest satisfaction.