Exploratory Data Analysis and Statitical inferences

Setup

Load packages

library(ggplot2)   
library(dplyr)
library(readr)     
library(statsr)
library(gridExtra) 
library(knitr)
library(ggpubr)

Load data

# Load the datasets for individual cities
ams_wd <- read_csv("amsterdam_weekdays.csv"); ath_wd <- read_csv("athens_weekdays.csv");      
bcn_wd <- read_csv("barcelona_weekdays.csv"); ber_wd <- read_csv("berlin_weekdays.csv");      
bud_wd <- read_csv("budapest_weekdays.csv");  lis_wd <- read_csv("lisbon_weekdays.csv");      
lon_wd <- read_csv("london_weekdays.csv");    par_wd <- read_csv("paris_weekdays.csv");       
rom_wd <- read_csv("rome_weekdays.csv");      vie_wd <- read_csv("vienna_weekdays.csv")      

# Label the datasets with city name
ams_wd$city <- c(rep("amsterdam", times = nrow(ams_wd))); ath_wd$city <- c(rep("athens", times = nrow(ath_wd)));       
bcn_wd$city <- c(rep("barcelona", times = nrow(bcn_wd))); ber_wd$city <- c(rep("berlin", times = nrow(ber_wd)));
bud_wd$city <- c(rep("budapest", times = nrow(bud_wd)));  lis_wd$city <- c(rep("lisbon", times = nrow(lis_wd)));
lon_wd$city <- c(rep("london", times = nrow(lon_wd)));    par_wd$city <- c(rep("paris", times = nrow(par_wd)));
rom_wd$city <- c(rep("rome", times = nrow(rom_wd)));      vie_wd$city <- c(rep("vienna", times = nrow(vie_wd)))

# Combine all datasets and name it eu_cities_wd
eu_cities_wd <- rbind(ams_wd,ath_wd,bcn_wd,ber_wd,bud_wd,lis_wd,lon_wd,par_wd,rom_wd,vie_wd)

Part 1: Explanation of the Data and Method

Data

eu_cities_wd is a dataset that combines information about Airbnb prices from 10 cities during weekdays. It has 25500 rows and 21 columns, some of them describing important factors, such as: city name, distance from city centre, price for a 2 person accomodation for 2 nights, etc.

Method

The dataset is part of the research project titled: Determinants of Airbnb prices in European cities: A spatial econometrics approach (link to research paper, link to the source dataset), which is supported by the National Science Centre, Poland, with project number 2017/27/N/HS4/00951.

From the source page, there are 20 datasets covering 10 cities where 1 city has 2 datasets - one for weekdays and weekends. To specify our work and ease analysis, we have combined 10 weekdays datasets from 10 cities and name it eu_cities_wd. We also add one more variable to eu_cities_wd, named city, to ensure that the combined dataset remains distinct. Since the number of observations vary among cities, this will ensure that we are able to obtain accurate results when it is group under the city category.

Part 2: Research questions

Research question 1: What is the relationship between price and distance from city center across different European cities?

This question seeks to explore how the relationship between price, and distance from city centre differ across differenct European cities. It is also to explore the distribution of airbnb rooms across European cities.

Research question 2: How does the number of bedrooms and cleanliness rating affect overall satisfaction?

This question explores the general customer preference during weekdays as measured by the choice of the number of bedrooms and cleanliness rating. It also seeks to answer which of the factors would determine overall satisfaction.

Research question 3: If the budget is less than $250, which city is the most affordable to go based on the best satisfaction rating?

This question comes to our interest as we want to find the most affordable city in Europe for a vacation to stay in. We restrict the range of hotels price to be at most $250 as that would be the maximum budget. We are also curious if we have any other options for a hotel with affordable price across the Europe cities.

Part 3: Exploratory data analysis or Statistical inferences

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Explain each finding.

Research question 1: What is the relationship between price and distance from city center across different European cities?

eu_cities_wd_medians <- eu_cities_wd %>% group_by(dist, city) %>%
                        summarize(median_price = median(realSum))

ggplot(eu_cities_wd_medians, aes(x = dist, y = median_price, color = city)) + 
geom_point() + geom_smooth(method = "lm", color = "black") +
  labs(x = "Distance from City Center (km)", y = "Median Airbnb Price (EUR)", 
       title = "Price and Distance from City Center") + theme_bw()

model <- lm(realSum ~ dist * city, data = eu_cities_wd)
anova(model)

## Analysis of Variance Table
## 
## Response: realSum
##              Df     Sum Sq  Mean Sq F value    Pr(>F)    
## dist          1    4745755  4745755  42.108 8.795e-11 ***
## city          9  301751167 33527907 297.487 < 2.2e-16 ***
## dist:city     9   22292412  2476935  21.977 < 2.2e-16 ***
## Residuals 25480 2871693843   112704                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

intervals <- (cut(eu_cities_wd$realSum, breaks = seq(0, 30000, by = 500)))
relevant_intervals <- as.data.frame(table(intervals))
relevant_intervals$percentage <- (relevant_intervals$Freq/sum(relevant_intervals$Freq))*100
head(relevant_intervals)

##         intervals  Freq  percentage
## 1         (0,500] 23066 90.45490196
## 2     (500,1e+03]  2080  8.15686275
## 3 (1e+03,1.5e+03]   230  0.90196078
## 4 (1.5e+03,2e+03]    60  0.23529412
## 5 (2e+03,2.5e+03]    25  0.09803922
## 6 (2.5e+03,3e+03]    12  0.04705882

filtered_cities <- eu_cities_wd %>% filter(realSum <= 1000)
ggplot(filtered_cities, aes(x = dist, y = realSum, color = city)) + 
geom_point() + geom_smooth(method = "lm", color = "black") + 
  labs(x = "Distance from City Center (km)", y = "Median Airbnb Price (EUR)", 
       title = "Price and Distance from City Center")  +
  theme_bw() + facet_wrap(~city, ncol = 2)

We answer this question by summarizing and visualising the general relationship between the room price and distance from city centre, where the relationship is visualised by the black-coloured regression line. Notice the presence of outliers with prices above €2500 limits us from properly describing the relationship using the regression line. We solve this by creating a new object model which relates the room price and distance form city centre, denoted realSum and dist respectively. Then, an ANOVA test is conducted on model, where

$H_{0}$: there is no significant difference of means of price based on distance form city centre across different cities
$H_{\alpha}$: there is a significant difference of means of price based on distance form city centre across different cities

Note that the $p = 2.2\times10^{-16} < 0.05$, therefore we reject $H_{0}$. To highlight the relationship across different cities, we filter eu_cities_wd from outliers by creating relevant_intervals, which determines the intervals of realSum are the most representative of the population. Since the intervals $[0,500] \cup [500, 1000]$ contains more than 98% of the room prices, we filter realSum to room prices less or equal than €1000 and plot the relationship across individual cities. We see there is a negative correlation between room price and distance from city centre, and this correlation applies to all cities, with varying levels of significance.

Research question 2: How does the number of bedrooms and cleanliness rating affect overall satisfaction?

eu_cities_wd %>% group_by(guest_satisfaction_overall) %>% 
  summarise(mean_bedroom = mean(bedrooms),
            mean_cleanliness = mean(cleanliness_rating)) %>% 
  arrange(-guest_satisfaction_overall)

## # A tibble: 51 × 3
##    guest_satisfaction_overall mean_bedroom mean_cleanliness
##                         <dbl>        <dbl>            <dbl>
##  1                        100         1.21             9.85
##  2                         99         1.20             9.97
##  3                         98         1.21             9.91
##  4                         97         1.17             9.83
##  5                         96         1.17             9.74
##  6                         95         1.17             9.65
##  7                         94         1.15             9.52
##  8                         93         1.15             9.43
##  9                         92         1.14             9.31
## 10                         91         1.14             9.24
## # ℹ 41 more rows

clean_model <- lm(guest_satisfaction_overall ~ cleanliness_rating, data = eu_cities_wd)
bedroom_model <- lm(guest_satisfaction_overall ~ bedrooms, data = eu_cities_wd)
summary(clean_model)$r.squared;summary(bedroom_model)$r.squared

## [1] 0.5114414

## [1] 0.001842468

lm_model <- lm(guest_satisfaction_overall ~ cleanliness_rating + bedrooms, data = eu_cities_wd)
summary(lm_model); anova(lm_model)

## 
## Call:
## lm(formula = guest_satisfaction_overall ~ cleanliness_rating + 
##     bedrooms, data = eu_cities_wd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.944  -2.217   1.076   3.368  37.029 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        29.24429    0.39238  74.531  < 2e-16 ***
## cleanliness_rating  6.70763    0.04109 163.239  < 2e-16 ***
## bedrooms            0.31154    0.06184   5.038 4.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.263 on 25497 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5119 
## F-statistic: 1.337e+04 on 2 and 25497 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: guest_satisfaction_overall
##                       Df  Sum Sq Mean Sq   F value    Pr(>F)    
## cleanliness_rating     1 1048039 1048039 26717.791 < 2.2e-16 ***
## bedrooms               1     996     996    25.384  4.73e-07 ***
## Residuals          25497 1000152      39                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a <- ggplot(eu_cities_wd, aes(x = cleanliness_rating, y = guest_satisfaction_overall)) + 
  geom_point(color="red") + geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(x = "Cleanliness Rating", y = "Overall Satisfaction") +
  ggtitle("Cleanliness Rating and Overall Satisfaction") +
  theme_bw()
b <- ggplot(eu_cities_wd, aes(x = bedrooms, y = guest_satisfaction_overall)) + 
  geom_point(color="blue") + geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(x = "No. of Bedrooms", y = "Overall Satisfaction") +
  ggtitle("No. of Bedrooms and Overall Satisfaction") +
  theme_bw()
grid.arrange(a,b, ncol = 2)

We begin by finding the summary statistics for eu_cities_wd, grouped by guess_satisfaction_overall, taking into consideration the columns bedrooms and cleanliness_rating as our variables, and arranging guess_satisfaction_overall in a descending order. Now, we wanted to answer the question even further: Between the columns bedrooms and cleanliness_rating, which is the stronger predictor in determining guess_satisfaction_overall?

Our summary statistics didn’t tell us much, since the trends are mostly the same on all categories. Therefore, we begin by employing two methods to answer this question. Firstly, we construct two linear models clean_model and bedroom_model, where each uses cleanliness_rating and bedrooms as predictors respectively. We then compare the R-squared values of both of these models, where a higher R-squared value implies a stronger predictor. Secondly, we approach this question through hypothesis testing. A linear model lm_model is created which relates guess satisfaction to both cleanliness rating and bedrooms. Then, an ANOVA test is conducted to this model with the following null and alternative hypotheses:

$H_{0}$: there is no significant difference of overall guess satisfaction between different cleanliness ratings and no.of bedrooms. In other words, cleanliness ratings and no.of bedrooms have the same predictive power in determining overall guess satisfaction.
$H_{\alpha}$: there is a significant difference of overall guess satisfaction between different cleanliness ratings and no.of bedrooms. In other words, either cleanliness ratings or no.of bedrooms have the stronger predictive power in determining overall guess satisfaction.

From comparing the our R-Squared values of both models and the p-value of our ANOVA test, we see that $R_{\text{clean_model}}^{2} = 0.5114 > 0.0018 = R_{\text{bedroom_model}}^{2}$. We also see that $p_{\text{ cleanliness_rating}} = 2.2\times10^{-16} < 4.7\times10^{-07} = p_{\text{ bedrooms}}$, therefore we reject $H_{0}$. In conlusion, cleanliness ratings have the stronger influence in determining overall guest satisfaction. We end this question with a visualisation of two plots which individually relates overall guess satisfaction with cleanliness rating and no.of bedrooms side-by-side.

Research question 3: If the budget is less than €250, which city is the most affordable to go based on the best satisfaction rating?

eu_cities_wd %>%  group_by(city) %>% 
  summarise(median_ratings = median(guest_satisfaction_overall),
            mean_price = mean(realSum), median_price = median(realSum),
            n = n(), percentage = n/nrow(eu_cities_wd)*100) %>% 
  arrange(desc(median_ratings), median_price)

## # A tibble: 10 × 6
##    city      median_ratings mean_price median_price     n percentage
##    <chr>              <dbl>      <dbl>        <dbl> <int>      <dbl>
##  1 athens                97       156.         128.  2653      10.4 
##  2 budapest              96       168.         147.  2074       8.13
##  3 berlin                96       240.         188.  1284       5.04
##  4 vienna                96       240.         205.  1738       6.82
##  5 amsterdam             96       545.         430.  1103       4.33
##  6 rome                  95       202.         180.  4492      17.6 
##  7 paris                 94       399.         319.  3130      12.3 
##  8 barcelona             93       288.         209.  1555       6.10
##  9 lisbon                93       236.         223.  2857      11.2 
## 10 london                93       360.         256.  4614      18.1

cheapest_cities <- rbind(ath_wd, bud_wd, ber_wd, vie_wd) %>% filter(realSum < 250) 
cheapest_cities$city <- as.factor(cheapest_cities$city)
cheapest_cities %>% group_by(city) %>% 
  summarise(mean_price = mean(realSum),median_price = median(realSum),
            sd = sd(realSum), n=n(), percentage = n/nrow(cheapest_cities)*100)

## # A tibble: 4 × 6
##   city     mean_price median_price    sd     n percentage
##   <fct>         <dbl>        <dbl> <dbl> <int>      <dbl>
## 1 athens         130.         124.  43.5  2421       37.7
## 2 berlin         163.         158.  41.4   892       13.9
## 3 budapest       146.         143.  41.5  1884       29.3
## 4 vienna         177.         180.  42.0  1226       19.1

a <- ggplot(data = cheapest_cities, aes(x = realSum, fill = city)) +
  geom_histogram(binwidth = 5) + 
  labs(title = "Distribution of Airbnb prices against the no.of rooms in 4 highest-rated cities",
       y = "No.of rooms", x = "Price")

b <- ggplot(cheapest_cities, aes(x = city, y = realSum, fill = city)) + geom_boxplot() + 
     labs(title = "Distribution of Airbnb prices 4 highest-rated cities", 
          y="Room Price for 2 people for 2 nights (EUR)",x="City") + 
  scale_color_brewer(type = "qual", palette = 2) + theme_minimal() + theme(legend.position = "none") + 
  annotate("text", x = 4, y = 55, size = 4,
           label = "H_0: group means are the same\nH_a: group means are different") +
  annotate("text", x = 4, y = 35, size = 4, label = "p = 2.2e-16 < 0.05, reject H_0") 

grid.arrange(a, b, nrow = 2)

model <- lm(realSum ~ city, data = cheapest_cities)
anova(model)

## Analysis of Variance Table
## 
## Response: realSum
##             Df   Sum Sq Mean Sq F value    Pr(>F)    
## city         3  2031454  677151   377.8 < 2.2e-16 ***
## Residuals 6419 11505201    1792                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We begin by finding the summary statistics of eu_cities_wd based on the satisfaction rate and median prices. Then a new dataset cheapest_cities is created where the data is filtered by the conditions we set on our question we found on our summary. From here, we use the insights gained to visualise our results. In particular, we visualize the relationship from two angles - the histogram describes the relationship of price and the number of rooms available, while the boxplot describes the price distribution in 4 different cities.

To confirm our results, we conducted an ANOVA test to determine whether there are differences in mean prices among different cities, where

$H_{0}$: there is no significant difference of mean prices across different cities
$H_{\alpha}$: there is a significant difference of mean prices across different cities

Note that the $p = 2.2\times10^{-16} < 0.05$, therefore we reject $H_{0}$. From these summary statistics and visualisations, we conclude that Athens, Budapest, Berlin, and Vienna are the best cities to go in terms of price and guest satisfaction.