Question of Interest

What are the most important factors that determine the price of an Air BnB?

Load the required library

The only library you need to load is ‘tidyverse’. If you are running this code on your own computer you will have to install it, if you are running this in a lab it’s already installed.

Load up the library as follows:

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The data are on Canvas, download the airbnb.csv file and put it in the same folder that this Assignment 1.Rmd file is saved. The data will then load using

airbnb <- read.csv("airbnb.csv")

[If you have any difficulty loading the data, you can Import manually using RStudio, click on File > Import Dataset > From Text (readr) and find the file. You will also be given the code required to read this in, which you can put above to knit your file]

Univariate summary

The key variable of interest price (representing the price per night in euros) is a quantitative variable.

  1. Describe the nightly price using the mean, median, and standard deviation.
airbnb |>
  summarise(mean = mean(price, na.rm=TRUE),
          median = median(price, na.rm=TRUE),
          sd = sd(price, na.rm=TRUE))
##       mean median      sd
## 1 111.1884     85 97.6163
  1. Describe the nightly price using a boxplot
ggplot(airbnb, aes(x=price)) + geom_boxplot()

  1. Interpret the summary statistics and plot This data set has a severe right skew as there are many outliers where the price is well above the median price and the mean price (111.1184) is above the median price (85).

Bivariate summaries

Time to create some bivariate plots and tables, where we investigate the relationship between several variables and price.

  1. Create a boxplot of price by number of bedrooms
airbnb |>
  ggplot(aes(x=bedrooms, y=price, fill=bedrooms)) + geom_boxplot()

  1. Calculate summary statistics of price (mean, median, standard deviation) separately for each number of bedrooms
airbnb |>
  group_by(bedrooms)|>
  summarise(mean = mean(price),
            median = median(price),
            sd = sd(price))
## # A tibble: 6 × 4
##   bedrooms  mean median    sd
##   <chr>    <dbl>  <dbl> <dbl>
## 1 0         82.4     75  50.5
## 2 1         76.3     65  57.6
## 3 2        157.     140  93.3
## 4 3        207.     180 121. 
## 5 4        248.     200 163. 
## 6 5+       333.     250 233.
  1. Produce a scatterplot of price by number_of_reviews
airbnb |>
  ggplot(aes(x=price, y=number_of_reviews)) + geom_point()

  1. Interpret your numerical and graphical summaries of price by bedroom and number of reviews

The greater the number of bedrooms in the property, the higher the price charged for the room, with the exception of properties with 0 rooms having a higher price than properties with one bedroom. The number of reviews does not seem to impact the price much, many of the higher priced properties have fewer reviews, however this is likely because they are not as popular due to the high price point. The cheaper rooms would get more bookings leading to more reviews.

Your investigation

  1. Now it’s up to you! Produce at least two more plots and two more tables to investigate other variables in the dataset, making sure to add your interpretation
airbnb |>
  group_by(room_type) |>
    summarise(mean = mean(price),
            median = median(price),
            sd = sd(price))
## # A tibble: 3 × 4
##   room_type        mean median    sd
##   <chr>           <dbl>  <dbl> <dbl>
## 1 Entire home/apt 157.     127 109. 
## 2 Private room     65.6     55  53.2
## 3 Shared room      44.1     35  41.9
airbnb |>
  group_by(host_identity_verified) |>
  summarise(mean = mean(price),
            median = median(price),
            sd = sd(price))
## # A tibble: 2 × 4
##   host_identity_verified  mean median    sd
##   <chr>                  <dbl>  <dbl> <dbl>
## 1 f                       108.     80  98.3
## 2 t                       116.     90  96.3
airbnb |>
  group_by(review_scores_cleanliness) |>
  ggplot(aes(x=factor(review_scores_cleanliness),y=price, fill=review_scores_cleanliness)) + geom_boxplot()

airbnb |>
  group_by(review_scores_location) |>
  ggplot(aes(x=factor(review_scores_location),y=price, fill=review_scores_location)) + geom_boxplot()

Conclusion

  1. Write a short conclusion to say which variable(s) seem to be the most important factors that determine the price of an Air BnB

number of bedrooms and room type seemed to be the most important variables. The second most important variables seemed to be the review scores and the cleanliness of the property. While for both of these variables the medians were similiar for each score, the amount of outliers with much higher prices increased with higher scores.