What are the most important factors that determine the price of an Air BnB?
The only library you need to load is ‘tidyverse’. If you are running this code on your own computer you will have to install it, if you are running this in a lab it’s already installed.
Load up the library as follows:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The data are on Canvas, download the airbnb.csv
file and
put it in the same folder that this Assignment 1.Rmd
file
is saved. The data will then load using
airbnb <- read.csv("airbnb.csv")
[If you have any difficulty loading the data, you can Import manually using RStudio, click on File > Import Dataset > From Text (readr) and find the file. You will also be given the code required to read this in, which you can put above to knit your file]
The key variable of interest price
(representing the
price per night in euros) is a quantitative variable.
airbnb |>
summarise(mean = mean(price, na.rm=TRUE),
median = median(price, na.rm=TRUE),
sd = sd(price, na.rm=TRUE))
## mean median sd
## 1 111.1884 85 97.6163
ggplot(airbnb, aes(x=price)) + geom_boxplot()
Time to create some bivariate plots and tables, where we investigate the relationship between several variables and price.
airbnb |>
ggplot(aes(x=bedrooms, y=price, fill=bedrooms)) + geom_boxplot()
airbnb |>
group_by(bedrooms)|>
summarise(mean = mean(price),
median = median(price),
sd = sd(price))
## # A tibble: 6 × 4
## bedrooms mean median sd
## <chr> <dbl> <dbl> <dbl>
## 1 0 82.4 75 50.5
## 2 1 76.3 65 57.6
## 3 2 157. 140 93.3
## 4 3 207. 180 121.
## 5 4 248. 200 163.
## 6 5+ 333. 250 233.
price
by
number_of_reviews
airbnb |>
ggplot(aes(x=price, y=number_of_reviews)) + geom_point()
The greater the number of bedrooms in the property, the higher the price charged for the room, with the exception of properties with 0 rooms having a higher price than properties with one bedroom. The number of reviews does not seem to impact the price much, many of the higher priced properties have fewer reviews, however this is likely because they are not as popular due to the high price point. The cheaper rooms would get more bookings leading to more reviews.
airbnb |>
group_by(room_type) |>
summarise(mean = mean(price),
median = median(price),
sd = sd(price))
## # A tibble: 3 × 4
## room_type mean median sd
## <chr> <dbl> <dbl> <dbl>
## 1 Entire home/apt 157. 127 109.
## 2 Private room 65.6 55 53.2
## 3 Shared room 44.1 35 41.9
airbnb |>
group_by(host_identity_verified) |>
summarise(mean = mean(price),
median = median(price),
sd = sd(price))
## # A tibble: 2 × 4
## host_identity_verified mean median sd
## <chr> <dbl> <dbl> <dbl>
## 1 f 108. 80 98.3
## 2 t 116. 90 96.3
airbnb |>
group_by(review_scores_cleanliness) |>
ggplot(aes(x=factor(review_scores_cleanliness),y=price, fill=review_scores_cleanliness)) + geom_boxplot()
airbnb |>
group_by(review_scores_location) |>
ggplot(aes(x=factor(review_scores_location),y=price, fill=review_scores_location)) + geom_boxplot()
number of bedrooms and room type seemed to be the most important variables. The second most important variables seemed to be the review scores and the cleanliness of the property. While for both of these variables the medians were similiar for each score, the amount of outliers with much higher prices increased with higher scores.