Once upon a time, people traveled all over the world, and some stayed in hotels and others chose to stay in other people’s houses that they booked through Airbnb. Recent developments in Edinburgh regarding the growth of Airbnb and its impact on the housing market means a better understanding of the Airbnb listings is needed. Using data provided by Airbnb, we can explore how Airbnb availability and prices vary by neighbourhood.
Before we introduce the data, let’s warm up with some simple exercises.
.md file
with the same name.We’ll use the tidyverse package for much of the data wrangling and visualization and the data lives in the dsbox package. You probably already have tidyverse installed. You will need to install dsbox, from github, as the package is not available for the most recent version of R. Once installed, you can comment (add #) this line of code and simply load the libraries into your environment, if needed.
devtools::install_github("tidyverse/dsbox")
Load the packages by running the following:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(dsbox)
Instructions for code: There R chunks where you should write your code. For some exercises, you might save your answer as a particular variable. For example, we might give you a code chunk that looks like this:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
# medx, I typed it but wouldn't allow me to knit it
And you might complete it like so:
set.seed(4291)
# insert code here save the median of your simulated data as
# 'medx'
x <- rnorm(1000)
medx <- median(x)
medx
## [1] 0.01612433
It is a good idea to put the variable name at the bottom so it prints (assuming its not a huge object), and usually this should be already part of the provided code. It also helps you check your work.
Of note: Sometimes an exercise will ask for code AND pose a question. Make sure that if the answer to the question is not an output of the code, then you must answer it separately in a non-code text box. For example the problem might ask you to make a plot and describe its prominent features. You would write the code to make the plot, but also write a sentence or two outside of the code block (plain text) to describe the features of the plot.
Submission: You must submit both the PDF and .Rmd to your submission folder on Google drive by the due date and time.
The data can be found in the dsbox package, and it’s
called edibnb. Since the dataset is distributed with the
package, we don’t need to load it separately; it becomes available to us
when we load the package.
You can view the dataset as a spreadsheet using the
View() function. Note that you should not put this function
in your R Markdown document, but instead type it directly in the
Console, as it pops open a new window (and the concept of popping open a
window in a static document doesn’t really make sense…). When you run
this in the console, you’ll see the following data
viewer window pop up.
View(edibnb)
You can find out more about the dataset by inspecting its
documentation, which you can access by running ?edibnb in
the Console or using the Help menu in RStudio to search for
edibnb. You can also find this information here.
Hint: The Markdown Quick Reference sheet has an example of inline R code that might be helpful. You can access it from the Help menu in RStudio.
View(edibnb) in your Console to view the data in
the data viewer.Your non-coding answer: Each row represents a house that is available on airbnb
# Each row represents a house that is available on airbnb
Each column represents a variable. b. Get a list of the variables in the data frame and their data types.
str(edibnb)
## spc_tbl_ [13,245 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:13245] 15420 24288 38628 44552 47616 ...
## $ price : num [1:13245] 80 115 46 32 100 71 175 150 139 190 ...
## $ neighbourhood : chr [1:13245] "New Town" "Southside" NA "Leith" ...
## $ accommodates : num [1:13245] 2 4 2 2 2 3 5 5 6 10 ...
## $ bathrooms : num [1:13245] 1 1.5 1 1 1 1 1 1 1 2 ...
## $ bedrooms : num [1:13245] 1 2 0 1 1 1 2 3 4 4 ...
## $ beds : num [1:13245] 1 2 2 1 1 2 3 4 5 7 ...
## $ review_scores_rating: num [1:13245] 99 92 94 93 98 97 100 92 96 99 ...
## $ number_of_reviews : num [1:13245] 283 199 52 184 32 762 7 28 222 142 ...
## $ listing_url : chr [1:13245] "https://www.airbnb.com/rooms/15420" "https://www.airbnb.com/rooms/24288" "https://www.airbnb.com/rooms/38628" "https://www.airbnb.com/rooms/44552" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. price = col_double(),
## .. neighbourhood = col_character(),
## .. accommodates = col_double(),
## .. bathrooms = col_double(),
## .. bedrooms = col_double(),
## .. beds = col_double(),
## .. review_scores_rating = col_double(),
## .. number_of_reviews = col_double(),
## .. listing_url = col_character()
## .. )
You can find descriptions of each of the variables in the help file
for the dataset, which you can access by running ?edibnb in
your Console.
summary function.nrow(edibnb)
## [1] 13245
# Numeric columns
numeric_data <- edibnb[sapply(edibnb, is.numeric)]
# Summary statistics
summary_stats <- function(x) {
return(c(
Min = min(x, na.rm = TRUE),
Median = median(x, na.rm = TRUE),
IQR = IQR(x, na.rm = TRUE),
Max = max(x, na.rm = TRUE),
Mean = mean(x, na.rm = TRUE),
SD = sd(x, na.rm = TRUE)
))
}
# Summary function to each numeric variable
stats_table <- t(apply(numeric_data, 2, summary_stats))
# Making it easier to view
stats_table <- as.data.frame(stats_table)
stats_table
REMINDER A `ggplot2’ plot is comprised of three fundamental building blocks:
`ggplot2’ works in layers. We can create a base layer, and then add additional layers to it. New layers can be added using “+” operator.
library(ggplot2)
plot <- edibnb |>
ggplot(aes(x = price)) +
geom_histogram(fill = "lightblue", color = "blue", binwidth = 20) +
facet_wrap(~ neighbourhood, scales = "fixed") +
labs(x = "Price",
y = "Count",
title = "Airbnb Price Distribution Across Neighborhoods") +
theme_minimal()
plot
## Warning: Removed 199 rows containing non-finite outside the scale range
## (`stat_bin()`).
Your non-coding answer about reasoning for the layers and layout your chose.
# First I called for the dataset and defined the aesthetic mapping for x to represent the price of the Airbnb's. Then called for the type of plot, histogram, and adding the type of colors, light blue with a blue outline. After that added a facet layer where the data is divided into subplots. And below that, there is coding for the labels of the y and x axis.
library(dplyr)
data("edibnb")
# Createing pipeline
top_5_neighborhoods <- edibnb %>%
group_by(neighbourhood) %>%
summarize(median_price = median(price, na.rm = TRUE)) %>%
arrange(desc(median_price)) %>%
slice(1:5)
# Display the result
top_5_neighborhoods
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.3
edibnb %>%
filter(neighbourhood %in% top_5_neighborhoods$neighbourhood) %>%
ggplot(aes(x = price, y = neighbourhood, fill = neighbourhood)) +
geom_density_ridges(scale = 0.9, alpha = 0.7) +
labs(title = "Distribution of Listing Prices in Top 5 Neighborhoods",
x = "Listing Price",
y = "Neighborhood") +
theme_minimal()
## Picking joint bandwidth of 13.8
## Warning: Removed 104 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).
# Summary statistics
summary_stats <- edibnb %>%
filter(neighbourhood %in% top_5_neighborhoods$neighbourhood) %>%
group_by(neighbourhood) %>%
summarize(
min_price = min(price, na.rm = TRUE),
mean_price = mean(price, na.rm = TRUE),
median_price = median(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
iqr_price = IQR(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE)
)
summary_stats
Your non-coding narrative here:
# On the graphs above most of the neighborhood prices are skewed right, but certain places like West End and Stockbridge have higher prices than the others.
EXTRA CREDIT/OPTIONAL (5 points). Create a visualization that will
help you compare the distribution of review scores
(review_scores_rating) across neighbourhoods. You get to
decide what type of visualisation to create and there is more than one
correct answer! In your answer, include a brief interpretation of how
Airbnb guests rate properties in general and how the neighbourhoods
compare to each other in terms of their ratings.
data(edibnb)
# Boxplot
ggplot(edibnb, aes(x = neighbourhood, y = review_scores_rating)) +
geom_boxplot() +
theme_minimal() +
labs(
title = "Distribution of Review Scores Across Neighbourhoods",
x = "Neighbourhood",
y = "Review Scores Rating"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
## Warning: Removed 2177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Your non-coding narrative here:
# I've never thought about people and the ways they rate Airbnb's but I would assume they would rate high because they are given the opportunity to research and find ones that fit their wants and needs. But everyone has a different experience which is why there are some outliers in the plot. The boxplot shows differences in rating distributions across the neighborhoods. Some neighborhoods have a tighter concentration of high ratings (New Town, Old Town, and Haymarket) while others display a variety of mixed opinions(Southside, Tollcross, and Newington). I also noticed the houses with lower amounts of ratings had higher medians than the ones with a large amount of rating.