Once upon a time, people traveled all over the world, and some stayed in hotels and others chose to stay in other people’s houses that they booked through Airbnb. Recent developments in Edinburgh regarding the growth of Airbnb and its impact on the housing market means a better understanding of the Airbnb listings is needed. Using data provided by Airbnb, we can explore how Airbnb availability and prices vary by neighbourhood.

Getting started

Before we introduce the data, let’s warm up with some simple exercises.

Update the YAML, changing the author name to your name.
Save the file with your last name, using “Save as” and substituting your last name for YOURNAME.
knit the document. Make sure it compiles without errors. The output will be in the file markdown .md file with the same name.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualization and the data lives in the dsbox package. You probably already have tidyverse installed. You will need to install dsbox, from github, as the package is not available for the most recent version of R. Once installed, you can comment (add #) this line of code and simply load the libraries into your environment, if needed.

devtools::install_github("tidyverse/dsbox")

Load the packages by running the following:

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.1

library(dsbox)
library(ggplot2)
library(ggridges)

## Warning: package 'ggridges' was built under R version 4.4.1

Submitting the lab

Instructions for code: There R chunks where you should write your code. For some exercises, you might save your answer as a particular variable. For example, we might give you a code chunk that looks like this:

set.seed(4291)
# insert code here save the median of your simulated data as 
# 'medx'

And you might complete it like so:

set.seed(4291)
# insert code here save the median of your simulated data as 
# 'medx'
x <- rnorm(1000)
medx <- median(x)
medx

## [1] 0.01612433

It is a good idea to put the variable name at the bottom so it prints (assuming its not a huge object), and usually this should be already part of the provided code. It also helps you check your work.

Of note: Sometimes an exercise will ask for code AND pose a question. Make sure that if the answer to the question is not an output of the code, then you must answer it separately in a non-code text box. For example the problem might ask you to make a plot and describe its prominent features. You would write the code to make the plot, but also write a sentence or two outside of the code block (plain text) to describe the features of the plot.

Submission: You must submit both the PDF and .Rmd to your submission folder on Google drive by the due date and time.

Data

The data can be found in the dsbox package, and it’s called edibnb. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package.

You can view the dataset as a spreadsheet using the View() function. Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesn’t really make sense…). When you run this in the console, you’ll see the following data viewer window pop up.

View(edibnb)

You can find out more about the dataset by inspecting its documentation, which you can access by running ?edibnb in the Console or using the Help menu in RStudio to search for edibnb. You can also find this information here.

Exercises

Hint: The Markdown Quick Reference sheet has an example of inline R code that might be helpful. You can access it from the Help menu in RStudio.

Run View(edibnb) in your Console to view the data in the data viewer.

What does each row in the dataset represent?

Your non-coding answer:

print("The id column is the individual id of each person. The neighborhood column is the location. The Accomodates column is the number of poeple per airbnb. The bathrooms column is the numner og bathrooms in the airbnb.The bedrooms column is the numner of bed rooms and the beds column is the number of beds. the reviewer score column is the rating the airbnb has recieved and the next column is the number of reviews recieved. The final column is the link to the airbnb. ")

## [1] "The id column is the individual id of each person. The neighborhood column is the location. The Accomodates column is the number of poeple per airbnb. The bathrooms column is the numner og bathrooms in the airbnb.The bedrooms column is the numner of bed rooms and the beds column is the number of beds. the reviewer score column is the rating the airbnb has recieved and the next column is the number of reviews recieved. The final column is the link to the airbnb. "

Each column represents a variable. b. Get a list of the variables in the data frame and their data types.

str(edibnb)

## spc_tbl_ [13,245 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                  : num [1:13245] 15420 24288 38628 44552 47616 ...
##  $ price               : num [1:13245] 80 115 46 32 100 71 175 150 139 190 ...
##  $ neighbourhood       : chr [1:13245] "New Town" "Southside" NA "Leith" ...
##  $ accommodates        : num [1:13245] 2 4 2 2 2 3 5 5 6 10 ...
##  $ bathrooms           : num [1:13245] 1 1.5 1 1 1 1 1 1 1 2 ...
##  $ bedrooms            : num [1:13245] 1 2 0 1 1 1 2 3 4 4 ...
##  $ beds                : num [1:13245] 1 2 2 1 1 2 3 4 5 7 ...
##  $ review_scores_rating: num [1:13245] 99 92 94 93 98 97 100 92 96 99 ...
##  $ number_of_reviews   : num [1:13245] 283 199 52 184 32 762 7 28 222 142 ...
##  $ listing_url         : chr [1:13245] "https://www.airbnb.com/rooms/15420" "https://www.airbnb.com/rooms/24288" "https://www.airbnb.com/rooms/38628" "https://www.airbnb.com/rooms/44552" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   price = col_double(),
##   ..   neighbourhood = col_character(),
##   ..   accommodates = col_double(),
##   ..   bathrooms = col_double(),
##   ..   bedrooms = col_double(),
##   ..   beds = col_double(),
##   ..   review_scores_rating = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   listing_url = col_character()
##   .. )

You can find descriptions of each of the variables in the help file for the dataset, which you can access by running ?edibnb in your Console.

Computing summary statistics is always the first step in the exploratory analysis. The summaries may include average, median, maximum, minimum, etc. One simple method is to use the summary function.

How many observations (rows) does the dataset have?

summary(edibnb)

##        id               price        neighbourhood       accommodates   
##  Min.   :   15420   Min.   :  0.00   Length:13245       Min.   : 1.000  
##  1st Qu.:13279107   1st Qu.: 49.00   Class :character   1st Qu.: 2.000  
##  Median :20171841   Median : 75.00   Mode  :character   Median : 3.000  
##  Mean   :20077242   Mean   : 97.21                      Mean   : 3.541  
##  3rd Qu.:27397925   3rd Qu.:110.00                      3rd Qu.: 4.000  
##  Max.   :36066014   Max.   :999.00                      Max.   :19.000  
##                     NA's   :199                                         
##    bathrooms        bedrooms           beds        review_scores_rating
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 20.00      
##  1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 93.00      
##  Median :1.000   Median : 1.000   Median : 2.000   Median : 97.00      
##  Mean   :1.226   Mean   : 1.583   Mean   : 2.032   Mean   : 95.02      
##  3rd Qu.:1.000   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.: 99.00      
##  Max.   :9.000   Max.   :13.000   Max.   :30.000   Max.   :100.00      
##  NA's   :12      NA's   :4        NA's   :15       NA's   :2177        
##  number_of_reviews listing_url       
##  Min.   :  0.00    Length:13245      
##  1st Qu.:  2.00    Class :character  
##  Median : 12.00    Mode  :character  
##  Mean   : 37.73                      
##  3rd Qu.: 45.00                      
##  Max.   :773.00                      
##

print("The dataset has 13245 rows/observations")

## [1] "The dataset has 13245 rows/observations"

Get key summary statistics for all numeric variables, and return them in a neatly organized table. The statistics should include the minimum, median, IQR, maximum, mean and standard deviation.

summary(edibnb)

##        id               price        neighbourhood       accommodates   
##  Min.   :   15420   Min.   :  0.00   Length:13245       Min.   : 1.000  
##  1st Qu.:13279107   1st Qu.: 49.00   Class :character   1st Qu.: 2.000  
##  Median :20171841   Median : 75.00   Mode  :character   Median : 3.000  
##  Mean   :20077242   Mean   : 97.21                      Mean   : 3.541  
##  3rd Qu.:27397925   3rd Qu.:110.00                      3rd Qu.: 4.000  
##  Max.   :36066014   Max.   :999.00                      Max.   :19.000  
##                     NA's   :199                                         
##    bathrooms        bedrooms           beds        review_scores_rating
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 20.00      
##  1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 93.00      
##  Median :1.000   Median : 1.000   Median : 2.000   Median : 97.00      
##  Mean   :1.226   Mean   : 1.583   Mean   : 2.032   Mean   : 95.02      
##  3rd Qu.:1.000   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.: 99.00      
##  Max.   :9.000   Max.   :13.000   Max.   :30.000   Max.   :100.00      
##  NA's   :12      NA's   :4        NA's   :15       NA's   :2177        
##  number_of_reviews listing_url       
##  Min.   :  0.00    Length:13245      
##  1st Qu.:  2.00    Class :character  
##  Median : 12.00    Mode  :character  
##  Mean   : 37.73                      
##  3rd Qu.: 45.00                      
##  Max.   :773.00                      
##

Create a faceted histogram where each facet represents a neighborhood and displays the distribution of Airbnb prices in that neighborhood. Be sure to label your visualization. Think critically about whether it makes more sense to stack the facets on top of each other in a column, lay them out in a row, or wrap them around. Along with your visualization, include your reasoning for the layout you chose for your facets.

REMINDER A `ggplot2’ plot is comprised of three fundamental building blocks:

Data: typically entered as a dataframe.
aesthetics (aes): to define which columns from the data will be used along which axes, and map the values to colors, symbol size, etc.
geom(s): to define the type of plot - point (geom_point), bars (geom_bar), lines (geom_line), etc.

`ggplot2’ works in layers. We can create a base layer, and then add additional layers to it. New layers can be added using “+” operator.

bnbpn <- data.frame(edibnb$neighbourhood,edibnb$price)

ggplot(edibnb, aes(x = price)) +
  geom_histogram(binwidth = 50, fill = "blue", color = "black") +
  facet_wrap(~ neighbourhood, scales = "free") +
  labs(title = "Distribution of Airbnb Prices by Neighborhood",
       x = "Price",
       y = "Count") +
  theme_minimal()

## Warning: Removed 199 rows containing non-finite outside the scale range
## (`stat_bin()`).

Your non-coding answer about reasoning for the layers and layout your chose.

print("The choice to use facet wrapping organizes the plots into a grid, making it easier to visually compare the price distributions across multiple neighborhoods without overwhelming the viewer. Independent scales for each neighborhood allow for accurate representation of varied price ranges, ensuring that each facet reflects its local context rather than being skewed by outliers in other neighborhoods. Minimal styling with a simple color palette keeps the focus on the data itself, avoiding unnecessary distractions. This layered approach balances clarity, readability, and data integrity, allowing for an intuitive understanding of how Airbnb prices vary by neighborhood.")

## [1] "The choice to use facet wrapping organizes the plots into a grid, making it easier to visually compare the price distributions across multiple neighborhoods without overwhelming the viewer. Independent scales for each neighborhood allow for accurate representation of varied price ranges, ensuring that each facet reflects its local context rather than being skewed by outliers in other neighborhoods. Minimal styling with a simple color palette keeps the focus on the data itself, avoiding unnecessary distractions. This layered approach balances clarity, readability, and data integrity, allowing for an intuitive understanding of how Airbnb prices vary by neighborhood."

Your answer to this exercise will include three pipelines, and a narrative.

Use a single pipeline to identity the neighbourhoods with the top five median listing prices.

# Assuming df_listings is your data frame
top5 <- edibnb %>%
  group_by(neighbourhood) %>%
  summarise(median_price = median(price, na.rm = TRUE)) %>%
  arrange(desc(median_price)) %>%
  slice_head(n = 5)

Then, in another pipeline filter the data for these five neighbourhoods and make ridge plots of the distributions of listing prices in these five neighbourhoods.

ggplot(top5, aes(x = median_price, y = neighbourhood, fill = neighbourhood)) +
  geom_density_ridges(scale = 1.5) +
  labs(title = "Distribution of Airbnb Prices in the Top 5 Neighborhoods",
       x = "Price",
       y = "Neighborhood") +
  theme_minimal()

## Picking joint bandwidth of NaN

In a third pipeline calculate the minimum, mean, median, standard deviation, IQR, and maximum listing price in each of these neighbourhoods.

summary_stats <- edibnb %>%
  filter(neighbourhood %in% top5$neighbourhood) %>%
  group_by(neighbourhood) %>%
  summarise(
    min_price = min(price, na.rm = TRUE),
    mean_price = mean(price, na.rm = TRUE),
    median_price = median(price, na.rm = TRUE),
    sd_price = sd(price, na.rm = TRUE),
    iqr_price = IQR(price, na.rm = TRUE),
    max_price = max(price, na.rm = TRUE)
  )

print(summary_stats)

## # A tibble: 5 × 7
##   neighbourhood min_price mean_price median_price sd_price iqr_price max_price
##   <chr>             <dbl>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
## 1 Bruntsfield          10       99.4           80     90.2      72.5       900
## 2 New Town             12      136.           100    109.       86.5       999
## 3 Old Town             15      128.            90    110.       76         999
## 4 Stockbridge          21      104.            85     77.6      66         750
## 5 West End             19      116.            90     93.3      80         999

Use visualization in Excercise 3 and the summary statistics to describe the distribution of listing prices in the neighborhoods.

Your non-coding narrative here:

print("The distribution of listing prices across the top five neighborhoods reveals significant differences in price variability. New Town and Old Town have the widest range of prices, with maximums near $999 and minimums as low as $12, offering diverse options from budget to luxury. High standard deviations and interquartile ranges in these neighborhoods indicate substantial variability in pricing. In contrast, Bruntsfield and Stockbridge show more consistent pricing, with narrower ranges and lower standard deviations, catering to a more stable market. The higher mean compared to the median in Bruntsfield suggests the presence of a few high-priced listings, while Stockbridge displays similar stability. Overall, New Town and Old Town are characterized by diverse pricing, whereas Bruntsfield and Stockbridge offer more uniform, mid-range options.")

## [1] "The distribution of listing prices across the top five neighborhoods reveals significant differences in price variability. New Town and Old Town have the widest range of prices, with maximums near $999 and minimums as low as $12, offering diverse options from budget to luxury. High standard deviations and interquartile ranges in these neighborhoods indicate substantial variability in pricing. In contrast, Bruntsfield and Stockbridge show more consistent pricing, with narrower ranges and lower standard deviations, catering to a more stable market. The higher mean compared to the median in Bruntsfield suggests the presence of a few high-priced listings, while Stockbridge displays similar stability. Overall, New Town and Old Town are characterized by diverse pricing, whereas Bruntsfield and Stockbridge offer more uniform, mid-range options."

EXTRA CREDIT/OPTIONAL (5 points). Create a visualization that will help you compare the distribution of review scores (review_scores_rating) across neighbourhoods. You get to decide what type of visualisation to create and there is more than one correct answer! In your answer, include a brief interpretation of how Airbnb guests rate properties in general and how the neighbourhoods compare to each other in terms of their ratings.

#Insert code here

Your non-coding narrative here:

*Add text here*

Week 2 Homework - Airbnb listings in Edinburgh

DATA 201 - Johnson

September 12, 2024

Getting started

Packages

Submitting the lab

Data

Exercises