This homework will guide you through the basics of RMarkdown, loading data, visualizing, and summarizing Airbnb data. You will learn by doing, with lots of opportunities to try things out, make changes, and see what happens. The goal is to help you become comfortable with data analysis in R, and to have fun exploring real business data, with relevant to current discussion around social issuses in CDMX.
RMarkdown is a tool for combining text, code, and results in a single document. You write your analysis in plain text, add code chunks (in R), and when you βknitβ the document, it runs the code and shows the output (tables, plots, etc.) right in the document. This makes it easy to create reproducible reports and share your work.
Ctrl + Enter to run a line in
the R console.In this homework, we will use the tidyverse package
collection, which includes tools for data manipulation
(dplyr), visualization (ggplot2), and more.
These are basically ready codes for functions that other users wrote to
make analysis easier. The main package for graphics is
ggplot2, which is part of the tidyverse.
How to load packages:
install.packages("tidyverse") # Run this once if you don't have tidyverse installed
library(tidyverse) # Loads dplyr, ggplot2, and other useful packages
Note: If you only want to use ggplot2, you can load
it with library(ggplot2). But loading tidyverse gives you
access to many helpful tools for data analysis.
Letβs start with loading the data from Rda file. Usually you can also load it from Excel or CSV, which we will explore later. Read through the chunk below, fill the path, delete the revelant # and run the line to load the data. Also, read or install library tidyverse.
# Option 2: Load from Rda
# If you placed the data in the same directory as this RMarkdown file, you can use (don't forget to eliminate #):
load("listings.Rda") ##if this doesn't work, use the full path
# or use the full path
#Otherwise, provide the full path to the file, like:
# load("C:/path/to/your/listings.Rda")
# head(listings) # this will show the first 10 rows of the data
#install.packages("tidyverse") # if you need to instlal it, just eliminate the initial # and run this line. After that, load it using the command below
library(tidyverse) #make sure you have this library. If not, install it using command
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β dplyr 1.1.4 β readr 2.1.6
## β forcats 1.0.1 β stringr 1.6.0
## β ggplot2 4.0.1 β tibble 3.3.1
## β lubridate 1.9.4 β tidyr 1.3.2
## β purrr 1.2.1
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Letβs see whatβs inside! This helps you understand what variables you have and what they mean. What are variables? Variables are columns in your dataset, each representing a different type of information (e.g., price, number_of_reviews, neighbourhood). Each row is an observation (a listing).
Where to look at variables in RStudio: - After
loading your data (by selecting a line in the chunk βlistings <-
read.csv(βC:/path/to/your/airbnb_listings.csvβ)β and running it - or the
one with RData), look in the Environment pane (top right). You should
see object called listings and click on the listings object
to open a spreadsheet-like viewer. - You can also use the command
View(listings) to open the data viewer in RStudio.
How to list variables with commands: - Use
names(listings) to see all variable names.
names(listings) # Lists all variable names
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "source"
## [6] "name"
## [7] "description"
## [8] "neighborhood_overview"
## [9] "picture_url"
## [10] "host_id"
## [11] "host_url"
## [12] "host_name"
## [13] "host_since"
## [14] "host_location"
## [15] "host_about"
## [16] "host_response_time"
## [17] "host_response_rate"
## [18] "host_acceptance_rate"
## [19] "host_is_superhost"
## [20] "host_thumbnail_url"
## [21] "host_picture_url"
## [22] "host_neighbourhood"
## [23] "host_listings_count"
## [24] "host_total_listings_count"
## [25] "host_verifications"
## [26] "host_has_profile_pic"
## [27] "host_identity_verified"
## [28] "neighbourhood"
## [29] "neighbourhood_cleansed"
## [30] "neighbourhood_group_cleansed"
## [31] "latitude"
## [32] "longitude"
## [33] "property_type"
## [34] "room_type"
## [35] "accommodates"
## [36] "bathrooms"
## [37] "bathrooms_text"
## [38] "bedrooms"
## [39] "beds"
## [40] "amenities"
## [41] "price"
## [42] "minimum_nights"
## [43] "maximum_nights"
## [44] "minimum_minimum_nights"
## [45] "maximum_minimum_nights"
## [46] "minimum_maximum_nights"
## [47] "maximum_maximum_nights"
## [48] "minimum_nights_avg_ntm"
## [49] "maximum_nights_avg_ntm"
## [50] "calendar_updated"
## [51] "has_availability"
## [52] "availability_30"
## [53] "availability_60"
## [54] "availability_90"
## [55] "availability_365"
## [56] "calendar_last_scraped"
## [57] "number_of_reviews"
## [58] "number_of_reviews_ltm"
## [59] "number_of_reviews_l30d"
## [60] "first_review"
## [61] "last_review"
## [62] "review_scores_rating"
## [63] "review_scores_accuracy"
## [64] "review_scores_cleanliness"
## [65] "review_scores_checkin"
## [66] "review_scores_communication"
## [67] "review_scores_location"
## [68] "review_scores_value"
## [69] "license"
## [70] "instant_bookable"
## [71] "calculated_host_listings_count"
## [72] "calculated_host_listings_count_entire_homes"
## [73] "calculated_host_listings_count_private_rooms"
## [74] "calculated_host_listings_count_shared_rooms"
## [75] "reviews_per_month"
Try it: Look at the output. What variables seem interesting for business analysis? (e.g., price, number_of_reviews, neighbourhood)
Explanation: When you open the data viewer in
RStudio (by clicking the listings object or using
View(listings)), you will see: - All the variables
(columns) in your dataset, with their names at the top. - The type of
each variable (e.g., numeric, character) is shown in the summary and
structure outputs. - Example values for each variable are visible in the
first few rows. - You can click on the column headers to sort the data
by that variable (e.g., sort by price to see the most or least expensive
listings). - This helps you quickly spot patterns, outliers, and
interesting variables for analysis.
gsub("[$,]", "", listings$price) removes all dollar
signs and commas (it replaces them with empty field), leaving just the
numbers as text.as.numeric(...) converts the cleaned text to a numeric
variable, so you can do calculations.summary(listings$price_num) gives you a quick overview:
min, max, mean, median, and how many missing values (NAs) there
are.# Remove $ and commas from the text, convert price to numeric variable
listings$price_num <- as.numeric(gsub("[$,]", "", listings$price))
summary(listings$price_num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 529 900 1619 1445 888888
What are missing values (NA)? Missing values (NAs) are entries where data is not available or not recorded. They are very common in real-world datasets. In our example they can happen, for instance, when it was not possible to convert a price to numerical value.
Why do missing values matter? - Many R functions (like mean, median, quantile) will return NA if there are any missing values, unless you tell R to ignore them. - This can make your results look strange or incomplete if you donβt handle NAs properly.
How to handle missing values: - Use the argument
na.rm = TRUE in most R functions to remove missing values
before calculation.
Try it: Run these commands and compare the results:
# Try calculating the mean with and without removing NAs
mean(listings$price_num) # Returns NA if there are missing values
## [1] 1619.112
mean(listings$price_num, na.rm = TRUE) # Ignores missing values
## [1] 1619.112
# Try with number_of_reviews
mean(listings$number_of_reviews_num) # Returns NA if there are missing values
## Warning in mean.default(listings$number_of_reviews_num): argument is not
## numeric or logical: returning NA
## [1] NA
mean(listings$number_of_reviews_num, na.rm = TRUE) # Ignores missing values
## Warning in mean.default(listings$number_of_reviews_num, na.rm = TRUE): argument
## is not numeric or logical: returning NA
## [1] NA
Now try calculating mean number of reviews in the chunk below. It is represented with number_of_reviews. You might need to clean it first!
Try it: - Use
sum(is.na(listings$price_num)) to count how many are
missing. - Try the same for other columns, like
number_of_reviews_num.
Explanation: - If you donβt handle missing values,
your summary statistics may be NA or misleading. - Always check for NAs
and use na.rm = TRUE when needed!
listings$number_of_reviews_num <- as.numeric(listings$number_of_reviews)
mean(listings$number_of_reviews_num, na.rm = TRUE)
## [1] 49.51017
Letβs explore where most of Airbnbs are in Mexico City. The neighborhood (cleaned name) is represented with the variable: neighbourhood_cleansed
Use this variable together with function table() to find number of listings per neighborhood.
table(listings$neighbourhood_cleansed)
##
## Γlvaro ObregΓ³n Azcapotzalco Benito JuΓ‘rez
## 825 172 2408
## CoyoacΓ‘n Cuajimalpa de Morelos CuauhtΓ©moc
## 1356 411 7557
## Gustavo A. Madero Iztacalco Iztapalapa
## 284 196 206
## La Magdalena Contreras Miguel Hidalgo Milpa Alta
## 115 3014 14
## TlΓ‘huac Tlalpan Venustiano Carranza
## 34 629 453
## Xochimilco
## 116
What graph would you use to visualize the number of listings per neighborhood? (no need to do maps, we just want visualize frequencies per neighborhood)
Try out some graphs and choose the best one in the chunk below.
listings %>%
count(neighbourhood_cleansed) %>%
ggplot(aes(x = reorder(neighbourhood_cleansed, n), y = n)) +
geom_col() +
coord_flip() +
labs(x="Neighbourhood", y="Number of listings", title="Listings per neighbourhood") +
theme_minimal()
Letβs look at the distribution of prices in Mexico City
Try it: - Change the number of breaks in the histogram. What happens? - Change the color. Try βorangeβ or βlightgreenβ. - If there are outliers, the x axis will be extened and the graph wonβt look good. We might want to trim the outliers. To do that, we subest the vector, to only keep observations lower than 10000. We do it by adding instruction:[listings$price_num<10000] after we put the vector. It tells R which rows of the vector/column of prices to keep. Change this limit to 20000 to see what happens.
hist(listings$price_num, main = "Prices", xlab = "Reviews", col = "lightgreen", breaks = 30)
### It seems outliers are making the x axis very large. We can only use observations with price lower than 10000 to see how this changes the graph
hist(listings$price_num[listings$price_num<10000], main = "Prices trimmed", xlab = "Reviews", col = "lightgreen", breaks = 30)
What is the average price of an Airbnb in Mexico City?
mean(listings$price_num, na.rm = TRUE)
## [1] 1619.112
What is the 90th percentile of the price?
What about 10th Percentile? hint: use quantile() function
quantile(listings$price_num, probs = 0.9, na.rm = TRUE)
## 90%
## 2399
quantile(listings$price_num, probs = 0.1, na.rm = TRUE)
## 10%
## 350
In many statistical procedures β like hypothesis testing or constructing confidence intervals β we need to calculate percentiles (quantiles) from theoretical probability distributions such as:
R provides built-in functions to calculate quantiles from these distributions:
qnorm() for the normal distributionqt() for the Studentβs t-distributionqf() for the F-distributionThese functions give you the cutoff value corresponding to a given probability in the lower tail of the distribution.
# 95th percentile of the standard normal distribution
qnorm(0.95)
## [1] 1.644854
# 10th percentile of a t-distribution with 25 degrees of freedom
qt(0.10, df = 25)
## [1] -1.316345
# 80th percentile of an F-distribution with 5 and 10 degrees of freedom
qf(0.80, df1 = 5, df2 = 10)
## [1] 1.802724
# Note: these are lower-tail probabilities (i.e., P(X β€ x))
Use the appropriate quantile functions in R to answer the following:
# Find the 99% percentile of the standard normal distribution
# Find the 5% percentile of the Student's t-distribution with 20 degrees of freedom
# Find the 10% percentile of the F-distribution with 100 and 100 degrees of freedom
qnorm(0.99)
## [1] 2.326348
qt(0.05, df = 20)
## [1] -1.724718
qf(0.10, df1 = 100, df2 = 100)
## [1] 0.7731327
Understanding how variables relate to one another is key in data analysis. Depending on the types of variables, we use different tools:
Letβs start by examining the relationship between
host_response_time and host_is_superhost.
We are interested in whether superhosts respond faster than
non-superhosts. Values for host_is_superhost include: - βtβ
(true) - βfβ (false) - NA (no information)
# Contingency table: host_response_time vs. host_is_superhost
table(listings$host_response_time, listings$host_is_superhost)
##
## f t
## a few days or more 280 308 8
## N/A 1302 1772 187
## within a day 359 397 77
## within a few hours 561 617 305
## within an hour 2692 4110 4815
We now compute the conditional probabilities of different response times, given superhost status. Since we are conditioning on superhost status, the probabilities in columns should add up to 1. Se we are doing it across columns, we use margin=2. If we were to do it by rows (if host_is_superhost was the first argument in the function and table was transposed), we would use margin=1.
# Conditional probabilities: response time | superhost
prop.table(table(listings$host_response_time, listings$host_is_superhost), margin = 2)
##
## f t
## a few days or more 0.05390836 0.04275403 0.00148368
## N/A 0.25067385 0.24597446 0.03468101
## within a day 0.06911821 0.05510827 0.01428042
## within a few hours 0.10800924 0.08564686 0.05656528
## within an hour 0.51829034 0.57051638 0.89298961
βοΈ Exercise: Reverse the order of the variables in
the table so that superhost status is the row variable and response time
is the column variable. Compute the conditional probabilities
again.
π What margin do you need to change?
π What changes in interpretation?
# Your code here
table(listings$host_is_superhost, listings$host_response_time)
##
## a few days or more N/A within a day within a few hours within an hour
## 280 1302 359 561 2692
## f 308 1772 397 617 4110
## t 8 187 77 305 4815
prop.table(table(listings$host_is_superhost, listings$host_response_time), margin = 1)
##
## a few days or more N/A within a day within a few hours
## 0.05390836 0.25067385 0.06911821 0.10800924
## f 0.04275403 0.24597446 0.05510827 0.08564686
## t 0.00148368 0.03468101 0.01428042 0.05656528
##
## within an hour
## 0.51829034
## f 0.57051638
## t 0.89298961
βοΈ Question: What is the difference in the probability of responding within an hour for superhosts vs.Β non-superhosts?
We now explore how prices vary across categories. For the categorical variable, letβs take the neighborhood.
average_prices <- listings %>% #this code pick the group variable (neighbourhood_cleansed) and summarizes values in each of these categories.
group_by(neighbourhood_cleansed) %>%
summarise(avg_price = mean(price_num, na.rm = TRUE)) # to summarize we used mean in this case
# Barplot of average price
average_prices %>%
ggplot(aes(x = neighbourhood_cleansed, y = avg_price)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Average Price by Neighbourhood",
x = "Neighbourhood", y = "Average Price (MXN)") +
theme_minimal()
βοΈ Exercise 11: Modify the code above to compute and plot median prices instead of means. (Replace βmeanβ function with βmedianβ in summarize). Why there might be a difference?
# Your code here
median_prices <- listings %>%
group_by(neighbourhood_cleansed) %>%
summarise(med_price = median(price_num, na.rm = TRUE))
median_prices %>%
ggplot(aes(x = neighbourhood_cleansed, y = med_price)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Median Price by Neighbourhood",
x = "Neighbourhood", y = "Median Price (MXN)") +
theme_minimal()
βοΈ Question: How did the results change? Why might median be more appropriate?
Letβs explore the relationship between price and review score rating. Play with the ylim(0,NA) in the function below. Replace it with a reasonable price per night in Mexico City (for example 10000). This should help to skip the outliers
# Ensure numeric review scores
listings$review_scores_rating_num <- as.numeric(listings$review_scores_rating)
# Scatterplot
ggplot(listings, aes(x = review_scores_rating_num, y = price_num)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Price vs. Review Score",
x = "Review Score", y = "Price (MXN)") +
ylim(0, NA)
## Warning: Removed 1848 rows containing missing values or values outside the scale range
## (`geom_point()`).
This scatterplot looks noisy β common when plotting many points.
To reduce scatterplot noise, we group data into bins (e.g., quantiles) and then plot average price per bin. In the code below, we divided listings into 10 equal groups (deciles) of review rating. 1st group has lowest review, last group has highest review. We will calculate in each group average price and average rating and plot against each other.
# Bin review scores into 20 quantiles
listings$score_bin <- ntile(listings$review_scores_rating_num, 10)
# Average price by bin
listings %>%
group_by(score_bin) %>%
summarise(
mean_price = mean(price_num, na.rm = TRUE),
avg_score = mean(review_scores_rating_num, na.rm = TRUE)
) %>%
ggplot(aes(x = avg_score, y = mean_price)) +
geom_line() +
geom_point() +
labs(title = "Average Price by Review Score Bin",
x = "Average Review Score (bin average)", y = "Average Price (MXN)") +
theme_minimal() +
ylim(0, NA)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
βοΈ Exercise: Try different bin counts (e.g., 5, 50). What happens to the trend?
Letβs also eliminate the outliers. The function filter(price_num<20000) will only keep the rows that have price higher than 20000 per night. Replace it with 10000. Why the graph looks different once we filter the data in that way?
# Bin review scores into 20 quantiles
listings$score_bin <- ntile(listings$review_scores_rating_num, 10)
# Average price by bin
listings %>%
filter(price_num<20000)%>%
group_by(score_bin) %>%
summarise(
mean_price = mean(price_num, na.rm = TRUE),
avg_score = mean(review_scores_rating_num, na.rm = TRUE)
) %>%
ggplot(aes(x = avg_score, y = mean_price)) +
geom_line() +
geom_point() +
labs(title = "Average Price by Review Score Bin",
x = "Average Review Score (bin average)", y = "Average Price (MXN)") +
theme_minimal() +
ylim(0, NA)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
We now compute summary statistics to describe the linear relationship between price and review score.
# Covariance
cov(listings$price_num, listings$review_scores_rating_num, use = "complete.obs")
## [1] 35.34035
# Correlation
cor(listings$price_num, listings$review_scores_rating_num, use = "complete.obs")
## [1] 0.007779218
What do we see? Why do you think we might see that? - Answer here: We observe that the covariance between price and review score is positive (about 35.34), but the correlation is extremely small (about 0.0078), which is very close to zero, indicating that there is practically no linear relationship between price and review score. This occurs because covariance depends on the scale of the variables and price varies much more than review scores, while correlation standardizes both variables and reflects the true strength of the association. The near-zero correlation suggests that more expensive listings do not systematically receive better or worse ratings. A likely reason is that review scores are bounded within a narrow range (usually close to 4β5 stars), while prices vary widely, and that ratings depend more on factors such as cleanliness, service quality, location, and host behavior than on price itself.