Welcome to Data Summarizing and Visualization in R!

This homework will guide you through the basics of RMarkdown, loading data, visualizing, and summarizing Airbnb data. You will learn by doing, with lots of opportunities to try things out, make changes, and see what happens. The goal is to help you become comfortable with data analysis in R, and to have fun exploring real business data, with relevant to current discussion around social issuses in CDMX.

What is RMarkdown and How Does It Work?

RMarkdown is a tool for combining text, code, and results in a single document. You write your analysis in plain text, add code chunks (in R), and when you “knit” the document, it runs the code and shows the output (tables, plots, etc.) right in the document. This makes it easy to create reproducible reports and share your work.

You can run code chunks by clicking the green play button next to each chunk, or by pressing Ctrl + Enter to run a line in the R console.
You do the above if you want to try running a line, or try experimenting.
When you click “Knit”, RStudio will run all code chunks and generate a formatted report (HTML, PDF, or Word). You do this usually when you are ready to complete the document.
If you try to knit and you get an error, review it, and if you can’t figure out - put it into chatgpt for a hint what to do. This document should knit before you modify anything (test it!). So if it can’t knit, likely something is wrong with your modifications.
You can mix explanations, instructions, and code so your analysis is clear and easy to follow.
So first, go one by one executing codes and changing things as requested. Then, once you have done everything knit it to get the html file.

Loading Packages: The tidyverse and ggplot2

In this homework, we will use the tidyverse package collection, which includes tools for data manipulation (dplyr), visualization (ggplot2), and more. These are basically ready codes for functions that other users wrote to make analysis easier. The main package for graphics is ggplot2, which is part of the tidyverse.

How to load packages:

install.packages("tidyverse") # Run this once if you don't have tidyverse installed
library(tidyverse) # Loads dplyr, ggplot2, and other useful packages

Note: If you only want to use ggplot2, you can load it with library(ggplot2). But loading tidyverse gives you access to many helpful tools for data analysis.

Step 1: Loading the Airbnb Data

Let’s start with loading the data from Rda file. Usually you can also load it from Excel or CSV, which we will explore later. Read through the chunk below, fill the path, delete the revelant # and run the line to load the data. Also, read or install library tidyverse.

# Option 2: Load from Rda
# If you placed the data in the same directory as this RMarkdown file, you can use (don't forget to eliminate #):

 load("listings.Rda") ##if this doesn't work, use the full path
# or use the full path
#Otherwise, provide the full path to the file, like:
# load("C:/path/to/your/listings.Rda")


# head(listings) # this will show the first 10 rows of the data



#install.packages("tidyverse") # if you need to instlal it, just eliminate the initial # and run this line. After that, load it using the command below
library(tidyverse) #make sure you have this library. If not, install it using command

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2: Explore the Data Structure

Let’s see what’s inside! This helps you understand what variables you have and what they mean. What are variables? Variables are columns in your dataset, each representing a different type of information (e.g., price, number_of_reviews, neighbourhood). Each row is an observation (a listing).

Where to look at variables in RStudio: - After loading your data (by selecting a line in the chunk “listings <- read.csv(”C:/path/to/your/airbnb_listings.csv”)” and running it - or the one with RData), look in the Environment pane (top right). You should see object called listings and click on the listings object to open a spreadsheet-like viewer. - You can also use the command View(listings) to open the data viewer in RStudio.

How to list variables with commands: - Use names(listings) to see all variable names.

names(listings) # Lists all variable names

##  [1] "id"                                          
##  [2] "listing_url"                                 
##  [3] "scrape_id"                                   
##  [4] "last_scraped"                                
##  [5] "source"                                      
##  [6] "name"                                        
##  [7] "description"                                 
##  [8] "neighborhood_overview"                       
##  [9] "picture_url"                                 
## [10] "host_id"                                     
## [11] "host_url"                                    
## [12] "host_name"                                   
## [13] "host_since"                                  
## [14] "host_location"                               
## [15] "host_about"                                  
## [16] "host_response_time"                          
## [17] "host_response_rate"                          
## [18] "host_acceptance_rate"                        
## [19] "host_is_superhost"                           
## [20] "host_thumbnail_url"                          
## [21] "host_picture_url"                            
## [22] "host_neighbourhood"                          
## [23] "host_listings_count"                         
## [24] "host_total_listings_count"                   
## [25] "host_verifications"                          
## [26] "host_has_profile_pic"                        
## [27] "host_identity_verified"                      
## [28] "neighbourhood"                               
## [29] "neighbourhood_cleansed"                      
## [30] "neighbourhood_group_cleansed"                
## [31] "latitude"                                    
## [32] "longitude"                                   
## [33] "property_type"                               
## [34] "room_type"                                   
## [35] "accommodates"                                
## [36] "bathrooms"                                   
## [37] "bathrooms_text"                              
## [38] "bedrooms"                                    
## [39] "beds"                                        
## [40] "amenities"                                   
## [41] "price"                                       
## [42] "minimum_nights"                              
## [43] "maximum_nights"                              
## [44] "minimum_minimum_nights"                      
## [45] "maximum_minimum_nights"                      
## [46] "minimum_maximum_nights"                      
## [47] "maximum_maximum_nights"                      
## [48] "minimum_nights_avg_ntm"                      
## [49] "maximum_nights_avg_ntm"                      
## [50] "calendar_updated"                            
## [51] "has_availability"                            
## [52] "availability_30"                             
## [53] "availability_60"                             
## [54] "availability_90"                             
## [55] "availability_365"                            
## [56] "calendar_last_scraped"                       
## [57] "number_of_reviews"                           
## [58] "number_of_reviews_ltm"                       
## [59] "number_of_reviews_l30d"                      
## [60] "first_review"                                
## [61] "last_review"                                 
## [62] "review_scores_rating"                        
## [63] "review_scores_accuracy"                      
## [64] "review_scores_cleanliness"                   
## [65] "review_scores_checkin"                       
## [66] "review_scores_communication"                 
## [67] "review_scores_location"                      
## [68] "review_scores_value"                         
## [69] "license"                                     
## [70] "instant_bookable"                            
## [71] "calculated_host_listings_count"              
## [72] "calculated_host_listings_count_entire_homes" 
## [73] "calculated_host_listings_count_private_rooms"
## [74] "calculated_host_listings_count_shared_rooms" 
## [75] "reviews_per_month"

Try it: Look at the output. What variables seem interesting for business analysis? (e.g., price, number_of_reviews, neighbourhood)

Explanation: When you open the data viewer in RStudio (by clicking the listings object or using View(listings)), you will see: - All the variables (columns) in your dataset, with their names at the top. - The type of each variable (e.g., numeric, character) is shown in the summary and structure outputs. - Example values for each variable are visible in the first few rows. - You can click on the column headers to sort the data by that variable (e.g., sort by price to see the most or least expensive listings). - This helps you quickly spot patterns, outliers, and interesting variables for analysis.

The price column in the raw data is stored as text, with dollar signs and commas (e.g., “$1,200.00”).
We need to clean it first, before we can use it in numerical analysis.
Most of data will require cleaning before use.
gsub("[$,]", "", listings$price) removes all dollar signs and commas (it replaces them with empty field), leaving just the numbers as text.
as.numeric(...) converts the cleaned text to a numeric variable, so you can do calculations.
summary(listings$price_num) gives you a quick overview: min, max, mean, median, and how many missing values (NAs) there are.

# Remove $ and commas from the text, convert price to numeric variable
listings$price_num <- as.numeric(gsub("[$,]", "", listings$price)) 
summary(listings$price_num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     529     900    1619    1445  888888

What are missing values (NA)? Missing values (NAs) are entries where data is not available or not recorded. They are very common in real-world datasets. In our example they can happen, for instance, when it was not possible to convert a price to numerical value.

Why do missing values matter? - Many R functions (like mean, median, quantile) will return NA if there are any missing values, unless you tell R to ignore them. - This can make your results look strange or incomplete if you don’t handle NAs properly.

How to handle missing values: - Use the argument na.rm = TRUE in most R functions to remove missing values before calculation.

Try it: Run these commands and compare the results:

# Try calculating the mean with and without removing NAs
mean(listings$price_num) # Returns NA if there are missing values

## [1] 1619.112

mean(listings$price_num, na.rm = TRUE) # Ignores missing values

## [1] 1619.112

# Try with number_of_reviews
mean(listings$number_of_reviews_num) # Returns NA if there are missing values

## Warning in mean.default(listings$number_of_reviews_num): argument is not
## numeric or logical: returning NA

## [1] NA

mean(listings$number_of_reviews_num, na.rm = TRUE) # Ignores missing values

## Warning in mean.default(listings$number_of_reviews_num, na.rm = TRUE): argument
## is not numeric or logical: returning NA

## [1] NA

Now try calculating mean number of reviews in the chunk below. It is represented with number_of_reviews. You might need to clean it first!

Try it: - Use sum(is.na(listings$price_num)) to count how many are missing. - Try the same for other columns, like number_of_reviews_num.

Explanation: - If you don’t handle missing values, your summary statistics may be NA or misleading. - Always check for NAs and use na.rm = TRUE when needed!

listings$number_of_reviews_num <- as.numeric(listings$number_of_reviews)
mean(listings$number_of_reviews_num, na.rm = TRUE)

## [1] 49.51017

Locations of Airbnbs

Let’s explore where most of Airbnbs are in Mexico City. The neighborhood (cleaned name) is represented with the variable: neighbourhood_cleansed

Use this variable together with function table() to find number of listings per neighborhood.

table(listings$neighbourhood_cleansed)

## 
##         Álvaro Obregón           Azcapotzalco          Benito Juárez 
##                    825                    172                   2408 
##               Coyoacán  Cuajimalpa de Morelos             Cuauhtémoc 
##                   1356                    411                   7557 
##      Gustavo A. Madero              Iztacalco             Iztapalapa 
##                    284                    196                    206 
## La Magdalena Contreras         Miguel Hidalgo             Milpa Alta 
##                    115                   3014                     14 
##                Tláhuac                Tlalpan    Venustiano Carranza 
##                     34                    629                    453 
##             Xochimilco 
##                    116

Step 3: visualizing the data

What graph would you use to visualize the number of listings per neighborhood? (no need to do maps, we just want visualize frequencies per neighborhood)

Try out some graphs and choose the best one in the chunk below.

listings %>%
  count(neighbourhood_cleansed) %>%
  ggplot(aes(x = reorder(neighbourhood_cleansed, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(x="Neighbourhood", y="Number of listings", title="Listings per neighbourhood") +
  theme_minimal()

Step 4: visualizing numerical data

Let’s look at the distribution of prices in Mexico City

Try it: - Change the number of breaks in the histogram. What happens? - Change the color. Try “orange” or “lightgreen”. - If there are outliers, the x axis will be extened and the graph won’t look good. We might want to trim the outliers. To do that, we subest the vector, to only keep observations lower than 10000. We do it by adding instruction:[listings$price_num<10000] after we put the vector. It tells R which rows of the vector/column of prices to keep. Change this limit to 20000 to see what happens.

hist(listings$price_num, main = "Prices", xlab = "Reviews", col = "lightgreen", breaks = 30)

### It seems outliers are making the x axis very large. We can only use observations with price lower than 10000 to see how this changes the graph

hist(listings$price_num[listings$price_num<10000], main = "Prices trimmed", xlab = "Reviews", col = "lightgreen", breaks = 30)

What is the average price of an Airbnb in Mexico City?

mean(listings$price_num, na.rm = TRUE)

## [1] 1619.112

What is the 90th percentile of the price?

What about 10th Percentile? hint: use quantile() function

quantile(listings$price_num, probs = 0.9, na.rm = TRUE)

##  90% 
## 2399

quantile(listings$price_num, probs = 0.1, na.rm = TRUE)

## 10% 
## 350

Sidenote: 🎯 Percentiles from Theoretical Distributions

In many statistical procedures — like hypothesis testing or constructing confidence intervals — we need to calculate percentiles (quantiles) from theoretical probability distributions such as:

The normal distribution (e.g. z-scores)
The Student’s t-distribution (e.g. when sample size is small)
The F-distribution (e.g. in ANOVA or testing model fit)

R provides built-in functions to calculate quantiles from these distributions:

qnorm() for the normal distribution
qt() for the Student’s t-distribution
qf() for the F-distribution

These functions give you the cutoff value corresponding to a given probability in the lower tail of the distribution.

📌 Examples

# 95th percentile of the standard normal distribution
qnorm(0.95)

## [1] 1.644854

# 10th percentile of a t-distribution with 25 degrees of freedom
qt(0.10, df = 25)

## [1] -1.316345

# 80th percentile of an F-distribution with 5 and 10 degrees of freedom
qf(0.80, df1 = 5, df2 = 10)

## [1] 1.802724

# Note: these are lower-tail probabilities (i.e., P(X ≤ x))

🧪 Your Turn: Quantile Practice

Use the appropriate quantile functions in R to answer the following:

# Find the 99% percentile of the standard normal distribution

# Find the 5% percentile of the Student's t-distribution with 20 degrees of freedom

# Find the 10% percentile of the F-distribution with 100 and 100 degrees of freedom
qnorm(0.99)

## [1] 2.326348

qt(0.05, df = 20)

## [1] -1.724718

qf(0.10, df1 = 100, df2 = 100)

## [1] 0.7731327

🔗 Step 6: Exploring Associations Between Variables

Understanding how variables relate to one another is key in data analysis. Depending on the types of variables, we use different tools:

Two categorical variables → contingency tables
One categorical and one numerical → group summaries and boxplots
Two numerical variables → scatterplots and correlation

1️⃣ Two Categorical Variables

Let’s start by examining the relationship between host_response_time and host_is_superhost.

We are interested in whether superhosts respond faster than non-superhosts. Values for host_is_superhost include: - “t” (true) - “f” (false) - NA (no information)

# Contingency table: host_response_time vs. host_is_superhost
table(listings$host_response_time, listings$host_is_superhost)

##                     
##                              f    t
##   a few days or more  280  308    8
##   N/A                1302 1772  187
##   within a day        359  397   77
##   within a few hours  561  617  305
##   within an hour     2692 4110 4815

We now compute the conditional probabilities of different response times, given superhost status. Since we are conditioning on superhost status, the probabilities in columns should add up to 1. Se we are doing it across columns, we use margin=2. If we were to do it by rows (if host_is_superhost was the first argument in the function and table was transposed), we would use margin=1.

# Conditional probabilities: response time | superhost
prop.table(table(listings$host_response_time, listings$host_is_superhost), margin = 2)

##                     
##                                          f          t
##   a few days or more 0.05390836 0.04275403 0.00148368
##   N/A                0.25067385 0.24597446 0.03468101
##   within a day       0.06911821 0.05510827 0.01428042
##   within a few hours 0.10800924 0.08564686 0.05656528
##   within an hour     0.51829034 0.57051638 0.89298961

✏️ Exercise: Reverse the order of the variables in the table so that superhost status is the row variable and response time is the column variable. Compute the conditional probabilities again.
📌 What margin do you need to change?
📌 What changes in interpretation?

# Your code here
table(listings$host_is_superhost, listings$host_response_time)

##    
##     a few days or more  N/A within a day within a few hours within an hour
##                    280 1302          359                561           2692
##   f                308 1772          397                617           4110
##   t                  8  187           77                305           4815

prop.table(table(listings$host_is_superhost, listings$host_response_time), margin = 1)

##    
##     a few days or more        N/A within a day within a few hours
##             0.05390836 0.25067385   0.06911821         0.10800924
##   f         0.04275403 0.24597446   0.05510827         0.08564686
##   t         0.00148368 0.03468101   0.01428042         0.05656528
##    
##     within an hour
##         0.51829034
##   f     0.57051638
##   t     0.89298961

✍️ Question: What is the difference in the probability of responding within an hour for superhosts vs. non-superhosts?

2️⃣ One Categorical, One Numerical Variable

We now explore how prices vary across categories. For the categorical variable, let’s take the neighborhood.

💡 Average Price by Neighbourhood

average_prices <- listings %>% #this code pick the group variable (neighbourhood_cleansed) and summarizes values in each of these categories.
  group_by(neighbourhood_cleansed) %>%
  summarise(avg_price = mean(price_num, na.rm = TRUE)) # to summarize we used mean in this case

# Barplot of average price
average_prices %>%
  ggplot(aes(x = neighbourhood_cleansed, y = avg_price)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Average Price by Neighbourhood",
       x = "Neighbourhood", y = "Average Price (MXN)") +
  theme_minimal()

✏️ Exercise 11: Modify the code above to compute and plot median prices instead of means. (Replace “mean” function with “median” in summarize). Why there might be a difference?

# Your code here
median_prices <- listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarise(med_price = median(price_num, na.rm = TRUE))

median_prices %>%
  ggplot(aes(x = neighbourhood_cleansed, y = med_price)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Median Price by Neighbourhood",
       x = "Neighbourhood", y = "Median Price (MXN)") +
  theme_minimal()

✍️ Question: How did the results change? Why might median be more appropriate?

3️⃣ Two Numerical Variables

Let’s explore the relationship between price and review score rating. Play with the ylim(0,NA) in the function below. Replace it with a reasonable price per night in Mexico City (for example 10000). This should help to skip the outliers

# Ensure numeric review scores
listings$review_scores_rating_num <- as.numeric(listings$review_scores_rating)

# Scatterplot
ggplot(listings, aes(x = review_scores_rating_num, y = price_num)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Price vs. Review Score",
       x = "Review Score", y = "Price (MXN)")  +
  ylim(0, NA)

## Warning: Removed 1848 rows containing missing values or values outside the scale range
## (`geom_point()`).

This scatterplot looks noisy — common when plotting many points.

📊 Bonus: Binned Averages

To reduce scatterplot noise, we group data into bins (e.g., quantiles) and then plot average price per bin. In the code below, we divided listings into 10 equal groups (deciles) of review rating. 1st group has lowest review, last group has highest review. We will calculate in each group average price and average rating and plot against each other.

# Bin review scores into 20 quantiles
listings$score_bin <- ntile(listings$review_scores_rating_num, 10)

# Average price by bin
listings %>%
  group_by(score_bin) %>%
  summarise(
    mean_price = mean(price_num, na.rm = TRUE),
    avg_score = mean(review_scores_rating_num, na.rm = TRUE)
  ) %>%
  ggplot(aes(x = avg_score, y = mean_price)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Price by Review Score Bin",
       x = "Average Review Score (bin average)", y = "Average Price (MXN)") +
  theme_minimal() +
  ylim(0, NA)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

✏️ Exercise: Try different bin counts (e.g., 5, 50). What happens to the trend?

Let’s also eliminate the outliers. The function filter(price_num<20000) will only keep the rows that have price higher than 20000 per night. Replace it with 10000. Why the graph looks different once we filter the data in that way?

# Bin review scores into 20 quantiles
listings$score_bin <- ntile(listings$review_scores_rating_num, 10)

# Average price by bin
listings %>%
  filter(price_num<20000)%>%
  group_by(score_bin) %>%
  summarise(
    mean_price = mean(price_num, na.rm = TRUE),
    avg_score = mean(review_scores_rating_num, na.rm = TRUE)
  ) %>%
  ggplot(aes(x = avg_score, y = mean_price)) +
  geom_line() +
  geom_point() +
  labs(title = "Average Price by Review Score Bin",
       x = "Average Review Score (bin average)", y = "Average Price (MXN)") +
  theme_minimal() +
  ylim(0, NA)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

📏 Covariance and Correlation

We now compute summary statistics to describe the linear relationship between price and review score.

# Covariance
cov(listings$price_num, listings$review_scores_rating_num, use = "complete.obs")

## [1] 35.34035

# Correlation
cor(listings$price_num, listings$review_scores_rating_num, use = "complete.obs")

## [1] 0.007779218

What do we see? Why do you think we might see that? - Answer here: We observe that the covariance between price and review score is positive (about 35.34), but the correlation is extremely small (about 0.0078), which is very close to zero, indicating that there is practically no linear relationship between price and review score. This occurs because covariance depends on the scale of the variables and price varies much more than review scores, while correlation standardizes both variables and reflects the true strength of the association. The near-zero correlation suggests that more expensive listings do not systematically receive better or worse ratings. A likely reason is that review scores are bounded within a narrow range (usually close to 4–5 stars), while prices vary widely, and that ratings depend more on factors such as cleanliness, service quality, location, and host behavior than on price itself.

Homework_summarizing_data

2025-07-19