1.- Introduction

Airbnb is an online marketplace that connects people who want to rent out their homes with those who are looking for accommodations. This data analysis report aims to explore the Airbnb dataset to uncover insights and trends that can help inform business decisions and recommend possible prices for new listings.

Currently, there is a limitation to export the data from the Airbnb website. However, there are some websites that provide Airbnb data for free. One of the most popular websites is Inside Airbnb. This website provides data for a variety of cities around the world, including information on listings, reviews, and host profiles. Even so, the data provided by Inside Airbnb is not real-time and may not be up-to-date plus it didn’t have the city that I was interested in. Therefore, I decided to scrape the data from the Airbnb website using Python and BeautifulSoup.

The main challenge in the web scraping process was to understand how Airbnb manages pagination and navigate to specific pages. After analyzing multiple pages, I was able to identify that the URL structure follows a specific pattern using cursor-based pagination.This means that it uses a Base64-encoded string that contains information about section_offset, items_offset and version. By changing the section_offset, we can navigate to different pages.In order to automate this process a function was created in python.

Once the data was scraped, it got saved into text files with the page number in the file name. The extract was done for 1000 pages because in some cases I was getting duplicate listings. Then the files got loaded into R and merged into a single dataframe that later it got loaded into MongoDB; I decided to use this database because I already have the data in a json format.

The data was then cleaned and prepared for analysis. The data cleansing process involved unnesting the nested fields and generating wide columns for the data. The column names were updated to make them more readable and easier to work with. Google Maps API was used to populate the zip code and city for each listing based on the latitude and longitude coordinates. The data was then merged with the original data frame to create a new data frame with the address information.

Finally, the data was analyzed using ggmap to create a map of the listings in Atlantic City; the map shows the location of each listing. A new data frame called listing was created with only the variables that I needed for the price analysis. I decided to use randomForest to predict the price of the listings based on the features in the data set. Using Rainforest I was able to create a the model to recommend a price for new listings in Atlantic City. If more data was available, I could have used a more complex model to predict the price of the listings, currently we have the ids for each listing so for future project we can use web scraping again to pull details about the listing (eg. number of bathrooms, number of rooms, etc).


2.-Data Collection

Web scraping was used to collect data from the Airbnb website for Atlantic City. The data included information about listings, reviews rate, and prices. The data was extracted using Python and BeautifulSoup, and saved into text files to later loaded into Mongodb for further analysis. Google Maps API was used to populate the zip code and city for each listing based on the latitude and longitude coordinates.

2.1.- Web Scraping Process

The web scraping process involved the following steps:

2.1.1.- Airbnb URL pattern

I started analyzing the structure of the Airbnb website URL to identify the pattern for pagination

https://www.airbnb.com/s/Atlantic-City--New-Jersey--United-States/homes?tab_id=home_tab &refinement_paths%5B%5D=%2Fhomes &query=Atlantic%20City%2C%20New%20Jersey%2C%20United%20States &place_id=ChIJIcdcblfdwIkRYlJn6UPLb0o &flexible_trip_lengths%5B%5D=one_week &monthly_start_date=2024-12-01 &monthly_length=3 &monthly_end_date=2025-03-01 &search_mode=regular_search &price_filter_input_type=0 &channel=EXPLORE &search_type=unknown &price_filter_num_nights=5 &federated_search_session_id=d64e8a69-3c91-4d03-a688-f50df0d65d06 &pagination_search=true &cursor=eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjowLCJ2ZXJzaW9uIjoxfQ%3D%3D

https://www.airbnb.com/s/Atlantic-City--New-Jersey--United-States/homes?tab_id=home_tab &refinement_paths%5B%5D=%2Fhomes &query=Atlantic%20City%2C%20New%20Jersey%2C%20United%20States &place_id=ChIJIcdcblfdwIkRYlJn6UPLb0o &flexible_trip_lengths%5B%5D=one_week &monthly_start_date=2024-12-01 &monthly_length=3 &monthly_end_date=2025-03-01 &search_mode=regular_search &price_filter_input_type=0 &channel=EXPLORE &search_type=unknown &price_filter_num_nights=5 &federated_search_session_id=d64e8a69-3c91-4d03-a688-f50df0d65d06 &pagination_search=true &cursor=eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjoxOCwidmVyc2lvbiI6MX0%3D

In the previous two URLs we can determinate that the cursor is the one that changes when we navigate to different pages. The cursor is a Base64-encoded string that contains information about section_offset, items_offset and version. By changing the section_offset, we can navigate to different pages.

  • section_offset: Indicates the section offset (often 0 for standard searches).
  • items_offset: Indicates the item offset for the results.
  • version: Indicates the version of the pagination system.

From previous example we can define the following two cursors that were used to navigate to different pages:

  • Cursor1: eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjowLCJ2ZXJzaW9uIjoxfQ==

    Decoded: {“section_offset”:0,“items_offset”:0,“version”:1}

  • Cursor2: eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjoxOCwidmVyc2lvbiI6MX0=

    Decoded: {“section_offset”:0,“items_offset”:18,“version”:1}

The items_offset increases in increments of 18 (e.g., 0, 18, 36, 54, 72), based on the number of results per page. A function was created in Python to automate the process of changing the cursor and navigating to the different pages. This function is later used to generate the URLs.

The following formula was used to calculate the offset for a specific page:(page_number−1)×18

import base64
def generate_cursor(page_number, results_per_page=18):
    items_offset = (page_number - 1) * results_per_page
    cursor_data = {
        "section_offset": 0,
        "items_offset": items_offset,
        "version": 1
    }
    cursor_json = str(cursor_data).replace("'", '"')  
    return base64.b64encode(cursor_json.encode()).decode()

2.1.2.- Airbnb Data Extraction Process

A Python script was created to extract data from the Airbnb website using BeautifulSoup. The script iterates over the pages and extracts the data for each page. The extracted data is saved into text files for each page. The script uses the requests library to send HTTP requests to the website and the BeautifulSoup library to parse the HTML content.

2.1.2.1.- Install Python Libraries

Using terminal; base64 , requests and BeautifulSoup libraries were installed.

Note: You need to have Python3 and pip3.

pip3 install base64
pip3 install requests
pip3 install BeautifulSoup

2.1.2.2.- Import packages

  • base64: To encode and decode data
  • requests: To send HTTP requests
  • BeautifulSoup: To parse HTML content
import base64
import requests
from bs4 import BeautifulSoup

2.1.2.3.- Define variables for location

In order to avoid hard coding the location in the base url; city, state and country variables were defined separated from base_url variable.

city = "Atlantic-City"
state = "New-Jersey"
country = "United-States"

base_url = f"https://www.airbnb.com/s/{city}--{state}--{country}/homes"

2.1.2.4.- Mimic browser behavior

Before starting the call to the website, it’s important to mimic the behavior of a browser by setting the user-agent in the headers. This will prevent the website from blocking the request.

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}

2.1.2.5.- Define the total of pages for call

The total_pages variable was defined to determine the number of pages to scrape. In this case, 1000 pages were scraped.

total_pages = 1000

2.1.2.6.- Web scraping loop

A loop was created to iterate over the pages and extract the data. The cursor was generated for each page, and the URL was constructed with the cursor (using previous function defined). The request was sent to the website, and the response was parsed using BeautifulSoup to extract the data. The extracted data was saved to a text file for each page into airbnb-scraper folder. In order to avoid overwriting the files, the page number was included in the file name “airbnb_pagination_search{page}.txt”.

for page in range(1, total_pages + 1):
    # Generate the cursor for the current page
    cursor = generate_cursor(page)
    
    # Construct the URL with the cursor
    url = f"{base_url}?pagination_search=true&cursor={cursor}"
    print(f"Scraping Page {page}: {url}")
    
    # Send request
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        data = soup.find_all('script', type='application/json')
        for i in data:
            if 'searchResults' in i.text:
                data = i.text
        with open(f"/airbnb-scraper/airbnb_pagination_search{page}.txt", 'w', encoding='utf-8') as file:
            file.write(data) 
            
    else:
        print(f"Failed to load page {page}: {response.status_code}")

The final python script will look like this:

import base64
import requests
from bs4 import BeautifulSoup

city = "Atlantic-City"
state = "New-Jersey"
country = "United-States"

base_url = f"https://www.airbnb.com/s/{city}--{state}--{country}/homes"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}

def generate_cursor(page_number, results_per_page=18):
    items_offset = (page_number - 1) * results_per_page
    cursor_data = {
        "section_offset": 0,
        "items_offset": items_offset,
        "version": 1
    }
    cursor_json = str(cursor_data).replace("'", '"')  
    return base64.b64encode(cursor_json.encode()).decode()

total_pages = 1000

for page in range(1, total_pages + 1):
    cursor = generate_cursor(page)
    
    url = f"{base_url}?pagination_search=true&cursor={cursor}"
    print(f"Scraping Page {page}: {url}")
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        data = soup.find_all('script', type='application/json')
        for i in data:
            if 'searchResults' in i.text:
                data = i.text
        with open(f"/airbnb-scraper/airbnb_pagination_search{page}.txt", 'w', encoding='utf-8') as file:
            file.write(data) 
    else:
        print(f"Failed to load page {page}: {response.status_code}")

2.2.- Load data into MongoDB

The data extracted from the Airbnb website was saved into text files for each page; the next step was to load the data into MongoDB. The data was loaded into a collection named airbnb_listings in the airbnb database.

For this purpose, I used R to load the data from the text files, convert it to json, and then load it into MongoDB using the mongolite package.

2.2.1.- Install packages in R

The mandatory packages for this process are stringr, jsonlite, and mongolite. They can be installed using the following code:

if (!requireNamespace("stringr", quietly = TRUE)) {
  install.packages("stringr")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
  install.packages("jsonlite")
}
if (!requireNamespace("mongolite", quietly = TRUE)) {
  install.packages("mongolite")
}

2.2.2.- Load packages into R

Once the packages are installed, they can be loaded into R using the following code:

  • stringr: For string manipulation
  • jsonlite: For converting data to json format
  • mongolite: For connecting to MongoDB
library(stringr)
library(jsonlite)
library(mongolite)

2.2.3.- Open connection to MongoDB

The next step is to open a connection to MongoDB using the mongo function from the mongolite package. The connection details, such as the collection name, database name, and URL, need to be specified. For this analysis I used localhost however it can be updated to a different DNS or IP.

mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")

2.2.4.- Add Unique ID to the collection - MongoDB

In order to avoid duplicate listings, a unique index was created on the listing.id field in the Listing_Information collection.

listing.id
listing.id
mongo_conn$run('{
  "createIndexes": "Listing_Information",
  "indexes": [
    {
      "key": { "listing.id": 1 },
      "name": "unique_listing_id_index",
      "unique": true
    }
  ]
}')

2.2.5.- Identify the number of files in the directory

The next step is to identify the number of files in the directory /airbnb-scraper/ that contains the extracted data.

num_files <- length(list.files("airbnb-scraper", pattern = "*.txt", full.names = TRUE))

2.2.6.- Load data from text files into MongoDB

A loop was created to load the data from each text file into R, convert it to json format, and insert it into MongoDB. The fromJSON function from the jsonlite package was used to convert the data to json format, and the insert function from the mongolite package was used to insert the data into MongoDB;this will be executed for each line in the text file.

An exception was added to handle duplicate entries error. Due to the unique index created in the previous step, if there is any duplicate error it will be skipped.

The pattern used to extract the data from the text files is shown below:

for (num_file in 1:num_files) {
    
  txt_file <- gsub(" ", "",paste("airbnb-scraper/airbnb_pagination_search",num_file,".txt"))
  
  # Read the file
  text <- readLines(txt_file, warn = FALSE)
  text <- paste(text, collapse = " ")
  
  
  pattern <- "\\{\"__typename\":\"StaySearchResult\",.*?\"demandStayListing\":\\{\"__typename\".*?\\}\\}"
  
  # Use str_extract_all to find all matches
  matches <- str_extract_all(text, pattern)[[1]]
  
  
  for (i in 1:length(matches)) {
    json_data <- fromJSON(matches[i])
    json_char <- toJSON(json_data, auto_unbox = TRUE)
    
    tryCatch({
      mongo_conn$insert(json_char)
      cat("Data inserted successfully.\n")
    }, error = function(e) {
      if (grepl("E11000 duplicate key error", e$message)) {
        cat("Duplicate entry detected. Skipping insertion.\n")
      } else {
        cat("An error occurred: ", e$message, "\n")
      }
    })
  }
  
}

2.2.7.- Validate the number of records in the collection

The final step is to validate the number of records in the Listing_Information collection in MongoDB.

print(mongo_conn$count())
## [1] 484

All the R code together will look like this:

library(stringr)
library(jsonlite)
library(mongolite)

mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")
mongo_conn$run('{
  "createIndexes": "Listing_Information",
  "indexes": [
    {
      "key": { "listing.id": 1 },
      "name": "unique_listing_id_index",
      "unique": true
    }
  ]
}')

num_files <- length(list.files("airbnb-scraper", pattern = "*.txt", full.names = TRUE))

for (num_file in 1:num_files) {
    
  txt_file <- gsub(" ", "",paste("airbnb-scraper/airbnb_pagination_search",num_file,".txt"))
  text <- readLines(txt_file, warn = FALSE)
  text <- paste(text, collapse = " ")
  
  
  pattern <- "\\{\"__typename\":\"StaySearchResult\",.*?\"demandStayListing\":\\{\"__typename\".*?\\}\\}"
  
  matches <- str_extract_all(text, pattern)[[1]]

  for (i in 1:length(matches)) {
    json_data <- fromJSON(matches[i])
    json_char <- toJSON(json_data, auto_unbox = TRUE)
    
    tryCatch({
      mongo_conn$insert(json_char)
      cat("Data inserted successfully.\n")
    }, error = function(e) {
      if (grepl("E11000 duplicate key error", e$message)) {
        cat("Duplicate entry detected. Skipping insertion.\n")
      } else {
        cat("An error occurred: ", e$message, "\n")
      }
    })
  }
  
}

print(mongo_conn$count())

3.- Data Cleansing

Now that the data is loaded in MongoDB I will be able to perform some data cleansing in order to prepare the data for analysis.

The data cleansing process will involve the following steps:

3.1.- Load neccesary libraries

The necessary libraries for this process are :

  • stringr: For string manipulation
  • jsonlite: For converting data to json format
  • mongolite: For connecting to MongoDB
  • tidyverse: For data manipulation

They can be loaded using the following code:

library(stringr)
library(jsonlite)
library(mongolite)
library(tidyverse)

3.2.- Open connection to MongoDB

I will connect to the database and collection that I previously created in MongoDB.

mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")

3.3.- Define JSON schema

The data in MongoDB is stored in JSON format. I will define the JSON schema to extract the data from MongoDB. In order to do that I have to validate how the data got loaded into MongoDB.

fields <- '{"_id": 1,
"listing.id":1,
"listing.coordinate.latitude":1,
"listing.coordinate.longitude":1,
"listing.roomTypeCategory":1,
"listing.listingObjType":1,
"listing.name":1,
"listing.pdpUrlType":1,
"listing.structuredContent.primaryLine": 1,
"listing.structuredContent.mapSecondaryLine": 1,
"listing.title":1,
"listing.titleLocale":1,
"avgRatingA11yLabel":1,
"avgRatingLocalized":1,
"listingParamOverrides.checkin":1,
"listingParamOverrides.checkout":1,
"listingParamOverrides.relaxedAmenityIds":1,
"listingParamOverrides.amenities":1,
"pricingQuote.amenities":1,
"pricingQuote.structuredStayDisplayPrice":1,
"badges.loggingContext.badgeType":1,
"badges.style":1,
"badges.text":1,
"badges.textAccessibilityLabel":1,
"badges.textColor":1,
"contextualPictures":1}'

3.4.- Retrive data from MongoDB

In this step I used the json schema defined in the previous step to extract the data from MongoDB. query <-{} is used to extract all the data from the collection.

query <- '{}'
airbnb_data <- mongo_conn$find(query, fields = fields)

3.5.- Unnest columns

The data is in a nested format, so I will generate wide columns for the nested fields to make it easier to analyze.

  • unnest: To unnest the nested fields
  • unnest_wider: To unnest the nested fields and generate wide columns
airbnb_data <- airbnb_data |> 
  unnest(listing) |> 
  unnest(structuredContent)  |>
  unnest_wider(primaryLine) |>       
  unnest_wider(mapSecondaryLine, names_sep = "_msl")

airbnb_data <- airbnb_data |> 
  unnest_wider(badges)      

airbnb_data <- airbnb_data|>
  unnest_wider(contextualPictures, names_sep = "_pic")    

airbnb_data <- airbnb_data|>
  unnest_wider(coordinate)   

airbnb_data <- airbnb_data|>
  unnest_wider(listingParamOverrides)    
  
airbnb_data <- airbnb_data|>
  unnest_wider(pricingQuote) 


airbnb_data <- airbnb_data|>
  unnest_wider(structuredStayDisplayPrice, names_sep = "_")

airbnb_data <- airbnb_data|>
  unnest_wider(structuredStayDisplayPrice_primaryLine, names_sep = "_")

airbnb_data <- airbnb_data|>
  unnest_wider(avgRatingLocalized , names_sep = "_")

airbnb_data <- airbnb_data|>
  unnest_wider(avgRatingA11yLabel, names_sep = "_")

airbnb_data <- airbnb_data|>
  unnest_wider(loggingContext, names_sep = "_") 

airbnb_data <- airbnb_data|>
  unnest_wider(contextualPictures_picpicture, names_sep = "_") 

3.6.- Fix the column names

The column names contain special characters and spaces, so I will update them in order to make them more readable so I can work with them for future steps.

  • makes.names: To make the column names unique
colnames(airbnb_data) <- make.names(colnames(airbnb_data), unique = TRUE)

print(colnames(airbnb_data))
##  [1] "X_id"                                                             
##  [2] "latitude"                                                         
##  [3] "longitude"                                                        
##  [4] "id"                                                               
##  [5] "listingObjType"                                                   
##  [6] "name"                                                             
##  [7] "pdpUrlType"                                                       
##  [8] "roomTypeCategory"                                                 
##  [9] "mapSecondaryLine_msl__typename"                                   
## [10] "mapSecondaryLine_mslbody"                                         
## [11] "X__typename"                                                      
## [12] "body"                                                             
## [13] "type"                                                             
## [14] "title"                                                            
## [15] "titleLocale"                                                      
## [16] "avgRatingA11yLabel_1"                                             
## [17] "avgRatingLocalized_1"                                             
## [18] "checkin"                                                          
## [19] "checkout"                                                         
## [20] "relaxedAmenityIds"                                                
## [21] "amenities"                                                        
## [22] "structuredStayDisplayPrice___typename"                            
## [23] "structuredStayDisplayPrice_primaryLine___typename"                
## [24] "structuredStayDisplayPrice_primaryLine_displayComponentType"      
## [25] "structuredStayDisplayPrice_primaryLine_accessibilityLabel"        
## [26] "structuredStayDisplayPrice_primaryLine_discountedPrice"           
## [27] "structuredStayDisplayPrice_primaryLine_originalPrice"             
## [28] "structuredStayDisplayPrice_primaryLine_qualifier"                 
## [29] "structuredStayDisplayPrice_primaryLine_shortQualifier"            
## [30] "structuredStayDisplayPrice_primaryLine_concatQualifierLeft"       
## [31] "structuredStayDisplayPrice_primaryLine_trailingContent"           
## [32] "structuredStayDisplayPrice_primaryLine_trailing"                  
## [33] "structuredStayDisplayPrice_primaryLine_price"                     
## [34] "structuredStayDisplayPrice_secondaryLine"                         
## [35] "structuredStayDisplayPrice_explanationData"                       
## [36] "structuredStayDisplayPrice_explanationDataDisplayPosition"        
## [37] "structuredStayDisplayPrice_explanationDataDisplayPriceTriggerType"
## [38] "structuredStayDisplayPrice_layout"                                
## [39] "loggingContext_badgeType"                                         
## [40] "style"                                                            
## [41] "text"                                                             
## [42] "textAccessibilityLabel"                                           
## [43] "textColor"                                                        
## [44] "contextualPictures_pic__typename"                                 
## [45] "contextualPictures_picid"                                         
## [46] "contextualPictures_picpicture_1"                                  
## [47] "contextualPictures_picpicture_2"                                  
## [48] "contextualPictures_picpicture_3"                                  
## [49] "contextualPictures_picpicture_4"                                  
## [50] "contextualPictures_picpicture_5"                                  
## [51] "contextualPictures_picpicture_6"

Create a new dataframe with the columns that I will use for the analysis.

 airbnb_df <- airbnb_data |>
  select(id, title, name, latitude, longitude, roomTypeCategory, type,body,avgRatingLocalized_1,structuredStayDisplayPrice_primaryLine___typename,
         structuredStayDisplayPrice_primaryLine_accessibilityLabel,
         structuredStayDisplayPrice_primaryLine_shortQualifier,
         text,contextualPictures_picpicture_1,
         contextualPictures_picpicture_2,contextualPictures_picpicture_3,
         contextualPictures_picpicture_4,contextualPictures_picpicture_5,
         contextualPictures_picpicture_6) |>
  rename(room_type_category = roomTypeCategory,
         stay_price = structuredStayDisplayPrice_primaryLine_shortQualifier,
         display_price = structuredStayDisplayPrice_primaryLine___typename,
         price_label = structuredStayDisplayPrice_primaryLine_accessibilityLabel,
         avg_rating = avgRatingLocalized_1,
         badge_type = text,
         pic1 = contextualPictures_picpicture_1,
         pic2 = contextualPictures_picpicture_2,
         pic3 = contextualPictures_picpicture_3,
         pic4 = contextualPictures_picpicture_4,
         pic5 = contextualPictures_picpicture_5,
         pic6 = contextualPictures_picpicture_6) |>
  mutate(bed_info = ifelse(grepl("BEDINFO", type, ignore.case = TRUE), body, '0 beds'),
         free_cancellation = ifelse(grepl("Free cancellation", body, ignore.case = TRUE), 'Yes', 'No'),
         rating = as.numeric(str_extract(avg_rating, "[0-9]+.[0-9]+")),
         num_ppl_rated = as.numeric(str_remove_all(str_extract(avg_rating, "\\((\\d+)\\)"), "[()]")),
         DiscountedDisplayPriceLine = ifelse(grepl("DiscountedDisplayPriceLine", 
                                                   display_price, 
                                                   ignore.case = TRUE), 'Yes', 'No'),
         discounted_price = str_remove(str_extract(price_label,
                                                 "\\$\\d+(?= per)"), "\\$"),
         original_price = ifelse(is.na(str_remove(str_extract(price_label,
                                                 "\\$\\d+$"), "\\$")) | str_remove(str_extract(price_label,
                                                 "\\$\\d+$"), "\\$") == "", discounted_price, str_remove(str_extract(price_label,
                                                 "\\$\\d+$"), "\\$"))) 

The dataframe will look like this:

print(colnames(airbnb_df))
##  [1] "id"                         "title"                     
##  [3] "name"                       "latitude"                  
##  [5] "longitude"                  "room_type_category"        
##  [7] "type"                       "body"                      
##  [9] "avg_rating"                 "display_price"             
## [11] "price_label"                "stay_price"                
## [13] "badge_type"                 "pic1"                      
## [15] "pic2"                       "pic3"                      
## [17] "pic4"                       "pic5"                      
## [19] "pic6"                       "bed_info"                  
## [21] "free_cancellation"          "rating"                    
## [23] "num_ppl_rated"              "DiscountedDisplayPriceLine"
## [25] "discounted_price"           "original_price"

In previous assignments I used to use DT but in this case I was getting an error so I decided to use reactable package to create an interactive table.

library(reactable)

# Create an interactive table
reactable(
  airbnb_df[1:12], 
  searchable = TRUE, 
  sortable = TRUE, 
  pagination = TRUE,
  highlight = TRUE,
  defaultColDef = colDef(headerStyle = list(whiteSpace = "normal")),
  style = list(overflowX = "auto"),
  defaultPageSize = 5  
)

4.- Data Enrichment - Google Maps API

In this step I will use the Google Maps API to populate the zip code and city for each listing based on the latitude and longitude coordinates. This will be my second source of data.

4.1.- Load main package

The main packages for this process are ggmap, httr, and jsonlite.

  • ggmap: For mapping
  • httr: For sending HTTP requests
  • jsonlite: For converting data to json format

They can be installed using the following code:

if (!requireNamespace("ggmap", quietly = TRUE)) {
  install.packages("ggmap")
}
if (!requireNamespace("httr", quietly = TRUE)) {
  install.packages("httr")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
  install.packages("jsonlite")
}
library(httr)
library(jsonlite)
library(ggmap)

4.2.- Setup API Key

For security reason, my API Key got storage in a json file in a folder that doesn’t get committed in Github.

ggm_kc <- fromJSON("config/ggm_k.json")
register_google(ggm_kc$key)

4.3.- Add Zip Code and City Columns

In this step I will use the Google Maps API to populate the zip code and city for each listing based on the latitude and longitude coordinates.

  • GET: To send a GET request to the Google Maps API
  • content: To extract the content from the response
  • fromJSON: To convert the content to json format
  • unnest_wider: To unnest the nested fields and generate wide columns
  • str_extract: To extract the zip code from the formatted address
  • str_remove_all: To remove all occurrences of a pattern from a string
  • str_remove: To remove a pattern from a string
  • slice_head: To select the first row of the data frame

The latitude and longitude coordinates are passed as parameters in the URL, along with the API key.

https://maps.googleapis.com/maps/api/geocode/json?latlng=39.3643,-74.4229&key=key_here

In order to get the information for each observation in the dataframe, I will use a loop to iterate over the rows and populate the zip code and city columns.

The address will look like this: 123 S New Jersey Ave, Atlantic City, NJ 08401, USA so I used regex to extract the zip code and city.

  • zip_code:\\d{5}, , this means five digits followed by a comma

  • city:(?<=, )[A-Za-z ]+(?=, [A-Z]{2}), this means a word that is between a comma and a comma followed by two uppercase letters

Note: In order to not keep calling multiple times the service, I will only call it once for each observation and save the data in a data frame; for later save the data into a csv file.

temp_address_info <- data.frame(id = rep(NA, nrow(airbnb_df)),
                                zip_code = rep(NA, nrow(airbnb_df)),
                                city = rep(NA, nrow(airbnb_df)))

for (i in 1:nrow(airbnb_df)) {
  temp_address_info$id[i] <- airbnb_df$id[i]
  
  url <- paste0("https://maps.googleapis.com/maps/api/geocode/json?latlng=",
                airbnb_df$latitude[i], ",", airbnb_df$longitude[i],
                "&key=", ggm_kc)
  
  
  response <- GET(url)
  
  content <- content(response, as = "text", encoding = "UTF-8")
  json_data <- fromJSON(content)
  
  json_data_df <- data.frame(json_data)

  json_data_df <- json_data_df |> 
  unnest_wider(results.geometry)   

  zip_result <- json_data_df |>
  select(results.formatted_address) |>
  filter(json_data_df$location_type == "ROOFTOP")|>
  mutate(zip_code = str_remove_all(str_extract(results.formatted_address, "\\d{5},"),","),
         city = str_extract(results.formatted_address, "(?<=, )[A-Za-z ]+(?=, [A-Z]{2})"))|>
  slice_head(n = 1)

  if (data$status == "OK") {
    temp_address_info$zip_code[i] <- zip_result[2]
    temp_address_info$city[i] <- zip_result[3]
  } else {
    temp_address_info$zip_code[i] <- NA
    temp_address_info$city[i] <- NA
  }
  
}

temp_address_info <- temp_address_info|>
  unnest_wider(zip_code, names_sep = "_") 

temp_address_info <- temp_address_info|>
  unnest_wider(city, names_sep = "_") 

colnames(temp_address_info) <- c("id", "zip_code", "city")

write.table(temp_address_info, "data/temp_address_info.csv",sep = ",", row.names = FALSE, col.names = TRUE)

Now I will use previous file to merge the data with the main dataframe.

temp_address_info_reload <- read.csv("data/temp_address_info.csv", 
  colClasses = c("character", "character", "character"),stringsAsFactors = FALSE, quote = "\"")

airbnb_df <- merge(airbnb_df, temp_address_info_reload, by = "id")



library(reactable)

reactable(
  airbnb_df[c(2,22,23,8)], 
  searchable = TRUE, 
  sortable = TRUE, 
  pagination = TRUE,
  highlight = TRUE,
  defaultColDef = colDef(headerStyle = list(whiteSpace = "normal")),
  style = list(overflowX = "auto"),
  defaultPageSize = 5  
)

5.- Exploratory Data Analysis

In this section, I will use the ggmap package to create a map of the listings in Atlantic City. The map will show the location of each listing.

5.1.- Create a Map of the Listings

In this step I will define the variables for the city, state and country. Then I will use the ggmap function to get the map location and plot the map with the listings.

5.1.1.- Setup API Key

In order to use the Google Maps API, you need to have an API key. The API key is used to authenticate your requests to the API and to track usage. You can get an API key by creating a project in the Google Cloud Platform and enabling the Google Maps API.

city <- "Atlantic-City"
state <- "New-Jersey"
country <- "United-States"

if (!requireNamespace("ggmap", quietly = TRUE)) {
  install.packages("ggmap")
}
library(ggmap)

ggm_kc <- fromJSON("config/ggm_k.json")
register_google(ggm_kc$key)

5.1.2.- Generate the Map Location

Using the variables defined for the location, I will use the get_map function to get the map location.

gg_map_location <- get_map(paste(gsub("-", " ", city),paste(' ',gsub("-", " ", state))),
                        maptype='roadmap', #hybrid
                        source="google",
                        api_key = ggm_kc,
                        zoom=12)

5.1.3.- Plot the Map

I will use the geom_point function to add the points for the listings to the map and Im also pulling the information about longitude and latitude from the data frame generated.

ggmap(gg_map_location) +
  geom_point(data = airbnb_df, aes(x = longitude, y = latitude), color = "#ff385c", size = 0.5)+
  labs(
    title = paste(gsub("-", " ", city),'-',gsub("-", " ", state),',',country),
  )+
  theme(
    plot.title = element_text(color = "#ff385c", size = 16, face="bold")
  )

In this map , we can confirm that the listings are in the Atlantic City area.

5.2.- Identify Dependent and Independent Variables

5.2.1.- Dependent Variables

price(\(y\)): The dependent variable represents the outcome(target) variable in the research in USD, it’s the variable of interest that is being investigated; a continuous variable.

airbnb_df$discounted_price <- as.numeric(airbnb_df$discounted_price)

ggplot(airbnb_df, aes(x=discounted_price)) + 
  geom_histogram(binwidth=39,fill = "white", colour = "#ff385c")+
  scale_x_continuous(
    breaks = seq(0, max(airbnb_df$discounted_price, na.rm = TRUE) + 50, by = 100),
    expand = c(0, 0)
  ) +
  geom_vline(xintercept=mean(airbnb_df$discounted_price), linetype="dashed", color = "darkgray",size=1)+
  geom_vline(xintercept=median(airbnb_df$discounted_price), linetype="dotted", color = "darkgray",size=1)+
  labs(
    x = "Price",
    y = "Number of Listings",
    title = paste("Airbnb -",gsub("-", " ", city),",",gsub("-", " ", state)," - Prices Histogram"),
      subtitle = paste("Dependent Variable - Skewed Right,tail of the graph is pulled toward the higher numbers
      \nThe mean(|)",round(mean(airbnb_df$discounted_price),2)," is greater than the median(.)",
                       round(median(airbnb_df$discounted_price),2),"")
  )+
  theme_minimal()

5.2.2.- Independent Variables

The independent variables are the predictors or factors that are believed to have an influence on the dependent variable. Possible Independent variables: bed_info(\(x_1\)), rating(\(x_2\)), badge_type(\(x_3\)), city(\(x_4\)), room_type_category(\(x_5\)), title(\(x_6\))

library(DT)

airbnb_df_ind <- airbnb_df|>
select(pic1,title,bed_info,rating,badge_type,city,room_type_category)

airbnb_df_ind$pic1 <- paste0('<img src="', airbnb_df_ind$pic1, '" style="width:50px;height:50px;">')

datatable(
  airbnb_df_ind,
  escape = FALSE, 
  options = list(pageLength = 5),
  caption = "Independent Variables with Images"
)

Note: After some investigation I realized that DT library was not working properly because of a syntax error in styles.css file; once that got fixed I was able to use the package.

5.3.- Correlation Analysis

The ideal is that all the independent variables to be correlated with the dependent variable but NOT with each other.

5.3.1.- Correlation Matrix

In this step I will create a correlation matrix to identify the relationship between the dependent and independent variables.

# Encode categorical variables
library(dplyr)

airbnb_df_enc <- airbnb_df |>
  select(discounted_price, rating, title,badge_type, bed_info, badge_type, city, room_type_category) |>
  rename(price = discounted_price)|>
  mutate(
    title = as.numeric(factor(title)),
    bed_info = as.numeric(factor(bed_info)),
    badge_type = as.numeric(factor(badge_type)),
    city = as.numeric(factor(city)),     
    badge = as.numeric(factor(badge_type)),        
    room_type = as.numeric(factor(room_type_category))  
  )

correlation_data <- airbnb_df_enc |>
  select(price, rating,title, bed_info,badge,city, room_type)

cor_matrix <- cor(correlation_data, use = "complete.obs")

print(cor_matrix)
##                 price      rating      title    bed_info      badge        city
## price      1.00000000  0.35786477  0.3004598  0.33587171 -0.2641221 -0.03998451
## rating     0.35786477  1.00000000  0.2669158  0.17459086 -0.6190593  0.10900602
## title      0.30045984  0.26691584  1.0000000  0.21458047 -0.3574230  0.12464895
## bed_info   0.33587171  0.17459086  0.2145805  1.00000000 -0.1323934 -0.04549565
## badge     -0.26412207 -0.61905929 -0.3574230 -0.13239336  1.0000000 -0.17735171
## city      -0.03998451  0.10900602  0.1246490 -0.04549565 -0.1773517  1.00000000
## room_type -0.09976651 -0.02443898  0.1699986 -0.16233900 -0.0316835  0.02773805
##             room_type
## price     -0.09976651
## rating    -0.02443898
## title      0.16999859
## bed_info  -0.16233900
## badge     -0.03168350
## city       0.02773805
## room_type  1.00000000

5.3.2.- Correlation Plot

The correlation matrix was visualized using a correlation plot to identify the relationship between the variables.

library(corrplot)
corrplot(cor_matrix, method="shade")

5.3.3.- Conclusion

The correlation matrix and Plot show the relationship between the variables. The correlation values range from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Based on the plot, the independent variables are not highly correlated with each other, which is ideal for the analysis.


6.- Model Selection

I decided to use Random Forest model to predict the price of the new listings because it is an ensemble learning method that builds multiple decision trees and merges them together to get a more accurate and stable prediction and is less prone to overfitting; it can be used for both continuous and categorical variables. The Random Forest model will be trained using the independent variables to predict the price of the listings.

6.1.- Loading the necessary libraries

The necessary library for this process is randomForest.

  • randomForest: For building the Random Forest model
  • dplyr: For data manipulation
  • stringr: For string manipulation
  • caret: For data preprocessing
if (!requireNamespace("randomForest", quietly = TRUE)) {
  install.packages("randomForest")
}
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}

library(randomForest)
library(dplyr)
library(stringr)
library(caret)

6.2.- Prepare the Data

I will first select only the columns that I need .

listings <- airbnb_df |>
  select(discounted_price, rating, title, badge_type, bed_info, city, room_type_category) |>
  rename(price = discounted_price) |>
  mutate(title_length = nchar(title))

6.3.- Identify and Convert Categorical Variables

Then I will identify any variable of character type and convert it to factor type (categorical data).

Note: The Random Forest algorithm in R’s randomForest package expects categorical predictors to be encoded as factors, as it uses the factor levels to split decision trees efficiently.

listings <- listings |>
  mutate(across(where(is.character), as.factor))

6.4.- Split the Data into Training and Testing Sets

Now, I will split the data into training and testing sets

  • set.seed(200): Ensures that the random processes produce the same results each time you run the code

  • createDataPartition: Creates a random partition of the data into training and testing sets. The main reason to use this function is to ensure that the distribution of the dependent variable (price(\(y\))) is similar in both the training and testing sets.

    • listings$price : Specifies the target variable used for stratification.
    • p = 0.8 : Indicates that 80% of the data should be used for training.
    • list = FALSE : Returns the indices as a matrix instead of a list.

createDataPartition is better for cases where you need to maintain the distribution of the target variable in both the training and testing sets.

set.seed(200)
train_index <- createDataPartition(listings$price, p = 0.8, list = FALSE)
train_data <- listings[train_index, ]
test_data <- listings[-train_index, ]

6.5.- Evaluate the Model

In order to determinate the correct value for ntree, I will use the Out-of-Bag (OOB) error rate to evaluate the model performance for different values of ntree.

set.seed(123)
ntree_values <- seq(100, 1000, by = 100)
oob_errors <- numeric(length(ntree_values))

for (i in seq_along(ntree_values)) {
  temp_model <- randomForest(
    price ~ rating + title_length + bed_info + badge_type + city + room_type_category,
    data = train_data,
    ntree = ntree_values[i],
    importance = TRUE,
    na.action = na.roughfix
  )
  oob_errors[i] <- mean(temp_model$mse) # Mean squared error for regression
}

# Plot OOB error vs. ntree
plot(ntree_values, oob_errors, type = "b", pch = 19, col = "blue",
     xlab = "Number of Trees", ylab = "OOB Error",
     main = "Optimal Number of Trees")

From the result of the plot, I can see that the OOB error rate reduces around 600 trees.

6.6.- Define the Random Forest Model

I will define the Random Forest model using the randomForest function.

  • price ~ . : The dependent variable is price, and all the other variables in the data frame are used as predictors.

    • A period (.) indicates that all the other variables in the data frame should be used as predictors.
    • If I want to specify the predictors manually, I can use the formula price ~ rating + title_length + bed_info + badge_type + city + room_type_category.
  • data = train_data : Specifies the training data set.

  • ntree = 500 : Specifies the number of trees to grow in the forest.

  • importance = TRUE : Specifies that the importance of the predictors should be calculated.

  • na.action = na.roughfix : it performs a simple imputation to fill in the missing values before fitting the Random Forest model.

    • For numeric variables, missing values are replaced with the median of the non-missing values in that variable.
    • For categorical variables, missing values are replaced with the most frequent (mode) value of the non-missing values in that variable.
model <- randomForest(
  #price ~ .,
  price ~ rating + title_length + bed_info + badge_type + city + room_type_category,
  data = train_data,
  ntree = 500,
  importance = TRUE,
  mtry = 5,
  na.action = na.roughfix
)

6.7.- Predict the Price of New Listings

I will use the model to predict the price of new listings. I will create a new listing with a random information.

new_listing <- data.frame(
  rating = 5, 
  title = "Home in Atlantic City",         
  badge_type = 'Guest favorite',                  
  bed_info = '2 beds',                        
  city = 'Atlantic City',                 
  room_type_category = 'entire_home'
)

new_listing <- new_listing|>
  mutate(title_length = nchar(title))

# Ensure categorical variables match levels in the training data
new_listing <- new_listing |>
  mutate(
    badge_type = factor(badge_type, levels = levels(train_data$badge_type)),
    bed_info = factor(bed_info, levels = levels(train_data$bed_info)),
    city = factor(city, levels = levels(train_data$city)),
    room_type_category = factor(room_type_category, levels = levels(train_data$room_type_category))
  )

6.8.- New listings Prediction

Now I will use the model to predict the price of the new listing and see the output.

predictions <- predict(model, new_listing)

print(predictions)
##        1 
## 87.30441

7.- Results

  • Model Importance:

The Random Forest model was used to predict the price of new listings. The model was trained using the independent variables, and the importance function will help us to identify the relationship of each predictor.

importance(model)
##                      %IncMSE IncNodePurity
## rating             13.264689     608395.40
## title_length        9.855577     172907.18
## bed_info           24.840166    1312384.23
## badge_type          5.118474      16417.92
## city               16.190250     165667.94
## room_type_category  4.760633      10719.42

In this result we can see that the most important variables are bed_info, city, rating and title_length so we can train the model with only these variables.

final_model <- randomForest(
  price ~ rating + title_length + bed_info + city ,
  data = train_data,
  ntree = 1000,  
  mtry = 5,
  importance = TRUE,
  na.action = na.roughfix
)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range

Now if we predict the price of the new listing we will get a more accurate result and we can validate that running importance(final_model).

importance(final_model)
##               %IncMSE IncNodePurity
## rating       19.30605      650726.8
## title_length 21.01131      144927.8
## bed_info     35.50465     1321798.3
## city         23.23656      173197.6

bed_info and city are the most important features in the model. These features significantly influence the price predictions;ratingand title_length are also important features, but to a lesser extent.

In order to improve the performance of the model we can also include more data (e.g. number of bathrooms) and use more advanced techniques.

  • Similar Listings:

Using the information from the new_listing dataframe, we can identify similar listings in the dataset based on the independent variables. This will help us to understand the price range of similar listings in the area.

1.- Convert data frame to factor type

new_listing <- new_listing %>%
  mutate(across(where(is.character), as.factor))

2.- Define the allowed difference for each variable

- threshold_rating <- 0.1 : Allowable difference for rating
- threshold_title_length <- 5 : Allowable difference for title length
- threshold_beds <- 1 : Allowable difference for bed count
- threshold_city <- 0 : Exact match for city
- threshold_room_type_category <- 0 : Exact match for room type
threshold_rating <- 0.1  

threshold_title_length <- 5  

threshold_beds <- 1  

threshold_city <- 0  

threshold_room_type_category <- 0  

3.- Filter the listings based on the allowed difference for each variable

matched_records <- airbnb_df |>
  filter(
    abs(rating - new_listing$rating) <= threshold_rating &
      abs(nchar(title) - new_listing$title_length) <= threshold_title_length &
      abs(as.numeric(str_extract(bed_info, "\\d+")) - as.numeric(str_extract(new_listing$bed_info, "\\d+"))) <= threshold_beds &
      city == new_listing$city &
      room_type_category == new_listing$room_type_category
  )|>
  mutate(discounted_price = as.numeric(discounted_price))|>
  filter(!is.na(discounted_price))

4.- Display existing listings that match the criteria

New Listing:

datatable(new_listing,
  escape = FALSE, 
  options = list(
    scrollX = TRUE,
    dom = 't',  # This option disables the table controls
    paging = FALSE  # This option disables pagination
  ),
  caption = "New Listing"
)

Price for new listing:

print(predictions)
##        1 
## 87.30441
library(DT)

matched_records$pic1 <- paste0('<p><img src="', matched_records$pic1, '" style="width:50px;height:50px;"></p>')
matched_records$pic2 <- paste0('<p><img src="', matched_records$pic2, '" style="width:50px;height:50px;"></p>')
matched_records$pic3 <- paste0('<p><img src="', matched_records$pic3, '" style="width:50px;height:50px;"></p>')
matched_records$pic4 <- paste0('<p><img src="', matched_records$pic4, '" style="width:50px;height:50px;"></p>')
matched_records$pic5 <- paste0('<p><img src="', matched_records$pic5, '" style="width:50px;height:50px;"></p>')
matched_records$pic6 <- paste0('<p><img src="', matched_records$pic6, '" style="width:50px;height:50px;"></p>')

matched_records <- matched_records|>
    mutate(other_pic = paste0(pic2,' ',pic3,' ',pic4,' ',pic5,' ',pic6))

If we compare the matching records with the new_listing we can see that the price is in the same range.

library(ggplot2)

price_summary <- matched_records |>
  mutate(discounted_price = as.numeric(discounted_price)) |>
  filter(!is.na(discounted_price))|>
  group_by(room_type_category) |>
  summarize(
    min_price = round(min(discounted_price),2),
    mean_price = round(mean(discounted_price),2),
    max_price = round(max(discounted_price),2)
  )

datatable(price_summary,
  escape = FALSE, 
  options = list(
    scrollX = TRUE,
    dom = 't',  # This option disables the table controls
    paging = FALSE  # This option disables pagination
  ),
  caption = "Summary of Price Range - Similar Listings"
)

Similar Listings:

datatable(
  matched_records|>
  select(id,pic1,other_pic,title,bed_info,rating,badge_type,city,room_type_category,price_label)|>
    rename(price = price_label),
  escape = FALSE, 
  options = list(pageLength = 3,
          scrollX = TRUE),
  caption = "Similar Listings"
)

In conclusion, if we compare the new listing with the similar listings, we can see that the price is in the same range. This indicates that the model is performing well in predicting the price of new listings based on the independent variables.However if we want to improve the model we can consider to include more data and use more advanced techniques.