Airbnb is an online marketplace that connects people who want to rent out their homes with those who are looking for accommodations. This data analysis report aims to explore the Airbnb dataset to uncover insights and trends that can help inform business decisions and recommend possible prices for new listings.
Currently, there is a limitation to export the data from the Airbnb website. However, there are some websites that provide Airbnb data for free. One of the most popular websites is Inside Airbnb. This website provides data for a variety of cities around the world, including information on listings, reviews, and host profiles. Even so, the data provided by Inside Airbnb is not real-time and may not be up-to-date plus it didn’t have the city that I was interested in. Therefore, I decided to scrape the data from the Airbnb website using Python and BeautifulSoup.
The main challenge in the web scraping process was to understand how Airbnb manages pagination and navigate to specific pages. After analyzing multiple pages, I was able to identify that the URL structure follows a specific pattern using cursor-based pagination.This means that it uses a Base64-encoded string that contains information about section_offset, items_offset and version. By changing the section_offset, we can navigate to different pages.In order to automate this process a function was created in python.
Once the data was scraped, it got saved into text files with the page number in the file name. The extract was done for 1000 pages because in some cases I was getting duplicate listings. Then the files got loaded into R and merged into a single dataframe that later it got loaded into MongoDB; I decided to use this database because I already have the data in a json format.
The data was then cleaned and prepared for analysis. The data
cleansing process involved unnesting the nested fields and generating
wide columns for the data. The column names were updated to make them
more readable and easier to work with. Google Maps API
was
used to populate the zip code and city for each listing based on the
latitude and longitude coordinates. The data was then merged with the
original data frame to create a new data frame with the address
information.
Finally, the data was analyzed using ggmap
to create a
map of the listings in Atlantic City; the map shows the location of each
listing. A new data frame called listing
was created with
only the variables that I needed for the price analysis. I decided to
use randomForest to predict the price of the listings based on the
features in the data set. Using Rainforest I was able to create a the
model to recommend a price for new listings in Atlantic City. If more
data was available, I could have used a more complex model to predict
the price of the listings, currently we have the ids for each listing so
for future project we can use web scraping again to pull details about
the listing (eg. number of bathrooms, number of rooms, etc).
Web scraping was used to collect data from the Airbnb website for
Atlantic City
. The data included information about
listings, reviews rate, and prices. The data was extracted using Python
and BeautifulSoup, and saved into text files to later loaded into
Mongodb for further analysis. Google Maps API
was used to
populate the zip code and city for each listing based on the latitude
and longitude coordinates.
The web scraping process involved the following steps:
I started analyzing the structure of the Airbnb website URL to identify the pattern for pagination
https://www.airbnb.com/s/Atlantic-City--New-Jersey--United-States/homes?tab_id=home_tab &refinement_paths%5B%5D=%2Fhomes &query=Atlantic%20City%2C%20New%20Jersey%2C%20United%20States &place_id=ChIJIcdcblfdwIkRYlJn6UPLb0o &flexible_trip_lengths%5B%5D=one_week &monthly_start_date=2024-12-01 &monthly_length=3 &monthly_end_date=2025-03-01 &search_mode=regular_search &price_filter_input_type=0 &channel=EXPLORE &search_type=unknown &price_filter_num_nights=5 &federated_search_session_id=d64e8a69-3c91-4d03-a688-f50df0d65d06 &pagination_search=true &cursor=eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjowLCJ2ZXJzaW9uIjoxfQ%3D%3D
https://www.airbnb.com/s/Atlantic-City--New-Jersey--United-States/homes?tab_id=home_tab &refinement_paths%5B%5D=%2Fhomes &query=Atlantic%20City%2C%20New%20Jersey%2C%20United%20States &place_id=ChIJIcdcblfdwIkRYlJn6UPLb0o &flexible_trip_lengths%5B%5D=one_week &monthly_start_date=2024-12-01 &monthly_length=3 &monthly_end_date=2025-03-01 &search_mode=regular_search &price_filter_input_type=0 &channel=EXPLORE &search_type=unknown &price_filter_num_nights=5 &federated_search_session_id=d64e8a69-3c91-4d03-a688-f50df0d65d06 &pagination_search=true &cursor=eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjoxOCwidmVyc2lvbiI6MX0%3D
In the previous two URLs we can determinate that the cursor is the one that changes when we navigate to different pages. The cursor is a Base64-encoded string that contains information about section_offset, items_offset and version. By changing the section_offset, we can navigate to different pages.
section_offset
: Indicates the section offset (often 0
for standard searches).items_offset
: Indicates the item offset for the
results.version
: Indicates the version of the pagination
system.From previous example we can define the following two cursors that were used to navigate to different pages:
Cursor1: eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjowLCJ2ZXJzaW9uIjoxfQ==
Decoded: {“section_offset”:0,“items_offset”:0,“version”:1}
Cursor2: eyJzZWN0aW9uX29mZnNldCI6MCwiaXRlbXNfb2Zmc2V0IjoxOCwidmVyc2lvbiI6MX0=
Decoded: {“section_offset”:0,“items_offset”:18,“version”:1}
The items_offset increases in increments of 18 (e.g., 0, 18, 36, 54, 72), based on the number of results per page. A function was created in Python to automate the process of changing the cursor and navigating to the different pages. This function is later used to generate the URLs.
The following formula was used to calculate the offset for a specific
page:(page_number−1)×18
import base64
def generate_cursor(page_number, results_per_page=18):
items_offset = (page_number - 1) * results_per_page
cursor_data = {
"section_offset": 0,
"items_offset": items_offset,
"version": 1
}
cursor_json = str(cursor_data).replace("'", '"')
return base64.b64encode(cursor_json.encode()).decode()
A Python script was created to extract data from the Airbnb website
using BeautifulSoup. The script iterates over the pages and extracts the
data for each page. The extracted data is saved into text files for each
page. The script uses the requests
library to send HTTP
requests to the website and the BeautifulSoup
library to
parse the HTML content.
Using terminal; base64 , requests and BeautifulSoup libraries were installed.
Note: You need to have Python3 and pip3.
pip3 install base64
pip3 install requests
pip3 install BeautifulSoup
base64
: To encode and decode datarequests
: To send HTTP requestsBeautifulSoup
: To parse HTML contentimport base64
import requests
from bs4 import BeautifulSoup
In order to avoid hard coding the location in the base url; city, state and country variables were defined separated from base_url variable.
city = "Atlantic-City"
state = "New-Jersey"
country = "United-States"
base_url = f"https://www.airbnb.com/s/{city}--{state}--{country}/homes"
Before starting the call to the website, it’s important to mimic the behavior of a browser by setting the user-agent in the headers. This will prevent the website from blocking the request.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
The total_pages variable was defined to determine the number of pages to scrape. In this case, 1000 pages were scraped.
total_pages = 1000
A loop was created to iterate over the pages and extract the data.
The cursor was generated for each page, and the URL was constructed with
the cursor (using previous function defined). The request was sent to
the website, and the response was parsed using
BeautifulSoup
to extract the data. The extracted data was
saved to a text file for each page into airbnb-scraper
folder. In order to avoid overwriting the files, the page number was
included in the file name
“airbnb_pagination_search{page}.txt
”.
for page in range(1, total_pages + 1):
# Generate the cursor for the current page
cursor = generate_cursor(page)
# Construct the URL with the cursor
url = f"{base_url}?pagination_search=true&cursor={cursor}"
print(f"Scraping Page {page}: {url}")
# Send request
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all('script', type='application/json')
for i in data:
if 'searchResults' in i.text:
data = i.text
with open(f"/airbnb-scraper/airbnb_pagination_search{page}.txt", 'w', encoding='utf-8') as file:
file.write(data)
else:
print(f"Failed to load page {page}: {response.status_code}")
The final python script will look like this:
import base64
import requests
from bs4 import BeautifulSoup
city = "Atlantic-City"
state = "New-Jersey"
country = "United-States"
base_url = f"https://www.airbnb.com/s/{city}--{state}--{country}/homes"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
def generate_cursor(page_number, results_per_page=18):
items_offset = (page_number - 1) * results_per_page
cursor_data = {
"section_offset": 0,
"items_offset": items_offset,
"version": 1
}
cursor_json = str(cursor_data).replace("'", '"')
return base64.b64encode(cursor_json.encode()).decode()
total_pages = 1000
for page in range(1, total_pages + 1):
cursor = generate_cursor(page)
url = f"{base_url}?pagination_search=true&cursor={cursor}"
print(f"Scraping Page {page}: {url}")
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all('script', type='application/json')
for i in data:
if 'searchResults' in i.text:
data = i.text
with open(f"/airbnb-scraper/airbnb_pagination_search{page}.txt", 'w', encoding='utf-8') as file:
file.write(data)
else:
print(f"Failed to load page {page}: {response.status_code}")
The data extracted from the Airbnb website was saved into text files
for each page; the next step was to load the data into MongoDB. The data
was loaded into a collection named airbnb_listings
in the
airbnb
database.
For this purpose, I used R to load the data from the text files,
convert it to json, and then load it into MongoDB using the
mongolite
package.
The mandatory packages for this process are stringr
,
jsonlite
, and mongolite
. They can be installed
using the following code:
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
install.packages("jsonlite")
}
if (!requireNamespace("mongolite", quietly = TRUE)) {
install.packages("mongolite")
}
Once the packages are installed, they can be loaded into R using the following code:
stringr
: For string manipulationjsonlite
: For converting data to json formatmongolite
: For connecting to MongoDBlibrary(stringr)
library(jsonlite)
library(mongolite)
The next step is to open a connection to MongoDB using the
mongo
function from the mongolite
package. The
connection details, such as the collection name, database name, and URL,
need to be specified. For this analysis I used localhost however it can
be updated to a different DNS or IP.
mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")
In order to avoid duplicate listings, a unique index was created on
the listing.id
field in the
Listing_Information
collection.
mongo_conn$run('{
"createIndexes": "Listing_Information",
"indexes": [
{
"key": { "listing.id": 1 },
"name": "unique_listing_id_index",
"unique": true
}
]
}')
The next step is to identify the number of files in the directory
/airbnb-scraper/
that contains the extracted data.
num_files <- length(list.files("airbnb-scraper", pattern = "*.txt", full.names = TRUE))
A loop was created to load the data from each text file into R,
convert it to json format, and insert it into MongoDB. The
fromJSON
function from the jsonlite
package
was used to convert the data to json format, and the insert
function from the mongolite
package was used to insert the
data into MongoDB;this will be executed for each line in the text
file.
An exception was added to handle duplicate entries error. Due to the unique index created in the previous step, if there is any duplicate error it will be skipped.
The pattern used to extract the data from the text files is shown below:
for (num_file in 1:num_files) {
txt_file <- gsub(" ", "",paste("airbnb-scraper/airbnb_pagination_search",num_file,".txt"))
# Read the file
text <- readLines(txt_file, warn = FALSE)
text <- paste(text, collapse = " ")
pattern <- "\\{\"__typename\":\"StaySearchResult\",.*?\"demandStayListing\":\\{\"__typename\".*?\\}\\}"
# Use str_extract_all to find all matches
matches <- str_extract_all(text, pattern)[[1]]
for (i in 1:length(matches)) {
json_data <- fromJSON(matches[i])
json_char <- toJSON(json_data, auto_unbox = TRUE)
tryCatch({
mongo_conn$insert(json_char)
cat("Data inserted successfully.\n")
}, error = function(e) {
if (grepl("E11000 duplicate key error", e$message)) {
cat("Duplicate entry detected. Skipping insertion.\n")
} else {
cat("An error occurred: ", e$message, "\n")
}
})
}
}
The final step is to validate the number of records in the
Listing_Information
collection in MongoDB.
print(mongo_conn$count())
## [1] 484
All the R code together will look like this:
library(stringr)
library(jsonlite)
library(mongolite)
mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")
mongo_conn$run('{
"createIndexes": "Listing_Information",
"indexes": [
{
"key": { "listing.id": 1 },
"name": "unique_listing_id_index",
"unique": true
}
]
}')
num_files <- length(list.files("airbnb-scraper", pattern = "*.txt", full.names = TRUE))
for (num_file in 1:num_files) {
txt_file <- gsub(" ", "",paste("airbnb-scraper/airbnb_pagination_search",num_file,".txt"))
text <- readLines(txt_file, warn = FALSE)
text <- paste(text, collapse = " ")
pattern <- "\\{\"__typename\":\"StaySearchResult\",.*?\"demandStayListing\":\\{\"__typename\".*?\\}\\}"
matches <- str_extract_all(text, pattern)[[1]]
for (i in 1:length(matches)) {
json_data <- fromJSON(matches[i])
json_char <- toJSON(json_data, auto_unbox = TRUE)
tryCatch({
mongo_conn$insert(json_char)
cat("Data inserted successfully.\n")
}, error = function(e) {
if (grepl("E11000 duplicate key error", e$message)) {
cat("Duplicate entry detected. Skipping insertion.\n")
} else {
cat("An error occurred: ", e$message, "\n")
}
})
}
}
print(mongo_conn$count())
Now that the data is loaded in MongoDB I will be able to perform some data cleansing in order to prepare the data for analysis.
The data cleansing process will involve the following steps:
The necessary libraries for this process are :
stringr
: For string manipulationjsonlite
: For converting data to json formatmongolite
: For connecting to MongoDBtidyverse
: For data manipulationThey can be loaded using the following code:
library(stringr)
library(jsonlite)
library(mongolite)
library(tidyverse)
I will connect to the database and collection that I previously created in MongoDB.
mongo_conn <- mongo(collection = "Listing_Information", db = "Airbnb", url = "mongodb://localhost:27017")
The data in MongoDB is stored in JSON format. I will define the JSON schema to extract the data from MongoDB. In order to do that I have to validate how the data got loaded into MongoDB.
fields <- '{"_id": 1,
"listing.id":1,
"listing.coordinate.latitude":1,
"listing.coordinate.longitude":1,
"listing.roomTypeCategory":1,
"listing.listingObjType":1,
"listing.name":1,
"listing.pdpUrlType":1,
"listing.structuredContent.primaryLine": 1,
"listing.structuredContent.mapSecondaryLine": 1,
"listing.title":1,
"listing.titleLocale":1,
"avgRatingA11yLabel":1,
"avgRatingLocalized":1,
"listingParamOverrides.checkin":1,
"listingParamOverrides.checkout":1,
"listingParamOverrides.relaxedAmenityIds":1,
"listingParamOverrides.amenities":1,
"pricingQuote.amenities":1,
"pricingQuote.structuredStayDisplayPrice":1,
"badges.loggingContext.badgeType":1,
"badges.style":1,
"badges.text":1,
"badges.textAccessibilityLabel":1,
"badges.textColor":1,
"contextualPictures":1}'
In this step I used the json schema defined in the previous step to
extract the data from MongoDB. query <-{}
is used to
extract all the data from the collection.
query <- '{}'
airbnb_data <- mongo_conn$find(query, fields = fields)
The data is in a nested format, so I will generate wide columns for the nested fields to make it easier to analyze.
unnest
: To unnest the nested fieldsunnest_wider
: To unnest the nested fields and generate
wide columnsairbnb_data <- airbnb_data |>
unnest(listing) |>
unnest(structuredContent) |>
unnest_wider(primaryLine) |>
unnest_wider(mapSecondaryLine, names_sep = "_msl")
airbnb_data <- airbnb_data |>
unnest_wider(badges)
airbnb_data <- airbnb_data|>
unnest_wider(contextualPictures, names_sep = "_pic")
airbnb_data <- airbnb_data|>
unnest_wider(coordinate)
airbnb_data <- airbnb_data|>
unnest_wider(listingParamOverrides)
airbnb_data <- airbnb_data|>
unnest_wider(pricingQuote)
airbnb_data <- airbnb_data|>
unnest_wider(structuredStayDisplayPrice, names_sep = "_")
airbnb_data <- airbnb_data|>
unnest_wider(structuredStayDisplayPrice_primaryLine, names_sep = "_")
airbnb_data <- airbnb_data|>
unnest_wider(avgRatingLocalized , names_sep = "_")
airbnb_data <- airbnb_data|>
unnest_wider(avgRatingA11yLabel, names_sep = "_")
airbnb_data <- airbnb_data|>
unnest_wider(loggingContext, names_sep = "_")
airbnb_data <- airbnb_data|>
unnest_wider(contextualPictures_picpicture, names_sep = "_")
The column names contain special characters and spaces, so I will update them in order to make them more readable so I can work with them for future steps.
colnames(airbnb_data) <- make.names(colnames(airbnb_data), unique = TRUE)
print(colnames(airbnb_data))
## [1] "X_id"
## [2] "latitude"
## [3] "longitude"
## [4] "id"
## [5] "listingObjType"
## [6] "name"
## [7] "pdpUrlType"
## [8] "roomTypeCategory"
## [9] "mapSecondaryLine_msl__typename"
## [10] "mapSecondaryLine_mslbody"
## [11] "X__typename"
## [12] "body"
## [13] "type"
## [14] "title"
## [15] "titleLocale"
## [16] "avgRatingA11yLabel_1"
## [17] "avgRatingLocalized_1"
## [18] "checkin"
## [19] "checkout"
## [20] "relaxedAmenityIds"
## [21] "amenities"
## [22] "structuredStayDisplayPrice___typename"
## [23] "structuredStayDisplayPrice_primaryLine___typename"
## [24] "structuredStayDisplayPrice_primaryLine_displayComponentType"
## [25] "structuredStayDisplayPrice_primaryLine_accessibilityLabel"
## [26] "structuredStayDisplayPrice_primaryLine_discountedPrice"
## [27] "structuredStayDisplayPrice_primaryLine_originalPrice"
## [28] "structuredStayDisplayPrice_primaryLine_qualifier"
## [29] "structuredStayDisplayPrice_primaryLine_shortQualifier"
## [30] "structuredStayDisplayPrice_primaryLine_concatQualifierLeft"
## [31] "structuredStayDisplayPrice_primaryLine_trailingContent"
## [32] "structuredStayDisplayPrice_primaryLine_trailing"
## [33] "structuredStayDisplayPrice_primaryLine_price"
## [34] "structuredStayDisplayPrice_secondaryLine"
## [35] "structuredStayDisplayPrice_explanationData"
## [36] "structuredStayDisplayPrice_explanationDataDisplayPosition"
## [37] "structuredStayDisplayPrice_explanationDataDisplayPriceTriggerType"
## [38] "structuredStayDisplayPrice_layout"
## [39] "loggingContext_badgeType"
## [40] "style"
## [41] "text"
## [42] "textAccessibilityLabel"
## [43] "textColor"
## [44] "contextualPictures_pic__typename"
## [45] "contextualPictures_picid"
## [46] "contextualPictures_picpicture_1"
## [47] "contextualPictures_picpicture_2"
## [48] "contextualPictures_picpicture_3"
## [49] "contextualPictures_picpicture_4"
## [50] "contextualPictures_picpicture_5"
## [51] "contextualPictures_picpicture_6"
Create a new dataframe with the columns that I will use for the analysis.
airbnb_df <- airbnb_data |>
select(id, title, name, latitude, longitude, roomTypeCategory, type,body,avgRatingLocalized_1,structuredStayDisplayPrice_primaryLine___typename,
structuredStayDisplayPrice_primaryLine_accessibilityLabel,
structuredStayDisplayPrice_primaryLine_shortQualifier,
text,contextualPictures_picpicture_1,
contextualPictures_picpicture_2,contextualPictures_picpicture_3,
contextualPictures_picpicture_4,contextualPictures_picpicture_5,
contextualPictures_picpicture_6) |>
rename(room_type_category = roomTypeCategory,
stay_price = structuredStayDisplayPrice_primaryLine_shortQualifier,
display_price = structuredStayDisplayPrice_primaryLine___typename,
price_label = structuredStayDisplayPrice_primaryLine_accessibilityLabel,
avg_rating = avgRatingLocalized_1,
badge_type = text,
pic1 = contextualPictures_picpicture_1,
pic2 = contextualPictures_picpicture_2,
pic3 = contextualPictures_picpicture_3,
pic4 = contextualPictures_picpicture_4,
pic5 = contextualPictures_picpicture_5,
pic6 = contextualPictures_picpicture_6) |>
mutate(bed_info = ifelse(grepl("BEDINFO", type, ignore.case = TRUE), body, '0 beds'),
free_cancellation = ifelse(grepl("Free cancellation", body, ignore.case = TRUE), 'Yes', 'No'),
rating = as.numeric(str_extract(avg_rating, "[0-9]+.[0-9]+")),
num_ppl_rated = as.numeric(str_remove_all(str_extract(avg_rating, "\\((\\d+)\\)"), "[()]")),
DiscountedDisplayPriceLine = ifelse(grepl("DiscountedDisplayPriceLine",
display_price,
ignore.case = TRUE), 'Yes', 'No'),
discounted_price = str_remove(str_extract(price_label,
"\\$\\d+(?= per)"), "\\$"),
original_price = ifelse(is.na(str_remove(str_extract(price_label,
"\\$\\d+$"), "\\$")) | str_remove(str_extract(price_label,
"\\$\\d+$"), "\\$") == "", discounted_price, str_remove(str_extract(price_label,
"\\$\\d+$"), "\\$")))
The dataframe will look like this:
print(colnames(airbnb_df))
## [1] "id" "title"
## [3] "name" "latitude"
## [5] "longitude" "room_type_category"
## [7] "type" "body"
## [9] "avg_rating" "display_price"
## [11] "price_label" "stay_price"
## [13] "badge_type" "pic1"
## [15] "pic2" "pic3"
## [17] "pic4" "pic5"
## [19] "pic6" "bed_info"
## [21] "free_cancellation" "rating"
## [23] "num_ppl_rated" "DiscountedDisplayPriceLine"
## [25] "discounted_price" "original_price"
In previous assignments I used to use DT but in this case I was
getting an error so I decided to use reactable
package to
create an interactive table.
library(reactable)
# Create an interactive table
reactable(
airbnb_df[1:12],
searchable = TRUE,
sortable = TRUE,
pagination = TRUE,
highlight = TRUE,
defaultColDef = colDef(headerStyle = list(whiteSpace = "normal")),
style = list(overflowX = "auto"),
defaultPageSize = 5
)
In this step I will use the Google Maps API to populate the
zip code
and city
for each listing based on
the latitude and longitude coordinates. This will be my second source of
data.
The main packages for this process are ggmap
,
httr
, and jsonlite
.
ggmap
: For mappinghttr
: For sending HTTP requestsjsonlite
: For converting data to json formatThey can be installed using the following code:
if (!requireNamespace("ggmap", quietly = TRUE)) {
install.packages("ggmap")
}
if (!requireNamespace("httr", quietly = TRUE)) {
install.packages("httr")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
install.packages("jsonlite")
}
library(httr)
library(jsonlite)
library(ggmap)
For security reason, my API Key got storage in a json file in a folder that doesn’t get committed in Github.
ggm_kc <- fromJSON("config/ggm_k.json")
register_google(ggm_kc$key)
In this step I will use the Google Maps API to populate the zip code and city for each listing based on the latitude and longitude coordinates.
GET
: To send a GET request to the Google Maps APIcontent
: To extract the content from the responsefromJSON
: To convert the content to json formatunnest_wider
: To unnest the nested fields and generate
wide columnsstr_extract
: To extract the zip code from the formatted
addressstr_remove_all
: To remove all occurrences of a pattern
from a stringstr_remove
: To remove a pattern from a stringslice_head
: To select the first row of the data
frameThe latitude and longitude coordinates are passed as parameters in the URL, along with the API key.
https://maps.googleapis.com/maps/api/geocode/json?latlng=39.3643,-74.4229&key=key_here
In order to get the information for each observation in the dataframe, I will use a loop to iterate over the rows and populate the zip code and city columns.
The address will look like this:
123 S New Jersey Ave, Atlantic City, NJ 08401, USA
so I
used regex to extract the zip code and city.
zip_code:\\d{5},
, this means five digits followed
by a comma
city:(?<=, )[A-Za-z ]+(?=, [A-Z]{2})
, this means
a word that is between a comma and a comma followed by two uppercase
letters
Note: In order to not keep calling multiple times the service, I will only call it once for each observation and save the data in a data frame; for later save the data into a csv file.
temp_address_info <- data.frame(id = rep(NA, nrow(airbnb_df)),
zip_code = rep(NA, nrow(airbnb_df)),
city = rep(NA, nrow(airbnb_df)))
for (i in 1:nrow(airbnb_df)) {
temp_address_info$id[i] <- airbnb_df$id[i]
url <- paste0("https://maps.googleapis.com/maps/api/geocode/json?latlng=",
airbnb_df$latitude[i], ",", airbnb_df$longitude[i],
"&key=", ggm_kc)
response <- GET(url)
content <- content(response, as = "text", encoding = "UTF-8")
json_data <- fromJSON(content)
json_data_df <- data.frame(json_data)
json_data_df <- json_data_df |>
unnest_wider(results.geometry)
zip_result <- json_data_df |>
select(results.formatted_address) |>
filter(json_data_df$location_type == "ROOFTOP")|>
mutate(zip_code = str_remove_all(str_extract(results.formatted_address, "\\d{5},"),","),
city = str_extract(results.formatted_address, "(?<=, )[A-Za-z ]+(?=, [A-Z]{2})"))|>
slice_head(n = 1)
if (data$status == "OK") {
temp_address_info$zip_code[i] <- zip_result[2]
temp_address_info$city[i] <- zip_result[3]
} else {
temp_address_info$zip_code[i] <- NA
temp_address_info$city[i] <- NA
}
}
temp_address_info <- temp_address_info|>
unnest_wider(zip_code, names_sep = "_")
temp_address_info <- temp_address_info|>
unnest_wider(city, names_sep = "_")
colnames(temp_address_info) <- c("id", "zip_code", "city")
write.table(temp_address_info, "data/temp_address_info.csv",sep = ",", row.names = FALSE, col.names = TRUE)
Now I will use previous file to merge the data with the main dataframe.
temp_address_info_reload <- read.csv("data/temp_address_info.csv",
colClasses = c("character", "character", "character"),stringsAsFactors = FALSE, quote = "\"")
airbnb_df <- merge(airbnb_df, temp_address_info_reload, by = "id")
library(reactable)
reactable(
airbnb_df[c(2,22,23,8)],
searchable = TRUE,
sortable = TRUE,
pagination = TRUE,
highlight = TRUE,
defaultColDef = colDef(headerStyle = list(whiteSpace = "normal")),
style = list(overflowX = "auto"),
defaultPageSize = 5
)
In this section, I will use the ggmap
package to create
a map of the listings in Atlantic City. The map will show the location
of each listing.
In this step I will define the variables for the city, state and
country. Then I will use the ggmap
function to get the map
location and plot the map with the listings.
In order to use the Google Maps API, you need to have an API key. The API key is used to authenticate your requests to the API and to track usage. You can get an API key by creating a project in the Google Cloud Platform and enabling the Google Maps API.
city <- "Atlantic-City"
state <- "New-Jersey"
country <- "United-States"
if (!requireNamespace("ggmap", quietly = TRUE)) {
install.packages("ggmap")
}
library(ggmap)
ggm_kc <- fromJSON("config/ggm_k.json")
register_google(ggm_kc$key)
Using the variables defined for the location, I will use the
get_map
function to get the map location.
gg_map_location <- get_map(paste(gsub("-", " ", city),paste(' ',gsub("-", " ", state))),
maptype='roadmap', #hybrid
source="google",
api_key = ggm_kc,
zoom=12)
I will use the geom_point
function to add the points for
the listings to the map and Im also pulling the information about
longitude and latitude from the data frame generated.
ggmap(gg_map_location) +
geom_point(data = airbnb_df, aes(x = longitude, y = latitude), color = "#ff385c", size = 0.5)+
labs(
title = paste(gsub("-", " ", city),'-',gsub("-", " ", state),',',country),
)+
theme(
plot.title = element_text(color = "#ff385c", size = 16, face="bold")
)
In this map , we can confirm that the listings are in the Atlantic City area.
price(\(y\)): The dependent variable represents the outcome(target) variable in the research in USD, it’s the variable of interest that is being investigated; a continuous variable.
airbnb_df$discounted_price <- as.numeric(airbnb_df$discounted_price)
ggplot(airbnb_df, aes(x=discounted_price)) +
geom_histogram(binwidth=39,fill = "white", colour = "#ff385c")+
scale_x_continuous(
breaks = seq(0, max(airbnb_df$discounted_price, na.rm = TRUE) + 50, by = 100),
expand = c(0, 0)
) +
geom_vline(xintercept=mean(airbnb_df$discounted_price), linetype="dashed", color = "darkgray",size=1)+
geom_vline(xintercept=median(airbnb_df$discounted_price), linetype="dotted", color = "darkgray",size=1)+
labs(
x = "Price",
y = "Number of Listings",
title = paste("Airbnb -",gsub("-", " ", city),",",gsub("-", " ", state)," - Prices Histogram"),
subtitle = paste("Dependent Variable - Skewed Right,tail of the graph is pulled toward the higher numbers
\nThe mean(|)",round(mean(airbnb_df$discounted_price),2)," is greater than the median(.)",
round(median(airbnb_df$discounted_price),2),"")
)+
theme_minimal()
The independent variables are the predictors or factors that are believed to have an influence on the dependent variable. Possible Independent variables: bed_info(\(x_1\)), rating(\(x_2\)), badge_type(\(x_3\)), city(\(x_4\)), room_type_category(\(x_5\)), title(\(x_6\))
library(DT)
airbnb_df_ind <- airbnb_df|>
select(pic1,title,bed_info,rating,badge_type,city,room_type_category)
airbnb_df_ind$pic1 <- paste0('<img src="', airbnb_df_ind$pic1, '" style="width:50px;height:50px;">')
datatable(
airbnb_df_ind,
escape = FALSE,
options = list(pageLength = 5),
caption = "Independent Variables with Images"
)
Note: After some investigation I realized that DT library was not working properly because of a syntax error in styles.css file; once that got fixed I was able to use the package.
The ideal is that all the independent variables to be correlated with the dependent variable but NOT with each other.
In this step I will create a correlation matrix to identify the relationship between the dependent and independent variables.
# Encode categorical variables
library(dplyr)
airbnb_df_enc <- airbnb_df |>
select(discounted_price, rating, title,badge_type, bed_info, badge_type, city, room_type_category) |>
rename(price = discounted_price)|>
mutate(
title = as.numeric(factor(title)),
bed_info = as.numeric(factor(bed_info)),
badge_type = as.numeric(factor(badge_type)),
city = as.numeric(factor(city)),
badge = as.numeric(factor(badge_type)),
room_type = as.numeric(factor(room_type_category))
)
correlation_data <- airbnb_df_enc |>
select(price, rating,title, bed_info,badge,city, room_type)
cor_matrix <- cor(correlation_data, use = "complete.obs")
print(cor_matrix)
## price rating title bed_info badge city
## price 1.00000000 0.35786477 0.3004598 0.33587171 -0.2641221 -0.03998451
## rating 0.35786477 1.00000000 0.2669158 0.17459086 -0.6190593 0.10900602
## title 0.30045984 0.26691584 1.0000000 0.21458047 -0.3574230 0.12464895
## bed_info 0.33587171 0.17459086 0.2145805 1.00000000 -0.1323934 -0.04549565
## badge -0.26412207 -0.61905929 -0.3574230 -0.13239336 1.0000000 -0.17735171
## city -0.03998451 0.10900602 0.1246490 -0.04549565 -0.1773517 1.00000000
## room_type -0.09976651 -0.02443898 0.1699986 -0.16233900 -0.0316835 0.02773805
## room_type
## price -0.09976651
## rating -0.02443898
## title 0.16999859
## bed_info -0.16233900
## badge -0.03168350
## city 0.02773805
## room_type 1.00000000
The correlation matrix was visualized using a correlation plot to identify the relationship between the variables.
library(corrplot)
corrplot(cor_matrix, method="shade")
The correlation matrix and Plot show the relationship between the variables. The correlation values range from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.
Based on the plot, the independent variables are not highly correlated with each other, which is ideal for the analysis.
I decided to use Random Forest model
to predict the
price of the new listings because it is an ensemble learning method that
builds multiple decision trees and merges them together to get a more
accurate and stable prediction and is less prone to overfitting; it can
be used for both continuous and categorical variables.
The Random Forest model
will be trained using the
independent variables to predict the price of the listings.
The necessary library for this process is
randomForest
.
randomForest
: For building the Random Forest modeldplyr
: For data manipulationstringr
: For string manipulationcaret
: For data preprocessingif (!requireNamespace("randomForest", quietly = TRUE)) {
install.packages("randomForest")
}
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
library(randomForest)
library(dplyr)
library(stringr)
library(caret)
I will first select only the columns that I need .
listings <- airbnb_df |>
select(discounted_price, rating, title, badge_type, bed_info, city, room_type_category) |>
rename(price = discounted_price) |>
mutate(title_length = nchar(title))
Then I will identify any variable of character type and convert it to factor type (categorical data).
Note: The Random Forest algorithm in R’s randomForest package expects categorical predictors to be encoded as factors, as it uses the factor levels to split decision trees efficiently.
listings <- listings |>
mutate(across(where(is.character), as.factor))
Now, I will split the data into training and testing sets
set.seed(200): Ensures that the random processes produce the same results each time you run the code
createDataPartition: Creates a random partition of the data into training and testing sets. The main reason to use this function is to ensure that the distribution of the dependent variable (price(\(y\))) is similar in both the training and testing sets.
listings$price
: Specifies the target variable used for
stratification.p = 0.8
: Indicates that 80% of the data should be used
for training.list
= FALSE : Returns the indices as a matrix instead
of a list.createDataPartition
is better for cases where you need
to maintain the distribution of the target variable in both the training
and testing sets.
set.seed(200)
train_index <- createDataPartition(listings$price, p = 0.8, list = FALSE)
train_data <- listings[train_index, ]
test_data <- listings[-train_index, ]
In order to determinate the correct value for ntree
, I
will use the Out-of-Bag (OOB) error rate
to evaluate the
model performance for different values of ntree
.
set.seed(123)
ntree_values <- seq(100, 1000, by = 100)
oob_errors <- numeric(length(ntree_values))
for (i in seq_along(ntree_values)) {
temp_model <- randomForest(
price ~ rating + title_length + bed_info + badge_type + city + room_type_category,
data = train_data,
ntree = ntree_values[i],
importance = TRUE,
na.action = na.roughfix
)
oob_errors[i] <- mean(temp_model$mse) # Mean squared error for regression
}
# Plot OOB error vs. ntree
plot(ntree_values, oob_errors, type = "b", pch = 19, col = "blue",
xlab = "Number of Trees", ylab = "OOB Error",
main = "Optimal Number of Trees")
From the result of the plot, I can see that the
OOB error rate
reduces around 600 trees.
I will define the Random Forest model using the
randomForest
function.
price ~ . : The dependent variable is price, and all the other variables in the data frame are used as predictors.
price ~ rating + title_length + bed_info + badge_type + city + room_type_category
.data = train_data : Specifies the training data set.
ntree = 500 : Specifies the number of trees to grow in the forest.
importance = TRUE : Specifies that the importance of the predictors should be calculated.
na.action = na.roughfix : it performs a simple imputation to fill in the missing values before fitting the Random Forest model.
model <- randomForest(
#price ~ .,
price ~ rating + title_length + bed_info + badge_type + city + room_type_category,
data = train_data,
ntree = 500,
importance = TRUE,
mtry = 5,
na.action = na.roughfix
)
I will use the model to predict the price of new listings. I will create a new listing with a random information.
new_listing <- data.frame(
rating = 5,
title = "Home in Atlantic City",
badge_type = 'Guest favorite',
bed_info = '2 beds',
city = 'Atlantic City',
room_type_category = 'entire_home'
)
new_listing <- new_listing|>
mutate(title_length = nchar(title))
# Ensure categorical variables match levels in the training data
new_listing <- new_listing |>
mutate(
badge_type = factor(badge_type, levels = levels(train_data$badge_type)),
bed_info = factor(bed_info, levels = levels(train_data$bed_info)),
city = factor(city, levels = levels(train_data$city)),
room_type_category = factor(room_type_category, levels = levels(train_data$room_type_category))
)
Now I will use the model to predict the price of the new listing and see the output.
predictions <- predict(model, new_listing)
print(predictions)
## 1
## 87.30441
The Random Forest model was used to predict the price of new
listings. The model was trained using the independent variables, and the
importance
function will help us to identify the
relationship of each predictor.
importance(model)
## %IncMSE IncNodePurity
## rating 13.264689 608395.40
## title_length 9.855577 172907.18
## bed_info 24.840166 1312384.23
## badge_type 5.118474 16417.92
## city 16.190250 165667.94
## room_type_category 4.760633 10719.42
In this result we can see that the most important variables are
bed_info
, city
, rating
and
title_length
so we can train the model with only these
variables.
final_model <- randomForest(
price ~ rating + title_length + bed_info + city ,
data = train_data,
ntree = 1000,
mtry = 5,
importance = TRUE,
na.action = na.roughfix
)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
Now if we predict the price of the new listing we will get a more
accurate result and we can validate that running
importance(final_model)
.
importance(final_model)
## %IncMSE IncNodePurity
## rating 19.30605 650726.8
## title_length 21.01131 144927.8
## bed_info 35.50465 1321798.3
## city 23.23656 173197.6
bed_info
and city
are the most important
features in the model. These features significantly influence the price
predictions;rating
and title_length
are also
important features, but to a lesser extent.
In order to improve the performance of the model we can also include more data (e.g. number of bathrooms) and use more advanced techniques.
Using the information from the new_listing
dataframe, we
can identify similar listings in the dataset based on the independent
variables. This will help us to understand the price range of similar
listings in the area.
1.- Convert data frame to factor type
new_listing <- new_listing %>%
mutate(across(where(is.character), as.factor))
2.- Define the allowed difference for each variable
- threshold_rating <- 0.1 : Allowable difference for rating
- threshold_title_length <- 5 : Allowable difference for title length
- threshold_beds <- 1 : Allowable difference for bed count
- threshold_city <- 0 : Exact match for city
- threshold_room_type_category <- 0 : Exact match for room type
threshold_rating <- 0.1
threshold_title_length <- 5
threshold_beds <- 1
threshold_city <- 0
threshold_room_type_category <- 0
3.- Filter the listings based on the allowed difference for each variable
matched_records <- airbnb_df |>
filter(
abs(rating - new_listing$rating) <= threshold_rating &
abs(nchar(title) - new_listing$title_length) <= threshold_title_length &
abs(as.numeric(str_extract(bed_info, "\\d+")) - as.numeric(str_extract(new_listing$bed_info, "\\d+"))) <= threshold_beds &
city == new_listing$city &
room_type_category == new_listing$room_type_category
)|>
mutate(discounted_price = as.numeric(discounted_price))|>
filter(!is.na(discounted_price))
4.- Display existing listings that match the criteria
New Listing:
datatable(new_listing,
escape = FALSE,
options = list(
scrollX = TRUE,
dom = 't', # This option disables the table controls
paging = FALSE # This option disables pagination
),
caption = "New Listing"
)
Price for new listing:
print(predictions)
## 1
## 87.30441
library(DT)
matched_records$pic1 <- paste0('<p><img src="', matched_records$pic1, '" style="width:50px;height:50px;"></p>')
matched_records$pic2 <- paste0('<p><img src="', matched_records$pic2, '" style="width:50px;height:50px;"></p>')
matched_records$pic3 <- paste0('<p><img src="', matched_records$pic3, '" style="width:50px;height:50px;"></p>')
matched_records$pic4 <- paste0('<p><img src="', matched_records$pic4, '" style="width:50px;height:50px;"></p>')
matched_records$pic5 <- paste0('<p><img src="', matched_records$pic5, '" style="width:50px;height:50px;"></p>')
matched_records$pic6 <- paste0('<p><img src="', matched_records$pic6, '" style="width:50px;height:50px;"></p>')
matched_records <- matched_records|>
mutate(other_pic = paste0(pic2,' ',pic3,' ',pic4,' ',pic5,' ',pic6))
If we compare the matching records with the new_listing
we can see that the price is in the same range.
library(ggplot2)
price_summary <- matched_records |>
mutate(discounted_price = as.numeric(discounted_price)) |>
filter(!is.na(discounted_price))|>
group_by(room_type_category) |>
summarize(
min_price = round(min(discounted_price),2),
mean_price = round(mean(discounted_price),2),
max_price = round(max(discounted_price),2)
)
datatable(price_summary,
escape = FALSE,
options = list(
scrollX = TRUE,
dom = 't', # This option disables the table controls
paging = FALSE # This option disables pagination
),
caption = "Summary of Price Range - Similar Listings"
)
Similar Listings:
datatable(
matched_records|>
select(id,pic1,other_pic,title,bed_info,rating,badge_type,city,room_type_category,price_label)|>
rename(price = price_label),
escape = FALSE,
options = list(pageLength = 3,
scrollX = TRUE),
caption = "Similar Listings"
)
In conclusion, if we compare the new listing with the similar listings, we can see that the price is in the same range. This indicates that the model is performing well in predicting the price of new listings based on the independent variables.However if we want to improve the model we can consider to include more data and use more advanced techniques.