# install and packages and libraries needed
#install.packages("tidyverse")
#install.packages("ggplot")
#install.packages("dplyr")
#install.packages("gtExtras")
#install.packages("leaflet")
Airbnb Data Analysis and Visualization Project
New York City Airbnb Data
Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.
Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City Kaggle:Airbnb
Dataset
The following Airbnb activity is included in this New York dataset:
Listings, including full descriptions and average review score Reviews, including unique id for each reviewer and detailed comments Calendar, including listing id and the price and availability for that day
Column | Description | Data type |
---|---|---|
id | Unique id of each listing | numeric |
name | Name of the airbnb listing | text |
Host id | Unique id for the host | numeric |
host_identity_verified | Whether the identify of the host is verified or not | Categorical |
host name | Name of the host | text |
neighbourhood group | District where the property is | Categorical |
neighbourhood | Area or locality of the property | Categorical |
lat | Latitude | numeric |
long | Longitude | numeric |
country | Country where the property is | Categorical |
country code | ISO country code | Categorical |
instant_bookable | If the property can be instantly booked or not | Categorical |
cancellation_policy | Cancellation policy for the booking | Categorical |
room type | Type of room | Categorical |
Construction year | Year when the property was constructed | numeric |
price | Price per night | numeric |
service fee | Additional service fee | numeric |
minimum nights | Minimum number of nights required for booking | numeric |
number of reviews | Total number of reviews | numeric |
last review | Last review date | date |
reviews per month | Average number of reviews per month | numeric |
review rate number | Rating score based on reviews | numeric |
calculated host listings count | Total number of listing managed by the host | numeric |
availability 365 | Number of days the property is available for booking throughout the year | numeric |
house_rules | Rules defined by the host for their guests | text |
license | License number for legal compliance of the listing | text |
Steps
Step 1 - Explore the dataset
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the dataset
<- read.csv("Airbnb_Open_Data.csv")
data
#View(data) # gives the view of entire dataset
head(data) # first 6 rows
id NAME host.id
1 1001254 Clean & quiet apt home by the park 80014485718
2 1002102 Skylit Midtown Castle 52335172823
3 1002403 THE VILLAGE OF HARLEM....NEW YORK ! 78829239556
4 1002755 85098326012
5 1003689 Entire Apt: Spacious Studio/Loft by central park 92037596077
6 1004098 Large Cozy 1 BR Apartment In Midtown East 45498551794
host_identity_verified host.name neighbourhood.group neighbourhood lat
1 unconfirmed Madaline Brooklyn Kensington 40.64749
2 verified Jenna Manhattan Midtown 40.75362
3 Elise Manhattan Harlem 40.80902
4 unconfirmed Garry Brooklyn Clinton Hill 40.68514
5 verified Lyndon Manhattan East Harlem 40.79851
6 verified Michelle Manhattan Murray Hill 40.74767
long country country.code instant_bookable cancellation_policy
1 -73.97237 United States US FALSE strict
2 -73.98377 United States US FALSE moderate
3 -73.94190 United States US TRUE flexible
4 -73.95976 United States US TRUE moderate
5 -73.94399 United States US FALSE moderate
6 -73.97500 United States US TRUE flexible
room.type Construction.year price service.fee minimum.nights
1 Private room 2020 $966 $193 10
2 Entire home/apt 2007 $142 $28 30
3 Private room 2005 $620 $124 3
4 Entire home/apt 2005 $368 $74 30
5 Entire home/apt 2009 $204 $41 10
6 Entire home/apt 2013 $577 $115 3
number.of.reviews last.review reviews.per.month review.rate.number
1 9 10/19/2021 0.21 4
2 45 5/21/2022 0.38 4
3 0 NA 5
4 270 7/5/2019 4.64 4
5 9 11/19/2018 0.10 3
6 74 6/22/2019 0.59 3
calculated.host.listings.count availability.365
1 6 286
2 2 228
3 1 352
4 1 322
5 1 289
6 1 374
house_rules
1 Clean up and treat the home the way you'd like your home to be treated. No smoking.
2 Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. I have a cute and quiet mixed chihuahua. I could accept more guests (for an extra fee) but this also needs to be confirmed beforehand. Also friends traveling together could sleep in separate beds for an extra fee (the second bed is either a sofa bed or inflatable bed). Smoking is only allowed on the porch.
3 I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge to use the washer/dryer in the basement. No smoking, inside or outside. Come home as late as you want. If you come home stumbling drunk, it's OK the first time. If you do it again, and you wake up me or the neighbors downstairs, we will be annoyed. (Just so you know . . . )
4
5 Please no smoking in the house, porch or on the property (you can go to the nearby corner). Reasonable quiet after 10:30 pm. Please remove shoes in the house.
6 No smoking, please, and no drugs.
license
1
2
3
4
5
6
#tail(data) # last 6 rows
dim(data) # shape - 102599, 26
[1] 102599 26
names(data) # names of all the columns
[1] "id" "NAME"
[3] "host.id" "host_identity_verified"
[5] "host.name" "neighbourhood.group"
[7] "neighbourhood" "lat"
[9] "long" "country"
[11] "country.code" "instant_bookable"
[13] "cancellation_policy" "room.type"
[15] "Construction.year" "price"
[17] "service.fee" "minimum.nights"
[19] "number.of.reviews" "last.review"
[21] "reviews.per.month" "review.rate.number"
[23] "calculated.host.listings.count" "availability.365"
[25] "house_rules" "license"
#str(data) # to check the datatype
glimpse(data) # better than str to see the data
Rows: 102,599
Columns: 26
$ id <int> 1001254, 1002102, 1002403, 1002755, 100…
$ NAME <chr> "Clean & quiet apt home by the park", "…
$ host.id <dbl> 80014485718, 52335172823, 78829239556, …
$ host_identity_verified <chr> "unconfirmed", "verified", "", "unconfi…
$ host.name <chr> "Madaline", "Jenna", "Elise", "Garry", …
$ neighbourhood.group <chr> "Brooklyn", "Manhattan", "Manhattan", "…
$ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "Cli…
$ lat <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
$ long <dbl> -73.97237, -73.98377, -73.94190, -73.95…
$ country <chr> "United States", "United States", "Unit…
$ country.code <chr> "US", "US", "US", "US", "US", "US", "US…
$ instant_bookable <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, …
$ cancellation_policy <chr> "strict", "moderate", "flexible", "mode…
$ room.type <chr> "Private room", "Entire home/apt", "Pri…
$ Construction.year <int> 2020, 2007, 2005, 2005, 2009, 2013, 201…
$ price <chr> "$966 ", "$142 ", "$620 ", "$368 ", "$2…
$ service.fee <chr> "$193 ", "$28 ", "$124 ", "$74 ", "$41 …
$ minimum.nights <int> 10, 30, 3, 30, 10, 3, 45, 45, 2, 2, 1, …
$ number.of.reviews <int> 9, 45, 0, 270, 9, 74, 49, 49, 430, 118,…
$ last.review <chr> "10/19/2021", "5/21/2022", "", "7/5/201…
$ reviews.per.month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
$ review.rate.number <int> 4, 4, 5, 4, 3, 3, 5, 5, 3, 5, 3, 4, 4, …
$ calculated.host.listings.count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, …
$ availability.365 <int> 286, 228, 352, 322, 289, 374, 224, 219,…
$ house_rules <chr> "Clean up and treat the home the way yo…
$ license <chr> "", "", "", "", "", "", "", "", "", "",…
sum(duplicated(data)) # there are 541 duplicate values
[1] 541
Insights -
With the glimpse we can see all the datatypes for the columns.
We can see that price, service fee have character dataype rather than int and last review is also character instead of date
STEP 2 - DATA CLEANING
# next lets drop our dupliacted rows
<- data %>%
data distinct()
dim(data) # 102058 , 26
[1] 102058 26
# next, lets deal with our null / misisng values
names(data)
[1] "id" "NAME"
[3] "host.id" "host_identity_verified"
[5] "host.name" "neighbourhood.group"
[7] "neighbourhood" "lat"
[9] "long" "country"
[11] "country.code" "instant_bookable"
[13] "cancellation_policy" "room.type"
[15] "Construction.year" "price"
[17] "service.fee" "minimum.nights"
[19] "number.of.reviews" "last.review"
[21] "reviews.per.month" "review.rate.number"
[23] "calculated.host.listings.count" "availability.365"
[25] "house_rules" "license"
# lets look at the missing values -
# lets convert all blank rows to NA
<- data %>%
data mutate(across(where(is.character), ~na_if(., "")))
#View(data)
%>%
data summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Column", values_to = "Missing Values") %>%
gt() %>% # neaten the table
tab_header(title = "Missing Values Across the Dataset") %>%
cols_align(align = 'left') %>%
gt_theme_dark()
Missing Values Across the Dataset | |
---|---|
Column | Missing Values |
id | 0 |
NAME | 249 |
host.id | 0 |
host_identity_verified | 289 |
host.name | 404 |
neighbourhood.group | 29 |
neighbourhood | 16 |
lat | 8 |
long | 8 |
country | 532 |
country.code | 131 |
instant_bookable | 105 |
cancellation_policy | 76 |
room.type | 0 |
Construction.year | 214 |
price | 247 |
service.fee | 273 |
minimum.nights | 400 |
number.of.reviews | 183 |
last.review | 15832 |
reviews.per.month | 15818 |
review.rate.number | 319 |
calculated.host.listings.count | 319 |
availability.365 | 448 |
house_rules | 51842 |
license | 102056 |
Insights -
The number of missing data differs across the dataset with missing values in 23 columns
We cannot directly remove these, we will perform data manipulation on this
The license column is almost empty. There are 102597 rows are missing, hence this isnt a column that would be useful for us.
There are just 2 rows. We can see what they are, but they are of no use, so lets drop
%>%
data filter(!is.na(license)) %>%
select(license)
license
1 41662/AL
2 41662/AL
# drop column license-
<- data %>%
data select(-license)
#View(data)
# next lets look at the house rules column, there are 51842, almost half missing values
# these rules dont account to any specific need. We can check an individual rule, but its not important for our purpose, so we will drop
<- data %>%
data select(-house_rules)
Insights -
For remaining columns, let do the cleaning based on the questions we are planning to answer
Data cleaning is based on your problem statement. Whether to drop or do imputation differs according to the question.
For this dataset we will will tackle rest missing values based on the following problems
STEP 3 - DATA MANIPULATION
# first lets start with dealing with our datatype conversion
<-data %>%
data mutate(price = as.numeric(str_remove(price, "\\$"))) %>%
mutate(service.fee = as.numeric(str_remove(service.fee, "\\$"))) %>%
mutate(last.review = as.Date(last.review, format = "%m/%d/%Y"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `price = as.numeric(str_remove(price, "\\$"))`.
Caused by warning:
! NAs introduced by coercion
#View(data)
STEP 4 - DATA DESCRIBE AND SUMMARY
# lets check the summary for our numeric data columns
%>%
data select(price, service.fee, minimum.nights, number.of.reviews,
review.rate.number, reviews.per.month,.365) %>%
calculated.host.listings.count, availabilitysummary()
price service.fee minimum.nights number.of.reviews
Min. : 50.0 Min. : 10 Min. :-1223.000 Min. : 0.00
1st Qu.:288.0 1st Qu.: 68 1st Qu.: 2.000 1st Qu.: 1.00
Median :524.0 Median :125 Median : 3.000 Median : 7.00
Mean :524.8 Mean :125 Mean : 8.127 Mean : 27.52
3rd Qu.:759.0 3rd Qu.:183 3rd Qu.: 5.000 3rd Qu.: 31.00
Max. :999.0 Max. :240 Max. : 5645.000 Max. :1024.00
NA's :18059 NA's :273 NA's :400 NA's :183
review.rate.number reviews.per.month calculated.host.listings.count
Min. :1.000 Min. : 0.010 Min. : 1.000
1st Qu.:2.000 1st Qu.: 0.220 1st Qu.: 1.000
Median :3.000 Median : 0.740 Median : 1.000
Mean :3.279 Mean : 1.375 Mean : 7.937
3rd Qu.:4.000 3rd Qu.: 2.010 3rd Qu.: 2.000
Max. :5.000 Max. :90.000 Max. :332.000
NA's :319 NA's :15818 NA's :319
availability.365
Min. : -10
1st Qu.: 3
Median : 96
Mean : 141
3rd Qu.: 268
Max. :3677
NA's :448
## this summary is useful for us to look at the distribution of the dataset
# minimum nights, number of reviews, review rate number and review rate month and calulated.host.listing.count, doesnt have any spread.
STEP 5 - DATA VISUALIZATION - QUESTIONS
I have solved around 12 questions on this dataset, you can find this project on my Github