Airbnb Data Analysis and Visualization Project

New York City Airbnb Data

Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.

Context

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City Kaggle:Airbnb

Dataset

The following Airbnb activity is included in this New York dataset:

Listings, including full descriptions and average review score Reviews, including unique id for each reviewer and detailed comments Calendar, including listing id and the price and availability for that day

Column Description Data type
id Unique id of each listing numeric
name Name of the airbnb listing text
Host id Unique id for the host numeric
host_identity_verified Whether the identify of the host is verified or not Categorical
host name Name of the host text
neighbourhood group District where the property is Categorical
neighbourhood Area or locality of the property Categorical
lat Latitude numeric
long Longitude numeric
country Country where the property is Categorical
country code ISO country code Categorical
instant_bookable If the property can be instantly booked or not Categorical
cancellation_policy Cancellation policy for the booking Categorical
room type Type of room Categorical
Construction year Year when the property was constructed numeric
price Price per night numeric
service fee Additional service fee numeric
minimum nights Minimum number of nights required for booking numeric
number of reviews Total number of reviews numeric
last review Last review date date
reviews per month Average number of reviews per month numeric
review rate number Rating score based on reviews numeric
calculated host listings count Total number of listing managed by the host numeric
availability 365 Number of days the property is available for booking throughout the year numeric
house_rules Rules defined by the host for their guests text
license License number for legal compliance of the listing text

Steps

Step 1 - Explore the dataset

# install and packages and libraries needed

#install.packages("tidyverse")
#install.packages("ggplot")
#install.packages("dplyr") 
#install.packages("gtExtras")
#install.packages("leaflet")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the dataset
data <- read.csv("Airbnb_Open_Data.csv")

#View(data) # gives the view of entire dataset
head(data) # first 6 rows 
       id                                             NAME     host.id
1 1001254               Clean & quiet apt home by the park 80014485718
2 1002102                            Skylit Midtown Castle 52335172823
3 1002403              THE VILLAGE OF HARLEM....NEW YORK ! 78829239556
4 1002755                                                  85098326012
5 1003689 Entire Apt: Spacious Studio/Loft by central park 92037596077
6 1004098        Large Cozy 1 BR Apartment In Midtown East 45498551794
  host_identity_verified host.name neighbourhood.group neighbourhood      lat
1            unconfirmed  Madaline            Brooklyn    Kensington 40.64749
2               verified     Jenna           Manhattan       Midtown 40.75362
3                            Elise           Manhattan        Harlem 40.80902
4            unconfirmed     Garry            Brooklyn  Clinton Hill 40.68514
5               verified    Lyndon           Manhattan   East Harlem 40.79851
6               verified  Michelle           Manhattan   Murray Hill 40.74767
       long       country country.code instant_bookable cancellation_policy
1 -73.97237 United States           US            FALSE              strict
2 -73.98377 United States           US            FALSE            moderate
3 -73.94190 United States           US             TRUE            flexible
4 -73.95976 United States           US             TRUE            moderate
5 -73.94399 United States           US            FALSE            moderate
6 -73.97500 United States           US             TRUE            flexible
        room.type Construction.year price service.fee minimum.nights
1    Private room              2020 $966        $193              10
2 Entire home/apt              2007 $142         $28              30
3    Private room              2005 $620        $124               3
4 Entire home/apt              2005 $368         $74              30
5 Entire home/apt              2009 $204         $41              10
6 Entire home/apt              2013 $577        $115               3
  number.of.reviews last.review reviews.per.month review.rate.number
1                 9  10/19/2021              0.21                  4
2                45   5/21/2022              0.38                  4
3                 0                            NA                  5
4               270    7/5/2019              4.64                  4
5                 9  11/19/2018              0.10                  3
6                74   6/22/2019              0.59                  3
  calculated.host.listings.count availability.365
1                              6              286
2                              2              228
3                              1              352
4                              1              322
5                              1              289
6                              1              374
                                                                                                                                                                                                                                                                                                                                                                                                          house_rules
1                                                                                                                                                                                                                                                                                                                                Clean up and treat the home the way you'd like your home to be treated.  No smoking.
2 Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. I have a cute and quiet mixed chihuahua. I could accept more guests (for an extra fee) but this also needs to be confirmed beforehand. Also friends traveling together could sleep in separate beds for an extra fee (the second bed is either a sofa bed or inflatable bed). Smoking is only allowed on the porch.
3                              I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge to use the washer/dryer in the basement.  No smoking, inside or outside. Come home as late as you want.  If you come home stumbling drunk, it's OK the first time. If you do it again, and you wake up me or the neighbors downstairs, we will be annoyed.  (Just so you know . . . )
4                                                                                                                                                                                                                                                                                                                                                                                                                    
5                                                                                                                                                                                                                                                    Please no smoking in the house, porch or on the property (you can go to the nearby corner).  Reasonable quiet after 10:30 pm.  Please remove shoes in the house.
6                                                                                                                                                                                                                                                                                                                                                                                   No smoking, please, and no drugs.
  license
1        
2        
3        
4        
5        
6        
#tail(data) # last 6 rows
dim(data) # shape - 102599, 26
[1] 102599     26
names(data) # names of all the columns
 [1] "id"                             "NAME"                          
 [3] "host.id"                        "host_identity_verified"        
 [5] "host.name"                      "neighbourhood.group"           
 [7] "neighbourhood"                  "lat"                           
 [9] "long"                           "country"                       
[11] "country.code"                   "instant_bookable"              
[13] "cancellation_policy"            "room.type"                     
[15] "Construction.year"              "price"                         
[17] "service.fee"                    "minimum.nights"                
[19] "number.of.reviews"              "last.review"                   
[21] "reviews.per.month"              "review.rate.number"            
[23] "calculated.host.listings.count" "availability.365"              
[25] "house_rules"                    "license"                       
#str(data) # to check the datatype

glimpse(data) # better than str to see the data
Rows: 102,599
Columns: 26
$ id                             <int> 1001254, 1002102, 1002403, 1002755, 100…
$ NAME                           <chr> "Clean & quiet apt home by the park", "…
$ host.id                        <dbl> 80014485718, 52335172823, 78829239556, …
$ host_identity_verified         <chr> "unconfirmed", "verified", "", "unconfi…
$ host.name                      <chr> "Madaline", "Jenna", "Elise", "Garry", …
$ neighbourhood.group            <chr> "Brooklyn", "Manhattan", "Manhattan", "…
$ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "Cli…
$ lat                            <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
$ long                           <dbl> -73.97237, -73.98377, -73.94190, -73.95…
$ country                        <chr> "United States", "United States", "Unit…
$ country.code                   <chr> "US", "US", "US", "US", "US", "US", "US…
$ instant_bookable               <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, …
$ cancellation_policy            <chr> "strict", "moderate", "flexible", "mode…
$ room.type                      <chr> "Private room", "Entire home/apt", "Pri…
$ Construction.year              <int> 2020, 2007, 2005, 2005, 2009, 2013, 201…
$ price                          <chr> "$966 ", "$142 ", "$620 ", "$368 ", "$2…
$ service.fee                    <chr> "$193 ", "$28 ", "$124 ", "$74 ", "$41 …
$ minimum.nights                 <int> 10, 30, 3, 30, 10, 3, 45, 45, 2, 2, 1, …
$ number.of.reviews              <int> 9, 45, 0, 270, 9, 74, 49, 49, 430, 118,…
$ last.review                    <chr> "10/19/2021", "5/21/2022", "", "7/5/201…
$ reviews.per.month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
$ review.rate.number             <int> 4, 4, 5, 4, 3, 3, 5, 5, 3, 5, 3, 4, 4, …
$ calculated.host.listings.count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, …
$ availability.365               <int> 286, 228, 352, 322, 289, 374, 224, 219,…
$ house_rules                    <chr> "Clean up and treat the home the way yo…
$ license                        <chr> "", "", "", "", "", "", "", "", "", "",…
sum(duplicated(data)) # there are 541 duplicate values
[1] 541

Insights -

With the glimpse we can see all the datatypes for the columns.

We can see that price, service fee have character dataype rather than int and last review is also character instead of date

STEP 2 - DATA CLEANING

# next lets drop our dupliacted rows

data <- data %>% 
  distinct() 

dim(data) # 102058 , 26
[1] 102058     26
# next, lets deal with our null / misisng values
names(data)
 [1] "id"                             "NAME"                          
 [3] "host.id"                        "host_identity_verified"        
 [5] "host.name"                      "neighbourhood.group"           
 [7] "neighbourhood"                  "lat"                           
 [9] "long"                           "country"                       
[11] "country.code"                   "instant_bookable"              
[13] "cancellation_policy"            "room.type"                     
[15] "Construction.year"              "price"                         
[17] "service.fee"                    "minimum.nights"                
[19] "number.of.reviews"              "last.review"                   
[21] "reviews.per.month"              "review.rate.number"            
[23] "calculated.host.listings.count" "availability.365"              
[25] "house_rules"                    "license"                       
# lets look at the missing values - 
# lets convert all blank rows to NA

data <- data %>% 
  mutate(across(where(is.character), ~na_if(., ""))) 
#View(data)

data %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing Values") %>%
  gt() %>% # neaten the table
  tab_header(title = "Missing Values Across the Dataset") %>%
  cols_align(align = 'left') %>%
  gt_theme_dark()
Missing Values Across the Dataset
Column Missing Values
id 0
NAME 249
host.id 0
host_identity_verified 289
host.name 404
neighbourhood.group 29
neighbourhood 16
lat 8
long 8
country 532
country.code 131
instant_bookable 105
cancellation_policy 76
room.type 0
Construction.year 214
price 247
service.fee 273
minimum.nights 400
number.of.reviews 183
last.review 15832
reviews.per.month 15818
review.rate.number 319
calculated.host.listings.count 319
availability.365 448
house_rules 51842
license 102056

Insights -

The number of missing data differs across the dataset with missing values in 23 columns

We cannot directly remove these, we will perform data manipulation on this

The license column is almost empty. There are 102597 rows are missing, hence this isnt a column that would be useful for us.

There are just 2 rows. We can see what they are, but they are of no use, so lets drop

data %>%
  filter(!is.na(license)) %>%
  select(license) 
   license
1 41662/AL
2 41662/AL
# drop column license- 
data <- data %>% 
  select(-license)

#View(data)
# next lets look at the house rules column, there are 51842, almost half missing values
# these rules dont account to any specific need. We can check an individual rule, but its not important for our purpose, so we will drop
data <- data %>% 
  select(-house_rules)

Insights -

For remaining columns, let do the cleaning based on the questions we are planning to answer

Data cleaning is based on your problem statement. Whether to drop or do imputation differs according to the question.

For this dataset we will will tackle rest missing values based on the following problems

STEP 3 - DATA MANIPULATION

# first lets start with dealing with our datatype conversion

data <-data %>% 
  mutate(price = as.numeric(str_remove(price, "\\$"))) %>% 
  mutate(service.fee = as.numeric(str_remove(service.fee, "\\$"))) %>% 
  mutate(last.review = as.Date(last.review, format = "%m/%d/%Y"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `price = as.numeric(str_remove(price, "\\$"))`.
Caused by warning:
! NAs introduced by coercion
#View(data)

STEP 4 - DATA DESCRIBE AND SUMMARY

# lets check the summary for our numeric data columns
data %>% 
  select(price, service.fee, minimum.nights, number.of.reviews,
         review.rate.number, reviews.per.month,
         calculated.host.listings.count, availability.365) %>% 
  summary()
     price        service.fee  minimum.nights      number.of.reviews
 Min.   : 50.0   Min.   : 10   Min.   :-1223.000   Min.   :   0.00  
 1st Qu.:288.0   1st Qu.: 68   1st Qu.:    2.000   1st Qu.:   1.00  
 Median :524.0   Median :125   Median :    3.000   Median :   7.00  
 Mean   :524.8   Mean   :125   Mean   :    8.127   Mean   :  27.52  
 3rd Qu.:759.0   3rd Qu.:183   3rd Qu.:    5.000   3rd Qu.:  31.00  
 Max.   :999.0   Max.   :240   Max.   : 5645.000   Max.   :1024.00  
 NA's   :18059   NA's   :273   NA's   :400         NA's   :183      
 review.rate.number reviews.per.month calculated.host.listings.count
 Min.   :1.000      Min.   : 0.010    Min.   :  1.000               
 1st Qu.:2.000      1st Qu.: 0.220    1st Qu.:  1.000               
 Median :3.000      Median : 0.740    Median :  1.000               
 Mean   :3.279      Mean   : 1.375    Mean   :  7.937               
 3rd Qu.:4.000      3rd Qu.: 2.010    3rd Qu.:  2.000               
 Max.   :5.000      Max.   :90.000    Max.   :332.000               
 NA's   :319        NA's   :15818     NA's   :319                   
 availability.365
 Min.   : -10    
 1st Qu.:   3    
 Median :  96    
 Mean   : 141    
 3rd Qu.: 268    
 Max.   :3677    
 NA's   :448     
## this summary is useful for us to look at the distribution of the dataset
# minimum nights, number of reviews, review rate number and review rate month and calulated.host.listing.count, doesnt have any spread. 

STEP 5 - DATA VISUALIZATION - QUESTIONS

I have solved around 12 questions on this dataset, you can find this project on my

GitHub Repository