Airbnb Data Analysis and Visualization Project

New York City Airbnb Data

Airbnb, Inc is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008. Airbnb is a shortened version of its original name, AirBedandBreakfast.com.

Context

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City Kaggle:Airbnb

Dataset

The following Airbnb activity is included in this New York dataset:

Listings, including full descriptions and average review score Reviews, including unique id for each reviewer and detailed comments Calendar, including listing id and the price and availability for that day

Column	Description	Data type
id	Unique id of each listing	numeric
name	Name of the airbnb listing	text
Host id	Unique id for the host	numeric
host_identity_verified	Whether the identify of the host is verified or not	Categorical
host name	Name of the host	text
neighbourhood group	District where the property is	Categorical
neighbourhood	Area or locality of the property	Categorical
lat	Latitude	numeric
long	Longitude	numeric
country	Country where the property is	Categorical
country code	ISO country code	Categorical
instant_bookable	If the property can be instantly booked or not	Categorical
cancellation_policy	Cancellation policy for the booking	Categorical
room type	Type of room	Categorical
Construction year	Year when the property was constructed	numeric
price	Price per night	numeric
service fee	Additional service fee	numeric
minimum nights	Minimum number of nights required for booking	numeric
number of reviews	Total number of reviews	numeric
last review	Last review date	date
reviews per month	Average number of reviews per month	numeric
review rate number	Rating score based on reviews	numeric
calculated host listings count	Total number of listing managed by the host	numeric
availability 365	Number of days the property is available for booking throughout the year	numeric
house_rules	Rules defined by the host for their guests	text
license	License number for legal compliance of the listing	text

Steps

Step 1 - Explore the dataset

# install and packages and libraries needed

#install.packages("tidyverse")
#install.packages("ggplot")
#install.packages("dplyr") 
#install.packages("gtExtras")
#install.packages("leaflet")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Read the dataset
data <- read.csv("Airbnb_Open_Data.csv")

#View(data) # gives the view of entire dataset

head(data) # first 6 rows

       id                                             NAME     host.id
1 1001254               Clean & quiet apt home by the park 80014485718
2 1002102                            Skylit Midtown Castle 52335172823
3 1002403              THE VILLAGE OF HARLEM....NEW YORK ! 78829239556
4 1002755                                                  85098326012
5 1003689 Entire Apt: Spacious Studio/Loft by central park 92037596077
6 1004098        Large Cozy 1 BR Apartment In Midtown East 45498551794
  host_identity_verified host.name neighbourhood.group neighbourhood      lat
1            unconfirmed  Madaline            Brooklyn    Kensington 40.64749
2               verified     Jenna           Manhattan       Midtown 40.75362
3                            Elise           Manhattan        Harlem 40.80902
4            unconfirmed     Garry            Brooklyn  Clinton Hill 40.68514
5               verified    Lyndon           Manhattan   East Harlem 40.79851
6               verified  Michelle           Manhattan   Murray Hill 40.74767
       long       country country.code instant_bookable cancellation_policy
1 -73.97237 United States           US            FALSE              strict
2 -73.98377 United States           US            FALSE            moderate
3 -73.94190 United States           US             TRUE            flexible
4 -73.95976 United States           US             TRUE            moderate
5 -73.94399 United States           US            FALSE            moderate
6 -73.97500 United States           US             TRUE            flexible
        room.type Construction.year price service.fee minimum.nights
1    Private room              2020 $966        $193              10
2 Entire home/apt              2007 $142         $28              30
3    Private room              2005 $620        $124               3
4 Entire home/apt              2005 $368         $74              30
5 Entire home/apt              2009 $204         $41              10
6 Entire home/apt              2013 $577        $115               3
  number.of.reviews last.review reviews.per.month review.rate.number
1                 9  10/19/2021              0.21                  4
2                45   5/21/2022              0.38                  4
3                 0                            NA                  5
4               270    7/5/2019              4.64                  4
5                 9  11/19/2018              0.10                  3
6                74   6/22/2019              0.59                  3
  calculated.host.listings.count availability.365
1                              6              286
2                              2              228
3                              1              352
4                              1              322
5                              1              289
6                              1              374
                                                                                                                                                                                                                                                                                                                                                                                                          house_rules
1                                                                                                                                                                                                                                                                                                                                Clean up and treat the home the way you'd like your home to be treated.  No smoking.
2 Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. I have a cute and quiet mixed chihuahua. I could accept more guests (for an extra fee) but this also needs to be confirmed beforehand. Also friends traveling together could sleep in separate beds for an extra fee (the second bed is either a sofa bed or inflatable bed). Smoking is only allowed on the porch.
3                              I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge to use the washer/dryer in the basement.  No smoking, inside or outside. Come home as late as you want.  If you come home stumbling drunk, it's OK the first time. If you do it again, and you wake up me or the neighbors downstairs, we will be annoyed.  (Just so you know . . . )
4                                                                                                                                                                                                                                                                                                                                                                                                                    
5                                                                                                                                                                                                                                                    Please no smoking in the house, porch or on the property (you can go to the nearby corner).  Reasonable quiet after 10:30 pm.  Please remove shoes in the house.
6                                                                                                                                                                                                                                                                                                                                                                                   No smoking, please, and no drugs.
  license
1        
2        
3        
4        
5        
6

#tail(data) # last 6 rows

dim(data) # shape - 102599, 26

[1] 102599     26

names(data) # names of all the columns

 [1] "id"                             "NAME"                          
 [3] "host.id"                        "host_identity_verified"        
 [5] "host.name"                      "neighbourhood.group"           
 [7] "neighbourhood"                  "lat"                           
 [9] "long"                           "country"                       
[11] "country.code"                   "instant_bookable"              
[13] "cancellation_policy"            "room.type"                     
[15] "Construction.year"              "price"                         
[17] "service.fee"                    "minimum.nights"                
[19] "number.of.reviews"              "last.review"                   
[21] "reviews.per.month"              "review.rate.number"            
[23] "calculated.host.listings.count" "availability.365"              
[25] "house_rules"                    "license"

#str(data) # to check the datatype

glimpse(data) # better than str to see the data

Rows: 102,599
Columns: 26
$ id                             <int> 1001254, 1002102, 1002403, 1002755, 100…
$ NAME                           <chr> "Clean & quiet apt home by the park", "…
$ host.id                        <dbl> 80014485718, 52335172823, 78829239556, …
$ host_identity_verified         <chr> "unconfirmed", "verified", "", "unconfi…
$ host.name                      <chr> "Madaline", "Jenna", "Elise", "Garry", …
$ neighbourhood.group            <chr> "Brooklyn", "Manhattan", "Manhattan", "…
$ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "Cli…
$ lat                            <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
$ long                           <dbl> -73.97237, -73.98377, -73.94190, -73.95…
$ country                        <chr> "United States", "United States", "Unit…
$ country.code                   <chr> "US", "US", "US", "US", "US", "US", "US…
$ instant_bookable               <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, …
$ cancellation_policy            <chr> "strict", "moderate", "flexible", "mode…
$ room.type                      <chr> "Private room", "Entire home/apt", "Pri…
$ Construction.year              <int> 2020, 2007, 2005, 2005, 2009, 2013, 201…
$ price                          <chr> "$966 ", "$142 ", "$620 ", "$368 ", "$2…
$ service.fee                    <chr> "$193 ", "$28 ", "$124 ", "$74 ", "$41 …
$ minimum.nights                 <int> 10, 30, 3, 30, 10, 3, 45, 45, 2, 2, 1, …
$ number.of.reviews              <int> 9, 45, 0, 270, 9, 74, 49, 49, 430, 118,…
$ last.review                    <chr> "10/19/2021", "5/21/2022", "", "7/5/201…
$ reviews.per.month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
$ review.rate.number             <int> 4, 4, 5, 4, 3, 3, 5, 5, 3, 5, 3, 4, 4, …
$ calculated.host.listings.count <int> 6, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, …
$ availability.365               <int> 286, 228, 352, 322, 289, 374, 224, 219,…
$ house_rules                    <chr> "Clean up and treat the home the way yo…
$ license                        <chr> "", "", "", "", "", "", "", "", "", "",…

sum(duplicated(data)) # there are 541 duplicate values

[1] 541

Insights -

With the glimpse we can see all the datatypes for the columns.

We can see that price, service fee have character dataype rather than int and last review is also character instead of date

STEP 2 - DATA CLEANING

# next lets drop our dupliacted rows

data <- data %>% 
  distinct() 

dim(data) # 102058 , 26

[1] 102058     26

# next, lets deal with our null / misisng values
names(data)

 [1] "id"                             "NAME"                          
 [3] "host.id"                        "host_identity_verified"        
 [5] "host.name"                      "neighbourhood.group"           
 [7] "neighbourhood"                  "lat"                           
 [9] "long"                           "country"                       
[11] "country.code"                   "instant_bookable"              
[13] "cancellation_policy"            "room.type"                     
[15] "Construction.year"              "price"                         
[17] "service.fee"                    "minimum.nights"                
[19] "number.of.reviews"              "last.review"                   
[21] "reviews.per.month"              "review.rate.number"            
[23] "calculated.host.listings.count" "availability.365"              
[25] "house_rules"                    "license"

# lets look at the missing values - 
# lets convert all blank rows to NA

data <- data %>% 
  mutate(across(where(is.character), ~na_if(., ""))) 
#View(data)

data %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column", values_to = "Missing Values") %>%
  gt() %>% # neaten the table
  tab_header(title = "Missing Values Across the Dataset") %>%
  cols_align(align = 'left') %>%
  gt_theme_dark()

Missing Values Across the Dataset
Column	Missing Values
id	0
NAME	249
host.id	0
host_identity_verified	289
host.name	404
neighbourhood.group	29
neighbourhood	16
lat	8
long	8
country	532
country.code	131
instant_bookable	105
cancellation_policy	76
room.type	0
Construction.year	214
price	247
service.fee	273
minimum.nights	400
number.of.reviews	183
last.review	15832
reviews.per.month	15818
review.rate.number	319
calculated.host.listings.count	319
availability.365	448
house_rules	51842
license	102056

Insights -

The number of missing data differs across the dataset with missing values in 23 columns

We cannot directly remove these, we will perform data manipulation on this

The license column is almost empty. There are 102597 rows are missing, hence this isnt a column that would be useful for us.

There are just 2 rows. We can see what they are, but they are of no use, so lets drop

data %>%
  filter(!is.na(license)) %>%
  select(license)

   license
1 41662/AL
2 41662/AL

# drop column license- 
data <- data %>% 
  select(-license)

#View(data)

# next lets look at the house rules column, there are 51842, almost half missing values
# these rules dont account to any specific need. We can check an individual rule, but its not important for our purpose, so we will drop
data <- data %>% 
  select(-house_rules)

Insights -

For remaining columns, let do the cleaning based on the questions we are planning to answer

Data cleaning is based on your problem statement. Whether to drop or do imputation differs according to the question.

For this dataset we will will tackle rest missing values based on the following problems

STEP 3 - DATA MANIPULATION

# first lets start with dealing with our datatype conversion

data <-data %>% 
  mutate(price = as.numeric(str_remove(price, "\\$"))) %>% 
  mutate(service.fee = as.numeric(str_remove(service.fee, "\\$"))) %>% 
  mutate(last.review = as.Date(last.review, format = "%m/%d/%Y"))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `price = as.numeric(str_remove(price, "\\$"))`.
Caused by warning:
! NAs introduced by coercion

#View(data)

STEP 4 - DATA DESCRIBE AND SUMMARY

# lets check the summary for our numeric data columns
data %>% 
  select(price, service.fee, minimum.nights, number.of.reviews,
         review.rate.number, reviews.per.month,
         calculated.host.listings.count, availability.365) %>% 
  summary()

     price        service.fee  minimum.nights      number.of.reviews
 Min.   : 50.0   Min.   : 10   Min.   :-1223.000   Min.   :   0.00  
 1st Qu.:288.0   1st Qu.: 68   1st Qu.:    2.000   1st Qu.:   1.00  
 Median :524.0   Median :125   Median :    3.000   Median :   7.00  
 Mean   :524.8   Mean   :125   Mean   :    8.127   Mean   :  27.52  
 3rd Qu.:759.0   3rd Qu.:183   3rd Qu.:    5.000   3rd Qu.:  31.00  
 Max.   :999.0   Max.   :240   Max.   : 5645.000   Max.   :1024.00  
 NA's   :18059   NA's   :273   NA's   :400         NA's   :183      
 review.rate.number reviews.per.month calculated.host.listings.count
 Min.   :1.000      Min.   : 0.010    Min.   :  1.000               
 1st Qu.:2.000      1st Qu.: 0.220    1st Qu.:  1.000               
 Median :3.000      Median : 0.740    Median :  1.000               
 Mean   :3.279      Mean   : 1.375    Mean   :  7.937               
 3rd Qu.:4.000      3rd Qu.: 2.010    3rd Qu.:  2.000               
 Max.   :5.000      Max.   :90.000    Max.   :332.000               
 NA's   :319        NA's   :15818     NA's   :319                   
 availability.365
 Min.   : -10    
 1st Qu.:   3    
 Median :  96    
 Mean   : 141    
 3rd Qu.: 268    
 Max.   :3677    
 NA's   :448

## this summary is useful for us to look at the distribution of the dataset
# minimum nights, number of reviews, review rate number and review rate month and calulated.host.listing.count, doesnt have any spread.

STEP 5 - DATA VISUALIZATION - QUESTIONS

I have solved around 12 questions on this dataset, you can find this project on my Github