AirBnB Data Cleaning
Introduction
The dataset has been taken from here
on Kaggle.
It contains 102,599 rows and 26 variables about AirBNBs in New York.
The variables include:
- AirBNB name, host name,
- location details like coordinates, neighborhood etc.
- whether hosts’ identity is verified or not
- details of rooms like availability, type of room,
- price per night, service fees etc.
- Reviews per month, last review date, review ratings,
- and house rules.
This notebook contains steps taken to clean this dataset.
The analysis part of this dataset is covered in another notebook, here.
Summary of cleaning steps
- Duplicates: 541 duplicate rows found and
removed.
- Missing Values: License column had only 2 values
and rest all rows were blank. This column was removed.
- Clean column-names: All white-spaces, symbols were
removed from column names. All names converted to lowercase and
whitespaces replaced with underscore.
- Price columns: symbols “$,”
removed and columns converted to numeric.
- Date column:
last_reviewcolumn contains dates, pattern was checked and then column converted from char to date-type.
- Categorical columns: 3 columns have categorical
values; they were checked and converted to factors.
- Clean long texts:
house_rulescolumn contains long text form data. All HTML tags, non-readable symbols etc. will be removed.
Notes on reading the dataset
Why col_types argument has been used in
read_csv()??
The col_character() in
read_csv() is used to set the data-type of
license column to char-type when read by the
function.
- When not used, the column was being read as logical.
- read_csv() by default, takes first 1000
rows of dataset and tries detecting the d-type of each column.
- In the case of license, all rows are blank except 2,
which is why this col gets read as logical and a warning is thrown as
shown in cell below.
- The two license values get converted to NA in the
process.
- In order to avoid such data loss, it is a good practice to explicitly
specify the dtype when such warnings occur.
data2 <- readr::read_csv(paste0(data_dir, "Airbnb_Open_Data.csv"),
col_types = cols(license = col_character()))#check warning message
problems(data)
#1 11116 26 1/0/T/F/TRUE/FALSE (expected); 41662/AL (actal value)
#2 72949 26 1/0/T/F/TRUE/FALSE (expected); 41662/AL (actal value)#check dtype of all cols; now license is 'char' and the 2 license values are not lost
glimpse(data2)## Rows: 102,599
## Columns: 26
## $ id <dbl> 1001254, 1002102, 1002403, 1002755, 1…
## $ NAME <chr> "Clean & quiet apt home by the park",…
## $ `host id` <dbl> 80014485718, 52335172823, 78829239556…
## $ host_identity_verified <chr> "unconfirmed", "verified", NA, "uncon…
## $ `host name` <chr> "Madaline", "Jenna", "Elise", "Garry"…
## $ `neighbourhood group` <chr> "Brooklyn", "Manhattan", "Manhattan",…
## $ neighbourhood <chr> "Kensington", "Midtown", "Harlem", "C…
## $ lat <dbl> 40.64749, 40.75362, 40.80902, 40.6851…
## $ long <dbl> -73.97237, -73.98377, -73.94190, -73.…
## $ country <chr> "United States", "United States", "Un…
## $ `country code` <chr> "US", "US", "US", "US", "US", "US", "…
## $ instant_bookable <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE…
## $ cancellation_policy <chr> "strict", "moderate", "flexible", "mo…
## $ `room type` <chr> "Private room", "Entire home/apt", "P…
## $ `Construction year` <dbl> 2020, 2007, 2005, 2005, 2009, 2013, 2…
## $ price <chr> "$966", "$142", "$620", "$368", "$204…
## $ `service fee` <chr> "$193", "$28", "$124", "$74", "$41", …
## $ `minimum nights` <dbl> 10, 30, 3, 30, 10, 3, 45, 45, 2, 2, 1…
## $ `number of reviews` <dbl> 9, 45, 0, 270, 9, 74, 49, 49, 430, 11…
## $ `last review` <chr> "10/19/2021", "5/21/2022", NA, "7/5/2…
## $ `reviews per month` <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.4…
## $ `review rate number` <dbl> 4, 4, 5, 4, 3, 3, 5, 5, 3, 5, 3, 4, 4…
## $ `calculated host listings count` <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1…
## $ `availability 365` <dbl> 286, 228, 352, 322, 289, 374, 224, 21…
## $ house_rules <chr> "Clean up and treat the home the way …
## $ license <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Each cleaning step in detail
1. Remove duplicate rows
There are 541 duplicate rows in the dataset.
## [1] 541
Following code can be used to create a subset of all the duplicate rows.
Keep only unique values.
Two functions distinct() of dplyr package
OR unique() of baseR can be used for the
same.
- Using distinct() does not preserve the original order of
rows in the dataframe.
2. Check missing values
Following code creates a dataframe with 3 columns:
- one contains names of columns in original dataframe,
- second column contains total number of NA values in that column,
- third column contains percentage of blank rows in each column.
##find missing values in each column
missing_values <- data2 %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "column_name",
values_to = "total_NA") %>%
#add a column indicating % of NA rows
mutate(percent_blank_rows = (total_NA/nrow(data2))*100) %>%
#add a "%" sign after rounding off
mutate(percent_blank_rows = round(percent_blank_rows,2))Here are all the columns and total missing values they contain.
#view the missing count dataframe
knitr::kable(missing_values) %>%
kable_styling(bootstrap_options = "condensed",
#full_width = F,
position = "center",
font_size = 11)| column_name | total_NA | percent_blank_rows |
|---|---|---|
| id | 0 | 0.00 |
| NAME | 249 | 0.24 |
| host id | 0 | 0.00 |
| host_identity_verified | 289 | 0.28 |
| host name | 404 | 0.40 |
| neighbourhood group | 29 | 0.03 |
| neighbourhood | 16 | 0.02 |
| lat | 8 | 0.01 |
| long | 8 | 0.01 |
| country | 532 | 0.52 |
| country code | 131 | 0.13 |
| instant_bookable | 105 | 0.10 |
| cancellation_policy | 76 | 0.07 |
| room type | 0 | 0.00 |
| Construction year | 214 | 0.21 |
| price | 247 | 0.24 |
| service fee | 273 | 0.27 |
| minimum nights | 400 | 0.39 |
| number of reviews | 183 | 0.18 |
| last review | 15832 | 15.51 |
| reviews per month | 15818 | 15.50 |
| review rate number | 319 | 0.31 |
| calculated host listings count | 319 | 0.31 |
| availability 365 | 448 | 0.44 |
| house_rules | 51842 | 50.80 |
| license | 102056 | 100.00 |
3. Clean column-names
- Remove
licensecolumn.
- Remove spaces, hyphens, reduce to lowercase etc.
clean_names()of janitor package handles all these manipulations.
4. Convert columns with price/rates to numeric
- Right now, the data-type of price columns
(
price&service_fee) is character.
- Values in these columns are of format- “$100”, “$1,200”.
- The dollar sign and comma get removed.
- Column is converted to numeric.
4.1 Column- price
i) Check all unique symbols present in the column.
- Only 2 found- ‘$’ and ‘,’.
gsub()below replaces all digits “\d+” with ““.
- After replacement, price only contains values in form-
“$,” or “$”.
- Now, from this column, only unique values are extracted which only includes dollar-sign and comma symbols here.
##1. column- 'price'; currently in char-type
#check for all characters in it except digits
## prices are either like '$100' or '$1,500'
unique(gsub(pattern = "\\d+", replacement = "", data2$price))## [1] "$" "$," NA
ii) Remove the ‘$,’ symbols leaving only digits behind.
5. Clean column with date values & convert to date-type
- Column to work on:
last_review - This column contains date of last review written for the given
AirBNB.
- It contains dates but data-type is char.
- In order to convert it to date-type, it is crucial to know whether
dates are in dd/mm/yyyy format so that
dmy()function will be used OR in mm/dd/yyyy format so thatmdy()function gets used.
- This pattern check has been performed in few steps below.
5.1 Check the overall pattern of digits.
The pattern below checks whether any date does not contain digits in format- 1-2 digits/1-2 digits/4-digits.
all()returns false if even a single value fails to match this pattern.“\d{1,2}” means minimum digits before ‘/’ should be 1 and maximum 2. There are dates in format- 6/12/2015 or 1/4/2020 etc.
Following code returns FALSE, indicating there are some values which do not match this pattern.
#check the format of digits- dd/dd/dddd--returns false
all(grepl(pattern = "\\d{1,2}/\\d{1,2}/\\d{4}", data2$last_review))## [1] FALSE
Following code finds all rows where the above pattern (date-format) is present.
#find row indices where this pattern is present in last-review column
correct_date_format_row_index <- grepl(pattern = "\\d{1,2}/\\d{1,2}/\\d{4}",
data2$last_review)- Create a subset containing rows excluding the rows found above in
correct_date_format_row_index.
- This
data_incorrect_datescontains all rows where date-format was not detected.
- Now we check this subset to find values which failed to parse with the date-format regex above.
#filter out rows where this pattern isn't present
data_incorrect_dates <- data2[!correct_date_format_row_index,
c('id', 'name', 'last_review')]- All the dates which failed to parse with the regex above might be NA
values.
- Following code counts total number of non-NA values in last_review
column.
- The result is 0, meaning all last-review rows are blank in this
subset.
- So, all the dates which failed to match with pattern above are actually NA.
##check whether all last_reviews values are NA in data_incorrect_dates-- all are 0
sum(!is.na(data_incorrect_dates$last_review))## [1] 0
5.2 Check range of day/month components
Here’s a summary of the following few steps:
- Check whether the first 2 digits lie between 1 to 12. If yes, these
2 indicate month.
- Check whether middle 2 digits range from 1 to 31. If yes, these 2
indicate day.
- Check the last 4 digits for any wrong entry. For e.g. year values like “2058” need correction.
a) Check first two digits
The first two digits in date range from 1 to 12, hence clearly represent “month”.
##Now check range of numbers in each component of date.
# If range(first 2 digits in all rows) is 1-12 => first 2 digits indicate month
# If range(middle 2 digtis) is 1-31 => days.
##Check first 2 digits
# range is 1-12
unique(str_extract(data2$last_review, pattern = "^\\d{1,2}(?=/)"))## [1] "10" "5" NA "7" "11" "6" "12" "1" "8" "4" "9" "3" "2"
b) Check middle two digits
Range of the middle two digits is 1 to 31 indicating day-component.
#extract middle 2 digits
#range 1-31
summary(unique(as.integer(str_extract(data2$last_review, pattern = "(?<=^\\d{1,2}/)\\d{1,2}"))))## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 8.5 16.0 16.0 23.5 31.0 1
5.3 Convert column to date type
Right now, date column is of character-type.
Using the cleaning steps above, it is confirmed that all dates are of
format mm/dd/yyyy.
Now, this column can be converted to date type using
mdy() function of lubridate package.
5.4 Check the range of dates in last_review
column
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2012-07-11" "2018-10-27" "2019-06-13" "2019-06-11" "2019-07-05" "2058-06-16"
## NA's
## "15832"
- This dataset was last updated in 2022.
- The subset below contains all rows where last_review column contains
dates beyond 2022.
- This might be due to data-entry error.
## # A tibble: 5 × 3
## id name last_review
## <dbl> <chr> <date>
## 1 1071478 Garden studio in the Upper East Sid 2024-08-15
## 2 1106825 LUX APT IN TIMES SQUARE NEW BUILDING 2025-06-26
## 3 1142173 Beautiful Landmarked Duplex 2058-06-16
## 4 1176967 <NA> 2026-03-28
## 5 1268097 Modern Space in Charming Pre-war 2040-06-16
a) Correct dates with year > 2022.
For dates where year component is greater than 2022, replace only the year component with 2022.
In the code below:
- year<-(last_review,2022): is
replacement function.
- It changes ONLY the year component to 2022 inplace,
without creating a new object.
#convert year values greater than 2022 to 2022
data2 <- data2 %>%
mutate(last_review = case_when(
year(last_review) > 2022 ~ `year<-`(last_review, 2022),
TRUE ~ last_review
))6. Find unique values in all categorical columns
6.1 Categories in host-identity-verified
column
- Two categories found- unconfirmed, verified.
#Column- Host-identity-verified (unique values check)
#2 values- "unconfirmed", "verified"
unique(data3$host_identity_verified)## [1] "unconfirmed" "verified" NA
Following is the proportion of verified & unconfirmed host-ids in the dataset.
#find total percentage of verified/unconfirmed
#Proportion of confirmed identity/ unconfirmed identity is almost same at ~50%
verification_data <- data3 %>% group_by(host_identity_verified) %>%
summarise(total = n()) %>%
mutate(percent_share = round((total/nrow(data3)*100),2)) %>%
arrange(-percent_share) %>%
rename(`Verification Status` = host_identity_verified,
`Total data points` = total,
`% in dataset` = percent_share
)formattable(verification_data,
align = c('l', 'c', 'c'),
list(
`Verification Status` = formatter("span",
style = ~ style(color = "black", font.weight = "bold")
),#formatter
`Total data points` = color_tile('#FFD1DC', '#FFB6C1'),
`% in dataset` = color_tile('#FFD1DC', '#FFB6C1')
)#list
) #formattable| Verification Status | Total data points | % in dataset |
|---|---|---|
| unconfirmed | 50944 | 49.92 |
| verified | 50825 | 49.80 |
| NA | 289 | 0.28 |
In the main dataframe data3, convert the
host_identity_verified column to factor
with 2 categories.
6.2 Categories in cancellation-policy
column
There are 3 categories in cancellation policy- strict,
moderate & flexible.
Each of this categories have roughtly 33% data-share.
#Column- Cancellation policy
#3 categories: strict, moderate, flexible
unique(data3$cancellation_policy)## [1] "strict" "moderate" "flexible" NA
#check for proportion for each
#roughly same for all 3, ~33%
cancel_policy <- data3 %>% group_by(cancellation_policy) %>%
summarise(total = n()) %>%
mutate(percent_share = round((total/nrow(data3))*100,2)) %>%
arrange(-percent_share) %>%
rename(`Cancellation Policy` = cancellation_policy,
`Total data points` = total,
`% in dataset` = percent_share
)formattable(cancel_policy,
align = c('l', 'c', 'c'),
list(
`Cancellation Policy` = formatter("span",
style = ~ style(color = "black", font.weight = "bold")
),#formatter
`Total data points` = color_tile('#FFD1DC', '#FFB6C1'),
`% in dataset` = color_tile('#FFD1DC', '#FFB6C1')
)#list
) #formattable| Cancellation Policy | Total data points | % in dataset |
|---|---|---|
| moderate | 34162 | 33.47 |
| strict | 33929 | 33.24 |
| flexible | 33891 | 33.21 |
| NA | 76 | 0.07 |
In the main dataframe data3, convert the
cancellation_policy column to factor with
3 categories.
6.3 Categories in room_type
column
4 types of rooms available in all the AirBnBs: “Private room”, “Entire home/apt”, “Shared room”, “Hotel room”
#Column- Room type
#4 cats- "Private room", "Entire home/apt", "Shared room", "Hotel room"
unique(data3$room_type)## [1] "Private room" "Entire home/apt" "Shared room" "Hotel room"
#Most listings are either Entire home/apt OR private room
type_of_room <- data3 %>% group_by(room_type) %>%
summarize(total = n()) %>%
mutate(percent_share = round((total/nrow(data3))*100,2)) %>%
arrange(-percent_share) %>%
rename(`Type of Room` = room_type,
`Total Rows` = total,
`% in dataset` = percent_share
)formattable(type_of_room,
align = c('l', 'c', 'c'),
list(
`Type of Room` = formatter("span",
style = ~ style(color = "black", font.weight = "bold")
),#formatter
`Total Rows` = color_tile('#FFD1DC', '#FFB6C1'),
`% in dataset` = color_tile('#FFD1DC', '#FFB6C1')
)#list
) #formattable| Type of Room | Total Rows | % in dataset |
|---|---|---|
| Entire home/apt | 53429 | 52.35 |
| Private room | 46306 | 45.37 |
| Shared room | 2208 | 2.16 |
| Hotel room | 115 | 0.11 |
In the main dataframe data3, convert this
room_type column to factor with 4
categories.
7. Clean house_rules column
- First, we check for presence of HTML tags; remove them if
present.
- Then, we remove unnecessary symbols which will not contribute to understanding most common house rules.
7.2 Remove “#NAME?” from texts
#some rows contain this value in house-rules column
data3$house_rules <- gsub(pattern = "#NAME?", replacement = "", data3$house_rules)
#some rows contain '?' in house-rules column
data3$house_rules <- gsub(pattern = "^\\?$", replacement = "", data3$house_rules)If there are rows where house-rule only contains “#NAME?”, it was
replaced with empty string (““) by the line above.
Now, such cells which only contain empty string get converted to NA.
7.3 Remove bulletin symbols (-, : , * etc.)
Many sentences in house-rule column have symbols used to add bulletin
points.
This step removes all such symbols.