LND Airbnb
LND Airbnb

Before we begin the Process for this Project, there are few key points that are wrapped below as these are the steps I’ll be following to ensure its completion:

Data Analysis Process
Data Analysis Process

PHASE 1 : Ask

As you know this is not a case study that means we don’t have any problems yet, only Dataset we have. I’ll doing EDA step by step and will try to find every type of possible insights from the dataset :

Airbnb is a platform that allows house and apartment owners to rent their properties to guests for short-term stays. Due to its convenience and flexibility, it has become one of the top choices for travelers.
We will see how Data-Driven Decisions propel Airbnb to stand Different.

Key taks
  • Identified the business task
  • Considered key stakeholders

Deliverable

  • To gain insights from Data to solve business problem.

PHASE 2 : Prepare

I’ll be using London Airbnb data to analyze and identify trends, Downloaded the data and stored it on my Google Drive

The data is located on Inside Airbnb and data has been made available by Inside Airbnb with No Copyright CC0: Public Domain.

I will use the ROCCC system to determine the credibility and integrity of the data.
  • Reliability: This data is reliable. This public dataset is a subset of Airbnb data and is available for public use.

  • Originality: This is Original subset dataset.

  • Comprehensiveness: This data is comprehensive. It provides comprehensive information about Airbnb listings, hosts, and various metrics for analysis and research purposes.

  • Current: A recent dataset, which is current.

  • Cited: Inside Airbnb created the dataset, made it Public Dataset this is Credible

Now the Datasets clearly meet the ROCCC System. Therefore, insights from the Analysis provide some directions ofcourse :

Therefore, the data is not Biased And have full credibility for the same reason. It meets ROCCC System since it’s reliable, original,comprehensive,current and cited.

Key taks
  • Downloaded data and stored it appropriately.
  • Identified how it’s organized.
  • Sorted and filtered the data.
  • Determined the credibility of the data.

Deliverable

  • The description of data source used since 2008 - 2023.

PHASE 3 : Process

Analysis could be done on Excel too since excel can easily handle the dataset. But still I’m using R because it is primarily used for statistical analysis, data visualization and I have to make Documentation too. So, I chose RStudio for further Analysis.

Setting up the Environment

Dependencies
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("janitor")
# install.packages("lubridate")
# install.packages("skimr")
#install.packages("stringr")
  • Libraries
library(tidyverse)
## Warning: package 'stringr' was built under R version 4.3.1
library(dplyr)
library(janitor)
library(lubridate)
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.1
library(stringr)
Working Directory
setwd("D:/Case_Study/Data/London_Airbnb")

Data Collection

airbnb <- read.csv("London_Airbnb_2023.csv")

Data Wrangling

  • Ensured Data’s integrity.
  • Ensured column(s) name consistent.
str(airbnb) 
## 'data.frame':    75241 obs. of  16 variables:
##  $ name                : chr  "Lovely private bedroom in Muswell Hill." "Studio Flat  Franklin London" "PropertyPlug - 2Bed Flat in Edgware SmartTV WiFi" "Wimbledon Double Bedroom Ensuite" ...
##  $ host_name           : chr  "Forough" "Hussein" "Paul" "Nikki" ...
##  $ latitude            : num  51.6 51.6 51.6 51.4 51.6 ...
##  $ longitude           : num  -0.139 -0.177 -0.277 -0.188 -0.194 ...
##  $ host_since          : chr  "2015-07-20" "2023-01-14" "2020-01-08" "2010-08-01" ...
##  $ host_location       : chr  "London, United Kingdom" "London, United Kingdom" "" "London, United Kingdom" ...
##  $ host_acceptance_rate: chr  "" "100%" "91%" "100%" ...
##  $ property_type       : chr  "Private room in condo" "Entire rental unit" "Entire rental unit" "Private room in rental unit" ...
##  $ room_type           : chr  "Private room" "Entire home/apt" "Entire home/apt" "Private room" ...
##  $ price               : chr  "$100.00" "$65.00" "$132.00" "$100.00" ...
##  $ minimum_nights      : int  1 180 2 5 5 1 2 1 1 1 ...
##  $ maximum_nights      : int  365 365 28 1125 90 365 365 365 365 6 ...
##  $ availability_30     : int  30 30 12 30 23 20 9 30 25 0 ...
##  $ availability_365    : int  365 365 35 365 337 278 252 365 278 11 ...
##  $ number_of_reviews   : int  0 0 0 4 0 3 6 0 0 2 ...
##  $ review_scores_rating: num  NA NA NA 3.67 NA 4.67 4.5 NA NA 5 ...
  • Everything looks great
summary(airbnb) # A quick look before proceeding
##      name            host_name            latitude       longitude       
##  Length:75241       Length:75241       Min.   :51.30   Min.   :-0.49780  
##  Class :character   Class :character   1st Qu.:51.48   1st Qu.:-0.18939  
##  Mode  :character   Mode  :character   Median :51.51   Median :-0.12628  
##                                        Mean   :51.51   Mean   :-0.12811  
##                                        3rd Qu.:51.54   3rd Qu.:-0.06846  
##                                        Max.   :51.68   Max.   : 0.28857  
##                                                                          
##   host_since        host_location      host_acceptance_rate property_type     
##  Length:75241       Length:75241       Length:75241         Length:75241      
##  Class :character   Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character   Mode  :character     Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##   room_type            price           minimum_nights     maximum_nights     
##  Length:75241       Length:75241       Min.   :   1.000   Min.   :        1  
##  Class :character   Class :character   1st Qu.:   1.000   1st Qu.:       42  
##  Mode  :character   Mode  :character   Median :   2.000   Median :      365  
##                                        Mean   :   5.751   Mean   :     7790  
##                                        3rd Qu.:   4.000   3rd Qu.:     1125  
##                                        Max.   :1125.000   Max.   :524855552  
##                                                                              
##  availability_30  availability_365 number_of_reviews review_scores_rating
##  Min.   : 0.000   Min.   :  0.0    Min.   :   0.00   Min.   :0.000       
##  1st Qu.: 0.000   1st Qu.:  0.0    1st Qu.:   1.00   1st Qu.:4.500       
##  Median : 0.000   Median : 47.0    Median :   4.00   Median :4.820       
##  Mean   : 7.526   Mean   :113.1    Mean   :  17.97   Mean   :4.588       
##  3rd Qu.:13.000   3rd Qu.:228.0    3rd Qu.:  17.00   3rd Qu.:5.000       
##  Max.   :30.000   Max.   :365.0    Max.   :1328.00   Max.   :5.000       
##                                                      NA's   :18693
airbnb <- airbnb %>% na.omit() %>%   # Dropping NA's value from dataset
  distinct()   # unique rows by checking values on all columns
  • Before that total columns were 75241 now 56548 columns only, good to go.
airbnb <- airbnb %>% 
  filter(price > "$0.00")  # Removed price that contained "0.00"
  • Just Removed Airbnb price which was set to 0 that does not make sense to me, now 56538 columns
mean(airbnb$price) # Average price
## Warning in mean.default(airbnb$price): argument is not numeric or logical:
## returning NA
## [1] NA
  • Removing “,” , “$” and converting chr to num for data types in :
airbnb$price <- gsub(",", "", airbnb$price) # removed ','
airbnb$price <- gsub("\\$", "", airbnb$price) # removed '$' sign
airbnb$price <- as.numeric(airbnb$price)  # converted to num.
  • Since the data is based on the ‘UK’ region we assume the currency column to be of ‘£’ sign.
mean(airbnb$price) # Average price till now 
## [1] 154.579
summary(airbnb) # A quick summary
##      name            host_name            latitude       longitude       
##  Length:56538       Length:56538       Min.   :51.30   Min.   :-0.49780  
##  Class :character   Class :character   1st Qu.:51.48   1st Qu.:-0.18885  
##  Mode  :character   Mode  :character   Median :51.51   Median :-0.12492  
##                                        Mean   :51.51   Mean   :-0.12786  
##                                        3rd Qu.:51.54   3rd Qu.:-0.06876  
##                                        Max.   :51.68   Max.   : 0.28857  
##   host_since        host_location      host_acceptance_rate property_type     
##  Length:56538       Length:56538       Length:56538         Length:56538      
##  Class :character   Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character   Mode  :character     Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##   room_type             price         minimum_nights     maximum_nights     
##  Length:56538       Min.   :    1.0   Min.   :   1.000   Min.   :        1  
##  Class :character   1st Qu.:   59.0   1st Qu.:   1.000   1st Qu.:       40  
##  Mode  :character   Median :  100.0   Median :   2.000   Median :      365  
##                     Mean   :  154.6   Mean   :   5.083   Mean   :    10193  
##                     3rd Qu.:  170.0   3rd Qu.:   4.000   3rd Qu.:     1125  
##                     Max.   :35243.0   Max.   :1124.000   Max.   :524855552  
##  availability_30  availability_365 number_of_reviews review_scores_rating
##  Min.   : 0.000   Min.   :  0.0    Min.   :   1.00   Min.   :0.000       
##  1st Qu.: 0.000   1st Qu.:  0.0    1st Qu.:   3.00   1st Qu.:4.500       
##  Median : 1.000   Median : 47.0    Median :   8.00   Median :4.820       
##  Mean   : 6.957   Mean   :108.3    Mean   :  23.91   Mean   :4.588       
##  3rd Qu.:12.000   3rd Qu.:210.0    3rd Qu.:  24.00   3rd Qu.:5.000       
##  Max.   :30.000   Max.   :365.0    Max.   :1328.00   Max.   :5.000
  • Here, some data may be entered incorrectly in the ‘maxium_nights’ column.
  • Since, it is not possible while for someone to book/stay for more than ~ 24657 Years
  • And even someone booked for ~ 1437960 Years, which is not possible.
  • Unless, they are vampire or something, jokes apart. Manipulating data now.
airbnb <- airbnb %>% 
  filter(maximum_nights < 9000)  # ~ 24 years of booking/stay is fine.
airbnb <- airbnb %>% 
  mutate(revenue = (maximum_nights+minimum_nights)*price)  # A new column
  • creating a fresh copy of ‘airbnb’
test <- airbnb  # just a fresh new dataframe
str(test)
## 'data.frame':    56534 obs. of  17 variables:
##  $ name                : chr  "Wimbledon Double Bedroom Ensuite" "Cosy room with own bathroom" "Waddon Coach House 2" "Specious private single bed room" ...
##  $ host_name           : chr  "Nikki" "Ekaterina" "Jahmal" "Marta" ...
##  $ latitude            : num  51.4 51.4 51.4 51.6 51.6 ...
##  $ longitude           : num  -0.188 -0.174 -0.11 -0.126 -0.188 ...
##  $ host_since          : chr  "2010-08-01" "2023-02-12" "2022-12-09" "2021-06-17" ...
##  $ host_location       : chr  "London, United Kingdom" "" "" "England, United Kingdom" ...
##  $ host_acceptance_rate: chr  "100%" "100%" "87%" "100%" ...
##  $ property_type       : chr  "Private room in rental unit" "Private room in rental unit" "Entire home" "Private room in rental unit" ...
##  $ room_type           : chr  "Private room" "Private room" "Entire home/apt" "Private room" ...
##  $ price               : num  100 43 100 29 59 55 68 379 93 45 ...
##  $ minimum_nights      : int  5 1 2 1 1 1 1 3 2 1 ...
##  $ maximum_nights      : int  1125 365 365 6 365 14 365 1125 365 365 ...
##  $ availability_30     : int  30 20 9 0 25 15 30 23 8 24 ...
##  $ availability_365    : int  365 278 252 11 359 205 365 358 273 358 ...
##  $ number_of_reviews   : int  4 3 6 2 10 19 1 2 3 11 ...
##  $ review_scores_rating: num  3.67 4.67 4.5 5 4.8 4.95 5 5 4.33 4.73 ...
##  $ revenue             : num  113000 15738 36700 203 21594 ...
##  - attr(*, "na.action")= 'omit' Named int [1:18693] 1 2 3 5 8 9 11 12 14 15 ...
##   ..- attr(*, "names")= chr [1:18693] "1" "2" "3" "5" ...
  • formatting date and mutating another columns
test$host_since <- as.Date(test$host_since)           # datatype as 'date'
test$day <- format(as.Date(test$host_since), "%A")    # a column for 'day'
test$date <- format(as.Date(test$host_since), "%d")   # another column for 'date'
test$month <- format(as.Date(test$host_since), "%B")  # another column for 'month' 
test$year <- format(as.Date(test$host_since), "%Y")   # another column for 'year'
summary(test) # a quick statistical summary
##      name            host_name            latitude       longitude       
##  Length:56534       Length:56534       Min.   :51.30   Min.   :-0.49780  
##  Class :character   Class :character   1st Qu.:51.48   1st Qu.:-0.18885  
##  Mode  :character   Mode  :character   Median :51.51   Median :-0.12492  
##                                        Mean   :51.51   Mean   :-0.12786  
##                                        3rd Qu.:51.54   3rd Qu.:-0.06876  
##                                        Max.   :51.68   Max.   : 0.28857  
##    host_since         host_location      host_acceptance_rate
##  Min.   :2008-12-03   Length:56534       Length:56534        
##  1st Qu.:2014-07-10   Class :character   Class :character    
##  Median :2016-02-26   Mode  :character   Mode  :character    
##  Mean   :2016-08-07                                          
##  3rd Qu.:2018-07-25                                          
##  Max.   :2023-03-10                                          
##  property_type       room_type             price         minimum_nights    
##  Length:56534       Length:56534       Min.   :    1.0   Min.   :   1.000  
##  Class :character   Class :character   1st Qu.:   59.0   1st Qu.:   1.000  
##  Mode  :character   Mode  :character   Median :  100.0   Median :   2.000  
##                                        Mean   :  154.5   Mean   :   5.083  
##                                        3rd Qu.:  170.0   3rd Qu.:   4.000  
##                                        Max.   :35243.0   Max.   :1124.000  
##  maximum_nights   availability_30  availability_365 number_of_reviews
##  Min.   :   1.0   Min.   : 0.000   Min.   :  0.0    Min.   :   1.00  
##  1st Qu.:  40.0   1st Qu.: 0.000   1st Qu.:  0.0    1st Qu.:   3.00  
##  Median : 365.0   Median : 1.000   Median : 47.0    Median :   8.00  
##  Mean   : 573.4   Mean   : 6.957   Mean   :108.3    Mean   :  23.91  
##  3rd Qu.:1125.0   3rd Qu.:12.000   3rd Qu.:210.0    3rd Qu.:  24.00  
##  Max.   :2000.0   Max.   :30.000   Max.   :365.0    Max.   :1328.00  
##  review_scores_rating    revenue             day                date          
##  Min.   :0.000        Min.   :      40   Length:56534       Length:56534      
##  1st Qu.:4.500        1st Qu.:    5409   Class :character   Class :character  
##  Median :4.820        Median :   44400   Mode  :character   Mode  :character  
##  Mean   :4.588        Mean   :   94842                                        
##  3rd Qu.:5.000        3rd Qu.:  107350                                        
##  Max.   :5.000        Max.   :17992000                                        
##     month               year          
##  Length:56534       Length:56534      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

PHASE 4 : Analysis

test1 <- test %>% 
  select(-c(availability_30, availability_365))   
  
glimpse(test1)          # quick look into data
## Rows: 56,534
## Columns: 19
## $ name                 <chr> "Wimbledon Double Bedroom Ensuite", "Cosy room wi…
## $ host_name            <chr> "Nikki", "Ekaterina", "Jahmal", "Marta", "Simon",…
## $ latitude             <dbl> 51.42231, 51.41605, 51.37069, 51.61334, 51.64041,…
## $ longitude            <dbl> -0.18841000, -0.17429873, -0.10995560, -0.1259100…
## $ host_since           <date> 2010-08-01, 2023-02-12, 2022-12-09, 2021-06-17, …
## $ host_location        <chr> "London, United Kingdom", "", "", "England, Unite…
## $ host_acceptance_rate <chr> "100%", "100%", "87%", "100%", "100%", "93%", "90…
## $ property_type        <chr> "Private room in rental unit", "Private room in r…
## $ room_type            <chr> "Private room", "Private room", "Entire home/apt"…
## $ price                <dbl> 100, 43, 100, 29, 59, 55, 68, 379, 93, 45, 49, 10…
## $ minimum_nights       <int> 5, 1, 2, 1, 1, 1, 1, 3, 2, 1, 3, 2, 20, 1, 2, 1, …
## $ maximum_nights       <int> 1125, 365, 365, 6, 365, 14, 365, 1125, 365, 365, …
## $ number_of_reviews    <int> 4, 3, 6, 2, 10, 19, 1, 2, 3, 11, 11, 1, 1, 9, 32,…
## $ review_scores_rating <dbl> 3.67, 4.67, 4.50, 5.00, 4.80, 4.95, 5.00, 5.00, 4…
## $ revenue              <dbl> 113000, 15738, 36700, 203, 21594, 825, 24888, 427…
## $ day                  <chr> "Sunday", "Sunday", "Friday", "Thursday", "Monday…
## $ date                 <chr> "01", "12", "09", "17", "09", "13", "22", "28", "…
## $ month                <chr> "August", "February", "December", "June", "Novemb…
## $ year                 <chr> "2010", "2023", "2022", "2021", "2015", "2015", "…
df_final <- test1 %>% 
  select(c(name, host_name, host_since, host_location, latitude, longitude,
           host_acceptance_rate, property_type, room_type, price, 
           minimum_nights, maximum_nights, number_of_reviews,
           review_scores_rating, revenue,
           day, date, month, year)) %>% 
  arrange(date, month) %>% 
  arrange(desc(year))
df_final$month <- factor(df_final$month,
                          levels = c("January", "February", "March", "April",
                                     "May", "June", "July", "August",
                                     "September", "October", "November", "December"))
df_final$day <- ordered(df_final$day,
                        levels = c("Sunday","Monday","Tuesday","Wednesday",
                                   "Thursday","Friday","Saturday"))


df_final$year <- ordered(df_final$year,
                         levels = c("2008", "2009","2010", "2011","2012",
                                    "2013","2014","2015","2016","2017",
                                    "2018","2019","2020","2021","2022","2023"))
df_final <- df_final %>%  
  mutate(host_location = ifelse(host_location== "", 'unknown', host_location))

glimpse(df_final)
## Rows: 56,534
## Columns: 19
## $ name                 <chr> "Large En-suite in Victorian Home", "Stylish En-S…
## $ host_name            <chr> "Dev", "Dev", "Dev", "Dev", "Seonyoung", "Azan", …
## $ host_since           <date> 2023-02-01, 2023-02-01, 2023-02-01, 2023-02-01, …
## $ host_location        <chr> "London, United Kingdom", "London, United Kingdom…
## $ latitude             <dbl> 51.48815, 51.48756, 51.48837, 51.48900, 51.50100,…
## $ longitude            <dbl> 0.005667280, 0.004235341, 0.003485379, 0.00553123…
## $ host_acceptance_rate <chr> "87%", "87%", "87%", "87%", "100%", "46%", "83%",…
## $ property_type        <chr> "Private room in home", "Private room in home", "…
## $ room_type            <chr> "Private room", "Private room", "Private room", "…
## $ price                <dbl> 100, 90, 75, 90, 128, 132, 175, 200, 110, 160, 80…
## $ minimum_nights       <int> 4, 3, 2, 2, 2, 1, 1, 1, 1, 3, 2, 1, 2, 2, 1, 1, 1…
## $ maximum_nights       <int> 365, 365, 365, 365, 1125, 365, 365, 1125, 1125, 1…
## $ number_of_reviews    <int> 1, 1, 1, 1, 7, 1, 1, 2, 1, 2, 1, 3, 1, 3, 3, 2, 1…
## $ review_scores_rating <dbl> 5.00, 5.00, 5.00, 5.00, 4.14, 5.00, 5.00, 3.00, 5…
## $ revenue              <dbl> 36900, 33120, 27525, 33030, 144256, 48312, 64050,…
## $ day                  <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wedne…
## $ date                 <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ month                <fct> February, February, February, February, February,…
## $ year                 <ord> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
summary(df_final)        
##      name            host_name           host_since         host_location     
##  Length:56534       Length:56534       Min.   :2008-12-03   Length:56534      
##  Class :character   Class :character   1st Qu.:2014-07-10   Class :character  
##  Mode  :character   Mode  :character   Median :2016-02-26   Mode  :character  
##                                        Mean   :2016-08-07                     
##                                        3rd Qu.:2018-07-25                     
##                                        Max.   :2023-03-10                     
##                                                                               
##     latitude       longitude        host_acceptance_rate property_type     
##  Min.   :51.30   Min.   :-0.49780   Length:56534         Length:56534      
##  1st Qu.:51.48   1st Qu.:-0.18885   Class :character     Class :character  
##  Median :51.51   Median :-0.12492   Mode  :character     Mode  :character  
##  Mean   :51.51   Mean   :-0.12786                                          
##  3rd Qu.:51.54   3rd Qu.:-0.06876                                          
##  Max.   :51.68   Max.   : 0.28857                                          
##                                                                            
##   room_type             price         minimum_nights     maximum_nights  
##  Length:56534       Min.   :    1.0   Min.   :   1.000   Min.   :   1.0  
##  Class :character   1st Qu.:   59.0   1st Qu.:   1.000   1st Qu.:  40.0  
##  Mode  :character   Median :  100.0   Median :   2.000   Median : 365.0  
##                     Mean   :  154.5   Mean   :   5.083   Mean   : 573.4  
##                     3rd Qu.:  170.0   3rd Qu.:   4.000   3rd Qu.:1125.0  
##                     Max.   :35243.0   Max.   :1124.000   Max.   :2000.0  
##                                                                          
##  number_of_reviews review_scores_rating    revenue                day      
##  Min.   :   1.00   Min.   :0.000        Min.   :      40   Sunday   :7529  
##  1st Qu.:   3.00   1st Qu.:4.500        1st Qu.:    5409   Monday   :9347  
##  Median :   8.00   Median :4.820        Median :   44400   Tuesday  :9104  
##  Mean   :  23.91   Mean   :4.588        Mean   :   94842   Wednesday:8916  
##  3rd Qu.:  24.00   3rd Qu.:5.000        3rd Qu.:  107350   Thursday :8627  
##  Max.   :1328.00   Max.   :5.000        Max.   :17992000   Friday   :7180  
##                                                            Saturday :5831  
##      date               month            year      
##  Length:56534       July   : 5714   2015   : 9342  
##  Class :character   May    : 5471   2016   : 7719  
##  Mode  :character   June   : 5141   2014   : 7144  
##                     August : 5132   2017   : 5071  
##                     March  : 5110   2013   : 4734  
##                     April  : 4710   2018   : 4576  
##                     (Other):25256   (Other):17948

As for now statistical summary provides interesting insights.

table(df_final$month) 
## 
##   January  February     March     April       May      June      July    August 
##      4465      4029      5110      4710      5471      5141      5714      5132 
## September   October  November  December 
##      4531      4121      4545      3565
df_final %>% 
  skim_without_charts()
Data summary
Name Piped data
Number of rows 56534
Number of columns 19
_______________________
Column type frequency:
character 7
Date 1
factor 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 0 254 17 55300 0
host_name 0 1 1 35 0 11203 0
host_location 0 1 5 56 0 1190 0
host_acceptance_rate 0 1 0 4 15460 100 0
property_type 0 1 3 34 0 92 0
room_type 0 1 10 15 0 4 0
date 0 1 2 2 0 31 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-12-03 2023-03-10 2016-02-26 4463

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
day 0 1 TRUE 7 Mon: 9347, Tue: 9104, Wed: 8916, Thu: 8627
month 0 1 FALSE 12 Jul: 5714, May: 5471, Jun: 5141, Aug: 5132
year 0 1 TRUE 16 201: 9342, 201: 7719, 201: 7144, 201: 5071

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
latitude 0 1 51.51 0.05 51.3 51.48 51.51 51.54 5.1680e+01
longitude 0 1 -0.13 0.10 -0.5 -0.19 -0.12 -0.07 2.9000e-01
price 0 1 154.53 319.43 1.0 59.00 100.00 170.00 3.5243e+04
minimum_nights 0 1 5.08 21.93 1.0 1.00 2.00 4.00 1.1240e+03
maximum_nights 0 1 573.41 491.07 1.0 40.00 365.00 1125.00 2.0000e+03
number_of_reviews 0 1 23.91 46.93 1.0 3.00 8.00 24.00 1.3280e+03
review_scores_rating 0 1 4.59 0.78 0.0 4.50 4.82 5.00 5.0000e+00
revenue 0 1 94842.41 260294.38 40.0 5409.00 44400.00 107350.00 1.7992e+07
airbnb_outside <- df_final   # a new dataframe for further analysis

head(airbnb_outside$host_location) 
## [1] "London, United Kingdom" "London, United Kingdom" "London, United Kingdom"
## [4] "London, United Kingdom" "unknown"                "unknown"
airbnb_outside <- airbnb_outside %>%  
  filter(minimum_nights > 90) %>%  
  mutate(host_location = ifelse(host_location== "", 'unknown', host_location)) 

view(airbnb_outside)  # Replaced blanks with 'unknown' value
airbnb_outside <- airbnb_outside %>% 
  mutate(really = if_else(str_ends(host_location, "Kingdom"), "in", "out"))
airbnb_outside <- airbnb_outside %>% 
  select(host_location, minimum_nights, revenue, really) %>% 
  filter(really == 'out') %>%            # filtered out of uk hosts location         
  reframe(host_location, minimum_nights, revenue, really)

There were 40 hosts outside of UK.

df_final <- anti_join(df_final, airbnb_outside, by = "host_location")
view(df_final)
df_final <- df_final %>% 
  mutate(host_acceptance_rate = ifelse(host_acceptance_rate== "", '0%',
                                       host_acceptance_rate)) 

PHASE 5 : Visualization

world_data <- df_final    # copy of final dataframe

world_data <- world_data %>% 
  separate(host_location, c("city","country"), ",")  # separated by city and country
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 670 rows [983, 1014,
## 1124, 1249, 1250, 1251, 1639, 1831, 1834, 1836, 1837, 1838, 1868, 1869, 2200,
## 2622, 3360, 3374, 3377, 3436, ...].
world_data <- world_data %>% 
  mutate(country = coalesce(country,city))   # Done with missing values
n_distinct(world_data$country)     # Now 157, before 117 countries
## [1] 157
n_distinct(world_data$city)        # Now 1153, before 1167 cities
## [1] 1153

A

df_final %>% 
  ggplot(aes(year, review_scores_rating, color = room_type)) +
  geom_jitter(size = 1.5, alpha = 0.3) +
  geom_smooth() +
  labs(x = "Year",
       y = "Review Scores Rated",
       color = "Room Type",
       title = "Overall Years with Review Scores Rated",
       caption = "Data Analyst : JP")

B

df_final %>% 
  ggplot(aes(year, review_scores_rating, fill= room_type)) +
  geom_bar(position = "stack", stat = "identity") +
  labs(x = "Year",
       y = "Review Scores Rated",
       fill = "Room Type",
       title = "Overall Years with Review Scores Rated",
       caption = "Data Analyst : JP")

  • Room_type and Price viz
df_final %>% 
  filter(price <= 10000) %>% 
  ggplot(aes(room_type, price, color = revenue)) +
  geom_jitter(size = 1.5, alpha = 0.3) +
  labs(x = "Room Type",
       y = "Price",
       color = "Revenue",
       title = "Price Affects on Room Types",
       caption = "Data Analyst : JP")

  • Number of reviews and year viz
df_final %>%
  ggplot(aes(year, number_of_reviews, fill= room_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Year",
       y = "Total Reviews",
       fill = "Room Type",
       title = "Total Reviews Over Years",
       caption = "Data Analyst : JP")

  • country and price viz
world_data %>% 
  ggplot(aes(country, price, color = room_type)) +
  geom_jitter(size = 1.5, alpha = 0.3)  # Added Tableau Link as "click here"

London Airbnb Host Locations
London Airbnb Host Locations

For World Map Viz

  • maximum nights and date viz
df_final %>%
  select(c(date, maximum_nights, price)) %>%
  ggplot(aes(date, maximum_nights, color = price)) +
  geom_point(size = 2, alpha = 0.4) +
  geom_smooth(method = 'lm' , color = 'purple') +
  labs(x = "Date",
       y = "Maximun Night Stay",
       color = "Price ",
       title = "Maximun Night Stay : Date-Wise",
       subtitle = "Linear Regression has 'Strong' fit for the model",
       caption = "Data Analyst : JP") +
  annotate("text", x=5, y= 1600, label = "R^2 =  1.1", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 )

  • Overall viz by room_type
df_final %>%
  select(c(maximum_nights, price, room_type)) %>%
  filter(maximum_nights <1300, price < 1300) %>% 
  ggplot(aes(maximum_nights, price, color = room_type)) +
  geom_point(size = 2, alpha = 0.3)  +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Maximun Night Stay",
       y = "Price",
       color = "Room type",
       title = "Maximun Night Stay : Price-Wise",
       subtitle = "Linear Regression has 'None' fit for the model",
       caption = "Data Analyst : JP") +
  annotate("text", x= 700, y= 1000, label = "R^2 =  0.00", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 )

  • price and revenue viz
df_final %>%
  select(c(price, revenue, room_type)) %>% 
  filter(price <1300, revenue < 1200000) %>% 
  ggplot(aes(price, revenue, color = room_type)) +
  geom_point(size = 2, alpha = 0.3)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Airbnb Price",
       y = "Revenue Generated",
       color = "Room type",
       title = "Airbnb's Price affect on Revenue",
       subtitle = "Linear Regression has 'Medium' fit for the model",
       caption = "Data Analyst : JP") +
  annotate("text", x=200, y= 1000000, label = "R^2 =  0.5", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 )

  • year and revenue viz
df_final %>%
  select(c(host_location, year, revenue, room_type)) %>% 
  filter(revenue < 1200000) %>% 
  ggplot(aes(year, revenue, color = room_type)) +
  geom_point(size = 2, alpha = 0.3)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Year",
       y = "Revenue Generated",
       color = "Room type",
       title = "Airbnb's Revenue Generated by Year(s)",
       subtitle = "Linear Regression has 'Very Strog' fit for the model",
       caption = "Data Analyst : JP") +
  annotate("text", x= 1.9, y= 900000, label = "R^2 =  4.8", color = "darkgreen",
           fontface = "bold", size = 5, angle = 25 )

  • price and host acceptance rate
df_final %>%
  select(c(price, host_acceptance_rate, revenue, room_type)) %>%
  filter(price <1300) %>% 
  ggplot(aes(host_acceptance_rate, price, color = room_type)) +
  geom_point(size = 2, alpha = 0.3)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Host Acceptance Rate in %",
       y = "Price Paid",
       color = "Room type",
       title = "Host Acceptance Rate by Price",
       caption = "Data Analyst : JP")

  • number of hosts based on yearly viz
df_final %>%
  mutate(
    yearly = format(host_since, "%Y"),
    month_year = format(host_since, "%m-%Y"),
    date = format(host_since, "%d")) %>%
  group_by(yearly, room_type) %>%
  summarise(
    yearly_host = sum(n())) %>% 
  select(c(yearly_host, yearly, room_type)) %>% 
  ggplot(aes(yearly, yearly_host, color = room_type)) +
  geom_point(size = 2, alpha = 1)   +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Year",
       y = "Total Hosts",
       color = "Room Type",
       title = "Total Hosts Growth by Years",
       caption = "Data Analyst : JP")

  • For Top 15 renvenue generated by hosts
top_15 <- df_final %>%
  arrange(desc(revenue)) %>% 
  distinct(host_name, .keep_all = TRUE) %>% #removed all duplicates 
  slice(1:15) %>% 
  ungroup()
  • Viz for top 15 hosts by revenue generated
top_15 %>% 
  ggplot(aes(host_name, revenue, fill= room_type)) +
  geom_bar(position = "dodge", stat = "identity")  +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Host Name",
       y = "Revenue Generated",
       color = "Room Type",
       title = "Top Hosts with Revenue",
       caption = "Data Analyst : JP")

  • For Top 15 Airbnb price
price_15 <- df_final %>%
  arrange(desc(price)) %>% 
  distinct(host_name, .keep_all = TRUE) %>% #removed all duplicates
  slice(1:15) %>% 
  ungroup()
  • Viz for top 15 Airbnb Price by room_types
price_15 %>% 
  ggplot(aes(room_type , price, fill= property_type)) +
  geom_bar(position = "dodge", stat = "identity")  +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = 'lm' , se = F, color = 'purple') +
  labs(x = "Room Type",
       y = "Price",
       color = "Property Type",
       title = "Affects of Price on Room Types",
       caption = "Data Analyst : JP")

My Tableau

PHASE 6 : Act

The act phase would be done by the Executive team of the company. So,Passing the Documented Report to The Director and the Team.

Data-driven decision-making :

  • With All These Years Airbnb doing great and having Continuous Growth Over Time

  • Airbnb’s Most Rated Room Type is Entire Home/Apt Then Private Room

  • Customers Rated Private Room more than Entire Home/Apt. Hence, Customers Liked Entire Home/Apt.

  • Hotel Rooms and Shared Rooms are Cheaper also Attract Less Customers and even Booking are lower

  • Since 2013 - 2106, Most Bookings had been done for London Airbnb

Suggestion(s) :

  • Lower Price Airbnb has Very Good Impact on Revenue

  • Improving/Fixing those Property Types that got lower reviews.

  • If we provide travelers to have their discounts/offers for providing reviews about their stay at checkout would be beneficial for the company.

  • If more info about customers are added like age, sex, relationship etc would be useful

  • London Airbnb dataset has 56534 columns, Finalized columns are 46126. Total of 10408 had been manipulated.

  • Saved

  • Exported data

  • write.csv(airbnb, “airbnb.csv”)

  • write.csv(df_final, “df_final.csv”)

  • write.csv(world_data, “world_data.csv”)

  • write.csv(airbnb_outside, “airbnb_outside.csv”)