Before we begin the Process for this Project, there are few key points that are wrapped below as these are the steps I’ll be following to ensure its completion:
As you know this is not a case study that means we don’t have any problems yet, only Dataset we have. I’ll doing EDA step by step and will try to find every type of possible insights from the dataset :
Airbnb is a platform that allows house and apartment owners to rent
their properties to guests for short-term stays. Due to its convenience
and flexibility, it has become one of the top choices for
travelers.
We will see how Data-Driven Decisions propel
Airbnb to stand Different.
About the company:
Airbnb, Inc is an American San Francisco-based company operating an online marketplace for short- and long-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast. The company is credited with revolutionizing the tourism industry, while also having been the subject of intense criticism by residents of tourism hotspot cities like Barcelona and Venice for enabling an unaffordable increase in home rents, and for a lack of regulation.
Regulation of short-term rentals can include requirements for hosts to have business licenses, payment of hotel taxes and compliance with building, city and zoning standards. The hotel industry has lobbied for stricter regulations on short-term home rental.In addition to government- imposed restrictions, many homeowner associations also limit short term rentals.
Since its founding in 2008, Airbnb has become one of the most successful and valuable start-ups in the world, and has had a significant impact on the industry of renting homes and the hospitality industry more generally.
Airbnb has also had a significant impact on the hospitality industry, particularly in the area of hotels and other traditional accommodation providers, which are often referred to as the HORECA (hotel, restaurant and catering) industry. Some industry experts believe that Airbnb’s rapid growth has disrupted the traditional HORECA model, and has led to a decline in revenue and occupancy rates for traditional hotels in certain markets. Some experts believe Airbnb can overtake Booking.com by revenue. Notably, Airbnb is one of the few major travel companies without a loyalty program.
I’ll be using London Airbnb data to analyze and identify trends, Downloaded the data and stored it on my Google Drive
The data is located on Inside Airbnb and data has been made available by Inside Airbnb with No Copyright CC0: Public Domain.
Reliability: This data is reliable. This public dataset is a subset of Airbnb data and is available for public use.
Originality: This is Original subset dataset.
Comprehensiveness: This data is comprehensive. It provides comprehensive information about Airbnb listings, hosts, and various metrics for analysis and research purposes.
Current: A recent dataset, which is current.
Cited: Inside Airbnb created the dataset, made it Public Dataset this is Credible
Now the Datasets clearly meet the ROCCC System. Therefore, insights from the Analysis provide some directions ofcourse :
Therefore, the data is not Biased And have full credibility for the same reason. It meets ROCCC System since it’s reliable, original,comprehensive,current and cited.
Analysis could be done on Excel too since excel can easily handle the dataset. But still I’m using R because it is primarily used for statistical analysis, data visualization and I have to make Documentation too. So, I chose RStudio for further Analysis.
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("janitor")
# install.packages("lubridate")
# install.packages("skimr")
#install.packages("stringr")
library(tidyverse)
## Warning: package 'stringr' was built under R version 4.3.1
library(dplyr)
library(janitor)
library(lubridate)
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.1
library(stringr)
setwd("D:/Case_Study/Data/London_Airbnb")
airbnb <- read.csv("London_Airbnb_2023.csv")
str(airbnb)
## 'data.frame': 75241 obs. of 16 variables:
## $ name : chr "Lovely private bedroom in Muswell Hill." "Studio Flat Franklin London" "PropertyPlug - 2Bed Flat in Edgware SmartTV WiFi" "Wimbledon Double Bedroom Ensuite" ...
## $ host_name : chr "Forough" "Hussein" "Paul" "Nikki" ...
## $ latitude : num 51.6 51.6 51.6 51.4 51.6 ...
## $ longitude : num -0.139 -0.177 -0.277 -0.188 -0.194 ...
## $ host_since : chr "2015-07-20" "2023-01-14" "2020-01-08" "2010-08-01" ...
## $ host_location : chr "London, United Kingdom" "London, United Kingdom" "" "London, United Kingdom" ...
## $ host_acceptance_rate: chr "" "100%" "91%" "100%" ...
## $ property_type : chr "Private room in condo" "Entire rental unit" "Entire rental unit" "Private room in rental unit" ...
## $ room_type : chr "Private room" "Entire home/apt" "Entire home/apt" "Private room" ...
## $ price : chr "$100.00" "$65.00" "$132.00" "$100.00" ...
## $ minimum_nights : int 1 180 2 5 5 1 2 1 1 1 ...
## $ maximum_nights : int 365 365 28 1125 90 365 365 365 365 6 ...
## $ availability_30 : int 30 30 12 30 23 20 9 30 25 0 ...
## $ availability_365 : int 365 365 35 365 337 278 252 365 278 11 ...
## $ number_of_reviews : int 0 0 0 4 0 3 6 0 0 2 ...
## $ review_scores_rating: num NA NA NA 3.67 NA 4.67 4.5 NA NA 5 ...
summary(airbnb) # A quick look before proceeding
## name host_name latitude longitude
## Length:75241 Length:75241 Min. :51.30 Min. :-0.49780
## Class :character Class :character 1st Qu.:51.48 1st Qu.:-0.18939
## Mode :character Mode :character Median :51.51 Median :-0.12628
## Mean :51.51 Mean :-0.12811
## 3rd Qu.:51.54 3rd Qu.:-0.06846
## Max. :51.68 Max. : 0.28857
##
## host_since host_location host_acceptance_rate property_type
## Length:75241 Length:75241 Length:75241 Length:75241
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## room_type price minimum_nights maximum_nights
## Length:75241 Length:75241 Min. : 1.000 Min. : 1
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 42
## Mode :character Mode :character Median : 2.000 Median : 365
## Mean : 5.751 Mean : 7790
## 3rd Qu.: 4.000 3rd Qu.: 1125
## Max. :1125.000 Max. :524855552
##
## availability_30 availability_365 number_of_reviews review_scores_rating
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. :0.000
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 1.00 1st Qu.:4.500
## Median : 0.000 Median : 47.0 Median : 4.00 Median :4.820
## Mean : 7.526 Mean :113.1 Mean : 17.97 Mean :4.588
## 3rd Qu.:13.000 3rd Qu.:228.0 3rd Qu.: 17.00 3rd Qu.:5.000
## Max. :30.000 Max. :365.0 Max. :1328.00 Max. :5.000
## NA's :18693
airbnb <- airbnb %>% na.omit() %>% # Dropping NA's value from dataset
distinct() # unique rows by checking values on all columns
airbnb <- airbnb %>%
filter(price > "$0.00") # Removed price that contained "0.00"
mean(airbnb$price) # Average price
## Warning in mean.default(airbnb$price): argument is not numeric or logical:
## returning NA
## [1] NA
airbnb$price <- gsub(",", "", airbnb$price) # removed ','
airbnb$price <- gsub("\\$", "", airbnb$price) # removed '$' sign
airbnb$price <- as.numeric(airbnb$price) # converted to num.
mean(airbnb$price) # Average price till now
## [1] 154.579
summary(airbnb) # A quick summary
## name host_name latitude longitude
## Length:56538 Length:56538 Min. :51.30 Min. :-0.49780
## Class :character Class :character 1st Qu.:51.48 1st Qu.:-0.18885
## Mode :character Mode :character Median :51.51 Median :-0.12492
## Mean :51.51 Mean :-0.12786
## 3rd Qu.:51.54 3rd Qu.:-0.06876
## Max. :51.68 Max. : 0.28857
## host_since host_location host_acceptance_rate property_type
## Length:56538 Length:56538 Length:56538 Length:56538
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## room_type price minimum_nights maximum_nights
## Length:56538 Min. : 1.0 Min. : 1.000 Min. : 1
## Class :character 1st Qu.: 59.0 1st Qu.: 1.000 1st Qu.: 40
## Mode :character Median : 100.0 Median : 2.000 Median : 365
## Mean : 154.6 Mean : 5.083 Mean : 10193
## 3rd Qu.: 170.0 3rd Qu.: 4.000 3rd Qu.: 1125
## Max. :35243.0 Max. :1124.000 Max. :524855552
## availability_30 availability_365 number_of_reviews review_scores_rating
## Min. : 0.000 Min. : 0.0 Min. : 1.00 Min. :0.000
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 3.00 1st Qu.:4.500
## Median : 1.000 Median : 47.0 Median : 8.00 Median :4.820
## Mean : 6.957 Mean :108.3 Mean : 23.91 Mean :4.588
## 3rd Qu.:12.000 3rd Qu.:210.0 3rd Qu.: 24.00 3rd Qu.:5.000
## Max. :30.000 Max. :365.0 Max. :1328.00 Max. :5.000
airbnb <- airbnb %>%
filter(maximum_nights < 9000) # ~ 24 years of booking/stay is fine.
airbnb <- airbnb %>%
mutate(revenue = (maximum_nights+minimum_nights)*price) # A new column
test <- airbnb # just a fresh new dataframe
str(test)
## 'data.frame': 56534 obs. of 17 variables:
## $ name : chr "Wimbledon Double Bedroom Ensuite" "Cosy room with own bathroom" "Waddon Coach House 2" "Specious private single bed room" ...
## $ host_name : chr "Nikki" "Ekaterina" "Jahmal" "Marta" ...
## $ latitude : num 51.4 51.4 51.4 51.6 51.6 ...
## $ longitude : num -0.188 -0.174 -0.11 -0.126 -0.188 ...
## $ host_since : chr "2010-08-01" "2023-02-12" "2022-12-09" "2021-06-17" ...
## $ host_location : chr "London, United Kingdom" "" "" "England, United Kingdom" ...
## $ host_acceptance_rate: chr "100%" "100%" "87%" "100%" ...
## $ property_type : chr "Private room in rental unit" "Private room in rental unit" "Entire home" "Private room in rental unit" ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Private room" ...
## $ price : num 100 43 100 29 59 55 68 379 93 45 ...
## $ minimum_nights : int 5 1 2 1 1 1 1 3 2 1 ...
## $ maximum_nights : int 1125 365 365 6 365 14 365 1125 365 365 ...
## $ availability_30 : int 30 20 9 0 25 15 30 23 8 24 ...
## $ availability_365 : int 365 278 252 11 359 205 365 358 273 358 ...
## $ number_of_reviews : int 4 3 6 2 10 19 1 2 3 11 ...
## $ review_scores_rating: num 3.67 4.67 4.5 5 4.8 4.95 5 5 4.33 4.73 ...
## $ revenue : num 113000 15738 36700 203 21594 ...
## - attr(*, "na.action")= 'omit' Named int [1:18693] 1 2 3 5 8 9 11 12 14 15 ...
## ..- attr(*, "names")= chr [1:18693] "1" "2" "3" "5" ...
test$host_since <- as.Date(test$host_since) # datatype as 'date'
test$day <- format(as.Date(test$host_since), "%A") # a column for 'day'
test$date <- format(as.Date(test$host_since), "%d") # another column for 'date'
test$month <- format(as.Date(test$host_since), "%B") # another column for 'month'
test$year <- format(as.Date(test$host_since), "%Y") # another column for 'year'
summary(test) # a quick statistical summary
## name host_name latitude longitude
## Length:56534 Length:56534 Min. :51.30 Min. :-0.49780
## Class :character Class :character 1st Qu.:51.48 1st Qu.:-0.18885
## Mode :character Mode :character Median :51.51 Median :-0.12492
## Mean :51.51 Mean :-0.12786
## 3rd Qu.:51.54 3rd Qu.:-0.06876
## Max. :51.68 Max. : 0.28857
## host_since host_location host_acceptance_rate
## Min. :2008-12-03 Length:56534 Length:56534
## 1st Qu.:2014-07-10 Class :character Class :character
## Median :2016-02-26 Mode :character Mode :character
## Mean :2016-08-07
## 3rd Qu.:2018-07-25
## Max. :2023-03-10
## property_type room_type price minimum_nights
## Length:56534 Length:56534 Min. : 1.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 59.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 100.0 Median : 2.000
## Mean : 154.5 Mean : 5.083
## 3rd Qu.: 170.0 3rd Qu.: 4.000
## Max. :35243.0 Max. :1124.000
## maximum_nights availability_30 availability_365 number_of_reviews
## Min. : 1.0 Min. : 0.000 Min. : 0.0 Min. : 1.00
## 1st Qu.: 40.0 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 3.00
## Median : 365.0 Median : 1.000 Median : 47.0 Median : 8.00
## Mean : 573.4 Mean : 6.957 Mean :108.3 Mean : 23.91
## 3rd Qu.:1125.0 3rd Qu.:12.000 3rd Qu.:210.0 3rd Qu.: 24.00
## Max. :2000.0 Max. :30.000 Max. :365.0 Max. :1328.00
## review_scores_rating revenue day date
## Min. :0.000 Min. : 40 Length:56534 Length:56534
## 1st Qu.:4.500 1st Qu.: 5409 Class :character Class :character
## Median :4.820 Median : 44400 Mode :character Mode :character
## Mean :4.588 Mean : 94842
## 3rd Qu.:5.000 3rd Qu.: 107350
## Max. :5.000 Max. :17992000
## month year
## Length:56534 Length:56534
## Class :character Class :character
## Mode :character Mode :character
##
##
##
test1 <- test %>%
select(-c(availability_30, availability_365))
glimpse(test1) # quick look into data
## Rows: 56,534
## Columns: 19
## $ name <chr> "Wimbledon Double Bedroom Ensuite", "Cosy room wi…
## $ host_name <chr> "Nikki", "Ekaterina", "Jahmal", "Marta", "Simon",…
## $ latitude <dbl> 51.42231, 51.41605, 51.37069, 51.61334, 51.64041,…
## $ longitude <dbl> -0.18841000, -0.17429873, -0.10995560, -0.1259100…
## $ host_since <date> 2010-08-01, 2023-02-12, 2022-12-09, 2021-06-17, …
## $ host_location <chr> "London, United Kingdom", "", "", "England, Unite…
## $ host_acceptance_rate <chr> "100%", "100%", "87%", "100%", "100%", "93%", "90…
## $ property_type <chr> "Private room in rental unit", "Private room in r…
## $ room_type <chr> "Private room", "Private room", "Entire home/apt"…
## $ price <dbl> 100, 43, 100, 29, 59, 55, 68, 379, 93, 45, 49, 10…
## $ minimum_nights <int> 5, 1, 2, 1, 1, 1, 1, 3, 2, 1, 3, 2, 20, 1, 2, 1, …
## $ maximum_nights <int> 1125, 365, 365, 6, 365, 14, 365, 1125, 365, 365, …
## $ number_of_reviews <int> 4, 3, 6, 2, 10, 19, 1, 2, 3, 11, 11, 1, 1, 9, 32,…
## $ review_scores_rating <dbl> 3.67, 4.67, 4.50, 5.00, 4.80, 4.95, 5.00, 5.00, 4…
## $ revenue <dbl> 113000, 15738, 36700, 203, 21594, 825, 24888, 427…
## $ day <chr> "Sunday", "Sunday", "Friday", "Thursday", "Monday…
## $ date <chr> "01", "12", "09", "17", "09", "13", "22", "28", "…
## $ month <chr> "August", "February", "December", "June", "Novemb…
## $ year <chr> "2010", "2023", "2022", "2021", "2015", "2015", "…
df_final <- test1 %>%
select(c(name, host_name, host_since, host_location, latitude, longitude,
host_acceptance_rate, property_type, room_type, price,
minimum_nights, maximum_nights, number_of_reviews,
review_scores_rating, revenue,
day, date, month, year)) %>%
arrange(date, month) %>%
arrange(desc(year))
df_final$month <- factor(df_final$month,
levels = c("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November", "December"))
df_final$day <- ordered(df_final$day,
levels = c("Sunday","Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday"))
df_final$year <- ordered(df_final$year,
levels = c("2008", "2009","2010", "2011","2012",
"2013","2014","2015","2016","2017",
"2018","2019","2020","2021","2022","2023"))
df_final <- df_final %>%
mutate(host_location = ifelse(host_location== "", 'unknown', host_location))
glimpse(df_final)
## Rows: 56,534
## Columns: 19
## $ name <chr> "Large En-suite in Victorian Home", "Stylish En-S…
## $ host_name <chr> "Dev", "Dev", "Dev", "Dev", "Seonyoung", "Azan", …
## $ host_since <date> 2023-02-01, 2023-02-01, 2023-02-01, 2023-02-01, …
## $ host_location <chr> "London, United Kingdom", "London, United Kingdom…
## $ latitude <dbl> 51.48815, 51.48756, 51.48837, 51.48900, 51.50100,…
## $ longitude <dbl> 0.005667280, 0.004235341, 0.003485379, 0.00553123…
## $ host_acceptance_rate <chr> "87%", "87%", "87%", "87%", "100%", "46%", "83%",…
## $ property_type <chr> "Private room in home", "Private room in home", "…
## $ room_type <chr> "Private room", "Private room", "Private room", "…
## $ price <dbl> 100, 90, 75, 90, 128, 132, 175, 200, 110, 160, 80…
## $ minimum_nights <int> 4, 3, 2, 2, 2, 1, 1, 1, 1, 3, 2, 1, 2, 2, 1, 1, 1…
## $ maximum_nights <int> 365, 365, 365, 365, 1125, 365, 365, 1125, 1125, 1…
## $ number_of_reviews <int> 1, 1, 1, 1, 7, 1, 1, 2, 1, 2, 1, 3, 1, 3, 3, 2, 1…
## $ review_scores_rating <dbl> 5.00, 5.00, 5.00, 5.00, 4.14, 5.00, 5.00, 3.00, 5…
## $ revenue <dbl> 36900, 33120, 27525, 33030, 144256, 48312, 64050,…
## $ day <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wedne…
## $ date <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ month <fct> February, February, February, February, February,…
## $ year <ord> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
summary(df_final)
## name host_name host_since host_location
## Length:56534 Length:56534 Min. :2008-12-03 Length:56534
## Class :character Class :character 1st Qu.:2014-07-10 Class :character
## Mode :character Mode :character Median :2016-02-26 Mode :character
## Mean :2016-08-07
## 3rd Qu.:2018-07-25
## Max. :2023-03-10
##
## latitude longitude host_acceptance_rate property_type
## Min. :51.30 Min. :-0.49780 Length:56534 Length:56534
## 1st Qu.:51.48 1st Qu.:-0.18885 Class :character Class :character
## Median :51.51 Median :-0.12492 Mode :character Mode :character
## Mean :51.51 Mean :-0.12786
## 3rd Qu.:51.54 3rd Qu.:-0.06876
## Max. :51.68 Max. : 0.28857
##
## room_type price minimum_nights maximum_nights
## Length:56534 Min. : 1.0 Min. : 1.000 Min. : 1.0
## Class :character 1st Qu.: 59.0 1st Qu.: 1.000 1st Qu.: 40.0
## Mode :character Median : 100.0 Median : 2.000 Median : 365.0
## Mean : 154.5 Mean : 5.083 Mean : 573.4
## 3rd Qu.: 170.0 3rd Qu.: 4.000 3rd Qu.:1125.0
## Max. :35243.0 Max. :1124.000 Max. :2000.0
##
## number_of_reviews review_scores_rating revenue day
## Min. : 1.00 Min. :0.000 Min. : 40 Sunday :7529
## 1st Qu.: 3.00 1st Qu.:4.500 1st Qu.: 5409 Monday :9347
## Median : 8.00 Median :4.820 Median : 44400 Tuesday :9104
## Mean : 23.91 Mean :4.588 Mean : 94842 Wednesday:8916
## 3rd Qu.: 24.00 3rd Qu.:5.000 3rd Qu.: 107350 Thursday :8627
## Max. :1328.00 Max. :5.000 Max. :17992000 Friday :7180
## Saturday :5831
## date month year
## Length:56534 July : 5714 2015 : 9342
## Class :character May : 5471 2016 : 7719
## Mode :character June : 5141 2014 : 7144
## August : 5132 2017 : 5071
## March : 5110 2013 : 4734
## April : 4710 2018 : 4576
## (Other):25256 (Other):17948
As for now statistical summary provides interesting insights.
table(df_final$month)
##
## January February March April May June July August
## 4465 4029 5110 4710 5471 5141 5714 5132
## September October November December
## 4531 4121 4545 3565
df_final %>%
skim_without_charts()
| Name | Piped data |
| Number of rows | 56534 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| Date | 1 |
| factor | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 0 | 254 | 17 | 55300 | 0 |
| host_name | 0 | 1 | 1 | 35 | 0 | 11203 | 0 |
| host_location | 0 | 1 | 5 | 56 | 0 | 1190 | 0 |
| host_acceptance_rate | 0 | 1 | 0 | 4 | 15460 | 100 | 0 |
| property_type | 0 | 1 | 3 | 34 | 0 | 92 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| date | 0 | 1 | 2 | 2 | 0 | 31 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 0 | 1 | 2008-12-03 | 2023-03-10 | 2016-02-26 | 4463 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| day | 0 | 1 | TRUE | 7 | Mon: 9347, Tue: 9104, Wed: 8916, Thu: 8627 |
| month | 0 | 1 | FALSE | 12 | Jul: 5714, May: 5471, Jun: 5141, Aug: 5132 |
| year | 0 | 1 | TRUE | 16 | 201: 9342, 201: 7719, 201: 7144, 201: 5071 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| latitude | 0 | 1 | 51.51 | 0.05 | 51.3 | 51.48 | 51.51 | 51.54 | 5.1680e+01 |
| longitude | 0 | 1 | -0.13 | 0.10 | -0.5 | -0.19 | -0.12 | -0.07 | 2.9000e-01 |
| price | 0 | 1 | 154.53 | 319.43 | 1.0 | 59.00 | 100.00 | 170.00 | 3.5243e+04 |
| minimum_nights | 0 | 1 | 5.08 | 21.93 | 1.0 | 1.00 | 2.00 | 4.00 | 1.1240e+03 |
| maximum_nights | 0 | 1 | 573.41 | 491.07 | 1.0 | 40.00 | 365.00 | 1125.00 | 2.0000e+03 |
| number_of_reviews | 0 | 1 | 23.91 | 46.93 | 1.0 | 3.00 | 8.00 | 24.00 | 1.3280e+03 |
| review_scores_rating | 0 | 1 | 4.59 | 0.78 | 0.0 | 4.50 | 4.82 | 5.00 | 5.0000e+00 |
| revenue | 0 | 1 | 94842.41 | 260294.38 | 40.0 | 5409.00 | 44400.00 | 107350.00 | 1.7992e+07 |
airbnb_outside <- df_final # a new dataframe for further analysis
head(airbnb_outside$host_location)
## [1] "London, United Kingdom" "London, United Kingdom" "London, United Kingdom"
## [4] "London, United Kingdom" "unknown" "unknown"
airbnb_outside <- airbnb_outside %>%
filter(minimum_nights > 90) %>%
mutate(host_location = ifelse(host_location== "", 'unknown', host_location))
view(airbnb_outside) # Replaced blanks with 'unknown' value
airbnb_outside <- airbnb_outside %>%
mutate(really = if_else(str_ends(host_location, "Kingdom"), "in", "out"))
airbnb_outside <- airbnb_outside %>%
select(host_location, minimum_nights, revenue, really) %>%
filter(really == 'out') %>% # filtered out of uk hosts location
reframe(host_location, minimum_nights, revenue, really)
There were 40 hosts outside of UK.
df_final <- anti_join(df_final, airbnb_outside, by = "host_location")
view(df_final)
df_final <- df_final %>%
mutate(host_acceptance_rate = ifelse(host_acceptance_rate== "", '0%',
host_acceptance_rate))
world_data <- df_final # copy of final dataframe
world_data <- world_data %>%
separate(host_location, c("city","country"), ",") # separated by city and country
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 670 rows [983, 1014,
## 1124, 1249, 1250, 1251, 1639, 1831, 1834, 1836, 1837, 1838, 1868, 1869, 2200,
## 2622, 3360, 3374, 3377, 3436, ...].
world_data <- world_data %>%
mutate(country = coalesce(country,city)) # Done with missing values
n_distinct(world_data$country) # Now 157, before 117 countries
## [1] 157
n_distinct(world_data$city) # Now 1153, before 1167 cities
## [1] 1153
df_final %>%
ggplot(aes(year, review_scores_rating, color = room_type)) +
geom_jitter(size = 1.5, alpha = 0.3) +
geom_smooth() +
labs(x = "Year",
y = "Review Scores Rated",
color = "Room Type",
title = "Overall Years with Review Scores Rated",
caption = "Data Analyst : JP")
df_final %>%
ggplot(aes(year, review_scores_rating, fill= room_type)) +
geom_bar(position = "stack", stat = "identity") +
labs(x = "Year",
y = "Review Scores Rated",
fill = "Room Type",
title = "Overall Years with Review Scores Rated",
caption = "Data Analyst : JP")
df_final %>%
filter(price <= 10000) %>%
ggplot(aes(room_type, price, color = revenue)) +
geom_jitter(size = 1.5, alpha = 0.3) +
labs(x = "Room Type",
y = "Price",
color = "Revenue",
title = "Price Affects on Room Types",
caption = "Data Analyst : JP")
df_final %>%
ggplot(aes(year, number_of_reviews, fill= room_type)) +
geom_bar(position = "dodge", stat = "identity") +
labs(x = "Year",
y = "Total Reviews",
fill = "Room Type",
title = "Total Reviews Over Years",
caption = "Data Analyst : JP")
world_data %>%
ggplot(aes(country, price, color = room_type)) +
geom_jitter(size = 1.5, alpha = 0.3) # Added Tableau Link as "click here"
df_final %>%
select(c(date, maximum_nights, price)) %>%
ggplot(aes(date, maximum_nights, color = price)) +
geom_point(size = 2, alpha = 0.4) +
geom_smooth(method = 'lm' , color = 'purple') +
labs(x = "Date",
y = "Maximun Night Stay",
color = "Price ",
title = "Maximun Night Stay : Date-Wise",
subtitle = "Linear Regression has 'Strong' fit for the model",
caption = "Data Analyst : JP") +
annotate("text", x=5, y= 1600, label = "R^2 = 1.1", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 )
df_final %>%
select(c(maximum_nights, price, room_type)) %>%
filter(maximum_nights <1300, price < 1300) %>%
ggplot(aes(maximum_nights, price, color = room_type)) +
geom_point(size = 2, alpha = 0.3) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Maximun Night Stay",
y = "Price",
color = "Room type",
title = "Maximun Night Stay : Price-Wise",
subtitle = "Linear Regression has 'None' fit for the model",
caption = "Data Analyst : JP") +
annotate("text", x= 700, y= 1000, label = "R^2 = 0.00", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 )
df_final %>%
select(c(price, revenue, room_type)) %>%
filter(price <1300, revenue < 1200000) %>%
ggplot(aes(price, revenue, color = room_type)) +
geom_point(size = 2, alpha = 0.3) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Airbnb Price",
y = "Revenue Generated",
color = "Room type",
title = "Airbnb's Price affect on Revenue",
subtitle = "Linear Regression has 'Medium' fit for the model",
caption = "Data Analyst : JP") +
annotate("text", x=200, y= 1000000, label = "R^2 = 0.5", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 )
df_final %>%
select(c(host_location, year, revenue, room_type)) %>%
filter(revenue < 1200000) %>%
ggplot(aes(year, revenue, color = room_type)) +
geom_point(size = 2, alpha = 0.3) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Year",
y = "Revenue Generated",
color = "Room type",
title = "Airbnb's Revenue Generated by Year(s)",
subtitle = "Linear Regression has 'Very Strog' fit for the model",
caption = "Data Analyst : JP") +
annotate("text", x= 1.9, y= 900000, label = "R^2 = 4.8", color = "darkgreen",
fontface = "bold", size = 5, angle = 25 )
df_final %>%
select(c(price, host_acceptance_rate, revenue, room_type)) %>%
filter(price <1300) %>%
ggplot(aes(host_acceptance_rate, price, color = room_type)) +
geom_point(size = 2, alpha = 0.3) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Host Acceptance Rate in %",
y = "Price Paid",
color = "Room type",
title = "Host Acceptance Rate by Price",
caption = "Data Analyst : JP")
df_final %>%
mutate(
yearly = format(host_since, "%Y"),
month_year = format(host_since, "%m-%Y"),
date = format(host_since, "%d")) %>%
group_by(yearly, room_type) %>%
summarise(
yearly_host = sum(n())) %>%
select(c(yearly_host, yearly, room_type)) %>%
ggplot(aes(yearly, yearly_host, color = room_type)) +
geom_point(size = 2, alpha = 1) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Year",
y = "Total Hosts",
color = "Room Type",
title = "Total Hosts Growth by Years",
caption = "Data Analyst : JP")
top_15 <- df_final %>%
arrange(desc(revenue)) %>%
distinct(host_name, .keep_all = TRUE) %>% #removed all duplicates
slice(1:15) %>%
ungroup()
top_15 %>%
ggplot(aes(host_name, revenue, fill= room_type)) +
geom_bar(position = "dodge", stat = "identity") +
scale_y_continuous(labels = scales::comma) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Host Name",
y = "Revenue Generated",
color = "Room Type",
title = "Top Hosts with Revenue",
caption = "Data Analyst : JP")
price_15 <- df_final %>%
arrange(desc(price)) %>%
distinct(host_name, .keep_all = TRUE) %>% #removed all duplicates
slice(1:15) %>%
ungroup()
price_15 %>%
ggplot(aes(room_type , price, fill= property_type)) +
geom_bar(position = "dodge", stat = "identity") +
scale_y_continuous(labels = scales::comma) +
geom_smooth(method = 'lm' , se = F, color = 'purple') +
labs(x = "Room Type",
y = "Price",
color = "Property Type",
title = "Affects of Price on Room Types",
caption = "Data Analyst : JP")
With All These Years Airbnb doing great and having Continuous Growth Over Time
Airbnb’s Most Rated Room Type is Entire Home/Apt Then Private Room
Customers Rated Private Room more than Entire Home/Apt. Hence, Customers Liked Entire Home/Apt.
Hotel Rooms and Shared Rooms are Cheaper also Attract Less Customers and even Booking are lower
Since 2013 - 2106, Most Bookings had been done for London Airbnb
Lower Price Airbnb has Very Good Impact on Revenue
Improving/Fixing those Property Types that got lower reviews.
If we provide travelers to have their discounts/offers for providing reviews about their stay at checkout would be beneficial for the company.
If more info about customers are added like age, sex, relationship etc would be useful
London Airbnb dataset has 56534 columns, Finalized columns are 46126. Total of 10408 had been manipulated.
Saved
Exported data
write.csv(airbnb, “airbnb.csv”)
write.csv(df_final, “df_final.csv”)
write.csv(world_data, “world_data.csv”)
write.csv(airbnb_outside, “airbnb_outside.csv”)