Hi !! Welcome to my Rmd.
in this time im looking dataset from external source (kaggle). I hope u’ll enjoy.
This data is talking about airbnb matrics for listing in New York city, USA. and the first thing i need to do is load all package tht might be needed for this dataset.
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(leaflet)
library(scales)
library(tidyr)
library(colorspace)
library(ggridges)##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
We could input our data to R and put it into ‘airbnb’ object
Then we do inspect data
## [1] 48895 16
From inspection above, we got short description of the data. airbnb is consist of 48895 x 16 of rows and cloumns. then we need to check the data structure
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
as we see here, we just have to change ’last_review column become Date type
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Date, format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
Find out missing data for datasetinputed
## [1] TRUE
## id name
## 0 0
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 10052 10052
## calculated_host_listings_count availability_365
## 0 0
OOpss!! airbnb data has NA inside “reviews_per_month”. We will delete all missing value
## [1] 38843 16
## id name
## 0 0
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 0 0
## calculated_host_listings_count availability_365
## 0 0
## [1] FALSE
after deleted, left us 38,843 of rows and 16 colums. Dont forget to save as
In this process, i will delete ‘latitude’ and ‘longitude’ coloumn
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "room_type" "price"
## [9] "minimum_nights" "number_of_reviews"
## [11] "last_review" "reviews_per_month"
## [13] "calculated_host_listings_count" "availability_365"
## [1] 38843 14
Cool !! looking good so far Lets check statistical summary
## id name
## Min. : 2539 Home away from home : 12
## 1st Qu.: 8720027 Loft Suite @ The Box House Hotel: 11
## Median :18871455 Private Room : 10
## Mean :18096462 Brooklyn Apartment : 9
## 3rd Qu.:27554820 Cozy Brooklyn Apartment : 8
## Max. :36455809 New york Multi-unit building : 8
## (Other) :38785
## host_id host_name neighbourhood_group
## Min. : 2438 Michael : 335 Bronx : 876
## 1st Qu.: 7033824 David : 309 Brooklyn :16447
## Median : 28371926 John : 250 Manhattan :16632
## Mean : 64239145 Alex : 229 Queens : 4574
## 3rd Qu.:101846466 Sonder (NYC): 207 Staten Island: 314
## Max. :273841667 Sarah : 179
## (Other) :37334
## neighbourhood room_type price
## Williamsburg : 3163 Entire home/apt:20332 Min. : 0.0
## Bedford-Stuyvesant: 3141 Private room :17665 1st Qu.: 69.0
## Harlem : 2206 Shared room : 846 Median : 101.0
## Bushwick : 1944 Mean : 142.3
## Hell's Kitchen : 1532 3rd Qu.: 170.0
## East Village : 1490 Max. :10000.0
## (Other) :25367
## minimum_nights number_of_reviews last_review
## Min. : 1.000 Min. : 1.0 Min. :2011-03-28
## 1st Qu.: 1.000 1st Qu.: 3.0 1st Qu.:2018-07-08
## Median : 2.000 Median : 9.0 Median :2019-05-19
## Mean : 5.868 Mean : 29.3 Mean :2018-10-04
## 3rd Qu.: 4.000 3rd Qu.: 33.0 3rd Qu.:2019-06-23
## Max. :1250.000 Max. :629.0 Max. :2019-07-08
##
## reviews_per_month calculated_host_listings_count availability_365
## Min. : 0.010 Min. : 1.000 Min. : 0.0
## 1st Qu.: 0.190 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 0.720 Median : 1.000 Median : 55.0
## Mean : 1.373 Mean : 5.165 Mean :114.9
## 3rd Qu.: 2.020 3rd Qu.: 2.000 3rd Qu.:229.0
## Max. :58.500 Max. :327.000 Max. :365.0
##
## 'data.frame': 38843 obs. of 14 variables:
## $ id : int 2539 2595 3831 5022 5099 5121 5178 5203 5238 5295 ...
## $ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 15702 19366 25001 8337 25048 15597 17682 5654 ...
## $ host_id : int 2787 2845 4869 7192 7322 7356 8967 7490 7549 7702 ...
## $ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 6264 5982 1970 3601 9699 6935 1264 6084 ...
## $ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 2 3 3 2 3 3 3 3 ...
## $ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 42 62 138 14 96 203 36 203 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 1 1 1 2 2 2 1 1 ...
## $ price : int 149 225 89 80 200 60 79 79 150 135 ...
## $ minimum_nights : int 1 1 1 10 3 45 2 2 1 5 ...
## $ number_of_reviews : int 9 45 270 9 74 49 430 118 160 53 ...
## $ last_review : Date, format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num 0.21 0.38 4.64 0.1 0.59 0.4 3.47 0.99 1.33 0.43 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 4 1 ...
## $ availability_365 : int 365 355 194 0 129 0 220 0 188 6 ...
From summary above, we may conclude some of the things
1. There are 3 types of listing : Entire home, privat home and shared room. which Entire home/apt are the most popular in airbnb populations
2. Price range is aroud 0 - 10,000 USD, with average value 142.3 USD
3. Manhattan neighbourhood group is the most highest compare to others groups
4. People stays start from 1 night till 1250 nights (around 3 years-ish) but average people is staying around 6 nights
5. Michael is showing as the most popular host at listing property in New York City
6. “Home away from home” is the most popular choice for staying in New York City in 2019
1. We will check the interaction between price and room type overlay with average price
i will use function "Scale_y_log10’ for better interpretation of IQR
ggplot(airbnb, aes(room_type, price)) +
geom_boxplot(aes(fill = room_type)) +
scale_y_log10() +
labs(title = "Price by Room Type", x= "Room Type", y= "Price", fill = "Room Type",
subtitle = "red line indicate average price") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = mean(airbnb$price), color = "red", linetype = 5)## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing non-finite values (stat_boxplot).
Interpretations :
a. as we can see from boxplot above, the highest price of all is ‘Entire home/apartment’ type of listing
b. Second place is Privat Room and the lowest is Shared Room of course
c. Average price line only crossed the Entire Room/apt. more than half price of Entire Room/apt are above the average price
d. Privat Room type and Shared room type of price distribution are far below from the average price
2. We want to find out the corelation between price and availability in a year, does cheap price makes a listing property become the most demanding property in New York City?
Create corelation between price and availability in a year using geom point
ggplot(airbnb, aes(availability_365, price)) +
geom_point(alpha = 0.5, color = "green") +
geom_segment(aes(x=availability_365, xend=availability_365, y=0, yend=price)) +
labs(title = " Availability Vs Price", x ="Availability During Year", y= "Price")+
theme(plot.title = element_text(hjust = 0.5))Interpretation :
lower price doesnt make guarantee will become more popular to be rented than high price
From graph above, show us that although price is high but the availability in a year is low. it means some customer not consider price as the most important variable which determined to choose a listing property. But some of customer do think about the price as well.
at graph above, at price around 6000+ USD, the availability around 50 days in a year. it means, this property so popular to be rented although the price is higher than other.
3. How is the availability of listing within the type of room and in different group of neighbourhood?
I will use violin plot, with x is type of room and y is availability in year
ggplot(airbnb,aes(room_type,availability_365))+
geom_violin(aes(fill = neighbourhood_group)) +
labs (title = "Availability in Gorup Neigh", x = "Room Type", y = "Availability", fill ="Group")+
theme(plot.title = element_text(hjust = 0.5))Interpretation :
a. We might say that less availability means more popular it might be
b. ‘Shared room’ is the mostpopular at the ‘Staten Island’ area compare to other group of neighbourhood, others is quitely in the same level
c. at ‘Brooklyn’ and ‘Manhattan’, ‘privat room’ type are both almost same popular as an options to stay
d. similar with privat room, at ‘Brooklyn’ and ‘Manhattan’, ‘Entire Home/apt’ type are both almost same popular as an options to stay compare to other areas.
4. Show me the price distribution of each group of neighbourhood for Entire home/apt type only!
Lets we focused on “Entire hoom/apt
Create new object which only consist of”Entire hoom/apt", named ‘era’
## [1] 20332 14
after that, make grouping price with range into <=300, 300<=x<=500, 500<x<1000, x>=1000 after that create new column in era object named ‘priceC’ column
P <- function(x){
if (x<300) {x<-"Below 300"} else if (x>=300 & x<=500) {x <- "300 to 500"} else if (x>500 & x<1000) {x <-"between 500-1000"} else {x <- "above 1000"}}
era$priceC <- as.factor(sapply(era$price, P))
head(era)ggplot(era,aes(neighbourhood_group,price)) +
geom_jitter(aes(col= era$priceC)) +
geom_boxplot(alpha=0.5) +
scale_y_log10()+
labs(title = "Entire Home/apt Price by Neighbourhood", x= "Neighbourhood Group", y= "Price", col = "Price Segment") +
theme(plot.title = element_text(hjust = 0.5))## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Interpretations :
a. Manhattan are the most distributed in price for “Entire home/apt” type, although we find that mostly price range at below 300 USD but also found some price range is above 1000 USD
b. Second is Brooklyn, similiar with Manhattan but with lower population of above 1000 USD of price segment
c. Third place is Queens
d. at Bronx and State Island, we dont find any price above 1000 USD, and only few are having the price between 500-1000 USD, it means in this area both area (Bronx and State Island most average price is below 300 USD)
5. Show the average value of minimum night and review for each neighbourhood group Within Entire home/apt listing!
make new data frame for Review and minimum night based on average value for each of neighbourhood group
named ‘eraRN’
eraRN <- aggregate.data.frame(list(Review = era$number_of_reviews, Nights = era$minimum_nights), by = list(Neighbourhood = era$neighbourhood_group), mean)
dim(eraRN)## [1] 5 3
gather all value (review and minimum night) into 1 column, named ‘variable’, usiing function ‘gaher’
ggplot(eraRN,aes(Neighbourhood, average))+
geom_col(aes(fill = variable), position = "dodge") +
coord_flip()+
labs(title = "Average Night and Review for each group of Neighbourhood", x= "Group of Neighb", y = "Value", fill = "Variable")+
geom_text(aes(label=comma(average)), hjust = -0.2, size = 3)+
theme(plot.title = element_text(hjust = 0.5))Interpretation :
a. Satate Island have the lowest average of minimum night for staying at Entire home/apt listing but has the highest average for number of reviews. it may happened when people do short stay and always write a review after that, in other words, people who rented Entire home/apt at State island area always changing because they not staying more night and only do vacation or short stay.
b. Manhattan has the higest average value of minimum nights for staying at Entire home/apt type but has lowest average number of reviews.it undestandable because people who choose to stay at Entire home/apt type at Manhattan mostly staying more for years or months in the same listing. so in other words not much changing of people thats explain why the average number of reviews is lowest than others.
6. We want to know corelation between number of review and price at Brooklyn and Manhattan area only
First, we need to separate neighbourhood group only for Manhattan and Brooklyn only and then make new object with named ‘airbnbM_B’
A <- airbnb[order(airbnb$price, decreasing =T),]
airbnbM_B <- A [A$neighbourhood_group== "Manhattan" | A$neighbourhood_group== "Brooklyn", ]
head(airbnbM_B)then continue create the graph
ggplot(airbnbM_B, aes(price, number_of_reviews)) +
geom_jitter(aes(col = number_of_reviews))+
scale_x_log10()+
facet_wrap(~neighbourhood_group, scales = "free")+
labs(title = "Number of reviews Vs Price", x="Price", y= "Number of Reviews" )+
theme(plot.title = element_text(hjust = 0.5))## Warning: Transformation introduced infinite values in continuous x-axis
Interpretations:
Both of area (Manhattan and Brooklyn) has showns almost the same value of price and number of reviews.
Manhattan has lilttle over number of reviews than Brooklyn
7. We want to see how long people staying in every year at Manhattan and brooklyn area. does every year have significant changes?
we already have object which included onlly Manhattan and Brooklyn area from previous graph, named ‘airbnbM&B’ so all we need to do next is to separate the year from column ‘last_review’ to get the year of each data
airbnbM_B$year <- year(airbnbM_B$last_review)
airbnbM_B$year <- as.factor(airbnbM_B$year)
head(airbnbM_B$year)## [1] 2017 2015 2016 2018 2016 2019
## Levels: 2011 2012 2013 2014 2015 2016 2017 2018 2019
continue to create the graph
ggplot(airbnbM_B,aes(minimum_nights, year)) +
geom_density_ridges(fill="yellow")+
scale_x_continuous(limits = c(0,100),breaks = seq(0,100,10))## Picking joint bandwidth of 1.24
## Warning: Removed 86 rows containing non-finite values
## (stat_density_ridges).
interpretations:
at Brooklyn and Manhattan area, number of staying nights almost during 2011-2019, except in 2012-2013. its shows litlle less of number of staying in 2012-2013. we also found another peak at number of 30 nights.
8. We want to know which host who has the highest ammount of review?
we do subset only for host_id and order it from highest to lowest based on the number of reviews
we only take top 30 of number of reviews
we name it as ‘host’ object
host <- airbnb[match(unique(airbnb$host_id), airbnb$host_id), ]
host <- host[order(host$number_of_reviews, decreasing = T), ]
host <- host[1:30, ]
hostggplot(host,aes(reorder(host$host_name,host$number_of_reviews), host$number_of_reviews))+
geom_col(fill ="magenta")+
facet_grid(rows = vars(neighbourhood_group), scales = "free_y")+
geom_point(aes(col=price))+
geom_text(aes(label= comma(host$number_of_reviews)), hjust=-0.4, size = 3)+
labs( x="Host Name", y= "Reviews")+
coord_flip()Interpretations:
a. Staten Island and Bronx are not included in top 30 number of reviews
b. ‘Dona’ from Queens is the highest with more than 600 number of reviews and price of her listing is below 100
c. Second place is occupied by ‘Jj’ at Manhattan with 594 number of review and price below 100 USD
d. Third place is occupied by ‘Carol’ from Manhattan as well with 540 number of review and price below 100 USD
e. Price arouns 500 USD is located at Manhattan with host name is ‘John’ and 447 number of reviews
From all graphs above, we may say some assumptions, such as :
1. Entire home/apr type of listing property relatively has the highest price compare to others listing of property
2. Lower price doesnt make guarantee will become more popular to be rented than high price, people choosing property to be rented mostly considering the needs, and price not one and only considerations
3. Different places showing different act of people needs. at Stated Island, shared room is more popular than other type. other hand, at Brooklyn and Manhattan showing Entire Home/apt and Privat Room type is more desirable than other
4. Some people are spending more than 1000 USD to pay their Entire home/aprt rent at Brooklyn and Manhattan area