Hi !! Welcome to my Rmd.
Im using the same data like previous LBB DV which i took it from external source (kaggle).
Since we are focusing on Shinydashboard, so i dont put much plot and analysis in this rmd (since i did it already on previous LBB DV, please kindly check it if needed :) )
I hope u’ll enjoy.
This data is talking about airbnb matrics for listing in New York city, USA. and the first thing i need to do is load all package tht might be needed for this dataset.
We could input our data to R and put it into ‘airbnb’ object
Then we do inspect data
## Observations: 48,895
## Variables: 16
## $ id <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name <chr> "Clean & quiet apt home by the ...
## $ host_id <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name <chr> "John", "Jennifer", "Elisabeth"...
## $ neighbourhood_group <chr> "Brooklyn", "Manhattan", "Manha...
## $ neighbourhood <chr> "Kensington", "Midtown", "Harle...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type <chr> "Private room", "Entire home/ap...
## $ price <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review <date> 2018-10-19, 2019-05-21, NA, 20...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365 <dbl> 365, 355, 365, 194, 0, 129, 0, ...
From inspection above, we got short description of the data. airbnb is consist of 48895 x 16 of rows and cloumns. then we need to check the data structure
We found that some of column need to change become factor, let’s change it, then check the type
airbnb <- airbnb %>%
mutate(name = as.factor(name),
host_name = as.factor (host_name),
neighbourhood_group = as.factor(neighbourhood_group),
neighbourhood = as.factor(neighbourhood),
room_type = as.factor(room_type),
last_review = as.factor(last_review))## Observations: 48,895
## Variables: 16
## $ id <dbl> 2539, 2595, 3647, 3831, 5022, 5...
## $ name <fct> "Clean & quiet apt home by the ...
## $ host_id <dbl> 2787, 2845, 4632, 4869, 7192, 7...
## $ host_name <fct> John, Jennifer, Elisabeth, Lisa...
## $ neighbourhood_group <fct> Brooklyn, Manhattan, Manhattan,...
## $ neighbourhood <fct> Kensington, Midtown, Harlem, Cl...
## $ latitude <dbl> 40.64749, 40.75362, 40.80902, 4...
## $ longitude <dbl> -73.97237, -73.98377, -73.94190...
## $ room_type <fct> Private room, Entire home/apt, ...
## $ price <dbl> 149, 225, 150, 89, 80, 200, 60,...
## $ minimum_nights <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1,...
## $ number_of_reviews <dbl> 9, 45, 0, 270, 9, 74, 49, 430, ...
## $ last_review <fct> 2018-10-19, 2019-05-21, NA, 201...
## $ reviews_per_month <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.5...
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1...
## $ availability_365 <dbl> 365, 355, 365, 194, 0, 129, 0, ...
as we see here, all data type has been corect already
Find out missing data for dataset inputed
## id name
## 0 16
## host_id host_name
## 0 21
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 10052 10052
## calculated_host_listings_count availability_365
## 0 0
In this case, our missing data is ‘date’ data, so lets ignore this first c=since we wont use it for further analysis
We will continue to check the data summary
## id name
## Min. : 2539 Hillside Hotel : 18
## 1st Qu.: 9471945 Home away from home : 17
## Median :19677284 New york Multi-unit building : 16
## Mean :19017143 Brooklyn Apartment : 12
## 3rd Qu.:29152178 Loft Suite @ The Box House Hotel: 11
## Max. :36487245 (Other) :48805
## NA's : 16
## host_id host_name neighbourhood_group
## Min. : 2438 Michael : 417 Bronx : 1091
## 1st Qu.: 7822033 David : 403 Brooklyn :20104
## Median : 30793816 Sonder (NYC): 327 Manhattan :21661
## Mean : 67620011 John : 294 Queens : 5666
## 3rd Qu.:107434423 Alex : 279 Staten Island: 373
## Max. :274321313 (Other) :47154
## NA's : 21
## neighbourhood latitude longitude
## Williamsburg : 3920 Min. :40.50 Min. :-74.24
## Bedford-Stuyvesant: 3714 1st Qu.:40.69 1st Qu.:-73.98
## Harlem : 2658 Median :40.72 Median :-73.96
## Bushwick : 2465 Mean :40.73 Mean :-73.95
## Upper West Side : 1971 3rd Qu.:40.76 3rd Qu.:-73.94
## Hell's Kitchen : 1958 Max. :40.91 Max. :-73.71
## (Other) :32209
## room_type price minimum_nights
## Entire home/apt:25409 Min. : 0.0 Min. : 1.00
## Private room :22326 1st Qu.: 69.0 1st Qu.: 1.00
## Shared room : 1160 Median : 106.0 Median : 3.00
## Mean : 152.7 Mean : 7.03
## 3rd Qu.: 175.0 3rd Qu.: 5.00
## Max. :10000.0 Max. :1250.00
##
## number_of_reviews last_review reviews_per_month
## Min. : 0.00 2019-06-23: 1413 Min. : 0.010
## 1st Qu.: 1.00 2019-07-01: 1359 1st Qu.: 0.190
## Median : 5.00 2019-06-30: 1341 Median : 0.720
## Mean : 23.27 2019-06-24: 875 Mean : 1.373
## 3rd Qu.: 24.00 2019-07-07: 718 3rd Qu.: 2.020
## Max. :629.00 (Other) :33137 Max. :58.500
## NA's :10052 NA's :10052
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 45.0
## Mean : 7.144 Mean :112.8
## 3rd Qu.: 2.000 3rd Qu.:227.0
## Max. :327.000 Max. :365.0
##
From summary above, we may conclude some of the things :
1. There are 3 types of listing : Entire home, privat home and shared room. which Entire home/apt are the most in airbnb populations
2. Price range is aroud 0 - 10,000 USD, with average value 152.7 USD
3. Manhattan neighbourhood group is the most highest ammount of property listing compare to others groups
4. People stays start from 1 night till 1250 nights (around 3 years-ish) but average people is staying around 7 nights
5. Michael is showing as the most popular host at listing property in New York City
6. “Hillside Hotel” is the most popular choice for staying in New York City in 2019
Find out, price distribution based on the number availability in a year
For graph below, we only pick “Manhattan” as Neighbourhood_group and “Private room” as type or room
Make a new object named ‘A’ which contains Queens neighbourhood group and privat room type
A <- airbnb %>%
filter(neighbourhood_group == "Manhattan" & room_type == "Private room") %>%
select(neighbourhood_group,room_type, availability_365,price)
dim(A)## [1] 7982 4
then we create the graph using geom_poin and save it into ‘plotA’ object
plotA <- ggplot(A,aes( price, availability_365))+
geom_point(color="orange",
fill="#fd90c9",
shape=23,
alpha=0.7,
size=3,
stroke = 1, aes(text = paste("Price:", price, "<br>",
"Availability:", availability_365)))+
geom_smooth()+
scale_y_continuous(limits = c(0,400))+
scale_x_continuous()+
labs(title = "Price and Availability in Year", x= "Price", y = "Availability")+
theme(plot.title = element_text(hjust = 0.5))
plotAContinue using plotly
Interpretations:
This graph showing us positive correlations between Availability and price at below 600 USD, but lately showing negative corelations afterwards.
it means, for price below around 600 USD at Manhattan neighbourhood group for Private room type we may say that the higher the price the less demand.
but afterwards showing contradiction behavior (but this behaviour might influenced by the number of night staying or others things).
Create grouping price with range into <=300, 300<=x<=500, 500<x<1000, x>=1000 after that create new column named by ‘price_seg’ column in ‘airbnb’ dataset
airbnb <- airbnb %>%
mutate(price_seg = case_when(
price < 300 ~ "Below 300",
price >= 300 & price <=500 ~ "300 to 500",
price > 500 & price < 1000 ~ "Between 500 - 1000",
TRUE ~ "Above 1000"
))after that, create the graph using boxplot and geom jitter to know the distribution , text = paste(“Price Range:”, price_seg, “
”, “Area:”, neighbourhood_group, “
”, “Price:”, price)
plotB <- ggplot(airbnb, aes(neighbourhood_group,price)) +
geom_jitter(aes(col= price_seg, text = paste("Price Range:", price_seg, "<br>",
"Area:", neighbourhood_group, "<br>",
"Price:", price)), alpha = 0.7) +
geom_boxplot(alpha=1) +
scale_y_continuous(limits = c(0,1500), breaks = seq(0,1500, 100))+
labs(title = "Entire Home/apt Price by Neighbourhood", x= "Neighbourhood Group", y= "Price", col = "Price Segment") +
theme(plot.title = element_text(hjust = 0.5))
plotBContinue to plotly
Interpretations:
a.All area mostly populated by price below 300 USD
b. Manhattan and Brooklyn are the most distributed in price, although we find that mostly price range at below 300 USD but also found some price range is above 1000 USD
b. Third place is Queens
c. at Bronx and State Island, we dont find much price range above 1000 USD, and only few are having the price between 500-1000 USD, it means in this area both area (Bronx and State Island most average price is below 300 USD)
We want to know which host who has the highest ammount of review?
we do subset only for host_id and order it from highest to lowest based on the number of reviews
we only take top 30 of number of reviews
we name it as ‘C’ object
C <- airbnb %>%
distinct(host_id, .keep_all = TRUE) %>%
arrange(desc(number_of_reviews)) %>%
top_n(30, number_of_reviews)
glimpse(C)## Observations: 30
## Variables: 17
## $ id <dbl> 9145202, 891117, 834190, 347432...
## $ name <fct> "Room near JFK Queen Bed", "Pri...
## $ host_id <dbl> 47621202, 4734398, 2369681, 129...
## $ host_name <fct> Dona, Jj, Carol, Asa, Wanda, Li...
## $ neighbourhood_group <fct> Queens, Manhattan, Manhattan, B...
## $ neighbourhood <fct> Jamaica, Harlem, Lower East Sid...
## $ latitude <dbl> 40.66730, 40.82264, 40.71921, 4...
## $ longitude <dbl> -73.76831, -73.94041, -73.99116...
## $ room_type <fct> Private room, Private room, Pri...
## $ price <dbl> 47, 49, 99, 160, 60, 55, 120, 6...
## $ minimum_nights <dbl> 1, 1, 2, 1, 3, 1, 30, 1, 1, 5, ...
## $ number_of_reviews <dbl> 629, 594, 540, 488, 480, 474, 4...
## $ last_review <fct> 2019-07-05, 2019-06-15, 2019-07...
## $ reviews_per_month <dbl> 14.58, 7.57, 6.95, 8.14, 6.70, ...
## $ calculated_host_listings_count <dbl> 2, 3, 1, 1, 1, 3, 2, 2, 1, 2, 5...
## $ availability_365 <dbl> 333, 339, 179, 269, 0, 332, 192...
## $ price_seg <chr> "Below 300", "Below 300", "Belo...
continue create the plot
plotC <-
ggplot(C,aes(reorder(host_name,number_of_reviews), number_of_reviews))+
geom_col(fill ="#f0b81e", aes(text = paste("Reviews:", number_of_reviews,"<br>","Host Name:", host_name)))+
facet_grid(rows = vars(neighbourhood_group), scales = "free_y")+
geom_point(aes(col = price, size = price))+
labs( x= NULL, y= "Reviews")+
coord_flip()
plotCInterpretations: