Hi !! Welcome to my Rmd.
in this time im looking dataset from external source (kaggle). I hope u’ll enjoy.
This data is talking about airbnb matrics for listing in Singapore and the first thing i need to do is load all package tht might be needed for this dataset.
We could input our data to R and put it into ‘master’ object
## Observations: 7,907
## Variables: 16
## $ id <dbl> 49091, 50646, 56334, 71609, 718...
## $ name <chr> "COZICOMFORT LONG TERM STAY ROO...
## $ host_id <dbl> 266763, 227796, 266763, 367042,...
## $ host_name <chr> "Francesca", "Sujatha", "France...
## $ neighbourhood_group <chr> "North Region", "Central Region...
## $ neighbourhood <chr> "Woodlands", "Bukit Timah", "Wo...
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34...
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 1...
## $ room_type <chr> "Private room", "Private room",...
## $ price <dbl> 83, 81, 69, 206, 94, 104, 208, ...
## $ minimum_nights <dbl> 180, 90, 6, 1, 1, 1, 1, 90, 90,...
## $ number_of_reviews <dbl> 1, 18, 20, 14, 22, 39, 25, 174,...
## $ last_review <date> 2013-10-21, 2014-12-26, 2015-1...
## $ reviews_per_month <dbl> 0.01, 0.28, 0.20, 0.15, 0.22, 0...
## $ calculated_host_listings_count <dbl> 2, 1, 2, 9, 9, 9, 9, 4, 4, 4, 3...
## $ availability_365 <dbl> 365, 365, 365, 353, 355, 346, 1...
## [1] 7907 16
We see here, dataset has 7907 row and 16 column
There are some wrong data type that we need to change it into corect type, and save it into new object ‘airbnb’
airbnb <- master %>%
mutate(name = as.factor(name),
host_name = as.factor(host_name),
neighbourhood_group = as.factor(neighbourhood_group),
neighbourhood = as.factor(neighbourhood),
room_type = as.factor(room_type))
airbnbSeems good! now i want to check missing data from this dataset
## id name
## 0 2
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 2758 2758
## calculated_host_listings_count availability_365
## 0 0
as we can see above, we have around 35% of missing data which coming from ‘last_review’ and ’reviews_per_month’column
In this case, our missing data is ‘date’ data, so lets ignore this first c=since we wont use it for further analysis
We will continue to check the data summary
## id
## Min. : 49091
## 1st Qu.:15821800
## Median :24706270
## Mean :23388625
## 3rd Qu.:32348500
## Max. :38112762
##
## name
## Luxury hostel with in-cabin locker - Single mixed: 13
## Inviting & Cozy 1BR APT 3 mins from Tg Pagar MRT : 9
## Studio Apartment - Oakwood Premier : 9
## City-located 1BR loft apartment *BRAND NEW* : 8
## Stylish 1BR Located 7 mins from Tg Pagar MRT : 8
## (Other) :7858
## NA's : 2
## host_id host_name neighbourhood_group
## Min. : 23666 Jay : 290 Central Region :6309
## 1st Qu.: 23058075 Alvin : 249 East Region : 508
## Median : 63448912 Richards: 157 North-East Region: 346
## Mean : 91144807 Aaron : 145 North Region : 204
## 3rd Qu.:155381142 Rain : 115 West Region : 540
## Max. :288567551 Darcy : 114
## (Other) :6837
## neighbourhood latitude longitude room_type
## Kallang :1043 Min. :1.244 Min. :103.6 Entire home/apt:4132
## Geylang : 994 1st Qu.:1.296 1st Qu.:103.8 Private room :3381
## Novena : 537 Median :1.311 Median :103.8 Shared room : 394
## Rochor : 536 Mean :1.314 Mean :103.8
## Outram : 477 3rd Qu.:1.322 3rd Qu.:103.9
## Bukit Merah: 470 Max. :1.455 Max. :104.0
## (Other) :3850
## price minimum_nights number_of_reviews
## Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.: 65.0 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 124.0 Median : 3.00 Median : 2.00
## Mean : 169.3 Mean : 17.51 Mean : 12.81
## 3rd Qu.: 199.0 3rd Qu.: 10.00 3rd Qu.: 10.00
## Max. :10000.0 Max. :1000.00 Max. :323.00
##
## last_review reviews_per_month calculated_host_listings_count
## Min. :2013-10-21 Min. : 0.010 Min. : 1.00
## 1st Qu.:2018-11-21 1st Qu.: 0.180 1st Qu.: 2.00
## Median :2019-06-27 Median : 0.550 Median : 9.00
## Mean :2019-01-11 Mean : 1.044 Mean : 40.61
## 3rd Qu.:2019-08-07 3rd Qu.: 1.370 3rd Qu.: 48.00
## Max. :2019-08-27 Max. :13.000 Max. :274.00
## NA's :2758 NA's :2758
## availability_365
## Min. : 0.0
## 1st Qu.: 54.0
## Median :260.0
## Mean :208.7
## 3rd Qu.:355.0
## Max. :365.0
##
We may conclude: 1. ‘Jay’ with most popular host in Singapore listing, he has 290 point
2. ‘Central Region’ as the most populated area for airbnb in Singapore and North Region is the last
3. In Neighbourhood, ‘Kallang’ is the most desirable area in Singapore
4. There are 3 type of listing : ‘Entire home/apt’, ‘Private room’ and ‘Shared room’ type
5. Price average is around SGD 169.3
6. People usualy staying with average 17 days
Question : Which area with most populated for high rental cost?
I will use density plot to describe the price distributions
options(scipen = 88)
Plot_density <- airbnb %>%
select(price, neighbourhood_group) %>%
ggplot(aes(price, fill = neighbourhood_group, col = neighbourhood_group))+
geom_density (alpha = 0.5, round=2) +
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)+
geom_vline(aes(xintercept=mean(price, na.rm=T)),
color="red", linetype="dashed", size=1)+
theme(text = element_text(size = 12))+
scale_x_continuous(limits = c(0,800), breaks = seq(0,800,100))+
labs(x="Price", y = NULL)
Plot_densityAnswer : graph above, we may see that Central Region has most distributed of higher rental price compare to other regions. It understandable because most of the Central Region area is tourism area, such as : Singapore river, Marina PArade, Newton, Rochor, Novena, Kallang, etc. its explain everything! in other hands, West Region price is concentrated around SGD 50-100, it happened with North Region and North-East Region as well
Another Question, at which area(neighbourhood) in central Region with high average of rental price?
Filtering the Region (neighbourhood_group) and grouping by area (neighbourhood)
airbnb %>%
filter(neighbourhood_group == "Central Region") %>%
group_by(neighbourhood) %>%
summarise(Price_median = median(price),
Freq = n()) %>%
ggplot(aes(area = Price_median, fill = Freq, label = neighbourhood)) +
geom_treemap() +
geom_treemap_text(colour = "white", place = "topleft", reflow = T)+
labs(fill = "Frequency Airbnb" )Answer : Southern Islands area is the the area with most populated of expensive rental cost in Central Region. However, Southern Island have total frequency of airbnb less than 500-ish.
Question : What is the most demanding type for staying in Singapore?
We will analyse it from availability_365, which showing us the number of unoccupied days from a single listing at Airbnb in Singapore. Create new column into airbnb dataset ‘priceSeg’, which devided price into 3 range into <=300, 300<x<500, >=500.
This time i will use Box plot graph to illustrate. The aggregate showing as below :
airbnb <- airbnb %>%
mutate(priceSeg = case_when(
price <= 300 ~ "Below SGD300",
price >300 & price <500 ~ "SGD300-500",
TRUE ~ "Above SGD500"
))filter dataset only for Central Region with minimum number of reviews is 50.
Plotbx <- airbnb %>%
filter(neighbourhood_group == "Central Region" , number_of_reviews>50) %>%
ggplot(aes(x=room_type, y= availability_365)) +
geom_jitter(aes(col = priceSeg, text = paste("Priceseg:", priceSeg, "<br>",
"Room Type:", room_type, "<br>",
"Unoccupied Days:", availability_365)), alpha = 0.5)+
geom_boxplot(alpha=0.3, fill = "yellow") +
theme(legend.position="top",plot.title = element_text(size=12)) +
geom_hline(yintercept = median(airbnb$availability_365), color = "red", linetype = 5)+
scale_y_continuous( breaks = seq(0,360,20))+
labs(x =NULL , y = "Days of Unoccupied", col = "Price Segment")+
guides(fill = FALSE)+
theme_ipsum()
PlotbxAnswer : Based on boxplot above, we get conclusion “Entire room/apt” is the most demanding type of listing on Singapore Airbnb with the lowest average of unoccupied days at 120 Days in a year compare to others.Shared room type at Central Region is less in demand with highest average number of unoccupied days at around 350 days in a year.
Question: Where is the most demanding area in Central Region ?
Like previous graph above, we will use ‘availability_365’ to know the level of popularity an area at Airbnb in Singapore.
Create new object named ‘ava’ then sort it for only for listing with number of days unoccupancy less than 50 days which represent more popular.
ava <- airbnb %>%
filter(neighbourhood_group == "Central Region") %>%
distinct(availability_365, .keep_all = TRUE) %>%
arrange(availability_365) %>%
top_n(-50, availability_365)continue to graph:
plotAva <- ggplot(ava, aes(neighbourhood, availability_365))+
geom_point(alpha = 0.5, aes(size = price,col = room_type, text = paste("Area:", neighbourhood, "<br>",
"Unoccupied:", availability_365,"<br>",
"Price (SGD):", price,"<br>",
"Room Type:", room_type)))+
coord_flip()+
labs(title = " Most Popular Area at Certain Region in Singapore",subtitle = "less Unoccupied, more popular", x =NULL, y= "Unoccupied(days)", col = "Room Type")+
guides(size = FALSE)
plotAva Answer : after we sorted by number of unoccupied days less than 50, we got several area, among others are ’Bishan, Bukit Merah, Bukit Timah, Downtown Core, Geylang, Kallang, Marine Parade, Newton, Novena, Outram, Queenstown, River Valley, Rochor, Tanglin".
Question : Whose are the top 10 host at airBnB Singapore?
To answer this questions, we may use number of review datas inside airbnb column.
Create new object named ’ Host’ then sort descending and take only the top 10 based on number of reviews.
Host <- airbnb %>%
distinct(host_id, .keep_all = TRUE) %>%
arrange(desc(number_of_reviews)) %>%
top_n(10, number_of_reviews)continue create the plot
plotHost <-
ggplot(Host,aes(reorder(host_name,number_of_reviews), number_of_reviews))+
geom_col(aes(text = paste("Reviews:", number_of_reviews,"<br>",
"Host Name:", host_name, "<br>",
"Price (SGD):", price, "<br>",
"Type:", room_type), fill = room_type))+
facet_grid(rows = vars(neighbourhood_group), scales = "free_y")+
geom_point(aes(col = price))+
labs( x= NULL, y= "Reviews")+
coord_flip()
plotHostAnswer : Top 10 Host from the ammount of reviews are : Val, Yuan, Felix, Callie&Kel, Your home away from home, Cheryl, Anita&David, Shu, Shirley, Su.They are mostly coming from Central Region and rest from East Region. Only these 2 regions (Central Region and East Region) that goes inside the top 10 while others Regions not having much of reviews
Question : Plot the spread out of the dataset for Central Region using leaflet function!
Create new object named ‘peta’ which only consist of Central Region.
After that continur to leaflet.
Pic <- makeIcon(iconUrl = "images (1).png",
iconWidth = 100*0.35,
iconHeight = 100*0.35)
map <- leaflet()
map <- addTiles(map)
map <- addMarkers(map,
lng = peta$longitude,
lat = peta$latitude,
popup = peta$name,
clusterOptions = markerClusterOptions(),
icon = Pic)
mapAnswer : LEaflet above is showing the spread out Airbnb data at Central Region in Singapore