1 Intro

1.1 Greetings

Hi !! Welcome to my Rmd.
in this time im looking dataset from external source (kaggle). I hope u’ll enjoy.

1.2 Brief

This data is talking about airbnb matrics for listing in New York city, USA. and the first thing i need to do is load all package tht might be needed for this dataset.

## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## 
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual

2 Data Explanatory

2.1 Data Input & Structure

We could input our data to R and put it into ‘airbnb’ object

Then we do inspect data

## [1] 48895    16

From inspection above, we got short description of the data. airbnb is consist of 48895 x 16 of rows and cloumns. then we need to check the data structure

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

as we see here, we just have to change ’last_review column become Date type

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date, format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

2.2 Missing Data

Find out missing data for datasetinputed

## [1] TRUE
##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0

OOpss!! airbnb data has NA inside “reviews_per_month”. We will delete all missing value

## [1] 38843    16
##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                              0                              0 
## calculated_host_listings_count               availability_365 
##                              0                              0
## [1] FALSE

after deleted, left us 38,843 of rows and 16 colums. Dont forget to save as

2.3 Subseting and Practical Statistics

In this process, i will delete ‘latitude’ and ‘longitude’ coloumn

##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "room_type"                      "price"                         
##  [9] "minimum_nights"                 "number_of_reviews"             
## [11] "last_review"                    "reviews_per_month"             
## [13] "calculated_host_listings_count" "availability_365"
## [1] 38843    14

Cool !! looking good so far Lets check statistical summary

##        id                                         name      
##  Min.   :    2539   Home away from home             :   12  
##  1st Qu.: 8720027   Loft Suite @ The Box House Hotel:   11  
##  Median :18871455   Private Room                    :   10  
##  Mean   :18096462   Brooklyn Apartment              :    9  
##  3rd Qu.:27554820   Cozy Brooklyn Apartment         :    8  
##  Max.   :36455809   New york Multi-unit building    :    8  
##                     (Other)                         :38785  
##     host_id                 host_name        neighbourhood_group
##  Min.   :     2438   Michael     :  335   Bronx        :  876   
##  1st Qu.:  7033824   David       :  309   Brooklyn     :16447   
##  Median : 28371926   John        :  250   Manhattan    :16632   
##  Mean   : 64239145   Alex        :  229   Queens       : 4574   
##  3rd Qu.:101846466   Sonder (NYC):  207   Staten Island:  314   
##  Max.   :273841667   Sarah       :  179                         
##                      (Other)     :37334                         
##             neighbourhood             room_type         price        
##  Williamsburg      : 3163   Entire home/apt:20332   Min.   :    0.0  
##  Bedford-Stuyvesant: 3141   Private room   :17665   1st Qu.:   69.0  
##  Harlem            : 2206   Shared room    :  846   Median :  101.0  
##  Bushwick          : 1944                           Mean   :  142.3  
##  Hell's Kitchen    : 1532                           3rd Qu.:  170.0  
##  East Village      : 1490                           Max.   :10000.0  
##  (Other)           :25367                                            
##  minimum_nights     number_of_reviews  last_review        
##  Min.   :   1.000   Min.   :  1.0     Min.   :2011-03-28  
##  1st Qu.:   1.000   1st Qu.:  3.0     1st Qu.:2018-07-08  
##  Median :   2.000   Median :  9.0     Median :2019-05-19  
##  Mean   :   5.868   Mean   : 29.3     Mean   :2018-10-04  
##  3rd Qu.:   4.000   3rd Qu.: 33.0     3rd Qu.:2019-06-23  
##  Max.   :1250.000   Max.   :629.0     Max.   :2019-07-08  
##                                                           
##  reviews_per_month calculated_host_listings_count availability_365
##  Min.   : 0.010    Min.   :  1.000                Min.   :  0.0   
##  1st Qu.: 0.190    1st Qu.:  1.000                1st Qu.:  0.0   
##  Median : 0.720    Median :  1.000                Median : 55.0   
##  Mean   : 1.373    Mean   :  5.165                Mean   :114.9   
##  3rd Qu.: 2.020    3rd Qu.:  2.000                3rd Qu.:229.0   
##  Max.   :58.500    Max.   :327.000                Max.   :365.0   
## 
## 'data.frame':    38843 obs. of  14 variables:
##  $ id                            : int  2539 2595 3831 5022 5099 5121 5178 5203 5238 5295 ...
##  $ name                          : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 15702 19366 25001 8337 25048 15597 17682 5654 ...
##  $ host_id                       : int  2787 2845 4869 7192 7322 7356 8967 7490 7549 7702 ...
##  $ host_name                     : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 6264 5982 1970 3601 9699 6935 1264 6084 ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 2 3 3 2 3 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 42 62 138 14 96 203 36 203 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 1 1 1 2 2 2 1 1 ...
##  $ price                         : int  149 225 89 80 200 60 79 79 150 135 ...
##  $ minimum_nights                : int  1 1 1 10 3 45 2 2 1 5 ...
##  $ number_of_reviews             : int  9 45 270 9 74 49 430 118 160 53 ...
##  $ last_review                   : Date, format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num  0.21 0.38 4.64 0.1 0.59 0.4 3.47 0.99 1.33 0.43 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 4 1 ...
##  $ availability_365              : int  365 355 194 0 129 0 220 0 188 6 ...

From summary above, we may conclude some of the things
1. There are 3 types of listing : Entire home, privat home and shared room. which Entire home/apt are the most popular in airbnb populations
2. Price range is aroud 0 - 10,000 USD, with average value 142.3 USD
3. Manhattan neighbourhood group is the most highest compare to others groups
4. People stays start from 1 night till 1250 nights (around 3 years-ish) but average people is staying around 6 nights
5. Michael is showing as the most popular host at listing property in New York City
6. “Home away from home” is the most popular choice for staying in New York City in 2019

3 Study Case

1. We will check the interaction between price and room type overlay with average price

i will use function "Scale_y_log10’ for better interpretation of IQR

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing non-finite values (stat_boxplot).

Interpretations :
a. as we can see from boxplot above, the highest price of all is ‘Entire home/apartment’ type of listing
b. Second place is Privat Room and the lowest is Shared Room of course
c. Average price line only crossed the Entire Room/apt. more than half price of Entire Room/apt are above the average price
d. Privat Room type and Shared room type of price distribution are far below from the average price

2. We want to find out the corelation between price and availability in a year, does cheap price makes a listing property become the most demanding property in New York City?

Create corelation between price and availability in a year using geom point

Interpretation :
lower price doesnt make guarantee will become more popular to be rented than high price
From graph above, show us that although price is high but the availability in a year is low. it means some customer not consider price as the most important variable which determined to choose a listing property. But some of customer do think about the price as well.
at graph above, at price around 6000+ USD, the availability around 50 days in a year. it means, this property so popular to be rented although the price is higher than other.

3. How is the availability of listing within the type of room and in different group of neighbourhood?

I will use violin plot, with x is type of room and y is availability in year

Interpretation :
a. We might say that less availability means more popular it might be
b. ‘Shared room’ is the mostpopular at the ‘Staten Island’ area compare to other group of neighbourhood, others is quitely in the same level
c. at ‘Brooklyn’ and ‘Manhattan’, ‘privat room’ type are both almost same popular as an options to stay
d. similar with privat room, at ‘Brooklyn’ and ‘Manhattan’, ‘Entire Home/apt’ type are both almost same popular as an options to stay compare to other areas.

4. Show me the price distribution of each group of neighbourhood for Entire home/apt type only!

Lets we focused on “Entire hoom/apt
Create new object which only consist of”Entire hoom/apt", named ‘era’

## [1] 20332    14

after that, make grouping price with range into <=300, 300<=x<=500, 500<x<1000, x>=1000 after that create new column in era object named ‘priceC’ column

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Interpretations :
a. Manhattan are the most distributed in price for “Entire home/apt” type, although we find that mostly price range at below 300 USD but also found some price range is above 1000 USD
b. Second is Brooklyn, similiar with Manhattan but with lower population of above 1000 USD of price segment
c. Third place is Queens
d. at Bronx and State Island, we dont find any price above 1000 USD, and only few are having the price between 500-1000 USD, it means in this area both area (Bronx and State Island most average price is below 300 USD)

5. Show the average value of minimum night and review for each neighbourhood group Within Entire home/apt listing!
make new data frame for Review and minimum night based on average value for each of neighbourhood group
named ‘eraRN’

## [1] 5 3

gather all value (review and minimum night) into 1 column, named ‘variable’, usiing function ‘gaher’

Interpretation :
a. Satate Island have the lowest average of minimum night for staying at Entire home/apt listing but has the highest average for number of reviews. it may happened when people do short stay and always write a review after that, in other words, people who rented Entire home/apt at State island area always changing because they not staying more night and only do vacation or short stay.
b. Manhattan has the higest average value of minimum nights for staying at Entire home/apt type but has lowest average number of reviews.it undestandable because people who choose to stay at Entire home/apt type at Manhattan mostly staying more for years or months in the same listing. so in other words not much changing of people thats explain why the average number of reviews is lowest than others.

6. We want to know corelation between number of review and price at Brooklyn and Manhattan area only

First, we need to separate neighbourhood group only for Manhattan and Brooklyn only and then make new object with named ‘airbnbM_B’

then continue create the graph

## Warning: Transformation introduced infinite values in continuous x-axis

Interpretations:
Both of area (Manhattan and Brooklyn) has showns almost the same value of price and number of reviews.
Manhattan has lilttle over number of reviews than Brooklyn

7. We want to see how long people staying in every year at Manhattan and brooklyn area. does every year have significant changes?

we already have object which included onlly Manhattan and Brooklyn area from previous graph, named ‘airbnbM&B’ so all we need to do next is to separate the year from column ‘last_review’ to get the year of each data

## [1] 2017 2015 2016 2018 2016 2019
## Levels: 2011 2012 2013 2014 2015 2016 2017 2018 2019

continue to create the graph

## Picking joint bandwidth of 1.24
## Warning: Removed 86 rows containing non-finite values
## (stat_density_ridges).

interpretations:
at Brooklyn and Manhattan area, number of staying nights almost during 2011-2019, except in 2012-2013. its shows litlle less of number of staying in 2012-2013. we also found another peak at number of 30 nights.

8. We want to know which host who has the highest ammount of review?

we do subset only for host_id and order it from highest to lowest based on the number of reviews
we only take top 30 of number of reviews
we name it as ‘host’ object

Interpretations:
a. Staten Island and Bronx are not included in top 30 number of reviews
b. ‘Dona’ from Queens is the highest with more than 600 number of reviews and price of her listing is below 100
c. Second place is occupied by ‘Jj’ at Manhattan with 594 number of review and price below 100 USD
d. Third place is occupied by ‘Carol’ from Manhattan as well with 540 number of review and price below 100 USD
e. Price arouns 500 USD is located at Manhattan with host name is ‘John’ and 447 number of reviews

4 Final Conclusion

From all graphs above, we may say some assumptions, such as :
1. Entire home/apr type of listing property relatively has the highest price compare to others listing of property
2. Lower price doesnt make guarantee will become more popular to be rented than high price, people choosing property to be rented mostly considering the needs, and price not one and only considerations
3. Different places showing different act of people needs. at Stated Island, shared room is more popular than other type. other hand, at Brooklyn and Manhattan showing Entire Home/apt and Privat Room type is more desirable than other
4. Some people are spending more than 1000 USD to pay their Entire home/aprt rent at Brooklyn and Manhattan area