Introduction

This challenge focuses more on visualizations between various attributes of the data. The main ones include univariate and bivariate visualizations.

Dataset

We first load the necessary libraries.

library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 5
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

We then read the data,

data <- read_csv("AB_NYC_2019.csv", show_col_types = FALSE)

We can view the first few rows of the data to see the various attributes.

head(data)
## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

This data is about AirBnB listings in several Boroughs of New York.

The description of each column is as follows-

  1. id - Unique identifier for each listing.

  2. name - Name of the listing.

  3. host_id - Identifier of the host of property.

  4. host_name - Name of the host.

  5. neighbourhood_group - Borough or area where property is located.

  6. neighbourhood - Specific neighborhood of the borough.

  7. latitude - One half of geographic coordinates.

  8. longitude - Other half of the coordinates.

  9. room_type- Type of room offered.

  10. price - Price of the listing per night.

  11. minimum_nights - Minimum number of nights required for booking.

  12. number_of_reviews - Total number of reviews received by the listing.

  13. last_review - Date of the last review.

  14. reviews_per_month - Average number of reviews received per month.

  15. calculated_host_listings_count - Count of properties listed by host.

  16. availability_365 - Number of days

Reading the Data

We can get the dimensions of the data.

dim(data)
## [1] 48895    16

Shows that there are 488895 unique listings with 16 different attributes.

An overview of the structure of the data can be observed.

str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr [1:48895] "19-10-2018" "21-05-2019" NA "05-07-2019" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_character(),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We can also get the summary statistics of the numerical columns of the data.

summary(data)
##        id               name              host_id           host_name        
##  Min.   :    2539   Length:48895       Min.   :     2438   Length:48895      
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character  
##  Median :19677284   Mode  :character   Median : 30793816   Mode  :character  
##  Mean   :19017143                      Mean   : 67620011                     
##  3rd Qu.:29152178                      3rd Qu.:107434423                     
##  Max.   :36487245                      Max.   :274321313                     
##                                                                              
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:48895        Length:48895       Min.   :40.50   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.95  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.91   Max.   :-73.71  
##                                                                         
##   room_type             price         minimum_nights    number_of_reviews
##  Length:48895       Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Mode  :character   Median :  106.0   Median :   3.00   Median :  5.00   
##                     Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##                     3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##                     Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##                                                                          
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:48895       Min.   : 0.010    Min.   :  1.000               
##  Class :character   1st Qu.: 0.190    1st Qu.:  1.000               
##  Mode  :character   Median : 0.720    Median :  1.000               
##                     Mean   : 1.373    Mean   :  7.144               
##                     3rd Qu.: 2.020    3rd Qu.:  2.000               
##                     Max.   :58.500    Max.   :327.000               
##                     NA's   :10052                                   
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median : 45.0   
##  Mean   :112.8   
##  3rd Qu.:227.0   
##  Max.   :365.0   
## 

We can select one or more particular columns to see all the values in that column. For instance, we can use it for neighborhood.

select(data, "neighbourhood")
## # A tibble: 48,895 × 1
##    neighbourhood     
##    <chr>             
##  1 Kensington        
##  2 Midtown           
##  3 Harlem            
##  4 Clinton Hill      
##  5 East Harlem       
##  6 Murray Hill       
##  7 Bedford-Stuyvesant
##  8 Hell's Kitchen    
##  9 Upper West Side   
## 10 Chinatown         
## # ℹ 48,885 more rows

We can get the count for above neighborhoods.

neighborhood <- select(data, "neighbourhood")
table(neighborhood)
## neighbourhood
##                   Allerton              Arden Heights 
##                         42                          4 
##                   Arrochar                    Arverne 
##                         21                         77 
##                    Astoria                 Bath Beach 
##                        900                         17 
##          Battery Park City                  Bay Ridge 
##                         70                        141 
##                Bay Terrace Bay Terrace, Staten Island 
##                          6                          2 
##                 Baychester                    Bayside 
##                          7                         39 
##                  Bayswater         Bedford-Stuyvesant 
##                         17                       3714 
##               Belle Harbor                  Bellerose 
##                          8                         14 
##                    Belmont                Bensonhurst 
##                         24                         75 
##               Bergen Beach                Boerum Hill 
##                         10                        177 
##               Borough Park               Breezy Point 
##                        136                          3 
##                  Briarwood             Brighton Beach 
##                         56                         75 
##                  Bronxdale           Brooklyn Heights 
##                         19                        154 
##                Brownsville                Bull's Head 
##                         61                          6 
##                   Bushwick            Cambria Heights 
##                       2465                         26 
##                   Canarsie            Carroll Gardens 
##                        147                        233 
##                Castle Hill          Castleton Corners 
##                          9                          4 
##                    Chelsea                  Chinatown 
##                       1113                        368 
##                City Island               Civic Center 
##                         18                         52 
##          Claremont Village               Clason Point 
##                         28                         21 
##                    Clifton               Clinton Hill 
##                         15                        572 
##                 Co-op City                Cobble Hill 
##                          2                         99 
##              College Point                Columbia St 
##                         19                         42 
##                    Concord                  Concourse 
##                         26                         50 
##          Concourse Village               Coney Island 
##                         32                         17 
##                     Corona              Crown Heights 
##                         64                       1564 
##              Cypress Hills           Ditmars Steinway 
##                        135                        309 
##               Dongan Hills                 Douglaston 
##                          7                          8 
##          Downtown Brooklyn                      DUMBO 
##                         83                         36 
##              Dyker Heights              East Elmhurst 
##                         12                        185 
##              East Flatbush                East Harlem 
##                        500                       1117 
##            East Morrisania              East New York 
##                         10                        218 
##               East Village                Eastchester 
##                       1853                         13 
##                   Edenwald                   Edgemere 
##                         13                         11 
##                   Elmhurst                Eltingville 
##                        237                          3 
##               Emerson Hill               Far Rockaway 
##                          5                         29 
##                  Fieldston         Financial District 
##                         12                        744 
##                   Flatbush          Flatiron District 
##                        621                         80 
##                  Flatlands                   Flushing 
##                         83                        426 
##                    Fordham               Forest Hills 
##                         63                        144 
##                Fort Greene              Fort Hamilton 
##                        489                         55 
##             Fort Wadsworth              Fresh Meadows 
##                          1                         32 
##                   Glendale                    Gowanus 
##                         54                        247 
##                   Gramercy               Graniteville 
##                        338                          3 
##                 Grant City                  Gravesend 
##                          6                         68 
##                Great Kills                 Greenpoint 
##                         10                       1115 
##          Greenwich Village                Grymes Hill 
##                        392                          7 
##                     Harlem             Hell's Kitchen 
##                       2658                       1958 
##                 Highbridge                     Hollis 
##                         27                         14 
##                 Holliswood               Howard Beach 
##                          4                         20 
##               Howland Hook                   Huguenot 
##                          2                          3 
##                Hunts Point                     Inwood 
##                         18                        252 
##            Jackson Heights                    Jamaica 
##                        186                        231 
##            Jamaica Estates              Jamaica Hills 
##                         19                          8 
##                 Kensington                Kew Gardens 
##                        175                         32 
##          Kew Gardens Hills                Kingsbridge 
##                         26                         70 
##                   Kips Bay                  Laurelton 
##                        470                         18 
##            Lighthouse Hill               Little Italy 
##                          2                        121 
##                Little Neck           Long Island City 
##                          5                        537 
##                   Longwood            Lower East Side 
##                         62                        911 
##            Manhattan Beach                Marble Hill 
##                          8                         12 
##            Mariners Harbor                    Maspeth 
##                          8                        110 
##                    Melrose             Middle Village 
##                         10                         31 
##              Midland Beach                    Midtown 
##                          6                       1545 
##                    Midwood                 Mill Basin 
##                        109                          4 
##        Morningside Heights             Morris Heights 
##                        346                         17 
##                Morris Park                 Morrisania 
##                         15                         18 
##                 Mott Haven                 Mount Eden 
##                         60                          6 
##                 Mount Hope                Murray Hill 
##                         20                        485 
##                  Navy Yard                   Neponsit 
##                         14                          3 
##               New Brighton                   New Dorp 
##                          5                          1 
##             New Dorp Beach            New Springville 
##                          5                          8 
##                       NoHo                     Nolita 
##                         78                        253 
##            North Riverdale                    Norwood 
##                         10                         31 
##                    Oakwood                  Olinville 
##                          5                          4 
##                 Ozone Park                 Park Slope 
##                         62                        506 
##                Parkchester                 Pelham Bay 
##                         39                         17 
##             Pelham Gardens                Port Morris 
##                         28                         46 
##              Port Richmond               Prince's Bay 
##                          9                          4 
##  Prospect-Lefferts Gardens           Prospect Heights 
##                        535                        357 
##             Queens Village              Randall Manor 
##                         60                         19 
##                   Red Hook                  Rego Park 
##                         79                        106 
##              Richmond Hill               Richmondtown 
##                         94                          1 
##                  Ridgewood                  Riverdale 
##                        423                         11 
##             Rockaway Beach           Roosevelt Island 
##                         56                         77 
##                   Rosebank                   Rosedale 
##                          7                         59 
##                  Rossville              Schuylerville 
##                          1                         13 
##                   Sea Gate             Sheepshead Bay 
##                          7                        164 
##                Shore Acres                Silver Lake 
##                          7                          2 
##                       SoHo                  Soundview 
##                        358                         15 
##                South Beach           South Ozone Park 
##                          8                         40 
##                South Slope        Springfield Gardens 
##                        284                         85 
##             Spuyten Duyvil                 St. Albans 
##                          4                         76 
##                 St. George                  Stapleton 
##                         48                         27 
##            Stuyvesant Town                  Sunnyside 
##                         37                        363 
##                Sunset Park           Theater District 
##                        390                        288 
##                Throgs Neck                  Todt Hill 
##                         24                          4 
##              Tompkinsville                Tottenville 
##                         42                          7 
##                    Tremont                    Tribeca 
##                         11                        177 
##                Two Bridges                  Unionport 
##                         72                          7 
##         University Heights            Upper East Side 
##                         21                       1798 
##            Upper West Side                   Van Nest 
##                       1971                         11 
##               Vinegar Hill                  Wakefield 
##                         34                         50 
##         Washington Heights              West Brighton 
##                        899                         18 
##                 West Farms               West Village 
##                          2                        768 
##         Westchester Square                Westerleigh 
##                         10                          2 
##                 Whitestone             Williamsbridge 
##                         11                         40 
##               Williamsburg                Willowbrook 
##                       3920                          1 
##            Windsor Terrace                  Woodhaven 
##                        157                         88 
##                   Woodlawn                    Woodrow 
##                         11                          1 
##                   Woodside 
##                        235

Based on numerical values we see that certain neighborhoods appear way more than others. Reasons being the size, population, proximity to tourist attractions, wealth, etc.

We can have a better idea by getting the proportions.

prop.table(table(neighborhood))
## neighbourhood
##                   Allerton              Arden Heights 
##               8.589835e-04               8.180796e-05 
##                   Arrochar                    Arverne 
##               4.294918e-04               1.574803e-03 
##                    Astoria                 Bath Beach 
##               1.840679e-02               3.476838e-04 
##          Battery Park City                  Bay Ridge 
##               1.431639e-03               2.883730e-03 
##                Bay Terrace Bay Terrace, Staten Island 
##               1.227119e-04               4.090398e-05 
##                 Baychester                    Bayside 
##               1.431639e-04               7.976276e-04 
##                  Bayswater         Bedford-Stuyvesant 
##               3.476838e-04               7.595869e-02 
##               Belle Harbor                  Bellerose 
##               1.636159e-04               2.863278e-04 
##                    Belmont                Bensonhurst 
##               4.908477e-04               1.533899e-03 
##               Bergen Beach                Boerum Hill 
##               2.045199e-04               3.620002e-03 
##               Borough Park               Breezy Point 
##               2.781470e-03               6.135597e-05 
##                  Briarwood             Brighton Beach 
##               1.145311e-03               1.533899e-03 
##                  Bronxdale           Brooklyn Heights 
##               3.885878e-04               3.149606e-03 
##                Brownsville                Bull's Head 
##               1.247571e-03               1.227119e-04 
##                   Bushwick            Cambria Heights 
##               5.041415e-02               5.317517e-04 
##                   Canarsie            Carroll Gardens 
##               3.006442e-03               4.765313e-03 
##                Castle Hill          Castleton Corners 
##               1.840679e-04               8.180796e-05 
##                    Chelsea                  Chinatown 
##               2.276306e-02               7.526332e-03 
##                City Island               Civic Center 
##               3.681358e-04               1.063503e-03 
##          Claremont Village               Clason Point 
##               5.726557e-04               4.294918e-04 
##                    Clifton               Clinton Hill 
##               3.067798e-04               1.169854e-02 
##                 Co-op City                Cobble Hill 
##               4.090398e-05               2.024747e-03 
##              College Point                Columbia St 
##               3.885878e-04               8.589835e-04 
##                    Concord                  Concourse 
##               5.317517e-04               1.022599e-03 
##          Concourse Village               Coney Island 
##               6.544636e-04               3.476838e-04 
##                     Corona              Crown Heights 
##               1.308927e-03               3.198691e-02 
##              Cypress Hills           Ditmars Steinway 
##               2.761019e-03               6.319665e-03 
##               Dongan Hills                 Douglaston 
##               1.431639e-04               1.636159e-04 
##          Downtown Brooklyn                      DUMBO 
##               1.697515e-03               7.362716e-04 
##              Dyker Heights              East Elmhurst 
##               2.454239e-04               3.783618e-03 
##              East Flatbush                East Harlem 
##               1.022599e-02               2.284487e-02 
##            East Morrisania              East New York 
##               2.045199e-04               4.458534e-03 
##               East Village                Eastchester 
##               3.789754e-02               2.658759e-04 
##                   Edenwald                   Edgemere 
##               2.658759e-04               2.249719e-04 
##                   Elmhurst                Eltingville 
##               4.847121e-03               6.135597e-05 
##               Emerson Hill               Far Rockaway 
##               1.022599e-04               5.931077e-04 
##                  Fieldston         Financial District 
##               2.454239e-04               1.521628e-02 
##                   Flatbush          Flatiron District 
##               1.270069e-02               1.636159e-03 
##                  Flatlands                   Flushing 
##               1.697515e-03               8.712547e-03 
##                    Fordham               Forest Hills 
##               1.288475e-03               2.945086e-03 
##                Fort Greene              Fort Hamilton 
##               1.000102e-02               1.124859e-03 
##             Fort Wadsworth              Fresh Meadows 
##               2.045199e-05               6.544636e-04 
##                   Glendale                    Gowanus 
##               1.104407e-03               5.051641e-03 
##                   Gramercy               Graniteville 
##               6.912772e-03               6.135597e-05 
##                 Grant City                  Gravesend 
##               1.227119e-04               1.390735e-03 
##                Great Kills                 Greenpoint 
##               2.045199e-04               2.280397e-02 
##          Greenwich Village                Grymes Hill 
##               8.017180e-03               1.431639e-04 
##                     Harlem             Hell's Kitchen 
##               5.436139e-02               4.004499e-02 
##                 Highbridge                     Hollis 
##               5.522037e-04               2.863278e-04 
##                 Holliswood               Howard Beach 
##               8.180796e-05               4.090398e-04 
##               Howland Hook                   Huguenot 
##               4.090398e-05               6.135597e-05 
##                Hunts Point                     Inwood 
##               3.681358e-04               5.153901e-03 
##            Jackson Heights                    Jamaica 
##               3.804070e-03               4.724409e-03 
##            Jamaica Estates              Jamaica Hills 
##               3.885878e-04               1.636159e-04 
##                 Kensington                Kew Gardens 
##               3.579098e-03               6.544636e-04 
##          Kew Gardens Hills                Kingsbridge 
##               5.317517e-04               1.431639e-03 
##                   Kips Bay                  Laurelton 
##               9.612435e-03               3.681358e-04 
##            Lighthouse Hill               Little Italy 
##               4.090398e-05               2.474691e-03 
##                Little Neck           Long Island City 
##               1.022599e-04               1.098272e-02 
##                   Longwood            Lower East Side 
##               1.268023e-03               1.863176e-02 
##            Manhattan Beach                Marble Hill 
##               1.636159e-04               2.454239e-04 
##            Mariners Harbor                    Maspeth 
##               1.636159e-04               2.249719e-03 
##                    Melrose             Middle Village 
##               2.045199e-04               6.340117e-04 
##              Midland Beach                    Midtown 
##               1.227119e-04               3.159832e-02 
##                    Midwood                 Mill Basin 
##               2.229267e-03               8.180796e-05 
##        Morningside Heights             Morris Heights 
##               7.076388e-03               3.476838e-04 
##                Morris Park                 Morrisania 
##               3.067798e-04               3.681358e-04 
##                 Mott Haven                 Mount Eden 
##               1.227119e-03               1.227119e-04 
##                 Mount Hope                Murray Hill 
##               4.090398e-04               9.919215e-03 
##                  Navy Yard                   Neponsit 
##               2.863278e-04               6.135597e-05 
##               New Brighton                   New Dorp 
##               1.022599e-04               2.045199e-05 
##             New Dorp Beach            New Springville 
##               1.022599e-04               1.636159e-04 
##                       NoHo                     Nolita 
##               1.595255e-03               5.174353e-03 
##            North Riverdale                    Norwood 
##               2.045199e-04               6.340117e-04 
##                    Oakwood                  Olinville 
##               1.022599e-04               8.180796e-05 
##                 Ozone Park                 Park Slope 
##               1.268023e-03               1.034871e-02 
##                Parkchester                 Pelham Bay 
##               7.976276e-04               3.476838e-04 
##             Pelham Gardens                Port Morris 
##               5.726557e-04               9.407915e-04 
##              Port Richmond               Prince's Bay 
##               1.840679e-04               8.180796e-05 
##  Prospect-Lefferts Gardens           Prospect Heights 
##               1.094181e-02               7.301360e-03 
##             Queens Village              Randall Manor 
##               1.227119e-03               3.885878e-04 
##                   Red Hook                  Rego Park 
##               1.615707e-03               2.167911e-03 
##              Richmond Hill               Richmondtown 
##               1.922487e-03               2.045199e-05 
##                  Ridgewood                  Riverdale 
##               8.651191e-03               2.249719e-04 
##             Rockaway Beach           Roosevelt Island 
##               1.145311e-03               1.574803e-03 
##                   Rosebank                   Rosedale 
##               1.431639e-04               1.206667e-03 
##                  Rossville              Schuylerville 
##               2.045199e-05               2.658759e-04 
##                   Sea Gate             Sheepshead Bay 
##               1.431639e-04               3.354126e-03 
##                Shore Acres                Silver Lake 
##               1.431639e-04               4.090398e-05 
##                       SoHo                  Soundview 
##               7.321812e-03               3.067798e-04 
##                South Beach           South Ozone Park 
##               1.636159e-04               8.180796e-04 
##                South Slope        Springfield Gardens 
##               5.808365e-03               1.738419e-03 
##             Spuyten Duyvil                 St. Albans 
##               8.180796e-05               1.554351e-03 
##                 St. George                  Stapleton 
##               9.816955e-04               5.522037e-04 
##            Stuyvesant Town                  Sunnyside 
##               7.567236e-04               7.424072e-03 
##                Sunset Park           Theater District 
##               7.976276e-03               5.890173e-03 
##                Throgs Neck                  Todt Hill 
##               4.908477e-04               8.180796e-05 
##              Tompkinsville                Tottenville 
##               8.589835e-04               1.431639e-04 
##                    Tremont                    Tribeca 
##               2.249719e-04               3.620002e-03 
##                Two Bridges                  Unionport 
##               1.472543e-03               1.431639e-04 
##         University Heights            Upper East Side 
##               4.294918e-04               3.677268e-02 
##            Upper West Side                   Van Nest 
##               4.031087e-02               2.249719e-04 
##               Vinegar Hill                  Wakefield 
##               6.953676e-04               1.022599e-03 
##         Washington Heights              West Brighton 
##               1.838634e-02               3.681358e-04 
##                 West Farms               West Village 
##               4.090398e-05               1.570713e-02 
##         Westchester Square                Westerleigh 
##               2.045199e-04               4.090398e-05 
##                 Whitestone             Williamsbridge 
##               2.249719e-04               8.180796e-04 
##               Williamsburg                Willowbrook 
##               8.017180e-02               2.045199e-05 
##            Windsor Terrace                  Woodhaven 
##               3.210962e-03               1.799775e-03 
##                   Woodlawn                    Woodrow 
##               2.249719e-04               2.045199e-05 
##                   Woodside 
##               4.806217e-03

This can give a clearer idea on comparison of listings among different neighborhoods.

Tidying the Data

To tidy the data, first we will check if there are missing values in the data.

colSums(is.na(data))
##                             id                           name 
##                              0                             16 
##                        host_id                      host_name 
##                              0                             21 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                          10052                          10052 
## calculated_host_listings_count               availability_365 
##                              0                              0

We see that certain property names and host names are missing. There is also a sizeable portion of missing last review and reviews per month. These 2 columns are usually empty when the listing has not received a single review.

We can replace the missing values by necessary data.

data$last_review[data$number_of_reviews == 0] <- "No Reviews"

data$reviews_per_month[data$number_of_reviews == 0] <- 0

data$name[is.na(data$name)] <- "Unknown"

data$host_name[is.na(data$host_name)] <- "Unknown"

From above, we see that we filled the missing cells with appropriate values. We can now conduct a sanity check after tidying the data.

We check once again for missing values.

colSums(is.na(data))
##                             id                           name 
##                              0                              0 
##                        host_id                      host_name 
##                              0                              0 
##            neighbourhood_group                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                              0                              0 
## calculated_host_listings_count               availability_365 
##                              0                              0

We see that our data is full.

We can now check for rows where the number of reviews is 0.

zero_reviews <- data %>%
  filter(number_of_reviews == 0) %>%
  select(everything())

zero_reviews
## # A tibble: 10,052 × 16
##       id name       host_id host_name neighbourhood_group neighbourhood latitude
##    <dbl> <chr>        <dbl> <chr>     <chr>               <chr>            <dbl>
##  1  3647 THE VILLA…    4632 Elisabeth Manhattan           Harlem            40.8
##  2  7750 Huge 2 BR…   17985 Sing      Manhattan           East Harlem       40.8
##  3  8700 Magnifiqu…   26394 Claude &… Manhattan           Inwood            40.9
##  4 11452 Clean and…    7355 Vt        Brooklyn            Bedford-Stuy…     40.7
##  5 11943 Country s…   45445 Harriet   Brooklyn            Flatbush          40.6
##  6 51438 1 Bedroom…  236421 Jessica   Manhattan           Upper East S…     40.8
##  7 54466 Beautiful…  253385 Douglas   Manhattan           Harlem            40.8
##  8 63588 LL3         295128 Carol Gl… Bronx               Clason Point      40.8
##  9 63913 HOSTING Y…  312288 Paula     Manhattan           Inwood            40.9
## 10 64015 Prime Eas…  146944 David     Manhattan           East Village      40.7
## # ℹ 10,042 more rows
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

We can now get the summary of the data after tidying.

summary(data)
##        id               name              host_id           host_name        
##  Min.   :    2539   Length:48895       Min.   :     2438   Length:48895      
##  1st Qu.: 9471945   Class :character   1st Qu.:  7822033   Class :character  
##  Median :19677284   Mode  :character   Median : 30793816   Mode  :character  
##  Mean   :19017143                      Mean   : 67620011                     
##  3rd Qu.:29152178                      3rd Qu.:107434423                     
##  Max.   :36487245                      Max.   :274321313                     
##  neighbourhood_group neighbourhood         latitude       longitude     
##  Length:48895        Length:48895       Min.   :40.50   Min.   :-74.24  
##  Class :character    Class :character   1st Qu.:40.69   1st Qu.:-73.98  
##  Mode  :character    Mode  :character   Median :40.72   Median :-73.96  
##                                         Mean   :40.73   Mean   :-73.95  
##                                         3rd Qu.:40.76   3rd Qu.:-73.94  
##                                         Max.   :40.91   Max.   :-73.71  
##   room_type             price         minimum_nights    number_of_reviews
##  Length:48895       Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
##  Class :character   1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
##  Mode  :character   Median :  106.0   Median :   3.00   Median :  5.00   
##                     Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
##                     3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
##                     Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
##  last_review        reviews_per_month calculated_host_listings_count
##  Length:48895       Min.   : 0.000    Min.   :  1.000               
##  Class :character   1st Qu.: 0.040    1st Qu.:  1.000               
##  Mode  :character   Median : 0.370    Median :  1.000               
##                     Mean   : 1.091    Mean   :  7.144               
##                     3rd Qu.: 1.580    3rd Qu.:  2.000               
##                     Max.   :58.500    Max.   :327.000               
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median : 45.0   
##  Mean   :112.8   
##  3rd Qu.:227.0   
##  Max.   :365.0

Mutating Data

The data in general seems to be in great form, without any redundant columns or rows. But we can still mutate the reviews by creating a new column which states if a listing has reviews or not.

data <- data %>%
  mutate(has_reviews = ifelse(number_of_reviews, "Yes", "No"))

We can get the count of listings that have at least one review and those that have none. This also works as a sanity check.

table(data$has_reviews)
## 
##    No   Yes 
## 10052 38843

Univariate Visualizations

There are certain variables that are suitable for histograms. One of them is number of days available.

ggplot(data, aes(x = availability_365)) +
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Listings Availability",
       x = "Number of Days available",
       y = "Frequency") +
  theme_minimal()

The number of days that a listing is available is essential in terms of generating reviews. If a listing is available for most of the year, it has a higher chance of being picked by guests all around the year and this can lead to more reviews. Because the number of days is in a reasonable range from 0 to 365, choosing a histogram is the right choice. We see that most listings are only available for less than 10 days.

Another variable that can benefit from univariate visualisation is the type of room.

ggplot(data, aes(x = room_type)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Room Types",
       x = "Room Type",
       y = "Frequency") +
  theme_minimal()

The type of room is an important factor when guests book a particular listing. If it is a large group, then their preference is an entire home whereas a solo traveler would be fine with a private/shared room. We see that most listings offer either the whole home or a private room and a few listings offering shared room accomodation.

Bivariate Distribution

To analyze the relationship between two variables we use a bivariate distribution. In this case we use the price and minimum number of nights in the plot.

ggplot(data, aes(x = minimum_nights, y = price)) +
  geom_point(color = "skyblue") +
  labs(title = "Price distribution by number of nights",
       x = "Minimum Number of Nights",
       y = "Price") +
  theme_minimal()

We choose these two variables because in most cases in real life, listings with higher minimum number of nights tend to have a lower price per night so that it remains among the top choices for guests. From above, we see that listings with least minimum nights criteria tend to have the widest range of prices. Beyond about 200, the price per night usually stays low. This is because, charging high with having higher minimum nights criteria is unreasonable for guests.

Conclusion

We went through the basic explanantion and cleaning of data followed by visualization of certain variables in the data.