This challenge focuses more on visualizations between various attributes of the data. The main ones include univariate and bivariate visualizations.
We first load the necessary libraries.
library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 5
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We then read the data,
data <- read_csv("AB_NYC_2019.csv", show_col_types = FALSE)
We can view the first few rows of the data to see the various attributes.
head(data)
## # A tibble: 6 × 16
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 2 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## 3 3647 THE VILLAG… 4632 Elisabeth Manhattan Harlem 40.8
## 4 3831 Cozy Entir… 4869 LisaRoxa… Brooklyn Clinton Hill 40.7
## 5 5022 Entire Apt… 7192 Laura Manhattan East Harlem 40.8
## 6 5099 Large Cozy… 7322 Chris Manhattan Murray Hill 40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>
This data is about AirBnB listings in several Boroughs of New York.
The description of each column is as follows-
id - Unique identifier for each listing.
name - Name of the listing.
host_id - Identifier of the host of property.
host_name - Name of the host.
neighbourhood_group - Borough or area where property is located.
neighbourhood - Specific neighborhood of the borough.
latitude - One half of geographic coordinates.
longitude - Other half of the coordinates.
room_type- Type of room offered.
price - Price of the listing per night.
minimum_nights - Minimum number of nights required for booking.
number_of_reviews - Total number of reviews received by the listing.
last_review - Date of the last review.
reviews_per_month - Average number of reviews received per month.
calculated_host_listings_count - Count of properties listed by host.
availability_365 - Number of days
We can get the dimensions of the data.
dim(data)
## [1] 48895 16
Shows that there are 488895 unique listings with 16 different attributes.
An overview of the structure of the data can be observed.
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:48895] 2539 2595 3647 3831 5022 ...
## $ name : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : num [1:48895] 2787 2845 4632 4869 7192 ...
## $ host_name : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr [1:48895] "19-10-2018" "21-05-2019" NA "05-07-2019" ...
## $ reviews_per_month : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. host_id = col_double(),
## .. host_name = col_character(),
## .. neighbourhood_group = col_character(),
## .. neighbourhood = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. room_type = col_character(),
## .. price = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. last_review = col_character(),
## .. reviews_per_month = col_double(),
## .. calculated_host_listings_count = col_double(),
## .. availability_365 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We can also get the summary statistics of the numerical columns of the data.
summary(data)
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
##
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
##
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
##
## last_review reviews_per_month calculated_host_listings_count
## Length:48895 Min. : 0.010 Min. : 1.000
## Class :character 1st Qu.: 0.190 1st Qu.: 1.000
## Mode :character Median : 0.720 Median : 1.000
## Mean : 1.373 Mean : 7.144
## 3rd Qu.: 2.020 3rd Qu.: 2.000
## Max. :58.500 Max. :327.000
## NA's :10052
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
##
We can select one or more particular columns to see all the values in that column. For instance, we can use it for neighborhood.
select(data, "neighbourhood")
## # A tibble: 48,895 × 1
## neighbourhood
## <chr>
## 1 Kensington
## 2 Midtown
## 3 Harlem
## 4 Clinton Hill
## 5 East Harlem
## 6 Murray Hill
## 7 Bedford-Stuyvesant
## 8 Hell's Kitchen
## 9 Upper West Side
## 10 Chinatown
## # ℹ 48,885 more rows
We can get the count for above neighborhoods.
neighborhood <- select(data, "neighbourhood")
table(neighborhood)
## neighbourhood
## Allerton Arden Heights
## 42 4
## Arrochar Arverne
## 21 77
## Astoria Bath Beach
## 900 17
## Battery Park City Bay Ridge
## 70 141
## Bay Terrace Bay Terrace, Staten Island
## 6 2
## Baychester Bayside
## 7 39
## Bayswater Bedford-Stuyvesant
## 17 3714
## Belle Harbor Bellerose
## 8 14
## Belmont Bensonhurst
## 24 75
## Bergen Beach Boerum Hill
## 10 177
## Borough Park Breezy Point
## 136 3
## Briarwood Brighton Beach
## 56 75
## Bronxdale Brooklyn Heights
## 19 154
## Brownsville Bull's Head
## 61 6
## Bushwick Cambria Heights
## 2465 26
## Canarsie Carroll Gardens
## 147 233
## Castle Hill Castleton Corners
## 9 4
## Chelsea Chinatown
## 1113 368
## City Island Civic Center
## 18 52
## Claremont Village Clason Point
## 28 21
## Clifton Clinton Hill
## 15 572
## Co-op City Cobble Hill
## 2 99
## College Point Columbia St
## 19 42
## Concord Concourse
## 26 50
## Concourse Village Coney Island
## 32 17
## Corona Crown Heights
## 64 1564
## Cypress Hills Ditmars Steinway
## 135 309
## Dongan Hills Douglaston
## 7 8
## Downtown Brooklyn DUMBO
## 83 36
## Dyker Heights East Elmhurst
## 12 185
## East Flatbush East Harlem
## 500 1117
## East Morrisania East New York
## 10 218
## East Village Eastchester
## 1853 13
## Edenwald Edgemere
## 13 11
## Elmhurst Eltingville
## 237 3
## Emerson Hill Far Rockaway
## 5 29
## Fieldston Financial District
## 12 744
## Flatbush Flatiron District
## 621 80
## Flatlands Flushing
## 83 426
## Fordham Forest Hills
## 63 144
## Fort Greene Fort Hamilton
## 489 55
## Fort Wadsworth Fresh Meadows
## 1 32
## Glendale Gowanus
## 54 247
## Gramercy Graniteville
## 338 3
## Grant City Gravesend
## 6 68
## Great Kills Greenpoint
## 10 1115
## Greenwich Village Grymes Hill
## 392 7
## Harlem Hell's Kitchen
## 2658 1958
## Highbridge Hollis
## 27 14
## Holliswood Howard Beach
## 4 20
## Howland Hook Huguenot
## 2 3
## Hunts Point Inwood
## 18 252
## Jackson Heights Jamaica
## 186 231
## Jamaica Estates Jamaica Hills
## 19 8
## Kensington Kew Gardens
## 175 32
## Kew Gardens Hills Kingsbridge
## 26 70
## Kips Bay Laurelton
## 470 18
## Lighthouse Hill Little Italy
## 2 121
## Little Neck Long Island City
## 5 537
## Longwood Lower East Side
## 62 911
## Manhattan Beach Marble Hill
## 8 12
## Mariners Harbor Maspeth
## 8 110
## Melrose Middle Village
## 10 31
## Midland Beach Midtown
## 6 1545
## Midwood Mill Basin
## 109 4
## Morningside Heights Morris Heights
## 346 17
## Morris Park Morrisania
## 15 18
## Mott Haven Mount Eden
## 60 6
## Mount Hope Murray Hill
## 20 485
## Navy Yard Neponsit
## 14 3
## New Brighton New Dorp
## 5 1
## New Dorp Beach New Springville
## 5 8
## NoHo Nolita
## 78 253
## North Riverdale Norwood
## 10 31
## Oakwood Olinville
## 5 4
## Ozone Park Park Slope
## 62 506
## Parkchester Pelham Bay
## 39 17
## Pelham Gardens Port Morris
## 28 46
## Port Richmond Prince's Bay
## 9 4
## Prospect-Lefferts Gardens Prospect Heights
## 535 357
## Queens Village Randall Manor
## 60 19
## Red Hook Rego Park
## 79 106
## Richmond Hill Richmondtown
## 94 1
## Ridgewood Riverdale
## 423 11
## Rockaway Beach Roosevelt Island
## 56 77
## Rosebank Rosedale
## 7 59
## Rossville Schuylerville
## 1 13
## Sea Gate Sheepshead Bay
## 7 164
## Shore Acres Silver Lake
## 7 2
## SoHo Soundview
## 358 15
## South Beach South Ozone Park
## 8 40
## South Slope Springfield Gardens
## 284 85
## Spuyten Duyvil St. Albans
## 4 76
## St. George Stapleton
## 48 27
## Stuyvesant Town Sunnyside
## 37 363
## Sunset Park Theater District
## 390 288
## Throgs Neck Todt Hill
## 24 4
## Tompkinsville Tottenville
## 42 7
## Tremont Tribeca
## 11 177
## Two Bridges Unionport
## 72 7
## University Heights Upper East Side
## 21 1798
## Upper West Side Van Nest
## 1971 11
## Vinegar Hill Wakefield
## 34 50
## Washington Heights West Brighton
## 899 18
## West Farms West Village
## 2 768
## Westchester Square Westerleigh
## 10 2
## Whitestone Williamsbridge
## 11 40
## Williamsburg Willowbrook
## 3920 1
## Windsor Terrace Woodhaven
## 157 88
## Woodlawn Woodrow
## 11 1
## Woodside
## 235
Based on numerical values we see that certain neighborhoods appear way more than others. Reasons being the size, population, proximity to tourist attractions, wealth, etc.
We can have a better idea by getting the proportions.
prop.table(table(neighborhood))
## neighbourhood
## Allerton Arden Heights
## 8.589835e-04 8.180796e-05
## Arrochar Arverne
## 4.294918e-04 1.574803e-03
## Astoria Bath Beach
## 1.840679e-02 3.476838e-04
## Battery Park City Bay Ridge
## 1.431639e-03 2.883730e-03
## Bay Terrace Bay Terrace, Staten Island
## 1.227119e-04 4.090398e-05
## Baychester Bayside
## 1.431639e-04 7.976276e-04
## Bayswater Bedford-Stuyvesant
## 3.476838e-04 7.595869e-02
## Belle Harbor Bellerose
## 1.636159e-04 2.863278e-04
## Belmont Bensonhurst
## 4.908477e-04 1.533899e-03
## Bergen Beach Boerum Hill
## 2.045199e-04 3.620002e-03
## Borough Park Breezy Point
## 2.781470e-03 6.135597e-05
## Briarwood Brighton Beach
## 1.145311e-03 1.533899e-03
## Bronxdale Brooklyn Heights
## 3.885878e-04 3.149606e-03
## Brownsville Bull's Head
## 1.247571e-03 1.227119e-04
## Bushwick Cambria Heights
## 5.041415e-02 5.317517e-04
## Canarsie Carroll Gardens
## 3.006442e-03 4.765313e-03
## Castle Hill Castleton Corners
## 1.840679e-04 8.180796e-05
## Chelsea Chinatown
## 2.276306e-02 7.526332e-03
## City Island Civic Center
## 3.681358e-04 1.063503e-03
## Claremont Village Clason Point
## 5.726557e-04 4.294918e-04
## Clifton Clinton Hill
## 3.067798e-04 1.169854e-02
## Co-op City Cobble Hill
## 4.090398e-05 2.024747e-03
## College Point Columbia St
## 3.885878e-04 8.589835e-04
## Concord Concourse
## 5.317517e-04 1.022599e-03
## Concourse Village Coney Island
## 6.544636e-04 3.476838e-04
## Corona Crown Heights
## 1.308927e-03 3.198691e-02
## Cypress Hills Ditmars Steinway
## 2.761019e-03 6.319665e-03
## Dongan Hills Douglaston
## 1.431639e-04 1.636159e-04
## Downtown Brooklyn DUMBO
## 1.697515e-03 7.362716e-04
## Dyker Heights East Elmhurst
## 2.454239e-04 3.783618e-03
## East Flatbush East Harlem
## 1.022599e-02 2.284487e-02
## East Morrisania East New York
## 2.045199e-04 4.458534e-03
## East Village Eastchester
## 3.789754e-02 2.658759e-04
## Edenwald Edgemere
## 2.658759e-04 2.249719e-04
## Elmhurst Eltingville
## 4.847121e-03 6.135597e-05
## Emerson Hill Far Rockaway
## 1.022599e-04 5.931077e-04
## Fieldston Financial District
## 2.454239e-04 1.521628e-02
## Flatbush Flatiron District
## 1.270069e-02 1.636159e-03
## Flatlands Flushing
## 1.697515e-03 8.712547e-03
## Fordham Forest Hills
## 1.288475e-03 2.945086e-03
## Fort Greene Fort Hamilton
## 1.000102e-02 1.124859e-03
## Fort Wadsworth Fresh Meadows
## 2.045199e-05 6.544636e-04
## Glendale Gowanus
## 1.104407e-03 5.051641e-03
## Gramercy Graniteville
## 6.912772e-03 6.135597e-05
## Grant City Gravesend
## 1.227119e-04 1.390735e-03
## Great Kills Greenpoint
## 2.045199e-04 2.280397e-02
## Greenwich Village Grymes Hill
## 8.017180e-03 1.431639e-04
## Harlem Hell's Kitchen
## 5.436139e-02 4.004499e-02
## Highbridge Hollis
## 5.522037e-04 2.863278e-04
## Holliswood Howard Beach
## 8.180796e-05 4.090398e-04
## Howland Hook Huguenot
## 4.090398e-05 6.135597e-05
## Hunts Point Inwood
## 3.681358e-04 5.153901e-03
## Jackson Heights Jamaica
## 3.804070e-03 4.724409e-03
## Jamaica Estates Jamaica Hills
## 3.885878e-04 1.636159e-04
## Kensington Kew Gardens
## 3.579098e-03 6.544636e-04
## Kew Gardens Hills Kingsbridge
## 5.317517e-04 1.431639e-03
## Kips Bay Laurelton
## 9.612435e-03 3.681358e-04
## Lighthouse Hill Little Italy
## 4.090398e-05 2.474691e-03
## Little Neck Long Island City
## 1.022599e-04 1.098272e-02
## Longwood Lower East Side
## 1.268023e-03 1.863176e-02
## Manhattan Beach Marble Hill
## 1.636159e-04 2.454239e-04
## Mariners Harbor Maspeth
## 1.636159e-04 2.249719e-03
## Melrose Middle Village
## 2.045199e-04 6.340117e-04
## Midland Beach Midtown
## 1.227119e-04 3.159832e-02
## Midwood Mill Basin
## 2.229267e-03 8.180796e-05
## Morningside Heights Morris Heights
## 7.076388e-03 3.476838e-04
## Morris Park Morrisania
## 3.067798e-04 3.681358e-04
## Mott Haven Mount Eden
## 1.227119e-03 1.227119e-04
## Mount Hope Murray Hill
## 4.090398e-04 9.919215e-03
## Navy Yard Neponsit
## 2.863278e-04 6.135597e-05
## New Brighton New Dorp
## 1.022599e-04 2.045199e-05
## New Dorp Beach New Springville
## 1.022599e-04 1.636159e-04
## NoHo Nolita
## 1.595255e-03 5.174353e-03
## North Riverdale Norwood
## 2.045199e-04 6.340117e-04
## Oakwood Olinville
## 1.022599e-04 8.180796e-05
## Ozone Park Park Slope
## 1.268023e-03 1.034871e-02
## Parkchester Pelham Bay
## 7.976276e-04 3.476838e-04
## Pelham Gardens Port Morris
## 5.726557e-04 9.407915e-04
## Port Richmond Prince's Bay
## 1.840679e-04 8.180796e-05
## Prospect-Lefferts Gardens Prospect Heights
## 1.094181e-02 7.301360e-03
## Queens Village Randall Manor
## 1.227119e-03 3.885878e-04
## Red Hook Rego Park
## 1.615707e-03 2.167911e-03
## Richmond Hill Richmondtown
## 1.922487e-03 2.045199e-05
## Ridgewood Riverdale
## 8.651191e-03 2.249719e-04
## Rockaway Beach Roosevelt Island
## 1.145311e-03 1.574803e-03
## Rosebank Rosedale
## 1.431639e-04 1.206667e-03
## Rossville Schuylerville
## 2.045199e-05 2.658759e-04
## Sea Gate Sheepshead Bay
## 1.431639e-04 3.354126e-03
## Shore Acres Silver Lake
## 1.431639e-04 4.090398e-05
## SoHo Soundview
## 7.321812e-03 3.067798e-04
## South Beach South Ozone Park
## 1.636159e-04 8.180796e-04
## South Slope Springfield Gardens
## 5.808365e-03 1.738419e-03
## Spuyten Duyvil St. Albans
## 8.180796e-05 1.554351e-03
## St. George Stapleton
## 9.816955e-04 5.522037e-04
## Stuyvesant Town Sunnyside
## 7.567236e-04 7.424072e-03
## Sunset Park Theater District
## 7.976276e-03 5.890173e-03
## Throgs Neck Todt Hill
## 4.908477e-04 8.180796e-05
## Tompkinsville Tottenville
## 8.589835e-04 1.431639e-04
## Tremont Tribeca
## 2.249719e-04 3.620002e-03
## Two Bridges Unionport
## 1.472543e-03 1.431639e-04
## University Heights Upper East Side
## 4.294918e-04 3.677268e-02
## Upper West Side Van Nest
## 4.031087e-02 2.249719e-04
## Vinegar Hill Wakefield
## 6.953676e-04 1.022599e-03
## Washington Heights West Brighton
## 1.838634e-02 3.681358e-04
## West Farms West Village
## 4.090398e-05 1.570713e-02
## Westchester Square Westerleigh
## 2.045199e-04 4.090398e-05
## Whitestone Williamsbridge
## 2.249719e-04 8.180796e-04
## Williamsburg Willowbrook
## 8.017180e-02 2.045199e-05
## Windsor Terrace Woodhaven
## 3.210962e-03 1.799775e-03
## Woodlawn Woodrow
## 2.249719e-04 2.045199e-05
## Woodside
## 4.806217e-03
This can give a clearer idea on comparison of listings among different neighborhoods.
To tidy the data, first we will check if there are missing values in the data.
colSums(is.na(data))
## id name
## 0 16
## host_id host_name
## 0 21
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 10052 10052
## calculated_host_listings_count availability_365
## 0 0
We see that certain property names and host names are missing. There is also a sizeable portion of missing last review and reviews per month. These 2 columns are usually empty when the listing has not received a single review.
We can replace the missing values by necessary data.
data$last_review[data$number_of_reviews == 0] <- "No Reviews"
data$reviews_per_month[data$number_of_reviews == 0] <- 0
data$name[is.na(data$name)] <- "Unknown"
data$host_name[is.na(data$host_name)] <- "Unknown"
From above, we see that we filled the missing cells with appropriate values. We can now conduct a sanity check after tidying the data.
We check once again for missing values.
colSums(is.na(data))
## id name
## 0 0
## host_id host_name
## 0 0
## neighbourhood_group neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 0 0
## calculated_host_listings_count availability_365
## 0 0
We see that our data is full.
We can now check for rows where the number of reviews is 0.
zero_reviews <- data %>%
filter(number_of_reviews == 0) %>%
select(everything())
zero_reviews
## # A tibble: 10,052 × 16
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 3647 THE VILLA… 4632 Elisabeth Manhattan Harlem 40.8
## 2 7750 Huge 2 BR… 17985 Sing Manhattan East Harlem 40.8
## 3 8700 Magnifiqu… 26394 Claude &… Manhattan Inwood 40.9
## 4 11452 Clean and… 7355 Vt Brooklyn Bedford-Stuy… 40.7
## 5 11943 Country s… 45445 Harriet Brooklyn Flatbush 40.6
## 6 51438 1 Bedroom… 236421 Jessica Manhattan Upper East S… 40.8
## 7 54466 Beautiful… 253385 Douglas Manhattan Harlem 40.8
## 8 63588 LL3 295128 Carol Gl… Bronx Clason Point 40.8
## 9 63913 HOSTING Y… 312288 Paula Manhattan Inwood 40.9
## 10 64015 Prime Eas… 146944 David Manhattan East Village 40.7
## # ℹ 10,042 more rows
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>
We can now get the summary of the data after tidying.
summary(data)
## id name host_id host_name
## Min. : 2539 Length:48895 Min. : 2438 Length:48895
## 1st Qu.: 9471945 Class :character 1st Qu.: 7822033 Class :character
## Median :19677284 Mode :character Median : 30793816 Mode :character
## Mean :19017143 Mean : 67620011
## 3rd Qu.:29152178 3rd Qu.:107434423
## Max. :36487245 Max. :274321313
## neighbourhood_group neighbourhood latitude longitude
## Length:48895 Length:48895 Min. :40.50 Min. :-74.24
## Class :character Class :character 1st Qu.:40.69 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.96
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.94
## Max. :40.91 Max. :-73.71
## room_type price minimum_nights number_of_reviews
## Length:48895 Min. : 0.0 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.: 69.0 1st Qu.: 1.00 1st Qu.: 1.00
## Mode :character Median : 106.0 Median : 3.00 Median : 5.00
## Mean : 152.7 Mean : 7.03 Mean : 23.27
## 3rd Qu.: 175.0 3rd Qu.: 5.00 3rd Qu.: 24.00
## Max. :10000.0 Max. :1250.00 Max. :629.00
## last_review reviews_per_month calculated_host_listings_count
## Length:48895 Min. : 0.000 Min. : 1.000
## Class :character 1st Qu.: 0.040 1st Qu.: 1.000
## Mode :character Median : 0.370 Median : 1.000
## Mean : 1.091 Mean : 7.144
## 3rd Qu.: 1.580 3rd Qu.: 2.000
## Max. :58.500 Max. :327.000
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 45.0
## Mean :112.8
## 3rd Qu.:227.0
## Max. :365.0
The data in general seems to be in great form, without any redundant columns or rows. But we can still mutate the reviews by creating a new column which states if a listing has reviews or not.
data <- data %>%
mutate(has_reviews = ifelse(number_of_reviews, "Yes", "No"))
We can get the count of listings that have at least one review and those that have none. This also works as a sanity check.
table(data$has_reviews)
##
## No Yes
## 10052 38843
There are certain variables that are suitable for histograms. One of them is number of days available.
ggplot(data, aes(x = availability_365)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Listings Availability",
x = "Number of Days available",
y = "Frequency") +
theme_minimal()
The number of days that a listing is available is essential in terms of generating reviews. If a listing is available for most of the year, it has a higher chance of being picked by guests all around the year and this can lead to more reviews. Because the number of days is in a reasonable range from 0 to 365, choosing a histogram is the right choice. We see that most listings are only available for less than 10 days.
Another variable that can benefit from univariate visualisation is the type of room.
ggplot(data, aes(x = room_type)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Room Types",
x = "Room Type",
y = "Frequency") +
theme_minimal()
The type of room is an important factor when guests book a particular listing. If it is a large group, then their preference is an entire home whereas a solo traveler would be fine with a private/shared room. We see that most listings offer either the whole home or a private room and a few listings offering shared room accomodation.
To analyze the relationship between two variables we use a bivariate distribution. In this case we use the price and minimum number of nights in the plot.
ggplot(data, aes(x = minimum_nights, y = price)) +
geom_point(color = "skyblue") +
labs(title = "Price distribution by number of nights",
x = "Minimum Number of Nights",
y = "Price") +
theme_minimal()
We choose these two variables because in most cases in real life, listings with higher minimum number of nights tend to have a lower price per night so that it remains among the top choices for guests. From above, we see that listings with least minimum nights criteria tend to have the widest range of prices. Beyond about 200, the price per night usually stays low. This is because, charging high with having higher minimum nights criteria is unreasonable for guests.
We went through the basic explanantion and cleaning of data followed by visualization of certain variables in the data.