HW4: Visualizations

This is the R Markdown file for the HW4 of DACS-601 Summer 2022. I’m using the New York City Airbnb csv file from the Sample Datasets.

Loading libraries

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tibble  3.1.7     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)

loading data

ab_nyc_data <- read_csv("C:/Users/apoor/Desktop/UMass/Summer 2022/DACS 601 - R Programming/datasets/AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ab_nyc_data)
## # A tibble: 6 × 16
##      id name           host_id host_name neighbourhood_g… neighbourhood latitude
##   <dbl> <chr>            <dbl> <chr>     <chr>            <chr>            <dbl>
## 1  2539 Clean & quiet…    2787 John      Brooklyn         Kensington        40.6
## 2  2595 Skylit Midtow…    2845 Jennifer  Manhattan        Midtown           40.8
## 3  3647 THE VILLAGE O…    4632 Elisabeth Manhattan        Harlem            40.8
## 4  3831 Cozy Entire F…    4869 LisaRoxa… Brooklyn         Clinton Hill      40.7
## 5  5022 Entire Apt: S…    7192 Laura     Manhattan        East Harlem       40.8
## 6  5099 Large Cozy 1 …    7322 Chris     Manhattan        Murray Hill       40.7
## # … with 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

Data analysis

1. Number of airbnbs in each neighbourhood group

head(ab_nyc_data %>% count(neighbourhood_group))
## # A tibble: 5 × 2
##   neighbourhood_group     n
##   <chr>               <int>
## 1 Bronx                1091
## 2 Brooklyn            20104
## 3 Manhattan           21661
## 4 Queens               5666
## 5 Staten Island         373

2. Average price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(avg_price= mean(price)) %>% arrange(desc(avg_price)))
## # A tibble: 5 × 2
##   neighbourhood_group avg_price
##   <chr>                   <dbl>
## 1 Manhattan               197. 
## 2 Brooklyn                124. 
## 3 Staten Island           115. 
## 4 Queens                   99.5
## 5 Bronx                    87.5

3. Median price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(median_price= median(price)) %>% arrange(desc(median_price)))
## # A tibble: 5 × 2
##   neighbourhood_group median_price
##   <chr>                      <dbl>
## 1 Manhattan                    150
## 2 Brooklyn                      90
## 3 Queens                        75
## 4 Staten Island                 75
## 5 Bronx                         65

4. Standard deviation price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(sd_price= sd(price)) %>% arrange(desc(sd_price)))
## # A tibble: 5 × 2
##   neighbourhood_group sd_price
##   <chr>                  <dbl>
## 1 Manhattan               291.
## 2 Staten Island           278.
## 3 Brooklyn                187.
## 4 Queens                  167.
## 5 Bronx                   107.

5. Host with the most stays

head(ab_nyc_data %>% count(host_id) %>% arrange(desc(n)))
## # A tibble: 6 × 2
##     host_id     n
##       <dbl> <int>
## 1 219517861   327
## 2 107434423   232
## 3  30283594   121
## 4 137358866   103
## 5  12243051    96
## 6  16098958    96

6. Airbnb with most number of reviews

head(ab_nyc_data %>% group_by(id, name) %>% summarise(reviews= sum(number_of_reviews)) %>% arrange(desc(reviews)))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   id [6]
##         id name                              reviews
##      <dbl> <chr>                               <dbl>
## 1  9145202 Room near JFK Queen Bed               629
## 2   903972 Great Bedroom in Manhattan            607
## 3   903947 Beautiful Bedroom in Manhattan        597
## 4   891117 Private Bedroom in Manhattan          594
## 5 10101135 Room Near JFK Twin Beds               576
## 6  8168619 Steps away from Laguardia airport     543

7. Most available airbnb

head(ab_nyc_data %>% group_by(id, name) %>% summarise(availability= mean(availability_365)) %>% arrange(desc(availability)))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   id [6]
##      id name                                availability
##   <dbl> <chr>                                      <dbl>
## 1  2539 Clean & quiet apt home by the park           365
## 2  3647 THE VILLAGE OF HARLEM....NEW YORK !          365
## 3 11452 Clean and Quiet in Brooklyn                  365
## 4 11943 Country space in the city                    365
## 5 21644 Upper Manhattan, New York                    365
## 6 32037 Huge Private  Floor at The Waverly           365

8. Least available airbnb

head(ab_nyc_data %>% group_by(id, name) %>% summarise(availability= mean(availability_365)) %>% arrange(availability))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   id [6]
##      id name                                              availability
##   <dbl> <chr>                                                    <dbl>
## 1  5022 Entire Apt: Spacious Studio/Loft by central park             0
## 2  5121 BlissArtsSpace!                                              0
## 3  5203 Cozy Clean Guest Room - Family Apt                           0
## 4  6090 West Village Nest - Superhost                                0
## 5  7801 Sweet and Spacious Brooklyn Loft                             0
## 6  8700 Magnifique Suite au N de Manhattan - vue Cloitres            0

Plots

Univariate

Plot 1 -Price Distribution

ggplot(ab_nyc_data) + geom_histogram(aes(price), binwidth=15) + xlab("Price") + ylab("Frequency") + ggtitle("Price Distribution")

Since this data is not that readable, we are zooming in on the values of x from 0 to 500 since the majority of the distribution is in this range

ggplot(ab_nyc_data) + geom_histogram(aes(price), binwidth=15) + xlab("Price") + ylab("Frequency") + ggtitle("Price Distribution") + xlim(0,500)
## Warning: Removed 1044 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

  • What variables are you visualizing: price
  • What questions are attempting to answer: What is the price distribution of all the airbnbs?
  • What conclusions you can make from the visualization: Most of the airbnbs are priced in the range of 55-70

Limitations

  • What questions are left unanswered with your visualizations: Taking into account all the price values and their distribution

  • What about the visualizations may be unclear to a naive viewer: The price distribution looks skewed and difficult to interpret.

  • How could you improve the visualizations for the final project: I zoomed in on the visualisation over the majority distributions. There are other ways to deal with it like using logarithmic method to normalise the scale.

Bivariate

Plot 2 - Average price of airbnb for each neighbourhood group

ggplot(ab_nyc_data, aes(x=neighbourhood_group, y=mean(price), fill=neighbourhood_group)) + geom_bar(stat="identity") + xlab("Neighbourhood group") + ylab("Average Price") + ggtitle("Avg Price by neighbourhood group")

  • What variables are you visualizing: price, neighbourhood_group
  • What questions are attempting to answer: What is the average price of airbnbs for each neighbourhood group?
  • What conclusions you can make from the visualization: Airbnbs in the neighbourhood group Manhattan and Brooklyn have much higher average price than the other neighbourhood groups

Limitations

  • What questions are left unanswered with your visualizations: Actual trend of price, it could be the case that there are only a few values of ‘Staten Island’ neighborhood group and much more of the other groups. How do we know this is the real trend and not lack of data?

  • What about the visualizations may be unclear to a naive viewer: The y-axis might be unclear to a naive viewer

  • How could you improve the visualizations for the final project: I would make the graph more readable - scale y axis to human readable values

Plot 3 - Average price of airbnb for each room type

ggplot(ab_nyc_data, aes(x=room_type, y=mean(price), fill=room_type)) + geom_bar(stat="identity") + xlab("Room Type") + ylab("Average Price") + ggtitle("Avg Price by room type")

  • What variables are you visualizing: price, room_type
  • What questions are attempting to answer: What is the average price of airbnb for each room type?
  • What conclusions you can make from the visualization: Airbnbs of room types ‘entire home/apt’ and ‘private room’ have much higher average price than ‘shared room’ room type

Limitations

  • What questions are left unanswered with your visualizations: Actual trend of price, it could be the case that there are only a few values of ‘shared room’ type and much more of the other room types. How do we know this is the real trend and not lack of data?

  • What about the visualizations may be unclear to a naive viewer: The y-axis might be unclear to a naive viewer

  • How could you improve the visualizations for the final project: I would make the graph more readable - scale y axis to human readable values