knitr::opts_chunk$set(echo = TRUE)

HW4: Visualizations

This is the R Markdown file for the HW5 of DACS-601 Summer 2022. I’m using the New York City Airbnb csv file from the Sample Datasets.

library(readr)
library(tidyverse)
library(dplyr)

Loading data

ab_nyc_data <- read_csv("C:/Users/apoor/Desktop/UMass/Summer 2022/DACS 601 - R Programming/datasets/AB_NYC_2019.csv")
head(ab_nyc_data)

# A tibble: 6 × 16
     id name          host_id host_name neighbourhood_g… neighbourhood
  <dbl> <chr>           <dbl> <chr>     <chr>            <chr>        
1  2539 Clean & quie…    2787 John      Brooklyn         Kensington   
2  2595 Skylit Midto…    2845 Jennifer  Manhattan        Midtown      
3  3647 THE VILLAGE …    4632 Elisabeth Manhattan        Harlem       
4  3831 Cozy Entire …    4869 LisaRoxa… Brooklyn         Clinton Hill 
5  5022 Entire Apt: …    7192 Laura     Manhattan        East Harlem  
6  5099 Large Cozy 1…    7322 Chris     Manhattan        Murray Hill  
# … with 10 more variables: latitude <dbl>, longitude <dbl>,
#   room_type <chr>, price <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, last_review <date>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>

Data analysis

1. Number of airbnbs in each neighbourhood group

head(ab_nyc_data %>% count(neighbourhood_group))

# A tibble: 5 × 2
  neighbourhood_group     n
  <chr>               <int>
1 Bronx                1091
2 Brooklyn            20104
3 Manhattan           21661
4 Queens               5666
5 Staten Island         373

2. Average price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(avg_price= mean(price)) %>% arrange(desc(avg_price)))

# A tibble: 5 × 2
  neighbourhood_group avg_price
  <chr>                   <dbl>
1 Manhattan               197. 
2 Brooklyn                124. 
3 Staten Island           115. 
4 Queens                   99.5
5 Bronx                    87.5

3. Median price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(median_price= median(price)) %>% arrange(desc(median_price)))

# A tibble: 5 × 2
  neighbourhood_group median_price
  <chr>                      <dbl>
1 Manhattan                    150
2 Brooklyn                      90
3 Queens                        75
4 Staten Island                 75
5 Bronx                         65

4. Standard deviation price of airbnb in each neighbourhood group

head(ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(sd_price= sd(price)/10) %>% arrange(desc(sd_price)))

# A tibble: 5 × 2
  neighbourhood_group sd_price
  <chr>                  <dbl>
1 Manhattan               29.1
2 Staten Island           27.8
3 Brooklyn                18.7
4 Queens                  16.7
5 Bronx                   10.7

5. Host with the most stays

head(ab_nyc_data %>% count(host_id) %>% arrange(desc(n)))

# A tibble: 6 × 2
    host_id     n
      <dbl> <int>
1 219517861   327
2 107434423   232
3  30283594   121
4 137358866   103
5  12243051    96
6  16098958    96

6. Airbnb with most number of reviews

head(ab_nyc_data %>% group_by(id, name) %>% summarise(reviews= sum(number_of_reviews)) %>% arrange(desc(reviews)))

# A tibble: 6 × 3
# Groups:   id [6]
        id name                              reviews
     <dbl> <chr>                               <dbl>
1  9145202 Room near JFK Queen Bed               629
2   903972 Great Bedroom in Manhattan            607
3   903947 Beautiful Bedroom in Manhattan        597
4   891117 Private Bedroom in Manhattan          594
5 10101135 Room Near JFK Twin Beds               576
6  8168619 Steps away from Laguardia airport     543

7. Most available airbnb

head(ab_nyc_data %>% group_by(id, name) %>% summarise(availability= mean(availability_365)) %>% arrange(desc(availability)))

# A tibble: 6 × 3
# Groups:   id [6]
     id name                                availability
  <dbl> <chr>                                      <dbl>
1  2539 Clean & quiet apt home by the park           365
2  3647 THE VILLAGE OF HARLEM....NEW YORK !          365
3 11452 Clean and Quiet in Brooklyn                  365
4 11943 Country space in the city                    365
5 21644 Upper Manhattan, New York                    365
6 32037 Huge Private  Floor at The Waverly           365

8. Least available airbnb

head(ab_nyc_data %>% group_by(id, name) %>% summarise(availability= mean(availability_365)) %>% arrange(availability))

# A tibble: 6 × 3
# Groups:   id [6]
     id name                                              availability
  <dbl> <chr>                                                    <dbl>
1  5022 Entire Apt: Spacious Studio/Loft by central park             0
2  5121 BlissArtsSpace!                                              0
3  5203 Cozy Clean Guest Room - Family Apt                           0
4  6090 West Village Nest - Superhost                                0
5  7801 Sweet and Spacious Brooklyn Loft                             0
6  8700 Magnifique Suite au N de Manhattan - vue Cloitres            0

Plots

Univariate

Plot 1 -Price Distribution

ggplot(ab_nyc_data) + geom_histogram(aes(price), binwidth=15) + xlab("Price (USD)") + ylab("Frequency") + ggtitle("Price Distribution")

Since this data is not that readable, we are zooming in on the values of x from 0 to 500 since the majority of the distribution is in this range

ggplot(ab_nyc_data) + geom_histogram(aes(price), binwidth=15) + xlab("Price (USD)") + ylab("Frequency") + ggtitle("Price Distribution") + xlim(0,500)

What variables are you visualizing: price
What questions are attempting to answer: What is the price distribution of all the airbnbs?
What conclusions you can make from the visualization: Most of the airbnbs are priced in the range of 55-70

Limitations

What questions are left unanswered with your visualizations: Taking into account all the price values and their distribution
What about the visualizations may be unclear to a naive viewer: The price distribution looks skewed and difficult to interpret.
How could you improve the visualizations for the final project: I zoomed in on the visualisation over the majority distributions. There are other ways to deal with it like using logarithmic method to normalise the scale.

Bivariate

Plot 2 - Median price of airbnb for each neighbourhood group

ab_nyc_data2 <- ab_nyc_data %>% group_by(neighbourhood_group) %>% summarise(sd=sd(price)/10, median=median(price))

ggplot(ab_nyc_data2, aes(x=neighbourhood_group, y=median, fill=neighbourhood_group)) + geom_bar(stat="identity") + xlab("Neighbourhood group") + ylab("Median Price (USD)") + ggtitle("Median Price by neighbourhood group") + geom_errorbar(aes(x=neighbourhood_group, ymin=median-sd, ymax=median+sd), width=0.6, colour="black")

What variables are you visualizing: price, neighbourhood_group
What questions are attempting to answer: What is the median price of airbnbs for each neighbourhood group?
What conclusions you can make from the visualization: Airbnbs in the neighbourhood group Manhattan and Brooklyn have much higher median price than the other neighbourhood groups

Limitations

What questions are left unanswered with your visualizations: Actual trend of price, it could be the case that there are only a few values of ‘Staten Island’ neighborhood group and much more of the other groups. How do we know this is the real trend and not lack of data?
What about the visualizations may be unclear to a naive viewer: The y-axis might be unclear to a naive viewer
Is there anything you want to answer with your dataset, but can’t?: I would make the graph more readable - scale y axis to human readable values

Plot 3 - Median price of airbnb for each room type

ab_nyc_data3 <- ab_nyc_data %>% group_by(room_type) %>% summarise(sd=sd(price)/10, median=median(price))

ggplot(ab_nyc_data3, aes(x=room_type, y=median, fill=room_type)) + geom_bar(stat="identity") + xlab("Room type") + ylab("Median Price") + ggtitle("Median Price by room type") + geom_errorbar(aes(x=room_type, ymin=median-sd, ymax=median+sd), width=0.6, colour="black")

What variables are you visualizing: price, room_type
What questions are attempting to answer: What is the median price of airbnb for each room type?
What conclusions you can make from the visualization: Airbnbs of room types ‘entire home/apt’ and ‘private room’ have much higher median price than ‘shared room’ room type

Limitations

What questions are left unanswered with your visualizations: Actual trend of price, it could be the case that there are only a few values of ‘shared room’ type and much more of the other room types. How do we know this is the real trend and not lack of data?
What about the visualizations may be unclear to a naive viewer: The y-axis might be unclear to a naive viewer
Is there anything you want to answer with your dataset, but can’t?: I would make the graph more readable - scale y axis to human readable values

Facetwrap

Plot 4 - Median price of airbnb for each room type by neighbourhood group

ggplot(ab_nyc_data, aes(x=room_type, y=median(price)/10000, fill=room_type)) + geom_bar(stat="identity") + xlab("Room Type") + ylab("Median Price (USD)") + ggtitle("Median Price for each room type by neighbourhood group") + facet_wrap(vars(neighbourhood_group)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

What variables are you visualizing: price, room_type, neighbourhood_group
What questions are attempting to answer: What is the median price of airbnb for each room type by neighbourhood group?
What conclusions you can make from the visualization: Airbnbs of room types ‘entire home/apt’ and ‘private room’ have much higher average price for neighbourhood groups Brooklyn and Manhattan.

Limitations

What questions are left unanswered with your visualizations: In depth trend of the data, what more can we know about the price differences by neighbourhood groups?
What about the visualizations may be unclear to a naive viewer: The y-axis might be unclear to a naive viewer
Is there anything you want to answer with your dataset, but can’t?: I would make the graph more readable - scale y axis to human readable values

Final Questions

What is missing (if anything) in your analysis process so far?: I’m missing some other analysis that can be done on the dataset like comparing the relationship between the price of airbnbs and their number of reviews, or even predicting costs of airbnbs based on the data. (not sure if this is in scope of this course)
What conclusions can you make about your research questions at this point?: I have answered the initial questions I had in mind for the data, I might add more analysis on it later on.
What conclusions you can make from the visualization: I made multiple conclusions based on my analysis -

Manhattan is the most expensive neighbourhood group compared to other neighborhood groups.
Brooklyn can be the second best choice considering the price and number of reviews.
Entire home/apts are much more expensive than private or shared rooms.

What do you think a naive reader would need to fully understand your graphs?: I guess they would need to have a basic knowledge on how to interpret graphs. Like the bar graphs, or error bars. To a naive reader they might not instantly make sense if they don’t have prior knowledge about it.
Is there anything you want to answer with your dataset, but can’t?: To be able to predict the costs of airbnbs as correctly as possible I would need more data like number of rooms, area sq feet of the airbnb, proximity to prime locations, etc. I feel the data I have currently might not be enough.

DACS 601 - HW5

HW4: Visualizations

Loading data

Data analysis

1. Number of airbnbs in each neighbourhood group

2. Average price of airbnb in each neighbourhood group

3. Median price of airbnb in each neighbourhood group

4. Standard deviation price of airbnb in each neighbourhood group

5. Host with the most stays

6. Airbnb with most number of reviews

7. Most available airbnb

8. Least available airbnb

Plots

Univariate

Plot 1 -Price Distribution

Bivariate

Plot 2 - Median price of airbnb for each neighbourhood group

Plot 3 - Median price of airbnb for each room type

Facetwrap

Plot 4 - Median price of airbnb for each room type by neighbourhood group

Final Questions