Challenge 5

Pull in data

NYCHousing2019 <- read_csv("challenge_datasets/AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NYCHousing2019)
## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

The table looks clean. It has clean column names, and the format of the data matches the data I would expect to find (e.g. date column class is ). It has some columns with information that may be repetitive (host_id and host_name is an identifying factors and I could get rid of the host_name as it is more likely to have a repeat name and keep host_id but with the focus being visuals, I will skip removing columns for this challenge).

Univariate visualization

Frequency of different neighborhood groups in NYC

The I would like to see the frequency of the different neighborhood groups in New York. I use pipe into the ggplot for the neighbourhood_group column to be a bar graph with a minimal theme and then title the graph, x axis and y axis.

Manhattan has the most rentals, followed closely by Brooklyn.

NYCHousing2019 %>%
  ggplot(aes(neighbourhood_group)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Different Neighborhood groups in NYC rentals", x = "Neighborhood groups", y = "Number of rentals")

Frequency of different rental types in NYC

Next I want to get a visual sense of what is the frequency of different rental types in NYC. I use pipe into the ggplot for the room_type column to be a bar graph with a minimal theme and then title the graph, x axis and y axis.

Entire homes are most common, followed closely by private rooms.

NYCHousing2019 %>%
  ggplot(aes(room_type)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Different housing types groups in NYC rentals", x = "Housing type", y = "Number of rentals")

Bivariate visualization

Price compared to rental type

Next I want to compare the spread of prices considering the rental type. I use pipe into the ggplot for the room_type and price columns to be a boxplot with a minimal theme and then title the graph, x axis and y axis.

Although entire homes have more frequency of a higher price, it is surprising to see private room also do have some occurrences of a quite high price as well.

NYCHousing2019 %>% 
  ggplot(aes(room_type, price)) + 
  geom_boxplot() + 
  theme_minimal() +
  labs(title = "Price for different room types", x = "Type of rentals", y = "Dollar price")