Assignment6

Assignment 6

Code Below:

#install all the packages and libraries that are helpful for data wrangling and visualization
install.packages("readxl")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("tidyverse")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("ggplot2")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("plotly")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
library(readxl)
library(tidyr)
library(ggplot2)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
#set the working directory and load the dataset
setwd("/cloud/project")
airbnb <- read_excel("Airbnb_DC_25.csv")
#preview the dataset and get an idea of the kind of information available
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
tibble(airbnb)
# A tibble: 6,257 × 18
      id name       host_id host_name neighbourhood_group neighbourhood latitude
   <dbl> <chr>        <dbl> <chr>     <lgl>               <chr>            <dbl>
 1  3686 Vita's Hi…    4645 Vita      NA                  Historic Ana…     38.9
 2  3943 Historic …    5059 Vasa      NA                  Edgewood, Bl…     38.9
 3  4197 Capitol H…    5061 Sandra    NA                  Capitol Hill…     38.9
 4  4529 Bertina's…    5803 Bertina   NA                  Eastland Gar…     38.9
 5  5589 Cozy apt …    6527 Ami       NA                  Kalorama Hei…     38.9
 6  7103 Lovely gu…   17633 Charlotte NA                  Spring Valle…     38.9
 7 11785 Sanctuary…   32015 Teresa    NA                  Cathedral He…     38.9
 8 12442 Peaches &…   32015 Teresa    NA                  Cathedral He…     38.9
 9 13744 Heart of …   53927 Victoria  NA                  Columbia Hei…     38.9
10 14218 Quiet Com…   32015 Teresa    NA                  Cathedral He…     38.9
# ℹ 6,247 more rows
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <dttm>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>
head(airbnb)
# A tibble: 6 × 18
     id name        host_id host_name neighbourhood_group neighbourhood latitude
  <dbl> <chr>         <dbl> <chr>     <lgl>               <chr>            <dbl>
1  3686 Vita's Hid…    4645 Vita      NA                  Historic Ana…     38.9
2  3943 Historic R…    5059 Vasa      NA                  Edgewood, Bl…     38.9
3  4197 Capitol Hi…    5061 Sandra    NA                  Capitol Hill…     38.9
4  4529 Bertina's …    5803 Bertina   NA                  Eastland Gar…     38.9
5  5589 Cozy apt i…    6527 Ami       NA                  Kalorama Hei…     38.9
6  7103 Lovely gue…   17633 Charlotte NA                  Spring Valle…     38.9
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <dttm>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>
colnames(airbnb)
 [1] "id"                             "name"                          
 [3] "host_id"                        "host_name"                     
 [5] "neighbourhood_group"            "neighbourhood"                 
 [7] "latitude"                       "longitude"                     
 [9] "room_type"                      "price"                         
[11] "minimum_nights"                 "number_of_reviews"             
[13] "last_review"                    "reviews_per_month"             
[15] "calculated_host_listings_count" "availability_365"              
[17] "number_of_reviews_ltm"          "license"                       

Brainstorming–what questions do I have about the data? The categories neighborhood, price, minimum stay and room type stand out to me. I’m interested to see which neighborhoods have the highest price for their Airbnb listings. I think I’ll put myself in the shoes of someone who wants to book a stay in DC, but isn’t picky about neighborhood. They have a minimum stay they need (for example, at least 1 night or at least 31 days) and they want to see which neighborhoods will be the cheapest for what they need.

#Look at some sample statistics for this dataset before I get deeper into wrangling.
summary(airbnb)
       id                   name         host_id              host_name   
 Min.   :3.686e+03   Length   :6257   Min.   :     4617   Length   :6257  
 1st Qu.:3.792e+07   N.unique :5996   1st Qu.: 22024017   N.unique :1860  
 Median :7.501e+17   N.blank  :   0   Median : 81005284   N.blank  :   0  
 Mean   :6.159e+17   Min.nchar:   4   Mean   :176451046   Min.nchar:   1  
 3rd Qu.:1.143e+18   Max.nchar: 107   3rd Qu.:304261532   Max.nchar:  34  
 Max.   :1.375e+18                    Max.   :681391481   NAs      :   2  
                                                                          
 neighbourhood_group   neighbourhood     latitude       longitude     
 Mode:logical        Length   :6257   Min.   :38.82   Min.   :-77.11  
 NAs :6257           N.unique :  39   1st Qu.:38.90   1st Qu.:-77.03  
                     N.blank  :   0   Median :38.91   Median :-77.01  
                     Min.nchar:  18   Mean   :38.91   Mean   :-77.01  
                     Max.nchar:  97   3rd Qu.:38.92   3rd Qu.:-76.99  
                                      Max.   :38.99   Max.   :-76.91  
                                                                      
     room_type        price        minimum_nights   number_of_reviews
 Length   :6257   Min.   :  10.0   Min.   :  1.00   Min.   :   0.00  
 N.unique :   4   1st Qu.:  88.0   1st Qu.:  1.00   1st Qu.:   1.00  
 N.blank  :   0   Median : 131.0   Median :  2.00   Median :  19.00  
 Min.nchar:  10   Mean   : 168.7   Mean   : 13.23   Mean   :  66.38  
 Max.nchar:  15   3rd Qu.: 193.0   3rd Qu.: 31.00   3rd Qu.:  86.00  
                  Max.   :7000.0   Max.   :701.00   Max.   :1205.00  
                  NAs    :1488                                       
  last_review                  reviews_per_month calculated_host_listings_count
 Min.   :2013-06-15 00:00:00   Min.   : 0.010    Min.   :  1.00                
 1st Qu.:2024-10-17 00:00:00   1st Qu.: 0.470    1st Qu.:  1.00                
 Median :2025-01-23 00:00:00   Median : 1.460    Median :  3.00                
 Mean   :2024-09-12 12:48:19   Mean   : 1.974    Mean   : 33.15                
 3rd Qu.:2025-02-27 00:00:00   3rd Qu.: 2.940    3rd Qu.: 14.00                
 Max.   :2025-03-14 00:00:00   Max.   :28.200    Max.   :289.00                
 NAs    :1236                  NAs    :1236                                    
 availability_365 number_of_reviews_ltm      license    
 Min.   :  0.0    Min.   :  0.0         Length   :6257  
 1st Qu.: 43.0    1st Qu.:  0.0         N.unique :2459  
 Median :175.0    Median :  5.0         N.blank  :   0  
 Mean   :175.8    Mean   : 15.8         Min.nchar:   6  
 3rd Qu.:303.0    3rd Qu.: 25.0         Max.nchar:  72  
 Max.   :365.0    Max.   :290.0         NAs      :1560  
                                                        
#I notice there are some NAs for the different observations, and neighbourhood_group is entirely NA
#I create a new datag.frame with the columns I care about.
airbnbnew <- airbnb %>% select(neighbourhood, room_type, minimum_nights, price)
airbnbnew
# A tibble: 6,257 × 4
   neighbourhood                                  room_type minimum_nights price
   <chr>                                          <chr>              <dbl> <dbl>
 1 Historic Anacostia                             Private …             31    60
 2 Edgewood, Bloomingdale, Truxton Circle, Eckin… Private …              1    63
 3 Capitol Hill, Lincoln Park                     Private …              4   128
 4 Eastland Gardens, Kenilworth                   Private …             30    64
 5 Kalorama Heights, Adams Morgan, Lanier Heights Entire h…             50    NA
 6 Spring Valley, Palisades, Wesley Heights, Fox… Entire h…             31    74
 7 Cathedral Heights, McLean Gardens, Glover Park Entire h…             31    85
 8 Cathedral Heights, McLean Gardens, Glover Park Private …             31    52
 9 Columbia Heights, Mt. Pleasant, Pleasant Plai… Entire h…             30   125
10 Cathedral Heights, McLean Gardens, Glover Park Private …             31    52
# ℹ 6,247 more rows
summary(airbnbnew)
   neighbourhood      room_type    minimum_nights       price       
 Length   :6257   Length   :6257   Min.   :  1.00   Min.   :  10.0  
 N.unique :  39   N.unique :   4   1st Qu.:  1.00   1st Qu.:  88.0  
 N.blank  :   0   N.blank  :   0   Median :  2.00   Median : 131.0  
 Min.nchar:  18   Min.nchar:  10   Mean   : 13.23   Mean   : 168.7  
 Max.nchar:  97   Max.nchar:  15   3rd Qu.: 31.00   3rd Qu.: 193.0  
                                   Max.   :701.00   Max.   :7000.0  
                                                    NAs    :1488    
#There are 3 outliers that are skewing my data--a few airbnbs are listed at $7000 per night, which is ridiculous! there are also over 1000 rows without any price data.
airbnbnew <- airbnbnew %>% filter(!is.na(price))
airbnbnew <- airbnbnew %>% filter(price < 7000)

#I also am going to remove any data that includes airbnbs with a minimum stay of more than one month, aka 31 days.
airbnbnew <- airbnbnew %>% filter(minimum_nights<= 31)
airbnbnew <- airbnbnew %>% filter_out(grepl("Hotel room",room_type))
airbnbnew <- airbnbnew %>% filter_out(grepl("Shared room",room_type))
summary(airbnbnew)
   neighbourhood      room_type    minimum_nights       price     
 Length   :4598   Length   :4598   Min.   : 1.000   Min.   :  10  
 N.unique :  39   N.unique :   2   1st Qu.: 1.000   1st Qu.:  88  
 N.blank  :   0   N.blank  :   0   Median : 2.000   Median : 131  
 Min.nchar:  18   Min.nchar:  12   Mean   : 9.974   Mean   : 163  
 Max.nchar:  97   Max.nchar:  15   3rd Qu.:31.000   3rd Qu.: 190  
                                   Max.   :31.000   Max.   :2800  
#Next I want to add a column that finds the average price of a night at the airbnb, sorted by the minimum nights allowed and the neighborhood. I'm going to use dplyr functions and create a new data frame with this information.
meanprice <- airbnbnew %>% group_by(neighbourhood,minimum_nights) %>% summarise(AveragePrice = mean(price))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by neighbourhood and minimum_nights.
ℹ Output is grouped by neighbourhood.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(neighbourhood, minimum_nights))` for per-operation
  grouping (`?dplyr::dplyr_by`) instead.
meanprice
# A tibble: 295 × 3
# Groups:   neighbourhood [39]
   neighbourhood                        minimum_nights AveragePrice
   <chr>                                         <dbl>        <dbl>
 1 Brightwood Park, Crestwood, Petworth              1        111. 
 2 Brightwood Park, Crestwood, Petworth              2        130. 
 3 Brightwood Park, Crestwood, Petworth              3        132. 
 4 Brightwood Park, Crestwood, Petworth              4        245. 
 5 Brightwood Park, Crestwood, Petworth              5        102  
 6 Brightwood Park, Crestwood, Petworth              6        282. 
 7 Brightwood Park, Crestwood, Petworth              7        139. 
 8 Brightwood Park, Crestwood, Petworth             13         65.5
 9 Brightwood Park, Crestwood, Petworth             14         35  
10 Brightwood Park, Crestwood, Petworth             28         66  
# ℹ 285 more rows
#Finally, I'm going to create a heatmap, where one axis has all of the neighborhoods, and the other axis has the minimum night stay. 

heatmapairbnb <- meanprice %>% ggplot(aes(x=minimum_nights,
                                      y=neighbourhood, fill = AveragePrice))+
  geom_tile(color ="grey")+
  scale_fill_gradientn(colours = c("darkgreen","lightyellow", "yellow","orange","red"))+
#The neighborhoods are very long, so in order to display nicely we're going to truncate the labels using stringr so that they're not longer than 15 characters.
  scale_y_discrete(label = function(y) stringr::str_trunc(y, 15))+
#The minimum nights axis needs more breaks to be more visually appealing, and we're going to change the panel background so it's less distracting, choosing a white background.
  scale_x_continuous(breaks=seq(1,32, by=2))+
  theme(panel.background = element_rect(fill = "white", colour="white"),
          aspect.ratio = 2/1,
        text = element_text(size=10),
        axis.text.y = element_text(size=8),
        axis.text.x = element_text(size=8))+
  
#Finally, adding the titles, labels and captions. I'm adding a caption with instructions on how to use the graph.
  labs(title="Which Neighborhoods Offer the Cheapest Prices?",
       subtitle="Based on your desired minimum night stay, look for your neighborhood.",
       x = "Minimum Night Stay", 
       y = "Neighborhood")
heatmapairbnb

#This is really dense, so I'm going to split my data up into a few different tables and then aggregate the tables.
#I also felt like this heat map was a little dense, so I created objects containing approximately 12-15 of the neighborhood options, to use as a list to filter out of the data frame. 

group1 <- c("Brightwood Park","Brookland","Capitol Hill","Capitol View","   
Cathedral Heights", "Woodley Park","Colonial Village", "Columbia Heights","Bellevue","Deanwood","Douglas","Penn Quarter","Dupont Circle") 
group2 <- c("Eastland Gardens","Edgewood","Fairfax Village","Friendship Heights","Georgetown","Hawthorne","Historic Anacostia","Howard University","Ivy City","Kalorama Heights","Lamont Riggs","Mayfair","Navy Yard","North Cleveland Park")
group3 <- c("North Michigan Park","River Terrace","Logan Circle","Sheridan","Southwest Employment Area","Spring Valley","Takoma","Twining", "Union Station","West End", "Garfield Heights","Woodridge")

#I then filter the data frame to create 3 new data frames to make heatmaps from. This will allow me to create 3 new heatmaps that aren't as spread out along the y axis as the original visualization.
meanprice1 <- meanprice %>% filter(grepl(paste(group1, collapse = '|'),neighbourhood))
meanprice2 <- meanprice %>% filter(grepl(paste(group2, collapse = '|'),neighbourhood))
meanprice3<- meanprice %>% filter(grepl(paste(group3, collapse = '|'),neighbourhood))

meanprice1
# A tibble: 101 × 3
# Groups:   neighbourhood [12]
   neighbourhood                        minimum_nights AveragePrice
   <chr>                                         <dbl>        <dbl>
 1 Brightwood Park, Crestwood, Petworth              1        111. 
 2 Brightwood Park, Crestwood, Petworth              2        130. 
 3 Brightwood Park, Crestwood, Petworth              3        132. 
 4 Brightwood Park, Crestwood, Petworth              4        245. 
 5 Brightwood Park, Crestwood, Petworth              5        102  
 6 Brightwood Park, Crestwood, Petworth              6        282. 
 7 Brightwood Park, Crestwood, Petworth              7        139. 
 8 Brightwood Park, Crestwood, Petworth             13         65.5
 9 Brightwood Park, Crestwood, Petworth             14         35  
10 Brightwood Park, Crestwood, Petworth             28         66  
# ℹ 91 more rows
meanprice2
# A tibble: 103 × 3
# Groups:   neighbourhood [15]
   neighbourhood                                     minimum_nights AveragePrice
   <chr>                                                      <dbl>        <dbl>
 1 Eastland Gardens, Kenilworth                                   2        164. 
 2 Eastland Gardens, Kenilworth                                   3         78  
 3 Eastland Gardens, Kenilworth                                  30         64  
 4 Eastland Gardens, Kenilworth                                  31         47.3
 5 Edgewood, Bloomingdale, Truxton Circle, Eckington              1        110. 
 6 Edgewood, Bloomingdale, Truxton Circle, Eckington              2        128. 
 7 Edgewood, Bloomingdale, Truxton Circle, Eckington              3        158. 
 8 Edgewood, Bloomingdale, Truxton Circle, Eckington              4        150  
 9 Edgewood, Bloomingdale, Truxton Circle, Eckington             14        138. 
10 Edgewood, Bloomingdale, Truxton Circle, Eckington             28        110. 
# ℹ 93 more rows
meanprice3
# A tibble: 90 × 3
# Groups:   neighbourhood [12]
   neighbourhood                                     minimum_nights AveragePrice
   <chr>                                                      <dbl>        <dbl>
 1 North Michigan Park, Michigan Park, University H…              1        113. 
 2 North Michigan Park, Michigan Park, University H…              2        127. 
 3 North Michigan Park, Michigan Park, University H…              3         79.2
 4 North Michigan Park, Michigan Park, University H…              4        140  
 5 North Michigan Park, Michigan Park, University H…              5         52.2
 6 North Michigan Park, Michigan Park, University H…              7         47.5
 7 North Michigan Park, Michigan Park, University H…              8         83  
 8 North Michigan Park, Michigan Park, University H…             10         57  
 9 North Michigan Park, Michigan Park, University H…             26         56  
10 North Michigan Park, Michigan Park, University H…             30         41.5
# ℹ 80 more rows
#Now I use the 3 new objects (meanprice1,2,3) to make new heatmaps.
heatmap1 <- meanprice1 %>% ggplot(aes(x=minimum_nights,
                                                y=neighbourhood, fill = AveragePrice))+
  geom_tile(color ="grey")+
  scale_fill_gradientn(colours = c("darkgreen","lightyellow", "yellow","orange","red"))+
  scale_y_discrete(label = function(y) stringr::str_trunc(y, 15))+
  scale_x_continuous(breaks=seq(1,32, by=2))+
  theme(panel.background = element_rect(fill = "white", colour="white"),
        aspect.ratio = 1/2,
        text = element_text(size=10),
        axis.text.y = element_text(size=8),
        axis.text.x = element_text(size=8))+
  labs(title="Which Neighborhoods Offer the Cheapest Prices? (1/3)",
       subtitle="Based on your desired minimum night stay, look for your neighborhood.",
       x = "Minimum Night Stay", 
       y = "Neighborhood")

heatmap2 <- meanprice2 %>% ggplot(aes(x=minimum_nights,
                                                y=neighbourhood, fill = AveragePrice))+
  geom_tile(color ="grey")+
  scale_fill_gradientn(colours = c("darkgreen","lightyellow", "yellow","orange","red"))+
  scale_y_discrete(label = function(y) stringr::str_trunc(y, 15))+
  scale_x_continuous(breaks=seq(1,32, by=2))+
  theme(panel.background = element_rect(fill = "white", colour="white"),
        aspect.ratio = 1/2,
        text = element_text(size=10),
        axis.text.y = element_text(size=8),
        axis.text.x = element_text(size=8))+
  labs(title="Which Neighborhoods Offer the Cheapest Prices?(2/3)",
       subtitle="Based on your desired minimum night stay, look for your neighborhood.",
       x = "Minimum Night Stay", 
       y = "Neighborhood")

heatmap3 <- meanprice3 %>% ggplot(aes(x=minimum_nights,
                                                y=neighbourhood, fill = AveragePrice))+
  geom_tile(color ="grey")+
  scale_fill_gradientn(colours = c("darkgreen","lightyellow", "yellow","orange","red"))+
  scale_y_discrete(label = function(y) stringr::str_trunc(y, 15))+
  scale_x_continuous(breaks=seq(1,32, by=2))+
  theme(panel.background = element_rect(fill = "white", colour="white"),
        aspect.ratio = 1/2,
        text = element_text(size=10),
        axis.text.y = element_text(size=8),
        axis.text.x = element_text(size=8))+
  labs(title="Which Neighborhoods Offer the Cheapest Prices? (3/3)",
       subtitle="Based on your desired minimum night stay, look for your neighborhood.",
       x = "Minimum Night Stay", 
       y = "Neighborhood")

heatmap1

heatmap2

heatmap3

#A viewer can now skim through these three heatmaps.

Write a short paragraph desribing the visualization you created.

The visualization I created was a heatmap, showing the different neighborhoods you can find an airbnb in, vs. the minimum night stays you can find in that neighborhood. The viewer can then see different colors that demonstrate the average price-per-night for an airbnb in that neighborhood.

Looking at the visualization above, you can see that some of the cheapest neighborhoods to stay in for a one-night stay would be Takoma, which is highlighted in a green color indicating an about $100 or less nightly price. However, if you wanted to stay in Georgetown, you can see by the yellow color you would expect to pay about $300 for a one-night stay. But lets say you’re looking for at minimum, one MONTH (31 days) in DC, you should avoid Woodridge because on average a 1 month stay would cost about $400 per night! That would be a total of $12,000! Better to look at almost any other neighborhood, since the light yellow and green indicate your nightly price for a month stay would average at $100-$200, which would save you almost half the price of the Woodridge monthly stay.