Assignment 5 Data 110

Author

Wesley Samimi

Load Libraries and Data

library(readxl)
Warning: package 'readxl' was built under R version 4.5.2
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)

setwd("C:/Users/wesle/Downloads/Data 110")
df <- read_excel("Airbnb_DC_25.csv")
df
# A tibble: 6,257 × 18
      id name       host_id host_name neighbourhood_group neighbourhood latitude
   <dbl> <chr>        <dbl> <chr>     <lgl>               <chr>            <dbl>
 1  3686 Vita's Hi…    4645 Vita      NA                  Historic Ana…     38.9
 2  3943 Historic …    5059 Vasa      NA                  Edgewood, Bl…     38.9
 3  4197 Capitol H…    5061 Sandra    NA                  Capitol Hill…     38.9
 4  4529 Bertina's…    5803 Bertina   NA                  Eastland Gar…     38.9
 5  5589 Cozy apt …    6527 Ami       NA                  Kalorama Hei…     38.9
 6  7103 Lovely gu…   17633 Charlotte NA                  Spring Valle…     38.9
 7 11785 Sanctuary…   32015 Teresa    NA                  Cathedral He…     38.9
 8 12442 Peaches &…   32015 Teresa    NA                  Cathedral He…     38.9
 9 13744 Heart of …   53927 Victoria  NA                  Columbia Hei…     38.9
10 14218 Quiet Com…   32015 Teresa    NA                  Cathedral He…     38.9
# ℹ 6,247 more rows
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <dttm>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>

Create a Data Visualization

df2 <- df |>
  filter(!is.na(reviews_per_month)) |> # 1236 NA values (5021)
  filter(reviews_per_month < 5)

dfbox <- df2 |>
  ggplot(aes(x = room_type, y = reviews_per_month)) +
  geom_boxplot(aes(col = room_type)) +
  labs(x = "Room Type", y = "Reviews Per Month", 
       title = "Boxplot of the Relationship between AirBnB Room Types and Reviews per Month",
       caption = "Airbnb_DC_25.csv")

dfbox

The visualization I used was a box plot. I used a box plot to allow for the median reviews per months of each of the AirBnB room types to be clearly visible against each other. But before creating the visualization there were many extreme outliers that heavily impacted how the graph look, this is why I lowered the maximum reviews per month that would be displayed on the plot to 5 per month. Doing this only got rid of about 1598 observations, many of which being NA values (1236 NA values). As shown by the box plot the median reviews per month for an entire home/apartment was around 1.5 reviews per month while the rest of the room types, private room, shared room, and hotel room, were all around 0.5 with private room being the highest out of the three and hotel room being the lowest closest to 0.5. This shows that you can typically expect for entire homes and apartments to have more reviews per month than single rooms.

df2 <- df |>
  mutate(reviews_per_month1 = ifelse(is.na(reviews_per_month), 0, reviews_per_month)) |>
  filter(reviews_per_month1 < 5)

dfbox <- df2 |>
  ggplot(aes(x = room_type, y = reviews_per_month1)) +
  geom_boxplot(aes(col = room_type)) +
  labs(x = "Room Type", y = "Reviews Per Month", 
       title = "Boxplot of the Relationship between AirBnB Room Types and Reviews per Month",
       caption = "Airbnb_DC_25.csv")

dfbox

After further examining the dataset I noticed that there were no values of 0 for reviews per month. This led me to realize that NA could have meant 0 reviews per month. Which this change only 362 observations are removed and the medians change and the difference between all of them is much more apparent. The median reviews per month for each of the room types changed, entire home and apartment went from around 1.5 to around 1, shared room stayed from around 0.5, private room dropped from around 0.5 to close to 0.25, and hotel room dropped from 0.5 to what appears to be 0.