March 8 Assignment

Author

Marie Adele Grosso

Identify working directory

getwd()
[1] "/Users/marieadelegrosso/Desktop/Data"

Load in Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load in Data

setwd("/Users/marieadelegrosso/Desktop/Data")
airbnb <- read_csv("Airbnb_DC_25.csv")
Rows: 6257 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): name, host_name, neighbourhood, room_type, last_review, license
dbl (11): id, host_id, latitude, longitude, price, minimum_nights, number_of...
lgl  (1): neighbourhood_group

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Preview Data

head(airbnb)
# A tibble: 6 × 18
     id name        host_id host_name neighbourhood_group neighbourhood latitude
  <dbl> <chr>         <dbl> <chr>     <lgl>               <chr>            <dbl>
1  3686 Vita's Hid…    4645 Vita      NA                  Historic Ana…     38.9
2  3943 Historic R…    5059 Vasa      NA                  Edgewood, Bl…     38.9
3  4197 Capitol Hi…    5061 Sandra    NA                  Capitol Hill…     38.9
4  4529 Bertina's …    5803 Bertina   NA                  Eastland Gar…     38.9
5  5589 Cozy apt i…    6527 Ami       NA                  Kalorama Hei…     38.9
6  7103 Lovely gue…   17633 Charlotte NA                  Spring Valle…     38.9
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>

Clean Data & Use a Dplyr Command

airbnb_nona <- airbnb |>
  filter(!is.na(price))

Play with Data

room_options <- airbnb |>
  select(room_type) |>
  group_by(room_type) |>
  count() |>
  arrange(desc(n))
head(room_options)
# A tibble: 4 × 2
# Groups:   room_type [4]
  room_type           n
  <chr>           <int>
1 Entire home/apt  4863
2 Private room     1305
3 Hotel room         74
4 Shared room        15

Produce the Graph

scatter1 <- airbnb_nona |>
  ggplot(aes(x = number_of_reviews,
                       y = price,
                       color = room_type)) +
  geom_point(aes(shape = room_type), alpha = 0.4) +
  scale_color_manual(values = c("#FFD6EB","#87cefa","#d1ffbd","#FA057F" ))  +
  labs(title = "DC Airbnb Number of Reviews by Price",
       x = "Number of Reviews",
       y = "Price Per Night (USD)",
        caption = "Source: DC Airbnb Statistics") +
  theme_bw()
scatter1

#note: I kept getting errors when I tried to filter directly in here

Essay

For the assignment, I did a scatter plot of number of Airbnb reviews by price per night in USD. In theory, the number of reviews felt like a pretty good reflection on how popular the Airbnb was. I thought it might be an interesting way to see what price range was most popular. It shows some clarity on the matter, but I struggled with getting the data to filter out specific points that were making the entire visualization confusing. I think the visualization would’ve been more interesting if it was more zoomed in, and for that reason, I really want to get rid of three outlier points. Specifically, for the price per night, there are only three Airbnb’s that are over 3000 per night but because of that the entire graph is much less informative. The same goes for the number of reviews, there are three rooms that have over 1000 reviews. I tried to filter a variety of ways and I kept getting the same error that I wasn’t able to resolve in a timely manner, but I would love to fix this and have a more informative graph. I do think it shows some interesting types of information still, specifically, that the most popular options are under $500 per night. It also showed that the majority of Airbnb‘s are entire apartments. I was also able to make this a more readable graph by decreasing the opacity of it.