Downloading libraries and Dataset
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.2
df <- read_excel("Airbnb_DC_25.csv")
df
## # A tibble: 6,257 × 18
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <lgl> <chr> <dbl>
## 1 3686 Vita's Hi… 4645 Vita NA Historic Ana… 38.9
## 2 3943 Historic … 5059 Vasa NA Edgewood, Bl… 38.9
## 3 4197 Capitol H… 5061 Sandra NA Capitol Hill… 38.9
## 4 4529 Bertina's… 5803 Bertina NA Eastland Gar… 38.9
## 5 5589 Cozy apt … 6527 Ami NA Kalorama Hei… 38.9
## 6 7103 Lovely gu… 17633 Charlotte NA Spring Valle… 38.9
## 7 11785 Sanctuary… 32015 Teresa NA Cathedral He… 38.9
## 8 12442 Peaches &… 32015 Teresa NA Cathedral He… 38.9
## 9 13744 Heart of … 53927 Victoria NA Columbia Hei… 38.9
## 10 14218 Quiet Com… 32015 Teresa NA Cathedral He… 38.9
## # ℹ 6,247 more rows
## # ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <dttm>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
df <- df |>
filter(!is.na(price) & price > 0)
I removed rows where prices are missing.
df1 <- df |>
filter(room_type %in% c("Private room", "Entire home/apt", "Shared room"))
head(df1)
## # A tibble: 6 × 18
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <lgl> <chr> <dbl>
## 1 3686 Vita's Hid… 4645 Vita NA Historic Ana… 38.9
## 2 3943 Historic R… 5059 Vasa NA Edgewood, Bl… 38.9
## 3 4197 Capitol Hi… 5061 Sandra NA Capitol Hill… 38.9
## 4 4529 Bertina's … 5803 Bertina NA Eastland Gar… 38.9
## 5 7103 Lovely gue… 17633 Charlotte NA Spring Valle… 38.9
## 6 11785 Sanctuary … 32015 Teresa NA Cathedral He… 38.9
## # ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <dttm>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>
ggplot(df1, aes(x = room_type, y = price, fill = room_type)) +
geom_boxplot() +
labs( title = "Distribution of Airbnb Prices by Room Type in DC",
x = "Room Type",
y = "Price",
fill = "Room Type",
caption = "Data Source: Airbnb_DC_25.xlsx") +
scale_fill_manual(values = c("Private room" = "skyblue", "Entire home/apt" = "orange", "Shared room" = "lightgreen")) +
theme_minimal()
Extreme outliers, making it look compressed.
ggplot(df1, aes(x = room_type, y = price, fill = room_type)) +
geom_boxplot() +
labs( title = "Distribution of Airbnb Prices by Room Type in DC",
x = "Room Type",
y = "Price",
fill = "Room Type",
caption = "Data Source: Airbnb_DC_25.xlsx") +
scale_fill_manual(values = c("Private room" = "skyblue", "Entire home/apt" = "orange", "Shared room" = "lightgreen")) +
scale_y_log10() +
theme_minimal()
This plot shows the prices of Airbnb listings in Washington DC by room
type: Entire home/apt, Shared rooms, or private room. I used a boxplot
to show the price distribution for each type. To make the prices easier
to compare, I used a log10 on the y-axis. This means the numbers
increase by multiples of 10 (like 10, 100, 1000) instead of counting one
by one. Using the log scales helps shows both low and high prices
cleaarly, eventhough some listings have very expensive prices. We can
see that the median price for entire houses or apartments is around
$150, for private rooms is $75, and for shared rooms is about $80 (the
median line is a bit higher than for the private rooms). However the
boxplot still shows that there are outliers, specially for the entire
homes/apt and the private rooms.