| Full Name | Student ID |
|---|---|
| Lê Xuân Tùng | 202413804 |
| Bùi Hạnh Ngân | 202413760 |
| Nguyễn Hoàng Khánh Huyền | 202413700 |
| Nguyễn Mạnh Hà | 202413684 |
In this project, we are going to look at Airbnb listings in New York City and try to understand what makes some listings more expensive than others. We were given a dataset that contains information about thousands of Airbnb listings in NYC, including things like what type of room it is, which borough it’s in, how many reviews it has, and how available it is throughout the year.
We thought this was an interesting topic because a lot of us have used Airbnb before and it can be quite confusing why two listings that seem similar can have very different prices. The goal is to use regression analysis to figure out which variables have the biggest influence on the price.
The variable we are trying to explain (the dependent variable) is price, which is the nightly rental price in US dollars.
airbnb <- read.csv("C:/Users/admin/OneDrive/Documents/02 _ Uni/Applied Statistic/Assignment/Airbnb_Sheets.csv",
stringsAsFactors = FALSE)
# Convert categorical variables to factors
airbnb$room_type <- as.factor(airbnb$room_type)
airbnb$neighbourhood_group <- as.factor(airbnb$neighbourhood_group)
cat("Rows:", nrow(airbnb), "| Columns:", ncol(airbnb), "
")## Rows: 47888 | Columns: 8
## neighbourhood_group room_type price minimum_nights number_of_reviews
## 1 Brooklyn Private room 149 1 9
## 2 Manhattan Entire home/apt 225 1 45
## 3 Manhattan Private room 150 3 0
## 4 Brooklyn Entire home/apt 89 1 270
## 5 Manhattan Entire home/apt 80 10 9
## 6 Manhattan Entire home/apt 200 3 74
## reviews_per_month calculated_host_listings_count availability_365
## 1 0.21 6 365
## 2 0.38 2 355
## 3 0.00 1 365
## 4 4.64 1 194
## 5 0.10 1 0
## 6 0.59 1 129
The dataset has 47,888 rows and 8 columns. We have 2 categorical
variables (room_type and neighbourhood_group)
and 6 numeric variables. We converted the categorical ones to factors so
R can handle them properly in regression.
## Observations kept: 47089
## Observations removed: 799
We removed 799 listings (about 1.7% of the data) priced above $500 since they are extreme outliers that would distort our results. We still have 47,089 observations.
## neighbourhood_group room_type price
## Bronx : 1068 Entire home/apt:24014 Min. : 10.0
## Brooklyn :19608 Private room :21943 1st Qu.: 68.0
## Manhattan :20467 Shared room : 1132 Median :100.0
## Queens : 5583 Mean :131.5
## Staten Island: 363 3rd Qu.:172.0
## Max. :500.0
## minimum_nights number_of_reviews reviews_per_month
## Min. : 1.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 0.050
## Median : 2.000 Median : 5.00 Median : 0.390
## Mean : 5.569 Mean : 23.73 Mean : 1.112
## 3rd Qu.: 5.000 3rd Qu.: 24.00 3rd Qu.: 1.630
## Max. :30.000 Max. :629.00 Max. :58.500
## calculated_host_listings_count availability_365
## Min. : 1.000 Min. : 0.0
## 1st Qu.: 1.000 1st Qu.: 0.0
## Median : 1.000 Median : 41.0
## Mean : 7.082 Mean :110.2
## 3rd Qu.: 2.000 3rd Qu.:219.0
## Max. :327.000 Max. :365.0
Looking at the summary, the average price is around $132 but the
median is only $100, which suggests the distribution is right-skewed.
The minimum_nights goes up to 30, meaning some listings
target longer stays. The number_of_reviews has a very low
median (5) which means most listings don’t have many reviews. For
availability_365, a lot of listings seem to have 0 days
available, possibly because they are already fully booked or
inactive.
For each variable, we provide a histogram and a boxplot, and comment on the results.
barplot(table(airbnb$room_type),
col = c("#3498db", "#e74c3c", "#2ecc71"),
main = "Distribution of Room Type",
xlab = "Room Type", ylab = "Count")The most common room type is “Entire home/apt” with around 24,700 listings, followed closely by “Private room” with about 22,000. Shared rooms are much rarer - only around 1,100 listings.
barplot(table(airbnb$neighbourhood_group),
col = c("#9b59b6", "#3498db", "#e67e22", "#e74c3c", "#1abc9c"),
main = "Distribution of Borough",
xlab = "Borough", ylab = "Count")Manhattan has the most listings (about 21,000), followed by Brooklyn (about 19,700). Queens has around 5,600 listings. The Bronx and Staten Island have far fewer listings.
hist(airbnb$price,
breaks = 60, col = "#2980b9", border = "white",
main = "Distribution of Nightly Price (USD)",
xlab = "Price per night (USD)", ylab = "Frequency")
abline(v = mean(airbnb$price), col = "red", lwd = 2, lty = 2)
abline(v = median(airbnb$price), col = "darkorange", lwd = 2, lty = 2)
legend("topright",
legend = c(paste("Mean =", round(mean(airbnb$price), 1)),
paste("Median =", round(median(airbnb$price), 1))),
col = c("red", "darkorange"), lty = 2, lwd = 2, bty = "n")boxplot(airbnb$price, col = "#2980b9", horizontal = TRUE,
main = "Boxplot of Nightly Price (USD)",
xlab = "Price per night (USD)")The histogram shows that price is strongly right-skewed - most listings are priced between $50 and $200, but there is a long tail towards $500. The mean ($132) is much higher than the median ($100), which confirms the skewness. The boxplot makes this even clearer with many outliers to the right.
boxplot(price ~ room_type, data = airbnb,
col = c("#3498db", "#e74c3c", "#2ecc71"),
main = "Price by Room Type",
xlab = "Room Type", ylab = "Price per night (USD)")boxplot(price ~ neighbourhood_group, data = airbnb,
col = c("#9b59b6", "#3498db", "#e67e22", "#e74c3c", "#1abc9c"),
main = "Price by Borough",
xlab = "Borough", ylab = "Price per night (USD)")When we split price by room type, entire homes are clearly more expensive than private rooms, which are more expensive than shared rooms. When split by borough, Manhattan is clearly the most expensive, followed by Brooklyn.
hist(airbnb$minimum_nights,
breaks = 30, col = "#e67e22", border = "white",
main = "Distribution of Minimum Nights",
xlab = "Minimum nights", ylab = "Frequency")boxplot(airbnb$minimum_nights, col = "#e67e22", horizontal = TRUE,
main = "Boxplot of Minimum Nights", xlab = "Minimum nights")Most listings require only 1 to 3 nights minimum. There is also a concentration at 30 nights — these listings probably target monthly accommodation. The boxplot shows the median is very low with many high-value outliers.
hist(airbnb$number_of_reviews,
breaks = 60, col = "#1abc9c", border = "white",
main = "Distribution of Number of Reviews",
xlab = "Total reviews", ylab = "Frequency")boxplot(airbnb$number_of_reviews, col = "#1abc9c", horizontal = TRUE,
main = "Boxplot of Number of Reviews", xlab = "Total reviews")The number of reviews is very right-skewed. Most listings have very few reviews (median = 5), and a few have accumulated hundreds. The boxplot shows a compact box close to 0 with many outliers to the right.
hist(airbnb$reviews_per_month,
breaks = 60, col = "#8e44ad", border = "white",
main = "Distribution of Reviews per Month",
xlab = "Reviews per month", ylab = "Frequency")boxplot(airbnb$reviews_per_month, col = "#8e44ad", horizontal = TRUE,
main = "Boxplot of Reviews per Month", xlab = "Reviews per month")Similar to the total number of reviews, the monthly review rate is heavily right-skewed. Most listings get fewer than 2 reviews per month.
hist(airbnb$calculated_host_listings_count,
breaks = 50, col = "#e74c3c", border = "white",
main = "Distribution of Host Listings Count",
xlab = "Listings per host", ylab = "Frequency")boxplot(airbnb$calculated_host_listings_count, col = "#e74c3c", horizontal = TRUE,
main = "Boxplot of Host Listings Count", xlab = "Listings per host")The vast majority of hosts only have 1 listing. A small number of hosts manage hundreds of listings — probably professional property managers. The boxplot shows the median is 1 with extreme outliers to the right.
hist(airbnb$availability_365,
breaks = 40, col = "#34495e", border = "white",
main = "Distribution of Availability (days/year)",
xlab = "Available days per year", ylab = "Frequency")boxplot(airbnb$availability_365, col = "#34495e", horizontal = TRUE,
main = "Boxplot of Availability (days/year)",
xlab = "Available days per year")This variable shows a bimodal pattern — many listings have 0 days available (inactive/fully booked) and many have close to 365 (always open). The mean (110 days) is much higher than the median (41 days).
num_vars <- airbnb[, c("price", "minimum_nights", "number_of_reviews",
"reviews_per_month",
"calculated_host_listings_count",
"availability_365")]
# Use a sample of 2,000 rows — plotting all 47k points would be unreadable
set.seed(42)
idx <- sample(nrow(num_vars), 2000)
pairs(num_vars[idx, ],
main = "Pairwise scatterplots (sample of 2,000 listings)",
col = rgb(0.15, 0.45, 0.75, 0.25),
pch = 16,
cex = 0.5,
upper.panel = panel.smooth)Looking at the pairs plot, none of the numeric predictors seem to
have a strong linear relationship with price. The clearest relationship
is between number_of_reviews and
reviews_per_month (positive), which makes sense because a
listing that gets many reviews per month will naturally accumulate more
total reviews over time.
## price minimum_nights number_of_reviews
## price 1.000 0.055 -0.049
## minimum_nights 0.055 1.000 -0.149
## number_of_reviews -0.049 -0.149 1.000
## reviews_per_month -0.050 -0.222 0.588
## calculated_host_listings_count 0.163 0.333 -0.072
## availability_365 0.093 0.242 0.182
## reviews_per_month calculated_host_listings_count
## price -0.050 0.163
## minimum_nights -0.222 0.333
## number_of_reviews 0.588 -0.072
## reviews_per_month 1.000 -0.048
## calculated_host_listings_count -0.048 1.000
## availability_365 0.175 0.230
## availability_365
## price 0.093
## minimum_nights 0.242
## number_of_reviews 0.182
## reviews_per_month 0.175
## calculated_host_listings_count 0.230
## availability_365 1.000
The strongest correlation with price is for
calculated_host_listings_count (r ≈ 0.16) — positive,
meaning hosts with more listings charge slightly more.
number_of_reviews and reviews_per_month are
slightly negatively correlated with price (r ≈ -0.05) because cheaper
listings tend to get more bookings. All correlations with price are
quite weak, so we expect the categorical variables to do most of the
work in the regression model.
##
## Call:
## lm(formula = price ~ room_type, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.06 -39.87 -15.06 19.94 436.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180.0610 0.4695 383.51 <2e-16 ***
## room_typePrivate room -98.1886 0.6795 -144.51 <2e-16 ***
## room_typeShared room -116.3436 2.2129 -52.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.76 on 47086 degrees of freedom
## Multiple R-squared: 0.3173, Adjusted R-squared: 0.3173
## F-statistic: 1.094e+04 on 2 and 47086 DF, p-value: < 2.2e-16
The reference level is “Entire home/apt”. The intercept (≈ $180) is the estimated average price for an entire home. Private rooms are about $98 cheaper and shared rooms about $116 cheaper. Both are highly significant. R² = 0.32, meaning room type alone explains 32% of the variation in price.
##
## Call:
## lm(formula = price ~ neighbourhood_group, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.30 -56.04 -20.72 36.70 418.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.938 2.546 32.179 < 2e-16 ***
## neighbourhood_groupBrooklyn 30.779 2.615 11.771 < 2e-16 ***
## neighbourhood_groupManhattan 81.365 2.612 31.151 < 2e-16 ***
## neighbourhood_groupQueens 11.104 2.779 3.995 6.47e-05 ***
## neighbourhood_groupStaten Island 9.494 5.056 1.878 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.22 on 47084 degrees of freedom
## Multiple R-squared: 0.107, Adjusted R-squared: 0.107
## F-statistic: 1411 on 4 and 47084 DF, p-value: < 2.2e-16
The reference borough is the Bronx. Manhattan listings are about $81 more expensive per night, and Brooklyn about $31 more. R² = 0.06.
##
## Call:
## lm(formula = price ~ calculated_host_listings_count, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.88 -62.86 -28.86 37.51 371.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.42385 0.40959 313.55 <2e-16 ***
## calculated_host_listings_count 0.43566 0.01218 35.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.89 on 47087 degrees of freedom
## Multiple R-squared: 0.02645, Adjusted R-squared: 0.02643
## F-statistic: 1279 on 1 and 47087 DF, p-value: < 2.2e-16
Small positive relationship — significant but R² = 0.03, so this variable alone is a weak predictor.
##
## Call:
## lm(formula = price ~ availability_365, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.46 -64.57 -24.58 39.13 375.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.246e+02 5.287e-01 235.64 <2e-16 ***
## availability_365 6.289e-02 3.094e-03 20.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.68 on 47087 degrees of freedom
## Multiple R-squared: 0.008698, Adjusted R-squared: 0.008677
## F-statistic: 413.2 on 1 and 47087 DF, p-value: < 2.2e-16
Positive and significant but very small coefficient. R² = 0.01.
model_full <- lm(price ~ room_type + neighbourhood_group +
minimum_nights + number_of_reviews +
reviews_per_month + calculated_host_listings_count +
availability_365,
data = airbnb)
summary(model_full)##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + minimum_nights +
## number_of_reviews + reviews_per_month + calculated_host_listings_count +
## availability_365, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -208.79 -39.34 -10.74 20.78 458.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.372e+02 2.174e+00 63.114 < 2e-16 ***
## room_typePrivate room -9.273e+01 6.496e-01 -142.757 < 2e-16 ***
## room_typeShared room -1.185e+02 2.078e+00 -57.015 < 2e-16 ***
## neighbourhood_groupBrooklyn 2.602e+01 2.145e+00 12.129 < 2e-16 ***
## neighbourhood_groupManhattan 6.420e+01 2.150e+00 29.856 < 2e-16 ***
## neighbourhood_groupQueens 1.177e+01 2.270e+00 5.186 2.15e-07 ***
## neighbourhood_groupStaten Island -4.535e+00 4.127e+00 -1.099 0.272
## minimum_nights -1.360e+00 4.320e-02 -31.489 < 2e-16 ***
## number_of_reviews -1.103e-01 8.703e-03 -12.676 < 2e-16 ***
## reviews_per_month -1.824e+00 2.472e-01 -7.376 1.65e-13 ***
## calculated_host_listings_count 1.829e-01 1.037e-02 17.637 < 2e-16 ***
## availability_365 1.016e-01 2.637e-03 38.515 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.9 on 47077 degrees of freedom
## Multiple R-squared: 0.4056, Adjusted R-squared: 0.4055
## F-statistic: 2920 on 11 and 47077 DF, p-value: < 2.2e-16
The R² for the full model is 0.3248, so the model explains about 32.5% of the variation in price. The F-statistic is huge and the p-value is basically 0.
Looking at the individual coefficients:
model_selected <- lm(price ~ room_type + neighbourhood_group +
reviews_per_month + calculated_host_listings_count +
availability_365,
data = airbnb)
summary(model_selected)##
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + reviews_per_month +
## calculated_host_listings_count + availability_365, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -202.78 -39.74 -11.85 20.01 444.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.326e+02 2.194e+00 60.432 < 2e-16 ***
## room_typePrivate room -9.047e+01 6.535e-01 -138.439 < 2e-16 ***
## room_typeShared room -1.144e+02 2.098e+00 -54.523 < 2e-16 ***
## neighbourhood_groupBrooklyn 2.271e+01 2.168e+00 10.475 < 2e-16 ***
## neighbourhood_groupManhattan 5.997e+01 2.172e+00 27.610 < 2e-16 ***
## neighbourhood_groupQueens 9.802e+00 2.296e+00 4.269 1.96e-05 ***
## neighbourhood_groupStaten Island -4.476e+00 4.176e+00 -1.072 0.284
## reviews_per_month -1.907e+00 2.013e-01 -9.473 < 2e-16 ***
## calculated_host_listings_count 1.100e-01 1.012e-02 10.871 < 2e-16 ***
## availability_365 7.735e-02 2.565e-03 30.155 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.7 on 47079 degrees of freedom
## Multiple R-squared: 0.3914, Adjusted R-squared: 0.3913
## F-statistic: 3364 on 9 and 47079 DF, p-value: < 2.2e-16
The R² of the selected model is 0.3236, almost exactly the same as
the full model (0.3248). Removing number_of_reviews and
minimum_nights barely changed anything — they were not
contributing much. We prefer this simpler model.
## 2.5 % 97.5 %
## (Intercept) 128.29 136.89
## room_typePrivate room -91.75 -89.19
## room_typeShared room -118.52 -110.29
## neighbourhood_groupBrooklyn 18.46 26.96
## neighbourhood_groupManhattan 55.71 64.23
## neighbourhood_groupQueens 5.30 14.30
## neighbourhood_groupStaten Island -12.66 3.71
## reviews_per_month -2.30 -1.51
## calculated_host_listings_count 0.09 0.13
## availability_365 0.07 0.08
All confidence intervals exclude 0 — all predictors are significant at the 5% level. Intervals are narrow because of the large sample size. For example, we are 95% confident that private rooms are between about $96 and $100 cheaper than entire homes per night, controlling for other variables.
In this project we analyzed 47,089 Airbnb listings in New York City to understand what drives the nightly price. We used descriptive statistics, visualizations, simple regressions, and a multiple regression model.
The main finding is that room type is by far the most important predictor of price. Entire homes are on average much more expensive than private rooms or shared rooms. The borough also matters a lot — Manhattan listings are significantly more expensive than the other boroughs. Our final model explains about 32.4% of the variation in price.
Strengths: Very large dataset giving reliable estimates. Both categorical and numeric predictors were included. Histograms and boxplots were provided for every variable.
Limitations: The price variable is skewed, leading to non-normal residuals and mild heteroscedasticity. Only 32.4% of price variation is explained — important factors like listing quality, exact location, or amenities are missing. The model also shows associations, not causation.
Applied Statistics — Group 4 | March 25, 2026
Dataset: Airbnb