Data Analysis and Application of Regression Techniques

Group Members

Full Name	Student ID
Lê Xuân Tùng	202413804
Bùi Hạnh Ngân	202413760
Nguyễn Hoàng Khánh Huyền	202413700
Nguyễn Mạnh Hà	202413684

1. Introduction

In this project, we are going to look at Airbnb listings in New York City and try to understand what makes some listings more expensive than others. We were given a dataset that contains information about thousands of Airbnb listings in NYC, including things like what type of room it is, which borough it’s in, how many reviews it has, and how available it is throughout the year.

We thought this was an interesting topic because a lot of us have used Airbnb before and it can be quite confusing why two listings that seem similar can have very different prices. The goal is to use regression analysis to figure out which variables have the biggest influence on the price.

The variable we are trying to explain (the dependent variable) is price, which is the nightly rental price in US dollars.

2. Data Description

2.1 Loading the data

airbnb <- read.csv("C:/Users/admin/OneDrive/Documents/02 _ Uni/Applied Statistic/Assignment/Airbnb_Sheets.csv",
                   stringsAsFactors = FALSE)

# Convert categorical variables to factors
airbnb$room_type           <- as.factor(airbnb$room_type)
airbnb$neighbourhood_group <- as.factor(airbnb$neighbourhood_group)

cat("Rows:", nrow(airbnb), "| Columns:", ncol(airbnb), "
")

## Rows: 47888 | Columns: 8

head(airbnb)

##   neighbourhood_group       room_type price minimum_nights number_of_reviews
## 1            Brooklyn    Private room   149              1                 9
## 2           Manhattan Entire home/apt   225              1                45
## 3           Manhattan    Private room   150              3                 0
## 4            Brooklyn Entire home/apt    89              1               270
## 5           Manhattan Entire home/apt    80             10                 9
## 6           Manhattan Entire home/apt   200              3                74
##   reviews_per_month calculated_host_listings_count availability_365
## 1              0.21                              6              365
## 2              0.38                              2              355
## 3              0.00                              1              365
## 4              4.64                              1              194
## 5              0.10                              1                0
## 6              0.59                              1              129

The dataset has 47,888 rows and 8 columns. We have 2 categorical variables (room_type and neighbourhood_group) and 6 numeric variables. We converted the categorical ones to factors so R can handle them properly in regression.

2.2 Removing outliers in price

airbnb <- airbnb[airbnb$price <= 500, ]

cat("Observations kept:", nrow(airbnb), "\n")

## Observations kept: 47089

cat("Observations removed:", 47888 - nrow(airbnb), "\n")

## Observations removed: 799

We removed 799 listings (about 1.7% of the data) priced above $500 since they are extreme outliers that would distort our results. We still have 47,089 observations.

2.3 Summary statistics

summary(airbnb)

##     neighbourhood_group           room_type         price      
##  Bronx        : 1068    Entire home/apt:24014   Min.   : 10.0  
##  Brooklyn     :19608    Private room   :21943   1st Qu.: 68.0  
##  Manhattan    :20467    Shared room    : 1132   Median :100.0  
##  Queens       : 5583                            Mean   :131.5  
##  Staten Island:  363                            3rd Qu.:172.0  
##                                                 Max.   :500.0  
##  minimum_nights   number_of_reviews reviews_per_month
##  Min.   : 1.000   Min.   :  0.00    Min.   : 0.000   
##  1st Qu.: 1.000   1st Qu.:  1.00    1st Qu.: 0.050   
##  Median : 2.000   Median :  5.00    Median : 0.390   
##  Mean   : 5.569   Mean   : 23.73    Mean   : 1.112   
##  3rd Qu.: 5.000   3rd Qu.: 24.00    3rd Qu.: 1.630   
##  Max.   :30.000   Max.   :629.00    Max.   :58.500   
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 41.0   
##  Mean   :  7.082                Mean   :110.2   
##  3rd Qu.:  2.000                3rd Qu.:219.0   
##  Max.   :327.000                Max.   :365.0

Looking at the summary, the average price is around $132 but the median is only $100, which suggests the distribution is right-skewed. The minimum_nights goes up to 30, meaning some listings target longer stays. The number_of_reviews has a very low median (5) which means most listings don’t have many reviews. For availability_365, a lot of listings seem to have 0 days available, possibly because they are already fully booked or inactive.

2.4 Variable-by-variable analysis

For each variable, we provide a histogram and a boxplot, and comment on the results.

room_type (categorical)

barplot(table(airbnb$room_type),
        col  = c("#3498db", "#e74c3c", "#2ecc71"),
        main = "Distribution of Room Type",
        xlab = "Room Type", ylab = "Count")

The most common room type is “Entire home/apt” with around 24,700 listings, followed closely by “Private room” with about 22,000. Shared rooms are much rarer - only around 1,100 listings.

neighbourhood_group (categorical)

barplot(table(airbnb$neighbourhood_group),
        col  = c("#9b59b6", "#3498db", "#e67e22", "#e74c3c", "#1abc9c"),
        main = "Distribution of Borough",
        xlab = "Borough", ylab = "Count")

Manhattan has the most listings (about 21,000), followed by Brooklyn (about 19,700). Queens has around 5,600 listings. The Bronx and Staten Island have far fewer listings.

price (dependent variable)

hist(airbnb$price,
     breaks = 60, col = "#2980b9", border = "white",
     main = "Distribution of Nightly Price (USD)",
     xlab = "Price per night (USD)", ylab = "Frequency")
abline(v = mean(airbnb$price),   col = "red",        lwd = 2, lty = 2)
abline(v = median(airbnb$price), col = "darkorange", lwd = 2, lty = 2)
legend("topright",
       legend = c(paste("Mean =",   round(mean(airbnb$price),   1)),
                  paste("Median =", round(median(airbnb$price), 1))),
       col = c("red", "darkorange"), lty = 2, lwd = 2, bty = "n")

boxplot(airbnb$price, col = "#2980b9", horizontal = TRUE,
        main = "Boxplot of Nightly Price (USD)",
        xlab = "Price per night (USD)")

The histogram shows that price is strongly right-skewed - most listings are priced between $50 and $200, but there is a long tail towards $500. The mean ($132) is much higher than the median ($100), which confirms the skewness. The boxplot makes this even clearer with many outliers to the right.

boxplot(price ~ room_type, data = airbnb,
        col  = c("#3498db", "#e74c3c", "#2ecc71"),
        main = "Price by Room Type",
        xlab = "Room Type", ylab = "Price per night (USD)")

boxplot(price ~ neighbourhood_group, data = airbnb,
        col  = c("#9b59b6", "#3498db", "#e67e22", "#e74c3c", "#1abc9c"),
        main = "Price by Borough",
        xlab = "Borough", ylab = "Price per night (USD)")

When we split price by room type, entire homes are clearly more expensive than private rooms, which are more expensive than shared rooms. When split by borough, Manhattan is clearly the most expensive, followed by Brooklyn.

minimum_nights

hist(airbnb$minimum_nights,
     breaks = 30, col = "#e67e22", border = "white",
     main = "Distribution of Minimum Nights",
     xlab = "Minimum nights", ylab = "Frequency")

boxplot(airbnb$minimum_nights, col = "#e67e22", horizontal = TRUE,
        main = "Boxplot of Minimum Nights", xlab = "Minimum nights")

Most listings require only 1 to 3 nights minimum. There is also a concentration at 30 nights — these listings probably target monthly accommodation. The boxplot shows the median is very low with many high-value outliers.

number_of_reviews

hist(airbnb$number_of_reviews,
     breaks = 60, col = "#1abc9c", border = "white",
     main = "Distribution of Number of Reviews",
     xlab = "Total reviews", ylab = "Frequency")

boxplot(airbnb$number_of_reviews, col = "#1abc9c", horizontal = TRUE,
        main = "Boxplot of Number of Reviews", xlab = "Total reviews")

The number of reviews is very right-skewed. Most listings have very few reviews (median = 5), and a few have accumulated hundreds. The boxplot shows a compact box close to 0 with many outliers to the right.

reviews_per_month

hist(airbnb$reviews_per_month,
     breaks = 60, col = "#8e44ad", border = "white",
     main = "Distribution of Reviews per Month",
     xlab = "Reviews per month", ylab = "Frequency")

boxplot(airbnb$reviews_per_month, col = "#8e44ad", horizontal = TRUE,
        main = "Boxplot of Reviews per Month", xlab = "Reviews per month")

Similar to the total number of reviews, the monthly review rate is heavily right-skewed. Most listings get fewer than 2 reviews per month.

calculated_host_listings_count

hist(airbnb$calculated_host_listings_count,
     breaks = 50, col = "#e74c3c", border = "white",
     main = "Distribution of Host Listings Count",
     xlab = "Listings per host", ylab = "Frequency")

boxplot(airbnb$calculated_host_listings_count, col = "#e74c3c", horizontal = TRUE,
        main = "Boxplot of Host Listings Count", xlab = "Listings per host")

The vast majority of hosts only have 1 listing. A small number of hosts manage hundreds of listings — probably professional property managers. The boxplot shows the median is 1 with extreme outliers to the right.

availability_365

hist(airbnb$availability_365,
     breaks = 40, col = "#34495e", border = "white",
     main = "Distribution of Availability (days/year)",
     xlab = "Available days per year", ylab = "Frequency")

boxplot(airbnb$availability_365, col = "#34495e", horizontal = TRUE,
        main = "Boxplot of Availability (days/year)",
        xlab = "Available days per year")

This variable shows a bimodal pattern — many listings have 0 days available (inactive/fully booked) and many have close to 365 (always open). The mean (110 days) is much higher than the median (41 days).

3. Analysis of Relationships Between Variables

3.1 Pairwise scatterplot matrix

num_vars <- airbnb[, c("price", "minimum_nights", "number_of_reviews",
                        "reviews_per_month",
                        "calculated_host_listings_count",
                        "availability_365")]

# Use a sample of 2,000 rows — plotting all 47k points would be unreadable
set.seed(42)
idx <- sample(nrow(num_vars), 2000)

pairs(num_vars[idx, ],
      main        = "Pairwise scatterplots (sample of 2,000 listings)",
      col         = rgb(0.15, 0.45, 0.75, 0.25),
      pch         = 16,
      cex         = 0.5,
      upper.panel = panel.smooth)

Looking at the pairs plot, none of the numeric predictors seem to have a strong linear relationship with price. The clearest relationship is between number_of_reviews and reviews_per_month (positive), which makes sense because a listing that gets many reviews per month will naturally accumulate more total reviews over time.

3.2 Correlation matrix

cor_table <- round(cor(num_vars), 3)
cor_table

##                                 price minimum_nights number_of_reviews
## price                           1.000          0.055            -0.049
## minimum_nights                  0.055          1.000            -0.149
## number_of_reviews              -0.049         -0.149             1.000
## reviews_per_month              -0.050         -0.222             0.588
## calculated_host_listings_count  0.163          0.333            -0.072
## availability_365                0.093          0.242             0.182
##                                reviews_per_month calculated_host_listings_count
## price                                     -0.050                          0.163
## minimum_nights                            -0.222                          0.333
## number_of_reviews                          0.588                         -0.072
## reviews_per_month                          1.000                         -0.048
## calculated_host_listings_count            -0.048                          1.000
## availability_365                           0.175                          0.230
##                                availability_365
## price                                     0.093
## minimum_nights                            0.242
## number_of_reviews                         0.182
## reviews_per_month                         0.175
## calculated_host_listings_count            0.230
## availability_365                          1.000

The strongest correlation with price is for calculated_host_listings_count (r ≈ 0.16) — positive, meaning hosts with more listings charge slightly more. number_of_reviews and reviews_per_month are slightly negatively correlated with price (r ≈ -0.05) because cheaper listings tend to get more bookings. All correlations with price are quite weak, so we expect the categorical variables to do most of the work in the regression model.

3.3 Simple linear regressions

Price ~ room_type

slr_room <- lm(price ~ room_type, data = airbnb)
summary(slr_room)

## 
## Call:
## lm(formula = price ~ room_type, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.06  -39.87  -15.06   19.94  436.28 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            180.0610     0.4695  383.51   <2e-16 ***
## room_typePrivate room  -98.1886     0.6795 -144.51   <2e-16 ***
## room_typeShared room  -116.3436     2.2129  -52.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.76 on 47086 degrees of freedom
## Multiple R-squared:  0.3173, Adjusted R-squared:  0.3173 
## F-statistic: 1.094e+04 on 2 and 47086 DF,  p-value: < 2.2e-16

The reference level is “Entire home/apt”. The intercept (≈ $180) is the estimated average price for an entire home. Private rooms are about $98 cheaper and shared rooms about $116 cheaper. Both are highly significant. R² = 0.32, meaning room type alone explains 32% of the variation in price.

Price ~ neighbourhood_group

slr_neigh <- lm(price ~ neighbourhood_group, data = airbnb)
summary(slr_neigh)

## 
## Call:
## lm(formula = price ~ neighbourhood_group, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.30  -56.04  -20.72   36.70  418.06 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        81.938      2.546  32.179  < 2e-16 ***
## neighbourhood_groupBrooklyn        30.779      2.615  11.771  < 2e-16 ***
## neighbourhood_groupManhattan       81.365      2.612  31.151  < 2e-16 ***
## neighbourhood_groupQueens          11.104      2.779   3.995 6.47e-05 ***
## neighbourhood_groupStaten Island    9.494      5.056   1.878   0.0604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.22 on 47084 degrees of freedom
## Multiple R-squared:  0.107,  Adjusted R-squared:  0.107 
## F-statistic:  1411 on 4 and 47084 DF,  p-value: < 2.2e-16

The reference borough is the Bronx. Manhattan listings are about $81 more expensive per night, and Brooklyn about $31 more. R² = 0.06.

Price ~ calculated_host_listings_count

slr_host <- lm(price ~ calculated_host_listings_count, data = airbnb)
summary(slr_host)

## 
## Call:
## lm(formula = price ~ calculated_host_listings_count, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.88  -62.86  -28.86   37.51  371.14 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    128.42385    0.40959  313.55   <2e-16 ***
## calculated_host_listings_count   0.43566    0.01218   35.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.89 on 47087 degrees of freedom
## Multiple R-squared:  0.02645,    Adjusted R-squared:  0.02643 
## F-statistic:  1279 on 1 and 47087 DF,  p-value: < 2.2e-16

Small positive relationship — significant but R² = 0.03, so this variable alone is a weak predictor.

Price ~ availability_365

slr_avail <- lm(price ~ availability_365, data = airbnb)
summary(slr_avail)

## 
## Call:
## lm(formula = price ~ availability_365, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135.46  -64.57  -24.58   39.13  375.42 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.246e+02  5.287e-01  235.64   <2e-16 ***
## availability_365 6.289e-02  3.094e-03   20.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.68 on 47087 degrees of freedom
## Multiple R-squared:  0.008698,   Adjusted R-squared:  0.008677 
## F-statistic: 413.2 on 1 and 47087 DF,  p-value: < 2.2e-16

Positive and significant but very small coefficient. R² = 0.01.

4. Multiple Regression

4.1 Full model

model_full <- lm(price ~ room_type + neighbourhood_group +
                   minimum_nights + number_of_reviews +
                   reviews_per_month + calculated_host_listings_count +
                   availability_365,
                 data = airbnb)
summary(model_full)

## 
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + minimum_nights + 
##     number_of_reviews + reviews_per_month + calculated_host_listings_count + 
##     availability_365, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -208.79  -39.34  -10.74   20.78  458.78 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                       1.372e+02  2.174e+00   63.114  < 2e-16 ***
## room_typePrivate room            -9.273e+01  6.496e-01 -142.757  < 2e-16 ***
## room_typeShared room             -1.185e+02  2.078e+00  -57.015  < 2e-16 ***
## neighbourhood_groupBrooklyn       2.602e+01  2.145e+00   12.129  < 2e-16 ***
## neighbourhood_groupManhattan      6.420e+01  2.150e+00   29.856  < 2e-16 ***
## neighbourhood_groupQueens         1.177e+01  2.270e+00    5.186 2.15e-07 ***
## neighbourhood_groupStaten Island -4.535e+00  4.127e+00   -1.099    0.272    
## minimum_nights                   -1.360e+00  4.320e-02  -31.489  < 2e-16 ***
## number_of_reviews                -1.103e-01  8.703e-03  -12.676  < 2e-16 ***
## reviews_per_month                -1.824e+00  2.472e-01   -7.376 1.65e-13 ***
## calculated_host_listings_count    1.829e-01  1.037e-02   17.637  < 2e-16 ***
## availability_365                  1.016e-01  2.637e-03   38.515  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.9 on 47077 degrees of freedom
## Multiple R-squared:  0.4056, Adjusted R-squared:  0.4055 
## F-statistic:  2920 on 11 and 47077 DF,  p-value: < 2.2e-16

The R² for the full model is 0.3248, so the model explains about 32.5% of the variation in price. The F-statistic is huge and the p-value is basically 0.

Looking at the individual coefficients:

Room type still has the biggest effect. Private rooms are about $98 cheaper than entire homes and shared rooms are about $116 cheaper.
Borough: Manhattan is about $57 more expensive than the Bronx, and Brooklyn about $17 more. All borough effects are significant.
minimum_nights has a small positive coefficient (≈ +$0.5).
number_of_reviews has a tiny negative coefficient (≈ -$0.1) — cheaper listings get more bookings and therefore more reviews.
reviews_per_month, calculated_host_listings_count, and availability_365 all have small but significant positive effects.

4.2 Residual diagnostics

par(mfrow = c(2, 2))
plot(model_full)

par(mfrow = c(1, 1))

Residuals vs Fitted: Not completely flat, more spread for higher fitted values — partly because of the right-skewed price distribution.
Normal Q-Q: Residuals don’t follow the normal line perfectly in the upper tail. With 47,000 observations our estimates are still reliable.
Scale-Location: Mild heteroscedasticity — common with price data.
Residuals vs Leverage: No obvious extreme leverage points.

4.3 Selected model

model_selected <- lm(price ~ room_type + neighbourhood_group +
                       reviews_per_month + calculated_host_listings_count +
                       availability_365,
                     data = airbnb)
summary(model_selected)

## 
## Call:
## lm(formula = price ~ room_type + neighbourhood_group + reviews_per_month + 
##     calculated_host_listings_count + availability_365, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202.78  -39.74  -11.85   20.01  444.31 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                       1.326e+02  2.194e+00   60.432  < 2e-16 ***
## room_typePrivate room            -9.047e+01  6.535e-01 -138.439  < 2e-16 ***
## room_typeShared room             -1.144e+02  2.098e+00  -54.523  < 2e-16 ***
## neighbourhood_groupBrooklyn       2.271e+01  2.168e+00   10.475  < 2e-16 ***
## neighbourhood_groupManhattan      5.997e+01  2.172e+00   27.610  < 2e-16 ***
## neighbourhood_groupQueens         9.802e+00  2.296e+00    4.269 1.96e-05 ***
## neighbourhood_groupStaten Island -4.476e+00  4.176e+00   -1.072    0.284    
## reviews_per_month                -1.907e+00  2.013e-01   -9.473  < 2e-16 ***
## calculated_host_listings_count    1.100e-01  1.012e-02   10.871  < 2e-16 ***
## availability_365                  7.735e-02  2.565e-03   30.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.7 on 47079 degrees of freedom
## Multiple R-squared:  0.3914, Adjusted R-squared:  0.3913 
## F-statistic:  3364 on 9 and 47079 DF,  p-value: < 2.2e-16

The R² of the selected model is 0.3236, almost exactly the same as the full model (0.3248). Removing number_of_reviews and minimum_nights barely changed anything — they were not contributing much. We prefer this simpler model.

4.4 Confidence intervals

round(confint(model_selected, level = 0.95), 2)

##                                    2.5 %  97.5 %
## (Intercept)                       128.29  136.89
## room_typePrivate room             -91.75  -89.19
## room_typeShared room             -118.52 -110.29
## neighbourhood_groupBrooklyn        18.46   26.96
## neighbourhood_groupManhattan       55.71   64.23
## neighbourhood_groupQueens           5.30   14.30
## neighbourhood_groupStaten Island  -12.66    3.71
## reviews_per_month                  -2.30   -1.51
## calculated_host_listings_count      0.09    0.13
## availability_365                    0.07    0.08

All confidence intervals exclude 0 — all predictors are significant at the 5% level. Intervals are narrow because of the large sample size. For example, we are 95% confident that private rooms are between about $96 and $100 cheaper than entire homes per night, controlling for other variables.

5. Conclusion

In this project we analyzed 47,089 Airbnb listings in New York City to understand what drives the nightly price. We used descriptive statistics, visualizations, simple regressions, and a multiple regression model.

The main finding is that room type is by far the most important predictor of price. Entire homes are on average much more expensive than private rooms or shared rooms. The borough also matters a lot — Manhattan listings are significantly more expensive than the other boroughs. Our final model explains about 32.4% of the variation in price.

Strengths: Very large dataset giving reliable estimates. Both categorical and numeric predictors were included. Histograms and boxplots were provided for every variable.

Limitations: The price variable is skewed, leading to non-normal residuals and mild heteroscedasticity. Only 32.4% of price variation is explained — important factors like listing quality, exact location, or amenities are missing. The model also shows associations, not causation.

Applied Statistics — Group 4 | March 25, 2026
Dataset: Airbnb