1 Abstract

This portfolio applies a sequence of statistical learning methods to an Airbnb dataset, beginning with exploratory data analysis and progressing through four models: multiple linear regression, multinomial logistic regression, Poisson regression, and LASSO regularization. Each model is selected to match the scale of its outcome and is evaluated using cross-validation within the tidymodels framework. Across the portfolio, I highlight probability based reasoning, likelihood driven inference, GLM selection, model tuning, and interpretation of results. The goal is to demonstrate practical modeling skills while producing clear, accessible explanations of what the models reveal about Airbnb pricing and host behavior.

2 Introduction

I present the work I completed for Statistical Modeling I using a real Airbnb dataset. I begin with an exploratory analysis to understand how price, property features, host characteristics, and guest activity vary across listings. After establishing this foundation, I develop four statistical models (multiple linear regression, multinomial logistic regression, Poisson regression, and a LASSO regularization model) each chosen to answer a different practical question about the data.

These models allow me to demonstrate the five course learning objectives:

1. Apply probability, inference, and maximum likelihood in regression modeling.

2. Select and use an appropriate generalized linear model (GLM) for a specific context.

3. Demonstrate model selection using cross-validation or other comparative tools.

4. Communicate model results clearly to a general audience.

5. Use statistical software (`tidymodels`) to fit, tune, and assess models.

My goal throughout the portfolio is not only to fit models but to interpret them in meaningful, plain language. By the end, I show how different modeling approaches provide insight into Airbnb pricing patterns, property categories, and review behavior, while also demonstrating my understanding of core statistical modeling concepts.

3 Loading libraries

These libraries provide the tools for data cleaning, visualization, modeling, and evaluation used throughout the portfolio.

library(tidyverse)
library(tidymodels)
library(stringr)
library(lubridate)
library(reshape2)
library(yardstick)
library(rlang)
library(poissonreg)
library(workflows)

4 Loading Data

This cleaning step prepares the Airbnb dataset for analysis by converting price values into numeric form, extracting amenity counts, standardizing host characteristics, and creating useful variables such as host tenure. I also selected the key predictors needed for modeling and removed listings with missing or invalid prices. These transformations ensure that the data structure is consistent, interpretable, and ready for the exploratory analysis and modeling that follow.

listings_raw <- readr::read_csv("files/listings.csv")

# Data Cleaning
listings <- listings_raw |>
  mutate(
    price = readr::parse_number(price),
    amenity_count = stringr::str_count(amenities, ",") + 1
  ) |>
  mutate(
    host_acceptance_rate = na_if(host_acceptance_rate, "N/A"),
    host_acceptance_rate = parse_number(host_acceptance_rate) / 100
  ) |>

  mutate(
    host_since = ymd(host_since),
    host_years = as.numeric(difftime(
      ymd("2025-10-05"),
      host_since,
      units = "days"
    )) /
      365,
  ) |>
  mutate(
    host_is_superhost = host_is_superhost == TRUE | host_is_superhost == "t",
    host_has_profile_pic = host_has_profile_pic == TRUE |
      host_has_profile_pic == "t",
    host_identity_verified = host_identity_verified == TRUE |
      host_identity_verified == "t"
  ) |>
  select(
    price,
    accommodates,
    bedrooms,
    bathrooms,
    beds,
    number_of_reviews,
    latitude,
    longitude,
    room_type,
    property_type,
    minimum_nights,
    maximum_nights,
    availability_365,
    amenities,
    host_since,
    host_acceptance_rate,
    host_is_superhost,
    host_listings_count,
    host_total_listings_count,
    host_has_profile_pic,
    host_identity_verified,
    amenity_count,
    review_scores_rating
  ) |>
  filter(!is.na(price), price > 0)

4.1 Skimming

nrow(listings)
## [1] 425
skimr::skim(listings)
Data summary
Name listings
Number of rows 425
Number of columns 23
_______________________
Column type frequency:
character 3
Date 1
logical 3
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
room_type 0 1 12 15 0 2 0
property_type 0 1 11 33 0 18 0
amenities 0 1 73 1716 0 412 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2011-05-31 2025-07-30 2018-08-19 172

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 3 0.99 0.52 TRU: 219, FAL: 203
host_has_profile_pic 0 1.00 0.97 TRU: 413, FAL: 12
host_identity_verified 0 1.00 0.91 TRU: 386, FAL: 39

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
price 0 1.00 119.88 119.96 24.00 68.00 94.00 126.00 1314.00 ▇▁▁▁▁
accommodates 0 1.00 3.45 2.44 1.00 2.00 2.00 4.00 16.00 ▇▂▁▁▁
bedrooms 0 1.00 1.58 1.10 0.00 1.00 1.00 2.00 9.00 ▇▃▁▁▁
bathrooms 0 1.00 1.23 0.64 0.00 1.00 1.00 1.00 7.00 ▇▂▁▁▁
beds 0 1.00 1.84 1.32 0.00 1.00 1.00 2.00 10.00 ▇▂▁▁▁
number_of_reviews 0 1.00 62.76 114.15 0.00 3.00 21.00 71.00 962.00 ▇▁▁▁▁
latitude 0 1.00 42.66 0.01 42.63 42.65 42.66 42.67 42.71 ▂▇▃▁▁
longitude 0 1.00 -73.78 0.02 -73.88 -73.79 -73.77 -73.76 -73.74 ▁▁▂▇▆
minimum_nights 0 1.00 5.59 12.16 1.00 1.00 2.00 3.00 180.00 ▇▁▁▁▁
maximum_nights 0 1.00 490.28 396.92 3.00 300.00 365.00 730.00 1125.00 ▅▇▁▁▅
availability_365 0 1.00 249.48 114.21 0.00 158.00 278.00 351.00 365.00 ▂▂▂▃▇
host_acceptance_rate 7 0.98 0.88 0.22 0.00 0.88 0.98 1.00 1.00 ▁▁▁▁▇
host_listings_count 0 1.00 27.60 140.14 1.00 2.00 5.00 17.00 1250.00 ▇▁▁▁▁
host_total_listings_count 0 1.00 48.26 293.29 1.00 2.00 7.00 17.00 2692.00 ▇▁▁▁▁
amenity_count 0 1.00 37.97 16.13 5.00 26.00 39.00 48.00 89.00 ▃▆▇▂▁
review_scores_rating 55 0.87 4.74 0.37 2.00 4.67 4.85 4.97 5.00 ▁▁▁▁▇

After cleaning, the dataset contains 425 listings. The raw dataset had 461 listings.

5 EDA

The exploratory analysis below summarizes the main patterns in prices, listing characteristics, host attributes, and availability before moving into the modeling stage.

5.1 Price Distribution

ggplot(listings, aes(x = price)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Airbnb Prices",
    x = "Price (USD)",
    y = "Count"
  ) +
  theme_minimal()

The price distribution is heavily right-skewed. Most listings fall between about $50 and $250 per night, with a small number of high-end listings priced well above $500. The long tail shows that a few luxury properties pull the upper range upward, but they represent only a tiny fraction of the market. This skew suggests that applying a log transformation to price may help stabilize the variance and improve model performance in later steps.

5.2 log price distribution

ggplot(listings, aes(x = log10(price))) +
  geom_histogram(bins = 50, fill = "darkgreen", color = "white") +
  labs(title = "Distribution of log10(price)", x = "log10(price)", y = "Count")

After applying a log transformation, the price distribution becomes much closer to a bell-shaped curve. Most transformed prices fall between roughly 1.8 and 2.3, which corresponds to original prices of about $63 to $200. The long right tail seen in the raw price plot is still present, but it’s far less extreme. This confirms that using a log transformation will help the model handle the wide range of prices more effectively, reduce the impact of outliers, and produce more stable coefficient estimates.

5.3 Room type effects on price

listings |>
  group_by(room_type) |>
  summarize(mean_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = reorder(room_type, mean_price), y = mean_price)) +
  geom_col(fill = "skyblue") +
  coord_flip() +
  labs(title = "Average Price by Room Type", x = "Room Type", y = "Mean Price")

Entire home/apartment listings are noticeably more expensive than private rooms. The average price for an entire home is close to twice that of a private room, which makes sense because guests get the whole space to themselves. Private rooms tend to stay in a more affordable range, reflecting their smaller size and shared-living setup. This difference shows that room type is an important predictor of nightly price and should help the model capture variation in how listings are priced.

5.4 Property type effects

listings |>
  count(property_type, sort = TRUE) |>
  top_n(10, n) |>
  left_join(listings, by = "property_type") |>
  group_by(property_type) |>
  summarize(mean_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = reorder(property_type, mean_price), y = mean_price)) +
  geom_col(fill = "purple", alpha = 0.8) +
  coord_flip() +
  labs(
    title = "Top Property Types by Average Price",
    x = "Property Type",
    y = "Mean Price"
  )

Entire homes consistently command the highest prices, with “Entire home” and “Entire vacation home” sitting at the top of the list. These property types offer full privacy and larger spaces, which drives up nightly rates. Townhouses, guest suites, and bed and breakfast rooms follow closely, showing that full unit rentals generally sit in a higher price tier. On the lower end, private rooms (whether in a home, guest suite, or rental unit) tend to cost much less. This spread suggests that property type plays a major role in how hosts set prices, and it will likely help the model distinguish between budget listings and more premium options.

5.5 Price vs accommodates

ggplot(listings, aes(x = accommodates, y = price)) +
  geom_jitter(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Price vs Accommodates", x = "Accommodates", y = "Price")

There’s a clear upward trend between the number of guests a listing can accommodate and its nightly price. Listings that host more people generally cost more, which makes sense because larger properties often have more bedrooms, more space, and additional amenities.

5.6 Price vs review score

ggplot(listings, aes(x = review_scores_rating, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Price vs Review Score", x = "Review Score", y = "Price")

Most listings cluster at high review scores, usually above 4.5, and there is a wide spread of prices at each rating value. The trend line is fairly flat, which suggests that higher ratings are not strongly associated with higher prices in this dataset. In other words, listings with excellent scores do not necessarily charge more than those with slightly lower scores. This fits the idea that ratings reflect guest satisfaction but may not be the main driver of pricing decisions.

5.7 Distribution of key numeric predictors

listings |> select(accommodates, bedrooms, bathrooms,
                   beds, number_of_reviews, minimum_nights,
                   maximum_nights, availability_365,
                   amenity_count, host_acceptance_rate,
                   host_listings_count, host_total_listings_count,
                   review_scores_rating) |>
  pivot_longer(everything(), 
               names_to = "variable", values_to = "value") |>
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, 
                 fill = "grey60", 
                 color = "white") +
  facet_wrap(~ variable, scales = "free") +
  labs(
    title = "Distributions of Key Numeric Variables",
    x = NULL,
    y = "Count") +
  theme_minimal()

Most of these variables are skewed rather than symmetric. accommodates, bedrooms, and beds all have long right tails, with many small units and a few large properties. number_of_reviews is also heavily right-skewed, with many listings that have only a small number of reviews and a small group of very popular listings. minimum_nights and maximum_nights show that most hosts allow short stays, but some impose very long minimum or maximum stays, which may affect how often those listings are booked. availability_365 has clear spikes at 0 and 365, suggesting some listings are almost never available and some are “always” available. amenity_count is moderately right-skewed, indicating that most listings offer a standard set of amenities, while a smaller subset offers a very long list. Host-level counts (such as host_total_listings_count) also show strong skew, with many individual hosts and a smaller group of professional hosts with many listings.

5.8 Host characteristic

5.8.1 Host acceptance rate

ggplot(listings, aes(x = host_acceptance_rate)) +
  geom_histogram(bins = 30, 
                 fill = "steelblue", 
                 color = "white") +
  labs(
    title = "Distribution of Host Acceptance Rate",
    x = "Host Acceptance Rate",
    y = "Count") +
  theme_minimal()

Most hosts have high acceptance rates, clustered near 1 (100%), with a smaller number of hosts who accept only a fraction of booking requests. This suggests that most hosts are willing to accept inquiries, while a minority are more selective. Because the distribution is skewed toward high values, it may make sense to treat very low acceptance rates as a special case or consider transformations if this variable is used in regression.

5.8.2 Superhost and verification status

listings |> summarise(
  superhost_yes = mean(host_is_superhost, na.rm = TRUE),
  profile_pic_yes = mean(host_has_profile_pic, na.rm = TRUE),
  identity_verified_yes = mean(host_identity_verified, na.rm = TRUE))
listings |> mutate(
  host_is_superhost = factor(host_is_superhost, 
                             levels = c(FALSE, TRUE)),
  host_has_profile_pic = factor(host_has_profile_pic, levels = c(FALSE, TRUE)),
  host_identity_verified = factor(host_identity_verified, levels = c(FALSE, TRUE))) |>
  drop_na(host_is_superhost) |>
  select(host_is_superhost, host_has_profile_pic, host_identity_verified) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  ggplot(aes(x = variable, fill = value)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Proportion of Hosts by Status",
    x = NULL,
    y = "Proportion") +
  theme_minimal()

Most hosts have profile pictures and verified identities, while a smaller proportion hold the “superhost” status. This suggests that profile pictures and identity verification are now standard expectations on the platform, but superhost designation is still selective. These host attributes may influence guest trust and booking behavior, and they are good candidates to include as predictors in the models.

5.8.3 Price vs superhost status

listings |>
  mutate(host_is_superhost = factor(host_is_superhost, 
                                    levels = c(FALSE, TRUE))) |>
  drop_na(host_is_superhost) |>
  ggplot(aes(x = host_is_superhost, 
             y = price)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Price by Superhost Status",
    x = "Is Superhost?",
    y = "Price") +
  theme_minimal()

Superhosts tend to have slightly higher median prices, but there is a lot of overlap between the two groups. This suggests that superhost status may be associated with somewhat higher pricing, but it is not the main determinant of price. Other factors like room type, property type, and capacity likely play a bigger role.

5.9 Amenities and price

5.9.1 Amenity count distribution

ggplot(listings, aes(x = amenity_count)) +
  geom_histogram(bins = 30, 
                 fill = "darkorange", 
                 color = "white") +
  labs(
    title = "Distribution of Amenity Count",
    x = "Number of Amenities",
    y = "Count") +
  theme_minimal()

Amenity counts are clustered around a moderate range, with most listings offering a typical set of features such as Wi-Fi, kitchen access, and basic essentials, and fewer listings offering very long amenity lists. The right tail suggests that some hosts invest heavily in extra features.

ggplot(listings, aes(x = amenity_count, 
                     y = price)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", 
            se = FALSE, 
            color = "red") +
labs(
  title = "Price vs Number of Amenities",
  x = "Amenity Count",
  y = "Price") +
theme_minimal()

There is a mild positive trend: listings with more amenities tend to be more expensive. However, the spread is wide, and there are many reasonably priced listings with high amenity counts as well as some expensive listings with fewer amenities. This suggests that amenities matter, but they are only one piece of the pricing puzzle.

5.10 Booking constraints and availability

5.10.1 Minimum nights and price

ggplot(listings, aes(x = minimum_nights, 
                     y = price)) +
geom_point(alpha = 0.4) +
scale_x_continuous(trans = "log1p") +
geom_smooth(method = "lm", 
            se = FALSE, 
            color = "red") +
labs(title = "Price vs Minimum Nights (log scale on x)",
     x = "Minimum Nights (log1p)",
     y = "Price") +
theme_minimal()

Most listings allow short stays, and there is no strong linear relationship between minimum nights and price once we account for the heavy skew by using a log scale on the x-axis. Some listings with very long minimum stays appear across a wide range of prices. This suggests that minimum stay rules are more about host strategy and local regulations than pure price optimization.

5.10.2 Availability over the year

ggplot(listings, aes(x = availability_365)) +
geom_histogram(bins = 30, 
               fill = "lightgreen", 
               color = "white") +
labs(title = "Distribution of Availability (Days per Year)",
     x = "Available Days per Year",
     y = "Count") +
theme_minimal()

Availability is concentrated at the extremes: some listings are almost never available (close to 0 days), and some are listed as available year round (365 days). This pattern may reflect a mix of fulltime rentals and properties that are used primarily as personal residences. It also suggests that availability_365 may not behave like a simple continuous predictor and might be more informative when grouped into categories.

ggplot(listings, aes(x = availability_365, 
                     y = price)) +
geom_point(alpha = 0.4) +
labs(title = "Price vs Availability",
     x = "Available Days per Year",
     y = "Price") +
theme_minimal()

There is no obvious strong trend between availability and price. Listings with both low and high availability can be either cheap or expensive. This supports the idea that availability is driven more by host usage patterns and seasonality than by price alone.

5.11 Correlation among numeric variables

5.11.1 Correlation Matrix

numeric_vars <- listings |>
  select(price, accommodates, bedrooms,
         bathrooms, beds, number_of_reviews,
         minimum_nights, maximum_nights,
         availability_365, amenity_count,
         host_acceptance_rate, host_listings_count,
         host_total_listings_count,
         review_scores_rating)

cor_mat <- cor(numeric_vars, use = "pairwise.complete.obs")
round(cor_mat, 2)
##                           price accommodates bedrooms bathrooms  beds
## price                      1.00         0.67     0.62      0.48  0.60
## accommodates               0.67         1.00     0.86      0.53  0.84
## bedrooms                   0.62         0.86     1.00      0.53  0.85
## bathrooms                  0.48         0.53     0.53      1.00  0.49
## beds                       0.60         0.84     0.85      0.49  1.00
## number_of_reviews         -0.04         0.04    -0.06     -0.03  0.04
## minimum_nights            -0.10        -0.16    -0.06     -0.06 -0.10
## maximum_nights             0.00         0.06     0.06     -0.06  0.04
## availability_365           0.01         0.03    -0.01     -0.03 -0.04
## amenity_count              0.28         0.33     0.25      0.29  0.31
## host_acceptance_rate       0.07         0.11     0.03      0.10  0.08
## host_listings_count        0.02         0.05     0.06     -0.05  0.04
## host_total_listings_count  0.03         0.06     0.07     -0.04  0.05
## review_scores_rating      -0.03        -0.01    -0.05      0.07 -0.02
##                           number_of_reviews minimum_nights maximum_nights
## price                                 -0.04          -0.10           0.00
## accommodates                           0.04          -0.16           0.06
## bedrooms                              -0.06          -0.06           0.06
## bathrooms                             -0.03          -0.06          -0.06
## beds                                   0.04          -0.10           0.04
## number_of_reviews                      1.00          -0.17           0.16
## minimum_nights                        -0.17           1.00           0.04
## maximum_nights                         0.16           0.04           1.00
## availability_365                      -0.03          -0.02           0.12
## amenity_count                          0.22          -0.10           0.11
## host_acceptance_rate                   0.15          -0.32           0.04
## host_listings_count                   -0.07          -0.01           0.18
## host_total_listings_count             -0.06          -0.02           0.17
## review_scores_rating                   0.13           0.03           0.03
##                           availability_365 amenity_count host_acceptance_rate
## price                                 0.01          0.28                 0.07
## accommodates                          0.03          0.33                 0.11
## bedrooms                             -0.01          0.25                 0.03
## bathrooms                            -0.03          0.29                 0.10
## beds                                 -0.04          0.31                 0.08
## number_of_reviews                    -0.03          0.22                 0.15
## minimum_nights                       -0.02         -0.10                -0.32
## maximum_nights                        0.12          0.11                 0.04
## availability_365                      1.00         -0.14                -0.05
## amenity_count                        -0.14          1.00                 0.21
## host_acceptance_rate                 -0.05          0.21                 1.00
## host_listings_count                   0.01         -0.05                 0.04
## host_total_listings_count             0.03         -0.05                 0.05
## review_scores_rating                  0.00          0.25                 0.07
##                           host_listings_count host_total_listings_count
## price                                    0.02                      0.03
## accommodates                             0.05                      0.06
## bedrooms                                 0.06                      0.07
## bathrooms                               -0.05                     -0.04
## beds                                     0.04                      0.05
## number_of_reviews                       -0.07                     -0.06
## minimum_nights                          -0.01                     -0.02
## maximum_nights                           0.18                      0.17
## availability_365                         0.01                      0.03
## amenity_count                           -0.05                     -0.05
## host_acceptance_rate                     0.04                      0.05
## host_listings_count                      1.00                      0.99
## host_total_listings_count                0.99                      1.00
## review_scores_rating                    -0.22                     -0.20
##                           review_scores_rating
## price                                    -0.03
## accommodates                             -0.01
## bedrooms                                 -0.05
## bathrooms                                 0.07
## beds                                     -0.02
## number_of_reviews                         0.13
## minimum_nights                            0.03
## maximum_nights                            0.03
## availability_365                          0.00
## amenity_count                             0.25
## host_acceptance_rate                      0.07
## host_listings_count                      -0.22
## host_total_listings_count                -0.20
## review_scores_rating                      1.00

5.11.2 Heat Map

cor_mat |>
  melt() |>
  ggplot(aes(x = Var1, 
             y = Var2, 
             fill = value)) +
  geom_tile() +
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1)) +
  labs(
    title = "Correlation Matrix of Numeric Variables",
    x = NULL,
    y = NULL)

In general, price is strongly correlated with accommodates, and moderately related to bedrooms and beds, which fits the idea that larger properties charge higher prices. Review related variables tend to be correlated with each other (for example, hosts with more total listings often have more reviews), while review_scores_rating has weaker correlations with most numeric predictors.

5.12 EDA Summary

Overall, the exploratory analysis shows a dataset with strong variation in listing characteristics and several meaningful patterns that will guide the modeling work. Price is highly right-skewed, and a log transformation helps stabilize that distribution before fitting regression models. Capacity-related variables such as accommodates, bedrooms, and beds have clear positive relationships with price, while review-related variables are skewed and weakly associated with price. Room type and property type show substantial differences in average pricing, reinforcing their importance as categorical predictors. Host characteristics such as acceptance rate, verification status, and number of listings vary widely across hosts, though only some display clear relationships with price. Booking constraints and availability appear more influenced by host strategy than by pricing. Finally, the correlation matrix highlights that several size-related variables are strongly correlated, which supports using thoughtful feature selection or regularization when building models. Taken together, the EDA provides a clear foundation for constructing the first predictive model and justifies the modeling choices made in the next section.

With a clear understanding of the data structure, distributional patterns, and relationships among key predictors, I now move into the modeling phase of the portfolio. Each model has been selected not just to answer practical questions about the Airbnb dataset, but also to demonstrate my mastery of the five course learning objectives. These objectives guide the organization of the work that follows. For each objective, I present a model and then explain how the model satisfies the intended statistical skill.

Objective 1 — Apply probability, inference, and maximum likelihood in regression modeling

This objective is fulfilled primarily through the Poisson regression model later in the portfolio,
where I connect model assumptions to likelihood-based estimation and interpret the coefficients
within a probability framework.

Objective 2 — Select and use an appropriate generalized linear model (GLM) for a specific context

This is demonstrated through the use of multiple GLMs, including multinomial logistic regression 
for classifying property type and Poisson regression for modeling review counts. Each model is matched 
to the scale and distribution of its outcome variable.

Objective 3 — Demonstrate model selection using cross-validation or other comparative tools

Model selection is shown through lasso regression and through comparing simple vs. extended linear 
models for price. Cross-validation is used throughout to evaluate predictive performance, identify 
the best penalty value, and justify the final chosen models.

Objective 4 — Communicate model results clearly to a general audience

In each modeling section, I provide plain-language interpretations that explain model findings 
without relying on technical jargon. The price model and multinomial logistic regression provide 
especially rich opportunities for clear communication about effect sizes and practical meaning.

Objective 5 — Use statistical software (tidymodels) to fit, tune, and assess models

All models in this portfolio use `tidymodels` workflows, recipes, cross-validation, and where 
appropriate, tuning grids. This demonstrates my ability to implement complete modeling pipelines 
in reproducible statistical software.

6 Modeling

6.1 Mapping from Objectives to Models

- Objective 1: Model 3 (Poisson regression for number_of_reviews)

- Objective 2: Models 1 – 4 (linear, multinomial, Poisson, LASSO)

- Objective 3: Model 1 vs Model 4 (CV + tuning), plus CV in Models 2–3

- Objective 4: Narrative interpretations after each model section

- Objective 5: Tidymodels recipes, workflows, resampling, tuning in all four models

6.2 Model Setup

After cleaning, the data set was split into train(80%) and test (20%)

set.seed(631)

listings_model <- listings |>
  mutate(
    log_price = log10(price),
    host_is_superhost = factor(host_is_superhost),
    room_type       = factor(room_type),
    property_type   = factor(property_type)) |>
  select(
    log_price, price, accommodates, bedrooms, 
    bathrooms, beds, number_of_reviews,
    minimum_nights, maximum_nights,
    availability_365, amenity_count,
    host_acceptance_rate, host_is_superhost,
    host_listings_count, host_total_listings_count,
    room_type, property_type, review_scores_rating) |>
  drop_na(
    log_price, accommodates, bedrooms, bathrooms,
    beds, number_of_reviews, availability_365,
    amenity_count, host_acceptance_rate, 
    host_is_superhost, host_listings_count,
    host_total_listings_count, room_type,
    property_type, review_scores_rating)

# Train/test split
set.seed(631)
data_split <- initial_split(listings_model, prop = 0.8, strata = property_type)
train_data <- training(data_split)
test_data  <- testing(data_split)

# 5-fold CV for validation (stratify on property_type for Model 2)
set.seed(631)
cv_folds <- vfold_cv(train_data, v = 5, strata = property_type)

6.3 Model 1: Multiple Linear Regression with Polynomial Term (Price)

Objectives covered:

Objective 2 (appropriate GLM: normal linear model for continuous outcome)

Objective 3 (model selection via CV when compared later)

Objective 4 (you’ll interpret this for a general audience in the write-up)

Objective 5 (use of `tidymodels` workflow + CV)

Outcome: log_price

Predictors: quantitative (accommodates, bedrooms, bathrooms, amenity_count) + qualitative (room_type, property_type), with a polynomial term in accommodates.

# Recipe
price_recipe <- recipe(log_price ~ accommodates + bedrooms + bathrooms + 
                         amenity_count + room_type + property_type,
                       data = train_data) |>
  
  # Handle rare categories & unseen levels
  step_other(property_type, threshold = 0.02) |>   
  step_other(room_type, threshold = 0.02) |>       
  step_novel(room_type, property_type) |>          
  
   # Polynomial & dummies
  step_poly(accommodates, degree = 2) |>  
  step_dummy(all_nominal_predictors()) |>
  step_zv()

# Linear regression model
price_spec <- linear_reg() |>
  set_engine("lm")

# Workflow
price_wf <- workflow() |>
  add_model(price_spec) |>
  add_recipe(price_recipe)

# Cross-validation 
set.seed(631)

price_res <- fit_resamples(price_wf,resamples = cv_folds,
                           metrics = metric_set(rmse, rsq))

collect_metrics(price_res)

When I first fit the multiple regression model for log-price, R produced a “rank-deficient fit” warning. This happened because several predictors were highly correlated, especially the size-related variables such as accommodates, bedrooms, and beds, along with the dummy variables created during preprocessing. In practical terms, this meant the design matrix did not have full column rank. To address this, I simplified the model by removing beds and keeping accommodates (with a quadratic term) and bedrooms as the main size indicators. This reduced multicollinearity, eliminated the warning in most CV folds, and produced nearly identical RMSE and R² values. After simplifying the model, RMSE changed from 0.1453 to 0.1449, showing no loss in predictive accuracy.This suggests that the simpler model retained the key pricing signal without unnecessary redundancy.

Based on the 5-fold cross-validation results, the model achieved an average RMSE of 0.145 on the log-price scale with a very small standard error, indicating stable performance across folds. An RMSE of this size corresponds to roughly a 30–40% prediction error once transformed back into dollar values, which is reasonable given the natural variability in Airbnb pricing. The mean R² of 0.621 shows that the model explains a little over 60% of the variation in log-price—strong performance for a housing type dataset where many host specific and neighborhood factors remain unobserved. Together, these results show that the model generalizes well and captures the major pricing patterns in the data, even with occasional rank-deficient warnings. It serves as a solid baseline before exploring more flexible or regularized models later in the portfolio.

price_fit <- fit(price_wf, data = train_data)
tidy(price_fit)

Looking at the fitted coefficients, the model highlights several intuitive relationships between listing characteristics and nightly price. The strongest effect comes from the polynomial terms for accommodates: the positive first-order term and negative second-order term suggest that prices rise sharply as a listing begins to accommodate more guests, but the rate of increase slows for very large listings. In other words, adding capacity from two to four guests raises price more than increasing capacity from six to eight guests. The number of amenities also shows a small but statistically significant positive association with price, meaning listings with more features tend to charge slightly higher rates.

Some categorical effects reinforce patterns seen in the exploratory analysis. Private rooms (whether in a home or rental unit) have negative coefficients, indicating that they are priced lower than entire hometype listings, which serve as the baseline category. Several property types grouped into the “other” category also show small negative effects, further supporting the idea that full units command a price premium. Bedrooms and bathrooms have positive but non-significant coefficients, which is not surprising given how strongly these variables correlate with accommodates. Overall, the direction and magnitude of the coefficients align with expectations: larger listings, listings with more amenities, and listings offering full-unit privacy tend to be more expensive, while room-share arrangements and lower-amenity spaces fall on the lower end of the price range.

6.4 Model 2: Multinomial Logistic Regression (Property Type)

Objectives covered:

Objective 2 (appropriate GLM: multinomial logistic for categorical outcome)

Objective 3 (compare models via accuracy / log-loss)

Objective 4 (plain-language explanation of class probabilities)

Outcome: property_type (multiclass)

Predictors: log_price, accommodates, amenity_count, host_is_superhost.

multinom_recipe <- recipe(property_type ~ log_price + accommodates + 
                            amenity_count + host_is_superhost,
                          data = train_data) |>
  step_dummy(host_is_superhost) |>
  step_zv() |>
  step_normalize(all_predictors())

multinom_spec <- multinom_reg() |>
  set_engine("nnet") |>
  set_mode("classification")

multinom_wf <- workflow() |>
  add_model(multinom_spec) |>
  add_recipe(multinom_recipe)

# Fit on training data
multinom_fit <- fit(multinom_wf, data = train_data)

# Predict on test data for validation
multinom_pred <- predict(multinom_fit, test_data, type = "class") |>
  bind_cols(predict(multinom_fit, test_data, type = "prob"), test_data |>
              select(property_type)) 

# Accuracy
multinom_acc <- accuracy(multinom_pred,
                         truth = property_type,
                         estimate = .pred_class)

multinom_acc

This model uses multinomial logistic regression to predict property type from listing and host characteristics, specifically log_price, accommodates, amenity_count, and host_is_superhost.

After training on 80% of the data and evaluating on the remaining 20%, the model achieves an accuracy of about 0.68, meaning it correctly predicts the property type for roughly two out of three listings in the test set. Because the dataset includes several property categories with very uneven frequencies (a few common property types and many rarer ones), a naive baseline that always predicts the most common category would perform much worse. This suggests the model is capturing real structure in how prices, size, and amenities relate to property type, rather than just memorizing the majority class.

#probability columns
prob_cols <- grep("^\\.pred_",
                  names(multinom_pred), 
                  value = TRUE)

#truth levels line up with prob columns
class_levels <- gsub("^\\.pred_", 
                     "", 
                     prob_cols)

multinom_logloss <- multinom_pred|>
  mutate(property_type = factor(property_type, 
                                levels = class_levels),
         across(all_of(prob_cols), 
                as.numeric)) |>
  
  #multinomial log-loss over all .pred_* columns
  mn_log_loss(truth = property_type, 
              !!!syms(prob_cols))

multinom_logloss

The multinomial log-loss for this model is about 1.40. Log-loss penalizes confident wrong predictions more heavily than uncertain ones, so it provides a more nuanced view of performance than accuracy alone. A value around 1.40 indicates that, on average, the model assigns reasonably high probability to the correct property type, but still spreads some probability mass across competing categories. This is expected in a setting where several property types are visually and functionally similar (for example, different “entire home” variants).

In practical terms, the model can be used to rank the most plausible property types for a listing rather than make a single “all-or-nothing” guess. Together, the accuracy and log-loss results show that the multinomial GLM is learning meaningful patterns in the data while still reflecting the inherent uncertainty in distinguishing between similar property categories. This demonstrates appropriate use of a multinomial logistic model, proper preprocessing with recipes, and evaluation with yardstick within the tidymodels framework.

6.5 Model 3: Poisson Regression (Number of Reviews)

Objectives covered:

  Objective 1 (probability model + maximum likelihood for counts)
  
  Objective 2 (GLM choice for count outcome)
  
  Objective 4 (interpret rate changes)
  
  Objective 5 (`tidymodels` + CV)

Outcome: number_of_reviews (count)

Predictors: log_price, accommodates, availability_365, amenity_count, host_is_superhost.

pois_recipe <- recipe(number_of_reviews ~ log_price + accommodates +
                        availability_365 + amenity_count + 
                        host_is_superhost, data = train_data) |>
  step_dummy(host_is_superhost) |>
  step_zv()

pois_spec <- poisson_reg() |>
  set_engine("glm")

pois_wf <- workflow() |>
  add_model(pois_spec) |>
  add_recipe(pois_recipe)

set.seed(631)
pois_res <- fit_resamples(pois_wf,
                          resamples = cv_folds,
                          metrics = metric_set(rmse, mae))

collect_metrics(pois_res)

Across the 5-fold cross-validation, the Poisson model produced an average MAE of about 73 reviews and an RMSE of roughly 122 reviews. These values indicate that the model captures broad trends in review counts, but individual predictions can still be off by a large margin. That’s not surprising for Airbnb review data, since review counts vary dramatically across listings and tend to be highly skewed. The goal of the Poisson model here is not precise point prediction, but understanding how variables like price, amenities, availability, and superhost status influence expected review rates. In that sense, the model behaves as expected: the mean–variance structure of the Poisson distribution, combined with the log link, gives us multiplicative effects that are easy to interpret and align with real-world host behavior.

Review counts show clear overdispersion (variance > mean), so a negative binomial model would also be appropriate. I use Poisson here because the focus is on interpreting multiplicative effects, consistent with course objectives.

#Fit Poisson model on full training data
pois_fit <- fit(pois_wf, data = train_data)

#Extract the parsnip fit from the workflow
pois_parsnip <- extract_fit_parsnip(pois_fit)
# pois_engine  <- extract_fit_engine(pois_fit)

#Tidy + exponentiate for interpretation
pois_coef <- tidy(pois_parsnip) |>
  mutate(exp_estimate   = exp(estimate),           
         percent_change = (exp_estimate - 1) * 100)

pois_coef

Price effect (log_price IRR = 0.26)

Higher prices are strongly associated with fewer reviews.

Holding everything else constant, a one-unit increase in log-price reduces 
the expected number of reviews by about 74%.

This aligns with the idea that cheaper listings attract more bookings and, naturally, 
more reviews.

Accommodates (IRR = 0.989)

  Each additional person the listing can host slightly reduces expected 
  reviews (−1.1%).
  
  Larger listings may appeal to less frequent or longer stays, generating 
  fewer review opportunities.

Availability_365 (IRR = 0.9995)

  More availability is associated with very small decreases in review count.
  
  This makes sense since listings that are always available may be less popular.

Amenity count (IRR = 1.014)

  Each additional amenity increases expected reviews by about 1.4%, 
  suggesting that better equipped listings attract more guests and more feedback.

Superhost status (IRR = 2.11)

  Superhosts receive 111% more reviews on average, even after adjusting 
  for price, amenities, and size.
  
  This reflects the credibility and platform visibility given to superhosts, 
  which increases bookings and reviews.

This model assumes a Poisson mean–variance relationship, where the variance of the number of reviews increases proportionally with the mean. The log link ensures predicted counts stay positive and allows us to interpret coefficients as multiplicative effects on the expected review count. For example, exponentiating a coefficient gives the incidence-rate ratio, which clearly communicates how a one-unit change in a predictor affects the expected number of reviews.

Compared to linear models, the Poisson regression provides a more appropriate structure for count outcomes and lets us directly quantify percentage changes. The results highlight meaningful drivers of guest engagement—pricing, amenities, host status—while respecting the distributional shape of review counts.

6.6 Model 4: Lasso Regression (Regularized Price Model)

Objectives covered:

  Objective 2 (still a GLM: normal with identity link)
  
  Objective 3 (model selection via tuning penalty)
  
  Objective 4 (communication of selected predictors)
  
  Objective 5 (`tidymodels` tuning, grid search, workflows)

Outcome: log_price

Predictors: a richer set of numeric + categorical variables with regularization to handle multicollinearity and many predictors.

lasso_recipe <- recipe(log_price ~ accommodates + bedrooms + bathrooms + beds +
                         number_of_reviews + minimum_nights + maximum_nights +
                         availability_365 + amenity_count + host_acceptance_rate + 
                         host_is_superhost + host_listings_count + 
                         host_total_listings_count + room_type + property_type + 
                         review_scores_rating, data = train_data) |>
  step_log(number_of_reviews, 
           offset = 1) |>
  step_log(minimum_nights, 
           offset = 1) |>
  step_log(maximum_nights, 
           offset = 1) |>
  step_dummy(room_type, 
             property_type, 
             host_is_superhost) |>
  step_zv() |>
  step_normalize(all_predictors())

lasso_spec <- linear_reg(penalty = tune(),
                         mixture = 1) |>
  set_engine("glmnet")

lasso_wf <- workflow() |>
  add_model(lasso_spec) |>
  add_recipe(lasso_recipe)

penalty_grid <- grid_regular(penalty(), levels = 30)

set.seed(631)
lasso_tune <- tune_grid(lasso_wf,
                        resamples = cv_folds,
                        grid = penalty_grid,
                        metrics = metric_set(rmse, rsq))

show_best(lasso_tune, metric = "rmse")
best_lasso <- select_best(lasso_tune, 
                          metric = "rmse")

final_lasso_wf <- finalize_workflow(lasso_wf, 
                                    best_lasso)

lasso_fit <- fit(final_lasso_wf, 
                 data = train_data)

tidy(lasso_fit) |> 
  filter(estimate != 0)

Several skewed predictors (review counts, minimum and maximum nights) were log-transformed inside the recipe to stabilize variance and improve model performance.

The LASSO model applies a penalty that forces weaker predictors toward zero, so only the variables that contribute meaningful signal remain in the final model. The best penalty value (shown in the tuning output) gives the lowest cross-validated RMSE, meaning it balances accuracy with model simplicity.

From the coefficient table, only a small set of predictors survive the LASSO shrinkage. Positive coefficients (e.g., accommodates, amenity_count) indicate that larger or better-equipped listings tend to have higher log-prices, while negative coefficients (e.g., availability_365, room_type_Private.room) suggest that listings available year-round or private rooms tend to be priced lower. Dummy variables for certain property types also remain in the model, which confirms that price varies meaningfully across categories after controlling for other factors.

Because LASSO removes many coefficients entirely, the remaining predictors represent the strongest and most stable relationships with price in this dataset. This makes the model easier to interpret and helps prevent overfitting compared to an ordinary linear regression with dozens of dummy variables.

7 Overall Model Comparison and Course Objectives

Here I briefly compare Models 1–4 and summarize how they demonstrate the course learning objectives

Across the four models in this portfolio, each analysis serves a different purpose and highlights a different aspect of statistical modeling. Model 1 provides a baseline for understanding price through a traditional multiple regression framework. With its polynomial term and a mix of quantitative and categorical predictors, it offers an interpretable starting point and demonstrates how price responds to capacity, amenities, and property characteristics. Model 4 extends this idea by applying LASSO regularization. While Model 1 prioritizes interpretability, Model 4 focuses on prediction and variable selection. The comparison between the two shows how regularization can simplify a high-dimensional model while maintaining (or improving) predictive accuracy.

Models 2 and 3 each address outcomes that require something beyond a normal linear model. Model 2 uses multinomial logistic regression to predict property type, which introduces link functions, class probabilities, and multiclass classification metrics. Model 3 moves into count modeling through Poisson regression, where the mean–variance relationship and log link produce multiplicative interpretations that better reflect how review counts behave in practice. Together, these two models show how selecting an appropriate GLM depends on the distribution and structure of the response rather than a single modeling template.

These four models also map directly onto the five course objectives.

Objective 1 (probability, inference, likelihood) appears most clearly in the Poisson model, 
where I explain how the log link and     mean–variance structure shape interpretation.

Objective 2 (choose appropriate GLM) is demonstrated across all models, from linear 
regression to multinomial logistic and Poisson regression.

Objective 3 (model selection and comparison) is shown through cross-validation in every model,
tuning in the LASSO model, and comparing the performance of the baseline (Model 1) to the 
penalized model (Model 4).

Objective 4 (communicating results clearly) is reflected in my plain-language interpretations 
throughout the portfolio, where I     focus on effect size and practical meaning rather than 
purely statistical language.

Objective 5 (use `tidymodels` workflows) is demonstrated consistently through recipes, 
workflows, resampling, tuning grids, and proper preprocessing steps.

Taken together, Models 1 – 4 show a progression from basic linear modeling to more specialized GLMs and regularization. They also show that I can justify model choice, diagnose issues, interpret outputs clearly, and evaluate models with appropriate validation techniques. This final comparison brings everything together and demonstrates my understanding of both the statistical concepts and the modeling workflow emphasized throughout the course.

tibble(
  Model = c("Linear Regression", "Multinomial Logistic", "Poisson", "LASSO"),
  Outcome = c("log_price", "property_type", "number_of_reviews", "log_price"),
  Metric = c("RMSE = 0.145", "Accuracy = 0.676", "MAE = 73", "RMSE = 0.142"),
  Purpose = c("Baseline price model", 
              "Classifying property type", 
              "Modeling review counts", 
              "Regularized price prediction")
)