1 Load and Glimpse Data

2 Visualize Data

The vast majority of Airbnb listings in Austin are classified as “Entire home/apt,” which dominate the market. Private rooms make up a much smaller portion, while hotel and shared room listings are nearly negligible. This suggests that Airbnb’s inventory in Austin is heavily geared toward travelers seeking full space accommodations, likely catering to families, groups, or longer-term stays.

Most listings accommodate between 2 and 6 guests, with a sharp drop-off beyond that range. Listings for solo travelers or couples (1–2 guests) are especially common, while very large listings (10+ guests) are rare. This reflects a market built to serve small groups, which aligns with common travel patterns for weekend getaways and short group trips.

The distribution of nightly prices is heavily right-skewed, with most listings priced under $250. A small number of luxury or high end listings extend beyond that range, but they represent the minority. This pattern indicates that the bulk of Airbnb offerings in Austin are competitively priced and accessible to a wide range of travelers, though premium options do exist.

Listings with fewer than 50 reviews dominate the dataset, and many listings have very few reviews or none at all. This suggests a high volume of newer or less frequently booked properties. A smaller subset of listings has earned over 100 reviews, indicating a group of consistently active or long standing properties that may be more established or highly sought after.

3 Cluster Analysis

## Rows: 6,402
## Columns: 10
## $ price                    <dbl[,1]> <matrix[26 x 1]>
## $ accommodates             <dbl[,1]> <matrix[26 x 1]>
## $ bathrooms                <dbl[,1]> <matrix[26 x 1]>
## $ bedrooms                 <dbl[,1]> <matrix[26 x 1]>
## $ beds                     <dbl[,1]> <matrix[26 x 1]>
## $ number_of_reviews        <dbl[,1]> <matrix[26 x 1]>
## $ overall_rating           <dbl[,1]> <matrix[26 x 1]>
## $ `room_type_Hotel room`   <dbl[,1]> <matrix[26 x 1]>
## $ `room_type_Private room` <dbl[,1]> <matrix[26 x 1]>
## $ `room_type_Shared room`  <dbl[,1]> <matrix[26 x 1]>

To determine the optimal number of clusters for this analysis, we applied both the Elbow Method and the Silhouette Method. The Elbow Method evaluates total within-cluster sum of squares (WSS) across different values of k and identifies a point where adding more clusters results in diminishing returns. In our case, the WSS dropped sharply between k = 1 and k = 3, then began to level off more gradually, suggesting a potential “elbow” at k = 3 or 4. This indicates that using three or four clusters would capture the majority of natural grouping in the data while maintaining simplicity and interpretability.

In contrast, the Silhouette Method measures the average silhouette width: a metric that evaluates how well each observation fits within its cluster relative to other clusters. The silhouette plot showed a clear peak at k = 2, meaning this configuration yielded the most distinct and well-separated clusters. However, silhouette scores began to decline as k increased beyond two, indicating less clean separation.

While k = 2 provides the clearest partitioning according to silhouette width, it risks oversimplifying the rich variability in Airbnb listings. On the other hand, k = 3 is supported by the Elbow Method and offers a more nuanced segmentation of listings while still maintaining strong cohesion and interpretability. Therefore, for the purposes of this analysis, we proceed with k = 3 clusters, which strikes an effective balance between statistical robustness and actionable insight.

The first cluster represents premium listings intended for large groups, with an average price of approximately $352 per night. These properties typically offer ample space, accommodating nearly 10 guests with around four bedrooms and nearly three bathrooms. With high review scores and solid customer feedback, this cluster likely reflects entire homes that cater to families or travel groups seeking luxury and comfort in a shared space.

The second cluster contains budget-conscious listings, consisting exclusively of private rooms. These listings are significantly more affordable, averaging about $106 per night, and generally serve solo travelers or couples. Despite their modest accommodations, around one bedroom and one bathroom, they maintain strong overall ratings, suggesting they deliver a solid guest experience relative to cost. This segment reflects Airbnb’s accessibility for price sensitive customers.

The third cluster offers mid-range entire home listings priced around $145 per night. These properties typically sleep four guests and include 1–2 bedrooms, making them attractive to small groups or families. This cluster had the highest number of reviews, indicating strong demand or frequent booking turnover. With excellent ratings and good value for space and amenities, this group likely represents Airbnb’s core user base and most competitive listings.

This cluster analysis provides actionable insights for both Airbnb and its users. For Airbnb, these distinct listing groups can support targeted marketing strategies, such as promoting budget private rooms to solo travelers or showcasing high-end homes to larger groups planning extended stays. Additionally, Airbnb could use this segmentation to refine pricing suggestions or personalize search filters. For customers, understanding these clusters can simplify the decision making process by helping them quickly identify listings that match their budget, group size, and accommodation expectations.

4 Predictive Model

4.1 Linear Regression Model

We began with a linear regression model as a baseline. This model assumes straight-line relationships between features and price. While it’s easy to interpret, it couldn’t capture complex patterns in the data. It produced the highest error of the three models, with an average difference of about $125 between predicted and actual prices.

4.2 Random Forest Model

Our most accurate model was the random forest, which builds an ensemble of decision trees and averages their predictions. After tuning parameters like “mtry” (number of features considered at each split) and the number of trees, the model predicted prices with an average error of about $73. This level of accuracy makes it a strong candidate for a price suggestion tool, helping new hosts estimate what they could charge for their listing.

4.3 XGBoost Model

I tested an XGBoost model — a flexible, tree-based algorithm that handles non-linear relationships and interactions well. After tuning, it performed significantly better than linear regression, with an average prediction error of around $96. While solid, it still didn’t match the accuracy of our top-performing model: the random forest model.

model .estimator rmse rsq
Linear Regression standard 124.98777 0.3683332
Random Forest standard 72.72284 0.8095683
XGBoost standard 96.20873 0.6273579

Among listing features, the number of bedrooms and whether a cleaning fee is charged were both strongly associated with price. More bedrooms and the presence of a cleaning fee typically meant higher nightly rates. On the host side, longer hosting experience appeared to slightly increase prices, while superhost status did not show a strong direct impact on pricing. To improve the model going forward, Airbnb could consider gathering data on things like the quality of listing photos, guest reviews, and seasonal demand. All of which likely influence price but weren’t available in this dataset.

5 Technical Report

Data Cleaning: Before clustering, we selected numeric features that would help differentiate listing types and removed missing or extreme values. The price variable was cleaned to remove outliers, and categorical features like room type were converted to dummy variables.We split our dataset into training and validation data to train and tune models using 3-fold cross-validation. After identifying the best-performing model (random forest), we generated final predictions on a separate holdout dataset.

Features that were excluded: We excluded inconsistent features and those not useful for segmentation and clustering. An example would be number of reviews and text descriptions. Features like room type, accommodations, and availability were key to our success.

Clustering: We used K-Means clustering. We used the elbow plot method and the Silhouette method to determine the optimal number of clusters. Both methods suggested 3 clusters as the best trade off between interpretability and separation. Use the table of contents shortcuts for more information on the clustering analysis.

Price Prediction: The model we used in our final prediction was the random forest model. This model predicted prices within $73, on average, of the actual value of the properties. This is accurate enough for several practical purposes.

Model Alternatives: The linear regression model, while still a simple and interpretable baseline model,was a little too straightforward to be used to make accurate predictions. We were off by about $124.99 dollars, on average. The XGBoost Model, a tree-based model that builds decision trees and tunes them to reduce error, was off by about $96.21, on average.

I would recommend this model be used to provide pricing guidance to new hosts, but additional refinements would be necessary for a full deployment of this model. The number of bedrooms was a major driver of price. Listings with more bedrooms tended to charge higher prices. The cleaning fee was also directly correlated with higher prices. Hosts with a longer hosting history tended to list properties at a higher price. Superhost status, did not seem to influence price as much as I had previously expected.

To improve this model, I would include data on listing image quality, the day/month of booking, guest reviews and ratings, and a desirability score of the area of each listing.