Project 1: Airbnb Listing Prediction

2024-02-07

Background and Problem Definition:

The Airbnb platform has revolutionized the hospitality industry by enabling individuals to rent out their properties to travelers, providing an alternative to traditional hotels.

As Airbnb continues to grow in popularity, understanding the characteristics and trends of Airbnb listings becomes increasingly important for hosts, guests, and stakeholders in the tourism industry.

Can we create a model to predict the success of an Airbnb listing based on its listing attributes?

Insights

In this project, we aim to analyze a dataset of Airbnb listings in a specific city to gain insights into the following aspects:

Listing Attributes
Pricing
Customer Reviews
Availability

Dataset: Santa Clara

Data was collected from 6949 listings in Santa Clara region. The dataset was downloaded from Inside Airbnb.

Inside Airbnb collects and analyzes data from the Airbnb platform to provide insights into its impact on various cities around the world.

The website hosts datasets containing information about Airbnb listings, hosts, and reviews for numerous cities globally.
These datasets are typically obtained by scraping data from the Airbnb website and are made available for download in various formats, such as CSV files.

Loading Dataset and Cleaning

# Set the path to the Airbnb listings dataset file
airbnb_path <- "/Users/jtruongjones/Downloads/Airbnb EDA/listings.csv"

# Load the dataset into R
airbnb_data <- read.csv(airbnb_path, stringsAsFactors = FALSE)

# Check for missing values
missing_values <- colSums(is.na(airbnb_data))

# Remove duplicate rows 
airbnb_data <- unique(airbnb_data)

Derived Category: Number of Beds

# Extract number of beds from the "name" category using regular expressions
airbnb_data$beds <- as.numeric(gsub("\\D", "", gsub(".*\\b(\\d+\\s*bed).*", "\\1", airbnb_data$name)))

# Display the first few rows of the updated dataset
head(airbnb_data[, c("beds")])

## [1] 1 1 1 3 1 1

Derived Category: Star Rating

# Extract star ratings from the "name" column
airbnb_data$star_rating <- as.numeric(gsub("[^0-9.]", "", gsub(".*★(\\d+\\.\\d+).*", "\\1", airbnb_data$name)))

# Display the first few rows of the updated dataset
head(airbnb_data[, c("star_rating")])

## [1] 4.81 4.50 4.87 4.89 4.87 4.93

Data Preprocessing:

The two primary techniques used to preprocess included:

Hot Encoding
- Convert categorical variables into a numerical format that can be used for machine learning algorithms.
  - Each category in a categorical variable is represented by a binary (0 or 1) indicator variable.
Feature Scaling
- Standardize or normalize the range of independent variables or features in the dataset. The goal of feature scaling is to bring all features to the same scale or range to ensure that no single feature dominates the model training process due to its larger magnitude.

Hot Encoding: Code

# Load necessary library for encoding
library(caret)

# Identify categorical variables
categorical_variables <- c("room_type", "host_name")

# One-hot encode room_type
encoded_room_type <- model.matrix(~ room_type - 1, data = airbnb_data)

# Label encode host_name
encoded_host_name <- as.integer(factor(airbnb_data$host_name))

# Replace original column with encoded one
airbnb_data$room_type <- NULL
airbnb_data$host_name <- encoded_host_name

# Concatenate encoded columns
encoded_data <- cbind(airbnb_data, encoded_room_type)

# Display the first few rows of the encoded dataset
head(encoded_data)

##       id
## 1   4952
## 2  11464
## 3  21373
## 4  62799
## 5  75284
## 6 106365
##                                                                       name
## 1  Place to stay in Palo Alto · ★4.81 · 1 bedroom · 1 bed · 2 shared baths
## 2             Rental unit in Santa Clara · ★4.50 · Studio · 1 bed · 1 bath
## 3  Place to stay in Palo Alto · ★4.87 · 1 bedroom · 1 bed · 2 shared baths
## 4 Place to stay in Palo Alto · ★4.89 · 1 bedroom · 3 beds · 2 shared baths
## 5  Place to stay in Palo Alto · ★4.87 · 1 bedroom · 1 bed · 2 shared baths
## 6             Guesthouse in Cupertino · ★4.93 · 1 bedroom · 1 bed · 1 bath
##   host_id host_name neighbourhood_group neighbourhood latitude longitude price
## 1    7054      1127                  NA     Palo Alto 37.43932 -122.1574    65
## 2   42458       493                  NA   Santa Clara 37.34415 -121.9870    94
## 3    7054      1127                  NA     Palo Alto 37.43972 -122.1553    67
## 4    7054      1127                  NA     Palo Alto 37.43934 -122.1572    83
## 5    7054      1127                  NA     Palo Alto 37.43923 -122.1574    70
## 6  551319       812                  NA     Cupertino 37.32194 -122.0051   289
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              7                84  2023-10-28              0.48
## 2              3                20  2023-08-05              0.17
## 3              7               266  2023-09-16              1.61
## 4              7               157  2023-12-16              1.35
## 5              7               214  2023-12-11              1.39
## 6             30               187  2023-08-12              1.24
##   calculated_host_listings_count availability_365 number_of_reviews_ltm license
## 1                              5              253                     7        
## 2                             13              152                     4        
## 3                              5              300                     5        
## 4                              5              309                     6        
## 5                              5              302                     6        
## 6                              1              365                     2  Exempt
##   beds star_rating room_typeEntire home/apt room_typePrivate room
## 1    1        4.81                        0                     1
## 2    1        4.50                        1                     0
## 3    1        4.87                        0                     1
## 4    3        4.89                        0                     1
## 5    1        4.87                        0                     1
## 6    1        4.93                        1                     0
##   room_typeShared room
## 1                    0
## 2                    0
## 3                    0
## 4                    0
## 5                    0
## 6                    0

Feature Scaling: Code

library(scales)

# Identify numeric variables for scaling
numeric_variables <- c("price", "minimum_nights", "number_of_reviews", 
                       "reviews_per_month", 
                       "calculated_host_listings_count",
                       "availability_365", "number_of_reviews_ltm")

# Perform Min-Max scaling for each numeric variable
scaled_data <- airbnb_data
scaled_data[, numeric_variables] <- lapply(airbnb_data[, numeric_variables], 
                                           rescale)

# Display the first few rows of the scaled dataset
head(scaled_data)

##       id
## 1   4952
## 2  11464
## 3  21373
## 4  62799
## 5  75284
## 6 106365
##                                                                       name
## 1  Place to stay in Palo Alto · ★4.81 · 1 bedroom · 1 bed · 2 shared baths
## 2             Rental unit in Santa Clara · ★4.50 · Studio · 1 bed · 1 bath
## 3  Place to stay in Palo Alto · ★4.87 · 1 bedroom · 1 bed · 2 shared baths
## 4 Place to stay in Palo Alto · ★4.89 · 1 bedroom · 3 beds · 2 shared baths
## 5  Place to stay in Palo Alto · ★4.87 · 1 bedroom · 1 bed · 2 shared baths
## 6             Guesthouse in Cupertino · ★4.93 · 1 bedroom · 1 bed · 1 bath
##   host_id host_name neighbourhood_group neighbourhood latitude longitude
## 1    7054      1127                  NA     Palo Alto 37.43932 -122.1574
## 2   42458       493                  NA   Santa Clara 37.34415 -121.9870
## 3    7054      1127                  NA     Palo Alto 37.43972 -122.1553
## 4    7054      1127                  NA     Palo Alto 37.43934 -122.1572
## 5    7054      1127                  NA     Palo Alto 37.43923 -122.1574
## 6  551319       812                  NA     Cupertino 37.32194 -122.0051
##         price minimum_nights number_of_reviews last_review reviews_per_month
## 1 0.001375378    0.016483516        0.09847597  2023-10-28        0.03408267
## 2 0.002100578    0.005494505        0.02344666  2023-08-05        0.01160261
## 3 0.001425392    0.016483516        0.31184056  2023-09-16        0.11602611
## 4 0.001825502    0.016483516        0.18405627  2023-12-16        0.09717186
## 5 0.001500413    0.016483516        0.25087925  2023-12-11        0.10007252
## 6 0.006976919    0.079670330        0.21922626  2023-08-12        0.08919507
##   calculated_host_listings_count availability_365 number_of_reviews_ltm license
## 1                    0.008097166        0.6931507            0.04268293        
## 2                    0.024291498        0.4164384            0.02439024        
## 3                    0.008097166        0.8219178            0.03048780        
## 4                    0.008097166        0.8465753            0.03658537        
## 5                    0.008097166        0.8273973            0.03658537        
## 6                    0.000000000        1.0000000            0.01219512  Exempt
##   beds star_rating
## 1    1        4.81
## 2    1        4.50
## 3    1        4.87
## 4    3        4.89
## 5    1        4.87
## 6    1        4.93

Initial EDA: Stats Summary

##      price         minimum_nights   number_of_reviews reviews_per_month
##  Min.   :   10.0   Min.   :  1.00   Min.   :  0.0     Min.   : 0.010   
##  1st Qu.:   72.0   1st Qu.:  1.00   1st Qu.:  1.0     1st Qu.: 0.260   
##  Median :  125.0   Median :  2.00   Median :  7.0     Median : 0.710   
##  Mean   :  209.7   Mean   : 11.07   Mean   : 32.2     Mean   : 1.195   
##  3rd Qu.:  199.0   3rd Qu.: 28.00   3rd Qu.: 33.0     3rd Qu.: 1.630   
##  Max.   :39999.0   Max.   :365.00   Max.   :853.0     Max.   :13.800   
##  NA's   :607                                          NA's   :1717     
##  calculated_host_listings_count availability_365 number_of_reviews_ltm
##  Min.   :  1.0                  Min.   :  0.0    Min.   :  0.000      
##  1st Qu.:  1.0                  1st Qu.: 53.0    1st Qu.:  0.000      
##  Median :  3.0                  Median :179.0    Median :  2.000      
##  Mean   : 47.4                  Mean   :184.9    Mean   :  7.363      
##  3rd Qu.: 14.0                  3rd Qu.:330.0    3rd Qu.:  8.000      
##  Max.   :495.0                  Max.   :365.0    Max.   :164.000      
##

Initial EDA: General Distribution

Beds vs Price

Beds vs Star Rating

Average Availability

Price vs. Rating vs. Beds

Regression Prediction

Goal: Use linear regression and listing attributes

Price
Beds

To predict Availability out of 365 year, and thus the success of an Airbnb Listing

Price vs. Rating vs. Beds

## Root Mean Squared Error (RMSE): 130.7727

## 
## Call:
## lm(formula = availability_365 ~ beds + price, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.894 -117.258   -0.065  137.448  166.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 196.19689    3.17123  61.868   <2e-16 ***
## beds          1.59782    1.52304   1.049    0.294    
## price         0.01228    0.01660   0.740    0.459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 130.5 on 4993 degrees of freedom
## Multiple R-squared:  0.0007628,  Adjusted R-squared:  0.0003625 
## F-statistic: 1.906 on 2 and 4993 DF,  p-value: 0.1488

Price vs. Rating vs. Beds

Discussion

Coefficients: If they are statistically significant (i.e., if their p-values are sufficiently low), it suggests that these variables have an impact on the availability.

R-squared: These values indicate the proportion of variance in the response variable (availability_365) explained by the predictor variables (beds and price). Higher values indicate a better fit of the model to the data.

Residuals: The differences between the observed and predicted values (residuals).

F-statistic: Assess the overall significance of the model using the F-statistic and its associated p-value. A low p-value suggests that the model as a whole is statistically significant.

Conclusion

Overall, based on our results, the model does not appear to be a good fit for predicting availability based on number of beds and price.

Ultimately, the dataset was missing some integral information for creating a prediction model. To improve the dataset, and create a better prediction model for investors it would be recommended to gather more data on

Listing ratings
Bedrooms (not just beds available)
Geographic location
Profit margins (cost for host)