Title: Airbnb Rating Analysis Using Machine-Learning

Introduction

Group Members

Name	Matric Number
LIM POH SZE	17082915
WONG KAI THUNG	17101556
LIANG HUIHAO	22099647
ZHENG XIN	22104340
LAI XIAOXUAN	22106168

Project Background

The hospitality industry has undergone a transformative shift with the advent of online platforms like Airbnb, offering travelers a diverse array of accommodation options. In the dynamic landscape of short-term rentals, the significance of guest satisfaction cannot be overstated. Users rely heavily on peer reviews and overall ratings to make informed decisions about their lodging choices. In this context, understanding the factors that contribute to a positive or negative Airbnb experience becomes paramount.

This project centers around the vibrant city-state of Singapore, a global hub for tourism and business. As the demand for Airbnb accommodations continues to grow, there is a pressing need to delve deeper into the determinants of overall satisfaction for both hosts and guests.

Objectives

To predict Airbnb unit overall review score rating using the cleanliness review score rating with regression algorithm.
To classify Airbnb unit into high-rating category and low-rating category based on various factors (such as cleanliness, communication and etc) with classification algorithm.

Research Question

What is the predictive relationship between cleanliness review scores and overall review scores of Airbnb units in Singapore?
Which specific factors, beyond cleanliness, significantly contribute to the categorization of Airbnb units into high and low-rating classes in Singapore?
How do different machine learning algorithms compare in their ability to predict overall review scores based on cleanliness ratings?

Data Understanding

Load Libraries

library(dplyr)

## Warning: 程辑包'dplyr'是用R版本4.3.2 来建造的

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: 程辑包'ggplot2'是用R版本4.3.2 来建造的

library(reshape2)

## Warning: 程辑包'reshape2'是用R版本4.3.2 来建造的

library(corrplot)

## Warning: 程辑包'corrplot'是用R版本4.3.2 来建造的

## corrplot 0.92 loaded

library(ggcorrplot)

## Warning: 程辑包'ggcorrplot'是用R版本4.3.2 来建造的

library(randomForest)

## Warning: 程辑包'randomForest'是用R版本4.3.2 来建造的

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rpart)

## Warning: 程辑包'rpart'是用R版本4.3.2 来建造的

library(rpart.plot)

## Warning: 程辑包'rpart.plot'是用R版本4.3.2 来建造的

library(Metrics)

## Warning: 程辑包'Metrics'是用R版本4.3.2 来建造的

library(caret)

## Warning: 程辑包'caret'是用R版本4.3.2 来建造的

## 载入需要的程辑包：lattice

## 
## 载入程辑包：'caret'

## The following objects are masked from 'package:Metrics':
## 
##     precision, recall

Dataset Details

Our original data set named ‘listings.csv’ was obtained from http://insideairbnb.com/get-the-data/ . It contains a list of Airbnb units in Singapore scrapped on 23rd September 2023. We then renamed the file to “Dataset_original.csv”

Import Data

data <- read.csv("Dataset_original.csv")

General Information

str(data)

## 'data.frame':    3483 obs. of  75 variables:
##  $ id                                          : num  71609 71896 71903 275343 275344 ...
##  $ listing_url                                 : chr  "https://www.airbnb.com/rooms/71609" "https://www.airbnb.com/rooms/71896" "https://www.airbnb.com/rooms/71903" "https://www.airbnb.com/rooms/275343" ...
##  $ scrape_id                                   : num  2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
##  $ last_scraped                                : chr  "2023-09-23" "2023-09-23" "2023-09-23" "2023-09-23" ...
##  $ source                                      : chr  "previous scrape" "previous scrape" "previous scrape" "city scrape" ...
##  $ name                                        : chr  "Villa in Singapore · ★4.44 · 2 bedrooms · 3 beds · 1 private bath" "Home in Singapore · ★4.16 · 1 bedroom · 1 bed · Shared half-bath" "Home in Singapore · ★4.41 · 1 bedroom · 2 beds · Shared half-bath" "Rental unit in Singapore · ★4.40 · 1 bedroom · 1 bed · 2 shared baths" ...
##  $ description                                 : chr  "For 3 rooms.Book room 1&2 and room 4<br /><br /><b>The space</b><br />Landed Homestay Room for Rental. Between "| __truncated__ "<b>The space</b><br />Vocational Stay Deluxe Bedroom in Singapore.(Near Airport) <br /> <br />Located Between  "| __truncated__ "Like your own home, 24hrs access.<br /><br /><b>The space</b><br />Vocational Stay Deluxe Bedroom in Singapore."| __truncated__ "**IMPORTANT NOTES:  READ BEFORE YOU BOOK! <br />==Since this is an HDB Flat tourists are NOT ALLOWED unless hav"| __truncated__ ...
##  $ neighborhood_overview                       : chr  "" "" "Quiet and view of the playground with exercise tracks with access to neighbourhood Simwi Estate." "" ...
##  $ picture_url                                 : chr  "https://a0.muscache.com/pictures/24453191/35803acb_original.jpg" "https://a0.muscache.com/pictures/2440674/ac4f4442_original.jpg" "https://a0.muscache.com/pictures/568743/7bc623e9_original.jpg" "https://a0.muscache.com/pictures/miso/Hosting-275343/original/abbb6837-808c-437e-9835-0e7bb621a4e7.png" ...
##  $ host_id                                     : int  367042 367042 367042 1439258 1439258 367042 1521514 1439258 1439258 1521514 ...
##  $ host_url                                    : chr  "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/1439258" ...
##  $ host_name                                   : chr  "Belinda" "Belinda" "Belinda" "Kay" ...
##  $ host_since                                  : chr  "2011-01-29" "2011-01-29" "2011-01-29" "2011-11-24" ...
##  $ host_location                               : chr  "Singapore" "Singapore" "Singapore" "Singapore" ...
##  $ host_about                                  : chr  "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "K2 Guesthouse is designed for guests who want a truly local experience with local people. Experience eating loc"| __truncated__ ...
##  $ host_response_time                          : chr  "within a few hours" "within a few hours" "within a few hours" "within an hour" ...
##  $ host_response_rate                          : chr  "100%" "100%" "100%" "100%" ...
##  $ host_acceptance_rate                        : chr  "100%" "100%" "100%" "95%" ...
##  $ host_is_superhost                           : chr  "f" "f" "f" "f" ...
##  $ host_thumbnail_url                          : chr  "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/user/7245b0a9-27fa-4759-9fb3-59ae8299e8a3.jpg?aki_policy=profile_small" ...
##  $ host_picture_url                            : chr  "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/user/7245b0a9-27fa-4759-9fb3-59ae8299e8a3.jpg?aki_policy=profile_x_medium" ...
##  $ host_neighbourhood                          : chr  "Tampines" "Tampines" "Tampines" "Bukit Merah" ...
##  $ host_listings_count                         : int  5 5 5 52 52 5 7 52 52 7 ...
##  $ host_total_listings_count                   : int  15 15 15 65 65 15 8 65 65 8 ...
##  $ host_verifications                          : chr  "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" ...
##  $ host_has_profile_pic                        : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified                      : chr  "t" "t" "t" "t" ...
##  $ neighbourhood                               : chr  "" "" "Singapore, Singapore" "" ...
##  $ neighbourhood_cleansed                      : chr  "Tampines" "Tampines" "Tampines" "Bukit Merah" ...
##  $ neighbourhood_group_cleansed                : chr  "East Region" "East Region" "East Region" "Central Region" ...
##  $ latitude                                    : num  1.35 1.35 1.35 1.29 1.29 ...
##  $ longitude                                   : num  104 104 104 104 104 ...
##  $ property_type                               : chr  "Private room in villa" "Private room in home" "Private room in home" "Private room in rental unit" ...
##  $ room_type                                   : chr  "Private room" "Private room" "Private room" "Private room" ...
##  $ accommodates                                : int  3 1 2 1 1 4 2 1 1 2 ...
##  $ bathrooms                                   : logi  NA NA NA NA NA NA ...
##  $ bathrooms_text                              : chr  "1 private bath" "Shared half-bath" "Shared half-bath" "2 shared baths" ...
##  $ bedrooms                                    : int  NA NA NA NA NA 3 NA NA NA NA ...
##  $ beds                                        : int  3 1 2 1 1 5 1 1 1 1 ...
##  $ amenities                                   : chr  "[\"Private backyard \\u2013 Fully fenced\", \"Shampoo\", \"Fire extinguisher\", \"Self check-in\", \"Outdoor fu"| __truncated__ "[\"Private backyard \\u2013 Fully fenced\", \"Shampoo\", \"Drying rack for clothing\", \"Self check-in\", \"Cof"| __truncated__ "[\"Shampoo\", \"Self check-in\", \"Coffee maker\", \"AC - split type ductless system\", \"Outdoor furniture\", "| __truncated__ "[\"Fire extinguisher\", \"Self check-in\", \"Bed linens\", \"Hot water kettle\", \"Wifi\", \"Carbon monoxide al"| __truncated__ ...
##  $ price                                       : chr  "$150.00" "$80.00" "$80.00" "$55.00" ...
##  $ minimum_nights                              : int  92 92 92 60 60 92 92 60 60 92 ...
##  $ maximum_nights                              : int  365 365 365 999 999 365 1125 999 365 180 ...
##  $ minimum_minimum_nights                      : int  92 92 92 60 60 92 92 60 60 92 ...
##  $ maximum_minimum_nights                      : int  92 92 92 60 60 92 92 60 60 92 ...
##  $ minimum_maximum_nights                      : int  1125 1125 1125 1125 1125 1125 1125 1125 1125 180 ...
##  $ maximum_maximum_nights                      : int  1125 1125 1125 1125 1125 1125 1125 1125 1125 180 ...
##  $ minimum_nights_avg_ntm                      : num  92 92 92 60 60 92 92 60 60 92 ...
##  $ maximum_nights_avg_ntm                      : num  1125 1125 1125 1125 1125 ...
##  $ calendar_updated                            : logi  NA NA NA NA NA NA ...
##  $ has_availability                            : chr  "t" "t" "t" "t" ...
##  $ availability_30                             : int  28 28 28 1 30 28 30 30 30 30 ...
##  $ availability_60                             : int  58 58 58 1 60 58 60 60 60 60 ...
##  $ availability_90                             : int  88 88 88 1 90 88 90 90 90 90 ...
##  $ availability_365                            : int  89 89 89 275 274 89 365 365 365 365 ...
##  $ calendar_last_scraped                       : chr  "2023-09-23" "2023-09-23" "2023-09-23" "2023-09-23" ...
##  $ number_of_reviews                           : int  20 24 47 22 17 12 133 18 6 81 ...
##  $ number_of_reviews_ltm                       : int  0 0 0 0 3 0 0 1 3 0 ...
##  $ number_of_reviews_l30d                      : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ first_review                                : chr  "2011-12-19" "2011-07-30" "2011-05-04" "2013-04-20" ...
##  $ last_review                                 : chr  "2020-01-17" "2019-10-13" "2020-01-09" "2022-08-13" ...
##  $ review_scores_rating                        : num  4.44 4.16 4.41 4.4 4.27 4.83 4.43 3.5 3.8 4.43 ...
##  $ review_scores_accuracy                      : num  4.37 4.22 4.39 4.16 4.44 4.67 4.33 3.47 3.8 4.45 ...
##  $ review_scores_cleanliness                   : num  4 4.09 4.52 4.26 4.06 4.75 4.16 3.94 4 4.41 ...
##  $ review_scores_checkin                       : num  4.63 4.43 4.63 4.47 4.5 4.58 4.5 4.53 4.2 4.71 ...
##  $ review_scores_communication                 : num  4.78 4.43 4.64 4.42 4.5 4.67 4.66 4.06 4.8 4.76 ...
##  $ review_scores_location                      : num  4.26 4.17 4.5 4.53 4.63 4.33 4.52 3.82 3.8 4.64 ...
##  $ review_scores_value                         : num  4.32 4.04 4.36 4.63 4.13 4.45 4.39 3.76 4 4.55 ...
##  $ license                                     : chr  "" "" "" "S0399" ...
##  $ instant_bookable                            : chr  "f" "f" "f" "t" ...
##  $ calculated_host_listings_count              : int  5 5 5 52 52 5 7 52 52 7 ...
##  $ calculated_host_listings_count_entire_homes : int  0 0 0 1 1 0 1 1 1 1 ...
##  $ calculated_host_listings_count_private_rooms: int  5 5 5 51 51 5 6 51 51 6 ...
##  $ calculated_host_listings_count_shared_rooms : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reviews_per_month                           : num  0.14 0.16 0.31 0.17 0.12 0.09 0.94 0.13 0.05 0.67 ...

From the infromation above, we can see that the dataset encompasses 3,483 rows/observations each representing a unique Airbnb unit, with 75 distinct variables/columns. These variables cover essential aspects of each listing, ranging from basic identifiers such as ID and listing URL to detailed information about hosts, property characteristics, and guest reviews.

Data Description

The detailed content of the dataset are shown below:

Variables	Description
id	Unique identifier for each Airbnb unit
listing_url	URL link to the Airbnb listing page
scrape_id	Identifier for the specific scrape date
last_scraped	Date when the listing was last updated
source	Source of the listing data
name	Title or name of the Airbnb listing
description	Detailed text describing the Airbnb unit
neighborhood_overview	Overview of the neighborhood where the unit is located
picture_url	URL link to pictures showcasing the Airbnb unit
host_id	Unique identifier for the host of the Airbnb unit
host_url	URL link to the host’s profile page
host_name	Name of the Airbnb host
host_since	Date when the host joined Airbnb
host_location	Location of the host
host_about	Information provided by the host about themselves
host_response_time	Time taken by the host to respond to inquiries
host_response_rate	Percentage of inquiries to which the host responds
host_acceptance_rate	Percentage of booking requests accepted by the host
host_is_superhost	Indicator of whether the host has Superhost status
host_thumbnail_url	URL link to the host’s profile picture
host_picture_url	URL link to a larger picture of the host
host_neighbourhood	Neighborhood of the host
host_listings_count	Number of listings by the host
host_total_listings_count	Total number of listings by the host
host_verifications	Methods used by the host to verify their identity
host_has_profile_pic	Indicator of whether the host has a profile picture
host_identity_verified	Indicator of whether the host’s identity is verified
neighbourhood	General neighborhood information
neighbourhood_cleansed	Specific cleansed neighborhood information
neighbourhood_group_cleansed	Cleansed neighborhood group information
latitude	Latitude coordinates of the Airbnb unit
longitude	Longitude coordinates of the Airbnb unit
property_type	Type of property (private room/entire unit)
room_type	Type of room available for booking
accommodates	Number of guests the unit can accommodate
bathrooms	Number of bathrooms
bathrooms_text	Text description of the bathroom(s)
bedrooms	Number of bedrooms
beds	Number of beds
amenities	List of amenities provided in the Airbnb unit
price	Nightly price for renting the Airbnb unit
minimum_nights	Minimum number of nights required for booking
maximum_nights	Maximum number of nights allowed for booking
minimum_minimum_nights	Minimum value for minimum nights
maximum_minimum_nights	Maximum value for minimum nights
minimum_maximum_nights	Minimum value for maximum nights
maximum_maximum_nights	Maximum value for maximum nights
minimum_nights_avg_ntm	Minimum average nights for booking
maximum_nights_avg_ntm	Maximum average nights for booking
calendar_updated	Calendar(empty column)
has_availability	Indicator of whether the unit is available
availability_30	Number of available nights in the next 30 days
availability_60	Number of available nights in the next 60 days
availability_90	Number of available nights in the next 90 days
availability_365	Number of available nights in the next 365 days
calendar_last_scraped	Date when the calendar was last scraped
number_of_reviews	Total number of reviews for the unit
number_of_reviews_ltm	Number of reviews in the last twelve months
number_of_reviews_l30d	Number of reviews in the last 30 days
first_review	Date of the first review for the unit
last_review	Date of the most recent review for the unit
review_scores_rating	Overall rating score given by guests
review_scores_accuracy	Rating score for accuracy
review_scores_cleanliness	Rating score for cleanliness
review_scores_checkin	Rating score for the check-in process
review_scores_communication	Rating score for communication
review_scores_location	Rating score for the location
review_scores_value	Rating score for the overall value
license	License information for the property
instant_bookable	Indicator of whether instant booking is available
calculated_host_listings_count	Count of listings by the host
calculated_host_listings_count_entire_homes	Count of entire homes listed by the host
calculated_host_listings_count_private_rooms	Count of private rooms listed by the host
calculated_host_listings_count_shared_rooms	Count of shared rooms listed by the host
reviews_per_month	Average number of reviews received per month

Statistical Information

Below is the statistical summary of the dataset:

summary(data)

##        id            listing_url          scrape_id         last_scraped      
##  Min.   :7.161e+04   Length:3483        Min.   :2.023e+13   Length:3483       
##  1st Qu.:2.477e+07   Class :character   1st Qu.:2.023e+13   Class :character  
##  Median :4.230e+07   Mode  :character   Median :2.023e+13   Mode  :character  
##  Mean   :2.607e+17                      Mean   :2.023e+13                     
##  3rd Qu.:6.927e+17                      3rd Qu.:2.023e+13                     
##  Max.   :9.859e+17                      Max.   :2.023e+13                     
##                                                                               
##     source              name           description        neighborhood_overview
##  Length:3483        Length:3483        Length:3483        Length:3483          
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  picture_url           host_id            host_url          host_name        
##  Length:3483        Min.   :    23666   Length:3483        Length:3483       
##  Class :character   1st Qu.: 29032695   Class :character   Class :character  
##  Mode  :character   Median :107599478   Mode  :character   Mode  :character  
##                     Mean   :154421232                                        
##                     3rd Qu.:238891646                                        
##                     Max.   :536857130                                        
##                                                                              
##   host_since        host_location       host_about        host_response_time
##  Length:3483        Length:3483        Length:3483        Length:3483       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  host_response_rate host_acceptance_rate host_is_superhost  host_thumbnail_url
##  Length:3483        Length:3483          Length:3483        Length:3483       
##  Class :character   Class :character     Class :character   Class :character  
##  Mode  :character   Mode  :character     Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  host_picture_url   host_neighbourhood host_listings_count
##  Length:3483        Length:3483        Min.   :  1.0      
##  Class :character   Class :character   1st Qu.:  3.0      
##  Mode  :character   Mode  :character   Median : 14.0      
##                                        Mean   : 87.2      
##                                        3rd Qu.: 79.0      
##                                        Max.   :571.0      
##                                                           
##  host_total_listings_count host_verifications host_has_profile_pic
##  Min.   :  1.0             Length:3483        Length:3483         
##  1st Qu.:  5.0             Class :character   Class :character    
##  Median : 20.0             Mode  :character   Mode  :character    
##  Mean   :145.7                                                    
##  3rd Qu.:126.0                                                    
##  Max.   :847.0                                                    
##                                                                   
##  host_identity_verified neighbourhood      neighbourhood_cleansed
##  Length:3483            Length:3483        Length:3483           
##  Class :character       Class :character   Class :character      
##  Mode  :character       Mode  :character   Mode  :character      
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##  neighbourhood_group_cleansed    latitude       longitude    
##  Length:3483                  Min.   :1.222   Min.   :103.6  
##  Class :character             1st Qu.:1.291   1st Qu.:103.8  
##  Mode  :character             Median :1.305   Median :103.8  
##                               Mean   :1.311   Mean   :103.8  
##                               3rd Qu.:1.318   3rd Qu.:103.9  
##                               Max.   :1.458   Max.   :104.0  
##                                                              
##  property_type       room_type          accommodates    bathrooms     
##  Length:3483        Length:3483        Min.   : 1.000   Mode:logical  
##  Class :character   Class :character   1st Qu.: 2.000   NA's:3483     
##  Mode  :character   Mode  :character   Median : 2.000                 
##                                        Mean   : 2.817                 
##                                        3rd Qu.: 4.000                 
##                                        Max.   :16.000                 
##                                                                       
##  bathrooms_text        bedrooms          beds       amenities        
##  Length:3483        Min.   :1.000   Min.   : 1.0   Length:3483       
##  Class :character   1st Qu.:1.000   1st Qu.: 1.0   Class :character  
##  Mode  :character   Median :1.000   Median : 1.0   Mode  :character  
##                     Mean   :1.447   Mean   : 1.8                     
##                     3rd Qu.:2.000   3rd Qu.: 2.0                     
##                     Max.   :5.000   Max.   :46.0                     
##                     NA's   :1488    NA's   :97                       
##     price           minimum_nights    maximum_nights     minimum_minimum_nights
##  Length:3483        Min.   :   1.00   Min.   :     2.0   Min.   :   1.00       
##  Class :character   1st Qu.:   6.00   1st Qu.:   365.0   1st Qu.:   6.00       
##  Mode  :character   Median :  92.00   Median :  1125.0   Median :  92.00       
##                     Mean   :  67.28   Mean   :   811.2   Mean   :  67.27       
##                     3rd Qu.:  92.00   3rd Qu.:  1125.0   3rd Qu.:  92.00       
##                     Max.   :1000.00   Max.   :100000.0   Max.   :1000.00       
##                                                                                
##  maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights
##  Min.   :   1.00        Min.   :     1.0       Min.   :     1.0      
##  1st Qu.:   6.00        1st Qu.:   365.0       1st Qu.:   365.0      
##  Median :  92.00        Median :  1125.0       Median :  1125.0      
##  Mean   :  74.04        Mean   :   891.4       Mean   :   904.6      
##  3rd Qu.:  92.00        3rd Qu.:  1125.0       3rd Qu.:  1125.0      
##  Max.   :1000.00        Max.   :100000.0       Max.   :100000.0      
##                                                                      
##  minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated
##  Min.   :   1.00        Min.   :     1.0       Mode:logical    
##  1st Qu.:   6.00        1st Qu.:   365.0       NA's:3483       
##  Median :  92.00        Median :  1125.0                       
##  Mean   :  73.41        Mean   :   893.1                       
##  3rd Qu.:  92.00        3rd Qu.:  1125.0                       
##  Max.   :1000.00        Max.   :100000.0                       
##                                                                
##  has_availability   availability_30 availability_60 availability_90
##  Length:3483        Min.   : 0.00   Min.   : 0      Min.   : 0.00  
##  Class :character   1st Qu.: 0.00   1st Qu.: 0      1st Qu.: 1.00  
##  Mode  :character   Median :22.00   Median :51      Median :79.00  
##                     Mean   :15.97   Mean   :35      Mean   :55.29  
##                     3rd Qu.:29.00   3rd Qu.:59      3rd Qu.:89.00  
##                     Max.   :30.00   Max.   :60      Max.   :90.00  
##                                                                    
##  availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm
##  Min.   :  0.0    Length:3483           Min.   :  0.00    Min.   :  0.000      
##  1st Qu.: 90.0    Class :character      1st Qu.:  0.00    1st Qu.:  0.000      
##  Median :312.0    Mode  :character      Median :  1.00    Median :  0.000      
##  Mean   :235.7                          Mean   : 10.25    Mean   :  2.253      
##  3rd Qu.:364.0                          3rd Qu.:  5.00    3rd Qu.:  0.000      
##  Max.   :365.0                          Max.   :665.00    Max.   :404.000      
##                                                                                
##  number_of_reviews_l30d first_review       last_review       
##  Min.   : 0.0000        Length:3483        Length:3483       
##  1st Qu.: 0.0000        Class :character   Class :character  
##  Median : 0.0000        Mode  :character   Mode  :character  
##  Mean   : 0.1904                                             
##  3rd Qu.: 0.0000                                             
##  Max.   :22.0000                                             
##                                                              
##  review_scores_rating review_scores_accuracy review_scores_cleanliness
##  Min.   :0.00         Min.   :1.00           Min.   :1.000            
##  1st Qu.:4.33         1st Qu.:4.44           1st Qu.:4.270            
##  Median :4.69         Median :4.78           Median :4.670            
##  Mean   :4.46         Mean   :4.58           Mean   :4.497            
##  3rd Qu.:5.00         3rd Qu.:5.00           3rd Qu.:5.000            
##  Max.   :5.00         Max.   :5.00           Max.   :5.000            
##  NA's   :1565         NA's   :1599           NA's   :1599             
##  review_scores_checkin review_scores_communication review_scores_location
##  Min.   :1.000         Min.   :1.000               Min.   :1.000         
##  1st Qu.:4.670         1st Qu.:4.670               1st Qu.:4.570         
##  Median :4.915         Median :4.920               Median :4.850         
##  Mean   :4.726         Mean   :4.708               Mean   :4.691         
##  3rd Qu.:5.000         3rd Qu.:5.000               3rd Qu.:5.000         
##  Max.   :5.000         Max.   :5.000               Max.   :5.000         
##  NA's   :1599          NA's   :1598                NA's   :1599          
##  review_scores_value   license          instant_bookable  
##  Min.   :1.000       Length:3483        Length:3483       
##  1st Qu.:4.207       Class :character   Class :character  
##  Median :4.570       Mode  :character   Mode  :character  
##  Mean   :4.441                                            
##  3rd Qu.:5.000                                            
##  Max.   :5.000                                            
##  NA's   :1599                                             
##  calculated_host_listings_count calculated_host_listings_count_entire_homes
##  Min.   :  1.00                 Min.   :  0.00                             
##  1st Qu.:  3.00                 1st Qu.:  0.00                             
##  Median : 13.00                 Median :  1.00                             
##  Mean   : 50.81                 Mean   : 39.91                             
##  3rd Qu.: 70.00                 3rd Qu.: 27.00                             
##  Max.   :253.00                 Max.   :238.00                             
##                                                                            
##  calculated_host_listings_count_private_rooms
##  Min.   : 0.00                               
##  1st Qu.: 0.00                               
##  Median : 2.00                               
##  Mean   :10.17                               
##  3rd Qu.: 9.00                               
##  Max.   :91.00                               
##                                              
##  calculated_host_listings_count_shared_rooms reviews_per_month
##  Min.   : 0.0000                             Min.   : 0.0100  
##  1st Qu.: 0.0000                             1st Qu.: 0.0500  
##  Median : 0.0000                             Median : 0.1700  
##  Mean   : 0.3574                             Mean   : 0.5582  
##  3rd Qu.: 0.0000                             3rd Qu.: 0.6275  
##  Max.   :18.0000                             Max.   :20.9300  
##                                              NA's   :1565

Data Cleaning

# Dropping unnecessary columns
data_cleaned <- data[c("id", "price", "review_scores_rating", 
                       "review_scores_cleanliness","review_scores_accuracy",
                       "review_scores_checkin","review_scores_communication",
                       "review_scores_location","review_scores_value")]

# Converting columns
data_cleaned$price <- gsub("\\$", "", data_cleaned$price)
columns_to_convert <- c("price", "review_scores_rating", 
                        "review_scores_cleanliness","review_scores_accuracy",
                        "review_scores_checkin","review_scores_communication",
                        "review_scores_location","review_scores_value")

data_cleaned <- data_cleaned %>%
  mutate(across(all_of(columns_to_convert), ~as.numeric(as.character(.))))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(all_of(columns_to_convert),
##   ~as.numeric(as.character(.)))`.
## Caused by warning:
## ! 强制改变过程中产生了NA

# Handling missing values
data_cleaned <- na.omit(data_cleaned)
data_cleaned <- data_cleaned[!duplicated(data_cleaned), ]
sapply(data_cleaned, function(x) sum(is.na(x)))

##                          id                       price 
##                           0                           0 
##        review_scores_rating   review_scores_cleanliness 
##                           0                           0 
##      review_scores_accuracy       review_scores_checkin 
##                           0                           0 
## review_scores_communication      review_scores_location 
##                           0                           0 
##         review_scores_value 
##                           0

# Rename column names
data_cleaned <- data_cleaned %>%
  rename(
    Price = price,
    Rating_Score = review_scores_rating,
    Cleanliness_Score = review_scores_cleanliness,
    Accuracy_Score = review_scores_accuracy,
    Checkin_Score = review_scores_checkin,
    Communication_Score = review_scores_communication,
    Location_Score = review_scores_location,
    Value_Score = review_scores_value
  )

# Output
write.csv(data_cleaned, "data_cleaned.csv", row.names = FALSE)

EDA Analysis

summary(data_cleaned)

##        id                Price        Rating_Score   Cleanliness_Score
##  Min.   :7.161e+04   Min.   : 22.0   Min.   :1.000   Min.   :1.000    
##  1st Qu.:1.890e+07   1st Qu.: 68.0   1st Qu.:4.330   1st Qu.:4.270    
##  Median :3.519e+07   Median :137.0   Median :4.705   Median :4.670    
##  Mean   :1.685e+17   Mean   :174.1   Mean   :4.538   Mean   :4.495    
##  3rd Qu.:5.258e+07   3rd Qu.:223.0   3rd Qu.:5.000   3rd Qu.:5.000    
##  Max.   :9.699e+17   Max.   :999.0   Max.   :5.000   Max.   :5.000    
##  Accuracy_Score  Checkin_Score   Communication_Score Location_Score 
##  Min.   :1.000   Min.   :1.000   Min.   :1.000       Min.   :1.000  
##  1st Qu.:4.440   1st Qu.:4.670   1st Qu.:4.670       1st Qu.:4.570  
##  Median :4.780   Median :4.910   Median :4.910       Median :4.850  
##  Mean   :4.578   Mean   :4.726   Mean   :4.707       Mean   :4.692  
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000       3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000       Max.   :5.000  
##   Value_Score   
##  Min.   :1.000  
##  1st Qu.:4.210  
##  Median :4.570  
##  Mean   :4.442  
##  3rd Qu.:5.000  
##  Max.   :5.000

# 1. Univariate Analysis
# Histogram for Price
ggplot(data_cleaned, aes(x = Price)) + geom_histogram(binwidth = 10, fill = "blue", color = "black")

# Boxplot
ggplot(data_cleaned, aes(y = Rating_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Cleanliness_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Accuracy_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Checkin_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Communication_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Location_Score)) + geom_boxplot(fill = "orange")

ggplot(data_cleaned, aes(y = Value_Score)) + geom_boxplot(fill = "orange")

# 2. Bivariate Analysis
# Scatter plot for Rating_Score vs. Cleanliness_Score
ggplot(data_cleaned, aes(x = Rating_Score, y = Cleanliness_Score)) + geom_point() + geom_smooth(method = lm) +
  ggtitle("Correlation between Rating_Score Rating and Cleanliness_Score")

## `geom_smooth()` using formula = 'y ~ x'

# Correlation matrix
cor_matrix <- cor(data_cleaned[,2:9])
corrplot(cor_matrix, method = "circle",
         tl.cex = 0.6, mar=c(0,0,2,0))
title("Correlation Matrix", cex.main = 1)

# 3. Multivariate Analysis (Pairs Plot & Heatmap)
pairs(data_cleaned[,2:9], main="Pairs Plot")

ggplot(melt(cor_matrix), aes(Var1, Var2, fill = value)) + 
  geom_tile() +
  scale_fill_gradient2(
    low = "blue",
    mid = "white", 
    high = "red",
    midpoint = 0, 
    limit = c(-1,1),
    space = "Lab",
    name="Correlation"
  ) +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
  ) +
  ggtitle("Heatmap")

Machine Learning Modelling

Train and Test Splitting

Data splitting is to ensure the effectiveness of modelling, avoid over-fitting and ensure good performance on the model.

This project will use 70:30 split on the data, to achieve the balance to provide sufficient data for model training and also avoid biaseness of model evaluation.

The data to split in to 70% as training data (train_data) and 30% to put into testing data (test_data). The data are randomly selectd from the dataset, that would be use to train the model to make prediction

df<-read.csv("data_cleaned.csv")

set.seed(1000)

#Split the data into X (features) and Y (target-Rating Score)
X <- df[, !(colnames(df) %in% c("Rating_Score"))]
Y <- df[,"Rating_Score"]

# Generate random indices for the training data
train_indices <- sample(nrow(df), nrow(df) * 0.7)

# Generate the training data
train_data <- df[train_indices, ]

# Generate the testing data
test_data <- df[-train_indices, ]

Dimension Reduction

corr_matrix <- cor(df)
ggcorrplot(corr_matrix)

From the correlation matrix above, the highest value of correlation cofficient are Rating_Score and Cleanliness_Score, that shows strong positive linear correlation between the two variables, as well as other features: Accuracy_Score, Value_Score.

Regression Modelling

For regression, Rating_Score is the target variable.

Random Forest Regression Modelling

RFR <- randomForest(train_data$Rating_Score ~ ., data = train_data, ntree = 80, importance = TRUE)

# Predict the target variable for the test set
yPred_RF <- predict(RFR, newdata = test_data)

# Store predicted values to data frame
RF <- data.frame(y_test = test_data$Rating_Score, y_pred = yPred_RF)

# Show first 50 actual and predicted data
subset_RF <- RF[1:50, ]

# Visualize prediction using plot
plot(subset_RF$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_RF$y_pred, col = "skyblue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "skyblue"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")

Decision Tree Modelling

The next regressor will be evaluated is using Decision Tree regressor to perform the Rating_Score prediction.

# Fit a decision tree model
dt_model <- rpart(Rating_Score ~ ., data = train_data)

# Predict the target variable for the test set
yPred_dt <- predict(dt_model, newdata = test_data)

# Store predicted values to data frame
DT <- data.frame(y_test = test_data$Rating_Score, y_pred = yPred_dt)

# Show first 50 actual and predicted data
subset_DT <- DT[1:50, ]

# Visualize decision tree
rpart.plot(dt_model)

# Visualize prediction using plot
plot(subset_DT$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_DT$y_pred, col = "purple", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "purple"), lwd = 2)
title(main = "Actual vs Predicted for Decision Tree Regressor Model")

Metrics Evaluation on Regressors

This part will evaluate the metrics comparison for Random Forest and Decision models based on R Squared value, Mean Absolute Error and Root Mean Squared Error.

# Calculate Mean Absolute Error (MAE) and R-squared for both Random Forest and Decision Tree
mae_rf <- round(mean(abs(test_data$Rating_Score - yPred_RF)),3)
mae_dt <- round(mean(abs(test_data$Rating_Score - yPred_dt)),3)
rmse_rf <- rmse <- round(sqrt(mean((test_data$Rating_Score - yPred_RF)^2)),3)
rmse_dt <- rmse <- round(sqrt(mean((test_data$Rating_Score - yPred_dt)^2)),3)
rsquared_rf <- round(cor(yPred_RF, test_data$Rating_Score)^2, 3)
rsquared_dt <- round(cor(yPred_dt, test_data$Rating_Score)^2, 3)

# Create summary table and display
summary_table <- data.frame(
  Model = c("Random Forest", "Decision Tree"),
  R_squared = c(rsquared_rf, rsquared_dt),
  Mean_Absolute_Error = c(mae_rf, mae_dt),
  Root_Mean_Squared_Error = c(rmse_rf, rmse_dt)
)

print(summary_table)

##           Model R_squared Mean_Absolute_Error Root_Mean_Squared_Error
## 1 Random Forest     0.777               0.165                   0.284
## 2 Decision Tree     0.652               0.235                   0.354

From the metrics table, Random Forest has R-squared value of 0.812 compared to Decision Tree has lower with 0.688 R-squared value. Relatively, Random Forest also has lower value for both Mean Absolute Error and Root Mean Square Error, comparing to Decision Tree. This proof that Random Forest is a better option to perform Rating_Score prediction.

Classification Modelling

Train and Test Spliting

Prepares a dataset for binary classification where the response variable indicates whether a rating score is high or not, based on the 70th percentile threshold. And uses a decision tree model to classify Airbnb units into high-scoring and low-scoring categories.

# Determine the threshold for high rating (70th percentile)
rating_threshold <- quantile(df$Rating_Score, 0.7)

# Create a new binary column for high rating
df$High_Rating <- ifelse(df$Rating_Score >= rating_threshold, 1, 0)

# Splitting the data into training and testing sets
set.seed(42) # For reproducibility
trainIndex <- createDataPartition(df$High_Rating, p = 0.7, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]

# Ensure that the 'High_Rating' column in both training and testing datasets are factors
train_data$High_Rating <- as.factor(train_data$High_Rating)
test_data$High_Rating <- as.factor(test_data$High_Rating)

Build and utilize decision tree models for classification using the “rpart” function from the “rpart” package.

# Build the Decision Tree model
dt_model <- rpart(High_Rating ~ . -Rating_Score -id, data = train_data, method = "class")
dt_model

## n= 1305 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1305 416 0 (0.68122605 0.31877395)  
##    2) Accuracy_Score< 4.975 858  62 0 (0.92773893 0.07226107)  
##      4) Value_Score< 4.935 820  40 0 (0.95121951 0.04878049) *
##      5) Value_Score>=4.935 38  16 1 (0.42105263 0.57894737)  
##       10) Cleanliness_Score< 4.415 15   4 0 (0.73333333 0.26666667) *
##       11) Cleanliness_Score>=4.415 23   5 1 (0.21739130 0.78260870) *
##    3) Accuracy_Score>=4.975 447  93 1 (0.20805369 0.79194631)  
##      6) Value_Score< 4.98 163  71 1 (0.43558282 0.56441718)  
##       12) Checkin_Score< 4.83 27   4 0 (0.85185185 0.14814815) *
##       13) Checkin_Score>=4.83 136  48 1 (0.35294118 0.64705882)  
##         26) Cleanliness_Score< 4.98 66  33 0 (0.50000000 0.50000000)  
##           52) Communication_Score< 4.97 13   2 0 (0.84615385 0.15384615) *
##           53) Communication_Score>=4.97 53  22 1 (0.41509434 0.58490566) *
##         27) Cleanliness_Score>=4.98 70  15 1 (0.21428571 0.78571429) *
##      7) Value_Score>=4.98 284  22 1 (0.07746479 0.92253521) *

# Predicting the Test set results
predictions <- predict(dt_model, test_data, type = "class")

Calculate confusion matrix, including True Positives, True Negatives, False Positives and False Negatives

Model evaluation

confusionMatrix(predictions, test_data$High_Rating)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 347  20
##          1  35 157
##                                          
##                Accuracy : 0.9016         
##                  95% CI : (0.8739, 0.925)
##     No Information Rate : 0.6834         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.7777         
##                                          
##  Mcnemar's Test P-Value : 0.05906        
##                                          
##             Sensitivity : 0.9084         
##             Specificity : 0.8870         
##          Pos Pred Value : 0.9455         
##          Neg Pred Value : 0.8177         
##              Prevalence : 0.6834         
##          Detection Rate : 0.6208         
##    Detection Prevalence : 0.6565         
##       Balanced Accuracy : 0.8977         
##                                          
##        'Positive' Class : 0              
##

Visualize the results of a decision tree model

# Visualize the Decision Tree
rpart.plot(dt_model)

Conclusion

The overall review score on the Airbnb units in Singapore has strong correlation with the cleanliness of the unit, as shown in the correlation plotting between Rating_Score and Cleanliness_Score. Stay in guests often prioritize hygiene and cleanliness when evaluating the stays in the Airbnb units. A clean and well-maintained unit does help to continue to a positive guest staying experience and consequently contribute to higher review scores. Besides cleanliness, other factors that contribute significantly to scoring rating on the Airbnb units include communication with the hosts, value spent on the Airbnb units and accuracy on the unit details. These three factors as shown in the correlation matrix, also playing important roles to influence the guest in evaluating the Airbnb units. In predicting the overall review score with machine learning, Random Forest regressor is a better option to perform regression prediction on Rating_score with achieved highest result on R-squared value, Mean Absolute Error and Root Mean Square Error, comparing to Decision Tree regressor. As for classification modelling to categorize the Airbnb units into high score and low score categories, Decision Tree model was being used and achieved accuracy of 89%.

WQD7004 Group Assignment

Group 3