| Name | Matric Number |
|---|---|
| LIM POH SZE | 17082915 |
| WONG KAI THUNG | 17101556 |
| LIANG HUIHAO | 22099647 |
| ZHENG XIN | 22104340 |
| LAI XIAOXUAN | 22106168 |
The hospitality industry has undergone a transformative shift with the advent of online platforms like Airbnb, offering travelers a diverse array of accommodation options. In the dynamic landscape of short-term rentals, the significance of guest satisfaction cannot be overstated. Users rely heavily on peer reviews and overall ratings to make informed decisions about their lodging choices. In this context, understanding the factors that contribute to a positive or negative Airbnb experience becomes paramount.
This project centers around the vibrant city-state of Singapore, a global hub for tourism and business. As the demand for Airbnb accommodations continues to grow, there is a pressing need to delve deeper into the determinants of overall satisfaction for both hosts and guests.
library(dplyr)
## Warning: 程辑包'dplyr'是用R版本4.3.2 来建造的
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.3.2 来建造的
library(reshape2)
## Warning: 程辑包'reshape2'是用R版本4.3.2 来建造的
library(corrplot)
## Warning: 程辑包'corrplot'是用R版本4.3.2 来建造的
## corrplot 0.92 loaded
library(ggcorrplot)
## Warning: 程辑包'ggcorrplot'是用R版本4.3.2 来建造的
library(randomForest)
## Warning: 程辑包'randomForest'是用R版本4.3.2 来建造的
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
## Warning: 程辑包'rpart'是用R版本4.3.2 来建造的
library(rpart.plot)
## Warning: 程辑包'rpart.plot'是用R版本4.3.2 来建造的
library(Metrics)
## Warning: 程辑包'Metrics'是用R版本4.3.2 来建造的
library(caret)
## Warning: 程辑包'caret'是用R版本4.3.2 来建造的
## 载入需要的程辑包:lattice
##
## 载入程辑包:'caret'
## The following objects are masked from 'package:Metrics':
##
## precision, recall
Our original data set named ‘listings.csv’ was obtained from http://insideairbnb.com/get-the-data/ . It contains a list of Airbnb units in Singapore scrapped on 23rd September 2023. We then renamed the file to “Dataset_original.csv”
data <- read.csv("Dataset_original.csv")
str(data)
## 'data.frame': 3483 obs. of 75 variables:
## $ id : num 71609 71896 71903 275343 275344 ...
## $ listing_url : chr "https://www.airbnb.com/rooms/71609" "https://www.airbnb.com/rooms/71896" "https://www.airbnb.com/rooms/71903" "https://www.airbnb.com/rooms/275343" ...
## $ scrape_id : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
## $ last_scraped : chr "2023-09-23" "2023-09-23" "2023-09-23" "2023-09-23" ...
## $ source : chr "previous scrape" "previous scrape" "previous scrape" "city scrape" ...
## $ name : chr "Villa in Singapore · ★4.44 · 2 bedrooms · 3 beds · 1 private bath" "Home in Singapore · ★4.16 · 1 bedroom · 1 bed · Shared half-bath" "Home in Singapore · ★4.41 · 1 bedroom · 2 beds · Shared half-bath" "Rental unit in Singapore · ★4.40 · 1 bedroom · 1 bed · 2 shared baths" ...
## $ description : chr "For 3 rooms.Book room 1&2 and room 4<br /><br /><b>The space</b><br />Landed Homestay Room for Rental. Between "| __truncated__ "<b>The space</b><br />Vocational Stay Deluxe Bedroom in Singapore.(Near Airport) <br /> <br />Located Between "| __truncated__ "Like your own home, 24hrs access.<br /><br /><b>The space</b><br />Vocational Stay Deluxe Bedroom in Singapore."| __truncated__ "**IMPORTANT NOTES: READ BEFORE YOU BOOK! <br />==Since this is an HDB Flat tourists are NOT ALLOWED unless hav"| __truncated__ ...
## $ neighborhood_overview : chr "" "" "Quiet and view of the playground with exercise tracks with access to neighbourhood Simwi Estate." "" ...
## $ picture_url : chr "https://a0.muscache.com/pictures/24453191/35803acb_original.jpg" "https://a0.muscache.com/pictures/2440674/ac4f4442_original.jpg" "https://a0.muscache.com/pictures/568743/7bc623e9_original.jpg" "https://a0.muscache.com/pictures/miso/Hosting-275343/original/abbb6837-808c-437e-9835-0e7bb621a4e7.png" ...
## $ host_id : int 367042 367042 367042 1439258 1439258 367042 1521514 1439258 1439258 1521514 ...
## $ host_url : chr "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/367042" "https://www.airbnb.com/users/show/1439258" ...
## $ host_name : chr "Belinda" "Belinda" "Belinda" "Kay" ...
## $ host_since : chr "2011-01-29" "2011-01-29" "2011-01-29" "2011-11-24" ...
## $ host_location : chr "Singapore" "Singapore" "Singapore" "Singapore" ...
## $ host_about : chr "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "Hi My name is Belinda -Housekeeper \n\nI would like to welcome you to my \"Homestay Website\" \n\n\nAccomodatio"| __truncated__ "K2 Guesthouse is designed for guests who want a truly local experience with local people. Experience eating loc"| __truncated__ ...
## $ host_response_time : chr "within a few hours" "within a few hours" "within a few hours" "within an hour" ...
## $ host_response_rate : chr "100%" "100%" "100%" "100%" ...
## $ host_acceptance_rate : chr "100%" "100%" "100%" "95%" ...
## $ host_is_superhost : chr "f" "f" "f" "f" ...
## $ host_thumbnail_url : chr "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/user/7245b0a9-27fa-4759-9fb3-59ae8299e8a3.jpg?aki_policy=profile_small" ...
## $ host_picture_url : chr "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/367042/profile_pic/1382521511/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/user/7245b0a9-27fa-4759-9fb3-59ae8299e8a3.jpg?aki_policy=profile_x_medium" ...
## $ host_neighbourhood : chr "Tampines" "Tampines" "Tampines" "Bukit Merah" ...
## $ host_listings_count : int 5 5 5 52 52 5 7 52 52 7 ...
## $ host_total_listings_count : int 15 15 15 65 65 15 8 65 65 8 ...
## $ host_verifications : chr "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" "['email', 'phone']" ...
## $ host_has_profile_pic : chr "t" "t" "t" "t" ...
## $ host_identity_verified : chr "t" "t" "t" "t" ...
## $ neighbourhood : chr "" "" "Singapore, Singapore" "" ...
## $ neighbourhood_cleansed : chr "Tampines" "Tampines" "Tampines" "Bukit Merah" ...
## $ neighbourhood_group_cleansed : chr "East Region" "East Region" "East Region" "Central Region" ...
## $ latitude : num 1.35 1.35 1.35 1.29 1.29 ...
## $ longitude : num 104 104 104 104 104 ...
## $ property_type : chr "Private room in villa" "Private room in home" "Private room in home" "Private room in rental unit" ...
## $ room_type : chr "Private room" "Private room" "Private room" "Private room" ...
## $ accommodates : int 3 1 2 1 1 4 2 1 1 2 ...
## $ bathrooms : logi NA NA NA NA NA NA ...
## $ bathrooms_text : chr "1 private bath" "Shared half-bath" "Shared half-bath" "2 shared baths" ...
## $ bedrooms : int NA NA NA NA NA 3 NA NA NA NA ...
## $ beds : int 3 1 2 1 1 5 1 1 1 1 ...
## $ amenities : chr "[\"Private backyard \\u2013 Fully fenced\", \"Shampoo\", \"Fire extinguisher\", \"Self check-in\", \"Outdoor fu"| __truncated__ "[\"Private backyard \\u2013 Fully fenced\", \"Shampoo\", \"Drying rack for clothing\", \"Self check-in\", \"Cof"| __truncated__ "[\"Shampoo\", \"Self check-in\", \"Coffee maker\", \"AC - split type ductless system\", \"Outdoor furniture\", "| __truncated__ "[\"Fire extinguisher\", \"Self check-in\", \"Bed linens\", \"Hot water kettle\", \"Wifi\", \"Carbon monoxide al"| __truncated__ ...
## $ price : chr "$150.00" "$80.00" "$80.00" "$55.00" ...
## $ minimum_nights : int 92 92 92 60 60 92 92 60 60 92 ...
## $ maximum_nights : int 365 365 365 999 999 365 1125 999 365 180 ...
## $ minimum_minimum_nights : int 92 92 92 60 60 92 92 60 60 92 ...
## $ maximum_minimum_nights : int 92 92 92 60 60 92 92 60 60 92 ...
## $ minimum_maximum_nights : int 1125 1125 1125 1125 1125 1125 1125 1125 1125 180 ...
## $ maximum_maximum_nights : int 1125 1125 1125 1125 1125 1125 1125 1125 1125 180 ...
## $ minimum_nights_avg_ntm : num 92 92 92 60 60 92 92 60 60 92 ...
## $ maximum_nights_avg_ntm : num 1125 1125 1125 1125 1125 ...
## $ calendar_updated : logi NA NA NA NA NA NA ...
## $ has_availability : chr "t" "t" "t" "t" ...
## $ availability_30 : int 28 28 28 1 30 28 30 30 30 30 ...
## $ availability_60 : int 58 58 58 1 60 58 60 60 60 60 ...
## $ availability_90 : int 88 88 88 1 90 88 90 90 90 90 ...
## $ availability_365 : int 89 89 89 275 274 89 365 365 365 365 ...
## $ calendar_last_scraped : chr "2023-09-23" "2023-09-23" "2023-09-23" "2023-09-23" ...
## $ number_of_reviews : int 20 24 47 22 17 12 133 18 6 81 ...
## $ number_of_reviews_ltm : int 0 0 0 0 3 0 0 1 3 0 ...
## $ number_of_reviews_l30d : int 0 0 0 0 0 0 0 1 1 0 ...
## $ first_review : chr "2011-12-19" "2011-07-30" "2011-05-04" "2013-04-20" ...
## $ last_review : chr "2020-01-17" "2019-10-13" "2020-01-09" "2022-08-13" ...
## $ review_scores_rating : num 4.44 4.16 4.41 4.4 4.27 4.83 4.43 3.5 3.8 4.43 ...
## $ review_scores_accuracy : num 4.37 4.22 4.39 4.16 4.44 4.67 4.33 3.47 3.8 4.45 ...
## $ review_scores_cleanliness : num 4 4.09 4.52 4.26 4.06 4.75 4.16 3.94 4 4.41 ...
## $ review_scores_checkin : num 4.63 4.43 4.63 4.47 4.5 4.58 4.5 4.53 4.2 4.71 ...
## $ review_scores_communication : num 4.78 4.43 4.64 4.42 4.5 4.67 4.66 4.06 4.8 4.76 ...
## $ review_scores_location : num 4.26 4.17 4.5 4.53 4.63 4.33 4.52 3.82 3.8 4.64 ...
## $ review_scores_value : num 4.32 4.04 4.36 4.63 4.13 4.45 4.39 3.76 4 4.55 ...
## $ license : chr "" "" "" "S0399" ...
## $ instant_bookable : chr "f" "f" "f" "t" ...
## $ calculated_host_listings_count : int 5 5 5 52 52 5 7 52 52 7 ...
## $ calculated_host_listings_count_entire_homes : int 0 0 0 1 1 0 1 1 1 1 ...
## $ calculated_host_listings_count_private_rooms: int 5 5 5 51 51 5 6 51 51 6 ...
## $ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reviews_per_month : num 0.14 0.16 0.31 0.17 0.12 0.09 0.94 0.13 0.05 0.67 ...
From the infromation above, we can see that the dataset encompasses 3,483 rows/observations each representing a unique Airbnb unit, with 75 distinct variables/columns. These variables cover essential aspects of each listing, ranging from basic identifiers such as ID and listing URL to detailed information about hosts, property characteristics, and guest reviews.
The detailed content of the dataset are shown below:
| Variables | Description |
|---|---|
| id | Unique identifier for each Airbnb unit |
| listing_url | URL link to the Airbnb listing page |
| scrape_id | Identifier for the specific scrape date |
| last_scraped | Date when the listing was last updated |
| source | Source of the listing data |
| name | Title or name of the Airbnb listing |
| description | Detailed text describing the Airbnb unit |
| neighborhood_overview | Overview of the neighborhood where the unit is located |
| picture_url | URL link to pictures showcasing the Airbnb unit |
| host_id | Unique identifier for the host of the Airbnb unit |
| host_url | URL link to the host’s profile page |
| host_name | Name of the Airbnb host |
| host_since | Date when the host joined Airbnb |
| host_location | Location of the host |
| host_about | Information provided by the host about themselves |
| host_response_time | Time taken by the host to respond to inquiries |
| host_response_rate | Percentage of inquiries to which the host responds |
| host_acceptance_rate | Percentage of booking requests accepted by the host |
| host_is_superhost | Indicator of whether the host has Superhost status |
| host_thumbnail_url | URL link to the host’s profile picture |
| host_picture_url | URL link to a larger picture of the host |
| host_neighbourhood | Neighborhood of the host |
| host_listings_count | Number of listings by the host |
| host_total_listings_count | Total number of listings by the host |
| host_verifications | Methods used by the host to verify their identity |
| host_has_profile_pic | Indicator of whether the host has a profile picture |
| host_identity_verified | Indicator of whether the host’s identity is verified |
| neighbourhood | General neighborhood information |
| neighbourhood_cleansed | Specific cleansed neighborhood information |
| neighbourhood_group_cleansed | Cleansed neighborhood group information |
| latitude | Latitude coordinates of the Airbnb unit |
| longitude | Longitude coordinates of the Airbnb unit |
| property_type | Type of property (private room/entire unit) |
| room_type | Type of room available for booking |
| accommodates | Number of guests the unit can accommodate |
| bathrooms | Number of bathrooms |
| bathrooms_text | Text description of the bathroom(s) |
| bedrooms | Number of bedrooms |
| beds | Number of beds |
| amenities | List of amenities provided in the Airbnb unit |
| price | Nightly price for renting the Airbnb unit |
| minimum_nights | Minimum number of nights required for booking |
| maximum_nights | Maximum number of nights allowed for booking |
| minimum_minimum_nights | Minimum value for minimum nights |
| maximum_minimum_nights | Maximum value for minimum nights |
| minimum_maximum_nights | Minimum value for maximum nights |
| maximum_maximum_nights | Maximum value for maximum nights |
| minimum_nights_avg_ntm | Minimum average nights for booking |
| maximum_nights_avg_ntm | Maximum average nights for booking |
| calendar_updated | Calendar(empty column) |
| has_availability | Indicator of whether the unit is available |
| availability_30 | Number of available nights in the next 30 days |
| availability_60 | Number of available nights in the next 60 days |
| availability_90 | Number of available nights in the next 90 days |
| availability_365 | Number of available nights in the next 365 days |
| calendar_last_scraped | Date when the calendar was last scraped |
| number_of_reviews | Total number of reviews for the unit |
| number_of_reviews_ltm | Number of reviews in the last twelve months |
| number_of_reviews_l30d | Number of reviews in the last 30 days |
| first_review | Date of the first review for the unit |
| last_review | Date of the most recent review for the unit |
| review_scores_rating | Overall rating score given by guests |
| review_scores_accuracy | Rating score for accuracy |
| review_scores_cleanliness | Rating score for cleanliness |
| review_scores_checkin | Rating score for the check-in process |
| review_scores_communication | Rating score for communication |
| review_scores_location | Rating score for the location |
| review_scores_value | Rating score for the overall value |
| license | License information for the property |
| instant_bookable | Indicator of whether instant booking is available |
| calculated_host_listings_count | Count of listings by the host |
| calculated_host_listings_count_entire_homes | Count of entire homes listed by the host |
| calculated_host_listings_count_private_rooms | Count of private rooms listed by the host |
| calculated_host_listings_count_shared_rooms | Count of shared rooms listed by the host |
| reviews_per_month | Average number of reviews received per month |
Below is the statistical summary of the dataset:
summary(data)
## id listing_url scrape_id last_scraped
## Min. :7.161e+04 Length:3483 Min. :2.023e+13 Length:3483
## 1st Qu.:2.477e+07 Class :character 1st Qu.:2.023e+13 Class :character
## Median :4.230e+07 Mode :character Median :2.023e+13 Mode :character
## Mean :2.607e+17 Mean :2.023e+13
## 3rd Qu.:6.927e+17 3rd Qu.:2.023e+13
## Max. :9.859e+17 Max. :2.023e+13
##
## source name description neighborhood_overview
## Length:3483 Length:3483 Length:3483 Length:3483
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## picture_url host_id host_url host_name
## Length:3483 Min. : 23666 Length:3483 Length:3483
## Class :character 1st Qu.: 29032695 Class :character Class :character
## Mode :character Median :107599478 Mode :character Mode :character
## Mean :154421232
## 3rd Qu.:238891646
## Max. :536857130
##
## host_since host_location host_about host_response_time
## Length:3483 Length:3483 Length:3483 Length:3483
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url
## Length:3483 Length:3483 Length:3483 Length:3483
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## host_picture_url host_neighbourhood host_listings_count
## Length:3483 Length:3483 Min. : 1.0
## Class :character Class :character 1st Qu.: 3.0
## Mode :character Mode :character Median : 14.0
## Mean : 87.2
## 3rd Qu.: 79.0
## Max. :571.0
##
## host_total_listings_count host_verifications host_has_profile_pic
## Min. : 1.0 Length:3483 Length:3483
## 1st Qu.: 5.0 Class :character Class :character
## Median : 20.0 Mode :character Mode :character
## Mean :145.7
## 3rd Qu.:126.0
## Max. :847.0
##
## host_identity_verified neighbourhood neighbourhood_cleansed
## Length:3483 Length:3483 Length:3483
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## neighbourhood_group_cleansed latitude longitude
## Length:3483 Min. :1.222 Min. :103.6
## Class :character 1st Qu.:1.291 1st Qu.:103.8
## Mode :character Median :1.305 Median :103.8
## Mean :1.311 Mean :103.8
## 3rd Qu.:1.318 3rd Qu.:103.9
## Max. :1.458 Max. :104.0
##
## property_type room_type accommodates bathrooms
## Length:3483 Length:3483 Min. : 1.000 Mode:logical
## Class :character Class :character 1st Qu.: 2.000 NA's:3483
## Mode :character Mode :character Median : 2.000
## Mean : 2.817
## 3rd Qu.: 4.000
## Max. :16.000
##
## bathrooms_text bedrooms beds amenities
## Length:3483 Min. :1.000 Min. : 1.0 Length:3483
## Class :character 1st Qu.:1.000 1st Qu.: 1.0 Class :character
## Mode :character Median :1.000 Median : 1.0 Mode :character
## Mean :1.447 Mean : 1.8
## 3rd Qu.:2.000 3rd Qu.: 2.0
## Max. :5.000 Max. :46.0
## NA's :1488 NA's :97
## price minimum_nights maximum_nights minimum_minimum_nights
## Length:3483 Min. : 1.00 Min. : 2.0 Min. : 1.00
## Class :character 1st Qu.: 6.00 1st Qu.: 365.0 1st Qu.: 6.00
## Mode :character Median : 92.00 Median : 1125.0 Median : 92.00
## Mean : 67.28 Mean : 811.2 Mean : 67.27
## 3rd Qu.: 92.00 3rd Qu.: 1125.0 3rd Qu.: 92.00
## Max. :1000.00 Max. :100000.0 Max. :1000.00
##
## maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights
## Min. : 1.00 Min. : 1.0 Min. : 1.0
## 1st Qu.: 6.00 1st Qu.: 365.0 1st Qu.: 365.0
## Median : 92.00 Median : 1125.0 Median : 1125.0
## Mean : 74.04 Mean : 891.4 Mean : 904.6
## 3rd Qu.: 92.00 3rd Qu.: 1125.0 3rd Qu.: 1125.0
## Max. :1000.00 Max. :100000.0 Max. :100000.0
##
## minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated
## Min. : 1.00 Min. : 1.0 Mode:logical
## 1st Qu.: 6.00 1st Qu.: 365.0 NA's:3483
## Median : 92.00 Median : 1125.0
## Mean : 73.41 Mean : 893.1
## 3rd Qu.: 92.00 3rd Qu.: 1125.0
## Max. :1000.00 Max. :100000.0
##
## has_availability availability_30 availability_60 availability_90
## Length:3483 Min. : 0.00 Min. : 0 Min. : 0.00
## Class :character 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 1.00
## Mode :character Median :22.00 Median :51 Median :79.00
## Mean :15.97 Mean :35 Mean :55.29
## 3rd Qu.:29.00 3rd Qu.:59 3rd Qu.:89.00
## Max. :30.00 Max. :60 Max. :90.00
##
## availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm
## Min. : 0.0 Length:3483 Min. : 0.00 Min. : 0.000
## 1st Qu.: 90.0 Class :character 1st Qu.: 0.00 1st Qu.: 0.000
## Median :312.0 Mode :character Median : 1.00 Median : 0.000
## Mean :235.7 Mean : 10.25 Mean : 2.253
## 3rd Qu.:364.0 3rd Qu.: 5.00 3rd Qu.: 0.000
## Max. :365.0 Max. :665.00 Max. :404.000
##
## number_of_reviews_l30d first_review last_review
## Min. : 0.0000 Length:3483 Length:3483
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.1904
## 3rd Qu.: 0.0000
## Max. :22.0000
##
## review_scores_rating review_scores_accuracy review_scores_cleanliness
## Min. :0.00 Min. :1.00 Min. :1.000
## 1st Qu.:4.33 1st Qu.:4.44 1st Qu.:4.270
## Median :4.69 Median :4.78 Median :4.670
## Mean :4.46 Mean :4.58 Mean :4.497
## 3rd Qu.:5.00 3rd Qu.:5.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.00 Max. :5.000
## NA's :1565 NA's :1599 NA's :1599
## review_scores_checkin review_scores_communication review_scores_location
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.670 1st Qu.:4.670 1st Qu.:4.570
## Median :4.915 Median :4.920 Median :4.850
## Mean :4.726 Mean :4.708 Mean :4.691
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1599 NA's :1598 NA's :1599
## review_scores_value license instant_bookable
## Min. :1.000 Length:3483 Length:3483
## 1st Qu.:4.207 Class :character Class :character
## Median :4.570 Mode :character Mode :character
## Mean :4.441
## 3rd Qu.:5.000
## Max. :5.000
## NA's :1599
## calculated_host_listings_count calculated_host_listings_count_entire_homes
## Min. : 1.00 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 0.00
## Median : 13.00 Median : 1.00
## Mean : 50.81 Mean : 39.91
## 3rd Qu.: 70.00 3rd Qu.: 27.00
## Max. :253.00 Max. :238.00
##
## calculated_host_listings_count_private_rooms
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 2.00
## Mean :10.17
## 3rd Qu.: 9.00
## Max. :91.00
##
## calculated_host_listings_count_shared_rooms reviews_per_month
## Min. : 0.0000 Min. : 0.0100
## 1st Qu.: 0.0000 1st Qu.: 0.0500
## Median : 0.0000 Median : 0.1700
## Mean : 0.3574 Mean : 0.5582
## 3rd Qu.: 0.0000 3rd Qu.: 0.6275
## Max. :18.0000 Max. :20.9300
## NA's :1565
# Dropping unnecessary columns
data_cleaned <- data[c("id", "price", "review_scores_rating",
"review_scores_cleanliness","review_scores_accuracy",
"review_scores_checkin","review_scores_communication",
"review_scores_location","review_scores_value")]
# Converting columns
data_cleaned$price <- gsub("\\$", "", data_cleaned$price)
columns_to_convert <- c("price", "review_scores_rating",
"review_scores_cleanliness","review_scores_accuracy",
"review_scores_checkin","review_scores_communication",
"review_scores_location","review_scores_value")
data_cleaned <- data_cleaned %>%
mutate(across(all_of(columns_to_convert), ~as.numeric(as.character(.))))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(all_of(columns_to_convert),
## ~as.numeric(as.character(.)))`.
## Caused by warning:
## ! 强制改变过程中产生了NA
# Handling missing values
data_cleaned <- na.omit(data_cleaned)
data_cleaned <- data_cleaned[!duplicated(data_cleaned), ]
sapply(data_cleaned, function(x) sum(is.na(x)))
## id price
## 0 0
## review_scores_rating review_scores_cleanliness
## 0 0
## review_scores_accuracy review_scores_checkin
## 0 0
## review_scores_communication review_scores_location
## 0 0
## review_scores_value
## 0
# Rename column names
data_cleaned <- data_cleaned %>%
rename(
Price = price,
Rating_Score = review_scores_rating,
Cleanliness_Score = review_scores_cleanliness,
Accuracy_Score = review_scores_accuracy,
Checkin_Score = review_scores_checkin,
Communication_Score = review_scores_communication,
Location_Score = review_scores_location,
Value_Score = review_scores_value
)
# Output
write.csv(data_cleaned, "data_cleaned.csv", row.names = FALSE)
summary(data_cleaned)
## id Price Rating_Score Cleanliness_Score
## Min. :7.161e+04 Min. : 22.0 Min. :1.000 Min. :1.000
## 1st Qu.:1.890e+07 1st Qu.: 68.0 1st Qu.:4.330 1st Qu.:4.270
## Median :3.519e+07 Median :137.0 Median :4.705 Median :4.670
## Mean :1.685e+17 Mean :174.1 Mean :4.538 Mean :4.495
## 3rd Qu.:5.258e+07 3rd Qu.:223.0 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :9.699e+17 Max. :999.0 Max. :5.000 Max. :5.000
## Accuracy_Score Checkin_Score Communication_Score Location_Score
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.440 1st Qu.:4.670 1st Qu.:4.670 1st Qu.:4.570
## Median :4.780 Median :4.910 Median :4.910 Median :4.850
## Mean :4.578 Mean :4.726 Mean :4.707 Mean :4.692
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Value_Score
## Min. :1.000
## 1st Qu.:4.210
## Median :4.570
## Mean :4.442
## 3rd Qu.:5.000
## Max. :5.000
# 1. Univariate Analysis
# Histogram for Price
ggplot(data_cleaned, aes(x = Price)) + geom_histogram(binwidth = 10, fill = "blue", color = "black")
# Boxplot
ggplot(data_cleaned, aes(y = Rating_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Cleanliness_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Accuracy_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Checkin_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Communication_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Location_Score)) + geom_boxplot(fill = "orange")
ggplot(data_cleaned, aes(y = Value_Score)) + geom_boxplot(fill = "orange")
# 2. Bivariate Analysis
# Scatter plot for Rating_Score vs. Cleanliness_Score
ggplot(data_cleaned, aes(x = Rating_Score, y = Cleanliness_Score)) + geom_point() + geom_smooth(method = lm) +
ggtitle("Correlation between Rating_Score Rating and Cleanliness_Score")
## `geom_smooth()` using formula = 'y ~ x'
# Correlation matrix
cor_matrix <- cor(data_cleaned[,2:9])
corrplot(cor_matrix, method = "circle",
tl.cex = 0.6, mar=c(0,0,2,0))
title("Correlation Matrix", cex.main = 1)
# 3. Multivariate Analysis (Pairs Plot & Heatmap)
pairs(data_cleaned[,2:9], main="Pairs Plot")
ggplot(melt(cor_matrix), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
midpoint = 0,
limit = c(-1,1),
space = "Lab",
name="Correlation"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
) +
ggtitle("Heatmap")
Data splitting is to ensure the effectiveness of modelling, avoid over-fitting and ensure good performance on the model.
This project will use 70:30 split on the data, to achieve the balance to provide sufficient data for model training and also avoid biaseness of model evaluation.
The data to split in to 70% as training data (train_data) and 30% to put into testing data (test_data). The data are randomly selectd from the dataset, that would be use to train the model to make prediction
df<-read.csv("data_cleaned.csv")
set.seed(1000)
#Split the data into X (features) and Y (target-Rating Score)
X <- df[, !(colnames(df) %in% c("Rating_Score"))]
Y <- df[,"Rating_Score"]
# Generate random indices for the training data
train_indices <- sample(nrow(df), nrow(df) * 0.7)
# Generate the training data
train_data <- df[train_indices, ]
# Generate the testing data
test_data <- df[-train_indices, ]
corr_matrix <- cor(df)
ggcorrplot(corr_matrix)
From the correlation matrix above, the highest value of correlation cofficient are Rating_Score and Cleanliness_Score, that shows strong positive linear correlation between the two variables, as well as other features: Accuracy_Score, Value_Score.
For regression, Rating_Score is the target variable.
RFR <- randomForest(train_data$Rating_Score ~ ., data = train_data, ntree = 80, importance = TRUE)
# Predict the target variable for the test set
yPred_RF <- predict(RFR, newdata = test_data)
# Store predicted values to data frame
RF <- data.frame(y_test = test_data$Rating_Score, y_pred = yPred_RF)
# Show first 50 actual and predicted data
subset_RF <- RF[1:50, ]
# Visualize prediction using plot
plot(subset_RF$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_RF$y_pred, col = "skyblue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "skyblue"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")
The next regressor will be evaluated is using Decision Tree regressor to perform the Rating_Score prediction.
# Fit a decision tree model
dt_model <- rpart(Rating_Score ~ ., data = train_data)
# Predict the target variable for the test set
yPred_dt <- predict(dt_model, newdata = test_data)
# Store predicted values to data frame
DT <- data.frame(y_test = test_data$Rating_Score, y_pred = yPred_dt)
# Show first 50 actual and predicted data
subset_DT <- DT[1:50, ]
# Visualize decision tree
rpart.plot(dt_model)
# Visualize prediction using plot
plot(subset_DT$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_DT$y_pred, col = "purple", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "purple"), lwd = 2)
title(main = "Actual vs Predicted for Decision Tree Regressor Model")
This part will evaluate the metrics comparison for Random Forest and Decision models based on R Squared value, Mean Absolute Error and Root Mean Squared Error.
# Calculate Mean Absolute Error (MAE) and R-squared for both Random Forest and Decision Tree
mae_rf <- round(mean(abs(test_data$Rating_Score - yPred_RF)),3)
mae_dt <- round(mean(abs(test_data$Rating_Score - yPred_dt)),3)
rmse_rf <- rmse <- round(sqrt(mean((test_data$Rating_Score - yPred_RF)^2)),3)
rmse_dt <- rmse <- round(sqrt(mean((test_data$Rating_Score - yPred_dt)^2)),3)
rsquared_rf <- round(cor(yPred_RF, test_data$Rating_Score)^2, 3)
rsquared_dt <- round(cor(yPred_dt, test_data$Rating_Score)^2, 3)
# Create summary table and display
summary_table <- data.frame(
Model = c("Random Forest", "Decision Tree"),
R_squared = c(rsquared_rf, rsquared_dt),
Mean_Absolute_Error = c(mae_rf, mae_dt),
Root_Mean_Squared_Error = c(rmse_rf, rmse_dt)
)
print(summary_table)
## Model R_squared Mean_Absolute_Error Root_Mean_Squared_Error
## 1 Random Forest 0.777 0.165 0.284
## 2 Decision Tree 0.652 0.235 0.354
From the metrics table, Random Forest has R-squared value of 0.812 compared to Decision Tree has lower with 0.688 R-squared value. Relatively, Random Forest also has lower value for both Mean Absolute Error and Root Mean Square Error, comparing to Decision Tree. This proof that Random Forest is a better option to perform Rating_Score prediction.
Prepares a dataset for binary classification where the response variable indicates whether a rating score is high or not, based on the 70th percentile threshold. And uses a decision tree model to classify Airbnb units into high-scoring and low-scoring categories.
# Determine the threshold for high rating (70th percentile)
rating_threshold <- quantile(df$Rating_Score, 0.7)
# Create a new binary column for high rating
df$High_Rating <- ifelse(df$Rating_Score >= rating_threshold, 1, 0)
# Splitting the data into training and testing sets
set.seed(42) # For reproducibility
trainIndex <- createDataPartition(df$High_Rating, p = 0.7, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]
# Ensure that the 'High_Rating' column in both training and testing datasets are factors
train_data$High_Rating <- as.factor(train_data$High_Rating)
test_data$High_Rating <- as.factor(test_data$High_Rating)
Build and utilize decision tree models for classification using the “rpart” function from the “rpart” package.
# Build the Decision Tree model
dt_model <- rpart(High_Rating ~ . -Rating_Score -id, data = train_data, method = "class")
dt_model
## n= 1305
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1305 416 0 (0.68122605 0.31877395)
## 2) Accuracy_Score< 4.975 858 62 0 (0.92773893 0.07226107)
## 4) Value_Score< 4.935 820 40 0 (0.95121951 0.04878049) *
## 5) Value_Score>=4.935 38 16 1 (0.42105263 0.57894737)
## 10) Cleanliness_Score< 4.415 15 4 0 (0.73333333 0.26666667) *
## 11) Cleanliness_Score>=4.415 23 5 1 (0.21739130 0.78260870) *
## 3) Accuracy_Score>=4.975 447 93 1 (0.20805369 0.79194631)
## 6) Value_Score< 4.98 163 71 1 (0.43558282 0.56441718)
## 12) Checkin_Score< 4.83 27 4 0 (0.85185185 0.14814815) *
## 13) Checkin_Score>=4.83 136 48 1 (0.35294118 0.64705882)
## 26) Cleanliness_Score< 4.98 66 33 0 (0.50000000 0.50000000)
## 52) Communication_Score< 4.97 13 2 0 (0.84615385 0.15384615) *
## 53) Communication_Score>=4.97 53 22 1 (0.41509434 0.58490566) *
## 27) Cleanliness_Score>=4.98 70 15 1 (0.21428571 0.78571429) *
## 7) Value_Score>=4.98 284 22 1 (0.07746479 0.92253521) *
# Predicting the Test set results
predictions <- predict(dt_model, test_data, type = "class")
Calculate confusion matrix, including True Positives, True Negatives, False Positives and False Negatives
confusionMatrix(predictions, test_data$High_Rating)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 347 20
## 1 35 157
##
## Accuracy : 0.9016
## 95% CI : (0.8739, 0.925)
## No Information Rate : 0.6834
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7777
##
## Mcnemar's Test P-Value : 0.05906
##
## Sensitivity : 0.9084
## Specificity : 0.8870
## Pos Pred Value : 0.9455
## Neg Pred Value : 0.8177
## Prevalence : 0.6834
## Detection Rate : 0.6208
## Detection Prevalence : 0.6565
## Balanced Accuracy : 0.8977
##
## 'Positive' Class : 0
##
# Visualize the Decision Tree
rpart.plot(dt_model)
The overall review score on the Airbnb units in Singapore has strong correlation with the cleanliness of the unit, as shown in the correlation plotting between Rating_Score and Cleanliness_Score. Stay in guests often prioritize hygiene and cleanliness when evaluating the stays in the Airbnb units. A clean and well-maintained unit does help to continue to a positive guest staying experience and consequently contribute to higher review scores. Besides cleanliness, other factors that contribute significantly to scoring rating on the Airbnb units include communication with the hosts, value spent on the Airbnb units and accuracy on the unit details. These three factors as shown in the correlation matrix, also playing important roles to influence the guest in evaluating the Airbnb units. In predicting the overall review score with machine learning, Random Forest regressor is a better option to perform regression prediction on Rating_score with achieved highest result on R-squared value, Mean Absolute Error and Root Mean Square Error, comparing to Decision Tree regressor. As for classification modelling to categorize the Airbnb units into high score and low score categories, Decision Tree model was being used and achieved accuracy of 89%.