Business Problem
AirBnB, the online booking company, has been gaining an increasing number of popularity and bookings. Because of its increased use and growing popularity, a topic of interest for many data analysis projects has been to see how a variety of different factors can impact the bookings of a given host and or listing. For this project, we will be doing an analysis of AirBnB data to see what independent variables can predict the number of reviews a given listing gets, and further infer what factors affects a listing being booked. We will be predicting Number.Of.Reviews using the following variables:
- Host.Listings.Count
- Zipcode
- City
- Accommodates
- Price
- Bedrooms
- Review.Scores.Value
- Cancellation.Policy
- House.Rules
- Des_length
Data Background
We took our data from OpenDataSoft (https://public.opendatasoft.com/explore/dataset/airbnb-listings/table/?flg=en-us&disjunctive.host_verifications&disjunctive.amenities&disjunctive.features), but specifically will focus only on data from listings in the United States. This dataset has 134,545 values, but we will only be using 100 for our analysis.
STEP 0 - Import libraries
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
STEP 1 - Read in the data
When reading in the data for our project, we first wanted to check how many rows our original csv contained. Then we read in the initial majority of the rows (100,000) into our first dataframe. From this first dataframe, we then created a new dataframe that contained a random select of 100 rows from the first dataframe.Ideally, we would have wanted to import more rows, but because this data set contains so many variables, it made the computing process extremely difficult. Thus, in order to make our models easier to manage, we limited our selection. Additionally, it is important to note that we will be using a random sample of the rows, meaning that each time we run our code, we will get slightly different outputs since we will be using a different random sample each time. Thus, all of our explanations are in general terms rather than specifics, since the outputs can change each time the code is run.
STEP 2 - Observe the dataframe
Data Structure
After importing our data, we first looked at how the data was structured using str(). From this analysis, we saw that a larger number of the variables had a character data type rather than a numerical one. Additionally, we found that there were a variety of variables that had values that we didn’t understand or know what they represented, such as the multiple different Review.Scores variables. Likewise, there were some variables that represented the same thing, such as City, Market, State, Smart.Location, Longitude, Latitude, County, etc., which we recognized would also need to be cut so that we would not have multiple variables representing the same value.
## 'data.frame': 100 obs. of 89 variables:
## $ ID : int 1089293 12652887 4005022 18547885 9393007 1492286 18007965 5046189 16944085 14801382 ...
## $ Listing.Url : chr "https://www.airbnb.com/rooms/1089293" "https://www.airbnb.com/rooms/12652887" "https://www.airbnb.com/rooms/4005022" "https://www.airbnb.com/rooms/18547885" ...
## $ Scrape.ID : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
## $ Last.Scraped : chr "2017-05-03" "2017-03-07" "2017-06-02" "2017-06-02" ...
## $ Name : chr "Entire home/apt in Los Angeles" "South Austin Digs" "Cozy Gentilly Home" "Historic Home: Close to French Quarter & Esplanade Ave" ...
## $ Summary : chr "The place is a monthly sublet with option to rent furnished or unfurnished. There is a deposit requirement and"| __truncated__ "My place is close to Downtown and South Congress in 20 minutes, WholeFoods, tons a great dining, and the Greenb"| __truncated__ "This home is located in the Milneburg subdivision of Gentilly. It has 2 bedrooms, 1 full bath, full kitchen noo"| __truncated__ "Come experience local living in my traditional New Orleans shotgun home conveniently located in the Historic 7t"| __truncated__ ...
## $ Space : chr "The place is a monthly sublet with option to rent furnished or unfurnished. There is a deposit requirement and"| __truncated__ "" "" "The house is shotgun style (meaning you must walk through one room to get into the next) with tons of original "| __truncated__ ...
## $ Description : chr "The place is a monthly sublet with option to rent furnished or unfurnished. There is a deposit requirement and"| __truncated__ "My place is close to Downtown and South Congress in 20 minutes, WholeFoods, tons a great dining, and the Greenb"| __truncated__ "This home is located in the Milneburg subdivision of Gentilly. It has 2 bedrooms, 1 full bath, full kitchen noo"| __truncated__ "Come experience local living in my traditional New Orleans shotgun home conveniently located in the Historic 7t"| __truncated__ ...
## $ Experiences.Offered : chr "none" "none" "none" "none" ...
## $ Neighborhood.Overview : chr "" "" "" "My home is located in a great historic neighborhood that allows access to all of the best parts of the city. Gu"| __truncated__ ...
## $ Notes : chr "" "" "" "Please be mindful of the tenant next door. No loud music. No smoking. No pets. Check-in is from 3:00PM -till an"| __truncated__ ...
## $ Transit : chr "" "" "Public transportation is steps away." "Uber, cab, bike, or the bus stop on our corner that goes to the French Quarter, CBD, and Uptown." ...
## $ Access : chr "" "" "Guest has access to the entire house." "My house is a duplex with a tenant next door. You will have access to one side of the house, which includes off"| __truncated__ ...
## $ Interaction : chr "" "" "Limited or no interaction with guest unless its to address questions or concerns. Host may or may not be presen"| __truncated__ "You will have total privacy here. That said, I am happy to answer any questions and offer recommendations to ma"| __truncated__ ...
## $ House.Rules : chr "no smoking inside, please use balcony. no pets. Party is allowed but be responsible, 10pm noise curfew. please "| __truncated__ "- There is a pool right off the master bedroom that does not have a fence around it. Very important to be awar"| __truncated__ "Honesty about the number of guest staying is required. No Pets or house parties. Please!!! No Smoking Guest a"| __truncated__ "We want guests to relax completely and enjoy themselves, but also to be respectful of our home, which we have p"| __truncated__ ...
## $ Thumbnail.Url : chr "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=small" ...
## $ Medium.Url : chr "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=medium" ...
## $ Picture.Url : chr "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/4daa7b6e743e3d58eedd6618be7b656c" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/472cde58733384e779e5f1d42f6f9a39" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/fea78caf3227a247ca7344dcb0ae0a58" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/4c8a34e72bac321ea5e3f3cad928ba62" ...
## $ XL.Picture.Url : chr "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=x_large" ...
## $ Host.ID : int 1833132 68263829 19107533 128753533 48066372 7977178 20870710 23732730 1739801 16507910 ...
## $ Host.URL : chr "https://www.airbnb.com/users/show/1833132" "https://www.airbnb.com/users/show/68263829" "https://www.airbnb.com/users/show/19107533" "https://www.airbnb.com/users/show/128753533" ...
## $ Host.Name : chr "Amelie" "Katy" "Rhett & Sheila" "Alex" ...
## $ Host.Since : chr "2012-02-29" "2016-04-21" "2014-07-29" "2017-05-03" ...
## $ Host.Location : chr "Los Angeles, California, United States" "" "New Orleans, Louisiana, United States" "New Orleans, Louisiana, United States" ...
## $ Host.About : chr "Loves the outdoors, hiking, beach, TENNIS!, rock climbing, and surf soon. I'm a food and wine fanatic. I love "| __truncated__ "" "We are native New Orleanian. There's no place we would rather live. Our city is an exciting place to visit and "| __truncated__ "I am a local to New Orleans, LA. I graduated from Loyola University New Orleans. Also, I completed 3 semesters "| __truncated__ ...
## $ Host.Response.Time : chr "" "" "within an hour" "within an hour" ...
## $ Host.Response.Rate : int NA NA 100 100 100 100 100 100 99 100 ...
## $ Host.Acceptance.Rate : chr "" "" "" "" ...
## $ Host.Thumbnail.Url : chr "https://a0.muscache.com/im/users/1833132/profile_pic/1366656419/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/ccf1ec7f-a96f-4a05-b3a5-ca9e245fe226.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/19107533/profile_pic/1409505580/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/4078ae48-f302-4edd-82f8-c1c54764a2a2.jpg?aki_policy=profile_small" ...
## $ Host.Picture.Url : chr "https://a0.muscache.com/im/users/1833132/profile_pic/1366656419/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/ccf1ec7f-a96f-4a05-b3a5-ca9e245fe226.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/19107533/profile_pic/1409505580/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/4078ae48-f302-4edd-82f8-c1c54764a2a2.jpg?aki_policy=profile_x_medium" ...
## $ Host.Neighbourhood : chr "West Los Angeles" "" "Milneburg" "" ...
## $ Host.Listings.Count : int 1 1 1 2 1 7 1 2 24 1 ...
## $ Host.Total.Listings.Count : int 1 1 1 2 1 7 1 2 24 1 ...
## $ Host.Verifications : chr "email,phone,facebook,reviews" "phone" "email,phone,reviews,kba" "email,phone,work_email" ...
## $ Street : chr "West Los Angeles, Los Angeles, CA 90025, United States" "Austin, TX, United States" "Milneburg, New Orleans, LA 70122, United States" "New Orleans, LA 70119, United States" ...
## $ Neighbourhood : chr "West Los Angeles" "" "Milneburg" "" ...
## $ Neighbourhood.Cleansed : chr "Sawtelle" "78745" "Milneburg" "Seventh Ward" ...
## $ Neighbourhood.Group.Cleansed : chr "" "" "" "" ...
## $ City : chr "Los Angeles" "Austin" "New Orleans" "New Orleans" ...
## $ State : chr "CA" "TX" "LA" "LA" ...
## $ Zipcode : chr "90025" "" "70122" "70119" ...
## $ Market : chr "Los Angeles" "Austin" "New Orleans" "New Orleans" ...
## $ Smart.Location : chr "Los Angeles, CA" "Austin, TX" "New Orleans, LA" "New Orleans, LA" ...
## $ Country.Code : chr "US" "US" "US" "US" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ Latitude : num 34 30.2 30 30 34.1 ...
## $ Longitude : num -118.4 -97.8 -90.1 -90.1 -118.4 ...
## $ Property.Type : chr "Apartment" "House" "House" "House" ...
## $ Room.Type : chr "Entire home/apt" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
## $ Accommodates : int 2 6 4 6 2 1 4 16 10 1 ...
## $ Bathrooms : num 1 2 1 1 1 1 1 1 2 1 ...
## $ Bedrooms : int 1 3 2 2 1 1 0 2 2 1 ...
## $ Beds : int 1 3 2 2 1 1 2 4 6 1 ...
## $ Bed.Type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ Amenities : chr "TV,Internet,Wireless Internet,Air conditioning,Wheelchair accessible,Kitchen,Free parking on premises,Pets allo"| __truncated__ "TV,Cable TV,Internet,Wireless Internet,Air conditioning,Pool,Kitchen,Free parking on premises,Breakfast,Indoor "| __truncated__ "TV,Cable TV,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Dryer,Smoke detector,Ca"| __truncated__ "TV,Internet,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Smoke detector,Carbon m"| __truncated__ ...
## $ Square.Feet : int NA NA NA NA NA NA NA NA NA NA ...
## $ Price : int 200 500 150 105 71 70 110 150 499 65 ...
## $ Weekly.Price : int 600 NA NA NA NA 525 NA NA NA NA ...
## $ Monthly.Price : int 1500 NA NA NA NA 1675 NA NA NA NA ...
## $ Security.Deposit : int NA NA 100 250 NA 150 100 NA NA NA ...
## $ Cleaning.Fee : int 50 100 75 110 20 60 50 75 NA 30 ...
## $ Guests.Included : int 1 1 4 6 1 1 1 2 1 1 ...
## $ Extra.People : int 0 0 25 50 20 20 0 25 0 0 ...
## $ Minimum.Nights : int 15 1 2 2 1 1 2 2 2 1 ...
## $ Maximum.Nights : int 31 1125 1125 1125 100 1125 13 28 1125 1125 ...
## $ Calendar.Updated : chr "yesterday" "yesterday" "today" "today" ...
## $ Has.Availability : chr "" "" "" "" ...
## $ Availability.30 : int 0 0 29 16 6 7 10 22 30 12 ...
## $ Availability.60 : int 0 0 59 33 17 7 37 50 60 30 ...
## $ Availability.90 : int 0 1 89 58 20 7 67 75 90 52 ...
## $ Availability.365 : int 0 1 364 289 89 273 283 346 365 314 ...
## $ Calendar.last.Scraped : chr "2017-05-02" "2017-03-06" "2017-06-02" "2017-06-02" ...
## $ Number.of.Reviews : int 0 0 12 4 33 33 4 31 0 36 ...
## $ First.Review : chr "" "" "2016-04-18" "2017-04-24" ...
## $ Last.Review : chr "" "" "2017-06-01" "" ...
## $ Review.Scores.Rating : int NA NA 94 100 99 94 100 95 NA 100 ...
## $ Review.Scores.Accuracy : int NA NA 10 10 10 9 10 9 NA 10 ...
## $ Review.Scores.Cleanliness : int NA NA 10 10 10 9 10 9 NA 10 ...
## $ Review.Scores.Checkin : int NA NA 9 10 10 10 10 10 NA 10 ...
## $ Review.Scores.Communication : int NA NA 10 10 10 10 10 10 NA 10 ...
## $ Review.Scores.Location : int NA NA 9 10 10 10 10 9 NA 10 ...
## $ Review.Scores.Value : int NA NA 9 10 10 9 10 9 NA 10 ...
## $ License : chr "" "" "City registration pending" "17STR-10146" ...
## $ Jurisdiction.Names : chr "City of Los Angeles, CA" "" "Louisiana State, New Orleans, LA" "Louisiana State, New Orleans, LA" ...
## $ Cancellation.Policy : chr "strict" "flexible" "strict" "strict" ...
## $ Calculated.host.listings.count: int 1 1 1 2 1 4 1 2 18 1 ...
## $ Reviews.per.Month : num NA NA 0.88 3 4.5 0.86 4 1.15 NA 5.51 ...
## $ Geolocation : chr "34.0432994388578, -118.44666625217695" "30.1946181529524, -97.81865977419787" "30.01895348868219, -90.05268642441278" "29.975441566796114, -90.07089887416147" ...
## $ Features : chr "Host Has Profile Pic,Is Location Exact" "Host Has Profile Pic,Instant Bookable" "Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License" "Host Has Profile Pic,Requires License,Instant Bookable" ...
Missing Values
After observing our variables, we then looked at which variables contained missing data. Looking at the results of this function, we saw that some variables like Square.Feet, Weekly.Price, and Monthly.Price, had much more missing data than actual values themselves, all missing over 80% of the 100 values we had added. We decided to take out Host.Response.Rate, for example, because even though we thought it could have an impact on our target variables, it had over 20 missing values so we thought it should just be taken out.
## Square.Feet Weekly.Price
## 99 83
## Monthly.Price Security.Deposit
## 76 54
## Cleaning.Fee Review.Scores.Value
## 30 21
## Review.Scores.Rating Review.Scores.Accuracy
## 20 20
## Review.Scores.Cleanliness Review.Scores.Checkin
## 20 20
## Review.Scores.Communication Review.Scores.Location
## 20 20
## Reviews.per.Month Host.Response.Rate
## 18 14
## Price Bedrooms
## 2 1
## Beds ID
## 1 0
## Listing.Url Scrape.ID
## 0 0
## Last.Scraped Name
## 0 0
## Summary Space
## 0 0
## Description Experiences.Offered
## 0 0
## Neighborhood.Overview Notes
## 0 0
## Transit Access
## 0 0
## Interaction House.Rules
## 0 0
## Thumbnail.Url Medium.Url
## 0 0
## Picture.Url XL.Picture.Url
## 0 0
## Host.ID Host.URL
## 0 0
## Host.Name Host.Since
## 0 0
## Host.Location Host.About
## 0 0
## Host.Response.Time Host.Acceptance.Rate
## 0 0
## Host.Thumbnail.Url Host.Picture.Url
## 0 0
## Host.Neighbourhood Host.Listings.Count
## 0 0
## Host.Total.Listings.Count Host.Verifications
## 0 0
## Street Neighbourhood
## 0 0
## Neighbourhood.Cleansed Neighbourhood.Group.Cleansed
## 0 0
## City State
## 0 0
## Zipcode Market
## 0 0
## Smart.Location Country.Code
## 0 0
## Country Latitude
## 0 0
## Longitude Property.Type
## 0 0
## Room.Type Accommodates
## 0 0
## Bathrooms Bed.Type
## 0 0
## Amenities Guests.Included
## 0 0
## Extra.People Minimum.Nights
## 0 0
## Maximum.Nights Calendar.Updated
## 0 0
## Has.Availability Availability.30
## 0 0
## Availability.60 Availability.90
## 0 0
## Availability.365 Calendar.last.Scraped
## 0 0
## Number.of.Reviews First.Review
## 0 0
## Last.Review License
## 0 0
## Jurisdiction.Names Cancellation.Policy
## 0 0
## Calculated.host.listings.count Geolocation
## 0 0
## Features
## 0
Summary Statistics
Lastly, we also generated summary statistics of our data. These summary statistics provided us with quantitative information about each of our numerical variables, such as Host.Listings.Count, Accommodates, Price, etc. One variable in particular that we found interesting was Host.Listings.Count, because although this variable had a mean of around 8 listings, its minimum was 0 listings, its 3rd quartile was around 10 listings, and its maximum was around 400 listings. This thus shows that although the majority of Hosts had between 1-8 listings, there was still a group of users that were outliers with hundreds of listings instead. Additionally, another variable whose summary statistics will be important to note is the target variable, Number.of.Reviews. Number.of.Reviews has a minimum of 0, mean of around 20, and maximum of around 200. This is important to note because it gave us an idea of how large the range for our target variables was, which could affect some of our models’ statistical results.
## ID Listing.Url Scrape.ID Last.Scraped
## Min. : 9531 Length:100 Min. :2.016e+13 Length:100
## 1st Qu.: 4780186 Class :character 1st Qu.:2.017e+13 Class :character
## Median :10323310 Mode :character Median :2.017e+13 Mode :character
## Mean : 9774289 Mean :2.017e+13
## 3rd Qu.:14916988 3rd Qu.:2.017e+13
## Max. :18547885 Max. :2.017e+13
##
## Name Summary Space Description
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Experiences.Offered Neighborhood.Overview Notes
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Transit Access Interaction House.Rules
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Thumbnail.Url Medium.Url Picture.Url XL.Picture.Url
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Host.ID Host.URL Host.Name Host.Since
## Min. : 31481 Length:100 Length:100 Length:100
## 1st Qu.: 6985930 Class :character Class :character Class :character
## Median : 22020728 Mode :character Mode :character Mode :character
## Mean : 34613599
## 3rd Qu.: 49338518
## Max. :128753533
##
## Host.Location Host.About Host.Response.Time Host.Response.Rate
## Length:100 Length:100 Length:100 Min. : 41.00
## Class :character Class :character Class :character 1st Qu.:100.00
## Mode :character Mode :character Mode :character Median :100.00
## Mean : 97.31
## 3rd Qu.:100.00
## Max. :100.00
## NA's :14
## Host.Acceptance.Rate Host.Thumbnail.Url Host.Picture.Url Host.Neighbourhood
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Host.Listings.Count Host.Total.Listings.Count Host.Verifications
## Min. : 1.00 Min. : 1.00 Length:100
## 1st Qu.: 1.00 1st Qu.: 1.00 Class :character
## Median : 2.00 Median : 2.00 Mode :character
## Mean : 12.62 Mean : 12.62
## 3rd Qu.: 3.00 3rd Qu.: 3.00
## Max. :472.00 Max. :472.00
##
## Street Neighbourhood Neighbourhood.Cleansed
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Neighbourhood.Group.Cleansed City State
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Zipcode Market Smart.Location Country.Code
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Country Latitude Longitude Property.Type
## Length:100 Min. :29.93 Min. :-122.70 Length:100
## Class :character 1st Qu.:33.77 1st Qu.:-118.33 Class :character
## Mode :character Median :36.12 Median : -97.72 Mode :character
## Mean :36.62 Mean : -97.37
## 3rd Qu.:40.70 3rd Qu.: -74.00
## Max. :47.57 Max. : -71.09
##
## Room.Type Accommodates Bathrooms Bedrooms
## Length:100 Min. : 1.00 Min. :1.00 Min. :0.000
## Class :character 1st Qu.: 2.00 1st Qu.:1.00 1st Qu.:1.000
## Mode :character Median : 3.00 Median :1.00 Median :1.000
## Mean : 4.12 Mean :1.38 Mean :1.566
## 3rd Qu.: 5.00 3rd Qu.:1.50 3rd Qu.:2.000
## Max. :16.00 Max. :5.00 Max. :8.000
## NA's :1
## Beds Bed.Type Amenities Square.Feet
## Min. :1.000 Length:100 Length:100 Min. :1200
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:1200
## Median :2.000 Mode :character Mode :character Median :1200
## Mean :2.091 Mean :1200
## 3rd Qu.:3.000 3rd Qu.:1200
## Max. :8.000 Max. :1200
## NA's :1 NA's :99
## Price Weekly.Price Monthly.Price Security.Deposit
## Min. : 30.0 Min. :150.0 Min. : 500 Min. :100.0
## 1st Qu.: 71.0 1st Qu.:290.0 1st Qu.:1288 1st Qu.:100.0
## Median :115.0 Median :500.0 Median :2042 Median :250.0
## Mean :160.4 Mean :492.9 Mean :2393 Mean :250.5
## 3rd Qu.:198.0 3rd Qu.:620.0 3rd Qu.:3025 3rd Qu.:375.0
## Max. :611.0 Max. :879.0 Max. :6500 Max. :500.0
## NA's :2 NA's :83 NA's :76 NA's :54
## Cleaning.Fee Guests.Included Extra.People Minimum.Nights
## Min. : 5.00 Min. : 0.0 Min. : 0.00 Min. : 1.00
## 1st Qu.: 39.25 1st Qu.: 1.0 1st Qu.: 0.00 1st Qu.: 1.00
## Median : 57.50 Median : 1.0 Median : 5.50 Median : 2.00
## Mean : 74.84 Mean : 1.9 Mean : 15.81 Mean : 2.42
## 3rd Qu.:100.00 3rd Qu.: 2.0 3rd Qu.: 25.00 3rd Qu.: 2.00
## Max. :300.00 Max. :12.0 Max. :100.00 Max. :30.00
## NA's :30
## Maximum.Nights Calendar.Updated Has.Availability Availability.30
## Min. : 3.0 Length:100 Length:100 Min. : 0.00
## 1st Qu.: 60.0 Class :character Class :character 1st Qu.: 0.75
## Median :1125.0 Mode :character Mode :character Median :10.00
## Mean : 753.1 Mean :12.40
## 3rd Qu.:1125.0 3rd Qu.:22.00
## Max. :1125.0 Max. :30.00
##
## Availability.60 Availability.90 Availability.365 Calendar.last.Scraped
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Length:100
## 1st Qu.: 4.0 1st Qu.:10.25 1st Qu.: 61.0 Class :character
## Median :26.0 Median :45.50 Median :188.0 Mode :character
## Mean :27.2 Mean :45.20 Mean :192.6
## 3rd Qu.:50.0 3rd Qu.:75.75 3rd Qu.:327.5
## Max. :60.0 Max. :90.00 Max. :365.0
##
## Number.of.Reviews First.Review Last.Review Review.Scores.Rating
## Min. : 0.00 Length:100 Length:100 Min. : 67.0
## 1st Qu.: 2.00 Class :character Class :character 1st Qu.: 93.0
## Median : 10.00 Mode :character Mode :character Median : 96.0
## Mean : 24.78 Mean : 94.8
## 3rd Qu.: 28.75 3rd Qu.:100.0
## Max. :239.00 Max. :100.0
## NA's :20
## Review.Scores.Accuracy Review.Scores.Cleanliness Review.Scores.Checkin
## Min. : 7.000 Min. : 7.000 Min. : 8.000
## 1st Qu.: 9.000 1st Qu.: 9.000 1st Qu.:10.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.625 Mean : 9.488 Mean : 9.775
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :20 NA's :20 NA's :20
## Review.Scores.Communication Review.Scores.Location Review.Scores.Value
## Min. : 8.000 Min. : 7.000 Min. : 8.000
## 1st Qu.:10.000 1st Qu.: 9.000 1st Qu.: 9.000
## Median :10.000 Median : 9.000 Median :10.000
## Mean : 9.863 Mean : 9.375 Mean : 9.443
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :20 NA's :20 NA's :21
## License Jurisdiction.Names Cancellation.Policy
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Calculated.host.listings.count Reviews.per.Month Geolocation
## Min. : 1.00 Min. :0.0400 Length:100
## 1st Qu.: 1.00 1st Qu.:0.5625 Class :character
## Median : 1.00 Median :1.4000 Mode :character
## Mean : 4.82 Mean :2.0096
## 3rd Qu.: 3.00 3rd Qu.:2.9850
## Max. :61.00 Max. :9.1600
## NA's :18
## Features
## Length:100
## Class :character
## Mode :character
##
##
##
##
STEP 3 - Clean the Data
Since we have very many variables in our dataframe, as seen in the previous step, we do not want to build our model on all of these since it will become too complicated and difficult to manage. So, firstly, we will create a new dataframe which will contain only the variables that we believe are significant and or could have some influence over the target variable in our model. Some of this will also include both creating new columns that we believe could be helpful and editing columns that need to be changed. Once we select our variable of interest, we will then clean the data by omitting missing values.
Selecting Columns of Interest
We though that it would be interesting to use the length of the “Description” column as one of our input variables, so we first created that column. Then, we saw that “Zipcode” was imputed as a character rather than integer, so we changed all of the values in that column. Finally, we selected our variables of interest for our new dataframe.
# Create column for length of description
df_abnb <- df_abnb %>%
mutate(Des_length = nchar(as.character(Description)))
# Change Zipcode from chr to int
df_abnb$Zipcode <- as.integer(df_abnb$Zipcode)
# Select only columns of interest
df_abnb_new <- df_abnb[,c("Host.Listings.Count","Zipcode","City", "Accommodates", "Price", "Bedrooms", "Review.Scores.Value", "Cancellation.Policy", "Amenities", "House.Rules", "Des_length", "Number.of.Reviews")]
STEP 4 - Text Mining
Since many of the variables in our dataset were character, text variables, any of which we want to work with we will first have to perform text mining on. Of the variables of interest that we picked for our model, the only one ones that require text mining are “Amenities” and “Cancellation.Policy”.
Combining Columns
Since the Amenities and Cancellation.Policy columns were similar and would both need to be text mined, we decided to combine them into one column which we would then perform the text mining on.
Text Mining for Amenities
# Preprocess text data
corpus <- VCorpus(VectorSource(df_abnb_new$combined))
corpus <- tm_map(corpus, content_transformer(tolower))
# Text is separated by commas, -, /
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "-") # Replace hyphens with spaces,
corpus <- tm_map(corpus, toSpace, "\\.") # Replace periods with spaces,
corpus <- tm_map(corpus, toSpace, ",") # Replace commas with spaces,
corpus <- tm_map(corpus, toSpace, "/") # Replace / with spaces.
corpus <- tm_map(corpus, toSpace, "_") # Replace _ with spaces.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
words <- colnames(matrix)
# Convert to a dataframe for modeling
text_data <- as.data.frame(matrix, stringsAsFactors = FALSE)
# Combine with original data (make sure row order is the same)
final_data_1 <- cbind(text_data, Host.Listings.Count = df_abnb_new$Host.Listings.Count, Zipcode = df_abnb_new$Zipcode,
Accommodates = df_abnb_new$Accommodates , Price = df_abnb_new$Price, Bedrooms = df_abnb_new$Bedrooms,
Review.Scores.Value = df_abnb_new$Review.Scores.Value,Des_length = df_abnb_new$Des_length,
Number.of.Reviews = df_abnb_new$Number.of.Reviews)
STEP 5 - Creating Models
Given our data and chosen business problem, we decided that 3 models that could fit best for our data would be Random Forest, Multiple Linear Regression, and KNN.
Random Forest Model
To create our random forest model, we first used the caret package to train the data for the model, then made predictions on the validation set to see how our random forest model performed. As seen in the output below, the randomForest function selected an optimal model by choosing the smallest RMSE, which was around 30, and thus produced an optimal mtry of around 2. When evaluating the model using our validation data, we found a mean absolute error of around 20, which is relatively high, but also could make sense considering the large range of our target variable, as mentioned previously.
# Split data into training and test sets
library(caret)
library(randomForest)
# Set seed for reproducibility
set.seed(123)
# Create a data partition for train/test split
index <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.6, list = FALSE)
training_data <- final_data_1[index, ]
valid_data <- final_data_1[-index, ]
# Define the train control using cross-validation
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Create a Random Forest model using caret's train function
rf_model_caret <- train(
Number.of.Reviews ~ .,
data = training_data,
method = "rf",
trControl = train_control
)
# View model details
rf_model_caret
## Random Forest
##
## 48 samples
## 112 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 38, 37, 38, 39, 40
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 31.54544 0.1014463 24.28920
## 57 34.22157 0.0389973 26.25609
## 112 34.46557 0.0508975 26.06395
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)
# Evaluate the model
MAE_RF <- mean(abs(predictions - valid_data$Number.of.Reviews))
MAE_RF
## [1] 32.53239
# Make predictions on the validation set
predictions_rf <- predict(rf_model_caret, newdata = valid_data)
# Calculate Mean Absolute Error (MAE)
MAE_rf <- mean(abs(predictions_rf - valid_data$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE) for Random Forest:", MAE_rf))
## [1] "Mean Absolute Error (MAE) for Random Forest: 32.5323876740319"
# Calculate Root Mean Squared Error (RMSE)
RMSE_rf <- sqrt(mean((predictions_rf - valid_data$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE) for Random Forest:", RMSE_rf))
## [1] "Root Mean Squared Error (RMSE) for Random Forest: 58.5325745655408"
# Calculate R-squared (coefficient of determination)
SSE_rf <- sum((predictions_rf - valid_data$Number.of.Reviews)^2) # Sum of Squared Errors
SST_rf <- sum((valid_data$Number.of.Reviews - mean(valid_data$Number.of.Reviews))^2) # Total Sum of Squares
rsquared_rf <- 1 - SSE_rf/SST_rf
print(paste("R-squared (Coefficient of Determination) for Random Forest:", rsquared_rf))
## [1] "R-squared (Coefficient of Determination) for Random Forest: -0.0637636126401615"
MLR Model
To create our MLR model, we first used the lm() function to create a model using only certain numerical, categorical values like Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_Length. From this model, we then created a variety of different models to represent the residuals.
cust_value_model = lm(formula = Number.of.Reviews ~ Accommodates + Price +
Bedrooms + Review.Scores.Value + Des_length,
data = df_abnb_new)
# Get the model residuals
model_residuals = cust_value_model$residuals
predictions <- predict(cust_value_model)
# Calculate Mean Absolute Error (MAE)
MAE_LM <- mean(abs(predictions - df_abnb_new$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE):", MAE_LM))
## [1] "Mean Absolute Error (MAE): 28.6142727985584"
# Calculate Root Mean Squared Error (RMSE)
RMSE_LM <- sqrt(mean((predictions - df_abnb_new$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE):", RMSE_LM))
## [1] "Root Mean Squared Error (RMSE): 42.2220474353286"
# Calculate R-squared (coefficient of determination)
rsquared_LM <- summary(cust_value_model)$r.squared
print(paste("R-squared (Coefficient of Determination):", rsquared_LM))
## [1] "R-squared (Coefficient of Determination): 0.0804997871545754"
Histogram
For our MLR model, we first created a histogram of the residuals. As seen in the histogram below, our model’s residuals were right-skewed, indicating that the normality assumption is most likely not true. Further, our model had the largest number of residuals focused between around -50 and 50.
# Plot a historgram of the result
hist(model_residuals, col = "skyblue", main = 'Histogram of MLR Model Residuals')
Residuals Plot
For our MLR model, we next created a Q-Q plot for the residuals of our model. As seen in the Q-Q plot below, our model’s residuals again showed a right-skew with a bit of randomness as the number of quantities increased.
# Residuals Plot
qqnorm(model_residuals, main = "Q-Q Plot of MLR Model Residuals")
# Plot the Q-Q line
qqline(model_residuals, col = "darkorchid3")
Correlation Matrix
Lastly for our MLR model, we also created a correlation matrix for our numerical variables Number.of.Reviews (our target variable), Accommodates, Price, Bedrooms, Review.Scores.Values, and Des_length. As seen in the correlation matrix below, the highest correlation appeared to be between Bedrooms and Accommodates. This was followed closely by the correlation between Price and Accommodates and between Price and Bedrooms. This indicated to us that Bedrooms and Accommodates were likely the most impactfull variables for our model.
df_cont <- df_abnb_new[ , c("Number.of.Reviews", "Accommodates", "Price",
"Bedrooms", "Review.Scores.Value", "Des_length")]
reduced_data <- subset(df_cont, select = -Number.of.Reviews)
# Compute correlation at 2 decimal places
corr_matrix = round(cor(reduced_data), 2)
# Compute and show the result
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower", lab = TRUE) +
ggtitle("Correlation Matrix of MLR Model Selected Variables")
KNN Model
For our last model, we created a KNN model to see how our independent variables impacted our target variable predictions. As seen in the results below, the KNN model found the optimal k to be around k=5, which produced a RMSE of around 35, a R-squared of around 0.2, and a MAE of around 25. RMSE is a measure of how well our model performed in terms of differences between predicted and actual values. Although this is a relatively high RMSE, our target variables ranged between 0 - around 200, so this was still relatively low, indicating a reasonably fine RMSE. Our R-squared implies that only around 20% of the variance in our target variable is explained by our independent variables, which is lower than what we would ideally want. Most likely, this is because there are many more factors that could impact our target variable besides the ones that we picked that we could not also include because it would make the model too complicated. Likewise, our root mean squared error of around 20 is relatively moderate, since it indicated relatively small errors considering the range of our target variable.
# Create training and testing datasets
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]
# Create a KNN model using caret
knn_model <- train(
Number.of.Reviews ~ .,
data = trainData,
method = "knn",
trControl = trainControl(method = "cv"),
preProcess = c("center", "scale")
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## k-Nearest Neighbors
##
## 64 samples
## 112 predictors
##
## Pre-processing: centered (112), scaled (112)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 58, 58, 59, 57, 56, 57, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 43.26290 0.1385732 31.96195
## 7 43.28874 0.1232710 31.65823
## 9 42.95763 0.1815657 31.25944
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
# Make predictions on the test dataset
predictions <- predict(knn_model, newdata = testData)
# Evaluate the model
RMSE_KNN <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error:", RMSE_KNN))
## [1] "Root Mean Squared Error: 44.4401684450967"
# Calculate Mean Absolute Error (MAE)
MAE_KNN <- mean(abs(predictions - testData$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE):", MAE_KNN))
## [1] "Mean Absolute Error (MAE): 28.5"
# Calculate Root Mean Squared Error (RMSE)
RMSE_KNN <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE):", RMSE_KNN))
## [1] "Root Mean Squared Error (RMSE): 44.4401684450967"
# Calculate R-squared (coefficient of determination)
SSE_KNN <- sum((predictions - testData$Number.of.Reviews)^2) # Sum of Squared Errors
SST_KNN <- sum((testData$Number.of.Reviews - mean(testData$Number.of.Reviews))^2) # Total Sum of Squares
rsquared_KNN <- 1 - SSE_KNN/SST_KNN
print(paste("R-squared (Coefficient of Determination):", rsquared_KNN))
## [1] "R-squared (Coefficient of Determination): -0.185140960638793"
STEP 6 - Evaluating Models
The last step of our data analysis will now be to compare the three models that we built to one another. We will do this by first comparing lift charts for each of the models, and then comparing each model’s MAE, RMSE and r-squared values to evaluate their performance.
Lift Chart for Random Forest Model
library(caret)
library(randomForest)
# Assuming you've already trained the model as in your code
# rf_model_caret <- ... (Your model training code)
# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)
# Combine predictions with actual outcomes
results <- data.frame(Actual = valid_data$Number.of.Reviews, Predicted = predictions)
results <- results[order(-results$Predicted), ] # Sort predictions in descending order
# Calculate cumulative sum of outcomes
results$CumulativeActual <- cumsum(results$Actual)
# Calculate lift values
expected_lift <- sum(results$Actual) / nrow(results)
results$Lift <- results$CumulativeActual / (expected_lift * (1:nrow(results)))
# Divide data into deciles or percentiles
deciles <- quantile(results$Predicted, probs = seq(0, 1, by = 0.1)) # Deciles
# Calculate average lift for each decile
avg_lift <- tapply(results$Lift, cut(results$Predicted, breaks = deciles, include.lowest = TRUE), mean)
# Plot the lift chart with baseline
baseline <- seq(0, max(results$Predicted), length.out = length(avg_lift))
plot(1:length(avg_lift), avg_lift, type = "b", xlab = "Deciles", ylab = "Lift",
main = "Lift Chart - Random Forest Model", col = "red")
lines(1:length(avg_lift), baseline, type = "b", col = "blue")
legend("topright", legend = c("Model Lift", "Baseline"), col = c("red", "blue"), lty = 1)
Lift Chart for MLR Model
# Assuming the Random Forest model and relevant code are already executed as provided earlier
# Multiple Linear Regression Model
cust_value_model <- lm(formula = Number.of.Reviews ~ Accommodates + Price +
Bedrooms + Review.Scores.Value + Des_length, data = df_abnb_new)
# Get predictions from the linear regression model
predictions_lm <- predict(cust_value_model, newdata = valid_data)
# Combine predictions with actual outcomes
results_lm <- data.frame(Actual = valid_data$Number.of.Reviews, Predicted = predictions_lm)
results_lm <- results_lm[order(-results_lm$Predicted), ] # Sort predictions in descending order
# Calculate cumulative sum of outcomes for linear regression
results_lm$CumulativeActual <- cumsum(results_lm$Actual)
# Calculate lift values for linear regression
expected_lift_lm <- sum(results_lm$Actual) / nrow(results_lm)
results_lm$Lift <- results_lm$CumulativeActual / (expected_lift_lm * (1:nrow(results_lm)))
# Divide data into deciles or percentiles for linear regression
deciles_lm <- quantile(results_lm$Predicted, probs = seq(0, 1, by = 0.1)) # Deciles
# Calculate average lift for each decile for linear regression
avg_lift_lm <- tapply(results_lm$Lift, cut(results_lm$Predicted, breaks = deciles_lm, include.lowest = TRUE), mean)
# Plot the lift chart with baseline
baseline <- seq(0, max(results$Predicted), length.out = length(avg_lift_lm))
plot(1:length(avg_lift_lm), avg_lift_lm, type = "b", xlab = "Deciles", ylab = "Lift",
main = "Lift Chart - Multiple Linear Regression", col = "green")
lines(1:length(avg_lift_lm), baseline, type = "b", col = "blue")
legend("topright", legend = c("Model Lift", "Baseline"), col = c("green", "blue"), lty = 1)
Lift Chart for KNN Model
# KNN Model
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]
library(caret)
knn_model <- train(
Number.of.Reviews ~ .,
data = trainData,
method = "knn",
trControl = trainControl(method = "cv"),
preProcess = c("center", "scale")
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
# Make predictions on the test dataset
predictions_knn <- predict(knn_model, newdata = testData)
# Combine predictions with actual outcomes for KNN
results_knn <- data.frame(Actual = testData$Number.of.Reviews, Predicted = predictions_knn)
results_knn <- results_knn[order(-results_knn$Predicted), ] # Sort predictions in descending order
# Calculate cumulative sum of outcomes for KNN
results_knn$CumulativeActual <- cumsum(results_knn$Actual)
# Calculate lift values for KNN
expected_lift_knn <- sum(results_knn$Actual) / nrow(results_knn)
results_knn$Lift <- results_knn$CumulativeActual / (expected_lift_knn * (1:nrow(results_knn)))
# Divide data into deciles or percentiles for KNN
deciles_knn <- quantile(results_knn$Predicted, probs = seq(0, 1, by = 0.1)) # Deciles
# Calculate average lift for each decile for KNN
avg_lift_knn <- tapply(results_knn$Lift, cut(results_knn$Predicted, breaks = deciles_knn, include.lowest = TRUE), mean)
# Assuming you have already computed 'avg_lift_knn' for the KNN model
# Adjust the lengths of 'avg_lift_knn' and 'baseline' to match
shorter_length <- min(length(avg_lift_knn), length(baseline))
avg_lift_knn <- avg_lift_knn[1:shorter_length]
baseline <- baseline[1:shorter_length]
# Plot the lift chart with KNN and baseline
plot(1:shorter_length, avg_lift_knn, type = "b", xlab = "Deciles", ylab = "Lift",
main = "Lift Chart - KNN vs. Baseline", col = "purple")
lines(1:shorter_length, baseline, type = "b", col = "red")
legend("topright", legend = c("KNN Lift", "Baseline"), col = c("purple", "red"), lty = 1)
Lift Chart for All Three Models
# Assuming Random Forest and Linear Regression models are already trained and predictions are obtained as earlier mentioned
# KNN Model
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]
library(caret)
knn_model <- train(
Number.of.Reviews ~ .,
data = trainData,
method = "knn",
trControl = trainControl(method = "cv"),
preProcess = c("center", "scale")
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards
# Make predictions on the test dataset
predictions_knn <- predict(knn_model, newdata = testData)
# Combine predictions with actual outcomes for KNN
results_knn <- data.frame(Actual = testData$Number.of.Reviews, Predicted = predictions_knn)
results_knn <- results_knn[order(-results_knn$Predicted), ] # Sort predictions in descending order
# Calculate cumulative sum of outcomes for KNN
results_knn$CumulativeActual <- cumsum(results_knn$Actual)
# Calculate lift values for KNN
expected_lift_knn <- sum(results_knn$Actual) / nrow(results_knn)
results_knn$Lift <- results_knn$CumulativeActual / (expected_lift_knn * (1:nrow(results_knn)))
# Divide data into deciles or percentiles for KNN
deciles_knn <- quantile(results_knn$Predicted, probs = seq(0, 1, by = 0.1)) # Deciles
# Calculate average lift for each decile for KNN
avg_lift_knn <- tapply(results_knn$Lift, cut(results_knn$Predicted, breaks = deciles_knn, include.lowest = TRUE), mean)
# Plot the lift chart with all three models and baseline
plot(1:length(avg_lift), avg_lift, type = "b", xlab = "Deciles", ylab = "Lift",
main = "Lift Chart - Random Forest vs. Linear Regression vs. KNN", col = "red")
lines(1:length(avg_lift_lm), avg_lift_lm, type = "b", col = "green")
lines(1:length(avg_lift_knn), avg_lift_knn, type = "b", col = "purple")
lines(1:length(avg_lift), baseline, type = "b", col = "blue")
legend("topright", legend = c("Random Forest Lift", "Multiple Linear Regression Lift", "KNN Lift", "Baseline"),
col = c("red", "green", "purple", "blue"), lty = 1)
The lift chart measures the effectiveness of each of the models by calculating the ratio of results obtained with the model versus results obtained without. As seen in the lift chart above comparing all three models we created with baseline, all three of the models are significantly below the baseline, indicating that they all have poor predictive performance and that all of the models likely need improvement. This makes sense, however, because, as we have mentioned time and time again, our models all have data quality/feature issues because of how large our initial datset was. Thus, these quality and feature issues could explain why none of the models are performing close to the baseline since they all are most likely lacking many significant features that could be affecting the target variable. However, when comparing the three models to one another in the lift chart above, we can see that the multiple linear regression model seems to have performed the best. This is displayed by the fact that while both the KNN and random forest models drop off significantly as the model takes in more data, the linear regression model flattens off before increasing instead. This thus would indicate that while the random forest model and KNN model decrease in predictive accuracy as the model takes in more sample, the MLR model instead increases in accuracy and performs better.
Comparison Table of Statistical Metrics for All Three Models
# Assuming you have computed evaluation metrics for all three models
# Store the evaluation metrics in variables
rf_metrics <- c(MAE_rf, RMSE_rf, rsquared_rf) # Replace these with actual values from Random Forest model
lm_metrics <- c(MAE_LM, RMSE_LM, rsquared_LM) # Replace these with actual values from Linear Regression model
knn_metrics <- c(MAE_KNN, RMSE_KNN, rsquared_KNN) # Replace these with actual values from KNN model
# Create a matrix or data frame to hold the metrics
comparison_table <- matrix(NA, nrow = 3, ncol = 3) # Create an empty matrix
rownames(comparison_table) <- c("MAE", "RMSE", "R-squared") # Row names for metrics
colnames(comparison_table) <- c("Random Forest", "MLR", "KNN") # Column names for models
# Fill in the matrix with the metrics
comparison_table[, "Random Forest"] <- rf_metrics
comparison_table[, "MLR"] <- lm_metrics
comparison_table[, "KNN"] <- knn_metrics
# Display the comparison table
#print("Comparison of Evaluation Metrics for Different Models:")
#print(comparison_table)
library(flextable)
comparison_table <- data.frame(
Row = c("MAE", "RMSE", "R-squared"),
`Random Forest` = rf_metrics,
`MLR` = lm_metrics,
KNN = knn_metrics
)
# Create a flextable object
ft <- flextable(comparison_table)
# Apply some style to the table
ft <- ft %>%
flextable::bg(j = 1:4, bg = "lightblue") %>%
flextable::border_outer()
set_caption(ft, "Comparison of Evaluation Metrics for Different Models")
Row | Random.Forest | MLR | KNN |
---|---|---|---|
MAE | 32.53238767 | 28.61427280 | 28.500000 |
RMSE | 58.53257457 | 42.22204744 | 44.440168 |
R-squared | -0.06376361 | 0.08049979 | -0.185141 |
The table above compares some statistical metrics of the three models to eachother to see how the model is performing and whether the factors we chose for each more are showing any significance. For the MAE, a lower value is better because it indicates that, on average, the model’s predictions are closest to the actual values and thus implies better model accuracy. Thus, as seen in the table above, MLR has relatively good accuracy in terms of MAE. For the RMSE, lower values are also considered better because it indicates better accuracy and precision. However, unlike MAE, RMSE also gives higher weight to larger errors, which is an important consideration because it means that it is also evaluating the impact of errors on each of the models. As seen in the table, in terms of RMSE, MLR actually does not have a low RMSE, likely because of the impact of errors in the model having more weight. Lastly, a higher r-squared is better because it indicates the percentage of the variance in the target variable that is explained by the model. Further, as seen in the table above, although none of the r-squared values are particularly high, the MLR model still has one of the highest values, indicating that it explains the most of the variance in the target variable.
Conclusion
In summary, we built a random forest, multiple linear regression, and a KNN model to try and answer our business problem of what factors impact the number of reviews, and thus number of bookings, a given AirBnb listing will receive. After comparing the performance of the three models using multiple different evaluation techniques, we found that the multiple linear regression model appeared to have the highest prediction accuracy and thus yielded the best model results. It is important to note that since our analysis was importing a random sample of the data, every time we re-ran our analysis, we got different results. However, we concluded that the MLR model was the best because among all of the times that we ran our analysis, the MLR model consistently had the best results. The numerical predictors of the MLR model (Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_length) thus seem to have at least some impact on the number of reviews a given listing will receive. Particularly, Accommodates, Bedrooms, and Price appeared to have the highest correlation with the target variable. However, though we concluded the MLR model to have yielded the best results, there are a multitude of factors that could be changed to create a model that would be more useful for further analysis. For example, including more values, more variables, finding a better method of text mining, using data with less quality issues, etc. are all problems we encountered during this project that, if further analysis is to be done, should be accounted for to produce better, more accurate results.