Business Problem
AirBnB, the online booking company, has been gaining an increasing number of popularity and bookings. Because of its increased use and growing popularity, a topic of interest for many data analysis projects has been to see how a variety of different factors can impact the bookings of a given host and or listing. For this project, we will be doing an analysis of AirBnB data to see what independent variables can predict the number of reviews a given listing gets, and further infer what factors affects a listing being booked. We will be predicting Number.Of.Reviews using the following variables:
- Host.Listings.Count
- Zipcode
- City
- Accommodates
- Price
- Bedrooms
- Review.Scores.Value
- Cancellation.Policy
- House.Rules
- Des_length
Data Background
We took our data from OpenDataSoft (https://public.opendatasoft.com/explore/dataset/airbnb-listings/table/?flg=en-us&disjunctive.host_verifications&disjunctive.amenities&disjunctive.features), but specifically will focus only on data from listings in the United States. This dataset has 134,545 values, but we will only be using 1,000 for our analysis.
STEP 0 - Import libraries
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
STEP 1 - Read in the data
When reading in the data for our project, we first wanted to check how many rows our original csv contained. Them we read in the initial majority of the rows into our first dataframe. From this first dataframe, we then created a new dataframe that contained a random select of 1,000 rows from the first dataframe.Ideally, we would have wanted to import more rows, but because this data set contains so many variables, it made the computing process extremely difficult. Thus, in order to make our models easier to manage, we limited our selection. Additionally, it is important to note that we will be using a random sample of the rows, meaning that each time we run our code, we will get slightly different outputs since we will be using a different random sample each time. Thus, all of our explanations are in general terms rather than specifics, since the outputs can change each time the code is run.
STEP 2 - Observe the dataframe
Data Sctructure
After importing our data, we first looked at how the data was structured using str(). From this analysis, we saw that a larger number of the variables had a character data type rather than a numerical one. Additionally, we found that there were a variety of variables that had values that we didn’t understand or know what they represented, such as the multiple different Review.Scores variables. Likewise, there were some variables that represented the same thing, such as City, Market, State, Smart.Location, Longitude, Latitude, County, etc., which we recognized would also need to be cut so that we would not have multiple variables representing the same value.
## 'data.frame': 100 obs. of 89 variables:
## $ ID : int 17601403 11227489 15162822 17269606 11376136 18324856 17697886 13375877 13320171 15595774 ...
## $ Listing.Url : chr "https://www.airbnb.com/rooms/17601403" "https://www.airbnb.com/rooms/11227489" "https://www.airbnb.com/rooms/15162822" "https://www.airbnb.com/rooms/17269606" ...
## $ Scrape.ID : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
## $ Last.Scraped : chr "2017-06-02" "2017-05-04" "2017-05-03" "2017-03-07" ...
## $ Name : chr "Lofted bed studio - work in progress" "Room rent for close to Manhattan" "Perfect location near everything, yet serene" "Greenbelt Private Rooms" ...
## $ Summary : chr "My place is a work in progress - there may be different furniture or updates when you arrive. I am close to Pag"| __truncated__ "My posh 2 bedroom apartment (1 room share) in private house close to buses and subways to Manhattan(from subway"| __truncated__ "My place is right above Downtown Burbank, great views, theatres, restaurants, and dining, family-friendly activ"| __truncated__ "My place is close to Close to the Domain shopping center, Walnut Creek Greenbelt, Food, Shopping, and 15 minute"| __truncated__ ...
## $ Space : chr "The is a historic single shotgun home in the Esplanade Ridge/ 7th Ward neighborhood. Walking distance to the Fa"| __truncated__ "" "" "" ...
## $ Description : chr "My place is a work in progress - there may be different furniture or updates when you arrive. I am close to Pag"| __truncated__ "My posh 2 bedroom apartment (1 room share) in private house close to buses and subways to Manhattan(from subway"| __truncated__ "My place is right above Downtown Burbank, great views, theatres, restaurants, and dining, family-friendly activ"| __truncated__ "My place is close to Close to the Domain shopping center, Walnut Creek Greenbelt, Food, Shopping, and 15 minute"| __truncated__ ...
## $ Experiences.Offered : chr "none" "none" "none" "none" ...
## $ Neighborhood.Overview : chr "This is a really pretty historic neighborhood a few blocks from Esplanade Ave. You can walk a couple miles to t"| __truncated__ "" "It is a nice hillside neighborhood is very safe. There is plenty of free street parking with no time restrictions." "" ...
## $ Notes : chr "We have dogs. They will be locked out of the room, but you may hear them!" "" "Please do not reserve if you smoke, I am super allergic and the meds I have to take to deal even a little hint "| __truncated__ "" ...
## $ Transit : chr "There's lots to walk around and see in New Orleans. We have one extra bike we can loan out to a short person. "| __truncated__ "" "There is a Metro 6 blocks down, and a city bus stop one block over. The Interstate 5 Freeway is 7 blocks down the hill." "" ...
## $ Access : chr "You will just have access to the first room of the house. The thermostat is not accessible to you, but we are "| __truncated__ "" "You are free to use the fridge, make coffee. There is a front porch with nice sunsets if you want to chill." "" ...
## $ Interaction : chr "Harvey or I will meet you on arrival and show you around the room. You can reach me by text throughout the day."| __truncated__ "" "If I am not working, I'm up for a chat!" "" ...
## $ House.Rules : chr "- Respect neighbors - quiet after 11pm" "" "- I have a small dog Sophie, and a house cat that shares the house - I am near many of the Studios if you are h"| __truncated__ "" ...
## $ Thumbnail.Url : chr "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=small" ...
## $ Medium.Url : chr "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=medium" ...
## $ Picture.Url : chr "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/90e39ca76f57ad4c3d55ad42787fc533" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/12efa7eb6e6b7fb176469302b18014b7" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/889dce574e52778d3a6aaa89f5ad279d" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/b42360e2c1621dbe3a13e0448e98e74f" ...
## $ XL.Picture.Url : chr "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=x_large" ...
## $ Host.ID : int 9773371 36526046 798773 99770161 57018721 75100240 120577640 4307894 75327963 78919700 ...
## $ Host.URL : chr "https://www.airbnb.com/users/show/9773371" "https://www.airbnb.com/users/show/36526046" "https://www.airbnb.com/users/show/798773" "https://www.airbnb.com/users/show/99770161" ...
## $ Host.Name : chr "Nicole" "Shafiqul" "Jennifer" "Samantha" ...
## $ Host.Since : chr "2013-11-02" "2015-06-23" "2011-07-09" "2016-10-15" ...
## $ Host.Location : chr "New Orleans, Louisiana, United States" "New York, New York, United States" "Los Angeles, California, United States" "College Station, Texas, United States" ...
## $ Host.About : chr "I am a transplant to New Orleans often looking for an escape. I love spending time outdoors and going on advent"| __truncated__ "Self Employe" "" "" ...
## $ Host.Response.Time : chr "within an hour" "within an hour" "within an hour" "within an hour" ...
## $ Host.Response.Rate : int 100 100 98 100 90 100 NA 100 NA 100 ...
## $ Host.Acceptance.Rate : chr "" "" "" "" ...
## $ Host.Thumbnail.Url : chr "https://a0.muscache.com/im/users/9773371/profile_pic/1431439109/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/15245b78-9885-4808-9281-37519247e490.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/0c03b70d-1537-4677-acd6-3121079325fc.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/1e27bb13-b3e9-403c-80c3-db432eea94ab.jpg?aki_policy=profile_small" ...
## $ Host.Picture.Url : chr "https://a0.muscache.com/im/users/9773371/profile_pic/1431439109/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/15245b78-9885-4808-9281-37519247e490.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/0c03b70d-1537-4677-acd6-3121079325fc.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/1e27bb13-b3e9-403c-80c3-db432eea94ab.jpg?aki_policy=profile_x_medium" ...
## $ Host.Neighbourhood : chr "Seventh Ward" "Astoria" "Burbank" "" ...
## $ Host.Listings.Count : int 1 1 2 1 1 1 1 1 1 4 ...
## $ Host.Total.Listings.Count : int 1 1 2 1 1 1 1 1 1 4 ...
## $ Host.Verifications : chr "email,phone,reviews,kba" "email,phone,reviews" "email,phone,reviews,jumio,offline_government_id,government_id" "email,phone,facebook" ...
## $ Street : chr "Seventh Ward, New Orleans, LA 70119, United States" "Astoria, Queens, NY 11102, United States" "Burbank, Burbank, CA 91501, United States" "Bruneau Trail, Austin, TX 78754, United States" ...
## $ Neighbourhood : chr "Seventh Ward" "Astoria" "Burbank" "" ...
## $ Neighbourhood.Cleansed : chr "Seventh Ward" "Astoria" "Burbank" "78754" ...
## $ Neighbourhood.Group.Cleansed : chr "" "Queens" "" "" ...
## $ City : chr "New Orleans" "Queens" "Burbank" "Austin" ...
## $ State : chr "LA" "NY" "CA" "TX" ...
## $ Zipcode : chr "70119" "11102" "91501" "78754" ...
## $ Market : chr "New Orleans" "New York" "Los Angeles" "Austin" ...
## $ Smart.Location : chr "New Orleans, LA" "Queens, NY" "Burbank, CA" "Austin, TX" ...
## $ Country.Code : chr "US" "US" "US" "US" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ Latitude : num 30 40.8 34.2 30.4 34 ...
## $ Longitude : num -90.1 -73.9 -118.3 -97.7 -118.5 ...
## $ Property.Type : chr "Apartment" "House" "House" "House" ...
## $ Room.Type : chr "Entire home/apt" "Private room" "Private room" "Private room" ...
## $ Accommodates : int 2 2 2 4 4 3 1 8 1 3 ...
## $ Bathrooms : num 1 1 1 1 1.5 1 1 1.5 1 1 ...
## $ Bedrooms : int 1 1 1 2 2 1 1 2 1 0 ...
## $ Beds : int 1 1 1 2 3 1 1 2 1 1 ...
## $ Bed.Type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ Amenities : chr "Internet,Wireless Internet,Air conditioning,Breakfast,Pets live on this property,Dog(s),Heating,Smoke detector,"| __truncated__ "Internet,Wireless Internet,Kitchen,Heating,Family/kid friendly,Shampoo,24-hour check-in,Hangers,Iron,Laptop friendly workspace" "Wireless Internet,Air conditioning,Pool,Kitchen,Free parking on premises,Pets live on this property,Dog(s),Cat("| __truncated__ "TV,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Family/kid friendly,Washer,Dryer"| __truncated__ ...
## $ Square.Feet : int NA NA NA NA NA NA NA NA NA NA ...
## $ Price : int 80 45 78 30 204 300 250 75 79 199 ...
## $ Weekly.Price : int NA NA NA NA NA NA NA NA NA NA ...
## $ Monthly.Price : int NA NA NA NA NA NA NA NA NA NA ...
## $ Security.Deposit : int 100 NA NA NA 500 500 NA 100 NA NA ...
## $ Cleaning.Fee : int 20 NA NA NA 80 100 NA 60 NA 50 ...
## $ Guests.Included : int 2 1 1 1 1 1 1 2 1 1 ...
## $ Extra.People : int 0 0 0 0 0 0 0 25 0 0 ...
## $ Minimum.Nights : int 2 3 1 1 3 5 1 1 1 3 ...
## $ Maximum.Nights : int 12 1125 14 4 1125 1125 3 1125 14 1125 ...
## $ Calendar.Updated : chr "2 weeks ago" "3 months ago" "today" "5 days ago" ...
## $ Has.Availability : chr "" "" "" "" ...
## $ Availability.30 : int 0 0 17 0 12 1 29 28 0 26 ...
## $ Availability.60 : int 0 3 47 0 24 2 59 58 0 56 ...
## $ Availability.90 : int 0 4 77 0 29 2 89 58 0 86 ...
## $ Availability.365 : int 0 146 77 34 84 2 89 58 0 361 ...
## $ Calendar.last.Scraped : chr "2017-06-02" "2017-05-04" "2017-05-02" "2017-03-06" ...
## $ Number.of.Reviews : int 4 32 51 0 15 0 0 2 41 1 ...
## $ First.Review : chr "2017-03-19" "2016-03-16" "2016-09-30" "" ...
## $ Last.Review : chr "2017-05-08" "2017-05-03" "2017-04-30" "" ...
## $ Review.Scores.Rating : int 100 91 94 NA 93 NA NA 100 98 60 ...
## $ Review.Scores.Accuracy : int 10 9 9 NA 10 NA NA 8 10 2 ...
## $ Review.Scores.Cleanliness : int 10 9 9 NA 10 NA NA 10 10 10 ...
## $ Review.Scores.Checkin : int 10 9 10 NA 10 NA NA 10 10 10 ...
## $ Review.Scores.Communication : int 10 10 10 NA 10 NA NA 10 10 10 ...
## $ Review.Scores.Location : int 10 9 10 NA 10 NA NA 10 10 6 ...
## $ Review.Scores.Value : int 10 9 10 NA 9 NA NA 10 10 8 ...
## $ License : chr "City registration pending" "" "" "" ...
## $ Jurisdiction.Names : chr "Louisiana State, New Orleans, LA" "" "" "" ...
## $ Cancellation.Policy : chr "flexible" "flexible" "flexible" "flexible" ...
## $ Calculated.host.listings.count: int 1 1 2 1 1 1 1 1 1 2 ...
## $ Reviews.per.Month : num 1.58 2.31 7.08 NA 1.04 NA NA 0.17 3.92 0.17 ...
## $ Geolocation : chr "29.977031012628125, -90.07298514292046" "40.76890506956011, -73.92859788917319" "34.19014920848236, -118.29887567566182" "30.364437053063515, -97.65307062568792" ...
## $ Features : chr "Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License" "Host Has Profile Pic,Is Location Exact,Instant Bookable" "Host Has Profile Pic,Host Identity Verified,Is Location Exact" "Host Has Profile Pic,Instant Bookable" ...
Missing Values
After observing our variables, we then looked at which variables contained missing data. Looking at the results of this function, we saw that some variables like Square.Feet, Weekly.Price, and Monthly.Price, had much more missing data than actual values themselves, all missing over 80% of the 1,000 values we had added. We decided to take out Host.Response.Rate, for example, because even though we thought it could have an impact on our target variables, it had over 2,000 missing values so we thought it should just be taken out.
## Square.Feet Weekly.Price
## 100 84
## Monthly.Price Security.Deposit
## 77 59
## Cleaning.Fee Host.Response.Rate
## 26 20
## Review.Scores.Rating Review.Scores.Accuracy
## 20 20
## Review.Scores.Cleanliness Review.Scores.Checkin
## 20 20
## Review.Scores.Communication Review.Scores.Location
## 20 20
## Review.Scores.Value Reviews.per.Month
## 20 18
## ID Listing.Url
## 0 0
## Scrape.ID Last.Scraped
## 0 0
## Name Summary
## 0 0
## Space Description
## 0 0
## Experiences.Offered Neighborhood.Overview
## 0 0
## Notes Transit
## 0 0
## Access Interaction
## 0 0
## House.Rules Thumbnail.Url
## 0 0
## Medium.Url Picture.Url
## 0 0
## XL.Picture.Url Host.ID
## 0 0
## Host.URL Host.Name
## 0 0
## Host.Since Host.Location
## 0 0
## Host.About Host.Response.Time
## 0 0
## Host.Acceptance.Rate Host.Thumbnail.Url
## 0 0
## Host.Picture.Url Host.Neighbourhood
## 0 0
## Host.Listings.Count Host.Total.Listings.Count
## 0 0
## Host.Verifications Street
## 0 0
## Neighbourhood Neighbourhood.Cleansed
## 0 0
## Neighbourhood.Group.Cleansed City
## 0 0
## State Zipcode
## 0 0
## Market Smart.Location
## 0 0
## Country.Code Country
## 0 0
## Latitude Longitude
## 0 0
## Property.Type Room.Type
## 0 0
## Accommodates Bathrooms
## 0 0
## Bedrooms Beds
## 0 0
## Bed.Type Amenities
## 0 0
## Price Guests.Included
## 0 0
## Extra.People Minimum.Nights
## 0 0
## Maximum.Nights Calendar.Updated
## 0 0
## Has.Availability Availability.30
## 0 0
## Availability.60 Availability.90
## 0 0
## Availability.365 Calendar.last.Scraped
## 0 0
## Number.of.Reviews First.Review
## 0 0
## Last.Review License
## 0 0
## Jurisdiction.Names Cancellation.Policy
## 0 0
## Calculated.host.listings.count Geolocation
## 0 0
## Features
## 0
Summary Statistics
Lastly, we also generated summary statistics of our data. These summary statistics provided us quantitative information about each of our numerical variables, such as Host.Listings.Count, Accommodates, Price, etc. One variable in particular that we found interesting was Host.Listings.Count, because although this variable had a mean of around 8 listings, its minimum was 0 listings, its 3rd quartile was around 3 listings, and its maximum was around 800 listings. This thus shows that although the majority of Hosts had between 1-3 listings, there was still a group of users that were outliers with hundreds of listings instead. Additionally, another variable whose summary statistics will be important to note is the target variable, Number.of.Reviews. Number.of.Reviews has a minimum of 0, mean of around 20, and maximum of around 400. This is important to note because it gave us an idea of how large the range for our target variables was, which could affect some of our models’ statistical results.
## ID Listing.Url Scrape.ID Last.Scraped
## Min. : 13059 Length:100 Min. :2.016e+13 Length:100
## 1st Qu.: 5032163 Class :character 1st Qu.:2.017e+13 Class :character
## Median :10400087 Mode :character Median :2.017e+13 Mode :character
## Mean : 9669259 Mean :2.017e+13
## 3rd Qu.:14520198 3rd Qu.:2.017e+13
## Max. :18324856 Max. :2.017e+13
##
## Name Summary Space Description
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Experiences.Offered Neighborhood.Overview Notes
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Transit Access Interaction House.Rules
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Thumbnail.Url Medium.Url Picture.Url XL.Picture.Url
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Host.ID Host.URL Host.Name Host.Since
## Min. : 50866 Length:100 Length:100 Length:100
## 1st Qu.: 4297625 Class :character Class :character Class :character
## Median : 17248958 Mode :character Mode :character Mode :character
## Mean : 30217686
## 3rd Qu.: 48761176
## Max. :120577640
##
## Host.Location Host.About Host.Response.Time Host.Response.Rate
## Length:100 Length:100 Length:100 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.:100.00
## Mode :character Mode :character Mode :character Median :100.00
## Mean : 95.53
## 3rd Qu.:100.00
## Max. :100.00
## NA's :20
## Host.Acceptance.Rate Host.Thumbnail.Url Host.Picture.Url Host.Neighbourhood
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Host.Listings.Count Host.Total.Listings.Count Host.Verifications
## Min. : 1.00 Min. : 1.00 Length:100
## 1st Qu.: 1.00 1st Qu.: 1.00 Class :character
## Median : 1.00 Median : 1.00 Mode :character
## Mean : 15.43 Mean : 15.43
## 3rd Qu.: 3.00 3rd Qu.: 3.00
## Max. :628.00 Max. :628.00
##
## Street Neighbourhood Neighbourhood.Cleansed
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Neighbourhood.Group.Cleansed City State
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Zipcode Market Smart.Location Country.Code
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Country Latitude Longitude Property.Type
## Length:100 Min. :29.92 Min. :-122.70 Length:100
## Class :character 1st Qu.:32.80 1st Qu.:-118.39 Class :character
## Mode :character Median :34.17 Median : -97.85 Mode :character
## Mean :36.25 Mean :-100.77
## 3rd Qu.:40.72 3rd Qu.: -84.32
## Max. :47.52 Max. : -71.05
##
## Room.Type Accommodates Bathrooms Bedrooms
## Length:100 Min. : 1.00 Min. :1.000 Min. :0.00
## Class :character 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.:1.00
## Mode :character Median : 2.00 Median :1.000 Median :1.00
## Mean : 3.61 Mean :1.375 Mean :1.44
## 3rd Qu.: 4.25 3rd Qu.:1.500 3rd Qu.:2.00
## Max. :14.00 Max. :8.000 Max. :4.00
##
## Beds Bed.Type Amenities Square.Feet
## Min. :1.00 Length:100 Length:100 Min. : NA
## 1st Qu.:1.00 Class :character Class :character 1st Qu.: NA
## Median :1.00 Mode :character Mode :character Median : NA
## Mean :1.88 Mean :NaN
## 3rd Qu.:2.00 3rd Qu.: NA
## Max. :7.00 Max. : NA
## NA's :100
## Price Weekly.Price Monthly.Price Security.Deposit
## Min. : 25.00 Min. :325.0 Min. : 900 Min. :100.0
## 1st Qu.: 78.75 1st Qu.:403.8 1st Qu.: 1600 1st Qu.:150.0
## Median :115.00 Median :570.0 Median : 2150 Median :250.0
## Mean :174.09 Mean :557.5 Mean : 3218 Mean :299.4
## 3rd Qu.:200.00 3rd Qu.:607.8 3rd Qu.: 3998 3rd Qu.:500.0
## Max. :947.00 Max. :960.0 Max. :11000 Max. :500.0
## NA's :84 NA's :77 NA's :59
## Cleaning.Fee Guests.Included Extra.People Minimum.Nights
## Min. : 5.00 Min. : 1.00 Min. : 0.00 Min. : 1.00
## 1st Qu.: 30.50 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 1.00
## Median : 60.00 Median : 1.00 Median : 0.00 Median : 2.00
## Mean : 82.41 Mean : 1.83 Mean : 14.52 Mean : 3.27
## 3rd Qu.:100.00 3rd Qu.: 2.00 3rd Qu.: 25.00 3rd Qu.: 3.00
## Max. :400.00 Max. :10.00 Max. :150.00 Max. :30.00
## NA's :26
## Maximum.Nights Calendar.Updated Has.Availability Availability.30
## Min. : 2.00 Length:100 Length:100 Min. : 0.00
## 1st Qu.: 29.75 Class :character Class :character 1st Qu.: 0.00
## Median :1125.00 Mode :character Mode :character Median : 8.50
## Mean : 697.34 Mean :11.36
## 3rd Qu.:1125.00 3rd Qu.:19.25
## Max. :1125.00 Max. :30.00
##
## Availability.60 Availability.90 Availability.365 Calendar.last.Scraped
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Length:100
## 1st Qu.: 2.00 1st Qu.: 3.00 1st Qu.: 17.75 Class :character
## Median :21.00 Median :47.00 Median :163.00 Mode :character
## Mean :25.52 Mean :42.43 Mean :181.24
## 3rd Qu.:47.50 3rd Qu.:76.00 3rd Qu.:329.25
## Max. :60.00 Max. :90.00 Max. :365.00
##
## Number.of.Reviews First.Review Last.Review Review.Scores.Rating
## Min. : 0.00 Length:100 Length:100 Min. : 60.00
## 1st Qu.: 1.00 Class :character Class :character 1st Qu.: 93.00
## Median : 8.00 Mode :character Mode :character Median : 97.00
## Mean : 24.92 Mean : 95.08
## 3rd Qu.: 27.50 3rd Qu.:100.00
## Max. :292.00 Max. :100.00
## NA's :20
## Review.Scores.Accuracy Review.Scores.Cleanliness Review.Scores.Checkin
## Min. : 2.000 Min. : 7.000 Min. : 9.000
## 1st Qu.: 9.750 1st Qu.: 9.000 1st Qu.:10.000
## Median :10.000 Median :10.000 Median :10.000
## Mean : 9.625 Mean : 9.575 Mean : 9.863
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :20 NA's :20 NA's :20
## Review.Scores.Communication Review.Scores.Location Review.Scores.Value
## Min. : 9.00 Min. : 6.0 Min. : 7.000
## 1st Qu.:10.00 1st Qu.: 9.0 1st Qu.: 9.000
## Median :10.00 Median :10.0 Median :10.000
## Mean : 9.85 Mean : 9.6 Mean : 9.588
## 3rd Qu.:10.00 3rd Qu.:10.0 3rd Qu.:10.000
## Max. :10.00 Max. :10.0 Max. :10.000
## NA's :20 NA's :20 NA's :20
## License Jurisdiction.Names Cancellation.Policy
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Calculated.host.listings.count Reviews.per.Month Geolocation
## Min. : 1.00 Min. :0.0400 Length:100
## 1st Qu.: 1.00 1st Qu.:0.3975 Class :character
## Median : 1.00 Median :1.0350 Mode :character
## Mean : 4.52 Mean :1.7778
## 3rd Qu.: 2.00 3rd Qu.:2.6375
## Max. :69.00 Max. :7.0800
## NA's :18
## Features
## Length:100
## Class :character
## Mode :character
##
##
##
##
STEP 3 - Clean the Data
Since we have very many variables in our dataframe, as seen in the previous step, we do not want to build our model on all of these since it will become too complicated and difficult to manage. So, firstly, we will create a new dataframe which will contain only the variables that we believe are significant and or could have some influence over the target variable in our model. Some of this will also include both creating new columns that we believe could be helpful and editing columns that need to be changed. Once we select our variable of interest, we will then clean the data by omitting missing values.
Selecting Columns of Interest
We though that it would be interesting to use the length of the “Description” column as one of our input variables, so we first created that column. Then, we saw that “Zipcode” was imputed as a character rather than integer, so we changed all of the values in that column. Finally, we selected our variables of interest for our new dataframe.
# Create column for length of description
df_abnb <- df_abnb %>%
mutate(Des_length = nchar(as.character(Description)))
# Change Zipcode from chr to int
df_abnb$Zipcode <- as.integer(df_abnb$Zipcode)
# Select only columns of interest
df_abnb_new <- df_abnb[,c("Host.Listings.Count","Zipcode","City", "Accommodates", "Price", "Bedrooms", "Review.Scores.Value", "Cancellation.Policy", "Amenities", "House.Rules", "Des_length", "Number.of.Reviews")]
STEP 4 - Text Mining
Since many of the variables in our dataset were character, text variables, any of which we want to work with we will first have to perform text mining on. Of the variables of interest that we picked for our model, the only one ones that require text mining are “Amenities” and “Cancellation.Policy”.
Combining Columns
Since the Amenities and Cancellation.Policy columns were similar and would both need to be text mined, we decided to combine them into one column which we would then perform the text mining on.
Text Mining for Amenities
# Preprocess text data
corpus <- VCorpus(VectorSource(df_abnb_new$combined))
corpus <- tm_map(corpus, content_transformer(tolower))
# Text is separated by commas, -, /
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "-") # Replace hyphens with spaces,
corpus <- tm_map(corpus, toSpace, "\\.") # Replace periods with spaces,
corpus <- tm_map(corpus, toSpace, ",") # Replace commas with spaces,
corpus <- tm_map(corpus, toSpace, "/") # Replace / with spaces.
corpus <- tm_map(corpus, toSpace, "_") # Replace _ with spaces.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
words <- colnames(matrix)
# Convert to a dataframe for modeling
text_data <- as.data.frame(matrix, stringsAsFactors = FALSE)
# Combine with original data (make sure row order is the same)
final_data_1 <- cbind(text_data, Host.Listings.Count = df_abnb_new$Host.Listings.Count, Zipcode = df_abnb_new$Zipcode,
Accommodates = df_abnb_new$Accommodates , Price = df_abnb_new$Price, Bedrooms = df_abnb_new$Bedrooms,
Review.Scores.Value = df_abnb_new$Review.Scores.Value,Des_length = df_abnb_new$Des_length,
Number.of.Reviews = df_abnb_new$Number.of.Reviews)
STEP 5 - Creating Models
Given our data and chosen business problem, we decided that 3 models that could fit best for our data would be Random Forest, Multiple Linear Regression, and KNN.
Random Forest Model
To create our random forest model, we first used the caret package to train the data for the model, then made predictions on the validation set to see how our random forest model performed. As seen in the output below, the randomForest function selected an optimal model by choosing the smallest RMSE, which was around 30, and thus produced an optimal mtry of around 2. When evaluating the model using our validation data, we found a mean absolute error of around 20, which is relatively high, but also could make sense considering the large range of our target variable, as mentioned previously.
# Split data into training and test sets
library(caret)
library(randomForest)
# Set seed for reproducibility
set.seed(123)
# Create a data partition for train/test split
index <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.6, list = FALSE)
training_data <- final_data_1[index, ]
valid_data <- final_data_1[-index, ]
# Define the train control using cross-validation
train_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Create a Random Forest model using caret's train function
rf_model_caret <- train(
Number.of.Reviews ~ .,
data = training_data,
method = "rf",
trControl = train_control
)
# View model details
rf_model_caret
## Random Forest
##
## 49 samples
## 82 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 38, 40, 40, 39, 39
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 37.87340 0.1910652 25.13559
## 42 44.85014 0.1241209 28.16363
## 82 51.32394 0.1107678 31.76138
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)
# Evaluate the model
MAE <- mean(abs(predictions - valid_data$Number.of.Reviews))
MAE
## [1] 30.91061
MLR Model
To create our MLR model, we first used the lm() function to create a model using only certain numerical, categorical values like Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_Length. From this model, we then created a v ariety of different models to represent the residuals.
cust_value_model = lm(formula = Number.of.Reviews ~ Accommodates + Price +
Bedrooms + Review.Scores.Value + Des_length,
data = df_abnb_new)
# Get the model residuals
model_residuals = cust_value_model$residuals
Histogram
For our MLR model, we first created a histogram of the residuals. As seen in the histogram below, our model’s residuals were right-skewed, indicating that the normality assumption is most likely not true. Further, our model had the largest number of residuals focused between around -50 and 100.
# Plot a historgram of the result
hist(model_residuals, col = "skyblue", main = 'Histogram of MLR Model Residuals')
Residuals Plot
For our MLR model, we next created a Q-Q plot for the residuals of our model. As seen in the Q-Q plot below, our model’s residuals again showed a right-skew with a bit of randomness.
# Residuals Plot
qqnorm(model_residuals, main = "Q-Q Plot of MLR Model Residuals")
# Plot the Q-Q line
qqline(model_residuals, col = "red")
Correlation Matrix
Lastly for our MLR model, we also created a correlation matrix for our numerical variables Number.of.Reviews (our target variable), Accommodates, Price, Bedrooms, Review.Scores.Values, and Des_length. As seen in the correlation matrix below, the highest correlation appeared to be between Bedrooms and Accommodates. This was followed closely by the correlation between Price and Accommodates and between Price and Bedrooms. This indicated to us that Bedrooms and Accommodates were likely the most impactful variables for our model.
df_cont <- df_abnb_new[ , c("Number.of.Reviews", "Accommodates", "Price",
"Bedrooms", "Review.Scores.Value", "Des_length")]
reduced_data <- subset(df_cont, select = -Number.of.Reviews)
# Compute correlation at 2 decimal places
corr_matrix = round(cor(reduced_data), 2)
# Compute and show the result
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower", lab = TRUE) +
ggtitle("Correlation Matrix of MLR Model Selected Variables")
KNN Model
For our last model, we created a KNN model to see how our independent variables impacted our target. As seen in the results below, the KNN model found the optimal k to be around k=9, which produced a RMSE of around 40, a R-squared of around 0.08, and a MAE of around 25. RMSE is a measure of how well our model performed in terms of differences between predicted and actual values. Although this is a relatively high RMSE, our target variables ranged between 0 - around 550, so this was still relatively low, indicating a reasonably fine RMSE. Our R-squared implies that only around 8% of the variance in our target variable is explained by our independent variables, which is lower than what we would ideally want. Most likely, this is because there are many more factors that could impact our target variable besides the ones that we picked that we could not also include because it would make the model too complicated. Likewise, our root mean squared error of around 37.69 is relatively moderate, since it indicated relatively small errors considering the range of our target variable.
# Create training and testing datasets
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]
# Create a KNN model using caret
knn_model <- train(
Number.of.Reviews ~ .,
data = trainData,
method = "knn",
trControl = trainControl(method = "cv"),
preProcess = c("center", "scale")
)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking
## k-Nearest Neighbors
##
## 65 samples
## 82 predictors
##
## Pre-processing: centered (82), scaled (82)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 59, 58, 58, 59, 60, 58, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 42.77032 0.2430612 29.13657
## 7 41.04926 0.2597987 28.29463
## 9 42.74899 0.1344118 29.61090
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
# Make predictions on the test dataset
predictions <- predict(knn_model, newdata = testData)
# Evaluate the model
RMSE <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error:", RMSE))
## [1] "Root Mean Squared Error: 35.0677603135854"