Business Problem

AirBnB, the online booking company, has been gaining an increasing number of popularity and bookings. Because of its increased use and growing popularity, a topic of interest for many data analysis projects has been to see how a variety of different factors can impact the bookings of a given host and or listing. For this project, we will be doing an analysis of AirBnB data to see what independent variables can predict the number of reviews a given listing gets, and further infer what factors affects a listing being booked. We will be predicting Number.Of.Reviews using the following variables:

Host.Listings.Count
Zipcode
City
Accommodates
Price
Bedrooms
Review.Scores.Value
Cancellation.Policy
House.Rules
Des_length

Data Background

We took our data from OpenDataSoft (https://public.opendatasoft.com/explore/dataset/airbnb-listings/table/?flg=en-us&disjunctive.host_verifications&disjunctive.amenities&disjunctive.features), but specifically will focus only on data from listings in the United States. This dataset has 134,545 values, but we will only be using 1,000 for our analysis.

STEP 0 - Import libraries

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

STEP 1 - Read in the data

When reading in the data for our project, we first wanted to check how many rows our original csv contained. Them we read in the initial majority of the rows into our first dataframe. From this first dataframe, we then created a new dataframe that contained a random select of 1,000 rows from the first dataframe.Ideally, we would have wanted to import more rows, but because this data set contains so many variables, it made the computing process extremely difficult. Thus, in order to make our models easier to manage, we limited our selection. Additionally, it is important to note that we will be using a random sample of the rows, meaning that each time we run our code, we will get slightly different outputs since we will be using a different random sample each time. Thus, all of our explanations are in general terms rather than specifics, since the outputs can change each time the code is run.

total_rows <- length(count.fields("airbnb-listings (1).csv", skip = 1)) 
df_rdin <- read.csv("airbnb-listings (1).csv", sep = ";", nrows = 100000)
  # This includes only United States data
df_abnb <- df_rdin[sample(nrow(df_rdin), 100), ]
  # change to 10000

STEP 2 - Observe the dataframe

Data Sctructure

After importing our data, we first looked at how the data was structured using str(). From this analysis, we saw that a larger number of the variables had a character data type rather than a numerical one. Additionally, we found that there were a variety of variables that had values that we didn’t understand or know what they represented, such as the multiple different Review.Scores variables. Likewise, there were some variables that represented the same thing, such as City, Market, State, Smart.Location, Longitude, Latitude, County, etc., which we recognized would also need to be cut so that we would not have multiple variables representing the same value.

str(df_abnb)

## 'data.frame':    100 obs. of  89 variables:
##  $ ID                            : int  17601403 11227489 15162822 17269606 11376136 18324856 17697886 13375877 13320171 15595774 ...
##  $ Listing.Url                   : chr  "https://www.airbnb.com/rooms/17601403" "https://www.airbnb.com/rooms/11227489" "https://www.airbnb.com/rooms/15162822" "https://www.airbnb.com/rooms/17269606" ...
##  $ Scrape.ID                     : num  2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
##  $ Last.Scraped                  : chr  "2017-06-02" "2017-05-04" "2017-05-03" "2017-03-07" ...
##  $ Name                          : chr  "Lofted bed studio - work in progress" "Room rent for close to Manhattan" "Perfect location near everything, yet serene" "Greenbelt Private Rooms" ...
##  $ Summary                       : chr  "My place is a work in progress - there may be different furniture or updates when you arrive. I am close to Pag"| __truncated__ "My posh 2 bedroom apartment (1 room share) in private house close to buses and subways to Manhattan(from subway"| __truncated__ "My place is right above Downtown Burbank, great views, theatres, restaurants, and dining, family-friendly activ"| __truncated__ "My place is close to Close to the Domain shopping center, Walnut Creek Greenbelt, Food, Shopping, and 15 minute"| __truncated__ ...
##  $ Space                         : chr  "The is a historic single shotgun home in the Esplanade Ridge/ 7th Ward neighborhood. Walking distance to the Fa"| __truncated__ "" "" "" ...
##  $ Description                   : chr  "My place is a work in progress - there may be different furniture or updates when you arrive. I am close to Pag"| __truncated__ "My posh 2 bedroom apartment (1 room share) in private house close to buses and subways to Manhattan(from subway"| __truncated__ "My place is right above Downtown Burbank, great views, theatres, restaurants, and dining, family-friendly activ"| __truncated__ "My place is close to Close to the Domain shopping center, Walnut Creek Greenbelt, Food, Shopping, and 15 minute"| __truncated__ ...
##  $ Experiences.Offered           : chr  "none" "none" "none" "none" ...
##  $ Neighborhood.Overview         : chr  "This is a really pretty historic neighborhood a few blocks from Esplanade Ave. You can walk a couple miles to t"| __truncated__ "" "It is a nice hillside neighborhood is very safe. There is plenty of free street parking with no time restrictions." "" ...
##  $ Notes                         : chr  "We have dogs. They will be locked out of the room, but you may hear them!" "" "Please do not reserve if you smoke, I am super allergic and the meds I have to take to deal even a little hint "| __truncated__ "" ...
##  $ Transit                       : chr  "There's lots to walk around and see in New Orleans. We have one extra bike we can loan out to a short person.  "| __truncated__ "" "There is a Metro 6 blocks down, and a city bus stop one block over. The Interstate 5 Freeway is 7 blocks down the hill." "" ...
##  $ Access                        : chr  "You will just have access to the first room of the house.  The thermostat is not accessible to you, but we are "| __truncated__ "" "You are free to use the fridge, make coffee. There is a front porch with nice sunsets if you want to chill." "" ...
##  $ Interaction                   : chr  "Harvey or I will meet you on arrival and show you around the room. You can reach me by text throughout the day."| __truncated__ "" "If I am not working, I'm up for a chat!" "" ...
##  $ House.Rules                   : chr  "- Respect neighbors - quiet after 11pm" "" "- I have a small dog Sophie, and a house cat that shares the house - I am near many of the Studios if you are h"| __truncated__ "" ...
##  $ Thumbnail.Url                 : chr  "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=small" ...
##  $ Medium.Url                    : chr  "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=medium" ...
##  $ Picture.Url                   : chr  "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/90e39ca76f57ad4c3d55ad42787fc533" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/12efa7eb6e6b7fb176469302b18014b7" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/889dce574e52778d3a6aaa89f5ad279d" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/b42360e2c1621dbe3a13e0448e98e74f" ...
##  $ XL.Picture.Url                : chr  "" "https://a0.muscache.com/im/pictures/ccece7ab-6a0b-4034-af30-ad08ed83ed3b.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/d36889ba-52b4-428b-8b9b-12f69122e17a.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/cc46053c-0c53-4782-af7d-b0e17b4a78dc.jpg?aki_policy=x_large" ...
##  $ Host.ID                       : int  9773371 36526046 798773 99770161 57018721 75100240 120577640 4307894 75327963 78919700 ...
##  $ Host.URL                      : chr  "https://www.airbnb.com/users/show/9773371" "https://www.airbnb.com/users/show/36526046" "https://www.airbnb.com/users/show/798773" "https://www.airbnb.com/users/show/99770161" ...
##  $ Host.Name                     : chr  "Nicole" "Shafiqul" "Jennifer" "Samantha" ...
##  $ Host.Since                    : chr  "2013-11-02" "2015-06-23" "2011-07-09" "2016-10-15" ...
##  $ Host.Location                 : chr  "New Orleans, Louisiana, United States" "New York, New York, United States" "Los Angeles, California, United States" "College Station, Texas, United States" ...
##  $ Host.About                    : chr  "I am a transplant to New Orleans often looking for an escape. I love spending time outdoors and going on advent"| __truncated__ "Self Employe" "" "" ...
##  $ Host.Response.Time            : chr  "within an hour" "within an hour" "within an hour" "within an hour" ...
##  $ Host.Response.Rate            : int  100 100 98 100 90 100 NA 100 NA 100 ...
##  $ Host.Acceptance.Rate          : chr  "" "" "" "" ...
##  $ Host.Thumbnail.Url            : chr  "https://a0.muscache.com/im/users/9773371/profile_pic/1431439109/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/15245b78-9885-4808-9281-37519247e490.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/0c03b70d-1537-4677-acd6-3121079325fc.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/1e27bb13-b3e9-403c-80c3-db432eea94ab.jpg?aki_policy=profile_small" ...
##  $ Host.Picture.Url              : chr  "https://a0.muscache.com/im/users/9773371/profile_pic/1431439109/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/15245b78-9885-4808-9281-37519247e490.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/0c03b70d-1537-4677-acd6-3121079325fc.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/1e27bb13-b3e9-403c-80c3-db432eea94ab.jpg?aki_policy=profile_x_medium" ...
##  $ Host.Neighbourhood            : chr  "Seventh Ward" "Astoria" "Burbank" "" ...
##  $ Host.Listings.Count           : int  1 1 2 1 1 1 1 1 1 4 ...
##  $ Host.Total.Listings.Count     : int  1 1 2 1 1 1 1 1 1 4 ...
##  $ Host.Verifications            : chr  "email,phone,reviews,kba" "email,phone,reviews" "email,phone,reviews,jumio,offline_government_id,government_id" "email,phone,facebook" ...
##  $ Street                        : chr  "Seventh Ward, New Orleans, LA 70119, United States" "Astoria, Queens, NY 11102, United States" "Burbank, Burbank, CA 91501, United States" "Bruneau Trail, Austin, TX 78754, United States" ...
##  $ Neighbourhood                 : chr  "Seventh Ward" "Astoria" "Burbank" "" ...
##  $ Neighbourhood.Cleansed        : chr  "Seventh Ward" "Astoria" "Burbank" "78754" ...
##  $ Neighbourhood.Group.Cleansed  : chr  "" "Queens" "" "" ...
##  $ City                          : chr  "New Orleans" "Queens" "Burbank" "Austin" ...
##  $ State                         : chr  "LA" "NY" "CA" "TX" ...
##  $ Zipcode                       : chr  "70119" "11102" "91501" "78754" ...
##  $ Market                        : chr  "New Orleans" "New York" "Los Angeles" "Austin" ...
##  $ Smart.Location                : chr  "New Orleans, LA" "Queens, NY" "Burbank, CA" "Austin, TX" ...
##  $ Country.Code                  : chr  "US" "US" "US" "US" ...
##  $ Country                       : chr  "United States" "United States" "United States" "United States" ...
##  $ Latitude                      : num  30 40.8 34.2 30.4 34 ...
##  $ Longitude                     : num  -90.1 -73.9 -118.3 -97.7 -118.5 ...
##  $ Property.Type                 : chr  "Apartment" "House" "House" "House" ...
##  $ Room.Type                     : chr  "Entire home/apt" "Private room" "Private room" "Private room" ...
##  $ Accommodates                  : int  2 2 2 4 4 3 1 8 1 3 ...
##  $ Bathrooms                     : num  1 1 1 1 1.5 1 1 1.5 1 1 ...
##  $ Bedrooms                      : int  1 1 1 2 2 1 1 2 1 0 ...
##  $ Beds                          : int  1 1 1 2 3 1 1 2 1 1 ...
##  $ Bed.Type                      : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ Amenities                     : chr  "Internet,Wireless Internet,Air conditioning,Breakfast,Pets live on this property,Dog(s),Heating,Smoke detector,"| __truncated__ "Internet,Wireless Internet,Kitchen,Heating,Family/kid friendly,Shampoo,24-hour check-in,Hangers,Iron,Laptop friendly workspace" "Wireless Internet,Air conditioning,Pool,Kitchen,Free parking on premises,Pets live on this property,Dog(s),Cat("| __truncated__ "TV,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Family/kid friendly,Washer,Dryer"| __truncated__ ...
##  $ Square.Feet                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Price                         : int  80 45 78 30 204 300 250 75 79 199 ...
##  $ Weekly.Price                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Monthly.Price                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Security.Deposit              : int  100 NA NA NA 500 500 NA 100 NA NA ...
##  $ Cleaning.Fee                  : int  20 NA NA NA 80 100 NA 60 NA 50 ...
##  $ Guests.Included               : int  2 1 1 1 1 1 1 2 1 1 ...
##  $ Extra.People                  : int  0 0 0 0 0 0 0 25 0 0 ...
##  $ Minimum.Nights                : int  2 3 1 1 3 5 1 1 1 3 ...
##  $ Maximum.Nights                : int  12 1125 14 4 1125 1125 3 1125 14 1125 ...
##  $ Calendar.Updated              : chr  "2 weeks ago" "3 months ago" "today" "5 days ago" ...
##  $ Has.Availability              : chr  "" "" "" "" ...
##  $ Availability.30               : int  0 0 17 0 12 1 29 28 0 26 ...
##  $ Availability.60               : int  0 3 47 0 24 2 59 58 0 56 ...
##  $ Availability.90               : int  0 4 77 0 29 2 89 58 0 86 ...
##  $ Availability.365              : int  0 146 77 34 84 2 89 58 0 361 ...
##  $ Calendar.last.Scraped         : chr  "2017-06-02" "2017-05-04" "2017-05-02" "2017-03-06" ...
##  $ Number.of.Reviews             : int  4 32 51 0 15 0 0 2 41 1 ...
##  $ First.Review                  : chr  "2017-03-19" "2016-03-16" "2016-09-30" "" ...
##  $ Last.Review                   : chr  "2017-05-08" "2017-05-03" "2017-04-30" "" ...
##  $ Review.Scores.Rating          : int  100 91 94 NA 93 NA NA 100 98 60 ...
##  $ Review.Scores.Accuracy        : int  10 9 9 NA 10 NA NA 8 10 2 ...
##  $ Review.Scores.Cleanliness     : int  10 9 9 NA 10 NA NA 10 10 10 ...
##  $ Review.Scores.Checkin         : int  10 9 10 NA 10 NA NA 10 10 10 ...
##  $ Review.Scores.Communication   : int  10 10 10 NA 10 NA NA 10 10 10 ...
##  $ Review.Scores.Location        : int  10 9 10 NA 10 NA NA 10 10 6 ...
##  $ Review.Scores.Value           : int  10 9 10 NA 9 NA NA 10 10 8 ...
##  $ License                       : chr  "City registration pending" "" "" "" ...
##  $ Jurisdiction.Names            : chr  "Louisiana State, New Orleans, LA" "" "" "" ...
##  $ Cancellation.Policy           : chr  "flexible" "flexible" "flexible" "flexible" ...
##  $ Calculated.host.listings.count: int  1 1 2 1 1 1 1 1 1 2 ...
##  $ Reviews.per.Month             : num  1.58 2.31 7.08 NA 1.04 NA NA 0.17 3.92 0.17 ...
##  $ Geolocation                   : chr  "29.977031012628125, -90.07298514292046" "40.76890506956011, -73.92859788917319" "34.19014920848236, -118.29887567566182" "30.364437053063515, -97.65307062568792" ...
##  $ Features                      : chr  "Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License" "Host Has Profile Pic,Is Location Exact,Instant Bookable" "Host Has Profile Pic,Host Identity Verified,Is Location Exact" "Host Has Profile Pic,Instant Bookable" ...

Missing Values

After observing our variables, we then looked at which variables contained missing data. Looking at the results of this function, we saw that some variables like Square.Feet, Weekly.Price, and Monthly.Price, had much more missing data than actual values themselves, all missing over 80% of the 1,000 values we had added. We decided to take out Host.Response.Rate, for example, because even though we thought it could have an impact on our target variables, it had over 2,000 missing values so we thought it should just be taken out.

sort(colSums(is.na(df_abnb)), decreasing = TRUE)

##                    Square.Feet                   Weekly.Price 
##                            100                             84 
##                  Monthly.Price               Security.Deposit 
##                             77                             59 
##                   Cleaning.Fee             Host.Response.Rate 
##                             26                             20 
##           Review.Scores.Rating         Review.Scores.Accuracy 
##                             20                             20 
##      Review.Scores.Cleanliness          Review.Scores.Checkin 
##                             20                             20 
##    Review.Scores.Communication         Review.Scores.Location 
##                             20                             20 
##            Review.Scores.Value              Reviews.per.Month 
##                             20                             18 
##                             ID                    Listing.Url 
##                              0                              0 
##                      Scrape.ID                   Last.Scraped 
##                              0                              0 
##                           Name                        Summary 
##                              0                              0 
##                          Space                    Description 
##                              0                              0 
##            Experiences.Offered          Neighborhood.Overview 
##                              0                              0 
##                          Notes                        Transit 
##                              0                              0 
##                         Access                    Interaction 
##                              0                              0 
##                    House.Rules                  Thumbnail.Url 
##                              0                              0 
##                     Medium.Url                    Picture.Url 
##                              0                              0 
##                 XL.Picture.Url                        Host.ID 
##                              0                              0 
##                       Host.URL                      Host.Name 
##                              0                              0 
##                     Host.Since                  Host.Location 
##                              0                              0 
##                     Host.About             Host.Response.Time 
##                              0                              0 
##           Host.Acceptance.Rate             Host.Thumbnail.Url 
##                              0                              0 
##               Host.Picture.Url             Host.Neighbourhood 
##                              0                              0 
##            Host.Listings.Count      Host.Total.Listings.Count 
##                              0                              0 
##             Host.Verifications                         Street 
##                              0                              0 
##                  Neighbourhood         Neighbourhood.Cleansed 
##                              0                              0 
##   Neighbourhood.Group.Cleansed                           City 
##                              0                              0 
##                          State                        Zipcode 
##                              0                              0 
##                         Market                 Smart.Location 
##                              0                              0 
##                   Country.Code                        Country 
##                              0                              0 
##                       Latitude                      Longitude 
##                              0                              0 
##                  Property.Type                      Room.Type 
##                              0                              0 
##                   Accommodates                      Bathrooms 
##                              0                              0 
##                       Bedrooms                           Beds 
##                              0                              0 
##                       Bed.Type                      Amenities 
##                              0                              0 
##                          Price                Guests.Included 
##                              0                              0 
##                   Extra.People                 Minimum.Nights 
##                              0                              0 
##                 Maximum.Nights               Calendar.Updated 
##                              0                              0 
##               Has.Availability                Availability.30 
##                              0                              0 
##                Availability.60                Availability.90 
##                              0                              0 
##               Availability.365          Calendar.last.Scraped 
##                              0                              0 
##              Number.of.Reviews                   First.Review 
##                              0                              0 
##                    Last.Review                        License 
##                              0                              0 
##             Jurisdiction.Names            Cancellation.Policy 
##                              0                              0 
## Calculated.host.listings.count                    Geolocation 
##                              0                              0 
##                       Features 
##                              0

Summary Statistics

Lastly, we also generated summary statistics of our data. These summary statistics provided us quantitative information about each of our numerical variables, such as Host.Listings.Count, Accommodates, Price, etc. One variable in particular that we found interesting was Host.Listings.Count, because although this variable had a mean of around 8 listings, its minimum was 0 listings, its 3rd quartile was around 3 listings, and its maximum was around 800 listings. This thus shows that although the majority of Hosts had between 1-3 listings, there was still a group of users that were outliers with hundreds of listings instead. Additionally, another variable whose summary statistics will be important to note is the target variable, Number.of.Reviews. Number.of.Reviews has a minimum of 0, mean of around 20, and maximum of around 400. This is important to note because it gave us an idea of how large the range for our target variables was, which could affect some of our models’ statistical results.

summary(df_abnb)

##        ID           Listing.Url          Scrape.ID         Last.Scraped      
##  Min.   :   13059   Length:100         Min.   :2.016e+13   Length:100        
##  1st Qu.: 5032163   Class :character   1st Qu.:2.017e+13   Class :character  
##  Median :10400087   Mode  :character   Median :2.017e+13   Mode  :character  
##  Mean   : 9669259                      Mean   :2.017e+13                     
##  3rd Qu.:14520198                      3rd Qu.:2.017e+13                     
##  Max.   :18324856                      Max.   :2.017e+13                     
##                                                                              
##      Name             Summary             Space           Description       
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Experiences.Offered Neighborhood.Overview    Notes          
##  Length:100          Length:100            Length:100        
##  Class :character    Class :character      Class :character  
##  Mode  :character    Mode  :character      Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##    Transit             Access          Interaction        House.Rules       
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Thumbnail.Url       Medium.Url        Picture.Url        XL.Picture.Url    
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Host.ID            Host.URL          Host.Name          Host.Since       
##  Min.   :    50866   Length:100         Length:100         Length:100        
##  1st Qu.:  4297625   Class :character   Class :character   Class :character  
##  Median : 17248958   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 30217686                                                           
##  3rd Qu.: 48761176                                                           
##  Max.   :120577640                                                           
##                                                                              
##  Host.Location       Host.About        Host.Response.Time Host.Response.Rate
##  Length:100         Length:100         Length:100         Min.   :  0.00    
##  Class :character   Class :character   Class :character   1st Qu.:100.00    
##  Mode  :character   Mode  :character   Mode  :character   Median :100.00    
##                                                           Mean   : 95.53    
##                                                           3rd Qu.:100.00    
##                                                           Max.   :100.00    
##                                                           NA's   :20        
##  Host.Acceptance.Rate Host.Thumbnail.Url Host.Picture.Url   Host.Neighbourhood
##  Length:100           Length:100         Length:100         Length:100        
##  Class :character     Class :character   Class :character   Class :character  
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  Host.Listings.Count Host.Total.Listings.Count Host.Verifications
##  Min.   :  1.00      Min.   :  1.00            Length:100        
##  1st Qu.:  1.00      1st Qu.:  1.00            Class :character  
##  Median :  1.00      Median :  1.00            Mode  :character  
##  Mean   : 15.43      Mean   : 15.43                              
##  3rd Qu.:  3.00      3rd Qu.:  3.00                              
##  Max.   :628.00      Max.   :628.00                              
##                                                                  
##     Street          Neighbourhood      Neighbourhood.Cleansed
##  Length:100         Length:100         Length:100            
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##                                                              
##  Neighbourhood.Group.Cleansed     City              State          
##  Length:100                   Length:100         Length:100        
##  Class :character             Class :character   Class :character  
##  Mode  :character             Mode  :character   Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##    Zipcode             Market          Smart.Location     Country.Code      
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Country             Latitude       Longitude       Property.Type     
##  Length:100         Min.   :29.92   Min.   :-122.70   Length:100        
##  Class :character   1st Qu.:32.80   1st Qu.:-118.39   Class :character  
##  Mode  :character   Median :34.17   Median : -97.85   Mode  :character  
##                     Mean   :36.25   Mean   :-100.77                     
##                     3rd Qu.:40.72   3rd Qu.: -84.32                     
##                     Max.   :47.52   Max.   : -71.05                     
##                                                                         
##   Room.Type          Accommodates     Bathrooms        Bedrooms   
##  Length:100         Min.   : 1.00   Min.   :1.000   Min.   :0.00  
##  Class :character   1st Qu.: 2.00   1st Qu.:1.000   1st Qu.:1.00  
##  Mode  :character   Median : 2.00   Median :1.000   Median :1.00  
##                     Mean   : 3.61   Mean   :1.375   Mean   :1.44  
##                     3rd Qu.: 4.25   3rd Qu.:1.500   3rd Qu.:2.00  
##                     Max.   :14.00   Max.   :8.000   Max.   :4.00  
##                                                                   
##       Beds        Bed.Type          Amenities          Square.Feet 
##  Min.   :1.00   Length:100         Length:100         Min.   : NA  
##  1st Qu.:1.00   Class :character   Class :character   1st Qu.: NA  
##  Median :1.00   Mode  :character   Mode  :character   Median : NA  
##  Mean   :1.88                                         Mean   :NaN  
##  3rd Qu.:2.00                                         3rd Qu.: NA  
##  Max.   :7.00                                         Max.   : NA  
##                                                       NA's   :100  
##      Price         Weekly.Price   Monthly.Price   Security.Deposit
##  Min.   : 25.00   Min.   :325.0   Min.   :  900   Min.   :100.0   
##  1st Qu.: 78.75   1st Qu.:403.8   1st Qu.: 1600   1st Qu.:150.0   
##  Median :115.00   Median :570.0   Median : 2150   Median :250.0   
##  Mean   :174.09   Mean   :557.5   Mean   : 3218   Mean   :299.4   
##  3rd Qu.:200.00   3rd Qu.:607.8   3rd Qu.: 3998   3rd Qu.:500.0   
##  Max.   :947.00   Max.   :960.0   Max.   :11000   Max.   :500.0   
##                   NA's   :84      NA's   :77      NA's   :59      
##   Cleaning.Fee    Guests.Included  Extra.People    Minimum.Nights 
##  Min.   :  5.00   Min.   : 1.00   Min.   :  0.00   Min.   : 1.00  
##  1st Qu.: 30.50   1st Qu.: 1.00   1st Qu.:  0.00   1st Qu.: 1.00  
##  Median : 60.00   Median : 1.00   Median :  0.00   Median : 2.00  
##  Mean   : 82.41   Mean   : 1.83   Mean   : 14.52   Mean   : 3.27  
##  3rd Qu.:100.00   3rd Qu.: 2.00   3rd Qu.: 25.00   3rd Qu.: 3.00  
##  Max.   :400.00   Max.   :10.00   Max.   :150.00   Max.   :30.00  
##  NA's   :26                                                       
##  Maximum.Nights    Calendar.Updated   Has.Availability   Availability.30
##  Min.   :   2.00   Length:100         Length:100         Min.   : 0.00  
##  1st Qu.:  29.75   Class :character   Class :character   1st Qu.: 0.00  
##  Median :1125.00   Mode  :character   Mode  :character   Median : 8.50  
##  Mean   : 697.34                                         Mean   :11.36  
##  3rd Qu.:1125.00                                         3rd Qu.:19.25  
##  Max.   :1125.00                                         Max.   :30.00  
##                                                                         
##  Availability.60 Availability.90 Availability.365 Calendar.last.Scraped
##  Min.   : 0.00   Min.   : 0.00   Min.   :  0.00   Length:100           
##  1st Qu.: 2.00   1st Qu.: 3.00   1st Qu.: 17.75   Class :character     
##  Median :21.00   Median :47.00   Median :163.00   Mode  :character     
##  Mean   :25.52   Mean   :42.43   Mean   :181.24                        
##  3rd Qu.:47.50   3rd Qu.:76.00   3rd Qu.:329.25                        
##  Max.   :60.00   Max.   :90.00   Max.   :365.00                        
##                                                                        
##  Number.of.Reviews First.Review       Last.Review        Review.Scores.Rating
##  Min.   :  0.00    Length:100         Length:100         Min.   : 60.00      
##  1st Qu.:  1.00    Class :character   Class :character   1st Qu.: 93.00      
##  Median :  8.00    Mode  :character   Mode  :character   Median : 97.00      
##  Mean   : 24.92                                          Mean   : 95.08      
##  3rd Qu.: 27.50                                          3rd Qu.:100.00      
##  Max.   :292.00                                          Max.   :100.00      
##                                                          NA's   :20          
##  Review.Scores.Accuracy Review.Scores.Cleanliness Review.Scores.Checkin
##  Min.   : 2.000         Min.   : 7.000            Min.   : 9.000       
##  1st Qu.: 9.750         1st Qu.: 9.000            1st Qu.:10.000       
##  Median :10.000         Median :10.000            Median :10.000       
##  Mean   : 9.625         Mean   : 9.575            Mean   : 9.863       
##  3rd Qu.:10.000         3rd Qu.:10.000            3rd Qu.:10.000       
##  Max.   :10.000         Max.   :10.000            Max.   :10.000       
##  NA's   :20             NA's   :20                NA's   :20           
##  Review.Scores.Communication Review.Scores.Location Review.Scores.Value
##  Min.   : 9.00               Min.   : 6.0           Min.   : 7.000     
##  1st Qu.:10.00               1st Qu.: 9.0           1st Qu.: 9.000     
##  Median :10.00               Median :10.0           Median :10.000     
##  Mean   : 9.85               Mean   : 9.6           Mean   : 9.588     
##  3rd Qu.:10.00               3rd Qu.:10.0           3rd Qu.:10.000     
##  Max.   :10.00               Max.   :10.0           Max.   :10.000     
##  NA's   :20                  NA's   :20             NA's   :20         
##    License          Jurisdiction.Names Cancellation.Policy
##  Length:100         Length:100         Length:100         
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##                                                           
##                                                           
##                                                           
##                                                           
##  Calculated.host.listings.count Reviews.per.Month Geolocation       
##  Min.   : 1.00                  Min.   :0.0400    Length:100        
##  1st Qu.: 1.00                  1st Qu.:0.3975    Class :character  
##  Median : 1.00                  Median :1.0350    Mode  :character  
##  Mean   : 4.52                  Mean   :1.7778                      
##  3rd Qu.: 2.00                  3rd Qu.:2.6375                      
##  Max.   :69.00                  Max.   :7.0800                      
##                                 NA's   :18                          
##    Features        
##  Length:100        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

STEP 3 - Clean the Data

Since we have very many variables in our dataframe, as seen in the previous step, we do not want to build our model on all of these since it will become too complicated and difficult to manage. So, firstly, we will create a new dataframe which will contain only the variables that we believe are significant and or could have some influence over the target variable in our model. Some of this will also include both creating new columns that we believe could be helpful and editing columns that need to be changed. Once we select our variable of interest, we will then clean the data by omitting missing values.

Selecting Columns of Interest

We though that it would be interesting to use the length of the “Description” column as one of our input variables, so we first created that column. Then, we saw that “Zipcode” was imputed as a character rather than integer, so we changed all of the values in that column. Finally, we selected our variables of interest for our new dataframe.

# Create column for length of description
df_abnb <- df_abnb %>%
  mutate(Des_length = nchar(as.character(Description)))
# Change Zipcode from chr to int
df_abnb$Zipcode <- as.integer(df_abnb$Zipcode)

# Select only columns of interest
df_abnb_new <- df_abnb[,c("Host.Listings.Count","Zipcode","City", "Accommodates", "Price", "Bedrooms", "Review.Scores.Value", "Cancellation.Policy", "Amenities", "House.Rules", "Des_length", "Number.of.Reviews")]

Omitting Missing Values

Of the values that we picked for our model, Host.Listings.Count, Accommodates, Price, Bedrooms, and Review.Scores.Value all contained missing values which we decided to remove.

df_abnb_new <- na.omit(df_abnb_new)
View(head(df_abnb_new, 10))

STEP 4 - Text Mining

Since many of the variables in our dataset were character, text variables, any of which we want to work with we will first have to perform text mining on. Of the variables of interest that we picked for our model, the only one ones that require text mining are “Amenities” and “Cancellation.Policy”.

Combining Columns

Since the Amenities and Cancellation.Policy columns were similar and would both need to be text mined, we decided to combine them into one column which we would then perform the text mining on.

df_abnb_new$combined <- paste(df_abnb_new$Amenities, df_abnb_new$Cancellation.Policy)
df_abnb_new <- df_abnb_new[,-c(8,9)]

Text Mining for Amenities

# Preprocess text data
corpus <- VCorpus(VectorSource(df_abnb_new$combined))
corpus <- tm_map(corpus, content_transformer(tolower))

# Text is separated by commas, -, /
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "-") # Replace hyphens with spaces,
corpus <- tm_map(corpus, toSpace, "\\.") # Replace periods with spaces, 
corpus <- tm_map(corpus, toSpace, ",") # Replace commas with spaces, 
corpus <- tm_map(corpus, toSpace, "/") # Replace / with spaces.
corpus <- tm_map(corpus, toSpace, "_") # Replace _ with spaces.

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
words <- colnames(matrix)

# Convert to a dataframe for modeling
text_data <- as.data.frame(matrix, stringsAsFactors = FALSE)

# Combine with original data (make sure row order is the same)
final_data_1 <- cbind(text_data, Host.Listings.Count = df_abnb_new$Host.Listings.Count, Zipcode = df_abnb_new$Zipcode, 
                      Accommodates = df_abnb_new$Accommodates , Price = df_abnb_new$Price, Bedrooms = df_abnb_new$Bedrooms, 
                      Review.Scores.Value = df_abnb_new$Review.Scores.Value,Des_length =  df_abnb_new$Des_length,
                      Number.of.Reviews = df_abnb_new$Number.of.Reviews)

STEP 5 - Creating Models

Given our data and chosen business problem, we decided that 3 models that could fit best for our data would be Random Forest, Multiple Linear Regression, and KNN.

Random Forest Model

To create our random forest model, we first used the caret package to train the data for the model, then made predictions on the validation set to see how our random forest model performed. As seen in the output below, the randomForest function selected an optimal model by choosing the smallest RMSE, which was around 30, and thus produced an optimal mtry of around 2. When evaluating the model using our validation data, we found a mean absolute error of around 20, which is relatively high, but also could make sense considering the large range of our target variable, as mentioned previously.

# Split data into training and test sets
library(caret)
library(randomForest)

# Set seed for reproducibility
set.seed(123)

# Create a data partition for train/test split
index <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.6, list = FALSE)
training_data <- final_data_1[index, ]
valid_data <- final_data_1[-index, ]

# Define the train control using cross-validation
train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

# Create a Random Forest model using caret's train function
rf_model_caret <- train(
  Number.of.Reviews ~ .,
  data = training_data,
  method = "rf",
  trControl = train_control
)

# View model details
rf_model_caret

## Random Forest 
## 
## 49 samples
## 82 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 38, 40, 40, 39, 39 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    37.87340  0.1910652  25.13559
##   42    44.85014  0.1241209  28.16363
##   82    51.32394  0.1107678  31.76138
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.

# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)

# Evaluate the model
MAE <- mean(abs(predictions - valid_data$Number.of.Reviews))
MAE

## [1] 30.91061

MLR Model

To create our MLR model, we first used the lm() function to create a model using only certain numerical, categorical values like Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_Length. From this model, we then created a v ariety of different models to represent the residuals.

cust_value_model = lm(formula =  Number.of.Reviews ~ Accommodates + Price + 
                        Bedrooms + Review.Scores.Value + Des_length, 
                      data = df_abnb_new)
# Get the model residuals
model_residuals = cust_value_model$residuals

Histogram

For our MLR model, we first created a histogram of the residuals. As seen in the histogram below, our model’s residuals were right-skewed, indicating that the normality assumption is most likely not true. Further, our model had the largest number of residuals focused between around -50 and 100.

# Plot a historgram of the result
hist(model_residuals, col = "skyblue", main = 'Histogram of MLR Model Residuals')

Residuals Plot

For our MLR model, we next created a Q-Q plot for the residuals of our model. As seen in the Q-Q plot below, our model’s residuals again showed a right-skew with a bit of randomness.

# Residuals Plot
qqnorm(model_residuals, main = "Q-Q Plot of MLR Model Residuals")
# Plot the Q-Q line
qqline(model_residuals, col = "red")

Correlation Matrix

Lastly for our MLR model, we also created a correlation matrix for our numerical variables Number.of.Reviews (our target variable), Accommodates, Price, Bedrooms, Review.Scores.Values, and Des_length. As seen in the correlation matrix below, the highest correlation appeared to be between Bedrooms and Accommodates. This was followed closely by the correlation between Price and Accommodates and between Price and Bedrooms. This indicated to us that Bedrooms and Accommodates were likely the most impactful variables for our model.

df_cont <- df_abnb_new[ , c("Number.of.Reviews", "Accommodates", "Price",
                            "Bedrooms", "Review.Scores.Value", "Des_length")]
reduced_data <- subset(df_cont, select = -Number.of.Reviews)
# Compute correlation at 2 decimal places
corr_matrix = round(cor(reduced_data), 2)
# Compute and show the  result
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower", lab = TRUE) + 
  ggtitle("Correlation Matrix of MLR Model Selected Variables")

KNN Model

For our last model, we created a KNN model to see how our independent variables impacted our target. As seen in the results below, the KNN model found the optimal k to be around k=9, which produced a RMSE of around 40, a R-squared of around 0.08, and a MAE of around 25. RMSE is a measure of how well our model performed in terms of differences between predicted and actual values. Although this is a relatively high RMSE, our target variables ranged between 0 - around 550, so this was still relatively low, indicating a reasonably fine RMSE. Our R-squared implies that only around 8% of the variance in our target variable is explained by our independent variables, which is lower than what we would ideally want. Most likely, this is because there are many more factors that could impact our target variable besides the ones that we picked that we could not also include because it would make the model too complicated. Likewise, our root mean squared error of around 37.69 is relatively moderate, since it indicated relatively small errors considering the range of our target variable.

# Create training and testing datasets
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]

# Create a KNN model using caret
knn_model <- train(
  Number.of.Reviews ~ .,
  data = trainData,
  method = "knn",
  trControl = trainControl(method = "cv"),
  preProcess = c("center", "scale")
)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: street

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smartlock

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: smoking

print(knn_model)

## k-Nearest Neighbors 
## 
## 65 samples
## 82 predictors
## 
## Pre-processing: centered (82), scaled (82) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 59, 58, 58, 59, 60, 58, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  42.77032  0.2430612  29.13657
##   7  41.04926  0.2597987  28.29463
##   9  42.74899  0.1344118  29.61090
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

# Make predictions on the test dataset
predictions <- predict(knn_model, newdata = testData)

# Evaluate the model
RMSE <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error:", RMSE))

## [1] "Root Mean Squared Error: 35.0677603135854"

AirBnB Final Project - Group 7

Sofi Vietri, Alex Sonnier, Lulu Lemken, Kaela Bernardr

2023-12-04