Business Problem

AirBnB, the online booking company, has been gaining an increasing number of popularity and bookings. Because of its increased use and growing popularity, a topic of interest for many data analysis projects has been to see how a variety of different factors can impact the bookings of a given host and or listing. For this project, we will be doing an analysis of AirBnB data to see what independent variables can predict the number of reviews a given listing gets, and further infer what factors affects a listing being booked. We will be predicting Number.Of.Reviews using the following variables:

Host.Listings.Count
Zipcode
City
Accommodates
Price
Bedrooms
Review.Scores.Value
Cancellation.Policy
House.Rules
Des_length

Data Background

We took our data from OpenDataSoft (https://public.opendatasoft.com/explore/dataset/airbnb-listings/table/?flg=en-us&disjunctive.host_verifications&disjunctive.amenities&disjunctive.features), but specifically will focus only on data from listings in the United States. This dataset has 134,545 values, but we will only be using 100 for our analysis.

STEP 0 - Import libraries

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

STEP 1 - Read in the data

When reading in the data for our project, we first wanted to check how many rows our original csv contained. Then we read in the initial majority of the rows (100,000) into our first dataframe. From this first dataframe, we then created a new dataframe that contained a random select of 100 rows from the first dataframe.Ideally, we would have wanted to import more rows, but because this data set contains so many variables, it made the computing process extremely difficult. Thus, in order to make our models easier to manage, we limited our selection. Additionally, it is important to note that we will be using a random sample of the rows, meaning that each time we run our code, we will get slightly different outputs since we will be using a different random sample each time. Thus, all of our explanations are in general terms rather than specifics, since the outputs can change each time the code is run.

total_rows <- length(count.fields("airbnb-listings (1).csv", skip = 1)) 
df_rdin <- read.csv("airbnb-listings (1).csv", sep = ";", nrows = 100000)
  # This includes only United States data
df_abnb <- df_rdin[sample(nrow(df_rdin), 100), ]

STEP 2 - Observe the dataframe

Data Structure

After importing our data, we first looked at how the data was structured using str(). From this analysis, we saw that a larger number of the variables had a character data type rather than a numerical one. Additionally, we found that there were a variety of variables that had values that we didn’t understand or know what they represented, such as the multiple different Review.Scores variables. Likewise, there were some variables that represented the same thing, such as City, Market, State, Smart.Location, Longitude, Latitude, County, etc., which we recognized would also need to be cut so that we would not have multiple variables representing the same value.

str(df_abnb)

## 'data.frame':    100 obs. of  89 variables:
##  $ ID                            : int  1089293 12652887 4005022 18547885 9393007 1492286 18007965 5046189 16944085 14801382 ...
##  $ Listing.Url                   : chr  "https://www.airbnb.com/rooms/1089293" "https://www.airbnb.com/rooms/12652887" "https://www.airbnb.com/rooms/4005022" "https://www.airbnb.com/rooms/18547885" ...
##  $ Scrape.ID                     : num  2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
##  $ Last.Scraped                  : chr  "2017-05-03" "2017-03-07" "2017-06-02" "2017-06-02" ...
##  $ Name                          : chr  "Entire home/apt in Los Angeles" "South Austin Digs" "Cozy Gentilly Home" "Historic Home: Close to French Quarter & Esplanade Ave" ...
##  $ Summary                       : chr  "The place is a monthly sublet with option to rent furnished or unfurnished.  There is a deposit requirement and"| __truncated__ "My place is close to Downtown and South Congress in 20 minutes, WholeFoods, tons a great dining, and the Greenb"| __truncated__ "This home is located in the Milneburg subdivision of Gentilly. It has 2 bedrooms, 1 full bath, full kitchen noo"| __truncated__ "Come experience local living in my traditional New Orleans shotgun home conveniently located in the Historic 7t"| __truncated__ ...
##  $ Space                         : chr  "The place is a monthly sublet with option to rent furnished or unfurnished.  There is a deposit requirement and"| __truncated__ "" "" "The house is shotgun style (meaning you must walk through one room to get into the next) with tons of original "| __truncated__ ...
##  $ Description                   : chr  "The place is a monthly sublet with option to rent furnished or unfurnished.  There is a deposit requirement and"| __truncated__ "My place is close to Downtown and South Congress in 20 minutes, WholeFoods, tons a great dining, and the Greenb"| __truncated__ "This home is located in the Milneburg subdivision of Gentilly. It has 2 bedrooms, 1 full bath, full kitchen noo"| __truncated__ "Come experience local living in my traditional New Orleans shotgun home conveniently located in the Historic 7t"| __truncated__ ...
##  $ Experiences.Offered           : chr  "none" "none" "none" "none" ...
##  $ Neighborhood.Overview         : chr  "" "" "" "My home is located in a great historic neighborhood that allows access to all of the best parts of the city. Gu"| __truncated__ ...
##  $ Notes                         : chr  "" "" "" "Please be mindful of the tenant next door. No loud music. No smoking. No pets. Check-in is from 3:00PM -till an"| __truncated__ ...
##  $ Transit                       : chr  "" "" "Public transportation is steps away." "Uber, cab, bike, or the bus stop on our corner that goes to the French Quarter, CBD, and Uptown." ...
##  $ Access                        : chr  "" "" "Guest has access to the entire house." "My house is a duplex with a tenant next door. You will have access to one side of the house, which includes off"| __truncated__ ...
##  $ Interaction                   : chr  "" "" "Limited or no interaction with guest unless its to address questions or concerns. Host may or may not be presen"| __truncated__ "You will have total privacy here. That said, I am happy to answer any questions and offer recommendations to ma"| __truncated__ ...
##  $ House.Rules                   : chr  "no smoking inside, please use balcony. no pets. Party is allowed but be responsible, 10pm noise curfew. please "| __truncated__ "- There is a pool right off the master bedroom that does not have a fence around it.  Very important to be awar"| __truncated__ "Honesty about the number of guest staying is required.   No Pets or house parties. Please!!! No Smoking Guest a"| __truncated__ "We want guests to relax completely and enjoy themselves, but also to be respectful of our home, which we have p"| __truncated__ ...
##  $ Thumbnail.Url                 : chr  "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=small" ...
##  $ Medium.Url                    : chr  "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=medium" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=medium" ...
##  $ Picture.Url                   : chr  "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/4daa7b6e743e3d58eedd6618be7b656c" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/472cde58733384e779e5f1d42f6f9a39" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/fea78caf3227a247ca7344dcb0ae0a58" "https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/files/4c8a34e72bac321ea5e3f3cad928ba62" ...
##  $ XL.Picture.Url                : chr  "https://a0.muscache.com/im/pictures/16377209/9f4c3da0_original.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/421e5d5f-631d-4194-8af1-7a123b47d9c8.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/fd562fab-4b91-46ec-88dd-734bf13ade3a.jpg?aki_policy=x_large" "https://a0.muscache.com/im/pictures/25032142-4eb4-442c-abd2-93b107c3d948.jpg?aki_policy=x_large" ...
##  $ Host.ID                       : int  1833132 68263829 19107533 128753533 48066372 7977178 20870710 23732730 1739801 16507910 ...
##  $ Host.URL                      : chr  "https://www.airbnb.com/users/show/1833132" "https://www.airbnb.com/users/show/68263829" "https://www.airbnb.com/users/show/19107533" "https://www.airbnb.com/users/show/128753533" ...
##  $ Host.Name                     : chr  "Amelie" "Katy" "Rhett & Sheila" "Alex" ...
##  $ Host.Since                    : chr  "2012-02-29" "2016-04-21" "2014-07-29" "2017-05-03" ...
##  $ Host.Location                 : chr  "Los Angeles, California, United States" "" "New Orleans, Louisiana, United States" "New Orleans, Louisiana, United States" ...
##  $ Host.About                    : chr  "Loves the outdoors, hiking, beach, TENNIS!, rock climbing, and surf soon. I'm a food and wine fanatic.  I love "| __truncated__ "" "We are native New Orleanian. There's no place we would rather live. Our city is an exciting place to visit and "| __truncated__ "I am a local to New Orleans, LA. I graduated from Loyola University New Orleans. Also, I completed 3 semesters "| __truncated__ ...
##  $ Host.Response.Time            : chr  "" "" "within an hour" "within an hour" ...
##  $ Host.Response.Rate            : int  NA NA 100 100 100 100 100 100 99 100 ...
##  $ Host.Acceptance.Rate          : chr  "" "" "" "" ...
##  $ Host.Thumbnail.Url            : chr  "https://a0.muscache.com/im/users/1833132/profile_pic/1366656419/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/ccf1ec7f-a96f-4a05-b3a5-ca9e245fe226.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/users/19107533/profile_pic/1409505580/original.jpg?aki_policy=profile_small" "https://a0.muscache.com/im/pictures/4078ae48-f302-4edd-82f8-c1c54764a2a2.jpg?aki_policy=profile_small" ...
##  $ Host.Picture.Url              : chr  "https://a0.muscache.com/im/users/1833132/profile_pic/1366656419/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/ccf1ec7f-a96f-4a05-b3a5-ca9e245fe226.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/users/19107533/profile_pic/1409505580/original.jpg?aki_policy=profile_x_medium" "https://a0.muscache.com/im/pictures/4078ae48-f302-4edd-82f8-c1c54764a2a2.jpg?aki_policy=profile_x_medium" ...
##  $ Host.Neighbourhood            : chr  "West Los Angeles" "" "Milneburg" "" ...
##  $ Host.Listings.Count           : int  1 1 1 2 1 7 1 2 24 1 ...
##  $ Host.Total.Listings.Count     : int  1 1 1 2 1 7 1 2 24 1 ...
##  $ Host.Verifications            : chr  "email,phone,facebook,reviews" "phone" "email,phone,reviews,kba" "email,phone,work_email" ...
##  $ Street                        : chr  "West Los Angeles, Los Angeles, CA 90025, United States" "Austin, TX, United States" "Milneburg, New Orleans, LA 70122, United States" "New Orleans, LA 70119, United States" ...
##  $ Neighbourhood                 : chr  "West Los Angeles" "" "Milneburg" "" ...
##  $ Neighbourhood.Cleansed        : chr  "Sawtelle" "78745" "Milneburg" "Seventh Ward" ...
##  $ Neighbourhood.Group.Cleansed  : chr  "" "" "" "" ...
##  $ City                          : chr  "Los Angeles" "Austin" "New Orleans" "New Orleans" ...
##  $ State                         : chr  "CA" "TX" "LA" "LA" ...
##  $ Zipcode                       : chr  "90025" "" "70122" "70119" ...
##  $ Market                        : chr  "Los Angeles" "Austin" "New Orleans" "New Orleans" ...
##  $ Smart.Location                : chr  "Los Angeles, CA" "Austin, TX" "New Orleans, LA" "New Orleans, LA" ...
##  $ Country.Code                  : chr  "US" "US" "US" "US" ...
##  $ Country                       : chr  "United States" "United States" "United States" "United States" ...
##  $ Latitude                      : num  34 30.2 30 30 34.1 ...
##  $ Longitude                     : num  -118.4 -97.8 -90.1 -90.1 -118.4 ...
##  $ Property.Type                 : chr  "Apartment" "House" "House" "House" ...
##  $ Room.Type                     : chr  "Entire home/apt" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
##  $ Accommodates                  : int  2 6 4 6 2 1 4 16 10 1 ...
##  $ Bathrooms                     : num  1 2 1 1 1 1 1 1 2 1 ...
##  $ Bedrooms                      : int  1 3 2 2 1 1 0 2 2 1 ...
##  $ Beds                          : int  1 3 2 2 1 1 2 4 6 1 ...
##  $ Bed.Type                      : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ Amenities                     : chr  "TV,Internet,Wireless Internet,Air conditioning,Wheelchair accessible,Kitchen,Free parking on premises,Pets allo"| __truncated__ "TV,Cable TV,Internet,Wireless Internet,Air conditioning,Pool,Kitchen,Free parking on premises,Breakfast,Indoor "| __truncated__ "TV,Cable TV,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Dryer,Smoke detector,Ca"| __truncated__ "TV,Internet,Wireless Internet,Air conditioning,Kitchen,Free parking on premises,Heating,Smoke detector,Carbon m"| __truncated__ ...
##  $ Square.Feet                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Price                         : int  200 500 150 105 71 70 110 150 499 65 ...
##  $ Weekly.Price                  : int  600 NA NA NA NA 525 NA NA NA NA ...
##  $ Monthly.Price                 : int  1500 NA NA NA NA 1675 NA NA NA NA ...
##  $ Security.Deposit              : int  NA NA 100 250 NA 150 100 NA NA NA ...
##  $ Cleaning.Fee                  : int  50 100 75 110 20 60 50 75 NA 30 ...
##  $ Guests.Included               : int  1 1 4 6 1 1 1 2 1 1 ...
##  $ Extra.People                  : int  0 0 25 50 20 20 0 25 0 0 ...
##  $ Minimum.Nights                : int  15 1 2 2 1 1 2 2 2 1 ...
##  $ Maximum.Nights                : int  31 1125 1125 1125 100 1125 13 28 1125 1125 ...
##  $ Calendar.Updated              : chr  "yesterday" "yesterday" "today" "today" ...
##  $ Has.Availability              : chr  "" "" "" "" ...
##  $ Availability.30               : int  0 0 29 16 6 7 10 22 30 12 ...
##  $ Availability.60               : int  0 0 59 33 17 7 37 50 60 30 ...
##  $ Availability.90               : int  0 1 89 58 20 7 67 75 90 52 ...
##  $ Availability.365              : int  0 1 364 289 89 273 283 346 365 314 ...
##  $ Calendar.last.Scraped         : chr  "2017-05-02" "2017-03-06" "2017-06-02" "2017-06-02" ...
##  $ Number.of.Reviews             : int  0 0 12 4 33 33 4 31 0 36 ...
##  $ First.Review                  : chr  "" "" "2016-04-18" "2017-04-24" ...
##  $ Last.Review                   : chr  "" "" "2017-06-01" "" ...
##  $ Review.Scores.Rating          : int  NA NA 94 100 99 94 100 95 NA 100 ...
##  $ Review.Scores.Accuracy        : int  NA NA 10 10 10 9 10 9 NA 10 ...
##  $ Review.Scores.Cleanliness     : int  NA NA 10 10 10 9 10 9 NA 10 ...
##  $ Review.Scores.Checkin         : int  NA NA 9 10 10 10 10 10 NA 10 ...
##  $ Review.Scores.Communication   : int  NA NA 10 10 10 10 10 10 NA 10 ...
##  $ Review.Scores.Location        : int  NA NA 9 10 10 10 10 9 NA 10 ...
##  $ Review.Scores.Value           : int  NA NA 9 10 10 9 10 9 NA 10 ...
##  $ License                       : chr  "" "" "City registration pending" "17STR-10146" ...
##  $ Jurisdiction.Names            : chr  "City of Los Angeles, CA" "" "Louisiana State, New Orleans, LA" "Louisiana State, New Orleans, LA" ...
##  $ Cancellation.Policy           : chr  "strict" "flexible" "strict" "strict" ...
##  $ Calculated.host.listings.count: int  1 1 1 2 1 4 1 2 18 1 ...
##  $ Reviews.per.Month             : num  NA NA 0.88 3 4.5 0.86 4 1.15 NA 5.51 ...
##  $ Geolocation                   : chr  "34.0432994388578, -118.44666625217695" "30.1946181529524, -97.81865977419787" "30.01895348868219, -90.05268642441278" "29.975441566796114, -90.07089887416147" ...
##  $ Features                      : chr  "Host Has Profile Pic,Is Location Exact" "Host Has Profile Pic,Instant Bookable" "Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License" "Host Has Profile Pic,Requires License,Instant Bookable" ...

Missing Values

After observing our variables, we then looked at which variables contained missing data. Looking at the results of this function, we saw that some variables like Square.Feet, Weekly.Price, and Monthly.Price, had much more missing data than actual values themselves, all missing over 80% of the 100 values we had added. We decided to take out Host.Response.Rate, for example, because even though we thought it could have an impact on our target variables, it had over 20 missing values so we thought it should just be taken out.

sort(colSums(is.na(df_abnb)), decreasing = TRUE)

##                    Square.Feet                   Weekly.Price 
##                             99                             83 
##                  Monthly.Price               Security.Deposit 
##                             76                             54 
##                   Cleaning.Fee            Review.Scores.Value 
##                             30                             21 
##           Review.Scores.Rating         Review.Scores.Accuracy 
##                             20                             20 
##      Review.Scores.Cleanliness          Review.Scores.Checkin 
##                             20                             20 
##    Review.Scores.Communication         Review.Scores.Location 
##                             20                             20 
##              Reviews.per.Month             Host.Response.Rate 
##                             18                             14 
##                          Price                       Bedrooms 
##                              2                              1 
##                           Beds                             ID 
##                              1                              0 
##                    Listing.Url                      Scrape.ID 
##                              0                              0 
##                   Last.Scraped                           Name 
##                              0                              0 
##                        Summary                          Space 
##                              0                              0 
##                    Description            Experiences.Offered 
##                              0                              0 
##          Neighborhood.Overview                          Notes 
##                              0                              0 
##                        Transit                         Access 
##                              0                              0 
##                    Interaction                    House.Rules 
##                              0                              0 
##                  Thumbnail.Url                     Medium.Url 
##                              0                              0 
##                    Picture.Url                 XL.Picture.Url 
##                              0                              0 
##                        Host.ID                       Host.URL 
##                              0                              0 
##                      Host.Name                     Host.Since 
##                              0                              0 
##                  Host.Location                     Host.About 
##                              0                              0 
##             Host.Response.Time           Host.Acceptance.Rate 
##                              0                              0 
##             Host.Thumbnail.Url               Host.Picture.Url 
##                              0                              0 
##             Host.Neighbourhood            Host.Listings.Count 
##                              0                              0 
##      Host.Total.Listings.Count             Host.Verifications 
##                              0                              0 
##                         Street                  Neighbourhood 
##                              0                              0 
##         Neighbourhood.Cleansed   Neighbourhood.Group.Cleansed 
##                              0                              0 
##                           City                          State 
##                              0                              0 
##                        Zipcode                         Market 
##                              0                              0 
##                 Smart.Location                   Country.Code 
##                              0                              0 
##                        Country                       Latitude 
##                              0                              0 
##                      Longitude                  Property.Type 
##                              0                              0 
##                      Room.Type                   Accommodates 
##                              0                              0 
##                      Bathrooms                       Bed.Type 
##                              0                              0 
##                      Amenities                Guests.Included 
##                              0                              0 
##                   Extra.People                 Minimum.Nights 
##                              0                              0 
##                 Maximum.Nights               Calendar.Updated 
##                              0                              0 
##               Has.Availability                Availability.30 
##                              0                              0 
##                Availability.60                Availability.90 
##                              0                              0 
##               Availability.365          Calendar.last.Scraped 
##                              0                              0 
##              Number.of.Reviews                   First.Review 
##                              0                              0 
##                    Last.Review                        License 
##                              0                              0 
##             Jurisdiction.Names            Cancellation.Policy 
##                              0                              0 
## Calculated.host.listings.count                    Geolocation 
##                              0                              0 
##                       Features 
##                              0

Summary Statistics

Lastly, we also generated summary statistics of our data. These summary statistics provided us with quantitative information about each of our numerical variables, such as Host.Listings.Count, Accommodates, Price, etc. One variable in particular that we found interesting was Host.Listings.Count, because although this variable had a mean of around 8 listings, its minimum was 0 listings, its 3rd quartile was around 10 listings, and its maximum was around 400 listings. This thus shows that although the majority of Hosts had between 1-8 listings, there was still a group of users that were outliers with hundreds of listings instead. Additionally, another variable whose summary statistics will be important to note is the target variable, Number.of.Reviews. Number.of.Reviews has a minimum of 0, mean of around 20, and maximum of around 200. This is important to note because it gave us an idea of how large the range for our target variables was, which could affect some of our models’ statistical results.

summary(df_abnb)

##        ID           Listing.Url          Scrape.ID         Last.Scraped      
##  Min.   :    9531   Length:100         Min.   :2.016e+13   Length:100        
##  1st Qu.: 4780186   Class :character   1st Qu.:2.017e+13   Class :character  
##  Median :10323310   Mode  :character   Median :2.017e+13   Mode  :character  
##  Mean   : 9774289                      Mean   :2.017e+13                     
##  3rd Qu.:14916988                      3rd Qu.:2.017e+13                     
##  Max.   :18547885                      Max.   :2.017e+13                     
##                                                                              
##      Name             Summary             Space           Description       
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Experiences.Offered Neighborhood.Overview    Notes          
##  Length:100          Length:100            Length:100        
##  Class :character    Class :character      Class :character  
##  Mode  :character    Mode  :character      Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##    Transit             Access          Interaction        House.Rules       
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Thumbnail.Url       Medium.Url        Picture.Url        XL.Picture.Url    
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Host.ID            Host.URL          Host.Name          Host.Since       
##  Min.   :    31481   Length:100         Length:100         Length:100        
##  1st Qu.:  6985930   Class :character   Class :character   Class :character  
##  Median : 22020728   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 34613599                                                           
##  3rd Qu.: 49338518                                                           
##  Max.   :128753533                                                           
##                                                                              
##  Host.Location       Host.About        Host.Response.Time Host.Response.Rate
##  Length:100         Length:100         Length:100         Min.   : 41.00    
##  Class :character   Class :character   Class :character   1st Qu.:100.00    
##  Mode  :character   Mode  :character   Mode  :character   Median :100.00    
##                                                           Mean   : 97.31    
##                                                           3rd Qu.:100.00    
##                                                           Max.   :100.00    
##                                                           NA's   :14        
##  Host.Acceptance.Rate Host.Thumbnail.Url Host.Picture.Url   Host.Neighbourhood
##  Length:100           Length:100         Length:100         Length:100        
##  Class :character     Class :character   Class :character   Class :character  
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  Host.Listings.Count Host.Total.Listings.Count Host.Verifications
##  Min.   :  1.00      Min.   :  1.00            Length:100        
##  1st Qu.:  1.00      1st Qu.:  1.00            Class :character  
##  Median :  2.00      Median :  2.00            Mode  :character  
##  Mean   : 12.62      Mean   : 12.62                              
##  3rd Qu.:  3.00      3rd Qu.:  3.00                              
##  Max.   :472.00      Max.   :472.00                              
##                                                                  
##     Street          Neighbourhood      Neighbourhood.Cleansed
##  Length:100         Length:100         Length:100            
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##                                                              
##  Neighbourhood.Group.Cleansed     City              State          
##  Length:100                   Length:100         Length:100        
##  Class :character             Class :character   Class :character  
##  Mode  :character             Mode  :character   Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##    Zipcode             Market          Smart.Location     Country.Code      
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Country             Latitude       Longitude       Property.Type     
##  Length:100         Min.   :29.93   Min.   :-122.70   Length:100        
##  Class :character   1st Qu.:33.77   1st Qu.:-118.33   Class :character  
##  Mode  :character   Median :36.12   Median : -97.72   Mode  :character  
##                     Mean   :36.62   Mean   : -97.37                     
##                     3rd Qu.:40.70   3rd Qu.: -74.00                     
##                     Max.   :47.57   Max.   : -71.09                     
##                                                                         
##   Room.Type          Accommodates     Bathrooms       Bedrooms    
##  Length:100         Min.   : 1.00   Min.   :1.00   Min.   :0.000  
##  Class :character   1st Qu.: 2.00   1st Qu.:1.00   1st Qu.:1.000  
##  Mode  :character   Median : 3.00   Median :1.00   Median :1.000  
##                     Mean   : 4.12   Mean   :1.38   Mean   :1.566  
##                     3rd Qu.: 5.00   3rd Qu.:1.50   3rd Qu.:2.000  
##                     Max.   :16.00   Max.   :5.00   Max.   :8.000  
##                                                    NA's   :1      
##       Beds         Bed.Type          Amenities          Square.Feet  
##  Min.   :1.000   Length:100         Length:100         Min.   :1200  
##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:1200  
##  Median :2.000   Mode  :character   Mode  :character   Median :1200  
##  Mean   :2.091                                         Mean   :1200  
##  3rd Qu.:3.000                                         3rd Qu.:1200  
##  Max.   :8.000                                         Max.   :1200  
##  NA's   :1                                             NA's   :99    
##      Price        Weekly.Price   Monthly.Price  Security.Deposit
##  Min.   : 30.0   Min.   :150.0   Min.   : 500   Min.   :100.0   
##  1st Qu.: 71.0   1st Qu.:290.0   1st Qu.:1288   1st Qu.:100.0   
##  Median :115.0   Median :500.0   Median :2042   Median :250.0   
##  Mean   :160.4   Mean   :492.9   Mean   :2393   Mean   :250.5   
##  3rd Qu.:198.0   3rd Qu.:620.0   3rd Qu.:3025   3rd Qu.:375.0   
##  Max.   :611.0   Max.   :879.0   Max.   :6500   Max.   :500.0   
##  NA's   :2       NA's   :83      NA's   :76     NA's   :54      
##   Cleaning.Fee    Guests.Included  Extra.People    Minimum.Nights 
##  Min.   :  5.00   Min.   : 0.0    Min.   :  0.00   Min.   : 1.00  
##  1st Qu.: 39.25   1st Qu.: 1.0    1st Qu.:  0.00   1st Qu.: 1.00  
##  Median : 57.50   Median : 1.0    Median :  5.50   Median : 2.00  
##  Mean   : 74.84   Mean   : 1.9    Mean   : 15.81   Mean   : 2.42  
##  3rd Qu.:100.00   3rd Qu.: 2.0    3rd Qu.: 25.00   3rd Qu.: 2.00  
##  Max.   :300.00   Max.   :12.0    Max.   :100.00   Max.   :30.00  
##  NA's   :30                                                       
##  Maximum.Nights   Calendar.Updated   Has.Availability   Availability.30
##  Min.   :   3.0   Length:100         Length:100         Min.   : 0.00  
##  1st Qu.:  60.0   Class :character   Class :character   1st Qu.: 0.75  
##  Median :1125.0   Mode  :character   Mode  :character   Median :10.00  
##  Mean   : 753.1                                         Mean   :12.40  
##  3rd Qu.:1125.0                                         3rd Qu.:22.00  
##  Max.   :1125.0                                         Max.   :30.00  
##                                                                        
##  Availability.60 Availability.90 Availability.365 Calendar.last.Scraped
##  Min.   : 0.0    Min.   : 0.00   Min.   :  0.0    Length:100           
##  1st Qu.: 4.0    1st Qu.:10.25   1st Qu.: 61.0    Class :character     
##  Median :26.0    Median :45.50   Median :188.0    Mode  :character     
##  Mean   :27.2    Mean   :45.20   Mean   :192.6                         
##  3rd Qu.:50.0    3rd Qu.:75.75   3rd Qu.:327.5                         
##  Max.   :60.0    Max.   :90.00   Max.   :365.0                         
##                                                                        
##  Number.of.Reviews First.Review       Last.Review        Review.Scores.Rating
##  Min.   :  0.00    Length:100         Length:100         Min.   : 67.0       
##  1st Qu.:  2.00    Class :character   Class :character   1st Qu.: 93.0       
##  Median : 10.00    Mode  :character   Mode  :character   Median : 96.0       
##  Mean   : 24.78                                          Mean   : 94.8       
##  3rd Qu.: 28.75                                          3rd Qu.:100.0       
##  Max.   :239.00                                          Max.   :100.0       
##                                                          NA's   :20          
##  Review.Scores.Accuracy Review.Scores.Cleanliness Review.Scores.Checkin
##  Min.   : 7.000         Min.   : 7.000            Min.   : 8.000       
##  1st Qu.: 9.000         1st Qu.: 9.000            1st Qu.:10.000       
##  Median :10.000         Median :10.000            Median :10.000       
##  Mean   : 9.625         Mean   : 9.488            Mean   : 9.775       
##  3rd Qu.:10.000         3rd Qu.:10.000            3rd Qu.:10.000       
##  Max.   :10.000         Max.   :10.000            Max.   :10.000       
##  NA's   :20             NA's   :20                NA's   :20           
##  Review.Scores.Communication Review.Scores.Location Review.Scores.Value
##  Min.   : 8.000              Min.   : 7.000         Min.   : 8.000     
##  1st Qu.:10.000              1st Qu.: 9.000         1st Qu.: 9.000     
##  Median :10.000              Median : 9.000         Median :10.000     
##  Mean   : 9.863              Mean   : 9.375         Mean   : 9.443     
##  3rd Qu.:10.000              3rd Qu.:10.000         3rd Qu.:10.000     
##  Max.   :10.000              Max.   :10.000         Max.   :10.000     
##  NA's   :20                  NA's   :20             NA's   :21         
##    License          Jurisdiction.Names Cancellation.Policy
##  Length:100         Length:100         Length:100         
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##                                                           
##                                                           
##                                                           
##                                                           
##  Calculated.host.listings.count Reviews.per.Month Geolocation       
##  Min.   : 1.00                  Min.   :0.0400    Length:100        
##  1st Qu.: 1.00                  1st Qu.:0.5625    Class :character  
##  Median : 1.00                  Median :1.4000    Mode  :character  
##  Mean   : 4.82                  Mean   :2.0096                      
##  3rd Qu.: 3.00                  3rd Qu.:2.9850                      
##  Max.   :61.00                  Max.   :9.1600                      
##                                 NA's   :18                          
##    Features        
##  Length:100        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

STEP 3 - Clean the Data

Since we have very many variables in our dataframe, as seen in the previous step, we do not want to build our model on all of these since it will become too complicated and difficult to manage. So, firstly, we will create a new dataframe which will contain only the variables that we believe are significant and or could have some influence over the target variable in our model. Some of this will also include both creating new columns that we believe could be helpful and editing columns that need to be changed. Once we select our variable of interest, we will then clean the data by omitting missing values.

Selecting Columns of Interest

We though that it would be interesting to use the length of the “Description” column as one of our input variables, so we first created that column. Then, we saw that “Zipcode” was imputed as a character rather than integer, so we changed all of the values in that column. Finally, we selected our variables of interest for our new dataframe.

# Create column for length of description
df_abnb <- df_abnb %>%
  mutate(Des_length = nchar(as.character(Description)))
# Change Zipcode from chr to int
df_abnb$Zipcode <- as.integer(df_abnb$Zipcode)

# Select only columns of interest
df_abnb_new <- df_abnb[,c("Host.Listings.Count","Zipcode","City", "Accommodates", "Price", "Bedrooms", "Review.Scores.Value", "Cancellation.Policy", "Amenities", "House.Rules", "Des_length", "Number.of.Reviews")]

Omitting Missing Values

Of the values that we picked for our model, Host.Listings.Count, Accommodates, Price, Bedrooms, and Review.Scores.Value all contained missing values which we decided to remove.

df_abnb_new <- na.omit(df_abnb_new)
View(head(df_abnb_new, 10))

STEP 4 - Text Mining

Since many of the variables in our dataset were character, text variables, any of which we want to work with we will first have to perform text mining on. Of the variables of interest that we picked for our model, the only one ones that require text mining are “Amenities” and “Cancellation.Policy”.

Combining Columns

Since the Amenities and Cancellation.Policy columns were similar and would both need to be text mined, we decided to combine them into one column which we would then perform the text mining on.

df_abnb_new$combined <- paste(df_abnb_new$Amenities, df_abnb_new$Cancellation.Policy)
df_abnb_new <- df_abnb_new[,-c(8,9)]

Text Mining for Amenities

# Preprocess text data
corpus <- VCorpus(VectorSource(df_abnb_new$combined))
corpus <- tm_map(corpus, content_transformer(tolower))

# Text is separated by commas, -, /
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "-") # Replace hyphens with spaces,
corpus <- tm_map(corpus, toSpace, "\\.") # Replace periods with spaces, 
corpus <- tm_map(corpus, toSpace, ",") # Replace commas with spaces, 
corpus <- tm_map(corpus, toSpace, "/") # Replace / with spaces.
corpus <- tm_map(corpus, toSpace, "_") # Replace _ with spaces.

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
words <- colnames(matrix)

# Convert to a dataframe for modeling
text_data <- as.data.frame(matrix, stringsAsFactors = FALSE)

# Combine with original data (make sure row order is the same)
final_data_1 <- cbind(text_data, Host.Listings.Count = df_abnb_new$Host.Listings.Count, Zipcode = df_abnb_new$Zipcode, 
                      Accommodates = df_abnb_new$Accommodates , Price = df_abnb_new$Price, Bedrooms = df_abnb_new$Bedrooms, 
                      Review.Scores.Value = df_abnb_new$Review.Scores.Value,Des_length =  df_abnb_new$Des_length,
                      Number.of.Reviews = df_abnb_new$Number.of.Reviews)

STEP 5 - Creating Models

Given our data and chosen business problem, we decided that 3 models that could fit best for our data would be Random Forest, Multiple Linear Regression, and KNN.

Random Forest Model

To create our random forest model, we first used the caret package to train the data for the model, then made predictions on the validation set to see how our random forest model performed. As seen in the output below, the randomForest function selected an optimal model by choosing the smallest RMSE, which was around 30, and thus produced an optimal mtry of around 2. When evaluating the model using our validation data, we found a mean absolute error of around 20, which is relatively high, but also could make sense considering the large range of our target variable, as mentioned previously.

# Split data into training and test sets
library(caret)
library(randomForest)

# Set seed for reproducibility
set.seed(123)

# Create a data partition for train/test split
index <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.6, list = FALSE)
training_data <- final_data_1[index, ]
valid_data <- final_data_1[-index, ]

# Define the train control using cross-validation
train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

# Create a Random Forest model using caret's train function
rf_model_caret <- train(
  Number.of.Reviews ~ .,
  data = training_data,
  method = "rf",
  trControl = train_control
)

# View model details
rf_model_caret

## Random Forest 
## 
##  48 samples
## 112 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 38, 37, 38, 39, 40 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##     2   31.54544  0.1014463  24.28920
##    57   34.22157  0.0389973  26.25609
##   112   34.46557  0.0508975  26.06395
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.

# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)

# Evaluate the model
MAE_RF <- mean(abs(predictions - valid_data$Number.of.Reviews))
MAE_RF

## [1] 32.53239

# Make predictions on the validation set
predictions_rf <- predict(rf_model_caret, newdata = valid_data)

# Calculate Mean Absolute Error (MAE)
MAE_rf <- mean(abs(predictions_rf - valid_data$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE) for Random Forest:", MAE_rf))

## [1] "Mean Absolute Error (MAE) for Random Forest: 32.5323876740319"

# Calculate Root Mean Squared Error (RMSE)
RMSE_rf <- sqrt(mean((predictions_rf - valid_data$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE) for Random Forest:", RMSE_rf))

## [1] "Root Mean Squared Error (RMSE) for Random Forest: 58.5325745655408"

# Calculate R-squared (coefficient of determination)
SSE_rf <- sum((predictions_rf - valid_data$Number.of.Reviews)^2) # Sum of Squared Errors
SST_rf <- sum((valid_data$Number.of.Reviews - mean(valid_data$Number.of.Reviews))^2) # Total Sum of Squares
rsquared_rf <- 1 - SSE_rf/SST_rf
print(paste("R-squared (Coefficient of Determination) for Random Forest:", rsquared_rf))

## [1] "R-squared (Coefficient of Determination) for Random Forest: -0.0637636126401615"

MLR Model

To create our MLR model, we first used the lm() function to create a model using only certain numerical, categorical values like Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_Length. From this model, we then created a variety of different models to represent the residuals.

cust_value_model = lm(formula =  Number.of.Reviews ~ Accommodates + Price + 
                        Bedrooms + Review.Scores.Value + Des_length, 
                      data = df_abnb_new)
# Get the model residuals
model_residuals = cust_value_model$residuals

predictions <- predict(cust_value_model)

# Calculate Mean Absolute Error (MAE)
MAE_LM <- mean(abs(predictions - df_abnb_new$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE):", MAE_LM))

## [1] "Mean Absolute Error (MAE): 28.6142727985584"

# Calculate Root Mean Squared Error (RMSE)
RMSE_LM <- sqrt(mean((predictions - df_abnb_new$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE):", RMSE_LM))

## [1] "Root Mean Squared Error (RMSE): 42.2220474353286"

# Calculate R-squared (coefficient of determination)
rsquared_LM <- summary(cust_value_model)$r.squared
print(paste("R-squared (Coefficient of Determination):", rsquared_LM))

## [1] "R-squared (Coefficient of Determination): 0.0804997871545754"

Histogram

For our MLR model, we first created a histogram of the residuals. As seen in the histogram below, our model’s residuals were right-skewed, indicating that the normality assumption is most likely not true. Further, our model had the largest number of residuals focused between around -50 and 50.

# Plot a historgram of the result
hist(model_residuals, col = "skyblue", main = 'Histogram of MLR Model Residuals')

Residuals Plot

For our MLR model, we next created a Q-Q plot for the residuals of our model. As seen in the Q-Q plot below, our model’s residuals again showed a right-skew with a bit of randomness as the number of quantities increased.

# Residuals Plot
qqnorm(model_residuals, main = "Q-Q Plot of MLR Model Residuals")
# Plot the Q-Q line
qqline(model_residuals, col = "darkorchid3")

Correlation Matrix

Lastly for our MLR model, we also created a correlation matrix for our numerical variables Number.of.Reviews (our target variable), Accommodates, Price, Bedrooms, Review.Scores.Values, and Des_length. As seen in the correlation matrix below, the highest correlation appeared to be between Bedrooms and Accommodates. This was followed closely by the correlation between Price and Accommodates and between Price and Bedrooms. This indicated to us that Bedrooms and Accommodates were likely the most impactfull variables for our model.

df_cont <- df_abnb_new[ , c("Number.of.Reviews", "Accommodates", "Price",
                            "Bedrooms", "Review.Scores.Value", "Des_length")]
reduced_data <- subset(df_cont, select = -Number.of.Reviews)
# Compute correlation at 2 decimal places
corr_matrix = round(cor(reduced_data), 2)
# Compute and show the  result
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower", lab = TRUE) + 
  ggtitle("Correlation Matrix of MLR Model Selected Variables")

KNN Model

For our last model, we created a KNN model to see how our independent variables impacted our target variable predictions. As seen in the results below, the KNN model found the optimal k to be around k=5, which produced a RMSE of around 35, a R-squared of around 0.2, and a MAE of around 25. RMSE is a measure of how well our model performed in terms of differences between predicted and actual values. Although this is a relatively high RMSE, our target variables ranged between 0 - around 200, so this was still relatively low, indicating a reasonably fine RMSE. Our R-squared implies that only around 20% of the variance in our target variable is explained by our independent variables, which is lower than what we would ideally want. Most likely, this is because there are many more factors that could impact our target variable besides the ones that we picked that we could not also include because it would make the model too complicated. Likewise, our root mean squared error of around 20 is relatively moderate, since it indicated relatively small errors considering the range of our target variable.

# Create training and testing datasets
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]

# Create a KNN model using caret
knn_model <- train(
  Number.of.Reviews ~ .,
  data = trainData,
  method = "knn",
  trControl = trainControl(method = "cv"),
  preProcess = c("center", "scale")
)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

print(knn_model)

## k-Nearest Neighbors 
## 
##  64 samples
## 112 predictors
## 
## Pre-processing: centered (112), scaled (112) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 58, 58, 59, 57, 56, 57, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  43.26290  0.1385732  31.96195
##   7  43.28874  0.1232710  31.65823
##   9  42.95763  0.1815657  31.25944
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

# Make predictions on the test dataset
predictions <- predict(knn_model, newdata = testData)

# Evaluate the model
RMSE_KNN <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error:", RMSE_KNN))

## [1] "Root Mean Squared Error: 44.4401684450967"

# Calculate Mean Absolute Error (MAE)
MAE_KNN <- mean(abs(predictions - testData$Number.of.Reviews))
print(paste("Mean Absolute Error (MAE):", MAE_KNN))

## [1] "Mean Absolute Error (MAE): 28.5"

# Calculate Root Mean Squared Error (RMSE)
RMSE_KNN <- sqrt(mean((predictions - testData$Number.of.Reviews)^2))
print(paste("Root Mean Squared Error (RMSE):", RMSE_KNN))

## [1] "Root Mean Squared Error (RMSE): 44.4401684450967"

# Calculate R-squared (coefficient of determination)
SSE_KNN <- sum((predictions - testData$Number.of.Reviews)^2) # Sum of Squared Errors
SST_KNN <- sum((testData$Number.of.Reviews - mean(testData$Number.of.Reviews))^2) # Total Sum of Squares
rsquared_KNN <- 1 - SSE_KNN/SST_KNN
print(paste("R-squared (Coefficient of Determination):", rsquared_KNN))

## [1] "R-squared (Coefficient of Determination): -0.185140960638793"

STEP 6 - Evaluating Models

The last step of our data analysis will now be to compare the three models that we built to one another. We will do this by first comparing lift charts for each of the models, and then comparing each model’s MAE, RMSE and r-squared values to evaluate their performance.

Lift Chart for Random Forest Model

library(caret)
library(randomForest)

# Assuming you've already trained the model as in your code
# rf_model_caret <- ... (Your model training code)

# Make predictions on the validation set
predictions <- predict(rf_model_caret, newdata = valid_data)

# Combine predictions with actual outcomes
results <- data.frame(Actual = valid_data$Number.of.Reviews, Predicted = predictions)
results <- results[order(-results$Predicted), ]  # Sort predictions in descending order

# Calculate cumulative sum of outcomes
results$CumulativeActual <- cumsum(results$Actual)

# Calculate lift values
expected_lift <- sum(results$Actual) / nrow(results)
results$Lift <- results$CumulativeActual / (expected_lift * (1:nrow(results)))

# Divide data into deciles or percentiles
deciles <- quantile(results$Predicted, probs = seq(0, 1, by = 0.1))  # Deciles

# Calculate average lift for each decile
avg_lift <- tapply(results$Lift, cut(results$Predicted, breaks = deciles, include.lowest = TRUE), mean)

# Plot the lift chart with baseline
baseline <- seq(0, max(results$Predicted), length.out = length(avg_lift))

plot(1:length(avg_lift), avg_lift, type = "b", xlab = "Deciles", ylab = "Lift", 
     main = "Lift Chart - Random Forest Model", col = "red")
lines(1:length(avg_lift), baseline, type = "b", col = "blue")
legend("topright", legend = c("Model Lift", "Baseline"), col = c("red", "blue"), lty = 1)

Lift Chart for MLR Model

# Assuming the Random Forest model and relevant code are already executed as provided earlier

# Multiple Linear Regression Model
cust_value_model <- lm(formula = Number.of.Reviews ~ Accommodates + Price +
                        Bedrooms + Review.Scores.Value + Des_length, data = df_abnb_new)

# Get predictions from the linear regression model
predictions_lm <- predict(cust_value_model, newdata = valid_data)

# Combine predictions with actual outcomes
results_lm <- data.frame(Actual = valid_data$Number.of.Reviews, Predicted = predictions_lm)
results_lm <- results_lm[order(-results_lm$Predicted), ]  # Sort predictions in descending order

# Calculate cumulative sum of outcomes for linear regression
results_lm$CumulativeActual <- cumsum(results_lm$Actual)

# Calculate lift values for linear regression
expected_lift_lm <- sum(results_lm$Actual) / nrow(results_lm)
results_lm$Lift <- results_lm$CumulativeActual / (expected_lift_lm * (1:nrow(results_lm)))

# Divide data into deciles or percentiles for linear regression
deciles_lm <- quantile(results_lm$Predicted, probs = seq(0, 1, by = 0.1))  # Deciles

# Calculate average lift for each decile for linear regression
avg_lift_lm <- tapply(results_lm$Lift, cut(results_lm$Predicted, breaks = deciles_lm, include.lowest = TRUE), mean)

# Plot the lift chart with baseline
baseline <- seq(0, max(results$Predicted), length.out = length(avg_lift_lm))

plot(1:length(avg_lift_lm), avg_lift_lm, type = "b", xlab = "Deciles", ylab = "Lift", 
     main = "Lift Chart - Multiple Linear Regression", col = "green")
lines(1:length(avg_lift_lm), baseline, type = "b", col = "blue")
legend("topright", legend = c("Model Lift", "Baseline"), col = c("green", "blue"), lty = 1)

Lift Chart for KNN Model

# KNN Model
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]

library(caret)

knn_model <- train(
  Number.of.Reviews ~ .,
  data = trainData,
  method = "knn",
  trControl = trainControl(method = "cv"),
  preProcess = c("center", "scale")
)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

# Make predictions on the test dataset
predictions_knn <- predict(knn_model, newdata = testData)

# Combine predictions with actual outcomes for KNN
results_knn <- data.frame(Actual = testData$Number.of.Reviews, Predicted = predictions_knn)
results_knn <- results_knn[order(-results_knn$Predicted), ]  # Sort predictions in descending order

# Calculate cumulative sum of outcomes for KNN
results_knn$CumulativeActual <- cumsum(results_knn$Actual)

# Calculate lift values for KNN
expected_lift_knn <- sum(results_knn$Actual) / nrow(results_knn)
results_knn$Lift <- results_knn$CumulativeActual / (expected_lift_knn * (1:nrow(results_knn)))

# Divide data into deciles or percentiles for KNN
deciles_knn <- quantile(results_knn$Predicted, probs = seq(0, 1, by = 0.1))  # Deciles

# Calculate average lift for each decile for KNN
avg_lift_knn <- tapply(results_knn$Lift, cut(results_knn$Predicted, breaks = deciles_knn, include.lowest = TRUE), mean)

# Assuming you have already computed 'avg_lift_knn' for the KNN model
# Adjust the lengths of 'avg_lift_knn' and 'baseline' to match
shorter_length <- min(length(avg_lift_knn), length(baseline))
avg_lift_knn <- avg_lift_knn[1:shorter_length]
baseline <- baseline[1:shorter_length]

# Plot the lift chart with KNN and baseline
plot(1:shorter_length, avg_lift_knn, type = "b", xlab = "Deciles", ylab = "Lift",
     main = "Lift Chart - KNN vs. Baseline", col = "purple")
lines(1:shorter_length, baseline, type = "b", col = "red")
legend("topright", legend = c("KNN Lift", "Baseline"), col = c("purple", "red"), lty = 1)

Lift Chart for All Three Models

# Assuming Random Forest and Linear Regression models are already trained and predictions are obtained as earlier mentioned

# KNN Model
set.seed(123)
trainIndex <- createDataPartition(final_data_1$Number.of.Reviews, p = 0.8, list = FALSE)
trainData <- final_data_1[trainIndex, ]
testData <- final_data_1[-trainIndex, ]

library(caret)

knn_model <- train(
  Number.of.Reviews ~ .,
  data = trainData,
  method = "knn",
  trControl = trainControl(method = "cv"),
  preProcess = c("center", "scale")
)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: baby, bath

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: bed, blankets, connection, ethernet,
## extra, linens, pillows, water

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: babysitter, console, covers,
## doorman, game, outlet, pool, recommendations

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: darkening, gates, shades, stair

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: guards

# Make predictions on the test dataset
predictions_knn <- predict(knn_model, newdata = testData)

# Combine predictions with actual outcomes for KNN
results_knn <- data.frame(Actual = testData$Number.of.Reviews, Predicted = predictions_knn)
results_knn <- results_knn[order(-results_knn$Predicted), ]  # Sort predictions in descending order

# Calculate cumulative sum of outcomes for KNN
results_knn$CumulativeActual <- cumsum(results_knn$Actual)

# Calculate lift values for KNN
expected_lift_knn <- sum(results_knn$Actual) / nrow(results_knn)
results_knn$Lift <- results_knn$CumulativeActual / (expected_lift_knn * (1:nrow(results_knn)))

# Divide data into deciles or percentiles for KNN
deciles_knn <- quantile(results_knn$Predicted, probs = seq(0, 1, by = 0.1))  # Deciles

# Calculate average lift for each decile for KNN
avg_lift_knn <- tapply(results_knn$Lift, cut(results_knn$Predicted, breaks = deciles_knn, include.lowest = TRUE), mean)

# Plot the lift chart with all three models and baseline

plot(1:length(avg_lift), avg_lift, type = "b", xlab = "Deciles", ylab = "Lift", 
     main = "Lift Chart - Random Forest vs. Linear Regression vs. KNN", col = "red")
lines(1:length(avg_lift_lm), avg_lift_lm, type = "b", col = "green")
lines(1:length(avg_lift_knn), avg_lift_knn, type = "b", col = "purple")
lines(1:length(avg_lift), baseline, type = "b", col = "blue")
legend("topright", legend = c("Random Forest Lift", "Multiple Linear Regression Lift", "KNN Lift", "Baseline"), 
       col = c("red", "green", "purple", "blue"), lty = 1)

The lift chart measures the effectiveness of each of the models by calculating the ratio of results obtained with the model versus results obtained without. As seen in the lift chart above comparing all three models we created with baseline, all three of the models are significantly below the baseline, indicating that they all have poor predictive performance and that all of the models likely need improvement. This makes sense, however, because, as we have mentioned time and time again, our models all have data quality/feature issues because of how large our initial datset was. Thus, these quality and feature issues could explain why none of the models are performing close to the baseline since they all are most likely lacking many significant features that could be affecting the target variable. However, when comparing the three models to one another in the lift chart above, we can see that the multiple linear regression model seems to have performed the best. This is displayed by the fact that while both the KNN and random forest models drop off significantly as the model takes in more data, the linear regression model flattens off before increasing instead. This thus would indicate that while the random forest model and KNN model decrease in predictive accuracy as the model takes in more sample, the MLR model instead increases in accuracy and performs better.

Comparison Table of Statistical Metrics for All Three Models

# Assuming you have computed evaluation metrics for all three models

# Store the evaluation metrics in variables
rf_metrics <- c(MAE_rf, RMSE_rf, rsquared_rf)  # Replace these with actual values from Random Forest model
lm_metrics <- c(MAE_LM, RMSE_LM, rsquared_LM)  # Replace these with actual values from Linear Regression model
knn_metrics <- c(MAE_KNN, RMSE_KNN, rsquared_KNN)  # Replace these with actual values from KNN model

# Create a matrix or data frame to hold the metrics
comparison_table <- matrix(NA, nrow = 3, ncol = 3)  # Create an empty matrix
rownames(comparison_table) <- c("MAE", "RMSE", "R-squared")  # Row names for metrics
colnames(comparison_table) <- c("Random Forest", "MLR", "KNN")  # Column names for models

# Fill in the matrix with the metrics
comparison_table[, "Random Forest"] <- rf_metrics
comparison_table[, "MLR"] <- lm_metrics
comparison_table[, "KNN"] <- knn_metrics

# Display the comparison table
#print("Comparison of Evaluation Metrics for Different Models:")
#print(comparison_table)

library(flextable)
comparison_table <- data.frame(
  Row = c("MAE", "RMSE", "R-squared"),
  `Random Forest` = rf_metrics,
  `MLR` = lm_metrics,
  KNN = knn_metrics
)

# Create a flextable object
ft <- flextable(comparison_table)


# Apply some style to the table
ft <- ft %>%
  flextable::bg(j = 1:4, bg = "lightblue") %>%
  flextable::border_outer()
set_caption(ft, "Comparison of Evaluation Metrics for Different Models")

Comparison of Evaluation Metrics for Different Models
Row	Random.Forest	MLR	KNN
MAE	32.53238767	28.61427280	28.500000
RMSE	58.53257457	42.22204744	44.440168
R-squared	-0.06376361	0.08049979	-0.185141

The table above compares some statistical metrics of the three models to eachother to see how the model is performing and whether the factors we chose for each more are showing any significance. For the MAE, a lower value is better because it indicates that, on average, the model’s predictions are closest to the actual values and thus implies better model accuracy. Thus, as seen in the table above, MLR has relatively good accuracy in terms of MAE. For the RMSE, lower values are also considered better because it indicates better accuracy and precision. However, unlike MAE, RMSE also gives higher weight to larger errors, which is an important consideration because it means that it is also evaluating the impact of errors on each of the models. As seen in the table, in terms of RMSE, MLR actually does not have a low RMSE, likely because of the impact of errors in the model having more weight. Lastly, a higher r-squared is better because it indicates the percentage of the variance in the target variable that is explained by the model. Further, as seen in the table above, although none of the r-squared values are particularly high, the MLR model still has one of the highest values, indicating that it explains the most of the variance in the target variable.

Conclusion

In summary, we built a random forest, multiple linear regression, and a KNN model to try and answer our business problem of what factors impact the number of reviews, and thus number of bookings, a given AirBnb listing will receive. After comparing the performance of the three models using multiple different evaluation techniques, we found that the multiple linear regression model appeared to have the highest prediction accuracy and thus yielded the best model results. It is important to note that since our analysis was importing a random sample of the data, every time we re-ran our analysis, we got different results. However, we concluded that the MLR model was the best because among all of the times that we ran our analysis, the MLR model consistently had the best results. The numerical predictors of the MLR model (Accommodates, Price, Bedrooms, Review.Scores.Value, and Des_length) thus seem to have at least some impact on the number of reviews a given listing will receive. Particularly, Accommodates, Bedrooms, and Price appeared to have the highest correlation with the target variable. However, though we concluded the MLR model to have yielded the best results, there are a multitude of factors that could be changed to create a model that would be more useful for further analysis. For example, including more values, more variables, finding a better method of text mining, using data with less quality issues, etc. are all problems we encountered during this project that, if further analysis is to be done, should be accounted for to produce better, more accurate results.

AirBnB Final Project - Group 7

Sofi Vietri, Alex Sonnier, Lulu Lemken, Kaela Bernard

2023-12-04