---
title: "Airbnb Analysis"
author: "Alice, Ben, Amy, Calvin"
date: "19 April 2022"
output:
  html_document:
    code_folding: hide
---

Airbnb, Opportunities and Challenges

Learning Objectives

  • Create maps to represent spatial data
  • Build increasingly complex multiple regression models
  • Visualize, test, and understand your model residuals
  • Build a model with strong predictive power and predict
  • Use your results to make a decision on the best place to run an AirBnB room, what features it should have, and what it should be priced at.

This is a team assignment. As usual, you will turn in one knitted HTML report for your team on Github. However, for the first time this week, I haven’t made an RMarkdown template for you. You should make your .Rmd file from scratch. Make sure that it has all the sections that we’ve been using all semester long: Introduction, Ethical Considerations, Data Explanation and Exploration, Statistical Analysis and Interpretation, and Conclusion.

Fully answer the questions below, but incorporate the answers into the report so that it will make sense to someone who has never seen the questions.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(parsnip)
library(ggthemes)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(sp)
library(rworldmap)
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')
library(gapminder)

Overview

Airbnb is a company founded in 2008 that operates an online marketplace for hospitality services. Users looking for places to reside can browse available listings by location, dates available, characteristics, amenities, and price. Home owners can list their properties for short or long term rent.

There are a wide variety of customers who use Airbnb: some renters are tourists, wanting short term stays near popular attractions at cheaper prices and more homely environments than traditional hotels. Others are business travellers, looking for similar features. Similarly, some home owners act as permanent landlords, owning multiple properties and renting them out on a permanent basis. Others have spare bedrooms or space and want to earn a little extra money on the side. Some even rent their home when they are out of town.

In this lab we will explore factors that affect the price of rental units in Columbus, Ohio. We will examine characteristics of the rentals, geographic locations, and time considerations. This will serve both a descriptive and a predictive analytics purpose: descriptive in the sense that interpreting regression models can help us understand what factors are the most critical to the price and how they affect it; and predictive in estimating what the price of a rental is, given certain factors that could be measured. Predictions can be used from both the renter and owner perspective: renters can understand whether a property is fairly priced and identify good deals, and owners can receive suggestions about how to price their properties.

Data

Data was downloaded from the website http://insideairbnb.com/get-the-data.html. This site is not affiliated with or endorsed by Airbnb. You can find some useful bits of metadata if you scroll down to “disclaimers” here - http://insideairbnb.com/about.html . Please read this section and think about if/how the disclaimers about the data could affect your analysis. There is not a developed codebook for this data, but many of the variables should be fairly self-explanatory.

We will be working with a somewhat cleaned version of the data for Columbus, Ohio, compiled on 28 July 2020. Bring the data into R, and start to explore it by looking at the column names, the unique values for categorical variables, etc.

New packages

In this lab, you’ll need to install and use functions from several new packages: GGally, ggfortify, and stringr and from packages that you’ve already been introduced to in previous labs.

Report

Your job is to provide a brief report of the Columbus Airbnb dataset, that is both descriptive and predictive in it’s analysis. You will need to provide evidence of model validation, as well as polished graphics that help summarize the data and make your key points. Use code folding to hide your code, and use headers and text to clearly walk your reader through your exploration, analysis, and findings. There should be enough text between the output (visuals and models) that a reader could understand from the report what your question was, how you made decisions, your visual and model interpretations, and what your final conclusions were.

The Task

  1. Create a new folder, R Project and R Markdown file. Download the dataset.

  2. Bring in the Columbus dataset. Use commands that you have learned in the class to explore the dataset, including learning what your variables represent. There are many variables here, so you’ll need to make some decisions on what you’ll want to focus on in your analysis. Most of them are self explanatory. You could use names(dataset) to see a quick listing of all the column names, without all the summary information.

cbus <- read.csv("Columbus_2020_listings.csv")
names(cbus)
##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"
  1. What ethical considerations might you have when working with this kind of data? Who are potential stakeholders in your analysis? What repercussions could your analysis have (e.g. social, statistical, and communication ethics)?

  2. OK, before we get too far, think about the main task at hand to predict housing price (response variable). What variable(s) do you think should be your predictor variables? Choose 5-7 predictors that you think make some logical sense, based on which of the variables seem intuitive to you and are easily understood.

  • beds

  • amenities

  • square_feet

  • bedrooms

  • neighbourhood_cleansed

  1. Look at the variables that represent price (price, weekly_price, etc…). What do you notice? It looks like R is reading these as character strings, and there are special characters like $ and , in the values, which will cause problems for our analysis. Let’s get rid of them now.

    The stringr function str_remove can match a pattern of 1 or more characters, and remove all matching characters from that string. For example:

Using the above code as an example, look up the help for the functions str_remove and str_remove_all, to get more information about how these work. These functions come from the stringr package, which allows us to manipulate text data. You can use the code below to “clean up” the dollar signs and the commas in the prices and make sure they are all numeric instead of character. I’ll give it to you here to copy and paste, but try to make sure you understand how it’s working. I called my data “cbus”, but you’ll need to replace that with whatever you named your dataset when you brought it into R. The column names should all be the same.

     cbus <- cbus %>%
        mutate(price = as.numeric(str_remove_all(price, "[$,]")),
          weekly_price = as.numeric(str_remove_all(weekly_price, "[$,]")),
          monthly_price = as.numeric(str_remove_all(monthly_price, "[$,]")),
          security_deposit = as.numeric(str_remove_all(security_deposit, "[$,]")),
          cleaning_fee = as.numeric(str_remove_all(cleaning_fee, "[$,]")),
          extra_people = as.numeric(str_remove_all(extra_people, "[$,]")))
  1. Since the point of this analysis is to explore factors that affect the price of rental units in Columbus, make 2 polished exploratory graphics related to price (and possibly some of the other predictor variables you chose). The exact graphics you use can be your choice, but should help describe the variable or lead to next level questions or connections to other variables in your analysis. They should be two different kinds of graphics (e.g., not two histograms). Please keep in mind that polished graphics always have a caption or annotation that helps interpret them as well as nice labels and conscious choices for use of aesthetics like color, size, shape, or theme.
cbus <- cbus %>% rename(neighborhood = neighbourhood_cleansed)
cbus <- cbus %>% group_by(neighborhood) %>% mutate(meanprice = mean(price))
ggplot(cbus, aes(x = neighborhood, y = meanprice, fill = neighborhood)) + geom_bar(stat="identity") + labs(x="Columbus neighborhood", y="average rental price (USD)", title="Average Columbus rental prices by neighborhood",caption="Data taken from Columbus AirBNB records") + coord_flip() + theme_classic()

cbus <- cbus %>% group_by(beds) %>% mutate(meanprice2 = mean(price))
ggplot(cbus, aes(x=beds, y=meanprice2)) + labs(x="Number of beds", y="Mean rental price (USD)",title="Mean Rental Price By Number of Beds In The Home",caption="Data taken from Columbus AirBNB records") + geom_point() + theme_economist()
## Warning: Removed 2 rows containing missing values (geom_point).

  1. Make a map. Since this is the first dataset we’ve used with spatial information beyond large regions like “state”, let’s put our points on a map so we can see where our data comes from. To do this, we’ll use the ggmap package that we explored a couple weeks ago. Be sure to refer to the mapping slideshow and your notes from those classes to remind yourself of how ggmap works.

Remember the exercise we did on crime in Houston? Use code similar to that to plot a map of Columbus, and then add your Airbnb location (x=longitude, y=latitude) points to it. To get started, the coordinates for drawing a box around Columbus are:

bbox <- c(left=-83.2, bottom=39.8, right=-82.75, top=40.16)
map <- get_stamenmap(bbox, maptype="terrain", zoom=10)
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
ggmap(map, darken=0.1) + geom_point(data=cbus, aes(x = longitude, y = latitude, col = cbus$neighborhood)) + labs(title = "AirBNBs in Columbus", col = "Neighborhood Legend")

Similar to any other ggplots, you can still add arguments for color, transparency, size, shape, etc, in your aesthetic. Your plot should add one or more possibly explanatory variables using these additional arguments (for example, use color, shape, and/or size to help visualize other variables - perhaps neighborhood, price, or some other characteristic of the housing.

  1. Regression 1 (bivariate linear regression). Plot the price by one predictor variable of your choice, with a trend line, run a simple linear regression, and completely and accurately interpret the output.

    Note 1: This means reporting intercept, slope, and R2, and all associated p-values for each of these. It also means talking about them in terms of the data and discussing statistical vs. practical significance.

    Note 2: As you work through the subsequent analyses, if it feels useful to filter the dataset to a subset of rentals (optional, not required) that is OK. But make sure to state in the text and in your explanation what you did and why that makes sense (i.e., does not bias your analysis, or helps to answer a specific question).

bivariate_reg  <- lm(price~review_scores_rating, data = cbus)
summary(bivariate_reg)
## 
## Call:
## lm(formula = price ~ review_scores_rating, data = cbus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3219.6  -281.8  -166.1   -83.4  9827.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4106.360    583.285   7.040 3.17e-12 ***
## review_scores_rating  -39.340      6.085  -6.465 1.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1371 on 1245 degrees of freedom
##   (162 observations deleted due to missingness)
## Multiple R-squared:  0.03248,    Adjusted R-squared:  0.03171 
## F-statistic:  41.8 on 1 and 1245 DF,  p-value: 1.448e-10
  1. Before you build up to a more complex model to predict price, we should check if our other predictors have multicollinearity. This means - are the predictor variables correlated with each other? If they are strongly correlated (let’s say more than 0.40 or 0.50), it can impact the accuracy of your model negatively.

Use the ggpairs function (from the GGally package) to do a quick visual and quantitative check, and to decide which variables you would like to put into your model. This function will make a visual “table” of your variables, where half of the graphic shows the scatterplot (or boxplots if categorical data) of how two variables are related, the diagonal line will show the histogram of each variable, and the other half of the plot will report Pearson’s correlation coefficient (the same thing we have calculated before using cor() and cor.test().) Using the code below as a model, choose 5-7 potential predictor variables (not your response variable), and make a pairs plot.

cbus %>% group_by(neighborhood) %>% summarize(mean(price))
## # A tibble: 27 × 2
##    neighborhood          `mean(price)`
##    <fct>                         <dbl>
##  1 Clintonville                  114. 
##  2 Downtown                      380. 
##  3 Eastland/Brice                 62.7
##  4 Eastmoor/Walnut Ridge          81.6
##  5 Far East                       48  
##  6 Far North                      91  
##  7 Far Northwest                  78.4
##  8 Far South                     113. 
##  9 Far West                       46.4
## 10 Franklinton                   112. 
## # … with 17 more rows
cbus %>% group_by(neighborhood) %>% mutate(meanprice = mean(price))
## # A tibble: 1,409 × 108
## # Groups:   neighborhood [27]
##         id listing_url    scrape_id last_scraped name  summary space description
##      <int> <fct>              <dbl> <fct>        <fct> <fct>   <fct> <fct>      
##  1   90676 https://www.a…   2.02e13 2020-07-28   Shor… "Just … "Gre… "Just step…
##  2  543140 https://www.a…   2.02e13 2020-07-28   Priv… "Priva… "Min… "Private, …
##  3  591101 https://www.a…   2.02e13 2020-07-28   The … "Famou… "Thi… "Famous Am…
##  4  681145 https://www.a…   2.02e13 2020-07-28   2 be… "INSTA… "WEL… "INSTANT B…
##  5  923248 https://www.a…   2.02e13 2020-07-28   1 Si… "This … "Inc… "This is a…
##  6  927867 https://www.a…   2.02e13 2020-07-28   Full… "The W… "Inc… "The Wayfa…
##  7 1217678 https://www.a…   2.02e13 2020-07-28   Comf… "A coz… "We … "A cozy, w…
##  8 1286887 https://www.a…   2.02e13 2020-07-28   Bett… "2 Bed… "Bet… "2 Bedroom…
##  9 1321192 https://www.a…   2.02e13 2020-07-28   Down… "**IMP… "**N… "**IMPORTA…
## 10 1336145 https://www.a…   2.02e13 2020-07-28   Comf… "A coz… "We … "A cozy, w…
## # … with 1,399 more rows, and 100 more variables: experiences_offered <fct>,
## #   neighborhood_overview <fct>, notes <fct>, transit <fct>, access <fct>,
## #   interaction <fct>, house_rules <fct>, thumbnail_url <lgl>,
## #   medium_url <lgl>, picture_url <fct>, xl_picture_url <lgl>, host_id <int>,
## #   host_url <fct>, host_name <fct>, host_since <fct>, host_location <fct>,
## #   host_about <fct>, host_response_time <fct>, host_response_rate <fct>,
## #   host_acceptance_rate <fct>, host_is_superhost <fct>, …
ggpairs(cbus, columns=c("bedrooms", "beds", "review_scores_rating", "number_of_reviews", "meanprice"), cardinality_threshold = 27)
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 3 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 163 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 163 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning: Removed 163 rows containing missing values (geom_point).
## Removed 163 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 162 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 162 rows containing missing values
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).

Show the graphic and discuss it. Does this change your mind for what you’d like to use for your model? Why or why not? Choose the best 3 or 4 variables to take to the next step.

Unsurprisingly there’s a strong correlation between the number of bedrooms and number of beds. That means we should only use one of these variables, not both.

  1. Regression 2 (multiple linear regression). Build a complex regression model that has 3 or more predictors for price, chosen from the variables that seemed the least correlated with each other in the ggpairs plot. Interpret the results of the model completely and accurately.

Question 10:

multiReg <- lm(price~bedrooms+review_scores_rating+number_of_reviews, data = cbus)

glance(multiReg)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0534        0.0511 1358.      23.4 1.04e-14     3 -10754. 21518. 21544.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(multiReg)
## # A tibble: 4 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          3650.       585.       6.24   5.89e-10
## 2 bedrooms              185.        35.4      5.23   1.94e- 7
## 3 review_scores_rating  -38.1        6.06    -6.28   4.76e-10
## 4 number_of_reviews       0.0429     0.586    0.0731 9.42e- 1
summary(multiReg)
## 
## Call:
## lm(formula = price ~ bedrooms + review_scores_rating + number_of_reviews, 
##     data = cbus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3287.6  -320.3  -138.6    -9.6  9785.5 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3650.39801  584.74895   6.243 5.89e-10 ***
## bedrooms              185.12977   35.36780   5.234 1.94e-07 ***
## review_scores_rating  -38.06332    6.06381  -6.277 4.76e-10 ***
## number_of_reviews       0.04286    0.58594   0.073    0.942    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1358 on 1242 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.0534, Adjusted R-squared:  0.05111 
## F-statistic: 23.35 on 3 and 1242 DF,  p-value: 1.043e-14
summary(multiReg$fit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -155.9   111.1   258.8   343.9   478.7  3444.6
  1. Validate your model by using the autoplot() function and interpreting the diagnostic plots (for guidance on how to do this, refer back to our class notes). Show the resulting visuals.
library(ggfortify)
## Registered S3 method overwritten by 'ggfortify':
##   method          from   
##   autoplot.glmnet parsnip
autoplot(multiReg) + theme_minimal() +theme(plot.title = element_text(hjust = 0.5))

  1. Write a short paragraph that interprets the output and validation of your model and answers the following questions: What is the effect (statistically speaking) of each predictor on the price? How much (in $) does each predictor influence price? How did you know if your model was any good?

  2. Make some predictions of rental prices using the predict() function. Using the code below as a guide, fill in the x predictors with each of the ones included in your final model, and values that you would like to predict the price from. You may want to choose a few sets of values to check your intuition, to see if it does indeed predict a low, medium, and high price. For example, the test dataset you make should represent values that you think would lead to a low price, the next to a middle price, and the next to a high price.

So, if my predictors were “NumberBedrooms”, “HousingType”, and “YearHouseBuilt”, I might fill in the code as follows:

lowprice <- data.frame(NumberBedrooms=1, HousingType="studio", 
 YearHouseBuilt=1960)
highprice <- data.frame(NumberBedrooms=6, HousingType="house", 
  YearHouseBuilt=2003)
predict(model, lowprice)
predict(model, highprice)

… assuming that a house with fewer bedrooms, that is a studio apartment and older, receives a lower cost to rent than a house with many bedrooms that was built more recently.

You’ll do the same thing, but with the variables that you chose, so make sure to rename x1etc. in the example code below to fit your predictors, and to fill in the blanks for a low, medium, and high price. You’ll get a prediction for each in your output, for what the price would be for lower, middle, & high price. Make sense?

See what the final prediction is to test your understanding of the model output. Each of the values you get from the predict() function is the price (your response variable), given the model coefficients and the x values you put into the testdata dataframe.

lowprice <- tibble(bedrooms=1, review_scores_rating = 70, number_of_reviews = 390)
midprice <- tibble(bedrooms=3, review_scores_rating = 50, number_of_reviews = 137)
highprice <- tibble(bedrooms=5, review_scores_rating= 45, number_of_reviews = 0) 

predict(multiReg, lowprice)
##        1 
## 1187.809
predict(multiReg, midprice)
##        1 
## 2308.493
predict(multiReg, highprice) 
##        1 
## 2863.197

Here is a second model using neighborhood mean price instead.

multiReg2 <- lm(price~bedrooms+meanprice+review_scores_rating, data = cbus)

glance(multiReg2)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0701        0.0678 1346.      31.2 1.92e-19     3 -10743. 21496. 21521.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(multiReg2)
## # A tibble: 4 × 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           3263.     585.         5.58 2.97e- 8
## 2 bedrooms               184.      35.0        5.28 1.56e- 7
## 3 meanprice                1.06     0.224      4.72 2.63e- 6
## 4 review_scores_rating   -37.6      5.98      -6.28 4.59e-10
summary(multiReg2)
## 
## Call:
## lm(formula = price ~ bedrooms + meanprice + review_scores_rating, 
##     data = cbus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3244.6  -375.2  -138.4    22.7  9624.2 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3262.6323   584.8258   5.579 2.97e-08 ***
## bedrooms              184.4704    34.9691   5.275 1.56e-07 ***
## meanprice               1.0575     0.2241   4.720 2.63e-06 ***
## review_scores_rating  -37.5679     5.9794  -6.283 4.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1346 on 1242 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.07007,    Adjusted R-squared:  0.06783 
## F-statistic:  31.2 on 3 and 1242 DF,  p-value: < 2.2e-16
summary(multiReg2$fit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -373.9   102.4   266.5   343.9   546.5  3401.6
lowprice2 <- tibble(bedrooms=1,  meanprice = 100, review_scores_rating = 70)
midprice2 <- tibble(bedrooms=3, meanprice = 120, review_scores_rating = 50)
highprice2 <- tibble(bedrooms=5, meanprice = 300, review_scores_rating = 45)

predict(multiReg2, lowprice2)
##        1 
## 923.0964
predict(multiReg2, midprice2)
##        1 
## 2064.544
predict(multiReg2, highprice2)
##        1 
## 2811.668

Choose values for each of your predictors that you think would lead to a low price, then repeat for a middle and a high price.

lowpricetest <- data.frame(x1=___, x2=_____, x3=_____)
predict(modelname, lowpricetest)
  1. Based on your analysis, what are the best characteristics for opening an Airbnb that you could rent out at a high price in Columbus? Write a final paragraph explaining your final big model and take-home message, including any potential pitfalls of your analysis.

It looks like the old saying “location, location, location” in real estate is truer than ever. The mean price of houses in a neighborhood an airBNB is in has far higher price predictive power than any other variable we analyzed. There is a weak correlation between rental price and other variables such as number of bedrooms in the home, the average review score and the number of reviews. The efficacy of this analysis is limited because the r2 we achieved is still very low at just 0.07. This means that the overwhelming majority of our data are not explained by the regression line and more predictive factors need to be explored.

  1. Read the articles (or listen) below and respond in a short paragraph (~approx. 4-7 sentences) sentences, given your analysis…
    1. https://www.wgbh.org/news/local-news/2021/07/22/airbnb-impacts-neighborhood-crime-but-not-in-the-way-you-think
    2. https://slate.com/business/2021/10/airbnb-housing-shortage-luxury-vacation-rental-galveston-texas.html
    It seems that there are significant problems with airBNB integration into their host communities. As airBNBs enter neighborhoods, they tend to cluster together and crowd out or alienate native residents. In addition, since these are very transient living spaces and tenants of airBNBs are less attached to the communities, there appears to be an increase in crime as the number of airBNB listings in a neighborhood increases. An increase in crime, of course, tends to drive down average housing prices in the neighborhood. The relationship between an airBNB and the neighborhood it’s located in is two-sided and may lead to fluctuating or changing rental prices.