---
title: "Airbnb Analysis"
author: "Alice, Ben, Amy, Calvin"
date: "19 April 2022"
output:
html_document:
code_folding: hide
---
This is a team assignment. As usual, you will turn in one knitted HTML report for your team on Github. However, for the first time this week, I haven’t made an RMarkdown template for you. You should make your .Rmd file from scratch. Make sure that it has all the sections that we’ve been using all semester long: Introduction, Ethical Considerations, Data Explanation and Exploration, Statistical Analysis and Interpretation, and Conclusion.
Fully answer the questions below, but incorporate the answers into the report so that it will make sense to someone who has never seen the questions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(parsnip)
library(ggthemes)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(sp)
library(rworldmap)
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(gapminder)
Airbnb is a company founded in 2008 that operates an online marketplace for hospitality services. Users looking for places to reside can browse available listings by location, dates available, characteristics, amenities, and price. Home owners can list their properties for short or long term rent.
There are a wide variety of customers who use Airbnb: some renters are tourists, wanting short term stays near popular attractions at cheaper prices and more homely environments than traditional hotels. Others are business travellers, looking for similar features. Similarly, some home owners act as permanent landlords, owning multiple properties and renting them out on a permanent basis. Others have spare bedrooms or space and want to earn a little extra money on the side. Some even rent their home when they are out of town.
In this lab we will explore factors that affect the price of rental units in Columbus, Ohio. We will examine characteristics of the rentals, geographic locations, and time considerations. This will serve both a descriptive and a predictive analytics purpose: descriptive in the sense that interpreting regression models can help us understand what factors are the most critical to the price and how they affect it; and predictive in estimating what the price of a rental is, given certain factors that could be measured. Predictions can be used from both the renter and owner perspective: renters can understand whether a property is fairly priced and identify good deals, and owners can receive suggestions about how to price their properties.
Data was downloaded from the website http://insideairbnb.com/get-the-data.html. This site is not affiliated with or endorsed by Airbnb. You can find some useful bits of metadata if you scroll down to “disclaimers” here - http://insideairbnb.com/about.html . Please read this section and think about if/how the disclaimers about the data could affect your analysis. There is not a developed codebook for this data, but many of the variables should be fairly self-explanatory.
We will be working with a somewhat cleaned version of the data for Columbus, Ohio, compiled on 28 July 2020. Bring the data into R, and start to explore it by looking at the column names, the unique values for categorical variables, etc.
In this lab, you’ll need to install and use functions from several new packages: GGally, ggfortify, and stringr and from packages that you’ve already been introduced to in previous labs.
Your job is to provide a brief report of the Columbus Airbnb dataset, that is both descriptive and predictive in it’s analysis. You will need to provide evidence of model validation, as well as polished graphics that help summarize the data and make your key points. Use code folding to hide your code, and use headers and text to clearly walk your reader through your exploration, analysis, and findings. There should be enough text between the output (visuals and models) that a reader could understand from the report what your question was, how you made decisions, your visual and model interpretations, and what your final conclusions were.
Create a new folder, R Project and R Markdown file. Download the dataset.
Bring in the Columbus dataset. Use commands that you have learned in the class to explore the dataset, including learning what your variables represent. There are many variables here, so you’ll need to make some decisions on what you’ll want to focus on in your analysis. Most of them are self explanatory. You could use names(dataset) to see a quick listing of all the column names, without all the summary information.
cbus <- read.csv("Columbus_2020_listings.csv")
names(cbus)
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "summary"
## [7] "space"
## [8] "description"
## [9] "experiences_offered"
## [10] "neighborhood_overview"
## [11] "notes"
## [12] "transit"
## [13] "access"
## [14] "interaction"
## [15] "house_rules"
## [16] "thumbnail_url"
## [17] "medium_url"
## [18] "picture_url"
## [19] "xl_picture_url"
## [20] "host_id"
## [21] "host_url"
## [22] "host_name"
## [23] "host_since"
## [24] "host_location"
## [25] "host_about"
## [26] "host_response_time"
## [27] "host_response_rate"
## [28] "host_acceptance_rate"
## [29] "host_is_superhost"
## [30] "host_thumbnail_url"
## [31] "host_picture_url"
## [32] "host_neighbourhood"
## [33] "host_listings_count"
## [34] "host_total_listings_count"
## [35] "host_verifications"
## [36] "host_has_profile_pic"
## [37] "host_identity_verified"
## [38] "street"
## [39] "neighbourhood"
## [40] "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed"
## [42] "city"
## [43] "state"
## [44] "zipcode"
## [45] "market"
## [46] "smart_location"
## [47] "country_code"
## [48] "country"
## [49] "latitude"
## [50] "longitude"
## [51] "is_location_exact"
## [52] "property_type"
## [53] "room_type"
## [54] "accommodates"
## [55] "bathrooms"
## [56] "bedrooms"
## [57] "beds"
## [58] "bed_type"
## [59] "amenities"
## [60] "square_feet"
## [61] "price"
## [62] "weekly_price"
## [63] "monthly_price"
## [64] "security_deposit"
## [65] "cleaning_fee"
## [66] "guests_included"
## [67] "extra_people"
## [68] "minimum_nights"
## [69] "maximum_nights"
## [70] "minimum_minimum_nights"
## [71] "maximum_minimum_nights"
## [72] "minimum_maximum_nights"
## [73] "maximum_maximum_nights"
## [74] "minimum_nights_avg_ntm"
## [75] "maximum_nights_avg_ntm"
## [76] "calendar_updated"
## [77] "has_availability"
## [78] "availability_30"
## [79] "availability_60"
## [80] "availability_90"
## [81] "availability_365"
## [82] "calendar_last_scraped"
## [83] "number_of_reviews"
## [84] "number_of_reviews_ltm"
## [85] "first_review"
## [86] "last_review"
## [87] "review_scores_rating"
## [88] "review_scores_accuracy"
## [89] "review_scores_cleanliness"
## [90] "review_scores_checkin"
## [91] "review_scores_communication"
## [92] "review_scores_location"
## [93] "review_scores_value"
## [94] "requires_license"
## [95] "license"
## [96] "jurisdiction_names"
## [97] "instant_bookable"
## [98] "is_business_travel_ready"
## [99] "cancellation_policy"
## [100] "require_guest_profile_picture"
## [101] "require_guest_phone_verification"
## [102] "calculated_host_listings_count"
## [103] "calculated_host_listings_count_entire_homes"
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"
## [106] "reviews_per_month"
What ethical considerations might you have when working with this kind of data? Who are potential stakeholders in your analysis? What repercussions could your analysis have (e.g. social, statistical, and communication ethics)?
OK, before we get too far, think about the main task at hand to predict housing price (response variable). What variable(s) do you think should be your predictor variables? Choose 5-7 predictors that you think make some logical sense, based on which of the variables seem intuitive to you and are easily understood.
beds
amenities
square_feet
bedrooms
neighbourhood_cleansed
Look at the variables that represent price (price, weekly_price, etc…). What do you notice? It looks like R is reading these as character strings, and there are special characters like $ and , in the values, which will cause problems for our analysis. Let’s get rid of them now.
The stringr function str_remove can match a pattern of 1 or more characters, and remove all matching characters from that string. For example:
Using the above code as an example, look up the help for the functions str_remove and str_remove_all, to get more information about how these work. These functions come from the stringr package, which allows us to manipulate text data. You can use the code below to “clean up” the dollar signs and the commas in the prices and make sure they are all numeric instead of character. I’ll give it to you here to copy and paste, but try to make sure you understand how it’s working. I called my data “cbus”, but you’ll need to replace that with whatever you named your dataset when you brought it into R. The column names should all be the same.
cbus <- cbus %>%
mutate(price = as.numeric(str_remove_all(price, "[$,]")),
weekly_price = as.numeric(str_remove_all(weekly_price, "[$,]")),
monthly_price = as.numeric(str_remove_all(monthly_price, "[$,]")),
security_deposit = as.numeric(str_remove_all(security_deposit, "[$,]")),
cleaning_fee = as.numeric(str_remove_all(cleaning_fee, "[$,]")),
extra_people = as.numeric(str_remove_all(extra_people, "[$,]")))
cbus <- cbus %>% rename(neighborhood = neighbourhood_cleansed)
cbus <- cbus %>% group_by(neighborhood) %>% mutate(meanprice = mean(price))
ggplot(cbus, aes(x = neighborhood, y = meanprice, fill = neighborhood)) + geom_bar(stat="identity") + labs(x="Columbus neighborhood", y="average rental price (USD)", title="Average Columbus rental prices by neighborhood",caption="Data taken from Columbus AirBNB records") + coord_flip() + theme_classic()
cbus <- cbus %>% group_by(beds) %>% mutate(meanprice2 = mean(price))
ggplot(cbus, aes(x=beds, y=meanprice2)) + labs(x="Number of beds", y="Mean rental price (USD)",title="Mean Rental Price By Number of Beds In The Home",caption="Data taken from Columbus AirBNB records") + geom_point() + theme_economist()
## Warning: Removed 2 rows containing missing values (geom_point).
ggmap package that we explored a couple weeks ago. Be sure to refer to the mapping slideshow and your notes from those classes to remind yourself of how ggmap works.Remember the exercise we did on crime in Houston? Use code similar to that to plot a map of Columbus, and then add your Airbnb location (x=longitude, y=latitude) points to it. To get started, the coordinates for drawing a box around Columbus are:
bbox <- c(left=-83.2, bottom=39.8, right=-82.75, top=40.16)
map <- get_stamenmap(bbox, maptype="terrain", zoom=10)
## Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
ggmap(map, darken=0.1) + geom_point(data=cbus, aes(x = longitude, y = latitude, col = cbus$neighborhood)) + labs(title = "AirBNBs in Columbus", col = "Neighborhood Legend")
Similar to any other ggplots, you can still add arguments for color, transparency, size, shape, etc, in your aesthetic. Your plot should add one or more possibly explanatory variables using these additional arguments (for example, use color, shape, and/or size to help visualize other variables - perhaps neighborhood, price, or some other characteristic of the housing.
Regression 1 (bivariate linear regression). Plot the price by one predictor variable of your choice, with a trend line, run a simple linear regression, and completely and accurately interpret the output.
Note 1: This means reporting intercept, slope, and R2, and all associated p-values for each of these. It also means talking about them in terms of the data and discussing statistical vs. practical significance.
Note 2: As you work through the subsequent analyses, if it feels useful to filter the dataset to a subset of rentals (optional, not required) that is OK. But make sure to state in the text and in your explanation what you did and why that makes sense (i.e., does not bias your analysis, or helps to answer a specific question).
bivariate_reg <- lm(price~review_scores_rating, data = cbus)
summary(bivariate_reg)
##
## Call:
## lm(formula = price ~ review_scores_rating, data = cbus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3219.6 -281.8 -166.1 -83.4 9827.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4106.360 583.285 7.040 3.17e-12 ***
## review_scores_rating -39.340 6.085 -6.465 1.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1371 on 1245 degrees of freedom
## (162 observations deleted due to missingness)
## Multiple R-squared: 0.03248, Adjusted R-squared: 0.03171
## F-statistic: 41.8 on 1 and 1245 DF, p-value: 1.448e-10
Use the ggpairs function (from the GGally package) to do a quick visual and quantitative check, and to decide which variables you would like to put into your model. This function will make a visual “table” of your variables, where half of the graphic shows the scatterplot (or boxplots if categorical data) of how two variables are related, the diagonal line will show the histogram of each variable, and the other half of the plot will report Pearson’s correlation coefficient (the same thing we have calculated before using cor() and cor.test().) Using the code below as a model, choose 5-7 potential predictor variables (not your response variable), and make a pairs plot.
cbus %>% group_by(neighborhood) %>% summarize(mean(price))
## # A tibble: 27 × 2
## neighborhood `mean(price)`
## <fct> <dbl>
## 1 Clintonville 114.
## 2 Downtown 380.
## 3 Eastland/Brice 62.7
## 4 Eastmoor/Walnut Ridge 81.6
## 5 Far East 48
## 6 Far North 91
## 7 Far Northwest 78.4
## 8 Far South 113.
## 9 Far West 46.4
## 10 Franklinton 112.
## # … with 17 more rows
cbus %>% group_by(neighborhood) %>% mutate(meanprice = mean(price))
## # A tibble: 1,409 × 108
## # Groups: neighborhood [27]
## id listing_url scrape_id last_scraped name summary space description
## <int> <fct> <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 90676 https://www.a… 2.02e13 2020-07-28 Shor… "Just … "Gre… "Just step…
## 2 543140 https://www.a… 2.02e13 2020-07-28 Priv… "Priva… "Min… "Private, …
## 3 591101 https://www.a… 2.02e13 2020-07-28 The … "Famou… "Thi… "Famous Am…
## 4 681145 https://www.a… 2.02e13 2020-07-28 2 be… "INSTA… "WEL… "INSTANT B…
## 5 923248 https://www.a… 2.02e13 2020-07-28 1 Si… "This … "Inc… "This is a…
## 6 927867 https://www.a… 2.02e13 2020-07-28 Full… "The W… "Inc… "The Wayfa…
## 7 1217678 https://www.a… 2.02e13 2020-07-28 Comf… "A coz… "We … "A cozy, w…
## 8 1286887 https://www.a… 2.02e13 2020-07-28 Bett… "2 Bed… "Bet… "2 Bedroom…
## 9 1321192 https://www.a… 2.02e13 2020-07-28 Down… "**IMP… "**N… "**IMPORTA…
## 10 1336145 https://www.a… 2.02e13 2020-07-28 Comf… "A coz… "We … "A cozy, w…
## # … with 1,399 more rows, and 100 more variables: experiences_offered <fct>,
## # neighborhood_overview <fct>, notes <fct>, transit <fct>, access <fct>,
## # interaction <fct>, house_rules <fct>, thumbnail_url <lgl>,
## # medium_url <lgl>, picture_url <fct>, xl_picture_url <lgl>, host_id <int>,
## # host_url <fct>, host_name <fct>, host_since <fct>, host_location <fct>,
## # host_about <fct>, host_response_time <fct>, host_response_rate <fct>,
## # host_acceptance_rate <fct>, host_is_superhost <fct>, …
ggpairs(cbus, columns=c("bedrooms", "beds", "review_scores_rating", "number_of_reviews", "meanprice"), cardinality_threshold = 27)
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 3 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 163 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 163 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 2 rows containing missing values
## Warning: Removed 163 rows containing missing values (geom_point).
## Removed 163 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 162 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 162 rows containing missing values
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 162 rows containing missing values (geom_point).
Show the graphic and discuss it. Does this change your mind for what you’d like to use for your model? Why or why not? Choose the best 3 or 4 variables to take to the next step.
Unsurprisingly there’s a strong correlation between the number of bedrooms and number of beds. That means we should only use one of these variables, not both.
Question 10:
multiReg <- lm(price~bedrooms+review_scores_rating+number_of_reviews, data = cbus)
glance(multiReg)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0534 0.0511 1358. 23.4 1.04e-14 3 -10754. 21518. 21544.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(multiReg)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3650. 585. 6.24 5.89e-10
## 2 bedrooms 185. 35.4 5.23 1.94e- 7
## 3 review_scores_rating -38.1 6.06 -6.28 4.76e-10
## 4 number_of_reviews 0.0429 0.586 0.0731 9.42e- 1
summary(multiReg)
##
## Call:
## lm(formula = price ~ bedrooms + review_scores_rating + number_of_reviews,
## data = cbus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3287.6 -320.3 -138.6 -9.6 9785.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3650.39801 584.74895 6.243 5.89e-10 ***
## bedrooms 185.12977 35.36780 5.234 1.94e-07 ***
## review_scores_rating -38.06332 6.06381 -6.277 4.76e-10 ***
## number_of_reviews 0.04286 0.58594 0.073 0.942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1358 on 1242 degrees of freedom
## (163 observations deleted due to missingness)
## Multiple R-squared: 0.0534, Adjusted R-squared: 0.05111
## F-statistic: 23.35 on 3 and 1242 DF, p-value: 1.043e-14
summary(multiReg$fit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -155.9 111.1 258.8 343.9 478.7 3444.6
autoplot() function and interpreting the diagnostic plots (for guidance on how to do this, refer back to our class notes). Show the resulting visuals.library(ggfortify)
## Registered S3 method overwritten by 'ggfortify':
## method from
## autoplot.glmnet parsnip
autoplot(multiReg) + theme_minimal() +theme(plot.title = element_text(hjust = 0.5))
Write a short paragraph that interprets the output and validation of your model and answers the following questions: What is the effect (statistically speaking) of each predictor on the price? How much (in $) does each predictor influence price? How did you know if your model was any good?
Make some predictions of rental prices using the predict() function. Using the code below as a guide, fill in the x predictors with each of the ones included in your final model, and values that you would like to predict the price from. You may want to choose a few sets of values to check your intuition, to see if it does indeed predict a low, medium, and high price. For example, the test dataset you make should represent values that you think would lead to a low price, the next to a middle price, and the next to a high price.
So, if my predictors were “NumberBedrooms”, “HousingType”, and “YearHouseBuilt”, I might fill in the code as follows:
lowprice <- data.frame(NumberBedrooms=1, HousingType="studio",
YearHouseBuilt=1960)
highprice <- data.frame(NumberBedrooms=6, HousingType="house",
YearHouseBuilt=2003)
predict(model, lowprice)
predict(model, highprice)
… assuming that a house with fewer bedrooms, that is a studio apartment and older, receives a lower cost to rent than a house with many bedrooms that was built more recently.
You’ll do the same thing, but with the variables that you chose, so make sure to rename x1etc. in the example code below to fit your predictors, and to fill in the blanks for a low, medium, and high price. You’ll get a prediction for each in your output, for what the price would be for lower, middle, & high price. Make sense?
See what the final prediction is to test your understanding of the model output. Each of the values you get from the predict() function is the price (your response variable), given the model coefficients and the x values you put into the testdata dataframe.
lowprice <- tibble(bedrooms=1, review_scores_rating = 70, number_of_reviews = 390)
midprice <- tibble(bedrooms=3, review_scores_rating = 50, number_of_reviews = 137)
highprice <- tibble(bedrooms=5, review_scores_rating= 45, number_of_reviews = 0)
predict(multiReg, lowprice)
## 1
## 1187.809
predict(multiReg, midprice)
## 1
## 2308.493
predict(multiReg, highprice)
## 1
## 2863.197
Here is a second model using neighborhood mean price instead.
multiReg2 <- lm(price~bedrooms+meanprice+review_scores_rating, data = cbus)
glance(multiReg2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0701 0.0678 1346. 31.2 1.92e-19 3 -10743. 21496. 21521.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(multiReg2)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3263. 585. 5.58 2.97e- 8
## 2 bedrooms 184. 35.0 5.28 1.56e- 7
## 3 meanprice 1.06 0.224 4.72 2.63e- 6
## 4 review_scores_rating -37.6 5.98 -6.28 4.59e-10
summary(multiReg2)
##
## Call:
## lm(formula = price ~ bedrooms + meanprice + review_scores_rating,
## data = cbus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3244.6 -375.2 -138.4 22.7 9624.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3262.6323 584.8258 5.579 2.97e-08 ***
## bedrooms 184.4704 34.9691 5.275 1.56e-07 ***
## meanprice 1.0575 0.2241 4.720 2.63e-06 ***
## review_scores_rating -37.5679 5.9794 -6.283 4.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1346 on 1242 degrees of freedom
## (163 observations deleted due to missingness)
## Multiple R-squared: 0.07007, Adjusted R-squared: 0.06783
## F-statistic: 31.2 on 3 and 1242 DF, p-value: < 2.2e-16
summary(multiReg2$fit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -373.9 102.4 266.5 343.9 546.5 3401.6
lowprice2 <- tibble(bedrooms=1, meanprice = 100, review_scores_rating = 70)
midprice2 <- tibble(bedrooms=3, meanprice = 120, review_scores_rating = 50)
highprice2 <- tibble(bedrooms=5, meanprice = 300, review_scores_rating = 45)
predict(multiReg2, lowprice2)
## 1
## 923.0964
predict(multiReg2, midprice2)
## 1
## 2064.544
predict(multiReg2, highprice2)
## 1
## 2811.668
Choose values for each of your predictors that you think would lead to a low price, then repeat for a middle and a high price.
lowpricetest <- data.frame(x1=___, x2=_____, x3=_____)
predict(modelname, lowpricetest)
It looks like the old saying “location, location, location” in real estate is truer than ever. The mean price of houses in a neighborhood an airBNB is in has far higher price predictive power than any other variable we analyzed. There is a weak correlation between rental price and other variables such as number of bedrooms in the home, the average review score and the number of reviews. The efficacy of this analysis is limited because the r2 we achieved is still very low at just 0.07. This means that the overwhelming majority of our data are not explained by the regression line and more predictive factors need to be explored.