Jennifer Ganeles
4/7/19
This week’s homework explores an AirBNB analysis Lisbon datset found on Kaggle. Airbnb is an online marketplace, which allows people to rent out their properties or spare rooms to guests. Its popularity has increased over the years as a financially-friendly way to stay overnight in prime locations while also getting an authentically local travel experience.
Airbnb is known for being a cheaper alternative to normal hotel accommodations, but what’s considered a good price can vary by location. For this reason, I will be using multilevel modelling to explore how price affects overall Airbnb satisfaction across individual listings as well as across neighborhoods. The dataset contains listing and rating information for 13,578 Airbnb accommodations spanning 24 neighborhoods in Lisbon, Portugal as of 2017.
My dependent variable (overall satisfaction) was rated on a scale from 0 to 5. My independent variable (price) was simplified into two levels using median as a cuttoff point: Low ($70 or lower) and High (Greater than $70).
Low Price=0
High Price=1
library(readr)
airbnb<-read_csv("/Users/jenniferganeles/Downloads/airbnb_lisbon_1480_2017-07-27.csv")
airbnb$pricelev <- ifelse(airbnb$price > median(airbnb$price), 1, 0)
airbnb<-select(airbnb, room_id, neighborhood, overall_satisfaction, price, pricelev, accommodates, bedrooms, reviews)
airbnbs<-airbnb[order(-airbnb$overall_satisfaction),]
head(airbnbs)
First, I will be conducting an ecological regression, which is a neighborhood-level analysis. The below analysis shows that at the neighborhood-level, there is an insignificant positive relationship between price level and satisfaction rating. We may be tempted to say that high priced accommodations have higher ratings, but this would be an ecological fallacy since I am not using individual-level data here.
#First, grouping by neighborhood:
hood<-airbnbs %>%
group_by(neighborhood) %>%
summarise(mean_s = mean(overall_satisfaction, na.rm = TRUE), mean_p = mean(pricelev, na.rm = TRUE))
#Ecological model:
ecoreg <- lm(mean_s ~ mean_p, data = hood)
summary(ecoreg)
Call:
lm(formula = mean_s ~ mean_p, data = hood)
Residuals:
Min 1Q Median 3Q Max
-1.5061 -0.4240 0.1378 0.4185 0.8816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5283 0.4159 6.079 4.06e-06 ***
mean_p 0.5188 0.9259 0.560 0.581
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6483 on 22 degrees of freedom
Multiple R-squared: 0.01407, Adjusted R-squared: -0.03075
F-statistic: 0.3139 on 1 and 22 DF, p-value: 0.5809
Using the complete pooling method, I can see that there is a significant negative relationship between price level and satisfaction rating. In other words, high prices decrease satisfaction by 0.19. However, here I am exploring the relationship between price level and satisfaction rating under the assumption that no difference exists between neighborhoods. Like my ecological regression, this model is also flawed since the type of neighborhood can be an important determinant of what is satisfactory and what it not.
cpooling <- lm(overall_satisfaction ~ pricelev, data = airbnbs)
summary(cpooling)
Call:
lm(formula = overall_satisfaction ~ pricelev, data = airbnbs)
Residuals:
Min 1Q Median 3Q Max
-3.338 -3.144 1.162 1.662 1.856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.33770 0.02594 128.680 < 2e-16 ***
pricelev -0.19345 0.03689 -5.245 1.59e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.149 on 13576 degrees of freedom
Multiple R-squared: 0.002022, Adjusted R-squared: 0.001948
F-statistic: 27.51 on 1 and 13576 DF, p-value: 1.59e-07
Here, I am running 24 different regression models (one per neighborhood) using the no-pooling model. Below are plots of the intercept of my regression for each neighborhood. The intercept represents the average satisfaction rating for listings that are low price (picelev=0). Here, the mode satisfaction rating for low-price listings is about 3, but variation exists with ratings as high as 4 and as low as about 1.3.
dcoefi <- airbnbs %>%
group_by(neighborhood) %>%
do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefi <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coefi, aes(x = intc)) + geom_histogram()+xlab("Satisfaction Rating for Low Price Across 24 Neighborhoods")
ggplot(coefi, aes(x = intc)) + geom_density(fill="steelblue")+xlab("Satisfaction Rating for Low Price Level Across 24 Neighborhoods")
Below are plots of the slope of my regression for each neighborhood (in other words, the difference in satisfaction between price levels across neighborhoods). The mode difference is about -0.5, which suggests that satifaction ratings are, on average, 0.5 lower for high-priced airbnb listings than for low-priced listings. However, there is variation in this distribution as well. Satisfaction rating for high-price listings can range from about 1.77 less than low-price listings to about 0.4 more than low-price listings. The issue with no-pooling is that it does not impose a structure on this between-neighborhood variation.
dcoefs <- airbnbs %>%
group_by(neighborhood) %>%
do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefs <- dcoef %>% do(data.frame(pricec = coef(.$mod)[2]))
ggplot(coefs, aes(x = pricec)) + geom_histogram()+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")
ggplot(coefs, aes(x = pricec)) + geom_density(fill="steelblue")+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")
With a random effects model, I am now combining the complexity of variation from the no-pooling model with the simplicity of the complete-pooling model (using the parameters of a normal distribution). Here, I am using a random intercept model, which allows the intercept of my regression model to vary freely between neighborhoods, but does not allow the price difference in satisfaction to differ between neighborhoods.
Below we can see that the standard deviation between neighborhoods for low-price listings is about 0.59. Ratings for low-price listings are, on average, 2.93 while ratings for high-price listings are, on average, 0.35 lower.
library(nlme)
m1_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~1|neighborhood, method = "ML")
summary(m1_lme)
Linear mixed-effects model fit by maximum likelihood
Data: airbnbs
Random effects:
Formula: ~1 | neighborhood
(Intercept) Residual
StdDev: 0.5936869 2.088129
Fixed effects: overall_satisfaction ~ pricelev
Correlation:
(Intr)
pricelev -0.124
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.8772828 -1.2117618 0.4744175 0.7191426 1.7468961
Number of Observations: 13578
Number of Groups: 24
Using a random slope model, I am now allowing the price difference in satisfaction to vary between neighborhoods.
Below we can see that the satisfaction rating of low-price listings across neighborhoods is 3.02 with a standard deviation of 0.62. Satisfaction ratings for high-price listings across neighborhoods is 0.54 lower, with a standard deviation of 0.45. The intercept and slope has a negative correlation of 0.33 across neighborhoods, which suggests that in neighborhoods where low-price listings have high satisfaction ratings, the difference in ratings between price levels is low, and in neighborhoods where low-price listings have low satisfaction ratings, the difference in ratings between price levels is high.
m2_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~ pricelev|neighborhood, method = "ML")
summary(m2_lme)
Linear mixed-effects model fit by maximum likelihood
Data: airbnbs
Random effects:
Formula: ~pricelev | neighborhood
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.6159586 (Intr)
pricelev 0.4476372 -0.286
Residual 2.0816310
Fixed effects: overall_satisfaction ~ pricelev
Correlation:
(Intr)
pricelev -0.33
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.9273662 -1.1607159 0.4745965 0.7204646 1.7657156
Number of Observations: 13578
Number of Groups: 24
Based on the below AIC results, the random slope model seems to fit the data best. As previously shown, this model suggests that, on average, low-price listings have a satisfaction rating of 3.02 with a standard deviation of .62 between neighborhoods, while high-price listings have a satisfaction rating that is .54 lower across neighborhoods with a standard deviation of .45.
AIC(cpooling, m1_lme, m2_lme)
m0_lme <- lme(overall_satisfaction ~ 1, random = ~ 1|neighborhood, data = airbnbs, method = "ML")
summary(m0_lme)
Linear mixed-effects model fit by maximum likelihood
Data: airbnbs
Random effects:
Formula: ~1 | neighborhood
(Intercept) Residual
StdDev: 0.5817633 2.095359
Fixed effects: overall_satisfaction ~ 1
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.7710411 -1.1962397 0.4412431 0.6839926 1.6267148
Number of Observations: 13578
Number of Groups: 24
Intra-class correlation (ICC):
0.5817633/(0.5817633+2.095359)=0.21730919801
About 21.7% of the total variation in overall_satisfaction can be attributed to neighborhood-level influences. The other 78.3% can be attributed to the individual level.
Confidence Intervals:
intervals(m0_lme)
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) 2.539996 2.782792 3.025588
attr(,"label")
[1] "Fixed effects:"
Random Effects:
Level: neighborhood
lower est. upper
sd((Intercept)) 0.4242685 0.5817633 0.7977225
Within-group standard error:
lower est. upper
2.070563 2.095359 2.120453
After exploring the satisfaction ratings for Airbnb ratings across 24 neighorhoods, I can conclude that price level does, in fact, matter when it comes to overall satisfaction. Though my ecological analysis showed a positive relationship between price and rating, all other analysese showed a negative relationship, where satisfaction ratings decrease when price level is high. This held true for both complete and partial pooling models; however, partial pooling was shown to be the best model fit, suggesting that what is considered satisfactory does indeed vary by neighborhood. According to the intra-class correlation, about 21.7% of variation in satisfaction can be attributed to neighborhood influences, while the rest can be attributed to individual-level influences.