Jennifer Ganeles
4/7/19

Airbnb Satisfaction: How Much Does Price Matter? (A Multilevel Analysis)

This week’s homework explores an AirBNB analysis Lisbon datset found on Kaggle. Airbnb is an online marketplace, which allows people to rent out their properties or spare rooms to guests. Its popularity has increased over the years as a financially-friendly way to stay overnight in prime locations while also getting an authentically local travel experience.


Airbnb is known for being a cheaper alternative to normal hotel accommodations, but what’s considered a good price can vary by location. For this reason, I will be using multilevel modelling to explore how price affects overall Airbnb satisfaction across individual listings as well as across neighborhoods. The dataset contains listing and rating information for 13,578 Airbnb accommodations spanning 24 neighborhoods in Lisbon, Portugal as of 2017.

Importing the Data and Examining Key Variables:

My dependent variable (overall satisfaction) was rated on a scale from 0 to 5. My independent variable (price) was simplified into two levels using median as a cuttoff point: Low ($70 or lower) and High (Greater than $70).
Low Price=0
High Price=1

library(readr)
airbnb<-read_csv("/Users/jenniferganeles/Downloads/airbnb_lisbon_1480_2017-07-27.csv")
airbnb$pricelev <- ifelse(airbnb$price > median(airbnb$price), 1, 0)
airbnb<-select(airbnb, room_id, neighborhood, overall_satisfaction, price, pricelev, accommodates, bedrooms, reviews)
airbnbs<-airbnb[order(-airbnb$overall_satisfaction),]
head(airbnbs)

Ecological Analysis (Neighborhood-Level Analysis)

First, I will be conducting an ecological regression, which is a neighborhood-level analysis. The below analysis shows that at the neighborhood-level, there is an insignificant positive relationship between price level and satisfaction rating. We may be tempted to say that high priced accommodations have higher ratings, but this would be an ecological fallacy since I am not using individual-level data here.

#First, grouping by neighborhood:
hood<-airbnbs %>% 
  group_by(neighborhood) %>% 
  summarise(mean_s = mean(overall_satisfaction, na.rm = TRUE), mean_p = mean(pricelev, na.rm = TRUE))
#Ecological model:
ecoreg <- lm(mean_s ~ mean_p, data = hood)
summary(ecoreg)

Call:
lm(formula = mean_s ~ mean_p, data = hood)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5061 -0.4240  0.1378  0.4185  0.8816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.5283     0.4159   6.079 4.06e-06 ***
mean_p        0.5188     0.9259   0.560    0.581    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6483 on 22 degrees of freedom
Multiple R-squared:  0.01407,   Adjusted R-squared:  -0.03075 
F-statistic: 0.3139 on 1 and 22 DF,  p-value: 0.5809

Complete-Pooling Model (Individual-Level Analysis)

Using the complete pooling method, I can see that there is a significant negative relationship between price level and satisfaction rating. In other words, high prices decrease satisfaction by 0.19. However, here I am exploring the relationship between price level and satisfaction rating under the assumption that no difference exists between neighborhoods. Like my ecological regression, this model is also flawed since the type of neighborhood can be an important determinant of what is satisfactory and what it not.

cpooling <- lm(overall_satisfaction ~ pricelev, data = airbnbs)
summary(cpooling)

Call:
lm(formula = overall_satisfaction ~ pricelev, data = airbnbs)

Residuals:
   Min     1Q Median     3Q    Max 
-3.338 -3.144  1.162  1.662  1.856 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.33770    0.02594 128.680  < 2e-16 ***
pricelev    -0.19345    0.03689  -5.245 1.59e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.149 on 13576 degrees of freedom
Multiple R-squared:  0.002022,  Adjusted R-squared:  0.001948 
F-statistic: 27.51 on 1 and 13576 DF,  p-value: 1.59e-07

No-Pooling Model (Intercept)

Here, I am running 24 different regression models (one per neighborhood) using the no-pooling model. Below are plots of the intercept of my regression for each neighborhood. The intercept represents the average satisfaction rating for listings that are low price (picelev=0). Here, the mode satisfaction rating for low-price listings is about 3, but variation exists with ratings as high as 4 and as low as about 1.3.

dcoefi <- airbnbs %>% 
    group_by(neighborhood) %>% 
    do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefi <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coefi, aes(x = intc)) + geom_histogram()+xlab("Satisfaction Rating for Low Price Across 24 Neighborhoods")

ggplot(coefi, aes(x = intc)) + geom_density(fill="steelblue")+xlab("Satisfaction Rating for Low Price Level Across 24 Neighborhoods")

No Pooling Model (Slope)

Below are plots of the slope of my regression for each neighborhood (in other words, the difference in satisfaction between price levels across neighborhoods). The mode difference is about -0.5, which suggests that satifaction ratings are, on average, 0.5 lower for high-priced airbnb listings than for low-priced listings. However, there is variation in this distribution as well. Satisfaction rating for high-price listings can range from about 1.77 less than low-price listings to about 0.4 more than low-price listings. The issue with no-pooling is that it does not impose a structure on this between-neighborhood variation.

dcoefs <- airbnbs %>% 
    group_by(neighborhood) %>% 
    do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefs <- dcoef %>% do(data.frame(pricec = coef(.$mod)[2]))
ggplot(coefs, aes(x = pricec)) + geom_histogram()+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")

ggplot(coefs, aes(x = pricec)) + geom_density(fill="steelblue")+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")

Random Intercept Model

With a random effects model, I am now combining the complexity of variation from the no-pooling model with the simplicity of the complete-pooling model (using the parameters of a normal distribution). Here, I am using a random intercept model, which allows the intercept of my regression model to vary freely between neighborhoods, but does not allow the price difference in satisfaction to differ between neighborhoods.


Below we can see that the standard deviation between neighborhoods for low-price listings is about 0.59. Ratings for low-price listings are, on average, 2.93 while ratings for high-price listings are, on average, 0.35 lower.

library(nlme)
m1_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~1|neighborhood, method = "ML")
summary(m1_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: airbnbs 

Random effects:
 Formula: ~1 | neighborhood
        (Intercept) Residual
StdDev:   0.5936869 2.088129

Fixed effects: overall_satisfaction ~ pricelev 
 Correlation: 
         (Intr)
pricelev -0.124

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.8772828 -1.2117618  0.4744175  0.7191426  1.7468961 

Number of Observations: 13578
Number of Groups: 24 

Random Slope Model

Using a random slope model, I am now allowing the price difference in satisfaction to vary between neighborhoods.


Below we can see that the satisfaction rating of low-price listings across neighborhoods is 3.02 with a standard deviation of 0.62. Satisfaction ratings for high-price listings across neighborhoods is 0.54 lower, with a standard deviation of 0.45. The intercept and slope has a negative correlation of 0.33 across neighborhoods, which suggests that in neighborhoods where low-price listings have high satisfaction ratings, the difference in ratings between price levels is low, and in neighborhoods where low-price listings have low satisfaction ratings, the difference in ratings between price levels is high.

m2_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~ pricelev|neighborhood, method = "ML")
summary(m2_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: airbnbs 

Random effects:
 Formula: ~pricelev | neighborhood
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 0.6159586 (Intr)
pricelev    0.4476372 -0.286
Residual    2.0816310       

Fixed effects: overall_satisfaction ~ pricelev 
 Correlation: 
         (Intr)
pricelev -0.33 

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.9273662 -1.1607159  0.4745965  0.7204646  1.7657156 

Number of Observations: 13578
Number of Groups: 24 

Comparing Models

Based on the below AIC results, the random slope model seems to fit the data best. As previously shown, this model suggests that, on average, low-price listings have a satisfaction rating of 3.02 with a standard deviation of .62 between neighborhoods, while high-price listings have a satisfaction rating that is .54 lower across neighborhoods with a standard deviation of .45.

AIC(cpooling, m1_lme, m2_lme)

Intra-Class Correlation:

Is overall satisfaction mainly an individual-level or neighborhood-level thing?

m0_lme <- lme(overall_satisfaction ~ 1, random = ~ 1|neighborhood, data = airbnbs, method = "ML")
summary(m0_lme)
Linear mixed-effects model fit by maximum likelihood
 Data: airbnbs 

Random effects:
 Formula: ~1 | neighborhood
        (Intercept) Residual
StdDev:   0.5817633 2.095359

Fixed effects: overall_satisfaction ~ 1 

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.7710411 -1.1962397  0.4412431  0.6839926  1.6267148 

Number of Observations: 13578
Number of Groups: 24 

Intra-class correlation (ICC):
0.5817633/(0.5817633+2.095359)=0.21730919801
About 21.7% of the total variation in overall_satisfaction can be attributed to neighborhood-level influences. The other 78.3% can be attributed to the individual level.


Confidence Intervals:

intervals(m0_lme)
Approximate 95% confidence intervals

 Fixed effects:
               lower     est.    upper
(Intercept) 2.539996 2.782792 3.025588
attr(,"label")
[1] "Fixed effects:"

 Random Effects:
  Level: neighborhood 
                    lower      est.     upper
sd((Intercept)) 0.4242685 0.5817633 0.7977225

 Within-group standard error:
   lower     est.    upper 
2.070563 2.095359 2.120453 

Conclusion

After exploring the satisfaction ratings for Airbnb ratings across 24 neighorhoods, I can conclude that price level does, in fact, matter when it comes to overall satisfaction. Though my ecological analysis showed a positive relationship between price and rating, all other analysese showed a negative relationship, where satisfaction ratings decrease when price level is high. This held true for both complete and partial pooling models; however, partial pooling was shown to be the best model fit, suggesting that what is considered satisfactory does indeed vary by neighborhood. According to the intra-class correlation, about 21.7% of variation in satisfaction can be attributed to neighborhood influences, while the rest can be attributed to individual-level influences.

---
title: "Soc 712: Homework 8"
output: html_notebook
---
*Jennifer Ganeles*
<br/>*4/7/19*

#Airbnb Satisfaction: How Much Does Price Matter? (A Multilevel Analysis)

This week's homework explores an [AirBNB analysis Lisbon](https://www.kaggle.com/vfoufikos/airbnb-analysis-lisbon) datset found on Kaggle. Airbnb is an online marketplace, which allows people to rent out their properties or spare rooms to guests. Its popularity has increased over the years as a financially-friendly way to stay overnight in prime locations while also getting an authentically local travel experience. 

<br/>Airbnb is known for being a cheaper alternative to normal hotel accommodations, but what's considered a good price can vary by location. For this reason, I will be using multilevel modelling to explore how price affects overall Airbnb satisfaction across individual listings as well as across neighborhoods. The dataset contains listing and rating information for 13,578 Airbnb accommodations spanning 24 neighborhoods in Lisbon, Portugal as of 2017. 

###Importing the Data and Examining Key Variables:
My dependent variable (overall satisfaction) was rated on a scale from 0 to 5. My independent variable (price) was simplified into two levels using median as a cuttoff point: Low ($70 or lower) and High (Greater than $70).
<br/>**Low Price**=0
<br/>**High Price**=1

```{r message=FALSE}
library(readr)
airbnb<-read_csv("/Users/jenniferganeles/Downloads/airbnb_lisbon_1480_2017-07-27.csv")
airbnb$pricelev <- ifelse(airbnb$price > median(airbnb$price), 1, 0)
airbnb<-select(airbnb, room_id, neighborhood, overall_satisfaction, price, pricelev, accommodates, bedrooms, reviews)
airbnbs<-airbnb[order(-airbnb$overall_satisfaction),]
head(airbnbs)
```

##Ecological Analysis (Neighborhood-Level Analysis)
First, I will be conducting an ecological regression, which is a neighborhood-level analysis. The below analysis shows that at the neighborhood-level, there is an insignificant positive relationship between price level and satisfaction rating. We may be tempted to say that high priced accommodations have higher ratings, but this would be an ecological fallacy since I am not using individual-level data here. 
```{r}
#First, grouping by neighborhood:
hood<-airbnbs %>% 
  group_by(neighborhood) %>% 
  summarise(mean_s = mean(overall_satisfaction, na.rm = TRUE), mean_p = mean(pricelev, na.rm = TRUE))

#Ecological model:
ecoreg <- lm(mean_s ~ mean_p, data = hood)
summary(ecoreg)
```
##Complete-Pooling Model (Individual-Level Analysis) 
Using the complete pooling method, I can see that there is a significant *negative* relationship between price level and satisfaction rating. In other words, high prices decrease satisfaction by 0.19. However, here I am exploring the relationship between price level and satisfaction rating under the assumption that no difference exists between neighborhoods. Like my ecological regression, this model is also flawed since the type of neighborhood can be an important determinant of what is satisfactory and what it not. 
```{r}
cpooling <- lm(overall_satisfaction ~ pricelev, data = airbnbs)
summary(cpooling)
```

##No-Pooling Model (Intercept)
Here, I am running 24 different regression models (one per neighborhood) using the no-pooling model. Below are plots of the intercept of my regression for each neighborhood. The intercept represents the average satisfaction rating for listings that are low price (picelev=0). Here, the mode satisfaction rating for low-price listings is about 3, but variation exists with ratings as high as 4 and as low as about 1.3.
```{r}
dcoefi <- airbnbs %>% 
    group_by(neighborhood) %>% 
    do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefi <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coefi, aes(x = intc)) + geom_histogram()+xlab("Satisfaction Rating for Low Price Across 24 Neighborhoods")

```

```{r}
ggplot(coefi, aes(x = intc)) + geom_density(fill="steelblue")+xlab("Satisfaction Rating for Low Price Level Across 24 Neighborhoods")
```
##No Pooling Model (Slope)
Below are plots of the slope of my regression for each neighborhood (in other words, the difference in satisfaction between price levels across neighborhoods). The mode difference is about -0.5, which suggests that satifaction ratings are, on average, 0.5 lower for high-priced airbnb listings than for low-priced listings. However, there is variation in this distribution as well. Satisfaction rating for high-price listings can range from about 1.77 less than low-price listings to about 0.4 more than low-price listings. The issue with no-pooling is that it does not impose a structure on this between-neighborhood variation.
```{r}
dcoefs <- airbnbs %>% 
    group_by(neighborhood) %>% 
    do(mod = lm(overall_satisfaction ~ pricelev, data = .))
coefs <- dcoef %>% do(data.frame(pricec = coef(.$mod)[2]))
ggplot(coefs, aes(x = pricec)) + geom_histogram()+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")
```

```{r warning=FALSE}
ggplot(coefs, aes(x = pricec)) + geom_density(fill="steelblue")+xlab("Difference in Satisfaction Rating Between High and Low Price Levels Across 24 Neighborhoods")
```
##Random Intercept Model
With a random effects model, I am now combining the complexity of variation from the no-pooling model with the simplicity of the complete-pooling model (using the parameters of a normal distribution). Here, I am using a random intercept model, which allows the intercept of my regression model to vary freely between neighborhoods, but does not allow the price difference in satisfaction to differ between neighborhoods.

<br/> Below we can see that the standard deviation between neighborhoods for low-price listings is about 0.59. Ratings for low-price listings are, on average, 2.93 while ratings for high-price listings are, on average, 0.35 lower. 
```{r}
library(nlme)
m1_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~1|neighborhood, method = "ML")
summary(m1_lme)
```
##Random Slope Model
Using a random slope model, I am now allowing the price difference in satisfaction to vary between neighborhoods. 

<br/>Below we can see that the satisfaction rating of low-price listings across neighborhoods is 3.02 with a standard deviation of 0.62. Satisfaction ratings for high-price listings across neighborhoods is 0.54 lower, with a standard deviation of 0.45. The intercept and slope has a negative correlation of 0.33  across neighborhoods, which suggests that in neighborhoods where low-price listings have high satisfaction ratings, the difference in ratings between price levels is low, and in neighborhoods where low-price listings have low satisfaction ratings, the difference in ratings between price levels is high.
```{r}
m2_lme <- lme(overall_satisfaction ~ pricelev, data = airbnbs, random = ~ pricelev|neighborhood, method = "ML")
summary(m2_lme)
```
##Comparing Models
Based on the below AIC results, the random slope model seems to fit the data best. As previously shown, this model suggests that, on average, low-price listings have a satisfaction rating of 3.02 with a standard deviation of .62 between neighborhoods, while high-price listings have a satisfaction rating that is .54 lower across neighborhoods with a standard deviation of .45.
```{r}
AIC(cpooling, m1_lme, m2_lme)
```

##Intra-Class Correlation: 
###Is overall satisfaction mainly an individual-level or neighborhood-level thing?
```{r}
m0_lme <- lme(overall_satisfaction ~ 1, random = ~ 1|neighborhood, data = airbnbs, method = "ML")
summary(m0_lme)
```
**Intra-class correlation (ICC):**
<br/> 0.5817633/(0.5817633+2.095359)=0.21730919801
<br/>About 21.7% of the total variation in overall_satisfaction can be attributed to neighborhood-level influences. The other 78.3% can be attributed to the individual level. 

<br/>**Confidence Intervals:**
```{r}
intervals(m0_lme)
```
#Conclusion
After exploring the satisfaction ratings for Airbnb ratings across 24 neighorhoods, I can conclude that price level does, in fact, matter when it comes to overall satisfaction. Though my ecological analysis showed a positive relationship between price and rating, all other analysese showed a negative relationship, where satisfaction ratings decrease when price level is high. This held true for both complete and partial pooling models; however, partial pooling was shown to be the best model fit, suggesting that what is considered satisfactory does indeed vary by neighborhood. According to the intra-class correlation, about 21.7% of variation in satisfaction can be attributed to neighborhood influences, while the rest can be attributed to individual-level influences. 