HW1

mydata <- read.table("./amsterdam_weekdays.csv",header = TRUE,sep = ",",dec = ".")
head(mydata)

##   ID  realSum    room_type room_shared room_private person_capacity
## 1  0 194.0337 Private room       FALSE         TRUE               2
## 2  1 344.2458 Private room       FALSE         TRUE               4
## 3  2 264.1014 Private room       FALSE         TRUE               2
## 4  3 433.5294 Private room       FALSE         TRUE               4
## 5  4 485.5529 Private room       FALSE         TRUE               2
## 6  5 552.8086 Private room       FALSE         TRUE               3
##   host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1             FALSE     1   0                 10                         93
## 2             FALSE     0   0                  8                         85
## 3             FALSE     0   1                  9                         87
## 4             FALSE     0   1                  9                         90
## 5              TRUE     0   0                 10                         98
## 6             FALSE     0   0                  8                        100

mydata$room_type <- factor(mydata$room_type)
mydata$room_shared <- factor(mydata$room_shared)
mydata$room_private <- factor(mydata$room_private)
mydata$host_is_superhost <- factor(mydata$host_is_superhost)
mydata$multi <- factor(mydata$multi,
                       levels = c("0", "1"),
                       labels = c("No","Yes"))
mydata$biz <- factor(mydata$biz,
                     levels = c("0","1"),
                     labels = c("No","Yes"))
head(mydata)

##   ID  realSum    room_type room_shared room_private person_capacity
## 1  0 194.0337 Private room       FALSE         TRUE               2
## 2  1 344.2458 Private room       FALSE         TRUE               4
## 3  2 264.1014 Private room       FALSE         TRUE               2
## 4  3 433.5294 Private room       FALSE         TRUE               4
## 5  4 485.5529 Private room       FALSE         TRUE               2
## 6  5 552.8086 Private room       FALSE         TRUE               3
##   host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1             FALSE   Yes  No                 10                         93
## 2             FALSE    No  No                  8                         85
## 3             FALSE    No Yes                  9                         87
## 4             FALSE    No Yes                  9                         90
## 5              TRUE    No  No                 10                         98
## 6             FALSE    No  No                  8                        100

summary(mydata[-1])

##     realSum              room_type   room_shared  room_private person_capacity
##  Min.   : 128.9   Entire home :538   FALSE:1097   FALSE:544    Min.   :2.000  
##  1st Qu.: 309.8   Private room:559   TRUE :   6   TRUE :559    1st Qu.:2.000  
##  Median : 430.2   Shared room :  6                             Median :2.000  
##  Mean   : 545.0                                                Mean   :2.792  
##  3rd Qu.: 657.3                                                3rd Qu.:4.000  
##  Max.   :7782.9                                                Max.   :6.000  
##  host_is_superhost multi      biz      cleanliness_rating
##  FALSE:780         No :763   No :976   Min.   : 4.000    
##  TRUE :323         Yes:340   Yes:127   1st Qu.: 9.000    
##                                        Median :10.000    
##                                        Mean   : 9.461    
##                                        3rd Qu.:10.000    
##                                        Max.   :10.000    
##  guest_satisfaction_overall
##  Min.   : 20.00            
##  1st Qu.: 92.00            
##  Median : 96.00            
##  Mean   : 94.36            
##  3rd Qu.: 98.00            
##  Max.   :100.00

Descriptive Statistics: The dataset used for this analysis contains Airbnb listings in Amsterdam. Below is an overview of the key variables:

Explanation:

1.realSum:

Definition:The total price of the Airbnb listing.

Type:Numeric.

Unit:Currency(€).

2.room_type:

Definition: The type of room being offered(e.g.”Entire home”,“Private room”,“Shared room”).

Type:Categorical.

3.room_shared:

Definition:Whether the room is shared or not.”TRUE”for shared, “FALSE” otherwise.

Type:Logical.

4.room_private:

Definition:Whether the room is private or not.”TRUE”for private, “FALSE” otherwise.

Type:Logical.

5.person_capacity:

Definition:The maximum number of people that can stay in the room.

Type:Numeric.

Unit:Number of persons.

6.host_is_superhost:

Definition:Whether the host is a superhost or not. “TRUE”for superhost,“FALSE”otherwise.

Type:Logical.

7.multi:

Definition:Whether the listing is for multiple rooms or not.”Yes”for multiple rooms,“No” otherwise.

Type:Categorical.

8.biz:

Definition:Whether the listing is for business purposes or not.”Yes”for business,“No” otherwise.

Type:Categorical.

9.cleanliness_rating:

Definition:The cleanliness rating of the listing.

Type:Numeric.

Unit:Rating scale(e.g.,1-10).

10.guest_satisfaction_overall: Definition:The overall guest satisfaction rating of the listing.

Type:Numeric.

Unit:Rating scale(e.g.,1-100).

The dataset amsterdam_weekdays.csv was obtained from Kaggle.com. It is part of the dataset titled “Airbnb Prices in European Cities,” available at the following link: Airbnb Prices in European Cities Dataset on Kaggle.

RQ1: The guest satisfaction between guests of entire homes and private rooms are the same.

room_data <- mydata[mydata$room_type %in% c("Entire home", "Private room"), ]


room_data$room_type <- droplevels(room_data$room_type)

In this chunk I created a new data set that only includes “Entire home” and “Private room”.

t_test <- t.test(guest_satisfaction_overall ~ room_type, data = room_data, var.equal = FALSE)
print(t_test)

## 
##  Welch Two Sample t-test
## 
## data:  guest_satisfaction_overall by room_type
## t = 4.5442, df = 1044.7, p-value = 6.157e-06
## alternative hypothesis: true difference in means between group Entire home and group Private room is not equal to 0
## 95 percent confidence interval:
##  0.9376154 2.3627417
## sample estimates:
##  mean in group Entire home mean in group Private room 
##                   95.21190                   93.56172

Parametric Test:Welch Two-Sample t-test. Since the p-value (6.157e-06) is smaller than the significance level (𝛼=0.05), we reject the null hypothesis.This result demonstrates that there is a statistically significant difference in the mean guest satisfaction overall score between “Entire home” and “Private room” Airbnb listings.

wilcox_test <- wilcox.test(guest_satisfaction_overall ~ room_type, data = room_data)
print(wilcox_test)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  guest_satisfaction_overall by room_type
## W = 177334, p-value = 2.401e-07
## alternative hypothesis: true location shift is not equal to 0

Non-Parametric Test: Wilcoxon Test. Since the p-value (2.401e-07)is much smaller than the significance level (α=0.05), we reject the null hypothesis. This indicates that there is a statistically significant difference in the median guest satisfaction overall score between “Entire home”and “Private room” Airbnb listings.

library(ggplot2)
library(ggpubr)


EntireHome <- ggplot(room_data[room_data$room_type == "Entire home", ], aes(x = guest_satisfaction_overall)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "black", fill = "blue", alpha = 0.7) +
  ylab("Rating") +
  xlab("Guest Satisfaction Overall") +
  ggtitle("Entire home")

PrivateRoom <- ggplot(room_data[room_data$room_type == "Private room", ], aes(x = guest_satisfaction_overall)) +
  theme_linedraw() +
  geom_histogram(binwidth = 2, col = "black", fill = "green", alpha = 0.7) +
  ylab("Rating") +
  xlab("Guest Satisfaction Overall") +
  ggtitle("Private room")

ggarrange(EntireHome, PrivateRoom,
          ncol = 2, nrow = 1)

The histograms for both types of places seem to not have a normal distribution.The histograms display the distribution of guest satisfaction scores for two room types: “Entire home” and “Private room.” For “Entire home,” the majority of ratings are concentrated between 90 and 100, with fewer low outliers, suggesting that guests consistently rate this room type highly. Conversely, “Private room” exhibits a wider distribution with more variability in ratings, including sporadic outliers below 80.

This implies that “Entire home” listings tend to provide a more consistent and higher level of guest satisfaction compared to “Private room” listings, which may be influenced by factors such as shared amenities or differences in guest expectations.

#install.packages("rstatix")
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

room_data %>%
  group_by(room_type) %>%
  shapiro_test(guest_satisfaction_overall)

## # A tibble: 2 × 4
##   room_type    variable                   statistic        p
##   <fct>        <chr>                          <dbl>    <dbl>
## 1 Entire home  guest_satisfaction_overall     0.809 9.53e-25
## 2 Private room guest_satisfaction_overall     0.734 6.32e-29

No normality, we use non-parametric.

#install.packages("effectsize")
library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

effect_size <- rank_biserial(guest_satisfaction_overall ~ room_type, data = room_data)

print(effect_size)

## r (rank biserial) |       95% CI
## --------------------------------
## 0.18              | [0.11, 0.24]

interpret_rank_biserial(.18)

## [1] "small"
## (Rules: funder2019)

Conclusion: The null hypothesis that guest satisfaction ratings for “Entire home” and “Private room” listings are equal was rejected by the results of both the parametric (Welch t-test) and non-parametric (Wilcoxon Rank Sum Test) approaches. In this instance, the non-parametric Wilcoxon test is better suitable due to the failure of the normalcy assumption. The mean guest satisfaction score for “Entire home” listings is 95.21, compared to 93.56 for “Private room” listings.However, the effect size, calculated as a rank biserial correlation of 0.18(95%[0.11-0.24]) , is categorized as “small” according to Funder’s 2019 guidelines. While the difference in guest satisfaction scores is statistically significant, the small effect size indicates that the variation has limited practical significance.Both “Entire home” and “Private room” listings are capable of achieving high levels of guest satisfaction.Hosts may, therefore, consider other factors such as pricing, location, or amenities when deciding on their listing style, as guest satisfaction is relatively high across both types.

RQ2: Is there a correlation between realSum (price) and cleanliness_rating?

library(car)

## Loading required package: carData

scatterplot(mydata$realSum, mydata$cleanliness_rating)

The data in the scatterplot shows the existence of extreme values or outliers, especially at higher price ranges. The data indicates some heterogeneity in cleanliness ratings across a wide range of prices, despite the sample being dominated by high cleanliness ratings (e.g., 9 and 10). The substantial variance at the lower cleanliness ratings (e.g., below 7) emphasizes how outside variables like property type, location, or amenities can have an impact. The distribution becomes less uniform, especially at higher costs when cleanliness ratings stay relatively constant, indicating that price has little effect on cleanliness ratings over a particular threshold. This pattern suggests that although more expensive houses typically uphold high standards of cleanliness, there may be other underlying reasons at play in the relationship between the two variables, which is not entirely linear.

#install.packages("Hmisc")
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(mydata$realSum, mydata$cleanliness_rating, type = "pearson")

##      x    y
## x 1.00 0.02
## y 0.02 1.00
## 
## n= 1103 
## 
## 
## P
##   x      y     
## x        0.4466
## y 0.4466

The Pearson correlation coefficient between price (realSum) and cleanliness rating is 0.02, with a p-value of 0.446. This shows a very weak and statistically negligible link between the two variables.

This finding reveals that the cleanliness rating of Airbnb properties is not highly correlated with the price. Although high cleanliness ratings are consistent throughout price ranges (as seen in the scatterplot), the absence of statistical significance indicates that the relationship between the two variables is minimal. This demonstrates that cleanliness assessments are likely impacted by variables other than price alone.

Conclusion: The analysis investigated the relationship between the price of Airbnb listings (realSum) and cleanliness rating (cleanliness_rating). The Pearson correlation coefficient (r = 0.02, p = 0.4466) indicates an extremely weak positive relationship between the two variables. However, this relationship is not statistically significant, as evidenced by the p-value exceeding the significance threshold (α = 0.05).

The scatterplot of realSum and cleanliness_rating further supports this conclusion, showing no clear linear trend. While there is some clustering at higher cleanliness ratings (particularly around 9-10), price variations are widespread and not directly influenced by cleanliness ratings. This suggests that other factors (e.g., location, room type, or amenities) likely contribute more significantly to price variation.

There is no meaningful or significant relationship between Airbnb price and cleanliness rating in this dataset.

RQ3:Is there an association between the Airbnb being listed for business purposes and it being categorized as a superhost?

mytable <- table(mydata$host_is_superhost, mydata$biz)
print(mytable)

##        
##          No Yes
##   FALSE 668 112
##   TRUE  308  15

chi <- chisq.test(mytable,
                  correct = TRUE)
print(chi)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 20.217, df = 1, p-value = 6.915e-06

Since the p-value(p= 6.915e-06)<0.05,we reject the null hypothesis.This indicates that a statistically significant correlation has been found between an Airbnb listing for business purposes and being classified as a superhost.

addmargins(chi$observed)

##        
##           No  Yes  Sum
##   FALSE  668  112  780
##   TRUE   308   15  323
##   Sum    976  127 1103

This observed distribution shows that a smaller proportion of superhosts (15 out of 323) list their properties for business purposes compared to non-superhosts (112 out of 780).

addmargins(round(chi$expected, 2))

##        
##             No    Yes  Sum
##   FALSE 690.19  89.81  780
##   TRUE  285.81  37.19  323
##   Sum   976.00 127.00 1103

Superhosts are less likely than expected to list their properties for business purposes (15 observed vs. 37.19 expected). Non-superhosts are more likely than expected to list their properties for business purposes (112 observed vs. 89.81 expected). The Chi-squared test’s statistically significant result is influenced by these discrepancies. As a result, we draw the conclusion that listing an Airbnb for business purposes and being a superhost are significantly related.

round(chi$residuals, 2)

##        
##            No   Yes
##   FALSE -0.84  2.34
##   TRUE   1.31 -3.64

These results indicate that non-superhosts are overrepresented in business listings, but superhosts are underrepresented. The residuals for “Yes” in both categories (2.34 and -3.64) are above the threshold of significance (∣2∣), indicating a significant divergence from expected values. This adds to the evidence that there is a strong link between being a superhost and listing for business purposes.

addmargins(round(prop.table(mytable), 3))

##        
##            No   Yes   Sum
##   FALSE 0.606 0.102 0.708
##   TRUE  0.279 0.014 0.293
##   Sum   0.885 0.116 1.001

The data reveals that non-superhosts dominate the listings for business purposes, accounting for 10.2% of all listings compared to only 1.4% from superhosts.Superhosts are significantly underrepresented in this category.

addmargins(round(prop.table(mytable, 1), 3), 2)

##        
##            No   Yes   Sum
##   FALSE 0.856 0.144 1.000
##   TRUE  0.954 0.046 1.000

The data indicates that non-superhosts are more likely to list their properties for business purposes, with 14.4% of their listings catering to business travelers, compared to only 4.6% for superhosts. Non-superhosts are more associated with business listings, while superhosts predominantly list for non-business purposes.

addmargins(round(prop.table(mytable, 2), 3), 1)

##        
##            No   Yes
##   FALSE 0.684 0.882
##   TRUE  0.316 0.118
##   Sum   1.000 1.000

The findings show that, while non-superhosts dominate both the non-business and business categories, their presence is especially strong in the business category, accounting for 88.2% of such entries. Non-superhosts are more commonly connected with business-related postings, whereas superhosts have a smaller share of both categories, particularly for commercial purposes.

library(effectsize)
effectsize::cramers_v(mytable)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.14              | [0.08, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

The value of Cramer’s V here is 0.14, with a 95% confidence interval of [0.08, 1.00]. According to Funder’s 2019 guidelines, this value is classified as a small effect size. This indicates that the association between whether the Airbnb listing is for business purposes (“biz”) and whether the host is a superhost (“host_is_superhost”) is weak but present. The confidence interval does not include 0, supporting the existence of some level of association. However, the small effect size suggests that the practical significance of this relationship is limited. These results align with the chi-squared test, which found a statistically significant association between the two variables but does not imply a strong relationship.

HW1

Yucheng Zhu

2025-01-16