MATH1324 Introduction to Statistics Assignment 2

Airbnb prices in Amsterdam

Manali Kuinthodu Amrithraj S3963300, Sakule Ankuli Nanda S3961237

Last updated: 18 May, 2023

Introduction

Problem Statement

Data

  1. realSum (The overall cost of the listing on Airbnb.)
  2. room_type (the specific type of accommodation provided.)
  3. room_shared (whether or not the room is shared.)
  4. room_private (Whether or not the room is private)
  5. person_capacity (the number of guests who can stay in the room together.)
  6. host_is_superhost (Whether the host is a superhost or not)
  7. multi (Whether there are multiple rooms listed or not.)
  8. biz (Whether or not the listing is for commercial purposes.)
  9. cleanliness_rating (The listing’s cleanliness score.)
  10. guest_satisfaction_overall(The listing’s total guest satisfaction score)

Data Cont.

  1. bedrooms (The listing’s number of bedrooms.)
  2. dist (how far it is from the city centre. )
  3. metro_dist (The distance from the closest metro stop. )
  4. attr_index
  5. attr_index_norm
  6. rest_index
  7. rest_index_norm
  8. lng (the listing’s longitude.)
  9. lat (the listing’s latitude.)

Data Pre-processing

*The read.csv() method was used to import the dataset, and the head() function was used to display the first five rows.

amsterdam_weekdays <- read_csv("/Users/manalika/Desktop/Applied Analytics/amsterdam_weekdays.csv")
knitr::kable(head(amsterdam_weekdays,5))
…1 realSum room_type room_shared room_private person_capacity host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall bedrooms dist metro_dist attr_index attr_index_norm rest_index rest_index_norm lng lat
0 194.0337 Private room FALSE TRUE 2 FALSE 1 0 10 93 1 5.0229638 2.5393800 78.69038 4.166708 98.25390 6.846473 4.90569 52.41772
1 344.2458 Private room FALSE TRUE 4 FALSE 0 0 8 85 1 0.4883893 0.2394039 631.17638 33.421209 837.28076 58.342928 4.90005 52.37432
2 264.1014 Private room FALSE TRUE 2 FALSE 0 1 9 87 1 5.7483119 3.6516213 75.27588 3.985908 95.38695 6.646700 4.97512 52.36103
3 433.5294 Private room FALSE TRUE 4 FALSE 0 1 9 90 2 0.3848620 0.4398761 493.27253 26.119108 875.03310 60.973565 4.89417 52.37663
4 485.5529 Private room FALSE TRUE 2 TRUE 0 0 10 98 1 0.5447382 0.3186926 552.83032 29.272733 815.30574 56.811677 4.90051 52.37508

Data Pre-processing Cont.

*str() function is used to check the data type of the variables.

str(amsterdam_weekdays)
## spc_tbl_ [1,103 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1                      : num [1:1103] 0 1 2 3 4 5 6 7 8 9 ...
##  $ realSum                   : num [1:1103] 194 344 264 434 486 ...
##  $ room_type                 : chr [1:1103] "Private room" "Private room" "Private room" "Private room" ...
##  $ room_shared               : logi [1:1103] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ room_private              : logi [1:1103] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ person_capacity           : num [1:1103] 2 4 2 4 2 3 2 4 4 2 ...
##  $ host_is_superhost         : logi [1:1103] FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ multi                     : num [1:1103] 1 0 0 0 0 0 0 0 0 1 ...
##  $ biz                       : num [1:1103] 0 0 1 1 0 0 0 0 0 0 ...
##  $ cleanliness_rating        : num [1:1103] 10 8 9 9 10 8 10 10 9 10 ...
##  $ guest_satisfaction_overall: num [1:1103] 93 85 87 90 98 100 94 100 96 88 ...
##  $ bedrooms                  : num [1:1103] 1 1 1 2 1 2 1 3 2 1 ...
##  $ dist                      : num [1:1103] 5.023 0.488 5.748 0.385 0.545 ...
##  $ metro_dist                : num [1:1103] 2.539 0.239 3.652 0.44 0.319 ...
##  $ attr_index                : num [1:1103] 78.7 631.2 75.3 493.3 552.8 ...
##  $ attr_index_norm           : num [1:1103] 4.17 33.42 3.99 26.12 29.27 ...
##  $ rest_index                : num [1:1103] 98.3 837.3 95.4 875 815.3 ...
##  $ rest_index_norm           : num [1:1103] 6.85 58.34 6.65 60.97 56.81 ...
##  $ lng                       : num [1:1103] 4.91 4.9 4.98 4.89 4.9 ...
##  $ lat                       : num [1:1103] 52.4 52.4 52.4 52.4 52.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   realSum = col_double(),
##   ..   room_type = col_character(),
##   ..   room_shared = col_logical(),
##   ..   room_private = col_logical(),
##   ..   person_capacity = col_double(),
##   ..   host_is_superhost = col_logical(),
##   ..   multi = col_double(),
##   ..   biz = col_double(),
##   ..   cleanliness_rating = col_double(),
##   ..   guest_satisfaction_overall = col_double(),
##   ..   bedrooms = col_double(),
##   ..   dist = col_double(),
##   ..   metro_dist = col_double(),
##   ..   attr_index = col_double(),
##   ..   attr_index_norm = col_double(),
##   ..   rest_index = col_double(),
##   ..   rest_index_norm = col_double(),
##   ..   lng = col_double(),
##   ..   lat = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Pre-processing Cont.

columns_to_convert<- c('room_shared' ,'room_private','host_is_superhost','room_type')
amsterdam_weekdays[,columns_to_convert] <- lapply(amsterdam_weekdays[,columns_to_convert] , factor)

Descriptive Statistics and Visualisation

summary(amsterdam_weekdays)
##       ...1           realSum                 room_type   room_shared 
##  Min.   :   0.0   Min.   : 128.9   Entire home/apt:538   FALSE:1097  
##  1st Qu.: 275.5   1st Qu.: 309.8   Private room   :559   TRUE :   6  
##  Median : 551.0   Median : 430.2   Shared room    :  6               
##  Mean   : 551.0   Mean   : 545.0                                     
##  3rd Qu.: 826.5   3rd Qu.: 657.3                                     
##  Max.   :1102.0   Max.   :7782.9                                     
##  room_private person_capacity host_is_superhost     multi       
##  FALSE:544    Min.   :2.000   FALSE:780         Min.   :0.0000  
##  TRUE :559    1st Qu.:2.000   TRUE :323         1st Qu.:0.0000  
##               Median :2.000                     Median :0.0000  
##               Mean   :2.792                     Mean   :0.3083  
##               3rd Qu.:4.000                     3rd Qu.:1.0000  
##               Max.   :6.000                     Max.   :1.0000  
##       biz         cleanliness_rating guest_satisfaction_overall    bedrooms    
##  Min.   :0.0000   Min.   : 4.000     Min.   : 20.00             Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.: 9.000     1st Qu.: 92.00             1st Qu.:1.000  
##  Median :0.0000   Median :10.000     Median : 96.00             Median :1.000  
##  Mean   :0.1151   Mean   : 9.461     Mean   : 94.36             Mean   :1.283  
##  3rd Qu.:0.0000   3rd Qu.:10.000     3rd Qu.: 98.00             3rd Qu.:2.000  
##  Max.   :1.0000   Max.   :10.000     Max.   :100.00             Max.   :5.000  
##       dist            metro_dist        attr_index      attr_index_norm  
##  Min.   : 0.01506   Min.   :0.03653   Min.   :  40.93   Min.   :  2.167  
##  1st Qu.: 1.30206   1st Qu.:0.46298   1st Qu.: 127.91   1st Qu.:  6.773  
##  Median : 2.34137   Median :0.85601   Median : 208.18   Median : 11.023  
##  Mean   : 2.84162   Mean   :1.08944   Mean   : 271.01   Mean   : 14.350  
##  3rd Qu.: 3.64814   3rd Qu.:1.51063   3rd Qu.: 386.44   3rd Qu.: 20.462  
##  Max.   :11.18710   Max.   :4.41191   Max.   :1888.55   Max.   :100.000  
##    rest_index      rest_index_norm        lng             lat       
##  Min.   :  50.88   Min.   :  3.545   Min.   :4.776   Min.   :52.29  
##  1st Qu.: 163.47   1st Qu.: 11.391   1st Qu.:4.871   1st Qu.:52.35  
##  Median : 260.26   Median : 18.135   Median :4.890   Median :52.37  
##  Mean   : 341.54   Mean   : 23.799   Mean   :4.891   Mean   :52.36  
##  3rd Qu.: 469.29   3rd Qu.: 32.701   3rd Qu.:4.907   3rd Qu.:52.38  
##  Max.   :1435.10   Max.   :100.000   Max.   :5.011   Max.   :52.42

Decsriptive Statistics Cont.

sum(is.na(amsterdam_weekdays))
## [1] 0
is.outlier <- function(x) {
  lower_fence <- summary(x)[2] - 1.5 * IQR(x)
  upper_fence <- summary(x)[5] + 1.5 * IQR(x)
  x < lower_fence | x > upper_fence
}
outliers <- is.outlier(amsterdam_weekdays$guest_satisfaction_overall)
sum(outliers)
## [1] 48
amsterdam_weekdays <- amsterdam_weekdays %>% filter(!outliers)

Decsriptive Statistics Cont.

ggplot(amsterdam_weekdays, aes(x = room_type, y = guest_satisfaction_overall)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Room Type", y = "Guest Rating") +
  theme_minimal()

Decsriptive Statistics Cont.

amsterdam_weekdays %>%
  group_by(room_type) %>%
  summarise(
    Min = min(guest_satisfaction_overall, na.rm = TRUE),
    Q1 = quantile(guest_satisfaction_overall, probs = 0.25, na.rm = TRUE),
    Median = median(guest_satisfaction_overall, na.rm = TRUE),
    Q3 = quantile(guest_satisfaction_overall, probs = 0.75, na.rm = TRUE),
    Max = max(guest_satisfaction_overall, na.rm = TRUE),
    Mean = mean(guest_satisfaction_overall, na.rm = TRUE),
    SD = sd(guest_satisfaction_overall, na.rm = TRUE),
    n = n(),
    missing = sum(is.na(guest_satisfaction_overall))
  ) -> table1

knitr::kable(table1)
room_type Min Q1 Median Q3 Max Mean SD n missing
Entire home/apt 83 94 97.0 99.00 100 95.83462 3.901429 520 0
Private room 83 92 96.0 98.00 100 94.61626 4.259370 529 0
Shared room 84 91 94.5 95.75 98 92.83333 5.076088 6 0

Hypthesis Testing

\[H_0: \mu_1 = \mu_2 \] * Alternative hypothesis (HA): There is a significant difference in guest satisfaction ratings between different room types.

\[H_A: \mu_1 \ne \mu_2\]

Hypthesis Testing - QQ Plot.

qqPlot(amsterdam_weekdays$guest_satisfaction_overall,dist="norm")

## [1] 389 661

Levene’s test

leveneTest(guest_satisfaction_overall ~ room_type, data = amsterdam_weekdays) %>% as.data.frame()

Two-sample t-tests

We are performing two-sample t-tests to compare the guest satisfaction scores between different room types. As we had more than two variables in a column, we divided the data into three groups based on the room types: “private room,” “entire home/apt,” and “shared room.” Then, we are conducting pairwise t-tests between these groups.

private_room <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Private room"]
entire_home <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Entire home/apt"]
shared_room <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Shared room"]
t.test(private_room, entire_home,var.equal = FALSE, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  private_room and entire_home
## t = -4.8324, df = 1041.8, p-value = 1.552e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7130881 -0.7236285
## sample estimates:
## mean of x mean of y 
##  94.61626  95.83462
t.test(private_room, shared_room,var.equal = FALSE, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  private_room and shared_room
## t = 0.85694, df = 5.0802, p-value = 0.43
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.540044  7.105891
## sample estimates:
## mean of x mean of y 
##  94.61626  92.83333
t.test(entire_home, shared_room,var.equal = FALSE, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  entire_home and shared_room
## t = 1.4434, df = 5.0684, p-value = 0.2077
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.322248  8.324812
## sample estimates:
## mean of x mean of y 
##  95.83462  92.83333

Discussion

Discussion cont.

References