Manali Kuinthodu Amrithraj S3963300, Sakule Ankuli Nanda S3961237
Last updated: 18 May, 2023
*The read.csv() method was used to import the dataset, and the head() function was used to display the first five rows.
amsterdam_weekdays <- read_csv("/Users/manalika/Desktop/Applied Analytics/amsterdam_weekdays.csv")
knitr::kable(head(amsterdam_weekdays,5))| …1 | realSum | room_type | room_shared | room_private | person_capacity | host_is_superhost | multi | biz | cleanliness_rating | guest_satisfaction_overall | bedrooms | dist | metro_dist | attr_index | attr_index_norm | rest_index | rest_index_norm | lng | lat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 194.0337 | Private room | FALSE | TRUE | 2 | FALSE | 1 | 0 | 10 | 93 | 1 | 5.0229638 | 2.5393800 | 78.69038 | 4.166708 | 98.25390 | 6.846473 | 4.90569 | 52.41772 |
| 1 | 344.2458 | Private room | FALSE | TRUE | 4 | FALSE | 0 | 0 | 8 | 85 | 1 | 0.4883893 | 0.2394039 | 631.17638 | 33.421209 | 837.28076 | 58.342928 | 4.90005 | 52.37432 |
| 2 | 264.1014 | Private room | FALSE | TRUE | 2 | FALSE | 0 | 1 | 9 | 87 | 1 | 5.7483119 | 3.6516213 | 75.27588 | 3.985908 | 95.38695 | 6.646700 | 4.97512 | 52.36103 |
| 3 | 433.5294 | Private room | FALSE | TRUE | 4 | FALSE | 0 | 1 | 9 | 90 | 2 | 0.3848620 | 0.4398761 | 493.27253 | 26.119108 | 875.03310 | 60.973565 | 4.89417 | 52.37663 |
| 4 | 485.5529 | Private room | FALSE | TRUE | 2 | TRUE | 0 | 0 | 10 | 98 | 1 | 0.5447382 | 0.3186926 | 552.83032 | 29.272733 | 815.30574 | 56.811677 | 4.90051 | 52.37508 |
*str() function is used to check the data type of the variables.
## spc_tbl_ [1,103 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:1103] 0 1 2 3 4 5 6 7 8 9 ...
## $ realSum : num [1:1103] 194 344 264 434 486 ...
## $ room_type : chr [1:1103] "Private room" "Private room" "Private room" "Private room" ...
## $ room_shared : logi [1:1103] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ room_private : logi [1:1103] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ person_capacity : num [1:1103] 2 4 2 4 2 3 2 4 4 2 ...
## $ host_is_superhost : logi [1:1103] FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ multi : num [1:1103] 1 0 0 0 0 0 0 0 0 1 ...
## $ biz : num [1:1103] 0 0 1 1 0 0 0 0 0 0 ...
## $ cleanliness_rating : num [1:1103] 10 8 9 9 10 8 10 10 9 10 ...
## $ guest_satisfaction_overall: num [1:1103] 93 85 87 90 98 100 94 100 96 88 ...
## $ bedrooms : num [1:1103] 1 1 1 2 1 2 1 3 2 1 ...
## $ dist : num [1:1103] 5.023 0.488 5.748 0.385 0.545 ...
## $ metro_dist : num [1:1103] 2.539 0.239 3.652 0.44 0.319 ...
## $ attr_index : num [1:1103] 78.7 631.2 75.3 493.3 552.8 ...
## $ attr_index_norm : num [1:1103] 4.17 33.42 3.99 26.12 29.27 ...
## $ rest_index : num [1:1103] 98.3 837.3 95.4 875 815.3 ...
## $ rest_index_norm : num [1:1103] 6.85 58.34 6.65 60.97 56.81 ...
## $ lng : num [1:1103] 4.91 4.9 4.98 4.89 4.9 ...
## $ lat : num [1:1103] 52.4 52.4 52.4 52.4 52.4 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. realSum = col_double(),
## .. room_type = col_character(),
## .. room_shared = col_logical(),
## .. room_private = col_logical(),
## .. person_capacity = col_double(),
## .. host_is_superhost = col_logical(),
## .. multi = col_double(),
## .. biz = col_double(),
## .. cleanliness_rating = col_double(),
## .. guest_satisfaction_overall = col_double(),
## .. bedrooms = col_double(),
## .. dist = col_double(),
## .. metro_dist = col_double(),
## .. attr_index = col_double(),
## .. attr_index_norm = col_double(),
## .. rest_index = col_double(),
## .. rest_index_norm = col_double(),
## .. lng = col_double(),
## .. lat = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## ...1 realSum room_type room_shared
## Min. : 0.0 Min. : 128.9 Entire home/apt:538 FALSE:1097
## 1st Qu.: 275.5 1st Qu.: 309.8 Private room :559 TRUE : 6
## Median : 551.0 Median : 430.2 Shared room : 6
## Mean : 551.0 Mean : 545.0
## 3rd Qu.: 826.5 3rd Qu.: 657.3
## Max. :1102.0 Max. :7782.9
## room_private person_capacity host_is_superhost multi
## FALSE:544 Min. :2.000 FALSE:780 Min. :0.0000
## TRUE :559 1st Qu.:2.000 TRUE :323 1st Qu.:0.0000
## Median :2.000 Median :0.0000
## Mean :2.792 Mean :0.3083
## 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000
## biz cleanliness_rating guest_satisfaction_overall bedrooms
## Min. :0.0000 Min. : 4.000 Min. : 20.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 9.000 1st Qu.: 92.00 1st Qu.:1.000
## Median :0.0000 Median :10.000 Median : 96.00 Median :1.000
## Mean :0.1151 Mean : 9.461 Mean : 94.36 Mean :1.283
## 3rd Qu.:0.0000 3rd Qu.:10.000 3rd Qu.: 98.00 3rd Qu.:2.000
## Max. :1.0000 Max. :10.000 Max. :100.00 Max. :5.000
## dist metro_dist attr_index attr_index_norm
## Min. : 0.01506 Min. :0.03653 Min. : 40.93 Min. : 2.167
## 1st Qu.: 1.30206 1st Qu.:0.46298 1st Qu.: 127.91 1st Qu.: 6.773
## Median : 2.34137 Median :0.85601 Median : 208.18 Median : 11.023
## Mean : 2.84162 Mean :1.08944 Mean : 271.01 Mean : 14.350
## 3rd Qu.: 3.64814 3rd Qu.:1.51063 3rd Qu.: 386.44 3rd Qu.: 20.462
## Max. :11.18710 Max. :4.41191 Max. :1888.55 Max. :100.000
## rest_index rest_index_norm lng lat
## Min. : 50.88 Min. : 3.545 Min. :4.776 Min. :52.29
## 1st Qu.: 163.47 1st Qu.: 11.391 1st Qu.:4.871 1st Qu.:52.35
## Median : 260.26 Median : 18.135 Median :4.890 Median :52.37
## Mean : 341.54 Mean : 23.799 Mean :4.891 Mean :52.36
## 3rd Qu.: 469.29 3rd Qu.: 32.701 3rd Qu.:4.907 3rd Qu.:52.38
## Max. :1435.10 Max. :100.000 Max. :5.011 Max. :52.42
## [1] 0
is.outlier <- function(x) {
lower_fence <- summary(x)[2] - 1.5 * IQR(x)
upper_fence <- summary(x)[5] + 1.5 * IQR(x)
x < lower_fence | x > upper_fence
}
outliers <- is.outlier(amsterdam_weekdays$guest_satisfaction_overall)
sum(outliers)## [1] 48
ggplot(amsterdam_weekdays, aes(x = room_type, y = guest_satisfaction_overall)) +
geom_boxplot(fill = "steelblue") +
labs(x = "Room Type", y = "Guest Rating") +
theme_minimal()amsterdam_weekdays %>%
group_by(room_type) %>%
summarise(
Min = min(guest_satisfaction_overall, na.rm = TRUE),
Q1 = quantile(guest_satisfaction_overall, probs = 0.25, na.rm = TRUE),
Median = median(guest_satisfaction_overall, na.rm = TRUE),
Q3 = quantile(guest_satisfaction_overall, probs = 0.75, na.rm = TRUE),
Max = max(guest_satisfaction_overall, na.rm = TRUE),
Mean = mean(guest_satisfaction_overall, na.rm = TRUE),
SD = sd(guest_satisfaction_overall, na.rm = TRUE),
n = n(),
missing = sum(is.na(guest_satisfaction_overall))
) -> table1
knitr::kable(table1)| room_type | Min | Q1 | Median | Q3 | Max | Mean | SD | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Entire home/apt | 83 | 94 | 97.0 | 99.00 | 100 | 95.83462 | 3.901429 | 520 | 0 |
| Private room | 83 | 92 | 96.0 | 98.00 | 100 | 94.61626 | 4.259370 | 529 | 0 |
| Shared room | 84 | 91 | 94.5 | 95.75 | 98 | 92.83333 | 5.076088 | 6 | 0 |
\[H_0: \mu_1 = \mu_2 \] * Alternative hypothesis (HA): There is a significant difference in guest satisfaction ratings between different room types.
\[H_A: \mu_1 \ne \mu_2\]
## [1] 389 661
We are performing two-sample t-tests to compare the guest satisfaction scores between different room types. As we had more than two variables in a column, we divided the data into three groups based on the room types: “private room,” “entire home/apt,” and “shared room.” Then, we are conducting pairwise t-tests between these groups.
private_room <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Private room"]
entire_home <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Entire home/apt"]
shared_room <- amsterdam_weekdays$guest_satisfaction_overall[amsterdam_weekdays$room_type == "Shared room"]
t.test(private_room, entire_home,var.equal = FALSE, alternative = "two.sided")##
## Welch Two Sample t-test
##
## data: private_room and entire_home
## t = -4.8324, df = 1041.8, p-value = 1.552e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.7130881 -0.7236285
## sample estimates:
## mean of x mean of y
## 94.61626 95.83462
##
## Welch Two Sample t-test
##
## data: private_room and shared_room
## t = 0.85694, df = 5.0802, p-value = 0.43
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.540044 7.105891
## sample estimates:
## mean of x mean of y
## 94.61626 92.83333
##
## Welch Two Sample t-test
##
## data: entire_home and shared_room
## t = 1.4434, df = 5.0684, p-value = 0.2077
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.322248 8.324812
## sample estimates:
## mean of x mean of y
## 95.83462 92.83333