報告人員:何怡姿 陳翰新 姚冠豪 陳妮葳
旅行數據集提供了旅行者各種旅行的詳細信息,包括他們的目的地、旅行日期、旅行天數、旅行者人口統計數據(姓名、年齡、性別和國籍),以及住宿和住宿的類型和費用。運輸。該數據集可用於深入了解不同類型旅行者的旅行模式、偏好和行為。它還可以幫助旅行社等與旅行相關的企業製定量身定制的營銷策略和旅行套餐,以滿足不同旅行者的需求和偏好。
\(旅行者旅行數據\)
\(★參考資料\)
●https://www.kaggle.com/datasets?search=image
●file:///C:/Users/Howard/Downloads/2.ggplot%20(2).html
●https://joe11051105.gitbooks.io/r_basic/content/arrange_data/merge_and_subsetting.html
●https://yijutseng.github.io/DataScienceRBook/manipulation.html
★旅行者旅行數據
※此報告為旅行者旅行的數據
●旅行者旅行數據包含的變數
Trip.ID(旅行ID) Destination(目的地)
Start.date(開始日期) End.date(結束日期) Duration
(days)(持續時間(天)) Traveler.name(旅客姓名)
Traveler.age(旅行者年齡) Traveler.gender(旅行者性別)Traveler.nationality(旅行者國籍) Accommodation.type(住宿類型) Accommodation.cost(住宿費用)
Transportation.type(交通類型) Transportation.cost(交通費用)
○查看前五筆資料
library(readr)
## Warning: 套件 'readr' 是用 R 版本 4.2.2 來建造的
Travel_details_dataset <- read_csv("C:/Users/Howard/OneDrive/桌面/Travel details dataset.csv")
## Rows: 139 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Destination, Start date, End date, Traveler name, Traveler gender,...
## dbl (3): Trip ID, Duration (days), Traveler age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(Travel_details_dataset, n = 5,width = Inf)
## # A tibble: 139 × 13
## `Trip ID` Destination `Start date` `End date` `Duration (days)`
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 London, UK 5/1/2023 5/8/2023 7
## 2 2 Phuket, Thailand 6/15/2023 6/20/2023 5
## 3 3 Bali, Indonesia 7/1/2023 7/8/2023 7
## 4 4 New York, USA 8/15/2023 8/29/2023 14
## 5 5 Tokyo, Japan 9/10/2023 9/17/2023 7
## `Traveler name` `Traveler age` `Traveler gender` `Traveler nationality`
## <chr> <dbl> <chr> <chr>
## 1 John Smith 35 Male American
## 2 Jane Doe 28 Female Canadian
## 3 David Lee 45 Male Korean
## 4 Sarah Johnson 29 Female British
## 5 Kim Nguyen 26 Female Vietnamese
## `Accommodation type` `Accommodation cost` `Transportation type`
## <chr> <chr> <chr>
## 1 Hotel 1200 Flight
## 2 Resort 800 Flight
## 3 Villa 1000 Flight
## 4 Hotel 2000 Flight
## 5 Airbnb 700 Train
## `Transportation cost`
## <chr>
## 1 600
## 2 500
## 3 700
## 4 1000
## 5 200
## # … with 134 more rows
#摘要統計
summary(Travel_details_dataset)
## Trip ID Destination Start date End date
## Min. : 1.0 Length:139 Length:139 Length:139
## 1st Qu.: 35.5 Class :character Class :character Class :character
## Median : 70.0 Mode :character Mode :character Mode :character
## Mean : 70.0
## 3rd Qu.:104.5
## Max. :139.0
##
## Duration (days) Traveler name Traveler age Traveler gender
## Min. : 5.000 Length:139 Min. :20.00 Length:139
## 1st Qu.: 7.000 Class :character 1st Qu.:28.00 Class :character
## Median : 7.000 Mode :character Median :31.00 Mode :character
## Mean : 7.606 Mean :33.18
## 3rd Qu.: 8.000 3rd Qu.:38.00
## Max. :14.000 Max. :60.00
## NA's :2 NA's :2
## Traveler nationality Accommodation type Accommodation cost Transportation type
## Length:139 Length:139 Length:139 Length:139
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Transportation cost
## Length:139
## Class :character
## Mode :character
##
##
##
##
#資料結構
str(Travel_details_dataset)
## spc_tbl_ [139 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Trip ID : num [1:139] 1 2 3 4 5 6 7 8 9 10 ...
## $ Destination : chr [1:139] "London, UK" "Phuket, Thailand" "Bali, Indonesia" "New York, USA" ...
## $ Start date : chr [1:139] "5/1/2023" "6/15/2023" "7/1/2023" "8/15/2023" ...
## $ End date : chr [1:139] "5/8/2023" "6/20/2023" "7/8/2023" "8/29/2023" ...
## $ Duration (days) : num [1:139] 7 5 7 14 7 5 10 7 7 7 ...
## $ Traveler name : chr [1:139] "John Smith" "Jane Doe" "David Lee" "Sarah Johnson" ...
## $ Traveler age : num [1:139] 35 28 45 29 26 42 33 25 31 39 ...
## $ Traveler gender : chr [1:139] "Male" "Female" "Male" "Female" ...
## $ Traveler nationality: chr [1:139] "American" "Canadian" "Korean" "British" ...
## $ Accommodation type : chr [1:139] "Hotel" "Resort" "Villa" "Hotel" ...
## $ Accommodation cost : chr [1:139] "1200" "800" "1000" "2000" ...
## $ Transportation type : chr [1:139] "Flight" "Flight" "Flight" "Flight" ...
## $ Transportation cost : chr [1:139] "600" "500" "700" "1000" ...
## - attr(*, "spec")=
## .. cols(
## .. `Trip ID` = col_double(),
## .. Destination = col_character(),
## .. `Start date` = col_character(),
## .. `End date` = col_character(),
## .. `Duration (days)` = col_double(),
## .. `Traveler name` = col_character(),
## .. `Traveler age` = col_double(),
## .. `Traveler gender` = col_character(),
## .. `Traveler nationality` = col_character(),
## .. `Accommodation type` = col_character(),
## .. `Accommodation cost` = col_character(),
## .. `Transportation type` = col_character(),
## .. `Transportation cost` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
○查看總共男女各為多少人以及總和並製作比例圖
x=Travel_details_dataset$`Traveler gender`
男性=sum(x=="Male",na.rm = T)
女性=sum(x=="Female",na.rm = T)
總和=男性+女性
cbind(男性,女性,總和)
## 男性 女性 總和
## [1,] 67 70 137
#比例圖
library(ggplot2)
## Warning: 套件 'ggplot2' 是用 R 版本 4.2.2 來建造的
library(knitr)
## Warning: 套件 'knitr' 是用 R 版本 4.2.2 來建造的
data=data.frame(性别 = c("男性", "女性"), 人数 = c(男性, 女性))
# 计算比例
data$比例=paste0(round(data$人数 / 總和 * 100, 1), "%")
# 创建比例图
ggplot(data, aes(x = "", y = 人数, fill = 性别)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(x = NULL, y = NULL, title = "男性与女性比例") +
scale_fill_manual(values = c("blue", "pink")) +
theme_minimal() +
geom_text(aes(label = 比例),
position = position_stack(vjust = 0.5),
color = "white", size = 5)
○篩選目的地為英國倫敦的資料
library(tidyverse)
## Warning: 套件 'tidyverse' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'tibble' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'tidyr' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'purrr' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'dplyr' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'stringr' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'forcats' 是用 R 版本 4.2.2 來建造的
## Warning: 套件 'lubridate' 是用 R 版本 4.2.2 來建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Travel_details_dataset%>%
filter(Destination=="London, UK")
## # A tibble: 3 × 13
## `Trip ID` Destination Start …¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 London, UK 5/1/2023 5/8/20… 7 John S… 35 Male Americ…
## 2 44 London, UK 3/5/2023 3/12/2… 7 Peter … 55 Male British
## 3 56 London, UK 3/15/20… 3/23/2… 8 Ben Sm… 35 Male British
## # … with 4 more variables: `Accommodation type` <chr>,
## # `Accommodation cost` <chr>, `Transportation type` <chr>,
## # `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## # ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## # ⁶`Traveler gender`, ⁷`Traveler nationality`
○找出旅遊時間大於等於10天的旅客
#1
filter(Travel_details_dataset,`Duration (days)`>=10)
## # A tibble: 17 × 13
## `Trip ID` Destination Start…¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 4 New York, … 8/15/2… 8/29/2… 14 Sarah … 29 Female British
## 2 7 Sydney, Au… 11/20/… 11/30/… 10 Emily … 33 Female Austra…
## 3 18 Bali 8/15/2… 8/25/2… 10 Michae… 28 Male Chinese
## 4 20 Tokyo 10/5/2… 10/15/… 10 Kenji … 45 Male Japane…
## 5 31 Australia 8/20/2… 9/2/20… 13 Emma D… 28 Female British
## 6 35 Mexico 1/5/20… 1/15/2… 10 James … 42 Male British
## 7 51 Tokyo, Jap… 10/10/… 10/20/… 10 David … 25 Male Americ…
## 8 85 Tokyo 7/1/20… 7/10/2… 10 Sarah … 28 Female Korean
## 9 86 Bali 8/10/2… 8/20/2… 11 Maria … 42 Female Spanish
## 10 89 London 11/20/… 11/30/… 11 James … 29 Male British
## 11 92 Rome 3/10/2… 3/20/2… 11 Giulia… 30 Female Italian
## 12 93 Bali 4/15/2… 4/25/2… 11 Putra … 33 Male Indone…
## 13 94 Seoul 5/1/20… 5/10/2… 10 Kim Mi… 27 Female Korean
## 14 119 Sydney, Aus 5/1/20… 5/12/2… 11 Cindy … 26 Female Chinese
## 15 121 Bali, Indo… 7/20/2… 7/30/2… 10 Emily … 29 Female Korean
## 16 123 Athens, Gr… 9/20/2… 9/30/2… 10 Gina L… 35 Female Korean
## 17 125 Sydney, Aus 11/11/… 11/21/… 10 Isabel… 30 Female Chinese
## # … with 4 more variables: `Accommodation type` <chr>,
## # `Accommodation cost` <chr>, `Transportation type` <chr>,
## # `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## # ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## # ⁶`Traveler gender`, ⁷`Traveler nationality`
#2
Travel_details_dataset%>%
filter(`Duration (days)`>=10)
## # A tibble: 17 × 13
## `Trip ID` Destination Start…¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 4 New York, … 8/15/2… 8/29/2… 14 Sarah … 29 Female British
## 2 7 Sydney, Au… 11/20/… 11/30/… 10 Emily … 33 Female Austra…
## 3 18 Bali 8/15/2… 8/25/2… 10 Michae… 28 Male Chinese
## 4 20 Tokyo 10/5/2… 10/15/… 10 Kenji … 45 Male Japane…
## 5 31 Australia 8/20/2… 9/2/20… 13 Emma D… 28 Female British
## 6 35 Mexico 1/5/20… 1/15/2… 10 James … 42 Male British
## 7 51 Tokyo, Jap… 10/10/… 10/20/… 10 David … 25 Male Americ…
## 8 85 Tokyo 7/1/20… 7/10/2… 10 Sarah … 28 Female Korean
## 9 86 Bali 8/10/2… 8/20/2… 11 Maria … 42 Female Spanish
## 10 89 London 11/20/… 11/30/… 11 James … 29 Male British
## 11 92 Rome 3/10/2… 3/20/2… 11 Giulia… 30 Female Italian
## 12 93 Bali 4/15/2… 4/25/2… 11 Putra … 33 Male Indone…
## 13 94 Seoul 5/1/20… 5/10/2… 10 Kim Mi… 27 Female Korean
## 14 119 Sydney, Aus 5/1/20… 5/12/2… 11 Cindy … 26 Female Chinese
## 15 121 Bali, Indo… 7/20/2… 7/30/2… 10 Emily … 29 Female Korean
## 16 123 Athens, Gr… 9/20/2… 9/30/2… 10 Gina L… 35 Female Korean
## 17 125 Sydney, Aus 11/11/… 11/21/… 10 Isabel… 30 Female Chinese
## # … with 4 more variables: `Accommodation type` <chr>,
## # `Accommodation cost` <chr>, `Transportation type` <chr>,
## # `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## # ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## # ⁶`Traveler gender`, ⁷`Traveler nationality`
○計算出國住宿費用和交通費用總和
library(dplyr)
Travel_details_dataset2=na.omit(Travel_details_dataset)
a=Travel_details_dataset2$`Accommodation cost`
b=Travel_details_dataset2$`Transportation cost`
clean_a=gsub("[[:punct:]]","",a)
clean_a1=gsub("USD","",clean_a)
clean_b=gsub("[[:punct:]]","",b)
clean_b1=gsub("USD","",clean_b)
住宿費=na.omit(as.numeric(clean_a1))
交通費=na.omit(as.numeric(clean_b1))
Travel_details_dataset2%>%
select(Destination ,`Duration (days)`,`Accommodation cost`,`Transportation cost`)%>%
mutate(
"Allfees"=住宿費+交通費,
"Convert Taiwan Dollar"=(住宿費+交通費)*30.72
)
## # A tibble: 136 × 6
## Destination `Duration (days)` Accom…¹ Trans…² Allfees Conve…³
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 London, UK 7 1200 600 1800 55296
## 2 Phuket, Thailand 5 800 500 1300 39936
## 3 Bali, Indonesia 7 1000 700 1700 52224
## 4 New York, USA 14 2000 1000 3000 92160
## 5 Tokyo, Japan 7 700 200 900 27648
## 6 Paris, France 5 1500 800 2300 70656
## 7 Sydney, Australia 10 500 1200 1700 52224
## 8 Rio de Janeiro, Brazil 7 900 600 1500 46080
## 9 Amsterdam, Netherlands 7 1200 200 1400 43008
## 10 Dubai, United Arab Emirates 7 2500 800 3300 101376
## # … with 126 more rows, and abbreviated variable names ¹`Accommodation cost`,
## # ²`Transportation cost`, ³`Convert Taiwan Dollar`
○找出有幾種交通類型,並畫圖顯示人們搭的工具比例
#找出交通類型
x=table(Travel_details_dataset$`Transportation type`)
x1=na.omit(x)
#畫圖(直向長條圖)
ggplot(Travel_details_dataset, aes(x =Travel_details_dataset$`Transportation type`)) +
stat_count()
#畫圖(橫向長條圖)
ggplot(Travel_details_dataset, aes(x = Travel_details_dataset$`Transportation type`)) +
geom_bar() + coord_flip()
#畫圖(圓餅圖)
Travel_details_dataset_Transportation=table(Travel_details_dataset$`Transportation type`)
Travel_details_dataset_Transportation1=as.data.frame(Travel_details_dataset_Transportation)
labels2 = paste(names(Travel_details_dataset_Transportation), "\n(", round((prop.table(Travel_details_dataset_Transportation))*100,2), "%)", sep = "")
ggplot(Travel_details_dataset_Transportation1,aes(x="",y=Freq,fill=Var1))+
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
geom_text(aes(x=1.7,label=labels2),
position = position_stack(vjust = 0.5))+
theme_void()
#圓餅圖2
labels=Travel_details_dataset$`Transportation type`
piepercent<- paste(round(100*x/sum(x1), 2), "%")
pie(x1, labels2, main = "交通工具", radius=1.05,lty=1,col = rainbow(length(x1)))
○計算旅行者平均年齡
mean(Travel_details_dataset2$`Traveler age`)
## [1] 33.11765
○畫出旅行者的年齡曲線
summary(Travel_details_dataset2$`Traveler age`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 28.00 31.00 33.12 37.25 60.00
# 製作示例資料
age <- c(20, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 50, 55, 60)
count <- c(1, 1, 4, 7, 6, 12, 11, 7, 11, 4, 9, 2, 10, 1, 4, 4, 3, 1, 5, 7, 1, 8, 1, 1, 1, 1, 1, 1)
# 將資料轉換成資料框格式
df <- data.frame(age, count)
# 設定區間範圍
intervals <- c(20, 24, 29, 34, 39, 44, 49, 54, 59, 64)
# 使用 cut 函數將資料分組
df$group <- cut(df$age, breaks = intervals, labels = c("20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64"), include.lowest = TRUE)
# 統計各組人數
group_counts <- aggregate(count ~ group, data = df, FUN = sum)
# 繪製長條圖和折線圖,並添加人數標籤
ggplot() +
geom_bar(data = group_counts, aes(x = group, y = count), stat = "identity", fill = "blue", width = 0.5) +
geom_line(data = group_counts, aes(x = group, y = count, group = 1), color = "red") +
geom_text(data = group_counts, aes(x = group, y = count, label = count), vjust = -0.5, size = 3) +
xlab("年齡組") +
ylab("人數") +
ggtitle("年齡分組人數統計") +
theme_minimal()
○數值變量進行相關性分析: Duration (days): 旅行持續時間(天數) Traveler age: 旅行者年龄 Accommodation cost: 住宿费用 Transportation cost: 交通费用
x<- data.frame(
Duration = as.numeric(Travel_details_dataset$`Duration (days)`),
Age = as.numeric(Travel_details_dataset$`Traveler age`),
AccommodationCost = as.numeric(Travel_details_dataset$`Accommodation cost`),
TransportationCost = as.numeric(Travel_details_dataset$`Transportation cost`)
)
## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA
## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA
x<- na.omit(x)
# 計算相關係數
cor_matrix <- cor(x)
cor_matrix
## Duration Age AccommodationCost TransportationCost
## Duration 1.0000000 -0.14540030 -0.11345104 0.09141530
## Age -0.1454003 1.00000000 0.02323159 -0.04162607
## AccommodationCost -0.1134510 0.02323159 1.00000000 0.82672834
## TransportationCost 0.0914153 -0.04162607 0.82672834 1.00000000
★檢驗旅行持續時間(Duration (days))是否與住宿费用(Accommodation cost)之間存在顯著差異。
data1<- data.frame(
Duration = as.numeric(Travel_details_dataset$`Duration (days)`),
AccommodationCost = as.numeric(Travel_details_dataset$`Accommodation cost`)
)
## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA
missing_values <- is.na(data1$Duration) | is.na(data1$AccommodationCost)
data1 <- data1[!missing_values, ]
result <- t.test(data1$Duration, data1$AccommodationCost)
result
##
## Welch Two Sample t-test
##
## data: data1$Duration and data1$AccommodationCost
## t = -8.2956, df = 71, p-value = 4.732e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2065.064 -1264.713
## sample estimates:
## mean of x mean of y
## 7.333333 1672.222222
★○線性迴歸分析
model <- lm(data1$Duration~data1$AccommodationCost,data1)
summary(model)
##
## Call:
## lm(formula = data1$Duration ~ data1$AccommodationCost, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4794 -0.4596 -0.3802 0.6124 6.6992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.499e+00 2.552e-01 29.385 <2e-16 ***
## data1$AccommodationCost -9.924e-05 1.073e-04 -0.925 0.358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.54 on 70 degrees of freedom
## Multiple R-squared: 0.01207, Adjusted R-squared: -0.002043
## F-statistic: 0.8552 on 1 and 70 DF, p-value: 0.3583
報告結束
感謝觀閱