期末報告-旅行數據分析

報告人員:何怡姿陳翰新姚冠豪陳妮葳

旅行數據集提供了旅行者各種旅行的詳細信息，包括他們的目的地、旅行日期、旅行天數、旅行者人口統計數據（姓名、年齡、性別和國籍），以及住宿和住宿的類型和費用。運輸。該數據集可用於深入了解不同類型旅行者的旅行模式、偏好和行為。它還可以幫助旅行社等與旅行相關的企業製定量身定制的營銷策略和旅行套餐，以滿足不同旅行者的需求和偏好。

\(旅行者旅行數據\)
\(★參考資料\)
●https://www.kaggle.com/datasets?search=image
●file:///C:/Users/Howard/Downloads/2.ggplot%20(2).html
●https://joe11051105.gitbooks.io/r_basic/content/arrange_data/merge_and_subsetting.html
●https://yijutseng.github.io/DataScienceRBook/manipulation.html
★旅行者旅行數據
※此報告為旅行者旅行的數據
●旅行者旅行數據包含的變數
Trip.ID(旅行ID) Destination(目的地) Start.date(開始日期) End.date(結束日期) Duration (days)(持續時間(天)) Traveler.name(旅客姓名)
Traveler.age(旅行者年齡) Traveler.gender(旅行者性別)Traveler.nationality(旅行者國籍) Accommodation.type(住宿類型) Accommodation.cost(住宿費用)
Transportation.type(交通類型) Transportation.cost(交通費用)

○查看前五筆資料

library(readr)

## Warning: 套件 'readr' 是用 R 版本 4.2.2 來建造的

Travel_details_dataset <- read_csv("C:/Users/Howard/OneDrive/桌面/Travel details dataset.csv")

## Rows: 139 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Destination, Start date, End date, Traveler name, Traveler gender,...
## dbl  (3): Trip ID, Duration (days), Traveler age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(Travel_details_dataset, n = 5,width = Inf)

## # A tibble: 139 × 13
##   `Trip ID` Destination      `Start date` `End date` `Duration (days)`
##       <dbl> <chr>            <chr>        <chr>                  <dbl>
## 1         1 London, UK       5/1/2023     5/8/2023                   7
## 2         2 Phuket, Thailand 6/15/2023    6/20/2023                  5
## 3         3 Bali, Indonesia  7/1/2023     7/8/2023                   7
## 4         4 New York, USA    8/15/2023    8/29/2023                 14
## 5         5 Tokyo, Japan     9/10/2023    9/17/2023                  7
##   `Traveler name` `Traveler age` `Traveler gender` `Traveler nationality`
##   <chr>                    <dbl> <chr>             <chr>                 
## 1 John Smith                  35 Male              American              
## 2 Jane Doe                    28 Female            Canadian              
## 3 David Lee                   45 Male              Korean                
## 4 Sarah Johnson               29 Female            British               
## 5 Kim Nguyen                  26 Female            Vietnamese            
##   `Accommodation type` `Accommodation cost` `Transportation type`
##   <chr>                <chr>                <chr>                
## 1 Hotel                1200                 Flight               
## 2 Resort               800                  Flight               
## 3 Villa                1000                 Flight               
## 4 Hotel                2000                 Flight               
## 5 Airbnb               700                  Train                
##   `Transportation cost`
##   <chr>                
## 1 600                  
## 2 500                  
## 3 700                  
## 4 1000                 
## 5 200                  
## # … with 134 more rows

#摘要統計
summary(Travel_details_dataset)

##     Trip ID      Destination         Start date          End date        
##  Min.   :  1.0   Length:139         Length:139         Length:139        
##  1st Qu.: 35.5   Class :character   Class :character   Class :character  
##  Median : 70.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 70.0                                                           
##  3rd Qu.:104.5                                                           
##  Max.   :139.0                                                           
##                                                                          
##  Duration (days)  Traveler name       Traveler age   Traveler gender   
##  Min.   : 5.000   Length:139         Min.   :20.00   Length:139        
##  1st Qu.: 7.000   Class :character   1st Qu.:28.00   Class :character  
##  Median : 7.000   Mode  :character   Median :31.00   Mode  :character  
##  Mean   : 7.606                      Mean   :33.18                     
##  3rd Qu.: 8.000                      3rd Qu.:38.00                     
##  Max.   :14.000                      Max.   :60.00                     
##  NA's   :2                           NA's   :2                         
##  Traveler nationality Accommodation type Accommodation cost Transportation type
##  Length:139           Length:139         Length:139         Length:139         
##  Class :character     Class :character   Class :character   Class :character   
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character   
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  Transportation cost
##  Length:139         
##  Class :character   
##  Mode  :character   
##                     
##                     
##                     
##

#資料結構
str(Travel_details_dataset)

## spc_tbl_ [139 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Trip ID             : num [1:139] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Destination         : chr [1:139] "London, UK" "Phuket, Thailand" "Bali, Indonesia" "New York, USA" ...
##  $ Start date          : chr [1:139] "5/1/2023" "6/15/2023" "7/1/2023" "8/15/2023" ...
##  $ End date            : chr [1:139] "5/8/2023" "6/20/2023" "7/8/2023" "8/29/2023" ...
##  $ Duration (days)     : num [1:139] 7 5 7 14 7 5 10 7 7 7 ...
##  $ Traveler name       : chr [1:139] "John Smith" "Jane Doe" "David Lee" "Sarah Johnson" ...
##  $ Traveler age        : num [1:139] 35 28 45 29 26 42 33 25 31 39 ...
##  $ Traveler gender     : chr [1:139] "Male" "Female" "Male" "Female" ...
##  $ Traveler nationality: chr [1:139] "American" "Canadian" "Korean" "British" ...
##  $ Accommodation type  : chr [1:139] "Hotel" "Resort" "Villa" "Hotel" ...
##  $ Accommodation cost  : chr [1:139] "1200" "800" "1000" "2000" ...
##  $ Transportation type : chr [1:139] "Flight" "Flight" "Flight" "Flight" ...
##  $ Transportation cost : chr [1:139] "600" "500" "700" "1000" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Trip ID` = col_double(),
##   ..   Destination = col_character(),
##   ..   `Start date` = col_character(),
##   ..   `End date` = col_character(),
##   ..   `Duration (days)` = col_double(),
##   ..   `Traveler name` = col_character(),
##   ..   `Traveler age` = col_double(),
##   ..   `Traveler gender` = col_character(),
##   ..   `Traveler nationality` = col_character(),
##   ..   `Accommodation type` = col_character(),
##   ..   `Accommodation cost` = col_character(),
##   ..   `Transportation type` = col_character(),
##   ..   `Transportation cost` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

○查看總共男女各為多少人以及總和並製作比例圖

x=Travel_details_dataset$`Traveler gender`
男性=sum(x=="Male",na.rm = T)
女性=sum(x=="Female",na.rm = T)
總和=男性+女性
cbind(男性,女性,總和)

##      男性 女性 總和
## [1,]   67   70  137

#比例圖
library(ggplot2)

## Warning: 套件 'ggplot2' 是用 R 版本 4.2.2 來建造的

library(knitr)

## Warning: 套件 'knitr' 是用 R 版本 4.2.2 來建造的

data=data.frame(性别 = c("男性", "女性"), 人数 = c(男性, 女性))

# 计算比例
data$比例=paste0(round(data$人数 / 總和 * 100, 1), "%")

# 创建比例图
ggplot(data, aes(x = "", y = 人数, fill = 性别)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(x = NULL, y = NULL, title = "男性与女性比例") +
  scale_fill_manual(values = c("blue", "pink")) +
  theme_minimal() +
  geom_text(aes(label = 比例), 
            position = position_stack(vjust = 0.5), 
            color = "white", size = 5)

○篩選目的地為英國倫敦的資料

library(tidyverse)

## Warning: 套件 'tidyverse' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'tibble' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'tidyr' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'purrr' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'dplyr' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'stringr' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'forcats' 是用 R 版本 4.2.2 來建造的

## Warning: 套件 'lubridate' 是用 R 版本 4.2.2 來建造的

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ stringr   1.5.0
## ✔ forcats   1.0.0     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Travel_details_dataset%>%
  filter(Destination=="London, UK")

## # A tibble: 3 × 13
##   `Trip ID` Destination Start …¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
##       <dbl> <chr>       <chr>    <chr>     <dbl> <chr>     <dbl> <chr>   <chr>  
## 1         1 London, UK  5/1/2023 5/8/20…       7 John S…      35 Male    Americ…
## 2        44 London, UK  3/5/2023 3/12/2…       7 Peter …      55 Male    British
## 3        56 London, UK  3/15/20… 3/23/2…       8 Ben Sm…      35 Male    British
## # … with 4 more variables: `Accommodation type` <chr>,
## #   `Accommodation cost` <chr>, `Transportation type` <chr>,
## #   `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## #   ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## #   ⁶`Traveler gender`, ⁷`Traveler nationality`

○找出旅遊時間大於等於10天的旅客

#1
filter(Travel_details_dataset,`Duration (days)`>=10)

## # A tibble: 17 × 13
##    `Trip ID` Destination Start…¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
##        <dbl> <chr>       <chr>   <chr>     <dbl> <chr>     <dbl> <chr>   <chr>  
##  1         4 New York, … 8/15/2… 8/29/2…      14 Sarah …      29 Female  British
##  2         7 Sydney, Au… 11/20/… 11/30/…      10 Emily …      33 Female  Austra…
##  3        18 Bali        8/15/2… 8/25/2…      10 Michae…      28 Male    Chinese
##  4        20 Tokyo       10/5/2… 10/15/…      10 Kenji …      45 Male    Japane…
##  5        31 Australia   8/20/2… 9/2/20…      13 Emma D…      28 Female  British
##  6        35 Mexico      1/5/20… 1/15/2…      10 James …      42 Male    British
##  7        51 Tokyo, Jap… 10/10/… 10/20/…      10 David …      25 Male    Americ…
##  8        85 Tokyo       7/1/20… 7/10/2…      10 Sarah …      28 Female  Korean 
##  9        86 Bali        8/10/2… 8/20/2…      11 Maria …      42 Female  Spanish
## 10        89 London      11/20/… 11/30/…      11 James …      29 Male    British
## 11        92 Rome        3/10/2… 3/20/2…      11 Giulia…      30 Female  Italian
## 12        93 Bali        4/15/2… 4/25/2…      11 Putra …      33 Male    Indone…
## 13        94 Seoul       5/1/20… 5/10/2…      10 Kim Mi…      27 Female  Korean 
## 14       119 Sydney, Aus 5/1/20… 5/12/2…      11 Cindy …      26 Female  Chinese
## 15       121 Bali, Indo… 7/20/2… 7/30/2…      10 Emily …      29 Female  Korean 
## 16       123 Athens, Gr… 9/20/2… 9/30/2…      10 Gina L…      35 Female  Korean 
## 17       125 Sydney, Aus 11/11/… 11/21/…      10 Isabel…      30 Female  Chinese
## # … with 4 more variables: `Accommodation type` <chr>,
## #   `Accommodation cost` <chr>, `Transportation type` <chr>,
## #   `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## #   ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## #   ⁶`Traveler gender`, ⁷`Traveler nationality`

#2
Travel_details_dataset%>%
  filter(`Duration (days)`>=10)

## # A tibble: 17 × 13
##    `Trip ID` Destination Start…¹ End d…² Durat…³ Trave…⁴ Trave…⁵ Trave…⁶ Trave…⁷
##        <dbl> <chr>       <chr>   <chr>     <dbl> <chr>     <dbl> <chr>   <chr>  
##  1         4 New York, … 8/15/2… 8/29/2…      14 Sarah …      29 Female  British
##  2         7 Sydney, Au… 11/20/… 11/30/…      10 Emily …      33 Female  Austra…
##  3        18 Bali        8/15/2… 8/25/2…      10 Michae…      28 Male    Chinese
##  4        20 Tokyo       10/5/2… 10/15/…      10 Kenji …      45 Male    Japane…
##  5        31 Australia   8/20/2… 9/2/20…      13 Emma D…      28 Female  British
##  6        35 Mexico      1/5/20… 1/15/2…      10 James …      42 Male    British
##  7        51 Tokyo, Jap… 10/10/… 10/20/…      10 David …      25 Male    Americ…
##  8        85 Tokyo       7/1/20… 7/10/2…      10 Sarah …      28 Female  Korean 
##  9        86 Bali        8/10/2… 8/20/2…      11 Maria …      42 Female  Spanish
## 10        89 London      11/20/… 11/30/…      11 James …      29 Male    British
## 11        92 Rome        3/10/2… 3/20/2…      11 Giulia…      30 Female  Italian
## 12        93 Bali        4/15/2… 4/25/2…      11 Putra …      33 Male    Indone…
## 13        94 Seoul       5/1/20… 5/10/2…      10 Kim Mi…      27 Female  Korean 
## 14       119 Sydney, Aus 5/1/20… 5/12/2…      11 Cindy …      26 Female  Chinese
## 15       121 Bali, Indo… 7/20/2… 7/30/2…      10 Emily …      29 Female  Korean 
## 16       123 Athens, Gr… 9/20/2… 9/30/2…      10 Gina L…      35 Female  Korean 
## 17       125 Sydney, Aus 11/11/… 11/21/…      10 Isabel…      30 Female  Chinese
## # … with 4 more variables: `Accommodation type` <chr>,
## #   `Accommodation cost` <chr>, `Transportation type` <chr>,
## #   `Transportation cost` <chr>, and abbreviated variable names ¹`Start date`,
## #   ²`End date`, ³`Duration (days)`, ⁴`Traveler name`, ⁵`Traveler age`,
## #   ⁶`Traveler gender`, ⁷`Traveler nationality`

○計算出國住宿費用和交通費用總和

library(dplyr)
Travel_details_dataset2=na.omit(Travel_details_dataset)
a=Travel_details_dataset2$`Accommodation cost`
b=Travel_details_dataset2$`Transportation cost`
clean_a=gsub("[[:punct:]]","",a)
clean_a1=gsub("USD","",clean_a)
clean_b=gsub("[[:punct:]]","",b)
clean_b1=gsub("USD","",clean_b)
住宿費=na.omit(as.numeric(clean_a1))
交通費=na.omit(as.numeric(clean_b1))
Travel_details_dataset2%>%
  select(Destination ,`Duration (days)`,`Accommodation cost`,`Transportation cost`)%>%
  mutate(
    "Allfees"=住宿費+交通費,
    "Convert Taiwan Dollar"=(住宿費+交通費)*30.72
  )

## # A tibble: 136 × 6
##    Destination                 `Duration (days)` Accom…¹ Trans…² Allfees Conve…³
##    <chr>                                   <dbl> <chr>   <chr>     <dbl>   <dbl>
##  1 London, UK                                  7 1200    600        1800   55296
##  2 Phuket, Thailand                            5 800     500        1300   39936
##  3 Bali, Indonesia                             7 1000    700        1700   52224
##  4 New York, USA                              14 2000    1000       3000   92160
##  5 Tokyo, Japan                                7 700     200         900   27648
##  6 Paris, France                               5 1500    800        2300   70656
##  7 Sydney, Australia                          10 500     1200       1700   52224
##  8 Rio de Janeiro, Brazil                      7 900     600        1500   46080
##  9 Amsterdam, Netherlands                      7 1200    200        1400   43008
## 10 Dubai, United Arab Emirates                 7 2500    800        3300  101376
## # … with 126 more rows, and abbreviated variable names ¹`Accommodation cost`,
## #   ²`Transportation cost`, ³`Convert Taiwan Dollar`

○找出有幾種交通類型，並畫圖顯示人們搭的工具比例

#找出交通類型
x=table(Travel_details_dataset$`Transportation type`)
x1=na.omit(x)
#畫圖(直向長條圖)
ggplot(Travel_details_dataset, aes(x =Travel_details_dataset$`Transportation type`)) + 
  stat_count()

#畫圖(橫向長條圖)
ggplot(Travel_details_dataset, aes(x = Travel_details_dataset$`Transportation type`)) + 
  geom_bar() + coord_flip()

#畫圖(圓餅圖)
Travel_details_dataset_Transportation=table(Travel_details_dataset$`Transportation type`)
Travel_details_dataset_Transportation1=as.data.frame(Travel_details_dataset_Transportation)
labels2 = paste(names(Travel_details_dataset_Transportation), "\n(", round((prop.table(Travel_details_dataset_Transportation))*100,2), "%)", sep = "")
ggplot(Travel_details_dataset_Transportation1,aes(x="",y=Freq,fill=Var1))+
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") +
  geom_text(aes(x=1.7,label=labels2),
            position = position_stack(vjust = 0.5))+
  theme_void()

#圓餅圖2 
labels=Travel_details_dataset$`Transportation type`
piepercent<- paste(round(100*x/sum(x1), 2), "%")  
pie(x1, labels2, main = "交通工具", radius=1.05,lty=1,col = rainbow(length(x1)))

○計算旅行者平均年齡

mean(Travel_details_dataset2$`Traveler age`)

## [1] 33.11765

○畫出旅行者的年齡曲線

summary(Travel_details_dataset2$`Traveler age`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   28.00   31.00   33.12   37.25   60.00

# 製作示例資料
age <- c(20, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 50, 55, 60)
count <- c(1, 1, 4, 7, 6, 12, 11, 7, 11, 4, 9, 2, 10, 1, 4, 4, 3, 1, 5, 7, 1, 8, 1, 1, 1, 1, 1, 1)

# 將資料轉換成資料框格式
df <- data.frame(age, count)

# 設定區間範圍
intervals <- c(20, 24, 29, 34, 39, 44, 49, 54, 59, 64)

# 使用 cut 函數將資料分組
df$group <- cut(df$age, breaks = intervals, labels = c("20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64"), include.lowest = TRUE)

# 統計各組人數
group_counts <- aggregate(count ~ group, data = df, FUN = sum)

# 繪製長條圖和折線圖，並添加人數標籤
ggplot() +
  geom_bar(data = group_counts, aes(x = group, y = count), stat = "identity", fill = "blue", width = 0.5) +
  geom_line(data = group_counts, aes(x = group, y = count, group = 1), color = "red") +
  geom_text(data = group_counts, aes(x = group, y = count, label = count), vjust = -0.5, size = 3) +
  xlab("年齡組") +
  ylab("人數") +
  ggtitle("年齡分組人數統計") +
  theme_minimal()

○數值變量進行相關性分析： Duration (days): 旅行持續時間（天數） Traveler age: 旅行者年龄 Accommodation cost: 住宿费用 Transportation cost: 交通费用

x<- data.frame(
  Duration = as.numeric(Travel_details_dataset$`Duration (days)`),
  Age = as.numeric(Travel_details_dataset$`Traveler age`),
  AccommodationCost = as.numeric(Travel_details_dataset$`Accommodation cost`),
  TransportationCost = as.numeric(Travel_details_dataset$`Transportation cost`)
)

## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA

## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA

x<- na.omit(x)
# 計算相關係數
cor_matrix <- cor(x)
cor_matrix

##                      Duration         Age AccommodationCost TransportationCost
## Duration            1.0000000 -0.14540030       -0.11345104         0.09141530
## Age                -0.1454003  1.00000000        0.02323159        -0.04162607
## AccommodationCost  -0.1134510  0.02323159        1.00000000         0.82672834
## TransportationCost  0.0914153 -0.04162607        0.82672834         1.00000000

★檢驗旅行持續時間（Duration (days)）是否與住宿费用（Accommodation cost）之間存在顯著差異。

data1<- data.frame(
  Duration = as.numeric(Travel_details_dataset$`Duration (days)`),
  AccommodationCost = as.numeric(Travel_details_dataset$`Accommodation cost`)
)

## Warning in data.frame(Duration = as.numeric(Travel_details_dataset$`Duration
## (days)`), : 強制變更過程中產生了 NA

missing_values <- is.na(data1$Duration) | is.na(data1$AccommodationCost)
data1 <- data1[!missing_values, ]
result <- t.test(data1$Duration, data1$AccommodationCost)
result

## 
##  Welch Two Sample t-test
## 
## data:  data1$Duration and data1$AccommodationCost
## t = -8.2956, df = 71, p-value = 4.732e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2065.064 -1264.713
## sample estimates:
##   mean of x   mean of y 
##    7.333333 1672.222222

★○線性迴歸分析

model <- lm(data1$Duration~data1$AccommodationCost,data1)
summary(model)

## 
## Call:
## lm(formula = data1$Duration ~ data1$AccommodationCost, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4794 -0.4596 -0.3802  0.6124  6.6992 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.499e+00  2.552e-01  29.385   <2e-16 ***
## data1$AccommodationCost -9.924e-05  1.073e-04  -0.925    0.358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.54 on 70 degrees of freedom
## Multiple R-squared:  0.01207,    Adjusted R-squared:  -0.002043 
## F-statistic: 0.8552 on 1 and 70 DF,  p-value: 0.3583

報告結束
感謝觀閱

期末報告-旅行數據分析

2023-05-02