1. Introduction

本例使用 Titanic 乘客数据,分析哪些因素会影响乘客的生还概率。
这类数据分析非常适合课堂演示,因为它同时包含类别变量和连续变量,也有比较明确的研究问题。

本例希望回答以下几个问题:

  1. 不同性别的乘客,生还率是否存在明显差异?
  2. 不同舱位等级的乘客,生还率是否不同?
  3. 年龄、票价、是否独自出行等变量,是否会影响生还概率?
  4. 能否通过一个简单的回归模型,对生还概率进行解释?

总体分析思路如下:

library(tidyverse)
library(janitor)
library(broom)
library(knitr)

2. Data Overview

首先导入数据,并查看数据维度、变量名称、变量类型和描述统计。

df <- read_csv("~/Library/Containers/com.tencent.xinWeChat/Data/Documents/xwechat_files/wxid_wztmk5rxpnp322_e799/msg/file/2026-04/Titanic-Dataset(1).csv") %>%
  clean_names()
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(df)
## [1] 891  12
names(df)
##  [1] "passenger_id" "survived"     "pclass"       "name"         "sex"         
##  [6] "age"          "sib_sp"       "parch"        "ticket"       "fare"        
## [11] "cabin"        "embarked"
str(df)
## spc_tbl_ [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ passenger_id: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ survived    : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass      : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
##  $ name        : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ sex         : chr [1:891] "male" "female" "female" "female" ...
##  $ age         : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ sib_sp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ parch       : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ ticket      : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ fare        : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ cabin       : chr [1:891] NA "C85" NA "C123" ...
##  $ embarked    : chr [1:891] "S" "C" "S" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(df)
##   passenger_id      survived          pclass          name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      sex                 age            sib_sp          parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     ticket               fare           cabin             embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

从初步结果可以看到,这份数据记录了 Titanic 乘客的个人信息和是否生还情况。
核心变量包括:

接下来检查缺失值情况。

missing_table <- data.frame(
  variable = names(df),
  missing_count = colSums(is.na(df)),
  missing_rate = colSums(is.na(df)) / nrow(df)
)

kable(missing_table, digits = 3)
variable missing_count missing_rate
passenger_id passenger_id 0 0.000
survived survived 0 0.000
pclass pclass 0 0.000
name name 0 0.000
sex sex 0 0.000
age age 177 0.199
sib_sp sib_sp 0 0.000
parch parch 0 0.000
ticket ticket 0 0.000
fare fare 0 0.000
cabin cabin 687 0.771
embarked embarked 2 0.002

从缺失值情况可以看出:

因此后续分析中,需要对这些变量进行处理。

3. Strengths and Limitations of the Data

3.1 数据优点

这份数据有几个明显优点:

  1. 样本量适中,适合作为课堂分析案例;
  2. 变量类型丰富,既有类别变量,也有连续变量;
  3. 研究问题明确,容易构建可解释的统计模型;
  4. 可以进行变量衍生,例如家庭规模、是否独自出行等;
  5. 适合进行可视化展示和回归分析。

3.2 数据局限

同时,这份数据也存在一些局限:

  1. cabin 变量缺失过多,难以直接纳入主分析;
  2. 数据是观测性数据,因此只能做相关分析,不能直接推出因果关系;
  3. 某些潜在重要变量并未被记录,例如乘客所在位置、逃生时机等;
  4. 一些变量可能存在偏态分布,例如 fare

因此,在解释结果时应保持谨慎。

4. Data Cleaning

这一部分对数据进行清洗,包括:

首先处理 embarked 的众数,并删除 cabin

mode_embarked <- df %>%
  filter(!is.na(embarked)) %>%
  count(embarked, sort = TRUE) %>%
  slice(1) %>%
  pull(embarked)

df_clean <- df %>%
  select(-cabin) %>%
  mutate(
    survived = factor(survived, levels = c(0, 1), labels = c("No", "Yes")),
    pclass = factor(pclass),
    sex = factor(sex),
    embarked = if_else(is.na(embarked), mode_embarked, embarked),
    embarked = factor(embarked),
    age = if_else(is.na(age), median(age, na.rm = TRUE), age)
  )

summary(df_clean)
##   passenger_id   survived  pclass      name               sex     
##  Min.   :  1.0   No :549   1:216   Length:891         female:314  
##  1st Qu.:223.5   Yes:342   2:184   Class :character   male  :577  
##  Median :446.0             3:491   Mode  :character               
##  Mean   :446.0                                                    
##  3rd Qu.:668.5                                                    
##  Max.   :891.0                                                    
##       age            sib_sp          parch           ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.36   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##       fare        embarked
##  Min.   :  0.00   C:168   
##  1st Qu.:  7.91   Q: 77   
##  Median : 14.45   S:646   
##  Mean   : 32.20           
##  3rd Qu.: 31.00           
##  Max.   :512.33

上述处理方式的逻辑如下:

再检查一次缺失值,确认清洗结果。

clean_missing_table <- data.frame(
  variable = names(df_clean),
  missing_count = colSums(is.na(df_clean)),
  missing_rate = colSums(is.na(df_clean)) / nrow(df_clean)
)

kable(clean_missing_table, digits = 3)
variable missing_count missing_rate
passenger_id passenger_id 0 0
survived survived 0 0
pclass pclass 0 0
name name 0 0
sex sex 0 0
age age 0 0
sib_sp sib_sp 0 0
parch parch 0 0
ticket ticket 0 0
fare fare 0 0
embarked embarked 0 0

5. Feature Engineering

为了让后续图形和模型更有解释力,这里构造几个新变量:

  1. family_size:家庭规模;
  2. is_alone:是否独自出行;
  3. age_group:年龄分组;
  4. log_fare:票价对数变换。
df_clean <- df_clean %>%
  mutate(
    family_size = sib_sp + parch + 1,
    is_alone = if_else(family_size == 1, "Alone", "With Family"),
    is_alone = factor(is_alone),
    age_group = case_when(
      age < 12 ~ "Child",
      age >= 12 & age < 18 ~ "Teenager",
      age >= 18 & age < 60 ~ "Adult",
      age >= 60 ~ "Senior"
    ),
    age_group = factor(age_group, levels = c("Child", "Teenager", "Adult", "Senior")),
    log_fare = log(fare + 1)
  )

summary(df_clean[, c("age", "fare", "family_size", "log_fare")])
##       age             fare         family_size        log_fare    
##  Min.   : 0.42   Min.   :  0.00   Min.   : 1.000   Min.   :0.000  
##  1st Qu.:22.00   1st Qu.:  7.91   1st Qu.: 1.000   1st Qu.:2.187  
##  Median :28.00   Median : 14.45   Median : 1.000   Median :2.738  
##  Mean   :29.36   Mean   : 32.20   Mean   : 1.905   Mean   :2.962  
##  3rd Qu.:35.00   3rd Qu.: 31.00   3rd Qu.: 2.000   3rd Qu.:3.466  
##  Max.   :80.00   Max.   :512.33   Max.   :11.000   Max.   :6.241

这些变量的作用是:

6. Descriptive Statistics

下面给出若干核心变量的描述统计。

df_clean %>%
  summarise(
    sample_size = n(),
    mean_age = mean(age, na.rm = TRUE),
    median_age = median(age, na.rm = TRUE),
    mean_fare = mean(fare, na.rm = TRUE),
    median_fare = median(fare, na.rm = TRUE),
    mean_family_size = mean(family_size, na.rm = TRUE)
  ) %>%
  kable(digits = 2)
sample_size mean_age median_age mean_fare median_fare mean_family_size
891 29.36 28 32.2 14.45 1.9

再看一下生还比例。

df_clean %>%
  count(survived) %>%
  mutate(proportion = n / sum(n)) %>%
  kable(digits = 3)
survived n proportion
No 549 0.616
Yes 342 0.384

从描述统计可以初步了解样本结构,但更重要的信息还需要通过可视化来展示。

7. Visualization

7.1 性别与生还率

先观察性别与生还之间的关系。

ggplot(df_clean, aes(x = sex, fill = survived)) +
  geom_bar(position = "fill") +
  labs(
    title = "Survival Rate by Sex",
    x = "Sex",
    y = "Proportion",
    fill = "Survived"
  ) +
  theme_minimal()

这张图展示了不同性别乘客的生还比例。
如果女性的生还比例明显高于男性,就说明性别可能是一个重要解释变量。

7.2 舱位等级与生还率

再观察乘客等级与生还率之间的关系。

ggplot(df_clean, aes(x = pclass, fill = survived)) +
  geom_bar(position = "fill") +
  labs(
    title = "Survival Rate by Passenger Class",
    x = "Passenger Class",
    y = "Proportion",
    fill = "Survived"
  ) +
  theme_minimal()

这张图用于比较不同舱位等级的生还情况。
如果高等级舱位的生还率更高,说明社会经济地位可能与生还概率有关。

7.3 年龄分布与生还情况

继续观察年龄与生还之间的关系。

ggplot(df_clean, aes(x = age, fill = survived)) +
  geom_histogram(binwidth = 5, alpha = 0.6, position = "identity") +
  labs(
    title = "Age Distribution by Survival Status",
    x = "Age",
    y = "Count",
    fill = "Survived"
  ) +
  theme_minimal()

这张图可以帮助我们判断不同年龄区间的人群是否表现出不同的生还模式。

7.4 票价与生还情况

下面用箱线图比较生还者和未生还者的票价分布。

ggplot(df_clean, aes(x = survived, y = fare, fill = survived)) +
  geom_boxplot() +
  labs(
    title = "Fare by Survival Status",
    x = "Survived",
    y = "Fare"
  ) +
  theme_minimal()

如果生还者整体票价更高,那么 fare 可能与生还概率存在正相关关系。

7.5 年龄组与生还率

进一步比较不同年龄组的生还情况。

ggplot(df_clean, aes(x = age_group, fill = survived)) +
  geom_bar(position = "fill") +
  labs(
    title = "Survival Rate by Age Group",
    x = "Age Group",
    y = "Proportion",
    fill = "Survived"
  ) +
  theme_minimal()

这张图可以让我们更直观地比较儿童、青少年、成年人和老年人的生还比例。

7.6 性别与舱位等级的联合比较

最后做一张稍微更有分析性的图,同时看性别和舱位等级。

ggplot(df_clean, aes(x = pclass, fill = survived)) +
  geom_bar(position = "fill") +
  facet_wrap(~ sex) +
  labs(
    title = "Survival Rate by Class and Sex",
    x = "Passenger Class",
    y = "Proportion",
    fill = "Survived"
  ) +
  theme_minimal()

这张图的意义在于:
我们不仅可以看性别差异,还可以看这种差异是否在不同舱位等级中依然存在。

8. Regression Analysis

由于因变量 survived 是一个二元变量,因此这里使用 logistic regression,而不是普通最小二乘回归。

首先把因变量重新转为 0/1 数值形式,便于建模。

df_model <- df_clean %>%
  mutate(
    survived_num = if_else(survived == "Yes", 1, 0)
  )

8.1 基础模型

第一个模型包含最基础的几个变量:性别、年龄、舱位等级和票价。

model_1 <- glm(
  survived_num ~ sex + age + pclass + fare,
  data = df_model,
  family = binomial
)

summary(model_1)
## 
## Call:
## glm(formula = survived_num ~ sex + age + pclass + fare, family = binomial, 
##     data = df_model)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.4577090  0.4159040   8.314  < 2e-16 ***
## sexmale     -2.6047131  0.1875578 -13.888  < 2e-16 ***
## age         -0.0328894  0.0074292  -4.427 9.55e-06 ***
## pclass2     -1.0706011  0.2854780  -3.750 0.000177 ***
## pclass3     -2.2825075  0.2804322  -8.139 3.98e-16 ***
## fare         0.0007582  0.0020994   0.361 0.717995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  805.4  on 885  degrees of freedom
## AIC: 817.4
## 
## Number of Fisher Scoring iterations: 5

下面整理一下回归结果表。

tidy(model_1) %>%
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 3.4577 0.4159 8.3137 0.0000
sexmale -2.6047 0.1876 -13.8875 0.0000
age -0.0329 0.0074 -4.4271 0.0000
pclass2 -1.0706 0.2855 -3.7502 0.0002
pclass3 -2.2825 0.2804 -8.1392 0.0000
fare 0.0008 0.0021 0.3611 0.7180

这个模型的解释思路是:

  • sex 的系数可以帮助我们判断性别与生还概率的关系;
  • age 的系数说明年龄变化对生还概率的影响方向;
  • pclass 反映舱位等级的差异;
  • fare 反映票价与生还之间的关系。

8.2 扩展模型

第二个模型加入 embarkedis_alonelog_fare,看看结果是否更稳定。

model_2 <- glm(
  survived_num ~ sex + age + pclass + log_fare + embarked + is_alone,
  data = df_model,
  family = binomial
)

summary(model_2)
## 
## Call:
## glm(formula = survived_num ~ sex + age + pclass + log_fare + 
##     embarked + is_alone, family = binomial, data = df_model)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.60575    0.70046   5.148 2.64e-07 ***
## sexmale             -2.58951    0.19661 -13.171  < 2e-16 ***
## age                 -0.03323    0.00756  -4.395 1.11e-05 ***
## pclass2             -0.88877    0.30490  -2.915  0.00356 ** 
## pclass3             -2.19776    0.32289  -6.807 9.99e-12 ***
## log_fare             0.07596    0.14483   0.524  0.59997    
## embarkedQ           -0.01591    0.37299  -0.043  0.96597    
## embarkedS           -0.53445    0.23518  -2.272  0.02306 *  
## is_aloneWith Family -0.12489    0.22337  -0.559  0.57610    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  797.95  on 882  degrees of freedom
## AIC: 815.95
## 
## Number of Fisher Scoring iterations: 5

再展示扩展模型的回归表。

tidy(model_2) %>%
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 3.6058 0.7005 5.1477 0.0000
sexmale -2.5895 0.1966 -13.1708 0.0000
age -0.0332 0.0076 -4.3951 0.0000
pclass2 -0.8888 0.3049 -2.9149 0.0036
pclass3 -2.1978 0.3229 -6.8066 0.0000
log_fare 0.0760 0.1448 0.5244 0.6000
embarkedQ -0.0159 0.3730 -0.0427 0.9660
embarkedS -0.5344 0.2352 -2.2725 0.0231
is_aloneWith Family -0.1249 0.2234 -0.5591 0.5761

8.3 模型比较

我们还可以简单比较两个模型的 AIC。

model_compare <- data.frame(
  Model = c("Model 1", "Model 2"),
  AIC = c(AIC(model_1), AIC(model_2)),
  BIC = c(BIC(model_1), BIC(model_2))
)

kable(model_compare, digits = 2)
Model AIC BIC
Model 1 817.40 846.15
Model 2 815.95 859.08

如果第二个模型的 AIC 更低,通常说明在拟合和复杂度之间取得了更好的平衡。

9. Interpretation of Results

根据前面的图形和模型结果,我们可以做出如下解释:

  1. 性别通常是最重要的解释变量之一。若模型中女性对应的系数显著,说明女性乘客的生还概率更高;
  2. 舱位等级也通常具有较强解释力。较高等级舱位的乘客更可能生还;
  3. 年龄的影响可能存在,但方向和显著性需要结合模型结果判断;
  4. 票价通常反映一定的社会经济差异,因此也可能与生还概率正相关;
  5. 是否独自出行和登船港口等变量,可以作为补充解释因素。

需要强调的是,这里的回归结果展示的是相关关系,而不是严格的因果关系。

10. Extra Analysis

为了进一步展示模型效果,这里可以计算预测概率,并查看部分结果。

df_model$predicted_prob <- predict(model_2, type = "response")

df_model %>%
  select(name, survived, sex, age, pclass, fare, predicted_prob) %>%
  head(10) %>%
  kable(digits = 3)
name survived sex age pclass fare predicted_prob
Braund, Mr. Owen Harris No male 22 3 7.250 0.082
Cumings, Mrs. John Bradley (Florence Briggs Thayer) Yes female 38 1 71.283 0.927
Heikkinen, Miss. Laina Yes female 26 3 7.925 0.544
Futrelle, Mrs. Jacques Heath (Lily May Peel) Yes female 35 1 53.100 0.890
Allen, Mr. William Henry No male 35 3 8.050 0.062
Moran, Mr. James No male 28 3 8.458 0.124
McCarthy, Mr. Timothy J No male 54 1 51.862 0.267
Palsson, Master. Gosta Leonard No male 2 3 21.075 0.158
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Yes female 27 3 11.133 0.510
Nasser, Mrs. Nicholas (Adele Achem) Yes female 14 2 30.071 0.916

这一步可以帮助我们理解 logistic regression 的输出:
模型不是直接给出“生还/未生还”的绝对判断,而是给出每个个体生还的预测概率。

11. Conclusion

本例基于 Titanic 数据,对乘客生还情况进行了一个完整的小型数据分析。

主要结论包括:

  1. 性别与生还率之间存在明显关系;
  2. 舱位等级与生还率也存在显著差异;
  3. 年龄、票价和是否独自出行等变量,也可能对生还概率产生影响;
  4. logistic regression 能够较好地帮助我们解释这些变量与生还之间的关系。

这份数据适合作为课堂分析案例,因为它结构清晰、变量丰富、图形效果明显,也便于建立可解释的模型。
但同时也要注意,这份数据仍然存在缺失值、变量遗漏和因果识别不足等局限,因此结论应理解为统计关联而非因果证明。