本例使用 Titanic 乘客数据,分析哪些因素会影响乘客的生还概率。
这类数据分析非常适合课堂演示,因为它同时包含类别变量和连续变量,也有比较明确的研究问题。
本例希望回答以下几个问题:
总体分析思路如下:
library(tidyverse)
library(janitor)
library(broom)
library(knitr)
首先导入数据,并查看数据维度、变量名称、变量类型和描述统计。
df <- read_csv("~/Library/Containers/com.tencent.xinWeChat/Data/Documents/xwechat_files/wxid_wztmk5rxpnp322_e799/msg/file/2026-04/Titanic-Dataset(1).csv") %>%
clean_names()
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(df)
## [1] 891 12
names(df)
## [1] "passenger_id" "survived" "pclass" "name" "sex"
## [6] "age" "sib_sp" "parch" "ticket" "fare"
## [11] "cabin" "embarked"
str(df)
## spc_tbl_ [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ passenger_id: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
## $ survived : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
## $ pclass : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## $ name : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ sex : chr [1:891] "male" "female" "female" "female" ...
## $ age : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
## $ sib_sp : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
## $ parch : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
## $ ticket : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ fare : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
## $ cabin : chr [1:891] NA "C85" NA "C123" ...
## $ embarked : chr [1:891] "S" "C" "S" "S" ...
## - attr(*, "spec")=
## .. cols(
## .. PassengerId = col_double(),
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. SibSp = col_double(),
## .. Parch = col_double(),
## .. Ticket = col_character(),
## .. Fare = col_double(),
## .. Cabin = col_character(),
## .. Embarked = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(df)
## passenger_id survived pclass name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## sex age sib_sp parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## ticket fare cabin embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
从初步结果可以看到,这份数据记录了 Titanic
乘客的个人信息和是否生还情况。
核心变量包括:
survived:是否生还(0 = No, 1 = Yes)pclass:船舱等级sex:性别age:年龄sib_sp:兄弟姐妹或配偶数量parch:父母或子女数量fare:票价embarked:登船港口接下来检查缺失值情况。
missing_table <- data.frame(
variable = names(df),
missing_count = colSums(is.na(df)),
missing_rate = colSums(is.na(df)) / nrow(df)
)
kable(missing_table, digits = 3)
| variable | missing_count | missing_rate | |
|---|---|---|---|
| passenger_id | passenger_id | 0 | 0.000 |
| survived | survived | 0 | 0.000 |
| pclass | pclass | 0 | 0.000 |
| name | name | 0 | 0.000 |
| sex | sex | 0 | 0.000 |
| age | age | 177 | 0.199 |
| sib_sp | sib_sp | 0 | 0.000 |
| parch | parch | 0 | 0.000 |
| ticket | ticket | 0 | 0.000 |
| fare | fare | 0 | 0.000 |
| cabin | cabin | 687 | 0.771 |
| embarked | embarked | 2 | 0.002 |
从缺失值情况可以看出:
age 存在一定数量的缺失;cabin 缺失非常严重;embarked 也有少量缺失。因此后续分析中,需要对这些变量进行处理。
这份数据有几个明显优点:
同时,这份数据也存在一些局限:
cabin 变量缺失过多,难以直接纳入主分析;fare。因此,在解释结果时应保持谨慎。
这一部分对数据进行清洗,包括:
首先处理 embarked 的众数,并删除
cabin。
mode_embarked <- df %>%
filter(!is.na(embarked)) %>%
count(embarked, sort = TRUE) %>%
slice(1) %>%
pull(embarked)
df_clean <- df %>%
select(-cabin) %>%
mutate(
survived = factor(survived, levels = c(0, 1), labels = c("No", "Yes")),
pclass = factor(pclass),
sex = factor(sex),
embarked = if_else(is.na(embarked), mode_embarked, embarked),
embarked = factor(embarked),
age = if_else(is.na(age), median(age, na.rm = TRUE), age)
)
summary(df_clean)
## passenger_id survived pclass name sex
## Min. : 1.0 No :549 1:216 Length:891 female:314
## 1st Qu.:223.5 Yes:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
## age sib_sp parch ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.36 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## fare embarked
## Min. : 0.00 C:168
## 1st Qu.: 7.91 Q: 77
## Median : 14.45 S:646
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
上述处理方式的逻辑如下:
cabin 缺失过多,因此直接删除;age
用中位数填补,因为年龄属于连续变量,且中位数对极端值更稳健;embarked
用众数填补,因为它是分类变量且缺失量较少;survived、pclass、sex、embarked
都转为 factor,便于后续分析。再检查一次缺失值,确认清洗结果。
clean_missing_table <- data.frame(
variable = names(df_clean),
missing_count = colSums(is.na(df_clean)),
missing_rate = colSums(is.na(df_clean)) / nrow(df_clean)
)
kable(clean_missing_table, digits = 3)
| variable | missing_count | missing_rate | |
|---|---|---|---|
| passenger_id | passenger_id | 0 | 0 |
| survived | survived | 0 | 0 |
| pclass | pclass | 0 | 0 |
| name | name | 0 | 0 |
| sex | sex | 0 | 0 |
| age | age | 0 | 0 |
| sib_sp | sib_sp | 0 | 0 |
| parch | parch | 0 | 0 |
| ticket | ticket | 0 | 0 |
| fare | fare | 0 | 0 |
| embarked | embarked | 0 | 0 |
为了让后续图形和模型更有解释力,这里构造几个新变量:
family_size:家庭规模;is_alone:是否独自出行;age_group:年龄分组;log_fare:票价对数变换。df_clean <- df_clean %>%
mutate(
family_size = sib_sp + parch + 1,
is_alone = if_else(family_size == 1, "Alone", "With Family"),
is_alone = factor(is_alone),
age_group = case_when(
age < 12 ~ "Child",
age >= 12 & age < 18 ~ "Teenager",
age >= 18 & age < 60 ~ "Adult",
age >= 60 ~ "Senior"
),
age_group = factor(age_group, levels = c("Child", "Teenager", "Adult", "Senior")),
log_fare = log(fare + 1)
)
summary(df_clean[, c("age", "fare", "family_size", "log_fare")])
## age fare family_size log_fare
## Min. : 0.42 Min. : 0.00 Min. : 1.000 Min. :0.000
## 1st Qu.:22.00 1st Qu.: 7.91 1st Qu.: 1.000 1st Qu.:2.187
## Median :28.00 Median : 14.45 Median : 1.000 Median :2.738
## Mean :29.36 Mean : 32.20 Mean : 1.905 Mean :2.962
## 3rd Qu.:35.00 3rd Qu.: 31.00 3rd Qu.: 2.000 3rd Qu.:3.466
## Max. :80.00 Max. :512.33 Max. :11.000 Max. :6.241
这些变量的作用是:
family_size
可以帮助我们观察家庭同行是否与生还有关;is_alone 可以让解释更直观;age_group 有助于做分组比较图;log_fare 用于减弱票价分布的偏态问题。下面给出若干核心变量的描述统计。
df_clean %>%
summarise(
sample_size = n(),
mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
mean_fare = mean(fare, na.rm = TRUE),
median_fare = median(fare, na.rm = TRUE),
mean_family_size = mean(family_size, na.rm = TRUE)
) %>%
kable(digits = 2)
| sample_size | mean_age | median_age | mean_fare | median_fare | mean_family_size |
|---|---|---|---|---|---|
| 891 | 29.36 | 28 | 32.2 | 14.45 | 1.9 |
再看一下生还比例。
df_clean %>%
count(survived) %>%
mutate(proportion = n / sum(n)) %>%
kable(digits = 3)
| survived | n | proportion |
|---|---|---|
| No | 549 | 0.616 |
| Yes | 342 | 0.384 |
从描述统计可以初步了解样本结构,但更重要的信息还需要通过可视化来展示。
先观察性别与生还之间的关系。
ggplot(df_clean, aes(x = sex, fill = survived)) +
geom_bar(position = "fill") +
labs(
title = "Survival Rate by Sex",
x = "Sex",
y = "Proportion",
fill = "Survived"
) +
theme_minimal()
这张图展示了不同性别乘客的生还比例。
如果女性的生还比例明显高于男性,就说明性别可能是一个重要解释变量。
再观察乘客等级与生还率之间的关系。
ggplot(df_clean, aes(x = pclass, fill = survived)) +
geom_bar(position = "fill") +
labs(
title = "Survival Rate by Passenger Class",
x = "Passenger Class",
y = "Proportion",
fill = "Survived"
) +
theme_minimal()
这张图用于比较不同舱位等级的生还情况。
如果高等级舱位的生还率更高,说明社会经济地位可能与生还概率有关。
继续观察年龄与生还之间的关系。
ggplot(df_clean, aes(x = age, fill = survived)) +
geom_histogram(binwidth = 5, alpha = 0.6, position = "identity") +
labs(
title = "Age Distribution by Survival Status",
x = "Age",
y = "Count",
fill = "Survived"
) +
theme_minimal()
这张图可以帮助我们判断不同年龄区间的人群是否表现出不同的生还模式。
下面用箱线图比较生还者和未生还者的票价分布。
ggplot(df_clean, aes(x = survived, y = fare, fill = survived)) +
geom_boxplot() +
labs(
title = "Fare by Survival Status",
x = "Survived",
y = "Fare"
) +
theme_minimal()
如果生还者整体票价更高,那么 fare
可能与生还概率存在正相关关系。
进一步比较不同年龄组的生还情况。
ggplot(df_clean, aes(x = age_group, fill = survived)) +
geom_bar(position = "fill") +
labs(
title = "Survival Rate by Age Group",
x = "Age Group",
y = "Proportion",
fill = "Survived"
) +
theme_minimal()
这张图可以让我们更直观地比较儿童、青少年、成年人和老年人的生还比例。
最后做一张稍微更有分析性的图,同时看性别和舱位等级。
ggplot(df_clean, aes(x = pclass, fill = survived)) +
geom_bar(position = "fill") +
facet_wrap(~ sex) +
labs(
title = "Survival Rate by Class and Sex",
x = "Passenger Class",
y = "Proportion",
fill = "Survived"
) +
theme_minimal()
这张图的意义在于:
我们不仅可以看性别差异,还可以看这种差异是否在不同舱位等级中依然存在。
由于因变量 survived 是一个二元变量,因此这里使用
logistic regression,而不是普通最小二乘回归。
首先把因变量重新转为 0/1 数值形式,便于建模。
df_model <- df_clean %>%
mutate(
survived_num = if_else(survived == "Yes", 1, 0)
)
第一个模型包含最基础的几个变量:性别、年龄、舱位等级和票价。
model_1 <- glm(
survived_num ~ sex + age + pclass + fare,
data = df_model,
family = binomial
)
summary(model_1)
##
## Call:
## glm(formula = survived_num ~ sex + age + pclass + fare, family = binomial,
## data = df_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.4577090 0.4159040 8.314 < 2e-16 ***
## sexmale -2.6047131 0.1875578 -13.888 < 2e-16 ***
## age -0.0328894 0.0074292 -4.427 9.55e-06 ***
## pclass2 -1.0706011 0.2854780 -3.750 0.000177 ***
## pclass3 -2.2825075 0.2804322 -8.139 3.98e-16 ***
## fare 0.0007582 0.0020994 0.361 0.717995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.7 on 890 degrees of freedom
## Residual deviance: 805.4 on 885 degrees of freedom
## AIC: 817.4
##
## Number of Fisher Scoring iterations: 5
下面整理一下回归结果表。
tidy(model_1) %>%
kable(digits = 4)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.4577 | 0.4159 | 8.3137 | 0.0000 |
| sexmale | -2.6047 | 0.1876 | -13.8875 | 0.0000 |
| age | -0.0329 | 0.0074 | -4.4271 | 0.0000 |
| pclass2 | -1.0706 | 0.2855 | -3.7502 | 0.0002 |
| pclass3 | -2.2825 | 0.2804 | -8.1392 | 0.0000 |
| fare | 0.0008 | 0.0021 | 0.3611 | 0.7180 |
这个模型的解释思路是:
sex 的系数可以帮助我们判断性别与生还概率的关系;age 的系数说明年龄变化对生还概率的影响方向;pclass 反映舱位等级的差异;fare 反映票价与生还之间的关系。第二个模型加入 embarked、is_alone 和
log_fare,看看结果是否更稳定。
model_2 <- glm(
survived_num ~ sex + age + pclass + log_fare + embarked + is_alone,
data = df_model,
family = binomial
)
summary(model_2)
##
## Call:
## glm(formula = survived_num ~ sex + age + pclass + log_fare +
## embarked + is_alone, family = binomial, data = df_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.60575 0.70046 5.148 2.64e-07 ***
## sexmale -2.58951 0.19661 -13.171 < 2e-16 ***
## age -0.03323 0.00756 -4.395 1.11e-05 ***
## pclass2 -0.88877 0.30490 -2.915 0.00356 **
## pclass3 -2.19776 0.32289 -6.807 9.99e-12 ***
## log_fare 0.07596 0.14483 0.524 0.59997
## embarkedQ -0.01591 0.37299 -0.043 0.96597
## embarkedS -0.53445 0.23518 -2.272 0.02306 *
## is_aloneWith Family -0.12489 0.22337 -0.559 0.57610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 797.95 on 882 degrees of freedom
## AIC: 815.95
##
## Number of Fisher Scoring iterations: 5
再展示扩展模型的回归表。
tidy(model_2) %>%
kable(digits = 4)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.6058 | 0.7005 | 5.1477 | 0.0000 |
| sexmale | -2.5895 | 0.1966 | -13.1708 | 0.0000 |
| age | -0.0332 | 0.0076 | -4.3951 | 0.0000 |
| pclass2 | -0.8888 | 0.3049 | -2.9149 | 0.0036 |
| pclass3 | -2.1978 | 0.3229 | -6.8066 | 0.0000 |
| log_fare | 0.0760 | 0.1448 | 0.5244 | 0.6000 |
| embarkedQ | -0.0159 | 0.3730 | -0.0427 | 0.9660 |
| embarkedS | -0.5344 | 0.2352 | -2.2725 | 0.0231 |
| is_aloneWith Family | -0.1249 | 0.2234 | -0.5591 | 0.5761 |
我们还可以简单比较两个模型的 AIC。
model_compare <- data.frame(
Model = c("Model 1", "Model 2"),
AIC = c(AIC(model_1), AIC(model_2)),
BIC = c(BIC(model_1), BIC(model_2))
)
kable(model_compare, digits = 2)
| Model | AIC | BIC |
|---|---|---|
| Model 1 | 817.40 | 846.15 |
| Model 2 | 815.95 | 859.08 |
如果第二个模型的 AIC 更低,通常说明在拟合和复杂度之间取得了更好的平衡。
根据前面的图形和模型结果,我们可以做出如下解释:
需要强调的是,这里的回归结果展示的是相关关系,而不是严格的因果关系。
为了进一步展示模型效果,这里可以计算预测概率,并查看部分结果。
df_model$predicted_prob <- predict(model_2, type = "response")
df_model %>%
select(name, survived, sex, age, pclass, fare, predicted_prob) %>%
head(10) %>%
kable(digits = 3)
| name | survived | sex | age | pclass | fare | predicted_prob |
|---|---|---|---|---|---|---|
| Braund, Mr. Owen Harris | No | male | 22 | 3 | 7.250 | 0.082 |
| Cumings, Mrs. John Bradley (Florence Briggs Thayer) | Yes | female | 38 | 1 | 71.283 | 0.927 |
| Heikkinen, Miss. Laina | Yes | female | 26 | 3 | 7.925 | 0.544 |
| Futrelle, Mrs. Jacques Heath (Lily May Peel) | Yes | female | 35 | 1 | 53.100 | 0.890 |
| Allen, Mr. William Henry | No | male | 35 | 3 | 8.050 | 0.062 |
| Moran, Mr. James | No | male | 28 | 3 | 8.458 | 0.124 |
| McCarthy, Mr. Timothy J | No | male | 54 | 1 | 51.862 | 0.267 |
| Palsson, Master. Gosta Leonard | No | male | 2 | 3 | 21.075 | 0.158 |
| Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | Yes | female | 27 | 3 | 11.133 | 0.510 |
| Nasser, Mrs. Nicholas (Adele Achem) | Yes | female | 14 | 2 | 30.071 | 0.916 |
这一步可以帮助我们理解 logistic regression 的输出:
模型不是直接给出“生还/未生还”的绝对判断,而是给出每个个体生还的预测概率。
本例基于 Titanic 数据,对乘客生还情况进行了一个完整的小型数据分析。
主要结论包括:
这份数据适合作为课堂分析案例,因为它结构清晰、变量丰富、图形效果明显,也便于建立可解释的模型。
但同时也要注意,这份数据仍然存在缺失值、变量遗漏和因果识别不足等局限,因此结论应理解为统计关联而非因果证明。