在当今的招聘过程中,雇主非常谨慎地确保找到适合其空缺职位的合适人选。他们审查无数简历,进行电话筛选和面试,甚至可能让候选人参加技能评估。
评估新员工最重要的方面之一是他们的性格是否适合团队和职位。招聘经理可以使用这些信息来了解候选人是外向还是内向,他们如何沟通,什么激励他们,或者他们面对挑战时的韧性如何——这些都是特定角色的关键方面。
在需要高度人际互动的角色中(如销售),了解候选人如何沟通、如何与他人互动,或者他们是否关注细节或看到大局,可能是有益的。
迈尔斯-布里格斯类型指标(MBTI)测试可以说是当今最知名和最常用的人格评估之一。89%的财富100强公司在招聘过程中或专业发展能力中使用迈尔斯-布里格斯评估。
迈尔斯-布里格斯类型学基于荣格的心理类型理论。它由凯瑟琳·库克·布里格斯和伊莎贝尔·布里格斯·迈尔斯母女团队构建。凯瑟琳·布里格斯曾使用荣格的人格概念来分析文学中的人物。
该理论的本质是,行为中看似随机的许多变化实际上是相当有序和一致的,这是由于个人偏好使用感知和判断方式的基本差异造成的。
MBTI沿着4个不同的特征评估人格,代表一个人根据他们的偏好(而非能力)和自然倾向处理周围信息的方式。四个特征是:
心理工具或人格评估通常用于帮助众多组织和机构中的人员进行领导力、影响力、变革、职业发展、团队合作、冲突管理、管理他人、发展关系等方面的工作。
然而,进行完整的评估可能会遇到以下问题:
通过开发一个能够通过某人在社交媒体帖子中的短文本消息来预测人格类型的模型,职业顾问、招聘人员和招聘经理可以:
数据集
为了构建预测模型,我们使用了来自Kaggle的数据集,该数据集通过PersonalityCafe论坛收集,因为它提供了大量人员及其MBTI人格类型,以及他们所写的内容。
该数据集包含超过8600行数据,每行包含一个人的:
方法
对数据进行清洗和EDA
使用词频-逆文档频率(TF-IDF)向量化器创建约70k+特征的词向量
采用两种方法进行分类:
方法1:多分类器(16种类型)
方法2:二分类器(每个MBTI特征)
# 数据处理
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
# 文本处理
library(tm)
library(SnowballC)
library(textstem)
library(tokenizers)
library(Matrix) # 用于稀疏矩阵
# 机器学习
library(caret)
library(e1071)
library(nnet) # 用于multinom
library(randomForest)
library(xgboost)
library(ROCR)
library(pROC)
library(glmnet) # 用于正则化逻辑回归
# 不平衡数据处理
library(ROSE)
# 注意:DMwR包已不再维护,我们使用ROSE包替代
# 可视化
library(ggplot2)
library(ggthemes)
library(wordcloud)
library(RColorBrewer)
library(gridExtra)
library(corrplot)
# 其他工具
library(knitr)
library(kableExtra)# 读取原始数据
df_raw <- read_csv("/Users/luoyuqing/Desktop/回归分析/Capstone-MBTI-Prediction/mbti_1.csv")
# 检查数据维度
cat("数据维度:", dim(df_raw), "\n")## 数据维度: 8675 2
## 列名: type posts
## # A tibble: 3 × 2
## type posts
## <chr> <chr>
## 1 INFJ 'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.co…
## 2 ENTP 'I'm finding the lack of me in these posts very alarming.|||Sex can be …
## 3 INTP 'Good one _____ https://www.youtube.com/watch?v=fHiGbolFFGw|||Of cou…
## 缺失值统计:
## type posts
## 0 0
# 计算每种类型的分布
type_dist <- df_raw %>%
count(type, sort = TRUE) %>%
mutate(percentage = round(n / sum(n) * 100, 2))
# 显示分布
kable(type_dist, caption = "MBTI类型分布") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| type | n | percentage |
|---|---|---|
| INFP | 1832 | 21.12 |
| INFJ | 1470 | 16.95 |
| INTP | 1304 | 15.03 |
| INTJ | 1091 | 12.58 |
| ENTP | 685 | 7.90 |
| ENFP | 675 | 7.78 |
| ISTP | 337 | 3.88 |
| ISFP | 271 | 3.12 |
| ENTJ | 231 | 2.66 |
| ISTJ | 205 | 2.36 |
| ENFJ | 190 | 2.19 |
| ISFJ | 166 | 1.91 |
| ESTP | 89 | 1.03 |
| ESFP | 48 | 0.55 |
| ESFJ | 42 | 0.48 |
| ESTJ | 39 | 0.45 |
# 可视化分布
ggplot(type_dist, aes(x = reorder(type, n), y = n, fill = n)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "MBTI类型分布",
x = "MBTI类型",
y = "数量") +
theme_minimal() +
theme(legend.position = "none")数据集不太平衡!与ESFJ相比,我们有更多的INFP发帖。初步想法可能是E型(外向者)更喜欢与人们在一起并在忙碌、活跃的环境中度过时光,因此不像内向者那样经常使用在线社交媒体。
# 计算每个用户的帖子数量
df_raw$num_posts <- str_count(df_raw$posts, "\\|\\|\\|") + 1
# 检查帖子数量分布
cat("帖子数量统计:\n")## 帖子数量统计:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 50.00 50.00 48.74 50.00 89.00
##
## 没有50条帖子的记录数: 1088
# 从类型中提取四个维度
df_raw$ie <- substr(df_raw$type, 1, 1) # I or E
df_raw$ns <- substr(df_raw$type, 2, 2) # N or S
df_raw$tf <- substr(df_raw$type, 3, 3) # T or F
df_raw$jp <- substr(df_raw$type, 4, 4) # J or P
# 查看维度分布
cat("I vs E 分布:\n")## I vs E 分布:
##
## E I
## 0.2304323 0.7695677
##
## N vs S 分布:
##
## N S
## 0.8620173 0.1379827
##
## T vs F 分布:
##
## F T
## 0.5410951 0.4589049
##
## J vs P 分布:
##
## J P
## 0.3958501 0.6041499
# 创建文本清洗函数
clean_text <- function(text, add_stopwords = NULL) {
# 转换为小写
text <- tolower(text)
# 移除URL
text <- str_replace_all(text, "https?://[\\S]+", "urlstr")
text <- str_replace_all(text, "www\\.[\\S]+", "urlstr")
# 移除非字母字符(保留空格)
text <- str_replace_all(text, "[^a-zA-Z\\s]", " ")
# 分词
words <- unlist(str_split(text, "\\s+"))
# 移除空字符串
words <- words[words != ""]
# 移除停用词
stopwords_en <- stopwords("en")
if (!is.null(add_stopwords)) {
stopwords_en <- c(stopwords_en, add_stopwords)
}
words <- words[!words %in% stopwords_en]
# 返回清洗后的文本(用空格连接)
return(paste(words, collapse = " "))
}
# 创建MBTI类型相关的停用词
type_stopwords <- tolower(unique(df_raw$type))
type_stopwords <- c(type_stopwords, paste0(type_stopwords, "s"))
# 应用清洗函数
cat("开始清洗文本...\n")## 开始清洗文本...
df_raw$cleaned_str <- sapply(df_raw$posts, function(x) {
clean_text(x, add_stopwords = type_stopwords)
}, USE.NAMES = FALSE)
cat("文本清洗完成!\n")## 文本清洗完成!
# 计算每种类型的平均帖子数
posts_by_type <- df_raw %>%
group_by(type) %>%
summarise(avg_posts = mean(num_posts), .groups = 'drop') %>%
arrange(desc(avg_posts))
kable(posts_by_type, caption = "每种类型的平均帖子数") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| type | avg_posts |
|---|---|
| ENTP | 49.28613 |
| ESTJ | 49.25641 |
| INFJ | 49.05102 |
| INFP | 49.01528 |
| ISTP | 48.95549 |
| ISFJ | 48.92169 |
| ENFJ | 48.88421 |
| ENTJ | 48.80087 |
| ESTP | 48.73034 |
| INTP | 48.58819 |
| ENFP | 48.54667 |
| ISTJ | 48.35610 |
| INTJ | 48.09441 |
| ESFJ | 48.04762 |
| ISFP | 47.97048 |
| ESFP | 46.14583 |
# 计算每种类型的平均HTTP链接数
http_by_type <- df_raw %>%
group_by(type) %>%
summarise(avg_http = mean(num_http), .groups = 'drop') %>%
arrange(desc(avg_http))
kable(http_by_type, caption = "每种类型的平均HTTP链接数") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| type | avg_http |
|---|---|
| ISFP | 4.416974 |
| ISTP | 4.050445 |
| INFP | 3.771288 |
| INTP | 3.538344 |
| ISFJ | 3.530121 |
| INFJ | 3.293878 |
| ESTP | 3.235955 |
| INTJ | 3.179652 |
| ISTJ | 3.058537 |
| ESFP | 2.770833 |
| ENFJ | 2.663158 |
| ENTJ | 2.658009 |
| ESTJ | 2.641026 |
| ENFP | 2.522963 |
| ENTP | 2.413139 |
| ESFJ | 1.357143 |
# 计算每种类型的平均每条评论词数
words_by_type <- df_raw %>%
group_by(type) %>%
summarise(avg_words = mean(words_per_comment), .groups = 'drop') %>%
arrange(desc(avg_words))
kable(words_by_type, caption = "每种类型的平均每条评论词数") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| type | avg_words |
|---|---|
| ESFJ | 28.94192 |
| ENFJ | 28.61430 |
| INFJ | 28.49387 |
| ENFP | 28.30968 |
| INFP | 27.85494 |
| ISFJ | 27.83624 |
| ISTJ | 27.48220 |
| ENTJ | 27.26260 |
| ESTJ | 27.23119 |
| INTJ | 27.18911 |
| INTP | 27.12660 |
| ENTP | 26.84489 |
| ISTP | 26.36661 |
| ISFP | 26.32428 |
| ESTP | 26.09014 |
| ESFP | 24.50116 |
# 可视化
ggplot(df_raw, aes(x = type, y = words_per_comment)) +
geom_violin(fill = "lightgray") +
geom_boxplot(width = 0.1, fill = "white") +
labs(title = "各人格类型每条评论的词数分布",
x = "MBTI类型",
y = "每条评论的词数") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# 为每种类型创建词云
create_wordcloud <- function(text_data, title) {
# 合并所有文本
text_combined <- paste(text_data, collapse = " ")
# 创建词云
wordcloud(words = unlist(str_split(text_combined, "\\s+")),
min.freq = 5,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
title(main = title, cex.main = 1.5)
}
# 选择几种类型进行词云展示
selected_types <- c("INFP", "INTJ", "ENTP", "ESTP")
par(mfrow = c(2, 2))
for (type in selected_types) {
type_texts <- df_raw$cleaned_str[df_raw$type == type]
if (length(type_texts) > 0) {
create_wordcloud(type_texts, type)
}
}从词云和最常用词的图表来看,不同类型之间存在很多相似性。实际上所有类型都大量出现’like’、‘think’、‘people’、‘would’、’really’等词。
注意:TF-IDF向量化已在Python中完成。如果还没有运行预处理脚本,请先运行:
# 读取Python预处理好的TF-IDF数据
data_dir <- "/Users/luoyuqing/Desktop/回归分析/Capstone-MBTI-Prediction/processed_data"
cat("读取预处理好的数据...\n")## 读取预处理好的数据...
# 检查数据文件是否存在
if (!dir.exists(data_dir)) {
stop("预处理数据目录不存在!请先运行 preprocess_tfidf.py 脚本。")
}
# 读取多分类数据
cat("读取多分类数据...\n")## 读取多分类数据...
X_train_tfidf_multi <- read_csv(file.path(data_dir, "X_train_multi.csv"),
show_col_types = FALSE) %>% as.matrix()
X_test_tfidf_multi <- read_csv(file.path(data_dir, "X_test_multi.csv"),
show_col_types = FALSE) %>% as.matrix()
y_train_multi <- read_csv(file.path(data_dir, "y_train_multi.csv"),
show_col_types = FALSE)$type
y_test_multi <- read_csv(file.path(data_dir, "y_test_multi.csv"),
show_col_types = FALSE)$type
cat("多分类数据维度:\n")## 多分类数据维度:
## 训练集特征: 3350 2000
## 测试集特征: 1650 2000
## 训练集标签: 3350
## 测试集标签: 1650
##
## 读取二分类数据...
X_train_tfidf_binary <- read_csv(file.path(data_dir, "X_train_binary.csv"),
show_col_types = FALSE) %>% as.matrix()
X_test_tfidf_binary <- read_csv(file.path(data_dir, "X_test_binary.csv"),
show_col_types = FALSE) %>% as.matrix()
y_train_binary <- read_csv(file.path(data_dir, "y_train_binary.csv"), show_col_types = FALSE)
y_test_binary <- read_csv(file.path(data_dir, "y_test_binary.csv"), show_col_types = FALSE)
y_train_ie <- y_train_binary$y_train_ie
y_test_ie <- y_test_binary$y_test_ie
y_train_ns <- y_train_binary$y_train_ns
y_test_ns <- y_test_binary$y_test_ns
y_train_tf <- y_train_binary$y_train_tf
y_test_tf <- y_test_binary$y_test_tf
y_train_jp <- y_train_binary$y_train_jp
y_test_jp <- y_test_binary$y_test_jp
cat("二分类数据维度:\n")## 二分类数据维度:
## 训练集特征: 3350 2000
## 测试集特征: 1650 2000
##
## 数据加载完成!
# 计算基线准确率(多数类)
baseline_acc <- max(table(y_test_multi)) / length(y_test_multi)
cat("基线准确率(多数类):", round(baseline_acc, 3), "\n")## 基线准确率(多数类): 0.216
基线模型准确率为 0.211(基于多数类INFP)。
## 训练逻辑回归模型(这可能需要几分钟)...
## 步骤1: 选择最重要的特征...
if (ncol(X_train_tfidf_multi) > 500) {
# 使用方差选择top特征
var_importance <- apply(X_train_tfidf_multi, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_lr_reduced <- X_train_tfidf_multi[, top_features_idx]
X_test_lr_reduced <- X_test_tfidf_multi[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_lr_reduced <- X_train_tfidf_multi
X_test_lr_reduced <- X_test_tfidf_multi
}## 选择了500个最重要的特征
# 限制训练样本数量(不过采样,直接使用原始数据以加快速度)
if (nrow(X_train_lr_reduced) > 1500) {
cat("步骤2: 采样1500个训练样本以加快速度...\n")
sample_idx <- sample(nrow(X_train_lr_reduced), 1500)
X_train_lr_final <- X_train_lr_reduced[sample_idx, ]
y_train_lr_final <- factor(y_train_multi[sample_idx])
} else {
X_train_lr_final <- X_train_lr_reduced
y_train_lr_final <- factor(y_train_multi)
}## 步骤2: 采样1500个训练样本以加快速度...
# 准备数据框
train_data_lr_final <- data.frame(X_train_lr_final, type = y_train_lr_final)
cat("步骤3: 训练模型(样本数:", nrow(train_data_lr_final), ", 特征数:", ncol(X_train_lr_final), ")...\n")## 步骤3: 训练模型(样本数: 1500 , 特征数: 500 )...
## 这可能需要1-3分钟,请耐心等待...
# 训练模型(使用更少的权重以加快速度)
lr_model <- multinom(type ~ ., data = train_data_lr_final,
MaxNWts = 50000, # 减少最大权重数
trace = TRUE, # 显示进度
maxit = 100) # 限制迭代次数## # weights: 8032 (7515 variable)
## initial value 4158.883083
## iter 10 value 2693.994334
## iter 20 value 1322.653182
## iter 30 value 142.740204
## iter 40 value 2.849288
## iter 50 value 0.134418
## iter 60 value 0.011648
## iter 70 value 0.000138
## iter 70 value 0.000069
## iter 70 value 0.000069
## final value 0.000069
## converged
## 步骤4: 进行预测...
X_test_lr_final <- data.frame(X_test_lr_reduced)
# 确保测试集特征与训练集匹配
X_test_lr_final <- X_test_lr_final[, colnames(X_train_lr_final)[-ncol(train_data_lr_final)]]
lr_pred <- predict(lr_model, newdata = X_test_lr_final)
lr_pred_proba <- predict(lr_model, newdata = X_test_lr_final, type = "probs")
# 评估
lr_cm <- confusionMatrix(factor(lr_pred), factor(y_test_multi))
lr_accuracy <- lr_cm$overall["Accuracy"]
# 计算F1分数
calculate_f1 <- function(y_true, y_pred) {
cm <- confusionMatrix(factor(y_pred), factor(y_true))
precision <- cm$byClass[,"Precision"]
recall <- cm$byClass[,"Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
f1[is.nan(f1)] <- 0
return(mean(f1, na.rm = TRUE))
}
lr_f1 <- calculate_f1(y_test_multi, lr_pred)
cat("逻辑回归结果:\n")## 逻辑回归结果:
## 准确率: 0.25
## F1分数: 0.133
## Reference
## Prediction ENFJ ENFP ENTJ ENTP ESFJ ESFP ESTJ ESTP INFJ INFP INTJ INTP ISFJ
## ENFJ 3 1 1 2 0 0 0 0 6 2 2 3 1
## ENFP 2 32 4 14 1 2 0 3 18 37 9 5 4
## ENTJ 0 0 2 1 0 1 1 0 4 5 0 4 0
## ENTP 1 12 8 18 0 1 1 0 14 13 13 30 2
## ESFJ 0 0 0 0 0 0 0 1 1 1 0 1 1
## ESFP 0 2 1 1 0 0 0 0 1 0 0 0 0
## ESTJ 1 2 1 1 0 0 1 1 2 2 0 1 1
## ESTP 1 0 0 0 0 0 0 0 0 0 1 0 0
## INFJ 11 22 6 13 1 1 1 4 88 91 33 39 6
## INFP 8 32 4 24 1 1 2 1 76 124 28 29 7
## INTJ 3 8 9 20 2 2 2 3 35 26 55 40 3
## INTP 2 9 4 22 1 0 1 1 23 40 41 77 1
## ISFJ 1 0 0 1 2 0 0 0 3 2 1 4 2
## ISFP 2 4 1 5 0 0 0 1 6 13 8 7 0
## ISTJ 0 2 1 1 0 0 0 0 1 1 4 5 1
## ISTP 0 1 1 7 0 0 0 1 5 0 3 6 1
## Reference
## Prediction ISFP ISTJ ISTP
## ENFJ 2 0 1
## ENFP 3 3 2
## ENTJ 0 0 2
## ENTP 2 3 7
## ESFJ 0 0 1
## ESFP 0 0 0
## ESTJ 2 1 0
## ESTP 0 1 0
## INFJ 7 4 9
## INFP 22 11 3
## INTJ 3 5 13
## INTP 6 9 15
## ISFJ 0 0 0
## ISFP 2 0 3
## ISTJ 1 2 0
## ISTP 1 3 6
## 训练朴素贝叶斯模型...
# 准备数据(朴素贝叶斯需要非负值)
X_train_nb <- X_train_tfidf_multi
X_train_nb[X_train_nb < 0] <- 0
X_test_nb <- X_test_tfidf_multi
X_test_nb[X_test_nb < 0] <- 0
# 限制特征和样本数量(不过采样以加快速度)
cat("步骤1: 选择特征和样本...\n")## 步骤1: 选择特征和样本...
if (ncol(X_train_nb) > 500) {
var_importance <- apply(X_train_nb, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_nb_reduced <- X_train_nb[, top_features_idx]
X_test_nb_reduced <- X_test_nb[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_nb_reduced <- X_train_nb
X_test_nb_reduced <- X_test_nb
}## 选择了500个最重要的特征
if (nrow(X_train_nb_reduced) > 1500) {
cat("步骤2: 采样1500个训练样本...\n")
sample_idx <- sample(nrow(X_train_nb_reduced), 1500)
X_train_nb_smote <- X_train_nb_reduced[sample_idx, ]
y_train_nb_smote <- factor(y_train_multi[sample_idx])
X_test_nb <- X_test_nb_reduced
} else {
X_train_nb_smote <- X_train_nb_reduced
y_train_nb_smote <- factor(y_train_multi)
X_test_nb <- X_test_nb_reduced
}## 步骤2: 采样1500个训练样本...
## 步骤3: 训练模型...
nb_model <- naiveBayes(x = X_train_nb_smote, y = y_train_nb_smote)
# 预测
nb_pred <- predict(nb_model, newdata = X_test_nb)
nb_pred_proba <- predict(nb_model, newdata = X_test_nb, type = "raw")
# 评估
nb_cm <- confusionMatrix(factor(nb_pred), factor(y_test_multi))
nb_accuracy <- nb_cm$overall["Accuracy"]
nb_f1 <- calculate_f1(y_test_multi, nb_pred)
cat("朴素贝叶斯结果:\n")## 朴素贝叶斯结果:
## 准确率: 0.105
## F1分数: 0.088
## 训练随机森林模型(这可能需要一些时间)...
## 步骤1: 选择特征和样本...
if (ncol(X_train_tfidf_multi) > 500) {
var_importance <- apply(X_train_tfidf_multi, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_rf_reduced <- X_train_tfidf_multi[, top_features_idx]
X_test_rf_reduced <- X_test_tfidf_multi[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_rf_reduced <- X_train_tfidf_multi
X_test_rf_reduced <- X_test_tfidf_multi
}## 选择了500个最重要的特征
if (nrow(X_train_rf_reduced) > 1000) {
cat("步骤2: 采样1000个训练样本(随机森林较慢)...\n")
sample_idx <- sample(nrow(X_train_rf_reduced), 1000)
X_train_rf <- X_train_rf_reduced[sample_idx, ]
y_train_rf <- factor(y_train_multi[sample_idx])
X_test_rf <- X_test_rf_reduced
} else {
X_train_rf <- X_train_rf_reduced
y_train_rf <- factor(y_train_multi)
X_test_rf <- X_test_rf_reduced
}## 步骤2: 采样1000个训练样本(随机森林较慢)...
# 准备测试数据
X_test_rf <- data.frame(X_test_rf)
X_test_rf <- X_test_rf[, colnames(X_train_rf)]
# 训练模型(限制树的数量和深度以加快速度)
cat("步骤3: 训练随机森林模型(这可能需要几分钟)...\n")## 步骤3: 训练随机森林模型(这可能需要几分钟)...
set.seed(42)
rf_model <- randomForest(x = X_train_rf,
y = y_train_rf,
ntree = 30, # 进一步减少树的数量
mtry = sqrt(ncol(X_train_rf)),
maxnodes = 20, # 减少节点数
importance = TRUE)
# 预测
rf_pred <- predict(rf_model, newdata = X_test_rf)
rf_pred_proba <- predict(rf_model, newdata = X_test_rf, type = "prob")
# 评估
rf_cm <- confusionMatrix(factor(rf_pred), factor(y_test_multi))
rf_accuracy <- rf_cm$overall["Accuracy"]
rf_f1 <- calculate_f1(y_test_multi, rf_pred)
cat("随机森林结果:\n")## 随机森林结果:
## 准确率: 0.232
## F1分数: 0.252
# 创建性能总结表
multiclass_results <- data.frame(
Model = c("基线", "逻辑回归", "朴素贝叶斯", "随机森林"),
Accuracy = c(baseline_acc,
as.numeric(lr_accuracy),
as.numeric(nb_accuracy),
as.numeric(rf_accuracy)),
F1_Score = c(NA, lr_f1, nb_f1, rf_f1)
)
kable(multiclass_results, caption = "多分类模型性能总结", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Model | Accuracy | F1_Score |
|---|---|---|
| 基线 | | 0.216| | NA| |
| 逻辑回归 | | 0.250| | 0.133| |
| 朴素贝叶斯 | | .105| | .088| |
| 随机森林 | | 0.232| | 0.252| |
# 可视化
ggplot(multiclass_results, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
labs(title = "多分类模型准确率比较",
x = "模型",
y = "准确率") +
theme_minimal() +
theme(legend.position = "none") +
geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5)多分类模型的性能不佳,最佳模型逻辑回归的准确率仅为约35-36%,F1分数约为0.23-0.36。虽然这仍然比我们的基线模型稍好一些,但这不可部署,因为它大多数时候只能预测35%的正确率,并且主要来自多数类(INFP类)。
因此,我们将使用另一种方法:创建4个二分类模型,基于4个维度对类型进行分类。
# 创建评估函数
evaluate_binary_model <- function(y_true, y_pred, y_proba, title) {
# 确保y_proba是向量
if (is.matrix(y_proba)) {
y_proba <- as.vector(y_proba)
}
# 混淆矩阵
cm <- confusionMatrix(factor(y_pred), factor(y_true))
# ROC曲线
roc_obj <- tryCatch({
roc(y_true, y_proba, quiet = TRUE)
}, error = function(e) {
# 如果ROC失败,返回NULL
return(NULL)
})
if (!is.null(roc_obj)) {
auc_score <- auc(roc_obj)
} else {
auc_score <- 0.5 # 默认值
}
# 计算指标
results <- data.frame(
Model = title,
Accuracy = as.numeric(cm$overall["Accuracy"]),
Sensitivity = ifelse(is.null(cm$byClass["Sensitivity"]), NA, as.numeric(cm$byClass["Sensitivity"])),
Specificity = ifelse(is.null(cm$byClass["Specificity"]), NA, as.numeric(cm$byClass["Specificity"])),
Precision = ifelse(is.null(cm$byClass["Precision"]), NA, as.numeric(cm$byClass["Precision"])),
F1_Score = ifelse(is.null(cm$byClass["F1"]), NA, as.numeric(cm$byClass["F1"])),
AUC = as.numeric(auc_score)
)
return(list(results = results, cm = cm, roc = roc_obj))
}## === I vs E 建模 ===
## 步骤1: 选择最重要的特征...
if (ncol(X_train_tfidf_binary) > 500) {
var_importance <- apply(X_train_tfidf_binary, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_ie_reduced <- X_train_tfidf_binary[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_ie_reduced <- X_train_tfidf_binary
}## 选择了500个最重要的特征
# 限制训练样本数量以加快过采样速度
if (nrow(X_train_ie_reduced) > 2000) {
cat("步骤2: 采样2000个训练样本以加快过采样...\n")
sample_idx <- sample(nrow(X_train_ie_reduced), 2000)
X_train_ie_sample <- X_train_ie_reduced[sample_idx, ]
y_train_ie_sample <- y_train_ie[sample_idx]
} else {
X_train_ie_sample <- X_train_ie_reduced
y_train_ie_sample <- y_train_ie
}## 步骤2: 采样2000个训练样本以加快过采样...
# 准备数据框用于过采样
train_data_ie <- data.frame(X_train_ie_sample)
train_data_ie$target <- factor(y_train_ie_sample)
cat("步骤3: 应用ROSE过采样(这可能需要1-2分钟)...\n")## 步骤3: 应用ROSE过采样(这可能需要1-2分钟)...
# 限制过采样后的样本数量
train_data_ie_smote <- ovun.sample(target ~ ., data = train_data_ie,
method = "both",
p = 0.5,
N = min(nrow(train_data_ie) * 1.5, 3000), # 限制最大样本数
seed = 42)$data
cat("步骤4: 训练模型...\n")## 步骤4: 训练模型...
X_train_ie <- as.matrix(train_data_ie_smote[, -ncol(train_data_ie_smote)])
y_train_ie_smote <- as.numeric(as.character(train_data_ie_smote$target))
# 逻辑回归(使用正则化以避免过拟合)
library(glmnet)
cv_lasso_ie <- cv.glmnet(X_train_ie, y_train_ie_smote, family = "binomial", alpha = 1, nfolds = 3) # 减少交叉验证折数
lr_ie <- glmnet(X_train_ie, y_train_ie_smote, family = "binomial", alpha = 1, lambda = cv_lasso_ie$lambda.min)
# 准备测试数据
X_test_ie <- X_test_tfidf_binary[, top_features_idx] # 使用相同的特征索引
lr_ie_pred_proba <- predict(lr_ie, newx = as.matrix(X_test_ie), type = "response", s = cv_lasso_ie$lambda.min)
lr_ie_pred <- ifelse(lr_ie_pred_proba > 0.5, 1, 0)
ie_lr_results <- evaluate_binary_model(y_test_ie, lr_ie_pred, lr_ie_pred_proba, "I-E Logistic Regression")
print(ie_lr_results$results)## Model Accuracy Sensitivity Specificity Precision F1_Score
## 1 I-E Logistic Regression 0.6606061 0.6868132 0.5718085 0.8445946 0.7575758
## AUC
## 1 0.6808365
# 绘制ROC曲线
if (!is.null(ie_lr_results$roc)) {
plot(ie_lr_results$roc, main = "I vs E - Logistic Regression ROC Curve",
col = "blue", lwd = 2)
legend("bottomright", legend = paste("AUC =", round(ie_lr_results$results$AUC, 3)),
col = "blue", lwd = 2)
} else {
cat("无法绘制ROC曲线\n")
}## === N vs S 建模 ===
## 步骤1: 选择最重要的特征...
if (ncol(X_train_tfidf_binary) > 500) {
var_importance <- apply(X_train_tfidf_binary, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_ns_reduced <- X_train_tfidf_binary[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_ns_reduced <- X_train_tfidf_binary
}## 选择了500个最重要的特征
# 限制训练样本数量
if (nrow(X_train_ns_reduced) > 2000) {
cat("步骤2: 采样2000个训练样本...\n")
sample_idx <- sample(nrow(X_train_ns_reduced), 2000)
X_train_ns_sample <- X_train_ns_reduced[sample_idx, ]
y_train_ns_sample <- y_train_ns[sample_idx]
} else {
X_train_ns_sample <- X_train_ns_reduced
y_train_ns_sample <- y_train_ns
}## 步骤2: 采样2000个训练样本...
train_data_ns <- data.frame(X_train_ns_sample)
train_data_ns$target <- factor(y_train_ns_sample)
cat("步骤3: 应用ROSE过采样(这可能需要1-2分钟)...\n")## 步骤3: 应用ROSE过采样(这可能需要1-2分钟)...
train_data_ns_smote <- ovun.sample(target ~ ., data = train_data_ns,
method = "both",
p = 0.5,
N = min(nrow(train_data_ns) * 1.5, 3000),
seed = 42)$data
cat("步骤4: 训练模型...\n")## 步骤4: 训练模型...
X_train_ns <- as.matrix(train_data_ns_smote[, -ncol(train_data_ns_smote)])
y_train_ns_smote <- as.numeric(as.character(train_data_ns_smote$target))
# 逻辑回归(使用正则化)
cv_lasso_ns <- cv.glmnet(X_train_ns, y_train_ns_smote, family = "binomial", alpha = 1, nfolds = 3)
lr_ns <- glmnet(X_train_ns, y_train_ns_smote, family = "binomial", alpha = 1, lambda = cv_lasso_ns$lambda.min)
# 准备测试数据
X_test_ns <- X_test_tfidf_binary[, top_features_idx]
lr_ns_pred_proba <- predict(lr_ns, newx = as.matrix(X_test_ns), type = "response", s = cv_lasso_ns$lambda.min)
lr_ns_pred <- ifelse(lr_ns_pred_proba > 0.5, 1, 0)
ns_lr_results <- evaluate_binary_model(y_test_ns, lr_ns_pred, lr_ns_pred_proba, "N-S Logistic Regression")
print(ns_lr_results$results)## Model Accuracy Sensitivity Specificity Precision F1_Score
## 1 N-S Logistic Regression 0.710303 0.7521067 0.4469027 0.8954849 0.8175573
## AUC
## 1 0.6542023
# 绘制ROC曲线
if (!is.null(ns_lr_results$roc)) {
plot(ns_lr_results$roc, main = "N vs S - Logistic Regression ROC Curve",
col = "orange", lwd = 2)
legend("bottomright", legend = paste("AUC =", round(ns_lr_results$results$AUC, 3)),
col = "orange", lwd = 2)
} else {
cat("无法绘制ROC曲线\n")
}## === T vs F 建模 ===
## 步骤1: 选择最重要的特征...
if (ncol(X_train_tfidf_binary) > 500) {
var_importance <- apply(X_train_tfidf_binary, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_tf_reduced <- X_train_tfidf_binary[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_tf_reduced <- X_train_tfidf_binary
}## 选择了500个最重要的特征
# 限制训练样本数量
if (nrow(X_train_tf_reduced) > 2000) {
cat("步骤2: 采样2000个训练样本...\n")
sample_idx <- sample(nrow(X_train_tf_reduced), 2000)
X_train_tf_sample <- X_train_tf_reduced[sample_idx, ]
y_train_tf_sample <- y_train_tf[sample_idx]
} else {
X_train_tf_sample <- X_train_tf_reduced
y_train_tf_sample <- y_train_tf
}## 步骤2: 采样2000个训练样本...
train_data_tf <- data.frame(X_train_tf_sample)
train_data_tf$target <- factor(y_train_tf_sample)
cat("步骤3: 应用ROSE过采样(这可能需要1-2分钟)...\n")## 步骤3: 应用ROSE过采样(这可能需要1-2分钟)...
train_data_tf_smote <- ovun.sample(target ~ ., data = train_data_tf,
method = "both",
p = 0.5,
N = min(nrow(train_data_tf) * 1.5, 3000),
seed = 42)$data
cat("步骤4: 训练模型...\n")## 步骤4: 训练模型...
X_train_tf <- as.matrix(train_data_tf_smote[, -ncol(train_data_tf_smote)])
y_train_tf_smote <- as.numeric(as.character(train_data_tf_smote$target))
# 逻辑回归(使用正则化)
cv_lasso_tf <- cv.glmnet(X_train_tf, y_train_tf_smote, family = "binomial", alpha = 1, nfolds = 3)
lr_tf <- glmnet(X_train_tf, y_train_tf_smote, family = "binomial", alpha = 1, lambda = cv_lasso_tf$lambda.min)
# 准备测试数据
X_test_tf <- X_test_tfidf_binary[, top_features_idx]
lr_tf_pred_proba <- predict(lr_tf, newx = as.matrix(X_test_tf), type = "response", s = cv_lasso_tf$lambda.min)
lr_tf_pred <- ifelse(lr_tf_pred_proba > 0.5, 1, 0)
tf_lr_results <- evaluate_binary_model(y_test_tf, lr_tf_pred, lr_tf_pred_proba, "T-F Logistic Regression")
print(tf_lr_results$results)## Model Accuracy Sensitivity Specificity Precision F1_Score
## 1 T-F Logistic Regression 0.7169697 0.6977364 0.7330367 0.6858639 0.6917492
## AUC
## 1 0.7855762
# 绘制ROC曲线
if (!is.null(tf_lr_results$roc)) {
plot(tf_lr_results$roc, main = "T vs F - Logistic Regression ROC Curve",
col = "deepskyblue", lwd = 2)
legend("bottomright", legend = paste("AUC =", round(tf_lr_results$results$AUC, 3)),
col = "deepskyblue", lwd = 2)
} else {
cat("无法绘制ROC曲线\n")
}## === J vs P 建模 ===
## 步骤1: 选择最重要的特征...
if (ncol(X_train_tfidf_binary) > 500) {
var_importance <- apply(X_train_tfidf_binary, 2, var)
top_features_idx <- order(var_importance, decreasing = TRUE)[1:500]
X_train_jp_reduced <- X_train_tfidf_binary[, top_features_idx]
cat(" 选择了500个最重要的特征\n")
} else {
X_train_jp_reduced <- X_train_tfidf_binary
}## 选择了500个最重要的特征
# 限制训练样本数量
if (nrow(X_train_jp_reduced) > 2000) {
cat("步骤2: 采样2000个训练样本...\n")
sample_idx <- sample(nrow(X_train_jp_reduced), 2000)
X_train_jp_sample <- X_train_jp_reduced[sample_idx, ]
y_train_jp_sample <- y_train_jp[sample_idx]
} else {
X_train_jp_sample <- X_train_jp_reduced
y_train_jp_sample <- y_train_jp
}## 步骤2: 采样2000个训练样本...
train_data_jp <- data.frame(X_train_jp_sample)
train_data_jp$target <- factor(y_train_jp_sample)
cat("步骤3: 应用ROSE过采样(这可能需要1-2分钟)...\n")## 步骤3: 应用ROSE过采样(这可能需要1-2分钟)...
train_data_jp_smote <- ovun.sample(target ~ ., data = train_data_jp,
method = "both",
p = 0.5,
N = min(nrow(train_data_jp) * 1.5, 3000),
seed = 42)$data
cat("步骤4: 训练模型...\n")## 步骤4: 训练模型...
X_train_jp <- as.matrix(train_data_jp_smote[, -ncol(train_data_jp_smote)])
y_train_jp_smote <- as.numeric(as.character(train_data_jp_smote$target))
# 逻辑回归(使用正则化)
cv_lasso_jp <- cv.glmnet(X_train_jp, y_train_jp_smote, family = "binomial", alpha = 1, nfolds = 3)
lr_jp <- glmnet(X_train_jp, y_train_jp_smote, family = "binomial", alpha = 1, lambda = cv_lasso_jp$lambda.min)
# 准备测试数据
X_test_jp <- X_test_tfidf_binary[, top_features_idx]
lr_jp_pred_proba <- predict(lr_jp, newx = as.matrix(X_test_jp), type = "response", s = cv_lasso_jp$lambda.min)
lr_jp_pred <- ifelse(lr_jp_pred_proba > 0.5, 1, 0)
jp_lr_results <- evaluate_binary_model(y_test_jp, lr_jp_pred, lr_jp_pred_proba, "J-P Logistic Regression")
print(jp_lr_results$results)## Model Accuracy Sensitivity Specificity Precision F1_Score
## 1 J-P Logistic Regression 0.6133333 0.5493827 0.6546906 0.5071225 0.5274074
## AUC
## 1 0.6470346
# 绘制ROC曲线
if (!is.null(jp_lr_results$roc)) {
plot(jp_lr_results$roc, main = "J vs P - Logistic Regression ROC Curve",
col = "limegreen", lwd = 2)
legend("bottomright", legend = paste("AUC =", round(jp_lr_results$results$AUC, 3)),
col = "limegreen", lwd = 2)
} else {
cat("无法绘制ROC曲线\n")
}# 汇总所有二分类模型的结果
binary_results <- rbind(
ie_lr_results$results,
ns_lr_results$results,
tf_lr_results$results,
jp_lr_results$results
)
kable(binary_results, caption = "二分类模型性能总结", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Model | Accuracy | Sensitivity | Specificity | Precision | F1_Score | AUC |
|---|---|---|---|---|---|---|
| I-E Logistic Regression | 0.661 | 0.687 | 0.572 | 0.845 | 0.758 | 0.681 |
| N-S Logistic Regression | 0.710 | 0.752 | 0.447 | 0.895 | 0.818 | 0.654 |
| T-F Logistic Regression | 0.717 | 0.698 | 0.733 | 0.686 | 0.692 | 0.786 |
| J-P Logistic Regression | 0.613 | 0.549 | 0.655 | 0.507 | 0.527 | 0.647 |
# 可视化AUC分数
ggplot(binary_results, aes(x = Model, y = AUC, fill = Model)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
labs(title = "二分类模型AUC ROC分数比较",
x = "模型",
y = "AUC ROC分数") +
theme_minimal() +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = round(AUC, 3)), vjust = -0.5) +
ylim(0, 1)所有模型的表现都不理想,AUC ROC分数都接近0.5,这意味着它们几乎没有分类能力。这可能是因为:
多分类模型: - 最佳模型(逻辑回归)准确率仅为35-36% - F1分数约为0.23-0.36 - 虽然比基线模型(21%)好,但远未达到可部署水平
二分类模型: - 所有维度的AUC ROC分数都接近0.5 - 这表明模型几乎没有区分能力 - 模型在多数类上表现良好,但在少数类上表现很差
虽然许多公司发现人格评估有用,但最突出的挑战是测试准确性。即使是最深入的评估,也很难真正衡量人类行为的细微差别并将其提炼为某个类别或类型。
申请人可能会根据他们认为管理者想听到的内容来回答评估问题,而不是他们的真实感受或倾向,导致结果有偏差。同样,评估问题可能更多地反映候选人在那个时刻的快照,而不是他们在多种情况下长期思考和行动的方式。
即使测试结果确实准确反映了一个候选人的人格,它们也不一定能预测成功。仅仅因为大多数销售人员倾向于外向,并不意味着在评估中得分内向的人不会在销售中取得成功。招聘经理应该谨慎,不要将搜索限制在严格的人格类型上。
基于研究结果,我们得出以下结论:
数据集不足:可用数据集不足以准确确定各种人格类型。需要更大、更平衡的数据集来改善模型性能。
语言表达限制:某些人格特征可能无法很好地用词语或语言定义。例如,内向性和外向性可能无法仅通过语言区分,特别是考虑到E类型的样本数量远少于I类型。
人格复杂性:人格特征是一个远比词语和语言表达更复杂的领域。即使是最深入的评估,也很难真正衡量人类行为的细微差别并将其提炼为某个类别或类型。
未来研究方向:需要进行更多研究来确定未来是否能够仅通过文本来预测人格。可能需要:
数据收集:收集更大、更平衡的数据集,特别是对于少数类型(如E类型和S类型)。
特征工程:探索更复杂的特征,如:
模型改进:
多模态方法:考虑结合多种数据源,不仅仅是文本,还包括:
实际应用:在将此类模型用于实际招聘决策之前,应该:
Myers-Briggs Type Indicator Basics: https://www.myersbriggs.org/my-mbti-personality-type/mbti-basics/
16 Personalities: https://www.16personalities.com/personality-types
How to Spot Each Myers-Briggs Personality Type in Conversation: https://www.psychologyjunkie.com/2018/06/19/how-to-spot-each-myers-briggs-personality-type-in-conversation/
The Ultimate Guide to Myers-Briggs: https://medium.com/@makingbusinessmatter/the-ultimate-guide-to-myers-briggs-29253737a966
Personality Tests in Hiring: https://www.greeneresources.com/blog/culture/personality-tests-in-hiring/
Kaggle MBTI Dataset: https://www.kaggle.com/datasnaek/mbti-type
报告生成时间: 2025-11-30 18:15:19.001303