CHƯƠNG 1: PHÂN TÍCH DỮ LIỆU LỰC LƯỢNG LAO ĐỘNG HOA KỲ (2023)

Nội dung của chương này tập trung vào việc thực hiện khám phá dữ liệu sơ bộ (Exploratory Data Analysis - EDA) trên bộ dữ liệu về Lực lượng Lao động Hoa Kỳ năm 2023. Các thao tác được thực hiện một cách tuần tự, từ việc tải và chuẩn bị dữ liệu, kiểm tra cấu trúc và chất lượng, cho đến các bước xử lý và mã hóa biến. Mục tiêu là xây dựng một nền tảng hiểu biết vững chắc về dữ liệu trước khi tiến hành các phân tích thống kê chuyên sâu và trực quan hóa.

1. Thông tin tổng quan về bộ dữ liệu

1.1. Đọc và chuẩn bị dữ liệu

df_raw <- read_excel("data_2023_hoaKy.xlsx")

1.2. Chuẩn hóa tên cột

df <- df_raw %>%
  clean_names()

1.3. Xem một mẫu dữ liệu ngẫu nhiên

df %>% 
  sample_n(6) %>%
  flextable() %>%
  set_caption("6 dòng dữ liệu ngẫu nhiên từ bộ dữ liệu") %>%
  autofit() # Tự động điều chỉnh độ rộng cột

6 dòng dữ liệu ngẫu nhiên từ bộ dữ liệu
age	sex	race	marst	empstat	occ	ind	educ	fullpart	incwage
71	Female	White	Single	Not in labor force	Not in universe	N/A (not applicable)	Associate degree	Not working	0
61	Male	White	Divorced	Employed	Cleaners of vehicles and equipment	Architectural, engineering, and related services	Some college	Part-time	30,000
42	Female	White	Married, spouse present	Employed	Accountants and auditors	Community food and housing, and emergency services	Graduate degree	Full-time	160,000
36	Male	White	Single	Employed	Actors	Independent artists, writers, and performers	Graduate degree	Full-time	150,000
17	Female	Black	Single	Employed	Preschool and kindergarten teachers	Child day care services	Some high school	Part-time	5,000
11	Male	Black	Single	NIU	Not in universe	N/A (not applicable)	Elementary school	Not working	0

1.4. Xem kích thước bộ dữ liệu

dim(df)

[1] 146133     10

1.5. Xem cấu trúc dữ liệu

str(df)

tibble [146,133 × 10] (S3: tbl_df/tbl/data.frame)
 $ age     : num [1:146133] 66 68 52 51 78 65 68 74 74 76 ...
 $ sex     : chr [1:146133] "Female" "Female" "Female" "Male" ...
 $ race    : chr [1:146133] "White" "White" "White" "White" ...
 $ marst   : chr [1:146133] "Separated" "Separated" "Married, spouse present" "Married, spouse present" ...
 $ empstat : chr [1:146133] "Not in labor force" "Not in labor force" "Not in labor force" "Employed" ...
 $ occ     : chr [1:146133] "Not in universe" "Not in universe" "Not in universe" "Retail salespersons" ...
 $ ind     : chr [1:146133] "N/A (not applicable)" "N/A (not applicable)" "N/A (not applicable)" "Auto parts, accessories, and tire stores" ...
 $ educ    : chr [1:146133] "Some college" "Some college" "Some college" "Some college" ...
 $ fullpart: chr [1:146133] "Not working" "Not working" "Not working" "Full-time" ...
 $ incwage : num [1:146133] 0 0 0 42000 0 55000 52000 0 0 0 ...

1.6. Tính các chỉ số thống kê mô tả chính cho biến định lượng

df %>%
  select(where(is.numeric)) %>%
  summarise(
    mean = mean(age),
    median = median(age),
    std_dev = sd(age),
    min = min(age),
    max = max(age),
    mean_income = mean(incwage),
    median_income = median(incwage)
  ) %>%
  pivot_longer(everything(), names_to = "Chi_so_Thong_ke", values_to = "Gia_tri") %>%
  kable(caption = "Các chỉ số thống kê mô tả chính", digits = 2) %>%
  kable_styling(full_width = F)

Các chỉ số thống kê mô tả chính
Chi_so_Thong_ke	Gia_tri
mean	38.73
median	38.00
std_dev	23.32
min	0.00
max	85.00
mean_income	31456.44
median_income	0.00

1.7. Kiểm tra tổng số giá trị bị thiếu (NA)

sum(is.na(df))

[1] 0

1.8. Đếm số dòng bị trùng lặp

num_duplicated_rows <- sum(duplicated(df))
cat(paste("Tổng số dòng bị trùng lặp hoàn toàn là:", num_duplicated_rows, "\n"))

Tổng số dòng bị trùng lặp hoàn toàn là: 65820

duplicated_summary <- df %>%
  group_by(across(everything())) %>%
  summarise(Frequency = n(), .groups = "drop") %>%
  arrange(desc(Frequency))

duplicated_summary %>%
  sample_n(10) %>%
  kable(caption = "10 dòng ngẫu nhiên trong bảng tần suất trùng lặp toàn bộ biến") %>%
  kable_styling(bootstrap_options = "striped", font_size = 9, full_width = FALSE)

10 dòng ngẫu nhiên trong bảng tần suất trùng lặp toàn bộ biến
age	sex	race	marst	empstat	occ	ind	educ	fullpart	incwage	Frequency
34	Female	White	Single	Employed	Administrative services managers	Executive offices and legislative bodies	Some college	Full-time	47352	1
36	Male	White	Widowed	Employed	Drywall installers, ceiling tile installers, and tapers	Construction	Some college	Full-time	21600	1
37	Male	White	Married, spouse present	Employed	Stockers and order fillers	Electronics stores	Some high school	Full-time	48000	1
80	Female	Asian	Married, spouse present	Employed	Hairdressers, hairstylists, and cosmetologists	Beauty salons	Some high school	Not working	0	1
61	Male	White	Married, spouse present	Employed	Lawyers	Legal services	Graduate degree	Full-time	250000	1
43	Male	White	Married, spouse present	Employed	First-Line supervisors of non-retail sales workers	Groceries and related products, merchant wholesalers	Some high school	Full-time	42000	1
30	Male	Asian	Married, spouse present	Employed	Chief executives	Computer systems design and related services	Graduate degree	Full-time	215000	1
34	Male	White	Married, spouse present	Employed	Other production workers	Petroleum refining	Associate degree	Full-time	90000	1
44	Female	Asian	Married, spouse present	Employed	Computer occupations, all other	Broadcasting (except internet)	Graduate degree	Full-time	40000	1
54	Male	White	Married, spouse present	Employed	Electrical and electronics engineers	Telecommunications, except wired telecommunications carriers	Graduate degree	Full-time	201000	1

#Tần suất trùng lặp cao nhất và thấp nhất
max_freq <- max(duplicated_summary$Frequency)
min_freq <- min(duplicated_summary$Frequency)

cat(paste("Tần suất trùng lặp cao nhất là:", max_freq, "\n"))

Tần suất trùng lặp cao nhất là: 906

cat(paste("Tần suất trùng lặp thấp nhất là:", min_freq, "\n"))

Tần suất trùng lặp thấp nhất là: 1

1.9. Liệt kê tên các biến

colnames(df)

 [1] "age"      "sex"      "race"     "marst"    "empstat"  "occ"     
 [7] "ind"      "educ"     "fullpart" "incwage"

1.10. Đếm số lượng giá trị duy nhất mỗi cột

sapply(df, function(x) length(unique(x)))

     age      sex     race    marst  empstat      occ      ind     educ 
      82        2        6        6        4      527      264        5 
fullpart  incwage 
       3     3537

2. Xử lý và mã hóa bộ dữ liệu

Trong phần này, các phép biến đổi được thực hiện và tạo ra các biến mới từ dữ liệu gốc để phục vụ cho các câu hỏi phân tích sâu hơn. Sau mỗi thao tác, một phần kết quả sẽ được hiển thị để minh họa cho sự thay đổi.

2.1. Chuyển đổi kiểu dữ liệu character sang factor

df <- df %>% mutate(across(where(is.character), as.factor))
# Hiển thị kiểu dữ liệu của một vài cột để kiểm tra
glimpse(df %>% select(sex, race, marst))

Rows: 146,133
Columns: 3
$ sex   <fct> Female, Female, Female, Male, Female, Male, Female, Female, Male…
$ race  <fct> White, White, White, White, White, White, White, White, White, W…
$ marst <fct> "Separated", "Separated", "Married, spouse present", "Married, s…

2.2. Mã hóa biến học vấn

df <- df %>%
  mutate(educ_group = case_when(
    educ %in% c("Less than high school", "Elementary school") ~ "Thấp",
    educ %in% c("High school graduate", "Some college") ~ "Trung bình",
    educ %in% c("Associate degree", "Bachelor's degree", "Graduate degree") ~ "Cao",
    TRUE ~ "Không xác định"
  ))
# Hiển thị cột mới và cột gốc để so sánh
df %>% select(educ, educ_group) %>% head() %>% kable() %>% kable_styling(full_width = F)

educ	educ_group
Some college	Trung bình
Some college	Trung bình
Some college	Trung bình
Some college	Trung bình
Some college	Trung bình
Some college	Trung bình

2.3. Sắp xếp thứ tự biến học vấn

df$educ_group <- factor(df$educ_group, levels = c("Thấp", "Trung bình", "Cao"))
# Hiển thị các cấp độ (levels) của factor mới
levels(df$educ_group)

[1] "Thấp"       "Trung bình" "Cao"

2.4 Mã hóa biến tuổi

df <- df %>%
  mutate(age_group = cut(age, 
                         breaks = c(0, 18, 30, 45, 60, Inf),
                         labels = c("Dưới 18", "18-30", "31-45", "46-60", "Trên 60"),
                         right = FALSE))
# Hiển thị tần số của các nhóm tuổi mới tạo
table(df$age_group) %>% kable(col.names = c("Nhóm Tuổi", "Số lượng")) %>% kable_styling(full_width = F)

Nhóm Tuổi	Số lượng
Dưới 18	36125
18-30	19768
31-45	30005
46-60	26390
Trên 60	33845

2.5. Tạo tập dữ liệu con (có thu nhập)

df_with_income <- df %>% filter(incwage > 0)
# Hiển thị kích thước của dataframe mới
cat(paste("Số người có thu nhập > 0 là:", nrow(df_with_income)))

Số người có thu nhập > 0 là: 69148

2.6. Tạo tập dữ liệu con (có việc làm)

df_employed <- df %>% filter(empstat == "Employed")
# Hiển thị kích thước của dataframe mới
cat(paste("Số người đang có việc làm là:", nrow(df_employed)))

Số người đang có việc làm là: 69160

2.7. Tạo biến nhị phân ‘có thu nhập’

df <- df %>%
  mutate(has_income = ifelse(incwage > 0, "Có thu nhập", "Không có thu nhập"))
# Hiển thị tần số của biến mới
table(df$has_income) %>% kable(col.names = c("Trạng thái", "Số lượng")) %>% kable_styling(full_width = F)

Trạng thái	Số lượng
Có thu nhập	69148
Không có thu nhập	76985

2.8. Lấy danh sách Top 5 nghề nghiệp phổ biến

top_5_occ <- df %>%
  count(occ, sort = TRUE) %>%
  head(5) %>%
  pull(occ)
# In ra danh sách top 5 nghề nghiệp
top_5_occ

[1] Not in universe                       
[2] Managers, all other                   
[3] Elementary and middle school teachers 
[4] Driver/sales workers and truck drivers
[5] Registered nurses                     
527 Levels: Accountants and auditors Actors Actuaries ... Writers and authors

2.9. Chọn lọc các biến quan trọng

df_subset <- df %>%
  select(age, sex, educ_group, empstat, incwage)
# Hiển thị 6 dòng đầu của dataframe con mới
head(df_subset) %>% kable() %>% kable_styling(full_width = F)

age	sex	educ_group	empstat	incwage
66	Female	Trung bình	Not in labor force	0
68	Female	Trung bình	Not in labor force	0
52	Female	Trung bình	Not in labor force	0
51	Male	Trung bình	Employed	42000
78	Female	Trung bình	Not in labor force	0
65	Male	Trung bình	Not in labor force	55000

2.10. Biến đổi Logarit cho thu nhập

df_with_income <- df_with_income %>%
  mutate(log_income = log1p(incwage))
# Hiển thị cột thu nhập gốc và cột thu nhập logarit hóa để so sánh
df_with_income %>% select(incwage, log_income) %>% head() %>% kable(digits = 2) %>% kable_styling(full_width = F)

incwage	log_income
42000	10.65
55000	10.92
52000	10.86
22000	10.00
45000	10.71
70000	11.16

3. Các thống kê cơ bản

Phần này đi sâu vào phân tích định lượng bằng cách đặt ra và trả lời 20 câu hỏi cụ thể về dữ liệu. Mỗi thao tác là một bước tính toán nhằm rút ra các insight có ý nghĩa.

3.1. Phân tích đơn biến

# Tuổi trung bình của lực lượng lao động là bao nhiêu?
mean(df$age)

[1] 38.73129

# Độ tuổi trung vị là bao nhiêu?
median(df$age)

[1] 38

# Độ lệch chuẩn của tuổi là gì?
sd(df$age)

[1] 23.31602

# Thu nhập trung bình của những người có lương là bao nhiêu?
mean(df_with_income$incwage)

[1] 66478.05

# Thu nhập trung vị của họ là bao nhiêu?
median(df_with_income$incwage)

[1] 49000

#  Khoảng thu nhập (cao nhất và thấp nhất) là bao nhiêu?
range(df_with_income$incwage)

[1]       1 1549999

# Tỷ lệ nam/nữ trong bộ dữ liệu?
prop.table(table(df$sex))


   Female      Male 
0.5120951 0.4879049

# Tần số của từng nhóm trình độ học vấn?
table(df$educ_group)


      Thấp Trung bình        Cao 
     29813      51903      49183

# Tần số của từng nhóm tuổi?
table(df$age_group)


Dưới 18   18-30   31-45   46-60 Trên 60 
  36125   19768   30005   26390   33845

# Tần số của từng tình trạng hôn nhân?
table(df$marst)


               Divorced  Married, spouse absent Married, spouse present 
                  11085                    1642                   58636 
              Separated                  Single                 Widowed 
                   6564                   66223                    1983

3.2. Phân tích đa biến

# So sánh thu nhập trung bình giữa Nam và Nữ
df_with_income %>% group_by(sex) %>% summarise(mean_income = mean(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

sex	mean_income
Female	55284
Male	76951

# So sánh thu nhập trung vị giữa Nam và Nữ
df_with_income %>% group_by(sex) %>% summarise(median_income = median(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

sex	median_income
Female	40000
Male	55000

# So sánh thu nhập trung bình theo trình độ học vấn
df_with_income %>% group_by(educ_group) %>% summarise(mean_income = mean(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

educ_group	mean_income
Thấp	43144
Trung bình	46755
Cao	88930
NA	28538

# So sánh thu nhập trung bình theo nhóm tuổi
df_with_income %>% group_by(age_group) %>% summarise(mean_income = mean(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

age_group	mean_income
Dưới 18	8429
18-30	39246
31-45	73603
46-60	81350
Trên 60	66235

# So sánh tỷ lệ có việc làm ('Employed') ở mỗi giới tính
df %>% group_by(sex) %>% summarise(employment_rate = mean(empstat == "Employed")) %>% kable(digits=2) %>% kable_styling(full_width=F)

sex	employment_rate
Female	0.44
Male	0.51

# So sánh thu nhập trung bình theo từng chủng tộc ('race')
df_with_income %>% group_by(race) %>% summarise(mean_income = mean(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

race	mean_income
American Indian	45613
Asian	85021
Black	55147
Multiracial	56257
Pacific Islander	54990
White	66941

# So sánh độ tuổi trung bình theo tình trạng hôn nhân
df %>% group_by(marst) %>% summarise(mean_age = mean(age)) %>% kable(digits=1) %>% kable_styling(full_width=F)

marst	mean_age
Divorced	56.6
Married, spouse absent	48.6
Married, spouse present	51.8
Separated	72.4
Single	20.3
Widowed	48.8

#  Lập bảng chéo giữa nhóm học vấn và tình trạng việc làm
table(df$educ_group, df$empstat) %>% kable() %>% kable_styling(full_width=F)

	Employed	NIU	Not in labor force	Unemployed
Thấp	91	29483	232	7
Trung bình	29064	0	21488	1351
Cao	35102	0	13346	735

# Tính thu nhập trung bình của 5 nghề nghiệp phổ biến nhất
df_with_income %>% filter(occ %in% top_5_occ) %>% group_by(occ) %>% summarise(mean_income = mean(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

occ	mean_income
Driver/sales workers and truck drivers	56723
Elementary and middle school teachers	59508
Managers, all other	105603
Not in universe	30813
Registered nurses	81147

# Tính độ lệch chuẩn của thu nhập trong mỗi nhóm học vấn
df_with_income %>% group_by(educ_group) %>% summarise(sd_income = sd(incwage)) %>% kable(digits=0) %>% kable_styling(full_width=F)

educ_group	sd_income
Thấp	122273
Trung bình	58397
Cao	99531
NA	65184

4. Trực quan hóa dữ liệu

Phần này sử dụng 20 biểu đồ để minh họa các kết quả thống kê, giúp người đọc dễ dàng nắm bắt các xu hướng và mối quan hệ trong dữ liệu. Mỗi biểu đồ đều được xây dựng với ít nhất 5 layers của ggplot2 để đảm bảo tính thẩm mỹ và đầy đủ thông tin.

4.1. Trực quan hóa phân phối đơn biến

4.1.1. Đồ thị 1: Phân phối tuổi của Lực lượng Lao động

ggplot(df, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white") +
  geom_vline(aes(xintercept = mean(age)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Phân phối tuổi của Lực lượng Lao động", 
       subtitle = "Đường nét đứt màu đỏ thể hiện tuổi trung bình",
       x = "Tuổi", y = "Số lượng") +
  theme_minimal()

Biểu đồ 1: Phân phối tuổi.

4.1.2. Đồ thị 2: Phân phối mật độ của Thu nhập (thang đo Log)

ggplot(df_with_income, aes(x = incwage)) +
  geom_density(fill = "lightgreen", alpha = 0.7, kernel = "gaussian") +
  scale_x_log10(labels = dollar_format()) +
  labs(title = "Phân phối mật độ của Thu nhập (thang đo Log)", 
       x = "Thu nhập (USD/năm)", y = "Mật độ") +
  theme_classic()

Biểu đồ 2: Phân phối thu nhập.

4.1.3. Tỷ lệ phân bổ theo Giới tính

df %>% count(sex) %>% mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = "", y = pct, fill = sex)) +
  geom_col(width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  geom_text(aes(label = percent(pct, accuracy = 0.1)), position = position_stack(vjust = 0.5)) +
  labs(title = "Tỷ lệ phân bổ theo Giới tính", fill = "Giới tính")

Biểu đồ 3: Tỷ lệ giới tính.

4.1.4. Tần suất các Nhóm học vấn

ggplot(df, aes(y = fct_infreq(educ_group))) +
  geom_bar(aes(fill = educ_group)) +
  labs(title = "Phân bố theo Nhóm học vấn", y = "Nhóm học vấn", x = "Số lượng") +
  theme_light() +
  theme(legend.position = "none")

Biểu đồ 4: Tần suất nhóm học vấn.

4.1.5. Tần suất Tình trạng việc làm

ggplot(df, aes(x = fct_infreq(empstat))) +
  geom_bar(fill = "coral") +
  labs(title = "Phân bố theo Tình trạng việc làm", x = "Tình trạng việc làm", y = "Số lượng") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Biểu đồ 5: Tần suất tình trạng việc làm.

4.2. Trực quan hóa mối quan hệ giữa các biến

4.2.1. Phân phối Thu nhập theo Giới tính

ggplot(df_with_income, aes(x = sex, y = incwage, fill = sex)) +
  geom_violin(trim = FALSE, alpha = 0.8) +
  geom_boxplot(width = 0.1, fill = "white", alpha = 0.5) +
  scale_y_continuous(labels = dollar_format(), limits = c(0, 200000)) +
  labs(title = "Phân phối Thu nhập theo Giới tính", 
       subtitle = "(Giới hạn hiển thị ở mức 200,000 USD/năm)",
       x = "Giới tính", y = "Thu nhập (USD/năm)") +
  theme_light()

Biểu đồ 6: So sánh thu nhập theo giới tính.

4.2.2. Mối quan hệ giữa Tuổi và Thu nhập

ggplot(df_with_income, aes(x = age, y = incwage)) +
  geom_point(alpha = 0.05, color = "gray50", shape = 16) +
  geom_smooth(method = "loess", color = "firebrick", se = FALSE) +
  scale_y_continuous(labels = dollar, limits = c(0, 250000)) +
  labs(title = "Mối quan hệ giữa Tuổi và Thu nhập", x = "Tuổi", y = "Thu nhập") +
  theme_bw()

Biểu đồ 7: Mối quan hệ tuổi và thu nhập.

4.2.3. Thu nhập trung bình theo Trình độ học vấn

df_with_income %>% group_by(educ_group) %>% summarise(mean_income = mean(incwage)) %>%
  ggplot(aes(x = educ_group, y = mean_income, fill = educ_group)) +
  geom_col() +
  geom_text(aes(label = dollar(round(mean_income))), vjust = -0.5, size = 3.5) +
  scale_y_continuous(labels = dollar) +
  labs(title = "Thu nhập trung bình theo Nhóm học vấn", x = "Nhóm học vấn", y = "Thu nhập trung bình") +
  theme_classic() +
  theme(legend.position = "none")

Biểu đồ 8: Thu nhập trung bình theo học vấn.

4.2.4. So sánh thu nhập theo học vấn, phân tách theo giới tính

ggplot(df_with_income, aes(x = educ_group, y = incwage, fill = sex)) +
  geom_boxplot() +
  facet_wrap(~sex) +
  scale_y_continuous(labels = dollar, limits = c(0, 250000)) +
  labs(title = "Phân phối Thu nhập theo Học vấn và Giới tính", 
       subtitle = "(Giới hạn hiển thị 250,000 USD)",
       x = "Nhóm học vấn", y = "Thu nhập") +
  theme_bw()

Biểu đồ 9: Thu nhập theo học vấn và giới tính.

4.2.5. Số lượng người theo Nhóm tuổi và Giới tính

ggplot(df, aes(x = age_group, fill = sex)) +
  geom_bar(position = "dodge") +
  labs(title = "Số lượng người theo Nhóm tuổi và Giới tính", x = "Nhóm tuổi", y = "Số lượng", fill = "Giới tính") +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1")

Biểu đồ 10: Phân bổ nhân khẩu học theo tuổi và giới tính.

4.2.6. Tỷ lệ Tình trạng việc làm theo Nhóm học vấn

ggplot(df, aes(x = educ_group, fill = empstat)) + # <-- SỬA Ở ĐÂY
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent) +
  labs(title = "Đồ thị 11: Tỷ lệ Tình trạng việc làm theo Nhóm học vấn", 
       y = "Tỷ lệ", x = "Nhóm học vấn", fill = "Tình trạng việc làm") +
  coord_flip() +
  theme_minimal()

Biểu đồ 11: Tỷ lệ tình trạng việc làm theo học vấn.

4.2.7. Phân phối thu nhập theo Chủng tộc

ggplot(df_with_income, aes(x = reorder(race, incwage, FUN = median), y = incwage, fill = race)) +
  geom_boxplot() +
  scale_y_log10(labels = dollar) +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Phân phối thu nhập theo Chủng tộc (thang đo Log)", x = "Chủng tộc", y = "Thu nhập")

Biểu đồ 12: So sánh thu nhập theo chủng tộc.

4.2.8. Heatmap giữa Tình trạng hôn nhân và Nhóm tuổi

df %>%
  count(marst, age_group) %>%
  ggplot(aes(x = age_group, y = marst, fill = n)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(labels = comma) +
  labs(title = "Heatmap giữa Tình trạng hôn nhân và Nhóm tuổi", x = "Nhóm tuổi", y = "Tình trạng hôn nhân", fill = "Số lượng") +
  theme_minimal()

Biểu đồ 13: Mối quan hệ giữa tuổi và tình trạng hôn nhân.

4.2.9. Thu nhập trung bình theo Tình trạng hôn nhân và Giới tính

df_with_income %>%
  group_by(marst, sex) %>%
  summarise(mean_income = mean(incwage), .groups = 'drop') %>%
  ggplot(aes(x = marst, y = mean_income, fill = sex)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = dollar) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Thu nhập TB theo Tình trạng hôn nhân và Giới tính", x = "Tình trạng hôn nhân", y = "Thu nhập trung bình", fill = "Giới tính")

Biểu đồ 14: So sánh thu nhập theo hôn nhân và giới tính.

4.2.10. Top 10 nghề nghiệp có thu nhập trung bình cao nhất

df_with_income %>%
  group_by(occ) %>%
  summarise(mean_income = mean(incwage), count = n(), .groups = 'drop') %>%
  filter(count > 50) %>% # Chỉ xét các nghề có đủ dữ liệu
  arrange(desc(mean_income)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(occ, mean_income), y = mean_income)) +
  geom_col(fill = "darkgoldenrod1") +
  geom_text(aes(label = dollar(round(mean_income))), hjust = 1.1, color = "white", size=3) +
  coord_flip() +
  scale_y_continuous(labels = dollar) +
  labs(title = "Top 10 nghề nghiệp có thu nhập trung bình cao nhất", 
       subtitle = "(Chỉ xét các nghề có trên 50 quan sát)",
       x = "Nghề nghiệp", y = "Thu nhập trung bình")

Biểu đồ 15: Các nghề nghiệp có thu nhập cao.

4.2.11. Phân phối Tuổi theo từng Nhóm học vấn

ggplot(df, aes(x = educ_group, y = age, fill = educ_group)) +
  geom_violin(trim = FALSE) +
  stat_summary(fun = "median", geom = "point", size = 3, color = "white") +
  theme(legend.position = "none") +
  labs(title = "Phân phối Tuổi theo từng Nhóm học vấn", x = "Nhóm học vấn", y = "Tuổi")

Biểu đồ 16: So sánh độ tuổi theo học vấn.

4.2.12. Xu hướng Thu nhập trung bình theo Tuổi và Học vấn

df_with_income %>%
  group_by(age_group, educ_group) %>%
  summarise(mean_income = mean(incwage), .groups = 'drop') %>%
  ggplot(aes(x = age_group, y = mean_income, color = educ_group, group = educ_group)) +
  geom_line(size = 1.5) +
  geom_point(size = 3) +
  scale_y_continuous(labels = dollar) +
  labs(title = "Thu nhập trung bình thay đổi theo Tuổi và Học vấn", x = "Nhóm tuổi", y = "Thu nhập trung bình", color = "Nhóm học vấn") +
  theme_light()

Biểu đồ 17: Xu hướng thu nhập.

4.2.13. Tỷ lệ có thu nhập theo Tình trạng việc làm

ggplot(df, aes(x = empstat, fill = has_income)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent) +
  scale_fill_manual(values = c("Có thu nhập" = "darkgreen", "Không có thu nhập" = "gray")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Tỷ lệ có thu nhập theo Tình trạng việc làm", x = "Tình trạng việc làm", y = "Tỷ lệ", fill = "Có thu nhập?")

Biểu đồ 18: Tỷ lệ có thu nhập.

4.2.14. So sánh thu nhập TRUNG VỊ theo Chủng tộc và Giới tính

df_with_income %>%
  group_by(race, sex) %>%
  summarise(median_income = median(incwage), .groups = 'drop') %>%
  ggplot(aes(x = race, y = median_income, fill = sex)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = dollar) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Thu nhập TRUNG VỊ theo Chủng tộc và Giới tính", x = "Chủng tộc", y = "Thu nhập trung vị", fill = "Giới tính")

Biểu đồ 19: So sánh thu nhập trung vị.

4.2.15. Phân phối thu nhập của 5 nghề nghiệp phổ biến nhất

df_with_income %>%
  filter(occ %in% top_5_occ) %>%
  ggplot(aes(x = reorder(occ, incwage, FUN=median), y = incwage, fill = occ)) +
  geom_boxplot() +
  scale_y_log10(labels = dollar) +
  theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1)) +
  labs(title = "Phân phối thu nhập của 5 nghề nghiệp phổ biến nhất", x = "Nghề nghiệp", y = "Thu nhập (thang đo Log)")

Biểu đồ 20: Phân phối thu nhập của các nghề phổ biến.

Tiểu luận Ngôn Ngữ Lập Trình

Lý Ngọc Hân & Nguyễn Gia Hân - Nhóm 35

30/10/2025