Ngày 2: Hiển thị dữ liệu

Việc 1. Đọc dữ liệu vào R

df = read.csv("C:\\Thach\\VN trips\\2025_2May\\Phuong Nam Institute\\Datasets\\CHNS data full.csv")

Việc 2. Biểu đồ histogram

2.1 Phân bố thu nhập

library(ggplot2)
library(gridExtra) 

p = ggplot(data = df, aes(x = income))
p1 = p + geom_histogram()
p2 = p + geom_histogram(fill = "blue", col = "white") + labs(x = "Thu nhập", y = "Số người", title = "Phân bố thu nhập")

grid.arrange(p1, p2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 224 rows containing non-finite values (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 224 rows containing non-finite values (`stat_bin()`).

Nhận xét: phân bố lệch không rõ ràng (non-informative)

2.2 Phân bố thu nhập rõ ràng hơn- Sử dụng giá trị logarithm

p = ggplot(data = df, aes(x = log(income)))
p1 = p + geom_histogram(fill = "blue", col = "white") + labs(x = "Thu nhập (logarithm scale)", y = "Số người", title = "Phân bố thu nhập")
p1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 241 rows containing non-finite values (`stat_bin()`).

[ChatGPT]

PROMPT: “Sử dụng gói lệnh ggplot2 để thể hiện phân bố thu nhập (income) bằng biểu đồ histogram với giá trị thu nhập được logarithm hóa, tức là log(income):”

ChatGPT:

Kiểm tra và loại bỏ các giá trị thu nhập nhỏ hơn hoặc bằng 0 (vì log không xác định)

df <- df[df$income > 0, ]

Vẽ biểu đồ histogram log(income) ggplot(df, aes(x = log(income))) + geom_histogram(bins = 30, fill = “steelblue”, color = “black”) + labs(title = “Histogram of log(income)”, x = “log(income)”, y = “Count”) + theme_minimal()

2.3 Phân bố tỉ trọng mỡ theo giới tính

p = ggplot(data = df, aes(x = log(income), fill = factor(gender)))
p1 = p + geom_histogram(col="white") + labs(x = "Thu nhập (logarithm scale)", y = "Số người", title = "Phân bố thu nhập")
p2 = p + geom_density(alpha = 0.5) + labs(x = "Thu nhập (logarithm scale)", y = "Tỉ lệ", title = "Phân bố thu nhập")

grid.arrange(p1, p2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 241 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 241 rows containing non-finite values (`stat_density()`).

[ChatGPT]

PROMPT: “Vẽ biểu đồ phân bố thu nhập (log(income)) theo giới tính (gender)”

ChatGPT:

Lọc bỏ các giá trị thu nhập không hợp lệ (<= 0)

df <- df[df$income > 0, ]

Đổi tên giới tính thành nhãn rõ ràng

df$gender <- factor(df$gender, levels = c(1, 2), labels = c(“Male”, “Female”))

Vẽ biểu đồ histogram của log(income) theo giới tính

ggplot(df, aes(x = log(income), fill = gender)) + geom_histogram(position = “identity”, alpha = 0.5, bins = 30) + labs(title = “Distribution of log(income) by Gender”, x = “log(income)”, y = “Count”, fill = “Gender”) + theme_minimal()

Việc 3. Biểu đồ thanh

3.1 Thu nhập theo trình độ học vấn

Tính giá trị trung vị của thu nhập

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

temp = df %>% group_by(edu) %>% summarise(income = median(income, na.rm = T))
temp

## # A tibble: 3 × 2
##     edu income
##   <int>  <dbl>
## 1     1  1750.
## 2     2  6420 
## 3     3 31200

Vẽ biểu đồ thanh

df$edu = as.character(factor(df$edu, levels = c(1, 2, 3), labels = c("Primary", "Secondary", "Tertary")))
p = ggplot(data = temp, aes(x = edu, y = income, fill = edu))
p1 = p + geom_bar(stat = "identity") + labs(x = "Education", y = "Income") + theme(legend.position = "none")
p1

[ChatGPT]

PROMPT: “Vẽ biểu đồ thanh (bar chart) thể hiện giá trị trung vị của thu nhập (income) theo trình độ học vấn (edu), trong đó học vấn 1= Primary, 2= High, 3= Tertiary”

ChatGPT:

Lọc dữ liệu có thu nhập > 0

df <- df[df$income > 0, ]

Gán nhãn trình độ học vấn

df$edu <- factor(df$edu, levels = c(1, 2, 3), labels = c(“Primary”, “High”, “Tertiary”))

Tính trung vị thu nhập theo trình độ học vấn

library(dplyr)

df_summary <- df %>% group_by(edu) %>% summarise(median_income = median(income, na.rm = TRUE))

Vẽ biểu đồ thanh ggplot(df_summary, aes(x = edu, y = median_income, fill = edu)) + geom_bar(stat = “identity”) + labs(title = “Median Income by Education Level”, x = “Education Level”, y = “Median Income”) + theme_minimal() + theme(legend.position = “none”)

3.2 Thêm giá trị thu nhập

p + geom_bar(stat = "identity") + geom_text(aes(label = income, vjust = -0.5)) + geom_bar(stat = "identity") + labs(x = "Education", y = "Income") + theme(legend.position="none")

[ChatGPT]

PROMPT: “Thêm giá trị trung vị của thu nhập vào biểu đồ cho từng nhóm trình độ học vấn”

ChatGPT:

Vẽ biểu đồ với nhãn số

ggplot(df_summary, aes(x = edu, y = median_income)) + geom_bar(stat = “identity”, fill = “steelblue”) + geom_text(aes(label = round(median_income, 0)), vjust = -0.5, size = 4) + labs(title = “Median Income by Education Level”, x = “Education Level”, y = “Median Income”) + theme_minimal()

Việc 4. Soạn biểu đồ hộp so sánh phân bố của thu nhập theo giới tính

df$gender = as.character(factor(df$gender, levels = c(1, 2), labels = c("Men", "Women")))
p = ggplot(data = df, aes(x = factor(gender), y = log(income), col = factor(gender)))
p + geom_boxplot() + geom_jitter(alpha = 0.05) + labs(x = "Giới tính", y = "Thu nhập (logarithm scale)") + ggtitle("Thu nhập theo giới tính")

## Warning: Removed 241 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 224 rows containing missing values (`geom_point()`).

[ChatGPT]

PROMPT 1: “Soạn biểu đồ hộp để so sánh phân bố của thu nhập (income) theo giới tính (gender)”

ChatGPT:

ggplot(df, aes(x = gender, y = income, fill = gender)) + geom_boxplot(outlier.shape = NA) + # Ẩn các outliers để rõ hơn scale_y_continuous(trans = “log10”, # dùng thang log nếu có nhiều trị lệch breaks = c(100, 1000, 10000, 100000, 1000000), labels = scales::comma) + labs(title = “Income Distribution by Gender”, x = “Gender”, y = “Income (log scale)”) + theme_minimal() + theme(legend.position = “none”)

PROMPT 2: “thêm điểm dữ liệu của từng cá nhân trong biểu đồ hộp”

ChatGPT:

“Để thêm điểm dữ liệu của từng cá nhân (jitter points) vào biểu đồ hộp (boxplot), bạn chỉ cần thêm lớp geom_jitter() hoặc geom_point(position = position_jitter(…)).”

ggplot(df, aes(x = gender, y = income, fill = gender)) + geom_boxplot(outlier.shape = NA, alpha = 0.7) + # Biểu đồ hộp geom_jitter(aes(color = gender), width = 0.2, alpha = 0.3, size = 1) + # Điểm dữ liệu cá nhân scale_y_continuous(trans = “log10”, # Trục y theo thang log breaks = c(100, 1000, 10000, 100000, 1000000), labels = scales::comma) + labs(title = “Income Distribution by Gender with Individual Data Points”, x = “Gender”, y = “Income (log scale)”) + theme_minimal() + theme(legend.position = “none”)

Việc 5. Soạn biểu đồ tương quan

5.1 Mối liên quan giữa thu nhập và tuổi

p = ggplot(data = df, aes(x = age, y = log(income)))
p + geom_point() + labs(x = "Tuổi", y = "Thu nhập (logarithm scale)") + ggtitle("Thu nhập theo tuổi")

## Warning: Removed 226 rows containing missing values (`geom_point()`).

5.2 Mối liên quan giữa thu nhập và tuổi theo giới tính

p = ggplot(data = df, aes(x = age, y = log(income), col = factor(gender)))
p1 = p + geom_point() + labs(x = "Tuổi", y = "Thu nhập (logarithm scale)") + ggtitle("Mối liên quan giữa thu nhập và tuổi theo giới tính")
p1

## Warning: Removed 226 rows containing missing values (`geom_point()`).

[ChatGPT]

PROMPT: “Soạn biểu đồ tương quan để thể hiện mối liên quan giữa thu nhập (income) và tuổi (age) theo giới tính (gender)”

ChatGPT:

ggplot(df, aes(x = age, y = income, color = gender)) + geom_point(alpha = 0.3, size = 1) + # điểm dữ liệu geom_smooth(method = “loess”, se = TRUE) + # đường xu hướng scale_y_continuous(trans = “log10”, # trục y theo thang log để dễ nhìn hơn labels = scales::comma) + labs(title = “Relationship Between Age and Income by Gender”, x = “Age”, y = “Income (log scale)”, color = “Gender”) + theme_minimal()

5.3 Thêm đường mô tả mối liên quan

p2 = p1 + geom_smooth()
p2

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 243 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 226 rows containing missing values (`geom_point()`).

p3 = p1 + geom_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3))
p3

## Warning: Removed 243 rows containing non-finite values (`stat_smooth()`).
## Removed 226 rows containing missing values (`geom_point()`).

grid.arrange(p2, p3, ncol = 2)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 243 rows containing non-finite values (`stat_smooth()`).
## Removed 226 rows containing missing values (`geom_point()`).

## Warning: Removed 243 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 226 rows containing missing values (`geom_point()`).

[ChatGPT]

PROMPT 1: “Thêm đường mô tả mối liên quan thật sự của dữ liệu”

ChatGPT:

“Hoàn hảo! Nếu bạn muốn thêm đường mô tả mối liên quan thật sự của dữ liệu (tức là mô tả sát nhất với phân bố điểm), thì dùng geom_smooth(method =”loess”) sẽ là lựa chọn lý tưởng, vì nó không giả định mối quan hệ tuyến tính, mà mô hình hóa xu hướng thực tế từ dữ liệu.”

ggplot(df, aes(x = age, y = income, color = gender)) + geom_point(alpha = 0.3, size = 1) + # điểm dữ liệu geom_smooth(method = “loess”, se = TRUE, linetype = “solid”, size = 1.2) + # đường mô tả sát thực tế scale_y_continuous(trans = “log10”, labels = scales::comma) + labs(title = “Age vs. Income with Smoothed Trend by Gender”, x = “Age”, y = “Income (log scale)”, color = “Gender”) + theme_minimal()

PROMPT 2: “Mối liên quan bậc 3 có vẻ phù hợp nhất với dữ liệu này. Bạn vẽ đường mô tả dữ liệu theo mối liên quan bậc 3”

ChatGPT:

“Tuyệt vời! Nếu bạn nhận thấy rằng mối liên quan bậc 3 (hồi quy đa thức bậc 3) phù hợp nhất với dữ liệu giữa tuổi (age) và thu nhập (income), ta có thể thêm đường xu hướng bằng mô hình hồi quy bậc 3 bằng cách dùng geom_smooth(method =”lm”, formula = y ~ poly(x, 3)).”

ggplot(df, aes(x = age, y = income, color = gender)) + geom_point(alpha = 0.3, size = 1) + geom_smooth(method = “lm”, formula = y ~ poly(x, 3), se = TRUE, size = 1.2) + # đường hồi quy bậc 3 scale_y_continuous(trans = “log10”, labels = scales::comma) + labs(title = “Polynomial (Degree 3) Fit: Age vs Income by Gender”, x = “Age”, y = “Income (log scale)”, color = “Gender”) + theme_minimal()

Phân tích dữ liệu bằng ngôn ngữ R - Viện Phương Nam (10-15/5/2025)

Thach Tran

2025-04-20

Ngày 2: Hiển thị dữ liệu

Việc 1. Đọc dữ liệu vào R

Việc 2. Biểu đồ histogram

2.1 Phân bố thu nhập

2.2 Phân bố thu nhập rõ ràng hơn- Sử dụng giá trị logarithm

[ChatGPT]

2.3 Phân bố tỉ trọng mỡ theo giới tính

[ChatGPT]

Việc 3. Biểu đồ thanh

3.1 Thu nhập theo trình độ học vấn

[ChatGPT]

3.2 Thêm giá trị thu nhập

[ChatGPT]

Việc 4. Soạn biểu đồ hộp so sánh phân bố của thu nhập theo giới tính

[ChatGPT]

Việc 5. Soạn biểu đồ tương quan

5.1 Mối liên quan giữa thu nhập và tuổi

5.2 Mối liên quan giữa thu nhập và tuổi theo giới tính

[ChatGPT]

5.3 Thêm đường mô tả mối liên quan

[ChatGPT]

Việc 8. Ghi lại tất cả các hàm/lệnh trên và chia sẻ lên tài khoản rpubs