Ngày 2: Hiển thị dữ liệu

Việc 1. Đọc dữ liệu vào R

df = read.csv("C:\\Thach\\VN trips\\2025_3Jun\\VNUHCM\\Datasets\\CHNS data full.csv")

Việc 2. Biểu đồ histogram

2.1 Phân bố thu nhập

PROMT: Sử dụng gói lệnh ‘ggplot2’ vẽ biểu đồ histogram của thu nhập (income), tô màu xanh với viền trắng. Đặt tên trục x là ‘Thu nhập’, trục y là ‘Số người’ và tựa là ‘Phân bố thu nhập’

library(ggplot2)

ggplot(df, aes(x = income)) +
  geom_histogram(fill = "blue", color = "white", bins = 30) +
  labs(
    x = "Thu nhập",
    y = "Số người",
    title = "Phân bố thu nhập"
  ) +
  theme_minimal()

## Warning: Removed 224 rows containing non-finite values (`stat_bin()`).

Nhận xét: phân bố lệch không rõ ràng (non-informative)

2.2 Phân bố thu nhập rõ ràng hơn- Sử dụng giá trị logarithm

PROMPT: Vẽ biểu đồ histogram phân bố thu nhập với giá trị logarithm của thu nhập (log(income))

df_log <- df[df$income > 0, ]

ggplot(df_log, aes(x = log(income))) +
  geom_histogram(fill = "blue", color = "white", bins = 30) +
  labs(
    x = "Log(Thu nhập)",
    y = "Số người",
    title = "Phân bố log(thu nhập)"
  ) +
  theme_minimal()

## Warning: Removed 224 rows containing non-finite values (`stat_bin()`).

2.3 Phân bố thu nhập theo giới tính

PROMPT: Vẽ biểu đồ histogram phân bố của log(income) theo giới tính (biến gender là biến factor) trong đó biểu đồ histogram của nam màu đỏ và nữ màu xanh

# Lọc dữ liệu để loại bỏ income <= 0
df_log <- df[df$income > 0, ]

# Đảm bảo gender là factor và đặt đúng cấp độ (levels) nếu cần
df_log$gender <- factor(df_log$gender)

# Kiểm tra levels hiện tại
levels(df_log$gender)

## [1] "1" "2"

# Vẽ biểu đồ
ggplot(df_log, aes(x = log(income), fill = gender)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30, color = "white") +
  scale_fill_manual(values = c("1" = "red", "2" = "blue")) +
  labs(
    x = "Log(Thu nhập)",
    y = "Số người",
    title = "Phân bố log(thu nhập) theo giới tính",
    fill = "Giới tính"
  ) +
  theme_minimal()

## Warning: Removed 224 rows containing non-finite values (`stat_bin()`).

Việc 3. Biểu đồ thanh

3.1 Thu nhập theo trình độ học vấn

PROMPT 1: Vẽ biểu đồ thanh thể hiện trung vị của thu nhập theo trình độ học vấn (biến factor, ‘edu’). Tô màu xanh cho “Primary”, đỏ cho “Secondary” và cam cho “Tertiary”

# Lọc dữ liệu hợp lệ: income không NA và > 0
df_med <- df[!is.na(df$income) & df$income > 0, ]

# Đảm bảo edu là factor với các mức đúng
df_med$edu <- factor(df_med$edu, levels = c(1, 2, 3), labels = c("Primary", "Secondary", "Tertiary"))

# Vẽ biểu đồ thanh trung vị thu nhập theo trình độ học vấn
ggplot(df_med, aes(x = edu, y = income, fill = edu)) +
  stat_summary(fun = median, geom = "bar", color = "black") +
  scale_fill_manual(values = c(
    "Primary" = "blue",
    "Secondary" = "red",
    "Tertiary" = "orange"
  )) +
  labs(
    x = "Trình độ học vấn",
    y = "Trung vị thu nhập",
    title = "Trung vị thu nhập theo trình độ học vấn",
    fill = "Học vấn"
  ) +
  theme_minimal()

PROMT 2: Dùng lệnh geom_bar vẽ biểu đồ thanh thể hiện trung vị của thu nhập theo trình độ học vấn (biến factor, ‘edu’). Tô màu xanh cho “Primary”, đỏ cho “Secondary” và cam cho “Tertiary”

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Bước 1: Lọc dữ liệu hợp lệ
df_med <- df[!is.na(df$income) & df$income > 0, ]

# Bước 2: Đảm bảo edu là factor với nhãn
df_med$edu <- factor(df_med$edu, levels = c(1, 2, 3),
                     labels = c("Primary", "Secondary", "Tertiary"))

# Bước 3: Tính trung vị thu nhập theo học vấn
med_data <- df_med %>%
  group_by(edu) %>%
  summarise(median_income = median(income))

# Bước 4: Vẽ biểu đồ với geom_bar
ggplot(med_data, aes(x = edu, y = median_income, fill = edu)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c(
    "Primary" = "blue",
    "Secondary" = "red",
    "Tertiary" = "orange"
  )) +
  labs(
    x = "Trình độ học vấn",
    y = "Trung vị thu nhập",
    title = "Trung vị thu nhập theo trình độ học vấn",
    fill = "Học vấn"
  ) +
  theme_minimal()

3.2 Thêm giá trị thu nhập

PROMPT: Thêm giá trị thu nhập vào biểu đồ

ggplot(med_data, aes(x = edu, y = median_income, fill = edu)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = round(median_income, 0)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c(
    "Primary" = "blue",
    "Secondary" = "red",
    "Tertiary" = "orange"
  )) +
  labs(
    x = "Trình độ học vấn",
    y = "Trung vị thu nhập",
    title = "Trung vị thu nhập theo trình độ học vấn",
    fill = "Học vấn"
  ) +
  theme_minimal()

Việc 4. Soạn biểu đồ hộp so sánh phân bố của thu nhập theo giới tính

PROMPT: Soạn biểu đồ hộp để so sánh phân bố của thu nhập (log(income)) theo giới tính (gender: 1 = Male, 2 = Female) và trình bày dữ liệu mỗi cá nhân. Hiển thị cả giá trị outliers

# Bước 1: Lọc dữ liệu hợp lệ
df_plot <- df %>%
  filter(!is.na(income), income > 0, !is.na(gender)) %>%
  mutate(
    gender = factor(gender, levels = c(1, 2), labels = c("Male", "Female")),
    log_income = log(income)
  )

# Bước 2: Vẽ biểu đồ hộp với điểm dữ liệu cá nhân
ggplot(df_plot, aes(x = gender, y = log_income, fill = gender)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2, outlier.alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1, color = "black") +
  labs(
    x = "Giới tính",
    y = "Log(Thu nhập)",
    title = "Phân bố log(thu nhập) theo giới tính",
    fill = "Giới tính"
  ) +
  scale_fill_manual(values = c("Male" = "red", "Female" = "blue")) +
  theme_minimal()

Việc 5. Soạn biểu đồ tương quan

5.1 Mối liên quan giữa thu nhập và tuổi

PROMPT: Soạn biểu đồ tương quan thể hiện mối liên quan giữa tuổi (age) và thu nhập (log(income))

# Lọc dữ liệu hợp lệ
df_corr <- df[!is.na(df$age) & !is.na(df$income) & df$income > 0, ]

# Vẽ biểu đồ tương quan giữa tuổi và log(thu nhập)
ggplot(df_corr, aes(x = age, y = log(income))) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    x = "Tuổi",
    y = "Log(Thu nhập)",
    title = "Biểu đồ tương quan giữa tuổi và log(thu nhập)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

5.2 Mối liên quan giữa thu nhập và tuổi theo giới tính

PROMPT: Hãy thể hiện mối liên quan giữa thu nhập (log(income)) và tuổi (age) theo giới tính (gender: 1 = Male, 2 = Female)

# Bước 1: Chuẩn bị dữ liệu
df_plot <- df %>%
  filter(!is.na(age), !is.na(income), income > 0, !is.na(gender)) %>%
  mutate(
    log_income = log(income),
    gender = factor(gender, levels = c(1, 2), labels = c("Male", "Female"))
  )

# Bước 2: Vẽ biểu đồ tương quan log(income) ~ age, phân theo gender
ggplot(df_plot, aes(x = age, y = log_income, color = gender)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    x = "Tuổi",
    y = "Log(Thu nhập)",
    color = "Giới tính",
    title = "Mối liên hệ giữa tuổi và log(thu nhập) theo giới tính"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

5.3 Thêm đường mô tả mối liên quan

PROMPT 1 : Thêm đường mô tả mối liên quan thật sự của dữ liệu

# Chuẩn bị dữ liệu
df_plot <- df %>%
  filter(!is.na(age), !is.na(income), income > 0, !is.na(gender)) %>%
  mutate(
    log_income = log(income),
    gender = factor(gender, levels = c(1, 2), labels = c("Male", "Female"))
  )

# Vẽ scatter plot + đường mô tả thực tế bằng loess
ggplot(df_plot, aes(x = age, y = log_income, color = gender)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = TRUE, span = 0.8) +
  labs(
    x = "Tuổi",
    y = "Log(Thu nhập)",
    color = "Giới tính",
    title = "Mối liên hệ giữa tuổi và log(thu nhập) theo giới tính (LOESS)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

PROMPT 2: Mối liên quan bậc 3 có vẻ phù hợp nhất

# Chuẩn bị dữ liệu
df_plot <- df %>%
  filter(!is.na(age), !is.na(income), income > 0, !is.na(gender)) %>%
  mutate(
    log_income = log(income),
    gender = factor(gender, levels = c(1, 2), labels = c("Male", "Female"))
  )

# Vẽ biểu đồ với hồi quy bậc 3
ggplot(df_plot, aes(x = age, y = log_income, color = gender)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), se = TRUE) +
  labs(
    x = "Tuổi",
    y = "Log(Thu nhập)",
    color = "Giới tính",
    title = "Mối liên hệ bậc 3 giữa tuổi và log(thu nhập) theo giới tính"
  ) +
  theme_minimal()

So sánh 2 nhóm - biến liên tục

Việc 6. So sánh mật độ xương giữa nam và nữ

6.1 Đọc dữ liệu vào R

df = read.csv("C:\\Thach\\VN trips\\2025_3Jun\\VNUHCM\\Datasets\\Bone data.csv")
dim(df)

## [1] 2162    9

head(df)

##   id    sex age weight height prior.fx fnbmd smoking fx
## 1  1   Male  73     98    175        0  1.08       1  0
## 2  2 Female  68     72    166        0  0.97       0  0
## 3  3   Male  68     87    184        0  1.01       0  0
## 4  4 Female  62     72    173        0  0.84       1  0
## 5  5   Male  61     72    173        0  0.81       1  0
## 6  6 Female  76     57    156        0  0.74       0  0

6.2 Vẽ histogram đánh giá phân bố mật độ xương

PROMPT: Dùng gói lệnh ‘lessR’ vẽ biểu đồ histogram mô tả phân bố của mật độ xương (fnbmd), tô màu xanh và đặt tựa trục x là ‘Mật độ xương ở cổ xương đùi (g/cm2)’, trục y là ‘Số người’, tựa là ‘Phân bố mật độ xương’

library(lessR)

## Warning: package 'lessR' was built under R version 4.3.3

## 
## lessR 4.3.9                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, 
## graphics, testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including time series forecasting
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()

## 
## Attaching package: 'lessR'

## The following objects are masked from 'package:dplyr':
## 
##     recode, rename

Histogram(fnbmd,
          data = df,
          fill = "blue",
          xlab = "Mật độ xương ở cổ xương đùi (g/cm2)",
          ylab = "Số người",
          main = "Phân bố mật độ xương")

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## Histogram(fnbmd, density=TRUE)  # smoothed curve + histogram 
## Plot(fnbmd)  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- fnbmd --- 
##  
##        n    miss      mean        sd       min       mdn       max 
##      2122      40     0.829     0.155     0.280     0.820     1.510 
## 
##   
## --- Outliers ---     from the box plot: 33 
##  
## Small      Large 
## -----      ----- 
##  0.3      1.5 
##  0.3      1.5 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.3 
##  0.4      1.3 
##  0.4      1.3 
##           1.3 
##           1.3 
##           1.3 
##           1.3 
##           1.2 
##           1.2 
##           1.2 
## 
## + 15 more outliers 
## 
## 
## Bin Width: 0.1 
## Number of Bins: 14 
##  
##        Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------- 
##  0.2 > 0.3    0.25      1    0.00        1     0.00 
##  0.3 > 0.4    0.35      9    0.00       10     0.00 
##  0.4 > 0.5    0.45     15    0.01       25     0.01 
##  0.5 > 0.6    0.55    103    0.05      128     0.06 
##  0.6 > 0.7    0.65    306    0.14      434     0.20 
##  0.7 > 0.8    0.75    522    0.24      956     0.44 
##  0.8 > 0.9    0.85    534    0.25     1490     0.69 
##  0.9 > 1.0    0.95    371    0.17     1861     0.86 
##  1.0 > 1.1    1.05    183    0.08     2044     0.95 
##  1.1 > 1.2    1.15     48    0.02     2092     0.97 
##  1.2 > 1.3    1.25     21    0.01     2113     0.98 
##  1.3 > 1.4    1.35      6    0.00     2119     0.98 
##  1.4 > 1.5    1.45      2    0.00     2121     0.98 
##  1.5 > 1.6    1.55      1    0.00     2122     0.98

6.3 So sánh mật độ xương cổ xương đùi giữa nam và nữ

PROMPT: Dùng gói lệnh ‘lessR’ so sánh mật độ xương giữa nam và nữ (sex)

ttest(fnbmd ~ sex, data = df)

## 
## Compare fnbmd across sex with levels Male and Female 
## Grouping Variable:  sex
## Response Variable:  fnbmd
## 
## 
## ------ Describe ------
## 
## fnbmd for sex Male:  n.miss = 23,  n = 822,  mean = 0.910,  sd = 0.153
## fnbmd for sex Female:  n.miss = 17,  n = 1300,  mean = 0.778,  sd = 0.132
## 
## Mean Difference of fnbmd:  0.132
## 
## Weighted Average Standard Deviation:   0.141 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of fnbmd.
## Group Male: Sample mean assumed normal because n > 30, so no test needed.
## Group Female: Sample mean assumed normal because n > 30, so no test needed.
## 
## Null hypothesis is equal variances of fnbmd, homogeneous.
## Variance Ratio test:  F = 0.023/0.018 = 1.336,  df = 821;1299,  p-value = 0.000
## Levene's test, Brown-Forsythe:  t = 3.449,  df = 2120,  p-value = 0.001
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of fnbmd for each sex 
## 
## t-cutoff for 95% range of variation: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.006 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 21.080,  df = 2120,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.012
## 95% Confidence Interval for Mean Difference:  0.120 to 0.144
## 
## 
## --- Do not assume equal population variances of fnbmd for each sex 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.006 
## 
## Hypothesis Test of 0 Mean Diff:  t = 20.407,  df = 1560.981, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.013
## 95% Confidence Interval for Mean Difference:  0.119 to 0.145
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of fnbmd for each sex 
## 
## Standardized Mean Difference of fnbmd, Cohen's d:  0.939
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for sex Male: 0.044
## Density bandwidth for sex Female: 0.034

Việc 7. Đánh giá ảnh hưởng của cafe lên RER bằng phương pháp bootstrap

PROMPT 1: Tôi có dữ liệu của 18 người trong 2 nhóm như sau: placebo = 105, 119, 100, 97, 96, 101, 94, 95, 98 và coffee = 96, 99, 94, 89, 96, 93, 88, 105, 88. Bạn giúp viết lệnh R để phân tích sự khác biệt giữa 2 nhóm bằng phương pháp bootstrap

# Bước 1: nhập dữ liệu

placebo <- c(105, 119, 100, 97, 96, 101, 94, 95, 98)
coffee <- c(96, 99, 94, 89, 96, 93, 88, 105, 88)

# Bước 2: định nghĩa hàm bootstrap

set.seed(123)  # Đặt seed để kết quả lặp lại được

n_boot <- 10000
boot_diff <- numeric(n_boot)

for (i in 1:n_boot) {
  sample_placebo <- sample(placebo, replace = TRUE)
  sample_coffee  <- sample(coffee, replace = TRUE)
  boot_diff[i]   <- mean(sample_placebo) - mean(sample_coffee)
}

# Bước 3: tóm tắt và hiển thị kết quả

## Trung bình chênh lệch bootstrap
mean(boot_diff)

## [1] 6.333167

## Khoảng tin cậy 95%
quantile(boot_diff, c(0.025, 0.975))

##       2.5%      97.5% 
##  0.8888889 12.4444444

## Vẽ histogram của chênh lệch trung bình
hist(boot_diff,
     main = "Phân bố bootstrap của chênh lệch trung bình (Placebo - Coffee)",
     xlab = "Chênh lệch trung bình",
     col = "lightblue", border = "white")
abline(v = quantile(boot_diff, c(0.025, 0.975)), col = "red", lty = 2)
abline(v = mean(boot_diff), col = "blue", lwd = 2)

PROMPT 2: Quá phức tạp. Bạn có thể sử dụng lệnh ‘two.boot’ trong gói lệnh ‘simpleboot’ hoặc ‘boot’ để thực hiện bootstrap và trình bày nhiều giá trị 95% CI (percentile, BCA, normal, basic) không?

library(simpleboot)

## Simple Bootstrap Routines (1.1-7)

library(boot)

set.seed(123)  # Để kết quả tái lập được

boot_result <- two.boot(placebo, coffee, FUN = mean, R = 10000)

boot.ci(boot_result, type = c("perc", "bca", "norm", "basic"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_result, type = c("perc", "bca", "norm", 
##     "basic"))
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.383, 12.188 )   ( 0.111, 11.778 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.889, 12.556 )   ( 0.778, 12.444 )  
## Calculations and Intervals on Original Scale

So sánh 2 nhóm - biến phân loại

Việc 8. So sánh tỉ lệ gãy xương giữa nam và nữ

PROMPT: So sánh tỉ lệ gãy xương (fx) giữa nam và nữ (sex)

# Tạo bảng tần số

table_fx_sex <- table(df$sex, df$fx)
table_fx_sex

##         
##            0   1
##   Female 916 401
##   Male   701 144

# Thực hiện chi-square test

chisq.test(table_fx_sex)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_fx_sex
## X-squared = 48.363, df = 1, p-value = 0.000000000003542

Ứng dụng Phân tích Dữ liệu với R - ĐHQG TPHCM (16- 19/6/2025)

Thach Tran

2025-06-02