Ngày 3: Phân tích so sánh

Việc 1. Đọc dữ liệu vào R

ob = read.csv("C:\\Thach\\VN trips\\2024_2Aug\\Data Analysis workshop (Hospital 108)\\Datasets\\obesity data.csv")

Việc 2. So sánh biến liên tục giữa 2 nhóm:

2.1 So sánh chỉ số khối cơ thể giữa nam và nữ

2.1.1 Biểu đồ phân bố chỉ số khối cơ thể

library(ggplot2)

p = ggplot(data = ob, aes(x = bmi))
p + geom_histogram(fill = "blue", col = "white") + labs(x = "Chỉ số khối cơ thể (kg/m2)", y = "Số người", title = "Phân bố chỉ số khối cơ thể")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2.2 Giả thuyết

Giả thuyết vô hiệu: Không có khác biệt về chỉ số khối cơ thể giữa nam và nữ Giả tuyết thay thế: Có sự khác biệt về chỉ số khối cơ thể giữa nam và nữ

2.2.3 So sánh chỉ số khối cơ thể giữa nam và nữ

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~bmi | gender, data = ob)

	F (N=862)	M (N=355)	Overall (N=1217)
bmi
Mean (SD)	22.3 (3.05)	22.7 (3.04)	22.4 (3.06)
Median [Min, Max]	22.1 [15.2, 37.1]	22.5 [14.5, 34.7]	22.2 [14.5, 37.1]

t.test(bmi~ gender, data = ob)

## 
##  Welch Two Sample t-test
## 
## data:  bmi by gender
## t = -2.4841, df = 662.09, p-value = 0.01324
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.85400278 -0.09994709
## sample estimates:
## mean in group F mean in group M 
##        22.25626        22.73324

2.2 So sánh mật độ xương toàn cơ thể giữa nam và nữ

p = ggplot(data = ob, aes(x = wbbmd))
p + geom_histogram(fill = "blue", col = "white") + labs(x = "Mật độ xương toàn cơ thể (g/cm2)", y = "Số người", title = "Phân bố mật độ xương toàn cơ thể")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

table1(~wbbmd | gender, data = ob)

	F (N=862)	M (N=355)	Overall (N=1217)
wbbmd
Mean (SD)	0.988 (0.111)	1.06 (0.101)	1.01 (0.113)
Median [Min, Max]	0.990 [0.650, 1.35]	1.06 [0.780, 1.34]	1.01 [0.650, 1.35]

t.test(wbbmd ~ gender, data = ob)

## 
##  Welch Two Sample t-test
## 
## data:  wbbmd by gender
## t = -11.073, df = 724.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.08538427 -0.05966707
## sample estimates:
## mean in group F mean in group M 
##        0.987587        1.060113

Việc 3. So sánh tuổi giữa nam và nữ béo phì

3.1 Tạo dữ liệu chỉ gồm người béo phì

obese = subset(ob, bmi>= 30)
table(obese$gender)

## 
##  F  M 
## 11  4

3.2 Phân bố tuổi

p = ggplot(data = ob, aes(x = age))
p + geom_histogram(fill = "blue", col = "white") + labs(x = "Tuổi (năm)", y = "Số người", title = "Phân bố tuổi")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3 So sánh khác biệt tuổi trung bình giữa nam và nữ béo phì

table1(~ age | gender, data = obese)

	F (N=11)	M (N=4)	Overall (N=15)
age
Mean (SD)	54.4 (13.0)	37.0 (13.3)	49.7 (14.9)
Median [Min, Max]	56.0 [21.0, 70.0]	40.5 [18.0, 49.0]	54.0 [18.0, 70.0]

t.test(age ~ gender, data = obese)

## 
##  Welch Two Sample t-test
## 
## data:  age by gender
## t = 2.25, df = 5.2628, p-value = 0.07162
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -2.179531 36.906803
## sample estimates:
## mean in group F mean in group M 
##        54.36364        37.00000

3.4 Thực hiện phân tích Wilcoxon đánh giá khác biệt tuổi giữa nam và nữ béo phì

table1(~ age | gender, data = obese, render.continuous = c(. = "Mean (SD)", . = "Median [Q1, Q3]"))

	F (N=11)	M (N=4)	Overall (N=15)
age
Mean (SD)	54.4 (13.0)	37.0 (13.3)	49.7 (14.9)
Median [Q1, Q3]	56.0 [53.0, 59.0]	40.5 [34.5, 43.0]	54.0 [44.0, 57.0]

wilcox.test(age ~ gender, data = obese)

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by gender
## W = 40, p-value = 0.02209
## alternative hypothesis: true location shift is not equal to 0

3.5 Thực hiện phân tích bootstrap để đánh giá khác biệt về tuổi giữa nam và nữ béo phì

library(simpleboot)

## Warning: package 'simpleboot' was built under R version 4.3.2

## Simple Bootstrap Routines (1.1-7)

men = subset(ob, gender == "M" & bmi>= 30)
women = subset(ob, gender == "F" & bmi >= 30)

# men = ob %>% filter(gender == "M")
# women = ob %>% filter(gender == "F")

set.seed(1234)
b.means = two.boot(men$age, women$age, mean, R = 1000)
hist (b.means$t, breaks = 20)

quantile(b.means$t, probs=c(0.025, 0.50, 0.975))

##       2.5%        50%      97.5% 
## -31.662500 -17.261364  -5.153409

Việc 4. So sánh biến phân loại giữa 2 nhóm

4.1 So sánh tình trạng béo phì giữa nam và nữ

4.1.1 Tạo biến obese từ chỉ số khối cơ thể

ob$obese[ob$bmi< 18.5] = "Underweight"
ob$obese[ob$bmi>= 18.5 & ob$bmi< 25] = "Normal"
ob$obese[ob$bmi>= 25 & ob$bmi< 30] = "Overweight"
ob$obese[ob$bmi>= 30] = "Obese"

table(ob$obese, ob$gender)

##              
##                 F   M
##   Normal      626 239
##   Obese        11   4
##   Overweight  149  81
##   Underweight  76  31

4.1.2 Giả thuyết

Giả thuyết vô hiệu: Không có sự khác biệt về tình trạng béo phì giữa nam và nữ Giả thuyết thay thế: Có sự khác biệt về tình trạng béo phì giữa nam và nữ

4.1.3 So sánh tình trang béo phì giữa nam và nữ

ob$obese.n = factor(ob$obese, levels = c("Underweight", "Normal", "Overweight", "Obese"))
table1(~obese.n | gender, data = ob)

	F (N=862)	M (N=355)	Overall (N=1217)
obese.n
Underweight	76 (8.8%)	31 (8.7%)	107 (8.8%)
Normal	626 (72.6%)	239 (67.3%)	865 (71.1%)
Overweight	149 (17.3%)	81 (22.8%)	230 (18.9%)
Obese	11 (1.3%)	4 (1.1%)	15 (1.2%)

chisq.test(ob$obese.n, ob$gender, correct = FALSE)

## Warning in chisq.test(ob$obese.n, ob$gender, correct = FALSE): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  ob$obese.n and ob$gender
## X-squared = 5.1114, df = 3, p-value = 0.1638

4.2 So sánh tiền căn cao HA giữa nam và nữ

ob$hypert = as.factor(ob$hypertension)
table1(~ hypert | gender, data = ob)

	F (N=862)	M (N=355)	Overall (N=1217)
hypert
0	430 (49.9%)	170 (47.9%)	600 (49.3%)
1	432 (50.1%)	185 (52.1%)	617 (50.7%)

chisq.test(ob$hypert, ob$gender, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  ob$hypert and ob$gender
## X-squared = 0.40105, df = 1, p-value = 0.5265

Việc 5. So sánh biến liên tục cho hơn 2 nhóm: So sánh mật độ xương toàn thân theo tình trạng béo phì

5.1 Giả thuyết

Giả thuyết vô hiệu: Không có sự khác biệt về mật độ xương toàn thân giữa các tình trạng béo phì. Giả thuyết thay thế: Có sự khác biệt về mật độ xương toàn thân giữa các tình trạng béo phì.

5.2 Phân bố của mật độ xương toàn thân

p = ggplot(data = ob, aes(x = wbbmd))
p + geom_histogram(fill = "blue", col = "white") + labs(x = "Mật độ xương toàn thân (g/cm2)", y = "Số người", title = "Phân bố mật độ xương toàn thân")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.3 Biểu đồ hộp so sánh mật độ xương toàn thân giữa các tình trạng béo phì

p = ggplot(data = ob, aes(x = obese.n, y = wbbmd, fill = obese.n, col = obese.n))
p1 = p + geom_boxplot(col = "black") + geom_jitter(alpha = 0.05) + labs(x = "Tình trạng béo phì", y = "Mật độ xương toàn thân (g/cm2)") + ggtitle("Mật độ xương toàn thân theo tình trạng béo phì")
p1

5.4 So sánh mật độ xương toàn thân theo tình trạng béo phì

table1(~ wbbmd | obese.n, data = ob)

	Underweight (N=107)	Normal (N=865)	Overweight (N=230)	Obese (N=15)	Overall (N=1217)
wbbmd
Mean (SD)	0.971 (0.0998)	1.01 (0.112)	1.03 (0.120)	1.03 (0.114)	1.01 (0.113)
Median [Min, Max]	0.980 [0.740, 1.17]	1.02 [0.650, 1.35]	1.03 [0.700, 1.29]	1.06 [0.790, 1.19]	1.01 [0.650, 1.35]

anova = aov(wbbmd ~ obese.n, data = ob)
summary(anova)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## obese.n        3   0.24 0.07991   6.326 0.000295 ***
## Residuals   1213  15.32 0.01263                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Phân tích hậu định là cần thiết vì giá trị global P-value< 0.05

Phân tích hậu định

tukey.anova= TukeyHSD(anova)
tukey.anova

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = wbbmd ~ obese.n, data = ob)
## 
## $obese.n
##                               diff          lwr        upr     p adj
## Normal-Underweight     0.037682027  0.008052906 0.06731115 0.0060418
## Overweight-Underweight 0.056395774  0.022562445 0.09022910 0.0001143
## Obese-Underweight      0.060105919 -0.019606866 0.13981870 0.2119646
## Overweight-Normal      0.018713747 -0.002735926 0.04016342 0.1120163
## Obese-Normal           0.022423892 -0.052872339 0.09772012 0.8696949
## Obese-Overweight       0.003710145 -0.073337448 0.08075774 0.9993207

Basic Data Analysis Workshop - Hanoi

Thach Tran

2024-08-17

Chương trình tập huấn phân tích dữ liệu bằng ngôn ngữ R - BV 108