Ngày 2: So sánh 2 nhóm

Việc 1. Đọc dữ liệu

df = read.csv("/Users/trangan/Desktop/AI in Data Analysis Course/Data Practice/Bone data.csv")
dim(df)

## [1] 2162    9

head(df)

##   id    sex age weight height prior.fx fnbmd smoking fx
## 1  1   Male  73     98    175        0  1.08       1  0
## 2  2 Female  68     72    166        0  0.97       0  0
## 3  3   Male  68     87    184        0  1.01       0  0
## 4  4 Female  62     72    173        0  0.84       1  0
## 5  5   Male  61     72    173        0  0.81       1  0
## 6  6 Female  76     57    156        0  0.74       0  0

summary(df)

##        id             sex                 age            weight      
##  Min.   :   1.0   Length:2162        Min.   :57.00   Min.   : 34.00  
##  1st Qu.: 545.2   Class :character   1st Qu.:65.00   1st Qu.: 60.00  
##  Median :1093.5   Mode  :character   Median :69.00   Median : 69.00  
##  Mean   :1094.7                      Mean   :70.65   Mean   : 70.14  
##  3rd Qu.:1639.8                      3rd Qu.:75.00   3rd Qu.: 79.00  
##  Max.   :2216.0                      Max.   :94.00   Max.   :133.00  
##                                                      NA's   :1       
##      height         prior.fx          fnbmd           smoking      
##  Min.   :136.0   Min.   :0.0000   Min.   :0.2800   Min.   :0.0000  
##  1st Qu.:158.0   1st Qu.:0.0000   1st Qu.:0.7300   1st Qu.:0.0000  
##  Median :164.0   Median :0.0000   Median :0.8200   Median :0.0000  
##  Mean   :164.9   Mean   :0.1554   Mean   :0.8287   Mean   :0.4251  
##  3rd Qu.:171.0   3rd Qu.:0.0000   3rd Qu.:0.9300   3rd Qu.:1.0000  
##  Max.   :196.0   Max.   :1.0000   Max.   :1.5100   Max.   :1.0000  
##                                   NA's   :40                       
##        fx        
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2521  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

Việc 2. So sánh MĐX giữa nam và nữ

2.1. Histogram đánh giá phân bố

#install.packages("lessR")   # chỉ cần chạy 1 lần
library(lessR)

## Warning: package 'lessR' was built under R version 4.4.1

## 
## lessR 4.5                            feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")  Read data file, many formats available, e.g., Excel
##   d is the default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, forecasting, and aggregation to pivot tables.
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including modern time series forecasting
##   and many, new Plotly interactive visualizations output. Most
##   visualization functions are now reorganized to three functions:
##      Chart(): type="bar", "pie", "radar", "bubble", "treemap", "icicle"
##      X(): type="histogram", "density", "vbs" and more
##      XY(): type="scatter" for a scatterplot, or "contour", "smooth"
##    Most previous function calls still work, such as:
##      BarChart(), Histogram, and Plot().
##   Enter: news(package="lessR"), or ?Chart, ?X, or ?XY
## There is also Flows() for Sankey flow diagrams, see ?Flows
## 
## Interactive data analysis for constructing visualizations.
##   Enter: interact()

# Vẽ histogram
Histogram(fnbmd,
          data = df,
          fill = "blue")

## lessR visualizations are now unified over just three core functions:
##   - Chart() for pivot tables, such as bar charts. More info: ?Chart
##   - X() for a single variable x, such as histograms. More info: ?X
##   - XY() for scatterplots of two variables, x and y. More info: ?XY
## 
## Histogram() is deprecated, though still working for now.
## Please use X(..., type = "histogram") going forward.

## [Interactive plot from the Plotly R package (Sievert, 2020)]

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## X(fnbmd, type="density")  # smoothed curve + histogram 
## X(fnbmd, type="vbs")  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- fnbmd --- 
##  
##        n    miss      mean        sd       min       mdn       max 
##      2122      40     0.829     0.155     0.280     0.820     1.510 
##  
## 
##   
## --- Outliers ---     from the box plot: 33 
##  
## Small      Large 
## -----      ----- 
##  0.3      1.5 
##  0.3      1.5 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.4 
##  0.4      1.3 
##  0.4      1.3 
##  0.4      1.3 
##           1.3 
##           1.3 
##           1.3 
##           1.3 
##           1.2 
##           1.2 
##           1.2 
## 
## + 15 more outliers 
## 
## 
## Bin Width: 0.1 
## Number of Bins: 14 
##  
##        Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------- 
##  0.2 > 0.3    0.25      1    0.00        1     0.00 
##  0.3 > 0.4    0.35      9    0.00       10     0.00 
##  0.4 > 0.5    0.45     15    0.01       25     0.01 
##  0.5 > 0.6    0.55    103    0.05      128     0.06 
##  0.6 > 0.7    0.65    306    0.14      434     0.20 
##  0.7 > 0.8    0.75    522    0.25      956     0.45 
##  0.8 > 0.9    0.85    534    0.25     1490     0.70 
##  0.9 > 1.0    0.95    371    0.17     1861     0.88 
##  1.0 > 1.1    1.05    183    0.09     2044     0.96 
##  1.1 > 1.2    1.15     48    0.02     2092     0.99 
##  1.2 > 1.3    1.25     21    0.01     2113     1.00 
##  1.3 > 1.4    1.35      6    0.00     2119     1.00 
##  1.4 > 1.5    1.45      2    0.00     2121     1.00 
##  1.5 > 1.6    1.55      1    0.00     2122     1.00 
##

Đánh giá phân bố mật độ xơng tương đối chuẩn

2.2. So sánh mật độ xương giữa nam và nữ

library(table1)

## Warning: package 'table1' was built under R version 4.4.1

## 
## Attaching package: 'table1'

## The following object is masked from 'package:lessR':
## 
##     label

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ fnbmd | sex, data = df)

	Female (N=1317)	Male (N=845)	Overall (N=2162)
fnbmd
Mean (SD)	0.778 (0.132)	0.910 (0.153)	0.829 (0.155)
Median [Min, Max]	0.770 [0.280, 1.31]	0.900 [0.340, 1.51]	0.820 [0.280, 1.51]
Missing	17 (1.3%)	23 (2.7%)	40 (1.9%)

2.2. T-test

library(lessR)
ttest(fnbmd ~ sex, data = df)

## 
## Compare fnbmd across sex with levels Male and Female 
## Grouping Variable:  sex
## Response Variable:  fnbmd
## 
## 
## ------ Describe ------
## 
## fnbmd for sex Male:  n.miss = 23,  n = 822,  mean = 0.910,  sd = 0.153
## fnbmd for sex Female:  n.miss = 17,  n = 1300,  mean = 0.778,  sd = 0.132
## 
## Mean Difference of fnbmd:  0.132
## 
## Weighted Average Standard Deviation:   0.141 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of fnbmd.
## Group Male: Sample mean assumed normal because n > 30, so no test needed.
## Group Female: Sample mean assumed normal because n > 30, so no test needed.
## 
## Null hypothesis is equal variances of fnbmd, homogeneous.
## Variance Ratio test:  F = 0.023/0.018 = 1.336,  df = 821;1299,  p-value = 0.000
## Levene's test, Brown-Forsythe:  t = 3.449,  df = 2120,  p-value = 0.001
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of fnbmd for each sex 
## 
## t-cutoff for 95% range of variation: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.006 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 21.080,  df = 2120,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.012
## 95% Confidence Interval for Mean Difference:  0.120 to 0.144
## 
## 
## --- Do not assume equal population variances of fnbmd for each sex 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.006 
## 
## Hypothesis Test of 0 Mean Diff:  t = 20.407,  df = 1560.981, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.013
## 95% Confidence Interval for Mean Difference:  0.119 to 0.145
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of fnbmd for each sex 
## 
## Standardized Mean Difference of fnbmd, Cohen's d:  0.939
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for sex Male: 0.044
## Density bandwidth for sex Female: 0.034

### 2.3 Sử dụng chatGPT 2.3.1 Biểu đồ histogram

PROMPT: Dùng gói lệnh ‘lessR’ vẽ biểu đồ histogram mô tả phân bố của mật độ xương (fnbmd), tô màu xanh và đặt tựa trục x là ‘Mật độ xương ở cổ xương đùi (g/cm2)’, trục y là ‘Số người’, tựa là ‘Phân bố mật độ xương’

Việc 3. So sánh khác biệt về RER giữa placebo và cafe

# Nhóm chứng (placebo)
placebo <- c(105, 119, 100, 97, 96, 101, 94, 95, 98)

# Nhóm café
cafe <- c(96, 99, 94, 89, 96, 93, 88, 105, 88)

3.2 Thực hiện phép kiểm t để đánh giá khác biệt về RER giữa 2 nhóm. Nhận xét kết quả.

ttest(placebo, cafe, data = df)

## 
## Compare Y across X with levels Group1 and Group2 
## Grouping Variable:  X
## Response Variable:  Y
## 
## 
## ------ Describe ------
## 
## Y for X Group1:  n.miss = 0,  n = 9,  mean = 100.556,  sd = 7.699
## Y for X Group2:  n.miss = 0,  n = 9,  mean = 94.222,  sd = 5.608
## 
## Mean Difference of Y:  6.333
## 
## Weighted Average Standard Deviation:   6.735 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Y.
## Group Group1  Shapiro-Wilk normality test:  W = 0.780,  p-value = 0.012
## Group Group2  Shapiro-Wilk normality test:  W = 0.924,  p-value = 0.427
## 
## Null hypothesis is equal variances of Y, homogeneous.
## Variance Ratio test:  F = 59.278/31.444 = 1.885,  df = 8;8,  p-value = 0.389
## Levene's test, Brown-Forsythe:  t = 0.230,  df = 16,  p-value = 0.821
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of Y for each X 
## 
## t-cutoff for 95% range of variation: tcut =  2.120 
## Standard Error of Mean Difference: SE =  3.175 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 1.995,  df = 16,  p-value = 0.063
## 
## Margin of Error for 95% Confidence Level:  6.731
## 95% Confidence Interval for Mean Difference:  -0.397 to 13.064
## 
## 
## --- Do not assume equal population variances of Y for each X 
## 
## t-cutoff: tcut =  2.136 
## Standard Error of Mean Difference: SE =  3.175 
## 
## Hypothesis Test of 0 Mean Diff:  t = 1.995,  df = 14.624, p-value = 0.065
## 
## Margin of Error for 95% Confidence Level:  6.782
## 95% Confidence Interval for Mean Difference:  -0.449 to 13.116
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Y for each X 
## 
## Standardized Mean Difference of Y, Cohen's d:  0.940
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for X Group1: 5.642
## Density bandwidth for X Group2: 4.112

Việc 3. So sánh về chỉ số RER giữa 2 nhóm

3.1. Nhập dữ liệu vào R

placebo <- c(105, 119, 100, 97, 96, 101, 94, 95, 98)
coffee <- c(96, 99, 94, 89, 96, 93, 88, 105, 88)

3.2. Thực hiện phép kiểm t để đánh giá khác biệt về RER giữa 2 nhóm.

ttest (placebo, cafe, data = df)

## 
## Compare Y across X with levels Group1 and Group2 
## Grouping Variable:  X
## Response Variable:  Y
## 
## 
## ------ Describe ------
## 
## Y for X Group1:  n.miss = 0,  n = 9,  mean = 100.556,  sd = 7.699
## Y for X Group2:  n.miss = 0,  n = 9,  mean = 94.222,  sd = 5.608
## 
## Mean Difference of Y:  6.333
## 
## Weighted Average Standard Deviation:   6.735 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Y.
## Group Group1  Shapiro-Wilk normality test:  W = 0.780,  p-value = 0.012
## Group Group2  Shapiro-Wilk normality test:  W = 0.924,  p-value = 0.427
## 
## Null hypothesis is equal variances of Y, homogeneous.
## Variance Ratio test:  F = 59.278/31.444 = 1.885,  df = 8;8,  p-value = 0.389
## Levene's test, Brown-Forsythe:  t = 0.230,  df = 16,  p-value = 0.821
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of Y for each X 
## 
## t-cutoff for 95% range of variation: tcut =  2.120 
## Standard Error of Mean Difference: SE =  3.175 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 1.995,  df = 16,  p-value = 0.063
## 
## Margin of Error for 95% Confidence Level:  6.731
## 95% Confidence Interval for Mean Difference:  -0.397 to 13.064
## 
## 
## --- Do not assume equal population variances of Y for each X 
## 
## t-cutoff: tcut =  2.136 
## Standard Error of Mean Difference: SE =  3.175 
## 
## Hypothesis Test of 0 Mean Diff:  t = 1.995,  df = 14.624, p-value = 0.065
## 
## Margin of Error for 95% Confidence Level:  6.782
## 95% Confidence Interval for Mean Difference:  -0.449 to 13.116
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Y for each X 
## 
## Standardized Mean Difference of Y, Cohen's d:  0.940
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for X Group1: 5.642
## Density bandwidth for X Group2: 4.112

3.3 Thực hiện bootstrap để đánh giá khác biệt về RER giữa 2 nhóm. Nhận xét kết quả.

library(simpleboot)

## Simple Bootstrap Routines (1.1-8)

library(boot)

## Warning: package 'boot' was built under R version 4.4.1

set.seed(123) # để kết quả tái lập được

boot_result <- two.boot (placebo, cafe, FUN = mean, R = 10000)
boot.ci(boot_result, type = c("perc", "bca", "norm", "basic"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_result, type = c("perc", "bca", "norm", 
##     "basic"))
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.383, 12.188 )   ( 0.111, 11.778 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.889, 12.556 )   ( 0.778, 12.444 )  
## Calculations and Intervals on Original Scale

BCa không chứa 0 với CI 95% nên kết quả là có sự khác biệt trong chuyển hoá RER giữa nhóm dùng cafe và nhóm không dùng cafe là có ý nghĩa thống kê

3.4 Sử dụng ChatGPT để thực hiện việc 3.3

PROMPT 1: Tôi có dữ liệu của 18 người trong 2 nhóm như sau: placebo = 105, 119, 100, 97, 96, 101, 94, 95, 98 và coffee = 96, 99, 94, 89, 96, 93, 88, 105, 88. Bạn giúp viết lệnh R từ gói lệnh ‘simpleboot’ và ‘boot’ để phân tích sự khác biệt giữa 2 nhóm bằng phương pháp bootstrap và trình bày nhiều giá trị 95% CI (percentile, BCA, normal, basic) không?

Việc 4. So sánh tỷ lệ gãy xương giữa 2 nhóm nam và nữ

Trả lời câu hỏi: Có mối tương quan giữa tỷ lệ gãy xương và sex không Vì là biến định tính nên phải dùng Chisquare

library(table1)
table1(~ fx +  as.factor(fx)| sex, data = df)

	Female (N=1317)	Male (N=845)	Overall (N=2162)
fx
Mean (SD)	0.304 (0.460)	0.170 (0.376)	0.252 (0.434)
Median [Min, Max]	0 [0, 1.00]	0 [0, 1.00]	0 [0, 1.00]
as.factor(fx)
0	916 (69.6%)	701 (83.0%)	1617 (74.8%)
1	401 (30.4%)	144 (17.0%)	545 (25.2%)

chisq.test(df$fx, df$sex)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df$fx and df$sex
## X-squared = 48.363, df = 1, p-value = 3.542e-12

p-value = 0.000000000003542, p rất nhỏ nên có giá trị thống kê hay nói cách khác sự khác biệt về tỷ lệ gãy xương của nam và nữ là có giá trị thống kê

AI in Data Analysis_Day2

Trang Lu

2026-01-07