Ngày 5: Phân tích theo thời gian

Việc 1. Phân tích trước-sau cho 1 nhóm

1.1 Đọc dữ liệu vào R

df = read.csv("C:\\Thach\\VN trips\\2025_2May\\Phuong Nam Institute\\Datasets\\Pre-Post study one group.csv")
head(df)

##   EmpID Before After
## 1    26     43    66
## 2    27     58    74
## 3    28     52    62
## 4    29     47    84
## 5    30     43    78
## 6    31     50    73

1.2 Đánh giá hiệu quả can thiệp bằng paired t-test

library(lessR)

## Warning: package 'lessR' was built under R version 4.3.3

## 
## lessR 4.3.9                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, 
## graphics, testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including time series forecasting
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()

ttest(Before, After, paired = TRUE, data = df)

## 
## 
## ------ Describe ------
## 
## Difference:  n.miss = 0,  n = 25,   mean = 19.640,  sd = 11.597
## 
## 
## ------ Normality Assumption ------
## 
## Null hypothesis is a normal distribution of Difference.
## Shapiro-Wilk normality test:  W = 0.9687,  p-value = 0.613
## 
## 
## ------ Infer ------
## 
## t-cutoff for 95% range of variation: tcut =  2.064 
## Standard Error of Mean: SE =  2.319 
## 
## Hypothesized Value H0: mu = 0 
## Hypothesis Test of Mean:  t-value = 8.468,  df = 24,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  4.787
## 95% Confidence Interval for Mean:  14.853 to 24.427
## 
## 
## ------ Effect Size ------
## 
## Distance of sample mean from hypothesized:  19.640
## Standardized Distance, Cohen's d:  1.694
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for 6.941

1.4 [ChatGPT]

PROMPT 1: “Sử dụng kiểm định t bắt cặp (paired t-test) trong gói lệnh ‘lessR’ để đánh giá thay đổi hormone trước (Before) và sau (After) can thiệp”

ChatGPT:

Paired(Before, After, data=df)

“Lỗi”could not find function ‘Paired’” cho thấy gói lessR chưa được nạp đúng cách, hoặc bạn đang dùng phiên bản chưa cập nhật.”

PROMPT 2: “Tôi sử dụng gói lệnh ‘lessR’ version 4.4.2 (16 March 2025) không có lệnh Paired (https://cran.r-project.org/web/packages/lessR/lessR.pdf). Bạn hãy dùng lệnh ttest của gói ‘lessR’”

ChatGPT: “Cảm ơn bạn đã cung cấp thông tin! Đúng như bạn nói, trong phiên bản mới của gói lessR (v4.4.2), hàm Paired() không còn, và thay vào đó có thể sử dụng hàm ttest() để thực hiện kiểm định t bắt cặp.”

ttest(After - Before, data = df)

PROMPT 3: “Lệnh này không phải thực hiện paired t-test, phải cần đối số Paired = TRUE”

ChatGPT: “Đúng rồi! Với gói lessR, nếu bạn muốn thực hiện kiểm định t bắt cặp, bạn phải thêm đối số paired = TRUE trong hàm ttest().”

ttest(After, Before, paired = TRUE, data = df)

Việc 2. Phân tích trước-sau cho 2 nhóm

2.1 Đọc dữ liệu

df2 = read.csv("C:\\Thach\\VN trips\\2025_2May\\Phuong Nam Institute\\Datasets\\Pre-Post test dataset.csv")
head(df2)

##   EmpID Treatment Before After
## 1    26       New     41    66
## 2    27       New     56    74
## 3    28       New     50    62
## 4    29       New     45    84
## 5    30       New     41    78
## 6    31       New     48    73

2.2 So sánh hiệu quả của 2 can thiệp bằng t-test

df2$Diff = df2$After - df2$Before
ttest(Diff ~ Treatment, data = df2)

## 
## Compare Diff across Treatment with levels New and Old 
## Grouping Variable:  Treatment
## Response Variable:  Diff
## 
## 
## ------ Describe ------
## 
## Diff for Treatment New:  n.miss = 0,  n = 25,  mean = 22.040,  sd = 11.238
## Diff for Treatment Old:  n.miss = 0,  n = 25,  mean = 13.280,  sd = 12.785
## 
## Mean Difference of Diff:  8.760
## 
## Weighted Average Standard Deviation:   12.036 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Diff.
## Group New  Shapiro-Wilk normality test:  W = 0.964,  p-value = 0.502
## Group Old  Shapiro-Wilk normality test:  W = 0.954,  p-value = 0.309
## 
## Null hypothesis is equal variances of Diff, homogeneous.
## Variance Ratio test:  F = 163.460/126.290 = 1.294,  df = 24;24,  p-value = 0.532
## Levene's test, Brown-Forsythe:  t = -0.983,  df = 48,  p-value = 0.331
## 
## 
## ------ Infer ------
## 
## --- Assume equal population variances of Diff for each Treatment 
## 
## t-cutoff for 95% range of variation: tcut =  2.011 
## Standard Error of Mean Difference: SE =  3.404 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 2.573,  df = 48,  p-value = 0.013
## 
## Margin of Error for 95% Confidence Level:  6.845
## 95% Confidence Interval for Mean Difference:  1.915 to 15.605
## 
## 
## --- Do not assume equal population variances of Diff for each Treatment 
## 
## t-cutoff: tcut =  2.011 
## Standard Error of Mean Difference: SE =  3.404 
## 
## Hypothesis Test of 0 Mean Diff:  t = 2.573,  df = 47.223, p-value = 0.013
## 
## Margin of Error for 95% Confidence Level:  6.848
## 95% Confidence Interval for Mean Difference:  1.912 to 15.608
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Diff for each Treatment 
## 
## Standardized Mean Difference of Diff, Cohen's d:  0.728
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Treatment New: 6.726
## Density bandwidth for Treatment Old: 7.655

### [ChatGPT] PROMPT 1: “Tập tin ‘Pre-Post test dataset.csv’ được đọc vào R và đặt tên là df2. Sử dụng gói lệnh ‘lessR’ để so sánh hiệu quả của 2 can thiệp New và Old (biến số ‘Treatment’) lên thay đổi trước (Before) và sau (After) bằng kiểm định t bắt cặp. Diễn giải kết quả.”

ChatGPT:

Để so sánh hiệu quả của hai can thiệp New và Old (biến Treatment) lên sự thay đổi từ Before sang After trong dữ liệu df2, bạn sẽ cần:

Tính toán hiệu số After - Before cho từng cá nhân.
Tách dữ liệu theo nhóm can thiệp.
Thực hiện kiểm định t bắt cặp trong từng nhóm bằng ttest() của gói lessR.

Nhóm “New”:

ttest(After, Before, paired=TRUE, data = subset(df2, Treatment == “New”))

Nhóm “Old”:

ttest(After, Before, paired=TRUE, data = subset(df2, Treatment == “Old”))

Diễn giải kết quả:

Nhóm New: Hiệu số trung bình (After - Before): +22.04; t-statistic: 9.81; p-value: < 0.000000001 (rất nhỏ)

Diễn giải: Can thiệp New giúp cải thiện đáng kể giá trị sau can thiệp, với mức tăng trung bình 22.04 đơn vị. Kết quả này có ý nghĩa thống kê rất cao (p < 0.001), cho thấy phương pháp New có hiệu quả rõ rệt.

Nhóm Old: Hiệu số trung bình (After - Before): +13.28; t-statistic: 5.19; p-value: 0.000025

Diễn giải: Can thiệp Old cũng cho thấy sự cải thiện có ý nghĩa thống kê (p < 0.001), với mức tăng trung bình 13.28 đơn vị. Tuy nhiên, mức cải thiện thấp hơn đáng kể so với nhóm New.

Kết luận: Cả hai phương pháp New và Old đều có hiệu quả cải thiện chỉ số sau can thiệp, nhưng phương pháp New tỏ ra vượt trội hơn về mức độ cải thiện trung bình (22.04 so với 13.28) và độ mạnh của bằng chứng thống kê.

PROMPT 2: “Cách làm này không thể so sánh trực tiếp thay đổi (After- Before) giữa 2 nhóm New và Old. Nói cách khác dù thay đổi trong nhóm New nhiều hơn (+22.04) so với nhóm Old (+13.28) nhưng cũng không đảm bảo sự khác biệt trung bình về thay đổi giữa 2 nhóm là có ý nghĩa thực tế và có ý nghĩa về thống kê. Có cách nào so sánh trực tiếp khác biệt về thay đổi giữa 2 nhóm không?”

ChatGPT: “Bạn hoàn toàn đúng! Việc thực hiện kiểm định t bắt cặp riêng rẽ trong từng nhóm (New và Old) chỉ cho thấy mỗi nhóm có thay đổi có ý nghĩa hay không — chứ không cho phép so sánh trực tiếp sự khác biệt về thay đổi giữa 2 nhóm.”

Giải pháp chính xác: So sánh trực tiếp mức thay đổi giữa 2 nhóm bằng kiểm định t độc lập

Cách làm:

Tạo biến mới: Change = After - Before
Dùng kiểm định t độc lập (independent t-test) để so sánh Change giữa 2 nhóm Treatment (New vs Old)

PROMPT 3: “Phương pháp này không giải quyết được hiện tượng hồi qui về trung bình (Regression toward the Mean). Bạn có đề nghị phương pháp phân tích nào tốt hơn không?”

ChatGPT: “Rất tuyệt vời khi bạn nêu ra hiện tượng hồi quy về trung bình (regression toward the mean) — một yếu tố thường bị bỏ qua nhưng có thể làm sai lệch kết luận, đặc biệt trong các thiết kế pre-post với so sánh nhóm song song.”

Giải pháp tốt hơn: ANCOVA (Analysis of Covariance)

ANCOVA giúp: So sánh điểm sau can thiệp (After) giữa hai nhóm sau khi hiệu chỉnh (adjust) cho điểm ban đầu (Before). Như vậy bạn sẽ kiểm soát được sự khác biệt ban đầu và hạn chế hiện tượng hồi quy về trung bình

PROMPT 4: “Có cần để tương tác Treatment*Before vào mô hình để giải quyết hiện tượng hồi qui về trung bình không?”

ChatGPT: “Câu hỏi của bạn rất sâu sắc! Việc đưa tương tác Treatment * Before vào mô hình hồi quy là một bước phân tích nâng cao có thể cung cấp hiểu biết sâu sắc hơn, đặc biệt trong bối cảnh:”

PROMPT 5: “Ý tôi là mô hình có cả Before, Treatment và tương tác Before*Treatment”

ChatGPT: “Cảm ơn bạn! Dưới đây là diễn giải mô hình hồi quy tuyến tính bao gồm: Before, Treatment, và tương tác Before * Treatment để kiểm soát hiện tượng hồi quy về trung bình và kiểm tra sự khác biệt hiệu quả can thiệp giữa hai nhóm.”

2.3 So sánh hiệu quả của 2 can thiệp bằng phân tích hiệp biến

reg(After ~ Before + Treatment + Before*Treatment, data = df2)

## 
## >>>  Treatment is not numeric. Converted to indicator variables.

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(After ~ Before + Treatment + Before * Treatment, data=df2, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  df2 
##  
## Response Variable: After 
## Predictor Variable 1: Before 
## Predictor Variable 2: TreatmentOld 
## Predictor Variable 3: Before.TreatmentOld 
##  
## Number of cases (rows) of data:  50 
## Number of cases retained for analysis:  50 
## 
## 
##   BASIC ANALYSIS 
## 
##                      Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
##         (Intercept)    90.468     12.149    7.446    0.000      66.012     114.923 
##              Before    -0.360      0.239   -1.503    0.140      -0.842       0.122 
##        TreatmentOld    -9.181     16.948   -0.542    0.591     -43.295      24.933 
## Before.TreatmentOld    -0.056      0.342   -0.163    0.871      -0.744       0.632 
## 
## Standard deviation of After: 9.795 
##  
## Standard deviation of residuals:  7.880 for df=46 
## 95% range of residuals:  31.723 = 2 * (2.013 * 7.880) 
##  
## R-squared: 0.392    Adjusted R-squared: 0.353    PRESS R-squared: NA 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 9.901     df: 3 and 46     p-value:  0.000 
## 
## -- Analysis of Variance from Type II Sums of Squares 
##  
##                       df    Sum Sq   Mean Sq   F-value   p-value 
##              Before    1   319.163   319.163     5.249     0.026 
##        TreatmentOld    1  1724.303  1724.303    27.769     0.000 
## Before.TreatmentOld    1     1.655     1.655     0.027     0.871 
## Residuals             46  2856.382    62.095 
## 
## -- Test of Interaction 
##   
## Before:Treatment  df: 1  df resid: 46  SS: 1.655  F: 0.027  p-value: 0.871 
##   
## -- Assume parallel lines, no interaction of Treatment with Before 
##  
## Level New: y^_After = 90.468 + -0.360(x_Before) 
## Level Old: y^_After = 81.287 + -0.360(x_Before) 
##   
## -- Visualize Separately Computed Regression Lines 
## 
## Plot(Before, After, by=Treatment, fit="lm") 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##                       After Before TreatmentOld Before.TreatmentOld 
##                 After  1.00  -0.16        -0.57               -0.60 
##                Before -0.16   1.00        -0.17               -0.04 
##          TreatmentOld -0.57  -0.17         1.00                0.98 
##   Before.TreatmentOld -0.60  -0.04         0.98                1.00 
## 
##                       Tolerance       VIF 
##                Before     0.494     2.024 
##          TreatmentOld     0.017    57.821 
##   Before.TreatmentOld     0.018    56.198 
## 
##  Before TreatmentOld Before.TreatmentOld    R2adj    X's 
##       1            1                   0    0.366      2 
##       1            0                   1    0.362      2 
##       1            1                   1    0.353      3 
##       0            0                   1    0.342      1 
##       0            1                   1    0.335      2 
##       0            1                   0    0.310      1 
##       1            0                   0    0.005      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 50 rows of data, or do n_res_rows="all"] 
## -------------------------------------------------------------------------------------- 
##      Before TreatmentOld Before.TreatmentOld  After fitted   resid rstdnt dffits cooks 
##   40     43            1                  43     42 63.415 -21.415 -3.053 -0.800 0.136 
##   19     64            0                   0     77 67.437   9.563  1.381  0.718 0.126 
##   50     60            1                  60     46 56.349 -10.349 -1.466 -0.679 0.113 
##   34     51            1                  51     79 60.090  18.910  2.611  0.589 0.077 
##    7     64            0                   0     60 67.437  -7.437 -1.065 -0.554 0.076 
##   32     44            1                  44     79 62.999  16.001  2.172  0.527 0.064 
##    1     41            0                   0     66 75.714  -9.714 -1.325 -0.490 0.059 
##    8     46            0                   0     61 73.915 -12.915 -1.724 -0.425 0.043 
##   26     51            1                  51     74 60.090  13.910  1.857  0.419 0.042 
##   46     47            1                  47     49 61.752 -12.752 -1.685 -0.349 0.029 
##    4     45            0                   0     84 74.274   9.726  1.286  0.342 0.029 
##   18     43            0                   0     83 74.994   8.006  1.066  0.334 0.028 
##   17     53            0                   0     61 71.396 -10.396 -1.364 -0.302 0.022 
##   16     43            0                   0     82 74.994   7.006  0.930  0.292 0.021 
##   10     47            0                   0     83 73.555   9.445  1.237  0.284 0.020 
##    3     50            0                   0     62 72.475 -10.475 -1.370 -0.280 0.019 
##   45     51            1                  51     51 60.090  -9.090 -1.188 -0.268 0.018 
##   30     42            1                  42     70 63.830   6.170  0.811  0.231 0.013 
##   36     50            1                  50     53 60.505  -7.505 -0.973 -0.208 0.011 
##   39     50            1                  50     53 60.505  -7.505 -0.973 -0.208 0.011 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##      Before TreatmentOld Before.TreatmentOld  After   pred s_pred pi.lwr pi.upr  width 
##   50     60            1                  60     46 56.349  8.549 39.142 73.557 34.415 
##   35     59            1                  59     59 56.765  8.469 39.718 73.811 34.093 
##   47     58            1                  58     60 57.180  8.395 40.282 74.078 33.796 
## ... 
##   48     49            1                  49     62 60.921  8.040 44.738 77.104 32.365 
##   41     48            1                  48     60 61.337  8.036 45.161 77.512 32.352 
##   27     47            1                  47     65 61.752  8.040 45.568 77.936 32.368 
## ... 
##   44     36            1                  36     67 66.324  8.555 49.103 83.545 34.442 
##    7     64            0                   0     60 67.437  8.678 49.970 84.905 34.935 
##   19     64            0                   0     77 67.437  8.678 49.970 84.905 34.935 
## 
## ------------------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## Plot 3: Scatterplot and Least-Squares Lines 
## -------------------------------------------

Phân tích dữ liệu bằng ngôn ngữ R - Viện Phương Nam (10-15/5/2025)

Thach Tran

2025-04-22