Ngày 4: Hồi qui logistic (Updated)

Việc 1. Đọc dữ liệu vào R

df = read.csv("C:\\Thach\\VN trips\\2026_1Jan\\PN Institute\\Datasets\\Bone data.csv")
dim(df)

## [1] 2162    9

head(df)

##   id    sex age weight height prior.fx fnbmd smoking fx
## 1  1   Male  73     98    175        0  1.08       1  0
## 2  2 Female  68     72    166        0  0.97       0  0
## 3  3   Male  68     87    184        0  1.01       0  0
## 4  4 Female  62     72    173        0  0.84       1  0
## 5  5   Male  61     72    173        0  0.81       1  0
## 6  6 Female  76     57    156        0  0.74       0  0

Việc 2. Mối liên quan giữa hút thuốc và nguy cơ gãy xương

2.1 Mô hình hồi qui logistic đánh giá mối liên quan giữa hút thuốc và gãy xương

Gói lệnh cơ bản

model <- glm(fx ~ smoking, data = df, family = binomial)
summary(model)

## 
## Call:
## glm(formula = fx ~ smoking, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.03006    0.06442 -15.990   <2e-16 ***
## smoking     -0.13796    0.10081  -1.368    0.171    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2441.4  on 2161  degrees of freedom
## Residual deviance: 2439.5  on 2160  degrees of freedom
## AIC: 2443.5
## 
## Number of Fisher Scoring iterations: 4

exp(cbind(Odds_Ratio = coef(model), confint(model)))

## Waiting for profiling to be done...

##             Odds_Ratio     2.5 %    97.5 %
## (Intercept)  0.3569869 0.3142326 0.4045364
## smoking      0.8711365 0.7144024 1.0608018

library(epiDisplay)

## Warning: package 'epiDisplay' was built under R version 4.3.2

## Loading required package: foreign

## Loading required package: survival

## Loading required package: MASS

## Loading required package: nnet

logistic.display(model)

## 
## Logistic regression predicting fx 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## smoking: 1 vs 0  0.87 (0.71,1.06)  0.171          0.17      
##                                                             
## Log-likelihood = -1219.7469
## No. of observations = 2162
## AIC value = 2443.4938

Gói lệnh ‘lessR’

library(lessR)

## Warning: package 'lessR' was built under R version 4.3.3

## 
## lessR 4.3.9                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Many examples of reading, writing, and manipulating data, 
## graphics, testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables
##   Enter: browseVignettes("lessR")
## 
## View lessR updates, now including time series forecasting
##   Enter: news(package="lessR")
## 
## Interactive data analysis
##   Enter: interact()

model.lessR <- Logit(fx ~ smoking, data = df, brief = TRUE)

## 
## Response Variable:   fx
## Predictor Variable 1:  smoking
## 
## Number of cases (rows) of data:  2162 
## Number of cases retained for analysis:  2162 
## 
## 
##    BASIC ANALYSIS 
## 
## -- Estimated Model of fx for the Logit of Reference Group Membership
## 
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)   -1.0301     0.0644  -15.990    0.000     -1.1563     -0.9038 
##     smoking   -0.1380     0.1008   -1.368    0.171     -0.3355      0.0596 
## 
## 
## -- Odds Ratios and Confidence Intervals
## 
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.3570      0.3146      0.4050 
##     smoking      0.8711      0.7149      1.0614 
## 
## 
## -- Model Fit
## 
##     Null deviance: 2441.375 on 2161 degrees of freedom
## Residual deviance: 2439.494 on 2160 degrees of freedom
## 
## AIC: 2443.494 
## 
## Number of iterations to convergence: 4

2.2 ChatGPT

PROMT: tôi có dữ liệu đánh giá nguy cơ gãy xương. Bạn giúp viết lệnh xây dựng mô hình hối qui logistic đánh giá mối liên quan giữa hút thuốc (smoking: 0= No; 1= Yes) và nguy cơ gãy xương (fx: 0= No; 1= Yes)

Việc 3. Đánh giá mối liên quan độc lập giữa hút thuốc và nguy cơ gãy xương

3.1 Mô hình hồi qui logistic

model.2 <- glm(fx ~ smoking + sex, data = df, family = binomial)
summary(model.2)

## 
## Call:
## glm(formula = fx ~ smoking + sex, family = binomial, data = df)
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) -0.85598    0.06835 -12.523 < 0.0000000000000002 ***
## smoking      0.09872    0.10724   0.921                0.357    
## sexMale     -0.78880    0.11493  -6.863     0.00000000000673 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2441.4  on 2161  degrees of freedom
## Residual deviance: 2389.6  on 2159  degrees of freedom
## AIC: 2395.6
## 
## Number of Fisher Scoring iterations: 4

exp(cbind(Odds_Ratio = coef(model.2), confint(model.2)))

## Waiting for profiling to be done...

##             Odds_Ratio     2.5 %    97.5 %
## (Intercept)  0.4248646 0.3711574 0.4852461
## smoking      1.1037584 0.8942327 1.3617209
## sexMale      0.4543898 0.3619039 0.5680105

model2.lessR = Logit(fx ~ smoking + sex, data = df, brief = TRUE)

## 
## >>> Note:  sex is not a numeric variable.
##            Indicator variables are created and analyzed.
## 
## Response Variable:   fx
## Predictor Variable 1:  smoking
## Predictor Variable 2:  sexMale
## 
## Number of cases (rows) of data:  2162 
## Number of cases retained for analysis:  2162 
## 
## 
##    BASIC ANALYSIS 
## 
## -- Estimated Model of fx for the Logit of Reference Group Membership
## 
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)   -0.8560     0.0684  -12.523    0.000     -0.9900     -0.7220 
##     smoking    0.0987     0.1072    0.921    0.357     -0.1115      0.3089 
##     sexMale   -0.7888     0.1149   -6.863    0.000     -1.0141     -0.5635 
## 
## 
## -- Odds Ratios and Confidence Intervals
## 
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.4249      0.3716      0.4858 
##     smoking      1.1038      0.8945      1.3619 
##     sexMale      0.4544      0.3627      0.5692 
## 
## 
## -- Model Fit
## 
##     Null deviance: 2441.375 on 2161 degrees of freedom
## Residual deviance: 2389.599 on 2159 degrees of freedom
## 
## AIC: 2395.599 
## 
## Number of iterations to convergence: 4 
## 
## 
## Collinearity
## 
##         Tolerance       VIF
## smoking     0.899     1.112
## sexMale     0.899     1.112

3.2 So sánh kết quả 3.1 và 2.1

ChatGPT:

PROMPT: Y văn ghi nhận giới tính (sex) là yếu tố gây nhiễu. Bạn thực hiện mô hinh hồi qui logistic đa biến đánh giá mối liên quan độc lập giữa hút thuốc và nguy cơ gãy xương sau khi hiệu chỉnh cho giới tính.

Việc 4. Xây dựng mô hình dự báo nguy cơ gãy xương

4.1 Tìm mô hình tối ưu bằng pp BMA

library(BMA)

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.7-4)

df_bma <- na.omit(df[, c("fx", "age", "sex", "weight", "height", "fnbmd", "smoking", "prior.fx")])
bma_model <- bic.glm(
  x = df_bma[, c("age", "sex", "weight", "height", "fnbmd", "smoking", "prior.fx")],
  y = df_bma$fx,
  glm.family = binomial(), OR = 20,
)
summary(bma_model)

## 
## Call:
## bic.glm.data.frame(x = df_bma[, c("age", "sex", "weight", "height",     "fnbmd", "smoking", "prior.fx")], y = df_bma$fx, glm.family = binomial(),     OR = 20)
## 
## 
##   4  models were selected
##  Best  4  models (cumulative posterior probability =  1 ): 
## 
##            p!=0    EV        SD        model 1       model 2       model 3     
## Intercept  100     2.249962  0.674258       2.50941       2.33985       1.13970
## age         16.2   0.002663  0.006838        .             .            0.01602
## sex         18.2  -0.048048  0.114551        .           -0.25782        .     
## weight       0.0   0.000000  0.000000        .             .             .     
## height       0.0   0.000000  0.000000        .             .             .     
## fnbmd      100.0  -4.487636  0.436674      -4.59534      -4.27923      -4.30244
## smoking      0.0   0.000000  0.000000        .             .             .     
## prior.fx   100.0   0.530729  0.134289       0.53062       0.54454       0.51562
##                                                                                
## nVar                                          2             3             3    
## BIC                                    -14024.61986  -14021.44672  -14021.13922
## post prob                                   0.696         0.142         0.122  
##            model 4     
## Intercept       0.79779
## age             0.01780
## sex            -0.28507
## weight           .     
## height           .     
## fnbmd          -3.91809
## smoking          .     
## prior.fx        0.52962
##                        
## nVar              4    
## BIC        -14018.89680
## post prob       0.040

imageplot.bma(bma_model)

4.2 Mô hình tối ưu

df_bma$fnbmd.sd = df_bma$fnbmd/0.15
bma_best = glm(fx ~ fnbmd.sd + prior.fx, family = binomial, data = df_bma)
exp(cbind(Odds_Ratio = coef(bma_best), confint(bma_best)))

## Waiting for profiling to be done...

##             Odds_Ratio     2.5 %    97.5 %
## (Intercept) 12.2977330 6.6858383 22.888616
## fnbmd.sd     0.5019265 0.4467839  0.562099
## prior.fx     1.6999821 1.3051619  2.207473

bma_best2 = Logit(fx ~ fnbmd.sd + prior.fx, data = df_bma, brief = TRUE)

## 
## Response Variable:   fx
## Predictor Variable 1:  fnbmd.sd
## Predictor Variable 2:  prior.fx
## 
## Number of cases (rows) of data:  2121 
## Number of cases retained for analysis:  2121 
## 
## 
##    BASIC ANALYSIS 
## 
## -- Estimated Model of fx for the Logit of Reference Group Membership
## 
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)    2.5094     0.3138    7.997    0.000      1.8944      3.1245 
##    fnbmd.sd   -0.6893     0.0585  -11.773    0.000     -0.8041     -0.5745 
##    prior.fx    0.5306     0.1340    3.961    0.000      0.2680      0.7932 
## 
## 
## -- Odds Ratios and Confidence Intervals
## 
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)     12.2977      6.6484     22.7474 
##    fnbmd.sd      0.5019      0.4475      0.5630 
##    prior.fx      1.7000      1.3074      2.2105 
## 
## 
## -- Model Fit
## 
##     Null deviance: 2395.806 on 2120 degrees of freedom
## Residual deviance: 2198.504 on 2118 degrees of freedom
## 
## AIC: 2204.504 
## 
## Number of iterations to convergence: 4 
## 
## 
## Collinearity
## 
##          Tolerance       VIF
## fnbmd.sd     0.973     1.027
## prior.fx     0.973     1.027

ROC

m1 = glm(fx ~ fnbmd.sd + prior.fx, family = binomial, data = df_bma)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:epiDisplay':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

pred = predict(m1, type = "response")
roc_curve = roc(df_bma$fx ~ pred)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(roc_curve)

## Area under the curve: 0.7043

plot(roc_curve)

Calibration:

library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:lessR':
## 
##     label, Merge

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Registered S3 method overwritten by 'rms':
##   method       from      
##   print.lrtest epiDisplay

## 
## Attaching package: 'rms'

## The following object is masked from 'package:epiDisplay':
## 
##     lrtest

m2 = lrm(fx ~ fnbmd.sd + prior.fx, data = df_bma)
m2

## Logistic Regression Model
## 
## lrm(formula = fx ~ fnbmd.sd + prior.fx, data = df_bma)
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs          2121    LR chi2     197.30       R2       0.131    C       0.704    
##  0           1586    d.f.             2      R2(2,2121)0.088    Dxy     0.409    
##  1            535    Pr(> chi2) <0.0001    R2(2,1200.2)0.150    gamma   0.414    
## max |deriv| 1e-12                             Brier    0.172    tau-a   0.154    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept  2.5094 0.3138   8.00 <0.0001 
## fnbmd.sd  -0.6893 0.0585 -11.77 <0.0001 
## prior.fx   0.5306 0.1340   3.96 <0.0001

pred2 = predict(m2)
phat = 1/(1 + exp(-pred2))
val.prob(phat, df_bma$fx, m = 20, cex = .5)

##                    Dxy                C (ROC)                     R2 
##   0.408562067624423986   0.704281033812211965   0.131241322905161162 
##                      D               D:Chi-sq                    D:p 
##   0.092551610296560588 197.301965439005016378                     NA 
##                      U               U:Chi-sq                    U:p 
##  -0.000942951438001801  -0.000000000001818989   1.000000000000000000 
##                      Q                  Brier              Intercept 
##   0.093494561734562387   0.171619808519130612  -0.000000000455667804 
##                  Slope                   Emax                    E90 
##   0.999999999173702858   0.251224791450949891   0.033596247400992207 
##                   Eavg                    S:z                    S:p 
##   0.016897020102022799   0.132109850414779295   0.894897404701360477

Optional: đánh giá tầm quan trọng biến số (relative importance)

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

df_bma$fx1 = as.factor(df_bma$fx)
m.rf = randomForest(fx1 ~ fnbmd + prior.fx, data = df_bma, importance = TRUE, ntrees = 1000)

importance <- importance(m.rf, type = 1)
varImpPlot(m.rf, type = 1, main = "Relative Importance")

4.3 ChatGPT

PROMPT: Xây dựng mô hình tối ưu dự báo gãy xương từ những biến số như tuổi (age), giới tính (sex), cân nặng (weight), chiều cao (height), mật độ xương (fnbmd), hút thuốc là (smoking) và tiền căn gãy xương (prior.fx). Dùng phương pháp Bayesian Model Averaging từ gói lệnh BMA

AI trong phân tích dữ liệu với R - Viện Phương Nam (6-11/1/2026)

Thach Tran

2025-12-29