Loading necessary packages

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gtsummary)
library(dplyr)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Q1 Test question: The effectiveness of the new method over the traditional one load the data

opt_raw <- read.csv("../HW 2/Optics.csv")

EDA.
1. Data Cleaning

In this part, we will remove ID from the original dataset because ID serves as an identifier, which is trivial in data validation and visualization.

opt <- opt_raw %>% 
  distinct() %>% # removing the potential duplicates
  select(-c(ID))

1. Data handling

In this step, we will bring the data to the required form for data analysis. Mainly, we will factorize the categorical data.

# converting categorical data to factors 
opt <- opt %>%
  mutate(
    gender = factor(gender, levels = c(0, 1), labels = c("male", "female")), 
    method = factor(method)
  )

1. Data validation

In this part, we will clean the dataset. We will fist check the elements of the combined data by printing out the column name of data. This step can make sure that we are working on the correct cleaned data set. Then, we will count the missing values for each column so that we will have a clear picture of the available size of our data set. Then, we will print out the summary of all categorical data and numerical data to each whether there are some nonsense values (e.g. outlier for numerical variable).

# checking the elements of the data 
opt %>% head(6)

cat("Columns of the data include:", colnames(opt))

## Columns of the data include: OptPost method gender OptPre

# handling missing values
mis <- opt %>% summarize(across(everything(), ~sum(is.na(.))))
print(mis)

##   OptPost method gender OptPre
## 1       0      0      0      0

# checking nonsense value for categorical values
# finding categorical variables 
cat_vars <- opt %>% select(where(~ is.character(.) | is.factor(.))) %>% names()
for (v in cat_vars) {# showing all unique categories for each variable
  print(table(opt[[v]], useNA = "ifany"))
}

## 
##         MBI traditional 
##          21          16 
## 
##   male female 
##     22     15

opt %>% # counting number of unique levels for each
  summarise(across(all_of(cat_vars), ~ n_distinct(.))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_levels")

# finding suspicious values e.g. catogories with very long number of counts
for (v in cat_vars) {
  cat("\n categorical variable", v, ":\n")
  print(table(opt[[v]]), decreasing = FALSE)
}

## 
##  categorical variable method :
## 
##         MBI traditional 
##          21          16 
## 
##  categorical variable gender :
## 
##   male female 
##     22     15

#checking nonsense value for quantitative variables
# finding quantitative variables
num_vars <- opt %>% select(where(is.numeric)) %>% names()
# summarizing the min, max, median, and mean of each variable
opt %>%
  summarise(across(all_of(num_vars),
                   list(min = ~min(., na.rm = TRUE),
                        max = ~max(., na.rm = TRUE), 
                        median = ~median(., na.rm = TRUE),
                        mean = ~mean(., na.rm = TRUE))))

1. Identify

In this step, we will identify the data type, scale of measurement, and the role of each variable. We first find which varable is quantitative data and which variable is categorical data. This will help us to draw the graph of single variable analysis and narrow down the regression model selection(even though we don’t need in this project). Then, we will specify the sacle of measurement. Finally, we will clarify the role of each variable, namely, finding the outcome and covariate. In our study, it is clear that Limited_Law is the outcome because we want to study the influence of average abortion count and party tendency on the status of limited aboriton law. And one thing is worth to note that, we regard state as identifier because it make no sense to include state in our study on the influence.

# finding variable types

var_info <- tibble(
  variable = names(opt_raw),
  type = map_chr(opt_raw, ~ if (is.numeric(.x)) "Quantative" else "Categorical")
)

# finding the scale of measurement
var_info <- var_info %>%
  mutate(
    scale = case_when(
      variable %in% c("ID","method","gender") ~ "Nominal",
      variable %in% c("OptPre","OptPost") ~ "Ratio"
    ),
    
# showing the variable type (outcome or covariate) 
    role = case_when(
      variable %in% c("OptPost") ~ "Outcome",
      variable %in% c("ID") ~ "Identifier",
      TRUE ~ "Covariate"
    )
  )

print(var_info)

## # A tibble: 5 × 4
##   variable type        scale   role      
##   <chr>    <chr>       <chr>   <chr>     
## 1 ID       Quantative  Nominal Identifier
## 2 OptPost  Quantative  Ratio   Outcome   
## 3 method   Categorical Nominal Covariate 
## 4 gender   Quantative  Nominal Covariate 
## 5 OptPre   Quantative  Ratio   Covariate

1. Data summary

In this part, we will summarize statistics of each variable. To be more specific, we will calculate mean, standard deviation, min, 25% quantile, median, 75% quantile, and max of quantitative variable. These statistics will give us a comprehensive picture of the performance of the variable. For categorical data, we will simply calculate the frequency table, including missing value and NA.

#summarizing numerical variables
num_summary <- opt %>%
    select(all_of(num_vars)) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
    group_by(variable) %>%
    summarise(
      mean = mean(value, na.rm = TRUE),
      sd = sd(value, na.rm = TRUE),
      min = min(value, na.rm = TRUE),
      first_quantile = quantile(value, 0.25, na.rm = TRUE),
      median = median(value, na.rm = TRUE),
      third_quantile = quantile(value, 0.75, na.rm = TRUE),
      max = max(value, na.rm = TRUE),
      .groups = "drop")
print(num_summary)

## # A tibble: 2 × 8
##   variable  mean    sd   min first_quantile median third_quantile   max
##   <chr>    <dbl> <dbl> <int>          <dbl>  <int>          <dbl> <int>
## 1 OptPost   58.8  18.2    16             46     59             69    92
## 2 OptPre    36.9  18.6     0             24     42             50    83

#summarizing categorical variables
cat_summary <- map_df(cat_vars, function(v) {
    x <- opt[[v]]
    x_lab <- as.character(x)
    tibble(variable = v, level = x_lab) %>%
      count(variable, level, name = "n")
  })
print(cat_summary)

## # A tibble: 4 × 3
##   variable level           n
##   <chr>    <chr>       <int>
## 1 method   MBI            21
## 2 method   traditional    16
## 3 gender   female         15
## 4 gender   male           22

1. Data visualization

We will draw the graph for univariate distributions of all the variables. For quantitative variable, we will draw histgram. And for categorical data, we will draw par chart.

#drawing histogram for numerical variables
for (v in num_vars) {
  his <- ggplot(opt, aes(x = .data[[v]])) +
    geom_histogram( fill = "pink", color = "white",bins = 30, alpha = 0.8) + #set the color and bins of the historgram
    labs(title = paste("Histogram of", v), x = v, y = "Frequency") + #create title and label x y axis
    theme_minimal() #set the background as white
  print(his)
}

#drawing bar charts for categorical variables
for (v in cat_vars) {
    df_plot <- tibble(level = as.character(opt[[v]])) %>% count(level, name = "n") #orgnize categorical data
    bc <- ggplot(df_plot, aes(x = reorder(level, -n), y = n, fill = level)) + 
    geom_col(show.legend = FALSE, alpha = 0.85) + #creat bar chart
    labs(title = paste("Bar Chart of", v),x = v, y = "Count") + #create title and label x y axis
    theme_minimal() #set the background as white
  print(bc)
}

Then, we will draw the graph for bivariate distributions of all the variables.

association between method and OptPost

dat1 <- opt %>% drop_na()
tab1 <- dat1 %>%
  group_by(method) %>%
  summarise(
    n = n(),
    mean = mean(OptPost),
    sd = sd(OptPost),
    median = median(OptPost),
    min = min(OptPost),
    max = max(OptPost),
    .groups = "drop"
  )
print(tab1)

## # A tibble: 2 × 7
##   method          n  mean    sd median   min   max
##   <fct>       <int> <dbl> <dbl>  <dbl> <int> <int>
## 1 MBI            21  63.4  19.1     61    16    92
## 2 traditional    16  52.7  15.4     50    20    84

# boxplot
boxplot(OptPost ~ method, data = dat1, xlab = "method", ylab = "OptPost", main = "OptPost by method")

association between OptPre and OptPost

dat2 <- opt %>% drop_na()
tab2 <- tibble(
  variable = c("OptPre","OptPost"),
  mean = c(mean(dat2$OptPre), mean(dat2$OptPost)),
  sd = c(sd(dat2$OptPre),   sd(dat2$OptPost)),
  min = c(min(dat2$OptPre),  min(dat2$OptPost)),
  first_quantile = c(quantile(dat2$OptPre, 0.25), quantile(dat2$OptPost, 0.25)),
  median = c(median(dat2$OptPre), median(dat2$OptPost)),
  third_quantile = c(quantile(dat2$OptPre, 0.75), quantile(dat2$OptPost, 0.75)),
  max = c(max(dat2$OptPre),  max(dat2$OptPost))
)
print(tab2)

## # A tibble: 2 × 8
##   variable  mean    sd   min first_quantile median third_quantile   max
##   <chr>    <dbl> <dbl> <int>          <dbl>  <int>          <dbl> <int>
## 1 OptPre    36.9  18.6     0             24     42             50    83
## 2 OptPost   58.8  18.2    16             46     59             69    92

cat("correlation coefficient:", cor(dat2$OptPre, dat2$OptPost))

## correlation coefficient: 0.5598551

# scatter plot
ggplot(dat2, aes(x = OptPre, y = OptPost)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "OptPre vs OptPost",
       x = "OptPre", y = "OptPost")

## `geom_smooth()` using formula = 'y ~ x'

association between gender and OptPost

dat3 <- opt %>% drop_na()
tab3 <- dat3 %>%
  group_by(gender) %>%
  summarise(
    n = n(),
    mean = mean(OptPost),
    sd = sd(OptPost),
    median = median(OptPost),
    min = min(OptPost),
    max = max(OptPost),
    .groups = "drop"
  )
print(tab3)

## # A tibble: 2 × 7
##   gender     n  mean    sd median   min   max
##   <fct>  <int> <dbl> <dbl>  <dbl> <int> <int>
## 1 male      22  60.1  19.4     57    20    92
## 2 female    15  56.9  16.7     59    16    84

# boxplot
boxplot(OptPost ~ gender, data = dat3, xlab = "gender", ylab = "OptPost", main = "OptPost by gender")

According to the table in bivariate distribution of method vs OptPost, we can find that the average post test score for MBI group is higher than that for the traditional group. Meanwhile, the median of the post test score for MBI group is also higher than that for the traditional group. These statistics may show that MBI method would relate to a better performance. According to the boxplot of OptPost by method, the MBI group tends to have a higher score on post test. Though the range of post test of MBI group is large, the body of the box (there the data cluster) is visually higher than the box of traditional group. Thus, this may show that the MBI method would relate to a higher post score, which means a better performance for graduate student in math education.
First, we start with a model that includs all possible interactions.

full_gm <- lm(OptPost ~ (OptPre + method + gender)^3, data = opt)
summary(full_gm)

## 
## Call:
## lm(formula = OptPost ~ (OptPre + method + gender)^3, data = opt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.7906  -6.6161  -0.9723   9.0781  24.8978 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            51.6253     7.1873   7.183  6.6e-08 ***
## OptPre                                  0.4398     0.1643   2.676  0.01211 *  
## methodtraditional                     -42.5676    14.9409  -2.849  0.00798 ** 
## genderfemale                          -13.3555    13.4720  -0.991  0.32971    
## OptPre:methodtraditional                0.5559     0.3499   1.589  0.12297    
## OptPre:genderfemale                     0.1253     0.3680   0.340  0.73596    
## methodtraditional:genderfemale         38.1092    23.3951   1.629  0.11414    
## OptPre:methodtraditional:genderfemale  -0.4999     0.5855  -0.854  0.40020    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.6 on 29 degrees of freedom
## Multiple R-squared:  0.5487, Adjusted R-squared:  0.4398 
## F-statistic: 5.037 on 7 and 29 DF,  p-value: 0.0007952

From the summary, we find that the coefficient of all interactive term are not statisically significant. Thus, we will do backward procedure based on p value of each coefficient to find the model that best fits the data. Suppose alpha = 0.05:

gm_1 <- lm(OptPost ~ OptPre + method + gender + method*gender +OptPre*method + OptPre*gender , data = opt)
summary(gm_1)

## 
## Call:
## lm(formula = OptPost ~ OptPre + method + gender + method * gender + 
##     OptPre * method + OptPre * gender, data = opt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.486  -6.159  -1.405   7.996  25.695 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.15922    6.94762   7.220 4.91e-08 ***
## OptPre                           0.47919    0.15701   3.052  0.00473 ** 
## methodtraditional              -35.56522   12.43227  -2.861  0.00763 ** 
## genderfemale                    -6.92954   11.12309  -0.623  0.53800    
## methodtraditional:genderfemale  19.82508    9.37783   2.114  0.04293 *  
## OptPre:methodtraditional         0.37738    0.27930   1.351  0.18674    
## OptPre:genderfemale             -0.07217    0.28494  -0.253  0.80177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.54 on 30 degrees of freedom
## Multiple R-squared:  0.5374, Adjusted R-squared:  0.4448 
## F-statistic: 5.808 on 6 and 30 DF,  p-value: 0.000416

We can see that OptPre*gender term has the largest p value. Thus, we drop this term and fit the model.

gm_2 <- lm(OptPost ~ OptPre + method + gender + method*gender + OptPre*method , data = opt)
summary(gm_2)

## 
## Call:
## lm(formula = OptPost ~ OptPre + method + gender + method * gender + 
##     OptPre * method, data = opt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.135  -6.695  -1.938   7.737  25.403 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.6951     6.5170   7.779  8.9e-09 ***
## OptPre                           0.4648     0.1441   3.225  0.00297 ** 
## methodtraditional              -34.7801    11.8565  -2.933  0.00626 ** 
## genderfemale                    -9.2782     6.0499  -1.534  0.13527    
## methodtraditional:genderfemale  19.3443     9.0441   2.139  0.04043 *  
## OptPre:methodtraditional         0.3586     0.2651   1.352  0.18604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.33 on 31 degrees of freedom
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.4616 
## F-statistic: 7.173 on 5 and 31 DF,  p-value: 0.0001464

Now, OptPre*method term has the largest p value, we then drop this one.

gm_3 <- lm(OptPost ~ OptPre + method + gender + method*gender , data = opt)
summary(gm_3)

## 
## Call:
## lm(formula = OptPost ~ OptPre + method + gender + method * gender, 
##     data = opt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.658  -7.077  -2.749   5.773  27.547 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     46.7495     5.9025   7.920 4.87e-09 ***
## OptPre                           0.5708     0.1225   4.658 5.36e-05 ***
## methodtraditional              -20.7871     5.8637  -3.545  0.00123 ** 
## genderfemale                    -8.6577     6.1101  -1.417  0.16616    
## methodtraditional:genderfemale  18.4552     9.1362   2.020  0.05182 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.5 on 32 degrees of freedom
## Multiple R-squared:  0.509,  Adjusted R-squared:  0.4477 
## F-statistic: 8.294 on 4 and 32 DF,  p-value: 0.0001042

While gender alone was not significant, the method*gender interaction was significant under alpha = 0.1. Thus, gender effects the increase in score by teaching method.

I will keep the gender term, as well as method*gender interactive term. Then, the form of the model becomes :

E[OptPost|X] = ß_0 + ß_1 x OptPre + ß_2 x method + ß_3 x gender + ß_4 x method x gender

where the reference group of method is BMI, and the reference group of gender is male.

Model assumtions are:

The error_i has normal distribution with mean 0 variance sigma^2. Y_i has normal distribution with mean E[Y_i|X_i] and variance sigma^2.
Constant variance(Homoscedasitcity): Var(error_i) = sigma^2.
Functional forms fo model covariates: linearity in parameters–the outcome OptPost is linear in ß.
No unusual residuals/outlier errors.
Influential points
Adequate fit of the covariates in the model
No multicollinearity in covariates.
X_i1, X_i2… are known constants. (true by default in our dataset)

check assumptions for overall fitted model

Normality

hist(rstandard(gm_3, main = "histgram"))

qqnorm(rstandard(gm_3), main = "Q-Q Plot")
qqline(rstandard(gm_3), col = "pink", lwd = 2)

ks.test(rstandard(gm_3), "pnorm")

## Warning in ks.test.default(rstandard(gm_3), "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  rstandard(gm_3)
## D = 0.13884, p-value = 0.4736
## alternative hypothesis: two-sided

The residual Q–Q plot and histgram indicate that residuals are approximately normally distributed. And the Kolmogorov–Smirnov test for normality yielded p = 0.3144, showing no significant difference from normality. Hence, the normality assumption is satisfied.

Constant Variance, 3. Functional Forms of Model Covariates, 4.Outliers

plot(fitted(gm_3), rstandard(gm_3), pch=19, ylim=c(-4,4))
abline(h=0, lty=2, lwd=2)
abline(h=-3, lty=3, lwd=2)
abline(h= 3, lty=3, lwd=2)

The poins are randomly scattered around the horizontal line at 0. The spread of residuals is roughly constant across the fitted value. There is not clear shape. And no point is beyond 3 and -3, so there is not outliers. Thus, the assumptions are verified.

Influential points

plot(cooks.distance(gm_3), pch = 19, ylim=c(0,1.5))
abline(h = 1, ltv = 2)

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): "ltv" is
## not a graphical parameter

There is no point with cook’s influence > 1. Thus, there is not influencial point.

Adequate fit of the covariates in the model

plot(opt$OptPre, rstandard(gm_3), pch = 19)

plot(opt$gender, rstandard(gm_3), pch = 19)

plot(opt$method, rstandard(gm_3), pch = 19)

Tje plot of standardized residuals against OptPre shows no pattern. And the residuals are evenly distributed around 0. Hence, the functional form of OptPre in the model is adequate. From the box plot of gender and method against rstandard(gm_3), we can see that boxes roughly centered around 0. Thus, the model fits group equally well.

Multicollinearity

vif(gm_3)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##        OptPre        method        gender method:gender 
##      1.028259      1.712405      1.826121      2.598226

The value of VIF of each term is less than 5. Thus, there is no multicollinearity.

To conclude, there is not questionable observations (outliers, influential points) or other problems indicateing model inadequacy.

Interpret the regression coefficients

ß_0: The predicted post test score is 46.7495 for male student under MBI method when pre test score is 0. But this is meaningless.

ß_1: Fixing teaching method and gender, on average, one unit increase in pre test score is associate with 0.5867 unit increase in post test score.

ß_2: Fixing pre test score, on average, the male student under traditional method would have 20.7871 unit lower than the male student under MBI method in post test score.

ß_3: Fixing pre test score, on average, the female student under MBI method would have 8.6577 unit lower than male student under MBI method in post test score.

ß_4: Fixing pre test score, on average, the gender difference in post test scores is 18.4552 units smaller under the traditional method than under the MBI method.

conclusion

The MBI method produces significantly higher post test scores than the traditional method among male students (ß_2 = -13.2472). For female students, the estimated difference is small (ß_2 + ß_4 = -2.3319). Meanwhile, according to the interactive term gender x method (which is statistically significant) and this result, we can find that the effectiveness of method differs by gender. Therefore, we can conclude that MBI is more effective than the traditional method for male student. For females, the effectiveness of MBI method is smaller and less significant. This may because ß_4 is not statistically significant under 0.05 and add some noise to the model interpretation.

I asked ChatGPT-5 for this answer of this question. I directly copy paste the dataset and the question a, b, c, and d to ChatGPT. Finally, I asked it to export our conversation as r markdown so that I can record the transcript here. Below is the exported answer:

User’s Question

find the best model that fits this dataset with respect to the primary research question.

evaluate the model performance, including regression diagnostics to check model assumptions,
fit to individual points, and predictive accuracy.

interpret the regression coefficients.

provide an overall conclusion regarding the primary research question of this study.

Quoted Sentence

Yes. export out conversation as a r markdown file, including my question and this sentence.

Reproducible Analysis

This document reproduces the model search, selection, diagnostics, and interpretation using the uploaded dataset Optics.csv.

Setup

library(tidyverse)
library(broom)
library(car)
library(lmtest)
library(sandwich)
library(modelr)
library(ggplot2)
library(kableExtra)

# Adjust path if needed
df <- read.csv("Optics.csv")
str(df)

## 'data.frame':    37 obs. of  5 variables:
##  $ ID     : int  4 5 6 8 12 13 15 16 17 18 ...
##  $ OptPost: int  50 67 61 92 59 16 42 69 42 59 ...
##  $ method : chr  "MBI" "MBI" "MBI" "MBI" ...
##  $ gender : int  0 0 0 0 1 1 0 1 1 1 ...
##  $ OptPre : int  50 50 30 67 42 8 8 12 42 50 ...

# Coerce types
df$method <- as.factor(df$method)
df$gender <- as.factor(df$gender)  # treat gender as a factor (levels 0, 1 assumed)

summary(df)

##        ID           OptPost              method   gender     OptPre     
##  Min.   : 4.00   Min.   :16.00   MBI        :21   0:22   Min.   : 0.00  
##  1st Qu.:18.00   1st Qu.:46.00   traditional:16   1:15   1st Qu.:24.00  
##  Median :33.00   Median :59.00                           Median :42.00  
##  Mean   :32.51   Mean   :58.78                           Mean   :36.86  
##  3rd Qu.:48.00   3rd Qu.:69.00                           3rd Qu.:50.00  
##  Max.   :60.00   Max.   :92.00                           Max.   :83.00

Candidate Models

We consider additive and interaction models aligned with the primary question: - Baseline prior ability: OptPre - Group differences: method, gender - Effect moderation: OptPre:method, method:gender

fm_list <- list(
  m1 = "OptPost ~ OptPre",
  m2 = "OptPost ~ OptPre + method",
  m3 = "OptPost ~ OptPre + gender",
  m4 = "OptPost ~ OptPre + method + gender",
  m5 = "OptPost ~ OptPre + method + gender + OptPre:method",
  m6 = "OptPost ~ OptPre + method + gender + method:gender",
  m7 = "OptPost ~ OptPre + method + gender + OptPre:method + method:gender"
)

fit_list <- lapply(fm_list, function(f) lm(as.formula(f), data = df))

cmp <- map_df(names(fit_list), function(nm) {
  m <- fit_list[[nm]]
  data.frame(
    model = nm,
    formula = deparse(formula(m)),
    AIC = AIC(m),
    BIC = BIC(m),
    R2 = summary(m)$r.squared,
    Adj_R2 = summary(m)$adj.r.squared,
    stringsAsFactors = FALSE
  )
}) %>% arrange(AIC, BIC)

knitr::kable(cmp, digits = 3) %>%
  kable_styling(full_width = FALSE)

model	formula	AIC	BIC	R2	Adj_R2
m7	OptPost ~ OptPre + method + gender + OptPre:method + method:gender	304.125	315.402	0.536	0.462
m6	OptPost ~ OptPre + method + gender + method:gender	304.246	313.912	0.509	0.448
m2	OptPost ~ OptPre + method	304.696	311.139	0.446	0.414
m4	OptPost ~ OptPre + method + gender	306.687	314.741	0.446	0.396
m5	OptPost ~ OptPre + method + gender + OptPre:method	307.219	316.884	0.468	0.401
m1	OptPost ~ OptPre	310.653	315.486	0.313	0.294
m3	OptPost ~ OptPre + gender	312.581	319.025	0.315	0.274

Choose Best Model

We select the model with lowest AIC and competitive BIC: - Chosen: OptPost ~ OptPre + method + gender + OptPre:method + method:gender

best <- fit_list[["m7"]]
summary(best)

## 
## Call:
## lm(formula = as.formula(f), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.135  -6.695  -1.938   7.737  25.403 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                50.6951     6.5170   7.779  8.9e-09 ***
## OptPre                      0.4648     0.1441   3.225  0.00297 ** 
## methodtraditional         -34.7801    11.8565  -2.933  0.00626 ** 
## gender1                    -9.2782     6.0499  -1.534  0.13527    
## OptPre:methodtraditional    0.3586     0.2651   1.352  0.18604    
## methodtraditional:gender1  19.3443     9.0441   2.139  0.04043 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.33 on 31 degrees of freedom
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.4616 
## F-statistic: 7.173 on 5 and 31 DF,  p-value: 0.0001464

Diagnostics

par(mfrow = c(2,2))
plot(best)

par(mfrow = c(1,1))

Heteroskedasticity and normality tests

# Breusch-Pagan
bp <- bptest(best)
bp

## 
##  studentized Breusch-Pagan test
## 
## data:  best
## BP = 10.281, df = 5, p-value = 0.06765

# Shapiro-Wilk on residuals (n is small; AD available in nortest, but keep base)
shapiro.test(residuals(best))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(best)
## W = 0.96729, p-value = 0.3399

Influence

infl <- influence.measures(best)
summary(infl)

## Potentially influential observations of
##   lm(formula = as.formula(f), data = df) :
## 
##    dfb.1_ dfb.OptP dfb.mthd dfb.gnd1 dfb.OpP: dfb.mt:1 dffit cov.r   cook.d
## 14 -0.05   0.09     0.03    -0.02    -0.05     0.01     0.11  1.79_*  0.00 
## 23  0.00   0.00     0.09     0.00    -0.10    -0.08    -0.18  1.58_*  0.01 
## 25  0.00   0.00    -0.31     0.00     0.51    -0.18     0.72  1.67_*  0.09 
## 29  0.00   0.00     0.06     0.00    -0.06     0.04     0.11  1.67_*  0.00 
## 31  0.00   0.00     0.20     0.00    -0.23     0.14     0.38  1.59_*  0.02 
##    hat  
## 14  0.32
## 23  0.25
## 25  0.38
## 29  0.27
## 31  0.29

# Cook's distance
plot(cooks.distance(best), type = "h", main = "Cook's Distance", ylab = "Cook's D")
abline(h = 4/length(residuals(best)), lty = 2, col = "red")

Largest |studentized residuals|

stud <- rstudent(best)
df %>% 
  mutate(fitted = fitted(best), stud_resid = stud) %>%
  arrange(desc(abs(stud_resid))) %>%
  select(ID, OptPre, OptPost, method, gender, fitted, stud_resid) %>%
  head(10) %>%
  knitr::kable(digits = 3) %>%
  kable_styling(full_width = FALSE)

ID	OptPre	OptPost	method	gender	fitted	stud_resid
13	8	16	MBI	1	45.135	-2.652
28	17	84	MBI	0	58.597	2.153
4	50	50	MBI	0	73.935	-1.975
16	12	69	MBI	1	46.994	1.884
31	44	84	MBI	1	61.868	1.864
30	24	84	MBI	0	61.850	1.812
17	42	42	MBI	1	60.939	-1.566
55	42	75	traditional	1	60.562	1.179
15	8	42	MBI	0	54.413	-1.027
35	28	67	MBI	1	54.431	1.009

Cross-Validation (5-fold)

set.seed(42)
k <- 5
folds <- sample(rep(1:k, length.out=nrow(df)))

rmse <- function(y, yhat) sqrt(mean((y - yhat)^2))
rmses <- r2s <- c()

for (i in 1:k) {
  tr <- df[folds != i, ]
  te <- df[folds == i, ]
  m <- lm(OptPost ~ OptPre + method + gender + OptPre:method + method:gender, data = tr)
  pred <- predict(m, newdata = te)
  rmses[i] <- rmse(te$OptPost, pred)
  r2s[i] <- 1 - sum((te$OptPost - pred)^2) / sum( (te$OptPost - mean(tr$OptPost))^2 )
}

data.frame(RMSE_mean = mean(rmses), R2_mean = mean(r2s))

Coefficient Interpretation (baseline: method = MBI, gender = 0)

co <- broom::tidy(best) %>% select(term, estimate, std.error, statistic, p.value)
knitr::kable(co, digits = 3) %>%
  kable_styling(full_width = FALSE)

term	estimate	std.error	statistic	p.value
(Intercept)	50.695	6.517	7.779	0.000
OptPre	0.465	0.144	3.225	0.003
methodtraditional	-34.780	11.857	-2.933	0.006
gender1	-9.278	6.050	-1.534	0.135
OptPre:methodtraditional	0.359	0.265	1.352	0.186
methodtraditional:gender1	19.344	9.044	2.139	0.040

# Extract key slopes and cross-over points
b <- best$coefficients
b0 <- unname(b["(Intercept)"])
b_pre <- unname(b["OptPre"])
b_m <- unname(b["methodtraditional"])
b_g <- ifelse("gender1" %in% names(b), unname(b["gender1"]), NA_real_)
b_mg <- ifelse("methodtraditional:gender1" %in% names(b), unname(b["methodtraditional:gender1"]), NA_real_)
b_pre_m <- ifelse("OptPre:methodtraditional" %in% names(b), unname(b["OptPre:methodtraditional"]), NA_real_)

slope_mbi <- b_pre
slope_trad <- b_pre + b_pre_m

x_cross_g0 <- -b_m / b_pre_m
x_cross_g1 <- -(b_m + b_mg) / b_pre_m

data.frame(
  slope_MBI = slope_mbi,
  slope_traditional = slope_trad,
  crossover_gender0 = x_cross_g0,
  crossover_gender1 = x_cross_g1
) %>%
  knitr::kable(digits = 3) %>%
  kable_styling(full_width = FALSE)

slope_MBI	slope_traditional	crossover_gender0	crossover_gender1
0.465	0.823	96.998	43.049

Conclusion

Teaching method affects post-test scores, and the effect depends on prior ability (OptPre) and gender.
MBI tends to benefit lower-pre students more (especially gender=0); for gender=1 with higher pre-scores, traditional can match or exceed MBI due to a steeper slope.
Small sample and borderline heteroskedasticity mean results should be reported with appropriate caution; cross-validated predictive power is modest.

Summary:

We selected different “best model”. In ChatGPT’s work, it further includes OptPre:method because the model with Opt:method has lowest AIC and competitive BIC. I think this is because we used different method to filter the “best model”. I used backward model selection procedure to find the model with all coefficient being statistically significant. On the other hand, ChatGPT used a model selection process based on multiple regression candidate specifications, comparing each model by information criteria (AIC/BIC) and cross-validated predictive performance.

Nevertheless, in regression diagnostics to check model assumptions, we tested similar indicators, including qq-plot for normality assumption and cook’s distance for influential points. But it only checked linearity / equal variance, normality of residuals, and influence / outliers, which is less comprehensive than what we learned in the lecture. Besides part of what we learned in the lecture, it also provided 5-fold cross-validation predictive accuracy.

In part c, we share the same explanation for the coefficients, except it had one more coefficient to explain than my model. And for the conclusion, it also said “teaching method matters, and its effect depends on prior ability (OptPre) and gender”. With MBI, students with lower pre-scores tend to achieve higher post-scores than comparable students in the traditional method; this advantage weakens (and for gender=1 can reverse) at higher pre-scores due to traditional’s steeper slope.

I think ChatGPT answered this question very well. It provided a different method than I used. And more importantly, it only used several seconds to generate the result, comparing I used more than 4 hours to finish Q1. Thus, using ChatGPT would increase our productivity if we can correctly utilize its ability. Nonetheless, ChatGPT is not 100% trustworthy. For example, the evaluation on the assumption is not comprehensive, if I don’t give it more command to verify other assumptions. The AI tool cannot be as rigorous as human power in doing comprehensive analysis so far. So we should not completely reply on Ai tools. We need to have our own ideas and judgement, which heavily connect with the lecture we are taking now a days. Thus, my conclusion is: AI’s are our perfect working pals. They can provide easier ways to solve our problems, but they still need our supervision to get the job 100% aligning with our goals.

Q3 I asked the same question to Gemini 2.5 Flash. But it cannot convert its analysis to r markdown form. So I directly copy paste my quesition and its answer to this file.

My question:

find the best model that fits this dataset with respect to the primary research question.
evaluate the model performance, including regression diagnostics to check model assumptions,

fit to individual points, and predictive accuracy.

interpret the regression coefficients.
provide an overall conclusion regarding the primary research question of this study.

Answer all questions and export our conversation to a r markdown file

Its answer:

Analysis of Covariance (ANCOVA) of Post-Intervention Optical Scores: Evaluating Method Effectiveness While Controlling for Baseline

Executive Summary

This report presents a rigorous quantitative analysis utilizing the Analysis of Covariance (ANCOVA) framework to evaluate the relative effectiveness of two intervention approaches, the Mind-Body Integration (MBI) method and the traditional method, on post-intervention optical scores ($\text{OptPost}$). The analysis statistically controls for baseline performance ($\text{OptPre}$) and participant gender. The primary research objective is to determine if a significant difference exists between the MBI and traditional methods regarding $\text{OptPost}$ after adjusting for the influence of the covariate $\text{OptPre}$.1 The preliminary assessment indicated that the relationship between the baseline score ($\text{OptPre}$) and the outcome ($\text{OptPost}$) was exceptionally weak ($r \approx 0.048$).3 This finding critically limits the ANCOVA’s ability to improve statistical precision, a core goal of covariate inclusion. Based on standard methodological requirements, the most appropriate model selected was the additive ANCOVA model ($\text{OptPost} \sim \text{Method} + \text{OptPre} + \text{Gender}$), which assumes the homogeneity of regression slopes (HORS) is met. The core findings, contingent on the successful validation of regression assumptions, indicate: The covariate ($\text{OptPre}$) was statistically non-significant in predicting $\text{OptPost}$ in the full model ([Calculated] $p > 0.05$), confirming the preliminary assessment of poor predictive utility. The primary factor, $\text{Method}$, showed a statistically non-significant adjusted effect on $\text{OptPost}$ ([Calculated] $p > 0.05$). The adjusted mean score for the MBI group was $\text{[Calculated]}$ (95% CI [Calculated], [Calculated]), compared to $\text{[Calculated]}$ for the traditional group. The overall conclusion regarding the primary research question is that, after controlling for baseline scores and gender, there is no detectable statistical difference in post-intervention optical scores between the MBI and traditional intervention methods.

Introduction and Methodology

2.1 Background and Research Question

This study investigates the influence of a categorical treatment variable ($\text{Method}$) on a continuous outcome variable ($\text{OptPost}$). The dataset consists of 37 observations ($N=37$), each characterized by an ID, the post-intervention optical score ($\text{OptPost}$, the dependent variable), the intervention method ($\text{method}$, primary factor: MBI or traditional), the participant gender ($\text{gender}$, secondary factor: 0 or 1), and the pre-intervention optical score ($\text{OptPre}$, the continuous covariate).3 The methodology employs ANCOVA, a specific application of the General Linear Model, which is explicitly designed to analyze differences in the means of the outcome variable ($\text{OptPost}$) across levels of the categorical factor ($\text{Method}$) while simultaneously controlling for the variance explained by a continuous predictor ($\text{OptPre}$).1 By statistically modifying the influence of the covariate, ANCOVA aims to decrease the error term, thereby enhancing statistical power and yielding adjusted group means that eliminate bias associated with differences in baseline scores ($\text{OptPre}$).2

2.2 Data Profile and Preliminary Descriptive Analysis

The dataset structure clearly aligns with the requirements for ANCOVA: one continuous dependent variable ($\text{OptPost}$), one continuous covariate ($\text{OptPre}$), and two categorical factors ($\text{Method}$ and $\text{Gender}$). Descriptive statistics detailing the raw means and standard deviations for the outcome ($\text{OptPost}$) and the covariate ($\text{OptPre}$) are presented below, segmented by the two factors ($\text{Method}$ and $\text{Gender}$).3 Raw Descriptive Statistics by Group Method Gender N OptPost Mean (SD) OptPre Mean (SD) MBI 0 13 63.69 (17.58) 38.00 (23.36) MBI 1 8 56.00 (25.86) 31.38 (16.96) Traditional 0 9 49.78 (18.06) 40.89 (16.57) Traditional 1 7 57.86 (14.96) 38.71 (20.03)

A comparison of the raw means indicates that the MBI group generally exhibits higher mean $\text{OptPost}$ scores (MBI 0: 63.69; MBI 1: 56.00) compared to the traditional group (Traditional 0: 49.78; Traditional 1: 57.86). However, the standard deviations reveal considerable variance within groups, particularly for the MBI group. Furthermore, the baseline scores ($\text{OptPre}$) also vary across groups, necessitating the ANCOVA adjustment to ensure that any observed differences in $\text{OptPost}$ are attributable to the $\text{Method}$ rather than pre-existing differences in participant ability or status.

Bivariate Relationships and Initial Covariate Assessment

3.1 Covariate Effectiveness ($\text{OptPre} \rightarrow \text{OptPost}$)

A cornerstone of effective ANCOVA design is the strength of the linear relationship between the covariate and the dependent variable.2 The Pearson correlation coefficient ($r$) was calculated for all available data pairs for $\text{OptPost}$ and $\text{OptPre}$ ($N=36$).3 The calculated Pearson correlation coefficient between $\text{OptPost}$ and $\text{OptPre}$ is approximately $r \approx 0.048$.3 This finding suggests a negligible linear relationship between the baseline score ($\text{OptPre}$) and the post-intervention score ($\text{OptPost}$). A strong covariate typically explains a substantial portion of the variance in the outcome, thereby reducing the error variance and increasing the statistical power to detect the primary factor’s effect. With $r \approx 0.048$, the proportion of variance explained ($r^2$) is approximately $0.0023$, meaning $\text{OptPre}$ accounts for less than $0.3\%$ of the total variance in $\text{OptPost}$. The weak correlational link between $\text{OptPre}$ and $\text{OptPost}$ carries a significant implication for the interpretation of the ANCOVA results. If the covariate explains almost no variance, the ANCOVA model cannot successfully fulfill its primary function of enhancing precision and statistical power by reducing the error terms.2 Consequently, the adjusted results for the primary factor ($\text{Method}$) are expected to be numerically similar to what a simple ANOVA (without the covariate) would produce. The study’s design, while formally employing ANCOVA, may be fundamentally constrained by the poor predictive utility of its selected baseline measure. This necessitates cautious interpretation of non-significant findings, as the lack of a strong covariate relationship may preclude the detection of a true effect size due to inadequate statistical precision.

Model Selection and Assumption Testing (Task a: Find the Best Model)

Selecting the best-fit model for ANCOVA involves testing critical assumptions derived from the General Linear Model structure. The primary models considered are the full interaction model, the interaction model for the factors, and the simple additive model.

4.1 Homogeneity of Regression Slopes (HORS) Test

The most critical assumption for interpreting the main effects in ANCOVA is the Homogeneity of Regression Slopes (HORS).2 This assumption mandates that the relationship (slope) between the covariate ($\text{OptPre}$) and the outcome ($\text{OptPost}$) must be constant across all levels of the primary factor ($\text{Method}$). Violation of HORS signifies that the treatment effect varies depending on the baseline score, invalidating the interpretation of a single main effect for $\text{Method}$. To test HORS, the interaction term between Method and OptPre is included in the model (Model M2):

\[\text{M2: } \text{OptPost} \sim \text{Method} \times \text{OptPre} + \text{Gender}\] A formal ANOVA comparison between Model M2 and the additive model (M1) is performed. Assuming HORS is satisfied, the $\text{Method} \times \text{OptPre}$ interaction term is statistically non-significant ($p > 0.05$). The F-statistic for the interaction was found to be $\text{[Calculated]}$, with a $p$-value of $\text{[Calculated]}$.

4.2 Evaluation of Method $\times$ Gender Interaction

A secondary consideration in model selection is whether the effect of the primary intervention (Method) is dependent on the participant’s demographic factor (Gender).3 This is tested using Model M3, which incorporates the interaction between the two factors:

\[\text{M3: } \text{OptPost} \sim \text{Method} \times \text{Gender} + \text{OptPre}\] Testing the $\text{Method} \times \text{Gender}$ interaction term in Model M3 determined its significance. The interaction F-statistic was $\text{[Calculated]}$, resulting in a $p$-value of $\text{[Calculated]}$.

4.3 Final Model Selection Rationale

For the analysis to proceed using a standard, easily interpretable ANCOVA, both interaction terms (HORS in M2 and the $\text{Method} \times \text{Gender}$ interaction in M3) must be non-significant, allowing for the decomposition of effects into independent main components. Assuming non-significant interaction terms, the most appropriate and parsimonious model that addresses the primary research question while maintaining the control function of the covariate is the additive ANCOVA model (M1): \[\text{M1 (Final Model): } \text{OptPost} \sim \text{Method} + \text{OptPre} + \text{Gender}\] This model provides the most direct estimate of the $\text{Method}$ effect adjusted for both $\text{OptPre}$ and $\text{Gender}$.

4.4 Model Comparison Summary

The following table summarizes the key metrics for the candidate models. Model M1 is selected due to its parsimony and assumption that non-significant interactions validate the main effects structure. Comparison of Candidate Regression Models Model Structure R2 Adj. R2 Method × Covariate F(p) Method × Gender F(p) AIC M1: Additive ANCOVA OptPost ~ Method + OptPre + Gender [Calculated] [Calculated] N/A N/A [Calculated] M2: HORS Test OptPost ~ Method $\times$ OptPre + Gender [Calculated] [Calculated] [Calculated] N/A [Calculated] M3: Interaction Test OptPost ~ Method $\times$ Gender + OptPre [Calculated] [Calculated] N/A [Calculated] [Calculated]

The overall explanatory power of the final model (M1), quantified by $R^2$, is $\text{[Calculated]}$, indicating the percentage of variance in $\text{OptPost}$ accounted for by the combined influence of $\text{Method}$, $\text{OptPre}$, and $\text{Gender}$.

Detailed Regression Diagnostics (Task b: Evaluate Model Performance)

A critical requirement for valid inference using the Ordinary Least Squares (OLS) estimation method employed in ANCOVA is that the model residuals adhere to several classical assumptions: normality, homoscedasticity, and independence. This section details the diagnostic assessment of the selected additive model (M1). The calculation of the diagnostic statistics is predicated on the model being fit to $N=37$ observations with $k=3$ predictors (Method, $\text{OptPre}$, Gender).

5.1 Normality of Residuals

The assumption of normality dictates that the errors (residuals) should be normally distributed around zero. This is primarily checked using graphical methods (Q-Q plots) and formal tests. The Shapiro-Wilk test was performed on the residuals of Model M1 to formally assess normality.3 Shapiro-Wilk Test Result: $W = \text{[Calculated]}$, $p = \text{[Calculated]}$ A low $p$-value ($p < 0.05$) would indicate a violation of normality. Given the limited sample size ($N=37$), the interpretation of this formal test must be tempered by visual inspection of the Q-Q plot. For small samples, the Shapiro-Wilk test may lack power to detect minor deviations, while conversely, the conclusions drawn from ANCOVA coefficients are somewhat robust to moderate non-normality, provided that the sample is not severely skewed or plagued by outliers. Should the $p$-value be below the critical threshold, relying on the central limit theorem’s robustness for large samples is not an option here, meaning any detected non-normality suggests the standard errors and confidence intervals might be inaccurate.

5.2 Homoscedasticity (Equal Variance of Residuals)

Homoscedasticity requires that the variance of the residuals remains constant across all predicted values of $\text{OptPost}$. Violation of this assumption (heteroscedasticity) means the standard errors are incorrect, biasing significance tests. The Breusch-Pagan test was used to formally evaluate the homoscedasticity assumption.3 Breusch-Pagan Test Result: $\text{Chisq} = \text{[Calculated]}$, $p = \text{[Calculated]}$ If the $p$-value of the Breusch-Pagan test is large ($p > 0.05$), the assumption holds, and the model’s standard errors are generally reliable. Conversely, a statistically significant result suggests that the error variance is unequal, often requiring the use of robust standard errors or a weighted least squares approach to correct for the variance bias.

5.3 Outlier and Influence Diagnostics

An analysis of potential influential observations is necessary to determine if the model parameters are unduly affected by a few data points. Cook’s Distance is a metric quantifying how much the model’s coefficients would change if a specific observation were excluded. The maximum value observed for Cook’s Distance in Model M1 was identified 3: Maximum Cook’s Distance: $\text{[Calculated]}$ A commonly cited threshold for undue influence is $4/(N-k-1)$. With $N=37$ and $k=3$, the critical threshold is approximately $4/(37-3-1) \approx 0.121$. If the maximum Cook’s Distance exceeds this value, or the conventional threshold of 1.0, the presence of an influential observation requires careful inspection. If an influential point is identified, the reliability of the derived coefficients, particularly the main effect for $\text{Method}$, becomes highly questionable, potentially warranting the investigation of data integrity or model re-specification.

5.4 Model Fit and Predictive Accuracy

The practical performance of the model is assessed by measuring the typical magnitude of prediction error. The Root Mean Square Error (RMSE) quantifies the average magnitude of the difference between the observed values of $\text{OptPost}$ and the values predicted by Model M1, reported in the original units of $\text{OptPost}$.3 Root Mean Square Error (RMSE): $\text{[Calculated]}$ The RMSE value of $\text{[Calculated]}$ signifies that, on average, the model’s predicted $\text{OptPost}$ scores deviate from the actual observed scores by this magnitude. This metric is essential for judging the practical utility and precision of the model in a real-world context.

Interpretation of Adjusted Regression Coefficients (Task c)

The interpretation focuses on the parameter estimates derived from the final additive ANCOVA model (M1), using the ‘traditional’ method and ‘Gender 0’ ([Calculated] reference groups) as the baseline for comparison.

6.1 Fixed Effects Parameter Estimates

The following table presents the unstandardized regression coefficients, standard errors, and significance tests for Model M1. Parameter Estimates for the Final Additive ANCOVA Model (M1) Predictor Unstandardized Coefficient (B) Standard Error (SE) t-value p-value 95% CI Lower 95% CI Upper (Intercept) [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] OptPre (Covariate) [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] Method [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] Gender [Calculated] [Calculated] [Calculated] [Calculated] [Calculated] [Calculated]

6.2 Interpretation of the Covariate ($\text{OptPre}$) Coefficient

The coefficient for $\text{OptPre}$ ($B_{\text{OptPre}} = \text{[Calculated]}$) represents the estimated change in $\text{OptPost}$ associated with a one-unit increase in the baseline score, holding both $\text{Method}$ and $\text{Gender}$ constant. Based on the preliminary analysis that showed a near-zero bivariate correlation ($r \approx 0.048$) 3, it is highly probable that the covariate $\text{OptPre}$ is statistically non-significant in this multiple regression framework (p-value $\text{[Calculated]}$). A non-significant and small coefficient for the covariate ($B_{\text{OptPre}} \approx \text{[Calculated]}$) indicates that baseline scores contribute negligible unique variance to the prediction of post-intervention scores, even after accounting for the primary factors. This outcome validates the concern raised in Section 3 regarding the weak association between the pre- and post-measures. The inclusion of the covariate fails to meet the methodological expectation of ANCOVA to increase precision, reducing the efficacy of the study design to detect small or moderate treatment effects.

6.3 Interpretation of the Primary Treatment Effect ($\text{Method}$)

The coefficient $B_{\text{MBI}}$ ($\text{[Calculated]}$) is the central finding related to the research question. This value represents the estimated difference in $\text{OptPost}$ between the MBI group and the traditional group, after statistically adjusting for both the continuous covariate ($\text{OptPre}$) and the categorical factor ($\text{Gender}$). If $B_{\text{MBI}}$ is positive, the MBI method results in higher adjusted $\text{OptPost}$ scores compared to the traditional method. If $B_{\text{MBI}}$ is negative, the reverse is true. The statistical test for this coefficient (t-value $\text{[Calculated]}$, p-value $\text{[Calculated]}$) determines if this adjusted difference is large enough to be considered non-random. Assuming the $p$-value is greater than the $\alpha=0.05$ threshold, the conclusion is that the MBI method does not yield a statistically significant different outcome compared to the traditional method, once baseline differences and gender are controlled.

6.4 Calculation and Interpretation of Adjusted Marginal Means (LS Means)

While the coefficient $B_{\text{MBI}}$ represents the mean difference, the effect is often most clearly communicated via Least Squares Means (LS Means), which are the predicted mean outcomes for each group calculated at the overall average value of the covariate ($\bar{X}_{\text{OptPre}}$). The overall mean of $\text{OptPre}$ is $1380 / 36 \approx 38.33$.3 Adjusted Marginal Means for OptPost (Controlling for OptPre) Method Adjusted Mean (LS MeanOptPost) Standard Error (SE) 95% CI Lower 95% CI Upper MBI [Calculated] [Calculated] [Calculated] [Calculated] Traditional [Calculated] [Calculated] [Calculated] [Calculated]

The difference between the LS Means provides an intuitive measure of the adjusted treatment effect. For instance, if the LS Mean for MBI is $\text{[Calculated]}$ and for traditional is $\text{[Calculated]}$, the MBI group had a mean score that was $\text{[Calculated]}$ units higher than the traditional group, holding baseline scores at the population average. This adjusted mean difference is statistically tested by the t-test on the $B_{\text{MBI}}$ coefficient.

Conclusion and Research Implications (Task d)

7.1 Summary of Key Findings

The objective of determining the comparative effectiveness of the MBI versus traditional methods on $\text{OptPost}$, adjusted for $\text{OptPre}$ and $\text{Gender}$, was addressed using the additive ANCOVA model (M1). The model diagnostics indicated potential issues (e.g., maximum Cook’s Distance $\text{[Calculated]}$, potential non-normality $W = \text{[Calculated]}$), requiring cautionary interpretation of the results. Critically, the covariate $\text{OptPre}$ failed to demonstrate a significant linear relationship with the outcome $\text{OptPost}$ ($r \approx 0.048$, $p_{\text{OptPre}} = \text{[Calculated]}$). The central finding regarding the adjusted treatment effect of $\text{Method}$ yielded a non-significant result ($B_{\text{MBI}} = \text{[Calculated]}$, $p = \text{[Calculated]}$).

7.2 Definitive Conclusion on the Primary Research Question

Based on the statistical analysis of the additive ANCOVA model, there is no statistically significant evidence to conclude that the MBI method results in post-intervention optical scores ($\text{OptPost}$) that are different from those achieved by the traditional method, once the baseline scores ($\text{OptPre}$) and gender differences are controlled. The adjusted mean scores confirmed this numerical equivalence. The MBI group’s adjusted mean ($\text{[Calculated]}$) was not statistically distinct from the traditional group’s adjusted mean ($\text{[Calculated]}$).

7.3 Broader Implications and Methodological Limitations

The most substantial limitation of this study, affecting its ability to achieve high statistical power, stems from the near-zero correlation between the covariate ($\text{OptPre}$) and the outcome ($\text{OptPost}$). The goal of ANCOVA is to increase statistical power by partitioning the outcome variance into variance explained by the factor and variance explained by the covariate, thus reducing the residual error term. When the covariate is largely orthogonal to the outcome, the error term reduction is minimal. Consequently, the study derived little benefit from the ANCOVA framework over a simpler ANOVA, potentially masking any genuine, albeit small, treatment effects due to insufficient precision. Future studies utilizing these measurement instruments must first establish a stronger correlation between the pre- and post-measures to ensure the covariate effectively serves its function as a statistical control variable. Furthermore, interpretation is restricted by the small sample size ($N=37$), which limits the power of both the primary F-tests and the diagnostic tests, such as the Shapiro-Wilk test for normality.4 If the observed influence diagnostics (e.g., high maximum Cook’s D) indicate model instability, the conclusions might be heavily dependent on a few unrepresentative individuals.

7.4 Final Recommendation

Given the non-significant adjusted difference between the two groups, the current evidence does not support the adoption of the MBI method over the traditional method based on improved $\text{OptPost}$ scores alone. However, the substantial methodological constraint imposed by the weak covariate must be acknowledged. Future research should prioritize replication with a larger, more diverse sample, or employ measures where the pre- and post-scores exhibit a more robust linear relationship, which is prerequisite for a statistically efficient ANCOVA design.

R Markdown File Structure Output

ANCOVA Report: Evaluation of Intervention Method on Optical Scores

Data Loading and Preparation (Conceptual)

Load the dataset (Assuming ‘Optics.csv’ is available)

#library(tidyverse)


#library(car) # For Breusch-Pagan test


#library(rstatix) # For ANOVA/ANCOVA tests


#library(broom) # For diagnostics


#data <- read.csv("Optics.csv")


#data$method <- factor(data$method, levels = c("traditional", "MBI"))


#data$gender <- factor(data$gender)


#N = 37 observations used in analysis




### Preliminary Statistics and Covariate Assessment


# Bivariate Correlation Calculation (as provided in research material [3])
# r_OptPre_OptPost <- 0.048


#The Pearson correlation coefficient between $\text{OptPost}$ and $\text{OptPre}$ is approximately $r = 0.048$.3

#Model Selection: Testing Assumptions

Model 1: Additive ANCOVA (Final Model)

M1 <- lm(OptPost ~ method + OptPre + gender, data = data)

Model 2: HORS Test (Interaction: Method * OptPre)

M2 <- lm(OptPost ~ method * OptPre + gender, data = data)

Model 3: Factor Interaction Test (Interaction: Method * Gender)

M3 <- lm(OptPost ~ method * gender + OptPre, data = data)

ANOVA Comparison M2 vs M1 (Conceptual HORS Test Result)

anova_hors <- anova(M1, M2)

F_HORS <- “[Calculated]”

p_HORS <- “[Calculated]”

ANOVA Test of Method * Gender Interaction (Conceptual M3)

summary_M3 <- summary(M3)

F_M3_interaction <- “[Calculated]”

p_M3_interaction <- “[Calculated]”

Regression Diagnostics for Final Model (M1)

Normality of Residuals (Shapiro-Wilk) [3, 4]

shapiro_test <- shapiro.test(residuals(M1))

W_stat <- “[Calculated]”

p_shapiro <- “[Calculated]”

Homoscedasticity (Breusch-Pagan) [3]

bp_test <- bptest(M1)

Chisq_stat <- “[Calculated]”

p_breusch_pagan <- “[Calculated]”

Influential Points (Cook’s Distance) [3]

cooks_max <- max(cooks.distance(M1))

max_cooks_D <- “[Calculated]”

Predictive Accuracy (RMSE) [3]

rmse_value <- sqrt(mean(residuals(M1)^2))

RMSE <- “[Calculated]”

Interpretation of Fixed Effects (M1)

Final Model Summary (M1)

summary_M1 <- summary(M1)

R_squared <- “[Calculated]”

Adj_R_squared <- “[Calculated]”

The full parameter estimates for the final additive model (M1) are detailed in the report (Section 6.1).

Calculation of Adjusted Marginal Means (LS Means)

Calculate LS Means for Method, adjusted to mean(OptPre) = 38.33

lsmeans_method <- emmeans(M1, ~ method)

lsmeans_MBI <- “[Calculated]”

lsmeans_Traditional <- “[Calculated]”

CI_lower_MBI <- “[Calculated]”

CI_upper_MBI <- “[Calculated]”

CI_lower_Trad <- “[Calculated]”

CI_upper_Trad <- “[Calculated]”

Works cited Getting Started with Analysis of Covariance - UVA Library - The University of Virginia, accessed October 17, 2025, https://library.virginia.edu/data/articles/getting-started-with-analysis-of-covariance Analysis of Covariance (ANCOVA) using R - R-bloggers, accessed October 17, 2025, https://www.r-bloggers.com/2021/10/analysis-of-covariance-ancova-using-r/ Optics.csv ANCOVA in R: The Ultimate Practical Guide - Datanovia, accessed October 17, 2025, https://www.datanovia.com/en/lessons/ancova-in-r/

Conclusion:

ChatGPT and Gemini have a very different result based on their method. In Gemini’s best model analysis, it used additive model and to model of interactive term plus a distinct covaraite. This does not match what ChatGPT did in its analysis, and also does not align with want we learned in the class. With this questionable model selection, Gemini conclude that there is not siginificant adjusted difference between the group of BMI and the group of traditional method in the post score.

On the other hand, we evaluating assumptions, Gemina and ChatGPT share the same idea. They both test the assumption of normality, constant variance, outliner/influencial points, an predictive accuracy.

I would trust ChatGPT more for conducting this type of analysis. Because ChatGPT have more similar idea to what we learned in the class. And ChatGPT’s ability of coding is better than Gemini. At lease it can convert our conversation to r markdown, but Gemini can only convert our conversation to word. Meanwhile, ChatGPT’s analysis is more readable and reasonable, though Gemini’s work seems more rigurous and academic paper like. I think AI tools are our great pals when doing data anaylsis. Though they can improve our productivity by their ability of generating code and logical analysis, we still need to supervise their performance and adjust their behavior by human language if they did anything wrong, or simply not match what we want. I think ChatGPT can do this job better than Gemini. Thus, I would trust ChatGPT.

Q7 laod data

sup <- read.csv("../HW 2/Support.2.csv")

primary objective: identify factors associated with the Mean Arterial Pressure

EDA
1. Data Cleaning

In this part, we will remove NA from the original dataset.

sup <- na.omit(sup)
sup <- sup %>% mutate(MAP = case_when(meanbp <= 100 ~ "Normal blood pressure", 
                                TRUE ~ "High blood pressure"))

1. Data handling

In this step, we will bring the data to the required form for data analysis. Mainly, we will factorize the categorical data.

# converting categorical data to factors 
sup <- sup %>%
  mutate(
    sex = factor(sex), 
    race = factor(race),
    dzgroup = factor(dzgroup),
    dzclass = factor(dzclass),
    income = factor(income), 
    death = factor(death), 
    hospdead = factor(hospdead), 
    sfdm2 = factor(sfdm2), 
    MAP = factor(MAP)
  )

1. Data validation

We will fist check the elements of the combined data by printing out the column name of data. This step can make sure that we are working on the correct cleaned data set. Then, we will count the missing values for each column so that we will have a clear picture of the available size of our data set. Then, we will print out the summary of all categorical data and numerical data to each whether there are some nonsense values (e.g. outlier for numerical variable).

# checking the elements of the data 
sup %>% head(6)

cat("Columns of the data include:", colnames(sup))

## Columns of the data include: age death sex hospdead slos d.time dzgroup dzclass num.co edu income scoma charges totcst totmcst avtisst race meanbp wblc hrt resp temp pafi alb bili crea sod ph sfdm2 adlsc MAP

# removing the potential duplicates
sup <- sup %>% distinct()

# handling missing values
mis <- sup %>% summarize(across(everything(), ~sum(is.na(.))))
print(mis)

##   age death sex hospdead slos d.time dzgroup dzclass num.co edu income scoma
## 1   0     0   0        0    0      0       0       0      0   0      0     0
##   charges totcst totmcst avtisst race meanbp wblc hrt resp temp pafi alb bili
## 1       0      0       0       0    0      0    0   0    0    0    0   0    0
##   crea sod ph sfdm2 adlsc MAP
## 1    0   0  0     0     0   0

# checking nonsense value for categorical values
# finding categorical variables 
cat_vars <- sup %>% select(where(~ is.character(.) | is.factor(.))) %>% names()
for (v in cat_vars) {# showing all unique categories for each variable
  print(table(sup[[v]], useNA = "ifany"))
}

## 
##   0   1 
##  64 119 
## 
## female   male 
##     93     90 
## 
##   0   1 
## 132  51 
## 
## ARF/MOSF w/Sepsis               CHF         Cirrhosis      Colon Cancer 
##                87                21                 6                 2 
##              Coma              COPD       Lung Cancer      MOSF w/Malig 
##                15                27                 7                18 
## 
##           ARF/MOSF             Cancer               Coma COPD/CHF/Cirrhosis 
##                105                  9                 15                 54 
## 
##      >$50k   $11-$25k   $25-$50k under $11k 
##          9         47         22        105 
## 
##    asian    black hispanic    other    white 
##        1       30        4        2      146 
## 
##    <2 mo. follow-up adl>=4 (>=5 if sur)       Coma or Intub no(M2 and SIP pres) 
##                  70                  28                   1                  70 
##             SIP>=30 
##                  14 
## 
##   High blood pressure Normal blood pressure 
##                    69                   114

sup %>% # counting number of unique levels for each
  summarise(across(all_of(cat_vars), ~ n_distinct(.))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_levels")

# finding suspicious values e.g. catogories with very long number of counts
for (v in cat_vars) {
  cat("\n categorical variable", v, ":\n")
  print(table(sup[[v]]), decreasing = FALSE)
}

## 
##  categorical variable death :
## 
##   0   1 
##  64 119 
## 
##  categorical variable sex :
## 
## female   male 
##     93     90 
## 
##  categorical variable hospdead :
## 
##   0   1 
## 132  51 
## 
##  categorical variable dzgroup :
## 
## ARF/MOSF w/Sepsis               CHF         Cirrhosis      Colon Cancer 
##                87                21                 6                 2 
##              Coma              COPD       Lung Cancer      MOSF w/Malig 
##                15                27                 7                18 
## 
##  categorical variable dzclass :
## 
##           ARF/MOSF             Cancer               Coma COPD/CHF/Cirrhosis 
##                105                  9                 15                 54 
## 
##  categorical variable income :
## 
##      >$50k   $11-$25k   $25-$50k under $11k 
##          9         47         22        105 
## 
##  categorical variable race :
## 
##    asian    black hispanic    other    white 
##        1       30        4        2      146 
## 
##  categorical variable sfdm2 :
## 
##    <2 mo. follow-up adl>=4 (>=5 if sur)       Coma or Intub no(M2 and SIP pres) 
##                  70                  28                   1                  70 
##             SIP>=30 
##                  14 
## 
##  categorical variable MAP :
## 
##   High blood pressure Normal blood pressure 
##                    69                   114

#checking nonsense value for quantitative variables
# finding quantitative variables
num_vars <- sup %>% select(where(is.numeric)) %>% names()
# summarizing the min, max, median, and mean of each variable
sup %>%
  summarise(across(all_of(num_vars),
                   list(min = ~min(., na.rm = TRUE),
                        max = ~max(., na.rm = TRUE), 
                        median = ~median(., na.rm = TRUE),
                        mean = ~mean(., na.rm = TRUE))))

1. Identify

# finding variable types
var_info <- tibble(
  variable = names(sup),
  type = map_chr(sup, ~ if (is.numeric(.x)) "Quantative" else "Categorical")
)

# finding the scale of measurement
var_info <- var_info %>%
  mutate(
    scale = case_when(
      variable %in% c("sex","race","dzgroup","dzclass","hospdead","death") ~ "Nominal",
      variable %in% c("income","edu", "MAP") ~ "Ordinal",
      variable %in% c("temp") ~ "Interval",
      TRUE ~ "Ratio"
    ),
    
# showing the variable type (outcome or covariate) 
    role = case_when(
      variable %in% c("MAP") ~ "Outcome",
      TRUE ~ "Covariate"
    )
  )

print(var_info)

## # A tibble: 31 × 4
##    variable type        scale   role     
##    <chr>    <chr>       <chr>   <chr>    
##  1 age      Quantative  Ratio   Covariate
##  2 death    Categorical Nominal Covariate
##  3 sex      Categorical Nominal Covariate
##  4 hospdead Categorical Nominal Covariate
##  5 slos     Quantative  Ratio   Covariate
##  6 d.time   Quantative  Ratio   Covariate
##  7 dzgroup  Categorical Nominal Covariate
##  8 dzclass  Categorical Nominal Covariate
##  9 num.co   Quantative  Ratio   Covariate
## 10 edu      Quantative  Ordinal Covariate
## # ℹ 21 more rows

1. Data summary

#summarizing numerical variables
num_summary <- sup %>%
    select(all_of(num_vars)) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
    group_by(variable) %>%
    summarise(
      mean = mean(value, na.rm = TRUE),
      sd = sd(value, na.rm = TRUE),
      min = min(value, na.rm = TRUE),
      first_quantile = quantile(value, 0.25, na.rm = TRUE),
      median = median(value, na.rm = TRUE),
      third_quantile = quantile(value, 0.75, na.rm = TRUE),
      max = max(value, na.rm = TRUE),
      .groups = "drop")
print(num_summary)

## # A tibble: 22 × 8
##    variable    mean      sd     min first_quantile  median third_quantile    max
##    <chr>      <dbl>   <dbl>   <dbl>          <dbl>   <dbl>          <dbl>  <dbl>
##  1 adlsc     2.28e0 2.42e+0 0                0     1   e+0           4    7   e0
##  2 age       5.96e1 1.63e+1 2.06e+1         45.6   6.23e+1          72.2  9.12e1
##  3 alb       2.83e0 7.48e-1 1.10e+0          2.30  2.80e+0           3.30 4.90e0
##  4 avtisst   2.73e1 1.48e+1 3   e+0         15.4   2.4 e+1          38.6  6.4 e1
##  5 bili      1.79e0 3.32e+0 2.00e-1          0.5   9.00e-1           1.75 3.5 e1
##  6 charges   4.88e4 6.26e+4 2.17e+3      11802.    2.85e+4       60824.   5.38e5
##  7 crea      1.72e0 1.74e+0 3.00e-1          0.800 1.20e+0           1.85 1.12e1
##  8 d.time    4.63e2 5.23e+2 3   e+0         22.5   2.85e+2         760.   2.01e3
##  9 edu       1.13e1 3.11e+0 0               10     1.2 e+1          12    2.4 e1
## 10 hrt       1.04e2 3.70e+1 0               72     1.1 e+2         126    3   e2
## # ℹ 12 more rows

#summarizing categorical variables
cat_summary <- map_df(cat_vars, function(v) {
    x <- sup[[v]]
    x_lab <- as.character(x)
    tibble(variable = v, level = x_lab) %>%
      count(variable, level, name = "n")
  })
print(cat_summary)

## # A tibble: 34 × 3
##    variable level                 n
##    <chr>    <chr>             <int>
##  1 death    0                    64
##  2 death    1                   119
##  3 sex      female               93
##  4 sex      male                 90
##  5 hospdead 0                   132
##  6 hospdead 1                    51
##  7 dzgroup  ARF/MOSF w/Sepsis    87
##  8 dzgroup  CHF                  21
##  9 dzgroup  COPD                 27
## 10 dzgroup  Cirrhosis             6
## # ℹ 24 more rows

1. Data visualization

We will draw the graph for univariate distributions of all the variables. For quantitative variable, we will draw histgram. And for categorical data, we will draw par chart.

#drawing histogram for numerical variables
for (v in num_vars) {
  bp <- ggplot(sup, aes(x = .data[[v]])) +
    geom_histogram() +
    labs(title = paste(v), x = v)
  print(bp)
}

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#drawing bar charts for categorical variables
for (v in cat_vars) {
    df_plot <- tibble(level = as.character(sup[[v]])) %>% count(level, name = "n")
    p <- ggplot(df_plot, aes(x = reorder(level, -n), y = n)) +
        geom_col() +
        labs(title = paste("Bar chart of", v), x = v, y = "Count") 
    print(p)
  }

Then, we will draw the graph for bivariate distributions of all the variables.

association between method and OptPost

#drawing boxplot for bivariate distribution of MAP with numerical variables
for (v in num_vars) {
  bp <- ggplot(sup, aes(x = MAP, y = .data[[v]])) + geom_boxplot()
  print(bp)
}

#summarize table for bivariate distribution of MAP with categorical variables
tbl <- sup %>%
  select(MAP, all_of(cat_vars)) %>%
  tbl_summary(
    by = MAP,                       # group by MAP
    type = all_categorical() ~ "categorical",
    statistic = all_categorical() ~ "{n} ({p}%)", 
    missing = "no"                  
  ) %>%
  bold_labels() %>%
  add_overall() %>%
  bold_labels()
tbl

Characteristic	Overall N = 183¹	High blood pressure N = 69¹	Normal blood pressure N = 114¹
death
0	64 (35%)	30 (43%)	34 (30%)
1	119 (65%)	39 (57%)	80 (70%)
sex
female	93 (51%)	37 (54%)	56 (49%)
male	90 (49%)	32 (46%)	58 (51%)
hospdead
0	132 (72%)	49 (71%)	83 (73%)
1	51 (28%)	20 (29%)	31 (27%)
dzgroup
ARF/MOSF w/Sepsis	87 (48%)	34 (49%)	53 (46%)
CHF	21 (11%)	4 (5.8%)	17 (15%)
Cirrhosis	6 (3.3%)	1 (1.4%)	5 (4.4%)
Colon Cancer	2 (1.1%)	0 (0%)	2 (1.8%)
Coma	15 (8.2%)	9 (13%)	6 (5.3%)
COPD	27 (15%)	12 (17%)	15 (13%)
Lung Cancer	7 (3.8%)	2 (2.9%)	5 (4.4%)
MOSF w/Malig	18 (9.8%)	7 (10%)	11 (9.6%)
dzclass
ARF/MOSF	105 (57%)	41 (59%)	64 (56%)
Cancer	9 (4.9%)	2 (2.9%)	7 (6.1%)
Coma	15 (8.2%)	9 (13%)	6 (5.3%)
COPD/CHF/Cirrhosis	54 (30%)	17 (25%)	37 (32%)
income
>$50k	9 (4.9%)	3 (4.3%)	6 (5.3%)
$11-$25k	47 (26%)	18 (26%)	29 (25%)
$25-$50k	22 (12%)	9 (13%)	13 (11%)
under $11k	105 (57%)	39 (57%)	66 (58%)
race
asian	1 (0.5%)	0 (0%)	1 (0.9%)
black	30 (16%)	15 (22%)	15 (13%)
hispanic	4 (2.2%)	2 (2.9%)	2 (1.8%)
other	2 (1.1%)	1 (1.4%)	1 (0.9%)
white	146 (80%)	51 (74%)	95 (83%)
sfdm2
<2 mo. follow-up	70 (38%)	25 (36%)	45 (39%)
adl>=4 (>=5 if sur)	28 (15%)	13 (19%)	15 (13%)
Coma or Intub	1 (0.5%)	1 (1.4%)	0 (0%)
no(M2 and SIP pres)	70 (38%)	26 (38%)	44 (39%)
SIP>=30	14 (7.7%)	4 (5.8%)	10 (8.8%)
¹ n (%)

Baased on results from my EDA, I would use logistic regression because MAP is a binary outcome.

#full model
full_gm <- glm(MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + num.co + 
                    edu + income + scoma + charges + totcst + totmcst + avtisst + 
                    race + wblc + hrt + resp + temp + pafi + alb + bili + crea +
                    sod + ph + sfdm2 + adlsc, data = sup, family = binomial(link = "logit"))
summary(full_gm)

## 
## Call:
## glm(formula = MAP ~ age + death + sex + hospdead + slos + d.time + 
##     dzgroup + num.co + edu + income + scoma + charges + totcst + 
##     totmcst + avtisst + race + wblc + hrt + resp + temp + pafi + 
##     alb + bili + crea + sod + ph + sfdm2 + adlsc, family = binomial(link = "logit"), 
##     data = sup)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)  
## (Intercept)              -8.670e+00  3.956e+03  -0.002   0.9983  
## age                      -2.719e-04  1.299e-02  -0.021   0.9833  
## death1                    1.233e+00  6.370e-01   1.935   0.0530 .
## sexmale                  -1.892e-01  4.238e-01  -0.446   0.6553  
## hospdead1                -4.879e-01  8.304e-01  -0.588   0.5568  
## slos                     -2.083e-02  2.547e-02  -0.818   0.4135  
## d.time                    3.691e-04  6.335e-04   0.583   0.5602  
## dzgroupCHF                9.060e-01  7.661e-01   1.183   0.2370  
## dzgroupCirrhosis          1.818e+01  1.588e+03   0.011   0.9909  
## dzgroupColon Cancer       1.670e+01  2.775e+03   0.006   0.9952  
## dzgroupComa              -2.198e+00  1.178e+00  -1.866   0.0620 .
## dzgroupCOPD              -5.915e-01  6.941e-01  -0.852   0.3941  
## dzgroupLung Cancer        1.167e+00  1.247e+00   0.936   0.3493  
## dzgroupMOSF w/Malig      -3.664e-01  6.688e-01  -0.548   0.5838  
## num.co                    1.120e-01  1.637e-01   0.685   0.4936  
## edu                      -8.256e-02  8.278e-02  -0.997   0.3186  
## income$11-$25k           -2.708e-02  1.116e+00  -0.024   0.9806  
## income$25-$50k           -5.779e-01  1.228e+00  -0.471   0.6379  
## incomeunder $11k         -4.614e-01  1.177e+00  -0.392   0.6950  
## scoma                     1.068e-02  1.232e-02   0.866   0.3863  
## charges                   2.437e-05  2.016e-05   1.209   0.2266  
## totcst                   -6.881e-05  3.331e-05  -2.066   0.0388 *
## totmcst                   2.329e-05  2.109e-05   1.104   0.2695  
## avtisst                   2.842e-02  2.544e-02   1.117   0.2640  
## raceblack                -5.605e+00  3.956e+03  -0.001   0.9989  
## racehispanic             -3.983e+00  3.956e+03  -0.001   0.9992  
## raceother                -8.176e+00  3.956e+03  -0.002   0.9984  
## racewhite                -4.580e+00  3.956e+03  -0.001   0.9991  
## wblc                     -8.691e-03  2.692e-02  -0.323   0.7469  
## hrt                      -5.437e-03  5.554e-03  -0.979   0.3277  
## resp                     -2.202e-02  1.930e-02  -1.141   0.2539  
## temp                      3.126e-01  1.733e-01   1.804   0.0713 .
## pafi                      1.964e-03  2.007e-03   0.978   0.3279  
## alb                       1.411e-01  3.754e-01   0.376   0.7070  
## bili                     -9.088e-02  5.995e-02  -1.516   0.1295  
## crea                      1.080e-01  1.326e-01   0.814   0.4157  
## sod                      -2.272e-04  3.385e-02  -0.007   0.9946  
## ph                        4.571e-01  2.659e+00   0.172   0.8635  
## sfdm2adl>=4 (>=5 if sur) -1.444e-01  9.078e-01  -0.159   0.8736  
## sfdm2Coma or Intub       -3.914e+01  4.263e+03  -0.009   0.9927  
## sfdm2no(M2 and SIP pres) -5.673e-01  8.339e-01  -0.680   0.4963  
## sfdm2SIP>=30              1.074e+00  1.096e+00   0.980   0.3270  
## adlsc                    -1.928e-01  9.103e-02  -2.118   0.0342 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.51  on 182  degrees of freedom
## Residual deviance: 189.32  on 140  degrees of freedom
## AIC: 275.32
## 
## Number of Fisher Scoring iterations: 16

We can find that most of the coefficients of full model are not statistically significant. Thus, we wil use backward procedure to find the best model.

best_gm <- step(full_gm, direction = "backward")

## Start:  AIC=275.32
## MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + 
##     num.co + edu + income + scoma + charges + totcst + totmcst + 
##     avtisst + race + wblc + hrt + resp + temp + pafi + alb + 
##     bili + crea + sod + ph + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - income    3   190.25 270.25
## - race      4   194.67 272.67
## - sod       1   189.32 273.32
## - age       1   189.32 273.32
## - ph        1   189.35 273.35
## - wblc      1   189.43 273.43
## - alb       1   189.47 273.46
## - sex       1   189.53 273.52
## - d.time    1   189.67 273.67
## - hospdead  1   189.67 273.67
## - num.co    1   189.80 273.80
## - slos      1   190.00 274.00
## - crea      1   190.02 274.02
## - scoma     1   190.10 274.10
## - pafi      1   190.30 274.30
## - hrt       1   190.32 274.32
## - edu       1   190.34 274.34
## - avtisst   1   190.59 274.59
## - resp      1   190.63 274.63
## - totmcst   1   190.66 274.66
## - charges   1   190.76 274.76
## <none>          189.32 275.32
## - bili      1   191.46 275.46
## - temp      1   192.73 276.73
## - death     1   193.18 277.18
## - adlsc     1   193.96 277.96
## - totcst    1   194.03 278.03
## - dzgroup   7   206.54 278.54
## - sfdm2     4   202.83 280.83
## 
## Step:  AIC=270.25
## MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + 
##     num.co + edu + scoma + charges + totcst + totmcst + avtisst + 
##     race + wblc + hrt + resp + temp + pafi + alb + bili + crea + 
##     sod + ph + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - race      4   195.59 267.59
## - sod       1   190.25 268.25
## - wblc      1   190.26 268.26
## - age       1   190.28 268.28
## - alb       1   190.32 268.32
## - sex       1   190.35 268.35
## - ph        1   190.36 268.36
## - d.time    1   190.44 268.44
## - hospdead  1   190.53 268.52
## - num.co    1   190.61 268.61
## - pafi      1   190.86 268.86
## - edu       1   190.93 268.93
## - slos      1   190.95 268.95
## - scoma     1   190.98 268.98
## - crea      1   191.01 269.01
## - avtisst   1   191.06 269.06
## - hrt       1   191.20 269.20
## - totmcst   1   191.35 269.35
## - resp      1   191.62 269.62
## - charges   1   191.89 269.89
## <none>          190.25 270.25
## - bili      1   192.41 270.41
## - death     1   193.66 271.66
## - temp      1   193.97 271.97
## - adlsc     1   194.80 272.80
## - totcst    1   194.98 272.98
## - dzgroup   7   207.24 273.24
## - sfdm2     4   203.32 275.32
## 
## Step:  AIC=267.59
## MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + 
##     num.co + edu + scoma + charges + totcst + totmcst + avtisst + 
##     wblc + hrt + resp + temp + pafi + alb + bili + crea + sod + 
##     ph + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - wblc      1   195.59 265.59
## - sod       1   195.60 265.60
## - sex       1   195.61 265.61
## - alb       1   195.63 265.63
## - ph        1   195.63 265.63
## - num.co    1   195.68 265.68
## - scoma     1   195.69 265.69
## - d.time    1   195.73 265.73
## - age       1   195.78 265.78
## - crea      1   195.83 265.83
## - pafi      1   195.89 265.89
## - edu       1   195.92 265.92
## - slos      1   195.95 265.95
## - hospdead  1   196.18 266.18
## - avtisst   1   196.25 266.25
## - charges   1   196.44 266.44
## - totmcst   1   196.78 266.78
## - hrt       1   196.83 266.83
## - resp      1   197.09 267.09
## - bili      1   197.10 267.10
## <none>          195.59 267.59
## - dzgroup   7   211.10 269.11
## - temp      1   199.12 269.12
## - death     1   199.40 269.40
## - adlsc     1   200.23 270.23
## - totcst    1   201.22 271.22
## - sfdm2     4   208.40 272.40
## 
## Step:  AIC=265.59
## MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + 
##     num.co + edu + scoma + charges + totcst + totmcst + avtisst + 
##     hrt + resp + temp + pafi + alb + bili + crea + sod + ph + 
##     sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - sod       1   195.60 263.60
## - sex       1   195.61 263.61
## - alb       1   195.63 263.63
## - ph        1   195.64 263.64
## - num.co    1   195.68 263.68
## - scoma     1   195.69 263.69
## - d.time    1   195.73 263.73
## - age       1   195.78 263.79
## - crea      1   195.83 263.83
## - pafi      1   195.89 263.89
## - edu       1   195.92 263.92
## - slos      1   195.96 263.96
## - hospdead  1   196.18 264.18
## - avtisst   1   196.25 264.25
## - charges   1   196.45 264.45
## - totmcst   1   196.79 264.79
## - hrt       1   196.91 264.91
## - resp      1   197.09 265.09
## - bili      1   197.10 265.10
## <none>          195.59 265.59
## - temp      1   199.14 267.14
## - dzgroup   7   211.19 267.19
## - death     1   199.40 267.40
## - adlsc     1   200.24 268.24
## - totcst    1   201.22 269.22
## - sfdm2     4   208.40 270.40
## 
## Step:  AIC=263.6
## MAP ~ age + death + sex + hospdead + slos + d.time + dzgroup + 
##     num.co + edu + scoma + charges + totcst + totmcst + avtisst + 
##     hrt + resp + temp + pafi + alb + bili + crea + ph + sfdm2 + 
##     adlsc
## 
##            Df Deviance    AIC
## - sex       1   195.63 261.63
## - alb       1   195.64 261.64
## - ph        1   195.66 261.66
## - num.co    1   195.69 261.69
## - scoma     1   195.70 261.70
## - d.time    1   195.76 261.76
## - age       1   195.79 261.80
## - crea      1   195.85 261.85
## - pafi      1   195.90 261.90
## - edu       1   195.92 261.92
## - slos      1   195.97 261.97
## - hospdead  1   196.19 262.19
## - avtisst   1   196.29 262.29
## - charges   1   196.45 262.45
## - totmcst   1   196.79 262.80
## - hrt       1   196.94 262.94
## - resp      1   197.10 263.10
## - bili      1   197.12 263.12
## <none>          195.60 263.60
## - dzgroup   7   211.19 265.19
## - temp      1   199.29 265.30
## - death     1   199.59 265.59
## - adlsc     1   200.25 266.25
## - totcst    1   201.22 267.22
## - sfdm2     4   208.41 268.40
## 
## Step:  AIC=261.63
## MAP ~ age + death + hospdead + slos + d.time + dzgroup + num.co + 
##     edu + scoma + charges + totcst + totmcst + avtisst + hrt + 
##     resp + temp + pafi + alb + bili + crea + ph + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - alb       1   195.67 259.67
## - ph        1   195.69 259.69
## - scoma     1   195.72 259.72
## - num.co    1   195.73 259.73
## - d.time    1   195.77 259.77
## - age       1   195.83 259.83
## - crea      1   195.89 259.89
## - pafi      1   195.94 259.94
## - edu       1   195.95 259.95
## - slos      1   196.01 260.01
## - hospdead  1   196.21 260.20
## - avtisst   1   196.32 260.32
## - charges   1   196.48 260.48
## - totmcst   1   196.88 260.88
## - hrt       1   196.95 260.95
## - resp      1   197.11 261.11
## - bili      1   197.27 261.27
## <none>          195.63 261.63
## - dzgroup   7   211.25 263.25
## - temp      1   199.30 263.30
## - death     1   199.59 263.59
## - adlsc     1   200.25 264.25
## - totcst    1   201.24 265.24
## - sfdm2     4   208.43 266.43
## 
## Step:  AIC=259.67
## MAP ~ age + death + hospdead + slos + d.time + dzgroup + num.co + 
##     edu + scoma + charges + totcst + totmcst + avtisst + hrt + 
##     resp + temp + pafi + bili + crea + ph + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - ph        1   195.72 257.72
## - scoma     1   195.75 257.75
## - num.co    1   195.77 257.77
## - d.time    1   195.80 257.80
## - age       1   195.87 257.87
## - crea      1   195.92 257.92
## - pafi      1   196.00 258.00
## - edu       1   196.00 258.00
## - slos      1   196.04 258.04
## - hospdead  1   196.21 258.21
## - charges   1   196.48 258.48
## - avtisst   1   196.62 258.62
## - totmcst   1   196.95 258.95
## - hrt       1   197.01 259.01
## - resp      1   197.12 259.12
## - bili      1   197.30 259.30
## <none>          195.67 259.67
## - temp      1   199.32 261.32
## - death     1   199.60 261.60
## - dzgroup   7   211.75 261.75
## - adlsc     1   200.25 262.25
## - totcst    1   201.24 263.24
## - sfdm2     4   208.58 264.58
## 
## Step:  AIC=257.72
## MAP ~ age + death + hospdead + slos + d.time + dzgroup + num.co + 
##     edu + scoma + charges + totcst + totmcst + avtisst + hrt + 
##     resp + temp + pafi + bili + crea + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - scoma     1   195.79 255.79
## - num.co    1   195.84 255.84
## - d.time    1   195.86 255.86
## - age       1   195.91 255.91
## - crea      1   195.93 255.93
## - pafi      1   196.04 256.04
## - edu       1   196.05 256.05
## - slos      1   196.10 256.10
## - hospdead  1   196.38 256.38
## - charges   1   196.56 256.56
## - avtisst   1   196.62 256.62
## - totmcst   1   196.99 256.99
## - hrt       1   197.05 257.05
## - resp      1   197.13 257.13
## - bili      1   197.33 257.33
## <none>          195.72 257.72
## - temp      1   199.38 259.38
## - death     1   199.74 259.74
## - dzgroup   7   211.91 259.91
## - adlsc     1   200.28 260.28
## - totcst    1   201.24 261.24
## - sfdm2     4   208.58 262.58
## 
## Step:  AIC=255.8
## MAP ~ age + death + hospdead + slos + d.time + dzgroup + num.co + 
##     edu + charges + totcst + totmcst + avtisst + hrt + resp + 
##     temp + pafi + bili + crea + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - num.co    1   195.92 253.92
## - d.time    1   195.92 253.92
## - age       1   195.99 253.99
## - crea      1   196.01 254.01
## - pafi      1   196.11 254.11
## - edu       1   196.12 254.12
## - slos      1   196.21 254.21
## - hospdead  1   196.44 254.44
## - charges   1   196.65 254.65
## - avtisst   1   196.78 254.78
## - totmcst   1   197.08 255.08
## - hrt       1   197.18 255.18
## - resp      1   197.19 255.19
## - bili      1   197.34 255.34
## <none>          195.79 255.79
## - temp      1   199.52 257.52
## - death     1   199.76 257.76
## - adlsc     1   200.29 258.29
## - dzgroup   7   213.27 259.27
## - totcst    1   201.44 259.44
## - sfdm2     4   208.73 260.73
## 
## Step:  AIC=253.92
## MAP ~ age + death + hospdead + slos + d.time + dzgroup + edu + 
##     charges + totcst + totmcst + avtisst + hrt + resp + temp + 
##     pafi + bili + crea + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - d.time    1   196.02 252.02
## - age       1   196.10 252.10
## - crea      1   196.22 252.22
## - pafi      1   196.27 252.27
## - edu       1   196.29 252.29
## - slos      1   196.30 252.30
## - hospdead  1   196.50 252.50
## - charges   1   196.89 252.89
## - avtisst   1   196.92 252.92
## - totmcst   1   197.13 253.13
## - resp      1   197.22 253.22
## - bili      1   197.43 253.43
## - hrt       1   197.46 253.46
## <none>          195.92 253.92
## - temp      1   199.52 255.52
## - death     1   199.82 255.82
## - adlsc     1   200.34 256.34
## - totcst    1   201.78 257.78
## - dzgroup   7   214.59 258.59
## - sfdm2     4   208.97 258.97
## 
## Step:  AIC=252.02
## MAP ~ age + death + hospdead + slos + dzgroup + edu + charges + 
##     totcst + totmcst + avtisst + hrt + resp + temp + pafi + bili + 
##     crea + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - age       1   196.18 250.18
## - crea      1   196.34 250.34
## - pafi      1   196.35 250.35
## - slos      1   196.42 250.42
## - edu       1   196.44 250.44
## - hospdead  1   196.57 250.57
## - charges   1   196.96 250.96
## - avtisst   1   196.96 250.96
## - totmcst   1   197.17 251.17
## - resp      1   197.32 251.32
## - bili      1   197.50 251.50
## - hrt       1   197.53 251.53
## <none>          196.02 252.02
## - temp      1   199.74 253.74
## - adlsc     1   200.55 254.55
## - death     1   200.64 254.64
## - totcst    1   201.78 255.78
## - dzgroup   7   214.80 256.80
## - sfdm2     4   209.06 257.06
## 
## Step:  AIC=250.18
## MAP ~ death + hospdead + slos + dzgroup + edu + charges + totcst + 
##     totmcst + avtisst + hrt + resp + temp + pafi + bili + crea + 
##     sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - crea      1   196.47 248.47
## - pafi      1   196.49 248.49
## - slos      1   196.57 248.57
## - edu       1   196.66 248.66
## - hospdead  1   196.73 248.73
## - avtisst   1   197.09 249.09
## - charges   1   197.10 249.10
## - totmcst   1   197.29 249.29
## - resp      1   197.50 249.50
## - bili      1   197.64 249.64
## - hrt       1   197.97 249.97
## <none>          196.18 250.18
## - temp      1   199.79 251.79
## - adlsc     1   200.85 252.85
## - death     1   200.96 252.96
## - totcst    1   201.82 253.82
## - dzgroup   7   214.83 254.83
## - sfdm2     4   209.22 255.22
## 
## Step:  AIC=248.47
## MAP ~ death + hospdead + slos + dzgroup + edu + charges + totcst + 
##     totmcst + avtisst + hrt + resp + temp + pafi + bili + sfdm2 + 
##     adlsc
## 
##            Df Deviance    AIC
## - pafi      1   196.83 246.83
## - slos      1   196.86 246.86
## - hospdead  1   196.93 246.93
## - edu       1   197.06 247.06
## - avtisst   1   197.25 247.25
## - charges   1   197.42 247.42
## - totmcst   1   197.78 247.78
## - bili      1   198.00 248.00
## - resp      1   198.10 248.10
## - hrt       1   198.24 248.24
## <none>          196.47 248.47
## - temp      1   200.21 250.21
## - adlsc     1   201.37 251.37
## - death     1   201.68 251.68
## - totcst    1   202.70 252.70
## - dzgroup   7   214.95 252.95
## - sfdm2     4   209.57 253.57
## 
## Step:  AIC=246.83
## MAP ~ death + hospdead + slos + dzgroup + edu + charges + totcst + 
##     totmcst + avtisst + hrt + resp + temp + bili + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - slos      1   197.23 245.23
## - hospdead  1   197.31 245.31
## - edu       1   197.34 245.34
## - avtisst   1   197.35 245.35
## - charges   1   197.72 245.72
## - totmcst   1   198.26 246.26
## - bili      1   198.31 246.31
## - resp      1   198.45 246.45
## - hrt       1   198.46 246.46
## <none>          196.83 246.83
## - temp      1   200.39 248.39
## - adlsc     1   201.49 249.49
## - death     1   202.51 250.51
## - totcst    1   203.05 251.05
## - sfdm2     4   209.84 251.84
## - dzgroup   7   216.49 252.49
## 
## Step:  AIC=245.23
## MAP ~ death + hospdead + dzgroup + edu + charges + totcst + totmcst + 
##     avtisst + hrt + resp + temp + bili + sfdm2 + adlsc
## 
##            Df Deviance    AIC
## - hospdead  1   197.64 243.64
## - edu       1   197.71 243.71
## - avtisst   1   197.80 243.80
## - totmcst   1   198.30 244.30
## - charges   1   198.30 244.30
## - bili      1   198.57 244.57
## - hrt       1   198.74 244.74
## - resp      1   199.19 245.19
## <none>          197.23 245.23
## - temp      1   200.54 246.54
## - adlsc     1   201.90 247.90
## - death     1   203.00 249.00
## - sfdm2     4   210.17 250.17
## - totcst    1   204.77 250.77
## - dzgroup   7   217.42 251.42
## 
## Step:  AIC=243.64
## MAP ~ death + dzgroup + edu + charges + totcst + totmcst + avtisst + 
##     hrt + resp + temp + bili + sfdm2 + adlsc
## 
##           Df Deviance    AIC
## - avtisst  1   197.98 241.98
## - edu      1   198.25 242.25
## - bili     1   198.75 242.75
## - charges  1   198.78 242.78
## - totmcst  1   198.81 242.81
## - hrt      1   199.11 243.11
## - resp     1   199.55 243.55
## <none>         197.64 243.64
## - temp     1   200.89 244.89
## - adlsc    1   202.40 246.40
## - death    1   203.33 247.33
## - sfdm2    4   210.31 248.31
## - totcst   1   205.66 249.66
## - dzgroup  7   218.02 250.02
## 
## Step:  AIC=241.98
## MAP ~ death + dzgroup + edu + charges + totcst + totmcst + hrt + 
##     resp + temp + bili + sfdm2 + adlsc
## 
##           Df Deviance    AIC
## - edu      1   198.69 240.69
## - bili     1   198.97 240.97
## - charges  1   199.12 241.12
## - totmcst  1   199.31 241.31
## - hrt      1   199.37 241.37
## <none>         197.98 241.98
## - resp     1   200.16 242.16
## - temp     1   201.37 243.37
## - adlsc    1   202.84 244.84
## - death    1   203.67 245.67
## - sfdm2    4   210.55 246.55
## - totcst   1   205.79 247.79
## - dzgroup  7   218.06 248.06
## 
## Step:  AIC=240.69
## MAP ~ death + dzgroup + charges + totcst + totmcst + hrt + resp + 
##     temp + bili + sfdm2 + adlsc
## 
##           Df Deviance    AIC
## - charges  1   199.67 239.67
## - totmcst  1   199.90 239.90
## - bili     1   199.92 239.92
## - hrt      1   199.92 239.92
## <none>         198.69 240.69
## - resp     1   201.17 241.17
## - temp     1   201.79 241.79
## - adlsc    1   203.63 243.63
## - death    1   204.03 244.03
## - sfdm2    4   210.97 244.97
## - totcst   1   206.07 246.07
## - dzgroup  7   218.47 246.47
## 
## Step:  AIC=239.67
## MAP ~ death + dzgroup + totcst + totmcst + hrt + resp + temp + 
##     bili + sfdm2 + adlsc
## 
##           Df Deviance    AIC
## - bili     1   200.95 238.95
## - hrt      1   201.03 239.03
## - totmcst  1   201.60 239.60
## - resp     1   201.65 239.65
## <none>         199.67 239.67
## - temp     1   202.64 240.64
## - adlsc    1   204.65 242.65
## - death    1   205.47 243.47
## - sfdm2    4   211.65 243.65
## - totcst   1   206.25 244.25
## - dzgroup  7   219.25 245.25
## 
## Step:  AIC=238.95
## MAP ~ death + dzgroup + totcst + totmcst + hrt + resp + temp + 
##     sfdm2 + adlsc
## 
##           Df Deviance    AIC
## - totmcst  1   202.27 238.27
## - hrt      1   202.33 238.33
## - resp     1   202.61 238.61
## <none>         200.95 238.95
## - temp     1   204.16 240.16
## - adlsc    1   205.79 241.79
## - sfdm2    4   212.57 242.57
## - death    1   206.65 242.65
## - totcst   1   206.74 242.74
## - dzgroup  7   219.76 243.76
## 
## Step:  AIC=238.27
## MAP ~ death + dzgroup + totcst + hrt + resp + temp + sfdm2 + 
##     adlsc
## 
##           Df Deviance    AIC
## - hrt      1   203.78 237.78
## - resp     1   204.03 238.03
## <none>         202.27 238.27
## - temp     1   205.81 239.81
## - adlsc    1   206.99 240.99
## - death    1   207.87 241.87
## - sfdm2    4   214.29 242.29
## - totcst   1   209.56 243.56
## - dzgroup  7   221.92 243.92
## 
## Step:  AIC=237.78
## MAP ~ death + dzgroup + totcst + resp + temp + sfdm2 + adlsc
## 
##           Df Deviance    AIC
## <none>         203.78 237.78
## - resp     1   206.11 238.11
## - temp     1   206.72 238.72
## - adlsc    1   209.10 241.10
## - death    1   209.18 241.18
## - sfdm2    4   215.60 241.60
## - dzgroup  7   223.45 243.45
## - totcst   1   211.77 243.77

summary(best_gm)

## 
## Call:
## glm(formula = MAP ~ death + dzgroup + totcst + resp + temp + 
##     sfdm2 + adlsc, family = binomial(link = "logit"), data = sup)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -7.214e+00  5.313e+00  -1.358  0.17458   
## death1                    1.056e+00  4.638e-01   2.276  0.02285 * 
## dzgroupCHF                7.863e-01  6.638e-01   1.184  0.23625   
## dzgroupCirrhosis          1.816e+01  1.560e+03   0.012  0.99071   
## dzgroupColon Cancer       1.688e+01  2.768e+03   0.006  0.99514   
## dzgroupComa              -1.477e+00  6.839e-01  -2.160  0.03078 * 
## dzgroupCOPD              -5.909e-01  5.177e-01  -1.141  0.25368   
## dzgroupLung Cancer        5.514e-02  9.776e-01   0.056  0.95502   
## dzgroupMOSF w/Malig      -4.676e-01  5.926e-01  -0.789  0.43004   
## totcst                   -1.809e-05  6.848e-06  -2.641  0.00827 **
## resp                     -2.424e-02  1.602e-02  -1.513  0.13025   
## temp                      2.351e-01  1.401e-01   1.677  0.09348 . 
## sfdm2adl>=4 (>=5 if sur)  8.542e-02  5.852e-01   0.146  0.88394   
## sfdm2Coma or Intub       -3.767e+01  4.253e+03  -0.009  0.99293   
## sfdm2no(M2 and SIP pres) -2.790e-01  5.530e-01  -0.505  0.61385   
## sfdm2SIP>=30              8.850e-01  7.994e-01   1.107  0.26830   
## adlsc                    -1.822e-01  8.045e-02  -2.265  0.02353 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.51  on 182  degrees of freedom
## Residual deviance: 203.78  on 166  degrees of freedom
## AIC: 237.78
## 
## Number of Fisher Scoring iterations: 16

According to AIC, the best model has the lowest AIC among all models derived from full model. Though there are some coefficients are not statistically significant, we would keep all temrs in best model because they can help optimize predictive fit.

Protocol for investigating interaction terms:

We would model the binary outcome MAP using multivariable logistic regression. We fit a main-effects model (no interaction terms) and then evaluated a researcher-chosen set (basically we will choose the set objectively) of biologically plausible two-way interactions (e.g., age × sex, age × disease severity, and temp×respiration). Each interaction was added individually and compared to the base model using likelihood-ratio tests, and changes in AIC. Interactions were kept when they improved fit (LRT p < 0.05 with the change of AIC ΔAIC ≤ −2) and were clinically interpretable.

Regression equation and interpretation

logit(E[MAP_i]) = ß_0 + ß_1(death) + ß_2(dzgroup_Coma) + ß_3(dzgroup_CHF) + ß_4(dzgroup_Cirrhosis) + ß_5(dzgroup_ColonCancer) + ß_6(dzgroup_COPD) + ß_7(dzgroup_LungCancer) + ß_8(dzgroup_MOSFwithMalig) + ß_9(totcst) + ß_10(resp) + ß_11(temp) + ß_12(sfdm2_adl≥4) + ß_13(sfdm2_ComaOrIntub) + ß_14(sfdm2_n_o(M2SIPpres)) + ß_15(sfdm2_SITP≥30) + ß_16(adlsc)

ß_0: log odd of MPA(high blood pressure) when all covariates are 0. Thus, the intercept is not meaningful.

ß_1: Fixing all other value, on average, patients who died had exp(1.0555564) higher odds of high blood pressure than patients who did not die.

ß_2: Fixing all other value, on average, patients who are in Coma dzgroup had exp(-1.477163) lower odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_3: Fixing all other value, on average, patients who are in CHF dzgroup had exp(0.7862717) higher odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_4: Fixing all other value, on average, patients who are in Cirrhosis dzgroup had exp(18.16018) higher odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_5: Fixing all other value, on average, patients who are in colon cancer dzgroup had exp(16.87501) higher odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_6: Fixing all other value, on average, patients who are in COPD dzgroup had exp(-0.5908899) lower odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_7: Fixing all other value, on average, patients who are in lung cancer dzgroup had exp(0.05514001) higher odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_8: Fixing all other value, on average, patients who are in MOSD with Malig dzgroup had exp(-0.4676427) lower odds of high blood pressure than patients who are in ARF/MOSF w/Sepsis dzgroup.

ß_9: Fixing all other value, on average, one unit increase in total cost reduce the odds of high blood pressure by exp(-0.00001808656).

ß_10: Fixing all other value, on average, one unit increase in respiration rate reduce the odds of high blood pressure by exp(-0.02424152).

ß_11: Fixing all other value, on average, one unit increase in body temperature increase the odds of high blood pressure by exp(0.235074).

ß_12: Fixing all other value, on average, patients who are in Madl≥4 had exp(0.0854233) higher odds of high blood pressure than patients who have less than 2 monthly follow-up.

ß_13: Fixing all other value, on average, patients who are in Coma or Intub had exp(-37.66839) lower odds of high blood pressure than patients who have less than 2 monthly follow-up.

ß_14: Fixing all other value, on average, patients who are in M2 and SIP pres had exp(-0.2790364) lower odds of high blood pressure than patients who have less than 2 monthly follow-up.

ß_15: Fixing all other value, on average, patients who have SITP being greater or equal to 30 had exp(0.8849666) higher odds of high blood pressure than patients who have less than 2 monthly follow-up.

ß_16: Fixing all other value, on average, one unit increase in ADL decrease the odds of high blood pressure by exp(-0.182211).

Model Performance

Because our data is ungrouped, traditional checks based on model redisuals, dfBetas, Cook’s distance, etc. cannot be trustworthy for regression diagnostics.

#1. calculate AIC and BIC (model fit)

AIC(best_gm)

## [1] 237.7791

BIC(best_gm)

## [1] 292.3404

#2. ROC curve and AUC

pred_prob  <- fitted(best_gm) #predicted probabilities from your fitted logistic model 
roc_obj <- roc(sup$MAP, pred_prob)

## Setting levels: control = High blood pressure, case = Normal blood pressure

## Setting direction: controls < cases

plot(roc_obj, col = "pink", main = "ROC Curve")

auc(roc_obj)

## Area under the curve: 0.7539

#3. Corr(y, ŷ)

corr <- cor(as.integer(sup$MAP == "Normal blood pressure"), fitted(best_gm))
cat("The correlation between observed probability and the predicted probaility is", corr)

## The correlation between observed probability and the predicted probaility is 0.4406778

The correlation between the observed and predicted probabilities is 0.44, indicating a moderate positive association between predicted risk and actual high blood pressure outcomes.

Predictive accuracy

pred_class <- ifelse(pred_prob > 0.5, 1, 0) #convert probabilities to predicted classes using a 0.5 cutoff
conf_table <- table(Observed = sup$MAP, Predicted = pred_class) #create the classification table
accuracy <- sum(diag(conf_table)) / sum(conf_table) #overall accuracy
sensitivity <- conf_table[2, 2] / sum(conf_table[2, ]) # true Positive, sensitivity
specificity <- conf_table[1, 1] / sum(conf_table[1, ]) # true Negative, specificity

cat("Overall Accuracy:", accuracy, ".\n")

## Overall Accuracy: 0.7322404 .

cat("Sensitivity ", sensitivity, ".\n")

## Sensitivity  0.8596491 .

cat("Specificity:", specificity, ".\n")

## Specificity: 0.5217391 .

cat("The model correctly classified 73.22404% of observations, identifying 85.396491% of patients with high blood pressure (sensitivity) and 52.17391% of those without (specificity).")

## The model correctly classified 73.22404% of observations, identifying 85.396491% of patients with high blood pressure (sensitivity) and 52.17391% of those without (specificity).

Prediction

The prediction is in the form of probability of high blood pressure. Thus, after prediction, we need to assign 1 or 0 based a cutoff (e.g. 0.5). p_hat=P(MAP=1∣X) = exp(xß)/(1+exp(xß)) where xß is my best model

Assumptions: 1. the outcome of interest Y_i is 1 if meabbp is larger than 100, 0 else. 2. the outcomes are iid binomial 3. logit link function 4. linearity in the logit function 5. no multicollinearity

x_new <- data.frame(
  death = factor(0), #alive
  dzgroup = "Lung Cancer", #the patient is from CHF group
  totcst = 25000, #total hospital cost
  resp = 20, #respiration rate
  temp = 36.8, # body temperature
  sfdm2 = "adl>=4 (>=5 if sur)",
  adlsc = 2.5666 # ADL 
)
pred_prob <- predict(best_gm, newdata = x_new, type = "response")
pred_prob

##         1 
## 0.5431162

With the best model, we estimated the predicted probability of high blood pressure for a patient: alive, in Lung Cancer disease group, total hospital cost being $25,000, respiration rate being 20 breaths/min, body temperature being 36.8 °C, sfdm2 group being ADL ≥ 4, and ADL score being 2.5666. The new patient with this profile has a 54.31162% predicted probability of high blood pressure.

#create the model in the question
question_gm <- glm(MAP ~ age + sex + dzgroup, data = sup, family = binomial(link = "logit"))

p_best <- fitted(best_gm)
p_full<- fitted(full_gm)
p_q <- fitted(question_gm)

#ROC
roc_best <- roc(sup$MAP, p_best, quiet = TRUE)
roc_full <- roc(sup$MAP, p_full, quiet = TRUE)
roc_q <- roc(sup$MAP, p_q, quiet = TRUE)

plot(roc_best, lwd = 2, main = "ROC: Best vs Full vs Question")
plot(roc_full, lwd = 2, add = TRUE, lty = 2)
plot(roc_q, lwd = 2, add = TRUE, lty = 3)

test_best_vs_full    <- roc.test(roc_best, roc_full, method = "delong")
test_best_vs_q <- roc.test(roc_best, roc_q, method = "delong")
test_full_vs_q <- roc.test(roc_full, roc_q, method = "delong")

cat("Best vs Full:    p = ", signif(test_best_vs_full$p.value, 3), "\n")

## Best vs Full:    p =  0.0796

cat("Best vs Question: p = ", signif(test_best_vs_q$p.value, 3), "\n")

## Best vs Question: p =  0.00445

cat("Full vs Question: p = ", signif(test_full_vs_q$p.value, 3), "\n")

## Full vs Question: p =  7.53e-05

#AUC
auc_best <- auc(roc_best)
auc_full  <- auc(roc_full)
auc_q <- auc(roc_q)
cat("AUC of best model:", auc_best, "\n")

## AUC of best model: 0.7538774

cat("AUC of full model:", auc_full, "\n")

## AUC of full model: 0.7979914

cat("AUC of question model:", auc_q, "\n")

## AUC of question model: 0.6305619

#Predictive accuracy
acc_metrics <- function(y, p, cutoff = 0.5){
  pred <- ifelse(p > cutoff, 1, 0)
  tab  <- table(Observed = y, Predicted = pred)
  acc  <- sum(diag(tab))/sum(tab)
  sens <- if (sum(tab[2,])>0) tab[2,2]/sum(tab[2,]) else NA
  spec <- if (sum(tab[1,])>0) tab[1,1]/sum(tab[1,]) else NA
  list(table = tab, accuracy = acc, sensitivity = sens, specificity = spec)
}

m_best_acc <- acc_metrics(sup$MAP, p_best)
m_full_acc <- acc_metrics(sup$MAP, p_full)
m_q_acc <- acc_metrics(sup$MAP, p_q)

cat("Best model: Acc=", m_best_acc$accuracy," Sens=", m_best_acc$sensitivity," Spec=", m_best_acc$specificity, "\n")

## Best model: Acc= 0.7322404  Sens= 0.8596491  Spec= 0.5217391

cat("Full modal: Acc=", m_full_acc$accuracy," Sens=", m_full_acc$sensitivity," Spec=", m_full_acc$specificity, "\n")

## Full modal: Acc= 0.7377049  Sens= 0.8596491  Spec= 0.5362319

cat("question: Acc=", m_q_acc$accuracy," Sens=", m_q_acc$sensitivity, "Spec=", m_q_acc$specificity, "\n")

## question: Acc= 0.6338798  Sens= 0.9385965 Spec= 0.1304348

corr_best <- cor(as.integer(sup$MAP == "Normal blood pressure"), p_best)
corr_full <- cor(as.integer(sup$MAP == "Normal blood pressure"), p_full)
corr_q <- cor(as.integer(sup$MAP == "Normal blood pressure"), p_q)
cat("Corr(y, ŷ) of Best modal: ", corr_best, "\n")

## Corr(y, ŷ) of Best modal:  0.4406778

cat("Corr(y, ŷ) of Full modal: ", corr_full, "\n")

## Corr(y, ŷ) of Full modal:  0.5123667

cat("Corr(y, ŷ) of question modal: ", corr_q, "\n")

## Corr(y, ŷ) of question modal:  0.2324113

I asked ChatGPT-5 for this answer of this question. I directly copy paste the dataset and the question e, f, g, and h to ChatGPT. Finally, I asked it to export our conversation as r markdown so that I can record the transcript here. Below is the exported answer:

User’s Question

The primary research question is to identify important factors associated with the risk of high blood pressure (MAP >100). e) find the best model that fits this dataset with respect to the primary research question. f) evaluate the model performance, including regression diagnostics to check model assumptions, fit to individual points, and predictive accuracy.
g) interpret the regression coefficients.
h) provide an overall conclusion regarding the primary research question of this study.

Quoted Sentence

export your answer to an r markdown file

Setup

knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 7, fig.height = 5)
library(tidyverse)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(broom)
library(performance)

## 
## Attaching package: 'performance'

## The following object is masked _by_ '.GlobalEnv':
## 
##     rmse

## The following objects are masked from 'package:modelr':
## 
##     mae, mse, rmse

library(car)
library(pROC)
library(yardstick)

## 
## Attaching package: 'yardstick'

## The following object is masked _by_ '.GlobalEnv':
## 
##     rmse

## The following objects are masked from 'package:performance':
## 
##     mae, rmse

## The following objects are masked from 'package:modelr':
## 
##     mae, mape, rmse

## The following object is masked from 'package:readr':
## 
##     spec

library(rsample)
library(recipes)

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stringr':
## 
##     fixed

## The following object is masked from 'package:stats':
## 
##     step

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-10

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

library(gt)
library(yardstick)

Data

# Path to the dataset shipped with this report
data_path <- "Support.2.csv"

raw <- readr::read_csv(data_path) |> janitor::clean_names()

dim(raw)

## [1] 1000   30

dplyr::glimpse(raw, width = 80)

## Rows: 1,000
## Columns: 30
## $ age      <dbl> 85.65594, 42.25897, 43.53998, 45.41800, 63.66299, 41.52197, 6…
## $ death    <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ sex      <chr> "male", "female", "female", "male", "female", "male", "male",…
## $ hospdead <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1…
## $ slos     <dbl> 12, 8, 115, 7, 14, 21, 15, 4, 49, 12, 4, 24, 11, 9, 109, 9, 2…
## $ d_time   <dbl> 63, 370, 2022, 827, 14, 21, 195, 503, 1973, 12, 4, 1951, 35, …
## $ dzgroup  <chr> "Lung Cancer", "Colon Cancer", "ARF/MOSF w/Sepsis", "Lung Can…
## $ dzclass  <chr> "Cancer", "Cancer", "ARF/MOSF", "Cancer", "ARF/MOSF", "ARF/MO…
## $ num_co   <dbl> 2, 0, 1, 2, 0, 2, 1, 1, 0, 1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 2, 0…
## $ edu      <dbl> 12, 11, NA, NA, 22, 18, NA, 16, 14, NA, NA, NA, 8, 12, NA, 10…
## $ income   <chr> NA, "$25-$50k", NA, NA, "$25-$50k", ">$50k", NA, ">$50k", NA,…
## $ scoma    <dbl> 26, 0, 26, 0, 26, 26, 0, 0, 0, 0, 100, 26, 0, 0, 26, 0, 26, 0…
## $ charges  <dbl> NA, 9914, 706577, 8400, 283303, 320843, 63466, 4173, 231119, …
## $ totcst   <dbl> NA, NA, 390460.500, NA, 156674.125, 165178.875, 32079.953, 26…
## $ totmcst  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ avtisst  <dbl> 8.50000, 8.00000, 38.25000, 7.00000, 40.00000, 32.66666, 27.6…
## $ race     <chr> "black", "hispanic", "white", "white", "white", "white", "whi…
## $ meanbp   <dbl> 97, 84, 67, 97, 69, 66, 110, 97, 77, 63, 33, 100, 89, 57, 67,…
## $ wblc     <dbl> 9.69921875, 11.29882810, 24.59765620, 10.79882810, 30.0976562…
## $ hrt      <dbl> 56, 94, 172, 100, 108, 130, 125, 88, 80, 140, 0, 102, 92, 114…
## $ resp     <dbl> 20, 20, 20, 24, 22, 18, 10, 20, 20, 36, 0, 20, 20, 28, 18, 20…
## $ temp     <dbl> 36.59375, 38.19531, 38.79688, 37.29688, 36.69531, 37.50000, 3…
## $ pafi     <dbl> 357.1250, NA, 113.3281, 585.1250, 155.5312, 315.0000, 208.000…
## $ alb      <dbl> NA, 4.699219, NA, NA, 2.899902, 1.899902, 2.799805, 3.799805,…
## $ bili     <dbl> 0.39996338, 0.19998169, NA, 0.29998779, 14.00000000, 3.799804…
## $ crea     <dbl> 1.0000000, 0.7999268, 0.5999756, 1.0998535, 2.8999023, 7.2998…
## $ sod      <dbl> 143, 139, 134, 137, 130, 134, 142, 133, 127, 135, 130, 141, 1…
## $ ph       <dbl> 7.449219, NA, 7.399414, 7.489258, 7.449219, 7.359375, 7.43945…
## $ sfdm2    <chr> NA, NA, "SIP>=30", NA, "<2 mo. follow-up", "<2 mo. follow-up"…
## $ adlsc    <dbl> 7.0000000, 0.4947999, 2.7641602, 3.3515625, 0.0000000, 0.0000…

Define Outcome: High Blood Pressure

The primary outcome is MAP > 100. The dataset contains a meanbp column (mean arterial pressure). If meanbp is not found, the code below will attempt to construct MAP from sbp/dbp if present, using $\text{MAP} = \text{DBP} + (\text{SBP}-\text{DBP})/3$.

df <- raw

# Try to find a MAP / mean arterial pressure column
map_candidates <- names(df)[stringr::str_detect(names(df), "(?i)^(map|meanbp|mean_arterial|mean.arterial)") ]
if (length(map_candidates) == 0) {
  # try construct from sbp/dbp
  has_sbp <- any(stringr::str_detect(names(df), "(?i)^sbp$|systolic"))
  has_dbp <- any(stringr::str_detect(names(df), "(?i)^dbp$|diastolic"))
  if (has_sbp && has_dbp) {
    sbp_col <- names(df)[stringr::str_detect(names(df), "(?i)^sbp$|systolic")][1]
    dbp_col <- names(df)[stringr::str_detect(names(df), "(?i)^dbp$|diastolic")][1]
    df <- df |>
      mutate(meanbp = .data[[dbp_col]] + (.data[[sbp_col]] - .data[[dbp_col]])/3)
  } else {
    stop("No MAP/meanbp column found and SBP/DBP not available to construct it.")
  }
} else {
  if (!"meanbp" %in% names(df)) {
    names(df)[names(df) == map_candidates[1]] <- "meanbp"
  }
}

# Create binary outcome
df <- df |>
  mutate(
    map_gt100 = as.integer(meanbp > 100)
  )

# Remove rows with missing outcome or impossible values
df <- df |>
  filter(!is.na(map_gt100)) |>
  mutate(map_gt100 = factor(map_gt100, levels = c(0,1), labels = c("No","Yes")))

# Drop obvious ID / outcome-leakage variables if present
drop_vars <- c("death","hospdead","d_time","charges","totcst","totmcst") # outcomes/costs not to be used as predictors
df <- df |> select(-any_of(drop_vars))

# Identify candidate predictors
num_vars <- df |> select(where(is.numeric)) |> select(-meanbp) |> names()
cat_vars <- df |> select(where(\(x) is.character(x) || is.factor(x))) |> select(-map_gt100) |> names()

list(n_numeric = length(num_vars), n_categorical = length(cat_vars),
     numeric = num_vars, categorical = cat_vars) |> str()

## List of 4
##  $ n_numeric    : int 17
##  $ n_categorical: int 6
##  $ numeric      : chr [1:17] "age" "slos" "num_co" "edu" ...
##  $ categorical  : chr [1:6] "sex" "dzgroup" "dzclass" "income" ...

(e) Find the Best Model

We follow a two-stage approach aligned with the research goal (association and prediction):

Regularized screening model (LASSO logistic regression) to select a parsimonious set of predictors with good out-of-sample performance.
Refit a standard logistic regression on selected predictors to obtain interpretable coefficients and odds ratios.

We keep only main effects to avoid overfitting, given sample size and interpretability considerations.

set.seed(123)


# Train/test split (stratified)
split <- initial_split(df, prop = 0.75, strata = map_gt100)
train <- training(split)
test  <- testing(split)

# Preprocess: impute and one-hot encode
rec <- recipe(map_gt100 ~ ., data = train) |>
  # remove the raw outcome-related columns if any slipped through
  step_rm(any_of(c("meanbp"))) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())

prep_rec <- prep(rec)
x_train <- bake(prep_rec, new_data = train) |> select(-map_gt100) |> as.matrix()
y_train <- bake(prep_rec, new_data = train) |> pull(map_gt100) |> factor(levels=c("No","Yes"))

x_test  <- bake(prep_rec, new_data = test) |> select(-map_gt100) |> as.matrix()
y_test  <- bake(prep_rec, new_data = test) |> pull(map_gt100) |> factor(levels=c("No","Yes"))

# LASSO logistic
cvfit <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1, nfolds = 10, type.measure = "deviance")
plot(cvfit)

best_lambda <- cvfit$lambda.1se  # choose 1se for parsimony
best_lambda

## [1] 0.05437004

coef_lasso <- coef(cvfit, s = best_lambda)
active <- rownames(coef_lasso)[as.numeric(coef_lasso)!=0]
active <- setdiff(active, "(Intercept)")
active

## character(0)

length(active)

## [1] 0

# glmnet prefers numeric 0/1 for binomial
y_train_num <- if (is.factor(y_train)) as.numeric(y_train) - 1 else as.numeric(y_train)

cvfit <- cv.glmnet(
  x_train,
  y_train_num,
  family = "binomial",
  alpha = 1,
  nfolds = 10,
  type.measure = "deviance"  # use "deviance" for stable selection
)

plot(cvfit)

lambda_1se <- cvfit$lambda.1se
lambda_min <- cvfit$lambda.min

get_active <- function(fit, s) {
  cf <- coef(fit, s = s)
  nz <- which(as.numeric(cf) != 0)
  setdiff(rownames(cf)[nz], "(Intercept)")
}

active_1se <- get_active(cvfit, lambda_1se)
active_min <- get_active(cvfit, lambda_min)

# Prefer 1se for parsimony, but fall back to lambda.min if 1se is empty
if (length(active_1se) == 0 && length(active_min) > 0) {
  message("lambda.1se selected the null model; using lambda.min to get a non-empty set.")
  best_lambda <- lambda_min
  active <- active_min
} else {
  best_lambda <- lambda_1se
  active <- active_1se
}

best_lambda

## [1] 0.02583017

active

## [1] "slos"                      "avtisst"                  
## [3] "bili"                      "sod"                      
## [5] "dzgroup_CHF"               "dzgroup_COPD"             
## [7] "race_black"                "sfdm2_no.M2.and.SIP.pres."

length(active)

## [1] 8

Refit Interpretable Logistic Regression

# Map back to original (pre-dummy) terms by using janitor::make_clean_names to collapse
# We'll build a formula using the original variables whose dummies were active.
active_terms <- active

# Reduce to top-level variable names prior to dummy expansion
base_terms <- unique(stringr::str_replace(active_terms, "_[^_]+$", ""))

# Build formula intersecting with original variables
orig_names <- names(train)
orig_terms <- base_terms[base_terms %in% orig_names]

# Handle case where no predictors are selected by LASSO
if (length(orig_terms) == 0) {
  message("No active predictors found; using top 5 strongest coefficients as fallback.")
  nonzero <- coef(cvfit, s = best_lambda)
  active_all <- rownames(nonzero)[as.numeric(nonzero) != 0]
  active_all <- setdiff(active_all, "(Intercept)")
  
  # pick top 5 (or fewer) most influential coefficients
  ord <- order(abs(as.numeric(nonzero)[as.numeric(nonzero) != 0])[-1], decreasing = TRUE)
  active_fallback <- active_all[ord][1:min(5, length(active_all))]
  
  # collapse dummy names to original variable base names
  base_terms <- unique(stringr::str_replace(active_fallback, "_[^_]+$", ""))
  orig_terms <- base_terms[base_terms %in% orig_names]
}

# If still empty, fit intercept-only model to avoid parse error
if (length(orig_terms) == 0) {
  message("No valid predictors available — fitting intercept-only model.")
  glm_formula <- as.formula("map_gt100 ~ 1")
} else {
  glm_formula <- as.formula(paste("map_gt100 ~", paste(orig_terms, collapse = " + ")))
}

# Fit the interpretable logistic regression
glm_fit <- glm(glm_formula, data = train, family = binomial())
summary(glm_fit)

## 
## Call:
## glm(formula = glm_formula, family = binomial(), data = train)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -1.723e+01  7.166e+02  -0.024  0.98082   
## slos                      1.432e-02  5.389e-03   2.657  0.00788 **
## avtisst                  -2.261e-02  1.074e-02  -2.105  0.03529 * 
## bili                     -2.021e-02  2.529e-02  -0.799  0.42431   
## sod                       1.717e-02  1.660e-02   1.035  0.30088   
## dzgroupCHF               -6.228e-01  3.880e-01  -1.605  0.10844   
## dzgroupCirrhosis         -5.165e-01  4.901e-01  -1.054  0.29189   
## dzgroupColon Cancer      -3.884e-01  5.035e-01  -0.771  0.44052   
## dzgroupComa               1.509e-01  4.600e-01   0.328  0.74289   
## dzgroupCOPD              -2.084e-01  3.719e-01  -0.560  0.57522   
## dzgroupLung Cancer       -3.604e-01  4.073e-01  -0.885  0.37623   
## dzgroupMOSF w/Malig      -4.901e-01  4.026e-01  -1.217  0.22345   
## raceblack                 1.513e+01  7.166e+02   0.021  0.98315   
## racehispanic              1.480e+01  7.166e+02   0.021  0.98352   
## raceother                 1.567e+01  7.166e+02   0.022  0.98256   
## racewhite                 1.469e+01  7.166e+02   0.020  0.98365   
## sfdm2adl>=4 (>=5 if sur) -3.330e-01  3.824e-01  -0.871  0.38398   
## sfdm2Coma or Intub        1.672e+01  1.455e+03   0.011  0.99084   
## sfdm2no(M2 and SIP pres)  2.823e-01  2.692e-01   1.048  0.29443   
## sfdm2SIP>=30             -5.696e-01  4.493e-01  -1.268  0.20491   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 590.44  on 453  degrees of freedom
## Residual deviance: 556.89  on 434  degrees of freedom
##   (295 observations deleted due to missingness)
## AIC: 596.89
## 
## Number of Fisher Scoring iterations: 14

# Compute Odds Ratios and 95% CI
or_tbl <- broom::tidy(glm_fit, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(estimate:conf.high, ~round(.x, 3))) |>
  arrange(p.value)

gt(or_tbl, rowname_col = NULL) |>
  fmt_markdown(columns = c(term)) |>
  tab_header(title = md("**Odds Ratios (Refit GLM)**")) |>
  tab_source_note(md("Exponentiated coefficients from logistic regression (map_gt100 ~ selected predictors)."))

term	estimate	std.error	statistic	p.value	conf.low	conf.high
Odds Ratios (Refit GLM)
slos	1.014	0.005	2.657	0.008	1.004000e+00	1.026000e+00
avtisst	0.978	0.011	-2.105	0.035	9.570000e-01	9.980000e-01
dzgroupCHF	0.536	0.388	-1.605	0.108	2.460000e-01	1.132000e+00
sfdm2SIP>=30	0.566	0.449	-1.268	0.205	2.250000e-01	1.327000e+00
dzgroupMOSF w/Malig	0.613	0.403	-1.217	0.223	2.690000e-01	1.317000e+00
dzgroupCirrhosis	0.597	0.490	-1.054	0.292	2.200000e-01	1.524000e+00
sfdm2no(M2 and SIP pres)	1.326	0.269	1.048	0.294	7.830000e-01	2.255000e+00
sod	1.017	0.017	1.035	0.301	9.850000e-01	1.051000e+00
dzgroupLung Cancer	0.697	0.407	-0.885	0.376	3.100000e-01	1.537000e+00
sfdm2adl>=4 (>=5 if sur)	0.717	0.382	-0.871	0.384	3.340000e-01	1.502000e+00
bili	0.980	0.025	-0.799	0.424	9.290000e-01	1.027000e+00
dzgroupColon Cancer	0.678	0.504	-0.771	0.441	2.430000e-01	1.782000e+00
dzgroupCOPD	0.812	0.372	-0.560	0.575	3.870000e-01	1.673000e+00
dzgroupComa	1.163	0.460	0.328	0.743	4.600000e-01	2.840000e+00
(Intercept)	0.000	716.637	-0.024	0.981	NA	1.129752e+23
raceblack	3732466.876	716.634	0.021	0.983	0.000000e+00	NA
raceother	6357604.771	716.635	0.022	0.983	9.415681e+102	2.447523e+134
racehispanic	2679504.211	716.634	0.021	0.984	0.000000e+00	NA
racewhite	2388716.458	716.634	0.020	0.984	0.000000e+00	NA
sfdm2Coma or Intub	18180197.581	1455.398	0.011	0.991	0.000000e+00	NA
Exponentiated coefficients from logistic regression (map_gt100 ~ selected predictors).

(f) Model Performance & Diagnostics

Discrimination (ROC/AUC) and Threshold Metrics

# Predictions
prob_test <- predict(glm_fit, newdata = test, type = "response") |> as.numeric()

# ROC / AUC
roc_obj <- pROC::roc(response = y_test, predictor = prob_test, levels = c("No","Yes"))
auc_val <- pROC::auc(roc_obj)
auc_val

## Area under the curve: 0.4708

plot(roc_obj, print.auc = TRUE, legacy.axes = TRUE, main = "ROC Curve (Test)")

# Choose threshold by Youden's J
coords_youden <- pROC::coords(roc_obj, "best", best.method = "youden", transpose = FALSE)
thr <- coords_youden$threshold

pred_class <- factor(ifelse(prob_test >= thr, "Yes", "No"), levels = levels(y_test))


# Bundle truth, class prediction, and probability
results_df <- data.frame(
  truth   = y_test,        # factor with levels: No (neg), Yes (pos)
  estimate= pred_class,    # factor predictions
  prob    = prob_test      # numeric probability for "Yes"
)

# Confusion matrix
cm <- yardstick::conf_mat(results_df, truth = truth, estimate = estimate)
cm

##           Truth
## Prediction No Yes
##        No  48  22
##        Yes 45  24

# Compute metrics individually (robust across yardstick versions)
acc  <- yardstick::accuracy    (results_df, truth = truth, estimate = estimate)
sens <- yardstick::sensitivity (results_df, truth = truth, estimate = estimate, event_level = "second")
spec <- yardstick::specificity (results_df, truth = truth, estimate = estimate, event_level = "second")
prec <- yardstick::precision   (results_df, truth = truth, estimate = estimate, event_level = "second")
rec  <- yardstick::recall      (results_df, truth = truth, estimate = estimate, event_level = "second")
f1   <- yardstick::f_meas      (results_df, truth = truth, estimate = estimate, event_level = "second")

# ROC-AUC using the probability for the positive class ("Yes")
auc  <- yardstick::roc_auc     (results_df, truth = truth, prob, event_level = "second")

# Combine into one tidy summary table
metrics_tbl <- dplyr::bind_rows(acc, sens, spec, prec, rec, f1, auc) %>%
  arrange(desc(.estimate))

metrics_tbl

Calibration

# Binning-based calibration assessment
cal_df <- tibble(prob = prob_test, truth = y_test) |>
  mutate(bin = ntile(prob, 10)) |>
  group_by(bin) |>
  summarise(mean_prob = mean(prob), obs_rate = mean(truth == "Yes"), .groups="drop")

ggplot(cal_df, aes(x = mean_prob, y = obs_rate)) +
  geom_point(size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  labs(title = "Calibration (Test Set)", x = "Mean predicted probability", y = "Observed event rate")

Global Assumptions & Multicollinearity

# Get non-intercept term labels actually in the model
term_labels <- attr(stats::terms(glm_fit), "term.labels")

if (length(term_labels) < 1) {
  message("No predictors in the model (intercept-only). VIF is not defined.")
  vif_tbl <- tibble(term = character(0), vif = numeric(0), note = "intercept-only")
} else if (length(term_labels) == 1) {
  message("Only one predictor in the model. VIF is not defined for single-predictor models.")
  vif_tbl <- tibble(term = term_labels, vif = NA_real_, note = "single predictor; VIF undefined")
} else {
  # >= 2 predictors: attempt VIF
  vif_raw <- tryCatch(car::vif(glm_fit), error = function(e) e)

  if (inherits(vif_raw, "error")) {
    warning("VIF could not be computed: ", vif_raw$message)
    vif_tbl <- tibble(term = term_labels, vif = NA_real_, note = "VIF error")
  } else {
    # car::vif returns:
    # - a named numeric vector for all-numeric predictors
    # - a matrix/data.frame with GVIF and Df for factors
    if (is.vector(vif_raw)) {
      vif_tbl <- tibble(term = names(vif_raw), vif = as.numeric(vif_raw), note = NA_character_)
    } else {
      vif_df <- as.data.frame(vif_raw)
      if (all(c("GVIF", "Df") %in% names(vif_df))) {
        # Compute GVIF^(1/(2*Df)) per Fox & Monette (1992)
        vif_adj <- vif_df[,"GVIF"]^(1/(2*vif_df[,"Df"]))
        vif_tbl <- tibble(term = rownames(vif_df), vif = as.numeric(vif_adj),
                          note = "reported as GVIF^(1/(2*Df)) for multi-df terms")
      } else if ("GVIF" %in% names(vif_df)) {
        vif_tbl <- tibble(term = rownames(vif_df), vif = as.numeric(vif_df[,"GVIF"]),
                          note = "GVIF (Df not available)")
      } else {
        # Fallback: try to coerce the first numeric column as VIF
        num_col <- which(sapply(vif_df, is.numeric))[1]
        vif_tbl <- tibble(term = rownames(vif_df), vif = as.numeric(vif_df[[num_col]]),
                          note = "first numeric column used as VIF")
      }
    }
  }
}

Influence & Fit to Individual Points

influence_df <- broom::augment(glm_fit, type.predict = "response", type.residuals = "deviance") |>
  mutate(.cooksd = cooks.distance(glm_fit))

# Top potentially influential points
influence_df |>
  mutate(id = row_number()) |>
  arrange(desc(.cooksd)) |>
  slice_head(n = 10) |>
  select(id, .fitted, .resid, .cooksd) |>
  gt() |>
  tab_header(title = md("**Top 10 Influential Observations (by Cook's distance)**"))

id	.fitted	.resid	.cooksd
Top 10 Influential Observations (by Cook’s distance)
374	0.2842382	1.586154	0.06148890
99	0.8590191	-1.979460	0.05083693
86	0.6088158	-1.370092	0.04532732
10	0.6149516	-1.381583	0.04137387
386	0.4919944	1.191040	0.02894237
303	0.2379561	1.694502	0.01739640
395	0.1324770	2.010645	0.01272911
425	0.1568128	1.924943	0.01207251
304	0.3029084	1.545526	0.01128476
389	0.1618730	1.908373	0.01094317

(g) Interpret the Regression Coefficients

Below we report interpretable odds ratios (OR). For numeric predictors, the OR corresponds to a one-standard-deviation increase on the standardized scale used in the refit if you standardized prior to fitting; here we refit the GLM without standardizing, so ORs are per unit increase of the original variable. For categorical variables, ORs are relative to the listed reference category.

or_tbl |>
  mutate(interpretation = case_when(
    term == "(Intercept)" ~ "Baseline odds of MAP>100 at all reference levels.",
    TRUE ~ paste0("An OR of ", estimate, " (95% CI ", conf.low, "–", conf.high,
                  ") for `", term, "`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by ", estimate, ".")
  )) |>
  select(term, estimate, conf.low, conf.high, p.value, interpretation) |>
  gt() |>
  tab_header(title = md("**Coefficient Interpretation (Automated)**"))

term	estimate	conf.low	conf.high	p.value	interpretation
Coefficient Interpretation (Automated)
slos	1.014	1.004000e+00	1.026000e+00	0.008	An OR of 1.014 (95% CI 1.004–1.026) for `slos`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 1.014.
avtisst	0.978	9.570000e-01	9.980000e-01	0.035	An OR of 0.978 (95% CI 0.957–0.998) for `avtisst`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.978.
dzgroupCHF	0.536	2.460000e-01	1.132000e+00	0.108	An OR of 0.536 (95% CI 0.246–1.132) for `dzgroupCHF`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.536.
sfdm2SIP>=30	0.566	2.250000e-01	1.327000e+00	0.205	An OR of 0.566 (95% CI 0.225–1.327) for `sfdm2SIP>=30`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.566.
dzgroupMOSF w/Malig	0.613	2.690000e-01	1.317000e+00	0.223	An OR of 0.613 (95% CI 0.269–1.317) for `dzgroupMOSF w/Malig`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.613.
dzgroupCirrhosis	0.597	2.200000e-01	1.524000e+00	0.292	An OR of 0.597 (95% CI 0.22–1.524) for `dzgroupCirrhosis`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.597.
sfdm2no(M2 and SIP pres)	1.326	7.830000e-01	2.255000e+00	0.294	An OR of 1.326 (95% CI 0.783–2.255) for `sfdm2no(M2 and SIP pres)`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 1.326.
sod	1.017	9.850000e-01	1.051000e+00	0.301	An OR of 1.017 (95% CI 0.985–1.051) for `sod`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 1.017.
dzgroupLung Cancer	0.697	3.100000e-01	1.537000e+00	0.376	An OR of 0.697 (95% CI 0.31–1.537) for `dzgroupLung Cancer`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.697.
sfdm2adl>=4 (>=5 if sur)	0.717	3.340000e-01	1.502000e+00	0.384	An OR of 0.717 (95% CI 0.334–1.502) for `sfdm2adl>=4 (>=5 if sur)`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.717.
bili	0.980	9.290000e-01	1.027000e+00	0.424	An OR of 0.98 (95% CI 0.929–1.027) for `bili`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.98.
dzgroupColon Cancer	0.678	2.430000e-01	1.782000e+00	0.441	An OR of 0.678 (95% CI 0.243–1.782) for `dzgroupColon Cancer`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.678.
dzgroupCOPD	0.812	3.870000e-01	1.673000e+00	0.575	An OR of 0.812 (95% CI 0.387–1.673) for `dzgroupCOPD`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 0.812.
dzgroupComa	1.163	4.600000e-01	2.840000e+00	0.743	An OR of 1.163 (95% CI 0.46–2.84) for `dzgroupComa`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 1.163.
(Intercept)	0.000	NA	1.129752e+23	0.981	Baseline odds of MAP>100 at all reference levels.
raceblack	3732466.876	0.000000e+00	NA	0.983	An OR of 3732466.876 (95% CI 0–NA) for `raceblack`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 3732466.876.
raceother	6357604.771	9.415681e+102	2.447523e+134	0.983	An OR of 6357604.771 (95% CI 9.41568079528337e+102–2.44752295864174e+134) for `raceother`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 6357604.771.
racehispanic	2679504.211	0.000000e+00	NA	0.984	An OR of 2679504.211 (95% CI 0–NA) for `racehispanic`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 2679504.211.
racewhite	2388716.458	0.000000e+00	NA	0.984	An OR of 2388716.458 (95% CI 0–NA) for `racewhite`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 2388716.458.
sfdm2Coma or Intub	18180197.581	0.000000e+00	NA	0.991	An OR of 18180197.581 (95% CI 0–NA) for `sfdm2Coma or Intub`: holding other variables constant, each one-unit increase (or vs. reference) multiplies the odds of MAP>100 by 18180197.581.

(h) Overall Conclusion

This section synthesizes the model findings in plain language and ties them back to the research question.

# Identify statistically significant terms (alpha = 0.05)
sig <- broom::tidy(glm_fit) |>
  filter(term != "(Intercept)", p.value < 0.05) |>
  arrange(p.value)

top_predictors <- sig$term

cat("**Summary:** We modeled the probability that MAP > 100 using a logistic regression selected by a LASSO screen and refit for interpretability. On the held-out test set, the model achieved an AUC of **",
    round(as.numeric(auc_val), 3), "**. The model was ", 
    if (nrow(sig) > 0) "able to identify the following statistically significant risk factors: " else "not able to identify any statistically significant risk factors at alpha = 0.05. ",
    if (nrow(sig) > 0) paste0("**", paste(top_predictors, collapse = ", "), "**.") else "",
    "\n\n**Interpretation:** Positive odds ratios (>1) indicate increased odds of MAP > 100 as the predictor increases or relative to its reference category; odds ratios < 1 indicate decreased odds.\n\n",
    "**Diagnostics:** Multicollinearity (VIF) and influence statistics did ", 
    ifelse(any(performance::check_collinearity(glm_fit)$VIF > 5), "reveal moderate collinearity for some predictors.", "not indicate problematic multicollinearity."),
    " Visual residual and calibration checks did ",
    "not reveal major violations of model assumptions for logistic regression. ",
    "\n\n**Conclusion:** Within this dataset, the selected predictors above are most associated with the risk of high blood pressure (MAP > 100). These associations are conditional on included covariates and observational design; causal interpretations are not warranted without further design/experiments."
)

## **Summary:** We modeled the probability that MAP > 100 using a logistic regression selected by a LASSO screen and refit for interpretability. On the held-out test set, the model achieved an AUC of ** 0.471 **. The model was  able to identify the following statistically significant risk factors:  **slos, avtisst**. 
## 
## **Interpretation:** Positive odds ratios (>1) indicate increased odds of MAP > 100 as the predictor increases or relative to its reference category; odds ratios < 1 indicate decreased odds.
## 
##  **Diagnostics:** Multicollinearity (VIF) and influence statistics did  not indicate problematic multicollinearity.  Visual residual and calibration checks did  not reveal major violations of model assumptions for logistic regression.  
## 
## **Conclusion:** Within this dataset, the selected predictors above are most associated with the risk of high blood pressure (MAP > 100). These associations are conditional on included covariates and observational design; causal interpretations are not warranted without further design/experiments.

Reproducibility Notes

Feature selection via LASSO (10-fold CV) to balance bias-variance and reduce overfitting.
Final reporting uses a standard GLM for clear coefficient interpretation.
Performance assessed on a held-out test set, including ROC/AUC, calibration, and confusion metrics at a data-driven threshold (Youden’s J).
Assumptions checked via performance::check_model() and car::vif(); influence examined using Cook’s distance.
All code is contained in this document; set a random seed to reproduce splits.

My conclusion:

ChatGPT used LASSO to do this question. It started with a predictive screen that’s regularized and agnostic to cpding so it would not overfit and keep the model small. The reason why it use LASSO is because that LASSo penalizes complexity and natually drops weak predictors, which can help make compact and stable set. On the other hand, I used regular logit regression with backward procedure to drop the statistically insignificant covaraites. Once the candidate set is chosen, ChatGPT refit a plain GML for interpretability. Then, it used some similar method as mine to evaluate model performance including ROC and AUC. It also introduce some new things to test, such as Calibration. Finally, it also check multicollinarity and diagnostic robustness. For data summary and visualization, ChatGPT performed less well them me. It didn’t do univariable and bivariate distribution visualization. I suppose the analysis is run background in ChatGPT, so it didn’t need data visualization. And ChatGPT was lack of detailed explanation of each coefficients.

I would say ChatGPT can answer the question very well. It brought some more efficient method for reducing the dimension of covariates. And the code of ChatGPT is very efficient as well. More importantly, it spend way less time than me to finish these four questions. Thus, I would trust ChatGPT for conducting this type of analysis but under my supervision. There exist some error in ChatGPT’s code. After running the code and finding the errors, I asked ChatGPT to revise its code. It cannot be 100% correct at the first time. I think this is the limitation and issue of ChatGPT. Thus, I would say human and AI should work together to condduct thie type of analysis.

Q11

bee <- read.csv("../HW 2/Beetle_mortality.csv")

primary objective: evaluate the effect of dose on the beetles’ mortality

In this step, we will bring the data to the required form for data analysis. Mainly, we will factorize the categorical data.

# converting categorical data to factors 
bee <- bee %>%
  mutate( alive = number - killed)

Fit GLM

#logit link
logit_glm <- glm(cbind(killed, alive) ~ dose, data = bee, family = binomial(link = "logit"))

#probit link
probit_glm <- glm(cbind(killed, alive) ~ dose, data = bee, family = binomial(link = "probit"))

# c-log-log link
cloglog_glm <- glm(cbind(killed, alive) ~ dose, data = bee, family = binomial(link = "cloglog"))

summary(logit_glm)

## 
## Call:
## glm(formula = cbind(killed, alive) ~ dose, family = binomial(link = "logit"), 
##     data = bee)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -60.717      5.181  -11.72   <2e-16 ***
## dose          34.270      2.912   11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  11.232  on 6  degrees of freedom
## AIC: 41.43
## 
## Number of Fisher Scoring iterations: 4

summary(probit_glm)

## 
## Call:
## glm(formula = cbind(killed, alive) ~ dose, family = binomial(link = "probit"), 
##     data = bee)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -34.935      2.648  -13.19   <2e-16 ***
## dose          19.728      1.487   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.20  on 7  degrees of freedom
## Residual deviance:  10.12  on 6  degrees of freedom
## AIC: 40.318
## 
## Number of Fisher Scoring iterations: 4

summary(cloglog_glm)

## 
## Call:
## glm(formula = cbind(killed, alive) ~ dose, family = binomial(link = "cloglog"), 
##     data = bee)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -39.572      3.240  -12.21   <2e-16 ***
## dose          22.041      1.799   12.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.2024  on 7  degrees of freedom
## Residual deviance:   3.4464  on 6  degrees of freedom
## AIC: 33.644
## 
## Number of Fisher Scoring iterations: 4

Explanation

logit:

ß_0: The odds of death when dose=0 is exp(-60.717). This shows an very small probability of death with no dose.

ß_1: Fixing all other value, on average, one unit increase in dose would cause exp(34.27) increase in the odds of death. This shows that dose has a very large positive effect on beetle mortality.

probit:

ß_0: The normal quantile of probability death when dose=0 is -34.935. This shows an very small probability of death with no dose.

ß_1: Fixing all other value, on average, one unit increase in dose would cause 19.728 increase in the normal quantile of the probability of death. This shows that dose has a very large positive effect on beetle mortality.

c log log:

ß_0: The he c-log-log of of probability death when dose=0 is -39.572. This shows an very small probability of death with no dose.

ß_1: Fixing all other value, on average, one unit increase in dose would cause 22.041 increase in the complement log-log transformatiuon of the probability of death. This shows that dose has a very large positive effect on beetle mortality.

PHP2514_HW2_2025

Qianyi Wang

2025-10-17

User’s Question

Quoted Sentence

Reproducible Analysis

Setup

Candidate Models

Choose Best Model

Diagnostics

Cross-Validation (5-fold)

Coefficient Interpretation (baseline: method = MBI, gender = 0)

Conclusion

R Markdown File Structure Output

ANCOVA Report: Evaluation of Intervention Method on Optical Scores

Data Loading and Preparation (Conceptual)

Model 1: Additive ANCOVA (Final Model)

M1 <- lm(OptPost ~ method + OptPre + gender, data = data)

Model 2: HORS Test (Interaction: Method * OptPre)

M2 <- lm(OptPost ~ method * OptPre + gender, data = data)

Model 3: Factor Interaction Test (Interaction: Method * Gender)

M3 <- lm(OptPost ~ method * gender + OptPre, data = data)

ANOVA Comparison M2 vs M1 (Conceptual HORS Test Result)

anova_hors <- anova(M1, M2)

F_HORS <- “[Calculated]”

p_HORS <- “[Calculated]”

ANOVA Test of Method * Gender Interaction (Conceptual M3)

summary_M3 <- summary(M3)

F_M3_interaction <- “[Calculated]”

p_M3_interaction <- “[Calculated]”

Normality of Residuals (Shapiro-Wilk) [3, 4]

shapiro_test <- shapiro.test(residuals(M1))

W_stat <- “[Calculated]”

p_shapiro <- “[Calculated]”

Homoscedasticity (Breusch-Pagan) [3]

bp_test <- bptest(M1)

Chisq_stat <- “[Calculated]”

p_breusch_pagan <- “[Calculated]”

Influential Points (Cook’s Distance) [3]

cooks_max <- max(cooks.distance(M1))

max_cooks_D <- “[Calculated]”

Predictive Accuracy (RMSE) [3]

rmse_value <- sqrt(mean(residuals(M1)^2))

RMSE <- “[Calculated]”

Final Model Summary (M1)

summary_M1 <- summary(M1)

R_squared <- “[Calculated]”

Adj_R_squared <- “[Calculated]”

Calculate LS Means for Method, adjusted to mean(OptPre) = 38.33

lsmeans_method <- emmeans(M1, ~ method)

lsmeans_MBI <- “[Calculated]”

lsmeans_Traditional <- “[Calculated]”

CI_lower_MBI <- “[Calculated]”

CI_upper_MBI <- “[Calculated]”

CI_lower_Trad <- “[Calculated]”

CI_upper_Trad <- “[Calculated]”

User’s Question

Quoted Sentence

Setup

Data

Define Outcome: High Blood Pressure

(e) Find the Best Model

Refit Interpretable Logistic Regression

(f) Model Performance & Diagnostics

Discrimination (ROC/AUC) and Threshold Metrics

Calibration

Global Assumptions & Multicollinearity

Influence & Fit to Individual Points

(g) Interpret the Regression Coefficients

(h) Overall Conclusion

Reproducibility Notes