Homework LDA

The dataset “renal” contains information on 1160 patients who received a renal graft (kidney transplant). The patients have been followed for at most 10 years. Of interest is to study the evolution in Hematocrit and how the evolution depends on patient characteristics such as gender and age. Also, it is of interest to investigate whether the evolution in Haematocrit is affected by cardio-vascular problems in the past or by the occurrence of rejection symptoms during the first three months after the transplantation took place.

Variables:

ID: identification number
MALE: 0=female, 1=male
AGE: Age at transplantation
CARDIO: has the patient experienced a cardio-vascular problem during the years preceding the transplantation? 0=no, 1=yes
REJECT: has the patient shown symptoms of graft rejection during the first three months after the transplantation? 0=no, 1=yes
Hc0: Haematocrit level at moment of transplantation
Hc06: Haematocrit level 6 monts after transplantation
Hc1, Hc2, . . . , HC10: Haematocrit level 1 year, 2 years, . . . , 10 years after transplantation

1. Read the dataset using read_sas() of library haven.

dataset <- read_sas("renal.sas7bdat")
head(dataset)

2. The data is in wide format. Transform it to a long format where each row is an observation of patient i in time j (in years). Add all needed extra columns.

# Transformación del conjunto de datos de formato ancho a largo con etiquetas
dataset_long <- dataset %>%
  pivot_longer(cols = starts_with("Hc"), 
               names_to = "measurement", values_to = "Haematocrit") %>%
  mutate(year = ifelse(measurement == "Hc06", 0.5, parse_number(measurement)),
         measurement = paste("Measure at", ifelse(year == 0.5, "6 months", paste(year, "years")))) %>%
  set_variable_labels(year = "Year after transplantation", 
                      measurement = "Time of measurement", 
                      Haematocrit = "Measurement") %>%
  na.omit()

# Primeras filas del conjunto de datos transformado
head(dataset_long)

Given the transformation from wide to long format, our dataset is now structured so that each row represents a patient observation (i) at a specific point in time (j), facilitating longitudinal analysis.

3. Explore the data set graphically. Explore profile plots and mean evolution overtime line plots per each combination of sex, cardio and reject variables.

# Nuevas columnas a dataset_long
dataset_long <- dataset_long %>%
  mutate(
    Gender = ifelse(male == 1, "Male", "Female"),
    Cardiov = ifelse(cardio == 1, "Yes", "No"),
    Rejection = ifelse(reject == 1, "Yes", "No"),
    Group = interaction(Gender, Cardiov, Rejection)  
  )

# Gráficos para cada combinación de sexo, problemas cardiovasculares y rechazo
p1 <- ggplot(dataset_long %>% filter(Group == "Female.No.No"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Female • No Cardio • No Rejection", x = "Time", y = "Haematocrit")

p2 <- ggplot(dataset_long %>% filter(Group == "Female.No.Yes"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Female • No Cardio • Yes Rejection", x = "Time", y = "Haematocrit")

p3 <- ggplot(dataset_long %>% filter(Group == "Female.Yes.No"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Female • Yes Cardio • No Rejection", x = "Time", y = "Haematocrit")

p4 <- ggplot(dataset_long %>% filter(Group == "Female.Yes.Yes"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Female • Yes Cardio • Yes Rejection", x = "Time", y = "Haematocrit")

p5 <- ggplot(dataset_long %>% filter(Group == "Male.No.No"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Male • No Cardio • No Rejection", x = "Time", y = "Haematocrit")

p6 <- ggplot(dataset_long %>% filter(Group == "Male.No.Yes"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Male • No Cardio • Yes Rejection", x = "Time", y = "Haematocrit")

p7 <- ggplot(dataset_long %>% filter(Group == "Male.Yes.No"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Male • Yes Cardio • No Rejection", x = "Time", y = "Haematocrit")

p8 <- ggplot(dataset_long %>% filter(Group == "Male.Yes.Yes"), aes(x = year, y = Haematocrit, group = id)) + 
  geom_line() + 
  theme_classic() + 
  labs(title = "Male • Yes Cardio • Yes Rejection", x = "Time", y = "Haematocrit") +
  theme(legend.position = "none")

# Organizar y mostrar todos los gráficos
ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2, nrow = 4)

In the graphs of individual trajectories by group, we can note that in all cases there is an initial increase in hematocrit levels during the first months after transplantation, generally followed by stabilization or minor fluctuations in subsequent years. The groups without rejection show more homogeneous trajectories, while those with rejection show greater dispersion, evidenced by more separated lines and more notable fluctuations in individual hematocrit levels. The trajectories of patients with cardiovascular problems also tend to show wider variability, especially when combined with rejection. In general, the groups without cardiovascular problems or rejection present more consistent trajectories, with less variability in hematocrit evolution over time.

# Gráfico de la evolución media a lo largo del tiempo para cada combinación de sexo, cardio y rechazo
ggplot(dataset_long, aes(x = year, y = Haematocrit, color = Group)) +
  stat_summary(fun = mean, geom = "line", size = 1) +
  labs(title = "Mean Evolution of Haematocrit Levels Over Time", x = "Year after Transplantation", y = "Mean Haematocrit Level", color = "Group") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_text(size = 8), legend.text = element_text(size = 6))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In the graph of average hematocrit evolution by group, we can notice that all groups show a significant initial increase in hematocrit in the first months after transplantation, reaching a maximum value close to the first year. Subsequently, hematocrit levels stabilize or decrease slightly, depending on the group. Men maintain higher average hematocrit levels compared to women over time. In addition, patients with rejection tend to have lower hematocrit levels than those without rejection, especially after the first year. The differences between patients with and without cardiovascular problems are not as marked as those associated with rejection or gender. The curves also show that the differences between the groups are more evident after the first year, while in the first months the trajectories are more similar.

# Gráfico de distribución de hematocrito por tiempo y grupo con diagramas de cajas
p_box <- ggplot(dataset_long, aes(x = factor(year), y = Haematocrit, fill = Group)) +
  geom_boxplot(outlier.size = 0.5, alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Distribución de Hematocrito por Tiempo y Grupo",
    x = "Tiempo (Años)",
    y = "Hematocrito",
    fill = "Grupo"
  ) +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 8)
  )

# Mostrar el gráfico
print(p_box)

In this graph of hematocrit distribution by time and group using box plots, we can observe several key aspects related to the evolution of hematocrit and variability between groups. First, the distributions in the first months (close to time 0) show a greater dispersion, with the presence of outliers towards lower hematocrit levels, which could reflect a more variable initial state before stabilizing. In the diagrams corresponding to the first year, the pattern identified in the previous graphs is confirmed: an initial increase in hematocrit levels, with generally higher medians, which then tend to stabilize or fluctuate slightly in subsequent years.

Variability within groups is evident in the box plots, especially for the groups with rejection and/or cardiovascular problems, which show wider distributions with larger interquartile ranges, supporting the previous findings of greater dispersion in these trajectories. On the other hand, the groups without rejection and without cardiovascular problems present more compact distributions, suggesting a more homogeneous evolution of hematocrit in these cases. In general, differences between men and women persist, with diagrams tending to show higher values for men in most periods. Finally, the differences between groups are more marked after the first year, while in the first months the distributions are more similar, which is consistent with the patterns observed in the previous graphs.

4. Fit two linear mixed models to answer the research interests:

4.1 Model the mean evolution over time using a quadratic effect in time. You decide how to best model the rest of the fixed and random parts of the model based on the graphical exploration and research interests.

# Variables necesarias
dataset_long <- dataset_long %>%
  mutate(
    time_centered = year - mean(year),          # Centrar la variable "year"
    time_cuadrado = time_centered^2            # Crear el término cuadrático
  )

# Modelo 4.1 
mod1 <- lmer(
  Haematocrit ~ time_centered * (age + male + cardio + reject) +
    time_cuadrado * (age + male + cardio + reject) +
    (time_centered || id),                      # Intercepto y pendiente aleatoria por paciente
  data = dataset_long
)

# Resumen del modelo 4.1
summary(mod1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Haematocrit ~ time_centered * (age + male + cardio + reject) +  
##     time_cuadrado * (age + male + cardio + reject) + ((1 | id) +  
##     (0 + time_centered | id))
##    Data: dataset_long
## 
## REML criterion at convergence: 57676.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.9114 -0.4920  0.0370  0.5605  6.7087 
## 
## Random effects:
##  Groups   Name          Variance Std.Dev.
##  id       (Intercept)   12.1222  3.4817  
##  id.1     time_centered  0.1687  0.4108  
##  Residual               18.6778  4.3218  
## Number of obs: 9551, groups:  id, 1159
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)           3.573e+01  4.989e-01  71.618
## time_centered         4.637e-01  8.752e-02   5.299
## age                   5.554e-02  9.781e-03   5.678
## male                  2.753e+00  2.464e-01  11.175
## cardio               -1.748e-01  3.342e-01  -0.523
## reject               -5.902e-01  2.654e-01  -2.223
## time_cuadrado        -1.545e-01  2.390e-02  -6.465
## time_centered:age    -1.639e-03  1.744e-03  -0.940
## time_centered:male    4.726e-02  4.327e-02   1.092
## time_centered:cardio -7.276e-03  5.956e-02  -0.122
## time_centered:reject  7.420e-04  4.617e-02   0.016
## age:time_cuadrado    -7.212e-05  4.870e-04  -0.148
## male:time_cuadrado   -5.336e-02  1.174e-02  -4.545
## cardio:time_cuadrado -5.282e-03  1.652e-02  -0.320
## reject:time_cuadrado  3.141e-02  1.237e-02   2.539

## 
## Correlation matrix not shown by default, as p = 15 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

The chosen model adequately reflects the patterns observed in the exploratory analysis of the average evolution graphs and individual trajectories. First, the inclusion of linear (time_centered) and quadratic (time_squared) terms for time captures the general behavior observed: an initial rapid increase in hematocrit until about the first year, followed by a stabilization or slight decrease, which corresponds to the curvature visible in the average trajectories. Furthermore, interactions between time (linear and quadratic) and patient characteristics (age, male, cardio, reject) allow investigating whether these variables modify both the initial rate of change and the subsequent stabilization, which is supported by the between-group differences observed in the graphs.

The use of a random intercept and slope per patient reflects the high individual variability observed in the individual trajectories, where each patient shows unique behavior in terms of initial level and rate of change of hematocrit. This allows the model to capture differences not explained by fixed effects. Finally, the centering of the time variable improves the interpretation of the coefficients and ensures that interactions between time and patient variables are correctly interpreted around the average time value.

4.2 Model the mean evolution over time using a piecewise linear regression. You decide how to best model the rest of the fixed and random parts of the model based on the graphical exploration and research interests. You can find an explanation of piecewise linear regression here: [Piecewise Linear Regression Models][https://online.stat.psu.edu/stat501/lesson/8/8.8]

# Variable indicadora para el punto de cambio (k = 0.5)
dataset_long <- dataset_long %>%
  mutate(x2 = ifelse(year <= 0.5, 0, 1))  # Variable indicadora: 1 si year > 0.5, 0 si year <= 0.5

# Modelo 4.2 
mod2 <- lmer(
  Haematocrit ~ year * (age + male + cardio + reject) +  # Interacciones con tiempo (year)
    I(year - 0.5) * x2 +                                 # Cambio en la pendiente después de k = 0.5
    (year || id),                                         # Intercepto y pendiente aleatoria por paciente
  data = dataset_long
)

## fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients

summary (mod2)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Haematocrit ~ year * (age + male + cardio + reject) + I(year -  
##     0.5) * x2 + ((1 | id) + (0 + year | id))
##    Data: dataset_long
## 
## REML criterion at convergence: 55964.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.7940 -0.5245  0.0159  0.5347  7.0406 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 11.3836  3.3740  
##  id.1     year         0.1559  0.3948  
##  Residual             15.2189  3.9011  
## Number of obs: 9551, groups:  id, 1159
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept) 28.1084073  0.5108369  55.024
## year        -0.1724597  0.0816457  -2.112
## age          0.0571252  0.0097957   5.832
## male         2.1831393  0.2474398   8.823
## cardio      -0.1890465  0.3344295  -0.565
## reject      -0.3596414  0.2676211  -1.344
## x2           8.0563302  0.1470475  54.787
## year:age    -0.0006073  0.0016230  -0.374
## year:male    0.0440920  0.0403412   1.093
## year:cardio  0.0049758  0.0556427   0.089
## year:reject -0.0008858  0.0429577  -0.021
## 
## Correlation of Fixed Effects:
##             (Intr) year   age    male   cardio reject x2     year:g yer:ml
## year        -0.345                                                        
## age         -0.890  0.337                                                 
## male        -0.287  0.108  0.003                                          
## cardio       0.146 -0.059 -0.285 -0.018                                   
## reject      -0.346  0.132  0.191  0.050 -0.029                            
## x2          -0.193 -0.087  0.019  0.008  0.000 -0.007                     
## year:age     0.335 -0.895 -0.376 -0.005  0.108 -0.068 -0.041              
## year:male    0.110 -0.294 -0.005 -0.371  0.008 -0.020 -0.014  0.015       
## year:cardio -0.057  0.152  0.108  0.008 -0.377  0.014  0.002 -0.283 -0.023
## year:reject  0.128 -0.356 -0.071 -0.021  0.014 -0.368  0.015  0.183  0.052
##             yr:crd
## year              
## age               
## male              
## cardio            
## reject            
## x2                
## year:age          
## year:male         
## year:cardio       
## year:reject -0.040
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients

The model chosen uses a segmented regression because it fits the patterns observed in the exploratory analysis, where the graphs show a change in the slope of hematocrit evolution around 6 months (\(k = 0.5\) years). The inclusion of an indicator function ((x2)) allows this change in slope to be modeled, assigning a value of 0 for observations with (year ) and 1 for those with (year > 0.5). This enables the model to estimate different slopes before and after the (k) point, more accurately capturing the transition from an initial rapid increase to a more gradual stabilization or change.

In addition, the model includes interactions between time (year) and individual characteristics (age, male, cardio, reject), which is supported by the between-group differences observed in the exploratory analysis, such as the impact of gender and rejection on hematocrit levels. The use of random effects for intercept and slope allows to reflect individual variability in the trajectories, observed as significant scatter in the individual plots.

5. Compare the two models, select the best one and answer the research interests based on appropriate inference.

anova(mod1,mod2)

## refitting model(s) with ML (instead of REML)

The segmented model (mod2) is the most suitable to describe the evolution of hematocrit, as it offers a better balance between fit and simplicity compared to the quadratic model (mod1). The AIC and BIC metrics, as well as the adaptive structure of the segmented model, support this conclusion.

confint(mod2)

## Computing profile confidence intervals ...

##                    2.5 %       97.5 %
## .sig01       3.189212184  3.550013731
## .sig02       0.353881034  0.432415706
## .sigma       3.839001147  3.964265526
## (Intercept) 27.108409849 29.108422965
## year        -0.332251800 -0.012630022
## age          0.037945204  0.076296595
## male         1.698978132  2.667759112
## cardio      -0.843313970  0.466107150
## reject      -0.883625305  0.164137102
## x2           7.767181798  8.344131920
## year:age    -0.003783023  0.002571559
## year:male   -0.034888517  0.123048317
## year:cardio -0.104092772  0.113875903
## year:reject -0.084948075  0.083209945

The study found that the evolution of hematocrit over time does not depend significantly on patient characteristics such as sex, age, cardiovascular problems or rejection symptoms. However, men show higher average hematocrit levels, and age has a slight positive effect on these levels. In addition, both the initial hematocrit level and its rate of change vary between individuals, reflecting significant individual variability in trajectories.