| age | sex | bmi0 | bmi10 | cigarettes | id | levels | cholesterol | year |
|---|---|---|---|---|---|---|---|---|
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 0 | 220 | 0 |
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 2 | 217 | 2 |
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 4 | 217 | 4 |
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 6 | 200 | 6 |
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 8 | 219 | 8 |
| 45 | Female | 22 | 22 | 0 | 1 | Chol Year 10 | 240 | 10 |
Longitudinal Analysis of Sex Differences in 10-Year Serum Cholesterol Trajectories
1 Introduction
Heart disease has been the leading cause of death in the United States since the early 1920s. It encompasses a range of conditions that affect the heart, including coronary artery disease, arrhythmias, congenital heart defects, diseases of the heart muscle, and heart valve disease. The need to understand this disease led to the creation of the Framingham Heart Study in 1948, given its overwhelming impact from the 1920s to 1945. This long-term cohort study has played a crucial role in understanding the causes of cardiovascular disease and stroke, as well as the risk factors associated with them. A milestone finding in 1961 was the discovery that high blood pressure and high cholesterol levels increase the risk of heart disease, which led to the popularization of the term “risk factor.” This concept paved the way for preventive approaches to the disease, marking a significant shift from the past practice of treating heart disease only after it had already affected health, such as following a heart attack.
The Framingham Heart Study allowed us to attribute heart disease to certain risk factors, such as high cholesterol levels. Cholesterol is a type of lipid found in the body. The serum cholesterol level represents the total amount of cholesterol in the blood, which includes high-density lipoprotein (HDL), low-density lipoprotein (LDL), and triglycerides. This measure is usually calculated by summing the HDL, LDL, and 20% of the triglyceride level present in a blood sample and comparing them to the optimal ranges: Optimal: 125–200 mg/dL, Borderline high: 200–239 mg/dL, and High: 240 mg/dL or more.
Numerous factors, both controllable and uncontrollable, can impact high cholesterol levels. These encompass health conditions such as Type 2 diabetes and obesity, lifestyle choices like smoking, unhealthy eating patterns, and lack of physical activity, as well as uncontrollable factors such as family history, age, and sex. Women tend to have lower LDL levels (considered bad cholesterol) until approximately age 55, while men usually have lower HDL levels (known as good cholesterol) at any age.
Project Goal
The goal of the project is to investigate sex differences in serum cholesterol level trajectories over a 10-year follow-up period using Framingham Heart Study data. We aim to determine if mean serum cholesterol levels over time differ between males and females, adjusting for baseline age, BMI, and smoking habits.
2 Methods
Data Description
The dataset is a subset of the data collected from the Framingham study and contains 11 columns and 2,634 rows, each representing an individual. The 11 columns include:
Baseline age of participants (age)
Sex at birth (1 for male, 2 for female)
Baseline and 10-year follow-up BMI (bmi0, bmi10)
Number of cigarettes smoked per day at baseline (cigarettes)
The serum cholesterol levels measured biennially from baseline to 10-year follow-up period (chol0, chol2, chol4, chol6, chol8, chol10)
Data Preparation
The dataset was tidied and cleaned before analysis to ensure accuracy and completeness. This process involved:
Converting cholesterol values (chol0-chol10) of -9 to missing (NA) as per study guidelines.
Removing all rows with missing values.
Reshaping the data into a long format, with each row representing a repeated measurement for an individual.
The cleaned dataset consists of 10,362 observations and nine variables.
Exploratory Data Analysis
An exploratory analysis of the longitudinal data was performed to examine changes in serum cholesterol levels over time, comparing males and females. This analysis aimed to identify unusual patterns or observations and explore the correlation between measurements over time. It included creating a summary table of cholesterol values across different time points, and baseline characteristics, generating a stratified spaghetti plot to visualize individual trajectories by sex, producing a plot of means to show average response trajectories by sex, and developing a correlation plot to investigate the relationship between measurements over time, stratified by sex.
Statistical Analysis
The aim of the project was to investigate sex differences in serum cholesterol level trajectory over a 10-year study period, adjusting for baseline age, baseline BMI, and number of cigarettes smoked at baseline. To address the project objective, we proposed a research question:
Does the growth rate of serum cholesterol level measured biennially differ for male and female participants, adjusting for baseline age, baseline BMI, and the number of cigarettes smoked at baseline?
To explore this research question, we utilized two approaches for longitudinal data analysis: the linear mixed effects model and generalized estimating equations.
Statistical Methods:
- Linear Mixed Effects Model: This model represents the mean as a combination of fixed effects (population characteristics) and random effects (subject-specific effects) to account for between- and within-subject sources of variability, which induces a specific correlation structure for observations in the same cluster. For this project, we utilized variations of the linear mixed effects model:
Random intercept model: This model allows each subject to have a unique level of response (intercept) that deviates from the population average.
Random intercept and slope model: This model allows each subject to have a unique response level (intercept) and a unique rate of change over time (slope). This model can be created by either assuming the random effects are uncorrelated or correlated.
- Generalized Estimating Equations (GEE): This method involves developing a marginal model that separates the relationship between the response and the covariates from the correlation between observations in the same cluster. The correlation between successive measurements is modeled explicitly by specifying a working correlation matrix.
Model Specifications:
- Marginal Mean Model for GEE and Linear Mixed Effects Models
\(E[Cholesterol_{ij}|year_{j},X] = \beta_0 + \beta_1year_{j} + \beta_2I_{sex=Female} + \beta_3year_{j}*I_{sex=Female} + \beta_4age_{i0} + \beta_5bmi_{i0} + \beta_6cigarettes_{i0}\)
where:
\(Cholesterol_{ij}\) is the serum cholesterol level for participant i at year j.
\(\beta_1\) represents the biennial rate of change in the expected serum cholesterol level for males, holding all other covariates fixed.
\(\beta_2\) represents the difference in the expected serum cholesterol level between females and males at baseline, holding all other covariates fixed.
\(\beta_3\) represents the difference in the biennial rate of change in the expected serum cholesterol level between females and males, holding all other covariates fixed.
\(age_{i0}\), \(bmi_{i0}\), and \(cigarettes_{i0}\) represent the baseline age, BMI, and number of cigarettes smoked for participant i, respectively.
\(I_{sex=Female}\) is an indicator variable that equals 1 for females and 0 for males.
Working Correlation Matrix for GEE: The assumption of a correlation structure facilitates the estimation of model parameters. There are different specifications of the working correlation matrix. They include:
Independence: This structure suggests that there is no correlation between successive measurements.
Exchangeable: This structure suggests that the correlation between repeated measurements is constant. Here, all observations are assumed to be equally correlated within a cluster.
AR(1): This structure suggests that the correlation between repeated measurements decays as a function of the interval between the observations.
Unstructured: This correlation matrix allows for a random pattern of correlation, as no explicit structure is assumed for the correlation among repeated measures.
Statistical Analysis Phases:
Linear Mixed-Effects Models:
We fitted the linear mixed-effects models and chose the better-fitting model using model fit statistics such as the Akaike Information Criterion (AIC) and the Schwarz Bayesian Information Criterion (BIC).
After selecting the best-fitting model, we tested the null hypothesis using the likelihood ratio test: \((H_0: \beta_3 = 0)\), which suggests no difference in the biennial rate of change in the expected serum cholesterol level between females and males after adjusting for baseline age, BMI, and number of cigarettes.
Generalized Estimating Equations:
We fitted marginal models with different specifications of the working correlation matrix to account for within-subject correlation for observations in the same cluster and selected the proper working correlation structure using the Correlation Information Criterion (CIC).
Using the model selected from the Correlation Information Criterion (CIC), we tested the null hypothesis: \((H_0: \beta_3 = 0)\), which suggests no difference in the biennial rate of change in the expected serum cholesterol level between females and males after adjusting for baseline age, BMI, and number of cigarettes.
Because of the semi-parametric methodology of the GEE approach and the absence of a likelihood function, we employed Wald statistics for hypothesis testing.
3 Results
Exploratory Data Analysis
Summary Statistics of Cholesterol by Year
| Cholesterol Summary Statistics by Year (mg/dL) | ||||
|---|---|---|---|---|
| Year | variable | Count | Cholesterol Statistics | |
| Mean | SD | |||
| 0 | cholesterol | 1727 | 219.9 | 44.7 |
| 2 | cholesterol | 1727 | 225.2 | 43.4 |
| 4 | cholesterol | 1727 | 228.4 | 44.2 |
| 6 | cholesterol | 1727 | 238.9 | 44.0 |
| 8 | cholesterol | 1727 | 240.4 | 46.0 |
| 10 | cholesterol | 1727 | 249.4 | 44.3 |
Baseline Characteristics Summary
| Variable | N = 1,7271 |
|---|---|
| sex | |
| Male | 850, 49% 0 |
| Female | 877, 51% 0 |
| age | |
| Mean,(IQR) | 43,(36, 49) |
| SD | 8 |
| bmi0 | |
| Mean,(IQR) | 24.9,(22.0, 27.0) |
| SD | 4.3 |
| cigarettes | |
| Mean,(IQR) | 10,(0, 20) |
| SD | 12 |
| 1 n, % N missing | |
:::
Data Visualizations
Boxplot of Cholesterol Levels over Time by Sex
Correlation of Responses
Plot of Means
Spaghetti Plot
Individual Trajectories Over Time
Statistical Analysis Results
Linear Mixed Effect Models
Model Comparison Using Anova
| Model | df | AIC | BIC | logLik | Test | L.Ratio | p-value | |
|---|---|---|---|---|---|---|---|---|
| model1 | 1 | 9 | 98707.66 | 98772.87 | -49344.83 | NA | NA | |
| model2 | 2 | 10 | 98513.58 | 98586.04 | -49246.79 | 1 vs 2 | 196.080 | 0.000 |
| model3 | 3 | 11 | 98513.26 | 98592.96 | -49245.63 | 2 vs 3 | 2.327 | 0.127 |
Fixed Effects Summary
| Estimate | Std. Error | DF | t-value | p-value | |
|---|---|---|---|---|---|
| (Intercept) | 129.84 | 7.06 | 8633 | 18.40 | 0.00 |
| year | 2.34 | 0.11 | 8633 | 21.52 | 0.00 |
| sexFemale | -0.49 | 2.03 | 1722 | -0.24 | 0.81 |
| age | 1.27 | 0.11 | 1722 | 11.25 | 0.00 |
| bmi0 | 1.29 | 0.21 | 1722 | 6.01 | 0.00 |
| cigarettes | 0.29 | 0.08 | 1722 | 3.48 | 0.00 |
| year:sexFemale | 1.13 | 0.15 | 8633 | 7.36 | 0.00 |
Random Effects Summary
id = pdDiag(1 + year)
Variance StdDev
(Intercept) 1265.183343 35.569416
year 3.689791 1.920883
Residual 447.080140 21.144270
Hypothesis Testing Result
| Model | df | AIC | BIC | logLik | Test | L.Ratio | p-value | |
|---|---|---|---|---|---|---|---|---|
| model2_reduced | 1 | 9 | 98565.08 | 98630.29 | -49273.54 | NA | NA | |
| model2 | 2 | 10 | 98513.58 | 98586.04 | -49246.79 | 1 vs 2 | 53.499 | 0 |
Generalized Estimating Equations
Correlation Structure Selection
| QIC | QICu | Quasi Lik | CIC | params | QICC | |
|---|---|---|---|---|---|---|
| mod1 | 19004372 | 19004338 | -9502162 | 23.69671 | 7 | 19004372 |
| mod2 | 18998434 | 18998400 | -9499193 | 23.92618 | 7 | 18998434 |
| mod3 | 18998434 | 18998400 | -9499193 | 23.92618 | 7 | 18998434 |
| mod4 | 19009923 | 19009887 | -9504937 | 24.79376 | 7 | 19009923 |
Marginal Model Summary
| cholesterol | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 132.65 | 118.81 – 146.50 | <0.001 |
| year | 2.29 | 2.09 – 2.50 | <0.001 |
| sex [Female] | -0.09 | -4.08 – 3.89 | 0.963 |
| age | 1.23 | 1.00 – 1.46 | <0.001 |
| bmi0 | 1.28 | 0.86 – 1.70 | <0.001 |
| cigarettes | 0.30 | 0.14 – 0.46 | <0.001 |
| year × sex [Female] | 1.13 | 0.83 – 1.43 | <0.001 |
| N id | 1727 | ||
| Observations | 10362 | ||
Hypothesis Testing Result
| Df | X2 | P(>|Chi|) | |
|---|---|---|---|
| year | 1 | 1341.480510 | 0.0000000 |
| sex | 1 | 1.254787 | 0.2626401 |
| age | 1 | 130.406641 | 0.0000000 |
| bmi0 | 1 | 31.011530 | 0.0000000 |
| cigarettes | 1 | 13.205569 | 0.0002791 |
| year:sex | 1 | 54.187656 | 0.0000000 |
4 Discussion
This project utilized data from the Framingham Heart Study to examine sex differences in serum cholesterol level trajectories over a 10-year follow-up period. Our objective was to determine whether the average serum cholesterol levels over time varied for males and females while accounting for baseline age, BMI, and number of cigarettes smoked. The exploratory data analysis of the longitudinal data revealed several significant findings about trajectories in serum cholesterol levels over time, focusing on comparing males and females. The analysis included summary statistics, visual representations of individual and average trajectories, and correlation assessments, offering valuable insights into the patterns and relationships within the data. The main findings from this analysis are as follows:
There were 1,727 participants in the study, of whom 51% were females.
At baseline, the participants’ mean age was 43, their mean BMI was 24.9, and they smoked an average of 10 cigarettes per day.
The mean serum cholesterol level for the population of participants increased at each measurement time, with an overall 29.5 mg/dL increase from baseline to the 10-year follow-up period.
There was a positive linear correlation between cholesterol levels measured at different time points, and the correlation measure declined as the time separation increased. However, an interesting trend emerged, showing an increase in the correlation of cholesterol levels between the baseline and year 8 after a consistent decrease for both the general population and males and females.
The plot of means showed an approximately equal mean serum cholesterol level for males and females at baseline. Then, there was a steady increase for males until year 4, when the means became equal, after which females’ mean level steadily increased.
To investigate the differences, we proposed a research question: Does the rate of change in mean serum cholesterol level measured biennially differ for male and female participants, adjusting for baseline age, baseline BMI, and the number of cigarettes smoked at baseline? To explore this question, we utilized two statistical approaches for analyzing longitudinal data: linear mixed effects models and generalized estimating equations.
We compared the fit of three variations of linear mixed effects models: random intercept, random intercept and slope (uncorrelated random effects), and random intercept and slope (correlated random effects) using the AIC and BIC criteria. The results showed that the random intercept and slope (uncorrelated random effects) model had smaller AIC and BIC values. As a result, this model was selected to address the research question.
For the generalized estimating equations approach, we fitted marginal models with four specifications of the working correlation matrix and compared fit using the Correlation Information Criterion. The results suggested the unstructured working correlation matrix had the lowest CIC value, hence was selected as the working correlation matrix to account for within-subject correlation.
Both the GEE and LMM approaches produced similar estimates for the fixed effects, supporting the reliability of our findings. The interaction between sex and time was consistent across both models, revealing that females exhibited a greater increase in mean serum cholesterol levels over time compared to males, with both models estimating this difference to be 1.13 mg/dL. The effects of the covariates (baseline age, BMI, and number of cigarettes smoked) were also comparable between the two models, but there were differences in the magnitude of the estimates. This can be attributed to the different assumptions and interpretations of these models; GEE provides population-averaged effects, while LMM accounts for individual-level random effects.
The random effects summary from the LMM approach revealed the following findings:
The standard deviation of subject-specific deviations in the mean serum cholesterol level at baseline for males is 35.569 mg/dL.
The standard deviation of subject-specific deviations in the rate of change of mean serum cholesterol level is 1.921 mg/dL per biennium (i.e., per two-year period).
The residual standard deviation is 21.144 mg/dL, representing the unexplained variation in cholesterol levels after accounting for fixed and random effects.
In conclusion, our analysis revealed significant sex differences in the trajectories of serum cholesterol levels over time. Females showed a greater increase in mean serum cholesterol levels compared to males. The GEE and LMM approaches estimated this difference to be 1.13 mg/dL per biennium. This finding addresses our research question and underscores the significance of accounting for sex-specific patterns in cardiovascular health research and interventions. Future studies should prioritize thorough model validation, including comprehensive diagnostic assessments. This process is crucial to verify that all underlying assumptions are met, ensuring the reliability and precision of subsequent interpretations and conclusions.
5 Appendix
Linear Mixed Models Result
Linear mixed-effects model fit by maximum likelihood
Data: framingham_longer
AIC BIC logLik
98513.58 98586.04 -49246.79
Random effects:
Formula: ~1 + year | id
Structure: Diagonal
(Intercept) year Residual
StdDev: 35.56942 1.920883 21.14427
Fixed effects: cholesterol ~ year * sex + age + bmi0 + cigarettes
Value Std.Error DF t-value p-value
(Intercept) 129.84166 7.057768 8633 18.396986 0.0000
year 2.34429 0.108917 8633 21.523612 0.0000
sexFemale -0.49381 2.026063 1722 -0.243729 0.8075
age 1.27445 0.113237 1722 11.254756 0.0000
bmi0 1.29256 0.214937 1722 6.013648 0.0000
cigarettes 0.29259 0.084023 1722 3.482225 0.0005
year:sexFemale 1.12519 0.152842 8633 7.361789 0.0000
Correlation:
(Intr) year sexFml age bmi0 cgrtts
year -0.049
sexFemale -0.284 0.170
age -0.573 0.000 0.004
bmi0 -0.667 0.000 0.120 -0.176
cigarettes -0.370 0.000 0.380 0.172 0.106
year:sexFemale 0.035 -0.713 -0.239 0.000 0.000 0.000
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-6.53103762 -0.53069852 -0.01978626 0.50323530 10.51714011
Number of Observations: 10362
Number of Groups: 1727
Generalized Estimating Equations Result
Call:
geeglm(formula = cholesterol ~ year * sex + age + bmi0 + cigarettes,
data = framingham_longer, id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 132.65251 7.06335 352.704 < 2e-16 ***
year 2.29421 0.10549 472.954 < 2e-16 ***
sexFemale -0.09333 2.03279 0.002 0.963380
age 1.22908 0.11632 111.642 < 2e-16 ***
bmi0 1.27728 0.21429 35.528 2.51e-09 ***
cigarettes 0.30029 0.08264 13.205 0.000279 ***
year:sexFemale 1.12667 0.15305 54.188 1.82e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1834 80.76
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.6839 0.02336
alpha.1:3 0.6838 0.02091
alpha.1:4 0.6577 0.02149
alpha.1:5 0.7206 0.03426
alpha.1:6 0.6705 0.02237
alpha.2:3 0.7046 0.02355
alpha.2:4 0.6997 0.02453
alpha.2:5 0.7265 0.02146
alpha.2:6 0.6738 0.02064
alpha.3:4 0.7376 0.02296
alpha.3:5 0.7742 0.01726
alpha.3:6 0.7287 0.02201
alpha.4:5 0.8187 0.02150
alpha.4:6 0.7838 0.02122
alpha.5:6 0.8490 0.02917
Number of clusters: 1727 Maximum cluster size: 6