DAG part1 Create a Directed Acyclic Graph (DAG) to represent the causal relationships of smoking intensity and smoking duration with systolic blood pressure, in the presence of other variables.
Create a Directed Acyclic Graph (DAG) to determine the associations of smoking intensity and smoking duration with systolic and diastolic blood pressure, as well as the diagnosis of high blood pressure.
Read in Assignment1_data.csv and appropriately manage missing data. Among the baseline variables, describe the missing data pattern and identify factors associated with missingness. One approach is to create a table comparing variables between participants with and without missing data. Which missingness assumption do you consider? Please justify your conclusion.
## Rows: 1,629
## Columns: 27
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ id <int> 233, 235, 244, 245, 252, 257, 262, 266, 419, 420, 428, …
## $ sex <int> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0…
## $ age <int> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, 51,…
## $ race <int> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ education <fct> 8th or less, High school dropout, High school dropout, …
## $ income <int> 19, 18, 15, 15, 18, 11, 19, 22, 18, 16, 19, 18, 16, NA,…
## $ marital <fct> Married, Married, Widowed, Widowed, Married, Never marr…
## $ active <fct> Very active, Very active, Very active, Moderately activ…
## $ alcoholfreq <fct> 2-3 times/week, Almost every day, < 12 times/year, 1-4 …
## $ tobacco_price <dbl> 2.183594, 2.346680, 1.569580, 1.506592, 2.346680, 2.209…
## $ tobacco_tax <dbl> 1.1022949, 1.3649902, 0.5512695, 0.5249023, 1.3649902, …
## $ smokeintensity <int> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10, 4…
## $ smokeyrs <int> 29, 24, 26, 53, 19, 21, 39, 9, 37, 25, 24, 20, 19, 38, …
## $ height <dbl> 174.1875, 159.3750, 168.5000, 170.1875, 181.8750, 162.1…
## $ weight <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.74,…
## $ asthma <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cholesterol <int> 197, 301, 157, 174, 216, 212, 205, 166, 337, 279, 173, …
## $ diabetes <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ dbp <int> 96, 80, 75, 78, 77, 83, 69, 53, 79, 106, 89, 69, 80, 10…
## $ sbp <int> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 135, …
## $ hbp <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ death <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1…
## $ dadth <int> NA, NA, NA, 14, NA, NA, NA, NA, 13, 17, NA, 28, NA, NA,…
## $ modth <int> NA, NA, NA, 2, NA, NA, NA, NA, 10, 10, NA, 11, NA, NA, …
## $ yrdth <int> NA, NA, NA, 85, NA, NA, NA, NA, 84, 86, NA, 92, NA, NA,…
## $ income_group <fct> $5000 - $14999, $5000 - $14999, $3000 - $4999, $3000 - …
## X id sex age race education active smokeintensity smokeyrs height weight
## 704 1 1 1 1 1 1 1 1 1 1 1
## 697 1 1 1 1 1 1 1 1 1 1 1
## 44 1 1 1 1 1 1 1 1 1 1 1
## 38 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1
## 47 1 1 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1
## 34 1 1 1 1 1 1 1 1 1 1 1
## 20 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0
## asthma cholesterol death marital alcoholfreq income income_group sbp dbp
## 704 1 1 1 1 1 1 1 1 1
## 697 1 1 1 1 1 1 1 1 1
## 44 1 1 1 1 1 1 1 1 1
## 38 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0
## 47 1 1 1 1 1 1 1 0 0
## 23 1 1 1 1 1 1 1 0 0
## 3 1 1 1 1 1 1 1 0 0
## 2 1 1 1 1 1 1 1 0 0
## 34 1 1 1 1 1 0 0 1 1
## 20 1 1 1 1 1 0 0 1 1
## 3 1 1 1 1 1 0 0 1 1
## 2 1 1 1 1 1 0 0 1 1
## 1 1 1 1 1 1 0 0 1 0
## 2 1 1 1 1 1 0 0 0 0
## 5 1 1 1 1 0 1 1 1 1
## 1 1 1 1 0 1 1 1 1 1
## 0 0 0 1 5 62 62 77 81
## tobacco_price tobacco_tax diabetes hbp
## 704 1 1 1 1 0
## 697 1 1 0 0 2
## 44 0 0 1 1 2
## 38 0 0 0 0 4
## 2 1 1 1 1 1
## 1 1 1 0 0 3
## 47 1 1 1 1 2
## 23 1 1 0 0 4
## 3 0 0 1 1 4
## 2 0 0 0 0 6
## 34 1 1 1 1 2
## 20 1 1 0 0 4
## 3 0 0 1 1 4
## 2 0 0 0 0 6
## 1 1 1 1 1 3
## 2 1 1 0 0 6
## 5 1 1 0 0 3
## 1 1 1 0 0 3
## 92 92 791 791 2054
##
## Variables sorted by number of missings:
## Variable Count
## diabetes 0.4855739718
## hbp 0.4855739718
## tobacco_price 0.0564763659
## tobacco_tax 0.0564763659
## dbp 0.0497237569
## sbp 0.0472682627
## income 0.0380601596
## income_group 0.0380601596
## alcoholfreq 0.0030693677
## marital 0.0006138735
## X 0.0000000000
## id 0.0000000000
## sex 0.0000000000
## age 0.0000000000
## race 0.0000000000
## education 0.0000000000
## active 0.0000000000
## smokeintensity 0.0000000000
## smokeyrs 0.0000000000
## height 0.0000000000
## weight 0.0000000000
## asthma 0.0000000000
## cholesterol 0.0000000000
## death 0.0000000000
##
## FALSE TRUE
## 1472 157
## X id sex age race
## 0 0 0 0 0
## education income marital active alcoholfreq
## 0 62 1 0 5
## tobacco_price tobacco_tax smokeintensity smokeyrs height
## 92 92 0 0 0
## weight asthma cholesterol diabetes dbp
## 0 0 0 791 81
## sbp hbp death dadth modth
## 77 791 0 1307 1307
## yrdth income_group CompleteCase
## 1311 62 0
##
## FALSE TRUE
## 1472 157
| Characteristic | N |
Complete case
|
p-value2 | |
|---|---|---|---|---|
| TRUE N = 1571 |
FALSE N = 1,4721 |
|||
| Sex(female) | 1,629 | 65 (41%) | 765 (52%) | 0.012 |
| Age | 1,629 | 56 (49, 62) | 42 (33, 51) | <0.001 |
| Race(Non-White) | 1,629 | 31 (20%) | 184 (13%) | 0.011 |
| Education level | 1,629 | <0.001 | ||
| 8th or less | 66 (42%) | 245 (17%) | ||
| High school dropout | 32 (20%) | 319 (22%) | ||
| High school | 39 (25%) | 620 (42%) | ||
| College dropout | 9 (5.7%) | 117 (7.9%) | ||
| College or more | 11 (7.0%) | 171 (12%) | ||
| Marital status | 1,628 | 0.051 | ||
| Under 17 | 0 (0%) | 0 (0%) | ||
| Married | 116 (74%) | 1,163 (79%) | ||
| Widowed | 17 (11%) | 80 (5.4%) | ||
| Never married | 6 (3.8%) | 90 (6.1%) | ||
| Divorced | 10 (6.4%) | 89 (6.1%) | ||
| Separated | 8 (5.1%) | 49 (3.3%) | ||
| Physical activity | 1,629 | 0.4 | ||
| Very active | 67 (43%) | 662 (45%) | ||
| Moderately active | 78 (50%) | 660 (45%) | ||
| Inactive | 12 (7.6%) | 150 (10%) | ||
| Alcohol consumption | 1,624 | 0.002 | ||
| Almost every day | 39 (25%) | 297 (20%) | ||
| 2-3 times/week | 15 (9.6%) | 216 (15%) | ||
| 1-4 times/month | 34 (22%) | 472 (32%) | ||
| < 12 times/year | 37 (24%) | 307 (21%) | ||
| No alcohol last year | 32 (20%) | 175 (12%) | ||
| Tobacco prices | 1,537 | 2.18 (2.09, 2.31) | 2.17 (2.04, 2.24) | 0.8 |
| In-State tobacco tax | 1,537 | 1.05 (1.00, 1.15) | 1.05 (0.94, 1.15) | 0.6 |
| Smoke intensity | 1,629 | 20 (10, 25) | 20 (10, 30) | 0.5 |
| Smoke years | 1,629 | 36 (27, 42) | 23 (14, 32) | <0.001 |
| Height | 1,629 | 169 (163, 175) | 168 (162, 175) | 0.7 |
| Weight | 1,629 | 71 (61, 81) | 69 (59, 80) | 0.084 |
| Asthma | 1,629 | 5 (3.2%) | 74 (5.0%) | 0.3 |
| Cholesterol | 1,629 | 231 (193, 261) | 216 (188, 245) | 0.003 |
| Diabetes | 838 | 2 (1.3%) | 12 (1.8%) | >0.9 |
| Diastolic Blood Pressure | 1,548 | 78 (70, 86) | 77 (70, 84) | 0.4 |
| Systolic Blood Pressure | 1,552 | 138 (126, 149) | 125 (115, 138) | <0.001 |
| High Blood Pressure | 838 | 37 (24%) | 93 (14%) | 0.002 |
| Death | 1,629 | 157 (100%) | 161 (11%) | <0.001 |
| Family income | 1,567 | <0.001 | ||
| <$3000 | 33 (21%) | 124 (8.8%) | ||
| $3000 - $4999 | 31 (20%) | 115 (8.2%) | ||
| $5000 - $14999 | 72 (46%) | 772 (55%) | ||
| >=$15000 | 21 (13%) | 399 (28%) | ||
| 1 n (%); Median (Q1, Q3) | ||||
| 2 Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test | ||||
Conclusion: The missing data pattern appears to be non-monotonic, meaning there is no structured order to the missing values. We removed the date of death variables since their missingness is expected when death = 0, and because after examining cases where death = 1, we found no instances of missing date of death. Other variables showing similar patterns of missingness include SBP-DBP, tobacco price-tobacco tax, income-income_group, and diabetes-HBP. The variables with most missing values in decreasing order are: diabetes, HBP, tobacco_price, tobacco_tax, DBP, SBP, income, income_group, alcoholfreq and marital status, as we can see in the histogram of missing data. When comparing participant characteristics by data completeness, cases with missing data show a higher proportion of women, a younger average age, fewer non-white individuals, higher education levels and income, differences in alcohol consumption frequency, a lower mean number of smoking years, lower cholesterol levels, and a lower proportion of individuals with hypertension. We are assuming missing completely at random, but even under this assumption, the results can be biased. However, with this method we have more power than using complete case analysis, but probably not the best since so we will perform multiple imputation.
Assuming that data are missing at random (MAR), implement Multiple Imputation by Chained Equations (MICE) to impute the missing data. It is your decision what variable(s) to impute (e.g., only covariates, covariates plus exposure, or all variables in the analysis). Provide a detailed description of your approach, including the imputation methods used, variables included in the imputation model, the number of iterations, the number of imputed datasets, and diagnostic checks.
##
## iter imp variable
## 1 1 tobacco_price income_group alcoholfreq
## 1 2 tobacco_price income_group alcoholfreq
## 1 3 tobacco_price income_group alcoholfreq
## 1 4 tobacco_price income_group alcoholfreq
## 1 5 tobacco_price income_group alcoholfreq
## 1 6 tobacco_price income_group alcoholfreq
## 1 7 tobacco_price income_group alcoholfreq
## 1 8 tobacco_price income_group alcoholfreq
## 1 9 tobacco_price income_group alcoholfreq
## 1 10 tobacco_price income_group alcoholfreq
## 2 1 tobacco_price income_group alcoholfreq
## 2 2 tobacco_price income_group alcoholfreq
## 2 3 tobacco_price income_group alcoholfreq
## 2 4 tobacco_price income_group alcoholfreq
## 2 5 tobacco_price income_group alcoholfreq
## 2 6 tobacco_price income_group alcoholfreq
## 2 7 tobacco_price income_group alcoholfreq
## 2 8 tobacco_price income_group alcoholfreq
## 2 9 tobacco_price income_group alcoholfreq
## 2 10 tobacco_price income_group alcoholfreq
## 3 1 tobacco_price income_group alcoholfreq
## 3 2 tobacco_price income_group alcoholfreq
## 3 3 tobacco_price income_group alcoholfreq
## 3 4 tobacco_price income_group alcoholfreq
## 3 5 tobacco_price income_group alcoholfreq
## 3 6 tobacco_price income_group alcoholfreq
## 3 7 tobacco_price income_group alcoholfreq
## 3 8 tobacco_price income_group alcoholfreq
## 3 9 tobacco_price income_group alcoholfreq
## 3 10 tobacco_price income_group alcoholfreq
## 4 1 tobacco_price income_group alcoholfreq
## 4 2 tobacco_price income_group alcoholfreq
## 4 3 tobacco_price income_group alcoholfreq
## 4 4 tobacco_price income_group alcoholfreq
## 4 5 tobacco_price income_group alcoholfreq
## 4 6 tobacco_price income_group alcoholfreq
## 4 7 tobacco_price income_group alcoholfreq
## 4 8 tobacco_price income_group alcoholfreq
## 4 9 tobacco_price income_group alcoholfreq
## 4 10 tobacco_price income_group alcoholfreq
## 5 1 tobacco_price income_group alcoholfreq
## 5 2 tobacco_price income_group alcoholfreq
## 5 3 tobacco_price income_group alcoholfreq
## 5 4 tobacco_price income_group alcoholfreq
## 5 5 tobacco_price income_group alcoholfreq
## 5 6 tobacco_price income_group alcoholfreq
## 5 7 tobacco_price income_group alcoholfreq
## 5 8 tobacco_price income_group alcoholfreq
## 5 9 tobacco_price income_group alcoholfreq
## 5 10 tobacco_price income_group alcoholfreq
## 6 1 tobacco_price income_group alcoholfreq
## 6 2 tobacco_price income_group alcoholfreq
## 6 3 tobacco_price income_group alcoholfreq
## 6 4 tobacco_price income_group alcoholfreq
## 6 5 tobacco_price income_group alcoholfreq
## 6 6 tobacco_price income_group alcoholfreq
## 6 7 tobacco_price income_group alcoholfreq
## 6 8 tobacco_price income_group alcoholfreq
## 6 9 tobacco_price income_group alcoholfreq
## 6 10 tobacco_price income_group alcoholfreq
## 7 1 tobacco_price income_group alcoholfreq
## 7 2 tobacco_price income_group alcoholfreq
## 7 3 tobacco_price income_group alcoholfreq
## 7 4 tobacco_price income_group alcoholfreq
## 7 5 tobacco_price income_group alcoholfreq
## 7 6 tobacco_price income_group alcoholfreq
## 7 7 tobacco_price income_group alcoholfreq
## 7 8 tobacco_price income_group alcoholfreq
## 7 9 tobacco_price income_group alcoholfreq
## 7 10 tobacco_price income_group alcoholfreq
## 8 1 tobacco_price income_group alcoholfreq
## 8 2 tobacco_price income_group alcoholfreq
## 8 3 tobacco_price income_group alcoholfreq
## 8 4 tobacco_price income_group alcoholfreq
## 8 5 tobacco_price income_group alcoholfreq
## 8 6 tobacco_price income_group alcoholfreq
## 8 7 tobacco_price income_group alcoholfreq
## 8 8 tobacco_price income_group alcoholfreq
## 8 9 tobacco_price income_group alcoholfreq
## 8 10 tobacco_price income_group alcoholfreq
## 9 1 tobacco_price income_group alcoholfreq
## 9 2 tobacco_price income_group alcoholfreq
## 9 3 tobacco_price income_group alcoholfreq
## 9 4 tobacco_price income_group alcoholfreq
## 9 5 tobacco_price income_group alcoholfreq
## 9 6 tobacco_price income_group alcoholfreq
## 9 7 tobacco_price income_group alcoholfreq
## 9 8 tobacco_price income_group alcoholfreq
## 9 9 tobacco_price income_group alcoholfreq
## 9 10 tobacco_price income_group alcoholfreq
## 10 1 tobacco_price income_group alcoholfreq
## 10 2 tobacco_price income_group alcoholfreq
## 10 3 tobacco_price income_group alcoholfreq
## 10 4 tobacco_price income_group alcoholfreq
## 10 5 tobacco_price income_group alcoholfreq
## 10 6 tobacco_price income_group alcoholfreq
## 10 7 tobacco_price income_group alcoholfreq
## 10 8 tobacco_price income_group alcoholfreq
## 10 9 tobacco_price income_group alcoholfreq
## 10 10 tobacco_price income_group alcoholfreq
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## id sbp smokeintensity smokeyrs tobacco_price
## "" "" "" "" "pmm"
## age sex race income_group alcoholfreq
## "" "" "" "polyreg" "polyreg"
## education active
## "" ""
## PredictorMatrix:
## id sbp smokeintensity smokeyrs tobacco_price age sex race
## id 0 1 1 1 1 1 1 1
## sbp 1 0 1 1 1 1 1 1
## smokeintensity 1 1 0 1 1 1 1 1
## smokeyrs 1 1 1 0 1 1 1 1
## tobacco_price 1 1 1 1 0 1 1 1
## age 1 1 1 1 1 0 1 1
## income_group alcoholfreq education active
## id 1 1 1 1
## sbp 1 1 1 1
## smokeintensity 1 1 1 1
## smokeyrs 1 1 1 1
## tobacco_price 1 1 1 1
## age 1 1 1 1
We chose to impute only covariates identified as confounders that had missing data, while retaining non-imputed variables as they were. The imputed dataset included both imputed and non-imputed variables. The variables imputed and their corresponding methods were: income (categorical with more than two levels) using polytomous regression, tobacco price (continuous) using predictive mean matching, and frequency of alcohol consumption (categorical with more than two levels) using polytomous regression. We performed 10 iterations and visually inspected the imputed variables. Overall, the imputed values aligned well with the observed data, as evidenced by the density plots.
Perform appropriate analyses to estimate the associations between smoking intensity and smoking duration with systolic blood pressure in the complete dataset (i.e., participants without any missing data) and the imputed datasets. Please report your findings and indicate whether you observe any discrepancies between these two methods.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| smokeintensity | -0.16 | -0.53, 0.20 | 0.4 | -0.06 | -0.13, 0.02 | 0.2 |
| tobacco_price | -14 | -29, 1.7 | 0.081 | 0.54 | -3.4, 4.5 | 0.8 |
| age | 0.15 | -0.21, 0.52 | 0.4 | 0.56 | 0.48, 0.64 | <0.001 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 1.9 | -6.2, 9.9 | 0.6 | -4.2 | -6.1, -2.3 | <0.001 |
| as.factor(race) | ||||||
| 0 | — | — | — | — | ||
| 1 | -1.5 | -12, 9.3 | 0.8 | 5.9 | 3.2, 8.6 | <0.001 |
| as.factor(income_group) | ||||||
| <$3000 | — | — | — | — | ||
| $3000 - $4999 | -3.7 | -14, 7.1 | 0.5 | 1.9 | -2.2, 6.1 | 0.4 |
| $5000 - $14999 | 1.9 | -7.9, 12 | 0.7 | 1.7 | -1.6, 5.0 | 0.3 |
| >=$15000 | -8.9 | -22, 4.7 | 0.2 | 1.3 | -2.5, 5.0 | 0.5 |
| as.factor(alcoholfreq) | ||||||
| Almost every day | — | — | — | — | ||
| 2-3 times/week | -4.4 | -17, 8.6 | 0.5 | -0.60 | -3.6, 2.4 | 0.7 |
| 1-4 times/month | -0.20 | -11, 10 | >0.9 | -4.0 | -6.5, -1.5 | 0.002 |
| < 12 times/year | -0.05 | -10, 10 | >0.9 | -2.2 | -4.9, 0.63 | 0.13 |
| No alcohol last year | -2.7 | -13, 8.0 | 0.6 | -2.3 | -5.6, 0.92 | 0.2 |
| as.factor(education) | ||||||
| 8th or less | — | — | — | — | ||
| High school dropout | -2.8 | -13, 7.1 | 0.6 | -2.1 | -5.0, 0.76 | 0.2 |
| High school | -8.4 | -18, 1.4 | 0.093 | -3.7 | -6.4, -0.99 | 0.008 |
| College dropout | -1.1 | -18, 15 | 0.9 | -1.7 | -5.7, 2.2 | 0.4 |
| College or more | -8.5 | -24, 7.4 | 0.3 | -5.1 | -8.7, -1.4 | 0.007 |
| as.factor(active) | ||||||
| Very active | — | — | — | — | ||
| Moderately active | 6.5 | -1.3, 14 | 0.10 | 0.28 | -1.6, 2.1 | 0.8 |
| Inactive | 15 | 1.2, 28 | 0.033 | 0.30 | -2.8, 3.4 | 0.9 |
| 1 CI = Confidence Interval | ||||||
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| smokeyrs | 0.22 | -0.24, 0.68 | 0.3 | -0.06 | -0.21, 0.10 | 0.5 |
| tobacco_price | -14 | -29, 1.7 | 0.082 | 0.52 | -3.4, 4.4 | 0.8 |
| age | 0.02 | -0.49, 0.52 | >0.9 | 0.61 | 0.46, 0.76 | <0.001 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 4.5 | -3.8, 13 | 0.3 | -4.1 | -6.0, -2.2 | <0.001 |
| as.factor(race) | ||||||
| 0 | — | — | — | — | ||
| 1 | 0.27 | -9.8, 10 | >0.9 | 6.2 | 3.5, 8.9 | <0.001 |
| as.factor(income_group) | ||||||
| <$3000 | — | — | — | — | ||
| $3000 - $4999 | -3.8 | -15, 6.9 | 0.5 | 1.9 | -2.3, 6.1 | 0.4 |
| $5000 - $14999 | 2.6 | -7.1, 12 | 0.6 | 1.6 | -1.7, 5.0 | 0.3 |
| >=$15000 | -8.5 | -22, 5.1 | 0.2 | 1.1 | -2.6, 4.8 | 0.6 |
| as.factor(alcoholfreq) | ||||||
| Almost every day | — | — | — | — | ||
| 2-3 times/week | -3.9 | -17, 9.2 | 0.6 | -0.59 | -3.6, 2.4 | 0.7 |
| 1-4 times/month | 1.4 | -9.3, 12 | 0.8 | -3.8 | -6.3, -1.3 | 0.003 |
| < 12 times/year | 1.4 | -9.2, 12 | 0.8 | -2.1 | -4.8, 0.72 | 0.15 |
| No alcohol last year | -3.0 | -14, 7.5 | 0.6 | -2.2 | -5.4, 1.1 | 0.2 |
| as.factor(education) | ||||||
| 8th or less | — | — | — | — | ||
| High school dropout | -3.0 | -13, 6.9 | 0.6 | -2.2 | -5.1, 0.62 | 0.13 |
| High school | -7.6 | -17, 2.2 | 0.13 | -3.8 | -6.5, -1.1 | 0.007 |
| College dropout | -1.6 | -18, 15 | 0.9 | -1.9 | -5.9, 2.0 | 0.3 |
| College or more | -6.7 | -22, 8.8 | 0.4 | -5.1 | -8.8, -1.4 | 0.007 |
| as.factor(active) | ||||||
| Very active | — | — | — | — | ||
| Moderately active | 6.6 | -1.2, 14 | 0.10 | 0.25 | -1.6, 2.1 | 0.8 |
| Inactive | 15 | 1.7, 29 | 0.027 | 0.21 | -2.9, 3.3 | 0.9 |
| 1 CI = Confidence Interval | ||||||
Interpretation:
smoke intensity and HBP: The association between smoke intensity and SBP weakens after imputation, with the coefficient shifting from -0.16 (95% CI: -0.53, 0.20, p = 0.4) in the complete-case analysis to -0.06 (95% CI: -0.13, 0.02, p = 0.2) in the imputed model. The narrower confidence interval in the imputed model suggests improved precision, while the reduced effect size indicates that the complete-case analysis may have overestimated the association. Despite these differences, both models show no statistically significant association between smoke intensity and SBP.
smoke duration and HBP: The association between smoke duration and SBP weakens after imputation, with the coefficient shifting from 0.22 (95% CI: -0.24, 0.68, p = 0.3) in the complete-case analysis to -0.06 (95% CI: -0.21, 0.10, p = 0.5) in the imputed model. The imputed model’s narrower confidence interval suggests improved precision, but the reversal in direction might indicate either that missing data may have biased the complete-case estimate, or that we need to change the imputation model/variables imputed. Despite this change, both models show no statistically significant association between smoke duration and SBP.
Perform a proper analysis to examine whether the associations between smoking intensity and duration (in years) with systolic blood pressure differ by biological sex in the complete data set and the imputed data sets. Please report your findings.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| smokeintensity | -0.12 | -0.53, 0.28 | 0.6 | 0.03 | -0.08, 0.13 | 0.6 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 4.5 | -10, 19 | 0.5 | -0.40 | -4.0, 3.2 | 0.8 |
| tobacco_price | -14 | -29, 1.8 | 0.082 | 0.59 | -3.3, 4.5 | 0.8 |
| age | 0.14 | -0.22, 0.51 | 0.4 | 0.56 | 0.48, 0.63 | <0.001 |
| as.factor(race) | ||||||
| 0 | — | — | — | — | ||
| 1 | -1.5 | -12, 9.3 | 0.8 | 5.9 | 3.2, 8.6 | <0.001 |
| as.factor(income_group) | ||||||
| <$3000 | — | — | — | — | ||
| $3000 - $4999 | -3.6 | -14, 7.2 | 0.5 | 2.1 | -2.1, 6.3 | 0.3 |
| $5000 - $14999 | 1.9 | -7.9, 12 | 0.7 | 1.7 | -1.6, 5.0 | 0.3 |
| >=$15000 | -8.7 | -22, 5.0 | 0.2 | 1.2 | -2.5, 5.0 | 0.5 |
| as.factor(alcoholfreq) | ||||||
| Almost every day | — | — | — | — | ||
| 2-3 times/week | -4.6 | -18, 8.5 | 0.5 | -0.61 | -3.6, 2.4 | 0.7 |
| 1-4 times/month | -0.23 | -11, 10 | >0.9 | -4.0 | -6.5, -1.5 | 0.002 |
| < 12 times/year | -0.24 | -11, 10 | >0.9 | -2.2 | -5.0, 0.57 | 0.12 |
| No alcohol last year | -2.5 | -13, 8.2 | 0.6 | -2.4 | -5.6, 0.86 | 0.2 |
| as.factor(education) | ||||||
| 8th or less | — | — | — | — | ||
| High school dropout | -2.7 | -13, 7.3 | 0.6 | -2.2 | -5.0, 0.66 | 0.13 |
| High school | -8.3 | -18, 1.5 | 0.094 | -3.8 | -6.5, -1.1 | 0.006 |
| College dropout | -1.4 | -18, 15 | 0.9 | -1.9 | -5.9, 2.0 | 0.3 |
| College or more | -8.9 | -25, 7.2 | 0.3 | -5.2 | -8.8, -1.5 | 0.006 |
| as.factor(active) | ||||||
| Very active | — | — | — | — | ||
| Moderately active | 6.5 | -1.3, 14 | 0.10 | 0.24 | -1.6, 2.1 | 0.8 |
| Inactive | 15 | 1.2, 29 | 0.033 | 0.40 | -2.7, 3.5 | 0.8 |
| smokeintensity * as.factor(sex) | ||||||
| smokeintensity * 1 | -0.15 | -0.85, 0.55 | 0.7 | -0.19 | -0.34, -0.03 | 0.017 |
| 1 CI = Confidence Interval | ||||||
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| smokeyrs | 0.15 | -0.42, 0.72 | 0.6 | -0.18 | -0.35, -0.01 | 0.036 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | -0.21 | -24, 24 | >0.9 | -10 | -14, -6.2 | <0.001 |
| tobacco_price | -14 | -30, 1.6 | 0.078 | 0.55 | -3.4, 4.5 | 0.8 |
| age | 0.03 | -0.48, 0.55 | 0.9 | 0.62 | 0.47, 0.77 | <0.001 |
| as.factor(race) | ||||||
| 0 | — | — | — | — | ||
| 1 | 0.14 | -10, 10 | >0.9 | 6.2 | 3.5, 8.8 | <0.001 |
| as.factor(income_group) | ||||||
| <$3000 | — | — | — | — | ||
| $3000 - $4999 | -3.7 | -15, 7.1 | 0.5 | 2.0 | -2.2, 6.2 | 0.3 |
| $5000 - $14999 | 2.5 | -7.3, 12 | 0.6 | 1.6 | -1.7, 4.9 | 0.3 |
| >=$15000 | -8.7 | -22, 5.0 | 0.2 | 1.3 | -2.4, 5.0 | 0.5 |
| as.factor(alcoholfreq) | ||||||
| Almost every day | — | — | — | — | ||
| 2-3 times/week | -3.7 | -17, 9.4 | 0.6 | -0.55 | -3.5, 2.4 | 0.7 |
| 1-4 times/month | 1.4 | -9.3, 12 | 0.8 | -3.6 | -6.1, -1.1 | 0.004 |
| < 12 times/year | 1.7 | -9.1, 13 | 0.8 | -2.0 | -4.7, 0.80 | 0.2 |
| No alcohol last year | -2.8 | -13, 7.9 | 0.6 | -2.3 | -5.5, 0.97 | 0.2 |
| as.factor(education) | ||||||
| 8th or less | — | — | — | — | ||
| High school dropout | -3.2 | -13, 6.8 | 0.5 | -2.5 | -5.3, 0.36 | 0.087 |
| High school | -7.4 | -17, 2.4 | 0.14 | -4.0 | -6.8, -1.3 | 0.003 |
| College dropout | -1.3 | -18, 15 | 0.9 | -2.1 | -6.0, 1.9 | 0.3 |
| College or more | -6.9 | -22, 8.7 | 0.4 | -5.8 | -9.5, -2.1 | 0.002 |
| as.factor(active) | ||||||
| Very active | — | — | — | — | ||
| Moderately active | 6.7 | -1.1, 15 | 0.093 | 0.36 | -1.5, 2.2 | 0.7 |
| Inactive | 15 | 1.6, 29 | 0.028 | 0.20 | -2.9, 3.3 | 0.9 |
| smokeyrs * as.factor(sex) | ||||||
| smokeyrs * 1 | 0.13 | -0.51, 0.77 | 0.7 | 0.25 | 0.10, 0.40 | <0.001 |
| 1 CI = Confidence Interval | ||||||
Interpretation: A) smoke intensity and HBP + sex interaction: The association between smoke intensity and SBP changes when considering sex, with the complete-case model estimating a coefficient of -0.12 (95% CI: -0.53, 0.28, p = 0.6), while the imputed model shifts to 0.03 (95% CI: -0.08, 0.13, p = 0.6). The interaction term between smoke intensity and sex also differs, with the complete-case estimate at -0.15 (95% CI: -0.85, 0.55, p = 0.7) and the imputed model showing a stronger negative association at -0.19 (95% CI: -0.34, -0.03, p = 0.017), suggesting a potential modifying effect of sex on the relationship between smoking intensity and SBP. The discrepancies between the 2 models may reflext systematic differences in missing data, where individuals with missing values differ from those with complete observations. However, imputation relies on model assumptions, and if key predictors were omitted or an incorrect method was used, bias may have been introduced. Additionally, imputed values are not truly observed, which can reduce variance artificially and alter effect estimates. If missing data were not MAR but instead depended on unmeasured factors, that could limit the validity or invalidate the imputed results despite narrower confidence intervals.
To impose the assumption that SBP increases linearly per multiplicative change in smoking intensity, we decide to apply a log base 2 transformation to smoking intensity when estimating the association. Perform this analysis in the complete data set and the imputed data sets. Please report your findings.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
| log2(smokeintensity) | -0.92 | -4.7, 2.9 | 0.6 | -0.67 | -1.5, 0.15 | 0.11 |
| tobacco_price | -14 | -29, 1.9 | 0.084 | 0.51 | -3.4, 4.4 | 0.8 |
| age | 0.17 | -0.20, 0.53 | 0.4 | 0.56 | 0.48, 0.64 | <0.001 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 2.5 | -5.5, 10 | 0.5 | -4.2 | -6.1, -2.3 | <0.001 |
| as.factor(race) | ||||||
| 0 | — | — | — | — | ||
| 1 | -0.76 | -12, 10 | 0.9 | 5.9 | 3.2, 8.6 | <0.001 |
| as.factor(income_group) | ||||||
| <$3000 | — | — | — | — | ||
| $3000 - $4999 | -3.7 | -14, 7.1 | 0.5 | 1.9 | -2.3, 6.1 | 0.4 |
| $5000 - $14999 | 1.9 | -8.0, 12 | 0.7 | 1.7 | -1.6, 5.0 | 0.3 |
| >=$15000 | -9.2 | -23, 4.4 | 0.2 | 1.3 | -2.5, 5.0 | 0.5 |
| as.factor(alcoholfreq) | ||||||
| Almost every day | — | — | — | — | ||
| 2-3 times/week | -4.5 | -18, 8.6 | 0.5 | -0.56 | -3.6, 2.4 | 0.7 |
| 1-4 times/month | 0.16 | -10, 11 | >0.9 | -3.9 | -6.4, -1.4 | 0.002 |
| < 12 times/year | 0.05 | -10, 11 | >0.9 | -2.1 | -4.9, 0.66 | 0.14 |
| No alcohol last year | -2.9 | -14, 7.8 | 0.6 | -2.3 | -5.5, 0.95 | 0.2 |
| as.factor(education) | ||||||
| 8th or less | — | — | — | — | ||
| High school dropout | -2.7 | -13, 7.2 | 0.6 | -2.1 | -4.9, 0.79 | 0.2 |
| High school | -8.1 | -18, 1.6 | 0.10 | -3.7 | -6.4, -0.97 | 0.008 |
| College dropout | -1.7 | -18, 15 | 0.8 | -1.8 | -5.8, 2.2 | 0.4 |
| College or more | -7.9 | -24, 8.2 | 0.3 | -5.1 | -8.8, -1.4 | 0.007 |
| as.factor(active) | ||||||
| Very active | — | — | — | — | ||
| Moderately active | 6.7 | -1.1, 15 | 0.091 | 0.25 | -1.6, 2.1 | 0.8 |
| Inactive | 15 | 1.5, 29 | 0.030 | 0.31 | -2.8, 3.4 | 0.8 |
| 1 CI = Confidence Interval | ||||||
Interpretation: The association between smoke intensity and SBP remained non-significant in both the complete case and imputed models, even after applying a log₂ transformation to the exposure. In the complete case analysis, the effect estimate was beta = -0.92 (95% CI: -4.7, 2.9, p = 0.6), while in the imputed model, the estimate was slightly attenuated to beta = -0.67 (95% CI: -1.5, 0.15, p = 0.11). The log2 transformation improved precision, as indicated by narrower confidence intervals in the imputed model, but the effect size remained small and non-significant.
Create a DAG to represent causal relationships between CD4 counts at baseline and opportunistic infection after one year, in the presence of other variables. Please provide your DAG.
Read in CD4_infection.csv. Describe the missing pattern and identify factors associated with missingness. Which missingness assumption do you consider? Please justify your conclusion.
## Rows: 80
## Columns: 11
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ septrin <int> 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, …
## $ site <int> 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ sex <int> 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, …
## $ age <int> 20, 32, 27, 40, 39, 35, 44, 23, 19, 38, 27, 35, 46, 39, 28, …
## $ activity <int> 1, 4, 12, 8, 2, 5, 7, 4, 6, 8, 6, 2, 4, 3, 9, 10, 2, 12, 10,…
## $ income <dbl> 22.1, 18.0, 32.0, 41.0, 12.0, NA, 25.0, NA, 10.0, 29.0, 25.0…
## $ residence <int> 1, 0, 2, 2, 0, NA, 1, 2, NA, 1, 0, NA, 2, 1, 1, NA, 2, 1, NA…
## $ art_base <int> 0, 1, NA, 0, 1, 0, 1, 0, 0, 0, 1, NA, 0, 1, NA, 1, 1, 0, 1, …
## $ cd4cnt <dbl> 4.75, 5.40, 3.10, 2.90, 4.05, 4.10, 5.05, 1.80, NA, 5.10, 5.…
## $ oi <int> NA, 0, 1, 1, 0, 0, NA, 1, 0, 0, 0, 1, 0, NA, 1, 0, 1, 0, 0, …
## id septrin site sex age activity income residence
## 0 0 0 0 0 0 4 5
## art_base cd4cnt oi
## 3 3 7
## id septrin site sex age activity art_base cd4cnt income residence oi
## 63 1 1 1 1 1 1 1 1 1 1 1 0
## 7 1 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 1 1 0 1 1
## 2 1 1 1 1 1 1 1 1 0 1 1 1
## 2 1 1 1 1 1 1 1 1 0 0 1 2
## 1 1 1 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 0 1 2
## 2 1 1 1 1 1 1 0 1 1 1 1 1
## 1 1 1 1 1 1 1 0 0 1 0 1 3
## 0 0 0 0 0 0 3 3 4 5 7 22
##
## Variables sorted by number of missings:
## Variable Count
## oi 0.0875
## residence 0.0625
## income 0.0500
## art_base 0.0375
## cd4cnt 0.0375
## id 0.0000
## septrin 0.0000
## site 0.0000
## sex 0.0000
## age 0.0000
## activity 0.0000
##
## FALSE TRUE
## 17 63
##
## Yes No
## 63 17
| Characteristic | N |
Complete Case
|
p-value2 | |
|---|---|---|---|---|
| Yes N = 631 |
No N = 171 |
|||
| Age | 80 | 32 (27, 40) | 29 (27, 39) | 0.5 |
| Sex (Female) | 80 | 28 (44%) | 6 (35%) | 0.5 |
| Household Income | 76 | 25 (18, 32) | 22 (19, 25) | 0.2 |
| Num. Sex Acts | 80 | 7 (4, 9) | 5 (3, 7) | 0.2 |
| Taking Septrin | 80 | 47 (75%) | 12 (71%) | 0.8 |
| Study site | 80 | 27 (43%) | 5 (29%) | 0.3 |
| Distance to clinic | 75 | 0.082 | ||
| 0 | 25 (40%) | 1 (8.3%) | ||
| 1 | 27 (43%) | 7 (58%) | ||
| 2 | 11 (17%) | 4 (33%) | ||
| ART at baseline | 77 | 21 (33%) | 5 (36%) | >0.9 |
| CD4 count (per 100 cells/𝜇L) | 77 | 4.20 (2.15, 5.20) | 4.30 (2.50, 5.05) | 0.8 |
| Oportunistic infection | 73 | 25 (40%) | 4 (40%) | >0.9 |
| 1 Median (Q1, Q3); n (%) | ||||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test | ||||
Interpretation: The missingness pattern seems to be non-monotonic, the variables with most missing values are the outcome variable (opportunistic infection), income, art at baseline and CD4 t cell counts, as shown in the histogram. In a comparative table between complete and incomplete data, we find the most important differences in proportion in the variables: number of sex acts, study site and distance to clinic. We assume MCAR or at least MAR.
Assuming that data are missing at random, implement MICE to impute the missing data. It is your decision what variable(s) to impute (e.g., only covariates, covariates plus exposure, or all variables in the analysis). Provide a detailed description of your approach, including the imputation methods used, variables included in the imputation model, the number of iterations, the number of imputed datasets, and diagnostic checks.
##
## iter imp variable
## 1 1 income residence art_base cd4cnt
## 1 2 income residence art_base cd4cnt
## 1 3 income residence art_base cd4cnt
## 1 4 income residence art_base cd4cnt
## 1 5 income residence art_base cd4cnt
## 1 6 income residence art_base cd4cnt
## 1 7 income residence art_base cd4cnt
## 1 8 income residence art_base cd4cnt
## 1 9 income residence art_base cd4cnt
## 1 10 income residence art_base cd4cnt
## 2 1 income residence art_base cd4cnt
## 2 2 income residence art_base cd4cnt
## 2 3 income residence art_base cd4cnt
## 2 4 income residence art_base cd4cnt
## 2 5 income residence art_base cd4cnt
## 2 6 income residence art_base cd4cnt
## 2 7 income residence art_base cd4cnt
## 2 8 income residence art_base cd4cnt
## 2 9 income residence art_base cd4cnt
## 2 10 income residence art_base cd4cnt
## 3 1 income residence art_base cd4cnt
## 3 2 income residence art_base cd4cnt
## 3 3 income residence art_base cd4cnt
## 3 4 income residence art_base cd4cnt
## 3 5 income residence art_base cd4cnt
## 3 6 income residence art_base cd4cnt
## 3 7 income residence art_base cd4cnt
## 3 8 income residence art_base cd4cnt
## 3 9 income residence art_base cd4cnt
## 3 10 income residence art_base cd4cnt
## 4 1 income residence art_base cd4cnt
## 4 2 income residence art_base cd4cnt
## 4 3 income residence art_base cd4cnt
## 4 4 income residence art_base cd4cnt
## 4 5 income residence art_base cd4cnt
## 4 6 income residence art_base cd4cnt
## 4 7 income residence art_base cd4cnt
## 4 8 income residence art_base cd4cnt
## 4 9 income residence art_base cd4cnt
## 4 10 income residence art_base cd4cnt
## 5 1 income residence art_base cd4cnt
## 5 2 income residence art_base cd4cnt
## 5 3 income residence art_base cd4cnt
## 5 4 income residence art_base cd4cnt
## 5 5 income residence art_base cd4cnt
## 5 6 income residence art_base cd4cnt
## 5 7 income residence art_base cd4cnt
## 5 8 income residence art_base cd4cnt
## 5 9 income residence art_base cd4cnt
## 5 10 income residence art_base cd4cnt
## 6 1 income residence art_base cd4cnt
## 6 2 income residence art_base cd4cnt
## 6 3 income residence art_base cd4cnt
## 6 4 income residence art_base cd4cnt
## 6 5 income residence art_base cd4cnt
## 6 6 income residence art_base cd4cnt
## 6 7 income residence art_base cd4cnt
## 6 8 income residence art_base cd4cnt
## 6 9 income residence art_base cd4cnt
## 6 10 income residence art_base cd4cnt
## 7 1 income residence art_base cd4cnt
## 7 2 income residence art_base cd4cnt
## 7 3 income residence art_base cd4cnt
## 7 4 income residence art_base cd4cnt
## 7 5 income residence art_base cd4cnt
## 7 6 income residence art_base cd4cnt
## 7 7 income residence art_base cd4cnt
## 7 8 income residence art_base cd4cnt
## 7 9 income residence art_base cd4cnt
## 7 10 income residence art_base cd4cnt
## 8 1 income residence art_base cd4cnt
## 8 2 income residence art_base cd4cnt
## 8 3 income residence art_base cd4cnt
## 8 4 income residence art_base cd4cnt
## 8 5 income residence art_base cd4cnt
## 8 6 income residence art_base cd4cnt
## 8 7 income residence art_base cd4cnt
## 8 8 income residence art_base cd4cnt
## 8 9 income residence art_base cd4cnt
## 8 10 income residence art_base cd4cnt
## 9 1 income residence art_base cd4cnt
## 9 2 income residence art_base cd4cnt
## 9 3 income residence art_base cd4cnt
## 9 4 income residence art_base cd4cnt
## 9 5 income residence art_base cd4cnt
## 9 6 income residence art_base cd4cnt
## 9 7 income residence art_base cd4cnt
## 9 8 income residence art_base cd4cnt
## 9 9 income residence art_base cd4cnt
## 9 10 income residence art_base cd4cnt
## 10 1 income residence art_base cd4cnt
## 10 2 income residence art_base cd4cnt
## 10 3 income residence art_base cd4cnt
## 10 4 income residence art_base cd4cnt
## 10 5 income residence art_base cd4cnt
## 10 6 income residence art_base cd4cnt
## 10 7 income residence art_base cd4cnt
## 10 8 income residence art_base cd4cnt
## 10 9 income residence art_base cd4cnt
## 10 10 income residence art_base cd4cnt
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## id age sex income activity septrin site residence
## "" "" "" "pmm" "" "" "" "polyreg"
## art_base cd4cnt oi
## "logreg" "pmm" ""
## PredictorMatrix:
## id age sex income activity septrin site residence art_base cd4cnt oi
## id 0 1 1 1 1 1 1 1 1 1 1
## age 1 0 1 1 1 1 1 1 1 1 1
## sex 1 1 0 1 1 1 1 1 1 1 1
## income 1 1 1 0 1 1 1 1 1 1 1
## activity 1 1 1 1 0 1 1 1 1 1 1
## septrin 1 1 1 1 1 0 1 1 1 1 1
## $id
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $age
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $sex
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $income
## 1 2 3 4 5 6 7 8 9 10
## 6 57 47 57 32 57 57 35 32 47 57
## 8 29 14 28 14 28 41 32 30 19 23
## 13 47 57 57 57 35 32 41 47 57 41
## 19 47 41 57 42 47 30 19 41 47 57
##
## $activity
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $septrin
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $site
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $residence
## 1 2 3 4 5 6 7 8 9 10
## 6 2 2 2 1 2 2 2 1 2 2
## 9 0 0 1 0 0 0 0 0 1 0
## 12 1 0 1 0 0 0 1 0 1 0
## 16 0 0 0 0 0 0 0 0 0 0
## 19 2 1 2 2 2 0 0 1 2 2
##
## $art_base
## 1 2 3 4 5 6 7 8 9 10
## 3 0 1 0 1 0 0 0 0 0 0
## 12 0 1 0 1 1 1 1 0 0 1
## 15 0 0 0 0 0 1 0 0 0 0
##
## $cd4cnt
## 1 2 3 4 5 6 7 8 9 10
## 9 4.9 2.5 5.4 5.2 6.05 5.15 5.1 5.50 5.1 5.20
## 12 1.6 1.5 1.9 2.0 1.65 2.15 3.1 4.20 2.9 2.15
## 23 3.4 4.9 5.4 5.1 6.80 5.40 3.2 5.05 5.2 2.50
##
## $oi
## 1 2 3 4 5 6 7 8 9 10
## 1 NA NA NA NA NA NA NA NA NA NA
## 7 NA NA NA NA NA NA NA NA NA NA
## 14 NA NA NA NA NA NA NA NA NA NA
## 20 NA NA NA NA NA NA NA NA NA NA
## 25 NA NA NA NA NA NA NA NA NA NA
## 36 NA NA NA NA NA NA NA NA NA NA
## 39 NA NA NA NA NA NA NA NA NA NA
We chose to impute both the main exposure and covariates identified as
confounders that had missing data, while retaining non-imputed variables
as they were. The imputed dataset included both imputed and non-imputed
variables. The variables imputed and their corresponding methods were:
cd4cnt and income were managed as continuous, using predictive mean
matching, art_base (dichotomous) using logistic regression,and residency
(categorical with more than two levels) using polytomous regression. We
decided to impute the exposure as well given that is a small dataset
with just 8 observations, the variable CD4 count had a few missing
values, and the variable can be relatively easily modeled since there’s
well known predictors of low CD4 cells count. We performed 10 iterations
and visually inspected the imputed variables. Overall, the imputed
values aligned well with the observed data, as evidenced by the density
plots.
Perform appropriate analyses to estimate the causal relationship between CD4 counts at baseline and opportunistic infection in both the complete dataset and the imputed datasets. Please report your findings and indicate whether you observe any disparity in the results between these two methods.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
| cd4cnt | 0.03 | 0.00, 0.18 | 0.009 | 0.03 | 0.00, 0.33 | 0.005 |
| activity | 1.09 | 0.59, 1.99 | 0.8 | 1.14 | 0.70, 1.84 | 0.6 |
| age | 0.90 | 0.47, 1.41 | 0.6 | 0.87 | 0.59, 1.27 | 0.5 |
| income | 1.23 | 0.74, 4.03 | 0.6 | 1.04 | 0.66, 1.64 | 0.9 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 0.00 | 0.00, 2.34 | 0.3 | 0.01 | 0.00, 15.5 | 0.2 |
| as.factor(art_base) | ||||||
| 0 | — | — | — | — | ||
| 1 | 4.78 | 0.01, 51,694 | 0.6 | 6.19 | 0.01, 2,821 | 0.6 |
| as.factor(residence) | ||||||
| 0 | — | — | — | — | ||
| 1 | 11.5 | 0.01, 94,215 | 0.5 | 46.5 | 0.09, 23,991 | 0.2 |
| 2 | 0.62 | 0.00, 10,501,915 | >0.9 | 85.5 | 0.01, 648,966 | 0.3 |
| 1 OR = Odds Ratio, CI = Confidence Interval | ||||||
The association between baseline CD4 count and the development of an opportunistic infection at one-year follow-up remained consistent across the complete case and imputed models. In both models, higher CD4 counts were strongly associated with reduced odds of developing an opportunistic infection. In both complete and imputed models, the odds ratio (OR) for CD4 count is 0.03, with confidence intervals that do not include 1 (p = 0.009 in complete data, p = 0.005 in imputed data), indicating a strong protective effect. However, the imputed data model shows slightly wider CIs, which might reflect increased variance if the imputation model does not fully capture the true relationships between variables (or maybe the model is misspecified)
On the multiplicative scale, perform an appropriate analysis to examine whether the association between CD4 counts at baseline and opportunistic infection differs by sexual activity in both the complete dataset and the imputed datasets. It is your decision whether to treat sexual activity as a numeric variable or a categorical (factor) variable, or even to aggregate certain levels based on your scientific rationale. Please ensure that the results are interpreted appropriately.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
| cd4cnt | 0.18 | 0.00, 5.79 | 0.3 | 0.18 | 0.01, 4.77 | 0.3 |
| activity | 5.16 | 0.26, 747 | 0.3 | 5.43 | 0.24, 125 | 0.3 |
| age | 0.94 | 0.47, 1.70 | 0.8 | 0.89 | 0.59, 1.37 | 0.6 |
| income | 1.45 | 0.80, 5.32 | 0.4 | 1.18 | 0.68, 2.06 | 0.5 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 0.00 | 0.00, 1.04 | 0.2 | 0.00 | 0.00, 89.3 | 0.2 |
| as.factor(art_base) | ||||||
| 0 | — | — | — | — | ||
| 1 | 9.20 | 0.01, 1,502,669 | 0.5 | 14.8 | 0.02, 13,562 | 0.4 |
| as.factor(residence) | ||||||
| 0 | — | — | — | — | ||
| 1 | 3.54 | 0.01, 26,892 | 0.7 | 13.9 | 0.03, 7,361 | 0.4 |
| 2 | 0.05 | 0.00, 29,761,604 | 0.8 | 20.2 | 0.00, 310,845 | 0.5 |
| cd4cnt * activity | 0.69 | 0.16, 1.38 | 0.3 | 0.69 | 0.33, 1.45 | 0.3 |
| 1 OR = Odds Ratio, CI = Confidence Interval | ||||||
CD4 Count (Exposure): The OR for CD4 count is 0.18 in both the complete and imputed data models (p = 0.3). This suggests a protective but statistically non-significant association, meaning higher CD4 counts may be linked to lower odds of opportunistic infection, but the wide confidence intervals indicate substantial uncertainty.
Sexual Activity: Sexual activity has an OR of 5.16 (complete data) and 5.43 (imputed data), both with p = 0.3. Although the OR suggests a strong positive association (implying higher odds of opportunistic infection among sexually active individuals), the confidence intervals are too wide, indicating high variability (low precision).
Interaction Between CD4 Count and Sexual Activity: The interaction term (CD4 count × sexual activity) has an OR of 0.69 (complete and imputed data, p = 0.3), suggesting that the protective effect of CD4 count might be slightly weaker in sexually active individuals, but this result is not statistically significant. The wide confidence intervals further indicate uncertainty in the estimate. The estimates did not become more precise using the imputed dataset, compared to the complete case.
Assume that age is a confounder in the association between baseline CD4 counts and opportunistic infections. You are concerned about the potential non-linear confounding effect of age, so you decide to include a non-linear term for age in the analyses of both the complete dataset and the imputed datasets. What non-linear term would you use? Please report your results and compare them to the methods that use a linear term for age.
| Characteristic |
Complete Data
|
Imputed Covars
|
||||
|---|---|---|---|---|---|---|
| OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
| cd4cnt | 0.04 | 0.00, 0.22 | 0.008 | 0.03 | 0.00, 0.36 | 0.006 |
| activity | 1.00 | 0.51, 1.85 | >0.9 | 1.04 | 0.61, 1.76 | 0.9 |
| age | 4.48 | 0.15, 1,652 | 0.4 | 4.67 | 0.08, 270 | 0.4 |
| I(age^2) | 0.98 | 0.89, 1.03 | 0.4 | 0.97 | 0.92, 1.03 | 0.4 |
| income | 1.13 | 0.72, 3.69 | 0.7 | 1.04 | 0.65, 1.68 | 0.9 |
| as.factor(sex) | ||||||
| 0 | — | — | — | — | ||
| 1 | 0.02 | 0.00, 10.5 | 0.4 | 0.03 | 0.00, 58.3 | 0.4 |
| as.factor(art_base) | ||||||
| 0 | — | — | — | — | ||
| 1 | 7.48 | 0.00, 3,365,480 | 0.6 | 6.85 | 0.00, 45,919 | 0.7 |
| as.factor(residence) | ||||||
| 0 | — | — | — | — | ||
| 1 | 19.5 | 0.01, 142,797 | 0.4 | 47.1 | 0.05, 41,410 | 0.3 |
| 2 | 36.6 | 0.00, 5,334,801,942 | 0.7 | 135 | 0.01, 2,226,870 | 0.3 |
| 1 OR = Odds Ratio, CI = Confidence Interval | ||||||
Including a squared term for age as a confounder helps account for a nonlinear relationship between age and the risk of opportunistic infections. Biological processes and immune function often do not change linearly with age; for example, the risk of infections might be higher in both very young and very old individuals but lower in middle age. By adding age sqr, the model allows for a curvilinear adjustment, ensuring that age is properly controlled as a confounder if the effect of age was non linear.
However, adding a squared term for age did not meaningfully change the association between CD4 count and opportunistic infections, as the ORs for CD4 count remained similar across models. In the complete case analysis, the OR for CD4 count was 0.03 (95% CI: 0.00, 0.18, p = 0.009) in the linear age model, and 0.04 (95% CI: 0.00, 0.22, p = 0.008) after adding age². Similarly, in the imputed model, the OR changed only slightly from 0.03 (95% CI: 0.00, 0.33, p = 0.005) to 0.03 (95% CI: 0.00, 0.36, p = 0.006). These results suggest that adjusting for potential nonlinear effects of age did not impact the estimated relationship between CD4 count and opportunistic infections.