Load R packages

Part1

Q1.1

DAG part1 Create a Directed Acyclic Graph (DAG) to represent the causal relationships of smoking intensity and smoking duration with systolic blood pressure, in the presence of other variables.

Create a Directed Acyclic Graph (DAG) to determine the associations of smoking intensity and smoking duration with systolic and diastolic blood pressure, as well as the diagnosis of high blood pressure.

Q1.2

Read in Assignment1_data.csv and appropriately manage missing data. Among the baseline variables, describe the missing data pattern and identify factors associated with missingness. One approach is to create a table comparing variables between participants with and without missing data. Which missingness assumption do you consider? Please justify your conclusion.

## Rows: 1,629
## Columns: 27
## $ X              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ id             <int> 233, 235, 244, 245, 252, 257, 262, 266, 419, 420, 428, …
## $ sex            <int> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0…
## $ age            <int> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, 51,…
## $ race           <int> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ education      <fct> 8th or less, High school dropout, High school dropout, …
## $ income         <int> 19, 18, 15, 15, 18, 11, 19, 22, 18, 16, 19, 18, 16, NA,…
## $ marital        <fct> Married, Married, Widowed, Widowed, Married, Never marr…
## $ active         <fct> Very active, Very active, Very active, Moderately activ…
## $ alcoholfreq    <fct> 2-3 times/week, Almost every day, < 12 times/year, 1-4 …
## $ tobacco_price  <dbl> 2.183594, 2.346680, 1.569580, 1.506592, 2.346680, 2.209…
## $ tobacco_tax    <dbl> 1.1022949, 1.3649902, 0.5512695, 0.5249023, 1.3649902, …
## $ smokeintensity <int> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10, 4…
## $ smokeyrs       <int> 29, 24, 26, 53, 19, 21, 39, 9, 37, 25, 24, 20, 19, 38, …
## $ height         <dbl> 174.1875, 159.3750, 168.5000, 170.1875, 181.8750, 162.1…
## $ weight         <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.74,…
## $ asthma         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cholesterol    <int> 197, 301, 157, 174, 216, 212, 205, 166, 337, 279, 173, …
## $ diabetes       <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ dbp            <int> 96, 80, 75, 78, 77, 83, 69, 53, 79, 106, 89, 69, 80, 10…
## $ sbp            <int> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 135, …
## $ hbp            <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ death          <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1…
## $ dadth          <int> NA, NA, NA, 14, NA, NA, NA, NA, 13, 17, NA, 28, NA, NA,…
## $ modth          <int> NA, NA, NA, 2, NA, NA, NA, NA, 10, 10, NA, 11, NA, NA, …
## $ yrdth          <int> NA, NA, NA, 85, NA, NA, NA, NA, 84, 86, NA, 92, NA, NA,…
## $ income_group   <fct> $5000 - $14999, $5000 - $14999, $3000 - $4999, $3000 - …

##     X id sex age race education active smokeintensity smokeyrs height weight
## 704 1  1   1   1    1         1      1              1        1      1      1
## 697 1  1   1   1    1         1      1              1        1      1      1
## 44  1  1   1   1    1         1      1              1        1      1      1
## 38  1  1   1   1    1         1      1              1        1      1      1
## 2   1  1   1   1    1         1      1              1        1      1      1
## 1   1  1   1   1    1         1      1              1        1      1      1
## 47  1  1   1   1    1         1      1              1        1      1      1
## 23  1  1   1   1    1         1      1              1        1      1      1
## 3   1  1   1   1    1         1      1              1        1      1      1
## 2   1  1   1   1    1         1      1              1        1      1      1
## 34  1  1   1   1    1         1      1              1        1      1      1
## 20  1  1   1   1    1         1      1              1        1      1      1
## 3   1  1   1   1    1         1      1              1        1      1      1
## 2   1  1   1   1    1         1      1              1        1      1      1
## 1   1  1   1   1    1         1      1              1        1      1      1
## 2   1  1   1   1    1         1      1              1        1      1      1
## 5   1  1   1   1    1         1      1              1        1      1      1
## 1   1  1   1   1    1         1      1              1        1      1      1
##     0  0   0   0    0         0      0              0        0      0      0
##     asthma cholesterol death marital alcoholfreq income income_group sbp dbp
## 704      1           1     1       1           1      1            1   1   1
## 697      1           1     1       1           1      1            1   1   1
## 44       1           1     1       1           1      1            1   1   1
## 38       1           1     1       1           1      1            1   1   1
## 2        1           1     1       1           1      1            1   1   0
## 1        1           1     1       1           1      1            1   1   0
## 47       1           1     1       1           1      1            1   0   0
## 23       1           1     1       1           1      1            1   0   0
## 3        1           1     1       1           1      1            1   0   0
## 2        1           1     1       1           1      1            1   0   0
## 34       1           1     1       1           1      0            0   1   1
## 20       1           1     1       1           1      0            0   1   1
## 3        1           1     1       1           1      0            0   1   1
## 2        1           1     1       1           1      0            0   1   1
## 1        1           1     1       1           1      0            0   1   0
## 2        1           1     1       1           1      0            0   0   0
## 5        1           1     1       1           0      1            1   1   1
## 1        1           1     1       0           1      1            1   1   1
##          0           0     0       1           5     62           62  77  81
##     tobacco_price tobacco_tax diabetes hbp     
## 704             1           1        1   1    0
## 697             1           1        0   0    2
## 44              0           0        1   1    2
## 38              0           0        0   0    4
## 2               1           1        1   1    1
## 1               1           1        0   0    3
## 47              1           1        1   1    2
## 23              1           1        0   0    4
## 3               0           0        1   1    4
## 2               0           0        0   0    6
## 34              1           1        1   1    2
## 20              1           1        0   0    4
## 3               0           0        1   1    4
## 2               0           0        0   0    6
## 1               1           1        1   1    3
## 2               1           1        0   0    6
## 5               1           1        0   0    3
## 1               1           1        0   0    3
##                92          92      791 791 2054

## 
##  Variables sorted by number of missings: 
##        Variable        Count
##        diabetes 0.4855739718
##             hbp 0.4855739718
##   tobacco_price 0.0564763659
##     tobacco_tax 0.0564763659
##             dbp 0.0497237569
##             sbp 0.0472682627
##          income 0.0380601596
##    income_group 0.0380601596
##     alcoholfreq 0.0030693677
##         marital 0.0006138735
##               X 0.0000000000
##              id 0.0000000000
##             sex 0.0000000000
##             age 0.0000000000
##            race 0.0000000000
##       education 0.0000000000
##          active 0.0000000000
##  smokeintensity 0.0000000000
##        smokeyrs 0.0000000000
##          height 0.0000000000
##          weight 0.0000000000
##          asthma 0.0000000000
##     cholesterol 0.0000000000
##           death 0.0000000000

## 
## FALSE  TRUE 
##  1472   157

##              X             id            sex            age           race 
##              0              0              0              0              0 
##      education         income        marital         active    alcoholfreq 
##              0             62              1              0              5 
##  tobacco_price    tobacco_tax smokeintensity       smokeyrs         height 
##             92             92              0              0              0 
##         weight         asthma    cholesterol       diabetes            dbp 
##              0              0              0            791             81 
##            sbp            hbp          death          dadth          modth 
##             77            791              0           1307           1307 
##          yrdth   income_group   CompleteCase 
##           1311             62              0

## 
## FALSE  TRUE 
##  1472   157

**Table 1. Patient Characteristics by complete case**
Characteristic	N	Complete case		p-value²
Characteristic	N	TRUE N = 157¹	FALSE N = 1,472¹	p-value²
Sex(female)	1,629	65 (41%)	765 (52%)	0.012
Age	1,629	56 (49, 62)	42 (33, 51)	<0.001
Race(Non-White)	1,629	31 (20%)	184 (13%)	0.011
Education level	1,629			<0.001
8th or less		66 (42%)	245 (17%)
High school dropout		32 (20%)	319 (22%)
High school		39 (25%)	620 (42%)
College dropout		9 (5.7%)	117 (7.9%)
College or more		11 (7.0%)	171 (12%)
Marital status	1,628			0.051
Under 17		0 (0%)	0 (0%)
Married		116 (74%)	1,163 (79%)
Widowed		17 (11%)	80 (5.4%)
Never married		6 (3.8%)	90 (6.1%)
Divorced		10 (6.4%)	89 (6.1%)
Separated		8 (5.1%)	49 (3.3%)
Physical activity	1,629			0.4
Very active		67 (43%)	662 (45%)
Moderately active		78 (50%)	660 (45%)
Inactive		12 (7.6%)	150 (10%)
Alcohol consumption	1,624			0.002
Almost every day		39 (25%)	297 (20%)
2-3 times/week		15 (9.6%)	216 (15%)
1-4 times/month		34 (22%)	472 (32%)
< 12 times/year		37 (24%)	307 (21%)
No alcohol last year		32 (20%)	175 (12%)
Tobacco prices	1,537	2.18 (2.09, 2.31)	2.17 (2.04, 2.24)	0.8
In-State tobacco tax	1,537	1.05 (1.00, 1.15)	1.05 (0.94, 1.15)	0.6
Smoke intensity	1,629	20 (10, 25)	20 (10, 30)	0.5
Smoke years	1,629	36 (27, 42)	23 (14, 32)	<0.001
Height	1,629	169 (163, 175)	168 (162, 175)	0.7
Weight	1,629	71 (61, 81)	69 (59, 80)	0.084
Asthma	1,629	5 (3.2%)	74 (5.0%)	0.3
Cholesterol	1,629	231 (193, 261)	216 (188, 245)	0.003
Diabetes	838	2 (1.3%)	12 (1.8%)	>0.9
Diastolic Blood Pressure	1,548	78 (70, 86)	77 (70, 84)	0.4
Systolic Blood Pressure	1,552	138 (126, 149)	125 (115, 138)	<0.001
High Blood Pressure	838	37 (24%)	93 (14%)	0.002
Death	1,629	157 (100%)	161 (11%)	<0.001
Family income	1,567			<0.001
<$3000		33 (21%)	124 (8.8%)
$3000 - $4999		31 (20%)	115 (8.2%)
$5000 - $14999		72 (46%)	772 (55%)
>=$15000		21 (13%)	399 (28%)
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test

Conclusion: The missing data pattern appears to be non-monotonic, meaning there is no structured order to the missing values. We removed the date of death variables since their missingness is expected when death = 0, and because after examining cases where death = 1, we found no instances of missing date of death. Other variables showing similar patterns of missingness include SBP-DBP, tobacco price-tobacco tax, income-income_group, and diabetes-HBP. The variables with most missing values in decreasing order are: diabetes, HBP, tobacco_price, tobacco_tax, DBP, SBP, income, income_group, alcoholfreq and marital status, as we can see in the histogram of missing data. When comparing participant characteristics by data completeness, cases with missing data show a higher proportion of women, a younger average age, fewer non-white individuals, higher education levels and income, differences in alcohol consumption frequency, a lower mean number of smoking years, lower cholesterol levels, and a lower proportion of individuals with hypertension. We are assuming missing completely at random, but even under this assumption, the results can be biased. However, with this method we have more power than using complete case analysis, but probably not the best since so we will perform multiple imputation.

Q1.3

Assuming that data are missing at random (MAR), implement Multiple Imputation by Chained Equations (MICE) to impute the missing data. It is your decision what variable(s) to impute (e.g., only covariates, covariates plus exposure, or all variables in the analysis). Provide a detailed description of your approach, including the imputation methods used, variables included in the imputation model, the number of iterations, the number of imputed datasets, and diagnostic checks.

## 
##  iter imp variable
##   1   1  tobacco_price  income_group  alcoholfreq
##   1   2  tobacco_price  income_group  alcoholfreq
##   1   3  tobacco_price  income_group  alcoholfreq
##   1   4  tobacco_price  income_group  alcoholfreq
##   1   5  tobacco_price  income_group  alcoholfreq
##   1   6  tobacco_price  income_group  alcoholfreq
##   1   7  tobacco_price  income_group  alcoholfreq
##   1   8  tobacco_price  income_group  alcoholfreq
##   1   9  tobacco_price  income_group  alcoholfreq
##   1   10  tobacco_price  income_group  alcoholfreq
##   2   1  tobacco_price  income_group  alcoholfreq
##   2   2  tobacco_price  income_group  alcoholfreq
##   2   3  tobacco_price  income_group  alcoholfreq
##   2   4  tobacco_price  income_group  alcoholfreq
##   2   5  tobacco_price  income_group  alcoholfreq
##   2   6  tobacco_price  income_group  alcoholfreq
##   2   7  tobacco_price  income_group  alcoholfreq
##   2   8  tobacco_price  income_group  alcoholfreq
##   2   9  tobacco_price  income_group  alcoholfreq
##   2   10  tobacco_price  income_group  alcoholfreq
##   3   1  tobacco_price  income_group  alcoholfreq
##   3   2  tobacco_price  income_group  alcoholfreq
##   3   3  tobacco_price  income_group  alcoholfreq
##   3   4  tobacco_price  income_group  alcoholfreq
##   3   5  tobacco_price  income_group  alcoholfreq
##   3   6  tobacco_price  income_group  alcoholfreq
##   3   7  tobacco_price  income_group  alcoholfreq
##   3   8  tobacco_price  income_group  alcoholfreq
##   3   9  tobacco_price  income_group  alcoholfreq
##   3   10  tobacco_price  income_group  alcoholfreq
##   4   1  tobacco_price  income_group  alcoholfreq
##   4   2  tobacco_price  income_group  alcoholfreq
##   4   3  tobacco_price  income_group  alcoholfreq
##   4   4  tobacco_price  income_group  alcoholfreq
##   4   5  tobacco_price  income_group  alcoholfreq
##   4   6  tobacco_price  income_group  alcoholfreq
##   4   7  tobacco_price  income_group  alcoholfreq
##   4   8  tobacco_price  income_group  alcoholfreq
##   4   9  tobacco_price  income_group  alcoholfreq
##   4   10  tobacco_price  income_group  alcoholfreq
##   5   1  tobacco_price  income_group  alcoholfreq
##   5   2  tobacco_price  income_group  alcoholfreq
##   5   3  tobacco_price  income_group  alcoholfreq
##   5   4  tobacco_price  income_group  alcoholfreq
##   5   5  tobacco_price  income_group  alcoholfreq
##   5   6  tobacco_price  income_group  alcoholfreq
##   5   7  tobacco_price  income_group  alcoholfreq
##   5   8  tobacco_price  income_group  alcoholfreq
##   5   9  tobacco_price  income_group  alcoholfreq
##   5   10  tobacco_price  income_group  alcoholfreq
##   6   1  tobacco_price  income_group  alcoholfreq
##   6   2  tobacco_price  income_group  alcoholfreq
##   6   3  tobacco_price  income_group  alcoholfreq
##   6   4  tobacco_price  income_group  alcoholfreq
##   6   5  tobacco_price  income_group  alcoholfreq
##   6   6  tobacco_price  income_group  alcoholfreq
##   6   7  tobacco_price  income_group  alcoholfreq
##   6   8  tobacco_price  income_group  alcoholfreq
##   6   9  tobacco_price  income_group  alcoholfreq
##   6   10  tobacco_price  income_group  alcoholfreq
##   7   1  tobacco_price  income_group  alcoholfreq
##   7   2  tobacco_price  income_group  alcoholfreq
##   7   3  tobacco_price  income_group  alcoholfreq
##   7   4  tobacco_price  income_group  alcoholfreq
##   7   5  tobacco_price  income_group  alcoholfreq
##   7   6  tobacco_price  income_group  alcoholfreq
##   7   7  tobacco_price  income_group  alcoholfreq
##   7   8  tobacco_price  income_group  alcoholfreq
##   7   9  tobacco_price  income_group  alcoholfreq
##   7   10  tobacco_price  income_group  alcoholfreq
##   8   1  tobacco_price  income_group  alcoholfreq
##   8   2  tobacco_price  income_group  alcoholfreq
##   8   3  tobacco_price  income_group  alcoholfreq
##   8   4  tobacco_price  income_group  alcoholfreq
##   8   5  tobacco_price  income_group  alcoholfreq
##   8   6  tobacco_price  income_group  alcoholfreq
##   8   7  tobacco_price  income_group  alcoholfreq
##   8   8  tobacco_price  income_group  alcoholfreq
##   8   9  tobacco_price  income_group  alcoholfreq
##   8   10  tobacco_price  income_group  alcoholfreq
##   9   1  tobacco_price  income_group  alcoholfreq
##   9   2  tobacco_price  income_group  alcoholfreq
##   9   3  tobacco_price  income_group  alcoholfreq
##   9   4  tobacco_price  income_group  alcoholfreq
##   9   5  tobacco_price  income_group  alcoholfreq
##   9   6  tobacco_price  income_group  alcoholfreq
##   9   7  tobacco_price  income_group  alcoholfreq
##   9   8  tobacco_price  income_group  alcoholfreq
##   9   9  tobacco_price  income_group  alcoholfreq
##   9   10  tobacco_price  income_group  alcoholfreq
##   10   1  tobacco_price  income_group  alcoholfreq
##   10   2  tobacco_price  income_group  alcoholfreq
##   10   3  tobacco_price  income_group  alcoholfreq
##   10   4  tobacco_price  income_group  alcoholfreq
##   10   5  tobacco_price  income_group  alcoholfreq
##   10   6  tobacco_price  income_group  alcoholfreq
##   10   7  tobacco_price  income_group  alcoholfreq
##   10   8  tobacco_price  income_group  alcoholfreq
##   10   9  tobacco_price  income_group  alcoholfreq
##   10   10  tobacco_price  income_group  alcoholfreq

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##             id            sbp smokeintensity       smokeyrs  tobacco_price 
##             ""             ""             ""             ""          "pmm" 
##            age            sex           race   income_group    alcoholfreq 
##             ""             ""             ""      "polyreg"      "polyreg" 
##      education         active 
##             ""             "" 
## PredictorMatrix:
##                id sbp smokeintensity smokeyrs tobacco_price age sex race
## id              0   1              1        1             1   1   1    1
## sbp             1   0              1        1             1   1   1    1
## smokeintensity  1   1              0        1             1   1   1    1
## smokeyrs        1   1              1        0             1   1   1    1
## tobacco_price   1   1              1        1             0   1   1    1
## age             1   1              1        1             1   0   1    1
##                income_group alcoholfreq education active
## id                        1           1         1      1
## sbp                       1           1         1      1
## smokeintensity            1           1         1      1
## smokeyrs                  1           1         1      1
## tobacco_price             1           1         1      1
## age                       1           1         1      1

We chose to impute only covariates identified as confounders that had missing data, while retaining non-imputed variables as they were. The imputed dataset included both imputed and non-imputed variables. The variables imputed and their corresponding methods were: income (categorical with more than two levels) using polytomous regression, tobacco price (continuous) using predictive mean matching, and frequency of alcohol consumption (categorical with more than two levels) using polytomous regression. We performed 10 iterations and visually inspected the imputed variables. Overall, the imputed values aligned well with the observed data, as evidenced by the density plots.

Q1.4.

Perform appropriate analyses to estimate the associations between smoking intensity and smoking duration with systolic blood pressure in the complete dataset (i.e., participants without any missing data) and the imputed datasets. Please report your findings and indicate whether you observe any discrepancies between these two methods.

**Table 2. Association between smoke intensity and SBP complete and imputed data**
Characteristic	Complete Data			Imputed Covars
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
smokeintensity	-0.16	-0.53, 0.20	0.4	-0.06	-0.13, 0.02	0.2
tobacco_price	-14	-29, 1.7	0.081	0.54	-3.4, 4.5	0.8
age	0.15	-0.21, 0.52	0.4	0.56	0.48, 0.64	<0.001
as.factor(sex)
0	—	—		—	—
1	1.9	-6.2, 9.9	0.6	-4.2	-6.1, -2.3	<0.001
as.factor(race)
0	—	—		—	—
1	-1.5	-12, 9.3	0.8	5.9	3.2, 8.6	<0.001
as.factor(income_group)
<$3000	—	—		—	—
$3000 - $4999	-3.7	-14, 7.1	0.5	1.9	-2.2, 6.1	0.4
$5000 - $14999	1.9	-7.9, 12	0.7	1.7	-1.6, 5.0	0.3
>=$15000	-8.9	-22, 4.7	0.2	1.3	-2.5, 5.0	0.5
as.factor(alcoholfreq)
Almost every day	—	—		—	—
2-3 times/week	-4.4	-17, 8.6	0.5	-0.60	-3.6, 2.4	0.7
1-4 times/month	-0.20	-11, 10	>0.9	-4.0	-6.5, -1.5	0.002
< 12 times/year	-0.05	-10, 10	>0.9	-2.2	-4.9, 0.63	0.13
No alcohol last year	-2.7	-13, 8.0	0.6	-2.3	-5.6, 0.92	0.2
as.factor(education)
8th or less	—	—		—	—
High school dropout	-2.8	-13, 7.1	0.6	-2.1	-5.0, 0.76	0.2
High school	-8.4	-18, 1.4	0.093	-3.7	-6.4, -0.99	0.008
College dropout	-1.1	-18, 15	0.9	-1.7	-5.7, 2.2	0.4
College or more	-8.5	-24, 7.4	0.3	-5.1	-8.7, -1.4	0.007
as.factor(active)
Very active	—	—		—	—
Moderately active	6.5	-1.3, 14	0.10	0.28	-1.6, 2.1	0.8
Inactive	15	1.2, 28	0.033	0.30	-2.8, 3.4	0.9
¹ CI = Confidence Interval

**Table 3. Association between smoke duration and SBP complete and imputed data**
Characteristic	Complete Data			Imputed Covars
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
smokeyrs	0.22	-0.24, 0.68	0.3	-0.06	-0.21, 0.10	0.5
tobacco_price	-14	-29, 1.7	0.082	0.52	-3.4, 4.4	0.8
age	0.02	-0.49, 0.52	>0.9	0.61	0.46, 0.76	<0.001
as.factor(sex)
0	—	—		—	—
1	4.5	-3.8, 13	0.3	-4.1	-6.0, -2.2	<0.001
as.factor(race)
0	—	—		—	—
1	0.27	-9.8, 10	>0.9	6.2	3.5, 8.9	<0.001
as.factor(income_group)
<$3000	—	—		—	—
$3000 - $4999	-3.8	-15, 6.9	0.5	1.9	-2.3, 6.1	0.4
$5000 - $14999	2.6	-7.1, 12	0.6	1.6	-1.7, 5.0	0.3
>=$15000	-8.5	-22, 5.1	0.2	1.1	-2.6, 4.8	0.6
as.factor(alcoholfreq)
Almost every day	—	—		—	—
2-3 times/week	-3.9	-17, 9.2	0.6	-0.59	-3.6, 2.4	0.7
1-4 times/month	1.4	-9.3, 12	0.8	-3.8	-6.3, -1.3	0.003
< 12 times/year	1.4	-9.2, 12	0.8	-2.1	-4.8, 0.72	0.15
No alcohol last year	-3.0	-14, 7.5	0.6	-2.2	-5.4, 1.1	0.2
as.factor(education)
8th or less	—	—		—	—
High school dropout	-3.0	-13, 6.9	0.6	-2.2	-5.1, 0.62	0.13
High school	-7.6	-17, 2.2	0.13	-3.8	-6.5, -1.1	0.007
College dropout	-1.6	-18, 15	0.9	-1.9	-5.9, 2.0	0.3
College or more	-6.7	-22, 8.8	0.4	-5.1	-8.8, -1.4	0.007
as.factor(active)
Very active	—	—		—	—
Moderately active	6.6	-1.2, 14	0.10	0.25	-1.6, 2.1	0.8
Inactive	15	1.7, 29	0.027	0.21	-2.9, 3.3	0.9
¹ CI = Confidence Interval

Interpretation:

smoke intensity and HBP: The association between smoke intensity and SBP weakens after imputation, with the coefficient shifting from -0.16 (95% CI: -0.53, 0.20, p = 0.4) in the complete-case analysis to -0.06 (95% CI: -0.13, 0.02, p = 0.2) in the imputed model. The narrower confidence interval in the imputed model suggests improved precision, while the reduced effect size indicates that the complete-case analysis may have overestimated the association. Despite these differences, both models show no statistically significant association between smoke intensity and SBP.
smoke duration and HBP: The association between smoke duration and SBP weakens after imputation, with the coefficient shifting from 0.22 (95% CI: -0.24, 0.68, p = 0.3) in the complete-case analysis to -0.06 (95% CI: -0.21, 0.10, p = 0.5) in the imputed model. The imputed model’s narrower confidence interval suggests improved precision, but the reversal in direction might indicate either that missing data may have biased the complete-case estimate, or that we need to change the imputation model/variables imputed. Despite this change, both models show no statistically significant association between smoke duration and SBP.

Q1.5

Perform a proper analysis to examine whether the associations between smoking intensity and duration (in years) with systolic blood pressure differ by biological sex in the complete data set and the imputed data sets. Please report your findings.

**Table 4. Association between smoke intensity and SBP considering variation by sex using complete and imputed data**
Characteristic	Complete Data			Imputed Covars
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
smokeintensity	-0.12	-0.53, 0.28	0.6	0.03	-0.08, 0.13	0.6
as.factor(sex)
0	—	—		—	—
1	4.5	-10, 19	0.5	-0.40	-4.0, 3.2	0.8
tobacco_price	-14	-29, 1.8	0.082	0.59	-3.3, 4.5	0.8
age	0.14	-0.22, 0.51	0.4	0.56	0.48, 0.63	<0.001
as.factor(race)
0	—	—		—	—
1	-1.5	-12, 9.3	0.8	5.9	3.2, 8.6	<0.001
as.factor(income_group)
<$3000	—	—		—	—
$3000 - $4999	-3.6	-14, 7.2	0.5	2.1	-2.1, 6.3	0.3
$5000 - $14999	1.9	-7.9, 12	0.7	1.7	-1.6, 5.0	0.3
>=$15000	-8.7	-22, 5.0	0.2	1.2	-2.5, 5.0	0.5
as.factor(alcoholfreq)
Almost every day	—	—		—	—
2-3 times/week	-4.6	-18, 8.5	0.5	-0.61	-3.6, 2.4	0.7
1-4 times/month	-0.23	-11, 10	>0.9	-4.0	-6.5, -1.5	0.002
< 12 times/year	-0.24	-11, 10	>0.9	-2.2	-5.0, 0.57	0.12
No alcohol last year	-2.5	-13, 8.2	0.6	-2.4	-5.6, 0.86	0.2
as.factor(education)
8th or less	—	—		—	—
High school dropout	-2.7	-13, 7.3	0.6	-2.2	-5.0, 0.66	0.13
High school	-8.3	-18, 1.5	0.094	-3.8	-6.5, -1.1	0.006
College dropout	-1.4	-18, 15	0.9	-1.9	-5.9, 2.0	0.3
College or more	-8.9	-25, 7.2	0.3	-5.2	-8.8, -1.5	0.006
as.factor(active)
Very active	—	—		—	—
Moderately active	6.5	-1.3, 14	0.10	0.24	-1.6, 2.1	0.8
Inactive	15	1.2, 29	0.033	0.40	-2.7, 3.5	0.8
smokeintensity * as.factor(sex)
smokeintensity * 1	-0.15	-0.85, 0.55	0.7	-0.19	-0.34, -0.03	0.017
¹ CI = Confidence Interval

**Table 5. Association between smoke duration and SBP considering variation by sex using complete and imputed data**
Characteristic	Complete Data			Imputed Covars
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
smokeyrs	0.15	-0.42, 0.72	0.6	-0.18	-0.35, -0.01	0.036
as.factor(sex)
0	—	—		—	—
1	-0.21	-24, 24	>0.9	-10	-14, -6.2	<0.001
tobacco_price	-14	-30, 1.6	0.078	0.55	-3.4, 4.5	0.8
age	0.03	-0.48, 0.55	0.9	0.62	0.47, 0.77	<0.001
as.factor(race)
0	—	—		—	—
1	0.14	-10, 10	>0.9	6.2	3.5, 8.8	<0.001
as.factor(income_group)
<$3000	—	—		—	—
$3000 - $4999	-3.7	-15, 7.1	0.5	2.0	-2.2, 6.2	0.3
$5000 - $14999	2.5	-7.3, 12	0.6	1.6	-1.7, 4.9	0.3
>=$15000	-8.7	-22, 5.0	0.2	1.3	-2.4, 5.0	0.5
as.factor(alcoholfreq)
Almost every day	—	—		—	—
2-3 times/week	-3.7	-17, 9.4	0.6	-0.55	-3.5, 2.4	0.7
1-4 times/month	1.4	-9.3, 12	0.8	-3.6	-6.1, -1.1	0.004
< 12 times/year	1.7	-9.1, 13	0.8	-2.0	-4.7, 0.80	0.2
No alcohol last year	-2.8	-13, 7.9	0.6	-2.3	-5.5, 0.97	0.2
as.factor(education)
8th or less	—	—		—	—
High school dropout	-3.2	-13, 6.8	0.5	-2.5	-5.3, 0.36	0.087
High school	-7.4	-17, 2.4	0.14	-4.0	-6.8, -1.3	0.003
College dropout	-1.3	-18, 15	0.9	-2.1	-6.0, 1.9	0.3
College or more	-6.9	-22, 8.7	0.4	-5.8	-9.5, -2.1	0.002
as.factor(active)
Very active	—	—		—	—
Moderately active	6.7	-1.1, 15	0.093	0.36	-1.5, 2.2	0.7
Inactive	15	1.6, 29	0.028	0.20	-2.9, 3.3	0.9
smokeyrs * as.factor(sex)
smokeyrs * 1	0.13	-0.51, 0.77	0.7	0.25	0.10, 0.40	<0.001
¹ CI = Confidence Interval

Interpretation: A) smoke intensity and HBP + sex interaction: The association between smoke intensity and SBP changes when considering sex, with the complete-case model estimating a coefficient of -0.12 (95% CI: -0.53, 0.28, p = 0.6), while the imputed model shifts to 0.03 (95% CI: -0.08, 0.13, p = 0.6). The interaction term between smoke intensity and sex also differs, with the complete-case estimate at -0.15 (95% CI: -0.85, 0.55, p = 0.7) and the imputed model showing a stronger negative association at -0.19 (95% CI: -0.34, -0.03, p = 0.017), suggesting a potential modifying effect of sex on the relationship between smoking intensity and SBP. The discrepancies between the 2 models may reflext systematic differences in missing data, where individuals with missing values differ from those with complete observations. However, imputation relies on model assumptions, and if key predictors were omitted or an incorrect method was used, bias may have been introduced. Additionally, imputed values are not truly observed, which can reduce variance artificially and alter effect estimates. If missing data were not MAR but instead depended on unmeasured factors, that could limit the validity or invalidate the imputed results despite narrower confidence intervals.

smoke duration and HBP + sex interaction: The association between smoking duration and systolic blood pressure (SBP) differs between complete and imputed data. In the complete-case analysis, smoking duration shows no significant effect 0.15 (95% CI: -0.42, 0.72, p = 0.6), while in the imputed data there’s a shift in direction and magnitude, being smoke duration significantly associated with lower SBP(95% CI: -0.35, -0.01, p = 0.036). Additionally, the interaction between smoking and sex is only significant in the imputed model, suggesting that sex modifies the effect of smoking on SBP.These differences may indicate that missing data was not random, and imputation restored important information, particularly regarding sex differences. Since the incomplete data had a higher proportion of females, the imputed results might better reflect the true relationships, whereas the complete-case analysis may underestimate key associations. However, the observed changes may be due to model misspecification, whether from incorrect variable selection or an inappropriate imputation method. Furthermore, if the missing data were not MAR, unmeasured confounders could still influence the results despite imputation.

Q1.6

To impose the assumption that SBP increases linearly per multiplicative change in smoking intensity, we decide to apply a log base 2 transformation to smoking intensity when estimating the association. Perform this analysis in the complete data set and the imputed data sets. Please report your findings.

**Table 6. Association between smoke intensity and SBP with log2 transformed exposure, Complete and imputed data**
Characteristic	Complete Data			Imputed Covars
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
log2(smokeintensity)	-0.92	-4.7, 2.9	0.6	-0.67	-1.5, 0.15	0.11
tobacco_price	-14	-29, 1.9	0.084	0.51	-3.4, 4.4	0.8
age	0.17	-0.20, 0.53	0.4	0.56	0.48, 0.64	<0.001
as.factor(sex)
0	—	—		—	—
1	2.5	-5.5, 10	0.5	-4.2	-6.1, -2.3	<0.001
as.factor(race)
0	—	—		—	—
1	-0.76	-12, 10	0.9	5.9	3.2, 8.6	<0.001
as.factor(income_group)
<$3000	—	—		—	—
$3000 - $4999	-3.7	-14, 7.1	0.5	1.9	-2.3, 6.1	0.4
$5000 - $14999	1.9	-8.0, 12	0.7	1.7	-1.6, 5.0	0.3
>=$15000	-9.2	-23, 4.4	0.2	1.3	-2.5, 5.0	0.5
as.factor(alcoholfreq)
Almost every day	—	—		—	—
2-3 times/week	-4.5	-18, 8.6	0.5	-0.56	-3.6, 2.4	0.7
1-4 times/month	0.16	-10, 11	>0.9	-3.9	-6.4, -1.4	0.002
< 12 times/year	0.05	-10, 11	>0.9	-2.1	-4.9, 0.66	0.14
No alcohol last year	-2.9	-14, 7.8	0.6	-2.3	-5.5, 0.95	0.2
as.factor(education)
8th or less	—	—		—	—
High school dropout	-2.7	-13, 7.2	0.6	-2.1	-4.9, 0.79	0.2
High school	-8.1	-18, 1.6	0.10	-3.7	-6.4, -0.97	0.008
College dropout	-1.7	-18, 15	0.8	-1.8	-5.8, 2.2	0.4
College or more	-7.9	-24, 8.2	0.3	-5.1	-8.8, -1.4	0.007
as.factor(active)
Very active	—	—		—	—
Moderately active	6.7	-1.1, 15	0.091	0.25	-1.6, 2.1	0.8
Inactive	15	1.5, 29	0.030	0.31	-2.8, 3.4	0.8
¹ CI = Confidence Interval

Interpretation: The association between smoke intensity and SBP remained non-significant in both the complete case and imputed models, even after applying a log₂ transformation to the exposure. In the complete case analysis, the effect estimate was beta = -0.92 (95% CI: -4.7, 2.9, p = 0.6), while in the imputed model, the estimate was slightly attenuated to beta = -0.67 (95% CI: -1.5, 0.15, p = 0.11). The log2 transformation improved precision, as indicated by narrower confidence intervals in the imputed model, but the effect size remained small and non-significant.

Part 2

Q2.1

Create a DAG to represent causal relationships between CD4 counts at baseline and opportunistic infection after one year, in the presence of other variables. Please provide your DAG.

Q2.2

Read in CD4_infection.csv. Describe the missing pattern and identify factors associated with missingness. Which missingness assumption do you consider? Please justify your conclusion.

## Rows: 80
## Columns: 11
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ septrin   <int> 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, …
## $ site      <int> 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ sex       <int> 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, …
## $ age       <int> 20, 32, 27, 40, 39, 35, 44, 23, 19, 38, 27, 35, 46, 39, 28, …
## $ activity  <int> 1, 4, 12, 8, 2, 5, 7, 4, 6, 8, 6, 2, 4, 3, 9, 10, 2, 12, 10,…
## $ income    <dbl> 22.1, 18.0, 32.0, 41.0, 12.0, NA, 25.0, NA, 10.0, 29.0, 25.0…
## $ residence <int> 1, 0, 2, 2, 0, NA, 1, 2, NA, 1, 0, NA, 2, 1, 1, NA, 2, 1, NA…
## $ art_base  <int> 0, 1, NA, 0, 1, 0, 1, 0, 0, 0, 1, NA, 0, 1, NA, 1, 1, 0, 1, …
## $ cd4cnt    <dbl> 4.75, 5.40, 3.10, 2.90, 4.05, 4.10, 5.05, 1.80, NA, 5.10, 5.…
## $ oi        <int> NA, 0, 1, 1, 0, 0, NA, 1, 0, 0, 0, 1, 0, NA, 1, 0, 1, 0, 0, …

##        id   septrin      site       sex       age  activity    income residence 
##         0         0         0         0         0         0         4         5 
##  art_base    cd4cnt        oi 
##         3         3         7

##    id septrin site sex age activity art_base cd4cnt income residence oi   
## 63  1       1    1   1   1        1        1      1      1         1  1  0
## 7   1       1    1   1   1        1        1      1      1         1  0  1
## 1   1       1    1   1   1        1        1      1      1         0  1  1
## 2   1       1    1   1   1        1        1      1      0         1  1  1
## 2   1       1    1   1   1        1        1      1      0         0  1  2
## 1   1       1    1   1   1        1        1      0      1         1  1  1
## 1   1       1    1   1   1        1        1      0      1         0  1  2
## 2   1       1    1   1   1        1        0      1      1         1  1  1
## 1   1       1    1   1   1        1        0      0      1         0  1  3
##     0       0    0   0   0        0        3      3      4         5  7 22

## 
##  Variables sorted by number of missings: 
##   Variable  Count
##         oi 0.0875
##  residence 0.0625
##     income 0.0500
##   art_base 0.0375
##     cd4cnt 0.0375
##         id 0.0000
##    septrin 0.0000
##       site 0.0000
##        sex 0.0000
##        age 0.0000
##   activity 0.0000

## 
## FALSE  TRUE 
##    17    63

## 
## Yes  No 
##  63  17

**Table 7. Patient Characteristics by Complete Case**
Characteristic	N	Complete Case		p-value²
Characteristic	N	Yes N = 63¹	No N = 17¹	p-value²
Age	80	32 (27, 40)	29 (27, 39)	0.5
Sex (Female)	80	28 (44%)	6 (35%)	0.5
Household Income	76	25 (18, 32)	22 (19, 25)	0.2
Num. Sex Acts	80	7 (4, 9)	5 (3, 7)	0.2
Taking Septrin	80	47 (75%)	12 (71%)	0.8
Study site	80	27 (43%)	5 (29%)	0.3
Distance to clinic	75			0.082
0		25 (40%)	1 (8.3%)
1		27 (43%)	7 (58%)
2		11 (17%)	4 (33%)
ART at baseline	77	21 (33%)	5 (36%)	>0.9
CD4 count (per 100 cells/𝜇L)	77	4.20 (2.15, 5.20)	4.30 (2.50, 5.05)	0.8
Oportunistic infection	73	25 (40%)	4 (40%)	>0.9
¹ Median (Q1, Q3); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test

Interpretation: The missingness pattern seems to be non-monotonic, the variables with most missing values are the outcome variable (opportunistic infection), income, art at baseline and CD4 t cell counts, as shown in the histogram. In a comparative table between complete and incomplete data, we find the most important differences in proportion in the variables: number of sex acts, study site and distance to clinic. We assume MCAR or at least MAR.

Q2.3

Assuming that data are missing at random, implement MICE to impute the missing data. It is your decision what variable(s) to impute (e.g., only covariates, covariates plus exposure, or all variables in the analysis). Provide a detailed description of your approach, including the imputation methods used, variables included in the imputation model, the number of iterations, the number of imputed datasets, and diagnostic checks.

## 
##  iter imp variable
##   1   1  income  residence  art_base  cd4cnt
##   1   2  income  residence  art_base  cd4cnt
##   1   3  income  residence  art_base  cd4cnt
##   1   4  income  residence  art_base  cd4cnt
##   1   5  income  residence  art_base  cd4cnt
##   1   6  income  residence  art_base  cd4cnt
##   1   7  income  residence  art_base  cd4cnt
##   1   8  income  residence  art_base  cd4cnt
##   1   9  income  residence  art_base  cd4cnt
##   1   10  income  residence  art_base  cd4cnt
##   2   1  income  residence  art_base  cd4cnt
##   2   2  income  residence  art_base  cd4cnt
##   2   3  income  residence  art_base  cd4cnt
##   2   4  income  residence  art_base  cd4cnt
##   2   5  income  residence  art_base  cd4cnt
##   2   6  income  residence  art_base  cd4cnt
##   2   7  income  residence  art_base  cd4cnt
##   2   8  income  residence  art_base  cd4cnt
##   2   9  income  residence  art_base  cd4cnt
##   2   10  income  residence  art_base  cd4cnt
##   3   1  income  residence  art_base  cd4cnt
##   3   2  income  residence  art_base  cd4cnt
##   3   3  income  residence  art_base  cd4cnt
##   3   4  income  residence  art_base  cd4cnt
##   3   5  income  residence  art_base  cd4cnt
##   3   6  income  residence  art_base  cd4cnt
##   3   7  income  residence  art_base  cd4cnt
##   3   8  income  residence  art_base  cd4cnt
##   3   9  income  residence  art_base  cd4cnt
##   3   10  income  residence  art_base  cd4cnt
##   4   1  income  residence  art_base  cd4cnt
##   4   2  income  residence  art_base  cd4cnt
##   4   3  income  residence  art_base  cd4cnt
##   4   4  income  residence  art_base  cd4cnt
##   4   5  income  residence  art_base  cd4cnt
##   4   6  income  residence  art_base  cd4cnt
##   4   7  income  residence  art_base  cd4cnt
##   4   8  income  residence  art_base  cd4cnt
##   4   9  income  residence  art_base  cd4cnt
##   4   10  income  residence  art_base  cd4cnt
##   5   1  income  residence  art_base  cd4cnt
##   5   2  income  residence  art_base  cd4cnt
##   5   3  income  residence  art_base  cd4cnt
##   5   4  income  residence  art_base  cd4cnt
##   5   5  income  residence  art_base  cd4cnt
##   5   6  income  residence  art_base  cd4cnt
##   5   7  income  residence  art_base  cd4cnt
##   5   8  income  residence  art_base  cd4cnt
##   5   9  income  residence  art_base  cd4cnt
##   5   10  income  residence  art_base  cd4cnt
##   6   1  income  residence  art_base  cd4cnt
##   6   2  income  residence  art_base  cd4cnt
##   6   3  income  residence  art_base  cd4cnt
##   6   4  income  residence  art_base  cd4cnt
##   6   5  income  residence  art_base  cd4cnt
##   6   6  income  residence  art_base  cd4cnt
##   6   7  income  residence  art_base  cd4cnt
##   6   8  income  residence  art_base  cd4cnt
##   6   9  income  residence  art_base  cd4cnt
##   6   10  income  residence  art_base  cd4cnt
##   7   1  income  residence  art_base  cd4cnt
##   7   2  income  residence  art_base  cd4cnt
##   7   3  income  residence  art_base  cd4cnt
##   7   4  income  residence  art_base  cd4cnt
##   7   5  income  residence  art_base  cd4cnt
##   7   6  income  residence  art_base  cd4cnt
##   7   7  income  residence  art_base  cd4cnt
##   7   8  income  residence  art_base  cd4cnt
##   7   9  income  residence  art_base  cd4cnt
##   7   10  income  residence  art_base  cd4cnt
##   8   1  income  residence  art_base  cd4cnt
##   8   2  income  residence  art_base  cd4cnt
##   8   3  income  residence  art_base  cd4cnt
##   8   4  income  residence  art_base  cd4cnt
##   8   5  income  residence  art_base  cd4cnt
##   8   6  income  residence  art_base  cd4cnt
##   8   7  income  residence  art_base  cd4cnt
##   8   8  income  residence  art_base  cd4cnt
##   8   9  income  residence  art_base  cd4cnt
##   8   10  income  residence  art_base  cd4cnt
##   9   1  income  residence  art_base  cd4cnt
##   9   2  income  residence  art_base  cd4cnt
##   9   3  income  residence  art_base  cd4cnt
##   9   4  income  residence  art_base  cd4cnt
##   9   5  income  residence  art_base  cd4cnt
##   9   6  income  residence  art_base  cd4cnt
##   9   7  income  residence  art_base  cd4cnt
##   9   8  income  residence  art_base  cd4cnt
##   9   9  income  residence  art_base  cd4cnt
##   9   10  income  residence  art_base  cd4cnt
##   10   1  income  residence  art_base  cd4cnt
##   10   2  income  residence  art_base  cd4cnt
##   10   3  income  residence  art_base  cd4cnt
##   10   4  income  residence  art_base  cd4cnt
##   10   5  income  residence  art_base  cd4cnt
##   10   6  income  residence  art_base  cd4cnt
##   10   7  income  residence  art_base  cd4cnt
##   10   8  income  residence  art_base  cd4cnt
##   10   9  income  residence  art_base  cd4cnt
##   10   10  income  residence  art_base  cd4cnt

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##        id       age       sex    income  activity   septrin      site residence 
##        ""        ""        ""     "pmm"        ""        ""        "" "polyreg" 
##  art_base    cd4cnt        oi 
##  "logreg"     "pmm"        "" 
## PredictorMatrix:
##          id age sex income activity septrin site residence art_base cd4cnt oi
## id        0   1   1      1        1       1    1         1        1      1  1
## age       1   0   1      1        1       1    1         1        1      1  1
## sex       1   1   0      1        1       1    1         1        1      1  1
## income    1   1   1      0        1       1    1         1        1      1  1
## activity  1   1   1      1        0       1    1         1        1      1  1
## septrin   1   1   1      1        1       0    1         1        1      1  1

## $id
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $age
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $sex
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $income
##     1  2  3  4  5  6  7  8  9 10
## 6  57 47 57 32 57 57 35 32 47 57
## 8  29 14 28 14 28 41 32 30 19 23
## 13 47 57 57 57 35 32 41 47 57 41
## 19 47 41 57 42 47 30 19 41 47 57
## 
## $activity
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $septrin
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $site
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $residence
##    1 2 3 4 5 6 7 8 9 10
## 6  2 2 2 1 2 2 2 1 2  2
## 9  0 0 1 0 0 0 0 0 1  0
## 12 1 0 1 0 0 0 1 0 1  0
## 16 0 0 0 0 0 0 0 0 0  0
## 19 2 1 2 2 2 0 0 1 2  2
## 
## $art_base
##    1 2 3 4 5 6 7 8 9 10
## 3  0 1 0 1 0 0 0 0 0  0
## 12 0 1 0 1 1 1 1 0 0  1
## 15 0 0 0 0 0 1 0 0 0  0
## 
## $cd4cnt
##      1   2   3   4    5    6   7    8   9   10
## 9  4.9 2.5 5.4 5.2 6.05 5.15 5.1 5.50 5.1 5.20
## 12 1.6 1.5 1.9 2.0 1.65 2.15 3.1 4.20 2.9 2.15
## 23 3.4 4.9 5.4 5.1 6.80 5.40 3.2 5.05 5.2 2.50
## 
## $oi
##     1  2  3  4  5  6  7  8  9 10
## 1  NA NA NA NA NA NA NA NA NA NA
## 7  NA NA NA NA NA NA NA NA NA NA
## 14 NA NA NA NA NA NA NA NA NA NA
## 20 NA NA NA NA NA NA NA NA NA NA
## 25 NA NA NA NA NA NA NA NA NA NA
## 36 NA NA NA NA NA NA NA NA NA NA
## 39 NA NA NA NA NA NA NA NA NA NA

We chose to impute both the main exposure and covariates identified as confounders that had missing data, while retaining non-imputed variables as they were. The imputed dataset included both imputed and non-imputed variables. The variables imputed and their corresponding methods were: cd4cnt and income were managed as continuous, using predictive mean matching, art_base (dichotomous) using logistic regression,and residency (categorical with more than two levels) using polytomous regression. We decided to impute the exposure as well given that is a small dataset with just 8 observations, the variable CD4 count had a few missing values, and the variable can be relatively easily modeled since there’s well known predictors of low CD4 cells count. We performed 10 iterations and visually inspected the imputed variables. Overall, the imputed values aligned well with the observed data, as evidenced by the density plots.

Q2.4

Perform appropriate analyses to estimate the causal relationship between CD4 counts at baseline and opportunistic infection in both the complete dataset and the imputed datasets. Please report your findings and indicate whether you observe any disparity in the results between these two methods.

**Table 7. Association of baseline CD4 count with the development of an opportunistic infection at the one-year follow-up**
Characteristic	Complete Data			Imputed Covars
Characteristic	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
cd4cnt	0.03	0.00, 0.18	0.009	0.03	0.00, 0.33	0.005
activity	1.09	0.59, 1.99	0.8	1.14	0.70, 1.84	0.6
age	0.90	0.47, 1.41	0.6	0.87	0.59, 1.27	0.5
income	1.23	0.74, 4.03	0.6	1.04	0.66, 1.64	0.9
as.factor(sex)
0	—	—		—	—
1	0.00	0.00, 2.34	0.3	0.01	0.00, 15.5	0.2
as.factor(art_base)
0	—	—		—	—
1	4.78	0.01, 51,694	0.6	6.19	0.01, 2,821	0.6
as.factor(residence)
0	—	—		—	—
1	11.5	0.01, 94,215	0.5	46.5	0.09, 23,991	0.2
2	0.62	0.00, 10,501,915	>0.9	85.5	0.01, 648,966	0.3
¹ OR = Odds Ratio, CI = Confidence Interval

The association between baseline CD4 count and the development of an opportunistic infection at one-year follow-up remained consistent across the complete case and imputed models. In both models, higher CD4 counts were strongly associated with reduced odds of developing an opportunistic infection. In both complete and imputed models, the odds ratio (OR) for CD4 count is 0.03, with confidence intervals that do not include 1 (p = 0.009 in complete data, p = 0.005 in imputed data), indicating a strong protective effect. However, the imputed data model shows slightly wider CIs, which might reflect increased variance if the imputation model does not fully capture the true relationships between variables (or maybe the model is misspecified)

Q2.5

On the multiplicative scale, perform an appropriate analysis to examine whether the association between CD4 counts at baseline and opportunistic infection differs by sexual activity in both the complete dataset and the imputed datasets. It is your decision whether to treat sexual activity as a numeric variable or a categorical (factor) variable, or even to aggregate certain levels based on your scientific rationale. Please ensure that the results are interpreted appropriately.

**Table 7. Association of baseline CD4 count with the development of an opportunistic infection at the one-year follow-up**
Characteristic	Complete Data			Imputed Covars
Characteristic	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
cd4cnt	0.18	0.00, 5.79	0.3	0.18	0.01, 4.77	0.3
activity	5.16	0.26, 747	0.3	5.43	0.24, 125	0.3
age	0.94	0.47, 1.70	0.8	0.89	0.59, 1.37	0.6
income	1.45	0.80, 5.32	0.4	1.18	0.68, 2.06	0.5
as.factor(sex)
0	—	—		—	—
1	0.00	0.00, 1.04	0.2	0.00	0.00, 89.3	0.2
as.factor(art_base)
0	—	—		—	—
1	9.20	0.01, 1,502,669	0.5	14.8	0.02, 13,562	0.4
as.factor(residence)
0	—	—		—	—
1	3.54	0.01, 26,892	0.7	13.9	0.03, 7,361	0.4
2	0.05	0.00, 29,761,604	0.8	20.2	0.00, 310,845	0.5
cd4cnt * activity	0.69	0.16, 1.38	0.3	0.69	0.33, 1.45	0.3
¹ OR = Odds Ratio, CI = Confidence Interval

CD4 Count (Exposure): The OR for CD4 count is 0.18 in both the complete and imputed data models (p = 0.3). This suggests a protective but statistically non-significant association, meaning higher CD4 counts may be linked to lower odds of opportunistic infection, but the wide confidence intervals indicate substantial uncertainty.

Sexual Activity: Sexual activity has an OR of 5.16 (complete data) and 5.43 (imputed data), both with p = 0.3. Although the OR suggests a strong positive association (implying higher odds of opportunistic infection among sexually active individuals), the confidence intervals are too wide, indicating high variability (low precision).

Interaction Between CD4 Count and Sexual Activity: The interaction term (CD4 count × sexual activity) has an OR of 0.69 (complete and imputed data, p = 0.3), suggesting that the protective effect of CD4 count might be slightly weaker in sexually active individuals, but this result is not statistically significant. The wide confidence intervals further indicate uncertainty in the estimate. The estimates did not become more precise using the imputed dataset, compared to the complete case.

Q2.6

Assume that age is a confounder in the association between baseline CD4 counts and opportunistic infections. You are concerned about the potential non-linear confounding effect of age, so you decide to include a non-linear term for age in the analyses of both the complete dataset and the imputed datasets. What non-linear term would you use? Please report your results and compare them to the methods that use a linear term for age.

**Table 7. Association of baseline CD4 count with the development of an opportunistic infection at the one-year follow-up**
Characteristic	Complete Data			Imputed Covars
Characteristic	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
cd4cnt	0.04	0.00, 0.22	0.008	0.03	0.00, 0.36	0.006
activity	1.00	0.51, 1.85	>0.9	1.04	0.61, 1.76	0.9
age	4.48	0.15, 1,652	0.4	4.67	0.08, 270	0.4
I(age^2)	0.98	0.89, 1.03	0.4	0.97	0.92, 1.03	0.4
income	1.13	0.72, 3.69	0.7	1.04	0.65, 1.68	0.9
as.factor(sex)
0	—	—		—	—
1	0.02	0.00, 10.5	0.4	0.03	0.00, 58.3	0.4
as.factor(art_base)
0	—	—		—	—
1	7.48	0.00, 3,365,480	0.6	6.85	0.00, 45,919	0.7
as.factor(residence)
0	—	—		—	—
1	19.5	0.01, 142,797	0.4	47.1	0.05, 41,410	0.3
2	36.6	0.00, 5,334,801,942	0.7	135	0.01, 2,226,870	0.3
¹ OR = Odds Ratio, CI = Confidence Interval

Including a squared term for age as a confounder helps account for a nonlinear relationship between age and the risk of opportunistic infections. Biological processes and immune function often do not change linearly with age; for example, the risk of infections might be higher in both very young and very old individuals but lower in middle age. By adding age sqr, the model allows for a curvilinear adjustment, ensuring that age is properly controlled as a confounder if the effect of age was non linear.

However, adding a squared term for age did not meaningfully change the association between CD4 count and opportunistic infections, as the ORs for CD4 count remained similar across models. In the complete case analysis, the OR for CD4 count was 0.03 (95% CI: 0.00, 0.18, p = 0.009) in the linear age model, and 0.04 (95% CI: 0.00, 0.22, p = 0.008) after adding age². Similarly, in the imputed model, the OR changed only slightly from 0.03 (95% CI: 0.00, 0.33, p = 0.005) to 0.03 (95% CI: 0.00, 0.36, p = 0.006). These results suggest that adjusting for potential nonlinear effects of age did not impact the estimated relationship between CD4 count and opportunistic infections.

PH724_Assignment2_VDC

Vanessa Davila

2025-03-11