###Q1. DAG

Create a Directed Acyclic Graph (DAG) to determine the associations of smoking intensity and smoking duration with systolic and diastolic blood pressure, as well as the diagnosis of high blood pressure.

Justify how you decided the role of each variable included in the DAG. If you have concerns about residual confounding, please present it in the DAG. Age, sex, race: These are common causes of both smoking behavior and blood pressure, creating potential confounding. They are fundamental determinants of both lifestyle patterns and health outcomes.

Income and education: These socioeconomic factors influence smoking behavior and other lifestyle habits, such as alcohol consumption and physical activity. However, they are downstream variables, shaped by factors like age, sex, and race.

Physical activity and alcohol consumption: These lifestyle factors influence both smoking habits and cardiovascular outcomes. They are, in turn, shaped by structural determinants such as socioeconomic status and access to healthcare.

Diabetes, asthma, and cholesterol: These are pathophysiological intermediates, lying on the causal pathway from smoking to high blood pressure.

Tobacco tax: Functions as an instrumental variable (IV) because it affects smoking intensity/duration only through its effect on tobacco prices. It meets the exclusion restriction by having no direct causal path to blood pressure aside from its influence on smoking behavior.

Stress and diet: unmeasured variables that could influence both smoking behavior and blood pressure, leading to residual confounding.

Adjusting for colliders such as physical activity, which is influenced by both smoking and other structural factors, can open a backdoor paths, causing residual confounding.

###Load R packages

###Q2. Read in, browse and clean the dataset. Glimpse the dataset

## [1] "/Users/vanessa/Library/Mobile Documents/com~apple~CloudDocs/Courses/PH724-01 Advanced Methods in Epidemiology/Assignments/Assignment1_Descriptive"
## Rows: 1,629
## Columns: 27
## $ X              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ id             <int> 233, 235, 244, 245, 252, 257, 262, 266, 419, 420, 428, …
## $ sex            <int> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0…
## $ age            <int> 42, 36, 56, 68, 40, 43, 56, 29, 51, 43, 43, 34, 54, 51,…
## $ race           <int> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ education      <int> 1, 2, 2, 1, 2, 2, 3, 3, 2, 2, 3, 3, 1, 3, 1, 3, 3, 3, 3…
## $ income         <int> 19, 18, 15, 15, 18, 11, 19, 22, 18, 16, 19, 18, 16, NA,…
## $ marital        <int> 2, 2, 3, 3, 2, 4, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2…
## $ active         <fct> Very active, Very active, Very active, Moderately activ…
## $ alcoholfreq    <fct> 2-3 times/week, Almost every day, < 12 times/year, 1-4 …
## $ tobacco_price  <dbl> 2.183594, 2.346680, 1.569580, 1.506592, 2.346680, 2.209…
## $ tobacco_tax    <dbl> 1.1022949, 1.3649902, 0.5512695, 0.5249023, 1.3649902, …
## $ smokeintensity <int> 30, 20, 20, 3, 20, 10, 20, 2, 25, 20, 30, 40, 20, 10, 4…
## $ smokeyrs       <int> 29, 24, 26, 53, 19, 21, 39, 9, 37, 25, 24, 20, 19, 38, …
## $ height         <dbl> 174.1875, 159.3750, 168.5000, 170.1875, 181.8750, 162.1…
## $ weight         <dbl> 79.04, 58.63, 56.81, 59.42, 87.09, 99.00, 63.05, 58.74,…
## $ asthma         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cholesterol    <int> 197, 301, 157, 174, 216, 212, 205, 166, 337, 279, 173, …
## $ diabetes       <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ dbp            <int> 96, 80, 75, 78, 77, 83, 69, 53, 79, 106, 89, 69, 80, 10…
## $ sbp            <int> 175, 123, 115, 148, 118, 141, 132, 100, 163, 184, 135, …
## $ hbp            <int> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ death          <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1…
## $ dadth          <int> NA, NA, NA, 14, NA, NA, NA, NA, 13, 17, NA, 28, NA, NA,…
## $ modth          <int> NA, NA, NA, 2, NA, NA, NA, NA, 10, 10, NA, 11, NA, NA, …
## $ yrdth          <int> NA, NA, NA, 85, NA, NA, NA, NA, 84, 86, NA, 92, NA, NA,…
## $ income_group   <fct> $5000 - $14999, $5000 - $14999, $3000 - $4999, $3000 - …

###Q3 Group subsets of females and males and report the distributions of smoking intensity, smoking duration, systolic blood pressure, and diastolic blood pressure by sex, using minimum, 1st quartile, median, mean, 3rd quartile, and maximum.

Smoking and BP by sex
Characteristic Male
N = 799
1
Female
N = 830
1
Smoke Intensity 15 20 23 30 1 80 10 20 18 20 1 60
Smoke Duration 16 27 27 36 1 60 14 22 23 31 1 64
Systolic Blood Pressure 118 129 131 141 90 229 113 124 126 137 87 212
Diastolic Blood Pressure 71 79 79 86 48 130 69 76 76 82 47 112
1 Q1 Median Mean Q3 Min Max

###Q4. Body Mass Index distribution Create a new variable, body mass index (BMI), using height and weight (hint: please pay attention to the units of height and weight) and plot the distribution of BMI using a histogram. Attach the histogram and describe the distribution.

Interpretation: The histogram shows a right-skewed distribution (positive skew), indicating that most patients have lower values, with a long tail extending to the right, where patients with morbid obesity are concentrated.The mode is around 22 kg/m2.

###Q5 date variable Create a date variable named dth_date in ymd format (e.g., “1985-02-14”) combining three columns – day, month, and year of death. Copy and paste the first six nonmissing rows of this variable here.

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `dth_date = ymd(dth_date)`.
## Caused by warning:
## !  1311 failed to parse.
##     dth_date
## 1 1985-02-14
## 2 1984-10-13
## 3 1986-10-17
## 4 1992-11-28
## 5 1988-01-03
## 6 1989-09-01

###Q6 table with confounders and precision variables, overall and by smoke intensity tertiles Create a table to show participant characteristics in the overall sample and separated by smoking intensity tertiles. You should at least include confounders and precision variables defined in your DAG above. Please present two versions of this table – one based on complete data.

##   smokeintensity smoke_tertile
## 1             30          High
## 2             20        Medium
## 3             20        Medium
## 4              3           Low
## 5             20        Medium
## 6             10           Low
Characteristics by smoke intensity, including missing values
Characteristic Overall
N = 1,629
1
Low
N = 543
1
Medium
N = 543
1
High
N = 543
1
Age 44 (33, 53) 45 (33, 54) 44 (33, 54) 43 (34, 51)
Sex 830 (51%) 349 (64%) 275 (51%) 206 (38%)
Family income



    Less than $3000 157 (10%) 72 (14%) 54 (10%) 31 (5.9%)
    $3000 - $4999 146 (9.3%) 58 (11%) 45 (8.6%) 43 (8.2%)
    $5000 - $14999 844 (54%) 266 (51%) 310 (59%) 268 (51%)
    $15000 and above 420 (27%) 123 (24%) 116 (22%) 181 (35%)
    Unknown 62 24 18 20
Physical activity



    Very active 729 (45%) 245 (45%) 251 (46%) 233 (43%)
    Moderately active 738 (45%) 251 (46%) 238 (44%) 249 (46%)
    Inactive 162 (9.9%) 47 (8.7%) 54 (9.9%) 61 (11%)
Alcohol consumption



    Almost every day 336 (21%) 91 (17%) 103 (19%) 142 (26%)
    2-3 times/week 231 (14%) 64 (12%) 80 (15%) 87 (16%)
    1-4 times/month 506 (31%) 183 (34%) 174 (32%) 149 (28%)
    < 12 times/year 344 (21%) 131 (24%) 114 (21%) 99 (18%)
    No alcohol last year 207 (13%) 73 (13%) 70 (13%) 64 (12%)
    Unknown 5 1 2 2
In-State tobacco tax 1.05 (0.94, 1.15) 1.05 (0.94, 1.26) 1.05 (0.94, 1.15) 1.05 (0.94, 1.15)
    Unknown 92 37 28 27
1 Median (Q1, Q3); n (%)
Characteristics by smoke intensity, without missing values
Characteristic Overall
N = 1,629
1
Low
N = 543
1
Medium
N = 543
1
High
N = 543
1
Age 44 (33, 53) 45 (33, 54) 44 (33, 54) 43 (34, 51)
Sex 830 (51%) 349 (64%) 275 (51%) 206 (38%)
Family income



    Less than $3000 157 (10%) 72 (14%) 54 (10%) 31 (5.9%)
    $3000 - $4999 146 (9.3%) 58 (11%) 45 (8.6%) 43 (8.2%)
    $5000 - $14999 844 (54%) 266 (51%) 310 (59%) 268 (51%)
    $15000 and above 420 (27%) 123 (24%) 116 (22%) 181 (35%)
Physical activity



    Very active 729 (45%) 245 (45%) 251 (46%) 233 (43%)
    Moderately active 738 (45%) 251 (46%) 238 (44%) 249 (46%)
    Inactive 162 (9.9%) 47 (8.7%) 54 (9.9%) 61 (11%)
Alcohol consumption



    Almost every day 336 (21%) 91 (17%) 103 (19%) 142 (26%)
    2-3 times/week 231 (14%) 64 (12%) 80 (15%) 87 (16%)
    1-4 times/month 506 (31%) 183 (34%) 174 (32%) 149 (28%)
    < 12 times/year 344 (21%) 131 (24%) 114 (21%) 99 (18%)
    No alcohol last year 207 (13%) 73 (13%) 70 (13%) 64 (12%)
In-State tobacco tax 1.05 (0.94, 1.15) 1.05 (0.94, 1.26) 1.05 (0.94, 1.15) 1.05 (0.94, 1.15)
1 Median (Q1, Q3); n (%)

###Q7 Correlation sbp/dbp Calculate both Pearson and Spearman correlations between systolic and diastolic blood pressure and report your results.

## [1] 0.5561762
## [1] 0.5651468

Interpretation: The correlation test suggest that sbp and dbp have a moderate positive linear correlation, meaning that as sbp increases, dbp also tends to increase. Both methods to assess correlation return very similar results, which means that there are no heavy influence from outliers, since Spearman test is low-sensitive to outliers.

###Q8 HBP definition according to different criteria Use the World Health Organization criteria (i.e., systolic blood pressure ≥ 140 mmHg and/or diastolic blood pressure ≥ 90 mmHg) to define high blood pressure and compare the distribution to the diagnosis of high blood pressure (i.e., variable hbp). Do you find these two distributions consistent with each other? How about using the American Heart Association criteria (i.e., systolic blood pressure ≥ 130 mmHg and/or diastolic blood pressure ≥ 80 mmHg)?

##   sbp dbp hbp hbp_who hbp_aha
## 1 175  96 Yes     Yes     Yes
## 2 123  80  No      No     Yes
## 3 115  75  No      No      No
## 4 148  78 Yes     Yes     Yes
## 5 118  77  No      No      No
## 6 141  83  No     Yes     Yes
## 791 missing rows in the "hbp" column have been removed.
Comparison of HBP Diagnoses by WHO and AHA Criteria
Characteristic Overall
N = 838
1
No
N = 708
1
Yes
N = 130
1
WHO-defined High BP 247 (31%) 187 (28%) 60 (50%)
AHA-defined High BP 468 (59%) 377 (56%) 91 (76%)
1 n (%)

Intepretation: There is low agreement between the existing HBP diagnosis and those defined by the WHO and AHA criteria. In the dataset, the prevalence of HBP is approximately 15%, compared to 30% using WHO criteria and 60% using AHA criteria. Among individuals diagnosed with HBP, the agreement is 50% with WHO criteria and 76% with AHA criteria. Additionally, 28% of those without an HBP diagnosis meet the WHO criteria, while 56% meet the AHA criteria. The AHA criteria are the most sensitive, as their lower hypertension threshold results in a higher estimated prevalence.

###Q9 Scatterplots Plot a scatterplot for smoking intensity and systolic blood pressure, as well as a scatterplot for smoking duration and systolic blood pressure. It may help result interpretation if you fit a regression line. Show the figures and describe your findings.

## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 77 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 77 rows containing missing values or values outside the scale range
## (`geom_point()`).

## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 77 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 77 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation: There is no clear linear relationship between smoking intensity and SBP, as indicated by the near-zero slope of the regression line. This suggests that smoking intensity may not be a strong predictor of increased blood pressure. In contrast, a direct, positive relationship is observed between smoking duration and SBP, indicating that smoking duration may have a greater impact on blood pressure than intensity.