Final project

Author

TLT

Part 1. Introduction

1 Background of research topic

Most of now-common knowledge about the risk factors related to strokes and coronary heart diseases (CHD) came from one of the biggest epidemiology studies of all time, the “Framingham Heart Study” (FHS, 1948-present) [1]. Before President Franklin Roosevelt’s death in 1945, it was considered normal for blood pressure to rise with age. On the day the President died of stroke, his blood pressure was measured 300/190 mmHg [2].

2. Source of data and how it was collected

3. Detais about the statistics you will use. What type of infomation does it contain?

4. Define the variables

5. Background research and facts, what research has been done on this topic

6. Define the overarching question to ask about the dataset

Part 2. Work with the data

Load libraries

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.2

Warning: package 'ggplot2' was built under R version 4.3.2

library(openintro)

Warning: package 'openintro' was built under R version 4.3.2

Warning: package 'airports' was built under R version 4.3.2

Warning: package 'cherryblossom' was built under R version 4.3.2

Warning: package 'usdata' was built under R version 4.3.2

library(infer)

Warning: package 'infer' was built under R version 4.3.2

library(skimr)

Warning: package 'skimr' was built under R version 4.3.3

library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.3.2

Warning: package 'dials' was built under R version 4.3.2

Warning: package 'modeldata' was built under R version 4.3.2

Warning: package 'parsnip' was built under R version 4.3.2

Warning: package 'recipes' was built under R version 4.3.2

Warning: package 'rsample' was built under R version 4.3.2

Warning: package 'tune' was built under R version 4.3.2

Warning: package 'workflows' was built under R version 4.3.2

Warning: package 'workflowsets' was built under R version 4.3.2

Warning: package 'yardstick' was built under R version 4.3.2

library(knitr)
library(kableExtra)

Warning: package 'kableExtra' was built under R version 4.3.3

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.3.2

Load data

setwd("C:/Users/Thu Le Trinh/Dropbox/Montgomery College/Classes/Math217/Final Project/Final Project_Framingham/Data Analysis")
Framingham <- read_csv("frmgham2.csv", show_col_types = FALSE)
set.seed(77777)

ORIGINAL DATA EXPLORATION

1. Framingham_mutate: Mutation of numerical data into binary

Because the all variables in the orignial Framingham dataframe was created as numerical variables, to visualize the distribution of some binary variable, a new dataframe is created to mutate some numeric variables into binary.

Framingham_mutate:

Framingham_mutate <- Framingham %>% 
  mutate(sex = if_else(SEX >1 , "Female", "Male")) %>%
  mutate(bpmeds = if_else(BPMEDS >0 , "BPMEDS_Y", "BPMEDS_N")) |>
  mutate(smoke = if_else(CURSMOKE >0 , "SMOKE_Y", "SMOKE_N"))

2. Framingham1: Period 1 Analysis

Because each participant has 1 to 3 observations during 3 examination periods (1, 2, 3), the observations resulted in more data points than the number of participants. A new dataframe is created to include original participants in period 1:

Framingham1: original participants from PERIOD = 1

Framingham1 <- Framingham_mutate |>
  filter(PERIOD == "1")

2a. Distribution of AGE ~ sex on original participants at the time of first exam in period 1.

From the data we can see that the age average of both female and male participants were around 50, there were 2490 female and 1944 male participants.

ggplot(data = Framingham1, aes(sex, AGE, fill = sex)) + geom_violin() + labs(title = "AGE distribution among sex", x="sex", y="AGE")

Framingham1 %>%
  group_by(sex) %>%
    summarise(mean_AGE = mean(AGE), count = n()) # count

# A tibble: 2 × 3
  sex    mean_AGE count
  <chr>     <dbl> <int>
1 Female     50.0  2490
2 Male       49.8  1944

2b. Distribution of Smoking Status based on SEX

ggplot(data = Framingham1, aes(sex, CIGPDAY, fill = sex)) + geom_boxplot() + labs(title = "Smoking status vs sex", x="sex", y="Cigars per day")

Warning: Removed 32 rows containing non-finite values (`stat_boxplot()`).

Framingham1 %>%
  filter(!is.na(CIGPDAY)) |>
  group_by(sex) %>%
    summarise(mean_CIGPDAY = mean(CIGPDAY)) # average age

# A tibble: 2 × 2
  sex    mean_CIGPDAY
  <chr>         <dbl>
1 Female         5.65
2 Male          13.2

2c. Distribution of Blood pressure based on AGE. Blood pressure: I’m looking at the systolic blood pressure (SYSBP) because it was shown to directly link with stroke and CVD.

ggplot(Framingham1, aes(x=AGE, y=SYSBP)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Scatterplot of systolic blood pressure vs AGE", x="AGE", y="SYSBP")+ theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

2d. Distribution of total cholesterol

ggplot(Framingham1, aes(TOTCHOL)) + geom_histogram(bins=100)

Warning: Removed 52 rows containing non-finite values (`stat_bin()`).

Framingham1 %>%
  filter(!is.na(TOTCHOL)) |>
  group_by(sex) %>%
    summarise(mean_TOTCHOL = mean(TOTCHOL)) # average age

# A tibble: 2 × 2
  sex    mean_TOTCHOL
  <chr>         <dbl>
1 Female         240.
2 Male           234.

Counting subjects with SYSBP higher than 130

Framingham1_SYSBP: mutate SYSBP > or <= 130

Framingham1_SYSBP <- Framingham1 %>% 
  mutate(SYSBP_130plus = if_else(SYSBP >130 , "Yes", "No"))

Framingham1_SYSBP %>% 
  group_by(sex) |>
  count(SYSBP_130plus)

# A tibble: 4 × 3
# Groups:   sex [2]
  sex    SYSBP_130plus     n
  <chr>  <chr>         <int>
1 Female No             1328
2 Female Yes            1162
3 Male   No             1067
4 Male   Yes             877

Average SYSBP by sex and BP

Framingham1_SYSBP %>%
  filter(sex == "Female") |>
  group_by(SYSBP_130plus) |>
  summarize(mean_SYSBP = mean(SYSBP), count = n())

# A tibble: 2 × 3
  SYSBP_130plus mean_SYSBP count
  <chr>              <dbl> <int>
1 No                  116.  1328
2 Yes                 154.  1162

Framingham1_SYSBP %>%
  filter(sex == "Male") |>
  group_by(SYSBP_130plus) |>
  summarize(mean_SYSBP = mean(SYSBP), count = n())

# A tibble: 2 × 3
  SYSBP_130plus mean_SYSBP count
  <chr>              <dbl> <int>
1 No                  118.  1067
2 Yes                 148.   877

LINEAR REGRESSION MODEL

Lm model: STROKE ~ SEX + SYSBP + AGE + CURSMOKE + DIABETES

Fit1: STROKE ~ SEX + SYSBP + AGE + CURSMOKE + DIABETES

fit1 <- lm(data = Framingham1, STROKE ~ SEX + SYSBP + AGE + CURSMOKE + DIABETES) 
summary(fit1)


Call:
lm(formula = STROKE ~ SEX + SYSBP + AGE + CURSMOKE + DIABETES, 
    data = Framingham1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43888 -0.12513 -0.07042 -0.01956  1.02876 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.4088499  0.0349743 -11.690  < 2e-16 ***
SEX         -0.0098742  0.0087186  -1.133  0.25746    
SYSBP        0.0018708  0.0002070   9.037  < 2e-16 ***
AGE          0.0051021  0.0005416   9.420  < 2e-16 ***
CURSMOKE     0.0246726  0.0088561   2.786  0.00536 ** 
DIABETES     0.0863249  0.0262352   3.290  0.00101 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2819 on 4428 degrees of freedom
Multiple R-squared:  0.06432,   Adjusted R-squared:  0.06326 
F-statistic: 60.87 on 5 and 4428 DF,  p-value: < 2.2e-16

Because p-value is high for SEX, remove SEX

Fit2: STROKE ~ SYSBP + AGE + CURSMOKE + DIABETES

fit2 <- lm(data = Framingham1, STROKE ~ SYSBP + AGE + CURSMOKE + DIABETES) 
summary(fit2)


Call:
lm(formula = STROKE ~ SYSBP + AGE + CURSMOKE + DIABETES, data = Framingham1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43262 -0.12524 -0.06979 -0.01994  1.02570 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.4253208  0.0318082 -13.371  < 2e-16 ***
SYSBP        0.0018625  0.0002069   9.003  < 2e-16 ***
AGE          0.0051253  0.0005412   9.470  < 2e-16 ***
CURSMOKE     0.0266690  0.0086791   3.073 0.002134 ** 
DIABETES     0.0871223  0.0262266   3.322 0.000901 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2819 on 4429 degrees of freedom
Multiple R-squared:  0.06404,   Adjusted R-squared:  0.0632 
F-statistic: 75.77 on 4 and 4429 DF,  p-value: < 2.2e-16

Remove SEX did not improve the Adjusted R-squared of the model

Fit3: STROKE ~ SYSBP

fit3 <- lm(data = Framingham1_SYSBP, STROKE ~ SYSBP_130plus) 
summary(fit3)


Call:
lm(formula = STROKE ~ SYSBP_130plus, data = Framingham1_SYSBP)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1393 -0.1393 -0.0547 -0.0547  0.9453 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.054697   0.005890   9.286   <2e-16 ***
SYSBP_130plusYes 0.084587   0.008686   9.738   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2883 on 4432 degrees of freedom
Multiple R-squared:  0.02095,   Adjusted R-squared:  0.02073 
F-statistic: 94.83 on 1 and 4432 DF,  p-value: < 2.2e-16

Frequently, epidemiologists need to define the population at risk for some disease or event outcome, and individuals who have previously had an event need to be excluded from the analysis so that only new or first events are counted. Incidence or first event rates can be calculated using any of the three examinations as a baseline exam. The variables PREVAP, PREVMI, PREVCHD, PREVSTRK, and PREVHYP will define the population at risk for the outcome of interest.

Filter PREVSTRK = 0

Framingham1_SYSBP |>
  filter(PREVSTRK == "0")

# A tibble: 4,402 × 43
   RANDID   SEX TOTCHOL   AGE SYSBP DIABP CURSMOKE CIGPDAY   BMI DIABETES BPMEDS
    <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>   <dbl> <dbl>    <dbl>  <dbl>
 1   2448     1     195    39  106     70        0       0  27.0        0      0
 2   6238     2     250    46  121     81        0       0  28.7        0      0
 3   9428     1     245    48  128.    80        1      20  25.3        0      0
 4  10552     2     225    61  150     95        1      30  28.6        0      0
 5  11252     2     285    46  130     84        1      23  23.1        0      0
 6  11263     2     228    43  180    110        0       0  30.3        0      0
 7  12629     2     205    63  138     71        0       0  33.1        0      0
 8  12806     2     313    45  100     71        1      20  21.7        0      0
 9  14367     1     260    52  142.    89        0       0  26.4        0      0
10  16365     1     225    43  162    107        1      30  23.6        0      0
# ℹ 4,392 more rows
# ℹ 32 more variables: HEARTRTE <dbl>, GLUCOSE <dbl>, educ <dbl>,
#   PREVCHD <dbl>, PREVAP <dbl>, PREVMI <dbl>, PREVSTRK <dbl>, PREVHYP <dbl>,
#   TIME <dbl>, PERIOD <dbl>, HDLC <dbl>, LDLC <dbl>, DEATH <dbl>,
#   ANGINA <dbl>, HOSPMI <dbl>, MI_FCHD <dbl>, ANYCHD <dbl>, STROKE <dbl>,
#   CVD <dbl>, HYPERTEN <dbl>, TIMEAP <dbl>, TIMEMI <dbl>, TIMEMIFC <dbl>,
#   TIMECHD <dbl>, TIMESTRK <dbl>, TIMECVD <dbl>, TIMEDTH <dbl>, …

library(ggfortify) 
autoplot(fit1, nrow=2, ncol=2)