Dataset: “monica” Found at: http://vincentarelbundock.github.io/Rdatasets/doc/DAAG/monica.html
We will investigate how both the event of myocardial infarction (commonly known as a ‘heart attack’) and resultant death are effected by factors of age, gender and smoking.
First let’s take a look at some high level summary statistics of the monica dataset:
summary(monica)
## outcome sex age yronset premi smstat
## live:3525 m:4605 Min. :35.00 Min. :85.00 y :1511 c :2051
## dead:2842 f:1762 1st Qu.:55.00 1st Qu.:87.00 n :4122 x :1938
## Median :61.00 Median :89.00 nk: 734 n :1460
## Mean :59.42 Mean :88.75 nk: 918
## 3rd Qu.:66.00 3rd Qu.:91.00
## Max. :69.00 Max. :93.00
## diabetes highbp hichol angina stroke hosp
## y : 818 y :2877 y :1840 y :1919 y : 560 y:4442
## n :4664 n :2542 n :3294 n :3473 n :4881 n:1925
## nk: 885 nk: 948 nk:1233 nk: 975 nk: 926
##
##
##
Some of these column names are a bit unintutitive, lets rename them:
colnames(monica)
## [1] "outcome" "sex" "age" "yronset" "premi" "smstat"
## [7] "diabetes" "highbp" "hichol" "angina" "stroke" "hosp"
newnames <- c("outcome", "gender", "age", "year_of_event", "previous_event", "smoker", "diabetes", "high_blood_pressure", "high_cholesterol", "angina", "stroke", "hospitalized")
monica_clean <- monica
colnames(monica_clean) <- newnames
Now, let’s take another look at our data, this type grouping by some of the variables we are trying to investigate. Let’s start small with a grouping by gender only:
monica_groups <- group_by(monica_clean, gender)
summarize(monica_groups, mean_age <- mean(age), median_age <- median(age), count = n())
## # A tibble: 2 x 4
## gender `mean_age <- mean(age)` `median_age <- median(age)` count
## <fct> <dbl> <dbl> <int>
## 1 m 58.8 61 4605
## 2 f 61.0 63 1762
Mean and median ages show that men have heart attacks at lower ages than women.
We can also see from the “count” column that men are vastly more likely to have a heart attack then women, to the tune of 4605/1763, or 2.6 times as likely.
#1 Men have heart attacks on average 2.2 years younger than women.
#2 Men are 2.6 times more likely to have a heart attack then a women.
Now let’s take a look at “Outcome”, i.e. whether the heart attack resulted in life or death.
monica_groups <- group_by(monica_clean, gender, outcome)
summarize(monica_groups, mean_age <- mean(age), median_age <- median(age), count = n())
## # A tibble: 4 x 5
## # Groups: gender [?]
## gender outcome `mean_age <- mean(age)` `median_age <- median(age)` count
## <fct> <fct> <dbl> <dbl> <int>
## 1 m live 57.5 59 2550
## 2 m dead 60.5 62 2055
## 3 f live 60.4 62 975
## 4 f dead 61.7 64 787
Are women more or less susceptible to death than men? We can peak at the ratios in the graphic above, or we can use the following r-code to calculate them:
male_dead_proportion <- nrow(subset(monica_clean, gender == 'm' & outcome == 'dead')) / nrow(subset(monica_clean, gender == 'm'))
male_dead_proportion
## [1] 0.4462541
female_dead_proportion <- nrow(subset(monica_clean, gender == 'f' & outcome == 'dead')) / nrow(subset(monica_clean, gender == 'f'))
female_dead_proportion
## [1] 0.4466515
The data is remarkable similar for death rates, with 44.62 percent for men and 44.46 percent for women.
#3 While women are less likely to have a heart attack, if they do in fact have a heart attack, they have essentially the same risk of dying as men.
How does smoking status effect the occurence and outcome of heart attacks?
First, let’s look at the overall proportions of the data set. What proportion of the entire dataset are either ex smokers or smokers?
nrow(subset(monica_clean, smoker == 'c' | smoker == 'x'))/nrow(monica_clean)
## [1] 0.6265117
So nearly 63% of all heart attack sufferers were smokers or ex-smokers.
monica_groups <- group_by(monica_clean, smoker, outcome)
summarize(monica_groups, mean_age <- mean(age), median_age <- median(age), count = n())
## # A tibble: 8 x 5
## # Groups: smoker [?]
## smoker outcome `mean_age <- mean(age)` `median_age <- median(age)` count
## <fct> <fct> <dbl> <dbl> <int>
## 1 c live 55.3 57 1337
## 2 c dead 59.4 61 714
## 3 x live 60.0 62 1186
## 4 x dead 61.8 64 752
## 5 n live 60.2 62 929
## 6 n dead 62.1 64 531
## 7 nk live 61.6 63 73
## 8 nk dead 60.2 63 845
Taking a closer look at counts across these groups we can see if smoking status leads to a greater chance of dying given a heart attack event:
Manually checking the proportions:
#Current Smokers
714/1337
## [1] 0.5340314
#Ex-Smokers
752/1186
## [1] 0.6340641
#Non-Smokers
531/929
## [1] 0.5715823
The data is certainly not intuitive. Current smokers are actually the least likely to die if they have a heart attack, compared with ex-smokers and non-smokers.
We can maybe follow the same logic as the gender analysis above, i.e. despite being more likely to have a heart attack, the chance of dying once having a heart attack isn’t largely effected by smoking status or gender. Of course, this is just a hypothesis.
Let’s look at some of the original analyses above, and some additional analyses, but this time from a visual perspective.
This simple bar chart shows us the skew of gender in the data:
ggplot(monica_clean, aes(x = monica_clean$gender)) + geom_bar() + labs(x = "Gender", y = "# of Heart Attacks")
Same type of chart can show skew for smoking type:
ggplot(monica_clean, aes(x = monica_clean$smoker)) + geom_bar() + labs(x = "Smoking Status (Current, Ex, Non, or Not Known)", y = "# of Heart Attacks")
How are heart attacks distributed across ages. Let's use a scatterplot to take a look:
monica_groups <- group_by(monica_clean, age)
ggplot(summarise(monica_groups, n = n()), aes(x = age, y = n)) + geom_point() + labs(x = "Age", y = "Count of Heart Attacks", title = "Heart Attacks by Age - Scatterplot")
So we can see # of heart attacks increases as age increases. This is pretty intuititve. Let’s look at how this data compares across genders, using a histogram:
ggplot(monica_clean, aes(age, fill = gender)) + geom_histogram(alpha = 0.5, position = 'identity', binwidth = 1) + labs(x = "Age", y = "Count of Heart Attacks", title = "Heart Attacks by Age - Male and Female - Histogram by Count")
We saw the skew in magnitude in the earlier data (i.e. more men have heart attacks than women). But how would this look if we normalized the data? We shift the histogram to look at the relative density, this shows the proportions of each age compared with the overal gender data. We can use the data to see if the age ranges differ across genders:
ggplot(monica_clean, aes(age, fill = gender)) + geom_histogram(aes( y = ..density..), alpha = 0.5, position = 'identity', binwidth = 1) + labs(x = "Age", y = "Count of Heart Attacks", title = "Heart Attacks by Age - Male and Female - Histogram by Density")
As we can see women tend to have a higher proportion of heart attacks later in life, while men do so earlier in life.
While we are at it, let’s look at the same type of histogram, only now analyzing the different types of smoker:
ggplot(monica_clean, aes(age, fill = smoker)) + geom_histogram(aes( y = ..density..), alpha = 0.4, position = 'identity', binwidth = 1) + labs(x = "Age", y = "Count of Heart Attacks", title = "Heart Attacks by Age - Smoking Status - Density Histogram")
While 3/4 of the types of smoker tend to show the same type of pattern, “c”, i.e. current smokers, tend to reach their peak heart attack proportion around age 60, while others continue to rise to peak in their late 60’s.
Finally we can take a look at age variable through its quartiles. We saw this earlier with a text summary, but a box-plot can present it visually.
ggplot(monica_clean, aes(y = age, x = gender)) + geom_boxplot() + labs(title = "Age by Gender - Boxplot")
This confirms the slightly higher age of women that we saw in the text summaries above.
While we are at it, let’s look at boxplots across smoker status.
ggplot(monica_clean, aes(y = age, x = smoker)) + geom_boxplot() + labs(title = "Age by Smoker Status - Boxplot", x = "Smoker Status (Current, Ex, Non, or Not Known)")
Here we see a rather strong difference between age of various smoker status, with current smokers having a median age of below 60, with other groups above 60. We can add that to our findings.
#4 - Current smokers suffer from heart attacks at an earlier age than ex-smokers or non smokers.
Our initial question for this analysis was “How Are Heart Attacks affected by Factors such as: Age, Gender, and Smoking?”. During the course of the analysis we made the following findings:
#1 Men have heart attacks on average 2.2 years younger than women.
#2 Men are 2.6 times more likely to have a heart attack than women.
#3 While women are less likely to have a heart attack, if they do in fact have a heart attack, they have essentially the same risk of dying as men.
#4 - Current smokers suffer from heart attacks at an earlier age than ex-smokers or non smokers.