Heart Disease UCI

Find interesting dataset and prepare short report (in R Markdown) which will consists:

short description of the dataset,
4 barplots which will present interesting relationships between variables,
brief comments which describes obtained results.

Then, edit theme of the graphs and all scales of the graph and prepare publication-ready plots.

Introduction

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

Source: https://www.kaggle.com/ronitf/heart-disease-uci

Data Variables

1. male | male or female (Nominal)
1. age | Age of the patient (Continuous)
1. education | whether or not the patient is a current smoker (Nominal)
1. currentSmoker | whether or not the patient is a current smoker (Nominal)
1. cigsPerDay | the number of cigarettes that the person smoked on average in one day.
1. BPMeds | whether or not the patient was on blood pressure medication (Nominal)
1. prevalentStroke | whether or not the patient had previously had a stroke (Nominal)
1. prevalentHyp | whether or not the patient was hypertensive (Nominal)
1. diabetes | whether or not the patient had diabetes (Nominal)
1. totChol | total cholesterol level (Continuous)
1. sysBP | systolic blood pressure (Continuous)
1. diaBP | diastolic blood pressure (Continuous)
1. BMI | Body Mass Index (Continuous)
1. heartRate | heart rate (Continuous)
1. glucose | glucose level (Continuous)
1. TenYearCHD | 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

library(ggplot2)
library(dplyr)

## 
## 載入套件：'dplyr'

## 下列物件被遮斷自 'package:stats':
## 
##     filter, lag

## 下列物件被遮斷自 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(RColorBrewer)

hdiease <- read.csv('/Users/jeank4723/Desktop/Advance VR/1/Data/framingham.csv')

head(hdiease)

##   male age education currentSmoker cigsPerDay BPMeds prevalentStroke
## 1    1  39         4             0          0      0               0
## 2    0  46         2             0          0      0               0
## 3    1  48         1             1         20      0               0
## 4    0  61         3             1         30      0               0
## 5    0  46         3             1         23      0               0
## 6    0  43         2             0          0      0               0
##   prevalentHyp diabetes totChol sysBP diaBP   BMI heartRate glucose TenYearCHD
## 1            0        0     195 106.0    70 26.97        80      77          0
## 2            0        0     250 121.0    81 28.73        95      76          0
## 3            0        0     245 127.5    80 25.34        75      70          0
## 4            1        0     225 150.0    95 28.58        65     103          1
## 5            0        0     285 130.0    84 23.10        85      85          0
## 6            1        0     228 180.0   110 30.30        77      99          0

hdiease <- hdiease %>% 
              mutate_at(c("male","education","currentSmoker",
                          "prevalentStroke","prevalentHyp","diabetes",
                          "TenYearCHD"), as.factor)


str(hdiease)

## 'data.frame':    4238 obs. of  16 variables:
##  $ male           : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 2 ...
##  $ age            : int  39 46 48 61 46 43 63 45 52 43 ...
##  $ education      : Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 2 1 2 1 1 ...
##  $ currentSmoker  : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 2 1 2 ...
##  $ cigsPerDay     : int  0 0 20 30 23 0 0 20 0 30 ...
##  $ BPMeds         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ prevalentStroke: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentHyp   : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 2 2 ...
##  $ diabetes       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ totChol        : int  195 250 245 225 285 228 205 313 260 225 ...
##  $ sysBP          : num  106 121 128 150 130 ...
##  $ diaBP          : num  70 81 80 95 84 110 71 71 89 107 ...
##  $ BMI            : num  27 28.7 25.3 28.6 23.1 ...
##  $ heartRate      : int  80 95 75 65 85 77 60 79 76 93 ...
##  $ glucose        : int  77 76 70 103 85 99 85 78 79 88 ...
##  $ TenYearCHD     : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...

4 barplots

Age and Count In this plot, we obtain the range of Age of the patient, the main subjects are middle-aged and elderly. The highest amount of the Age of the patient is 40 which is nearly 200 people. The youngest patient and oldest patient have 38 years different.

p1 <- ggplot(data = hdiease, aes(x = age))  

p1 + 
  geom_bar(stat = 'Count', aes(y = ..count..), fill = 'skyblue4') +
  labs(x = 'Age', y = 'Population', title = 'Age of the patient') +
  theme_classic()

2. CurrentSmoker and Gender According to the plot, it shows that female patients in total are more than male patients. Moreover, male current smoker are more than female, but in the other way male people who are not current smoker are less than female people. However, we do not know if current smoker is more or less than not current smoker in this plot. if we would like to know we can switch the ‘position = “dodge”’ into ‘position = “stack”’.

p2 <- ggplot(data = hdiease, aes(x = currentSmoker, fill = male)) 
   
p2 + geom_bar(position = "dodge") +
  xlab('Current Smoker') +
  ylab('Population') +
  labs(title = 'Current Smokers',
       subtitle = '0 means not a current smoker; 1 means a current smoker',
       fill = 'Gender')

3. Education Level In this pie plot, we can see that the majority of people are at the Level 1 education. In addtion, the second most is Level 2 education. The line which is surrounding the pie plot indicates the accumulation number of people is divided by education variable.

Note: + scale_fill_brewer() for different color

hdiease_edunum <- hdiease %>% 
  count(education)
str(hdiease_edunum)

## 'data.frame':    5 obs. of  2 variables:
##  $ education: Factor w/ 4 levels "1","2","3","4": 1 2 3 4 NA
##  $ n        : int  1720 1253 687 473 105

p3 <- ggplot(data = hdiease_edunum, aes(x = "", y = n, fill = education)) 

library(wesanderson)


p3 + geom_bar(width = 1, stat = "identity", color = "white") + 
  coord_polar("y", start=0) +
  scale_fill_hue(c=45, l=80) +
  labs(title = "Patients' Education level")

4. Average BMI and Education

Underweight = <18.5
Normal weight = 18.5–24.9
Overweight = 25–29.9
Obesity = BMI of 30 or greater

We can see that at least more than half of patients are close to or more than 25 BMI value which means most of them are currently in a state of overweight.

hdiease_edu <- hdiease %>% 
  group_by(education) %>% 
  summarise(BMI = median(BMI,na.rm=TRUE))

p4 <- ggplot(data = hdiease_edu, aes(x = education , y = BMI, fill = education))

p4 + 
  geom_bar(stat = 'identity',show.legend = F) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = 'Education', y = 'Average BMI', title = 'Median BMI group by Education')

Heart Disease UCI

Min-Jhen Wu

2021/10/28

Introduction

Data Variables

4 barplots