Gehad Gad

March 27th, 2020

Assignment Instruction

My task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

I chose a heart disease dataset from Kaggle.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ----------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library (dplyr)

Columns meanings:

age:age in years.

sex:(1 = male; 0 = female)

cpchest: pain type. Type of chest-pain experienced by the individual: 1 = typical angina 2 = atypical angina 3 = non-angina pain 4 = asymptomatic angina

trestbps: Resting blood pressure (in mm Hg on admission to the hospital).

chol:Serum cholestoral in mg/dl.

fbs:(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false).

restecg: Resting electrocardiographic results: 0 = normal 1 = ST-T wave abnormality 2 = left ventricle hyperthrophy

thalach: Maximum heart rate achieved.

exang: Exercise induced angina (1 = yes; 0 = no).

oldpeak: ST depression induced by exercise relative to rest.

slope: the slope of the peak exercise ST segment.

ca: number of major vessels (0-3) colored by flourosopy.

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect.

target: 1 or 0.

#Source: https://www.kaggle.com/ronitf/heart-disease-uci/data#heart.csv

Heart <- read.csv("https://github.com/GehadGad/Heart-Dataset/raw/master/HeartDisease.csv")

#Display the first few rows in the data.

head(Heart)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1
#Rename columns:

names (Heart) <- c("Age","Gender", "Chest_Pain_Type","Resting_Blood_Pressure","Serum_Cholesterol","Fasting_Blood_Sugar","Resting_ECG","Max_Heart_Rate_Achieved","Exercise_Induced_Angina","ST_Depression_Exercise","Peak_Exercise_ST_Segment","Num_Major_Vessels_Flouro", "Thalassemia","Diagnosis_Heart_Disease")

#Display the first few rows in the data.

head(Heart)
##   Age Gender Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1  63      1               3                    145               233
## 2  37      1               2                    130               250
## 3  41      0               1                    130               204
## 4  56      1               1                    120               236
## 5  57      0               0                    120               354
## 6  57      1               0                    140               192
##   Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1                   1           0                     150
## 2                   0           1                     187
## 3                   0           0                     172
## 4                   0           1                     178
## 5                   0           1                     163
## 6                   0           1                     148
##   Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1                       0                    2.3                        0
## 2                       0                    3.5                        0
## 3                       0                    1.4                        2
## 4                       0                    0.8                        2
## 5                       1                    0.6                        2
## 6                       0                    0.4                        1
##   Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 1                        0           1                       1
## 2                        0           2                       1
## 3                        0           2                       1
## 4                        0           2                       1
## 5                        0           2                       1
## 6                        0           1                       1
#Display the numbers of rows and columns in the dataset.

dim(Heart)
## [1] 303  14

The data contain 303 observation and 12 vaiables.

#Another way to check the number of rows, columns, and types.
str(Heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ Age                     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ Gender                  : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ Chest_Pain_Type         : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ Resting_Blood_Pressure  : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ Serum_Cholesterol       : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ Fasting_Blood_Sugar     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ Resting_ECG             : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Max_Heart_Rate_Achieved : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ Exercise_Induced_Angina : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ ST_Depression_Exercise  : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ Peak_Exercise_ST_Segment: int  0 0 2 2 2 1 1 2 2 2 ...
##  $ Num_Major_Vessels_Flouro: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Thalassemia             : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ Diagnosis_Heart_Disease : int  1 1 1 1 1 1 1 1 1 1 ...
#Check if there is any NA values.

sum(is.na(Heart))
## [1] 0
#Matrix coercion to tibble. as_tibble is one of the funtions in dpyr.

tibble <- as_tibble(Heart)

Coerce to tibble. “Enable preserving row names when coercing matrix and time-series-like objects with row names”.

#Set features as factors. 
Heart$Age <- as.factor(Heart$Age)
Heart$Gender <- as.factor (Heart$Gender)
Heart$Diagnosis_Heart_Disease <- as.factor (Heart$Diagnosis_Heart_Disease)
summary(Heart)
##       Age      Gender  Chest_Pain_Type Resting_Blood_Pressure
##  58     : 19   0: 96   Min.   :0.000   Min.   : 94.0         
##  57     : 17   1:207   1st Qu.:0.000   1st Qu.:120.0         
##  54     : 16           Median :1.000   Median :130.0         
##  59     : 14           Mean   :0.967   Mean   :131.6         
##  52     : 13           3rd Qu.:2.000   3rd Qu.:140.0         
##  51     : 12           Max.   :3.000   Max.   :200.0         
##  (Other):212                                                 
##  Serum_Cholesterol Fasting_Blood_Sugar  Resting_ECG    
##  Min.   :126.0     Min.   :0.0000      Min.   :0.0000  
##  1st Qu.:211.0     1st Qu.:0.0000      1st Qu.:0.0000  
##  Median :240.0     Median :0.0000      Median :1.0000  
##  Mean   :246.3     Mean   :0.1485      Mean   :0.5281  
##  3rd Qu.:274.5     3rd Qu.:0.0000      3rd Qu.:1.0000  
##  Max.   :564.0     Max.   :1.0000      Max.   :2.0000  
##                                                        
##  Max_Heart_Rate_Achieved Exercise_Induced_Angina ST_Depression_Exercise
##  Min.   : 71.0           Min.   :0.0000          Min.   :0.00          
##  1st Qu.:133.5           1st Qu.:0.0000          1st Qu.:0.00          
##  Median :153.0           Median :0.0000          Median :0.80          
##  Mean   :149.6           Mean   :0.3267          Mean   :1.04          
##  3rd Qu.:166.0           3rd Qu.:1.0000          3rd Qu.:1.60          
##  Max.   :202.0           Max.   :1.0000          Max.   :6.20          
##                                                                        
##  Peak_Exercise_ST_Segment Num_Major_Vessels_Flouro  Thalassemia   
##  Min.   :0.000            Min.   :0.0000           Min.   :0.000  
##  1st Qu.:1.000            1st Qu.:0.0000           1st Qu.:2.000  
##  Median :1.000            Median :0.0000           Median :2.000  
##  Mean   :1.399            Mean   :0.7294           Mean   :2.314  
##  3rd Qu.:2.000            3rd Qu.:1.0000           3rd Qu.:3.000  
##  Max.   :2.000            Max.   :4.0000           Max.   :3.000  
##                                                                   
##  Diagnosis_Heart_Disease
##  0:138                  
##  1:165                  
##                         
##                         
##                         
##                         
## 

The summary function explains alot of things about the data. In this data, we can have some knowledge about our data features and observation. For example, minimum, maximum, mean, median, quarters.

#In addition to the summary function, we can use summarise or summarize to get more statistical parameters for a column in one step using summarize function.

#Resting_Blood_Pressure column

Heart %>%
  summarise (Mean = mean(Resting_Blood_Pressure ), Max = max(Resting_Blood_Pressure), Mean = mean(Resting_Blood_Pressure ), Variance= var(Resting_Blood_Pressure ), SD= sd(Resting_Blood_Pressure))
##       Mean Max Variance       SD
## 1 131.6238 200 307.5865 17.53814
ggplot(Heart,aes(x= Num_Major_Vessels_Flouro,fill= Diagnosis_Heart_Disease)) +
  geom_bar()+
  labs(y ="count",
       title = "Heart disease diagnosis based on number of major vessels")

This graph shows that major vessel 0 causes heart diseases the most.

ggplot(Heart,aes(x= Max_Heart_Rate_Achieved,fill= Diagnosis_Heart_Disease)) +
  geom_bar()+
  labs(y ="count",
       title = "Heart disease diagnosis Max Heart Rate Achieved")

Heart rate is also important in diagnosis of heart disease

ggplot(Heart,aes(x= Gender,fill= Diagnosis_Heart_Disease)) +
  geom_bar()+
  labs(y ="Age",
       title = "Heart disease diagnosis rate distribution by Gender and Age")

This graph show that males are diagnosed with heart disease more than females.

ggplot(Heart,aes(x= Chest_Pain_Type,fill= Diagnosis_Heart_Disease)) +
  theme_bw() +
  geom_bar() +
  facet_wrap(~Gender) +
  labs(y ="count",
       title = "Heart Disease distribution by Gender based on Chest_Pain_Type")

This graph shows that people (males and females) experience chest pain (2), which is (non-angina pain) have higher chance of having a heart disease.This graph also shows that Males have higher exposure to have heart diseases if they suffer from (typical angina, atypical angina, or asymptomatic angina)

ggplot(Heart,aes(x= Resting_Blood_Pressure,fill= Diagnosis_Heart_Disease)) +
  geom_bar()+
   labs(y ="count",
       title = "Heart Disease diagnostic Rates based on Resting_Blood_Pressure")

ggplot(Heart,aes(x= Serum_Cholesterol,fill= Diagnosis_Heart_Disease)) +
  geom_bar()+
  labs(y ="Freq",
       title = "Heart Disease diagnostic Rates based on Serum_Cholesterol")

ggplot(Heart, aes(ST_Depression_Exercise, Resting_Blood_Pressure, colour = Thalassemia)) + 
    geom_point()

Conclusion

Heart disease is a major healt concern and there are factors that people should be aware of.In addition to age, Chest Pain, Cholesterol, and Blood Pressure are important factors to consider.