Gehad Gad
March 27th, 2020
Assignment Instruction
My task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
I chose a heart disease dataset from Kaggle.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ----------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library (dplyr)
Columns meanings:
age:age in years.
sex:(1 = male; 0 = female)
cpchest: pain type. Type of chest-pain experienced by the individual: 1 = typical angina 2 = atypical angina 3 = non-angina pain 4 = asymptomatic angina
trestbps: Resting blood pressure (in mm Hg on admission to the hospital).
chol:Serum cholestoral in mg/dl.
fbs:(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false).
restecg: Resting electrocardiographic results: 0 = normal 1 = ST-T wave abnormality 2 = left ventricle hyperthrophy
thalach: Maximum heart rate achieved.
exang: Exercise induced angina (1 = yes; 0 = no).
oldpeak: ST depression induced by exercise relative to rest.
slope: the slope of the peak exercise ST segment.
ca: number of major vessels (0-3) colored by flourosopy.
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect.
target: 1 or 0.
#Source: https://www.kaggle.com/ronitf/heart-disease-uci/data#heart.csv
Heart <- read.csv("https://github.com/GehadGad/Heart-Dataset/raw/master/HeartDisease.csv")
#Display the first few rows in the data.
head(Heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
#Rename columns:
names (Heart) <- c("Age","Gender", "Chest_Pain_Type","Resting_Blood_Pressure","Serum_Cholesterol","Fasting_Blood_Sugar","Resting_ECG","Max_Heart_Rate_Achieved","Exercise_Induced_Angina","ST_Depression_Exercise","Peak_Exercise_ST_Segment","Num_Major_Vessels_Flouro", "Thalassemia","Diagnosis_Heart_Disease")
#Display the first few rows in the data.
head(Heart)
## Age Gender Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1 63 1 3 145 233
## 2 37 1 2 130 250
## 3 41 0 1 130 204
## 4 56 1 1 120 236
## 5 57 0 0 120 354
## 6 57 1 0 140 192
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1 1 0 150
## 2 0 1 187
## 3 0 0 172
## 4 0 1 178
## 5 0 1 163
## 6 0 1 148
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1 0 2.3 0
## 2 0 3.5 0
## 3 0 1.4 2
## 4 0 0.8 2
## 5 1 0.6 2
## 6 0 0.4 1
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 1 0 1 1
## 2 0 2 1
## 3 0 2 1
## 4 0 2 1
## 5 0 2 1
## 6 0 1 1
#Display the numbers of rows and columns in the dataset.
dim(Heart)
## [1] 303 14
The data contain 303 observation and 12 vaiables.
#Another way to check the number of rows, columns, and types.
str(Heart)
## 'data.frame': 303 obs. of 14 variables:
## $ Age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ Gender : int 1 1 0 1 0 1 0 1 1 1 ...
## $ Chest_Pain_Type : int 3 2 1 1 0 0 1 1 2 2 ...
## $ Resting_Blood_Pressure : int 145 130 130 120 120 140 140 120 172 150 ...
## $ Serum_Cholesterol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ Fasting_Blood_Sugar : int 1 0 0 0 0 0 0 0 1 0 ...
## $ Resting_ECG : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Max_Heart_Rate_Achieved : int 150 187 172 178 163 148 153 173 162 174 ...
## $ Exercise_Induced_Angina : int 0 0 0 0 1 0 0 0 0 0 ...
## $ ST_Depression_Exercise : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ Peak_Exercise_ST_Segment: int 0 0 2 2 2 1 1 2 2 2 ...
## $ Num_Major_Vessels_Flouro: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Thalassemia : int 1 2 2 2 2 1 2 3 3 2 ...
## $ Diagnosis_Heart_Disease : int 1 1 1 1 1 1 1 1 1 1 ...
#Check if there is any NA values.
sum(is.na(Heart))
## [1] 0
#Matrix coercion to tibble. as_tibble is one of the funtions in dpyr.
tibble <- as_tibble(Heart)
Coerce to tibble. “Enable preserving row names when coercing matrix and time-series-like objects with row names”.
#Set features as factors.
Heart$Age <- as.factor(Heart$Age)
Heart$Gender <- as.factor (Heart$Gender)
Heart$Diagnosis_Heart_Disease <- as.factor (Heart$Diagnosis_Heart_Disease)
summary(Heart)
## Age Gender Chest_Pain_Type Resting_Blood_Pressure
## 58 : 19 0: 96 Min. :0.000 Min. : 94.0
## 57 : 17 1:207 1st Qu.:0.000 1st Qu.:120.0
## 54 : 16 Median :1.000 Median :130.0
## 59 : 14 Mean :0.967 Mean :131.6
## 52 : 13 3rd Qu.:2.000 3rd Qu.:140.0
## 51 : 12 Max. :3.000 Max. :200.0
## (Other):212
## Serum_Cholesterol Fasting_Blood_Sugar Resting_ECG
## Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :240.0 Median :0.0000 Median :1.0000
## Mean :246.3 Mean :0.1485 Mean :0.5281
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :564.0 Max. :1.0000 Max. :2.0000
##
## Max_Heart_Rate_Achieved Exercise_Induced_Angina ST_Depression_Exercise
## Min. : 71.0 Min. :0.0000 Min. :0.00
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00
## Median :153.0 Median :0.0000 Median :0.80
## Mean :149.6 Mean :0.3267 Mean :1.04
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60
## Max. :202.0 Max. :1.0000 Max. :6.20
##
## Peak_Exercise_ST_Segment Num_Major_Vessels_Flouro Thalassemia
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :1.000 Median :0.0000 Median :2.000
## Mean :1.399 Mean :0.7294 Mean :2.314
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.000 Max. :4.0000 Max. :3.000
##
## Diagnosis_Heart_Disease
## 0:138
## 1:165
##
##
##
##
##
The summary function explains alot of things about the data. In this data, we can have some knowledge about our data features and observation. For example, minimum, maximum, mean, median, quarters.
#In addition to the summary function, we can use summarise or summarize to get more statistical parameters for a column in one step using summarize function.
#Resting_Blood_Pressure column
Heart %>%
summarise (Mean = mean(Resting_Blood_Pressure ), Max = max(Resting_Blood_Pressure), Mean = mean(Resting_Blood_Pressure ), Variance= var(Resting_Blood_Pressure ), SD= sd(Resting_Blood_Pressure))
## Mean Max Variance SD
## 1 131.6238 200 307.5865 17.53814
ggplot(Heart,aes(x= Num_Major_Vessels_Flouro,fill= Diagnosis_Heart_Disease)) +
geom_bar()+
labs(y ="count",
title = "Heart disease diagnosis based on number of major vessels")
This graph shows that major vessel 0 causes heart diseases the most.
ggplot(Heart,aes(x= Max_Heart_Rate_Achieved,fill= Diagnosis_Heart_Disease)) +
geom_bar()+
labs(y ="count",
title = "Heart disease diagnosis Max Heart Rate Achieved")
Heart rate is also important in diagnosis of heart disease
ggplot(Heart,aes(x= Gender,fill= Diagnosis_Heart_Disease)) +
geom_bar()+
labs(y ="Age",
title = "Heart disease diagnosis rate distribution by Gender and Age")
This graph show that males are diagnosed with heart disease more than females.
ggplot(Heart,aes(x= Chest_Pain_Type,fill= Diagnosis_Heart_Disease)) +
theme_bw() +
geom_bar() +
facet_wrap(~Gender) +
labs(y ="count",
title = "Heart Disease distribution by Gender based on Chest_Pain_Type")
This graph shows that people (males and females) experience chest pain (2), which is (non-angina pain) have higher chance of having a heart disease.This graph also shows that Males have higher exposure to have heart diseases if they suffer from (typical angina, atypical angina, or asymptomatic angina)
ggplot(Heart,aes(x= Resting_Blood_Pressure,fill= Diagnosis_Heart_Disease)) +
geom_bar()+
labs(y ="count",
title = "Heart Disease diagnostic Rates based on Resting_Blood_Pressure")
ggplot(Heart,aes(x= Serum_Cholesterol,fill= Diagnosis_Heart_Disease)) +
geom_bar()+
labs(y ="Freq",
title = "Heart Disease diagnostic Rates based on Serum_Cholesterol")
ggplot(Heart, aes(ST_Depression_Exercise, Resting_Blood_Pressure, colour = Thalassemia)) +
geom_point()
Conclusion
Heart disease is a major healt concern and there are factors that people should be aware of.In addition to age, Chest Pain, Cholesterol, and Blood Pressure are important factors to consider.