Load required libraries:
library(tidyverse)
Load the data from github repo
url <- "https://raw.githubusercontent.com/chinedu2301/DATA606-Statistics-and-Probability-for-Data-Analytics/main/heart.csv"
heart <- read_csv(url)
Look at the head of the data
head(heart, n = 10)
Get a glimpse of the variables in the datasets.
# get a glimpse of the variables
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
There are 908 rows and 12 columns in this dataset.
There are 12 variables and 918 observations in the dataset. Eleven(11) of the 12 variables in the dataset are potential predictors of the twelfth(12th) variable - HeartDisease.
Each observation represents the characteristics of an individual such as Age, Sex, RestingBP, Cholesterol level, etc. and whether that individual has a Heart Disease or not.
This dataset was downloaded from Kaggle and then uploaded to my github repository.
This is an observational study as there is no control group.
This data was collected from kaggle and it’s available here
There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:
Relevant statistics are:
Summary statistics of all variables
summary(heart)
## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and mean maxHR is 136.8
Visualizations
# Bar chart by gender
ggplot(heart, aes(x = Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph of total count by Gender") + ylab(NULL)
The bar chart shows that there are way more Males in the dataset than Females.
# Barchart of individuals who have heart disease by gender
heart %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() +
labs(title = "Bar Graph by Individuals who have HeartDisease") + xlab("HeartDisease") + ylab(NULL)
# Histogram of RestingBP
ggplot(heart, aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
labs(title = "Distribution of RestingBP") + ylab(NULL)
# Histogram of Age
ggplot(heart, aes(x = Age)) + geom_histogram(binwidth = 2, fill = "brown") +
labs(title = "Distribution of Age") + ylab(NULL)
# Histogram of Cholesterol level
ggplot(heart, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
labs(title = "Distribution of Cholesterol level") + ylab(NULL)