The objective of this study is to preform a exploratory data anlaysis of cardiovascular paitents using the data provided from a kaggler. The dataset sourced from Kaggle at https://www.kaggle.com/sulianova/cardiovascular-disease-dataset.This report is organised as follows. Chapter 2 describes the dataset and their attributes . Chapter 3 describes the data preproccessing. In Chapter 4 we explore each attribute and their correlation. Finally, in the last chapter we breifly summarize the anlaysis.
# loading the data
dOriginal<-read.csv("cardio_train.csv")
head(dOriginal)
## id.age.gender.height.weight.ap_hi.ap_lo.cholesterol.gluc.smoke.alco.active.cardio
## 1 0;18393;2;168;62.0;110;80;1;1;0;0;1;0
## 2 1;20228;1;156;85.0;140;90;3;1;0;0;1;1
## 3 2;18857;1;165;64.0;130;70;3;1;0;0;0;1
## 4 3;17623;2;169;82.0;150;100;1;1;0;0;1;1
## 5 4;17474;1;156;56.0;100;60;1;1;0;0;0;0
## 6 8;21914;1;151;67.0;120;80;2;2;0;0;0;0
# Formatting
dOriginal<-read.csv("cardio_train.csv",sep=";",header = TRUE,stringsAsFactors = FALSE)
head(dOriginal)
## id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco
## 1 0 18393 2 168 62 110 80 1 1 0 0
## 2 1 20228 1 156 85 140 90 3 1 0 0
## 3 2 18857 1 165 64 130 70 3 1 0 0
## 4 3 17623 2 169 82 150 100 1 1 0 0
## 5 4 17474 1 156 56 100 60 1 1 0 0
## 6 8 21914 1 151 67 120 80 2 2 0 0
## active cardio
## 1 1 0
## 2 1 1
## 3 0 1
## 4 1 1
## 5 0 0
## 6 0 0
# Checking the details of the variables
data.frame(variable = names(dOriginal),
classe = sapply(dOriginal, typeof),
first_values = sapply(dOriginal, function(x) paste0(head(x), collapse = ", ")),
row.names = NULL) %>%
kable()
| variable | classe | first_values |
|---|---|---|
| id | integer | 0, 1, 2, 3, 4, 8 |
| age | integer | 18393, 20228, 18857, 17623, 17474, 21914 |
| gender | integer | 2, 1, 1, 2, 1, 1 |
| height | integer | 168, 156, 165, 169, 156, 151 |
| weight | double | 62, 85, 64, 82, 56, 67 |
| ap_hi | integer | 110, 140, 130, 150, 100, 120 |
| ap_lo | integer | 80, 90, 70, 100, 60, 80 |
| cholesterol | integer | 1, 3, 3, 1, 1, 2 |
| gluc | integer | 1, 1, 1, 1, 1, 2 |
| smoke | integer | 0, 0, 0, 0, 0, 0 |
| alco | integer | 0, 0, 0, 0, 0, 0 |
| active | integer | 1, 1, 0, 1, 0, 0 |
| cardio | integer | 0, 1, 1, 1, 0, 0 |
The dataset used in this project is obtained from kaggle as mentioned earlier. This data has 70000 observation with 12 descriptive features and 1 target excluding the ID column.
The traget feature has two classes and hence it is a binary classification problem . More precisely, it tells whether a person has cardio vascular disease.
The variabile description are as follows: Age : Age of the person in days Height : height of the person Weight : weight of the person Gender : gender of the person ap_hi : Systolic blood pressure ap_lo : Diastolic blood pressure Cholestrol : cholesterol level | 1: normal, 2: above normal, 3: well above normal | gluc : glucose level | 1: normal, 2: above normal, 3: well above normal | smoke : smoking | 0: No, 1: True | alco : Alcohol intake | 0: No, 1: True | active : Physical activity |0: No, 1: True |
Dataset consists of categorical varaibles such as cholesterol, glucose, smoke ,active and gender. These variables are convverted as factors as shown below.
#To drop the id column
d1<-select (dOriginal,-c(1))
#Changing the variables to factors
d1$cholesterol<-as.factor(d1$cholesterol)
d1$gluc<-as.factor(d1$gluc)
d1$smoke<-as.factor(d1$smoke)
d1$alco<-as.factor(d1$alco)
d1$active<-as.factor(d1$active)
str(d1)
## 'data.frame': 70000 obs. of 12 variables:
## $ age : int 18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
## $ gender : int 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : num 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: Factor w/ 3 levels "1","2","3": 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ alco : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ active : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 2 1 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
d1$cardio<-as.factor(d1$cardio)
d1$gender<-factor(d1$gender, levels=c(1,2), labels=c(0,1)) #changing the male and female values into 0's and 1's
head(d1$gender)
## [1] 1 0 0 1 0 0
## Levels: 0 1
This stage of the data preprocessing is the most crucial part of the analysis. Cleaning the data deals with missing values, impossible values ,obvious error in simple terms typo’s and outlier manipulations.
Missing values can be traced out with any() function in R. In most cases ‘NA’ or ‘?’ values are the only possiblities of missing values in a dataset. According to the below chunk , we can see there are no missing values in the data set.
# Cleaning data
#Checking for missing values
#Na's
any(is.na(d1))
## [1] FALSE
d1[d1 == "?"] <- NA
any(is.na(d1))
## [1] FALSE
These are the values which can be inferred with two diferent perspectives. First , typos may lead to impossible values such as Systolic pressure with negative sysmbol which is an obvious typo. To deal with kind of the errors we can use abs() function to convert negative values to absolute values.
Second, these are also considered as typos like Systolic blood pressure is 0 ie, person is almost dead. In these cases it is ideal to delete that columns with these values. Comaprtively there are few rows with these error in the dataset . Hence , we delelted the rows which has both systolic and diastolic pressure recorded as 0. There are also some oultiers in weight like less than 20 kgs. But so far, it has been recorded the least adult(age range start from 28 years) weight of the a person is 20 kg. Hence we decided to drop these values before dealing with the outliers.
Oultiers has a cruical affect on data analysis. Dealing with the outliers will be benificial before anlaysis of a data. In practice there are many methods to deal with outliers like deleting the row, imputing the value with mean, using capping function and even transformation of data helps to get rid of some outliers. In this project , we choose to delete the values of Systolic and diastolic pressure higher than 360 and 370 respectively. These are the highest values ever recorded in a study.Then , the outliers which implies these are the possible values but exterme values. These values are manipulated with capping function which are replaced with the confidence invertal of 97.5% values.
d1<-d1[!(d1$ap_hi>370),]
d1<-d1[!(d1$ap_lo>360),]
# Visualising the outliers
boxplot(d1$ap_hi ~ d1$cardio, main="Systolic blood pressure by cardio", ylab = "Systolic blood pressure", xlab = "cardio")
boxplot(d1$ap_lo ~ d1$cardio, main="Diastolic blood pressure by cardio", ylab = "Systolic blood pressure", xlab = "cardio")
#cappping the ouliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
d2<-d1
d2$ap_hi <- d2$ap_hi %>% cap()
d2$ap_lo <- d2$ap_lo %>% cap()
boxplot(d2$ap_hi ~ d2$cardio, main="Systolic blood pressure by cardio", ylab = "Systolic blood pressure", xlab = "cardio")
boxplot(d2$ap_lo ~ d2$cardio, main="Diastolic blood pressure by cardio", ylab = "Systolic blood pressure", xlab = "cardio")
As of all five numerical varaibles in one range ie., between 1 to 300 except the age which are recorded in days. Hence , age in days variable is converted in years.
#changing the age in days to years
d2$age<-d2$age/365
d2$age<-round(d2$age,digits = 0)
summary(d2)
## age gender height weight
## Min. :30.00 0:44936 Min. : 55.0 Min. : 21.00
## 1st Qu.:48.00 1:24056 1st Qu.:159.0 1st Qu.: 65.00
## Median :54.00 Median :165.0 Median : 72.00
## Mean :53.32 Mean :164.4 Mean : 74.12
## 3rd Qu.:58.00 3rd Qu.:170.0 3rd Qu.: 82.00
## Max. :65.00 Max. :250.0 Max. :200.00
## ap_hi ap_lo cholesterol gluc smoke
## Min. : 90.0 Min. : 65.00 1:51752 1:58657 0:62931
## 1st Qu.:120.0 1st Qu.: 80.00 2: 9341 2: 5088 1: 6061
## Median :120.0 Median : 80.00 3: 7899 3: 5247
## Mean :126.2 Mean : 81.62
## 3rd Qu.:140.0 3rd Qu.: 90.00
## Max. :170.0 Max. :105.00
## alco active cardio
## 0:65295 0:13573 0:34848
## 1: 3697 1:55419 1:34144
##
##
##
##
The most effective way to get an idea about the data and its variables is to visualize the data. In this project we have started looking at data with Univariate , Bi-variate and Multi-Vairate.
2.Distribution of Age: with the genrated histogram we can hardly expect the age variable exhibits a normal distribution. At a glance, we can expect that the age group from 40 to 60 years are considerably high in the data set.
3.Blood Pressure : Blood Pressure range does show variance. In which density of systolic blood pressure quiet same in the first and third quartile but it rises very high at 110 and 130. Where as diastolic blood pressure quiet same in the first and third quartile but it rises very high at 70 and 90.
4.Height and weight : The height and weight histograms have a normal distribution. We expect most of the persons in this study are with in a height range of 150 cm - 170 cm and wieght in a range of 70 kg to 90 kg
5.Univariate Graphs of Cateogrical Variables: In these graphs , we try to understand the number of persons with respect to their cholesterol, glucolse level and their activity information and smoking habits. It shows that most the observations have normal cholestrol and glucose level and suprisingly less smoking persons. It is also evident that most of the observations involves in physical activies like excersies which is really good.
##Univariate Graphs
d3<-d2
# How may males and females do we have
perc <- d3$gender %>% table() %>% prop.table()*100
perc %>% barplot(main = "male vs female",ylab="Percent", ylim=c(0,100))
# Age distribution
ggplot(d3,aes(x=age))+geom_histogram(color="blue")+stat_function(fun = dnorm)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# distribution of Systolic pressure and diastolic pressure
p1<-ggplot(d3,aes(x=ap_hi))+geom_density(fill="grey")
p2<-ggplot(d3,aes(x=ap_lo))+geom_density(fill="grey")
grid.arrange(p1, p2, nrow = 1)
# height and weight distribution
p3<-ggplot(d3,aes(x=height))+geom_histogram(color="blue")
p4<-ggplot(d3,aes(x=weight))+geom_histogram(color="blue")
grid.arrange(p3, p4, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# univariate of all categorical values
p5<-ggplot(d3, aes(x=cholesterol, fill=cholesterol)) +
geom_bar(stat="count")+theme_bw()
p6<-ggplot(d3, aes(x=active, fill=active)) +
geom_bar(stat="count")+theme_bw()
p7<-ggplot(d3, aes(x=smoke, fill=smoke)) +
geom_bar(stat="count")+theme_bw()
p8<-ggplot(d3, aes(x=gluc, fill=gluc)) +
geom_bar(stat="count")+theme_bw()
grid.arrange(p5,p6,p7,p8, nrow= 2)
Bi-varaite data visualisation is one of the most effective way to learn about data and its variables inter-realtionships. Since Cardio is the target value in this project we decided to visualize cardio against different possibilities.
We can see almost equal distribution, being a cardiovascular pateint in all levels but suprisingly we observe more cardio vascular patient who are active , normal cholesterol and glucose levels, non alcoholic and not a smoker. This is really an awful results which we didn’t expect . To deal with this we come across with correlation factor which is really helpful in further analysis.
# Bi-Variate plots with cardio and categorical variables
p9<-ggplot(d3, aes(x=cholesterol, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
p10<-ggplot(d3, aes(x=gluc, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
p11<-ggplot(d3, aes(x=smoke, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
p12<-ggplot(d3, aes(x=alco, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
p13<-ggplot(d3, aes(x=active, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
p14<-ggplot(d3, aes(x=gender, fill=cardio)) +
geom_bar(stat="count",position="dodge")+theme_bw()
grid.arrange(p9,p10,p11,p12,p13,p14,ncol=3, nrow= 2)
To know the correlation of each variable , we have generated a matrix plot of correlations as shown in code chunck
par(mfrow=c(1,1))
dExplor<-d3
dExplor[]<-lapply(dExplor,as.integer)
correlation = cor(dExplor[,1:12])
cols<- colorRampPalette(c("red", "blue"))(20)
corrplot(correlation, method ="number",col=cols,type="upper")
There are four major variables has considerable correlation with cardio variables and they are ap_hi,ap_lo,age,cholesterol, weight.
So let’s jump back to bi-variate graphs. Age has considerable impact on different categorical variables. As density plots shows as the age increases cholestorl and glucose levels increases and there is high risk factor of the cardio vascular disease.
#density plots for age and cateogrical variables
ggplot(d3,aes(x=age,fill=cholesterol))+geom_density(col=NA,alpha=0.35)
ggplot(d3,aes(x=age,fill=cardio))+geom_density(col=NA,alpha=0.35)
ggplot(d3,aes(x=age,fill=active))+geom_density(col=NA,alpha=0.35)
ggplot(d3,aes(x=age,fill=gluc))+geom_density(col=NA,alpha=0.35)
While looking at the density plots of age and ap_hi ,ap_lo we can observe the density at 45 to 60 years their is considerable change in ap_hi and ap_lo.
As another interesting factor , the relationship between weight and height is mostly linear and the distribution more concentrated at normal height and wieght ranges.
#Bivariate for two numerical variables density charts
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
ggplot(d3,aes(x=ap_hi,y=age))+stat_density_2d(geom = "point",aes(size=..density..),n=20,contour = FALSE)+scale_size(range=c(0,9))
ggplot(d3,aes(x=ap_lo,y=age))+stat_density_2d(geom = "point",aes(size=..density..),n=20,contour = FALSE)+scale_size(range=c(0,9))
ggplot(d3,aes(x=ap_hi,y=ap_lo))+geom_point()
ggplot(d3,aes(x=height,y=weight))+geom_point()
We can also see the cholesterol level do have some positive relation on cardio. Which implicilty shows that there is higher risk in getting cardio vacular disease with an increase in cholesterol levels.
#As correlation between Cardio and cholestrol is comparitively significant
# cardio and cholestrol
ggplot(d3,
aes(x = factor(cholesterol,
levels = c("1", "2",
"3")),
fill = factor(cardio,
levels = c("0", "1"),
labels = c("0",
"1"
)))) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "cardio",
x = "cholesterol",
title = "cholesterol by cardio") +
theme_bw()
Cholestrol and age does have positive impact. In these violin plot , we can infer that as age increase theres is chances of increasing cholestrol levels.
# Categorical and quantitative
ggplot(d3,
aes(x = cholesterol,
y = age)) +
geom_violin() +
labs(title = "cholesterol by age
")
From the correlation matrix , we can see cholesterol and glucose levels have a positive correlation. This is evident from the below chart. We can observe at the higher levels of glucose there is chance of developing higher cholesterol which inturn imples more chancesof being a cardio vascular paitent.
#considerable correlation between gluc and cholestrol levels
ggplot(d3,
aes(x = factor(gluc,
levels = c("1", "2",
"3")),
fill = factor(cholesterol,
levels = c("1", "2", "3"),
labels = c("Normal",
"Moderate",
"High")))) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "cholesterol",
x = "gluc",
title = "glucose by cholestrol") +
theme_bw()
Finally , Multivariate graphs is the one stop to allow our imagination. It reveals the behaviour of multiple variables on a single variable.
In the first graph, we can see the distribution of cardio paitents is much more high as the age increase and very high chances of get cardio vascular disease with higher cholesterol levles.
In the second plot, we can observe most of the cardio vascular paitents have higher Systolic pressure more than 150mm?hg and higher diastolic pressure more than 90mm/Hg
# Cardio given by age and cholestrol
ggplot(d3,
aes(y = factor(cholesterol,
labels = c("1",
"2",
"3")),
x = age,
color = cardio)) +
geom_jitter(alpha = 0.7,
size = 1.5) +
labs(title = "Cardio by Cholestrol w.r.t to age",
x = "",
y = "") +
theme_minimal()
#Cardio given by ap_hi and ap_lo
ggplot(d3,
aes(x = ap_hi,
y = ap_lo,
color= cardio)) +
geom_jitter(alpha = 0.7,
size = 1.5) +
labs(x = "Systolic Blood Pressure",
y = "Diastolic Blood Pressure",
title = "Blood pressure relationship by Cardio"
)
In this study we have explored the data of Cardio Vasicular disease dataset and gain insights about the key factors that decide the target value. In the inital stages of this study according to Uni-Variate graphs we mentioned that the gender variable has more female than male where this study maybe baised on the levels of gender. But according to multi-variate and correlation matrix, we find that Systolic Blood pressure , Diastolic Pressure, age and cholesterol are the most influencial on the target value. Further more Analysis and model setting will be developed in order to predict the cardiovascular disease.