Chapter 1

Introduction

1.1 Objective

The objective of this study is to preform a exploratory data anlaysis of cardiovascular paitents using the data provided from a kaggler. The dataset sourced from Kaggle at https://www.kaggle.com/sulianova/cardiovascular-disease-dataset.This report is organised as follows. Chapter 2 describes the dataset and their attributes . Chapter 3 describes the data preproccessing. In Chapter 4 we explore each attribute and their correlation. Finally, in the last chapter we breifly summarize the anlaysis.

# loading the data
dOriginal<-read.csv("cardio_train.csv")
head(dOriginal)

##   id.age.gender.height.weight.ap_hi.ap_lo.cholesterol.gluc.smoke.alco.active.cardio
## 1                                             0;18393;2;168;62.0;110;80;1;1;0;0;1;0
## 2                                             1;20228;1;156;85.0;140;90;3;1;0;0;1;1
## 3                                             2;18857;1;165;64.0;130;70;3;1;0;0;0;1
## 4                                            3;17623;2;169;82.0;150;100;1;1;0;0;1;1
## 5                                             4;17474;1;156;56.0;100;60;1;1;0;0;0;0
## 6                                             8;21914;1;151;67.0;120;80;2;2;0;0;0;0

# Formatting
dOriginal<-read.csv("cardio_train.csv",sep=";",header = TRUE,stringsAsFactors = FALSE)
head(dOriginal)

##   id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco
## 1  0 18393      2    168     62   110    80           1    1     0    0
## 2  1 20228      1    156     85   140    90           3    1     0    0
## 3  2 18857      1    165     64   130    70           3    1     0    0
## 4  3 17623      2    169     82   150   100           1    1     0    0
## 5  4 17474      1    156     56   100    60           1    1     0    0
## 6  8 21914      1    151     67   120    80           2    2     0    0
##   active cardio
## 1      1      0
## 2      1      1
## 3      0      1
## 4      1      1
## 5      0      0
## 6      0      0

# Checking the details of the variables
data.frame(variable = names(dOriginal),
           classe = sapply(dOriginal, typeof),
           first_values = sapply(dOriginal, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
  kable()

variable	classe	first_values
id	integer	0, 1, 2, 3, 4, 8
age	integer	18393, 20228, 18857, 17623, 17474, 21914
gender	integer	2, 1, 1, 2, 1, 1
height	integer	168, 156, 165, 169, 156, 151
weight	double	62, 85, 64, 82, 56, 67
ap_hi	integer	110, 140, 130, 150, 100, 120
ap_lo	integer	80, 90, 70, 100, 60, 80
cholesterol	integer	1, 3, 3, 1, 1, 2
gluc	integer	1, 1, 1, 1, 1, 2
smoke	integer	0, 0, 0, 0, 0, 0
alco	integer	0, 0, 0, 0, 0, 0
active	integer	1, 1, 0, 1, 0, 0
cardio	integer	0, 1, 1, 1, 0, 0

Chapter 2

About data and its features:

2.1 About the Data and its attributes

The dataset used in this project is obtained from kaggle as mentioned earlier. This data has 70000 observation with 12 descriptive features and 1 target excluding the ID column.

2.2 Target Feature

The traget feature has two classes and hence it is a binary classification problem . More precisely, it tells whether a person has cardio vascular disease.

2.3 Descriptive Feature

The variabile description are as follows: Age : Age of the person in days Height : height of the person Weight : weight of the person Gender : gender of the person ap_hi : Systolic blood pressure ap_lo : Diastolic blood pressure Cholestrol : cholesterol level | 1: normal, 2: above normal, 3: well above normal | gluc : glucose level | 1: normal, 2: above normal, 3: well above normal | smoke : smoking | 0: No, 1: True | alco : Alcohol intake | 0: No, 1: True | active : Physical activity |0: No, 1: True |

Chapter 3

Data Preprocessing

Dataset consists of categorical varaibles such as cholesterol, glucose, smoke ,active and gender. These variables are convverted as factors as shown below.

#To drop the id column
d1<-select (dOriginal,-c(1))


#Changing the variables to factors

d1$cholesterol<-as.factor(d1$cholesterol)
d1$gluc<-as.factor(d1$gluc)
d1$smoke<-as.factor(d1$smoke)
d1$alco<-as.factor(d1$alco)
d1$active<-as.factor(d1$active)
str(d1)

## 'data.frame':    70000 obs. of  12 variables:
##  $ age        : int  18393 20228 18857 17623 17474 21914 22113 22584 17668 19834 ...
##  $ gender     : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : num  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: Factor w/ 3 levels "1","2","3": 1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 2 1 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...

d1$cardio<-as.factor(d1$cardio)
d1$gender<-factor(d1$gender, levels=c(1,2), labels=c(0,1)) #changing the male and female values into 0's and 1's
head(d1$gender)

## [1] 1 0 0 1 0 0
## Levels: 0 1

This stage of the data preprocessing is the most crucial part of the analysis. Cleaning the data deals with missing values, impossible values ,obvious error in simple terms typo’s and outlier manipulations.

3.1 Missing Values:

Missing values can be traced out with any() function in R. In most cases ‘NA’ or ‘?’ values are the only possiblities of missing values in a dataset. According to the below chunk , we can see there are no missing values in the data set.

# Cleaning data 

#Checking for missing values
#Na's

any(is.na(d1))

## [1] FALSE

d1[d1 == "?"] <- NA
any(is.na(d1))

## [1] FALSE

3.2 Impossible values:

These are the values which can be inferred with two diferent perspectives. First , typos may lead to impossible values such as Systolic pressure with negative sysmbol which is an obvious typo. To deal with kind of the errors we can use abs() function to convert negative values to absolute values.

Second, these are also considered as typos like Systolic blood pressure is 0 ie, person is almost dead. In these cases it is ideal to delete that columns with these values. Comaprtively there are few rows with these error in the dataset . Hence , we delelted the rows which has both systolic and diastolic pressure recorded as 0. There are also some oultiers in weight like less than 20 kgs. But so far, it has been recorded the least adult(age range start from 28 years) weight of the a person is 20 kg. Hence we decided to drop these values before dealing with the outliers.

3.3 Outliers:

Oultiers has a cruical affect on data analysis. Dealing with the outliers will be benificial before anlaysis of a data. In practice there are many methods to deal with outliers like deleting the row, imputing the value with mean, using capping function and even transformation of data helps to get rid of some outliers. In this project , we choose to delete the values of Systolic and diastolic pressure higher than 360 and 370 respectively. These are the highest values ever recorded in a study.Then , the outliers which implies these are the possible values but exterme values. These values are manipulated with capping function which are replaced with the confidence invertal of 97.5% values.

d1<-d1[!(d1$ap_hi>370),]
d1<-d1[!(d1$ap_lo>360),]

# Visualising the outliers
boxplot(d1$ap_hi ~ d1$cardio, main="Systolic blood pressure  by cardio", ylab = "Systolic blood pressure", xlab = "cardio")

boxplot(d1$ap_lo ~ d1$cardio, main="Diastolic blood pressure   by cardio", ylab = "Systolic blood pressure", xlab = "cardio")

#cappping the ouliers 
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

d2<-d1
d2$ap_hi <- d2$ap_hi %>% cap()
d2$ap_lo <- d2$ap_lo %>% cap()

boxplot(d2$ap_hi ~ d2$cardio, main="Systolic blood pressure  by cardio", ylab = "Systolic blood pressure", xlab = "cardio")

boxplot(d2$ap_lo ~ d2$cardio, main="Diastolic blood pressure   by cardio", ylab = "Systolic blood pressure", xlab = "cardio")

3.4 Scaling:

As of all five numerical varaibles in one range ie., between 1 to 300 except the age which are recorded in days. Hence , age in days variable is converted in years.

#changing the age in days to years
d2$age<-d2$age/365
d2$age<-round(d2$age,digits = 0)

summary(d2)

##       age        gender        height          weight      
##  Min.   :30.00   0:44936   Min.   : 55.0   Min.   : 21.00  
##  1st Qu.:48.00   1:24056   1st Qu.:159.0   1st Qu.: 65.00  
##  Median :54.00             Median :165.0   Median : 72.00  
##  Mean   :53.32             Mean   :164.4   Mean   : 74.12  
##  3rd Qu.:58.00             3rd Qu.:170.0   3rd Qu.: 82.00  
##  Max.   :65.00             Max.   :250.0   Max.   :200.00  
##      ap_hi           ap_lo        cholesterol gluc      smoke    
##  Min.   : 90.0   Min.   : 65.00   1:51752     1:58657   0:62931  
##  1st Qu.:120.0   1st Qu.: 80.00   2: 9341     2: 5088   1: 6061  
##  Median :120.0   Median : 80.00   3: 7899     3: 5247            
##  Mean   :126.2   Mean   : 81.62                                  
##  3rd Qu.:140.0   3rd Qu.: 90.00                                  
##  Max.   :170.0   Max.   :105.00                                  
##  alco      active    cardio   
##  0:65295   0:13573   0:34848  
##  1: 3697   1:55419   1:34144  
##                               
##                               
##                               
##

Chapter 4

Data Exploration

The most effective way to get an idea about the data and its variables is to visualize the data. In this project we have started looking at data with Univariate , Bi-variate and Multi-Vairate.

4.1 Univariate data exploration

Distribution of gender in the data: We can see 60 % of the data is about females and rest is about the males. Hence, we can expect this project tells more about cardio vascular disease in female. But this inference can be subjected to change according to correlation between cardio and gender which is discussed furhter in multivariate analysis.

2.Distribution of Age: with the genrated histogram we can hardly expect the age variable exhibits a normal distribution. At a glance, we can expect that the age group from 40 to 60 years are considerably high in the data set.

3.Blood Pressure : Blood Pressure range does show variance. In which density of systolic blood pressure quiet same in the first and third quartile but it rises very high at 110 and 130. Where as diastolic blood pressure quiet same in the first and third quartile but it rises very high at 70 and 90.

4.Height and weight : The height and weight histograms have a normal distribution. We expect most of the persons in this study are with in a height range of 150 cm - 170 cm and wieght in a range of 70 kg to 90 kg

5.Univariate Graphs of Cateogrical Variables: In these graphs , we try to understand the number of persons with respect to their cholesterol, glucolse level and their activity information and smoking habits. It shows that most the observations have normal cholestrol and glucose level and suprisingly less smoking persons. It is also evident that most of the observations involves in physical activies like excersies which is really good.

##Univariate Graphs

d3<-d2
# How may males and females do we have
perc <- d3$gender %>% table() %>% prop.table()*100
perc %>% barplot(main = "male vs female",ylab="Percent", ylim=c(0,100))

# Age distribution
ggplot(d3,aes(x=age))+geom_histogram(color="blue")+stat_function(fun = dnorm)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# distribution of Systolic pressure and diastolic pressure
p1<-ggplot(d3,aes(x=ap_hi))+geom_density(fill="grey")
p2<-ggplot(d3,aes(x=ap_lo))+geom_density(fill="grey")
grid.arrange(p1, p2, nrow = 1)

# height and weight distribution
p3<-ggplot(d3,aes(x=height))+geom_histogram(color="blue")
p4<-ggplot(d3,aes(x=weight))+geom_histogram(color="blue")
grid.arrange(p3, p4, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# univariate of all categorical values
p5<-ggplot(d3, aes(x=cholesterol,  fill=cholesterol)) +
  geom_bar(stat="count")+theme_bw()

p6<-ggplot(d3, aes(x=active,  fill=active)) +
  geom_bar(stat="count")+theme_bw()


p7<-ggplot(d3, aes(x=smoke,  fill=smoke)) +
  geom_bar(stat="count")+theme_bw()

p8<-ggplot(d3, aes(x=gluc,  fill=gluc)) +
  geom_bar(stat="count")+theme_bw()

grid.arrange(p5,p6,p7,p8, nrow= 2)

4.2 Bi-Variate Data Explorartion:

Bi-varaite data visualisation is one of the most effective way to learn about data and its variables inter-realtionships. Since Cardio is the target value in this project we decided to visualize cardio against different possibilities.

We can see almost equal distribution, being a cardiovascular pateint in all levels but suprisingly we observe more cardio vascular patient who are active , normal cholesterol and glucose levels, non alcoholic and not a smoker. This is really an awful results which we didn’t expect . To deal with this we come across with correlation factor which is really helpful in further analysis.

# Bi-Variate plots with cardio and categorical variables
p9<-ggplot(d3, aes(x=cholesterol,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
p10<-ggplot(d3, aes(x=gluc,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
p11<-ggplot(d3, aes(x=smoke,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
p12<-ggplot(d3, aes(x=alco,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
p13<-ggplot(d3, aes(x=active,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
p14<-ggplot(d3, aes(x=gender,  fill=cardio)) +
  geom_bar(stat="count",position="dodge")+theme_bw()
grid.arrange(p9,p10,p11,p12,p13,p14,ncol=3, nrow= 2)

4.3 Correlation Matrix:

To know the correlation of each variable , we have generated a matrix plot of correlations as shown in code chunck

par(mfrow=c(1,1))
dExplor<-d3
dExplor[]<-lapply(dExplor,as.integer)
correlation = cor(dExplor[,1:12])
cols<- colorRampPalette(c("red", "blue"))(20)
corrplot(correlation,  method ="number",col=cols,type="upper")

There are four major variables has considerable correlation with cardio variables and they are ap_hi,ap_lo,age,cholesterol, weight.

So let’s jump back to bi-variate graphs. Age has considerable impact on different categorical variables. As density plots shows as the age increases cholestorl and glucose levels increases and there is high risk factor of the cardio vascular disease.

#density plots for age and cateogrical variables
ggplot(d3,aes(x=age,fill=cholesterol))+geom_density(col=NA,alpha=0.35)

ggplot(d3,aes(x=age,fill=cardio))+geom_density(col=NA,alpha=0.35)

ggplot(d3,aes(x=age,fill=active))+geom_density(col=NA,alpha=0.35)

ggplot(d3,aes(x=age,fill=gluc))+geom_density(col=NA,alpha=0.35)

While looking at the density plots of age and ap_hi ,ap_lo we can observe the density at 45 to 60 years their is considerable change in ap_hi and ap_lo.

As another interesting factor , the relationship between weight and height is mostly linear and the distribution more concentrated at normal height and wieght ranges.

#Bivariate for two numerical variables density charts
library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:scales':
## 
##     viridis_pal

ggplot(d3,aes(x=ap_hi,y=age))+stat_density_2d(geom = "point",aes(size=..density..),n=20,contour = FALSE)+scale_size(range=c(0,9))

ggplot(d3,aes(x=ap_lo,y=age))+stat_density_2d(geom = "point",aes(size=..density..),n=20,contour = FALSE)+scale_size(range=c(0,9))

ggplot(d3,aes(x=ap_hi,y=ap_lo))+geom_point()

ggplot(d3,aes(x=height,y=weight))+geom_point()

We can also see the cholesterol level do have some positive relation on cardio. Which implicilty shows that there is higher risk in getting cardio vacular disease with an increase in cholesterol levels.

#As correlation between Cardio and cholestrol is comparitively significant
# cardio and cholestrol
ggplot(d3, 
       aes(x = factor(cholesterol,
                      levels = c("1", "2", 
                                 "3")),
           fill = factor(cardio, 
                         levels = c("0", "1"),
                         labels = c("0", 
                                    "1"
                                    )))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent) +
  scale_fill_brewer(palette = "Set2") +
  labs(y = "Percent", 
       fill = "cardio",
       x = "cholesterol",
       title = "cholesterol by cardio") +
  theme_bw()

Cholestrol and age does have positive impact. In these violin plot , we can infer that as age increase theres is chances of increasing cholestrol levels.

# Categorical and quantitative
ggplot(d3, 
       aes(x = cholesterol,
           y = age)) +
  geom_violin() +
  labs(title = "cholesterol by age
       ")

From the correlation matrix , we can see cholesterol and glucose levels have a positive correlation. This is evident from the below chart. We can observe at the higher levels of glucose there is chance of developing higher cholesterol which inturn imples more chancesof being a cardio vascular paitent.

#considerable correlation between gluc and cholestrol levels
ggplot(d3, 
       aes(x = factor(gluc,
                      levels = c("1", "2", 
                                 "3")),
           fill = factor(cholesterol, 
                         levels = c("1", "2", "3"),
                         labels = c("Normal", 
                                    "Moderate", 
                                    "High")))) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent) +
  scale_fill_brewer(palette = "Set2") +
  labs(y = "Percent", 
       fill = "cholesterol",
       x = "gluc",
       title = "glucose by cholestrol") +
  theme_bw()

4.4 Multivariate Exploration

Finally , Multivariate graphs is the one stop to allow our imagination. It reveals the behaviour of multiple variables on a single variable.

In the first graph, we can see the distribution of cardio paitents is much more high as the age increase and very high chances of get cardio vascular disease with higher cholesterol levles.

In the second plot, we can observe most of the cardio vascular paitents have higher Systolic pressure more than 150mm?hg and higher diastolic pressure more than 90mm/Hg

# Cardio given by age and cholestrol
ggplot(d3, 
       aes(y = factor(cholesterol,
                      labels = c("1",
                                 "2",
                                 "3")), 
           x = age, 
           color = cardio)) +
  geom_jitter(alpha = 0.7,
              size = 1.5) +
  labs(title = "Cardio by Cholestrol w.r.t to age", 
       x = "",
       y = "") +
  theme_minimal()

#Cardio given by ap_hi and ap_lo
ggplot(d3, 
       aes(x = ap_hi, 
           y = ap_lo,
           color= cardio)) +
  geom_jitter(alpha = 0.7,
              size = 1.5) +

  labs(x = "Systolic Blood Pressure",
       y = "Diastolic Blood Pressure",
       title = "Blood pressure relationship by Cardio"
       )

Summary

In this study we have explored the data of Cardio Vasicular disease dataset and gain insights about the key factors that decide the target value. In the inital stages of this study according to Uni-Variate graphs we mentioned that the gender variable has more female than male where this study maybe baised on the levels of gender. But according to multi-variate and correlation matrix, we find that Systolic Blood pressure , Diastolic Pressure, age and cholesterol are the most influencial on the target value. Further more Analysis and model setting will be developed in order to predict the cardiovascular disease.

Exploratory Data Analysis of Cardiovascular paitents

Manikanta Naishadu Devabhakthuni