Group member for this project

  1. Firdaus 17220599
  2. Deepiga Loganathamoorthy S2114748
  3. Kaveta Reveendran S2127193
  4. Muhammad Ezlan S2116731
  5. Jinrong Zhang S2111224

Title: cardiovascular disease Prediction Dataset

Year: 2022 Purpose: Predict cardiovascular disease based on various variables Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset?resource=download

Data description

There are 3 types of input features:

Objective: factual information; Examination: results of medical examination; Subjective: information given by the patient.

Features:

Age | Objective Feature | age | int (days) Height | Objective Feature | height | int (cm) | Weight | Objective Feature | weight | float (kg) | Gender | Objective Feature | gender | categorical code | Systolic blood pressure | Examination Feature | ap_hi | int | Diastolic blood pressure | Examination Feature | ap_lo | int | Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal | Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal | Smoking | Subjective Feature | smoke | binary | Alcoh ol intake | Subjective Feature | alco | binary | Physical activity | Subjective Feature | active | binary | Presence or absence of cardiovascular disease | Target Variable | cardio | binary | All of the dataset values were collected at the moment of medical examination.

1.0 Prepocessing

1.1. Import the packages before starting with data preprocessing

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#install.packages("dplyr")
#install.packages("ISLR")
#install.packages("reshape2")
#install.packages("rpart")
#install.packages("rlang")
library(rpart)
library(rpart.plot)
library(dplyr)
library(ISLR)
library(reshape2)
library(e1071)
library(gmodels)
library(caret)
library(cvms)
library(tibble)
library(ggplot2)

1.1.2 Read the original dataset

df<-read.csv2("/Users/Acer/Downloads/cardio_train.csv",head=TRUE,sep = ";")
head(df)
##   id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1  0 18393      2    168   62.0   110    80           1    1     0    0      1
## 2  1 20228      1    156   85.0   140    90           3    1     0    0      1
## 3  2 18857      1    165   64.0   130    70           3    1     0    0      0
## 4  3 17623      2    169   82.0   150   100           1    1     0    0      1
## 5  4 17474      1    156   56.0   100    60           1    1     0    0      0
## 6  8 21914      1    151   67.0   120    80           2    2     0    0      0
##   cardio
## 1      0
## 2      1
## 3      1
## 4      1
## 5      0
## 6      0

1.1.3 View the structure of the data We observe a part of the data using the head() commands.

glimpse(df)
## Rows: 70,000
## Columns: 13
## $ id          <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, 2~
## $ age         <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 17~
## $ gender      <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2~
## $ height      <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 173~
## $ weight      <chr> "62.0", "85.0", "64.0", "82.0", "56.0", "67.0", "93.0", "9~
## $ ap_hi       <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 120~
## $ ap_lo       <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, 9~
## $ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ gluc        <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1~
## $ smoke       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1~
## $ alco        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
## $ active      <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1~
## $ cardio      <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~

There are total of 70,000 rows and 13 columns. From the above data showing in the table, we noticed that the column age is computed in days which is not a familiar conception we heard in life. So we tansfered this column that is described in year.That is, the age described in days will be divided by 365 and given a new name newage with a integar data type.

summary(df)
##        id             age            gender         height     
##  Min.   :    0   Min.   :10798   Min.   :1.00   Min.   : 55.0  
##  1st Qu.:25007   1st Qu.:17664   1st Qu.:1.00   1st Qu.:159.0  
##  Median :50002   Median :19703   Median :1.00   Median :165.0  
##  Mean   :49972   Mean   :19469   Mean   :1.35   Mean   :164.4  
##  3rd Qu.:74889   3rd Qu.:21327   3rd Qu.:2.00   3rd Qu.:170.0  
##  Max.   :99999   Max.   :23713   Max.   :2.00   Max.   :250.0  
##     weight              ap_hi             ap_lo           cholesterol   
##  Length:70000       Min.   : -150.0   Min.   :  -70.00   Min.   :1.000  
##  Class :character   1st Qu.:  120.0   1st Qu.:   80.00   1st Qu.:1.000  
##  Mode  :character   Median :  120.0   Median :   80.00   Median :1.000  
##                     Mean   :  128.8   Mean   :   96.63   Mean   :1.367  
##                     3rd Qu.:  140.0   3rd Qu.:   90.00   3rd Qu.:2.000  
##                     Max.   :16020.0   Max.   :11000.00   Max.   :3.000  
##       gluc           smoke              alco             active      
##  Min.   :1.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :1.000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :1.226   Mean   :0.08813   Mean   :0.05377   Mean   :0.8037  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##      cardio      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4997  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
#assign new data frame for experiment
df_test<-df
class(df_test)
## [1] "data.frame"
#The name of each column
names(df_test)
##  [1] "id"          "age"         "gender"      "height"      "weight"     
##  [6] "ap_hi"       "ap_lo"       "cholesterol" "gluc"        "smoke"      
## [11] "alco"        "active"      "cardio"

1.1.4 View the dimension of the data

dim(df_test)
## [1] 70000    13

1.1.5 Summary of data Check for missing value in each column

#Calculate how many missing values are in each column
colSums(is.na(df_test))
##          id         age      gender      height      weight       ap_hi 
##           0           0           0           0           0           0 
##       ap_lo cholesterol        gluc       smoke        alco      active 
##           0           0           0           0           0           0 
##      cardio 
##           0

We can also visualize missing data using Amelia Package, using function of “missmap”

library(Amelia)
missmap(df)

Currently showing no missing data Check if there is a duplicate id

#Check for duplicate data
table(duplicated(df_test$id))
## 
## FALSE 
## 70000

Id has no duplicate values Check other columns of data

unique(df_test$gender)
## [1] 2 1
unique(df_test$height)
##   [1] 168 156 165 169 151 157 178 158 164 173 181 172 170 154 162 163 153 159
##  [19] 166 155 160 175 171 152 187 148 179 180 188 185 167 183 174 176 161 184
##  [37] 177 182  76 149 142 150 144 147 186 146 141 195 140 198 145 143 196 138
##  [55] 194 190 134 136 100 120 189 137 192 122 250 191 117  70  97 119 130 110
##  [73] 193  75 132  71 135  67 125 139 133  74  98 112 207  68  55  81  80  64
##  [91]  91  60 109  72 197  65 128 105 108 200 104 111 113  96 131  59  66  99
## [109]  57
unique(df_test$weight)
##   [1] "62.0"  "85.0"  "64.0"  "82.0"  "56.0"  "67.0"  "93.0"  "95.0"  "71.0" 
##  [10] "68.0"  "80.0"  "60.0"  "78.0"  "112.0" "75.0"  "52.0"  "83.0"  "69.0" 
##  [19] "90.0"  "45.0"  "65.0"  "59.0"  "66.0"  "74.0"  "105.0" "73.0"  "55.0" 
##  [28] "70.0"  "72.0"  "63.0"  "50.0"  "107.0" "84.0"  "77.0"  "79.0"  "76.0" 
##  [37] "58.0"  "115.0" "97.0"  "53.0"  "57.0"  "49.0"  "110.0" "94.0"  "92.0" 
##  [46] "87.0"  "103.0" "88.0"  "99.0"  "100.0" "61.0"  "48.0"  "54.0"  "51.0" 
##  [55] "47.0"  "91.0"  "104.0" "81.0"  "98.0"  "108.0" "89.0"  "101.0" "86.0" 
##  [64] "65.5"  "142.0" "96.0"  "44.0"  "41.0"  "169.0" "116.0" "200.0" "64.8" 
##  [73] "120.0" "117.0" "125.0" "106.0" "114.0" "113.0" "129.0" "124.0" "42.0" 
##  [82] "123.0" "111.0" "102.0" "109.0" "46.0"  "139.0" "150.0" "138.0" "131.0"
##  [91] "43.0"  "83.6"  "134.0" "106.5" "121.0" "40.0"  "141.0" "119.0" "126.0"
## [100] "177.0" "165.0" "122.0" "62.3"  "118.0" "89.5"  "132.0" "155.0" "130.0"
## [109] "159.0" "84.6"  "140.0" "82.3"  "30.0"  "67.9"  "127.0" "67.5"  "74.3" 
## [118] "149.0" "180.0" "128.0" "60.2"  "64.5"  "58.7"  "90.5"  "37.0"  "80.8" 
## [127] "166.0" "161.0" "92.2"  "75.6"  "136.0" "82.9"  "154.0" "68.9"  "170.0"
## [136] "84.5"  "135.0" "72.5"  "146.0" "70.5"  "74.2"  "84.3"  "38.0"  "54.5" 
## [145] "145.0" "68.5"  "79.5"  "72.8"  "73.2"  "96.5"  "53.2"  "39.0"  "61.2" 
## [154] "168.0" "133.0" "54.35" "50.7"  "79.94" "45.8"  "153.0" "34.0"  "62.4" 
## [163] "167.0" "156.0" "35.0"  "70.3"  "31.0"  "60.5"  "69.8"  "64.7"  "64.1" 
## [172] "152.0" "59.5"  "163.0" "75.5"  "78.2"  "162.0" "121.3" "178.0" "82.5" 
## [181] "90.7"  "80.6"  "55.6"  "59.8"  "74.77" "32.0"  "70.8"  "57.4"  "73.8" 
## [190] "56.2"  "69.5"  "71.5"  "89.1"  "158.0" "76.5"  "35.45" "86.5"  "63.8" 
## [199] "23.0"  "144.0" "70.2"  "68.4"  "148.0" "164.0" "22.0"  "66.5"  "54.9" 
## [208] "59.2"  "65.3"  "137.0" "62.2"  "50.5"  "36.0"  "61.5"  "80.5"  "81.1" 
## [217] "53.6"  "11.0"  "66.4"  "76.7"  "28.0"  "68.3"  "60.6"  "76.8"  "58.8" 
## [226] "181.0" "75.2"  "83.5"  "64.3"  "175.0" "84.9"  "80.7"  "84.7"  "60.1" 
## [235] "105.5" "85.5"  "121.8" "51.5"  "53.9"  "42.2"  "74.5"  "114.6" "73.5" 
## [244] "71.3"  "33.0"  "66.3"  "53.3"  "67.8"  "70.7"  "52.3"  "82.4"  "63.4" 
## [253] "71.2"  "62.5"  "63.82" "88.5"  "89.9"  "72.1"  "106.4" "160.0" "68.2" 
## [262] "53.67" "171.0" "55.2"  "94.5"  "143.0" "62.7"  "53.5"  "109.5" "10.0" 
## [271] "82.1"  "57.8"  "84.8"  "21.0"  "29.0"  "183.0" "58.5"  "57.6"  "78.5" 
## [280] "172.0" "109.7" "61.3"  "67.3"  "94.7"  "147.0" "55.4"  "99.9"
unique(df_test$ap_hi)
##   [1]   110   140   130   150   100   120   145   170   135   125    90   180
##  [13]   160   133   190    80   122   169   126   158   200    14   123    70
##  [25]   161   147   115   137   153    11   148   105   220   119   141   165
##  [37]   164    12   124   172   902   162   906   117   134   166   210   176
##  [49]   116    10   121    16   112   159   113   118   155   142   131   157
##  [61]   136   146   138  -100   909   109    85   106   129    93     7    95
##  [73]   179   156   168   132   104   103   178   175   128   151    15   139
##  [85] 11500   127    17   108   144   102     1  1420    13   143   701   107
##  [97]   184   149   167   114   101    60  1500   181   171   202  -115   111
## [109]   907    20   188   185   163   173   154   177  -140   174  -120 14020
## [121]  1400   240   191   197  1620   152    96   199  -150  1130   193    99
## [133]   196   309   401 16020  1202   806  1300   230   207   215    97  1409
## [145] 11020    24   960 13010  1110   195  1205   187  2000
unique(df_test$ap_lo)
##   [1]    80    90    70   100    60    85    89   110    65    63    79  1100
##  [13]  1000   800   120    50    30   109    84  1033   150    91    40    73
##  [25]    78    75    86    87  1001    82    95    69    74    97    81  1200
##  [37]    83   119     0    93   105 10000    99    77    59  8044   140    92
##  [49]  1044   108   125   115    68    61   106   102    94    66    52   170
##  [61]    76   160    62    96   130   113    67  9100    10    88   902     8
##  [73]   112   104    71    72  1008    98  2088    20   802  8000  1022   850
##  [85]   708    57   101  9011  1011    64  1007  1177  7100    45   709  8500
##  [97]    58  1110  8099  1088   126  1077  1120     7   103  1125   180   121
## [109]  8100   710  5700  8079  1111  1003     6  1900   809   114   801  1002
## [121]    53   111     1   118    56   182   810     9  7099 11000  9800  8200
## [133]  1139   107   820    55  1400   190   900   122  6800   135   700    15
## [145]  1101   910  1140  1211   -70    54  8077   901   880   870   585    49
## [157]   602
unique(df_test$cholesterol)
## [1] 1 3 2
unique(df_test$gluc)
## [1] 1 2 3
unique(df_test$smoke)
## [1] 0 1
unique(df_test$alco)
## [1] 0 1
unique(df_test$active)
## [1] 1 0
unique(df_test$cardio)
## [1] 0 1
unique(df_test$newage)
## NULL

1.2 Data Cleaning - Which section of the data do you need to tidy?

1.2.1 we normalize the age value from number of days into age value, we add a new columns named newage

df_test<-df
#df$newage<-floor(df$age/365)
df_test$newage<-floor(df$age/365)
#drop original age column and id column
df_test<-df_test[-c(1,2)]
head(df_test)
##   gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
## 1      2    168   62.0   110    80           1    1     0    0      1      0
## 2      1    156   85.0   140    90           3    1     0    0      1      1
## 3      1    165   64.0   130    70           3    1     0    0      0      1
## 4      2    169   82.0   150   100           1    1     0    0      1      1
## 5      1    156   56.0   100    60           1    1     0    0      0      0
## 6      1    151   67.0   120    80           2    2     0    0      0      0
##   newage
## 1     50
## 2     55
## 3     51
## 4     48
## 5     47
## 6     60

1.2.2 Change the Data type of weight from character to integer, so we can make a plot of it

#View original data type
class(df_test$weight)
## [1] "character"
#Change the data type
df_test$weight <- as.integer(df_test$weight)
#Check the modification result
class(df_test$weight)
## [1] "integer"

1.2.3 created a new column called BMI from height and weight BMI formula = weight(Kg) / height(M)^2

  • Below 18.5 Underweight
  • 18.5—24.9 Healthy
  • 25.0—29.9 Overweight
  • 30.0 and Above Obese
df_test$BMI<-round(df_test$weight/(df_test$height/100)^2, 2)
head(df_test)
##   gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
## 1      2    168     62   110    80           1    1     0    0      1      0
## 2      1    156     85   140    90           3    1     0    0      1      1
## 3      1    165     64   130    70           3    1     0    0      0      1
## 4      2    169     82   150   100           1    1     0    0      1      1
## 5      1    156     56   100    60           1    1     0    0      0      0
## 6      1    151     67   120    80           2    2     0    0      0      0
##   newage   BMI
## 1     50 21.97
## 2     55 34.93
## 3     51 23.51
## 4     48 28.71
## 5     47 23.01
## 6     60 29.38
boxplot(df_test$BMI, col="pink")

summary(df_test$BMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.47   23.88   26.37   27.56   30.22  298.67

1.2.4 remove outliers in BMI. BMI range is from 10 to 60, the data reduced by 72, from 70000 now is 69928

df_test=subset(df_test, df_test$BMI>=10 & df_test$BMI<=60)
summary(df_test)
##      gender         height          weight           ap_hi        
##  Min.   :1.00   Min.   : 98.0   Min.   : 28.00   Min.   : -150.0  
##  1st Qu.:1.00   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:  120.0  
##  Median :1.00   Median :165.0   Median : 72.00   Median :  120.0  
##  Mean   :1.35   Mean   :164.4   Mean   : 74.17   Mean   :  128.8  
##  3rd Qu.:2.00   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:  140.0  
##  Max.   :2.00   Max.   :250.0   Max.   :200.00   Max.   :16020.0  
##      ap_lo           cholesterol         gluc           smoke        
##  Min.   :  -70.00   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
##  1st Qu.:   80.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000  
##  Median :   80.00   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :   96.59   Mean   :1.367   Mean   :1.227   Mean   :0.08813  
##  3rd Qu.:   90.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.00000  
##  Max.   :11000.00   Max.   :3.000   Max.   :3.000   Max.   :1.00000  
##       alco             active           cardio           newage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :29.00  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:48.00  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :53.00  
##  Mean   :0.05376   Mean   :0.8037   Mean   :0.4997   Mean   :52.84  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:58.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
##       BMI       
##  Min.   :10.73  
##  1st Qu.:23.88  
##  Median :26.37  
##  Mean   :27.48  
##  3rd Qu.:30.12  
##  Max.   :60.00
str(df_test)
## 'data.frame':    69928 obs. of  13 variables:
##  $ gender     : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : int  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: int  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : int  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...
##  $ newage     : num  50 55 51 48 47 60 60 61 48 54 ...
##  $ BMI        : num  22 34.9 23.5 28.7 23 ...

1.2.5 remove outliers in newage column, we remove age <= 39 the data reduced by 4, from 69928 now is 69924

df_test=subset(df_test, df_test$newage >=39 )
summary(df_test)
##      gender         height          weight           ap_hi        
##  Min.   :1.00   Min.   : 98.0   Min.   : 28.00   Min.   : -150.0  
##  1st Qu.:1.00   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:  120.0  
##  Median :1.00   Median :165.0   Median : 72.00   Median :  120.0  
##  Mean   :1.35   Mean   :164.4   Mean   : 74.17   Mean   :  128.8  
##  3rd Qu.:2.00   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:  140.0  
##  Max.   :2.00   Max.   :250.0   Max.   :200.00   Max.   :16020.0  
##      ap_lo           cholesterol         gluc           smoke        
##  Min.   :  -70.00   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
##  1st Qu.:   80.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000  
##  Median :   80.00   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :   96.59   Mean   :1.367   Mean   :1.227   Mean   :0.08814  
##  3rd Qu.:   90.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.00000  
##  Max.   :11000.00   Max.   :3.000   Max.   :3.000   Max.   :1.00000  
##       alco             active           cardio           newage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :39.00  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:48.00  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :53.00  
##  Mean   :0.05376   Mean   :0.8037   Mean   :0.4997   Mean   :52.84  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:58.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
##       BMI       
##  Min.   :10.73  
##  1st Qu.:23.88  
##  Median :26.37  
##  Mean   :27.48  
##  3rd Qu.:30.12  
##  Max.   :60.00
str(df_test)
## 'data.frame':    69924 obs. of  13 variables:
##  $ gender     : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : int  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: int  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : int  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ alco       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ active     : int  1 1 0 1 0 0 1 1 1 0 ...
##  $ cardio     : int  0 1 1 1 0 0 0 1 0 0 ...
##  $ newage     : num  50 55 51 48 47 60 60 61 48 54 ...
##  $ BMI        : num  22 34.9 23.5 28.7 23 ...

From the above data, there are some redundant data. we attribute this result that some patients has the same weight or height. We need further check if there are any outliers by plotting boxplot.

#par(mfrow=c(4,3))
boxplot(df_test$newage, col="brown2")

boxplot(df_test$height, col="cyan2")

boxplot(df_test$weight, col="cyan3")

boxplot(df_test$ap_hi, col="cyan2")

boxplot(df_test$ap_lo, col="cyan3")

par(mfrow=c(1,2))
hist(df_test$height, freq = TRUE)
#title("Histogram of height")
lines(density.default(df_test$height), col = "blue")
x <- 0:280
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$height), sd(df_test$height)), col = "black")

hist(df_test$weight, freq = FALSE)
#title("Histogram of weight")
lines(density.default(df_test$weight), col = "blue")
x <- 0:250
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$weight), sd(df_test$weight)), col = "red")

hist(df_test$BMI, freq = FALSE)
#title("Histogram of BMI")
lines(density.default(df_test$BMI), col = "blue")
x <- 0:250
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$BMI), sd(df_test$BMI)), col = "red")

#sp<-boxplot(df$height,boxwex=0.7) 
#xi=1.1  
#sd.s=sd(df[complete.cases(df$height),]$height)  
#mn.s=mean(df[complete.cases(df),]$height)  
#points(xi,mn.s,col="red",pch=18)  
#arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1)  
#text(rep(c(1.05,1.05,0.95,0.95),length=length(sp$out)),labels=sp$out[order(sp$out)],  
#sp$out[order(sp$out)]+rep(c(150,-150,150,-150),length=length(sp$out)),col="red")  

1.2.6 change column name of ap_hi to systolic_bp and ap_lo to diastolic_bp for better understanding of the data

names(df_test)[4:5] <-c("systolic_bp", "diastolic_bp")
summary(df_test)
##      gender         height          weight        systolic_bp     
##  Min.   :1.00   Min.   : 98.0   Min.   : 28.00   Min.   : -150.0  
##  1st Qu.:1.00   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:  120.0  
##  Median :1.00   Median :165.0   Median : 72.00   Median :  120.0  
##  Mean   :1.35   Mean   :164.4   Mean   : 74.17   Mean   :  128.8  
##  3rd Qu.:2.00   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:  140.0  
##  Max.   :2.00   Max.   :250.0   Max.   :200.00   Max.   :16020.0  
##   diastolic_bp       cholesterol         gluc           smoke        
##  Min.   :  -70.00   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
##  1st Qu.:   80.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000  
##  Median :   80.00   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :   96.59   Mean   :1.367   Mean   :1.227   Mean   :0.08814  
##  3rd Qu.:   90.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.00000  
##  Max.   :11000.00   Max.   :3.000   Max.   :3.000   Max.   :1.00000  
##       alco             active           cardio           newage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :39.00  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:48.00  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :53.00  
##  Mean   :0.05376   Mean   :0.8037   Mean   :0.4997   Mean   :52.84  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:58.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
##       BMI       
##  Min.   :10.73  
##  1st Qu.:23.88  
##  Median :26.37  
##  Mean   :27.48  
##  3rd Qu.:30.12  
##  Max.   :60.00

1.2.7 remove outliers from systolic_bp, range only 80<=systolic_bp<=220

#before removal
boxplot(df_test$systolic_bp)

#remove outliers
df_test=subset(df_test, df_test$systolic_bp>=80 & df_test$systolic_bp<=220)
summary(df_test)
##      gender         height          weight        systolic_bp 
##  Min.   :1.00   Min.   :100.0   Min.   : 28.00   Min.   : 80  
##  1st Qu.:1.00   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:120  
##  Median :1.00   Median :165.0   Median : 72.00   Median :120  
##  Mean   :1.35   Mean   :164.4   Mean   : 74.18   Mean   :127  
##  3rd Qu.:2.00   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:140  
##  Max.   :2.00   Max.   :250.0   Max.   :200.00   Max.   :220  
##   diastolic_bp       cholesterol         gluc           smoke        
##  Min.   :    0.00   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
##  1st Qu.:   80.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000  
##  Median :   80.00   Median :1.000   Median :1.000   Median :0.00000  
##  Mean   :   96.59   Mean   :1.367   Mean   :1.227   Mean   :0.08827  
##  3rd Qu.:   90.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.00000  
##  Max.   :11000.00   Max.   :3.000   Max.   :3.000   Max.   :1.00000  
##       alco             active           cardio           newage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :39.00  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:48.00  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :53.00  
##  Mean   :0.05374   Mean   :0.8038   Mean   :0.4997   Mean   :52.84  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:58.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
##       BMI       
##  Min.   :10.73  
##  1st Qu.:23.88  
##  Median :26.37  
##  Mean   :27.48  
##  3rd Qu.:30.18  
##  Max.   :60.00
#boxplot(df_test$systolic_bp)

1.2.8 remove outliers in diastolic_bp, range from 40<=diastolic_bp<=160

#before remove outliers
boxplot(df_test$diastolic_bp)

#remove outliers
df_test=subset(df_test, df_test$diastolic_bp>=40 & df_test$diastolic_bp<=160)
boxplot(df_test$diastolic_bp, col="aquamarine")

summary(df_test$diastolic_bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   80.00   80.00   81.36   90.00  160.00

1.2.9 change value of gender for male from 2 to 0,

df_test$gender[df_test$gender == 2] <- 0
summary(df_test)
##      gender           height          weight        systolic_bp   
##  Min.   :0.0000   Min.   :100.0   Min.   : 28.00   Min.   : 80.0  
##  1st Qu.:0.0000   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:120.0  
##  Median :1.0000   Median :165.0   Median : 72.00   Median :120.0  
##  Mean   :0.6512   Mean   :164.4   Mean   : 74.09   Mean   :126.6  
##  3rd Qu.:1.0000   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:140.0  
##  Max.   :1.0000   Max.   :250.0   Max.   :200.00   Max.   :220.0  
##   diastolic_bp     cholesterol         gluc           smoke      
##  Min.   : 40.00   Min.   :1.000   Min.   :1.000   Min.   :0.000  
##  1st Qu.: 80.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 80.00   Median :1.000   Median :1.000   Median :0.000  
##  Mean   : 81.36   Mean   :1.365   Mean   :1.226   Mean   :0.088  
##  3rd Qu.: 90.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :160.00   Max.   :3.000   Max.   :3.000   Max.   :1.000  
##       alco             active           cardio           newage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :39.00  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:48.00  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median :53.00  
##  Mean   :0.05353   Mean   :0.8034   Mean   :0.4949   Mean   :52.83  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:58.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :64.00  
##       BMI       
##  Min.   :10.73  
##  1st Qu.:23.88  
##  Median :26.35  
##  Mean   :27.45  
##  3rd Qu.:30.12  
##  Max.   :60.00

summary of data cleaning: 1. normalize the age value 2. Change the Data type of weight from character to integer 3. created a new column called BMI 4. remove outliers in BMI 5. change column name ap_hi to systolic_bp, ap_lo to diastolic_bp 5. remove outliers in systolic_bp 6. remove outliers in diastolic_bp

1.3 create correlation matrix

library(reshape2)
df_cormat <- round(cor(df_test),2)
#head(df_cormat)
df_melted_cormat <- melt(df_cormat)
#head(df_melted_cormat)
#function to split upper and lower triangle part of the correlation matrix
get_lower_tri<-function(df_cormat){
    df_cormat[upper.tri(df_cormat)] <- NA
    return(df_cormat)
  }
get_upper_tri <- function(df_cormat){
    df_cormat[lower.tri(df_cormat)]<- NA
    return(df_cormat)
}

upper_tri<-get_upper_tri(df_cormat)
lower_tri<-get_lower_tri(df_cormat)

df_melted_cormat <- melt(upper_tri, na.rm = TRUE)

library(ggplot2)
ggheatmap<-ggplot(data = df_melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "grey", 
    midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pairwise Correlation Matrix") +
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
  coord_fixed()+
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 2)
ggheatmap

#par(mfrow=c(1,2))  

#dotchart(df_test$height)

#pc=boxplot(df_test$height,horizontal=T)

Now creating prediction part

1.4 Classification using Decision Tree

1.4.1 Before any prediction process begin, we need to split data into train and test, we could use various types of splitting , for sake of simplicity, we create split ratio of 80/20 and its function named “create_train_test”

df_prediction<-df_test 

create_train_test <- function(data, size = 0.8, train = TRUE) {
    n_row = nrow(data)
    total_row = size * n_row
    train_sample <- 1: total_row
    if (train == TRUE) {
        return (data[train_sample, ])
    } else {
        return (data[-train_sample, ])
    }
}

1.4.2 Create the train and test split using the function that we created

data_train <- create_train_test(df_prediction, 0.8, train = TRUE)
data_test <- create_train_test(df_prediction, 0.8, train = FALSE)
dim(data_train)
## [1] 54934    13
dim(data_test)
## [1] 13734    13

1.4.3 verify if the randomization process is correct.

prop.table(table(df_prediction$cardio))
## 
##         0         1 
## 0.5050533 0.4949467
prop.table(table(data_train$cardio)) 
## 
##         0         1 
## 0.5051698 0.4948302
prop.table(table(data_test$cardio)) 
## 
##         0         1 
## 0.5045872 0.4954128

Defining the argument in Rpart function rpart(formula, data= , method=’’) arguments:
formula: The function to predict data: Specifies the data frame- method:
“class” for a classification tree and “anova” for a regression tree

1.4.4 Run the model using rpart function (with systolic bp)

library(rpart)
library(rpart.plot)
ctrl=rpart.control(cp=0.001)

fit <- rpart(cardio~., data = data_train, method = 'class', control=ctrl)
rpart.plot(fit, extra = 106)

1.4.5 Make a prediction model with systolic bp and its confusion matrix predict(fitted_model, df, type = ‘class’) arguments: fitted_model: This is the object stored after model estimation. df: Data frame used to make the prediction type: Type of prediction
‘class’: for classification
‘prob’: to compute the probability of each class
‘vector’: Predict the mean response at the node level

predict_cardio <-predict(fit, data_test, type = 'class')
table_mat <- table(data_test$cardio, predict_cardio)
table_mat
##    predict_cardio
##        0    1
##   0 5450 1480
##   1 2253 4551

1.4.6 Accuracy for the confusion matrix

accuracy_Test <- (sum(diag(table_mat)) / sum(table_mat))*100
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 72.8192806174458"

1.5 Classification using Naive Bayes (will use in-built package e1071)##

1.5.1 Train the model using in-built function naiveBayes

df_NB<-df_prediction
#Building a NB model
#maintain use same data split from classification decision tree


NBclassfier=naiveBayes(cardio~., data=data_train)
print(NBclassfier)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.5051698 0.4948302 
## 
## Conditional probabilities:
##    gender
## Y        [,1]      [,2]
##   0 0.6563727 0.4749270
##   1 0.6511055 0.4766293
## 
##    height
## Y       [,1]     [,2]
##   0 164.5223 7.818532
##   1 164.2804 8.012345
## 
##    weight
## Y       [,1]     [,2]
##   0 71.61897 13.19151
##   1 76.63860 14.67659
## 
##    systolic_bp
## Y       [,1]     [,2]
##   0 119.5563 12.59890
##   1 133.8482 17.29502
## 
##    diastolic_bp
## Y       [,1]     [,2]
##   0 78.15740 8.221568
##   1 84.64831 9.749330
## 
##    cholesterol
## Y       [,1]      [,2]
##   0 1.213650 0.5227188
##   1 1.515432 0.7755952
## 
##    gluc
## Y       [,1]      [,2]
##   0 1.176354 0.5110993
##   1 1.276791 0.6242184
## 
##    smoke
## Y         [,1]      [,2]
##   0 0.09322187 0.2907483
##   1 0.08218372 0.2746495
## 
##    alco
## Y         [,1]      [,2]
##   0 0.05578177 0.2295040
##   1 0.05061987 0.2192242
## 
##    active
## Y        [,1]      [,2]
##   0 0.8182047 0.3856826
##   1 0.7883604 0.4084782
## 
##    newage
## Y       [,1]     [,2]
##   0 51.22507 6.782410
##   1 54.45867 6.362268
## 
##    BMI
## Y       [,1]     [,2]
##   0 26.48614 4.781995
##   1 28.45079 5.437499

1.5.2 Prediction the model

y_pred <- predict(NBclassfier, newdata = data_test)  

print(head(y_pred))
## [1] 0 0 0 0 1 1
## Levels: 0 1

1.5.3 Confusion matrix

cm <- table(data_test$cardio, y_pred)
cm
##    y_pred
##        0    1
##   0 5557 1373
##   1 2672 4132

1.5.4 Accuracy for the confusion matrix

accuracy_Test <- (sum(diag(cm)) / sum(cm))*100

print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 70.5475462356196"
In conclusion, it is safe to say, decision tree does produce slight edge in accuracy performance of 72.82% as opposed to Naive Bayes with 70.54% thus reflecting back to our main objectives, accuracy capabilities to detect cardiovscular desease based on its significant attributes (pairwise correlation)