Title: cardiovascular disease Prediction Dataset
Year: 2022 Purpose: Predict cardiovascular disease based on various variables Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset?resource=download
Objective: factual information; Examination: results of medical examination; Subjective: information given by the patient.
Features:
Age | Objective Feature | age | int (days) Height | Objective Feature | height | int (cm) | Weight | Objective Feature | weight | float (kg) | Gender | Objective Feature | gender | categorical code | Systolic blood pressure | Examination Feature | ap_hi | int | Diastolic blood pressure | Examination Feature | ap_lo | int | Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal | Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal | Smoking | Subjective Feature | smoke | binary | Alcoh ol intake | Subjective Feature | alco | binary | Physical activity | Subjective Feature | active | binary | Presence or absence of cardiovascular disease | Target Variable | cardio | binary | All of the dataset values were collected at the moment of medical examination.
1.0 Prepocessing
1.1. Import the packages before starting with data preprocessing
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
#install.packages("dplyr")
#install.packages("ISLR")
#install.packages("reshape2")
#install.packages("rpart")
#install.packages("rlang")
library(rpart)
library(rpart.plot)
library(dplyr)
library(ISLR)
library(reshape2)
library(e1071)
library(gmodels)
library(caret)
library(cvms)
library(tibble)
library(ggplot2)
1.1.2 Read the original dataset
df<-read.csv2("/Users/Acer/Downloads/cardio_train.csv",head=TRUE,sep = ";")
head(df)
## id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 0 18393 2 168 62.0 110 80 1 1 0 0 1
## 2 1 20228 1 156 85.0 140 90 3 1 0 0 1
## 3 2 18857 1 165 64.0 130 70 3 1 0 0 0
## 4 3 17623 2 169 82.0 150 100 1 1 0 0 1
## 5 4 17474 1 156 56.0 100 60 1 1 0 0 0
## 6 8 21914 1 151 67.0 120 80 2 2 0 0 0
## cardio
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
1.1.3 View the structure of the data We observe a part of the data using the head() commands.
glimpse(df)
## Rows: 70,000
## Columns: 13
## $ id <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, 2~
## $ age <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 17~
## $ gender <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2~
## $ height <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 173~
## $ weight <chr> "62.0", "85.0", "64.0", "82.0", "56.0", "67.0", "93.0", "9~
## $ ap_hi <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 120~
## $ ap_lo <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, 9~
## $ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ gluc <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1~
## $ smoke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1~
## $ alco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
## $ active <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1~
## $ cardio <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
There are total of 70,000 rows and 13 columns. From the above data showing in the table, we noticed that the column age is computed in days which is not a familiar conception we heard in life. So we tansfered this column that is described in year.That is, the age described in days will be divided by 365 and given a new name newage with a integar data type.
summary(df)
## id age gender height
## Min. : 0 Min. :10798 Min. :1.00 Min. : 55.0
## 1st Qu.:25007 1st Qu.:17664 1st Qu.:1.00 1st Qu.:159.0
## Median :50002 Median :19703 Median :1.00 Median :165.0
## Mean :49972 Mean :19469 Mean :1.35 Mean :164.4
## 3rd Qu.:74889 3rd Qu.:21327 3rd Qu.:2.00 3rd Qu.:170.0
## Max. :99999 Max. :23713 Max. :2.00 Max. :250.0
## weight ap_hi ap_lo cholesterol
## Length:70000 Min. : -150.0 Min. : -70.00 Min. :1.000
## Class :character 1st Qu.: 120.0 1st Qu.: 80.00 1st Qu.:1.000
## Mode :character Median : 120.0 Median : 80.00 Median :1.000
## Mean : 128.8 Mean : 96.63 Mean :1.367
## 3rd Qu.: 140.0 3rd Qu.: 90.00 3rd Qu.:2.000
## Max. :16020.0 Max. :11000.00 Max. :3.000
## gluc smoke alco active
## Min. :1.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :1.000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :1.226 Mean :0.08813 Mean :0.05377 Mean :0.8037
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## cardio
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4997
## 3rd Qu.:1.0000
## Max. :1.0000
#assign new data frame for experiment
df_test<-df
class(df_test)
## [1] "data.frame"
#The name of each column
names(df_test)
## [1] "id" "age" "gender" "height" "weight"
## [6] "ap_hi" "ap_lo" "cholesterol" "gluc" "smoke"
## [11] "alco" "active" "cardio"
1.1.4 View the dimension of the data
dim(df_test)
## [1] 70000 13
1.1.5 Summary of data Check for missing value in each column
#Calculate how many missing values are in each column
colSums(is.na(df_test))
## id age gender height weight ap_hi
## 0 0 0 0 0 0
## ap_lo cholesterol gluc smoke alco active
## 0 0 0 0 0 0
## cardio
## 0
We can also visualize missing data using Amelia Package, using function of “missmap”
library(Amelia)
missmap(df)
Currently showing no missing data Check if there is a duplicate id
#Check for duplicate data
table(duplicated(df_test$id))
##
## FALSE
## 70000
Id has no duplicate values Check other columns of data
unique(df_test$gender)
## [1] 2 1
unique(df_test$height)
## [1] 168 156 165 169 151 157 178 158 164 173 181 172 170 154 162 163 153 159
## [19] 166 155 160 175 171 152 187 148 179 180 188 185 167 183 174 176 161 184
## [37] 177 182 76 149 142 150 144 147 186 146 141 195 140 198 145 143 196 138
## [55] 194 190 134 136 100 120 189 137 192 122 250 191 117 70 97 119 130 110
## [73] 193 75 132 71 135 67 125 139 133 74 98 112 207 68 55 81 80 64
## [91] 91 60 109 72 197 65 128 105 108 200 104 111 113 96 131 59 66 99
## [109] 57
unique(df_test$weight)
## [1] "62.0" "85.0" "64.0" "82.0" "56.0" "67.0" "93.0" "95.0" "71.0"
## [10] "68.0" "80.0" "60.0" "78.0" "112.0" "75.0" "52.0" "83.0" "69.0"
## [19] "90.0" "45.0" "65.0" "59.0" "66.0" "74.0" "105.0" "73.0" "55.0"
## [28] "70.0" "72.0" "63.0" "50.0" "107.0" "84.0" "77.0" "79.0" "76.0"
## [37] "58.0" "115.0" "97.0" "53.0" "57.0" "49.0" "110.0" "94.0" "92.0"
## [46] "87.0" "103.0" "88.0" "99.0" "100.0" "61.0" "48.0" "54.0" "51.0"
## [55] "47.0" "91.0" "104.0" "81.0" "98.0" "108.0" "89.0" "101.0" "86.0"
## [64] "65.5" "142.0" "96.0" "44.0" "41.0" "169.0" "116.0" "200.0" "64.8"
## [73] "120.0" "117.0" "125.0" "106.0" "114.0" "113.0" "129.0" "124.0" "42.0"
## [82] "123.0" "111.0" "102.0" "109.0" "46.0" "139.0" "150.0" "138.0" "131.0"
## [91] "43.0" "83.6" "134.0" "106.5" "121.0" "40.0" "141.0" "119.0" "126.0"
## [100] "177.0" "165.0" "122.0" "62.3" "118.0" "89.5" "132.0" "155.0" "130.0"
## [109] "159.0" "84.6" "140.0" "82.3" "30.0" "67.9" "127.0" "67.5" "74.3"
## [118] "149.0" "180.0" "128.0" "60.2" "64.5" "58.7" "90.5" "37.0" "80.8"
## [127] "166.0" "161.0" "92.2" "75.6" "136.0" "82.9" "154.0" "68.9" "170.0"
## [136] "84.5" "135.0" "72.5" "146.0" "70.5" "74.2" "84.3" "38.0" "54.5"
## [145] "145.0" "68.5" "79.5" "72.8" "73.2" "96.5" "53.2" "39.0" "61.2"
## [154] "168.0" "133.0" "54.35" "50.7" "79.94" "45.8" "153.0" "34.0" "62.4"
## [163] "167.0" "156.0" "35.0" "70.3" "31.0" "60.5" "69.8" "64.7" "64.1"
## [172] "152.0" "59.5" "163.0" "75.5" "78.2" "162.0" "121.3" "178.0" "82.5"
## [181] "90.7" "80.6" "55.6" "59.8" "74.77" "32.0" "70.8" "57.4" "73.8"
## [190] "56.2" "69.5" "71.5" "89.1" "158.0" "76.5" "35.45" "86.5" "63.8"
## [199] "23.0" "144.0" "70.2" "68.4" "148.0" "164.0" "22.0" "66.5" "54.9"
## [208] "59.2" "65.3" "137.0" "62.2" "50.5" "36.0" "61.5" "80.5" "81.1"
## [217] "53.6" "11.0" "66.4" "76.7" "28.0" "68.3" "60.6" "76.8" "58.8"
## [226] "181.0" "75.2" "83.5" "64.3" "175.0" "84.9" "80.7" "84.7" "60.1"
## [235] "105.5" "85.5" "121.8" "51.5" "53.9" "42.2" "74.5" "114.6" "73.5"
## [244] "71.3" "33.0" "66.3" "53.3" "67.8" "70.7" "52.3" "82.4" "63.4"
## [253] "71.2" "62.5" "63.82" "88.5" "89.9" "72.1" "106.4" "160.0" "68.2"
## [262] "53.67" "171.0" "55.2" "94.5" "143.0" "62.7" "53.5" "109.5" "10.0"
## [271] "82.1" "57.8" "84.8" "21.0" "29.0" "183.0" "58.5" "57.6" "78.5"
## [280] "172.0" "109.7" "61.3" "67.3" "94.7" "147.0" "55.4" "99.9"
unique(df_test$ap_hi)
## [1] 110 140 130 150 100 120 145 170 135 125 90 180
## [13] 160 133 190 80 122 169 126 158 200 14 123 70
## [25] 161 147 115 137 153 11 148 105 220 119 141 165
## [37] 164 12 124 172 902 162 906 117 134 166 210 176
## [49] 116 10 121 16 112 159 113 118 155 142 131 157
## [61] 136 146 138 -100 909 109 85 106 129 93 7 95
## [73] 179 156 168 132 104 103 178 175 128 151 15 139
## [85] 11500 127 17 108 144 102 1 1420 13 143 701 107
## [97] 184 149 167 114 101 60 1500 181 171 202 -115 111
## [109] 907 20 188 185 163 173 154 177 -140 174 -120 14020
## [121] 1400 240 191 197 1620 152 96 199 -150 1130 193 99
## [133] 196 309 401 16020 1202 806 1300 230 207 215 97 1409
## [145] 11020 24 960 13010 1110 195 1205 187 2000
unique(df_test$ap_lo)
## [1] 80 90 70 100 60 85 89 110 65 63 79 1100
## [13] 1000 800 120 50 30 109 84 1033 150 91 40 73
## [25] 78 75 86 87 1001 82 95 69 74 97 81 1200
## [37] 83 119 0 93 105 10000 99 77 59 8044 140 92
## [49] 1044 108 125 115 68 61 106 102 94 66 52 170
## [61] 76 160 62 96 130 113 67 9100 10 88 902 8
## [73] 112 104 71 72 1008 98 2088 20 802 8000 1022 850
## [85] 708 57 101 9011 1011 64 1007 1177 7100 45 709 8500
## [97] 58 1110 8099 1088 126 1077 1120 7 103 1125 180 121
## [109] 8100 710 5700 8079 1111 1003 6 1900 809 114 801 1002
## [121] 53 111 1 118 56 182 810 9 7099 11000 9800 8200
## [133] 1139 107 820 55 1400 190 900 122 6800 135 700 15
## [145] 1101 910 1140 1211 -70 54 8077 901 880 870 585 49
## [157] 602
unique(df_test$cholesterol)
## [1] 1 3 2
unique(df_test$gluc)
## [1] 1 2 3
unique(df_test$smoke)
## [1] 0 1
unique(df_test$alco)
## [1] 0 1
unique(df_test$active)
## [1] 1 0
unique(df_test$cardio)
## [1] 0 1
unique(df_test$newage)
## NULL
1.2 Data Cleaning - Which section of the data do you need to tidy?
1.2.1 we normalize the age value from number of days into age value, we add a new columns named newage
df_test<-df
#df$newage<-floor(df$age/365)
df_test$newage<-floor(df$age/365)
#drop original age column and id column
df_test<-df_test[-c(1,2)]
head(df_test)
## gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
## 1 2 168 62.0 110 80 1 1 0 0 1 0
## 2 1 156 85.0 140 90 3 1 0 0 1 1
## 3 1 165 64.0 130 70 3 1 0 0 0 1
## 4 2 169 82.0 150 100 1 1 0 0 1 1
## 5 1 156 56.0 100 60 1 1 0 0 0 0
## 6 1 151 67.0 120 80 2 2 0 0 0 0
## newage
## 1 50
## 2 55
## 3 51
## 4 48
## 5 47
## 6 60
1.2.2 Change the Data type of weight from character to integer, so we can make a plot of it
#View original data type
class(df_test$weight)
## [1] "character"
#Change the data type
df_test$weight <- as.integer(df_test$weight)
#Check the modification result
class(df_test$weight)
## [1] "integer"
1.2.3 created a new column called BMI from height and weight BMI formula = weight(Kg) / height(M)^2
df_test$BMI<-round(df_test$weight/(df_test$height/100)^2, 2)
head(df_test)
## gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
## 1 2 168 62 110 80 1 1 0 0 1 0
## 2 1 156 85 140 90 3 1 0 0 1 1
## 3 1 165 64 130 70 3 1 0 0 0 1
## 4 2 169 82 150 100 1 1 0 0 1 1
## 5 1 156 56 100 60 1 1 0 0 0 0
## 6 1 151 67 120 80 2 2 0 0 0 0
## newage BMI
## 1 50 21.97
## 2 55 34.93
## 3 51 23.51
## 4 48 28.71
## 5 47 23.01
## 6 60 29.38
boxplot(df_test$BMI, col="pink")
summary(df_test$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.47 23.88 26.37 27.56 30.22 298.67
1.2.4 remove outliers in BMI. BMI range is from 10 to 60, the data reduced by 72, from 70000 now is 69928
df_test=subset(df_test, df_test$BMI>=10 & df_test$BMI<=60)
summary(df_test)
## gender height weight ap_hi
## Min. :1.00 Min. : 98.0 Min. : 28.00 Min. : -150.0
## 1st Qu.:1.00 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.: 120.0
## Median :1.00 Median :165.0 Median : 72.00 Median : 120.0
## Mean :1.35 Mean :164.4 Mean : 74.17 Mean : 128.8
## 3rd Qu.:2.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.: 140.0
## Max. :2.00 Max. :250.0 Max. :200.00 Max. :16020.0
## ap_lo cholesterol gluc smoke
## Min. : -70.00 Min. :1.000 Min. :1.000 Min. :0.00000
## 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
## Median : 80.00 Median :1.000 Median :1.000 Median :0.00000
## Mean : 96.59 Mean :1.367 Mean :1.227 Mean :0.08813
## 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :11000.00 Max. :3.000 Max. :3.000 Max. :1.00000
## alco active cardio newage
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :29.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:48.00
## Median :0.00000 Median :1.0000 Median :0.0000 Median :53.00
## Mean :0.05376 Mean :0.8037 Mean :0.4997 Mean :52.84
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:58.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :64.00
## BMI
## Min. :10.73
## 1st Qu.:23.88
## Median :26.37
## Mean :27.48
## 3rd Qu.:30.12
## Max. :60.00
str(df_test)
## 'data.frame': 69928 obs. of 13 variables:
## $ gender : int 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : int 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: int 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : int 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
## $ newage : num 50 55 51 48 47 60 60 61 48 54 ...
## $ BMI : num 22 34.9 23.5 28.7 23 ...
1.2.5 remove outliers in newage column, we remove age <= 39 the data reduced by 4, from 69928 now is 69924
df_test=subset(df_test, df_test$newage >=39 )
summary(df_test)
## gender height weight ap_hi
## Min. :1.00 Min. : 98.0 Min. : 28.00 Min. : -150.0
## 1st Qu.:1.00 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.: 120.0
## Median :1.00 Median :165.0 Median : 72.00 Median : 120.0
## Mean :1.35 Mean :164.4 Mean : 74.17 Mean : 128.8
## 3rd Qu.:2.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.: 140.0
## Max. :2.00 Max. :250.0 Max. :200.00 Max. :16020.0
## ap_lo cholesterol gluc smoke
## Min. : -70.00 Min. :1.000 Min. :1.000 Min. :0.00000
## 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
## Median : 80.00 Median :1.000 Median :1.000 Median :0.00000
## Mean : 96.59 Mean :1.367 Mean :1.227 Mean :0.08814
## 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :11000.00 Max. :3.000 Max. :3.000 Max. :1.00000
## alco active cardio newage
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :39.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:48.00
## Median :0.00000 Median :1.0000 Median :0.0000 Median :53.00
## Mean :0.05376 Mean :0.8037 Mean :0.4997 Mean :52.84
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:58.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :64.00
## BMI
## Min. :10.73
## 1st Qu.:23.88
## Median :26.37
## Mean :27.48
## 3rd Qu.:30.12
## Max. :60.00
str(df_test)
## 'data.frame': 69924 obs. of 13 variables:
## $ gender : int 2 1 1 2 1 1 1 2 1 1 ...
## $ height : int 168 156 165 169 156 151 157 178 158 164 ...
## $ weight : int 62 85 64 82 56 67 93 95 71 68 ...
## $ ap_hi : int 110 140 130 150 100 120 130 130 110 110 ...
## $ ap_lo : int 80 90 70 100 60 80 80 90 70 60 ...
## $ cholesterol: int 1 3 3 1 1 2 3 3 1 1 ...
## $ gluc : int 1 1 1 1 1 2 1 3 1 1 ...
## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ alco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ active : int 1 1 0 1 0 0 1 1 1 0 ...
## $ cardio : int 0 1 1 1 0 0 0 1 0 0 ...
## $ newage : num 50 55 51 48 47 60 60 61 48 54 ...
## $ BMI : num 22 34.9 23.5 28.7 23 ...
From the above data, there are some redundant data. we attribute this result that some patients has the same weight or height. We need further check if there are any outliers by plotting boxplot.
#par(mfrow=c(4,3))
boxplot(df_test$newage, col="brown2")
boxplot(df_test$height, col="cyan2")
boxplot(df_test$weight, col="cyan3")
boxplot(df_test$ap_hi, col="cyan2")
boxplot(df_test$ap_lo, col="cyan3")
par(mfrow=c(1,2))
hist(df_test$height, freq = TRUE)
#title("Histogram of height")
lines(density.default(df_test$height), col = "blue")
x <- 0:280
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$height), sd(df_test$height)), col = "black")
hist(df_test$weight, freq = FALSE)
#title("Histogram of weight")
lines(density.default(df_test$weight), col = "blue")
x <- 0:250
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$weight), sd(df_test$weight)), col = "red")
hist(df_test$BMI, freq = FALSE)
#title("Histogram of BMI")
lines(density.default(df_test$BMI), col = "blue")
x <- 0:250
#The return value of the function dnorm() is a normal distribution density function
lines(x, dnorm(x, mean(df_test$BMI), sd(df_test$BMI)), col = "red")
#sp<-boxplot(df$height,boxwex=0.7)
#xi=1.1
#sd.s=sd(df[complete.cases(df$height),]$height)
#mn.s=mean(df[complete.cases(df),]$height)
#points(xi,mn.s,col="red",pch=18)
#arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1)
#text(rep(c(1.05,1.05,0.95,0.95),length=length(sp$out)),labels=sp$out[order(sp$out)],
#sp$out[order(sp$out)]+rep(c(150,-150,150,-150),length=length(sp$out)),col="red")
1.2.6 change column name of ap_hi to systolic_bp and ap_lo to diastolic_bp for better understanding of the data
names(df_test)[4:5] <-c("systolic_bp", "diastolic_bp")
summary(df_test)
## gender height weight systolic_bp
## Min. :1.00 Min. : 98.0 Min. : 28.00 Min. : -150.0
## 1st Qu.:1.00 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.: 120.0
## Median :1.00 Median :165.0 Median : 72.00 Median : 120.0
## Mean :1.35 Mean :164.4 Mean : 74.17 Mean : 128.8
## 3rd Qu.:2.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.: 140.0
## Max. :2.00 Max. :250.0 Max. :200.00 Max. :16020.0
## diastolic_bp cholesterol gluc smoke
## Min. : -70.00 Min. :1.000 Min. :1.000 Min. :0.00000
## 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
## Median : 80.00 Median :1.000 Median :1.000 Median :0.00000
## Mean : 96.59 Mean :1.367 Mean :1.227 Mean :0.08814
## 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :11000.00 Max. :3.000 Max. :3.000 Max. :1.00000
## alco active cardio newage
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :39.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:48.00
## Median :0.00000 Median :1.0000 Median :0.0000 Median :53.00
## Mean :0.05376 Mean :0.8037 Mean :0.4997 Mean :52.84
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:58.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :64.00
## BMI
## Min. :10.73
## 1st Qu.:23.88
## Median :26.37
## Mean :27.48
## 3rd Qu.:30.12
## Max. :60.00
1.2.7 remove outliers from systolic_bp, range only 80<=systolic_bp<=220
#before removal
boxplot(df_test$systolic_bp)
#remove outliers
df_test=subset(df_test, df_test$systolic_bp>=80 & df_test$systolic_bp<=220)
summary(df_test)
## gender height weight systolic_bp
## Min. :1.00 Min. :100.0 Min. : 28.00 Min. : 80
## 1st Qu.:1.00 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.:120
## Median :1.00 Median :165.0 Median : 72.00 Median :120
## Mean :1.35 Mean :164.4 Mean : 74.18 Mean :127
## 3rd Qu.:2.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.:140
## Max. :2.00 Max. :250.0 Max. :200.00 Max. :220
## diastolic_bp cholesterol gluc smoke
## Min. : 0.00 Min. :1.000 Min. :1.000 Min. :0.00000
## 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
## Median : 80.00 Median :1.000 Median :1.000 Median :0.00000
## Mean : 96.59 Mean :1.367 Mean :1.227 Mean :0.08827
## 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :11000.00 Max. :3.000 Max. :3.000 Max. :1.00000
## alco active cardio newage
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :39.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:48.00
## Median :0.00000 Median :1.0000 Median :0.0000 Median :53.00
## Mean :0.05374 Mean :0.8038 Mean :0.4997 Mean :52.84
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:58.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :64.00
## BMI
## Min. :10.73
## 1st Qu.:23.88
## Median :26.37
## Mean :27.48
## 3rd Qu.:30.18
## Max. :60.00
#boxplot(df_test$systolic_bp)
1.2.8 remove outliers in diastolic_bp, range from 40<=diastolic_bp<=160
#before remove outliers
boxplot(df_test$diastolic_bp)
#remove outliers
df_test=subset(df_test, df_test$diastolic_bp>=40 & df_test$diastolic_bp<=160)
boxplot(df_test$diastolic_bp, col="aquamarine")
summary(df_test$diastolic_bp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.00 80.00 80.00 81.36 90.00 160.00
1.2.9 change value of gender for male from 2 to 0,
df_test$gender[df_test$gender == 2] <- 0
summary(df_test)
## gender height weight systolic_bp
## Min. :0.0000 Min. :100.0 Min. : 28.00 Min. : 80.0
## 1st Qu.:0.0000 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.:120.0
## Median :1.0000 Median :165.0 Median : 72.00 Median :120.0
## Mean :0.6512 Mean :164.4 Mean : 74.09 Mean :126.6
## 3rd Qu.:1.0000 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.:140.0
## Max. :1.0000 Max. :250.0 Max. :200.00 Max. :220.0
## diastolic_bp cholesterol gluc smoke
## Min. : 40.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.: 80.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median : 80.00 Median :1.000 Median :1.000 Median :0.000
## Mean : 81.36 Mean :1.365 Mean :1.226 Mean :0.088
## 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :160.00 Max. :3.000 Max. :3.000 Max. :1.000
## alco active cardio newage
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :39.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:48.00
## Median :0.00000 Median :1.0000 Median :0.0000 Median :53.00
## Mean :0.05353 Mean :0.8034 Mean :0.4949 Mean :52.83
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:58.00
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :64.00
## BMI
## Min. :10.73
## 1st Qu.:23.88
## Median :26.35
## Mean :27.45
## 3rd Qu.:30.12
## Max. :60.00
summary of data cleaning: 1. normalize the age value 2. Change the Data type of weight from character to integer 3. created a new column called BMI 4. remove outliers in BMI 5. change column name ap_hi to systolic_bp, ap_lo to diastolic_bp 5. remove outliers in systolic_bp 6. remove outliers in diastolic_bp
1.3 create correlation matrix
library(reshape2)
df_cormat <- round(cor(df_test),2)
#head(df_cormat)
df_melted_cormat <- melt(df_cormat)
#head(df_melted_cormat)
#function to split upper and lower triangle part of the correlation matrix
get_lower_tri<-function(df_cormat){
df_cormat[upper.tri(df_cormat)] <- NA
return(df_cormat)
}
get_upper_tri <- function(df_cormat){
df_cormat[lower.tri(df_cormat)]<- NA
return(df_cormat)
}
upper_tri<-get_upper_tri(df_cormat)
lower_tri<-get_lower_tri(df_cormat)
df_melted_cormat <- melt(upper_tri, na.rm = TRUE)
library(ggplot2)
ggheatmap<-ggplot(data = df_melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "grey",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pairwise Correlation Matrix") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()+
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2)
ggheatmap
#par(mfrow=c(1,2))
#dotchart(df_test$height)
#pc=boxplot(df_test$height,horizontal=T)
Now creating prediction part
1.4 Classification using Decision Tree
1.4.1 Before any prediction process begin, we need to split data into train and test, we could use various types of splitting , for sake of simplicity, we create split ratio of 80/20 and its function named “create_train_test”
df_prediction<-df_test
create_train_test <- function(data, size = 0.8, train = TRUE) {
n_row = nrow(data)
total_row = size * n_row
train_sample <- 1: total_row
if (train == TRUE) {
return (data[train_sample, ])
} else {
return (data[-train_sample, ])
}
}
1.4.2 Create the train and test split using the function that we created
data_train <- create_train_test(df_prediction, 0.8, train = TRUE)
data_test <- create_train_test(df_prediction, 0.8, train = FALSE)
dim(data_train)
## [1] 54934 13
dim(data_test)
## [1] 13734 13
1.4.3 verify if the randomization process is correct.
prop.table(table(df_prediction$cardio))
##
## 0 1
## 0.5050533 0.4949467
prop.table(table(data_train$cardio))
##
## 0 1
## 0.5051698 0.4948302
prop.table(table(data_test$cardio))
##
## 0 1
## 0.5045872 0.4954128
Defining the argument in Rpart function rpart(formula, data= , method=’’) arguments:
formula: The function to predict data: Specifies the data frame- method:
“class” for a classification tree and “anova” for a regression tree
1.4.4 Run the model using rpart function (with systolic bp)
library(rpart)
library(rpart.plot)
ctrl=rpart.control(cp=0.001)
fit <- rpart(cardio~., data = data_train, method = 'class', control=ctrl)
rpart.plot(fit, extra = 106)
1.4.5 Make a prediction model with systolic bp and its confusion matrix predict(fitted_model, df, type = ‘class’) arguments: fitted_model: This is the object stored after model estimation. df: Data frame used to make the prediction type: Type of prediction
‘class’: for classification
‘prob’: to compute the probability of each class
‘vector’: Predict the mean response at the node level
predict_cardio <-predict(fit, data_test, type = 'class')
table_mat <- table(data_test$cardio, predict_cardio)
table_mat
## predict_cardio
## 0 1
## 0 5450 1480
## 1 2253 4551
1.4.6 Accuracy for the confusion matrix
accuracy_Test <- (sum(diag(table_mat)) / sum(table_mat))*100
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 72.8192806174458"
1.5 Classification using Naive Bayes (will use in-built package e1071)##
1.5.1 Train the model using in-built function naiveBayes
df_NB<-df_prediction
#Building a NB model
#maintain use same data split from classification decision tree
NBclassfier=naiveBayes(cardio~., data=data_train)
print(NBclassfier)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5051698 0.4948302
##
## Conditional probabilities:
## gender
## Y [,1] [,2]
## 0 0.6563727 0.4749270
## 1 0.6511055 0.4766293
##
## height
## Y [,1] [,2]
## 0 164.5223 7.818532
## 1 164.2804 8.012345
##
## weight
## Y [,1] [,2]
## 0 71.61897 13.19151
## 1 76.63860 14.67659
##
## systolic_bp
## Y [,1] [,2]
## 0 119.5563 12.59890
## 1 133.8482 17.29502
##
## diastolic_bp
## Y [,1] [,2]
## 0 78.15740 8.221568
## 1 84.64831 9.749330
##
## cholesterol
## Y [,1] [,2]
## 0 1.213650 0.5227188
## 1 1.515432 0.7755952
##
## gluc
## Y [,1] [,2]
## 0 1.176354 0.5110993
## 1 1.276791 0.6242184
##
## smoke
## Y [,1] [,2]
## 0 0.09322187 0.2907483
## 1 0.08218372 0.2746495
##
## alco
## Y [,1] [,2]
## 0 0.05578177 0.2295040
## 1 0.05061987 0.2192242
##
## active
## Y [,1] [,2]
## 0 0.8182047 0.3856826
## 1 0.7883604 0.4084782
##
## newage
## Y [,1] [,2]
## 0 51.22507 6.782410
## 1 54.45867 6.362268
##
## BMI
## Y [,1] [,2]
## 0 26.48614 4.781995
## 1 28.45079 5.437499
1.5.2 Prediction the model
y_pred <- predict(NBclassfier, newdata = data_test)
print(head(y_pred))
## [1] 0 0 0 0 1 1
## Levels: 0 1
1.5.3 Confusion matrix
cm <- table(data_test$cardio, y_pred)
cm
## y_pred
## 0 1
## 0 5557 1373
## 1 2672 4132
1.5.4 Accuracy for the confusion matrix
accuracy_Test <- (sum(diag(cm)) / sum(cm))*100
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 70.5475462356196"