We have two datsets for this group assignment, shown below:
For dataset 1, it has 14 attributes:
There are 4 attributes for dataset 2:
# Import the dataset about death from heart disease between 1990 and 2017
df_country <- read.csv("C:/Users/AndrewSzl/Desktop/UM/WQD7001 PRINCIPLES OF DATA SCIENCE/DataSet_Group_Assignment/Descriptive modelling/share-deaths-heart-disease.csv")
# Import the Cleveland dataset
colNames <- c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","num")
df_clev <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col.names = colNames)
# Have a look at the first six rows of df_country
head(df_country)
## Entity Code Year
## 1 Afghanistan AFG 1990
## 2 Afghanistan AFG 1991
## 3 Afghanistan AFG 1992
## 4 Afghanistan AFG 1993
## 5 Afghanistan AFG 1994
## 6 Afghanistan AFG 1995
## Deaths...Cardiovascular.diseases...Sex..Both...Age..All.Ages..Percent.
## 1 23.70775
## 2 23.49031
## 3 23.14692
## 4 21.15421
## 5 19.75614
## 6 19.58233
From the first six rows, we could see the data of df_country is clean. Also since the last column name is pretty long and thus we will shorten it.
# Rename the last column
names(df_country)[4] <- "death_percent"
Then we go ahead to observe its summary information.
# Have a look at the summary information of df_country
summary(df_country)
## Entity Code Year death_percent
## Length:6468 Length:6468 Min. :1990 Min. : 1.366
## Class :character Class :character 1st Qu.:1997 1st Qu.:18.877
## Mode :character Mode :character Median :2004 Median :30.294
## Mean :2004 Mean :29.598
## 3rd Qu.:2010 3rd Qu.:38.185
## Max. :2017 Max. :66.692
From the summary information, we could see there are 6468 records from year 1990 to 2017 while the death rate ranges from 1.366% to 66.692% with 29.598% as its mean. Then we move forwards to have a look at its structure.
# Have a look at the structure
str(df_country)
## 'data.frame': 6468 obs. of 4 variables:
## $ Entity : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ death_percent: num 23.7 23.5 23.1 21.2 19.8 ...
To confirm there no missing values in the categorical variables, we will have a look at at their frequency table and every subcategory should have 28 records
# Frequency tables for column Entity and Code
freq_table_Entity <- table(df_country $Entity)
freq_table_Code <- table(df_country $Code)
# To see whetehr there is any subcategory whose counts don't equal 28
freq_table_Entity[freq_table_Entity != 28]
## named integer(0)
freq_table_Code[freq_table_Code != 28]
##
## 980
As you can see there are 980 records for both columns whose values is "". Since there are useless to our analysis and we have the percent rates of all countries, we will drop them.
# Load the library dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Filter the rows whose Entity and Code don't equal to " "
df_country <- df_country %>%
filter(Entity != "" & Code != "")
Since the variables Entity and Code are factors, we will convert them into the right classes.
df_country $Entity <- as.factor(df_country $Entity)
df_country $Code <- as.factor(df_country $Code)
Finally we will check its missing values for each column
colSums(is.na(df_country))
## Entity Code Year death_percent
## 0 0 0 0
As you can see, there is no missing value.
Similarly, we will firstly have a look at its first six rows
# Have a look at the first six row
head(df_clev)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
## 1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
## 2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
## 3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
## 4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
## 5 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
## 6 62 0 4 140 268 0 2 160 0 3.6 3 2.0 3.0 3
From the first six rows, it seems that there is nothing wrong. Thus we go ahead to have a look at its summary information
summary(df_clev)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.41 Mean :0.6788 Mean :3.166 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.2
## Median :241.5 Median :0.0000 Median :0.5000 Median :153.0
## Mean :246.7 Mean :0.1457 Mean :0.9868 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :1.000 Length:302
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Median :0.0000 Median :0.800 Median :2.000 Mode :character
## Mean :0.3278 Mean :1.035 Mean :1.596
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :1.0000 Max. :6.200 Max. :3.000
## thal num
## Length:302 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.9404
## 3rd Qu.:2.0000
## Max. :4.0000
It seems everything is normal and thus we will move forwards to have a look at the structure information.
# Have a look at the its structure
str(df_clev)
## 'data.frame': 302 obs. of 14 variables:
## $ age : num 67 67 37 41 56 62 57 63 53 57 ...
## $ sex : num 1 1 1 0 1 0 0 1 1 1 ...
## $ cp : num 4 4 3 2 2 4 4 4 4 4 ...
## $ trestbps: num 160 120 130 130 120 140 120 130 140 140 ...
## $ chol : num 286 229 250 204 236 268 354 254 203 192 ...
## $ fbs : num 0 0 0 0 0 0 0 0 1 0 ...
## $ restecg : num 2 2 0 2 0 2 0 2 2 0 ...
## $ thalach : num 108 129 187 172 178 160 163 147 155 148 ...
## $ exang : num 1 1 0 0 0 0 1 0 1 0 ...
## $ oldpeak : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ slope : num 2 2 3 1 1 3 1 2 3 2 ...
## $ ca : chr "3.0" "2.0" "0.0" "0.0" ...
## $ thal : chr "3.0" "7.0" "3.0" "3.0" ...
## $ num : int 2 1 0 0 0 3 0 2 1 0 ...
It is evident that there are errors with the class of variables. Before we correct them, we’d have a look at the frequency table of categorical variables since missing values could be denoted as “?” and thus their classes will be character.
# Have a look at the frequency table for each categorical variable
table(df_clev $ ca)
##
## ? 0.0 1.0 2.0 3.0
## 4 175 65 38 20
table(df_clev $ thal)
##
## ? 3.0 6.0 7.0
## 2 166 17 117
You could see there are 4 and 2 “?” for the variable ca and thal respectively. It means missing values and we will reassign “?” to NA
# Assign "?" to NA
df_clev[df_clev == "?"] <- NA
Since there are totaly 6 missing values, we will have at its percentage
6 / nrow(df_clev)
## [1] 0.01986755
As you could observe, the missing values only account for about 2%, pretty small percentage and thus we will select the rows without missing values.
# Filter the missing rows
df_clev <- df_clev[complete.cases(df_clev),]
Then we will convert wrong classes of variables of the right ones.
# Convert variables classes
df_clev $ sex <- factor(df_clev$sex, labels=c('Female','Male'))
df_clev $ cp <- factor(df_clev $cp, labels=c('typical angina','atypical angina','non-anginal pain','asymptomatic'))
df_clev $ trestbps <- as.integer(df_clev $trestbps)
df_clev $ chol <- as.integer(df_clev $chol)
df_clev $ fbs <-factor(df_clev$fbs,labels=c('<= 120','> 120'))
df_clev $ restecg <- factor(df_clev$restecg, labels=c('normal','ST-T wave abnormality','LVH'))
df_clev $ thalach <- as.integer(df_clev $thalach)
df_clev $ exang <- factor(df_clev $exang, labels=c('No','Yes'))
df_clev $ oldpeak <- as.numeric(df_clev $oldpeak)
df_clev $ slope <- factor(df_clev $slope, labels=c('upsloping','flat','downsloping'))
df_clev $ ca <- factor(df_clev$ca, labels = c("0 major vessels","1 major vessels", "2 major vessels", "3 major vessels"))
df_clev $ thal <- factor(df_clev$thal, labels=c('normal','fixed defect','reversable defect'))
df_clev $ num <- ifelse(df_clev $ num == 0, 0, 1 )
df_clev $ num <- factor(df_clev$num, labels=c('healthy','heart disease'))
Finally we will have a look at the missing values for each column
colSums(is.na(df_clev))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal num
## 0 0 0 0 0 0
As you could observe, there is no missing value and we will go ahead for EDA.
# Have a look at the summary information of df_country
summary(df_country)
## Entity Code Year death_percent
## Afghanistan : 28 AFG : 28 Min. :1990 Min. : 1.366
## Albania : 28 AGO : 28 1st Qu.:1997 1st Qu.:18.216
## Algeria : 28 ALB : 28 Median :2004 Median :30.254
## American Samoa: 28 AND : 28 Mean :2004 Mean :29.506
## Andorra : 28 ARE : 28 3rd Qu.:2010 3rd Qu.:38.118
## Angola : 28 ARG : 28 Max. :2017 Max. :66.692
## (Other) :5320 (Other):5320
From the summary information, we could know there are totally 6468 records between year 1990 and 2017. Also, the death percent ranges from 1.366% to 66.692% with 29.598% as the mean. Also, you might want to know which country, which year has the highest and lowest mean death rate of heart disease.
# Load the library dplyr
library(dplyr)
# Mean death percent of each country
df_country_mean_death_rate <- df_country %>%
group_by(Entity) %>%
summarise(mean_death_rate = mean(death_percent)) %>%
arrange(desc(mean_death_rate)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
# Have a look at the highest and lowest average death rate of heart disease by country
df_country_mean_death_rate[c(1, nrow(df_country_mean_death_rate)), ]
## # A tibble: 2 x 2
## Entity mean_death_rate
## <fct> <dbl>
## 1 Georgia 62.5
## 2 Niger 5.75
# Mean death rate of all countries for each year
mean_death_percent_every_year <- df_country %>%
group_by(Year) %>%
summarise(mean_death_rate_per_year = mean(death_percent)) %>%
arrange(desc(mean_death_rate_per_year)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
# Have a look at with the highest and lowest mean death rate of heart disease by year
mean_death_percent_every_year[c(1, nrow(mean_death_percent_every_year)),]
## # A tibble: 2 x 2
## Year mean_death_rate_per_year
## <int> <dbl>
## 1 2017 30.6
## 2 1990 29.1
From the above information, we could sense Georgia and Niger has the highest and lowest average death rate of heart disease respectively while we have the highest mean death rate in 2017 and 1990 separately.
# Have a look at the summary information of df_clev
summary(df_clev)
## age sex cp trestbps
## Min. :29.00 Female: 96 typical angina : 22 Min. : 94.0
## 1st Qu.:48.00 Male :200 atypical angina : 49 1st Qu.:120.0
## Median :56.00 non-anginal pain: 83 Median :130.0
## Mean :54.51 asymptomatic :142 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 <= 120:254 normal :147 Min. : 71.0
## 1st Qu.:211.0 > 120 : 42 ST-T wave abnormality: 4 1st Qu.:133.0
## Median :243.0 LVH :145 Median :153.0
## Mean :247.4 Mean :149.6
## 3rd Qu.:276.2 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## exang oldpeak slope ca
## No :199 Min. :0.000 upsloping :139 0 major vessels:173
## Yes: 97 1st Qu.:0.000 flat :137 1 major vessels: 65
## Median :0.800 downsloping: 20 2 major vessels: 38
## Mean :1.051 3 major vessels: 20
## 3rd Qu.:1.600
## Max. :6.200
## thal num
## normal :164 healthy :159
## fixed defect : 17 heart disease:137
## reversable defect:115
##
##
##
There are a lot of information we could get. For instance, the age is from 29 to 77 with 54.51 as the mean value in 296 records.
Since there are hundred of countries, it might be impossible to plot all countries. Instead, it is better build an interactive app so that the users could select their interested countries and gain some insights. Here we will give some examples of our interested countries like China, Malaysia, and India.
# Load the library ggplot2 and dplyr
library(ggplot2)
library(dplyr)
df_country %>%
filter(Entity %in% c("China", "Malaysia", "India")) %>%
ggplot(aes(Year, death_percent, color = Entity)) + geom_line()
From the line plot, you could see the death rate of heart of Malaysia is decreasing while China and India shows an opposite trend.
par(mfrow = c(2,3))
with(df_clev, {
boxplot(age, ylab = "Age", main = "Boxplot of age")
boxplot(trestbps, ylab = "trestbps", main = "Boxplot of trestbps")
boxplot(chol, ylab = "chol", main = "Boxplot of chol")
boxplot(thalach, ylab = "chol", main = "Boxplot of chol")
boxplot(oldpeak, ylab = "oldpeak", main = "Boxplot of oldpeak")
}
)
Although there are small proportion of outlier, we will keep them because they are normal values after confirming from various source information.
Considering various length of levels of each factor, we will put short length of levels of each categorical variable together for better presentation
par(mfrow = c(2,3))
with(df_clev, {
plot(sex, xlab = "sex", ylab = "frequency", main = "Barplot of sex")
plot(fbs, xlab = "fbs", ylab = "frequency", main = "Barplot of fbs", )
plot(exang, xlab = "exangs", ylab = "frequency", main = "Barplot of exang")
plot(slope, xlab = "slope", ylab = "frequency",main = "Barplot of slope")
plot(num, xlab = "num", ylab = "frequency", main = "Barplot of num")
}
)
From the 5 barplot, we could draw some conclusions below:
# 6. Barplot of cp
barplot(table(df_clev $cp), xlab = "cp", ylab = "frequency", main = "Barplot of cp")
# 7. Barplot of restecg
barplot(table(df_clev $restecg), xlab = "restecg", ylab = "frequency", main = "Barplot of restecg")
# 8. Barplot of ca
barplot(table(df_clev $ca), xlab = "ca", ylab = "frequency", main = "Barplot of ca")
# 9. Barplot of thal
barplot(table(df_clev $thal), xlab = "thal", ylab = "frequency", main = "Barplot of thal")
Based on the 4 barplot, we could see:
# 1. age by num
# Make a boxplot
ggplot(df_clev, aes(num, age)) + geom_boxplot() + ggtitle("age by num") + theme(plot.title = element_text(hjust = 0.5))
As you could sense, the average age for heart disease patient is higher than than normal people while age of normal people have a larger variance.
# 2. trestbps by num
ggplot(df_clev, aes(num, trestbps)) + geom_boxplot() + ggtitle("trestbps by num") + theme(plot.title = element_text(hjust = 0.5))
From the boxplot, it is concluded that the trestbps for heart disease patients has a larger variance.
# 3. chol by num
ggplot(df_clev, aes(num, chol)) + geom_boxplot() + ggtitle("chol by num") + theme(plot.title = element_text(hjust = 0.5))
Base on this boxplot, it is shown bothe the average and the variance of chol for heart disease patients are larger compared with normal people.
# 4. thalach by num
ggplot(df_clev, aes(num, thalach)) + geom_boxplot() + ggtitle("thalach by num") + theme(plot.title = element_text(hjust = 0.5))
It is clear that the average and variance of thalach for normal people are higher compared with heart disease patients.
# 5. oldpeak by num
ggplot(df_clev, aes(num, oldpeak)) + geom_boxplot() + ggtitle("oldpeak by num") + theme(plot.title = element_text(hjust = 0.5))
According to the boxplt, we could see heart disease patients have a higher average oldpeak and larger variance compared with normal people.
# 1. sex by num
#Load the library ggplot2
ggplot(df_clev, aes(sex, fill = num)) + geom_bar(position = "fill") + ggtitle("sex by num") + theme(plot.title = element_text(hjust = 0.5))
It is noticeable that male has a higher of hearts disease patients compared with the female
# 2. cp by num
ggplot(df_clev, aes(cp, fill = num)) + geom_bar(position = "dodge") + ggtitle("cp by num") + theme(plot.title = element_text(hjust = 0.5))
Obviously, typical angina, atypical angina, non-anginal pain all have a smaller proportion of normal people while there are higher proportion of heart disease patients for asymptomatic
# 3. fbs vs num
ggplot(df_clev, aes(fbs, fill = num)) + geom_bar(position = "fill") + ggtitle("fbs by num") + theme(plot.title = element_text(hjust = 0.5))
From the bar chart, we could notice there are almost the same percentages of health and heart disease people for each group in variable fbs.
# 4. restecg vs num
ggplot(df_clev, aes(restecg, fill = num)) + geom_bar(position = "dodge") + ggtitle("restecg by num") + theme(plot.title = element_text(hjust = 0.5))
Evidently, all categories apart from normal group have higher percentage of heart disease patients compared with normal people.
# 5. exang vs num
ggplot(df_clev, aes(exang, fill = num)) + geom_bar(position = "fill") + ggtitle("exang by num") + theme(plot.title = element_text(hjust = 0.5))
From the plot, we could observe that there are far more heart disease patients for the group Yes(exercise induced angina). In contrast, normal people has a higher percentage in the the group No (exercise not induced angina).
# 6. slope vs num
ggplot(df_clev, aes(slope, fill = num)) + geom_bar(position = "fill") + ggtitle("slope by num") + theme(plot.title = element_text(hjust = 0.5))
It is presented that there are higher percentages of heart disease patients in the group flat (the flat slope of the peak exercise ST segment),downsloping(the downsloping slope of the peak exercise ST segment) compared with the opposite condition in group upsloping(the upsloping slope of the peak exercise ST segment)
# 7. ca vs num
ggplot(df_clev, aes(ca, fill = num)) + geom_bar(position = "dodge") + ggtitle("ca by num") + theme(plot.title = element_text(hjust = 0.5))
1, 2 , and 3 major vessels colored by flourosopy have a higher proportion of heart disease patients compared with normal people. Instead, 0 major vessels colored by flourosopy present the opposite condition.
# 8. thal vs num
ggplot(df_clev, aes(thal, fill = num)) + geom_bar(position = "fill") + ggtitle("thal by num") + theme(plot.title = element_text(hjust = 0.5))
Clearly, fixed defect and reversable defect have far more heart disease people while the normal group shows an opposite circumstance.
# 1.age vs trestbps grouped by num
ggplot(df_clev, aes(age,trestbps, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 2.age vs chol grouped by num
ggplot(df_clev, aes(age,chol, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 3.age vs thalach grouped by num
ggplot(df_clev, aes(age,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 4.age vs oldpeak grouped by num
ggplot(df_clev, aes(age,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 5.trestbps vs chol grouped by num
ggplot(df_clev, aes(trestbps,chol, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 6.trestbps vs thalach grouped by num
ggplot(df_clev, aes(trestbps,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 7.trestbps vs oldpeak grouped by num
ggplot(df_clev, aes(trestbps,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 8.chol vs thalach grouped by num
ggplot(df_clev, aes(chol,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 9.chol vs oldpeak grouped by num
ggplot(df_clev, aes(chol,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 10.thalach, oldpeak grouped by num
ggplot(df_clev, aes(thalach,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the 10 plots above, you could see that plot 6, 8 show opposite trends between 2 facets while other plots present almost the same trends.
After gaining an intuitive insight into the data , you may want to see the correlations between variables.
# Load the library PerformanceAnalytics
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
# Get the correlation matrix chart
df_clev %>%
select(age, trestbps, chol, thalach, oldpeak) %>%
chart.Correlation(histogram=TRUE, pch=19)
It is obvious that there are correlations but not strong between other variables and our target variable num. Thus there is no reason for us to delete any variables and we will use all variables for our predictive model.
In this part, we will compare several models using 10-fold cross validation and select the best one fro df_clev. Finally we will test its performance on the testing dataset to see whether it is overfitting.
# Load the library caret
library(caret)
## Loading required package: lattice
# Split the dataset into 80% and 20% as training and testing dataset respectively
# Set the random seed to ensure we get the same traning dataset and testing dataset everytime we run the code.
split <- 0.8
set.seed(998)
train_index <- createDataPartition(df_clev$num, p=split, list = F)
data_train <- df_clev[train_index, ]
data_test <- df_clev[-train_index, ]
# 10 fold cross validation
train_control <- trainControl(method="cv", number=10)
# 1. knn
# Set the same random seed for each algorithm to ensure the data split for 10-fold cross validation
set.seed(2)
model_knn <- train(num~., data=data_train, trControl=train_control, method="knn", metric="Accuracy")
print(model_knn)
## k-Nearest Neighbors
##
## 238 samples
## 13 predictor
## 2 classes: 'healthy', 'heart disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6760870 0.3458851
## 7 0.6762681 0.3451615
## 9 0.6932971 0.3812592
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# 2. logistics regression
set.seed(2)
model_glm <- train(num~., data=data_train, trControl=train_control, method="glm", family = 'binomial', metric="Accuracy")
print(model_glm)
## Generalized Linear Model
##
## 238 samples
## 13 predictor
## 2 classes: 'healthy', 'heart disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8322464 0.6611789
# 3. Random forest
set.seed(2)
model_rf <- train(num~., data=data_train, trControl=train_control, method="rf", metric="Accuracy")
print(model_rf)
## Random Forest
##
## 238 samples
## 13 predictor
## 2 classes: 'healthy', 'heart disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8192029 0.6339159
## 11 0.7985507 0.5947672
## 20 0.7896739 0.5766230
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
As you could see, logistics regression model performs best with about 83% mean accuracy. Thus we tend to choose it as our final model. In order to confirm whether there is overfitting, we will validate it in our testing dataset.
# Make predictions
x_test <- data_test[, 1:length(data_train) - 1]
y_test <- data_test[, length(data_train)]
prediction <- predict(model_glm, x_test)
# Have a look at the confusion metrics
confusionMatrix(prediction, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy heart disease
## healthy 25 6
## heart disease 6 21
##
## Accuracy : 0.7931
## 95% CI : (0.6665, 0.8883)
## No Information Rate : 0.5345
## P-Value [Acc > NIR] : 3.93e-05
##
## Kappa : 0.5842
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8065
## Specificity : 0.7778
## Pos Pred Value : 0.8065
## Neg Pred Value : 0.7778
## Prevalence : 0.5345
## Detection Rate : 0.4310
## Detection Prevalence : 0.5345
## Balanced Accuracy : 0.7921
##
## 'Positive' Class : healthy
##
Obviously, the accuracy in the testing dataset is about 79%, very close to 83%. Thus we could think our model fits well.