1. Student information

2. Asking good questions

3. Data collection

We have two datsets for this group assignment, shown below:

For dataset 1, it has 14 attributes:

There are 4 attributes for dataset 2:

4. Data preparation

4.1 import the datasets

# Import the dataset about death from heart disease between 1990 and 2017
df_country <- read.csv("C:/Users/AndrewSzl/Desktop/UM/WQD7001 PRINCIPLES OF DATA SCIENCE/DataSet_Group_Assignment/Descriptive modelling/share-deaths-heart-disease.csv")
# Import the Cleveland dataset
colNames <- c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","num")
df_clev <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col.names = colNames)

4.2 Data preparation for df_country

# Have a look at the first six rows of df_country
head(df_country)
##        Entity Code Year
## 1 Afghanistan  AFG 1990
## 2 Afghanistan  AFG 1991
## 3 Afghanistan  AFG 1992
## 4 Afghanistan  AFG 1993
## 5 Afghanistan  AFG 1994
## 6 Afghanistan  AFG 1995
##   Deaths...Cardiovascular.diseases...Sex..Both...Age..All.Ages..Percent.
## 1                                                               23.70775
## 2                                                               23.49031
## 3                                                               23.14692
## 4                                                               21.15421
## 5                                                               19.75614
## 6                                                               19.58233

From the first six rows, we could see the data of df_country is clean. Also since the last column name is pretty long and thus we will shorten it.

# Rename the last column
names(df_country)[4] <- "death_percent"

Then we go ahead to observe its summary information.

# Have a look at the summary information of df_country
summary(df_country)
##     Entity              Code                Year      death_percent   
##  Length:6468        Length:6468        Min.   :1990   Min.   : 1.366  
##  Class :character   Class :character   1st Qu.:1997   1st Qu.:18.877  
##  Mode  :character   Mode  :character   Median :2004   Median :30.294  
##                                        Mean   :2004   Mean   :29.598  
##                                        3rd Qu.:2010   3rd Qu.:38.185  
##                                        Max.   :2017   Max.   :66.692

From the summary information, we could see there are 6468 records from year 1990 to 2017 while the death rate ranges from 1.366% to 66.692% with 29.598% as its mean. Then we move forwards to have a look at its structure.

# Have a look at the structure
str(df_country)
## 'data.frame':    6468 obs. of  4 variables:
##  $ Entity       : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Code         : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ Year         : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ death_percent: num  23.7 23.5 23.1 21.2 19.8 ...

To confirm there no missing values in the categorical variables, we will have a look at at their frequency table and every subcategory should have 28 records

# Frequency tables for column Entity and Code
freq_table_Entity <- table(df_country $Entity)
freq_table_Code <- table(df_country $Code)
# To see whetehr there is any subcategory whose counts don't equal 28
freq_table_Entity[freq_table_Entity != 28]
## named integer(0)
freq_table_Code[freq_table_Code != 28]
##     
## 980

As you can see there are 980 records for both columns whose values is "". Since there are useless to our analysis and we have the percent rates of all countries, we will drop them.

# Load the library dplyr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Filter the rows whose Entity and Code don't equal to  " "
df_country <- df_country %>%
  filter(Entity != "" & Code != "")

Since the variables Entity and Code are factors, we will convert them into the right classes.

df_country $Entity <- as.factor(df_country $Entity)
df_country $Code <- as.factor(df_country $Code)

Finally we will check its missing values for each column

colSums(is.na(df_country))
##        Entity          Code          Year death_percent 
##             0             0             0             0

As you can see, there is no missing value.

4.3 Data preparation for df_clev

Similarly, we will firstly have a look at its first six rows

# Have a look at the first six row
head(df_clev)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal num
## 1  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0   2
## 2  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0   1
## 3  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0   0
## 4  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0   0
## 5  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0   0
## 6  62   0  4      140  268   0       2     160     0     3.6     3 2.0  3.0   3

From the first six rows, it seems that there is nothing wrong. Thus we go ahead to have a look at its summary information

summary(df_clev)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.41   Mean   :0.6788   Mean   :3.166   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :241.5   Median :0.0000   Median :0.5000   Median :153.0  
##  Mean   :246.7   Mean   :0.1457   Mean   :0.9868   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope            ca           
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Length:302        
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   Class :character  
##  Median :0.0000   Median :0.800   Median :2.000   Mode  :character  
##  Mean   :0.3278   Mean   :1.035   Mean   :1.596                     
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000                     
##  Max.   :1.0000   Max.   :6.200   Max.   :3.000                     
##      thal                num        
##  Length:302         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.9404  
##                     3rd Qu.:2.0000  
##                     Max.   :4.0000

It seems everything is normal and thus we will move forwards to have a look at the structure information.

# Have a look at the its structure
str(df_clev)
## 'data.frame':    302 obs. of  14 variables:
##  $ age     : num  67 67 37 41 56 62 57 63 53 57 ...
##  $ sex     : num  1 1 1 0 1 0 0 1 1 1 ...
##  $ cp      : num  4 4 3 2 2 4 4 4 4 4 ...
##  $ trestbps: num  160 120 130 130 120 140 120 130 140 140 ...
##  $ chol    : num  286 229 250 204 236 268 354 254 203 192 ...
##  $ fbs     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : num  2 2 0 2 0 2 0 2 2 0 ...
##  $ thalach : num  108 129 187 172 178 160 163 147 155 148 ...
##  $ exang   : num  1 1 0 0 0 0 1 0 1 0 ...
##  $ oldpeak : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ slope   : num  2 2 3 1 1 3 1 2 3 2 ...
##  $ ca      : chr  "3.0" "2.0" "0.0" "0.0" ...
##  $ thal    : chr  "3.0" "7.0" "3.0" "3.0" ...
##  $ num     : int  2 1 0 0 0 3 0 2 1 0 ...

It is evident that there are errors with the class of variables. Before we correct them, we’d have a look at the frequency table of categorical variables since missing values could be denoted as “?” and thus their classes will be character.

# Have a look at the frequency table for each categorical variable
table(df_clev $ ca)
## 
##   ? 0.0 1.0 2.0 3.0 
##   4 175  65  38  20
table(df_clev $ thal)
## 
##   ? 3.0 6.0 7.0 
##   2 166  17 117

You could see there are 4 and 2 “?” for the variable ca and thal respectively. It means missing values and we will reassign “?” to NA

# Assign "?" to NA
df_clev[df_clev == "?"] <- NA

Since there are totaly 6 missing values, we will have at its percentage

6 / nrow(df_clev)
## [1] 0.01986755

As you could observe, the missing values only account for about 2%, pretty small percentage and thus we will select the rows without missing values.

# Filter the missing rows
df_clev <- df_clev[complete.cases(df_clev),]

Then we will convert wrong classes of variables of the right ones.

# Convert variables classes
df_clev $ sex <- factor(df_clev$sex, labels=c('Female','Male'))
df_clev $ cp <- factor(df_clev $cp, labels=c('typical angina','atypical angina','non-anginal pain','asymptomatic'))
df_clev $ trestbps <- as.integer(df_clev $trestbps)
df_clev $ chol <- as.integer(df_clev $chol)
df_clev $ fbs <-factor(df_clev$fbs,labels=c('<= 120','> 120'))
df_clev $ restecg <- factor(df_clev$restecg, labels=c('normal','ST-T wave abnormality','LVH'))
df_clev $ thalach <- as.integer(df_clev $thalach)
df_clev $ exang <- factor(df_clev $exang, labels=c('No','Yes'))
df_clev $ oldpeak <- as.numeric(df_clev $oldpeak)
df_clev $ slope <- factor(df_clev $slope, labels=c('upsloping','flat','downsloping'))
df_clev $ ca <- factor(df_clev$ca, labels = c("0 major vessels","1 major vessels", "2 major vessels", "3 major vessels"))
df_clev $ thal <- factor(df_clev$thal, labels=c('normal','fixed defect','reversable defect'))
df_clev $ num <- ifelse(df_clev $ num == 0, 0, 1 )
df_clev $ num <- factor(df_clev$num, labels=c('healthy','heart disease'))

Finally we will have a look at the missing values for each column

colSums(is.na(df_clev))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal      num 
##        0        0        0        0        0        0

As you could observe, there is no missing value and we will go ahead for EDA.

5.EDA

5.1 Descriptive analysis

5.1.1 Summary information of df_country

# Have a look at the summary information of df_country
summary(df_country)
##             Entity          Code           Year      death_percent   
##  Afghanistan   :  28   AFG    :  28   Min.   :1990   Min.   : 1.366  
##  Albania       :  28   AGO    :  28   1st Qu.:1997   1st Qu.:18.216  
##  Algeria       :  28   ALB    :  28   Median :2004   Median :30.254  
##  American Samoa:  28   AND    :  28   Mean   :2004   Mean   :29.506  
##  Andorra       :  28   ARE    :  28   3rd Qu.:2010   3rd Qu.:38.118  
##  Angola        :  28   ARG    :  28   Max.   :2017   Max.   :66.692  
##  (Other)       :5320   (Other):5320

From the summary information, we could know there are totally 6468 records between year 1990 and 2017. Also, the death percent ranges from 1.366% to 66.692% with 29.598% as the mean. Also, you might want to know which country, which year has the highest and lowest mean death rate of heart disease.

# Load the library dplyr
library(dplyr)
# Mean death percent of each country
df_country_mean_death_rate <- df_country %>%
  group_by(Entity) %>%
  summarise(mean_death_rate = mean(death_percent)) %>%
  arrange(desc(mean_death_rate)) %>%
  ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
# Have a look at the highest and lowest average death rate of heart disease by country
df_country_mean_death_rate[c(1, nrow(df_country_mean_death_rate)), ]
## # A tibble: 2 x 2
##   Entity  mean_death_rate
##   <fct>             <dbl>
## 1 Georgia           62.5 
## 2 Niger              5.75
# Mean death rate of all countries for each year
mean_death_percent_every_year <- df_country %>%
  group_by(Year) %>%
  summarise(mean_death_rate_per_year = mean(death_percent)) %>%
  arrange(desc(mean_death_rate_per_year)) %>%
  ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
# Have a look at  with the highest and lowest mean death rate of heart disease by year
mean_death_percent_every_year[c(1, nrow(mean_death_percent_every_year)),]
## # A tibble: 2 x 2
##    Year mean_death_rate_per_year
##   <int>                    <dbl>
## 1  2017                     30.6
## 2  1990                     29.1

From the above information, we could sense Georgia and Niger has the highest and lowest average death rate of heart disease respectively while we have the highest mean death rate in 2017 and 1990 separately.

5.1.2 Summary information of df_clev

# Have a look at the summary information of df_clev
summary(df_clev)
##       age            sex                     cp         trestbps    
##  Min.   :29.00   Female: 96   typical angina  : 22   Min.   : 94.0  
##  1st Qu.:48.00   Male  :200   atypical angina : 49   1st Qu.:120.0  
##  Median :56.00                non-anginal pain: 83   Median :130.0  
##  Mean   :54.51                asymptomatic    :142   Mean   :131.6  
##  3rd Qu.:61.00                                       3rd Qu.:140.0  
##  Max.   :77.00                                       Max.   :200.0  
##       chol           fbs                       restecg       thalach     
##  Min.   :126.0   <= 120:254   normal               :147   Min.   : 71.0  
##  1st Qu.:211.0   > 120 : 42   ST-T wave abnormality:  4   1st Qu.:133.0  
##  Median :243.0                LVH                  :145   Median :153.0  
##  Mean   :247.4                                            Mean   :149.6  
##  3rd Qu.:276.2                                            3rd Qu.:166.0  
##  Max.   :564.0                                            Max.   :202.0  
##  exang        oldpeak              slope                   ca     
##  No :199   Min.   :0.000   upsloping  :139   0 major vessels:173  
##  Yes: 97   1st Qu.:0.000   flat       :137   1 major vessels: 65  
##            Median :0.800   downsloping: 20   2 major vessels: 38  
##            Mean   :1.051                     3 major vessels: 20  
##            3rd Qu.:1.600                                          
##            Max.   :6.200                                          
##                 thal                num     
##  normal           :164   healthy      :159  
##  fixed defect     : 17   heart disease:137  
##  reversable defect:115                      
##                                             
##                                             
## 

There are a lot of information we could get. For instance, the age is from 29 to 77 with 54.51 as the mean value in 296 records.

5.2 Plotting

5.2.1 Plot df_country

Since there are hundred of countries, it might be impossible to plot all countries. Instead, it is better build an interactive app so that the users could select their interested countries and gain some insights. Here we will give some examples of our interested countries like China, Malaysia, and India.

# Load the library ggplot2 and dplyr
library(ggplot2)
library(dplyr)
df_country %>%
  filter(Entity %in% c("China", "Malaysia", "India")) %>%
  ggplot(aes(Year, death_percent, color = Entity)) + geom_line()

From the line plot, you could see the death rate of heart of Malaysia is decreasing while China and India shows an opposite trend.

5.2.2 Plot df_clev

5.2.2.1 One numerical variable
par(mfrow = c(2,3))
with(df_clev, {
boxplot(age, ylab = "Age", main = "Boxplot of age")
boxplot(trestbps, ylab = "trestbps", main = "Boxplot of trestbps")
boxplot(chol, ylab = "chol", main = "Boxplot of chol")
boxplot(thalach, ylab = "chol", main = "Boxplot of chol")
boxplot(oldpeak, ylab = "oldpeak", main = "Boxplot of oldpeak")
}
)

Although there are small proportion of outlier, we will keep them because they are normal values after confirming from various source information.

5.2.2.2 One categorical variable

Considering various length of levels of each factor, we will put short length of levels of each categorical variable together for better presentation

par(mfrow = c(2,3))
with(df_clev, {
plot(sex, xlab = "sex", ylab = "frequency", main = "Barplot of sex")
plot(fbs, xlab = "fbs", ylab = "frequency", main = "Barplot of fbs", )
plot(exang, xlab = "exangs", ylab = "frequency", main = "Barplot of exang")
plot(slope, xlab = "slope", ylab = "frequency",main = "Barplot of slope")
plot(num, xlab = "num", ylab = "frequency", main = "Barplot of num")
}
)

From the 5 barplot, we could draw some conclusions below:

  • More male, fbs, exangs(No) records
  • Far less downsloping records while upsloping and flat are almost the same
  • Almost balanced groups between heart disease and normal people
# 6. Barplot of cp
barplot(table(df_clev $cp),  xlab = "cp", ylab = "frequency", main = "Barplot of cp")

# 7. Barplot of restecg
barplot(table(df_clev $restecg),  xlab = "restecg", ylab = "frequency", main = "Barplot of restecg")

# 8. Barplot of ca
barplot(table(df_clev $ca),  xlab = "ca", ylab = "frequency", main = "Barplot of ca")

# 9. Barplot of thal
barplot(table(df_clev $thal),  xlab = "thal", ylab = "frequency", main = "Barplot of thal")

Based on the 4 barplot, we could see:

  • There are increasing records from typical angina to asymptomatic for the variable cp while it presents opposite patterns for the variable ca, ranging from 0 major vessels to 3 major vessels.
  • From normal to ST-T wave abnormality and LVH, we have far less ST-T wave abnormality but almost the same records between normal and LVH for the variable restecg. Obviously it presents the same patterns for variable thal from normal to fixed defect and finally reversable defect.
5.2.2.3 One numerical variable vs response variable
# 1. age by num
# Make a boxplot
ggplot(df_clev, aes(num, age)) + geom_boxplot() + ggtitle("age by num") + theme(plot.title = element_text(hjust = 0.5))

As you could sense, the average age for heart disease patient is higher than than normal people while age of normal people have a larger variance.

# 2. trestbps by num
ggplot(df_clev, aes(num, trestbps)) + geom_boxplot() +  ggtitle("trestbps by num") + theme(plot.title = element_text(hjust = 0.5))

From the boxplot, it is concluded that the trestbps for heart disease patients has a larger variance.

# 3. chol by num
ggplot(df_clev, aes(num, chol)) + geom_boxplot() + ggtitle("chol by num") + theme(plot.title = element_text(hjust = 0.5))

Base on this boxplot, it is shown bothe the average and the variance of chol for heart disease patients are larger compared with normal people.

# 4. thalach by num
ggplot(df_clev, aes(num, thalach)) + geom_boxplot() + ggtitle("thalach by num") + theme(plot.title = element_text(hjust = 0.5))

It is clear that the average and variance of thalach for normal people are higher compared with heart disease patients.

# 5. oldpeak by num
ggplot(df_clev, aes(num, oldpeak)) + geom_boxplot() +  ggtitle("oldpeak by num") + theme(plot.title = element_text(hjust = 0.5))

According to the boxplt, we could see heart disease patients have a higher average oldpeak and larger variance compared with normal people.

5.2.2.4 One categorical independent variable vs response variable
# 1. sex by num
#Load the library ggplot2
ggplot(df_clev, aes(sex, fill = num)) + geom_bar(position = "fill") + ggtitle("sex by num") + theme(plot.title = element_text(hjust = 0.5))

It is noticeable that male has a higher of hearts disease patients compared with the female

# 2. cp by num
ggplot(df_clev, aes(cp, fill = num)) + geom_bar(position = "dodge") + ggtitle("cp by num") + theme(plot.title = element_text(hjust = 0.5))

Obviously, typical angina, atypical angina, non-anginal pain all have a smaller proportion of normal people while there are higher proportion of heart disease patients for asymptomatic

# 3. fbs vs num 
ggplot(df_clev, aes(fbs, fill = num)) + geom_bar(position = "fill") +  ggtitle("fbs by num") + theme(plot.title = element_text(hjust = 0.5))

From the bar chart, we could notice there are almost the same percentages of health and heart disease people for each group in variable fbs.

# 4. restecg vs num
ggplot(df_clev, aes(restecg, fill = num)) + geom_bar(position = "dodge") + ggtitle("restecg by num") + theme(plot.title = element_text(hjust = 0.5))

Evidently, all categories apart from normal group have higher percentage of heart disease patients compared with normal people.

# 5. exang vs num
ggplot(df_clev, aes(exang, fill = num)) + geom_bar(position = "fill") + ggtitle("exang by num") + theme(plot.title = element_text(hjust = 0.5))

From the plot, we could observe that there are far more heart disease patients for the group Yes(exercise induced angina). In contrast, normal people has a higher percentage in the the group No (exercise not induced angina).

# 6. slope vs num
ggplot(df_clev, aes(slope, fill = num)) + geom_bar(position = "fill") + ggtitle("slope by num") + theme(plot.title = element_text(hjust = 0.5))

It is presented that there are higher percentages of heart disease patients in the group flat (the flat slope of the peak exercise ST segment),downsloping(the downsloping slope of the peak exercise ST segment) compared with the opposite condition in group upsloping(the upsloping slope of the peak exercise ST segment)

# 7. ca vs num
ggplot(df_clev, aes(ca, fill = num)) + geom_bar(position = "dodge") + ggtitle("ca by num") + theme(plot.title = element_text(hjust = 0.5))

1, 2 , and 3 major vessels colored by flourosopy have a higher proportion of heart disease patients compared with normal people. Instead, 0 major vessels colored by flourosopy present the opposite condition.

# 8. thal vs num
ggplot(df_clev, aes(thal, fill = num)) + geom_bar(position = "fill") + ggtitle("thal by num") + theme(plot.title = element_text(hjust = 0.5))

Clearly, fixed defect and reversable defect have far more heart disease people while the normal group shows an opposite circumstance.

5.2.2.5 Multiple variables
# 1.age vs trestbps grouped by num
ggplot(df_clev, aes(age,trestbps, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 2.age vs chol grouped by num
ggplot(df_clev, aes(age,chol, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 3.age vs thalach grouped by num
ggplot(df_clev, aes(age,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 4.age vs oldpeak grouped by num
ggplot(df_clev, aes(age,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 5.trestbps vs chol grouped by num
ggplot(df_clev, aes(trestbps,chol, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 6.trestbps vs thalach grouped by num
ggplot(df_clev, aes(trestbps,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 7.trestbps vs oldpeak grouped by num
ggplot(df_clev, aes(trestbps,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 8.chol vs thalach grouped by num
ggplot(df_clev, aes(chol,thalach, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 9.chol vs oldpeak grouped by num
ggplot(df_clev, aes(chol,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# 10.thalach, oldpeak grouped by num
ggplot(df_clev, aes(thalach,oldpeak, color = num)) + geom_point() + geom_smooth() + facet_wrap(df_clev$num)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the 10 plots above, you could see that plot 6, 8 show opposite trends between 2 facets while other plots present almost the same trends.

After gaining an intuitive insight into the data , you may want to see the correlations between variables.

# Load the library PerformanceAnalytics
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
# Get the correlation matrix chart
df_clev %>%
  select(age, trestbps, chol, thalach, oldpeak) %>%
  chart.Correlation(histogram=TRUE, pch=19)

It is obvious that there are correlations but not strong between other variables and our target variable num. Thus there is no reason for us to delete any variables and we will use all variables for our predictive model.

6.Modelling

In this part, we will compare several models using 10-fold cross validation and select the best one fro df_clev. Finally we will test its performance on the testing dataset to see whether it is overfitting.

# Load the library caret
library(caret)
## Loading required package: lattice
# Split the dataset into 80% and 20% as training and testing dataset respectively
# Set the random seed to ensure we get the same traning dataset and testing dataset everytime we run the code.
split <- 0.8
set.seed(998)
train_index <- createDataPartition(df_clev$num, p=split, list = F)
data_train <- df_clev[train_index, ]
data_test <- df_clev[-train_index, ]
# 10 fold cross validation
train_control <- trainControl(method="cv", number=10)
# 1. knn
# Set the same random seed for each algorithm to ensure the data split for 10-fold cross validation
set.seed(2)
model_knn <- train(num~., data=data_train, trControl=train_control, method="knn", metric="Accuracy")
print(model_knn)
## k-Nearest Neighbors 
## 
## 238 samples
##  13 predictor
##   2 classes: 'healthy', 'heart disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6760870  0.3458851
##   7  0.6762681  0.3451615
##   9  0.6932971  0.3812592
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# 2. logistics regression
set.seed(2)
model_glm <- train(num~., data=data_train, trControl=train_control, method="glm", family = 'binomial', metric="Accuracy")
print(model_glm)
## Generalized Linear Model 
## 
## 238 samples
##  13 predictor
##   2 classes: 'healthy', 'heart disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8322464  0.6611789
# 3. Random forest
set.seed(2)
model_rf <- train(num~., data=data_train, trControl=train_control, method="rf", metric="Accuracy")
print(model_rf)
## Random Forest 
## 
## 238 samples
##  13 predictor
##   2 classes: 'healthy', 'heart disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8192029  0.6339159
##   11    0.7985507  0.5947672
##   20    0.7896739  0.5766230
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

As you could see, logistics regression model performs best with about 83% mean accuracy. Thus we tend to choose it as our final model. In order to confirm whether there is overfitting, we will validate it in our testing dataset.

# Make predictions
x_test <- data_test[, 1:length(data_train) - 1]
y_test <- data_test[, length(data_train)]
prediction <- predict(model_glm, x_test)
# Have a look at the confusion metrics
confusionMatrix(prediction, y_test)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      healthy heart disease
##   healthy            25             6
##   heart disease       6            21
##                                           
##                Accuracy : 0.7931          
##                  95% CI : (0.6665, 0.8883)
##     No Information Rate : 0.5345          
##     P-Value [Acc > NIR] : 3.93e-05        
##                                           
##                   Kappa : 0.5842          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8065          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.8065          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.5345          
##          Detection Rate : 0.4310          
##    Detection Prevalence : 0.5345          
##       Balanced Accuracy : 0.7921          
##                                           
##        'Positive' Class : healthy         
## 

Obviously, the accuracy in the testing dataset is about 79%, very close to 83%. Thus we could think our model fits well.