Name : Mst Nigar Sultana

Part 1 :

2) Loading dataset :

heart = read.csv('heart.csv')
heart

3) Few sentences to explain the dataset :a)Identify numerical columns. b)Identify categorical columns.c)Identify target variable (normally the last column is the target variable)

print("The dataset contains information on 918 individuals with 12 variables, including Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina, Oldpeak, ST_Slope, and HeartDisease. The individuals' ages range from 28 to 77 years. There are 5 numerical features (Ages, RestingBP, Cholesterol, MaxHR, and Oldpeak ), while others are categorical (Sex, ChestPainType, FastingBS,RestingECG,ExerciseAngina,ST_Slope,HeartDisease). The last column 'HeartDisease' is the target variable .The dataset provides a comprehensive overview of various health-related attributes for the individuals studied.")
[1] "The dataset contains information on 918 individuals with 12 variables, including Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina, Oldpeak, ST_Slope, and HeartDisease. The individuals' ages range from 28 to 77 years. There are 5 numerical features (Ages, RestingBP, Cholesterol, MaxHR, and Oldpeak ), while others are categorical (Sex, ChestPainType, FastingBS,RestingECG,ExerciseAngina,ST_Slope,HeartDisease). The last column 'HeartDisease' is the target variable .The dataset provides a comprehensive overview of various health-related attributes for the individuals studied."
numeric_columns = heart[ ,c(1,4,5,8,10)]
numeric_columns
categorical_columns = heart[ ,c(2,3,6,7,9,11,12)]
categorical_columns
target_variable = heart[ ,12]
target_variable
  [1] 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0
 [49] 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1
 [97] 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0
[145] 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0
[193] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1
[241] 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
[289] 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1
[337] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[385] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0
[433] 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1
[481] 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0
[529] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1
[577] 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 1
[625] 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1
[673] 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 0 1 1
[721] 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0
[769] 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0
[817] 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1
[865] 1 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1
[913] 1 1 1 1 1 0

4. Remove all the categorical columns (not the target column) from the dataset. Now we will call all the numerical columns as features and the last column as target or class.

Examine_data = heart[ ,c(-2,-3,-6,-7,-9,-11)]
Examine_data
target = heart$HeartDisease
target
  [1] 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0
 [49] 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 1
 [97] 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0
[145] 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0
[193] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1
[241] 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
[289] 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1
[337] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[385] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 0
[433] 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1
[481] 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0
[529] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1
[577] 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 1 0 0 0 1 1 1
[625] 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1
[673] 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 0 1 1
[721] 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0
[769] 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0
[817] 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1
[865] 1 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1
[913] 1 1 1 1 1 0

5.Scatter plot of any two features - each class (target) should be represented with different colors.

target_color = as.numeric(factor(heart$HeartDisease))
plot(heart$Age,heart$Cholesterol,
     
    col = target_color,
     pch = 2,
     cex = 0.5,
     xlim = c(20,80),
     ylim = c(1, 500),
     xlab = "Age" ,
     ylab = "Cholesterol",
     main = "Visualization of Age and Cholesterol",
     col.main = 'blue',
     col.axis = 'black',
     col.lab = 'red',
     cex.main = 1.5,
     cex.axis = 1,
     cex.lab = 1,
    )

a. Histogram plot of any single feature.

hist(heart$MaxHR, main = "Histogram plot", xlab = "MaxHR", col = "blue")

b. Boxplot of any single feature

boxplot(heart$MaxHR, ylab = 'MaxHR' , xlab='Age', main='Box plot', col = 'red')

6. Use the ggplot library functions to plot the following and explain each figure in (2-3) sentences.

(i) Sclater plot between Age and MaxHR

library(ggplot2)
heart = read.csv('heart.csv')
target_color = as.character(heart$HeartDisease)
ggplot(heart, aes(x = Age, y = MaxHR, color = target_color)) +
  geom_point(size = 2)+
 scale_color_manual(values = c("0" = "green", "1" = "red")) +
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between Age and MaxHR",
       x = "Age",
       y = "MaxHR",
       caption = "Source: Iskulghar") + 
  
  theme(
  
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

NA
NA

(ii) Sclater plot between Age and Cholesterol

ggplot(heart, aes(x = Age, y = Cholesterol, color = target_color)) +
  geom_point(size = 2)+
  scale_color_manual(values = c("0"="pink","1"= "purple"))+
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between Age and Cholesterol",
       x = "Age",
       y = "Cholesterol",
       caption = "Source: Iskulghar") + 
  
  theme(
  
    legend.position = "top", 
    text = element_text(colour = "blue", size = 15),
    axis.text.x = element_text(color = "black", size = 10),
    axis.text.y = element_text(color = "black", size = 10))

NA

(iii) Sclater plot between Age and RestingBP

Heartdisease= as.character(heart$HeartDisease)
ggplot(heart, aes(x = Age, y = RestingBP, color = Heartdisease)) +
  geom_point(size = 2)+
  scale_color_manual(values = c("0"="cyan3","1"= "darkblue"))+  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 2))+
  
  labs(title = "Sclater plot between Age and RestingBP",
       x = "Age",
       y = "RestingBP",
       caption = "Source: Iskulghar") + 

  theme(
  plot.title = element_text(size = 20, color = "darkblue"), 
    legend.position = "top",
  
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

NA

(iv) Sclater plot between Age and Oldpeak

ggplot(heart, aes(x = Age, y = Oldpeak, color = Heartdisease)) +
  geom_point(size = 2)+
  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between Age and Oldpeak",
       x = "Age",
       y = "Oldpeak",
       caption = "Source: Iskulghar") + 
  
  theme(
 plot.title = element_text(size = 20, color = "red"), 
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(v) Sclater plot between RestingBP and Cholesterol

ggplot(heart, aes(x = RestingBP, y = Cholesterol, color = Heartdisease)) +
  geom_point(size = 2)+
  scale_color_manual(values = c("0"="darkolivegreen3","1"= "darkolivegreen4"))+  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between RestingBP and Cholesterol",
       x = "RestingBP",
       y = "Cholesterol",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'darkgreen', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'brown', size = 12),
    axis.text.x = element_text(color = "brown4", size = 10),
    axis.text.y = element_text(color = "brown4", size = 10))

(vi) Sclater plot between RestingBP and MaxHR

ggplot(heart, aes(x = RestingBP, y = MaxHR, color = Heartdisease)) +
  geom_point(size = 2)+
  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between RestingBP and MaxHR",
       x = "RestingBP",
       y = "MaxHR",
       caption = "Source: Iskulghar") + 
  
  theme(
   plot.title = element_text(colour = 'blue', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "red", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(vii) Sclater plot between RestingBP and Oldpeak

ggplot(heart, aes(x = RestingBP, y =Oldpeak, color = Heartdisease)) +
  geom_point(size = 2)+
  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between RestingBP and Oldpeak",
       x = "RestingBP",
       y = "Oldpeak",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'blue', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'red', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(viii) “Sclater plot between Cholesterol and MaxHR

ggplot(heart, aes(x = Cholesterol, y = MaxHR, color = Heartdisease)) +
  geom_point(size = 2)+

  scale_color_manual(values = c("0" = "cadetblue","1" = "darkorange"))+
  labs(title = "Sclater plot between Cholesterol and MaxHR",
       x = "Cholesterol",
       y = "MaxHR",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'black', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'darkblue', size = 15),
    axis.text.x = element_text(color = "red", size = 10),
    axis.text.y = element_text(color = "red", size = 10))

(ix)Sclater plot between Cholesterol and Oldpeak

ggplot(heart, aes(x = Cholesterol, y = Oldpeak, color = Heartdisease)) +
  geom_point(size = 2)+
  
guides(color = guide_legend(order = 1),
         size = guide_legend(order = 2),
         shape = guide_legend(order = 3))+
  
  labs(title = "Sclater plot between Cholesterol and Oldpeak",
       x = "Cholesterol",
       y = "Oldpeak",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'red', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'gray', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(x)Sclater plot between Oldpeak and MaxHR

ggplot(heart, aes(x =Oldpeak, y = MaxHR, color = Heartdisease)) +
  geom_point(size = 2)+
  
  labs(title = "Sclater plot between Oldpeak and MaxHR",
       x = "Oldpeak",
       y = "MaxHR",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'blue', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

a. Boxplot of all columns

(i) Box plot of Age

ggplot(data = heart, aes(x = Heartdisease, y=Age)) +  
  geom_boxplot(fill = c("cyan2","darkblue"), alpha = 0.5) +
 
  labs(title = "Box plot of Age",
       x = "HeartDisease",
       y = "Age",
       caption = "Source: Iskulghar") + 
  
  theme(
  
    legend.position = "top", 
    plot.title.position = "panel",
    plot.title = element_text(colour = 'darkblue', size = 20),
    text = element_text(colour = 'blue', size = 15),
    axis.text.x = element_text(color = "black", size = 10),
    axis.text.y = element_text(color = "black", size = 10))

(ii) Box plot of RestingBP

ggplot(data = heart, aes(x = Heartdisease, y=RestingBP)) +  
  geom_boxplot(fill = c("brown2","darkcyan"), alpha = 0.5) +
 
  labs(title = "Box plot of RestingBP",
       x = "HeartDisease",
       y = "RestingBP",
       caption = "Source: Iskulghar") + 
  
  theme(
  plot.title = element_text(colour = 'darkblue', size = 20),
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(iii) Box plot of Cholesterol

ggplot(data = heart, aes(x = Heartdisease, y=Cholesterol)) +  
  geom_boxplot(fill = c("green","red"), alpha = 0.5) +
 
  labs(title = "Box plot of Cholesterol",
       x = "HeartDisease",
       y = "Cholesterol",
       caption = "Source: Iskulghar") + 
  
  theme(
  
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(iv) Box plot of MaxHR

ggplot(data = heart, aes(x = Heartdisease, y=MaxHR)) +  
  geom_boxplot(fill = c("red","blue"), alpha = 0.5) +
 
  labs(title = "Box plot of MaxHR",
       x = "HeartDisease",
       y = "MaxHR",
       caption = "Source: Iskulghar") + 
  theme(
  
    legend.position = "top", 
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

(v) Box plot of Oldpeak

ggplot(data = heart, aes(x = Heartdisease, y=Cholesterol)) +  
  geom_boxplot(fill = c("darkorange2","blue"), alpha = 0.5) +
 
  labs(title = "Box plot of Oldpeak",
       x = "HeartDisease",
       y = "Oldpeak",
       caption = "Source: Iskulghar") + 
  
  theme(
  
    legend.position = "top", 
    plot.title = element_text(colour = 'darkblue', size = 15),
    text = element_text(colour = 'black', size = 15),
    axis.text.x = element_text(color = "blue", size = 10),
    axis.text.y = element_text(color = "blue", size = 10))

Project Part - 2

a. Interactive violin plot of all features

library(plotly)

plot_ly(heart, x= Heartdisease, y = ~Age, type = 'violin')
plot_ly(heart, x= Heartdisease, y = ~RestingBP, type = 'violin')
plot_ly(heart, x= Heartdisease, y = ~Cholesterol, type = 'violin')
plot_ly(heart, x= Heartdisease, y = ~MaxHR, type = 'violin')
plot_ly(heart, x= Heartdisease, y = ~Oldpeak, type = 'violin')

b.Interactive boxplot of all features

plot_ly(heart, x= Heartdisease, y = ~Age, type = 'box')
plot_ly(heart, x= Heartdisease, y = ~RestingBP, type = 'box')
plot_ly(heart, x= Heartdisease, y = ~Cholesterol, type = 'box')
plot_ly(heart, x= Heartdisease, y = ~MaxHR, type = 'box')
plot_ly(heart, x= Heartdisease, y = ~Oldpeak, type = 'box')

c. Calculate correlation matrix and print the matrix. Explain strong and weak correlation

cor_matrix = cor(numeric_columns[ ,1:5])
cor_matrix
                    Age  RestingBP Cholesterol      MaxHR     Oldpeak
Age          1.00000000  0.2543994 -0.09528177 -0.3820447  0.25861154
RestingBP    0.25439936  1.0000000  0.10089294 -0.1121350  0.16480304
Cholesterol -0.09528177  0.1008929  1.00000000  0.2357924  0.05014811
MaxHR       -0.38204468 -0.1121350  0.23579240  1.0000000 -0.16069055
Oldpeak      0.25861154  0.1648030  0.05014811 -0.1606906  1.00000000
# strong correlation  is 0.258 which is between Age and Oldpeak. Weak correlation is -0.382 which is between Age and MaxHR.

d.Plot correlation matrix (lower triangle) with values

library(ggcorrplot)
ggcorrplot(cor_matrix, 
           type = "lower",
           colors = c("red", "white", "blue"),
           lab = TRUE)

e. Pair plot of all feature

library(GGally)
ggpairs(numeric_columns, aes(colour = Heartdisease))

f. Apply principal component analysis (PCA) and explain the PCAs

library(stats)

heart_pca = prcomp(numeric_columns, scale = TRUE, center = TRUE)
heart_pca
Standard deviations (1, .., p=5):
[1] 1.3065648 1.0926990 0.9117941 0.8314805 0.7590579

Rotation (n x k) = (5 x 5):
                   PC1         PC2         PC3        PC4        PC5
Age          0.6026536 0.009353273 -0.07505093  0.3172132  0.7283298
RestingBP    0.3742414 0.474166841 -0.64215210 -0.4299469 -0.1946676
Cholesterol -0.1779956 0.743454901  0.06375770  0.6283270 -0.1293543
MaxHR       -0.5391946 0.343937374  0.03735596 -0.4329406  0.6341477
Oldpeak      0.4175389 0.322583657  0.75930727 -0.3637157 -0.1129796
summary(heart_pca)
Importance of components:
                          PC1    PC2    PC3    PC4    PC5
Standard deviation     1.3066 1.0927 0.9118 0.8315 0.7591
Proportion of Variance 0.3414 0.2388 0.1663 0.1383 0.1152
Cumulative Proportion  0.3414 0.5802 0.7465 0.8848 1.0000
pca_12 = as.data.frame(heart_pca$x[ , 1:2])
pca_12_class = cbind(pca_12, Heartdisease = Heartdisease)
pca_12_class

i. Bar plot of PCAs (Percentage of explained variance vs PCs)

library(factoextra)
fviz_eig(heart_pca, addlabels = TRUE)

ii. Contribution plot of PCs (Circular plot)

fviz_pca_var(heart_pca,
             col.var = "contrib")

iii. Contribution plot as Heatmap


ggcorrplot(cor_matrix, 
           type = "lower",
           colors = c("purple", "white", "red"),
           lab = TRUE)

library("corrplot")
var = get_pca_var(heart_pca)
corrplot(var$cos2)

iv. Cluster plot after PCA

fviz_pca_ind(heart_pca,
             geom.ind = "point",
             col.ind = Heartdisease,
             addEllipses = TRUE)

g.Use SVM to train a model to classify the target variable. Explain the results

i. Plot the confusion matrix

library(lattice)
library(e1071)
library(caret)
heart = read.csv('heart.csv')
heart_data = heart[ ,c(1,4,5,8,10,12)]
train_ix = createDataPartition(heart$HeartDisease, p = 0.8, list = FALSE)
train_data = heart_data[train_ix, ]
test_data = heart_data[-train_ix, ]
train_data
test_data
svm_model = svm(HeartDisease ~ Age+RestingBP+Cholesterol+MaxHR+Oldpeak, data = train_data, kernel = "linear")
test_data[12, ]
NA
predict(svm_model, newdata = test_data[5, ])
        48 
0.07287127 
predictions = predict(svm_model, newdata = test_data)

2. Using regression dataset called “US Admission” (US Admission.csv).Remove the “Serial No” column from the dataset

US_Admission = read.csv('US Admission.csv')
US_Admission
Admission = US_Admission[ ,-1]
Admission
lm_model = lm(Chance.of.Admit ~ GRE.Score, data = Admission)
summary(lm_model)

Call:
lm(formula = Chance.of.Admit ~ GRE.Score, data = Admission)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33613 -0.04604  0.00408  0.05644  0.18339 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.4360842  0.1178141  -20.68   <2e-16 ***
GRE.Score    0.0099759  0.0003716   26.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08517 on 398 degrees of freedom
Multiple R-squared:  0.6442,    Adjusted R-squared:  0.6433 
F-statistic: 720.6 on 1 and 398 DF,  p-value: < 2.2e-16
library(datasets)


x = US_Admission$Chance.of.Admit
y = US_Admission$CGPA

pred = predict(ln_model)
ix = sort(x, index.return = T)

plot(x, y)

lines(x[ix], pred[ix])
Error in x[ix] : invalid subscript type 'list'
library(ggplot2)
library(GGally)
data(Admission)
Warning: data set ‘Admission’ not found
ggpairs(Admission, aes(colour = Admission$Chance.of.Admit))
Error in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm,  : 
  `mapping` color column must be categorical, not numeric

a. Pair plot of all features

b. Plot linear regression with each feature. For example: y = GRE Score, x = Chance of Admit. Do this for all different features (only y will change). Explain the result.

(i) Plot of Linear Regression with “chanche of admit” and “GRE.Score”.

ggplot(Admission, aes(x = Chance.of.Admit, y = GRE.Score, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.90)

(ii) Plot of Linear Regression with “chanche of admit” and “TOEFL Score”.

ggplot(Admission, aes(x = Chance.of.Admit, y = TOEFL.Score, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

(iii) Plot of Linear Regression with “chanche of admit” and “University Rating”.

ggplot(Admission, aes(x = Chance.of.Admit, y = University.Rating, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

(iv) Plot of Linear Regression with “chanche of admit” and “SOP”.

ggplot(Admission, aes(x = Chance.of.Admit, y = SOP, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

(v) Plot of Linear Regression with “chanche of admit” and “LOR”.

ggplot(Admission, aes(x = Chance.of.Admit, y = LOR, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.90)

(vi) Plot of Linear Regression with “chanche of admit” and “CGPA”.

ggplot(Admission, aes(x = Chance.of.Admit, y = CGPA, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

  1. Plot of Linear Regression with “chanche of admit” and “Research”.
ggplot(Admission, aes(x = Chance.of.Admit, y = Research, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", level = 0.95)

c. Plot polynomial regression of power 2 with each feature. For example: y = Chance of Admit, x = Chance of Admit. Do this for all different features (only y will change). Explain the result.

(i) Polynomial Plot of Linear Regression with “chanche of admit” and “GRE.Score”.

ggplot(Admission, aes(x = Chance.of.Admit, y = GRE.Score, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2),level = 0.90)

(ii) Polynomial Plot of Linear Regression with “chanche of admit” and “TOEFL Score”.

ggplot(Admission, aes(x = Chance.of.Admit, y = TOEFL.Score, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm",formula = y~poly(x, 2), level = 0.95)

(iii) Polynomial Plot of Linear Regression with “chanche of admit” and “University Rating”.

(iv) Polynomial Plot of Linear Regression with “chanche of admit” and “SOP”.

(v) Polynomial Plot of Linear Regression with “chanche of admit” and “LOR”.

ggplot(Admission, aes(x = Chance.of.Admit, y = LOR, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2),level = 0.90)

(vi) Polynomial Plot of Linear Regression with “chanche of admit” and “CGPA”.

ggplot(Admission, aes(x = Chance.of.Admit, y = CGPA, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2),level = 0.95)

(vii) Polynomial Plot of Linear Regression with “chanche of admit” and “Research”.

ggplot(Admission, aes(x = Chance.of.Admit, y = Research, color = Chance.of.Admit)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y~poly(x, 2),level = 0.95)

d. Use all features together to create a regression model and explain the result. For example: lm(y ~ x1+x2+x3….+xn, data = US_Admission), here x represents a single feature and y represents the target variable.

lm_model_all = lm(Chance.of.Admit ~ GRE.Score+TOEFL.Score+ University.Rating+ SOP+LOR+CGPA+Research, data = Admission)
summary(lm_model)

Call:
lm(formula = Chance.of.Admit ~ GRE.Score, data = Admission)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33613 -0.04604  0.00408  0.05644  0.18339 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.4360842  0.1178141  -20.68   <2e-16 ***
GRE.Score    0.0099759  0.0003716   26.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08517 on 398 degrees of freedom
Multiple R-squared:  0.6442,    Adjusted R-squared:  0.6433 
F-statistic: 720.6 on 1 and 398 DF,  p-value: < 2.2e-16
