Asma Sultana

Dataset load

This project will reflect R programming and data science skills as a researcher.

mortality = read.csv("Mortality.csv")
str(mortality)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

4.Explain the dataset.

In this load dataset, R programming is not showing numerical column correctly as per as I guess. so generate those column/feature my own thought.

Numerical columns- age, creatinine_phosphokinase, ejection_fraction, platelets, serum_creatinine, serum_sodium, and time are numerical columns that likely represent continuous variables.
Categorical columns: anaemia, diabetes, high_blood_pressure, sex, smoking are categorical columns that likely represent binary or discrete variables.

Target variable: The last column, DEATH_EVENT, is the target variable in the dataset, which likely indicates whether an individual experienced a death event or not.

numerical_columns = c("age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time")
numerical_columns

## [1] "age"                      "creatinine_phosphokinase"
## [3] "ejection_fraction"        "platelets"               
## [5] "serum_creatinine"         "serum_sodium"            
## [7] "time"

categorical_columns = c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")
categorical_columns

## [1] "anaemia"             "diabetes"            "high_blood_pressure"
## [4] "sex"                 "smoking"

target_variable = "DEATH_EVENT"
target_variable

## [1] "DEATH_EVENT"

5.Correlation between Numerical feature

## In this load dataset, R programming not showing numerical column correctly as per as I guess. so generate those  numerical feature my own thought. 

table = data.frame(

  a1= as.numeric(mortality$age),
  a2= as.numeric(mortality$creatinine_phosphokinase),
  a3= as.numeric(mortality$ejection_fraction),
  a4= as.numeric(mortality$platelets),
  a5= as.numeric(mortality$serum_creatinine),
  a6= as.numeric(mortality$serum_sodium),
  a7= as.numeric(mortality$time)
)  
table

Correlation numerical feature using loop

## A named vector to store correlation coefficients
correlations = numeric(length = 6)

for (i in 2:7) {
  correlation = cor(table$a1, table[, i])
  correlations[i-1] = correlation #store
}

names(correlations) = c("a1-a2", "a1-a3", "a1-a4","a1-a5", "a1-a6", "a1-a7")

print(correlations)

##       a1-a2       a1-a3       a1-a4       a1-a5       a1-a6       a1-a7 
## -0.08158390  0.06009836 -0.05235437  0.15918713 -0.04596584 -0.22406842

## Find out the strongest correlated feature and weakest correlated feature.

Max = (max(correlations))
cat(" Strongest correlation -> ", names(which.max(correlations)), ": ", Max, "\n")

##  Strongest correlation ->  a1-a5 :  0.1591871

Min = (min(correlations))
cat(" Weakest correlation -> ", names(which.min(correlations)), ": ", Min, "\n")

##  Weakest correlation ->  a1-a7 :  -0.2240684

6.Subset for target feature

subset0 = subset(mortality, DEATH_EVENT == 0)
subset1 = subset(mortality, DEATH_EVENT == 1)
subset0

subset1

Summary of all subsets

summary(subset0)

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  30.0           Min.   :0.0000  
##  1st Qu.:50.00   1st Qu.:0.0000   1st Qu.: 109.0           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 245.0           Median :0.0000  
##  Mean   :58.76   Mean   :0.4089   Mean   : 540.1           Mean   :0.4187  
##  3rd Qu.:65.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :90.00   Max.   :1.0000   Max.   :5209.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :17.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:35.00     1st Qu.:0.0000      1st Qu.:219500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :263000   Median :1.000   
##  Mean   :40.27     Mean   :0.3251      Mean   :266658   Mean   :1.185   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:302000   3rd Qu.:1.200   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :6.100   
##   serum_sodium        sex            smoking            time        DEATH_EVENT
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   : 12.0   Min.   :0   
##  1st Qu.:135.5   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 95.0   1st Qu.:0   
##  Median :137.0   Median :1.0000   Median :0.0000   Median :172.0   Median :0   
##  Mean   :137.2   Mean   :0.6502   Mean   :0.3251   Mean   :158.3   Mean   :0   
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:213.0   3rd Qu.:0   
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0   Max.   :0

summary(subset1)

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :42.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:55.00   1st Qu.:0.0000   1st Qu.: 128.8           1st Qu.:0.0000  
##  Median :65.00   Median :0.0000   Median : 259.0           Median :0.0000  
##  Mean   :65.22   Mean   :0.4792   Mean   : 670.2           Mean   :0.4167  
##  3rd Qu.:75.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 47000   Min.   :0.600   
##  1st Qu.:25.00     1st Qu.:0.0000      1st Qu.:197500   1st Qu.:1.075   
##  Median :30.00     Median :0.0000      Median :258500   Median :1.300   
##  Mean   :33.47     Mean   :0.4062      Mean   :256381   Mean   :1.836   
##  3rd Qu.:38.00     3rd Qu.:1.0000      3rd Qu.:311000   3rd Qu.:1.900   
##  Max.   :70.00     Max.   :1.0000      Max.   :621000   Max.   :9.400   
##   serum_sodium        sex            smoking            time       
##  Min.   :116.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.00  
##  1st Qu.:133.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 25.50  
##  Median :135.5   Median :1.0000   Median :0.0000   Median : 44.50  
##  Mean   :135.4   Mean   :0.6458   Mean   :0.3125   Mean   : 70.89  
##  3rd Qu.:138.2   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:102.25  
##  Max.   :146.0   Max.   :1.0000   Max.   :1.0000   Max.   :241.00  
##   DEATH_EVENT
##  Min.   :1   
##  1st Qu.:1   
##  Median :1   
##  Mean   :1   
##  3rd Qu.:1   
##  Max.   :1

Observations

Comparing mean, median, min, and max values between subsets can reveal differences in variable distributions. Identifying patterns or trends specific to each subset may indicate factors influencing the target variable. Ensure balance of classes within subsets for better model performance. Further explore subsets using visualizations or additional analyses to understand their relationship with the target variable.

8. Determine correlation for all the subsets

Correlation for subset0

table2 = data.frame(

  a1= as.numeric(subset0$age),
  a2= as.numeric(subset0$creatinine_phosphokinase),
  a3= as.numeric(subset0$ejection_fraction),
  a4= as.numeric(subset0$platelets),
  a5= as.numeric(subset0$serum_creatinine),
  a6= as.numeric(subset0$serum_sodium),
  a7= as.numeric(subset0$time)
)  
table2

correlations = numeric(length = 6)

for (i in 2:7) {
  correlation = cor(table2$a1, table2[, i])
  correlations[i-1] = correlation #store
}

names(correlations) = c("a1-a2", "a1-a3", "a1-a4","a1-a5", "a1-a6", "a1-a7")

print(correlations)

##       a1-a2       a1-a3       a1-a4       a1-a5       a1-a6       a1-a7 
## -0.04055170  0.08441109 -0.10869487  0.13758933 -0.01953457 -0.06985139

Max0 = (max(correlations))
cat("\n Strongest correlation for subset0 -> ", names(which.max(correlations)), ": ", Max0, "\n")

## 
##  Strongest correlation for subset0 ->  a1-a5 :  0.1375893

Min0 = (min(correlations))
cat(" Weakest correlation for subset0 -> ", names(which.min(correlations)), ": ", Min0, "\n")

##  Weakest correlation for subset0 ->  a1-a4 :  -0.1086949

Correlation for subset0

table3 = data.frame(

  a1= as.numeric(subset1$age),
  a2= as.numeric(subset1$creatinine_phosphokinase),
  a3= as.numeric(subset1$ejection_fraction),
  a4= as.numeric(subset1$platelets),
  a5= as.numeric(subset1$serum_creatinine),
  a6= as.numeric(subset1$serum_sodium),
  a7= as.numeric(subset1$time)
)  
table3

correlations = numeric(length = 6)

for (i in 2:7) {
  correlation = cor(table3$a1, table3[, i])
  correlations[i-1] = correlation #store
}

names(correlations) = c("a1-a2", "a1-a3", "a1-a4", "a1-a5", "a1-a6", "a1-a7")

print(correlations)

##       a1-a2       a1-a3       a1-a4       a1-a5       a1-a6       a1-a7 
## -0.16314530  0.21688505  0.07237915  0.06321810  0.03550256 -0.18761577

Max1 = (max(correlations))
cat("\n Strongest correlation for subset1 -> ", names(which.max(correlations)), ": ", Max1, "\n")

## 
##  Strongest correlation for subset1 ->  a1-a3 :  0.216885

Min1 = (min(correlations))
cat(" Weakest correlation for subset1 -> ", names(which.min(correlations)), ": ", Min1, "\n")

##  Weakest correlation for subset1 ->  a1-a7 :  -0.1876158

Observations

My observations are as follows as: a. In the overall dataset, the strongest correlation is between variables a1(age) and a5(serum_creatinine)) with a correlation coefficient of 0.1591871. The weakest correlation is between variables a1(age) and a7(time) with a correlation coefficient of -0.2240684. b. When considering subset0 of the data, the strongest correlation is still between variables a1(age) and a5(serum_creatinine)) with a correlation coefficient of 0.1375893. The weakest correlation within subset0 is between variables a1(age) and a4(platelets) with a correlation coefficient of -0.1086949. c. For subset1 of the data, the strongest correlation is between variables a1(age) and a3(ejection_fraction) with a correlation coefficient of 0.216885. The weakest correlation within subset1 is between variables a1(age) and a7(time) with a correlation coefficient of -0.1876158.