This project will reflect R programming and data science skills as a researcher.
mortality = read.csv("Mortality.csv")
str(mortality)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
Numerical columns- age, creatinine_phosphokinase, ejection_fraction,
platelets, serum_creatinine, serum_sodium, and time are numerical
columns that likely represent continuous variables.
Categorical columns: anaemia, diabetes, high_blood_pressure, sex,
smoking are categorical columns that likely represent binary or discrete
variables.
Target variable: The last column, DEATH_EVENT, is the target variable in the dataset, which likely indicates whether an individual experienced a death event or not.
numerical_columns = c("age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time")
numerical_columns
## [1] "age" "creatinine_phosphokinase"
## [3] "ejection_fraction" "platelets"
## [5] "serum_creatinine" "serum_sodium"
## [7] "time"
categorical_columns = c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")
categorical_columns
## [1] "anaemia" "diabetes" "high_blood_pressure"
## [4] "sex" "smoking"
target_variable = "DEATH_EVENT"
target_variable
## [1] "DEATH_EVENT"
## In this load dataset, R programming not showing numerical column correctly as per as I guess. so generate those numerical feature my own thought.
table = data.frame(
a1= as.numeric(mortality$age),
a2= as.numeric(mortality$creatinine_phosphokinase),
a3= as.numeric(mortality$ejection_fraction),
a4= as.numeric(mortality$platelets),
a5= as.numeric(mortality$serum_creatinine),
a6= as.numeric(mortality$serum_sodium),
a7= as.numeric(mortality$time)
)
table
## A named vector to store correlation coefficients
correlations = numeric(length = 6)
for (i in 2:7) {
correlation = cor(table$a1, table[, i])
correlations[i-1] = correlation #store
}
names(correlations) = c("a1-a2", "a1-a3", "a1-a4","a1-a5", "a1-a6", "a1-a7")
print(correlations)
## a1-a2 a1-a3 a1-a4 a1-a5 a1-a6 a1-a7
## -0.08158390 0.06009836 -0.05235437 0.15918713 -0.04596584 -0.22406842
## Find out the strongest correlated feature and weakest correlated feature.
Max = (max(correlations))
cat(" Strongest correlation -> ", names(which.max(correlations)), ": ", Max, "\n")
## Strongest correlation -> a1-a5 : 0.1591871
Min = (min(correlations))
cat(" Weakest correlation -> ", names(which.min(correlations)), ": ", Min, "\n")
## Weakest correlation -> a1-a7 : -0.2240684
subset0 = subset(mortality, DEATH_EVENT == 0)
subset1 = subset(mortality, DEATH_EVENT == 1)
subset0
subset1
summary(subset0)
## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 30.0 Min. :0.0000
## 1st Qu.:50.00 1st Qu.:0.0000 1st Qu.: 109.0 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 245.0 Median :0.0000
## Mean :58.76 Mean :0.4089 Mean : 540.1 Mean :0.4187
## 3rd Qu.:65.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :90.00 Max. :1.0000 Max. :5209.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :17.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:35.00 1st Qu.:0.0000 1st Qu.:219500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :263000 Median :1.000
## Mean :40.27 Mean :0.3251 Mean :266658 Mean :1.185
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:302000 3rd Qu.:1.200
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :6.100
## serum_sodium sex smoking time DEATH_EVENT
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 12.0 Min. :0
## 1st Qu.:135.5 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 95.0 1st Qu.:0
## Median :137.0 Median :1.0000 Median :0.0000 Median :172.0 Median :0
## Mean :137.2 Mean :0.6502 Mean :0.3251 Mean :158.3 Mean :0
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:213.0 3rd Qu.:0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0 Max. :0
summary(subset1)
## age anaemia creatinine_phosphokinase diabetes
## Min. :42.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:55.00 1st Qu.:0.0000 1st Qu.: 128.8 1st Qu.:0.0000
## Median :65.00 Median :0.0000 Median : 259.0 Median :0.0000
## Mean :65.22 Mean :0.4792 Mean : 670.2 Mean :0.4167
## 3rd Qu.:75.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 47000 Min. :0.600
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:197500 1st Qu.:1.075
## Median :30.00 Median :0.0000 Median :258500 Median :1.300
## Mean :33.47 Mean :0.4062 Mean :256381 Mean :1.836
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:311000 3rd Qu.:1.900
## Max. :70.00 Max. :1.0000 Max. :621000 Max. :9.400
## serum_sodium sex smoking time
## Min. :116.0 Min. :0.0000 Min. :0.0000 Min. : 4.00
## 1st Qu.:133.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 25.50
## Median :135.5 Median :1.0000 Median :0.0000 Median : 44.50
## Mean :135.4 Mean :0.6458 Mean :0.3125 Mean : 70.89
## 3rd Qu.:138.2 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:102.25
## Max. :146.0 Max. :1.0000 Max. :1.0000 Max. :241.00
## DEATH_EVENT
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
Comparing mean, median, min, and max values between subsets can reveal differences in variable distributions. Identifying patterns or trends specific to each subset may indicate factors influencing the target variable. Ensure balance of classes within subsets for better model performance. Further explore subsets using visualizations or additional analyses to understand their relationship with the target variable.
table2 = data.frame(
a1= as.numeric(subset0$age),
a2= as.numeric(subset0$creatinine_phosphokinase),
a3= as.numeric(subset0$ejection_fraction),
a4= as.numeric(subset0$platelets),
a5= as.numeric(subset0$serum_creatinine),
a6= as.numeric(subset0$serum_sodium),
a7= as.numeric(subset0$time)
)
table2
correlations = numeric(length = 6)
for (i in 2:7) {
correlation = cor(table2$a1, table2[, i])
correlations[i-1] = correlation #store
}
names(correlations) = c("a1-a2", "a1-a3", "a1-a4","a1-a5", "a1-a6", "a1-a7")
print(correlations)
## a1-a2 a1-a3 a1-a4 a1-a5 a1-a6 a1-a7
## -0.04055170 0.08441109 -0.10869487 0.13758933 -0.01953457 -0.06985139
Max0 = (max(correlations))
cat("\n Strongest correlation for subset0 -> ", names(which.max(correlations)), ": ", Max0, "\n")
##
## Strongest correlation for subset0 -> a1-a5 : 0.1375893
Min0 = (min(correlations))
cat(" Weakest correlation for subset0 -> ", names(which.min(correlations)), ": ", Min0, "\n")
## Weakest correlation for subset0 -> a1-a4 : -0.1086949
table3 = data.frame(
a1= as.numeric(subset1$age),
a2= as.numeric(subset1$creatinine_phosphokinase),
a3= as.numeric(subset1$ejection_fraction),
a4= as.numeric(subset1$platelets),
a5= as.numeric(subset1$serum_creatinine),
a6= as.numeric(subset1$serum_sodium),
a7= as.numeric(subset1$time)
)
table3
correlations = numeric(length = 6)
for (i in 2:7) {
correlation = cor(table3$a1, table3[, i])
correlations[i-1] = correlation #store
}
names(correlations) = c("a1-a2", "a1-a3", "a1-a4", "a1-a5", "a1-a6", "a1-a7")
print(correlations)
## a1-a2 a1-a3 a1-a4 a1-a5 a1-a6 a1-a7
## -0.16314530 0.21688505 0.07237915 0.06321810 0.03550256 -0.18761577
Max1 = (max(correlations))
cat("\n Strongest correlation for subset1 -> ", names(which.max(correlations)), ": ", Max1, "\n")
##
## Strongest correlation for subset1 -> a1-a3 : 0.216885
Min1 = (min(correlations))
cat(" Weakest correlation for subset1 -> ", names(which.min(correlations)), ": ", Min1, "\n")
## Weakest correlation for subset1 -> a1-a7 : -0.1876158
My observations are as follows as: a. In the overall dataset, the strongest correlation is between variables a1(age) and a5(serum_creatinine)) with a correlation coefficient of 0.1591871. The weakest correlation is between variables a1(age) and a7(time) with a correlation coefficient of -0.2240684. b. When considering subset0 of the data, the strongest correlation is still between variables a1(age) and a5(serum_creatinine)) with a correlation coefficient of 0.1375893. The weakest correlation within subset0 is between variables a1(age) and a4(platelets) with a correlation coefficient of -0.1086949. c. For subset1 of the data, the strongest correlation is between variables a1(age) and a3(ejection_fraction) with a correlation coefficient of 0.216885. The weakest correlation within subset1 is between variables a1(age) and a7(time) with a correlation coefficient of -0.1876158.