#use statistical models for data analysis
Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots
I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)
References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher
##(Bio)statistics in R: Part #1
Let’s look at some examples in R with coin flips so it will make more sense.
We assume that coin is fair and the distribution of outcomes falls under binomial distribution with chance of getting heads = 0.5.
A binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by P, is the same on every trial.
set.seed(11)
# Generate two sample spaces for 10 coins, flip each coin 1 time. Event E = 1 (heads) with probability .5
exp_a <- rbinom(10, 1 , 0.5)
exp_b <- rbinom(10, 1, 0.5)
print(paste0("sample space of expt a: ", list(exp_a)))
## [1] "sample space of expt a: c(0, 0, 1, 0, 0, 1, 0, 0, 1, 0)"
print(paste0("sample space of expt b: ", list(exp_b)))
## [1] "sample space of expt b: c(0, 0, 1, 1, 1, 1, 0, 0, 0, 0)"
cat("\n")
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_a)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)"
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_b)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)"
cat("\n")
print(paste0("Coin flip is 'heads' (1) for both sample a and b: ", list(exp_a & exp_b)))
## [1] "Coin flip is 'heads' (1) for both sample a and b: c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for either sample a and b: ", list(exp_a | exp_b)))
## [1] "Coin flip is 'heads' (1) for either sample a and b: c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for sample a when b is 'tails' : ", list(exp_a & !exp_b)))
## [1] "Coin flip is 'heads' (1) for sample a when b is 'tails' : c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)"
#Probability Probability is simply likelihood that event will take place.
print(paste0("Probability that is 'heads' for both experiment a and b: ", round(mean(exp_a & exp_b), 3)))
## [1] "Probability that is 'heads' for both experiment a and b: 0.2"
print(paste0("Probability that is 'heads' for either experiment a and b: ", round(mean(exp_a | exp_b), 3)))
## [1] "Probability that is 'heads' for either experiment a and b: 0.5"
print(paste0("Probability that is 'heads' for both experiment a when its is tails of experiment b: ", round(mean(exp_a & !(exp_b)), 3)))
## [1] "Probability that is 'heads' for both experiment a when its is tails of experiment b: 0.1"
he duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods of a random sample of 14 HIV infected individuals is given (in years).
Calculate the sample mean. Calculate the sample median. Calculate the sample standard deviation.
inc_period <- c(12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9)
print(paste0("sample mean:", round(mean(inc_period), 3)))
## [1] "sample mean:9.85"
print(paste0("sample mean:", round(median(inc_period), 3)))
## [1] "sample mean:10"
print(paste0("sample mean:", round(sd(inc_period), 3)))
## [1] "sample mean:3.059"
Serum cholesterol (mmol/L) measured on a sample of 86 stroke patients (data of Markus et al., 1995). Calculate the distribution characteristics and plot the distribution.
cholesterol <- c(3.7,4.8,5.4,5.6,6.1,6.4,7.0,7.6,8.7,3.8,4.9,5.4,5.6,6.1,6.5,7.0,7.6,8.9,3.8,4.9,5.5,5.7,6.1,
6.5,7.1,7.6,9.3,4.4,4.9,5.5,5.7,6.2,6.6,7.1,7.7,9.5,4.5,5.0,5.5,5.7,6.3,6.7,7.2,7.8,10.2,4.5,
5.1,5.6,5.8,6.3,6.7,7.3,7.8,10.4,4.5,5.1,5.6,5.8,6.4,6.8,7.4,7.8,4.7,5.2,5.6,5.9,6.4,6.8,7.4,
8.2,4.7,5.3,5.6,6.0,6.4,7.0,7.5,8.3,4.8,5.3,5.6,6.1,6.4,7.0,7.5,8.6)
print(paste0("sample mean:", round(mean(cholesterol), 3)))
## [1] "sample mean:6.341"
print(paste0("sample mean:", round(median(cholesterol), 3)))
## [1] "sample mean:6.15"
print(paste0("sample mean:", round(var(cholesterol), 3)))
## [1] "sample mean:1.959"
print(paste0("sample mean:", round(sd(cholesterol), 3)))
## [1] "sample mean:1.4"
df <- data.frame(cholesterol)
ggplot(df, aes(x = cholesterol))+geom_histogram(bins = 20, fill = "green", alpha = 0.3)+
geom_vline(aes(xintercept = mean(cholesterol)), color = "red", size = 0.5) +
geom_vline(aes(xintercept = median(cholesterol)), color = "red", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(df, aes(x = "", y = cholesterol)) +
geom_boxplot(fill = "green", alpha = 0.3) +
stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The data below show the consumption of alcohol (X, liters per year per person, 14 years or older) and the death rate from cirrhosis, a liver disease (Y, death per 100,000 population) in 15 countries (each country is an observation unit).
Draw a Scatter Diagram to show the association, if any, between these two variables; can you draw any conclusion/observation without doing any calculation?
Calculate the Coefficient of Covariance and Correlation.
X <- c(24.7, 15.2, 12.3, 10.9, 10.8, 9.9, 8.3, 7.2, 6.6, 5.8, 5.7, 5.6, 4.2, 3.9, 3.1)
Y <- c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3.0, 7.2, 10.6, 3.7, 3.4, 4.3, 3.6, 5.4)
df <- data.frame(X, Y)
#to visulize the connection is to draw a scatter plot
ggplot(df, aes(x = X, y = Y))+geom_point()+
geom_smooth(method = lm, se = FALSE) +
labs (x = "Alcohol consumption", y = "Death Rate from Cirrhosis",
title = "Relation between two variables")
## `geom_smooth()` using formula = 'y ~ x'
It looks like we have strong postive correlation - as Alcohol consumption increases Death Rate from Cirrhosis also increases. Let’s calculate coefficients to check our thoughts.
print(paste0("covariance between alcohol consumption adn death rate from cirrhosis: ", round(cov(X, Y),2)))
## [1] "covariance between alcohol consumption adn death rate from cirrhosis: 60.68"
print(paste0("Correlation betweel Alcohol consumption and death rate from cirrhosis ", round(cor(X,Y),2)))
## [1] "Correlation betweel Alcohol consumption and death rate from cirrhosis 0.94"
#Conditional Probability What is the probability of getting a certain illness between 0 and 30 years of age, if the likelihood of being of that age is 0.4, and the overall (marginal) probability (in the whole population) of getting this illness is 0.1?
pa <- 0.4
pb <- 0.1
pa_given_pb <- pa * pb / pb
print(paste0("Probability of getting a certain illness is: ", pa_given_pb))
## [1] "Probability of getting a certain illness is: 0.4"
There are 3 diseases, (A, B, C), which occur with probabilities 0.5, 0.3 and 0.2 respectively in people coming to an outpatient clinic. The probability of chest pain in these diseases is 0.8, 0.4 and 0.2 respectively. What is the probability that people at the outpatient clinic experience chest pain? What is the probability of disease C in a person who has no chest pain?
pa <- 0.5
pb <- 0.3
pc <- 0.2
chest_given_pa <- 0.8
chest_given_pb <- 0.4
chest_given_pc <- 0.2
p_chest_pain <- pa * chest_given_pa + pb * chest_given_pb + pc * chest_given_pc
p_chest_given_no_chest <- pc *(1 - chest_given_pc) / (1 - p_chest_pain)
print(paste0("Probability that people at the outpatient clinic experience chest pain: ", p_chest_pain))
## [1] "Probability that people at the outpatient clinic experience chest pain: 0.56"
print(paste0("Probability of disease C in a person who has no chest pain: ", round(p_chest_given_no_chest,3)))
## [1] "Probability of disease C in a person who has no chest pain: 0.364"
#Bayes Rule¶ Bayes rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
A study comparing the efficacy of HIV tests concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%. Suppose that a subject, from a population with a 0.1% prevalence of HIV, receives a positive test result. We are interested in calculating the probability that this subject has HIV. Mathematically, we would like to calculate P(D+|T+) , the probability of being HIV positive (D+) given that one test was positive (T+) . What is known is the sensitivity of the test, P(T+|D+) = 0.997, the specificity of the test, P(T−|D−) = 0.985 and the prevalence of the disease in the target population, P(D+) = 0.001. Using Bayes formula we can calculate the quantity that we are interested in.
p_dp_given_tp <- (0.997*0.001)/(0.997*0.001 + (1-0.985)*(1-0.001))
print(paste0("Probability of being HIV positive given that one test was positive: ", round(p_dp_given_tp,3)))
## [1] "Probability of being HIV positive given that one test was positive: 0.062"
Thus, in this population, a positive test result corresponds to only a 6% probability that the subject is actually HIV positive. This is the positive predictive value of the test (PPV). The low positive predictive value is due to low prevalence of the disease and the somewhat modest specificity relative to the prevalence of the disease.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
data <- read.csv("FIC_FUll.csv")
sample_n(data, 5)
## Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1 50 41-50 Male RURAL MARRIED NO NO FREE
## 2 55 51-60 Male URBAN MARRIED YES YES FREE
## 3 56 51-60 Male URBAN MARRIED YES YES FREE
## 4 55 51-60 Male URBAN MARRIED NO NO FREE
## 5 50 41-50 Male RURAL MARRIED YES NO FREE
## Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1 YES YES YES YES 1 0 NO NO
## 2 YES YES YES NO 0 1 YES NO
## 3 YES YES YES NO 0 1 YES NO
## 4 YES YES NO YES 1 0 NO NO
## 5 YES YES YES NO 0 0 YES NO
## BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 80.5 0 125 2.3 0.9 129 4.2 96 65
## 2 100.7 0 323 25.0 0.8 139 3.9 100 3877
## 3 150.9 0 260 31.0 0.8 139 4.4 110 319
## 4 130.8 0 127 46.0 1.2 139 4.3 108 188
## 5 100.6 0 348 47.0 1.1 134 4.4 96 336
## CK.MB ESR WBC RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1 36 11 9800 6.15 16.5 0.51 82.1 26.9 0.33 236000
## 2 299 25 10080 5.70 13.3 0.41 72.0 23.0 0.32 391000
## 3 51 11 9800 6.15 16.5 0.51 82.1 26.9 0.33 236000
## 4 45 11 14400 5.70 16.3 0.48 84.0 33.0 0.32 217000
## 5 22 9 10300 5.41 15.5 0.46 86.0 28.0 0.33 261000
## NEUTROPHIL LYMPHO MONOCYTE EOSINO Others
## 1 0.75 0.20 0.03 2 ICMP WITH EF= 30%,PULMONAR ODEMA
## 2 0.90 0.08 0.01 1 no
## 3 0.75 0.20 0.03 2 no
## 4 0.57 0.32 0.07 2 ORTHOPENIA, PND
## 5 0.80 0.17 0.02 1 HTN, DM
## CO Diagnosis
## 1 SOB, DIZZINESS, CHEST PAIN,NAUSEA,DIAPHORESIS CARDIOGENIC SHOCK
## 2 Chest pain4 HR, SWEATING AC I/W M.I
## 3 Chest pain,vomiting, sweating A/C, A/W M.I
## 4 Central Chest pain,SOB, Sweating NSTEM.I
## 5 Chest pain,COLD SWEATING, A/S M.I
## Hypersensitivity cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 NO 4 142 309 0 2 147 1 0.0 2 3
## 2 NO 4 110 275 0 2 118 1 1.0 2 1
## 3 NO 4 132 184 0 2 105 1 2.1 2 1
## 4 YES 4 130 256 1 2 150 1 0.0 1 2
## 5 NO 3 140 233 0 0 163 0 0.6 2 1
## thal num SK SK.React Reaction Mortality Follow.Up
## 1 7 3 1 LUNGS 1 1 15
## 2 3 1 1 STOMACH.BLEEDING 1 0 15
## 3 6 1 1 COUGH.BLEEDING 1 0 32
## 4 7 3 1 SKIN.BLEEDING 1 0 9
## 5 7 1 1 BODY.PAIN 1 0 15
glimpse(data)
## Rows: 368
## Columns: 60
## $ Age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ Age.Group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ Gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ Locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ Marital.status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ Life.Style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ Sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ Category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ Depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Family.History <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ F.History <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ HTN <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ Allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ BP <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ Thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGR <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ B.Urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ S.Cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ S.Sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ S.Potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ S.Chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ C.P.K <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ CK.MB <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ ESR <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ WBC <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ RBC <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ Hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ P.C.V <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ M.C.V <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ M.C.H <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ M.C.H.C <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ PLATELET_COUNT <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ NEUTROPHIL <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ LYMPHO <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ MONOCYTE <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ EOSINO <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ Others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ CO <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ Diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ Hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ SK <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SK.React <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ Follow.Up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
colSums(is.na(data))
## Age Age.Group Gender Locality
## 0 0 0 0
## Marital.status Life.Style Sleep Category
## 0 0 0 0
## Depression Hyperlipi Smoking Family.History
## 0 0 0 0
## F.History Diabetes HTN Allergies
## 0 0 0 0
## BP Thrombolysis BGR B.Urea
## 0 0 0 0
## S.Cr S.Sodium S.Potassium S.Chloride
## 0 0 0 0
## C.P.K CK.MB ESR WBC
## 0 0 0 0
## RBC Hemoglobin P.C.V M.C.V
## 0 0 0 0
## M.C.H M.C.H.C PLATELET_COUNT NEUTROPHIL
## 0 0 0 0
## LYMPHO MONOCYTE EOSINO Others
## 0 0 0 0
## CO Diagnosis Hypersensitivity cp
## 0 0 0 0
## trestbps chol fbs restecg
## 0 0 0 0
## thalach exang oldpeak slope
## 0 0 0 0
## ca thal num SK
## 0 0 0 0
## SK.React Reaction Mortality Follow.Up
## 0 0 0 0
data <- clean_names(data)
#for data cleaning
#data2 <- data %>% filter(Gender != "Other" & !(is.na(Diabetes))) %>% select(-Age)
colnames(data)
## [1] "age" "age_group" "gender" "locality"
## [5] "marital_status" "life_style" "sleep" "category"
## [9] "depression" "hyperlipi" "smoking" "family_history"
## [13] "f_history" "diabetes" "htn" "allergies"
## [17] "bp" "thrombolysis" "bgr" "b_urea"
## [21] "s_cr" "s_sodium" "s_potassium" "s_chloride"
## [25] "c_p_k" "ck_mb" "esr" "wbc"
## [29] "rbc" "hemoglobin" "p_c_v" "m_c_v"
## [33] "m_c_h" "m_c_h_c" "platelet_count" "neutrophil"
## [37] "lympho" "monocyte" "eosino" "others"
## [41] "co" "diagnosis" "hypersensitivity" "cp"
## [45] "trestbps" "chol" "fbs" "restecg"
## [49] "thalach" "exang" "oldpeak" "slope"
## [53] "ca" "thal" "num" "sk"
## [57] "sk_react" "reaction" "mortality" "follow_up"
glimpse(data)
## Rows: 368
## Columns: 60
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
table(data$marital_status)
##
## MARRIED SINGLE
## 365 3
#convert some variables into factor
data$family.History <- as.factor(data$family_history)
data$diabetes <- as.factor(data$diabetes)
data$mortality <- as.factor(data$mortality)
colnames(data)
## [1] "age" "age_group" "gender" "locality"
## [5] "marital_status" "life_style" "sleep" "category"
## [9] "depression" "hyperlipi" "smoking" "family_history"
## [13] "f_history" "diabetes" "htn" "allergies"
## [17] "bp" "thrombolysis" "bgr" "b_urea"
## [21] "s_cr" "s_sodium" "s_potassium" "s_chloride"
## [25] "c_p_k" "ck_mb" "esr" "wbc"
## [29] "rbc" "hemoglobin" "p_c_v" "m_c_v"
## [33] "m_c_h" "m_c_h_c" "platelet_count" "neutrophil"
## [37] "lympho" "monocyte" "eosino" "others"
## [41] "co" "diagnosis" "hypersensitivity" "cp"
## [45] "trestbps" "chol" "fbs" "restecg"
## [49] "thalach" "exang" "oldpeak" "slope"
## [53] "ca" "thal" "num" "sk"
## [57] "sk_react" "reaction" "mortality" "follow_up"
## [61] "family.History"
#EDA
library(janitor)
data <- clean_names(data)
#relationship of diabetes status with marital status
data %>% select(marital_status, platelet_count) %>%
group_by(marital_status) %>% summarise(
mean = round(mean(platelet_count),2),
median = median(platelet_count),
variance = round(var(platelet_count), 2),
std = round(sd(platelet_count), 2)
)
## # A tibble: 2 × 5
## marital_status mean median variance std
## <chr> <dbl> <int> <dbl> <dbl>
## 1 MARRIED 248507. 237000 5894481409 76776.
## 2 SINGLE 267333. 268000 6400333333. 80002.
ggplot(data, aes(x = platelet_count, fill = marital_status))+
geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", size = 0.5)+
geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red", size = 0.5)+
geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", linetype = "dashed", size = 0.5)+
geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red",linetype = "dashed", size = 0.5)+
theme_bw()
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
ggplot(data, aes(x = marital_status, y = platelet_count)) +
geom_boxplot(aes(fill = marital_status)) +
stat_summary(fun.y = mean, color = "darkred", geom = "point", size =3) +
theme_classic()
summary(data)
## age age_group gender locality
## Min. :24.00 Length:368 Length:368 Length:368
## 1st Qu.:50.75 Class :character Class :character Class :character
## Median :55.00 Mode :character Mode :character Mode :character
## Mean :54.29
## 3rd Qu.:60.25
## Max. :77.00
## marital_status life_style sleep category
## Length:368 Length:368 Length:368 Length:368
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## depression hyperlipi smoking family_history
## Length:368 Length:368 Length:368 Length:368
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## f_history diabetes htn allergies
## Min. :0.0000 0:198 Length:368 Length:368
## 1st Qu.:0.0000 1:170 Class :character Class :character
## Median :0.0000 Mode :character Mode :character
## Mean :0.1957
## 3rd Qu.:0.0000
## Max. :1.0000
## bp thrombolysis bgr b_urea
## Min. : 80.5 Min. :0.00000 Min. : 60 Min. : 2.30
## 1st Qu.:100.7 1st Qu.:0.00000 1st Qu.:117 1st Qu.: 28.00
## Median :120.8 Median :0.00000 Median :164 Median : 36.00
## Mean :121.2 Mean :0.03261 Mean :220 Mean : 51.68
## 3rd Qu.:140.7 3rd Qu.:0.00000 3rd Qu.:291 3rd Qu.: 43.00
## Max. :190.1 Max. :1.00000 Max. :563 Max. :394.00
## s_cr s_sodium s_potassium s_chloride
## Min. : 0.600 Min. :129 Min. :3.300 Min. : 90.0
## 1st Qu.: 0.900 1st Qu.:135 1st Qu.:3.900 1st Qu.:100.0
## Median : 0.900 Median :138 Median :4.200 Median :104.0
## Mean : 1.717 Mean :138 Mean :4.211 Mean :103.8
## 3rd Qu.: 1.100 3rd Qu.:141 3rd Qu.:4.400 3rd Qu.:107.0
## Max. :22.900 Max. :146 Max. :5.300 Max. :112.0
## c_p_k ck_mb esr wbc
## Min. : 52.0 Min. : 14.00 Min. : 5.00 Min. : 5800
## 1st Qu.: 135.0 1st Qu.: 21.00 1st Qu.: 11.00 1st Qu.: 7800
## Median : 188.0 Median : 36.00 Median : 16.00 Median :10650
## Mean : 553.9 Mean : 62.49 Mean : 26.57 Mean :11181
## 3rd Qu.: 390.0 3rd Qu.: 52.00 3rd Qu.: 25.00 3rd Qu.:13500
## Max. :4289.0 Max. :505.00 Max. :154.00 Max. :19590
## rbc hemoglobin p_c_v m_c_v
## Min. :3.46 Min. : 9.10 Min. :0.2900 Min. :60.00
## 1st Qu.:4.40 1st Qu.:12.30 1st Qu.:0.3600 1st Qu.:78.00
## Median :5.20 Median :14.20 Median :0.4250 Median :82.10
## Mean :5.09 Mean :13.91 Mean :0.4153 Mean :81.57
## 3rd Qu.:5.65 3rd Qu.:15.50 3rd Qu.:0.4600 3rd Qu.:86.00
## Max. :6.98 Max. :18.00 Max. :0.5400 Max. :96.00
## m_c_h m_c_h_c platelet_count neutrophil
## Min. :18.00 Min. :0.2200 Min. : 20000 Min. : 0.360
## 1st Qu.:25.70 1st Qu.:0.3200 1st Qu.:192000 1st Qu.: 0.600
## Median :27.90 Median :0.3300 Median :237000 Median : 0.720
## Mean :27.27 Mean :0.3294 Mean :248660 Mean : 2.927
## 3rd Qu.:29.00 3rd Qu.:0.3400 3rd Qu.:287000 3rd Qu.: 0.800
## Max. :33.00 Max. :0.3900 Max. :459000 Max. :83.000
## lympho monocyte eosino others
## Min. :0.0500 Min. :0.0100 Min. :1.000 Length:368
## 1st Qu.:0.1700 1st Qu.:0.0200 1st Qu.:2.000 Class :character
## Median :0.2100 Median :0.0300 Median :2.000 Mode :character
## Mean :0.2489 Mean :0.0325 Mean :2.255
## 3rd Qu.:0.3200 3rd Qu.:0.0400 3rd Qu.:3.000
## Max. :0.5400 Max. :0.0800 Max. :5.000
## co diagnosis hypersensitivity cp
## Length:368 Length:368 Length:368 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.671
## 3rd Qu.:4.000
## Max. :4.000
## trestbps chol fbs restecg
## Min. :100.0 Min. :131.0 Min. :0.0000 Min. :0.000
## 1st Qu.:120.0 1st Qu.:212.0 1st Qu.:0.0000 1st Qu.:0.000
## Median :130.0 Median :249.0 Median :0.0000 Median :2.000
## Mean :132.7 Mean :248.9 Mean :0.1413 Mean :1.073
## 3rd Qu.:142.0 3rd Qu.:283.0 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :200.0 Max. :409.0 Max. :1.0000 Max. :2.000
## thalach exang oldpeak slope
## Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:125.0 1st Qu.:0.0000 1st Qu.:0.275 1st Qu.:1.750
## Median :144.0 Median :1.0000 Median :1.200 Median :2.000
## Mean :140.9 Mean :0.5625 Mean :1.542 Mean :1.842
## 3rd Qu.:158.0 3rd Qu.:1.0000 3rd Qu.:2.200 3rd Qu.:2.000
## Max. :195.0 Max. :1.0000 Max. :6.200 Max. :3.000
## ca thal num sk
## Min. :0.000 Min. :3.000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.0000
## Median :1.000 Median :7.000 Median :2.000 Median :1.0000
## Mean :1.003 Mean :5.859 Mean :2.035 Mean :0.9837
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :7.000 Max. :4.000 Max. :1.0000
## sk_react reaction mortality follow_up family_history_2
## Length:368 Min. :0.0000 0:288 Min. : 1.00 NO :296
## Class :character 1st Qu.:0.0000 1: 80 1st Qu.:15.00 YES: 72
## Mode :character Median :1.0000 Median :32.00
## Mean :0.7473 Mean :28.65
## 3rd Qu.:1.0000 3rd Qu.:36.00
## Max. :1.0000 Max. :60.00
distribution_plot <- function(y){
#function return histogram and boxplot of chosen variable distribution
box_plt <- ggplot(data, aes(x = platelet_count, y = eval(parse(text = y))))+
geom_boxplot(aes(fill = platelet_count))+
stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2) +
labs(x = "Platelet Count", y = y)+theme_bw()
hstgm <- ggplot(data, aes(x = eval(parse(text = y)), fill = mortality)) + geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "blue", size = 0.5) +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "blue", size = 0.5) +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
labs(x = y) + theme_bw()
return(plot_grid(hstgm, box_plt))
}
distribution_plot("s_sodium")
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here we could see that histograms were not the best oprions for us since
there are not so many subjects with positive stroke outcome so we were
not able to explore the distribution properly.
distribution_ratio <- function(x){
#fucntion creates a stacked bar plots
plt <- data %>% select(s_sodium,x, gender) %>%
group_by(s_sodium, var = eval(parse(text = x))) %>%
summarise(count = length(gender)) %>%
group_by(s_sodium) %>%
mutate(ratio = round(count * 100 / sum(count), 1)) %>%
ggplot(aes(y = ratio, x = s_sodium, fill = var)) +
geom_bar(stat = "identity") +
labs(title = paste0("Ratio of stroke outcome by ", x), fill = x) +
theme_bw()
return(plt)
}
distribution_ratio("bp")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(x)
##
## # Now:
## data %>% select(all_of(x))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
Let’s stop here for a moment and remember about discrete variables and
probability.
According to our sample we can make following assumptions:
H <- as.numeric(levels(data$diabetes))[data$diabetes]
S <- as.numeric(levels(data$mortality))[data$mortality]
p_sp_and_hp <- mean(H&S)
p_sp_or_hp <- mean(H|S)
p_sp_and_hn <- mean(H&!S)
print(paste0("Probability of suffereing from having a disturbed platelet count:", round(p_sp_and_hp,3)))
## [1] "Probability of suffereing from having a disturbed platelet count:0.166"
print(paste0("Probability of suffereing or having a disturbed platelet count:", round(p_sp_or_hp,3)))
## [1] "Probability of suffereing or having a disturbed platelet count:0.514"
print(paste0("Probability of suffereing without having disturbed platelet count:", round(p_sp_and_hn,3)))
## [1] "Probability of suffereing without having disturbed platelet count:0.296"
distribution_ratio("diabetes")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
distribution_ratio("marital_status")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
glimpse(data)
## Rows: 368
## Columns: 61
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <fct> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
## $ family_history_2 <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, N…
#Test Outcome Analysis¶ Now I will try to predict the stroke outcome with the help of random forest. I will not go into details about what is random forest, how it works and how to adjust the model. My point here is to show how can we analyse model performance (can be applied both for machine learning or real testings). You can check out my project on Dementia Prediction with Tree-based Models to know more about tree-based models.
frml <- (diabetes ~ gender+age+follow_up+thalach+chol+trestbps)
model_rf <- randomForest(formula = frml, data = data)
prediction_rf <- predict(object = model_rf,
newdata = select(data, -diabetes),
type = "class")
confusionMatrix(data = prediction_rf, reference = data$diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 197 1
## 1 1 169
##
## Accuracy : 0.9946
## 95% CI : (0.9805, 0.9993)
## No Information Rate : 0.538
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9891
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9949
## Specificity : 0.9941
## Pos Pred Value : 0.9949
## Neg Pred Value : 0.9941
## Prevalence : 0.5380
## Detection Rate : 0.5353
## Detection Prevalence : 0.5380
## Balanced Accuracy : 0.9945
##
## 'Positive' Class : 0
##
print(paste0("AUC = ", round(Metrics::auc(actual = data$diabetes, predicted = prediction_rf), 4)))
## [1] "AUC = 0.9945"
plot(roc(prediction_rf, data$diabetes), main = "ROC", cex.axis = 0.6, cex.main = 0.8, cex.lab = 0.8)