#use statistical models for data analysis

Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots

I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)

  1. Other Statistics
  1. Mosaic Plots and Contingency Tables b. Chloropeth Maps with GGPlot2

References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher

##(Bio)statistics in R: Part #1

Let’s look at some examples in R with coin flips so it will make more sense.

We assume that coin is fair and the distribution of outcomes falls under binomial distribution with chance of getting heads = 0.5.

A binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by P, is the same on every trial.

set.seed(11)
# Generate two sample spaces for 10 coins, flip each coin 1 time. Event E = 1 (heads) with probability .5

exp_a <- rbinom(10, 1 , 0.5)
exp_b <- rbinom(10, 1, 0.5)

print(paste0("sample space of expt a: ", list(exp_a)))
## [1] "sample space of expt a: c(0, 0, 1, 0, 0, 1, 0, 0, 1, 0)"
print(paste0("sample space of expt b: ", list(exp_b)))
## [1] "sample space of expt b: c(0, 0, 1, 1, 1, 1, 0, 0, 0, 0)"
cat("\n")
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_a)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)"
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_b)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)"
cat("\n")
print(paste0("Coin flip is 'heads' (1) for both sample a and b: ", list(exp_a & exp_b)))
## [1] "Coin flip is 'heads' (1) for both sample a and b: c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for either sample a and b: ", list(exp_a | exp_b)))
## [1] "Coin flip is 'heads' (1) for either sample a and b: c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for sample a when b is 'tails' : ", list(exp_a & !exp_b)))
## [1] "Coin flip is 'heads' (1) for sample a when b is 'tails' : c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)"

#Probability Probability is simply likelihood that event will take place.

print(paste0("Probability that is 'heads' for both experiment a and b: ", round(mean(exp_a & exp_b), 3)))
## [1] "Probability that is 'heads' for both experiment a and b: 0.2"
print(paste0("Probability that is 'heads' for either experiment a and b: ", round(mean(exp_a | exp_b), 3)))
## [1] "Probability that is 'heads' for either experiment a and b: 0.5"
print(paste0("Probability that is 'heads' for both experiment a when its is tails of experiment b: ", round(mean(exp_a & !(exp_b)), 3)))
## [1] "Probability that is 'heads' for both experiment a when its is tails of experiment b: 0.1"

he duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods of a random sample of 14 HIV infected individuals is given (in years).

Calculate the sample mean. Calculate the sample median. Calculate the sample standard deviation.

inc_period <- c(12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9)
print(paste0("sample mean:", round(mean(inc_period), 3)))
## [1] "sample mean:9.85"
print(paste0("sample mean:", round(median(inc_period), 3)))
## [1] "sample mean:10"
print(paste0("sample mean:", round(sd(inc_period), 3)))
## [1] "sample mean:3.059"

Serum cholesterol (mmol/L) measured on a sample of 86 stroke patients (data of Markus et al., 1995). Calculate the distribution characteristics and plot the distribution.

cholesterol <- c(3.7,4.8,5.4,5.6,6.1,6.4,7.0,7.6,8.7,3.8,4.9,5.4,5.6,6.1,6.5,7.0,7.6,8.9,3.8,4.9,5.5,5.7,6.1,
                 6.5,7.1,7.6,9.3,4.4,4.9,5.5,5.7,6.2,6.6,7.1,7.7,9.5,4.5,5.0,5.5,5.7,6.3,6.7,7.2,7.8,10.2,4.5,
                 5.1,5.6,5.8,6.3,6.7,7.3,7.8,10.4,4.5,5.1,5.6,5.8,6.4,6.8,7.4,7.8,4.7,5.2,5.6,5.9,6.4,6.8,7.4,
                 8.2,4.7,5.3,5.6,6.0,6.4,7.0,7.5,8.3,4.8,5.3,5.6,6.1,6.4,7.0,7.5,8.6)

print(paste0("sample mean:", round(mean(cholesterol), 3)))
## [1] "sample mean:6.341"
print(paste0("sample mean:", round(median(cholesterol), 3)))
## [1] "sample mean:6.15"
print(paste0("sample mean:", round(var(cholesterol), 3)))
## [1] "sample mean:1.959"
print(paste0("sample mean:", round(sd(cholesterol), 3)))
## [1] "sample mean:1.4"
df <- data.frame(cholesterol)
ggplot(df, aes(x = cholesterol))+geom_histogram(bins = 20, fill = "green", alpha = 0.3)+
  geom_vline(aes(xintercept = mean(cholesterol)), color = "red", size = 0.5) + 
  geom_vline(aes(xintercept = median(cholesterol)), color = "red", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(df, aes(x = "", y = cholesterol)) +
  geom_boxplot(fill = "green", alpha = 0.3) +
  stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The data below show the consumption of alcohol (X, liters per year per person, 14 years or older) and the death rate from cirrhosis, a liver disease (Y, death per 100,000 population) in 15 countries (each country is an observation unit).

Draw a Scatter Diagram to show the association, if any, between these two variables; can you draw any conclusion/observation without doing any calculation?

Calculate the Coefficient of Covariance and Correlation.

X <- c(24.7, 15.2, 12.3, 10.9, 10.8, 9.9, 8.3, 7.2, 6.6, 5.8, 5.7, 5.6, 4.2, 3.9, 3.1)
Y <- c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3.0, 7.2, 10.6, 3.7, 3.4, 4.3, 3.6, 5.4)

df <- data.frame(X, Y)

#to visulize the connection is to draw a scatter plot
ggplot(df, aes(x = X, y = Y))+geom_point()+
  geom_smooth(method = lm, se = FALSE) +
  labs (x = "Alcohol consumption", y = "Death Rate from Cirrhosis",
        title = "Relation between two variables")
## `geom_smooth()` using formula = 'y ~ x'

It looks like we have strong postive correlation - as Alcohol consumption increases Death Rate from Cirrhosis also increases. Let’s calculate coefficients to check our thoughts.

print(paste0("covariance between alcohol consumption adn death rate from cirrhosis: ", round(cov(X, Y),2)))
## [1] "covariance between alcohol consumption adn death rate from cirrhosis: 60.68"
print(paste0("Correlation betweel Alcohol consumption and death rate from cirrhosis ", round(cor(X,Y),2)))
## [1] "Correlation betweel Alcohol consumption and death rate from cirrhosis 0.94"

#Conditional Probability What is the probability of getting a certain illness between 0 and 30 years of age, if the likelihood of being of that age is 0.4, and the overall (marginal) probability (in the whole population) of getting this illness is 0.1?

pa <- 0.4
pb <- 0.1
pa_given_pb <- pa * pb / pb

print(paste0("Probability of getting a certain illness is: ", pa_given_pb))
## [1] "Probability of getting a certain illness is: 0.4"

There are 3 diseases, (A, B, C), which occur with probabilities 0.5, 0.3 and 0.2 respectively in people coming to an outpatient clinic. The probability of chest pain in these diseases is 0.8, 0.4 and 0.2 respectively. What is the probability that people at the outpatient clinic experience chest pain? What is the probability of disease C in a person who has no chest pain?

pa <- 0.5
pb <- 0.3
pc <- 0.2

chest_given_pa <- 0.8
chest_given_pb <- 0.4
chest_given_pc <- 0.2

p_chest_pain <- pa * chest_given_pa + pb * chest_given_pb + pc * chest_given_pc
p_chest_given_no_chest <- pc *(1 - chest_given_pc) / (1 - p_chest_pain)

print(paste0("Probability that people at the outpatient clinic experience chest pain: ", p_chest_pain))
## [1] "Probability that people at the outpatient clinic experience chest pain: 0.56"
print(paste0("Probability of disease C in a person who has no chest pain: ", round(p_chest_given_no_chest,3)))
## [1] "Probability of disease C in a person who has no chest pain: 0.364"

#Bayes Rule¶ Bayes rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

A study comparing the efficacy of HIV tests concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%. Suppose that a subject, from a population with a 0.1% prevalence of HIV, receives a positive test result. We are interested in calculating the probability that this subject has HIV. Mathematically, we would like to calculate P(D+|T+) , the probability of being HIV positive (D+) given that one test was positive (T+) . What is known is the sensitivity of the test, P(T+|D+) = 0.997, the specificity of the test, P(T−|D−) = 0.985 and the prevalence of the disease in the target population, P(D+) = 0.001. Using Bayes formula we can calculate the quantity that we are interested in.

p_dp_given_tp <- (0.997*0.001)/(0.997*0.001 + (1-0.985)*(1-0.001))
print(paste0("Probability of being HIV positive given that one test was positive: ", round(p_dp_given_tp,3)))
## [1] "Probability of being HIV positive given that one test was positive: 0.062"

Thus, in this population, a positive test result corresponds to only a 6% probability that the subject is actually HIV positive. This is the positive predictive value of the test (PPV). The low positive predictive value is due to low prevalence of the disease and the somewhat modest specificity relative to the prevalence of the disease.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
data <- read.csv("FIC_FUll.csv")
sample_n(data, 5)
##   Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1  50     41-50   Male    RURAL        MARRIED         NO    NO     FREE
## 2  55     51-60   Male    URBAN        MARRIED        YES   YES     FREE
## 3  56     51-60   Male    URBAN        MARRIED        YES   YES     FREE
## 4  55     51-60   Male    URBAN        MARRIED         NO    NO     FREE
## 5  50     41-50   Male    RURAL        MARRIED        YES    NO     FREE
##   Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1        YES       YES     YES            YES         1        0  NO        NO
## 2        YES       YES     YES             NO         0        1 YES        NO
## 3        YES       YES     YES             NO         0        1 YES        NO
## 4        YES       YES      NO            YES         1        0  NO        NO
## 5        YES       YES     YES             NO         0        0 YES        NO
##      BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1  80.5            0 125    2.3  0.9      129         4.2         96    65
## 2 100.7            0 323   25.0  0.8      139         3.9        100  3877
## 3 150.9            0 260   31.0  0.8      139         4.4        110   319
## 4 130.8            0 127   46.0  1.2      139         4.3        108   188
## 5 100.6            0 348   47.0  1.1      134         4.4         96   336
##   CK.MB ESR   WBC  RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1    36  11  9800 6.15       16.5  0.51  82.1  26.9    0.33         236000
## 2   299  25 10080 5.70       13.3  0.41  72.0  23.0    0.32         391000
## 3    51  11  9800 6.15       16.5  0.51  82.1  26.9    0.33         236000
## 4    45  11 14400 5.70       16.3  0.48  84.0  33.0    0.32         217000
## 5    22   9 10300 5.41       15.5  0.46  86.0  28.0    0.33         261000
##   NEUTROPHIL LYMPHO MONOCYTE EOSINO                           Others
## 1       0.75   0.20     0.03      2 ICMP WITH EF= 30%,PULMONAR ODEMA
## 2       0.90   0.08     0.01      1                               no
## 3       0.75   0.20     0.03      2                               no
## 4       0.57   0.32     0.07      2                  ORTHOPENIA, PND
## 5       0.80   0.17     0.02      1                          HTN, DM
##                                              CO         Diagnosis
## 1 SOB, DIZZINESS, CHEST PAIN,NAUSEA,DIAPHORESIS CARDIOGENIC SHOCK
## 2                      Chest pain4 HR, SWEATING        AC I/W M.I
## 3                 Chest pain,vomiting, sweating      A/C, A/W M.I
## 4              Central Chest pain,SOB, Sweating           NSTEM.I
## 5                    Chest pain,COLD SWEATING,            A/S M.I
##   Hypersensitivity cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1               NO  4      142  309   0       2     147     1     0.0     2  3
## 2               NO  4      110  275   0       2     118     1     1.0     2  1
## 3               NO  4      132  184   0       2     105     1     2.1     2  1
## 4              YES  4      130  256   1       2     150     1     0.0     1  2
## 5               NO  3      140  233   0       0     163     0     0.6     2  1
##   thal num SK         SK.React Reaction Mortality Follow.Up
## 1    7   3  1            LUNGS        1         1        15
## 2    3   1  1 STOMACH.BLEEDING        1         0        15
## 3    6   1  1   COUGH.BLEEDING        1         0        32
## 4    7   3  1    SKIN.BLEEDING        1         0         9
## 5    7   1  1        BODY.PAIN        1         0        15
glimpse(data)
## Rows: 368
## Columns: 60
## $ Age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ Age.Group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ Gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ Locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ Marital.status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ Life.Style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ Sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ Category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ Depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Family.History   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ F.History        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ HTN              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ Allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ BP               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ Thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGR              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ B.Urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ S.Cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ S.Sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ S.Potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ S.Chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ C.P.K            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ CK.MB            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ ESR              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ WBC              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ RBC              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ Hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ P.C.V            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ M.C.V            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ M.C.H            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ M.C.H.C          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ PLATELET_COUNT   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ NEUTROPHIL       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ LYMPHO           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ MONOCYTE         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ EOSINO           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ Others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ CO               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ Diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ Hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ SK               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SK.React         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ Follow.Up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
colSums(is.na(data))
##              Age        Age.Group           Gender         Locality 
##                0                0                0                0 
##   Marital.status       Life.Style            Sleep         Category 
##                0                0                0                0 
##       Depression        Hyperlipi          Smoking   Family.History 
##                0                0                0                0 
##        F.History         Diabetes              HTN        Allergies 
##                0                0                0                0 
##               BP     Thrombolysis              BGR           B.Urea 
##                0                0                0                0 
##             S.Cr         S.Sodium      S.Potassium       S.Chloride 
##                0                0                0                0 
##            C.P.K            CK.MB              ESR              WBC 
##                0                0                0                0 
##              RBC       Hemoglobin            P.C.V            M.C.V 
##                0                0                0                0 
##            M.C.H          M.C.H.C   PLATELET_COUNT       NEUTROPHIL 
##                0                0                0                0 
##           LYMPHO         MONOCYTE           EOSINO           Others 
##                0                0                0                0 
##               CO        Diagnosis Hypersensitivity               cp 
##                0                0                0                0 
##         trestbps             chol              fbs          restecg 
##                0                0                0                0 
##          thalach            exang          oldpeak            slope 
##                0                0                0                0 
##               ca             thal              num               SK 
##                0                0                0                0 
##         SK.React         Reaction        Mortality        Follow.Up 
##                0                0                0                0
data <- clean_names(data)
#for data cleaning
#data2 <- data %>% filter(Gender != "Other" & !(is.na(Diabetes))) %>% select(-Age)

colnames(data)
##  [1] "age"              "age_group"        "gender"           "locality"        
##  [5] "marital_status"   "life_style"       "sleep"            "category"        
##  [9] "depression"       "hyperlipi"        "smoking"          "family_history"  
## [13] "f_history"        "diabetes"         "htn"              "allergies"       
## [17] "bp"               "thrombolysis"     "bgr"              "b_urea"          
## [21] "s_cr"             "s_sodium"         "s_potassium"      "s_chloride"      
## [25] "c_p_k"            "ck_mb"            "esr"              "wbc"             
## [29] "rbc"              "hemoglobin"       "p_c_v"            "m_c_v"           
## [33] "m_c_h"            "m_c_h_c"          "platelet_count"   "neutrophil"      
## [37] "lympho"           "monocyte"         "eosino"           "others"          
## [41] "co"               "diagnosis"        "hypersensitivity" "cp"              
## [45] "trestbps"         "chol"             "fbs"              "restecg"         
## [49] "thalach"          "exang"            "oldpeak"          "slope"           
## [53] "ca"               "thal"             "num"              "sk"              
## [57] "sk_react"         "reaction"         "mortality"        "follow_up"
glimpse(data)
## Rows: 368
## Columns: 60
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
table(data$marital_status)
## 
## MARRIED  SINGLE 
##     365       3
#convert some variables into factor
data$family.History <- as.factor(data$family_history)
data$diabetes <- as.factor(data$diabetes)
data$mortality <- as.factor(data$mortality)
colnames(data)
##  [1] "age"              "age_group"        "gender"           "locality"        
##  [5] "marital_status"   "life_style"       "sleep"            "category"        
##  [9] "depression"       "hyperlipi"        "smoking"          "family_history"  
## [13] "f_history"        "diabetes"         "htn"              "allergies"       
## [17] "bp"               "thrombolysis"     "bgr"              "b_urea"          
## [21] "s_cr"             "s_sodium"         "s_potassium"      "s_chloride"      
## [25] "c_p_k"            "ck_mb"            "esr"              "wbc"             
## [29] "rbc"              "hemoglobin"       "p_c_v"            "m_c_v"           
## [33] "m_c_h"            "m_c_h_c"          "platelet_count"   "neutrophil"      
## [37] "lympho"           "monocyte"         "eosino"           "others"          
## [41] "co"               "diagnosis"        "hypersensitivity" "cp"              
## [45] "trestbps"         "chol"             "fbs"              "restecg"         
## [49] "thalach"          "exang"            "oldpeak"          "slope"           
## [53] "ca"               "thal"             "num"              "sk"              
## [57] "sk_react"         "reaction"         "mortality"        "follow_up"       
## [61] "family.History"

#EDA

library(janitor)
data <- clean_names(data)
#relationship of diabetes status with marital status 
data %>% select(marital_status, platelet_count) %>% 
  group_by(marital_status) %>% summarise(
    mean = round(mean(platelet_count),2),
    median = median(platelet_count),
    variance = round(var(platelet_count), 2),
    std = round(sd(platelet_count), 2)
  )
## # A tibble: 2 × 5
##   marital_status    mean median    variance    std
##   <chr>            <dbl>  <int>       <dbl>  <dbl>
## 1 MARRIED        248507. 237000 5894481409  76776.
## 2 SINGLE         267333. 268000 6400333333. 80002.
ggplot(data, aes(x = platelet_count, fill = marital_status))+
  geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
  geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", size = 0.5)+
  geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red", size = 0.5)+
  geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", linetype = "dashed", size = 0.5)+
  geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red",linetype = "dashed", size = 0.5)+
  theme_bw()
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.

ggplot(data, aes(x = marital_status, y = platelet_count)) + 
  geom_boxplot(aes(fill = marital_status)) + 
  stat_summary(fun.y = mean, color = "darkred", geom = "point", size =3) +
  theme_classic()

summary(data)
##       age         age_group            gender            locality        
##  Min.   :24.00   Length:368         Length:368         Length:368        
##  1st Qu.:50.75   Class :character   Class :character   Class :character  
##  Median :55.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :54.29                                                           
##  3rd Qu.:60.25                                                           
##  Max.   :77.00                                                           
##  marital_status      life_style           sleep             category        
##  Length:368         Length:368         Length:368         Length:368        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   depression         hyperlipi           smoking          family_history    
##  Length:368         Length:368         Length:368         Length:368        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    f_history      diabetes     htn             allergies        
##  Min.   :0.0000   0:198    Length:368         Length:368        
##  1st Qu.:0.0000   1:170    Class :character   Class :character  
##  Median :0.0000            Mode  :character   Mode  :character  
##  Mean   :0.1957                                                 
##  3rd Qu.:0.0000                                                 
##  Max.   :1.0000                                                 
##        bp         thrombolysis          bgr          b_urea      
##  Min.   : 80.5   Min.   :0.00000   Min.   : 60   Min.   :  2.30  
##  1st Qu.:100.7   1st Qu.:0.00000   1st Qu.:117   1st Qu.: 28.00  
##  Median :120.8   Median :0.00000   Median :164   Median : 36.00  
##  Mean   :121.2   Mean   :0.03261   Mean   :220   Mean   : 51.68  
##  3rd Qu.:140.7   3rd Qu.:0.00000   3rd Qu.:291   3rd Qu.: 43.00  
##  Max.   :190.1   Max.   :1.00000   Max.   :563   Max.   :394.00  
##       s_cr           s_sodium    s_potassium      s_chloride   
##  Min.   : 0.600   Min.   :129   Min.   :3.300   Min.   : 90.0  
##  1st Qu.: 0.900   1st Qu.:135   1st Qu.:3.900   1st Qu.:100.0  
##  Median : 0.900   Median :138   Median :4.200   Median :104.0  
##  Mean   : 1.717   Mean   :138   Mean   :4.211   Mean   :103.8  
##  3rd Qu.: 1.100   3rd Qu.:141   3rd Qu.:4.400   3rd Qu.:107.0  
##  Max.   :22.900   Max.   :146   Max.   :5.300   Max.   :112.0  
##      c_p_k            ck_mb             esr              wbc       
##  Min.   :  52.0   Min.   : 14.00   Min.   :  5.00   Min.   : 5800  
##  1st Qu.: 135.0   1st Qu.: 21.00   1st Qu.: 11.00   1st Qu.: 7800  
##  Median : 188.0   Median : 36.00   Median : 16.00   Median :10650  
##  Mean   : 553.9   Mean   : 62.49   Mean   : 26.57   Mean   :11181  
##  3rd Qu.: 390.0   3rd Qu.: 52.00   3rd Qu.: 25.00   3rd Qu.:13500  
##  Max.   :4289.0   Max.   :505.00   Max.   :154.00   Max.   :19590  
##       rbc         hemoglobin        p_c_v            m_c_v      
##  Min.   :3.46   Min.   : 9.10   Min.   :0.2900   Min.   :60.00  
##  1st Qu.:4.40   1st Qu.:12.30   1st Qu.:0.3600   1st Qu.:78.00  
##  Median :5.20   Median :14.20   Median :0.4250   Median :82.10  
##  Mean   :5.09   Mean   :13.91   Mean   :0.4153   Mean   :81.57  
##  3rd Qu.:5.65   3rd Qu.:15.50   3rd Qu.:0.4600   3rd Qu.:86.00  
##  Max.   :6.98   Max.   :18.00   Max.   :0.5400   Max.   :96.00  
##      m_c_h          m_c_h_c       platelet_count     neutrophil    
##  Min.   :18.00   Min.   :0.2200   Min.   : 20000   Min.   : 0.360  
##  1st Qu.:25.70   1st Qu.:0.3200   1st Qu.:192000   1st Qu.: 0.600  
##  Median :27.90   Median :0.3300   Median :237000   Median : 0.720  
##  Mean   :27.27   Mean   :0.3294   Mean   :248660   Mean   : 2.927  
##  3rd Qu.:29.00   3rd Qu.:0.3400   3rd Qu.:287000   3rd Qu.: 0.800  
##  Max.   :33.00   Max.   :0.3900   Max.   :459000   Max.   :83.000  
##      lympho          monocyte          eosino         others         
##  Min.   :0.0500   Min.   :0.0100   Min.   :1.000   Length:368        
##  1st Qu.:0.1700   1st Qu.:0.0200   1st Qu.:2.000   Class :character  
##  Median :0.2100   Median :0.0300   Median :2.000   Mode  :character  
##  Mean   :0.2489   Mean   :0.0325   Mean   :2.255                     
##  3rd Qu.:0.3200   3rd Qu.:0.0400   3rd Qu.:3.000                     
##  Max.   :0.5400   Max.   :0.0800   Max.   :5.000                     
##       co             diagnosis         hypersensitivity         cp       
##  Length:368         Length:368         Length:368         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.671  
##                                                           3rd Qu.:4.000  
##                                                           Max.   :4.000  
##     trestbps          chol            fbs            restecg     
##  Min.   :100.0   Min.   :131.0   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:120.0   1st Qu.:212.0   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :130.0   Median :249.0   Median :0.0000   Median :2.000  
##  Mean   :132.7   Mean   :248.9   Mean   :0.1413   Mean   :1.073  
##  3rd Qu.:142.0   3rd Qu.:283.0   3rd Qu.:0.0000   3rd Qu.:2.000  
##  Max.   :200.0   Max.   :409.0   Max.   :1.0000   Max.   :2.000  
##     thalach          exang           oldpeak          slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:125.0   1st Qu.:0.0000   1st Qu.:0.275   1st Qu.:1.750  
##  Median :144.0   Median :1.0000   Median :1.200   Median :2.000  
##  Mean   :140.9   Mean   :0.5625   Mean   :1.542   Mean   :1.842  
##  3rd Qu.:158.0   3rd Qu.:1.0000   3rd Qu.:2.200   3rd Qu.:2.000  
##  Max.   :195.0   Max.   :1.0000   Max.   :6.200   Max.   :3.000  
##        ca             thal            num              sk        
##  Min.   :0.000   Min.   :3.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.0000  
##  Median :1.000   Median :7.000   Median :2.000   Median :1.0000  
##  Mean   :1.003   Mean   :5.859   Mean   :2.035   Mean   :0.9837  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :7.000   Max.   :4.000   Max.   :1.0000  
##    sk_react            reaction      mortality   follow_up     family_history_2
##  Length:368         Min.   :0.0000   0:288     Min.   : 1.00   NO :296         
##  Class :character   1st Qu.:0.0000   1: 80     1st Qu.:15.00   YES: 72         
##  Mode  :character   Median :1.0000             Median :32.00                   
##                     Mean   :0.7473             Mean   :28.65                   
##                     3rd Qu.:1.0000             3rd Qu.:36.00                   
##                     Max.   :1.0000             Max.   :60.00
distribution_plot <- function(y){
  #function return histogram and boxplot of chosen variable distribution 
  
  box_plt <- ggplot(data, aes(x = platelet_count, y = eval(parse(text = y))))+
    geom_boxplot(aes(fill = platelet_count))+
    stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2) +
    labs(x = "Platelet Count", y = y)+theme_bw()
  
  hstgm <- ggplot(data, aes(x = eval(parse(text = y)), fill = mortality)) + geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "blue", size = 0.5) +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "blue", size = 0.5) +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
    labs(x = y) + theme_bw()
  
  return(plot_grid(hstgm, box_plt))
}

distribution_plot("s_sodium")
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Here we could see that histograms were not the best oprions for us since there are not so many subjects with positive stroke outcome so we were not able to explore the distribution properly.

distribution_ratio <- function(x){
  #fucntion creates a stacked bar plots
  
  plt <- data %>% select(s_sodium,x, gender) %>% 
    group_by(s_sodium, var = eval(parse(text = x))) %>% 
    summarise(count = length(gender)) %>% 
    group_by(s_sodium) %>% 
    mutate(ratio = round(count * 100 / sum(count), 1)) %>% 
    ggplot(aes(y = ratio, x = s_sodium, fill = var)) +
    geom_bar(stat = "identity") +
    labs(title = paste0("Ratio of stroke outcome by ", x), fill = x) +
    theme_bw()
  return(plt)
  
}

distribution_ratio("bp")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(x)
## 
##   # Now:
##   data %>% select(all_of(x))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

Let’s stop here for a moment and remember about discrete variables and probability.

According to our sample we can make following assumptions:

H <- as.numeric(levels(data$diabetes))[data$diabetes]
S <- as.numeric(levels(data$mortality))[data$mortality]

p_sp_and_hp <- mean(H&S)
p_sp_or_hp <- mean(H|S)
p_sp_and_hn <- mean(H&!S)

print(paste0("Probability of suffereing from having a disturbed platelet count:", round(p_sp_and_hp,3)))
## [1] "Probability of suffereing from having a disturbed platelet count:0.166"
print(paste0("Probability of suffereing or having a disturbed platelet count:", round(p_sp_or_hp,3)))
## [1] "Probability of suffereing or having a disturbed platelet count:0.514"
print(paste0("Probability of suffereing without having disturbed platelet count:", round(p_sp_and_hn,3)))
## [1] "Probability of suffereing without having disturbed platelet count:0.296"
distribution_ratio("diabetes")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

distribution_ratio("marital_status")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

glimpse(data)
## Rows: 368
## Columns: 61
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <fct> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
## $ family_history_2 <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, N…

#Test Outcome Analysis¶ Now I will try to predict the stroke outcome with the help of random forest. I will not go into details about what is random forest, how it works and how to adjust the model. My point here is to show how can we analyse model performance (can be applied both for machine learning or real testings). You can check out my project on Dementia Prediction with Tree-based Models to know more about tree-based models.

frml <- (diabetes ~ gender+age+follow_up+thalach+chol+trestbps)

model_rf <- randomForest(formula = frml, data = data)

prediction_rf <- predict(object = model_rf, 
                         newdata = select(data, -diabetes),
                         type = "class")

confusionMatrix(data = prediction_rf, reference = data$diabetes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 197   1
##          1   1 169
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9805, 0.9993)
##     No Information Rate : 0.538           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9891          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9949          
##             Specificity : 0.9941          
##          Pos Pred Value : 0.9949          
##          Neg Pred Value : 0.9941          
##              Prevalence : 0.5380          
##          Detection Rate : 0.5353          
##    Detection Prevalence : 0.5380          
##       Balanced Accuracy : 0.9945          
##                                           
##        'Positive' Class : 0               
## 
print(paste0("AUC = ", round(Metrics::auc(actual = data$diabetes, predicted = prediction_rf), 4)))
## [1] "AUC = 0.9945"
plot(roc(prediction_rf, data$diabetes), main = "ROC", cex.axis = 0.6, cex.main = 0.8, cex.lab = 0.8)