#use statistical models for data analysis

Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots

I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)

  1. Other Statistics
  1. Mosaic Plots and Contingency Tables b. Chloropeth Maps with GGPlot2

References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher

##(Bio)statistics in R: Part #1

Let’s look at some examples in R with coin flips so it will make more sense.

We assume that coin is fair and the distribution of outcomes falls under binomial distribution with chance of getting heads = 0.5.

A binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by P, is the same on every trial.

set.seed(11)
# Generate two sample spaces for 10 coins, flip each coin 1 time. Event E = 1 (heads) with probability .5

exp_a <- rbinom(10, 1 , 0.5)
exp_b <- rbinom(10, 1, 0.5)

print(paste0("sample space of expt a: ", list(exp_a)))
## [1] "sample space of expt a: c(0, 0, 1, 0, 0, 1, 0, 0, 1, 0)"
print(paste0("sample space of expt b: ", list(exp_b)))
## [1] "sample space of expt b: c(0, 0, 1, 1, 1, 1, 0, 0, 0, 0)"
cat("\n")
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_a)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)"
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_b)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)"
cat("\n")
print(paste0("Coin flip is 'heads' (1) for both sample a and b: ", list(exp_a & exp_b)))
## [1] "Coin flip is 'heads' (1) for both sample a and b: c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for either sample a and b: ", list(exp_a | exp_b)))
## [1] "Coin flip is 'heads' (1) for either sample a and b: c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for sample a when b is 'tails' : ", list(exp_a & !exp_b)))
## [1] "Coin flip is 'heads' (1) for sample a when b is 'tails' : c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)"

#Probability Probability is simply likelihood that event will take place.

print(paste0("Probability that is 'heads' for both experiment a and b: ", round(mean(exp_a & exp_b), 3)))
## [1] "Probability that is 'heads' for both experiment a and b: 0.2"
print(paste0("Probability that is 'heads' for either experiment a and b: ", round(mean(exp_a | exp_b), 3)))
## [1] "Probability that is 'heads' for either experiment a and b: 0.5"
print(paste0("Probability that is 'heads' for both experiment a when its is tails of experiment b: ", round(mean(exp_a & !(exp_b)), 3)))
## [1] "Probability that is 'heads' for both experiment a when its is tails of experiment b: 0.1"

he duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods of a random sample of 14 HIV infected individuals is given (in years).

Calculate the sample mean. Calculate the sample median. Calculate the sample standard deviation.

inc_period <- c(12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9)
print(paste0("sample mean:", round(mean(inc_period), 3)))
## [1] "sample mean:9.85"
print(paste0("sample mean:", round(median(inc_period), 3)))
## [1] "sample mean:10"
print(paste0("sample mean:", round(sd(inc_period), 3)))
## [1] "sample mean:3.059"

Serum cholesterol (mmol/L) measured on a sample of 86 stroke patients (data of Markus et al., 1995). Calculate the distribution characteristics and plot the distribution.

cholesterol <- c(3.7,4.8,5.4,5.6,6.1,6.4,7.0,7.6,8.7,3.8,4.9,5.4,5.6,6.1,6.5,7.0,7.6,8.9,3.8,4.9,5.5,5.7,6.1,
                 6.5,7.1,7.6,9.3,4.4,4.9,5.5,5.7,6.2,6.6,7.1,7.7,9.5,4.5,5.0,5.5,5.7,6.3,6.7,7.2,7.8,10.2,4.5,
                 5.1,5.6,5.8,6.3,6.7,7.3,7.8,10.4,4.5,5.1,5.6,5.8,6.4,6.8,7.4,7.8,4.7,5.2,5.6,5.9,6.4,6.8,7.4,
                 8.2,4.7,5.3,5.6,6.0,6.4,7.0,7.5,8.3,4.8,5.3,5.6,6.1,6.4,7.0,7.5,8.6)

print(paste0("sample mean:", round(mean(cholesterol), 3)))
## [1] "sample mean:6.341"
print(paste0("sample mean:", round(median(cholesterol), 3)))
## [1] "sample mean:6.15"
print(paste0("sample mean:", round(var(cholesterol), 3)))
## [1] "sample mean:1.959"
print(paste0("sample mean:", round(sd(cholesterol), 3)))
## [1] "sample mean:1.4"
df <- data.frame(cholesterol)
ggplot(df, aes(x = cholesterol))+geom_histogram(bins = 20, fill = "green", alpha = 0.3)+
  geom_vline(aes(xintercept = mean(cholesterol)), color = "red", size = 0.5) + 
  geom_vline(aes(xintercept = median(cholesterol)), color = "red", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(df, aes(x = "", y = cholesterol)) +
  geom_boxplot(fill = "green", alpha = 0.3) +
  stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The data below show the consumption of alcohol (X, liters per year per person, 14 years or older) and the death rate from cirrhosis, a liver disease (Y, death per 100,000 population) in 15 countries (each country is an observation unit).

Draw a Scatter Diagram to show the association, if any, between these two variables; can you draw any conclusion/observation without doing any calculation?

Calculate the Coefficient of Covariance and Correlation.

X <- c(24.7, 15.2, 12.3, 10.9, 10.8, 9.9, 8.3, 7.2, 6.6, 5.8, 5.7, 5.6, 4.2, 3.9, 3.1)
Y <- c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3.0, 7.2, 10.6, 3.7, 3.4, 4.3, 3.6, 5.4)

df <- data.frame(X, Y)

#to visulize the connection is to draw a scatter plot
ggplot(df, aes(x = X, y = Y))+geom_point()+
  geom_smooth(method = lm, se = FALSE) +
  labs (x = "Alcohol consumption", y = "Death Rate from Cirrhosis",
        title = "Relation between two variables")
## `geom_smooth()` using formula = 'y ~ x'

It looks like we have strong postive correlation - as Alcohol consumption increases Death Rate from Cirrhosis also increases. Let’s calculate coefficients to check our thoughts.

print(paste0("covariance between alcohol consumption adn death rate from cirrhosis: ", round(cov(X, Y),2)))
## [1] "covariance between alcohol consumption adn death rate from cirrhosis: 60.68"
print(paste0("Correlation betweel Alcohol consumption and death rate from cirrhosis ", round(cor(X,Y),2)))
## [1] "Correlation betweel Alcohol consumption and death rate from cirrhosis 0.94"

#Conditional Probability What is the probability of getting a certain illness between 0 and 30 years of age, if the likelihood of being of that age is 0.4, and the overall (marginal) probability (in the whole population) of getting this illness is 0.1?

pa <- 0.4
pb <- 0.1
pa_given_pb <- pa * pb / pb

print(paste0("Probability of getting a certain illness is: ", pa_given_pb))
## [1] "Probability of getting a certain illness is: 0.4"

There are 3 diseases, (A, B, C), which occur with probabilities 0.5, 0.3 and 0.2 respectively in people coming to an outpatient clinic. The probability of chest pain in these diseases is 0.8, 0.4 and 0.2 respectively. What is the probability that people at the outpatient clinic experience chest pain? What is the probability of disease C in a person who has no chest pain?

pa <- 0.5
pb <- 0.3
pc <- 0.2

chest_given_pa <- 0.8
chest_given_pb <- 0.4
chest_given_pc <- 0.2

p_chest_pain <- pa * chest_given_pa + pb * chest_given_pb + pc * chest_given_pc
p_chest_given_no_chest <- pc *(1 - chest_given_pc) / (1 - p_chest_pain)

print(paste0("Probability that people at the outpatient clinic experience chest pain: ", p_chest_pain))
## [1] "Probability that people at the outpatient clinic experience chest pain: 0.56"
print(paste0("Probability of disease C in a person who has no chest pain: ", round(p_chest_given_no_chest,3)))
## [1] "Probability of disease C in a person who has no chest pain: 0.364"

#Bayes Rule¶ Bayes rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

A study comparing the efficacy of HIV tests concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%. Suppose that a subject, from a population with a 0.1% prevalence of HIV, receives a positive test result. We are interested in calculating the probability that this subject has HIV. Mathematically, we would like to calculate P(D+|T+) , the probability of being HIV positive (D+) given that one test was positive (T+) . What is known is the sensitivity of the test, P(T+|D+) = 0.997, the specificity of the test, P(T−|D−) = 0.985 and the prevalence of the disease in the target population, P(D+) = 0.001. Using Bayes formula we can calculate the quantity that we are interested in.

p_dp_given_tp <- (0.997*0.001)/(0.997*0.001 + (1-0.985)*(1-0.001))
print(paste0("Probability of being HIV positive given that one test was positive: ", round(p_dp_given_tp,3)))
## [1] "Probability of being HIV positive given that one test was positive: 0.062"

Thus, in this population, a positive test result corresponds to only a 6% probability that the subject is actually HIV positive. This is the positive predictive value of the test (PPV). The low positive predictive value is due to low prevalence of the disease and the somewhat modest specificity relative to the prevalence of the disease.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
data <- read.csv("FIC_FUll.csv")
sample_n(data, 5)
##   Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1  50     41-50   Male    RURAL        MARRIED         NO    NO     FREE
## 2  55     51-60   Male    URBAN        MARRIED        YES   YES     FREE
## 3  56     51-60   Male    URBAN        MARRIED        YES   YES     FREE
## 4  55     51-60   Male    URBAN        MARRIED         NO    NO     FREE
## 5  50     41-50   Male    RURAL        MARRIED        YES    NO     FREE
##   Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1        YES       YES     YES            YES         1        0  NO        NO
## 2        YES       YES     YES             NO         0        1 YES        NO
## 3        YES       YES     YES             NO         0        1 YES        NO
## 4        YES       YES      NO            YES         1        0  NO        NO
## 5        YES       YES     YES             NO         0        0 YES        NO
##      BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1  80.5            0 125    2.3  0.9      129         4.2         96    65
## 2 100.7            0 323   25.0  0.8      139         3.9        100  3877
## 3 150.9            0 260   31.0  0.8      139         4.4        110   319
## 4 130.8            0 127   46.0  1.2      139         4.3        108   188
## 5 100.6            0 348   47.0  1.1      134         4.4         96   336
##   CK.MB ESR   WBC  RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1    36  11  9800 6.15       16.5  0.51  82.1  26.9    0.33         236000
## 2   299  25 10080 5.70       13.3  0.41  72.0  23.0    0.32         391000
## 3    51  11  9800 6.15       16.5  0.51  82.1  26.9    0.33         236000
## 4    45  11 14400 5.70       16.3  0.48  84.0  33.0    0.32         217000
## 5    22   9 10300 5.41       15.5  0.46  86.0  28.0    0.33         261000
##   NEUTROPHIL LYMPHO MONOCYTE EOSINO                           Others
## 1       0.75   0.20     0.03      2 ICMP WITH EF= 30%,PULMONAR ODEMA
## 2       0.90   0.08     0.01      1                               no
## 3       0.75   0.20     0.03      2                               no
## 4       0.57   0.32     0.07      2                  ORTHOPENIA, PND
## 5       0.80   0.17     0.02      1                          HTN, DM
##                                              CO         Diagnosis
## 1 SOB, DIZZINESS, CHEST PAIN,NAUSEA,DIAPHORESIS CARDIOGENIC SHOCK
## 2                      Chest pain4 HR, SWEATING        AC I/W M.I
## 3                 Chest pain,vomiting, sweating      A/C, A/W M.I
## 4              Central Chest pain,SOB, Sweating           NSTEM.I
## 5                    Chest pain,COLD SWEATING,            A/S M.I
##   Hypersensitivity cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1               NO  4      142  309   0       2     147     1     0.0     2  3
## 2               NO  4      110  275   0       2     118     1     1.0     2  1
## 3               NO  4      132  184   0       2     105     1     2.1     2  1
## 4              YES  4      130  256   1       2     150     1     0.0     1  2
## 5               NO  3      140  233   0       0     163     0     0.6     2  1
##   thal num SK         SK.React Reaction Mortality Follow.Up
## 1    7   3  1            LUNGS        1         1        15
## 2    3   1  1 STOMACH.BLEEDING        1         0        15
## 3    6   1  1   COUGH.BLEEDING        1         0        32
## 4    7   3  1    SKIN.BLEEDING        1         0         9
## 5    7   1  1        BODY.PAIN        1         0        15
glimpse(data)
## Rows: 368
## Columns: 60
## $ Age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ Age.Group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ Gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ Locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ Marital.status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ Life.Style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ Sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ Category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ Depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Family.History   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ F.History        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ HTN              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ Allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ BP               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ Thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGR              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ B.Urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ S.Cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ S.Sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ S.Potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ S.Chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ C.P.K            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ CK.MB            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ ESR              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ WBC              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ RBC              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ Hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ P.C.V            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ M.C.V            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ M.C.H            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ M.C.H.C          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ PLATELET_COUNT   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ NEUTROPHIL       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ LYMPHO           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ MONOCYTE         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ EOSINO           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ Others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ CO               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ Diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ Hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ SK               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SK.React         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ Follow.Up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
colSums(is.na(data))
##              Age        Age.Group           Gender         Locality 
##                0                0                0                0 
##   Marital.status       Life.Style            Sleep         Category 
##                0                0                0                0 
##       Depression        Hyperlipi          Smoking   Family.History 
##                0                0                0                0 
##        F.History         Diabetes              HTN        Allergies 
##                0                0                0                0 
##               BP     Thrombolysis              BGR           B.Urea 
##                0                0                0                0 
##             S.Cr         S.Sodium      S.Potassium       S.Chloride 
##                0                0                0                0 
##            C.P.K            CK.MB              ESR              WBC 
##                0                0                0                0 
##              RBC       Hemoglobin            P.C.V            M.C.V 
##                0                0                0                0 
##            M.C.H          M.C.H.C   PLATELET_COUNT       NEUTROPHIL 
##                0                0                0                0 
##           LYMPHO         MONOCYTE           EOSINO           Others 
##                0                0                0                0 
##               CO        Diagnosis Hypersensitivity               cp 
##                0                0                0                0 
##         trestbps             chol              fbs          restecg 
##                0                0                0                0 
##          thalach            exang          oldpeak            slope 
##                0                0                0                0 
##               ca             thal              num               SK 
##                0                0                0                0 
##         SK.React         Reaction        Mortality        Follow.Up 
##                0                0                0                0
data <- clean_names(data)
#for data cleaning
#data2 <- data %>% filter(Gender != "Other" & !(is.na(Diabetes))) %>% select(-Age)

colnames(data)
##  [1] "age"              "age_group"        "gender"           "locality"        
##  [5] "marital_status"   "life_style"       "sleep"            "category"        
##  [9] "depression"       "hyperlipi"        "smoking"          "family_history"  
## [13] "f_history"        "diabetes"         "htn"              "allergies"       
## [17] "bp"               "thrombolysis"     "bgr"              "b_urea"          
## [21] "s_cr"             "s_sodium"         "s_potassium"      "s_chloride"      
## [25] "c_p_k"            "ck_mb"            "esr"              "wbc"             
## [29] "rbc"              "hemoglobin"       "p_c_v"            "m_c_v"           
## [33] "m_c_h"            "m_c_h_c"          "platelet_count"   "neutrophil"      
## [37] "lympho"           "monocyte"         "eosino"           "others"          
## [41] "co"               "diagnosis"        "hypersensitivity" "cp"              
## [45] "trestbps"         "chol"             "fbs"              "restecg"         
## [49] "thalach"          "exang"            "oldpeak"          "slope"           
## [53] "ca"               "thal"             "num"              "sk"              
## [57] "sk_react"         "reaction"         "mortality"        "follow_up"
glimpse(data)
## Rows: 368
## Columns: 60
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
table(data$marital_status)
## 
## MARRIED  SINGLE 
##     365       3
#convert some variables into factor
data$family.History <- as.factor(data$family_history)
data$diabetes <- as.factor(data$diabetes)
data$mortality <- as.factor(data$mortality)
colnames(data)
##  [1] "age"              "age_group"        "gender"           "locality"        
##  [5] "marital_status"   "life_style"       "sleep"            "category"        
##  [9] "depression"       "hyperlipi"        "smoking"          "family_history"  
## [13] "f_history"        "diabetes"         "htn"              "allergies"       
## [17] "bp"               "thrombolysis"     "bgr"              "b_urea"          
## [21] "s_cr"             "s_sodium"         "s_potassium"      "s_chloride"      
## [25] "c_p_k"            "ck_mb"            "esr"              "wbc"             
## [29] "rbc"              "hemoglobin"       "p_c_v"            "m_c_v"           
## [33] "m_c_h"            "m_c_h_c"          "platelet_count"   "neutrophil"      
## [37] "lympho"           "monocyte"         "eosino"           "others"          
## [41] "co"               "diagnosis"        "hypersensitivity" "cp"              
## [45] "trestbps"         "chol"             "fbs"              "restecg"         
## [49] "thalach"          "exang"            "oldpeak"          "slope"           
## [53] "ca"               "thal"             "num"              "sk"              
## [57] "sk_react"         "reaction"         "mortality"        "follow_up"       
## [61] "family.History"

#EDA

library(janitor)
data <- clean_names(data)
#relationship of diabetes status with marital status 
data %>% select(marital_status, platelet_count) %>% 
  group_by(marital_status) %>% summarise(
    mean = round(mean(platelet_count),2),
    median = median(platelet_count),
    variance = round(var(platelet_count), 2),
    std = round(sd(platelet_count), 2)
  )
## # A tibble: 2 × 5
##   marital_status    mean median    variance    std
##   <chr>            <dbl>  <int>       <dbl>  <dbl>
## 1 MARRIED        248507. 237000 5894481409  76776.
## 2 SINGLE         267333. 268000 6400333333. 80002.
ggplot(data, aes(x = platelet_count, fill = marital_status))+
  geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
  geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", size = 0.5)+
  geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red", size = 0.5)+
  geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", linetype = "dashed", size = 0.5)+
  geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red",linetype = "dashed", size = 0.5)+
  theme_bw()
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.

ggplot(data, aes(x = marital_status, y = platelet_count)) + 
  geom_boxplot(aes(fill = marital_status)) + 
  stat_summary(fun.y = mean, color = "darkred", geom = "point", size =3) +
  theme_classic()

summary(data)
##       age         age_group            gender            locality        
##  Min.   :24.00   Length:368         Length:368         Length:368        
##  1st Qu.:50.75   Class :character   Class :character   Class :character  
##  Median :55.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :54.29                                                           
##  3rd Qu.:60.25                                                           
##  Max.   :77.00                                                           
##  marital_status      life_style           sleep             category        
##  Length:368         Length:368         Length:368         Length:368        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   depression         hyperlipi           smoking          family_history    
##  Length:368         Length:368         Length:368         Length:368        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    f_history      diabetes     htn             allergies        
##  Min.   :0.0000   0:198    Length:368         Length:368        
##  1st Qu.:0.0000   1:170    Class :character   Class :character  
##  Median :0.0000            Mode  :character   Mode  :character  
##  Mean   :0.1957                                                 
##  3rd Qu.:0.0000                                                 
##  Max.   :1.0000                                                 
##        bp         thrombolysis          bgr          b_urea      
##  Min.   : 80.5   Min.   :0.00000   Min.   : 60   Min.   :  2.30  
##  1st Qu.:100.7   1st Qu.:0.00000   1st Qu.:117   1st Qu.: 28.00  
##  Median :120.8   Median :0.00000   Median :164   Median : 36.00  
##  Mean   :121.2   Mean   :0.03261   Mean   :220   Mean   : 51.68  
##  3rd Qu.:140.7   3rd Qu.:0.00000   3rd Qu.:291   3rd Qu.: 43.00  
##  Max.   :190.1   Max.   :1.00000   Max.   :563   Max.   :394.00  
##       s_cr           s_sodium    s_potassium      s_chloride   
##  Min.   : 0.600   Min.   :129   Min.   :3.300   Min.   : 90.0  
##  1st Qu.: 0.900   1st Qu.:135   1st Qu.:3.900   1st Qu.:100.0  
##  Median : 0.900   Median :138   Median :4.200   Median :104.0  
##  Mean   : 1.717   Mean   :138   Mean   :4.211   Mean   :103.8  
##  3rd Qu.: 1.100   3rd Qu.:141   3rd Qu.:4.400   3rd Qu.:107.0  
##  Max.   :22.900   Max.   :146   Max.   :5.300   Max.   :112.0  
##      c_p_k            ck_mb             esr              wbc       
##  Min.   :  52.0   Min.   : 14.00   Min.   :  5.00   Min.   : 5800  
##  1st Qu.: 135.0   1st Qu.: 21.00   1st Qu.: 11.00   1st Qu.: 7800  
##  Median : 188.0   Median : 36.00   Median : 16.00   Median :10650  
##  Mean   : 553.9   Mean   : 62.49   Mean   : 26.57   Mean   :11181  
##  3rd Qu.: 390.0   3rd Qu.: 52.00   3rd Qu.: 25.00   3rd Qu.:13500  
##  Max.   :4289.0   Max.   :505.00   Max.   :154.00   Max.   :19590  
##       rbc         hemoglobin        p_c_v            m_c_v      
##  Min.   :3.46   Min.   : 9.10   Min.   :0.2900   Min.   :60.00  
##  1st Qu.:4.40   1st Qu.:12.30   1st Qu.:0.3600   1st Qu.:78.00  
##  Median :5.20   Median :14.20   Median :0.4250   Median :82.10  
##  Mean   :5.09   Mean   :13.91   Mean   :0.4153   Mean   :81.57  
##  3rd Qu.:5.65   3rd Qu.:15.50   3rd Qu.:0.4600   3rd Qu.:86.00  
##  Max.   :6.98   Max.   :18.00   Max.   :0.5400   Max.   :96.00  
##      m_c_h          m_c_h_c       platelet_count     neutrophil    
##  Min.   :18.00   Min.   :0.2200   Min.   : 20000   Min.   : 0.360  
##  1st Qu.:25.70   1st Qu.:0.3200   1st Qu.:192000   1st Qu.: 0.600  
##  Median :27.90   Median :0.3300   Median :237000   Median : 0.720  
##  Mean   :27.27   Mean   :0.3294   Mean   :248660   Mean   : 2.927  
##  3rd Qu.:29.00   3rd Qu.:0.3400   3rd Qu.:287000   3rd Qu.: 0.800  
##  Max.   :33.00   Max.   :0.3900   Max.   :459000   Max.   :83.000  
##      lympho          monocyte          eosino         others         
##  Min.   :0.0500   Min.   :0.0100   Min.   :1.000   Length:368        
##  1st Qu.:0.1700   1st Qu.:0.0200   1st Qu.:2.000   Class :character  
##  Median :0.2100   Median :0.0300   Median :2.000   Mode  :character  
##  Mean   :0.2489   Mean   :0.0325   Mean   :2.255                     
##  3rd Qu.:0.3200   3rd Qu.:0.0400   3rd Qu.:3.000                     
##  Max.   :0.5400   Max.   :0.0800   Max.   :5.000                     
##       co             diagnosis         hypersensitivity         cp       
##  Length:368         Length:368         Length:368         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.671  
##                                                           3rd Qu.:4.000  
##                                                           Max.   :4.000  
##     trestbps          chol            fbs            restecg     
##  Min.   :100.0   Min.   :131.0   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:120.0   1st Qu.:212.0   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :130.0   Median :249.0   Median :0.0000   Median :2.000  
##  Mean   :132.7   Mean   :248.9   Mean   :0.1413   Mean   :1.073  
##  3rd Qu.:142.0   3rd Qu.:283.0   3rd Qu.:0.0000   3rd Qu.:2.000  
##  Max.   :200.0   Max.   :409.0   Max.   :1.0000   Max.   :2.000  
##     thalach          exang           oldpeak          slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:125.0   1st Qu.:0.0000   1st Qu.:0.275   1st Qu.:1.750  
##  Median :144.0   Median :1.0000   Median :1.200   Median :2.000  
##  Mean   :140.9   Mean   :0.5625   Mean   :1.542   Mean   :1.842  
##  3rd Qu.:158.0   3rd Qu.:1.0000   3rd Qu.:2.200   3rd Qu.:2.000  
##  Max.   :195.0   Max.   :1.0000   Max.   :6.200   Max.   :3.000  
##        ca             thal            num              sk        
##  Min.   :0.000   Min.   :3.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.0000  
##  Median :1.000   Median :7.000   Median :2.000   Median :1.0000  
##  Mean   :1.003   Mean   :5.859   Mean   :2.035   Mean   :0.9837  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :7.000   Max.   :4.000   Max.   :1.0000  
##    sk_react            reaction      mortality   follow_up     family_history_2
##  Length:368         Min.   :0.0000   0:288     Min.   : 1.00   NO :296         
##  Class :character   1st Qu.:0.0000   1: 80     1st Qu.:15.00   YES: 72         
##  Mode  :character   Median :1.0000             Median :32.00                   
##                     Mean   :0.7473             Mean   :28.65                   
##                     3rd Qu.:1.0000             3rd Qu.:36.00                   
##                     Max.   :1.0000             Max.   :60.00
distribution_plot <- function(y){
  #function return histogram and boxplot of chosen variable distribution 
  
  box_plt <- ggplot(data, aes(x = platelet_count, y = eval(parse(text = y))))+
    geom_boxplot(aes(fill = platelet_count))+
    stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2) +
    labs(x = "Platelet Count", y = y)+theme_bw()
  
  hstgm <- ggplot(data, aes(x = eval(parse(text = y)), fill = mortality)) + geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "blue", size = 0.5) +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "blue", size = 0.5) +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
    geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
    labs(x = y) + theme_bw()
  
  return(plot_grid(hstgm, box_plt))
}

distribution_plot("s_sodium")
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Here we could see that histograms were not the best oprions for us since there are not so many subjects with positive stroke outcome so we were not able to explore the distribution properly.

distribution_ratio <- function(x){
  #fucntion creates a stacked bar plots
  
  plt <- data %>% select(s_sodium,x, gender) %>% 
    group_by(s_sodium, var = eval(parse(text = x))) %>% 
    summarise(count = length(gender)) %>% 
    group_by(s_sodium) %>% 
    mutate(ratio = round(count * 100 / sum(count), 1)) %>% 
    ggplot(aes(y = ratio, x = s_sodium, fill = var)) +
    geom_bar(stat = "identity") +
    labs(title = paste0("Ratio of stroke outcome by ", x), fill = x) +
    theme_bw()
  return(plt)
  
}

distribution_ratio("bp")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(x)
## 
##   # Now:
##   data %>% select(all_of(x))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

Let’s stop here for a moment and remember about discrete variables and probability.

According to our sample we can make following assumptions:

H <- as.numeric(levels(data$diabetes))[data$diabetes]
S <- as.numeric(levels(data$mortality))[data$mortality]

p_sp_and_hp <- mean(H&S)
p_sp_or_hp <- mean(H|S)
p_sp_and_hn <- mean(H&!S)

print(paste0("Probability of suffereing from having a disturbed platelet count:", round(p_sp_and_hp,3)))
## [1] "Probability of suffereing from having a disturbed platelet count:0.166"
print(paste0("Probability of suffereing or having a disturbed platelet count:", round(p_sp_or_hp,3)))
## [1] "Probability of suffereing or having a disturbed platelet count:0.514"
print(paste0("Probability of suffereing without having disturbed platelet count:", round(p_sp_and_hn,3)))
## [1] "Probability of suffereing without having disturbed platelet count:0.296"
distribution_ratio("diabetes")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

distribution_ratio("marital_status")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.

glimpse(data)
## Rows: 368
## Columns: 61
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <fct> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
## $ family_history_2 <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, N…

#Test Outcome Analysis¶ Now I will try to predict the stroke outcome with the help of random forest. I will not go into details about what is random forest, how it works and how to adjust the model. My point here is to show how can we analyse model performance (can be applied both for machine learning or real testings). You can check out my project on Dementia Prediction with Tree-based Models to know more about tree-based models.

frml <- (diabetes ~ gender+age+follow_up+thalach+chol+trestbps)

model_rf <- randomForest(formula = frml, data = data)

prediction_rf <- predict(object = model_rf, 
                         newdata = select(data, -diabetes),
                         type = "class")

confusionMatrix(data = prediction_rf, reference = data$diabetes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 197   1
##          1   1 169
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9805, 0.9993)
##     No Information Rate : 0.538           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9891          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9949          
##             Specificity : 0.9941          
##          Pos Pred Value : 0.9949          
##          Neg Pred Value : 0.9941          
##              Prevalence : 0.5380          
##          Detection Rate : 0.5353          
##    Detection Prevalence : 0.5380          
##       Balanced Accuracy : 0.9945          
##                                           
##        'Positive' Class : 0               
## 
print(paste0("AUC = ", round(Metrics::auc(actual = data$diabetes, predicted = prediction_rf), 4)))
## [1] "AUC = 0.9945"
plot(roc(prediction_rf, data$diabetes), main = "ROC", cex.axis = 0.6, cex.main = 0.8, cex.lab = 0.8)

In this part we will focus on how make decisions about population based on the information of sample.

library(tidyverse)
library(repr)
library(cowplot)
library(bootstrap)
library(binom)
## Warning: package 'binom' was built under R version 4.2.3
library(infer)
## Warning: package 'infer' was built under R version 4.2.3
set.seed(123)
options(repr.plot.width = 7, repr.plot.height = 3)
heart <- read.csv("FIC_FUll.csv")

sample_n(heart,5)
##   Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1  61     61-70 Female    RURAL        MARRIED         NO    NO     FREE
## 2  61     61-70 Female    RURAL        MARRIED        YES   YES     PAID
## 3  40     31-40   Male    URBAN        MARRIED         NO    NO     PAID
## 4  50     41-50   Male    RURAL        MARRIED        YES   YES     PAID
## 5  64     61-70   Male    URBAN        MARRIED         NO    NO     FREE
##   Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1        YES       YES      NO             NO         0        1  NO        NO
## 2        YES       YES      NO             NO         0        1 YES        NO
## 3        YES       YES     YES            YES         1        1 YES       YES
## 4        YES       YES     YES             NO         0        1  NO        NO
## 5        YES        NO      NO             NO         0        0 YES        NO
##      BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 100.6            0  84     28  0.9      138         3.3        107   130
## 2 120.8            0  96     42  1.0      146         3.9        100   146
## 3 140.9            0 152     35  0.9      140         3.9        105    61
## 4 100.6            0 512     42  1.3      139         4.1        105   196
## 5 120.8            1 291     37  0.6      137         3.9         98    72
##   CK.MB ESR   WBC  RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1    30  11  9900 4.26       11.6  0.34  79.7  27.2    0.34         265000
## 2    21  25  7400 4.14       11.7  0.36  87.0  28.0    0.32         395000
## 3    15  12 12800 4.26       12.3  0.36  83.6  28.9    0.35         459000
## 4    50  22 15000 4.41       14.3  0.41  93.9  32.4    0.35         228000
## 5    39  25  7400 4.36       11.2  0.33  76.6  25.7    0.34         136000
##   NEUTROPHIL LYMPHO MONOCYTE EOSINO   Others                                 CO
## 1       0.70   0.25     0.03      2       no                        Chest pain,
## 2       0.63   0.31     0.03      3       no               Chest pain, SWEATING
## 3       0.65   0.26     0.08      1       no Chest pain,heart sinking, vomiting
## 4       0.85   0.10     0.03      2       no                        Chest pain,
## 5       0.78   0.18     0.03      1 HCV, IHD       Chest pain,sweating,vomiting
##             Diagnosis Hypersensitivity cp trestbps chol fbs restecg thalach
## 1 EXT. ACUTE WALL M.I               NO  4      145  307   0       2     146
## 2          AC I/W M.I               NO  4      130  330   0       2     169
## 3       Acute I/W M.I               NO  4      152  223   0       0     181
## 4         AC. A/L M.I               NO  1      110  264   0       0     132
## 5   old I/W M.I, ACS.               NO  3      140  335   0       0     158
##   exang oldpeak slope ca thal num SK       SK.React Reaction Mortality
## 1     1     1.0     2  0    7   1  1             NO        0         1
## 2     0     0.0     1  0    3   1  1             NO        0         1
## 3     0     0.0     1  0    7   1  1 COUGH.BLEEDING        1         0
## 4     0     1.2     2  0    7   1  1 COUGH.BLEEDING        1         1
## 5     0     0.0     1  0    3   1  0      BODY.PAIN        1         0
##   Follow.Up
## 1        15
## 2        12
## 3        36
## 4         3
## 5        60
library(janitor)
heart <- clean_names(heart)

#Likelihood¶ Likelihood is a tool for summarizing the data’s evidence about unknown parameters. The likelohood of a collection of data is the joint density evaluated as a function of the unknown parameters with the data fixed.

Imagine we are running an experiment to check the ratio of people with and w/o Diabetes. We assume that ratio from this dataset is true ratio of population. Then we take the sample of Hypertension results and check whether the difference between ratio of the sample and ratio of population is going to zero after bigger sample size.

print(paste0("Ratio of people with Diabetes in population:", round(mean(heart$diabetes == 1), 3)))
## [1] "Ratio of people with Diabetes in population:0.462"
glimpse(heart)
## Rows: 368
## Columns: 60
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
n <- 100 #size of the sample

#convert diabetes column to string 0 (0,1 integers)
diabetes <- as.numeric(levels(heart$chol))[heart$chol]

avg_value <- mean(diabetes)

#taking three random sample n_size each
x1 = sample(diabetes, size = n, replace = F)
x2 = sample(diabetes, size = n, replace = F)
x3 = sample(diabetes, size = n, replace = F)

#create 3 vectors that will hold informaiton aabout ratio for each size of sample
xbar1 = rep(0, length(x1))
xbar2 = rep(0, length(x2))
xbar3 = rep(0, length(x3))

for (i in 1:length(x1)){
  xbar1[i] = mean(x1[1:i])
  xbar2[i] = mean(x2[1:i])
  xbar3[i] = mean(x3[1:i])
}

plot(1:n, xbar1 - avg_value, type = "l",col = 'red', lwd = 1, ylim=c(-0.1,0.2),
     xlab = "Number of Subjects sampled",
     ylab = "Distance to the mean")

lines(1:n, xbar2-avg_value, col="blue", lwd=1)
lines(1:n, xbar3-avg_value, col="orange", lwd=1)
lines(1:n, rep(0,n), lwd=3)

We can see, that the higher the number of subjects in our samples, the more likely mean of the sample is going to equal to the mean of population.

The law of large numbers will establish that as n increases the averages are close to the target, while the central limit theorem will say how close and with what probability are the results of the experiment to the true target.

#Central Limit Theorem¶ The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.

chol <- heart$chol[heart$diabetes == 0]
hist(chol, breaks = 50, col = "lightgreen")

print(paste0("Population mean: ", round(mean(chol), 1)))
## [1] "Population mean: 244.9"
n=30
sample = sample(chol, size = n, replace = F)
xbar = rep(0, length(sample))

for(i in 1:length(sample)){
  xbar[i] = mean(sample[1:i])
}

hist(xbar, breaks = 50, col = "orange", main = "Histogram of sample mean(sample size is 30)")

print(paste0("Sample mean:", round(mean(xbar), 1)))
## [1] "Sample mean:244.6"
dim(chol)
## NULL
n = 190
sample = sample(chol, size = n, replace = F)
xbar = rep(0, length(sample))

for(i in 1:length(sample)){
  xbar[i] = mean(sample[1:i])
}

hist(xbar, breaks = 50, col = "orange", main = "Histogram of sample mean(sample size = 190)")

print(paste0("Sample mean: ", round(mean(xbar), 1)))
## [1] "Sample mean: 242.5"

#Confidence Intervals A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error. It tells you how confident you can be that the results from a poll or survey reflect what you would expect to find if it were possible to survey the entire population.

population <- rnorm(5000, 10,5)
hist(population, breaks = 25, col = "lightgreen")

print(paste0("Population mean: ", round(mean(population), 1)))
## [1] "Population mean: 10"
print(paste0("Population standard error:", round(sd(population), 1)))
## [1] "Population standard error:5"
n = 250
sample <- sample(population, size = n, replace = F)
print(paste0("sample mean: ", round(mean(sample), 1)))
## [1] "sample mean: 10.1"
print(paste0("Sample standard error: ", round(sd(sample), 1)))
## [1] "Sample standard error: 4.6"
#calculate 95% CI
alpha = 0.05
quant = 1 - alpha/2
z_score <- qnorm(quant, lower.tail = TRUE)
lower <- mean(sample) -  z_score * sd(population) / sqrt(n)
upper <- mean(sample) + z_score * sd(population) / sqrt(n)

hist(sample, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v =upper, lwd =3)

print(paste0("With 95% confidence we can say thet population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say thet population mean is in interval [9.5; 10.7]"

But, as said before, we usually don’t know the distribution of population. In this case we must rely on sample standard deviations and T-scores. T-scores come from T-distributions, which help us account for error that occurs when we sample from a population. We use a different T-distribution to calculate cumulative probabilites depending on our degrees of freedom.

chol <- heart$chol
X_bar <- mean(chol)
s <- sd(chol)
n <- length(chol)
alpha = 0.05
t_score <-  qt(1 - alpha/2, n-1)

lower <- X_bar - t_score * s / sqrt(n)
upper <- X_bar + t_score * s / sqrt(n)

hist(chol, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v = upper, lwd = 3, col = "red")

print(paste0("With 95% confidence we can say that population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say that population mean is in interval [243.8; 254.1]"

Note how narrow the CI is. This is due the large sample size (almost 42k). Look what will happen if we had much smaller sample:

chol <- heart$chol
chol <- sample(chol, size = 100, replace  = F)
X_bar <- mean(chol)
sigma <- sd(chol)
n <- length(chol)
alpha = 0.05
t_score <- qt( 1 - alpha/2, n-1)

lower <- X_bar - t_score * sigma / sqrt(n)
upper <- X_bar + t_score * sigma / sqrt(n)

hist(chol, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v = upper, lwd = 3, col = "red")

print(paste0("With 95% confidence we can say that population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say that population mean is in interval [245.4; 264]"

#t and F tests In this part we will look at how we can compare independent groups (for example, .

Introduction to t-test and confidence intervals The t-test (also called Student’s T-Test) compares two averages (means) and tells you if they are different from each other. The t-test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.

Let’s assume that we want to check whether the difference in age for people who had cholestrol and who didn’t is significant. To do that we can run a t-test to find out. First, lets visually inspect the distribution of Age by cholestrol outcome.

heart %>% mutate(group = case_when(
  (diabetes == 1) ~ "Diabeted",
  (diabetes == 0) ~ "Non-Diabetic"
)) %>% 
  group_by(group) %>% 
  summarise(count = length(age),
            mean = round(mean(age)),
            variance = round(var(age), 2))
## # A tibble: 2 × 4
##   group        count  mean variance
##   <chr>        <int> <dbl>    <dbl>
## 1 Diabeted       170    56     53.1
## 2 Non-Diabetic   198    52     88.5

Our hypothesis for t-test look like this:

H0
there is no difference in mean between two groups. μ0=μ1
H1
there is a difference in mean between two groups. μ0≠μ1
#sample size
n1 <- length(heart$age[heart$diabetes == 0])
n2 <- length(heart$age[heart$diabetes == 1])

#sample mean
x1_bar <- mean(heart$age[heart$diabetes == 0])
x2_bar <- mean(heart$age[heart$diabetes == 1])

#sample variance
s1 <- var(heart$age[heart$diabetes == 0])
s2 <- var(heart$age[heart$diabetes == 1])

diff <- x1_bar - x2_bar
SE <- sqrt(s1/n1 + s2/n2)
alpha = 0.05
t <- qt((1-alpha/2), df = n1 + n2 -2) #t score 

print(paste0("95% of Confidence interval is [", round(diff-t*SE,2), "; ", round(diff+t*SE,2), "]. "))
## [1] "95% of Confidence interval is [-5.75; -2.32]. "

Confidence interval of mean differences is (−5.75;−2.32) . This does not include 0, so we reject null hypothesis and we say that there is a significance difference between group means with confidence level of 95%. In other words, we are 95% confident that Diabetes is more likely to happen to older people. Also, we could just run t.test function which is built in in R and find out if our difference in means can be considered as significant or it’s just a random chance. We can not use paired t-test vecause our groups have different sample size, so we have to set paired = FALSE (which comes as default). Another parameter var.equal should be also set to FALSE since variances between groups are not equal.

t.test(age ~ diabetes, 
       data = heart, 
       var.equal = FALSE, 
       paired = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  age by diabetes
## t = -4.6313, df = 362.3, p-value = 5.073e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.748912 -2.321914
## sample estimates:
## mean in group 0 mean in group 1 
##        52.42929        56.46471

This may seems unclear - we had two samples, calculated the difference of their averages. This difference was far away from 0, so why did we still have to run t-test to check fot significance? The answer is that the size of confidence interval depends on the sample size (the higher observations we have the more confident we can be). Let’s run an another example with less amount of observations.

example <- heart %>% select(diabetes, age)

#fill dataframe with unreal values for example
example <- example[1:50,]
example$diabetes[1:25] <- 0
example$diabetes[26:50] <- 1

example %>% group_by(diabetes) %>% 
  summarise(count = length(age),
            mean_age = round(mean(age)),
            variance = round(var(age),2))
## # A tibble: 2 × 4
##   diabetes count mean_age variance
##      <dbl> <int>    <dbl>    <dbl>
## 1        0    25       59     21.1
## 2        1    25       47     81.4
t.test(age ~ diabetes, data = example, 
       var.equal = FALSE,
       paired = FALSE, 
       conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  age by diabetes
## t = 6.0835, df = 35.679, p-value = 5.564e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   8.21154 16.42846
## sample estimates:
## mean in group 0 mean in group 1 
##           59.16           46.84

In this case difference was 9, which also can be considered as “away from zero”, but confidence interval is too wide [-4; 21]. It includes 0 value, so we can not be so sure, that Age is not related to diabetic outcome based on this sample.

#F test¶ It cab be a difficult task to assume wether the variances between two groups are different or not. We need to provide evidence against the assumption that the true variances are equal. And to do this we need to obtain the distribution of the ratio of variances in group 1 and 2: σ21/σ22 .

we always test that the population variances are equal when running an F Test. In other words, we always assume that the variances are equal to 1. Therefore, null hypothesis will always be that the variances are equal.

Test statistic: F=σ21σ22 .

The hypothesis that the two variances are equal is rejected if:

One-tailed test (>or< ):

Let’s look back at our example with the distribution of Age by diabetes outcome:

heart %>% mutate(group = case_when(
  (diabetes == 1) ~ "Diabetic",
  (diabetes == 0) ~ "Non Diabetic"
)) %>% group_by(group) %>% summarise(count = length(age), 
                                     variance = round(var(age), 2))
## # A tibble: 2 × 3
##   group        count variance
##   <chr>        <int>    <dbl>
## 1 Diabetic       170     53.1
## 2 Non Diabetic   198     88.5
#calcuclate the ratio of variance
F <- var(heart$age[heart$diabetes == 0]) / var(heart$age[heart$diabetes == 1])
print(paste0("Test statitic (ratio of variances): ", round(F,2)))
## [1] "Test statitic (ratio of variances): 1.67"
#get the critical values for the F distribution
round(qf(c(0.025, 0.975), 643-1,41288-1),digits =2)
## [1] 0.89 1.11
lower <- F/qf(c(0.025, 0.975), 643-1,41288-1)[2]
upper <- F/qf(c(0.025, 0.975), 643-1,41288-1)[1]

print(paste0("95% of Confidence interval is [", round(lower,2), "; ", round(upper,2), "]."))
## [1] "95% of Confidence interval is [1.5; 1.87]."

And again, interval does not include 1 so we can reject the null hypothesis.

There is a built-in function in R which can compute F-statistic and confidence intervals: var.test

var.test(age ~ diabetes, data = heart)
## 
##  F test to compare two variances
## 
## data:  age by diabetes
## F = 1.6668, num df = 197, denom df = 169, p-value = 0.0006817
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.243223 2.227283
## sample estimates:
## ratio of variances 
##           1.666776

#Data Resampling¶ Unfortunately, sometimes the size of your sample is not big enough to make proper judgments about the population. In this case resampling methods become useful. Re-sampling methods (like jackknife, bootstrapping, permutation testing) help to get approximate view of population, confidence intervals and standard errors. The main idea is repeatedly creating new samples from original sample with replacements.

#Jackknife The Jackknife deletes each observation and calculates an estimate base on the remaining (n−1) of them. It uses this collection of estimates to do things like estimate the bias and the st.error.

Note that estimating the bias and having a st.error are not needed for things like sample means, which we know are unbiased estimates of population means.

Looking back on diabetes distribution from out dataset:

n = length(heart$hemoglobin)

#function for biased version of variance
bias_var = function(x){
  n = length(x)
  (n - 1) * var(x) / n
}

#run jacknife function from bootstrap package
jackknife(heart$hemoglobin, bias_var)
## $jack.se
## [1] 0.2770813
## 
## $jack.bias
## [1] -0.01267983
## 
## $jack.values
##   [1] 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859 4.651625
##   [9] 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859 4.651625 4.665220
##  [17] 4.666077 4.666077 4.659114 4.624452 4.652859 4.666077 4.659114 4.624452
##  [25] 4.652859 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425
##  [33] 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.647819 4.647819
##  [41] 4.658354 4.660882 4.661616 4.650336 4.665168 4.666177 4.650543 4.646143
##  [49] 4.662191 4.665919 4.646376 4.659252 4.630613 4.626560 4.662919 4.666077
##  [57] 4.603020 4.666059 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003
##  [65] 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616
##  [73] 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336 4.665168
##  [81] 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252 4.630613
##  [89] 4.626560 4.662919 4.666077 4.603020 4.666059 4.665757 4.666059 4.662191
##  [97] 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252
## [105] 4.659252 4.664463 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882
## [113] 4.661616 4.650336 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919
## [121] 4.646376 4.659252 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [129] 4.665757 4.666059 4.658354 4.660882 4.661616 4.650336 4.665168 4.666177
## [137] 4.650543 4.646143 4.662191 4.641717 4.630613 4.639787 4.662191 4.630613
## [145] 4.639787 4.662191 4.641717 4.630613 4.639787 4.666077 4.630613 4.639787
## [153] 4.662191 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [161] 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114
## [169] 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114
## [177] 4.624452 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452
## [185] 4.652859 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402
## [193] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [201] 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336
## [209] 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252
## [217] 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059 4.665757
## [225] 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425
## [233] 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114 4.624452
## [241] 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859
## [249] 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402 4.664003
## [257] 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616
## [265] 4.647819 4.647819 4.647819 4.658354 4.665757 4.630613 4.626560 4.662919
## [273] 4.666077 4.603020 4.666059 4.665757 4.666059 4.662191 4.657402 4.657402
## [281] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [289] 4.661616 4.647819 4.659114 4.624452 4.652859 4.651625 4.665220 4.666077
## [297] 4.666077 4.659114 4.624452 4.652859 4.666077 4.659114 4.624452 4.652859
## [305] 4.662191 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [313] 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114
## [321] 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114
## [329] 4.624452 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452
## [337] 4.652859 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402
## [345] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [353] 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336
## [361] 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252
## 
## $call
## jackknife(x = heart$hemoglobin, theta = bias_var)

Bootstrapping In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

And again, hemoglobin distribution. We will assume that our distribution of BMI is true population distribution, that is why we know population mean and variance.

pop_hemo <- heart$hemoglobin
pop_mean <- median(heart$hemoglobin)
hist(pop_hemo, col = "lightgreen", main = "")

n <- 300 #sample size
no_sim <- 1000 #number of resamples
sample_hemo <- sample(pop_hemo, n, replace  = F)

# sample no_simul resamples of size n with replacement
## and put them in a matrix with no_simul rows and n columns
## so that each row is a resampled data set

bsResamples = matrix(sample(sample_hemo, n * no_sim, replace = TRUE), no_sim, n)

par(mfcol = c(1,2))

#plot histogram
hist(sample_hemo, probability = TRUE, col = "orange", breaks = 10, xlab = "Hemoglobin", main = "")

#plot histogram of resampled data
hist(bsResamples, probability = TRUE, col = "orange", breaks = 10, xlab = "Hemoglobin (1000 - resamples)", main = "")

Note that the both histograms look similar as they supposed to be, since the right histogram was effectively created by simulating from the left histogram.

#calculate the mean of each resample data
bsStat = rowMeans(bsResamples)

#plot the histogram of our resamples statistics
hist(bsStat, probability =  TRUE, col = "lightblue", xlab = "Mean ofDiabetic resamples", main = "")

abline(v = pop_mean, col = "red", lwd = 2)
abline(v = quantile(bsStat, c(0.025, 0.975)), lwd = 3) #95% CI

print(paste0("Standard Error: ", round(sd(bsStat), 3)))
## [1] "Standard Error: 0.132"

A shortcut function to do compute bootstapping!

BS <- bootstrap(x = sample_hemo, nboot = 10000, theta = mean)

hist(BS$thetastar, probability = TRUE, col = "lightblue", xlab = "Mean of DIabetic resamples", main = "")
abline(v = pop_mean, col = "red", lwd = 2)
abline(v = quantile(BS$thetastar, c(0.025, 0.975)), lwd = 4) #95% CI

print(paste0("Standard Error: ", round(sd(BS$thetastar), 3)))
## [1] "Standard Error: 0.132"

There are cases when re-sampling can give false results (not similar to population). For example, when the sample size is small or if sample data has some outliers. If we will rerun code over and over again we will see that mean of resampled proportions can vary significantly from mean of population.

#Intervals for Binomial Probabilities¶ Let’s consider the estimation of the probability of success, p , from data X∼Binomial(n,p) , where X is the number of successes obtained from n trials, each with probability of success p . Consider an example, where we have sample size of 50 people and we want to estimate the prevalence of hypertension in the population.

n <- 50
mort <- sample(heart$mortality, n)
de_mort <- length(mort[heart$mortality == 0])

n_tilde <- n + 4
p_hat <- de_mort/n
p_tilde <- (de_mort + 2) / n_tilde
z = qnorm(c(0.025, 0.975))

print(paste0("P-hat= ", round(p_hat, 2)))
## [1] "P-hat= 5.76"
print(paste0("p-tilde = ", round(p_tilde, 2)))
## [1] "p-tilde = 5.37"
# # 95% Wald interval
# se_Wald = sqrt(p_hat * (1 - p_hat) / n)
# ci_Wald = round(p_hat + z * se_Wald, 3)
# ci_Wald
# ```
# ```{r}
# #95% Agresti/Coull interval
# 
# se_AC = sqrt(p_tilde * (1 - p_tilde) / n)
# ci_AC = round(p_tilde + z * se_AC, 3)
# ci_AC
# 
# ```
# ```{r}
# ## and as aways there is function from 'binom' package that can do all the calculations for us
# ## keep the methods == "all" if you want to check intervals for different methods, not only Wald or Agresti/Coull
# 
# binom.confint(x = de_mort, 
#               n = n, 
#               conf.level = 0.95, 
#               methods = "all")