#use statistical models for data analysis
Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots
I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)
References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher
##(Bio)statistics in R: Part #1
Let’s look at some examples in R with coin flips so it will make more sense.
We assume that coin is fair and the distribution of outcomes falls under binomial distribution with chance of getting heads = 0.5.
A binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by P, is the same on every trial.
set.seed(11)
# Generate two sample spaces for 10 coins, flip each coin 1 time. Event E = 1 (heads) with probability .5
exp_a <- rbinom(10, 1 , 0.5)
exp_b <- rbinom(10, 1, 0.5)
print(paste0("sample space of expt a: ", list(exp_a)))
## [1] "sample space of expt a: c(0, 0, 1, 0, 0, 1, 0, 0, 1, 0)"
print(paste0("sample space of expt b: ", list(exp_b)))
## [1] "sample space of expt b: c(0, 0, 1, 1, 1, 1, 0, 0, 0, 0)"
cat("\n")
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_a)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)"
print(paste0("E compliment (tails flip) for sample a: ", list(!exp_b)))
## [1] "E compliment (tails flip) for sample a: c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)"
cat("\n")
print(paste0("Coin flip is 'heads' (1) for both sample a and b: ", list(exp_a & exp_b)))
## [1] "Coin flip is 'heads' (1) for both sample a and b: c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for either sample a and b: ", list(exp_a | exp_b)))
## [1] "Coin flip is 'heads' (1) for either sample a and b: c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)"
print(paste0("Coin flip is 'heads' (1) for sample a when b is 'tails' : ", list(exp_a & !exp_b)))
## [1] "Coin flip is 'heads' (1) for sample a when b is 'tails' : c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)"
#Probability Probability is simply likelihood that event will take place.
print(paste0("Probability that is 'heads' for both experiment a and b: ", round(mean(exp_a & exp_b), 3)))
## [1] "Probability that is 'heads' for both experiment a and b: 0.2"
print(paste0("Probability that is 'heads' for either experiment a and b: ", round(mean(exp_a | exp_b), 3)))
## [1] "Probability that is 'heads' for either experiment a and b: 0.5"
print(paste0("Probability that is 'heads' for both experiment a when its is tails of experiment b: ", round(mean(exp_a & !(exp_b)), 3)))
## [1] "Probability that is 'heads' for both experiment a when its is tails of experiment b: 0.1"
he duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods of a random sample of 14 HIV infected individuals is given (in years).
Calculate the sample mean. Calculate the sample median. Calculate the sample standard deviation.
inc_period <- c(12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9)
print(paste0("sample mean:", round(mean(inc_period), 3)))
## [1] "sample mean:9.85"
print(paste0("sample mean:", round(median(inc_period), 3)))
## [1] "sample mean:10"
print(paste0("sample mean:", round(sd(inc_period), 3)))
## [1] "sample mean:3.059"
Serum cholesterol (mmol/L) measured on a sample of 86 stroke patients (data of Markus et al., 1995). Calculate the distribution characteristics and plot the distribution.
cholesterol <- c(3.7,4.8,5.4,5.6,6.1,6.4,7.0,7.6,8.7,3.8,4.9,5.4,5.6,6.1,6.5,7.0,7.6,8.9,3.8,4.9,5.5,5.7,6.1,
6.5,7.1,7.6,9.3,4.4,4.9,5.5,5.7,6.2,6.6,7.1,7.7,9.5,4.5,5.0,5.5,5.7,6.3,6.7,7.2,7.8,10.2,4.5,
5.1,5.6,5.8,6.3,6.7,7.3,7.8,10.4,4.5,5.1,5.6,5.8,6.4,6.8,7.4,7.8,4.7,5.2,5.6,5.9,6.4,6.8,7.4,
8.2,4.7,5.3,5.6,6.0,6.4,7.0,7.5,8.3,4.8,5.3,5.6,6.1,6.4,7.0,7.5,8.6)
print(paste0("sample mean:", round(mean(cholesterol), 3)))
## [1] "sample mean:6.341"
print(paste0("sample mean:", round(median(cholesterol), 3)))
## [1] "sample mean:6.15"
print(paste0("sample mean:", round(var(cholesterol), 3)))
## [1] "sample mean:1.959"
print(paste0("sample mean:", round(sd(cholesterol), 3)))
## [1] "sample mean:1.4"
df <- data.frame(cholesterol)
ggplot(df, aes(x = cholesterol))+geom_histogram(bins = 20, fill = "green", alpha = 0.3)+
geom_vline(aes(xintercept = mean(cholesterol)), color = "red", size = 0.5) +
geom_vline(aes(xintercept = median(cholesterol)), color = "red", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(df, aes(x = "", y = cholesterol)) +
geom_boxplot(fill = "green", alpha = 0.3) +
stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The data below show the consumption of alcohol (X, liters per year per person, 14 years or older) and the death rate from cirrhosis, a liver disease (Y, death per 100,000 population) in 15 countries (each country is an observation unit).
Draw a Scatter Diagram to show the association, if any, between these two variables; can you draw any conclusion/observation without doing any calculation?
Calculate the Coefficient of Covariance and Correlation.
X <- c(24.7, 15.2, 12.3, 10.9, 10.8, 9.9, 8.3, 7.2, 6.6, 5.8, 5.7, 5.6, 4.2, 3.9, 3.1)
Y <- c(46.1, 23.6, 23.7, 7, 12.3, 14.2, 7.4, 3.0, 7.2, 10.6, 3.7, 3.4, 4.3, 3.6, 5.4)
df <- data.frame(X, Y)
#to visulize the connection is to draw a scatter plot
ggplot(df, aes(x = X, y = Y))+geom_point()+
geom_smooth(method = lm, se = FALSE) +
labs (x = "Alcohol consumption", y = "Death Rate from Cirrhosis",
title = "Relation between two variables")
## `geom_smooth()` using formula = 'y ~ x'
It looks like we have strong postive correlation - as Alcohol consumption increases Death Rate from Cirrhosis also increases. Let’s calculate coefficients to check our thoughts.
print(paste0("covariance between alcohol consumption adn death rate from cirrhosis: ", round(cov(X, Y),2)))
## [1] "covariance between alcohol consumption adn death rate from cirrhosis: 60.68"
print(paste0("Correlation betweel Alcohol consumption and death rate from cirrhosis ", round(cor(X,Y),2)))
## [1] "Correlation betweel Alcohol consumption and death rate from cirrhosis 0.94"
#Conditional Probability What is the probability of getting a certain illness between 0 and 30 years of age, if the likelihood of being of that age is 0.4, and the overall (marginal) probability (in the whole population) of getting this illness is 0.1?
pa <- 0.4
pb <- 0.1
pa_given_pb <- pa * pb / pb
print(paste0("Probability of getting a certain illness is: ", pa_given_pb))
## [1] "Probability of getting a certain illness is: 0.4"
There are 3 diseases, (A, B, C), which occur with probabilities 0.5, 0.3 and 0.2 respectively in people coming to an outpatient clinic. The probability of chest pain in these diseases is 0.8, 0.4 and 0.2 respectively. What is the probability that people at the outpatient clinic experience chest pain? What is the probability of disease C in a person who has no chest pain?
pa <- 0.5
pb <- 0.3
pc <- 0.2
chest_given_pa <- 0.8
chest_given_pb <- 0.4
chest_given_pc <- 0.2
p_chest_pain <- pa * chest_given_pa + pb * chest_given_pb + pc * chest_given_pc
p_chest_given_no_chest <- pc *(1 - chest_given_pc) / (1 - p_chest_pain)
print(paste0("Probability that people at the outpatient clinic experience chest pain: ", p_chest_pain))
## [1] "Probability that people at the outpatient clinic experience chest pain: 0.56"
print(paste0("Probability of disease C in a person who has no chest pain: ", round(p_chest_given_no_chest,3)))
## [1] "Probability of disease C in a person who has no chest pain: 0.364"
#Bayes Rule¶ Bayes rule describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
A study comparing the efficacy of HIV tests concluded that HIV antibody tests have a sensitivity of 99.7% and a specificity of 98.5%. Suppose that a subject, from a population with a 0.1% prevalence of HIV, receives a positive test result. We are interested in calculating the probability that this subject has HIV. Mathematically, we would like to calculate P(D+|T+) , the probability of being HIV positive (D+) given that one test was positive (T+) . What is known is the sensitivity of the test, P(T+|D+) = 0.997, the specificity of the test, P(T−|D−) = 0.985 and the prevalence of the disease in the target population, P(D+) = 0.001. Using Bayes formula we can calculate the quantity that we are interested in.
p_dp_given_tp <- (0.997*0.001)/(0.997*0.001 + (1-0.985)*(1-0.001))
print(paste0("Probability of being HIV positive given that one test was positive: ", round(p_dp_given_tp,3)))
## [1] "Probability of being HIV positive given that one test was positive: 0.062"
Thus, in this population, a positive test result corresponds to only a 6% probability that the subject is actually HIV positive. This is the positive predictive value of the test (PPV). The low positive predictive value is due to low prevalence of the disease and the somewhat modest specificity relative to the prevalence of the disease.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
data <- read.csv("FIC_FUll.csv")
sample_n(data, 5)
## Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1 50 41-50 Male RURAL MARRIED NO NO FREE
## 2 55 51-60 Male URBAN MARRIED YES YES FREE
## 3 56 51-60 Male URBAN MARRIED YES YES FREE
## 4 55 51-60 Male URBAN MARRIED NO NO FREE
## 5 50 41-50 Male RURAL MARRIED YES NO FREE
## Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1 YES YES YES YES 1 0 NO NO
## 2 YES YES YES NO 0 1 YES NO
## 3 YES YES YES NO 0 1 YES NO
## 4 YES YES NO YES 1 0 NO NO
## 5 YES YES YES NO 0 0 YES NO
## BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 80.5 0 125 2.3 0.9 129 4.2 96 65
## 2 100.7 0 323 25.0 0.8 139 3.9 100 3877
## 3 150.9 0 260 31.0 0.8 139 4.4 110 319
## 4 130.8 0 127 46.0 1.2 139 4.3 108 188
## 5 100.6 0 348 47.0 1.1 134 4.4 96 336
## CK.MB ESR WBC RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1 36 11 9800 6.15 16.5 0.51 82.1 26.9 0.33 236000
## 2 299 25 10080 5.70 13.3 0.41 72.0 23.0 0.32 391000
## 3 51 11 9800 6.15 16.5 0.51 82.1 26.9 0.33 236000
## 4 45 11 14400 5.70 16.3 0.48 84.0 33.0 0.32 217000
## 5 22 9 10300 5.41 15.5 0.46 86.0 28.0 0.33 261000
## NEUTROPHIL LYMPHO MONOCYTE EOSINO Others
## 1 0.75 0.20 0.03 2 ICMP WITH EF= 30%,PULMONAR ODEMA
## 2 0.90 0.08 0.01 1 no
## 3 0.75 0.20 0.03 2 no
## 4 0.57 0.32 0.07 2 ORTHOPENIA, PND
## 5 0.80 0.17 0.02 1 HTN, DM
## CO Diagnosis
## 1 SOB, DIZZINESS, CHEST PAIN,NAUSEA,DIAPHORESIS CARDIOGENIC SHOCK
## 2 Chest pain4 HR, SWEATING AC I/W M.I
## 3 Chest pain,vomiting, sweating A/C, A/W M.I
## 4 Central Chest pain,SOB, Sweating NSTEM.I
## 5 Chest pain,COLD SWEATING, A/S M.I
## Hypersensitivity cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 NO 4 142 309 0 2 147 1 0.0 2 3
## 2 NO 4 110 275 0 2 118 1 1.0 2 1
## 3 NO 4 132 184 0 2 105 1 2.1 2 1
## 4 YES 4 130 256 1 2 150 1 0.0 1 2
## 5 NO 3 140 233 0 0 163 0 0.6 2 1
## thal num SK SK.React Reaction Mortality Follow.Up
## 1 7 3 1 LUNGS 1 1 15
## 2 3 1 1 STOMACH.BLEEDING 1 0 15
## 3 6 1 1 COUGH.BLEEDING 1 0 32
## 4 7 3 1 SKIN.BLEEDING 1 0 9
## 5 7 1 1 BODY.PAIN 1 0 15
glimpse(data)
## Rows: 368
## Columns: 60
## $ Age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ Age.Group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ Gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ Locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ Marital.status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ Life.Style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ Sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ Category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ Depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ Smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Family.History <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ F.History <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ HTN <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ Allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ BP <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ Thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGR <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ B.Urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ S.Cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ S.Sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ S.Potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ S.Chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ C.P.K <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ CK.MB <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ ESR <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ WBC <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ RBC <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ Hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ P.C.V <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ M.C.V <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ M.C.H <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ M.C.H.C <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ PLATELET_COUNT <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ NEUTROPHIL <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ LYMPHO <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ MONOCYTE <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ EOSINO <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ Others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ CO <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ Diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ Hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ SK <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SK.React <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ Reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ Follow.Up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
colSums(is.na(data))
## Age Age.Group Gender Locality
## 0 0 0 0
## Marital.status Life.Style Sleep Category
## 0 0 0 0
## Depression Hyperlipi Smoking Family.History
## 0 0 0 0
## F.History Diabetes HTN Allergies
## 0 0 0 0
## BP Thrombolysis BGR B.Urea
## 0 0 0 0
## S.Cr S.Sodium S.Potassium S.Chloride
## 0 0 0 0
## C.P.K CK.MB ESR WBC
## 0 0 0 0
## RBC Hemoglobin P.C.V M.C.V
## 0 0 0 0
## M.C.H M.C.H.C PLATELET_COUNT NEUTROPHIL
## 0 0 0 0
## LYMPHO MONOCYTE EOSINO Others
## 0 0 0 0
## CO Diagnosis Hypersensitivity cp
## 0 0 0 0
## trestbps chol fbs restecg
## 0 0 0 0
## thalach exang oldpeak slope
## 0 0 0 0
## ca thal num SK
## 0 0 0 0
## SK.React Reaction Mortality Follow.Up
## 0 0 0 0
data <- clean_names(data)
#for data cleaning
#data2 <- data %>% filter(Gender != "Other" & !(is.na(Diabetes))) %>% select(-Age)
colnames(data)
## [1] "age" "age_group" "gender" "locality"
## [5] "marital_status" "life_style" "sleep" "category"
## [9] "depression" "hyperlipi" "smoking" "family_history"
## [13] "f_history" "diabetes" "htn" "allergies"
## [17] "bp" "thrombolysis" "bgr" "b_urea"
## [21] "s_cr" "s_sodium" "s_potassium" "s_chloride"
## [25] "c_p_k" "ck_mb" "esr" "wbc"
## [29] "rbc" "hemoglobin" "p_c_v" "m_c_v"
## [33] "m_c_h" "m_c_h_c" "platelet_count" "neutrophil"
## [37] "lympho" "monocyte" "eosino" "others"
## [41] "co" "diagnosis" "hypersensitivity" "cp"
## [45] "trestbps" "chol" "fbs" "restecg"
## [49] "thalach" "exang" "oldpeak" "slope"
## [53] "ca" "thal" "num" "sk"
## [57] "sk_react" "reaction" "mortality" "follow_up"
glimpse(data)
## Rows: 368
## Columns: 60
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
table(data$marital_status)
##
## MARRIED SINGLE
## 365 3
#convert some variables into factor
data$family.History <- as.factor(data$family_history)
data$diabetes <- as.factor(data$diabetes)
data$mortality <- as.factor(data$mortality)
colnames(data)
## [1] "age" "age_group" "gender" "locality"
## [5] "marital_status" "life_style" "sleep" "category"
## [9] "depression" "hyperlipi" "smoking" "family_history"
## [13] "f_history" "diabetes" "htn" "allergies"
## [17] "bp" "thrombolysis" "bgr" "b_urea"
## [21] "s_cr" "s_sodium" "s_potassium" "s_chloride"
## [25] "c_p_k" "ck_mb" "esr" "wbc"
## [29] "rbc" "hemoglobin" "p_c_v" "m_c_v"
## [33] "m_c_h" "m_c_h_c" "platelet_count" "neutrophil"
## [37] "lympho" "monocyte" "eosino" "others"
## [41] "co" "diagnosis" "hypersensitivity" "cp"
## [45] "trestbps" "chol" "fbs" "restecg"
## [49] "thalach" "exang" "oldpeak" "slope"
## [53] "ca" "thal" "num" "sk"
## [57] "sk_react" "reaction" "mortality" "follow_up"
## [61] "family.History"
#EDA
library(janitor)
data <- clean_names(data)
#relationship of diabetes status with marital status
data %>% select(marital_status, platelet_count) %>%
group_by(marital_status) %>% summarise(
mean = round(mean(platelet_count),2),
median = median(platelet_count),
variance = round(var(platelet_count), 2),
std = round(sd(platelet_count), 2)
)
## # A tibble: 2 × 5
## marital_status mean median variance std
## <chr> <dbl> <int> <dbl> <dbl>
## 1 MARRIED 248507. 237000 5894481409 76776.
## 2 SINGLE 267333. 268000 6400333333. 80002.
ggplot(data, aes(x = platelet_count, fill = marital_status))+
geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", size = 0.5)+
geom_vline(aes(xintercept = mean(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red", size = 0.5)+
geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "MARRIED"], na.rm = TRUE)), color = "blue", linetype = "dashed", size = 0.5)+
geom_vline(aes(xintercept = median(data$platelet_count[data$marital_status == "SINGLE"], na.rm = TRUE)), color = "red",linetype = "dashed", size = 0.5)+
theme_bw()
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
## Warning: Use of `data$platelet_count` is discouraged.
## ℹ Use `platelet_count` instead.
## Warning: Use of `data$marital_status` is discouraged.
## ℹ Use `marital_status` instead.
ggplot(data, aes(x = marital_status, y = platelet_count)) +
geom_boxplot(aes(fill = marital_status)) +
stat_summary(fun.y = mean, color = "darkred", geom = "point", size =3) +
theme_classic()
summary(data)
## age age_group gender locality
## Min. :24.00 Length:368 Length:368 Length:368
## 1st Qu.:50.75 Class :character Class :character Class :character
## Median :55.00 Mode :character Mode :character Mode :character
## Mean :54.29
## 3rd Qu.:60.25
## Max. :77.00
## marital_status life_style sleep category
## Length:368 Length:368 Length:368 Length:368
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## depression hyperlipi smoking family_history
## Length:368 Length:368 Length:368 Length:368
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## f_history diabetes htn allergies
## Min. :0.0000 0:198 Length:368 Length:368
## 1st Qu.:0.0000 1:170 Class :character Class :character
## Median :0.0000 Mode :character Mode :character
## Mean :0.1957
## 3rd Qu.:0.0000
## Max. :1.0000
## bp thrombolysis bgr b_urea
## Min. : 80.5 Min. :0.00000 Min. : 60 Min. : 2.30
## 1st Qu.:100.7 1st Qu.:0.00000 1st Qu.:117 1st Qu.: 28.00
## Median :120.8 Median :0.00000 Median :164 Median : 36.00
## Mean :121.2 Mean :0.03261 Mean :220 Mean : 51.68
## 3rd Qu.:140.7 3rd Qu.:0.00000 3rd Qu.:291 3rd Qu.: 43.00
## Max. :190.1 Max. :1.00000 Max. :563 Max. :394.00
## s_cr s_sodium s_potassium s_chloride
## Min. : 0.600 Min. :129 Min. :3.300 Min. : 90.0
## 1st Qu.: 0.900 1st Qu.:135 1st Qu.:3.900 1st Qu.:100.0
## Median : 0.900 Median :138 Median :4.200 Median :104.0
## Mean : 1.717 Mean :138 Mean :4.211 Mean :103.8
## 3rd Qu.: 1.100 3rd Qu.:141 3rd Qu.:4.400 3rd Qu.:107.0
## Max. :22.900 Max. :146 Max. :5.300 Max. :112.0
## c_p_k ck_mb esr wbc
## Min. : 52.0 Min. : 14.00 Min. : 5.00 Min. : 5800
## 1st Qu.: 135.0 1st Qu.: 21.00 1st Qu.: 11.00 1st Qu.: 7800
## Median : 188.0 Median : 36.00 Median : 16.00 Median :10650
## Mean : 553.9 Mean : 62.49 Mean : 26.57 Mean :11181
## 3rd Qu.: 390.0 3rd Qu.: 52.00 3rd Qu.: 25.00 3rd Qu.:13500
## Max. :4289.0 Max. :505.00 Max. :154.00 Max. :19590
## rbc hemoglobin p_c_v m_c_v
## Min. :3.46 Min. : 9.10 Min. :0.2900 Min. :60.00
## 1st Qu.:4.40 1st Qu.:12.30 1st Qu.:0.3600 1st Qu.:78.00
## Median :5.20 Median :14.20 Median :0.4250 Median :82.10
## Mean :5.09 Mean :13.91 Mean :0.4153 Mean :81.57
## 3rd Qu.:5.65 3rd Qu.:15.50 3rd Qu.:0.4600 3rd Qu.:86.00
## Max. :6.98 Max. :18.00 Max. :0.5400 Max. :96.00
## m_c_h m_c_h_c platelet_count neutrophil
## Min. :18.00 Min. :0.2200 Min. : 20000 Min. : 0.360
## 1st Qu.:25.70 1st Qu.:0.3200 1st Qu.:192000 1st Qu.: 0.600
## Median :27.90 Median :0.3300 Median :237000 Median : 0.720
## Mean :27.27 Mean :0.3294 Mean :248660 Mean : 2.927
## 3rd Qu.:29.00 3rd Qu.:0.3400 3rd Qu.:287000 3rd Qu.: 0.800
## Max. :33.00 Max. :0.3900 Max. :459000 Max. :83.000
## lympho monocyte eosino others
## Min. :0.0500 Min. :0.0100 Min. :1.000 Length:368
## 1st Qu.:0.1700 1st Qu.:0.0200 1st Qu.:2.000 Class :character
## Median :0.2100 Median :0.0300 Median :2.000 Mode :character
## Mean :0.2489 Mean :0.0325 Mean :2.255
## 3rd Qu.:0.3200 3rd Qu.:0.0400 3rd Qu.:3.000
## Max. :0.5400 Max. :0.0800 Max. :5.000
## co diagnosis hypersensitivity cp
## Length:368 Length:368 Length:368 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.671
## 3rd Qu.:4.000
## Max. :4.000
## trestbps chol fbs restecg
## Min. :100.0 Min. :131.0 Min. :0.0000 Min. :0.000
## 1st Qu.:120.0 1st Qu.:212.0 1st Qu.:0.0000 1st Qu.:0.000
## Median :130.0 Median :249.0 Median :0.0000 Median :2.000
## Mean :132.7 Mean :248.9 Mean :0.1413 Mean :1.073
## 3rd Qu.:142.0 3rd Qu.:283.0 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :200.0 Max. :409.0 Max. :1.0000 Max. :2.000
## thalach exang oldpeak slope
## Min. : 71.0 Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:125.0 1st Qu.:0.0000 1st Qu.:0.275 1st Qu.:1.750
## Median :144.0 Median :1.0000 Median :1.200 Median :2.000
## Mean :140.9 Mean :0.5625 Mean :1.542 Mean :1.842
## 3rd Qu.:158.0 3rd Qu.:1.0000 3rd Qu.:2.200 3rd Qu.:2.000
## Max. :195.0 Max. :1.0000 Max. :6.200 Max. :3.000
## ca thal num sk
## Min. :0.000 Min. :3.000 Min. :1.000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.0000
## Median :1.000 Median :7.000 Median :2.000 Median :1.0000
## Mean :1.003 Mean :5.859 Mean :2.035 Mean :0.9837
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :7.000 Max. :4.000 Max. :1.0000
## sk_react reaction mortality follow_up family_history_2
## Length:368 Min. :0.0000 0:288 Min. : 1.00 NO :296
## Class :character 1st Qu.:0.0000 1: 80 1st Qu.:15.00 YES: 72
## Mode :character Median :1.0000 Median :32.00
## Mean :0.7473 Mean :28.65
## 3rd Qu.:1.0000 3rd Qu.:36.00
## Max. :1.0000 Max. :60.00
distribution_plot <- function(y){
#function return histogram and boxplot of chosen variable distribution
box_plt <- ggplot(data, aes(x = platelet_count, y = eval(parse(text = y))))+
geom_boxplot(aes(fill = platelet_count))+
stat_summary(fun.y = mean, color = "darkred", geom = "point", size = 2) +
labs(x = "Platelet Count", y = y)+theme_bw()
hstgm <- ggplot(data, aes(x = eval(parse(text = y)), fill = mortality)) + geom_histogram(bins = 40, alpha = 0.5, position = "identity")+
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "blue", size = 0.5) +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "blue", size = 0.5) +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 1], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
geom_vline(aes(xintercept = mean(data[[y]][data$mortality == 0], na.rm = TRUE)), color = "red", size = 0.5, linetype = "dashed") +
labs(x = y) + theme_bw()
return(plot_grid(hstgm, box_plt))
}
distribution_plot("s_sodium")
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Use of `data[[y]]` is discouraged.
## ℹ Use `.data[[y]]` instead.
## Warning: Use of `data$mortality` is discouraged.
## ℹ Use `mortality` instead.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here we could see that histograms were not the best oprions for us since
there are not so many subjects with positive stroke outcome so we were
not able to explore the distribution properly.
distribution_ratio <- function(x){
#fucntion creates a stacked bar plots
plt <- data %>% select(s_sodium,x, gender) %>%
group_by(s_sodium, var = eval(parse(text = x))) %>%
summarise(count = length(gender)) %>%
group_by(s_sodium) %>%
mutate(ratio = round(count * 100 / sum(count), 1)) %>%
ggplot(aes(y = ratio, x = s_sodium, fill = var)) +
geom_bar(stat = "identity") +
labs(title = paste0("Ratio of stroke outcome by ", x), fill = x) +
theme_bw()
return(plt)
}
distribution_ratio("bp")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(x)
##
## # Now:
## data %>% select(all_of(x))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
Let’s stop here for a moment and remember about discrete variables and
probability.
According to our sample we can make following assumptions:
H <- as.numeric(levels(data$diabetes))[data$diabetes]
S <- as.numeric(levels(data$mortality))[data$mortality]
p_sp_and_hp <- mean(H&S)
p_sp_or_hp <- mean(H|S)
p_sp_and_hn <- mean(H&!S)
print(paste0("Probability of suffereing from having a disturbed platelet count:", round(p_sp_and_hp,3)))
## [1] "Probability of suffereing from having a disturbed platelet count:0.166"
print(paste0("Probability of suffereing or having a disturbed platelet count:", round(p_sp_or_hp,3)))
## [1] "Probability of suffereing or having a disturbed platelet count:0.514"
print(paste0("Probability of suffereing without having disturbed platelet count:", round(p_sp_and_hn,3)))
## [1] "Probability of suffereing without having disturbed platelet count:0.296"
distribution_ratio("diabetes")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
distribution_ratio("marital_status")
## `summarise()` has grouped output by 's_sodium'. You can override using the
## `.groups` argument.
glimpse(data)
## Rows: 368
## Columns: 61
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <fct> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
## $ family_history_2 <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, N…
#Test Outcome Analysis¶ Now I will try to predict the stroke outcome with the help of random forest. I will not go into details about what is random forest, how it works and how to adjust the model. My point here is to show how can we analyse model performance (can be applied both for machine learning or real testings). You can check out my project on Dementia Prediction with Tree-based Models to know more about tree-based models.
frml <- (diabetes ~ gender+age+follow_up+thalach+chol+trestbps)
model_rf <- randomForest(formula = frml, data = data)
prediction_rf <- predict(object = model_rf,
newdata = select(data, -diabetes),
type = "class")
confusionMatrix(data = prediction_rf, reference = data$diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 197 1
## 1 1 169
##
## Accuracy : 0.9946
## 95% CI : (0.9805, 0.9993)
## No Information Rate : 0.538
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9891
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9949
## Specificity : 0.9941
## Pos Pred Value : 0.9949
## Neg Pred Value : 0.9941
## Prevalence : 0.5380
## Detection Rate : 0.5353
## Detection Prevalence : 0.5380
## Balanced Accuracy : 0.9945
##
## 'Positive' Class : 0
##
print(paste0("AUC = ", round(Metrics::auc(actual = data$diabetes, predicted = prediction_rf), 4)))
## [1] "AUC = 0.9945"
plot(roc(prediction_rf, data$diabetes), main = "ROC", cex.axis = 0.6, cex.main = 0.8, cex.lab = 0.8)
In this part we will focus on how make decisions about population based on the information of sample.
library(tidyverse)
library(repr)
library(cowplot)
library(bootstrap)
library(binom)
## Warning: package 'binom' was built under R version 4.2.3
library(infer)
## Warning: package 'infer' was built under R version 4.2.3
set.seed(123)
options(repr.plot.width = 7, repr.plot.height = 3)
heart <- read.csv("FIC_FUll.csv")
sample_n(heart,5)
## Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1 61 61-70 Female RURAL MARRIED NO NO FREE
## 2 61 61-70 Female RURAL MARRIED YES YES PAID
## 3 40 31-40 Male URBAN MARRIED NO NO PAID
## 4 50 41-50 Male RURAL MARRIED YES YES PAID
## 5 64 61-70 Male URBAN MARRIED NO NO FREE
## Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1 YES YES NO NO 0 1 NO NO
## 2 YES YES NO NO 0 1 YES NO
## 3 YES YES YES YES 1 1 YES YES
## 4 YES YES YES NO 0 1 NO NO
## 5 YES NO NO NO 0 0 YES NO
## BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 100.6 0 84 28 0.9 138 3.3 107 130
## 2 120.8 0 96 42 1.0 146 3.9 100 146
## 3 140.9 0 152 35 0.9 140 3.9 105 61
## 4 100.6 0 512 42 1.3 139 4.1 105 196
## 5 120.8 1 291 37 0.6 137 3.9 98 72
## CK.MB ESR WBC RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1 30 11 9900 4.26 11.6 0.34 79.7 27.2 0.34 265000
## 2 21 25 7400 4.14 11.7 0.36 87.0 28.0 0.32 395000
## 3 15 12 12800 4.26 12.3 0.36 83.6 28.9 0.35 459000
## 4 50 22 15000 4.41 14.3 0.41 93.9 32.4 0.35 228000
## 5 39 25 7400 4.36 11.2 0.33 76.6 25.7 0.34 136000
## NEUTROPHIL LYMPHO MONOCYTE EOSINO Others CO
## 1 0.70 0.25 0.03 2 no Chest pain,
## 2 0.63 0.31 0.03 3 no Chest pain, SWEATING
## 3 0.65 0.26 0.08 1 no Chest pain,heart sinking, vomiting
## 4 0.85 0.10 0.03 2 no Chest pain,
## 5 0.78 0.18 0.03 1 HCV, IHD Chest pain,sweating,vomiting
## Diagnosis Hypersensitivity cp trestbps chol fbs restecg thalach
## 1 EXT. ACUTE WALL M.I NO 4 145 307 0 2 146
## 2 AC I/W M.I NO 4 130 330 0 2 169
## 3 Acute I/W M.I NO 4 152 223 0 0 181
## 4 AC. A/L M.I NO 1 110 264 0 0 132
## 5 old I/W M.I, ACS. NO 3 140 335 0 0 158
## exang oldpeak slope ca thal num SK SK.React Reaction Mortality
## 1 1 1.0 2 0 7 1 1 NO 0 1
## 2 0 0.0 1 0 3 1 1 NO 0 1
## 3 0 0.0 1 0 7 1 1 COUGH.BLEEDING 1 0
## 4 0 1.2 2 0 7 1 1 COUGH.BLEEDING 1 1
## 5 0 0.0 1 0 3 1 0 BODY.PAIN 1 0
## Follow.Up
## 1 15
## 2 12
## 3 36
## 4 3
## 5 60
library(janitor)
heart <- clean_names(heart)
#Likelihood¶ Likelihood is a tool for summarizing the data’s evidence about unknown parameters. The likelohood of a collection of data is the joint density evaluated as a function of the unknown parameters with the data fixed.
Imagine we are running an experiment to check the ratio of people with and w/o Diabetes. We assume that ratio from this dataset is true ratio of population. Then we take the sample of Hypertension results and check whether the difference between ratio of the sample and ratio of population is going to zero after bigger sample size.
print(paste0("Ratio of people with Diabetes in population:", round(mean(heart$diabetes == 1), 3)))
## [1] "Ratio of people with Diabetes in population:0.462"
glimpse(heart)
## Rows: 368
## Columns: 60
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
n <- 100 #size of the sample
#convert diabetes column to string 0 (0,1 integers)
diabetes <- as.numeric(levels(heart$chol))[heart$chol]
avg_value <- mean(diabetes)
#taking three random sample n_size each
x1 = sample(diabetes, size = n, replace = F)
x2 = sample(diabetes, size = n, replace = F)
x3 = sample(diabetes, size = n, replace = F)
#create 3 vectors that will hold informaiton aabout ratio for each size of sample
xbar1 = rep(0, length(x1))
xbar2 = rep(0, length(x2))
xbar3 = rep(0, length(x3))
for (i in 1:length(x1)){
xbar1[i] = mean(x1[1:i])
xbar2[i] = mean(x2[1:i])
xbar3[i] = mean(x3[1:i])
}
plot(1:n, xbar1 - avg_value, type = "l",col = 'red', lwd = 1, ylim=c(-0.1,0.2),
xlab = "Number of Subjects sampled",
ylab = "Distance to the mean")
lines(1:n, xbar2-avg_value, col="blue", lwd=1)
lines(1:n, xbar3-avg_value, col="orange", lwd=1)
lines(1:n, rep(0,n), lwd=3)
We can see, that the higher the number of subjects in our samples, the
more likely mean of the sample is going to equal to the mean of
population.
The law of large numbers will establish that as n increases the averages are close to the target, while the central limit theorem will say how close and with what probability are the results of the experiment to the true target.
#Central Limit Theorem¶ The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.
chol <- heart$chol[heart$diabetes == 0]
hist(chol, breaks = 50, col = "lightgreen")
print(paste0("Population mean: ", round(mean(chol), 1)))
## [1] "Population mean: 244.9"
n=30
sample = sample(chol, size = n, replace = F)
xbar = rep(0, length(sample))
for(i in 1:length(sample)){
xbar[i] = mean(sample[1:i])
}
hist(xbar, breaks = 50, col = "orange", main = "Histogram of sample mean(sample size is 30)")
print(paste0("Sample mean:", round(mean(xbar), 1)))
## [1] "Sample mean:244.6"
dim(chol)
## NULL
n = 190
sample = sample(chol, size = n, replace = F)
xbar = rep(0, length(sample))
for(i in 1:length(sample)){
xbar[i] = mean(sample[1:i])
}
hist(xbar, breaks = 50, col = "orange", main = "Histogram of sample mean(sample size = 190)")
print(paste0("Sample mean: ", round(mean(xbar), 1)))
## [1] "Sample mean: 242.5"
#Confidence Intervals A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error. It tells you how confident you can be that the results from a poll or survey reflect what you would expect to find if it were possible to survey the entire population.
population <- rnorm(5000, 10,5)
hist(population, breaks = 25, col = "lightgreen")
print(paste0("Population mean: ", round(mean(population), 1)))
## [1] "Population mean: 10"
print(paste0("Population standard error:", round(sd(population), 1)))
## [1] "Population standard error:5"
n = 250
sample <- sample(population, size = n, replace = F)
print(paste0("sample mean: ", round(mean(sample), 1)))
## [1] "sample mean: 10.1"
print(paste0("Sample standard error: ", round(sd(sample), 1)))
## [1] "Sample standard error: 4.6"
#calculate 95% CI
alpha = 0.05
quant = 1 - alpha/2
z_score <- qnorm(quant, lower.tail = TRUE)
lower <- mean(sample) - z_score * sd(population) / sqrt(n)
upper <- mean(sample) + z_score * sd(population) / sqrt(n)
hist(sample, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v =upper, lwd =3)
print(paste0("With 95% confidence we can say thet population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say thet population mean is in interval [9.5; 10.7]"
But, as said before, we usually don’t know the distribution of population. In this case we must rely on sample standard deviations and T-scores. T-scores come from T-distributions, which help us account for error that occurs when we sample from a population. We use a different T-distribution to calculate cumulative probabilites depending on our degrees of freedom.
chol <- heart$chol
X_bar <- mean(chol)
s <- sd(chol)
n <- length(chol)
alpha = 0.05
t_score <- qt(1 - alpha/2, n-1)
lower <- X_bar - t_score * s / sqrt(n)
upper <- X_bar + t_score * s / sqrt(n)
hist(chol, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v = upper, lwd = 3, col = "red")
print(paste0("With 95% confidence we can say that population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say that population mean is in interval [243.8; 254.1]"
Note how narrow the CI is. This is due the large sample size (almost 42k). Look what will happen if we had much smaller sample:
chol <- heart$chol
chol <- sample(chol, size = 100, replace = F)
X_bar <- mean(chol)
sigma <- sd(chol)
n <- length(chol)
alpha = 0.05
t_score <- qt( 1 - alpha/2, n-1)
lower <- X_bar - t_score * sigma / sqrt(n)
upper <- X_bar + t_score * sigma / sqrt(n)
hist(chol, breaks = 25, col = "orange")
abline(v = lower, lwd = 3, col = "blue")
abline(v = upper, lwd = 3, col = "red")
print(paste0("With 95% confidence we can say that population mean is in interval [", round(lower, 1), "; ", round(upper, 1), "]"))
## [1] "With 95% confidence we can say that population mean is in interval [245.4; 264]"
#t and F tests In this part we will look at how we can compare independent groups (for example, .
Introduction to t-test and confidence intervals The t-test (also called Student’s T-Test) compares two averages (means) and tells you if they are different from each other. The t-test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.
Let’s assume that we want to check whether the difference in age for people who had cholestrol and who didn’t is significant. To do that we can run a t-test to find out. First, lets visually inspect the distribution of Age by cholestrol outcome.
heart %>% mutate(group = case_when(
(diabetes == 1) ~ "Diabeted",
(diabetes == 0) ~ "Non-Diabetic"
)) %>%
group_by(group) %>%
summarise(count = length(age),
mean = round(mean(age)),
variance = round(var(age), 2))
## # A tibble: 2 × 4
## group count mean variance
## <chr> <int> <dbl> <dbl>
## 1 Diabeted 170 56 53.1
## 2 Non-Diabetic 198 52 88.5
Our hypothesis for t-test look like this:
#sample size
n1 <- length(heart$age[heart$diabetes == 0])
n2 <- length(heart$age[heart$diabetes == 1])
#sample mean
x1_bar <- mean(heart$age[heart$diabetes == 0])
x2_bar <- mean(heart$age[heart$diabetes == 1])
#sample variance
s1 <- var(heart$age[heart$diabetes == 0])
s2 <- var(heart$age[heart$diabetes == 1])
diff <- x1_bar - x2_bar
SE <- sqrt(s1/n1 + s2/n2)
alpha = 0.05
t <- qt((1-alpha/2), df = n1 + n2 -2) #t score
print(paste0("95% of Confidence interval is [", round(diff-t*SE,2), "; ", round(diff+t*SE,2), "]. "))
## [1] "95% of Confidence interval is [-5.75; -2.32]. "
Confidence interval of mean differences is (−5.75;−2.32) . This does not include 0, so we reject null hypothesis and we say that there is a significance difference between group means with confidence level of 95%. In other words, we are 95% confident that Diabetes is more likely to happen to older people. Also, we could just run t.test function which is built in in R and find out if our difference in means can be considered as significant or it’s just a random chance. We can not use paired t-test vecause our groups have different sample size, so we have to set paired = FALSE (which comes as default). Another parameter var.equal should be also set to FALSE since variances between groups are not equal.
t.test(age ~ diabetes,
data = heart,
var.equal = FALSE,
paired = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: age by diabetes
## t = -4.6313, df = 362.3, p-value = 5.073e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.748912 -2.321914
## sample estimates:
## mean in group 0 mean in group 1
## 52.42929 56.46471
This may seems unclear - we had two samples, calculated the difference of their averages. This difference was far away from 0, so why did we still have to run t-test to check fot significance? The answer is that the size of confidence interval depends on the sample size (the higher observations we have the more confident we can be). Let’s run an another example with less amount of observations.
example <- heart %>% select(diabetes, age)
#fill dataframe with unreal values for example
example <- example[1:50,]
example$diabetes[1:25] <- 0
example$diabetes[26:50] <- 1
example %>% group_by(diabetes) %>%
summarise(count = length(age),
mean_age = round(mean(age)),
variance = round(var(age),2))
## # A tibble: 2 × 4
## diabetes count mean_age variance
## <dbl> <int> <dbl> <dbl>
## 1 0 25 59 21.1
## 2 1 25 47 81.4
t.test(age ~ diabetes, data = example,
var.equal = FALSE,
paired = FALSE,
conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: age by diabetes
## t = 6.0835, df = 35.679, p-value = 5.564e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 8.21154 16.42846
## sample estimates:
## mean in group 0 mean in group 1
## 59.16 46.84
In this case difference was 9, which also can be considered as “away from zero”, but confidence interval is too wide [-4; 21]. It includes 0 value, so we can not be so sure, that Age is not related to diabetic outcome based on this sample.
#F test¶ It cab be a difficult task to assume wether the variances between two groups are different or not. We need to provide evidence against the assumption that the true variances are equal. And to do this we need to obtain the distribution of the ratio of variances in group 1 and 2: σ21/σ22 .
we always test that the population variances are equal when running an F Test. In other words, we always assume that the variances are equal to 1. Therefore, null hypothesis will always be that the variances are equal.
Test statistic: F=σ21σ22 .
The hypothesis that the two variances are equal is rejected if:
One-tailed test (>or< ):
Let’s look back at our example with the distribution of Age by diabetes outcome:
heart %>% mutate(group = case_when(
(diabetes == 1) ~ "Diabetic",
(diabetes == 0) ~ "Non Diabetic"
)) %>% group_by(group) %>% summarise(count = length(age),
variance = round(var(age), 2))
## # A tibble: 2 × 3
## group count variance
## <chr> <int> <dbl>
## 1 Diabetic 170 53.1
## 2 Non Diabetic 198 88.5
#calcuclate the ratio of variance
F <- var(heart$age[heart$diabetes == 0]) / var(heart$age[heart$diabetes == 1])
print(paste0("Test statitic (ratio of variances): ", round(F,2)))
## [1] "Test statitic (ratio of variances): 1.67"
#get the critical values for the F distribution
round(qf(c(0.025, 0.975), 643-1,41288-1),digits =2)
## [1] 0.89 1.11
lower <- F/qf(c(0.025, 0.975), 643-1,41288-1)[2]
upper <- F/qf(c(0.025, 0.975), 643-1,41288-1)[1]
print(paste0("95% of Confidence interval is [", round(lower,2), "; ", round(upper,2), "]."))
## [1] "95% of Confidence interval is [1.5; 1.87]."
And again, interval does not include 1 so we can reject the null hypothesis.
There is a built-in function in R which can compute F-statistic and confidence intervals: var.test
var.test(age ~ diabetes, data = heart)
##
## F test to compare two variances
##
## data: age by diabetes
## F = 1.6668, num df = 197, denom df = 169, p-value = 0.0006817
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.243223 2.227283
## sample estimates:
## ratio of variances
## 1.666776
#Data Resampling¶ Unfortunately, sometimes the size of your sample is not big enough to make proper judgments about the population. In this case resampling methods become useful. Re-sampling methods (like jackknife, bootstrapping, permutation testing) help to get approximate view of population, confidence intervals and standard errors. The main idea is repeatedly creating new samples from original sample with replacements.
#Jackknife The Jackknife deletes each observation and calculates an estimate base on the remaining (n−1) of them. It uses this collection of estimates to do things like estimate the bias and the st.error.
Note that estimating the bias and having a st.error are not needed for things like sample means, which we know are unbiased estimates of population means.
Looking back on diabetes distribution from out dataset:
n = length(heart$hemoglobin)
#function for biased version of variance
bias_var = function(x){
n = length(x)
(n - 1) * var(x) / n
}
#run jacknife function from bootstrap package
jackknife(heart$hemoglobin, bias_var)
## $jack.se
## [1] 0.2770813
##
## $jack.bias
## [1] -0.01267983
##
## $jack.values
## [1] 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859 4.651625
## [9] 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859 4.651625 4.665220
## [17] 4.666077 4.666077 4.659114 4.624452 4.652859 4.666077 4.659114 4.624452
## [25] 4.652859 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425
## [33] 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.647819 4.647819
## [41] 4.658354 4.660882 4.661616 4.650336 4.665168 4.666177 4.650543 4.646143
## [49] 4.662191 4.665919 4.646376 4.659252 4.630613 4.626560 4.662919 4.666077
## [57] 4.603020 4.666059 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003
## [65] 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616
## [73] 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336 4.665168
## [81] 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252 4.630613
## [89] 4.626560 4.662919 4.666077 4.603020 4.666059 4.665757 4.666059 4.662191
## [97] 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252
## [105] 4.659252 4.664463 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882
## [113] 4.661616 4.650336 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919
## [121] 4.646376 4.659252 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [129] 4.665757 4.666059 4.658354 4.660882 4.661616 4.650336 4.665168 4.666177
## [137] 4.650543 4.646143 4.662191 4.641717 4.630613 4.639787 4.662191 4.630613
## [145] 4.639787 4.662191 4.641717 4.630613 4.639787 4.666077 4.630613 4.639787
## [153] 4.662191 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [161] 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114
## [169] 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114
## [177] 4.624452 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452
## [185] 4.652859 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402
## [193] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [201] 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336
## [209] 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252
## [217] 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059 4.665757
## [225] 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114 4.620425
## [233] 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114 4.624452
## [241] 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452 4.652859
## [249] 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402 4.664003
## [257] 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616
## [265] 4.647819 4.647819 4.647819 4.658354 4.665757 4.630613 4.626560 4.662919
## [273] 4.666077 4.603020 4.666059 4.665757 4.666059 4.662191 4.657402 4.657402
## [281] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [289] 4.661616 4.647819 4.659114 4.624452 4.652859 4.651625 4.665220 4.666077
## [297] 4.666077 4.659114 4.624452 4.652859 4.666077 4.659114 4.624452 4.652859
## [305] 4.662191 4.665757 4.630613 4.626560 4.662919 4.666077 4.603020 4.666059
## [313] 4.665757 4.666059 4.662191 4.657402 4.657402 4.664003 4.641717 4.659114
## [321] 4.620425 4.660094 4.659252 4.659252 4.664463 4.661616 4.647819 4.659114
## [329] 4.624452 4.652859 4.651625 4.665220 4.666077 4.666077 4.659114 4.624452
## [337] 4.652859 4.666077 4.659114 4.624452 4.652859 4.662191 4.657402 4.657402
## [345] 4.664003 4.641717 4.659114 4.620425 4.660094 4.659252 4.659252 4.664463
## [353] 4.661616 4.647819 4.647819 4.647819 4.658354 4.660882 4.661616 4.650336
## [361] 4.665168 4.666177 4.650543 4.646143 4.662191 4.665919 4.646376 4.659252
##
## $call
## jackknife(x = heart$hemoglobin, theta = bias_var)
Bootstrapping In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.
And again, hemoglobin distribution. We will assume that our distribution of BMI is true population distribution, that is why we know population mean and variance.
pop_hemo <- heart$hemoglobin
pop_mean <- median(heart$hemoglobin)
hist(pop_hemo, col = "lightgreen", main = "")
n <- 300 #sample size
no_sim <- 1000 #number of resamples
sample_hemo <- sample(pop_hemo, n, replace = F)
# sample no_simul resamples of size n with replacement
## and put them in a matrix with no_simul rows and n columns
## so that each row is a resampled data set
bsResamples = matrix(sample(sample_hemo, n * no_sim, replace = TRUE), no_sim, n)
par(mfcol = c(1,2))
#plot histogram
hist(sample_hemo, probability = TRUE, col = "orange", breaks = 10, xlab = "Hemoglobin", main = "")
#plot histogram of resampled data
hist(bsResamples, probability = TRUE, col = "orange", breaks = 10, xlab = "Hemoglobin (1000 - resamples)", main = "")
Note that the both histograms look similar as they supposed to be, since
the right histogram was effectively created by simulating from the left
histogram.
#calculate the mean of each resample data
bsStat = rowMeans(bsResamples)
#plot the histogram of our resamples statistics
hist(bsStat, probability = TRUE, col = "lightblue", xlab = "Mean ofDiabetic resamples", main = "")
abline(v = pop_mean, col = "red", lwd = 2)
abline(v = quantile(bsStat, c(0.025, 0.975)), lwd = 3) #95% CI
print(paste0("Standard Error: ", round(sd(bsStat), 3)))
## [1] "Standard Error: 0.132"
A shortcut function to do compute bootstapping!
BS <- bootstrap(x = sample_hemo, nboot = 10000, theta = mean)
hist(BS$thetastar, probability = TRUE, col = "lightblue", xlab = "Mean of DIabetic resamples", main = "")
abline(v = pop_mean, col = "red", lwd = 2)
abline(v = quantile(BS$thetastar, c(0.025, 0.975)), lwd = 4) #95% CI
print(paste0("Standard Error: ", round(sd(BS$thetastar), 3)))
## [1] "Standard Error: 0.132"
There are cases when re-sampling can give false results (not similar to population). For example, when the sample size is small or if sample data has some outliers. If we will rerun code over and over again we will see that mean of resampled proportions can vary significantly from mean of population.
#Intervals for Binomial Probabilities¶ Let’s consider the estimation of the probability of success, p , from data X∼Binomial(n,p) , where X is the number of successes obtained from n trials, each with probability of success p . Consider an example, where we have sample size of 50 people and we want to estimate the prevalence of hypertension in the population.
n <- 50
mort <- sample(heart$mortality, n)
de_mort <- length(mort[heart$mortality == 0])
n_tilde <- n + 4
p_hat <- de_mort/n
p_tilde <- (de_mort + 2) / n_tilde
z = qnorm(c(0.025, 0.975))
print(paste0("P-hat= ", round(p_hat, 2)))
## [1] "P-hat= 5.76"
print(paste0("p-tilde = ", round(p_tilde, 2)))
## [1] "p-tilde = 5.37"
# # 95% Wald interval
# se_Wald = sqrt(p_hat * (1 - p_hat) / n)
# ci_Wald = round(p_hat + z * se_Wald, 3)
# ci_Wald
# ```
# ```{r}
# #95% Agresti/Coull interval
#
# se_AC = sqrt(p_tilde * (1 - p_tilde) / n)
# ci_AC = round(p_tilde + z * se_AC, 3)
# ci_AC
#
# ```
# ```{r}
# ## and as aways there is function from 'binom' package that can do all the calculations for us
# ## keep the methods == "all" if you want to check intervals for different methods, not only Wald or Agresti/Coull
#
# binom.confint(x = de_mort,
# n = n,
# conf.level = 0.95,
# methods = "all")