#use statistical models for data analysis
Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots
I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)
References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher
##(Bio)statistics in R: Part #3
#Project Introduction Alright! Now we know how to make assumptions about population when we have only the sample. We can run t and F tests and find confidence intervals. In this part we are going to ask questions about the sample and answer them with hypothesis testings.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(repr)
## Warning: package 'repr' was built under R version 4.2.3
library(pwr)
## Warning: package 'pwr' was built under R version 4.2.3
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
set.seed(123)
options(repr.plot.height = 3, repr.plot.width = 7)
health <- read.csv("FIC_Full.csv")
sample_n(health, 4)
## Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1 61 61-70 Female RURAL MARRIED NO NO FREE
## 2 61 61-70 Female RURAL MARRIED YES YES PAID
## 3 40 31-40 Male URBAN MARRIED NO NO PAID
## 4 50 41-50 Male RURAL MARRIED YES YES PAID
## Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1 YES YES NO NO 0 1 NO NO
## 2 YES YES NO NO 0 1 YES NO
## 3 YES YES YES YES 1 1 YES YES
## 4 YES YES YES NO 0 1 NO NO
## BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 100.6 0 84 28 0.9 138 3.3 107 130
## 2 120.8 0 96 42 1.0 146 3.9 100 146
## 3 140.9 0 152 35 0.9 140 3.9 105 61
## 4 100.6 0 512 42 1.3 139 4.1 105 196
## CK.MB ESR WBC RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1 30 11 9900 4.26 11.6 0.34 79.7 27.2 0.34 265000
## 2 21 25 7400 4.14 11.7 0.36 87.0 28.0 0.32 395000
## 3 15 12 12800 4.26 12.3 0.36 83.6 28.9 0.35 459000
## 4 50 22 15000 4.41 14.3 0.41 93.9 32.4 0.35 228000
## NEUTROPHIL LYMPHO MONOCYTE EOSINO Others CO
## 1 0.70 0.25 0.03 2 no Chest pain,
## 2 0.63 0.31 0.03 3 no Chest pain, SWEATING
## 3 0.65 0.26 0.08 1 no Chest pain,heart sinking, vomiting
## 4 0.85 0.10 0.03 2 no Chest pain,
## Diagnosis Hypersensitivity cp trestbps chol fbs restecg thalach
## 1 EXT. ACUTE WALL M.I NO 4 145 307 0 2 146
## 2 AC I/W M.I NO 4 130 330 0 2 169
## 3 Acute I/W M.I NO 4 152 223 0 0 181
## 4 AC. A/L M.I NO 1 110 264 0 0 132
## exang oldpeak slope ca thal num SK SK.React Reaction Mortality
## 1 1 1.0 2 0 7 1 1 NO 0 1
## 2 0 0.0 1 0 3 1 1 NO 0 1
## 3 0 0.0 1 0 7 1 1 COUGH.BLEEDING 1 0
## 4 0 1.2 2 0 7 1 1 COUGH.BLEEDING 1 1
## Follow.Up
## 1 15
## 2 12
## 3 36
## 4 3
health <- clean_names(health)
#Null vs Alternative Hypothesis¶ Hypothesis testing is one of the most fundamental techniques in statistics. You always want to ask question about your data - is the prevelance of disease among men higher rather than among women? Does Drug A help to reduce the pain better than Drug B?
We will explore the most popular technique called Null Hypothesis Significance Testing (NHST). According to this method we will always declare one hypothesis as status quo hypothesis, also called null hypothesis, H0 - there is no evidence to assume that prevelance of disease differs among gender or pain relief ratio for Drug A is the same as for Drug B. The alternative hypothesis is usually labeled as HA or H1 and that’s the hypothesis we want to check.
cholestrol_level <- health %>% group_by(gender) %>%
summarise(meanchol = mean(chol), sd = sd(chol), n = n())
cholestrol_level
## # A tibble: 2 × 4
## gender meanchol sd n
## <chr> <dbl> <dbl> <int>
## 1 Female 269. 58.9 83
## 2 Male 243. 45.7 285
estimate = cholestrol_level$meanchol[2] - cholestrol_level$meanchol[1]
mu0 = 0
se = sqrt(cholestrol_level$sd[2]^2 / cholestrol_level$n[2] + cholestrol_level$sd[1]^2 / cholestrol_level$n[1] )
ts = (estimate - mu0) / se
print(paste0("Calculated z-score: ", round(ts, 2)))
## [1] "Calculated z-score: -3.71"
print(paste0("Z_0.975: ", round(qnorm(0.975), 2)))
## [1] "Z_0.975: 1.96"
We can see that the resulted test statistic is much smaller than 0.975 quantile of the normal distribution, so we accept null hypothesis.
blood <- health %>% select(bp) %>%
summarise(mean = mean(bp), sd = sd(bp), n = n())
blood
## mean sd n
## 1 121.2133 24.5392 368
mu_0 <- 30
alpha = 0.05
z = sqrt(blood$n) * (blood$mean - mu_0) / blood$sd
print(paste0("Caculated Z-score: ", round(z, 2)))
## [1] "Caculated Z-score: 71.31"
print(paste0("Z_0.95:", round(qnorm(1-alpha), 2)))
## [1] "Z_0.95:1.64"
Calculated score is much larger than .95 quantile of the normal distribution, so we have enough evidence to reject the null hypothesis.
#Connection with CI¶ Consider testing H0:μ=μ0 vs HA:μ≠μ0 . Take the set of all possible values for which you fail to reject H0 , this set is a (1−α)100 % CI for μ .
The same works in reverse, if a (1−α)100 % CI contains μ0 then we fail to reject H0 (the same as we did in t-tests).
Looking back on previous example with average glucose level:
## 95% CI for difference in mean of average glucose level
## H_0: mu_1 - mu_2 = 0, H_A: mu_1 - mu_2 != 0
round(estimate + c(-1, 1) * qnorm(0.975) * se, 2)
## [1] -39.77 -12.29
95% CI does contain a 0, so we accept the null hypothesis.
Also, we could call t.test function which we have seen in part #1. We can see that results are the same:
## average cholestrol level
t.test(health$chol ~ health$gender, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: health$chol by health$gender
## t = 3.7125, df = 112.32, p-value = 0.000321
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## 12.13648 39.91594
## sample estimates:
## mean in group Female mean in group Male
## 269.0964 243.0702
#example 2 with blood
t.test(health$bp, mu = mu_0, alternative = "greater")
##
## One Sample t-test
##
## data: health$bp
## t = 71.305, df = 367, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
## 119.1039 Inf
## sample estimates:
## mean of x
## 121.2133
glimpse(health)
## Rows: 368
## Columns: 60
## $ age <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …
#P-value The p value is the smallest value of α where we still reject a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.
P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone.
Consider an example:
α = 0.05
We will use prop.test function for this:
# cholesterol <- health %>% select(gender, mortality) %>%
# mutate(b = as.integer(levels(health$mortality))[health$mortality]) %>%
# group_by(gender) %>%
# summarise(n = n(), #total number of observations
# x = sum(b))
#
# cholesterol
# result <- prop.test(x = c(cholesterol$x[1], cholesterol$x[2]),
# n = c(cholesterol$n[1], cholesterol$n[2]),
# alternative = "two.sided")
# result
#
# print(paste0("Calculated p-value is: ", round(result$p.value, 3)))
# print(paste0("Significance level alpha is: 0.05"))
We see that p.value<α so we reject the null hypothesis - there are not enough evidence to believe that ratio of strokes among men and women is the same.
#Statistical Power¶ Power is the probability of rejecting the null hypothesis whe it is false. The more power, the better!
Power=1−β In a power calculation, one fixes the sample size and other aspects of the study and calculates an estimate of the power. In a sample size calculation, one fixes the power then determines the minimum sample size to acheive it.