#use statistical models for data analysis

Outline: I. Statistics Basics a)Summaries and Histograms b) The relationship between mean and median c) Boxplots and Suspected Outliers d) Using natural logarithms and Qplots

I. Inferential Statistics a) Hypothesis testing and Confidence Intervals b) Interpreting Confidence Intervals c) Significance levels and P-values. d) Brief Introduction to Chi-Squared Test of Independence (will be continue) e) Contingency Tables for Categorical Variables f) T-Distribution g) Difference Between two Independent Means h) Difference Between more than two Independent Mean (ANOVA) i) Bootstrapping (Simulation base method)

  1. Other Statistics
  1. Mosaic Plots and Contingency Tables b. Chloropeth Maps with GGPlot2

References: a) Statistics Terminology b) (Bio)statistics in R: Part #1 by def me c) Statistics with R Coursera Certification by Duke University d) Performing the Non-parametric Bootstrap for statistical inference using R by Ian Dworkin e) Data visualization with ggplot Part 2 by Rick Scavetta f) Stack Exchange Reply by Penguin_Knight g) Outlier Detection by Zach Loertscher

##(Bio)statistics in R: Part #3

#Project Introduction Alright! Now we know how to make assumptions about population when we have only the sample. We can run t and F tests and find confidence intervals. In this part we are going to ask questions about the sample and answer them with hypothesis testings.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(repr)
## Warning: package 'repr' was built under R version 4.2.3
library(pwr)
## Warning: package 'pwr' was built under R version 4.2.3
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
set.seed(123)
options(repr.plot.height = 3, repr.plot.width = 7)

health <- read.csv("FIC_Full.csv")
sample_n(health, 4)
##   Age Age.Group Gender Locality Marital.status Life.Style Sleep Category
## 1  61     61-70 Female    RURAL        MARRIED         NO    NO     FREE
## 2  61     61-70 Female    RURAL        MARRIED        YES   YES     PAID
## 3  40     31-40   Male    URBAN        MARRIED         NO    NO     PAID
## 4  50     41-50   Male    RURAL        MARRIED        YES   YES     PAID
##   Depression Hyperlipi Smoking Family.History F.History Diabetes HTN Allergies
## 1        YES       YES      NO             NO         0        1  NO        NO
## 2        YES       YES      NO             NO         0        1 YES        NO
## 3        YES       YES     YES            YES         1        1 YES       YES
## 4        YES       YES     YES             NO         0        1  NO        NO
##      BP Thrombolysis BGR B.Urea S.Cr S.Sodium S.Potassium S.Chloride C.P.K
## 1 100.6            0  84     28  0.9      138         3.3        107   130
## 2 120.8            0  96     42  1.0      146         3.9        100   146
## 3 140.9            0 152     35  0.9      140         3.9        105    61
## 4 100.6            0 512     42  1.3      139         4.1        105   196
##   CK.MB ESR   WBC  RBC Hemoglobin P.C.V M.C.V M.C.H M.C.H.C PLATELET_COUNT
## 1    30  11  9900 4.26       11.6  0.34  79.7  27.2    0.34         265000
## 2    21  25  7400 4.14       11.7  0.36  87.0  28.0    0.32         395000
## 3    15  12 12800 4.26       12.3  0.36  83.6  28.9    0.35         459000
## 4    50  22 15000 4.41       14.3  0.41  93.9  32.4    0.35         228000
##   NEUTROPHIL LYMPHO MONOCYTE EOSINO Others                                 CO
## 1       0.70   0.25     0.03      2     no                        Chest pain,
## 2       0.63   0.31     0.03      3     no               Chest pain, SWEATING
## 3       0.65   0.26     0.08      1     no Chest pain,heart sinking, vomiting
## 4       0.85   0.10     0.03      2     no                        Chest pain,
##             Diagnosis Hypersensitivity cp trestbps chol fbs restecg thalach
## 1 EXT. ACUTE WALL M.I               NO  4      145  307   0       2     146
## 2          AC I/W M.I               NO  4      130  330   0       2     169
## 3       Acute I/W M.I               NO  4      152  223   0       0     181
## 4         AC. A/L M.I               NO  1      110  264   0       0     132
##   exang oldpeak slope ca thal num SK       SK.React Reaction Mortality
## 1     1     1.0     2  0    7   1  1             NO        0         1
## 2     0     0.0     1  0    3   1  1             NO        0         1
## 3     0     0.0     1  0    7   1  1 COUGH.BLEEDING        1         0
## 4     0     1.2     2  0    7   1  1 COUGH.BLEEDING        1         1
##   Follow.Up
## 1        15
## 2        12
## 3        36
## 4         3
health <- clean_names(health)

#Null vs Alternative Hypothesis¶ Hypothesis testing is one of the most fundamental techniques in statistics. You always want to ask question about your data - is the prevelance of disease among men higher rather than among women? Does Drug A help to reduce the pain better than Drug B?

We will explore the most popular technique called Null Hypothesis Significance Testing (NHST). According to this method we will always declare one hypothesis as status quo hypothesis, also called null hypothesis, H0 - there is no evidence to assume that prevelance of disease differs among gender or pain relief ratio for Drug A is the same as for Drug B. The alternative hypothesis is usually labeled as HA or H1 and that’s the hypothesis we want to check.

cholestrol_level <- health %>% group_by(gender) %>% 
  summarise(meanchol = mean(chol), sd = sd(chol), n = n())
cholestrol_level
## # A tibble: 2 × 4
##   gender meanchol    sd     n
##   <chr>     <dbl> <dbl> <int>
## 1 Female     269.  58.9    83
## 2 Male       243.  45.7   285
estimate = cholestrol_level$meanchol[2] - cholestrol_level$meanchol[1]
mu0 = 0
se = sqrt(cholestrol_level$sd[2]^2 / cholestrol_level$n[2] + cholestrol_level$sd[1]^2 / cholestrol_level$n[1] )

ts = (estimate - mu0)  / se

print(paste0("Calculated z-score: ", round(ts, 2)))
## [1] "Calculated z-score: -3.71"
print(paste0("Z_0.975: ", round(qnorm(0.975), 2)))
## [1] "Z_0.975: 1.96"

We can see that the resulted test statistic is much smaller than 0.975 quantile of the normal distribution, so we accept null hypothesis.

blood <- health %>% select(bp) %>% 
  summarise(mean = mean(bp), sd = sd(bp), n = n())
blood
##       mean      sd   n
## 1 121.2133 24.5392 368
mu_0 <- 30
alpha = 0.05
z = sqrt(blood$n) * (blood$mean - mu_0) / blood$sd

print(paste0("Caculated Z-score: ", round(z, 2)))
## [1] "Caculated Z-score: 71.31"
print(paste0("Z_0.95:", round(qnorm(1-alpha), 2)))
## [1] "Z_0.95:1.64"

Calculated score is much larger than .95 quantile of the normal distribution, so we have enough evidence to reject the null hypothesis.

#Connection with CI¶ Consider testing H0:μ=μ0 vs HA:μ≠μ0 . Take the set of all possible values for which you fail to reject H0 , this set is a (1−α)100 % CI for μ .

The same works in reverse, if a (1−α)100 % CI contains μ0 then we fail to reject H0 (the same as we did in t-tests).

Looking back on previous example with average glucose level:

## 95% CI for difference in mean of average glucose level
## H_0: mu_1 - mu_2 = 0, H_A: mu_1 - mu_2 != 0

round(estimate + c(-1, 1) * qnorm(0.975) * se, 2)
## [1] -39.77 -12.29

95% CI does contain a 0, so we accept the null hypothesis.

Also, we could call t.test function which we have seen in part #1. We can see that results are the same:

## average cholestrol level
t.test(health$chol ~ health$gender, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  health$chol by health$gender
## t = 3.7125, df = 112.32, p-value = 0.000321
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  12.13648 39.91594
## sample estimates:
## mean in group Female   mean in group Male 
##             269.0964             243.0702
#example 2 with blood
t.test(health$bp, mu = mu_0, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  health$bp
## t = 71.305, df = 367, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  119.1039      Inf
## sample estimates:
## mean of x 
##  121.2133
glimpse(health)
## Rows: 368
## Columns: 60
## $ age              <int> 45, 51, 55, 55, 56, 56, 57, 57, 58, 58, 59, 60, 60, 6…
## $ age_group        <chr> "41-50", "51-60", "51-60", "51-60", "51-60", "51-60",…
## $ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Fe…
## $ locality         <chr> "RURAL", "URBAN", "RURAL", "RURAL", "RURAL", "URBAN",…
## $ marital_status   <chr> "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRIED"…
## $ life_style       <chr> "NO", "NO", "YES", "YES", "YES", "NO", "YES", "NO", "…
## $ sleep            <chr> "NO", "NO", "YES", "YES", "NO", "NO", "YES", "NO", "N…
## $ category         <chr> "FREE", "FREE", "FREE", "FREE", "FREE", "FREE", "PAID…
## $ depression       <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ hyperlipi        <chr> "YES", "YES", "YES", "YES", "YES", "YES", "YES", "YES…
## $ smoking          <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ family_history   <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ f_history        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ diabetes         <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ htn              <chr> "NO", "NO", "YES", "YES", "YES", "YES", "YES", "NO", …
## $ allergies        <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ bp               <dbl> 100.6, 90.6, 100.7, 160.1, 90.6, 140.7, 120.8, 100.6,…
## $ thrombolysis     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bgr              <int> 84, 135, 146, 146, 85, 166, 96, 84, 135, 146, 146, 85…
## $ b_urea           <dbl> 28, 17, 37, 37, 78, 104, 42, 28, 17, 37, 37, 78, 104,…
## $ s_cr             <dbl> 0.9, 0.7, 1.0, 1.0, 1.2, 4.0, 1.0, 0.9, 0.7, 1.0, 1.0…
## $ s_sodium         <int> 138, 144, 137, 137, 139, 130, 146, 138, 144, 137, 137…
## $ s_potassium      <dbl> 3.3, 4.7, 4.2, 4.2, 4.5, 5.3, 3.9, 3.3, 4.7, 4.2, 4.2…
## $ s_chloride       <int> 107, 104, 103, 103, 112, 100, 100, 107, 104, 103, 103…
## $ c_p_k            <int> 130, 163, 149, 149, 75, 322, 146, 130, 163, 149, 149,…
## $ ck_mb            <int> 30, 30, 22, 22, 18, 52, 21, 30, 30, 22, 22, 18, 52, 2…
## $ esr              <int> 11, 27, 19, 19, 13, 154, 25, 11, 27, 19, 19, 13, 154,…
## $ wbc              <int> 9900, 15800, 7900, 7900, 6900, 13500, 7400, 9900, 158…
## $ rbc              <dbl> 4.26, 5.74, 4.83, 4.83, 4.41, 3.90, 4.14, 4.26, 5.74,…
## $ hemoglobin       <dbl> 11.6, 14.5, 14.1, 14.1, 12.3, 10.0, 11.7, 11.6, 14.5,…
## $ p_c_v            <dbl> 0.34, 0.44, 0.42, 0.42, 0.36, 0.29, 0.36, 0.34, 0.44,…
## $ m_c_v            <dbl> 79.7, 78.0, 87.0, 87.0, 82.0, 74.4, 87.0, 79.7, 78.0,…
## $ m_c_h            <dbl> 27.2, 25.0, 29.0, 29.0, 27.0, 25.7, 28.0, 27.2, 25.0,…
## $ m_c_h_c          <dbl> 0.34, 0.32, 0.33, 0.33, 0.33, 0.35, 0.32, 0.34, 0.32,…
## $ platelet_count   <int> 265000, 287000, 183000, 183000, 211000, 288000, 39500…
## $ neutrophil       <dbl> 0.70, 0.73, 0.60, 0.60, 0.71, 0.85, 0.63, 0.70, 0.73,…
## $ lympho           <dbl> 0.25, 0.20, 0.33, 0.33, 0.25, 0.10, 0.31, 0.25, 0.20,…
## $ monocyte         <dbl> 0.03, 0.04, 0.04, 0.04, 0.02, 0.03, 0.03, 0.03, 0.04,…
## $ eosino           <int> 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3,…
## $ others           <chr> "no", "no", "LV dysfunction", "HTN", "no", "PND, ORTH…
## $ co               <chr> "Chest pain,", "Central Chest pain,", "Chest pain,SOB…
## $ diagnosis        <chr> "EXT. ACUTE WALL M.I", "A/W M.I", "AC I/W M.I (RV) RE…
## $ hypersensitivity <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ cp               <int> 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3,…
## $ trestbps         <int> 132, 130, 180, 128, 200, 134, 140, 130, 136, 170, 174…
## $ chol             <int> 341, 305, 327, 205, 288, 409, 241, 236, 319, 225, 249…
## $ fbs              <int> 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ restecg          <int> 2, 0, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0,…
## $ thalach          <int> 136, 142, 117, 130, 133, 150, 123, 174, 152, 146, 143…
## $ exang            <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ oldpeak          <dbl> 3.0, 1.2, 3.4, 2.0, 4.0, 1.9, 0.2, 0.0, 0.0, 2.8, 0.0…
## $ slope            <int> 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 3, 3, 2,…
## $ ca               <int> 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 3, 1,…
## $ thal             <int> 7, 7, 3, 7, 7, 7, 7, 3, 3, 6, 3, 7, 3, 3, 7, 3, 7, 7,…
## $ num              <int> 2, 2, 2, 3, 3, 2, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 2,…
## $ sk               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sk_react         <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO",…
## $ reaction         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mortality        <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,…
## $ follow_up        <int> 60, 15, 6, 52, 34, 32, 60, 3, 15, 6, 52, 34, 32, 12, …

#P-value The p value is the smallest value of α where we still reject a null hypothesis. The smaller the p-value, the strong the evidence that you should reject the null hypothesis.

P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone.

Consider an example:

H0
ratio of strokes among men and women is the same
HA
ratio of strokes among men and women is different

α = 0.05

We will use prop.test function for this:

# cholesterol <- health %>% select(gender, mortality) %>% 
#   mutate(b = as.integer(levels(health$mortality))[health$mortality]) %>% 
#   group_by(gender) %>% 
#   summarise(n = n(), #total number of observations
#             x = sum(b))
# 
# cholesterol
# result <- prop.test(x = c(cholesterol$x[1], cholesterol$x[2]),
#                     n = c(cholesterol$n[1], cholesterol$n[2]),
#                     alternative = "two.sided")
# result
# 
# print(paste0("Calculated p-value is: ", round(result$p.value, 3)))
# print(paste0("Significance level alpha is: 0.05"))

We see that p.value<α so we reject the null hypothesis - there are not enough evidence to believe that ratio of strokes among men and women is the same.

#Statistical Power¶ Power is the probability of rejecting the null hypothesis whe it is false. The more power, the better!

Power=1−β In a power calculation, one fixes the sample size and other aspects of the study and calculates an estimate of the power. In a sample size calculation, one fixes the power then determines the minimum sample size to acheive it.