Probability tools

Normal distribution

Probability density functions are named like dDIST_NAME. Given an x axis value, it gives the density of that value. Roughly, you can interpret this as relative probability of values around x. If it is higher at x1 than at x2, the values around x1 has higher probability of appearing than values around x2.

dnorm(0)
## [1] 0.3989423
dnorm(-2)
## [1] 0.05399097

Let’s draw density function of two normal distributions.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
data_frame(x = seq(-5, +5, by = 0.1),
           y = dnorm(x),
           y2 = dnorm(x, mean = 1, sd = 2)) %>%
  ggplot(mapping = aes(x = x, y = y)) +
  geom_line() +
  geom_line(mapping = aes(y = y2))

A pDIST_NAME function gives a probability (area under the curve to the lower end) given quantile (x axis point).

pnorm(-1)
## [1] 0.1586553
pnorm(q = 0)
## [1] 0.5
pnorm(q = 0, mean = 1, sd = 2)
## [1] 0.3085375

A qDIST_NAME function gives a quantile (x axis value) given probability (area under the curve to the lower end).

qnorm(p = 0.025)
## [1] -1.959964
qnorm(p = 1 - 0.025)
## [1] 1.959964

Binomial distribution

Discrete distributions can only take a countable number of values. Here Binomial(10, 0.5) can take integer values 0,1,…,10. More formally these functions should be called probability mass functions.

dbinom(x = 0, size = 10, prob = 0.5)
## [1] 0.0009765625
dbinom(x = 5, size = 10, prob = 0.5)
## [1] 0.2460938

Visualization should be with a bar plot with each bar at the possible values indicating the probability of that value.

data_frame(x = 0:10,
           y = dbinom(x = x, size = 10, prob = 0.5)) %>%
  ggplot(mapping = aes(x = x, y = y)) +
  geom_col()

In a discrete distribution, pDIST_NAME functions gives the sum of bar heights below the given threshold on x axis. You can confirm this with sum().

pbinom(q = 2, size = 10, prob = 0.5)
## [1] 0.0546875
dbinom(x = 0:2, size = 10, prob = 0.5) %>% sum
## [1] 0.0546875

Random number generator

Computers can “emulate” random numbers. These are really pseudo-random numbers, so setting a seed (starting point of an algorithm) to a given number assures reproducibility.

Normal distribution

set.seed(123)
rnorm(n = 3, mean = 0, sd = 1)
## [1] -0.5604756 -0.2301775  1.5587083
data_frame(X = rnorm(n = 30000)) %>%
  ggplot(mapping = aes(x = X)) +
  geom_density()

Binomail distribution

rbinom(n = 1, size = 10, prob = 0.5)
## [1] 3
data_frame(X = rbinom(n = 30000, size = 10, prob = 0.1)) %>%
  ggplot(mapping = aes(x = X)) +
  geom_bar()

Hypothesis testing tools

Continuous variable by two groups

framingham <- read_csv("framingham.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   SYSBP = col_double(),
##   DIABP = col_double(),
##   BMI = col_double(),
##   TIMEAP = col_double(),
##   TIMEMI = col_double(),
##   TIMEMIFC = col_double(),
##   TIMECHD = col_double(),
##   TIMESTRK = col_double(),
##   TIMECVD = col_double(),
##   TIMEDTH = col_double(),
##   TIMEHYP = col_double()
## )
## See spec(...) for full column specifications.
framingham2 <- readxl::read_excel("framingham.xlsx")
ggplot(data = framingham, mapping = aes(x = factor(SEX), y = TOTCHOL)) +
  geom_boxplot()
## Warning: Removed 52 rows containing non-finite values (stat_boxplot).

The formula Y ~ X can be roughly interpreted as explain Y by X. Here specifically compare the average of Y by grouping X. To examine the result in detail, capture it in an object and use the dollar operator to access the data elements.

t_test_res1 <- t.test(TOTCHOL ~ SEX, data = framingham)
t_test_res1
## 
##  Welch Two Sample t-test
## 
## data:  TOTCHOL by SEX
## t = 4.548, df = 4288.4, p-value = 0.000005565
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.471381 8.731875
## sample estimates:
## mean in group 0 mean in group 1 
##        239.6814        233.5798
names(t_test_res1)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
t_test_res1$statistic
##        t 
## 4.547994

Binary variable by two groups

Define a two by two table. Read the empty LHS of the formula as an implicit count variable. The row variable comes first.

tab1 <- xtabs( ~ SEX + CURSMOKE, data = framingham)
tab1
##    CURSMOKE
## SEX    0    1
##   0 1484 1006
##   1  769 1175

Both chi-squared test and Fisher’s exact test functions can take a two by two table.

chisq_test_res1 <- chisq.test(tab1)
chisq_test_res1
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab1
## X-squared = 174.63, df = 1, p-value < 2.2e-16
names(chisq_test_res1)
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"
chisq_test_res1$observed
##    CURSMOKE
## SEX    0    1
##   0 1484 1006
##   1  769 1175
chisq_test_res1$expected
##    CURSMOKE
## SEX         0         1
##   0 1265.2165 1224.7835
##   1  987.7835  956.2165
chisq_test_res1$residuals
##    CURSMOKE
## SEX         0         1
##   0  6.150807 -6.251509
##   1 -6.961193  7.075163
fisher.test(tab1)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab1
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.993100 2.548945
## sample estimates:
## odds ratio 
##   2.253545

More two by two tables

Two types of 2x2 tables are commonly encountered in epidemiology. epiR is a useful package with many classical analysis tools in epidemiology.

library(epiR)
## Loading required package: survival
## Package epiR 0.9-82 is loaded
## Type help(epi.about) for summary information
## 

Disease vs risk factor 2x2 table

tab2 <- xtabs( ~ CURSMOKE + DEATH, data = framingham)[2:1, 2:1]
tab2
##         DEATH
## CURSMOKE    1    0
##        1  788 1393
##        0  762 1491
epiR::epi.2by2(tab2 * 1.0)
##              Outcome +    Outcome -      Total        Inc risk *
## Exposed +          788         1393       2181              36.1
## Exposed -          762         1491       2253              33.8
## Total             1550         2884       4434              35.0
##                  Odds
## Exposed +       0.566
## Exposed -       0.511
## Total           0.537
## 
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio                               1.07 (0.99, 1.16)
## Odds ratio                                   1.11 (0.98, 1.25)
## Attrib risk *                                2.31 (-0.50, 5.12)
## Attrib risk in population *                  1.14 (-1.27, 3.54)
## Attrib fraction in exposed (%)               6.39 (-1.44, 13.61)
## Attrib fraction in population (%)            3.25 (-0.78, 7.12)
## -------------------------------------------------------------------
##  X2 test statistic: 2.598 p-value: 0.107
##  Wald confidence limits
##  * Outcomes per 100 population units

Disease vs diagnostic test 2x2 table

## epiR package's example
## Scott et al. 2008, Table 1:
## A new diagnostic test was trialled on 1586 patients. Of 744 patients
## that were disease positive, 670 tested positive. Of 842 patients that
## were disease negative, 640 tested negative. What is the likeliood
## ratio of a positive test? What is the number needed to diagnose?
c(670, 202, 74, 640)
## [1] 670 202  74 640
dat <- matrix(c(670, 202, 74, 640), nrow = 2, byrow = TRUE)
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
##       Dis+ Dis-
## Test+  670  202
## Test-   74  640
addmargins(dat)
##       Dis+ Dis-  Sum
## Test+  670  202  872
## Test-   74  640  714
## Sum    744  842 1586
epiR::epi.tests(dat)
##           Outcome +    Outcome -      Total
## Test +          670          202        872
## Test -           74          640        714
## Total           744          842       1586
## 
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence                    0.55 (0.52, 0.57)
## True prevalence                        0.47 (0.44, 0.49)
## Sensitivity                            0.90 (0.88, 0.92)
## Specificity                            0.76 (0.73, 0.79)
## Positive predictive value              0.77 (0.74, 0.80)
## Negative predictive value              0.90 (0.87, 0.92)
## Positive likelihood ratio              3.75 (3.32, 4.24)
## Negative likelihood ratio              0.13 (0.11, 0.16)
## ---------------------------------------------------------

Regression modeling

Formulas

Dependent_variable ~ explanatory_variable1 + explanatory_variable2 … is the structure.

A product term (interaction term) is constructed with a colon operator between two variable names.

Simple Linear regression

lm1 <- lm(formula = TOTCHOL ~ AGE, data = framingham)
lm1
## 
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
## 
## Coefficients:
## (Intercept)          AGE  
##     172.912        1.284
summary(lm1)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155.78  -29.52   -2.98   25.17  457.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 172.91217    3.81696   45.30   <2e-16 ***
## AGE           1.28386    0.07535   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.25 on 4380 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.06215,    Adjusted R-squared:  0.06194 
## F-statistic: 290.3 on 1 and 4380 DF,  p-value: < 2.2e-16

What the model is representing can be observed by overplotting the predicted values (blue) on the raw data points.

framingham <- framingham %>%
  modelr::add_predictions(model = lm1)
ggplot(data = framingham, mapping = aes(x = AGE, y = TOTCHOL)) +
  geom_point() +
  geom_line(mapping = aes(y = pred), color = "blue")
## Warning: Removed 52 rows containing missing values (geom_point).

Multiple linear regression

A model with two variables without an interaction.

lm2 <- lm(TOTCHOL ~ AGE + SEX, data = framingham)
summary(lm2)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158.27  -29.68   -3.12   25.20  460.86 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) 175.6959     3.8604   45.51    < 2e-16 ***
## AGE           1.2796     0.0752   17.02    < 2e-16 ***
## SEX          -5.8152     1.3128   -4.43 0.00000967 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.15 on 4379 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.06634,    Adjusted R-squared:  0.06591 
## F-statistic: 155.6 on 2 and 4379 DF,  p-value: < 2.2e-16

A model with two variables with an interaction. Conceptually, there are only two explanatory variables (AGE and SEX), but mathematically, the product term acts as another separate variable.

lm3 <- lm(TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
summary(lm3)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.46  -28.59   -3.24   24.44  462.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.6914     5.0388   26.33   <2e-16 ***
## AGE           2.1396     0.0993   21.55   <2e-16 ***
## SEX          90.0487     7.5157   11.98   <2e-16 ***
## AGE:SEX      -1.9218     0.1484  -12.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.36 on 4378 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.1008, Adjusted R-squared:  0.1002 
## F-statistic: 163.5 on 3 and 4378 DF,  p-value: < 2.2e-16
## This abbreviated formula is equivalent.
lm4 <- lm(TOTCHOL ~ AGE*SEX, data = framingham)
summary(lm4)
## 
## Call:
## lm(formula = TOTCHOL ~ AGE * SEX, data = framingham)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.46  -28.59   -3.24   24.44  462.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.6914     5.0388   26.33   <2e-16 ***
## AGE           2.1396     0.0993   21.55   <2e-16 ***
## SEX          90.0487     7.5157   11.98   <2e-16 ***
## AGE:SEX      -1.9218     0.1484  -12.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.36 on 4378 degrees of freedom
##   (52 observations deleted due to missingness)
## Multiple R-squared:  0.1008, Adjusted R-squared:  0.1002 
## F-statistic: 163.5 on 3 and 4378 DF,  p-value: < 2.2e-16

How do they differ in flexibility can again be observed by overplotting.

framingham_models_2v3 <- framingham %>%
  modelr::gather_predictions(lm2, lm3)

ggplot(data = framingham_models_2v3,
       mapping = aes(x = AGE, y = TOTCHOL,
                     color = factor(SEX),
                     group = factor(SEX))) +
  geom_point(alpha = 0.02) +
  geom_line(mapping = aes(y = pred)) +
  facet_grid(SEX ~ model)
## Warning: Removed 104 rows containing missing values (geom_point).

Generalized linear model

If the outcome variable is something more tricky (binary, count, etc), glm() with an appropriate family specification is used. Here I just show an example of binary logistic regression for the DEATH 0,1 variable.

glm1 <- glm(formula = DEATH ~ AGE + SEX,
            data = framingham,
            family = binomial())
summary(glm1)
## 
## Call:
## glm(formula = DEATH ~ AGE + SEX, family = binomial(), data = framingham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9003  -0.8284  -0.5110   0.9555   2.4911  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.132411   0.246377  -28.95   <2e-16 ***
## AGE          0.119868   0.004533   26.45   <2e-16 ***
## SEX          0.847200   0.072045   11.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5739.2  on 4433  degrees of freedom
## Residual deviance: 4763.9  on 4431  degrees of freedom
## AIC: 4769.9
## 
## Number of Fisher Scoring iterations: 4