Probability density functions are named like dDIST_NAME. Given an x axis value, it gives the density of that value. Roughly, you can interpret this as relative probability of values around x. If it is higher at x1 than at x2, the values around x1 has higher probability of appearing than values around x2.
dnorm(0)
## [1] 0.3989423
dnorm(-2)
## [1] 0.05399097
Let’s draw density function of two normal distributions.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
data_frame(x = seq(-5, +5, by = 0.1),
y = dnorm(x),
y2 = dnorm(x, mean = 1, sd = 2)) %>%
ggplot(mapping = aes(x = x, y = y)) +
geom_line() +
geom_line(mapping = aes(y = y2))
A pDIST_NAME function gives a probability (area under the curve to the lower end) given quantile (x axis point).
pnorm(-1)
## [1] 0.1586553
pnorm(q = 0)
## [1] 0.5
pnorm(q = 0, mean = 1, sd = 2)
## [1] 0.3085375
A qDIST_NAME function gives a quantile (x axis value) given probability (area under the curve to the lower end).
qnorm(p = 0.025)
## [1] -1.959964
qnorm(p = 1 - 0.025)
## [1] 1.959964
Discrete distributions can only take a countable number of values. Here Binomial(10, 0.5) can take integer values 0,1,…,10. More formally these functions should be called probability mass functions.
dbinom(x = 0, size = 10, prob = 0.5)
## [1] 0.0009765625
dbinom(x = 5, size = 10, prob = 0.5)
## [1] 0.2460938
Visualization should be with a bar plot with each bar at the possible values indicating the probability of that value.
data_frame(x = 0:10,
y = dbinom(x = x, size = 10, prob = 0.5)) %>%
ggplot(mapping = aes(x = x, y = y)) +
geom_col()
In a discrete distribution, pDIST_NAME functions gives the sum of bar heights below the given threshold on x axis. You can confirm this with sum().
pbinom(q = 2, size = 10, prob = 0.5)
## [1] 0.0546875
dbinom(x = 0:2, size = 10, prob = 0.5) %>% sum
## [1] 0.0546875
Computers can “emulate” random numbers. These are really pseudo-random numbers, so setting a seed (starting point of an algorithm) to a given number assures reproducibility.
set.seed(123)
rnorm(n = 3, mean = 0, sd = 1)
## [1] -0.5604756 -0.2301775 1.5587083
data_frame(X = rnorm(n = 30000)) %>%
ggplot(mapping = aes(x = X)) +
geom_density()
rbinom(n = 1, size = 10, prob = 0.5)
## [1] 3
data_frame(X = rbinom(n = 30000, size = 10, prob = 0.1)) %>%
ggplot(mapping = aes(x = X)) +
geom_bar()
framingham <- read_csv("framingham.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## SYSBP = col_double(),
## DIABP = col_double(),
## BMI = col_double(),
## TIMEAP = col_double(),
## TIMEMI = col_double(),
## TIMEMIFC = col_double(),
## TIMECHD = col_double(),
## TIMESTRK = col_double(),
## TIMECVD = col_double(),
## TIMEDTH = col_double(),
## TIMEHYP = col_double()
## )
## See spec(...) for full column specifications.
framingham2 <- readxl::read_excel("framingham.xlsx")
ggplot(data = framingham, mapping = aes(x = factor(SEX), y = TOTCHOL)) +
geom_boxplot()
## Warning: Removed 52 rows containing non-finite values (stat_boxplot).
The formula Y ~ X can be roughly interpreted as explain Y by X. Here specifically compare the average of Y by grouping X. To examine the result in detail, capture it in an object and use the dollar operator to access the data elements.
t_test_res1 <- t.test(TOTCHOL ~ SEX, data = framingham)
t_test_res1
##
## Welch Two Sample t-test
##
## data: TOTCHOL by SEX
## t = 4.548, df = 4288.4, p-value = 0.000005565
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.471381 8.731875
## sample estimates:
## mean in group 0 mean in group 1
## 239.6814 233.5798
names(t_test_res1)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
t_test_res1$statistic
## t
## 4.547994
Define a two by two table. Read the empty LHS of the formula as an implicit count variable. The row variable comes first.
tab1 <- xtabs( ~ SEX + CURSMOKE, data = framingham)
tab1
## CURSMOKE
## SEX 0 1
## 0 1484 1006
## 1 769 1175
Both chi-squared test and Fisher’s exact test functions can take a two by two table.
chisq_test_res1 <- chisq.test(tab1)
chisq_test_res1
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab1
## X-squared = 174.63, df = 1, p-value < 2.2e-16
names(chisq_test_res1)
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
chisq_test_res1$observed
## CURSMOKE
## SEX 0 1
## 0 1484 1006
## 1 769 1175
chisq_test_res1$expected
## CURSMOKE
## SEX 0 1
## 0 1265.2165 1224.7835
## 1 987.7835 956.2165
chisq_test_res1$residuals
## CURSMOKE
## SEX 0 1
## 0 6.150807 -6.251509
## 1 -6.961193 7.075163
fisher.test(tab1)
##
## Fisher's Exact Test for Count Data
##
## data: tab1
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.993100 2.548945
## sample estimates:
## odds ratio
## 2.253545
Two types of 2x2 tables are commonly encountered in epidemiology. epiR is a useful package with many classical analysis tools in epidemiology.
library(epiR)
## Loading required package: survival
## Package epiR 0.9-82 is loaded
## Type help(epi.about) for summary information
##
tab2 <- xtabs( ~ CURSMOKE + DEATH, data = framingham)[2:1, 2:1]
tab2
## DEATH
## CURSMOKE 1 0
## 1 788 1393
## 0 762 1491
epiR::epi.2by2(tab2 * 1.0)
## Outcome + Outcome - Total Inc risk *
## Exposed + 788 1393 2181 36.1
## Exposed - 762 1491 2253 33.8
## Total 1550 2884 4434 35.0
## Odds
## Exposed + 0.566
## Exposed - 0.511
## Total 0.537
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.07 (0.99, 1.16)
## Odds ratio 1.11 (0.98, 1.25)
## Attrib risk * 2.31 (-0.50, 5.12)
## Attrib risk in population * 1.14 (-1.27, 3.54)
## Attrib fraction in exposed (%) 6.39 (-1.44, 13.61)
## Attrib fraction in population (%) 3.25 (-0.78, 7.12)
## -------------------------------------------------------------------
## X2 test statistic: 2.598 p-value: 0.107
## Wald confidence limits
## * Outcomes per 100 population units
## epiR package's example
## Scott et al. 2008, Table 1:
## A new diagnostic test was trialled on 1586 patients. Of 744 patients
## that were disease positive, 670 tested positive. Of 842 patients that
## were disease negative, 640 tested negative. What is the likeliood
## ratio of a positive test? What is the number needed to diagnose?
c(670, 202, 74, 640)
## [1] 670 202 74 640
dat <- matrix(c(670, 202, 74, 640), nrow = 2, byrow = TRUE)
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
## Dis+ Dis-
## Test+ 670 202
## Test- 74 640
addmargins(dat)
## Dis+ Dis- Sum
## Test+ 670 202 872
## Test- 74 640 714
## Sum 744 842 1586
epiR::epi.tests(dat)
## Outcome + Outcome - Total
## Test + 670 202 872
## Test - 74 640 714
## Total 744 842 1586
##
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence 0.55 (0.52, 0.57)
## True prevalence 0.47 (0.44, 0.49)
## Sensitivity 0.90 (0.88, 0.92)
## Specificity 0.76 (0.73, 0.79)
## Positive predictive value 0.77 (0.74, 0.80)
## Negative predictive value 0.90 (0.87, 0.92)
## Positive likelihood ratio 3.75 (3.32, 4.24)
## Negative likelihood ratio 0.13 (0.11, 0.16)
## ---------------------------------------------------------
Dependent_variable ~ explanatory_variable1 + explanatory_variable2 … is the structure.
A product term (interaction term) is constructed with a colon operator between two variable names.
lm1 <- lm(formula = TOTCHOL ~ AGE, data = framingham)
lm1
##
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
##
## Coefficients:
## (Intercept) AGE
## 172.912 1.284
summary(lm1)
##
## Call:
## lm(formula = TOTCHOL ~ AGE, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.78 -29.52 -2.98 25.17 457.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172.91217 3.81696 45.30 <2e-16 ***
## AGE 1.28386 0.07535 17.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.25 on 4380 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.06215, Adjusted R-squared: 0.06194
## F-statistic: 290.3 on 1 and 4380 DF, p-value: < 2.2e-16
What the model is representing can be observed by overplotting the predicted values (blue) on the raw data points.
framingham <- framingham %>%
modelr::add_predictions(model = lm1)
ggplot(data = framingham, mapping = aes(x = AGE, y = TOTCHOL)) +
geom_point() +
geom_line(mapping = aes(y = pred), color = "blue")
## Warning: Removed 52 rows containing missing values (geom_point).
A model with two variables without an interaction.
lm2 <- lm(TOTCHOL ~ AGE + SEX, data = framingham)
summary(lm2)
##
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.27 -29.68 -3.12 25.20 460.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 175.6959 3.8604 45.51 < 2e-16 ***
## AGE 1.2796 0.0752 17.02 < 2e-16 ***
## SEX -5.8152 1.3128 -4.43 0.00000967 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.15 on 4379 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.06634, Adjusted R-squared: 0.06591
## F-statistic: 155.6 on 2 and 4379 DF, p-value: < 2.2e-16
A model with two variables with an interaction. Conceptually, there are only two explanatory variables (AGE and SEX), but mathematically, the product term acts as another separate variable.
lm3 <- lm(TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
summary(lm3)
##
## Call:
## lm(formula = TOTCHOL ~ AGE + SEX + AGE:SEX, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.46 -28.59 -3.24 24.44 462.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.6914 5.0388 26.33 <2e-16 ***
## AGE 2.1396 0.0993 21.55 <2e-16 ***
## SEX 90.0487 7.5157 11.98 <2e-16 ***
## AGE:SEX -1.9218 0.1484 -12.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.36 on 4378 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.1008, Adjusted R-squared: 0.1002
## F-statistic: 163.5 on 3 and 4378 DF, p-value: < 2.2e-16
## This abbreviated formula is equivalent.
lm4 <- lm(TOTCHOL ~ AGE*SEX, data = framingham)
summary(lm4)
##
## Call:
## lm(formula = TOTCHOL ~ AGE * SEX, data = framingham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175.46 -28.59 -3.24 24.44 462.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.6914 5.0388 26.33 <2e-16 ***
## AGE 2.1396 0.0993 21.55 <2e-16 ***
## SEX 90.0487 7.5157 11.98 <2e-16 ***
## AGE:SEX -1.9218 0.1484 -12.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.36 on 4378 degrees of freedom
## (52 observations deleted due to missingness)
## Multiple R-squared: 0.1008, Adjusted R-squared: 0.1002
## F-statistic: 163.5 on 3 and 4378 DF, p-value: < 2.2e-16
How do they differ in flexibility can again be observed by overplotting.
framingham_models_2v3 <- framingham %>%
modelr::gather_predictions(lm2, lm3)
ggplot(data = framingham_models_2v3,
mapping = aes(x = AGE, y = TOTCHOL,
color = factor(SEX),
group = factor(SEX))) +
geom_point(alpha = 0.02) +
geom_line(mapping = aes(y = pred)) +
facet_grid(SEX ~ model)
## Warning: Removed 104 rows containing missing values (geom_point).
If the outcome variable is something more tricky (binary, count, etc), glm() with an appropriate family specification is used. Here I just show an example of binary logistic regression for the DEATH 0,1 variable.
glm1 <- glm(formula = DEATH ~ AGE + SEX,
data = framingham,
family = binomial())
summary(glm1)
##
## Call:
## glm(formula = DEATH ~ AGE + SEX, family = binomial(), data = framingham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9003 -0.8284 -0.5110 0.9555 2.4911
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.132411 0.246377 -28.95 <2e-16 ***
## AGE 0.119868 0.004533 26.45 <2e-16 ***
## SEX 0.847200 0.072045 11.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5739.2 on 4433 degrees of freedom
## Residual deviance: 4763.9 on 4431 degrees of freedom
## AIC: 4769.9
##
## Number of Fisher Scoring iterations: 4