Notes 2

Introducing Probability Distributions

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3

We saw in Notes 1 that Generalized Linear Models assume other distributions than normal. Specifically, we mentioned

So this notes is about introducing how to work with distributions in R.

Distribution functions in R

One of the great aspects of R is its systematic family of functions for representing a variety of distributions. These functions all follow the form:

_<dist>

where

  • The _ is a one-letter prefix (either d, p, q, or r)

  • <dist> is an abbreviation representing the distribution

dnorm, pnorm, qnorm, and rnorm

Because you’re familiar with the normal distribution, we’ll start there, where <dist> = norm. So there are the functions dnorm, pnorm, qnorm, and rnorm.

  • dnorm: d is for density, the height of the normal curve at a value \(x\)

  • pnorm: p is for probability, specifically the probability \(P(X\leq x)\) for a value \(x\)

  • qnorm: q is for quantile. This is the inverse of the pnorm function: where for pnorm you give the value \(x\) and get the probability, for qnorm you give the probability and get the value \(x\).

  • rnorm: r is for random generation. This generates a specified number of random normal variables.

help!

Type ?dnorm in the console. This opens a help window for these 4 functions.

Note that each function has the arguments mean and sd, which allow you to specify the mean and standard deviation of the normal distribution.

Examples

dnorm = density

Plotting the standard normal distribution (mean = 0, sd = 1)

tibble(
  x = seq(from=-5, to=5, by=.1), 
  Density = dnorm(x, mean=0, sd=1)
) |> 
  ggplot(
    aes(x = x, y= Density)
  ) + geom_line()

Plotting four different normal distributions

tibble(
  x = seq(from = -5, to = 5, by = .05), 
  `mean=0, sd=1` = dnorm(x, mean = 0, sd = 1), 
  `mean=0, sd=2` = dnorm(x, mean = 0, sd = 2), 
  `mean=1, sd=1` = dnorm(x, mean = 1, sd = 1), 
  `mean=-1, sd=1.5` = dnorm(x, mean = -1, sd = 1.5)
) |>
  pivot_longer(
    cols = c("mean=0, sd=1", "mean=0, sd=2", "mean=1, sd=1", "mean=-1, sd=1.5"), 
    names_to = "Parameters", 
    values_to = "Density"
  ) |> 
  ggplot(
    mapping = aes(x = x, y = Density, color = Parameters)
  ) + 
  geom_line() + 
  labs(title = "Normal Distributions")

The formula for the normal density function for mean = \(\mu\) and sd = \(\sigma\) is

\[f(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]\]

pnorm = probability

This gives you the probability \(P(X\leq x)\).

Graphically, this is like an area under the density curve to the left of a value \(x\). For instance, for the standard normal distribution, pnorm(1) gives you the area

tibble(
  x = seq(from=-4, to=4, by=.05), 
  Density = dnorm(x, mean = 0, sd = 1), 
  x_range = ifelse(x <= 1, x, NA)
) |> 
  ggplot(
    aes(x = x, y= Density)
  ) + geom_line() + 
  geom_area(
    aes(x = x_range, y = Density), 
    fill = "grey50"
  )
## Warning: Removed 60 rows containing non-finite outside the scale range
## (`stat_align()`).

You should know this fact about normal distributions:

  • 68% of the probability is within 1 standard deviation of the mean
pnorm(1, mean=0, sd=1) - pnorm(-1, mean=0, sd=1)
## [1] 0.6826895
  • 95% of the probability is within 2 standard deviation of the mean
pnorm(2, mean=0, sd=1) - pnorm(-2, mean=0, sd=1)
## [1] 0.9544997
  • 99.7% of the probability is within 3 standard deviation of the mean
pnorm(3, mean=0, sd=1) - pnorm(-3, mean=0, sd=1)
## [1] 0.9973002

qnorm = quantile

This gives the quantile \(q\) corresponding with the probability \(p\) in \(p = P(X\leq q)\).

For the standard normal distribution, what is qnorm(.5)?

qnorm(.5, mean=0, sd=1)
## [1] 0

Can you estimate qnorm(1) and qnorm(2) before you find them? For what value of \(x\) will qnorm(x) give you 1?

qnorm(.84)
## [1] 0.9944579

For what value of \(x\) will qnorm(x) give you 2?

qnorm(.975)
## [1] 1.959964

How can you find the standard normal value \(z^\ast\) such that 95% of the probability is between \(-z^\ast\) and \(z^\ast\)?

qnorm(.975)
## [1] 1.959964

rnorm = random normal

Let’s generate 100 random standard normal values and make a histogram of them.

set.seed(123)
sim <- tibble(
  x = rnorm(n=100, mean=0, sd=1)
)
ggplot(
  data = sim, 
  mapping = aes(x = x)
) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What proportion of the random numbers are between -1 and 1?

sim |>
  mutate(between = case_when(
    x < -1 ~ FALSE,
    x < 1 ~ TRUE,
    TRUE ~ FALSE
  )) |>
  summarise(
    sum(between)
  )
## # A tibble: 1 × 1
##   `sum(between)`
##            <int>
## 1             69

What if we increase the sample size?

tibble(
  x=rnorm(n=1000, mean=0, sd=1)
)|>
  mutate(between = case_when(
    x < -1 ~ FALSE,
    x < 1 ~ TRUE,
    TRUE ~ FALSE
  )) |>
  summarise(
    sum(between)
  )
## # A tibble: 1 × 1
##   `sum(between)`
##            <int>
## 1            674