Notes 2

Introducing Probability Distributions

Loading packages

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

We saw in Notes 1 that Generalized Linear Models assume other distributions than normal. Specifically, we mentioned

The Binomial distribution for proportion responses
The Poisson distribution for count responses
The Gamma distribution for positive continuous responses

So this notes is about introducing how to work with distributions in R.

Distribution functions in R

One of the great aspects of R is its systematic family of functions for representing a variety of distributions. These functions all follow the form:

_<dist>

where

The _ is a one-letter prefix (either d, p, q, or r)
<dist> is an abbreviation representing the distribution

`dnorm`, `pnorm`, `qnorm`, and `rnorm`

Because you’re familiar with the normal distribution, we’ll start there, where <dist> = norm. So there are the functions dnorm, pnorm, qnorm, and rnorm.

dnorm: d is for density, the height of the normal curve at a value \(x\)
pnorm: p is for probability, specifically the probability \(P(X\leq x)\) for a value \(x\)
qnorm: q is for quantile. This is the inverse of the pnorm function: where for pnorm you give the value \(x\) and get the probability, for qnorm you give the probability and get the value \(x\).
rnorm: r is for random generation. This generates a specified number of random normal variables.

help!

Type ?dnorm in the console. This opens a help window for these 4 functions.

Note that each function has the arguments mean and sd, which allow you to specify the mean and standard deviation of the normal distribution.

Examples

`dnorm` = density

Plotting the standard normal distribution (mean = 0, sd = 1)

tibble(
  x = seq(from=-5, to=5, by=.1), 
  Density = dnorm(x, mean=0, sd=1)
) |> 
  ggplot(
    aes(x = x, y= Density)
  ) + geom_line()

Plotting four different normal distributions

tibble(
  x = seq(from = -5, to = 5, by = .05), 
  `mean=0, sd=1` = dnorm(x, mean = 0, sd = 1), 
  `mean=0, sd=2` = dnorm(x, mean = 0, sd = 2), 
  `mean=1, sd=1` = dnorm(x, mean = 1, sd = 1), 
  `mean=-1, sd=1.5` = dnorm(x, mean = -1, sd = 1.5)
) |>
  pivot_longer(
    cols = c("mean=0, sd=1", "mean=0, sd=2", "mean=1, sd=1", "mean=-1, sd=1.5"), 
    names_to = "Parameters", 
    values_to = "Density"
  ) |> 
  ggplot(
    mapping = aes(x = x, y = Density, color = Parameters)
  ) + 
  geom_line() + 
  labs(title = "Normal Distributions")

The formula for the normal density function for mean = \(\mu\) and sd = \(\sigma\) is

\[f(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]\]

`pnorm` = probability

This gives you the probability \(P(X\leq x)\).

Graphically, this is like an area under the density curve to the left of a value \(x\). For instance, for the standard normal distribution, pnorm(1) gives you the area

tibble(
  x = seq(from=-4, to=4, by=.05), 
  Density = dnorm(x, mean = 0, sd = 1), 
  x_range = ifelse(x <= 1, x, NA)
) |> 
  ggplot(
    aes(x = x, y= Density)
  ) + geom_line() + 
  geom_area(
    aes(x = x_range, y = Density), 
    fill = "grey50"
  )

## Warning: Removed 60 rows containing non-finite outside the scale range
## (`stat_align()`).

You should know this fact about normal distributions:

68% of the probability is within 1 standard deviation of the mean

pnorm(1, mean=0, sd=1) - pnorm(-1, mean=0, sd=1)

## [1] 0.6826895

95% of the probability is within 2 standard deviation of the mean

pnorm(2, mean=0, sd=1) - pnorm(-2, mean=0, sd=1)

## [1] 0.9544997

99.7% of the probability is within 3 standard deviation of the mean

pnorm(3, mean=0, sd=1) - pnorm(-3, mean=0, sd=1)

## [1] 0.9973002

`qnorm` = quantile

This gives the quantile \(q\) corresponding with the probability \(p\) in \(p = P(X\leq q)\).

For the standard normal distribution, what is qnorm(.5)?

qnorm(.5, mean=0, sd=1)

## [1] 0

Can you estimate qnorm(1) and qnorm(2) before you find them? For what value of \(x\) will qnorm(x) give you 1?

qnorm(.84)

## [1] 0.9944579

For what value of \(x\) will qnorm(x) give you 2?

qnorm(.975)

## [1] 1.959964

How can you find the standard normal value \(z^\ast\) such that 95% of the probability is between \(-z^\ast\) and \(z^\ast\)?

qnorm(.975)

## [1] 1.959964

`rnorm` = random normal

Let’s generate 100 random standard normal values and make a histogram of them.

set.seed(123)
sim <- tibble(
  x = rnorm(n=100, mean=0, sd=1)
)
ggplot(
  data = sim, 
  mapping = aes(x = x)
) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What proportion of the random numbers are between -1 and 1?

sim |>
  mutate(between = case_when(
    x < -1 ~ FALSE,
    x < 1 ~ TRUE,
    TRUE ~ FALSE
  )) |>
  summarise(
    sum(between)
  )

## # A tibble: 1 × 1
##   `sum(between)`
##            <int>
## 1             69

What if we increase the sample size?

tibble(
  x=rnorm(n=1000, mean=0, sd=1)
)|>
  mutate(between = case_when(
    x < -1 ~ FALSE,
    x < 1 ~ TRUE,
    TRUE ~ FALSE
  )) |>
  summarise(
    sum(between)
  )

## # A tibble: 1 × 1
##   `sum(between)`
##            <int>
## 1            674

Notes 2

Introducing Probability Distributions

Notes 2

Distribution functions in R

dnorm, pnorm, qnorm, and rnorm

help!

Examples

dnorm = density

pnorm = probability

qnorm = quantile

rnorm = random normal

`dnorm`, `pnorm`, `qnorm`, and `rnorm`

`dnorm` = density

`pnorm` = probability

`qnorm` = quantile

`rnorm` = random normal