Notes 2
Introducing Probability Distributions
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
We saw in Notes 1 that Generalized Linear Models assume other distributions than normal. Specifically, we mentioned
The Binomial distribution for proportion responses
The Poisson distribution for count responses
The Gamma distribution for positive continuous responses
So this notes is about introducing how to work with distributions in R.
Distribution functions in R
One of the great aspects of R is its systematic family of functions for representing a variety of distributions. These functions all follow the form:
_<dist>
where
The
_is a one-letter prefix (eitherd,p,q, orr)<dist>is an abbreviation representing the distribution
dnorm, pnorm, qnorm, and
rnorm
Because you’re familiar with the normal distribution, we’ll start
there, where <dist> = norm. So there are
the functions dnorm, pnorm,
qnorm, and rnorm.
dnorm:dis for density, the height of the normal curve at a value \(x\)pnorm:pis for probability, specifically the probability \(P(X\leq x)\) for a value \(x\)qnorm:qis for quantile. This is the inverse of thepnormfunction: where forpnormyou give the value \(x\) and get the probability, forqnormyou give the probability and get the value \(x\).rnorm:ris for random generation. This generates a specified number of random normal variables.
help!
Type ?dnorm in the console. This opens a help window for
these 4 functions.
Note that each function has the arguments mean and
sd, which allow you to specify the mean and standard
deviation of the normal distribution.
Examples
dnorm = density
Plotting the standard normal distribution (mean = 0, sd = 1)
tibble(
x = seq(from=-5, to=5, by=.1),
Density = dnorm(x, mean=0, sd=1)
) |>
ggplot(
aes(x = x, y= Density)
) + geom_line()Plotting four different normal distributions
tibble(
x = seq(from = -5, to = 5, by = .05),
`mean=0, sd=1` = dnorm(x, mean = 0, sd = 1),
`mean=0, sd=2` = dnorm(x, mean = 0, sd = 2),
`mean=1, sd=1` = dnorm(x, mean = 1, sd = 1),
`mean=-1, sd=1.5` = dnorm(x, mean = -1, sd = 1.5)
) |>
pivot_longer(
cols = c("mean=0, sd=1", "mean=0, sd=2", "mean=1, sd=1", "mean=-1, sd=1.5"),
names_to = "Parameters",
values_to = "Density"
) |>
ggplot(
mapping = aes(x = x, y = Density, color = Parameters)
) +
geom_line() +
labs(title = "Normal Distributions")The formula for the normal density function for mean = \(\mu\) and sd = \(\sigma\) is
\[f(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]\]
pnorm = probability
This gives you the probability \(P(X\leq x)\).
Graphically, this is like an area under the density curve to the left
of a value \(x\). For instance, for the
standard normal distribution, pnorm(1) gives you the
area
tibble(
x = seq(from=-4, to=4, by=.05),
Density = dnorm(x, mean = 0, sd = 1),
x_range = ifelse(x <= 1, x, NA)
) |>
ggplot(
aes(x = x, y= Density)
) + geom_line() +
geom_area(
aes(x = x_range, y = Density),
fill = "grey50"
)## Warning: Removed 60 rows containing non-finite outside the scale range
## (`stat_align()`).
You should know this fact about normal distributions:
- 68% of the probability is within 1 standard deviation of the mean
## [1] 0.6826895
- 95% of the probability is within 2 standard deviation of the mean
## [1] 0.9544997
- 99.7% of the probability is within 3 standard deviation of the mean
## [1] 0.9973002
qnorm = quantile
This gives the quantile \(q\) corresponding with the probability \(p\) in \(p = P(X\leq q)\).
For the standard normal distribution, what is
qnorm(.5)?
## [1] 0
Can you estimate qnorm(1) and qnorm(2)
before you find them? For what value of \(x\) will qnorm(x) give you
1?
## [1] 0.9944579
For what value of \(x\) will
qnorm(x) give you 2?
## [1] 1.959964
How can you find the standard normal value \(z^\ast\) such that 95% of the probability is between \(-z^\ast\) and \(z^\ast\)?
## [1] 1.959964
rnorm = random normal
Let’s generate 100 random standard normal values and make a histogram of them.
set.seed(123)
sim <- tibble(
x = rnorm(n=100, mean=0, sd=1)
)
ggplot(
data = sim,
mapping = aes(x = x)
) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What proportion of the random numbers are between -1 and 1?
sim |>
mutate(between = case_when(
x < -1 ~ FALSE,
x < 1 ~ TRUE,
TRUE ~ FALSE
)) |>
summarise(
sum(between)
)## # A tibble: 1 × 1
## `sum(between)`
## <int>
## 1 69
What if we increase the sample size?
tibble(
x=rnorm(n=1000, mean=0, sd=1)
)|>
mutate(between = case_when(
x < -1 ~ FALSE,
x < 1 ~ TRUE,
TRUE ~ FALSE
)) |>
summarise(
sum(between)
)## # A tibble: 1 × 1
## `sum(between)`
## <int>
## 1 674