Notes 2
Introducing Probability Distributions
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
We saw in Notes 1 that Generalized Linear Models assume other distributions than normal. Specifically, we mentioned
The Binomial distribution for proportion responses
The Poisson distribution for count responses
The Gamma distribution for positive continuous responses
So this notes is about introducing how to work with distributions in R.
Distribution functions in R
One of the great aspects of R is its systematic family of functions for representing a variety of distributions. These functions all follow the form:
_<dist>
where
The
_
is a one-letter prefix (eitherd
,p
,q
, orr
)<dist>
is an abbreviation representing the distribution
dnorm
, pnorm
, qnorm
, and
rnorm
Because you’re familiar with the normal distribution, we’ll start
there, where <dist>
= norm
. So there are
the functions dnorm
, pnorm
,
qnorm
, and rnorm
.
dnorm
:d
is for density, the height of the normal curve at a value \(x\)pnorm
:p
is for probability, specifically the probability \(P(X\leq x)\) for a value \(x\)qnorm
:q
is for quantile. This is the inverse of thepnorm
function: where forpnorm
you give the value \(x\) and get the probability, forqnorm
you give the probability and get the value \(x\).rnorm
:r
is for random generation. This generates a specified number of random normal variables.
help!
Type ?dnorm
in the console. This opens a help window for
these 4 functions.
Note that each function has the arguments mean
and
sd
, which allow you to specify the mean and standard
deviation of the normal distribution.
Examples
dnorm
= density
Plotting the standard normal distribution (mean = 0, sd = 1)
tibble(
x = seq(from=-5, to=5, by=.1),
Density = dnorm(x, mean=0, sd=1)
) |>
ggplot(
aes(x = x, y= Density)
) + geom_line()
Plotting four different normal distributions
tibble(
x = seq(from = -5, to = 5, by = .05),
`mean=0, sd=1` = dnorm(x, mean = 0, sd = 1),
`mean=0, sd=2` = dnorm(x, mean = 0, sd = 2),
`mean=1, sd=1` = dnorm(x, mean = 1, sd = 1),
`mean=-1, sd=1.5` = dnorm(x, mean = -1, sd = 1.5)
) |>
pivot_longer(
cols = c("mean=0, sd=1", "mean=0, sd=2", "mean=1, sd=1", "mean=-1, sd=1.5"),
names_to = "Parameters",
values_to = "Density"
) |>
ggplot(
mapping = aes(x = x, y = Density, color = Parameters)
) +
geom_line() +
labs(title = "Normal Distributions")
The formula for the normal density function for mean = \(\mu\) and sd = \(\sigma\) is
\[f(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]\]
pnorm
= probability
This gives you the probability \(P(X\leq x)\).
Graphically, this is like an area under the density curve to the left
of a value \(x\). For instance, for the
standard normal distribution, pnorm(1)
gives you the
area
tibble(
x = seq(from=-4, to=4, by=.05),
Density = dnorm(x, mean = 0, sd = 1),
x_range = ifelse(x <= 1, x, NA)
) |>
ggplot(
aes(x = x, y= Density)
) + geom_line() +
geom_area(
aes(x = x_range, y = Density),
fill = "grey50"
)
## Warning: Removed 60 rows containing non-finite outside the scale range
## (`stat_align()`).
You should know this fact about normal distributions:
- 68% of the probability is within 1 standard deviation of the mean
## [1] 0.6826895
- 95% of the probability is within 2 standard deviation of the mean
## [1] 0.9544997
- 99.7% of the probability is within 3 standard deviation of the mean
## [1] 0.9973002
qnorm
= quantile
This gives the quantile \(q\) corresponding with the probability \(p\) in \(p = P(X\leq q)\).
For the standard normal distribution, what is
qnorm(.5)
?
## [1] 0
Can you estimate qnorm(1)
and qnorm(2)
before you find them? For what value of \(x\) will qnorm(x)
give you
1?
## [1] 0.9944579
For what value of \(x\) will
qnorm(x)
give you 2?
## [1] 1.959964
How can you find the standard normal value \(z^\ast\) such that 95% of the probability is between \(-z^\ast\) and \(z^\ast\)?
## [1] 1.959964
rnorm
= random normal
Let’s generate 100 random standard normal values and make a histogram of them.
set.seed(123)
sim <- tibble(
x = rnorm(n=100, mean=0, sd=1)
)
ggplot(
data = sim,
mapping = aes(x = x)
) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What proportion of the random numbers are between -1 and 1?
sim |>
mutate(between = case_when(
x < -1 ~ FALSE,
x < 1 ~ TRUE,
TRUE ~ FALSE
)) |>
summarise(
sum(between)
)
## # A tibble: 1 × 1
## `sum(between)`
## <int>
## 1 69
What if we increase the sample size?
tibble(
x=rnorm(n=1000, mean=0, sd=1)
)|>
mutate(between = case_when(
x < -1 ~ FALSE,
x < 1 ~ TRUE,
TRUE ~ FALSE
)) |>
summarise(
sum(between)
)
## # A tibble: 1 × 1
## `sum(between)`
## <int>
## 1 674