Let X be the count of heads on a toss of two coins.
Input the data.
x <- c( 0, 1, 2)
px <- c(.25, .5, .25)
Now, lets plot it.
?barplot
barplot(height = px,
names.arg = x,
main = "Discrete Distribution",
xlab = "Heads",
ylab = "Probability"
)
Calculate the mean and variance of the probability distribution.
# print
x
## [1] 0 1 2
px
## [1] 0.25 0.50 0.25
# manually calculate the mean
mu <- 0* .25 + 1 *.5 + 2 * .25
mu
## [1] 1
x*px
## [1] 0.0 0.5 0.5
mu2 <-sum(x * px)
mu2 # should be one too if we calculated things correctly
## [1] 1
mu3 <- x%*%px # dot product in R
mu3
## [,1]
## [1,] 1
var(x)
/ sd(x)
will assume each outcome is
equally probable
x
## [1] 0 1 2
px
## [1] 0.25 0.50 0.25
# assumes each outcome is equally probable
var(x)
## [1] 1
sd(x)
## [1] 1
Lets apply the correct weighting instead.
# manually
sigma2 <- (0-1)^2 *.25 + (1-1)^2 * .50 + (2-1)^2 *.25
sigma2
## [1] 0.5
sigma2_alt <- sum( (x-mu)^2 * px ) # PEDMAS
sigma2_alt
## [1] 0.5
# Var (X) = N * p * (1-p)
sigma2_alt2 <- 2 * .5 *(1-.5)
sigma2_alt2
## [1] 0.5
Standard deviation will still be sqrt(variance)
.
# SD
sigma <- sqrt(sigma2)
sigma
## [1] 0.7071068
moments <- c(mu, sigma2)
moments
## [1] 1.0 0.5
Let X be the random variable that tracks the number of heads on n=5 flips of an unfair coin (p = .60). What is the probability of getting exactly 3 heads, P(X=3)?
Let X be the random variable that tracks the number of heads on \(n=5\) flips of an unfair coin (\(\pi = .60\)).
Our solution required us to calculate -
\(P \ (X=3 \ | \ n = 5, \ p = .6)\)
# P (X=3 | n = 5, p = .6)
n <- 5
p <- .6
p^3 * (1-p)^2 # prob of one sequence, say HHHTT
## [1] 0.03456
(1-p)^2 * p^3 # prob of another sequence, say TTHHH
## [1] 0.03456
(1-p)^2 * p^3 # prob of another sequence, say TTHHH...and so on...
## [1] 0.03456
choose(5,3) # 10 different sequences in total
## [1] 10
# using pmf formula
choose(5,3) * p^3 * (1-p)^2
## [1] 0.3456
### expected mean of binomial
# N * p
# size * prob
5 * .6 # 3 successes
## [1] 3
### expected variance of binomial
# N * p * (1-p)
# size * prob * (1-prob)
5 * .6 * .4 # 1.2
## [1] 1.2
dbinom
Use the dbinom
function to calculate this -
?dbinom
dbinom(x = 3,
size = 5,
prob = .6
)
## [1] 0.3456
This might help clarify the situation but requires a bit of coding and understanding the binomial distribution.
# Number of trials
n <- 5
# Probability of success
p <- 0.6
# Generate values for x (number of successes)
x <- 0:n
# Calculate the probabilities for each value of x
probabilities <- dbinom(x = x,
size = n,
prob = p
)
# Plot the binomial distribution
barplot(height = probabilities,
names.arg = x,
col = "skyblue",
main = "Binomial Distribution",
xlab = "Number of Successes",
ylab = "Probability"
)
The national 30-day case fatality rate for intracranial hemorrhage (ICH) surgical intervention is 16%. The neurosurgical team performed 20 of these operations last year and experienced 6 deaths. Does sufficient evidence exist to suggest that the neurosurgical practice pattern requires review? In other words, what is the probability that this team would experience something this or more severe, i.e., 6 or more deaths in 20 trials if they were operating at the national rate of .16? Use an evidentiary standard that investigates if the event is rare (routinely defined as less than .05 probability.)
You should get the same answer through all the methods.
Lets plot the graph at the beginning now.
# Number of trials
n <- 20
# Probability of success
p <- 0.16
# Generate values for x (number of successes)
x <- 0:n
# Calculate the probabilities for each value of x (can be calculated for individual values as well instead of the entire x vector)
dbinom(x = 0:20,
size = 20,
prob = .16)
## [1] 3.059044e-02 1.165350e-01 2.108729e-01 2.409976e-01 1.950933e-01
## [6] 1.189140e-01 5.662571e-02 2.157170e-02 6.676955e-03 1.695735e-03
## [11] 3.552968e-04 6.152325e-05 8.789035e-06 1.030217e-06 9.811587e-08
## [16] 7.475495e-09 4.449699e-10 1.994263e-11 6.330993e-13 1.269372e-14
## [21] 1.208926e-16
# store them
probabilities <- dbinom(x = x,
size = n,
prob = p
)
# Plot the binomial distribution
barplot(height = probabilities,
names.arg = x,
col = "green",
main = "Binomial Distribution",
xlab = "Number of Successes",
ylab = "Probability"
)
Typing equations in R Markdown.
\[\begin{align*} choose(20,6) \* \pi\^6\*(1- \pi)\^14 & \\\\ \\ choose(20,7) \* \pi\^7 \* (1- \pi)\^13 & \\\\ \+ ... + & \\\\ choose(20,20) \* \pi\^20 \* (1- \pi)\^0 \\\\ \end{align*}\]# BRUTE FORCE CALCULATION
## P(X >= 6) = P(X=6) + P(X=7) + ... + P(X=20)
pi <- .16
choose(20,6) * pi^6 * (1-pi)^14 +
choose(20,7) * pi^7 * (1-pi)^13 +
choose(20,8) * pi^8 * (1-pi)^12 +
choose(20,9) * pi^9 * (1-pi)^11 +
choose(20,10) * pi^10 * (1-pi)^10 +
choose(20,11) * pi^11 * (1-pi)^9 +
choose(20,12) * pi^12 * (1-pi)^8 +
choose(20,13) * pi^13 * (1-pi)^7 +
choose(20,14) * pi^14 * (1-pi)^6 +
choose(20,15) * pi^15 * (1-pi)^5 +
choose(20,16) * pi^16 * (1-pi)^4 +
choose(20,17) * pi^17 * (1-pi)^3 +
choose(20,18) * pi^18 * (1-pi)^2 +
choose(20,19) * pi^19 * (1-pi)^1 +
choose(20,20) * pi^20 * (1-pi)^0
## [1] 0.08699685
Easy to type in R though if you combine with some basic functions
like dbinom
and sum
. Else manually tying it
would take too much time.
# PMF addition
sum(dbinom(x = 6:20,
size = 20,
prob = .16)
)
## [1] 0.08699685
\[1 = P(X<6) + P(X \geq 6) \]
due to sum of the area under a curve (of a discrete variable) is equal
to 1/complement rule. This simplifies to \[
\\P(X\geq6) = 1- P(X \leq 5)\] and
\[P(X\geq6) = 1 - P(X \leq 5) \\ = 1 –
P(X=0) - P(X=1) – P(X=2) – P(X=3) - P(X=4) – P(X=5)
\]
## Manually - use the pdf formula n choose x * pi^x * (1-pi)^x, smart way
pi <- .16
1 - (choose(20,0) * pi^0 * (1-pi)^20 +
choose(20,1) * pi^1 * (1-pi)^19 +
choose(20,2) * pi^2 * (1-pi)^18 +
choose(20,3) * pi^3 * (1-pi)^17 +
choose(20,4) * pi^4 * (1-pi)^16 +
choose(20,5) * pi^5 * (1-pi)^15
) # can be added due to mutually exclusivity
## [1] 0.08699685
1 - sum(dbinom(x = 0:5,
size = 20,
prob = .16
)
)
## [1] 0.08699685
We need the area to the right of the curve.
Two ways, either declare the lower tail = FALSE
, or
apply the complement rule.
Compute the regular CDF and take its complement to find the required area.
?pbinom
# CDF - P(X>=6) = 1- P(X<=5)
1 - pbinom(q = 5,
size = 20,
prob = .16
)
## [1] 0.08699685
Or simply compute the same areas above by changing the default
options of lower.tail=TRUE
# CDF - P(X>=6) = 1- P(X<=5)
pbinom(q = 5,
size = 20,
prob = .16,
lower.tail = FALSE
)
## [1] 0.08699685
The normal human temperature is known from old studies to be 98.6 and the standard deviation is 1. What is the probability that a randomly selected person’s temperature will be less than 96? More than 99.5? Between 96 and 98?
Lets begin with plotting the normal distribution centerted at 98.6 degrees Celsius and a standard deviation of 1 degree Celsius.
# Set the mean and standard deviation
mu <- 98.6
sigma <- 1
# Generate a range of values around the mean
?seq
x <- seq(from = mu - 3*sigma,
to = mu + 3*sigma,
length.out = 1000
)
# Calculate the probability density function
pdf <- dnorm(x = x,
mean = mu,
sd = sigma
)
# Plot the normal distribution
plot(x = x,
y = pdf,
type = 'l',
col = 'red',
lwd = 2,
xlab = 'Temperature',
ylab = 'Density',
main = 'Normal Distribution with Mean 98.6 and SD 1'
)
dnorm
(density norm) - density of an observation on
the curve / pdf
pnorm
(probability norm) - probability of area under
the curve / cdf
Lets apply the empirical rule to check the area under 1, 2 and 3 sd of the mean.
?pnorm
pnorm(q = 98.6 + 1, mean = 98.6, sd = 1 ) - pnorm(q = 98.6 - 1, mean = 98.6, sd = 1 )
## [1] 0.6826895
pnorm(q = 98.6 + 2, mean = 98.6, sd = 1 ) - pnorm(q = 98.6 - 2, mean = 98.6, sd = 1 )
## [1] 0.9544997
pnorm(q = 98.6 + 3, mean = 98.6, sd = 1 ) - pnorm(q = 98.6 - 3, mean = 98.6, sd = 1 )
## [1] 0.9973002
Lets plot the area as well so that we can eyeball the probability size as well to double check our answer.
plot(x = x,
y = pdf,
type = 'l',
col = 'blue',
lwd = 2,
xlab = 'Temperature',
ylab = 'Density',
main = 'Normal Distribution with Mean 98.6 and SD 1'
)
# Shade the area under the curve for values below 96
x_shade <- seq(from = mu - 3*sigma,
to = 96,
length.out = 100
)
pdf_shade <- dnorm(x = x_shade,
mean = mu,
sd = sigma
)
?polygon
polygon(x = c(x_shade, rev(x_shade)),
y = c(pdf_shade, rep(x = 0,
times = length(pdf_shade)
)
),
col = 'red',
border = NA
)
# Draw a vertical line at 98.6
abline(v = 98.6, col = 'darkgreen', lty = 2)
abline(v = 97.6, col = 'green', lty = 2)
abline(v = 99.6, col = 'green', lty = 2)
abline(v = 96.6, col = 'cyan', lty = 2)
abline(v = 100.6, col = 'cyan', lty = 2)
abline(v = 95.6, col = 'yellow', lty = 2)
abline(v = 101.6, col = 'yellow', lty = 2)
Lets calculate the probability -
pnorm(q = 96, mean = 98.6, sd = 1)
## [1] 0.004661188
# explicitly state the lower tail argument
pnorm(q = 96, mean = 98.6, sd = 1, lower.tail = TRUE)
## [1] 0.004661188
1 - pnorm(q = 96, mean = 98.6, sd = 1, lower.tail = FALSE)
## [1] 0.004661188
Lets plot the area as well so that we can eyeball the probability size as well to double check our answer.
plot(x = x,
y = pdf,
type = 'l',
col = 'blue',
lwd = 2,
xlab = 'Temperature',
ylab = 'Density',
main = 'Normal Distribution with Mean 98.6 and SD 1'
)
# Shade the area under the curve for values below 96
x_shade <- seq(from = 99.5,
to = mu + 3*sigma,
length.out = 100
)
pdf_shade <- dnorm(x = x_shade,
mean = mu,
sd = sigma
)
polygon(x = c(x_shade, rev(x_shade)),
y = c(pdf_shade, rep(x = 0,
times = length(pdf_shade)
)
),
col = 'red',
border = NA
)
# Draw a vertical line at 98.6
abline(v = 98.6, col = 'darkgreen', lty = 2)
abline(v = 97.6, col = 'green', lty = 2)
abline(v = 99.6, col = 'green', lty = 2)
abline(v = 96.6, col = 'cyan', lty = 2)
abline(v = 100.6, col = 'cyan', lty = 2)
abline(v = 95.6, col = 'yellow', lty = 2)
abline(v = 101.6, col = 'yellow', lty = 2)
Lets calculate the probability -
pnorm(q = 99.5, mean = 98.6, sd = 1, lower.tail = FALSE)
## [1] 0.1840601
1 - pnorm(q = 99.5, mean = 98.6, sd = 1)
## [1] 0.1840601
# explicitly state the lower tail argument
1 - pnorm(q = 99.5, mean = 98.6, sd = 1, lower.tail = TRUE)
## [1] 0.1840601
Lets plot the area as well so that we can eyeball the probability size as well to double check our answer.
plot(x = x,
y = pdf,
type = 'l',
col = 'blue',
lwd = 2,
xlab = 'Temperature',
ylab = 'Density',
main = 'Normal Distribution with Mean 98.6 and SD 1'
)
# Shade the area under the curve for values below 96
x_shade <- seq(from = 96,
to = 98,
length.out = 100
)
pdf_shade <- dnorm(x = x_shade,
mean = mu,
sd = sigma
)
?rev
polygon(x = c(x_shade, rev(x_shade)),
y = c(pdf_shade, rep(x = 0,
times = length(pdf_shade)
)
),
col = 'red',
border = NA
)
# Draw a vertical line at 98.6
abline(v = 98.6, col = 'darkgreen', lty = 2)
abline(v = 97.6, col = 'green', lty = 2)
abline(v = 99.6, col = 'green', lty = 2)
abline(v = 96.6, col = 'cyan', lty = 2)
abline(v = 100.6, col = 'cyan', lty = 2)
abline(v = 95.6, col = 'yellow', lty = 2)
abline(v = 101.6, col = 'yellow', lty = 2)
Lets calculate the probability -
pnorm(q = 98, mean = 98.6, sd = 1) - pnorm(q = 96, mean = 98.6, sd = 1)
## [1] 0.2695919
# explicitly state the lower tail argument
pnorm(q = 98, mean = 98.6, sd = 1, lower.tail = T) - pnorm(q = 96, mean = 98.6, sd = 1, lower.tail = T)
## [1] 0.2695919
1 - pnorm(q = 98, mean = 98.6, sd = 1, lower.tail = F) - (1 - pnorm(q = 96, mean = 98.6, sd = 1, lower.tail = F))
## [1] 0.2695919
In the context of a normal distribution with a mean of 98.6 and a
standard deviation of 1, you can use the quantile function
qnorm
to calculate the temperature
corresponding to the median (50th percentile) or any other quantile.
?qnorm # quantile function
qnorm(p = .5,
mean = 98.6,
sd = 1
)
## [1] 98.6
This code is using the qnorm
function
to find the temperature corresponding to the 50th percentile (median) in
a normal distribution with a mean of 98.6 and a standard deviation of
1.
The result would give you the temperature at which 50% of the population has a lower temperature, and 50% has a higher temperature, assuming a normal distribution. This would be the median temperature in this specific distribution.
Feel free to modify the p
parameter in
qnorm
to calculate the temperature
corresponding to different percentiles (e.g.,
p = 0.95
for the 95th percentile).
qnorm(p = .95,
mean = 98.6,
sd = 1
)
## [1] 100.2449
This seems to be a very high temperature.
qnorm(p = .05,
mean = 98.6,
sd = 1
)
## [1] 96.95515
This seems to be a very low temperature.
The rate of medication errors across the United States is 2 per 1000 orders (volume). A sample from your local hospital finds 5 errors in a sample of 2000. What is the probability that this hospital is within the US standard? In other words, what is the probability of having 5 or more errors.
Let X be count of errors in 1000 orders.
We have 5 errors in sample of 2000.
\(Rate=2\), and \(t=\dfrac{2000}{1000}=2\).
Thus, \(E(X)=Var(X)=\lambda*t=2*2=4\).
t <- 2 # 2000 sample / 1000 orders
lambda_t <- 2 * t # lambda = 2 (2 errors in 1000 orders); but there are 2000 orders in sample so scale lambda by time
lambda_t
## [1] 4
# P (Y >=5 | lambda * t = 4)
################################################
### Method 1: brute force
sum(dpois(x = 5:2000,
lambda = lambda_t)
)
## [1] 0.3711631
################################################
### Method 2: Lower Tail TRUE
### P(Y >= 5 errors | lambda=2, t=2) = 1–P(Y<=4) = 1-ppois(y=4, lambda*t = 4)
1 - ppois( q = 4, # discrete variable adjustment - read lower tail argument specification P[X≤x]
lambda = lambda_t,
lower.tail = TRUE
)
## [1] 0.3711631
################################################
### Method 3: Lower Tail False / Upper Tail
### P(Y >= 5 errors | lambda=2, t=2)
ppois( q = 4, # discrete variable adjustment - read lower tail argument specification P[X>x]
lambda = lambda_t,
lower.tail = FALSE
)
## [1] 0.3711631
?ppois
?rpois # random sample to generate distribution
hist(rpois(n = 10000,
lambda = 4)
) # mean of 10000 samples with lambda 4 plotted as histogram
Poisson is rate limiting distribution of Binomial…
Binomial will approximate well the Poisson distribution if \(n\) is large and \(\pi\) is small…If you reduce the sample size or probability of success, the numbers will be off. You can try it by chnagig the numbers below.
N <- 2000
pi <- 2/1000 # rate of medication error is 2 in 1000 orders, the truth
lambda_approx <- N * pi # alternatively
lambda_approx
## [1] 4
lambda_t
## [1] 4
sum(dbinom(x = 5:2000,
size = 2000,
prob = pi) # .002
)
## [1] 0.371163
sum(dpois(x = 5:2000,
lambda = lambda_approx) # .002
)
## [1] 0.3711631
pbinom(q = 4 ,
size = 2000,
prob = pi,
lower.tail = F
)
## [1] 0.371163
ppois(q = 4 ,
lambda = lambda_approx,
lower.tail = F
)
## [1] 0.3711631
## A. Model as a Poisson (# .371)
### P(Y >= 5 errors | lambda = 2, t=2) = 1 – P(Y<=4) = 1-ppois(4,4)
sum(dpois(x = 5:2000, lambda = 4) ) # brute force
## [1] 0.3711631
1 - ppois(q = 4,lambda = 4,lower.tail=T) # adjust for discrete variable !!!
## [1] 0.3711631
ppois(q = 4,lambda = 4,lower.tail=F) # adjust for discrete variable !!!
## [1] 0.3711631
## B. Model as a Binomial (# .371)
### P(X>=5 errors | n = 2000, p= .002) = 1-P(Y<=4) = 1-pbinom(4,2000,.002)
sum(dbinom(5:2000,size = 2000,prob = .002)) # brute force
## [1] 0.371163
1 - pbinom(4,2000,.002) # adjust for discrete variable !!!
## [1] 0.371163
pbinom(4,2000,.002,lower.tail=F) # adjust for discrete variable !!!
## [1] 0.371163
Recall the Poisson distribution is an approximation of normal distribution when \(N\) is large and \(\pi\) is small.
# Parameters
lambda <- 4 # Mean of the Poisson distribution
# Generate x values (number of events)
x_values <- 0:50
# Calculate the PMF using dpois
pmf_values <- dpois(x = x_values,
lambda = lambda
)
# Plot the PMF
plot(x = x_values,
y = pmf_values,
type = "h",
lwd = 2,
col = "blue",
xlab = "Number of Events",
ylab = "Probability",
main = "Poisson Distribution PMF")
# Parameters
N <- 2000
pi <- 2/1000 # rate of medication error is 2 in 1000 orders, the truth
# Generate x values (number of events)
x_values <- 0:50
# Calculate the cumulative probabilities using ppois
pmf_values <- dbinom(x = x_values,
size = 2000,
prob = pi
)
# Plot the CDF as a "step" plot
plot(x = x_values,
y = pmf_values,
type = "h",
lwd = 2,
col = "red",
xlab = "Number of Events",
ylab = "Probability",
main = "Binomial Distribution PMF"
)
# 6 Choose 6 * 48 Choose 0 / 54 Choose 6
choose(6,6) * choose(48,0)/ choose(54,6)
## [1] 3.871892e-08
?dhyper # (white balls successes, black balls are failure )
dhyper(x = 6,
m = 6,
n = 48,
k = 6
)
## [1] 3.871892e-08
# choose 3 or more correct balls to get the reward - mutually exclusive events so rule of addition applies.
# Prob (X >=3)
# Prob (3, 4, 5 or 6 numbers)
# Brute force -
choose(6,3) * choose(48,3) / choose(54,6) +
choose(6,4) * choose(48,2) / choose(54,6) +
choose(6,5) * choose(48,1) / choose(54,6) +
choose(6,6) * choose(48,0) / choose(54,6) # [1] 0.01405996
## [1] 0.01405996
sum(dhyper(x = 3:6, m = 6, n = 48, k = 6 ))
## [1] 0.01405996
1-phyper(q = 2, m = 6, n = 48, k = 6) # discrete distribution
## [1] 0.01405996
From Discovering Business Statistics.
## Binomial dist
# Prob Dist: P(X = 14 events, n trials = 19, success prob = .6)
dbinom(x = 14, size = 19, prob = .6)
## [1] 0.09330877
# Prob Dist: P(X <= 3 events, n trials=7, success prob = .95)
pbinom(q = 3, size = 7,prob = .95)
## [1] 0.0001935781
1-pbinom(q = 3, size = 7,prob = .95, lower.tail = F) # because sum of area under distribution is one.
## [1] 0.0001935781
# Prob Dist: P(X >= 5 events, n trials=14, success prob = .4)
sum(dbinom(x = 5:14, size = 14,prob = .4))
## [1] 0.720743
1 - pbinom(q = 4, size = 14,prob = .4) # because binomial is a discrete distribution, thus adjust X by one.
## [1] 0.720743
## Poisson dist
# Prob Dist: P(X = 4 events, lambda = 4.3, t = 1 )
dpois(x = 4, lambda = 4.3)
## [1] 0.1932842
# Prob Dist: P(X = 3 events, lambda = 3, t = 2 )
dpois(x = 3, lambda = 6)
## [1] 0.08923508
# Prob Dist: P(X = 4 events, lambda = 4, t = 3 )
dpois(x = 3, lambda = 6)
## [1] 0.08923508
## HyperGeometric
# 10 black balls and 7 red balls, choose 4 balls from 17 without replacement, prob that exactly 1 black ball drawn?
#Prob Dist: P( )
choose(10,1) * choose(7,3) / choose(17,4)
## [1] 0.1470588
# P(X = 1 black ball | S=10 , F= 17-10=7, n=4, N=17)
dhyper(x = 1, m = 10, n = 7 ,k = 4)
## [1] 0.1470588
choose(4,3)
## [1] 4
factorial(3)
## [1] 6
choose(5,2) * choose(4,1) * factorial(3)
## [1] 240