Given a random experiment, a probability measure is a population quantity that summarizes the randomness. “Probability” is a conceptual idea that is generalized to a population quantity. Specifically, probability operates on the outcomes of an experiment and assigns each possible outcome a number between 0 and 1.
Probability can be illustrated by rolling a die many times. Let pˆn be the proportion of outcomes that are 1 after the first n rolls. As the number of rolls increases, pˆn will converge to the probability of rolling a 1, p = 1/6. Figure 2.1 shows this convergence for 100,000 die rolls. The tendency of pˆn to stabilize around p is described by the Law of Large Numbers.
n<-c(1:100)
x<-rnorm(100, 68, 3.5)
s<-cumsum(x)
plot(s/n, xlab="Number of Trials", ylab= "AVG Male Height", ylim=c(60,70), type="l", col="blue", main="Simulating the Law of Large Numbers", lwd=6)
Does this mean that 13% of people will have at least one sleep problem of these sorts?
NO: the events can simultaneously occur and thus, are not mutually exclusive
The following equation estimates the true proportion of the population that has at least one of these sleep problems:
Problem: we need ways to model and think about probabilities for numeric outcomes of experiments.
Random Variable: A numeric outcome of an experiment
hist(rnorm(n=1000, m=24.2, sd=2.2))
Difference between random variables and random variates
A probability distribution is a list of the possible outcomes with corresponding probabilities that satisfies three rules:
To be a valid PMF, a function p must satisfy:
1. It must always be larger than or equal to zero
2. The sum of the possible values that the random variable can take on must add up to one
To be a valid PDF, a function must satisfy:
1. It must be larger than or equal to zero everywhere
2. The total area under it must be one
The Cumulative Distribution Function of a random variable, X, returns the probability that the random variable is less than or equal to the value “x”. This definition applies whether the variable is discrete or continuous.
This is the opposite of a CDF and is given by: S(x) = 1 - CDF(x). In other words, the Survival Function of a random variable “X” is defined as the probability that the random variable is greater than the value “x”.
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
#head(bdims)
male_dims <- subset(bdims, sex == 1)
female_dims <- subset(bdims, sex == 0)
F_hgt_mean <- mean(female_dims$hgt)
F_hgt_sd <- sd(female_dims$hgt)
hist(female_dims$hgt, probability = TRUE, ylim = c(0, 0.06), main = "Female Height Distribution", xlab = "Height (centimeters)", ylab = "Probability")
x <- 140:190
y <- dnorm(x=x, mean=F_hgt_mean, sd=F_hgt_sd)
lines(x = x, y = y, col = "blue")
Is the data normally distributed? Let's check by constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”. A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line.
More on Q-Q Plots: http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html
qqnorm(female_dims$hgt)
qqline(female_dims$hgt)
The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. So, the question becomes: how close is close enough?
A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using “qqnormsim”:
qqnormsim(female_dims$hgt)
mean <- F_hgt_mean
sd <- F_hgt_sd
upper_bound <- 168
lower_bound <- 160
x <- seq(-4, 4, length = 100)*sd + mean
hx <- dnorm(x, mean, sd)
plot(x, hx, type = "n", xlab = "Female Heights", ylab = "Density", main = "Normal Distribution of Heights", axes = FALSE)
i <- x >= lower_bound & x <= upper_bound
lines(x, hx)
#polygon(x, y, color) x, y = vectors containing the coordinates of the vertices of the polygon
polygon(c(lower_bound, x[i], upper_bound), c(0, hx[i], 0), col = "blue")
#pnorm = CDF
area <- pnorm(upper_bound, mean, sd) - pnorm(lower_bound, mean, sd)
#text
result <- paste("Probability(", lower_bound, "<= Height <=", upper_bound, ") =", signif(area, digits = 3))
#mtext = text under title
mtext(result, 3)
range(female_dims$hgt)
## [1] 147.2 182.9
#axis(side (1=below, 2=left, 3=above, 4=right), at= the point at which tick-marks are to be drawn, pos = coordinate at which the axis line is to be drawn)
axis(1, at=seq(140, 190, 10), pos=0)
NOTE: This type of visualization is good for answering ad-hoc questions regarding the probability of a certain event occuring. Parameters can be changed by users if the visual were reproduced using Shiny.
Density or probability function: F(x) = P(a <= X <= b)
“There's not much need for this function in doing calculations, because you need to do integrals to use any p. d. f., and R doesn't do integrals. In fact, there's not much use for the "d” function for any continuous distribution (discrete distributions are entirely another matter, for them the “d” functions are very useful, see the section about dbinom).“
Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.
Solution: Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers by random attempts as follows.
#dbinom(value, size=number of trials (zero or more), prob = probability of success on each trial)
dbinom(4, size=12, prob=0.2)
## [1] 0.1328756
Cumulative density function:
#this means 160 inclusive
pnorm(160, mean, sd)
## [1] 0.2282939
#find the cut off value or SURVIVAL FUNCTION: the probability that a given female is greater than 160 centimeters tall
1 - pnorm(160, mean, sd)
## [1] 0.7717061
Quantile function: It's the inverse of the CDF. So given a number p between zero and one, qnorm looks up the p-th quantile of the normal distribution.
Problem: Suppose IQ scores are normally distributed with mean 100 and standard deviation 15. What is the 95th percentile of the distribution of IQ scores?
help(qnorm)
qnorm(0.95, mean=100, sd=15)
## [1] 124.6728