Statistical Inference for Data Science: Probability

Johns Hopkins Coursera Summary

Probability

Given a random experiment, a probability measure is a population quantity that summarizes the randomness. “Probability” is a conceptual idea that is generalized to a population quantity. Specifically, probability operates on the outcomes of an experiment and assigns each possible outcome a number between 0 and 1.

A Demonstration: The Law of Large Numbers

Probability can be illustrated by rolling a die many times. Let pˆn be the proportion of outcomes that are 1 after the first n rolls. As the number of rolls increases, pˆn will converge to the probability of rolling a 1, p = 1/6. Figure 2.1 shows this convergence for 100,000 die rolls. The tendency of pˆn to stabilize around p is described by the Law of Large Numbers.

n<-c(1:100)
x<-rnorm(100, 68, 3.5)
s<-cumsum(x)
plot(s/n, xlab="Number of Trials", ylab= "AVG Male Height", ylim=c(60,70), type="l", col="blue", main="Simulating the Law of Large Numbers", lwd=6)

plot of chunk unnamed-chunk-1

Rule of Probability

The probability that nothing occurs is 0

The probability that something occurs is 1

The probability of something is 1 minus the probability that the opposite occurs: complementary events

The probability of at least one of two (or more) things that cannot simultaneously occur (DISJOINT or mutually exclusive outcomes) is the sum of their respective probabilities

For any two events, the probability that at least one occurs is the sum of their probabilities minus the intersection. Subtracting the intersection allows us to avoid adding the intersection in twice (venn diagram example).

Example: Incidence of Sleep Apnea and RLS in the General Population

Incidence of Sleep Apnea: 3%
Incidence of RLS: 10%

Does this mean that 13% of people will have at least one sleep problem of these sorts?
NO: the events can simultaneously occur and thus, are not mutually exclusive

The following equation estimates the true proportion of the population that has at least one of these sleep problems:

Random Variables

Problem: we need ways to model and think about probabilities for numeric outcomes of experiments.

Random Variable: A numeric outcome of an experiment

Discrete Random Variables: Take on a countable number of possibilities. Probability Mass Functions (PMFs) gives the probability that a discrete random variable is exactly equal to some value.

Continuous Random Variables: Can take on an infinite number of values. As such, the probability that a RV will take on any specific value in a continuous distribution is zero. Thus, Probability Density Functions (PDFs) describe the relative likelihood that a random variable will take on a value that falls within a particular range

Simulating Random Variates with rnorm()

hist(rnorm(n=1000, m=24.2, sd=2.2))

plot of chunk unnamed-chunk-2

Difference between random variables and random variates

Rules for Probability Distributions:

A probability distribution is a list of the possible outcomes with corresponding probabilities that satisfies three rules:

The outcomes listed must be disjoint.
Each probability must be between 0 and 1.
The probabilities must total 1.

Probability Mass Functions (PMFs)

To be a valid PMF, a function p must satisfy:
1. It must always be larger than or equal to zero
2. The sum of the possible values that the random variable can take on must add up to one

Probability Density Functions (PDFs)

To be a valid PDF, a function must satisfy:
1. It must be larger than or equal to zero everywhere
2. The total area under it must be one

Certain areas of PDFs and PMFs are so useful that they are given names:

Cumulative Distribution Functions (CDFs)

The Cumulative Distribution Function of a random variable, X, returns the probability that the random variable is less than or equal to the value “x”. This definition applies whether the variable is discrete or continuous.

“X” = a random, unrealized version of the random variable
“x” = a specific number that we plug into the function

Survival Functions

This is the opposite of a CDF and is given by: S(x) = 1 - CDF(x). In other words, the Survival Function of a random variable “X” is defined as the probability that the random variable is greater than the value “x”.

Applying the Functions

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
#head(bdims)


male_dims <- subset(bdims, sex == 1)
female_dims <- subset(bdims, sex == 0)
F_hgt_mean <- mean(female_dims$hgt)
F_hgt_sd <- sd(female_dims$hgt)


hist(female_dims$hgt, probability = TRUE, ylim = c(0, 0.06), main = "Female Height Distribution", xlab = "Height (centimeters)", ylab = "Probability")
x <- 140:190
y <- dnorm(x=x, mean=F_hgt_mean, sd=F_hgt_sd)
lines(x = x, y = y, col = "blue")

plot of chunk unnamed-chunk-3

Checking for Normality

Is the data normally distributed? Let's check by constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”. A data set that is nearly normal will result in a probability plot where the points closely follow the line. Any deviations from normality leads to deviations of these points from the line.

More on Q-Q Plots: http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html

qqnorm(female_dims$hgt)
qqline(female_dims$hgt)

plot of chunk unnamed-chunk-4

The plot for female heights shows points that tend to follow the line but with some errant points towards the tails. So, the question becomes: how close is close enough?

A useful way to address this question is to rephrase it as: what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using “qqnormsim”:

qqnormsim(female_dims$hgt)

plot of chunk unnamed-chunk-5

Now that we know that female heights are roughly normally distributed, we can ask some questions about the dataset:

1. What’s the probability that a given female will stand between 160 and 168 centimeters tall?

mean <- F_hgt_mean
sd <- F_hgt_sd
upper_bound <- 168
lower_bound <- 160

x <- seq(-4, 4, length = 100)*sd + mean
hx <- dnorm(x, mean, sd)
plot(x, hx, type = "n", xlab = "Female Heights", ylab = "Density", main = "Normal Distribution of Heights", axes = FALSE)

i <- x >= lower_bound & x <= upper_bound
lines(x, hx)
#polygon(x, y, color) x, y = vectors containing the coordinates of the vertices of the polygon
polygon(c(lower_bound, x[i], upper_bound), c(0, hx[i], 0), col = "blue")

#pnorm = CDF
area <- pnorm(upper_bound, mean, sd) - pnorm(lower_bound, mean, sd)
#text
result <- paste("Probability(", lower_bound, "<= Height <=", upper_bound, ") =", signif(area, digits = 3))
#mtext = text under title
mtext(result, 3)
range(female_dims$hgt)

## [1] 147.2 182.9

#axis(side (1=below, 2=left, 3=above, 4=right), at= the point at which tick-marks are to be drawn, pos = coordinate at which the axis line is to be drawn)
axis(1, at=seq(140, 190, 10), pos=0)

plot of chunk unnamed-chunk-6

NOTE: This type of visualization is good for answering ad-hoc questions regarding the probability of a certain event occuring. Parameters can be changed by users if the visual were reproduced using Shiny.

Examples of Probability Functions

dname()

Density or probability function: F(x) = P(a <= X <= b)

Apparently this function is pretty useless in R:

“There's not much need for this function in doing calculations, because you need to do integrals to use any p. d. f., and R doesn't do integrals. In fact, there's not much use for the "d” function for any continuous distribution (discrete distributions are entirely another matter, for them the “d” functions are very useful, see the section about dbinom).“

PMFs with Binomial Distributions

Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

Solution: Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers by random attempts as follows.

#dbinom(value, size=number of trials (zero or more), prob = probability of success on each trial)
dbinom(4, size=12, prob=0.2)

## [1] 0.1328756

pname()

Cumulative density function:

F(x) = P(X <= x)

What's the probability that a given female will be less than or equal to 160 centimeters in height?

#this means 160 inclusive
pnorm(160, mean, sd)

## [1] 0.2282939

#find the cut off value or SURVIVAL FUNCTION: the probability that a given female is greater than 160 centimeters tall
1 - pnorm(160, mean, sd)

## [1] 0.7717061

qname()

Quantile function: It's the inverse of the CDF. So given a number p between zero and one, qnorm looks up the p-th quantile of the normal distribution.

Problem: Suppose IQ scores are normally distributed with mean 100 and standard deviation 15. What is the 95th percentile of the distribution of IQ scores?

help(qnorm)
qnorm(0.95, mean=100, sd=15)

## [1] 124.6728