Aimee Schwab
June 17, 2013
In everyday life, we have to make decisions – even when we're uncertain of the outcome. They could be simple decisions like whether to bring an umbrella or buy a Powerball ticket.
Each of these decisions has some uncertainty – we don't know for sure if it will rain, but we're prepared for that chance. In statistics, we use probability to quantify that uncertainty.
In this section, we'll cover the need to know parts of probability theory, we won't go very in-depth. Several decades ago we'd need advanced probability rules to solve even the simplest statistical problems. Since we can use a computer now, we'll only focus on properties and interpretations of probabilities.
What makes a probability valid?
There are four basic rules of probability. For an event, A:
| Rule | Interpretation |
|---|---|
| 1. P(A) \( \ge \) 0 | The probability that A occurs must be at least 0. |
| 2. P(A) \( \le \) 1 | The probability that A occurs can be at most 1. |
| 3. P(A) + P(not A) = 1 | A is guaranteed to either happen or not. |
| 4. P(A) + P(B) + … = 1 | The probability of all possible events must add up to 1. |
For some possible events, probabilities can be calculated using formulas from probability theory. These formulas tend to make assumptions that aren't realistic.
For more interesting events, we can estimate probabilities using sample probabilities.
Sample probability: number of observations in some event A, divided by the total number of observations
Example: Based on records of automobile accidents, the Department of Highway Safety and Motor Vehicles in Florida reported the counts of those who survived (S) and died (D), according to whether they wore a seatbelt (Y = yes, N= no). The data are presented in the table below. Use the data to estimate the following probabilities.
| Wore Seatbelt | Survived (S) | Died (D) | Total |
|---|---|---|---|
| Yes (Y) | 412,368 | 510 | 412,878 |
| No (N) | 162,527 | 1,601 | 164,128 |
| Total | 574,895 | 2,111 | 577,006 |
Example: We've already seen how to estimate probabilities with a data set – we can use the tally function. For the Whickham data set, we might want to find:
library(mosaic)
data(Whickham)
tally(~smoker+outcome, data=Whickham, format='proportion')
outcome
smoker Alive Dead Total
No 0.3820 0.1750 0.5571
Yes 0.3371 0.1058 0.4429
Total 0.7192 0.2808 1.0000
P(smoker and died) = 0.1058
P(nonsmoker and died) = 0.1750
Suppose I'm only interested in the number of people in Florida who were wearing a seatbelt – what's the probability that someone in that group survived? This is called a conditional probability.
The probability of A, given that we know B has already occured is:
Conditional probability, P(A|B):
\[ P(A|B)=\frac{P(A and B)}{P(B)} \]
Where have we seen this vertical bar before?
The vertical bar in R has the same meaning as in a conditional probability. In R it splits our data into groups. With a conditional probability, we're splitting the sample group that we want into a smaller portion - we only want to look at category B.
Example: Using the seatbelt data, find the conditional probabilities of:
Can we do this in R? Yes! We'll modify the tally function from how we've already used it.
Example: For the Whickham data set, which recorded smoking status and whether or not the subject was alive at the end of 20 years, use the tally function to find the conditional probability of surviving for smokers and nonsmokers.
tally(~outcome|smoker, data=Whickham)
smoker
outcome No Yes
Alive 0.6858 0.7612
Dead 0.3142 0.2388
Total 1.0000 1.0000
In our notation,
The variable that goes to the right of “|” is what we want to condition on!
Variables are sometimes called random variables – this reminds us that their values are uncertain! Random variables have probability distributions.
A probability distribution assigns a probability to each measurable subset of the possible values of a random variable. Some possible values are more likely than others, so they have a higher probability.
We'll consider two types of probability distributions:
For a discrete probability distribution, we should be able to list all possible outcomes and their probabilities. Each probability rule must also be satisfied - all probabilities should be between 0 and 1, and all must add up to 1.
Example: The table below shows the probability distribution of the number of home runs in a single game for the San Francisco Giants.
| Number of Home Runs | Probability |
|---|---|
| 0 | 0.3889 |
| 1 | 0.3148 |
| 2 | 0.2222 |
| 3 | 0.0556 |
| 4 | ? |
| 5 or more | 0.000 |
In most cases, we don't know the probability distribution for a discrete random variable until after we collect data. There are some notable exceptions.
For a discrete probability distribution, the population mean is easy to find. To find the mean, \( \mu \),
\[ \mu=\sum x_i*P(X=x_i) \]
This is also called the expected value.
Example: Let X represent the response of a randomly selected person to the question, “What is the ideal number of children for a family to have?” The probability distribution based on data collected from the General Social Survey is shown below, sorted by gender.
| x | P(X=x): females | P(X=x): males |
|---|---|---|
| 0 | 0.01 | 0.02 |
| 1 | 0.03 | 0.03 |
| 2 | 0.55 | 0.60 |
| 3 | 0.31 | 0.28 |
| 4 | 0.11 | 0.08 |
Find the expected value of probability distributions for each gender. How do they compare? (You can use R as a calculator!)
The standard deviation for the females is 0.770 and the standard deviation for the males is 0.758. Compare these standard deviations – what are the practical implications?
Continuous random variables have an infinite number of possible values. All possible values form an interval.
Since we don't have a finite list of possibilities, we can't just add together all of the outcomes. Instead of a histogram, continuous probability densities are represented by a smooth curve.
Important Properties:
For most continuous probability distributions, there is no easy way to find the probabilities (unless you've taken calculus, but it's still not the easiest task).
Because of this, there are some continuous distributions that we rely on much more than others. The big one is the normal distribution.
The normal distribution has several nice properties.
The normal distribution is a natural fit for many applications, and happens to have a very nice statistical property!
It's also incredibly flexible, since changing the mean and standard deviation can change the location and shape of the curve.
Example: A study was conducted by the University of Georgia to measure male and female heights. The normal distribution curves are plotted below.
Empirical Rule: For any normal distribution (bell-shaped curve), the probability that an observation is within any particular number of standard deviations from the mean \( \mu \) is the same.
Some important distances:
Example: The Empirical Rule can be used to find reasonable ranges for a data set. For the University of Georgia heights study, height for men (\( \mu \)=70, s=4) and women (\( \mu \)=65, s=3.5) followed normal distributions.
68% of women fall between 61.5 and 68.5 inches. 95% of men fall between 62 and 78 inches. 99.7% of women fall between 54.5 and 75.5 inches.
With the Empirical Rule, it's easy to find probabilities within 1, 2, or 3 standard deviations of the mean. However the formula for the normal distribution is messy, and not one that we want to use. To get more specific probabilities, we can make some adjustment to our notation.
z-score: the number of standard deviations an observation falls from the mean
\[ z=\frac{x-\mu}{\sigma} \]
The distribution of z-scores is normal with a mean of 0 and a standard deviation of 1. This is called the standard normal distribution.
This is a big idea in statistics. We'll come across this again and again, the idea that an observation is ## standard deviations away from the mean.
With R, we have two options for finding probabilities using a normal distribution.
Option 1: Use a z-score.
Example: The distribution of z-scores is approximately normal with mean 100 and standard deviation 16.
xpnorm(1.56)
If X ~ N(0,1), then
P(X <= 1.56) = P(Z <= 1.56) = 0.9406
P(X > 1.56) = P(Z > 1.56) = 0.0594
[1] 0.9406
Try it!
Option 2: Tell xpnorm which mean and standard deviation to use.
xpnorm can read in the correct mean and standard deviation, and find the z-score on its own.
xpnorm(125, mean=100, sd=16)
Example: Remember that IQs are approximately normal with mean 100 and standard deviation 16. What percent of people have IQs:
Use whichever option you're most comfortable with!
We can also use R to find quantiles (percentiles) with a normal distribution using the xqnorm function. (“q” for quantiles, “p” for probabilities.)
Example: IQs have a normal distribution with mean 100 and s.d. (standard deviation) 16. Find:
xqnorm(0.20, mean=100, sd=16)
Example: SAT scores on each section follow an approximately normal distribution with mean 500 points and s.d. 100 points. The maximum possible score is 800 points. Use R to answer the following.