Part 4: Probability Basics

Aimee Schwab
June 17, 2013

In everyday life, we have to make decisions – even when we're uncertain of the outcome. They could be simple decisions like whether to bring an umbrella or buy a Powerball ticket.

  • How do you decide when to bring an umbrella to class?
  • How do you decide when to buy a Powerball ticket?

Each of these decisions has some uncertainty – we don't know for sure if it will rain, but we're prepared for that chance. In statistics, we use probability to quantify that uncertainty.

In this section, we'll cover the need to know parts of probability theory, we won't go very in-depth. Several decades ago we'd need advanced probability rules to solve even the simplest statistical problems. Since we can use a computer now, we'll only focus on properties and interpretations of probabilities.

What makes a probability valid?

  • We could spend an entire semester learning how to find probabilities in theory. If you want to, take Stat 462!
  • For our purposes, we'll only look at what's needed to do statistical inference.

There are four basic rules of probability. For an event, A:

Rule Interpretation
1. P(A) \( \ge \) 0 The probability that A occurs must be at least 0.
2. P(A) \( \le \) 1 The probability that A occurs can be at most 1.
3. P(A) + P(not A) = 1 A is guaranteed to either happen or not.
4. P(A) + P(B) + … = 1 The probability of all possible events must add up to 1.

For some possible events, probabilities can be calculated using formulas from probability theory. These formulas tend to make assumptions that aren't realistic.

For more interesting events, we can estimate probabilities using sample probabilities.

Sample probability: number of observations in some event A, divided by the total number of observations

Example: Based on records of automobile accidents, the Department of Highway Safety and Motor Vehicles in Florida reported the counts of those who survived (S) and died (D), according to whether they wore a seatbelt (Y = yes, N= no). The data are presented in the table below. Use the data to estimate the following probabilities.

  1. P(Y) =
  2. P(S) =
  3. P(S and Y) =
  4. P(S and D) =
Wore Seatbelt Survived (S) Died (D) Total
Yes (Y) 412,368 510 412,878
No (N) 162,527 1,601 164,128
Total 574,895 2,111 577,006

Example: We've already seen how to estimate probabilities with a data set – we can use the tally function. For the Whickham data set, we might want to find:

  1. P(smoker and died) =
  2. P(nonsmoker and died) =
library(mosaic)
data(Whickham)
tally(~smoker+outcome, data=Whickham, format='proportion')
       outcome
smoker   Alive   Dead  Total
  No    0.3820 0.1750 0.5571
  Yes   0.3371 0.1058 0.4429
  Total 0.7192 0.2808 1.0000

P(smoker and died) = 0.1058

P(nonsmoker and died) = 0.1750

Suppose I'm only interested in the number of people in Florida who were wearing a seatbelt – what's the probability that someone in that group survived? This is called a conditional probability.

The probability of A, given that we know B has already occured is:

Conditional probability, P(A|B):

\[ P(A|B)=\frac{P(A and B)}{P(B)} \]

  • Number of observations in both A and B, divided by the number of observations in B

Where have we seen this vertical bar before?

The vertical bar in R has the same meaning as in a conditional probability. In R it splits our data into groups. With a conditional probability, we're splitting the sample group that we want into a smaller portion - we only want to look at category B.

Example: Using the seatbelt data, find the conditional probabilities of:

  1. P(Y|S) =
  2. P(N|S) =
  3. P(N|D) =

Can we do this in R? Yes! We'll modify the tally function from how we've already used it.

Example: For the Whickham data set, which recorded smoking status and whether or not the subject was alive at the end of 20 years, use the tally function to find the conditional probability of surviving for smokers and nonsmokers.

tally(~outcome|smoker, data=Whickham)
       smoker
outcome     No    Yes
  Alive 0.6858 0.7612
  Dead  0.3142 0.2388
  Total 1.0000 1.0000

In our notation,

  1. 0.761 =
  2. 0.686 =
  3. 0.314 =
  4. 0.239 =

The variable that goes to the right of “|” is what we want to condition on!

Variables are sometimes called random variables – this reminds us that their values are uncertain! Random variables have probability distributions.

A probability distribution assigns a probability to each measurable subset of the possible values of a random variable. Some possible values are more likely than others, so they have a higher probability.

We'll consider two types of probability distributions:

  • Discrete: all possible values can be listed, along with their respective probabilities
  • Continuous: there's an infinite amount of possible values

For a discrete probability distribution, we should be able to list all possible outcomes and their probabilities. Each probability rule must also be satisfied - all probabilities should be between 0 and 1, and all must add up to 1.

Example: The table below shows the probability distribution of the number of home runs in a single game for the San Francisco Giants.

Number of Home Runs Probability
0 0.3889
1 0.3148
2 0.2222
3 0.0556
4 ?
5 or more 0.000
  1. Find the probability that the Giants score 4 home runs in a single game.
  2. Find the probability that the Giants score less than 2 home runs in a single game.

In most cases, we don't know the probability distribution for a discrete random variable until after we collect data. There are some notable exceptions.

For a discrete probability distribution, the population mean is easy to find. To find the mean, \( \mu \),

  • Multiply each possible observed value by its probability
  • Add them all up

\[ \mu=\sum x_i*P(X=x_i) \]

This is also called the expected value.

Example: Let X represent the response of a randomly selected person to the question, “What is the ideal number of children for a family to have?” The probability distribution based on data collected from the General Social Survey is shown below, sorted by gender.

x P(X=x): females P(X=x): males
0 0.01 0.02
1 0.03 0.03
2 0.55 0.60
3 0.31 0.28
4 0.11 0.08
  1. Find the expected value of probability distributions for each gender. How do they compare? (You can use R as a calculator!)

  2. The standard deviation for the females is 0.770 and the standard deviation for the males is 0.758. Compare these standard deviations – what are the practical implications?

Continuous random variables have an infinite number of possible values. All possible values form an interval.

Since we don't have a finite list of possibilities, we can't just add together all of the outcomes. Instead of a histogram, continuous probability densities are represented by a smooth curve.

Important Properties:

  • Probabilities are represented as areas under a density curve (smooth line). The total area under the density curve is exactly equal to 1.
    • This is why histograms in R default to saying “Density” on the right!
  • A single value has probability 0 of occuring.
    • How many possible values are there?
  • The probability that X is between some interval, a to b, is the area under the curve between points a and b.

For most continuous probability distributions, there is no easy way to find the probabilities (unless you've taken calculus, but it's still not the easiest task).

Because of this, there are some continuous distributions that we rely on much more than others. The big one is the normal distribution.

The normal distribution has several nice properties.

  • Bell-shaped. Any time you hear someone say “bell curve”, they mean the normal distribution.
  • Symmetric.
  • Centered at the population mean \( \mu \)
  • Spread varies based on the standard deviation \( \sigma \)

The normal distribution is a natural fit for many applications, and happens to have a very nice statistical property!

It's also incredibly flexible, since changing the mean and standard deviation can change the location and shape of the curve.

Example: A study was conducted by the University of Georgia to measure male and female heights. The normal distribution curves are plotted below.

plot of chunk unnamed-chunk-4

Empirical Rule: For any normal distribution (bell-shaped curve), the probability that an observation is within any particular number of standard deviations from the mean \( \mu \) is the same.

Some important distances:

  • \( \pm \) 1 standard deviation: probability is 0.68
  • \( \pm \) 2 standard deviation: probability is 0.95
  • \( \pm \) 3 standard deviation: probability is 0.997

plot of chunk unnamed-chunk-5

Example: The Empirical Rule can be used to find reasonable ranges for a data set. For the University of Georgia heights study, height for men (\( \mu \)=70, s=4) and women (\( \mu \)=65, s=3.5) followed normal distributions.

  1. What range of height do 68% of women fall between?
  2. What range of height do 95% of men fall between?
  3. What range of height do 99.7% (almost all) of women fall into?

68% of women fall between 61.5 and 68.5 inches. 95% of men fall between 62 and 78 inches. 99.7% of women fall between 54.5 and 75.5 inches.

With the Empirical Rule, it's easy to find probabilities within 1, 2, or 3 standard deviations of the mean. However the formula for the normal distribution is messy, and not one that we want to use. To get more specific probabilities, we can make some adjustment to our notation.

z-score: the number of standard deviations an observation falls from the mean

\[ z=\frac{x-\mu}{\sigma} \]

  • Positive z-scores indicate values above the mean
  • Negative z-scores indicate values below the mean

The distribution of z-scores is normal with a mean of 0 and a standard deviation of 1. This is called the standard normal distribution.

This is a big idea in statistics. We'll come across this again and again, the idea that an observation is ## standard deviations away from the mean.

With R, we have two options for finding probabilities using a normal distribution.

Option 1: Use a z-score.

Example: The distribution of z-scores is approximately normal with mean 100 and standard deviation 16.

  1. If Jim has an IQ of 125, what is his z-score?
  2. What's the probability that a worker at Dunder Mifflin will be outsmarted by one of Jim's pranks? (That is, has a lower IQ?)
xpnorm(1.56)

If X ~ N(0,1), then 

    P(X <= 1.56) = P(Z <= 1.56) = 0.9406
    P(X >  1.56) = P(Z >  1.56) = 0.0594

plot of chunk unnamed-chunk-6

[1] 0.9406

Try it!

  1. If Dwight has an IQ of 87, what is his z-score?
  2. What's the probability that a worker at Dunder Mifflin will be pranked by Dwight? (That is, has a lower IQ?)
  3. What's the probability that a worker at Dunder Mifflin can prank Dwight? (That is, has a higher IQ?)

plot of chunk unnamed-chunk-7

Option 2: Tell xpnorm which mean and standard deviation to use.

xpnorm can read in the correct mean and standard deviation, and find the z-score on its own.

xpnorm(125, mean=100, sd=16)

plot of chunk unnamed-chunk-8

Example: Remember that IQs are approximately normal with mean 100 and standard deviation 16. What percent of people have IQs:

  1. Lower than 115?
  2. Lower than 94?
  3. Higher than 94?
  4. Exactly 94?
  5. Between 94 and 115?

Use whichever option you're most comfortable with!

We can also use R to find quantiles (percentiles) with a normal distribution using the xqnorm function. (“q” for quantiles, “p” for probabilities.)

Example: IQs have a normal distribution with mean 100 and s.d. (standard deviation) 16. Find:

  1. The 20th quantile.
  2. The 95th quantile.
  3. Q3.
  4. Q3 for a standard normal distribution.
xqnorm(0.20, mean=100, sd=16)

Example: SAT scores on each section follow an approximately normal distribution with mean 500 points and s.d. 100 points. The maximum possible score is 800 points. Use R to answer the following.

  1. Sketch the distribution of SAT scores.
  2. What proportion of students will score below 260 in one section of the SAT?
  3. What proportion of students will score between 400 and 500 in one section?
  4. What proportion of students will score above 700 in one section of the SAT?
  5. What score corresponds to the top 10% of students?
  6. What score corresponds to the bottom 8% of students?
  7. Between what two values will you find the middle 95% of SAT scores?