Introduction to Statistical Inference

Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to Statistical_Inference/Introduction.)

Basics

In this lesson, we’ll briefly introduce basics of statistical inference, the process of drawing conclusions “about a population using noisy statistical data where uncertainty must be accounted for”.

In other words, statistical inference lets scientists formulate conclusions from data and quantify the uncertainty arising from using incomplete data.

Which of the following is NOT an example of statistical inference?

1: Testing the efficacy of a new drug
2: Polling before an election to predict its outcome
3: Recording the results of a statistics exam
4: Constructing a medical image from fMRI data

Selection: 3
Hint: Which of the choices involves concrete data that doesn't require any
inference or generalization?

So statistical inference involves formulating conclusions using data AND quantifying the uncertainty associated with those conclusions. The uncertainty could arise from incomplete or bad data.

Which of the following would NOT be a source of bad data?

1: A poorly designed study
2: Small sample size
3: Selection bias
4: A randomly selected sample of population

Selection: 4
Hint: Which of the choices would allow one to draw reasonable inferences?

So with statistical inference we use data to draw general conclusions about a
population. Which of the following would a scientist using statistical inference
techniques consider a problem?

1: Our data sample is representative of the population
2: Contaminated data
3: Our study has no bias and is well-designed

Selection: 2
Hint: Which of the choices is obviously bad?

Which of the following is NOT an example of statistical inference in action?

1: Estimating the proportion of people who will vote for a candidate
2: Testing the effectiveness of a medical treatment
3: Counting sheep
4: Determining a causative mechanism underlying a disease

Selection: 3
Hint: Which of the choices is exact?

We want to emphasize a couple of important points here. First, a statistic (singular) is a number computed from a sample of data. We use statistics to infer information about a population. Second, a random variable is an outcome from an experiment. Deterministic processes, such as computing means or variances, applied to random variables, produce additional random variables which have their own distributions. It’s important to keep straight which distributions you’re talking about.

Flavour of Inference

Finally, there are two broad flavors of inference:

Frequency, which uses “long run proportion of times an event occurs in independent, identically distributed repetitions.”
Bayesian in which the probability estimate for a hypothesis is updated as additional evidence is acquired.

Both flavors require an understanding of probability so that’s what the next lessons will cover.

Congrats! You’ve concluded this brief introduction to statistical inference.

Probability 1

Introduction

Probability. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to Statistical_Inference/Probability.)

In this lesson, we’ll review basic ideas of probability, the study of quantifying the likelihood of particular events occurring. Note the similarity between the words probability and probably. Every time you use the latter word you’re implying that an event is more likely than not to occur.

The first step in understanding probability is to see if you understand what
outcomes of an experiment are possible. For instance, if you were rolling a
single, fair die once, how many outcomes are possible?

1: 4
2: Too many
3: 1
4: 6

Selection: 4
Hint: How many sides or faces does a die (or any cube or box) have?

The probability of a particular outcome of an experiment is the ratio of the
number of ways that outcome can occur to all possible outcomes of the
experiment. Since there are 6 possible outcomes to the experiment of rolling a
die, and we assume the die is fair, each outcome is equally likely. So what is
the probability of rolling a 2?

1: 0
2: 1/6
3: 1/3
4: 2/6

Selection: 2
Hint: The '2' is one of six possibilities.

What is the probability of rolling an even number?

1: 0
2: 1/2
3: 1
4: 1/3

Selection: 2
Hint: Half of the outcomes are even and half are odd. There are three even
outcomes (2, 4, and 6) and three odd (1, 3, and 5).

Since the probability of a particular outcome or event $E$ is the ratio of ways that $E$ could occur to the number of all possible outcomes or events, the probability of $E$, denoted $P(E)$, is always between $0$ and $1$. Impossible events have a probability of $0$ (since they can’t occur) and events that are certain to occur have a probability of $1$.

If you’re doing an experiment with $n$ possible outcomes, say $e1, e2, ..., en$, then the sum of the probabilities of all the outcomes is 1. If all the outcomes are equally likely, as in the case of a fair die, then the probability of each is $1/n$.

Probability of Two Independent Events

If $A$ and $B$ are two independent events then the probability of them both occurring is the product of the probabilities. $P(A \cap B) = P(A) \times P(B)$

Suppose you rolled the fair die twice in succession. What is the probability of
rolling two 4's?

1: 2/6
2: 0
3: 1/36
4: 1/6

Selection: 3
Hint: The probability of rolling the first 4 is 1/6. The second roll of the dice
doesn't depend on the outcome of the first, so that also has a probability of
1/6. The probability of both events occurring is 1/6 * 1/6. This makes intuitive
sense since the probability of 2 such events occurring has to be smaller than
the probability of only 1 event.

Suppose you rolled the fair die twice. What is the probability of rolling the
same number two times in a row?

1: 0
2: 1/36
3: 2/6
4: 1/6

Selection: 4
Hint: Since we don't care what the outcome of the first roll is, its probability
is 1. The second roll of the dice has to match the outcome of the first, so that
has a probability of 1/6. The probability of both events occurring is 1 * 1/6.

Now consider the experiment of rolling 2 dice, one red and one green. Assume the
dice are fair and not loaded. How many distinct outcomes are possible?

1: 12
2: 36
3: 1
4: 11

Selection: 2
Hint: Each of the dice has 6 ways to land, and their outcomes are independent of
each other. Each way the red die lands can be paired with each way the green die
lands. For instance, a "1" on the red dice can occur independently of any of the
6 outcomes of the green dice.

Mutually Exclusive (Disjoint) Events

If an event $E$ can occur in more than one way and these ways are disjoint (mutually exclusive) then $P(E)$ is the sum of the probabilities of each of the ways in which it can occur.

Rolling these two dice, what's the probability of rolling a 10?

1: 0
2: 1/6
3: 2/36
4: 3/36

Selection: 4
Hint: Since the highest possible dice roll is a '6', the only combinations which
give '10' are 4+6 and 5+5. The first occurs in two ways - red dice gets '4' and
green gets '6' OR red gets '6' and green gets '4'.

What sum is the most likely when rolling these two dice?

1: 7
2: 12
3: 1
4: 2
5: 9

Selection: 1
Hint: The choice of '1' is impossible rolling two dice, and '2' and '12' each
occur in only one possible way (snake-eyes and double 6's, respectively), so '7'
and '9' are the only reasonable choices. To roll '7' you can use any number from
'1' to '6', while '9' can only use outcomes of '3' and above.

At Least One of Two Events

The probability of at least one of two events, A and B, occurring is the sum of their individual probabilities minus the probability of their intersection. $P(A U B) = P(A) + P(B) - P(A \cap B)$.

It’s easy to see why this is. Calculating $P(A)$ and $P(B)$ counts outcomes that are in both A and B twice, so they’re overcounted. The probability of the intersection of the two events, denoted as $A \cap B$, must be subtracted from the sum.

Back to rolling two dice. Which expression represents the probability of rolling
an even number or a number greater than 8?

1: (18+10-4)/36
2: (18+4-2)/36
3: (18+10-2)/36
4: (18+10)/36

Selection: 1
Hint: The probability of rolling an even number is 1/2 or 18/36. There are 10
ways of rolling a number greater than '8' - 4 ways for rolling '9', 3 for '10',
2 for '11' and 1 for '12'. How big is the intersection between rolling an even
number and those greater than '8'?

Only One of Disjoint Events

It follows that if A and B are disjoint or mutually exclusive, i.e. only one of them can occur, then $P(A U B) = P(A) + P(B)$.

Which of the following expressions represents the probability of rolling a
number greater than 10?

1: (3+1)/36
2: (1+1)/36
3: (3+1-1)/36
4: (2+1)/36

Selection: 4
Hint: The only outcomes greater than 10 are 11 and 12 which are mutually
exclusive. The first, 11, can occur in two ways, and the second, 12, can occur
only with a roll of double 6's.

Use the answer to the previous question and the fact that the sum of all outcomes must sum to 1 to determine the probability of rolling a number less than or equal to 10.
Hint: Subtract the previous answer from 1.

33/36

## [1] 0.9166667

Now we’ll apply the concepts of probability to playing cards.

values <- c("A", as.character(2:10), "J", "Q", "K")
suits <- c("spades", "hearts", "diamonds", "clubs")

# Create a deck as a 13x4 matrix which is easy to verify by eye.
deck <- sapply(suits, function(suit) paste(values, suit, sep=":"))

# Note the value of a card is its row index (if Aces are low.)

# Select n cards from a deck at random without replacement.
hand <- function(n, deck) sample(deck, n, replace=FALSE)

# Deal k hands of n cards each as a kxn matrix.
deal <- function(k, n, deck) {
    # Select kxn cards at random without replacement.
    temp <- hand(k*n, deck)

    # Reshape selections into a kxn matrix. Since R
    # fills by column, this is like dealing the cards
    # out in circular order around the table. (Not that
    # it makes any difference since the selection process
    # is a random permutation of the deck anyway.)
    matrix(temp, k, n)
}

A deck of cards is a set of 52 cards, 4 suits of 13 cards each. There are two red suits, diamonds and hearts, and two black suits, spades and clubs. Each of the 13 cards in a suit has a value - an ace which is sometimes thought of as 1, a number from 2 to 10, and 3 face cards, king, queen, and jack. We’ve created a deck in R for you. Type ‘deck’ to see it now.

deck

##       spades      hearts      diamonds      clubs     
##  [1,] "A:spades"  "A:hearts"  "A:diamonds"  "A:clubs" 
##  [2,] "2:spades"  "2:hearts"  "2:diamonds"  "2:clubs" 
##  [3,] "3:spades"  "3:hearts"  "3:diamonds"  "3:clubs" 
##  [4,] "4:spades"  "4:hearts"  "4:diamonds"  "4:clubs" 
##  [5,] "5:spades"  "5:hearts"  "5:diamonds"  "5:clubs" 
##  [6,] "6:spades"  "6:hearts"  "6:diamonds"  "6:clubs" 
##  [7,] "7:spades"  "7:hearts"  "7:diamonds"  "7:clubs" 
##  [8,] "8:spades"  "8:hearts"  "8:diamonds"  "8:clubs" 
##  [9,] "9:spades"  "9:hearts"  "9:diamonds"  "9:clubs" 
## [10,] "10:spades" "10:hearts" "10:diamonds" "10:clubs"
## [11,] "J:spades"  "J:hearts"  "J:diamonds"  "J:clubs" 
## [12,] "Q:spades"  "Q:hearts"  "Q:diamonds"  "Q:clubs" 
## [13,] "K:spades"  "K:hearts"  "K:diamonds"  "K:clubs"

When drawing a single card, how many outcomes are possible?
Hint: How many cards are in the deck?

## [1] 52

What is the probability of drawing a jack?
Hint: How many jacks are in the deck? Divide this number by the number of cards in the deck.

4/52

## [1] 0.07692308

If you’re dealt a hand of 5 cards, what is the probability of getting all 5 of the same value?
Hint: There are only 4 different suits in a deck.

## [1] 0

What is the probability of drawing a face card?
Hint: There are 3 face cards in each of the 4 suits in the deck.

12/52

## [1] 0.2307692

Suppose you draw a face card and don't replace it in the deck. What is the
probability that when you draw a second card it also will be a face card?

1: 0
2: 11/52
3: 12/51
4: 11/51

Selection: 4
Hint: There are only 51 cards remaining in the deck, and of those, only 11 are
face cards.

Suppose you draw a face card and don’t replace it in the deck. What is the probability that when you draw a second card it also will be a face card of the same suit?
Hint: There are 2 face cards of the same suit left in the deck.

2/51

## [1] 0.03921569

Congrats! With probability 1, you’ve aced this first lesson on basic probability.

Probability 2

Introduction

In this lesson, we’ll continue to discuss probability.

Recall that a random variable is a numerical outcome of an experiment. It can be discrete (take on only a countable number of possibilities), or continuous (take on any value on the real line or subset of it).

If you had a ruler of infinite precision, would measuring the height of adults
around the world be continuous or discrete?

1: discrete
2: continuous

Selection: 2
Hint: The ruler of infinite precision is the hint. Can you list all possible
heights?

Is the drawing of a hand of cards continuous or discrete?

1: discrete
2: continuous

Selection: 1
Hint: Can you enumerate the possible outcomes?

Continuous random variables are usually associated with measurements of time, distance, or some biological process since they can take on any value, often within some specified range. Limitations of precision in taking the measurements may imply that the values are discrete; we in fact consider them continuous.

Probability Mass Function (PMF)

A probability mass function (PMF) gives the probability that a discrete random variable is exactly equal to some value.

For instance, suppose we have a coin which may or may not be fair. Let x=0
represent a 'heads' outcome and x=1 represent a 'tails' outcome of a coin toss.
If p is the probability of 'heads' which of the following represents the PMF of
the coin toss?  The variable x is either 0 (heads) or 1 (tails).

1: (p^x)*(1-p)^(1-x)
2: (p^(1-x))*(1-p)^x

Selection: 2
Hint: The probability p is associated with a 'heads' outcome which occurs when
x = 0. Which of the two expressions has an exponent of 1 for p when x is 0?

Probability Density Function (PDF)

A probability density function is associated with a continuous random variable. To quote from Wikipedia, it “is a function that describes the relative likelihood for this random variable to take on a given value. The probability of the random variable falling within a particular range of values is given by … the area under the density function but above the horizontal axis and between the lowest and greatest values of the range.”

We’ll repeat two requirements of a probability density function. It must be non-negative everywhere, and the area under it must equal one."

x <- c(0, 2, 2, 0, 0); y <- c(0, 0, 1, 1, 0)
plot(x, y, lwd = 3, frame = FALSE, type = "l")
segments(0, 0, 2, 1, lwd = 3)

Consider this figure - a rectangle with height 1 and width 2 with a diagonal line drawn from the lower left corner (0,0) to the upper right (2,1). The area of the entire rectangle is 2 and elementary geometry tells us that the diagonal divides the rectangle into 2 equal areas.

Could the diagonal line represent a probability density function for a random
variable with a range of values between 0 and 2? Assume the lower side of the
rectangle is the x axis.

1: No
2: Yes

Selection: 2
Hint: Is the line nonnegative? Is the area under the diagonal 1?

plot(x, y, lwd = 3, frame = FALSE, type = "l")
segments(0, 0, 2, 1, lwd = 3)

polygon(c(0, 1.6, 1.6, 0), c(0, 0, 0.8, 0), lwd = 3, col = "lightblue")

mypdf <- function(x) { x/2 }

Now consider the shaded portion of the triangle - a smaller triangle with a base of length 1.6 and height determined by the diagonal. We’ll answer the question, “What proportion of the big triangle is shaded?”

This proportion represents the probability that throwing a piece of cat kibble at the bigger triangle (below the diagonal) hits the blue portion.

We have to compute the area of the blue triangle. (Since the area of the big
triangle is 1, the area of the blue triangle is the proportion of the big
triangle that is shaded.) We know the base, but what is its height?

1: I can't tell
2: .25
3: .5
4: .8

Selection: 4
Hint: The  slope of a line is the "rise" (change in height) divided by the "run"
(change in width), so the diagonal's slope is 1/2. At x = 1.6, the y value of
the diagonal is 1/2 * 1.6.

What is the area of the blue triangle?
Hint: Multiply the base by the height and divide by 2.

0.64

## [1] 0.64

So, what is the probability that the kibble we throw at the bigger triangle will hit the blue portion?
Hint: The area of the blue triangle divided by the area of the big triangle gives you the probability.

0.64

## [1] 0.64

This artificial example was meant to illustrate a simple probability density function (PDF). Most PDFs have underlying formulae more complicated than lines. We’ll see more of these in future lessons.

Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) of a random variable X, either discrete or continuous, is the function F(x) equal to the probability that $X \le x$. In the example above, the area of the blue triangle represents the probability that the random variable was less than or equal to the value 1.6.

In the triangle example from above, which of the following expressions
represents F(x), the CDF?

1: x*x/4
2: x^2
3: x*2x/2
4: x*x/2

Selection: 1
Hint: The term 'x' is the base, x/2 is the height. Plug these into the formula
for the area of a triangle.

If you’re familiar with calculus you might recognize that when you’re computing areas under curves you’re actually integrating functions.

When the random variable is continuous, as in the example, the PDF is the derivative of the CDF. So integrating the PDF (the line represented by the diagonal) yields the CDF. When you evaluate the CDF at the limits of integration the result is an area.

$CDF = \int PDF$

To see this in the example, we’ve defined the function mypdf for you. This is the equation of the line represented by the diagonal of the rectangle. As the PDF, it is the derivative of F(x), the CDF. Look at mypdf now.

mypdf

## function(x) { x/2 }

Now use the R function integrate to integrate mypdf with the parameters lower equal to 0 and upper equal to 1.6. See if you get the same area (probability) you got before.

integrate(mypdf, 0, 1.6)

## 0.64 with absolute error < 7.1e-15

\[CDF = \int_{0}^{1.6} PDF = \int_{x = 0}^{1.6} \frac{x}{2} dx= \left. \frac{x^2}{4} \right|_{0}^{1.6} = 0.64\]

Survivor Function

The survivor function $S(x)$ of a random variable $X$ is defined as the function of $x$ equal to the probability that the random variable $X$ is greater than the value $x$. This is the complement of the CDF $F(x)$, in our example, the portion of the lower triangle that is not shaded.

$S(X) = P(X > x) = 1 - F(X)$, where $F(X) =$ CDF

In our example, which of the following expressions represents the survival
function?

1: 1-x*x/4
2: 1-x*x/2
3: 1-x*2x/2
4: 1-x^2

Selection: 1
Hint: Since areas under PDF's must be 1 and the survival function is the
complement of the CDF, the survival function and the CDF sum to 1.

The quantile $v$ of a CDF is the point $x_v$ at which the CDF has the value $v$. More precisely, $F(x_v) = v$. A percentile is a quantile in which $v$ is expressed as a percentage.

What percentile is the median?

1: I can't tell
2: 25-th
3: 50-th
4: 95-th

Selection: 3
Hint: The median is the point at which half of the outcomes are above and half
are below.

What is the $50^{th}$ percentile of the CDF $F(x) = \frac{x^2}{4}$ from the example above?
Hint: Solve for the x such that $x^2 = 4 \times 0.5 = 2$

2^(1/2)

## [1] 1.414214

What does this mean with respect to the kibble we're tossing at the triangle?

1: All of it falls to the right of 1.41
2: All of it falls on the vertical line at 1.41
3: All of it falls to the left of 1.41
4: Half of it falls to the left of 1.41

Selection: 4
Hint: Recall the meaning of median (half).

We’ll close by repeating some important points.

A probability model connects data to a population using assumptions.
Be careful to distinguish between population medians and sample medians.
A sample median is an estimator of a population median (the estimand).

Congrats! You’ve concluded this lesson on probability.

Conditional Probability

Introduction

Conditional Probability. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to 06_Statistical_Inference/03_Conditional_Probability.)

In this lesson, as the name suggests, we’ll discuss conditional probability.

If you were given a fair die and asked what the probability of rolling a 3 is,
what would you reply?

1: 1/2
2: 1
3: 1/4
4: 1/3
5: 1/6

Selection: 5
Hint: There are 6 possible outcomes and you want to know the probability of 1 of
them.

Suppose the person who gave you the dice rolled it behind your back and told you
the roll was odd. Now what is the probability that the roll was a 3?

1: 1/2
2: 1
3: 1/3
4: 1/4
5: 1/6

Selection: 3
Hint: Given that there are 3 odd numbers on the die your possibilities have been
reduced to 3 and you want to know the probability of 1 of them.

The probability of this second event is conditional on this new information, so the probability of rolling a 3 is now one third.

Definition

We represent the conditional probability of an event A given that B has occurred with the notation P(A|B). More specifically, we define the conditional probability of event A, given that B has occurred with the following.

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

$P(A|B)$ is the probability that BOTH A and B occur divided by the probability that B occurs.

Back to our dice example. Which of the following expressions represents $P(A \cap B)$, where A is the event of rolling a 3 and B is the event of the roll being odd?

1: 1/2
2: 1
3: 1/3
4: 1/4
5: 1/6

Selection: 5
Hint: Here A is a subset of B so the probability of both A AND B happening is
the probability of A happening.

Continuing with the same dice example. Which of the following expressions represents $\frac{P(A \cap B)}{P(B)}$, where A is the event of rolling a 3 and B is the event of the roll being odd?

1: 1/6
2: (1/2)/(1/6)
3: (1/6)/(1/2)
4: (1/3)/(1/2)

Selection: 3
Hint: Here A is a subset of B so the probability of both A AND B happening is
the probability of A happening. The probability of B is the reciprocal of the
number of odd numbers between 1 and 6 (inclusive).

Bayes’ Formula

From the definition of P(A|B), we can write: \[P(A \cap B) = P(A|B) \times P(B)\]

Let’s use this to express P(B|A).

\[P(B|A) = \frac{P(B \cap A)}{P(A)} = P(A|B) \times \frac{P(B)}{P(A)}\]

This is a simple form of Bayes’ Rule which relates the two conditional probabilities.

Suppose we don’t know P(A) itself, but only know its conditional probabilities, that is, the probability that it occurs if B occurs and the probability that it occurs if B doesn’t occur. These are $P(A|B)$ and $P(A|B^c)$, respectively. We use $B^c$ to represent ‘not B’ or ‘B complement’.

We can then express \[P(A) = P(A|B) \times P(B) + P(A|B^c) \times P(B^c)\] and substitute this is into the denominator of Bayes’ Formula.

\[P(B|A) = \frac{P(A|B) \times P(B)}{P(A|B) \times P(B) + P(A|B^c) \times P(B^c)}\]

Bayes’ Rule has applicability to medical diagnostic tests. We’ll now discuss the example of the HIV test from the slides.

Test Sensitivity and Specificity

Suppose we know the accuracy rates of the test for both the positive case (positive result when the patient has HIV) and negative (negative test result when the patient doesn’t have HIV). These are referred to as test sensitivity and specificity, respectively.

Let 'D' be the event that the patient has HIV, and let '+' indicate a positive
test result and '-' a negative. What information do we know? Recall that we know
the accuracy rates of the HIV test.

1: P(+|D) and P(-|D)
2: P(+|D) and P(-|D^c)
3: P(+|D^c) and P(-|D^c)
4: P(+|D^c) and P(-|D)

Selection: 2
Hint: The clue here is accuracy. The test is positive when the patient has the
disease and negative when he doesn't.

Test Sensitivity = $P(+|D)$, and Test Specificity = $P(-|D^c)$

Suppose a person gets a positive test result and comes from a population with a
HIV prevalence rate of .001. We'd like to know the probability that he really
has HIV. Which of the following represents this?

1: P(+|D)
2: P(D|+)
3: P(D^c|+)
4: P(D|-)

Selection: 2
Hint: We've already been given the information that the test was positive '+'.
We want to know whether D is present given the positive test result.

By Bayes’ Formula, \[P(D|+) = \frac{P(+|D) \times P(D)}{P(+|D) \times P(D) + P(+|~D) \times P(~D)}\]

We can use the prevalence of HIV in the patient’s population as the value for $P(D)$. Note that since $P(D^c) = 1 - P(D)$ and $P(+|D^c) = 1 - P(-|D^c)$ we can calculate $P(D|+)$. In other words, we know values for all the terms on the right side of the equation. Let’s do it!

Disease prevalence is .001 ($P(D) = 0.001$). Test sensitivity (+ result with disease) is 99.7% ($P(+|D) = 0.997$) and specificity (- result without disease) is 98.5% ($P(-|D^c) = 0.985$). First compute the numerator, $P(+|D) \times P(D)$. (This is also part of the denominator.)
Hint: Multiply the test sensitivity by the prevalence.

0.997 * 0.001

## [1] 0.000997

Now solve for the remainder of the denominator, $P(+|D^c) \times P(D^c)$.
Hint: Multiply the complement of test specificity by the complement of prevalence.

(1 - 0.985)*(1 - 0.001)

## [1] 0.014985

Now put the pieces together to compute the probability that the patient has the disease given his positive test result, $P(D|+)$. Plug your last two answers into the formula: \[\frac{P(+|D) \times P(D)}{P(+|D) \times P(D) + P(+|D^c) \times P(~D)}\] to compute P(D|+).
Hint: Divide $(0.997 \times 0.001)$ by $(0.997 \times 0.001 + 0.015 \times 0.999)$

(0.997 * 0.001)/((0.997 * 0.001) + ((1 - 0.985) * (1 - 0.001)))

## [1] 0.06238268

Postive/Negative Predictive Values

So the patient has a 6% chance of having HIV given this positive test result. The expression $P(D|+)$ is called the positive predictive value. Similarly, P(D^c|-), is called the negative predictive value, the probability that a patient does not have the disease given a negative test result.

Diagnostic Likelihood Ratio

The diagnostic likelihood ratio of a positive test, $DLR_+$, is the ratio of the two + conditional probabilities, one given the presence of disease and the other given the absence. Specifically, \[DLR_+ = \frac{P(+|D)}{P(+|D^c)}\] Similarly, the $DLR_-$ is defined as a ratio. Which of the following do you think represents the $DLR_-$?

1: $\frac{P(+|D^c)}{P(-|D)}$
2: $\frac{P(-|D)}{P(-|D^c)}$
3: I haven’t a clue.
4: $\frac{P(-|D)}{P(+|D^c)}$

Selection: 2
Hint: The signs of the test in both the numerator and denominator have to agree as they did for the $DLR_+$

Recall that $P(+|D)$ and $P(-|D^c)$, (test sensitivity and specificity respectively) are accuracy rates of a diagnostic test for the two possible results. They should be close to 1 because no one would take an inaccurate test, right? Since $DLR_+ = \frac{P(+|D)}{P(+|D^c)}$ we recognize the numerator as test sensitivity and the denominator as the complement of test specificity.

Since the numerator is close to 1 and the denominator is close to 0 do you expect $DLR_+$ to be large or small?

1: Large
2: Small
3: I haven't a clue.

Selection: 1
Hint: What happens when you divide a large number by a much smaller one?

Now recall that $DLR_- = \frac{P(-|D)}{P(-|D^c)}$. Here the numerator is the complement of sensitivity and the denominator is specificity. From the arithmetic and what you know about accuracy tests, do you expect $DLR_-$ to be large or small?

1: Large
2: Small
3: I haven't a clue.

Selection: 2
Hint: What happens when you divide by small number by a larger one?

Likelihood Ratios

Now a little more about likelihood ratios. Recall Bayes Formula: \[P(D|+) = \frac{P(+|D) \times P(D)}{P(+|D) \times P(D) + P(+|D^c) \times P(D^c)}\]

and notice that if we replace all occurrences of ‘$D$’ with ‘$D^c$’, the denominator doesn’t change. This means that if we formed a ratio of $P(D|+)$ to $P(D^c|+)$ we’d get a much simpler expression (since the complicated denominators would cancel each other out). Like this….

\[\frac{P(D|+)}{P(D^c|+)} = \frac{P(+|D) \times P(D)}{P(+|D^c) \times P(D^c)} = \frac{P(+|D)}{P(+|D^c)} \times \frac{P(D)}{P(D^c)}\]

The left side of the equation represents the post-test odds of disease given a positive test result. The equation says that the post-test odds of disease equals the pre-test odds of disease (that is, $\frac{P(D)}{P(D^c)}$) times

1: the DLR_-
2: the DLR_+
3: I haven't a clue.

Selection: 2

Hint: Do you recognize the expression $\frac{P(+|D)}{P(+|D^c)}$? The ‘+’ signs are a big clue.

In other words, a $DLR_+$ value equal to N indicates that the hypothesis of disease is N times more supported by the data than the hypothesis of no disease.

Taking the formula above and replacing the ‘+’ signs with ‘-’ yields a formula with the DLR_-. Specifically, \[\frac{P(D|-)}{P(D^c|-)} = \frac{P(-|D)}{P(-|D^c)} \times \frac{P(D)}{P(D^c)}\]

As with the positive case, this relates the odds of disease post-test, $\frac{P(D|-)}{P(D^c|-)}$, to those of disease pre-test, $\frac{P(D)}{P(D^c)}$.

The equation $\frac{P(D|-)}{P(D^c|-)} = \frac{P(-|D)}{P(-|D^c)} \times \frac{P(D)}{P(D^c)}$ says what about the post-test odds of disease relative to the pre-test odds of disease given negative test results?

1: I haven't a clue.
2: post-test odds are greater than pre-test odds
3: post-test odds are less than pre-test odds

Selection: 3

Hint: Remember that we argued (hopefully convincingly) that $DLR_-$ is small (less than 1). $\mbox{Post-test odds} = \mbox{Pre-test odds} \times DLR_-$ so post-test odds are a fraction of the pre-test odds.

Let’s cover some basics now.

Independent Events

Two events, A and B, are independent if they have no effect on each other. Formally, $P(A \cap B) = P(A) \times P(B)$. It’s easy to see that if A and B are independent, then $P(A|B) = P(A)$. The definition is similar for random variables X and Y.

We've seen examples of independence in our previous probability lessons. Let's
review a little. What's the probability of rolling a '6' twice in a row using a
fair die?

1: 1/6
2: 1/36
3: 2/6
4: 1/2

Selection: 2
Hint: Square the probability of rolling a single '6' since the two rolls are
independent of one another.

You're given a fair die and asked to roll it twice. What's the probability that
the second roll of the die matches the first?

1: 2/6
2: 1/36
3: 1/6
4: 1/2

Selection: 3
Hint: Now the events aren't independent. You don't care what the first roll is
so that's a probability 1 event. The second roll just has to match the first, so
that's a 1/6 event.

If the chance of developing a disease with a genetic or environmental component
is p, is the chance of both you and your sibling developing that disease p * p?

1: No
2: Yes

Selection: 1
Hint: The events aren't independent since genetic or environmental factors
likely will affect the outcome.

Independent and Identically Distributed

We’ll conclude with iid. Random variables are said to be iid if they are independent and identically distributed. By independent we mean “statistically unrelated from one another”. Identically distributed means that “all have been drawn from the same population distribution”.

Random variables which are iid are the default model for random samples and many of the important theories of statistics assume that variables are iid. We’ll usually assume our samples are random and variables are iid.

Congrats! You’ve concluded this lesson on conditional probability. We hope you liked it unconditionally.

Expectations

Introduction

Expectations. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to 06_Statistical_Inference/04_Expectations.)

In this lesson, as you might expect, we’ll discuss expected values. Expected values of what, exactly?

Expected Value

The expected value of a random variable $X$, $E(X)$, is a measure of its central tendency. For a discrete random variable $X$ with PMF $p(x)$, $E(X)$ is defined as a sum, over all possible values $x$, of the quantity $x \times p(x)$. \[E(X) = \sum_{x} x \times p(x)\]

$E(X)$ represents the center of mass of a collection of locations and weights, ${x, p(x)}$.

Another term for expected value is mean. Recall your high school definition of arithmetic mean (or average) as the sum of a bunch of numbers divided by the number of numbers you added together. This is consistent with the formal definition of $E(X)$ if all the numbers are equally weighted.

Consider the random variable $X$ representing a roll of a fair dice. By ‘fair’ we mean all the sides are equally likely to appear. What is the expected value of X?
Hint: Add the numbers from 1 to 6 and divide by 6.

3.5

## [1] 3.5

expect_dice <- function(pmf) {
    mu <- 0;
    for (i in 1:6) {
        mu <- mu + i * pmf[i];
    }
    mu
}

We’ve defined a function for you, expect_dice, which takes a PMF as an input. For our purposes, the PMF is a 6-long array of fractions. The i-th entry in the array represents the probability of i being the outcome of a dice roll. Look at the function expect_dice now.

expect_dice

## function(pmf) {
##     mu <- 0;
##     for (i in 1:6) {
##         mu <- mu + i * pmf[i];
##     }
##     mu
## }

dice_fair <- c(1/6,  1/6,  1/6,  1/6,  1/6,  1/6 )
dice_high <- c(1/21, 2/21, 3/21, 4/21, 5/21, 6/21)
dice_low  <- c(6/21, 5/21, 4/21, 3/21, 2/21, 1/21)

We’ve also defined PMFs for three dice, dice_fair, dice_high and dice_low. The last two are loaded, that is, not fair. Look at dice_high now.

dice_high

## [1] 0.04761905 0.09523810 0.14285714 0.19047619 0.23809524 0.28571429

Using the function expect_dice with dice_high as its argument, calculate the expected value of a roll of dice_high.

expect_dice(dice_high)

## [1] 4.333333

See how the expected value of dice_high is higher than that of the fair dice. Now calculate the expected value of a roll of dice_low.

expect_dice(dice_low)

## [1] 2.666667

edh <- expect_dice(dice_high)
edl <- expect_dice(dice_low)

You can see the effect of loading the dice on the expectations of the rolls. For high-loaded dice the expected value of a roll (on average) is 4.33 and for low-loaded dice 2.67. We’ve stored these off for you in two variables, edh and edl. We’ll need them later.

One of the nice properties of the expected value operation is that it’s linear. This means that, if $c$ is a constant, then \[E(cX) = c \times E(X)\]

Also, if $X$ and $Y$ are two random variables then \[E(X + Y) = E(X) + E(Y)\]

It follows that: \[E(aX + bY) = aE(X) + bE(Y)\]

Suppose you were rolling our two loaded dice, dice_high and dice_low. You can use this linearity property of expectation to compute the expected value of their average. Let $X_{hi}$ and $X_{lo}$ represent the respective outcomes of the dice roll. The expected value of the average is $E(\frac{(X_{hi} + X_{lo})}{2})$ or $0.5 \times (E(X_{hi}) + E(X_{lo}))$. Compute this now. Remember we stored the expected values in edh and edl.

0.5 * (edh + edl)

## [1] 3.5

Did you expect that?

1: No
2: Yes

Selection: 2
Hint: The dice were loaded in opposite ways so their average should be fair. No?

For a continuous random variable $X$, the expected value is defined analogously as it was for the discrete case. Instead of summing over discrete values, however, the expectation integrates over a continuous function.

It follows that for continuous random variables, E(X) is the area under the function $t \times f(t)$, where $f(t)$ is the PDF (probability density function) of X. This definition borrows from the definition of center of mass of a continuous body.

par(mfrow = c(1, 2))
plot(c(-0.25, 0, 0, 1, 1, 1.25),
     c(0,     0, 1, 1, 0, 0),
     type = "l", lwd = 3, frame = FALSE, xlab="", ylab = "")
title('f(t)')

plot(c(-0.25, 0, 1, 1, 1.25),
     c(0,     0, 1, 0, 0),
     type = "l", lwd = 3, frame = FALSE, xlab="", ylab = "")
title('t f(t)')

Here’s a figure from the slides. It shows the constant (1) PDF on the left and the graph of $t \times f(t)$ on the right.

Knowing that the expected value is the area under the triangle, $t \times f(t)$, what is the expected value of the random variable with this PDF?

1: .5
2: 2.0
3: 1.0
4: .25

Selection: 1
Hint: The area of the triangle is base * height/2.

par(mfrow = c(1, 2))
plot(c(-0.25, 0, 2, 2, 2.25),
     c(0,     0, 1, 0, 0),
     type = "l", lwd = 3, frame = FALSE, xlab = "", ylab = "")
title('f(t)')

my_x <- seq(0, 2, by = 0.1)
my_y <- my_x^2/2
plot(my_y ~ my_x, type = "l", lwd = 3, frame = FALSE, xlab="", ylab = "")
title('t f(t)')
abline(v = 2.0, lwd = 3)

myfunc <- function(x){x^2/2}

For the purposes of illustration, here’s another figure using a PDF from our previous probability lesson. It shows the triangular PDF $f(t)$ on the left and the parabolic $t \times f(t)$ on the right. The area under the parabola between 0 and 2 represents the expected value of the random variable with this PDF.

To find the expected value of this random variable you need to integrate the function $t \times f(t)$. Here $f(t) = t/2$, the diagonal line. (You might recall this from the last probability lesson.) The function you’re integrating over is therefore $t^2/2$. We’ve defined a function myfunc for you representing this. You can use the R function ‘integrate’ with parameters myfunc, 0 (the lower bound), and 2 (the upper bound) to find the expected value. Do this now.

integrate(myfunc, 0, 2)

## 1.333333 with absolute error < 1.5e-14

As all the examples have shown, expected values of distributions are useful in characterizing them. The mean characterizes the central tendency of the distribution. However, often populations are too big to measure, so we have to sample them and then we have to use sample means. That’s okay because sample expected values estimate the population versions. We’ll show this first with a very simple toy and then with some simple equations.

spop <- c(1, 4, 7, 10, 13)

We’ve defined a small population of 5 numbers for you, spop. Look at it now.

spop

## [1]  1  4  7 10 13

The R function mean will give us the mean of spop. Do this now.

mean(spop)

## [1] 7

sam0 <- c(1,  4)
sam1 <- c(1,  7)
sam2 <- c(1,  10)
sam3 <- c(1,  13)
sam4 <- c(4,  7)
sam5 <- c(4,  10)
sam6 <- c(4,  13)
sam7 <- c(7,  10)
sam8 <- c(7,  13)
sam9 <- c(10, 13)
allsam <- matrix(c(sam0, sam1, sam2, sam3, sam4, sam5, sam6, sam7, sam8, sam9),
                 nrow = 10, ncol = 2, byrow = TRUE)

Suppose spop were much bigger and we couldn’t measure its mean directly and instead had to sample it with samples of size 2. There are 10 such samples, right? We’ve stored this for you in a 10 x 2 matrix, allsam. Look at it now.

allsam

##       [,1] [,2]
##  [1,]    1    4
##  [2,]    1    7
##  [3,]    1   10
##  [4,]    1   13
##  [5,]    4    7
##  [6,]    4   10
##  [7,]    4   13
##  [8,]    7   10
##  [9,]    7   13
## [10,]   10   13

Each of these 10 samples will have a mean, right? We can use the R function apply to calculate the mean of each row of the matrix allsam. We simply call apply with the arguments allsam, 1, and mean. The second argument, 1, tells ‘apply’ to apply the third argument ‘mean’ to the rows of the matrix. Try this now.

apply(allsam, 1, mean)

##  [1]  2.5  4.0  5.5  7.0  5.5  7.0  8.5  8.5 10.0 11.5

You can see from the resulting vector that the sample means vary a lot, from 2.5 to 11.5, right? Not unexpectedly, the sample mean depends on the sample.

smeans <- apply(allsam, 1, mean)

However if we take the expected value of these sample means we’ll see something amazing. We’ve stored the sample means in the array smeans for you. Use the R function mean on the array smeans now.

mean(smeans)

## [1] 7

Look familiar? The result is the same as the mean of the original population spop. This is not because the example was specially cooked. It would work on any population. The expected value or mean of the sample mean is the population mean. What this means is that the sample mean is an unbiased estimator of the population mean.

Formally, an estimator $e$ of some parameter $v$ is unbiased if its expected value equals $v$, i.e., \[E(e) = v\]

We can show that the expected value of a sample mean equals the population mean with some simple algebra.

Let $X_1, X_2, ..., X_n$ be a collection of $n$ samples from a population with mean $\mu$. The mean of these is $\frac{(X_1 + X_2 + ... + X_n)}{n}$.

What’s the expected value of the mean? Recall that $E(aX) = aE(X)$, so

\[E(\frac{X_1 + ... + X_n}{n}) = \frac{1}{n} \times (E(X_1) + E(X_2) + ... + E(X_n)) = \frac{1}{n} \times n \times \mu = \mu\]

Each $E(X_i)$ equals $\mu$ since $X_i$ is drawn from the population with mean $\mu$. We expect, on average, a random $X_i$ will equal $\mu$.

Now that was theory. We can also show this empirically with more simulations.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.2

nosim <- 10000; n <- 10
dat <- data.frame(x = c(rnorm(nosim),
                        apply(matrix(rnorm(nosim * n), nosim), 1, mean)),
                  what = factor(rep(c("Obs", "Mean"), c(nosim, nosim))))

np <- ggplot(dat, aes(x = x, fill = what)) +
      geom_density(size = 2, alpha = 0.2) 
print(np)

Here’s another figure from the slides. It shows how a sample mean and the mean of averages spike together. The two shaded distributions come from the same data. The blue portion represents the density function of randomly generated standard normal data, 100000 samples. The pink portion represents the density function of 10000 averages, each of 10 random normals. (The original data was stored in a 10000 x 10 array and the average of each row was taken to generate the pink data.)

dat <- data.frame(
  x = c(sample(1:6, nosim, replace = TRUE),
        apply(matrix(sample(1:6, nosim * 2, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1:6, nosim * 3, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1:6, nosim * 4, replace = TRUE), nosim), 1, mean)),
  size = factor(rep(1:4, rep(nosim, 4))))

g <- ggplot(dat, aes(x = x, fill = size)) +
     geom_histogram(alpha = 0.20, binwidth = 0.25, colour = "black") 
g <- g + facet_grid(. ~ size)
print(g)

Here’s another figure from the slides. Rolling a single die 10000 times yields the first figure. Each of the 6 possible outcomes appears with about the same frequency. The second figure is the histogram of outcomes of the average of rolling two dice. Similarly, the third figure is the histogram of averages of rolling three dice, and the fourth four dice. As we showed previously, the center or mean of the original distribution is 3.5 and that’s exactly where all the panels are centered.

Let’s recap. Expected values are properties of distributions. The average, or mean, of random variables is itself a random variable and its associated distribution itself has an expected value. The center of this distribution is the same as that of the original distribution.

Now let’s review!

Expected values are properties of what?

1: fulcrums
2: demanding parents
3: variances
4: distributions

Selection: 4
Hint: What would you expect to have a center?

A population mean is a center of mass of what?

1: a sample
2: a family
3: a distribution
4: a population

Selection: 4
Hint: What word appears in the question?

A sample mean is a center of mass of what?

1: a distribution
2: a population
3: a family
4: observed data

Selection: 4
Hint: If you're sampling you need to observe data, right?

True or False? A population mean estimates a sample mean.

1: True
2: False

Selection: 2
Hint: We can only sample a population and calculate the sample mean.

True or False? A sample mean is unbiased.

1: False
2: True

Selection: 2
Hint: The sample mean is the population mean, so by definition it's unbiased.

True or False? The more data that goes into the sample mean, the more
concentrated its density/mass function is around the population mean.

1: True
2: False

Selection: 1
Hint: It's better to have more data than less, right?

Congrats! You’ve concluded this lesson on expectations. We hope it met yours.

Variance

Variance. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to 06_Statistical_Inference/05_Variance.)

In this lesson, we’ll discuss variances of distributions which, like means, are useful in characterizing them. While the mean characterizes the center of a distribution, the variance and its square root, the standard deviation, characterize the distribution’s spread around the mean. As the sample mean estimates the population mean, so the sample variance estimates the population variance.

The variance of a random variable, as a measure of spread or dispersion, is, like a mean, defined as an expected value. It is the expected squared distance of the variable from its mean. Squaring the distance makes it positive so values less than and greater than the mean are treated the same. In mathematical terms, if $X$ comes from a population with mean $\mu$, then \[Var(X) = E((X - \mu)^2) = E((X - E(X))^2 ) = E(X^2) - E(X)^2\]

So variance is the difference between two expected values. Recall that $E(X)$, the expected value of a random variable from the population, is $\mu$, the mean of that population.

Higher variance implies more spread around a mean than lower variance.

Finally, it’s easy to show from the definition and the linearity of expectations that, if $a$ is a constant, then $Var(aX) = a^2 \times Var(X)$. This will come in handy later.

Would you like to see the equation proving this? (Video Link)

Here’s the proof: \[\begin{aligned} Var(aX) & = E((aX)^2) - (E(aX))^2\\ & = E(a^2X^2) - (aE(X))^2\\ & = a^2E(X^2) - a^2(E(X))^2\\ & = a^2Var(X) \end{aligned}\]

Variance of a Die Roll

Let’s practice computing the variance of a dice roll now. First we need to compute $E(X^2)$. From the definition of expected values, this means we’ll take a weighted sum over all possible values of $X^2$. The weight is the probability of $X$ occurring.

dice_fair <- c(1/6,  1/6,  1/6,  1/6,  1/6,  1/6)
dice_high <- c(1/21, 2/21, 3/21, 4/21, 5/21, 6/21)
dice_low <-  c(6/21, 5/21, 4/21, 3/21, 2/21, 1/21)
expect_dice <- function(pmf) {
  mu <- 0
  for (i in 1:6) {
    mu <- mu + i * pmf[i]
  }
  mu
}
dice_sqr <- c(1, 4, 9, 16, 25, 36)
edh <- expect_dice(dice_high)
edl <- expect_dice(dice_low)
nsim <- 1000
samsz <- 10

For convenience, we’ve defined a 6-long vector for you, dice_sqr, which holds the squares of the integers 1 through 6. This will give us the $X^2$ values. Look at it now.

dice_sqr

## [1]  1  4  9 16 25 36

Now we need weights. For these we can use any of the three PDF’s, (dice_fair, dice_high, and dice_low) we defined in the previous lesson. Using R’s ability to multiply vectors componentwise and its function ‘sum’ we can easily compute $E(X^2)$ for any of these dice. Simply sum the product dice_sqr * PDF. Try this now with dice_fair and put the result in a variable ex2_fair.

ex2_fair <- sum(dice_sqr * dice_fair)

You’re the best!

Recall that the expected value of a fair dice roll is 3.5. Subtract the square of that from ex2_fair to compute the sample variance.

ex2_fair - 3.5^2

## [1] 2.916667

Now use a similar approach to compute the sample variance of dice_high in one step. Sum the appropriate product and subtract the square of the mean. Recall that edh holds the expected value of dice_high.

sum(dice_sqr * dice_high) - edh^2

## [1] 2.222222

Note that when we talk about variance we’re using square units. Because it is often more useful to use measurements in the same units as X we define the standard deviation of X as the square root of Var(X).

Distributions with Increasing Variance

library(ggplot2)
xvals <- seq(-10, 10, by = 0.01)
dat <- data.frame(y = c(dnorm(xvals, mean = 0, sd = 1),
                        dnorm(xvals, mean = 0, sd = 2),
                        dnorm(xvals, mean = 0, sd = 3),
                        dnorm(xvals, mean = 0, sd = 4)),
                  x = rep(xvals, 4),
                  factor = factor(rep(1:4, rep(length(xvals), 4))))

g <- ggplot(dat, aes(x = x, y = y, color = factor)) +
     geom_line(size = 2)    
print(g)

Here’s a figure from the slides. It shows several normal distributions all centered around a common mean 0, but with different standard deviations. As you can see from the color key on the right, the thinner the bell the smaller the standard deviation and the bigger the standard deviation the fatter the bell.

Just as we distinguished between a population mean and a sample mean we have to distinguish between a population variance $\sigma^2$ and a sample variance $s^2$. They are defined similarly but with a slight difference. The sample variance is defined as the sum of $n$ squared distances from the sample mean divided by $(n - 1)$, where $n$ is the number of samples or observations. We divide by $n - 1$ because this is the number of degrees of freedom in the system. The first $n - 1$ samples or observations are independent given the mean. The last one isn’t independent since it can be calculated from the sample mean used in the formula.

In other words, the sample variance is ALMOST the average squared deviation from the sample mean.

As with the sample mean, the sample variance is also a random variable with an associated population distribution. Its expected value or mean is the population variance and its distribution gets more concentrated around the population variance with more data. The sample standard deviation is the square root of the sample variance.

Variance Simulation Experiment

nosim <- 10000
dat <- data.frame(x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var)),
                  n = factor(rep(c("10"), c(nosim))))

g <- ggplot(dat, aes(x = x, fill = n)) +
     geom_density(size = 2, alpha = 0.6) +
     geom_vline(xintercept = 1, size = 2) 
print(g)

To illustrate this point, consider this figure which plots the distribution of 10000 variances, Each variance was computed on a sample of standard normals of size 10. The vertical line indicates the standard deviation 1.

dat <- data.frame(x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
                        apply(matrix(rnorm(nosim * 20), nosim), 1, var)),
                  n = factor(rep(c("10", "20"), c(nosim, nosim))))

g <- ggplot(dat, aes(x = x, fill = n)) +
     geom_density(size = 2, alpha = 0.4) +
     geom_vline(xintercept = 1, size = 2) 
print(g)

Here we do the same experiment but this time (the taller lump) each of the 10000 variances is over 20 standard normal samples. We’ve plotted over the first plot (the shorter lump) and you can see that the distribution of the variances is getting tighter and shifting closer to the vertical line.

dat <- data.frame(x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
                        apply(matrix(rnorm(nosim * 20), nosim), 1, var),
                        apply(matrix(rnorm(nosim * 30), nosim), 1, var)),
                  n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))))

g <- ggplot(dat, aes(x = x, fill = n)) +
     geom_density(size = 2, alpha = 0.2) +
     geom_vline(xintercept = 1, size = 2) 
print(g)

Finally, we repeat the experiment using 30 samples for each of the 10000 variances. You can see that with more data, the distribution gets more concentrated around the population variance it is trying to estimate.

Now recall that the means of unbiased estimators equal the values they’re trying to estimate. We can infer from the above that the sample variance is an unbiased estimator of population variance.

Recall that the average of random samples from a population is itself a random variable with a distribution centered around the population mean. Specifically, $E(X') = \mu$, where $X'$ represents a sample mean and $\mu$ is the population mean.

We can show that, if the population is infinite, the variance of the sample mean is the population variance divided by the sample size. Specifically, $Var(X') = \sigma^2/n$. Let’s work through this in four short steps.

Which of the following does $Var(X')$ equal? Here $X'$ represents the sample mean and $\sum(X_i)$ represents the sum of the $n$ samples $X_1, ..., X_n$. Assume these samples are independent.

1: $\mu$
2: $E(\frac{1}{n} \times \sum{X_i})$
3: $\sigma$
4: $Var(\frac{1}{n} \times \sum{X_i})$

Selection: 4
Hint: Which of the choices has both Var and the definition of mean in it?

Which of the following does $Var(\frac{1}{n} \times \sum{X_i})$ equal?

1: $\frac{1}{n^2} \times E(\sum{X_i})$
2: $\frac{1}{n^2} \times Var(\sum{X_i})$
3: $\frac{\mu}{n^2}$
4: $\frac{\sigma}{n}$

Selection: 2
Hint: Remember that fact about Var that we said would be useful before? Now is the time to use it.

Recall that $Var$ is an expected value and expected values are linear. Also recall that our samples $X_1, X_2, ..., X_n$ are independent. What does $Var(\sum{X_i})$ equal?

1: $\sum Var(X_i)$
2: $Var(\sigma)$
3: $E(\mu)$
4: $E(\sum{X_i})$

Selection: 1
Hint: By linearity, the variance of the sum equals the sum of the variance.

Finally, each $X_i$ comes from a population with variance $\sigma^2$. What does $\sum{Var(X_i)}$ equal? As before, $\sum$ is taken over $n$ values.

1: $n \times E(\mu)$
2: $n \times \mu$
3: $n^2 \times Var(\sigma)$
4: $n \times \sigma^2$

Selection: 4
Hint: $Var(X_i)$ is the constant value $\sigma^2$ and we’re summing over $n$ of them.

So we’ve shown that \[\begin{aligned} Var(X') & = Var(\frac{1}{n} \times \sum{X_i})\\ & = \frac{1}{n^2} \times Var(\sum{X_i})\\ & = \frac{1}{n^2} \times \sum{\sigma^2}\\ & = \frac{\sigma^2}{n} \end{aligned}\]

for infinite populations when our samples are independent.

The standard deviation of a statistic is called its standard error, so the standard error of the sample mean is the square root of its variance.

We just showed that the variance of a sample mean is $\frac{\sigma^2}{n}$ and we estimate it with $\frac{s^2}{n}$. It follows that its square root, $\frac{s}{\sqrt n}$, is the standard error of the sample mean.

The sample standard deviation, $s$, tells us how variable the population is, and $\frac{s}{\sqrt n}$, the standard error, tells us how much averages of random samples of size n from the population vary. Let’s see this with some simulations.

The R function rnorm(n, mean, sd) generates $n$ independent (hence uncorrelated) random normal samples with the specified mean and standard deviation. The defaults for the latter are mean 0 and standard deviation 1. Type the expression sd(apply(matrix(rnorm(10000), 1000), 1, mean)) at the prompt.

sd(apply(matrix(rnorm(10000), 1000), 1, mean))

## [1] 0.3198572

This returns the standard deviation of 1000 averages, each of a sample of 10 random normal numbers with mean 0 and standard deviation 1. The theory tells us that the standard error, $\frac{s}{\sqrt n}$, of the sample means indicates how much averages of random samples of size $n$ (in this case 10) vary. Now compute $\frac{1}{\sqrt{10}}$ to see if it matches the standard deviation we just computed with our simulation.

1/sqrt(10)

## [1] 0.3162278

Pretty close, right? Let’s try a few more. Standard uniform distributions have variance 1/12. The theory tells us the standard error of means of independent samples of size n would have which standard error?

1: $\frac{1}{\sqrt{12 \times n}}$
2: $\frac{1}{12 \times \sqrt n}$
3: I haven’t a clue
4: $\frac{12}{\sqrt n}$

Selection: 1
Hint: In this case $s$ is the $\sqrt{\frac{1}{12}}$. Divide this by $\sqrt n$.

Compute $\frac{1}{\sqrt{120}}$. This would be the standard error of the means of uniform samples of size 10.

1/sqrt(120)

## [1] 0.09128709

Now check it as we did before. Use the expression sd(apply(matrix(runif(10000), 1000), 1, mean)).

sd(apply(matrix(runif(10000), 1000), 1, mean))

## [1] 0.0937844

Pretty close again, right? Poisson(4) are distributions with variance 4; what standard error would means of random samples of n Poisson(4) have?

1: $2 \times \sqrt n$
2: I haven’t a clue
3: $\frac{2}{\sqrt n}$
4: $\frac{1}{\sqrt{2 \times n}}$

Selection: 3
Hint: In this case $s$ is 2. Divide this by $\sqrt n$.

We’ll do another simulation to test the theory. First, assume you’re taking averages of 10 Poisson(4) samples and compute the standard error of these means. Use the formula you just chose.

2/sqrt(10)

## [1] 0.6324555

Now check it as we did before. Use the expression sd(apply(matrix(rpois(10000, 4), 1000), 1, mean)).

sd(apply(matrix(rpois(10000, 4), 1000), 1, mean))

## [1] 0.633038

Like magic, right? One final test. Fair coin flips have variance 0.25; means of random samples of $n$ coin flips have what standard error?

1: $2 \times \sqrt n$
2: $\frac{1}{2 \times \sqrt n}$
3: $\frac{1}{\sqrt{2 \times n}}$
4: $\frac{2}{\sqrt n}$
5: I haven’t a clue

Selection: 2
Hint: In this case $s$ is $1/2$ ($\sqrt{1/4}$), the variance. Divide this by $\sqrt n$.

You know the drill. Assume you’re taking averages of 10 coin flips and compute the standard error of these means with the theoretical formula you just picked.

1/(2 * sqrt(10))

## [1] 0.1581139

Now check it as we did before. Use the expression sd(apply(matrix(sample(0:1, 10000, TRUE), 1000), 1, mean)).

sd(apply(matrix(sample(0:1, 10000, TRUE), 1000), 1, mean))

## [1] 0.1560107

Finally, here’s something interesting. Chebyshev’s inequality helps interpret variances. It states that the probability that a random variable $X$ is at least $k$ standard deviations from its mean is less than $\frac{1}{k^2}$. In other words, the probability that $X$ is at least $2$ standard deviations from the mean is less than 1/4, 3 standard deviations 1/9, 4 standard deviations 1/16, etc.

However this estimate is quite conservative for random variables that are normally distributed, that is, with bell-curve distributions. In these cases, the probability of being at least 2 standard deviations from the mean is about 5% (as compared to Chebyshev’s upper bound of 25%) and the probability of being at least 3 standard deviations from the mean is roughly 0.2%.

Suppose you had a measurement that was 4 standard deviations from the
distribution's mean. What would be the upper bound of the probability of this
happening using Chebyshev's inequality?

1: 11%
2: 96%
3: 6%
4: 0%
5: 25%

Selection: 3
Hint: Chebyshev's inequality estimates that probability as 1/16. Convert this to
a probability.

Now to review. The sample variance estimates what?

1: population variance
2: population
3: sample mean
4: sample standard deviation

Selection: 1
Hint: Which choice has the word variance in it?

The distribution of the sample variance is centered at what?

1: sample standard deviation
2: population variance
3: sample mean
4: population

Selection: 2
Hint: What is the sample variance estimating?

True or False - The sample variance gets more concentrated around the population
variance with larger sample sizes

1: True
2: False

Selection: 1
Hint: Is more data better than less data?

The variance of the sample mean is the population variance divided by ?

1: I haven't a clue
2: sqrt(n)
3: n
4: n^2

Selection: 3
Hint: Remember the 4 step proof starting with Var(X') = ...? The last step had
an n divided by an n^2.

The standard error of the sample mean is the sample standard deviation s divided
by ?

1: sqrt(n)
2: n^2
3: I haven't a clue
4: n

Selection: 1
Hint: Remember the many many examples we went through. The sqrt(n) figured
prominently in them.

Congrats! You’ve concluded this vary long lesson on variance. We hope you liked it vary much.

Common Distributions

Common Distributions. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to 06_Statistical_Inference/06_CommonDistros.)

Given the title of this lesson, what do you think it will cover?

1: Common Distributions
2: Rare Distributions
3: Common Bistros
4: I haven't a clue

Selection: 1
Hint: Part of the title is an abbreviation for another word we've seen several
times in earlier lessons.

Binomial/Bernoulli Distribution

The first distribution we’ll examine is the Bernoulli which is associated with experiments which have only 2 possible outcomes. These are also called (by people in the know) binary trials.

It might surprise you to learn that you've probably had experience with
Bernoulli trials. Which of the following would be a Bernoulli trial?

1: Tossing a die
2: Spinning a roulette wheel
3: Drawing a card from a deck
4: Flipping a coin

Selection: 4
Hint: Which of the choices has only two possible outcomes?

For simplicity, we usually say that Bernoulli random variables take only the
values 1 and 0. Suppose we also specify that the probability that the Bernoulli
outcome of 1 is p. Which of the following represents the probability of a 0
outcome?

1: p^2
2: 1-p
3: p(1-p)
4: p

Selection: 2
Hint: Recall that the sum of the probabilities of all the outcomes is 1.

If the probability of a 1 is p and the probability of a 0 is 1-p which of the following represents the PMF of a Bernoulli distribution? Recall that the PMF is the function representing the probability that X=x.

1: $p^x \times (1 - p)^{1 - x}$
2: $p \times (1 - p)$
3: $p^{1 - x} \times (1 - p) \times (1 - x)$
4: $x \times (1 - x)$

Selection: 1
Hint: When x = 1, which of the given expressions yields p?

Recall the definition of the expectation of a random variable. Suppose we have
a Bernoulli random variable and, as before, the probability it equals 1 (a 
success) is p and probability it equals 0 (a failure) is 1 - p. What is its
mean?

1: p^2
2: p
3: 1-p
4: p(1-p)

Selection: 2
Hint: Add the two terms x * p(x) where x equals 0 and 1 respectively.

Given the same Bernoulli random variable above, which of the following
represents E(X^2)

1: p(1-p)
2: p
3: (1-p)^2
4: 1-p
5: p^2

Selection: 2
Hint: Add the two terms x^2 * p(x) where x equals 0 and 1 respectively.

Use the answers of the last two questions to find the variance of the Bernoulli random variable. Recall $Var = E(X^2) - E(X)^2$

1: $p^2 \times (1-p)^2$
2: $p(1 - p)$
3: $p^2 - p$
4: $p(p - 1)$

Selection: 2
Hint: $E(X^2) = p$ and $E(X) = p$, so $Var = p - p^2$. Rewrite this expression by factoring out the p.

Binomial random variables are obtained as the sum of iid Bernoulli trials. Specifically, let $X_1, ..., X_n$ be iid Bernoulli(p) random variables; then $X = X_1 + X_2 + ... + X_n$ is a binomial random variable. Binomial random variables represent the number of successes, $k$, out of $n$ independent Bernoulli trials. Each of the trials has probability $p$.

The PMF of a binomial random variable $X$ is the function representing the probability that $X = x$. In other words, that there are $x$ successes out of $n$ independent trials.

Which of the following represents the PMF of a binomial distribution? Here $x$, the number of successes, goes from $0$ to $n$, the number of trials, and $\binom{n}{x}$ represents the binomial coefficient ‘n choose x’ which is the number of ways $x$ successes out of $n$ trials can occur regardless of order.

1: $\binom{n}{x} p^x (1 - p) ^ {n - x}$
2: $\binom{n}{x} px(1 - p)(1 - x)$
3: $\binom{n}{x} p^{n - x} (1 - p)^x$
4: $p^x$

Selection: 1
Hint: To take the value $x$, the random variable $X$ must have x ‘successes’. Each of these occurs with probability $p$. It also must have $n - x$ ‘failures’, each of which occurs with probability $(1 - p)$. We don’t care about the order in which the successes and failures occur so we have to multiply by $\binom{n}{x}$.

Suppose we were going to flip a biased coin 5 times. The probability of tossing a head is 0.8 and a tail 0.2. What is the probability that you’ll toss at least 3 heads.
Hint: You’ll have to add together 3 terms each of the form, $\binom{5}{x} 0.8^x 0.2^{5 - x}$ for $x = 3, 4, 5$.

choose(5, 3) * 0.8^3 * 0.2^2 +
choose(5, 4) * 0.8^4 * 0.2^1 +
choose(5, 5) * 0.8^5 * 0.2^0

## [1] 0.94208

Now you can verify your answer with the R function pbinom. The quantile is 2, the size is 5, the prob is 0.8 and the lower.tail is FALSE. Try it now.

pbinom(2, 5, 0.8, lower.tail = FALSE)

## [1] 0.94208

Normal/Gaussian Distribution

Another very common distribution is the normal or Gaussian. It has a complicated density function involving its mean $\mu$ and variance $\sigma^2$. The key fact of the density formula is that when plotted, it forms a bell shaped curve, symmetric about its mean $\mu$. The variance $\sigma^2$ corresponds to the width of the bell, the higher the variance, the fatter the bell. We denote a normally distributed random variable $X$ as $X \sim N(\mu, \sigma^2)$.

When $\mu = 0$ and $\sigma = 1$ the resulting distribution is called the standard normal distribution and it is often labeled $Z$.

library(ggplot2)

x <- seq(-4, 4, length = 1000)
dat <- data.frame(x = x, y = dnorm(x))

g <- ggplot(dat, aes(x = x, y = y)) +
     geom_line(size = 1.5) +
     geom_vline(xintercept = -4:4, size = 1)
print(g)

Here’s a picture of the density function of a standard normal distribution. It’s centered at its mean 0 and the vertical lines (at the integer points of the x-axis) indicate the standard deviations.

g <- ggplot(dat, aes(x = x, y = y)) +
     geom_line(size = 1.5) +
     geom_ribbon(aes(x = ifelse(x > -1 & x < 1, x, 0), ymin = 0, ymax = dat$y),
                 fill = "red", alpha = 1) +
     geom_ribbon(aes(x = ifelse(x > -2 & x < 2, x, 0), ymin = 0, ymax = dat$y),
                 fill = "red", alpha = 0.5) +
     geom_ribbon(aes(x = ifelse(x > -3 & x < 3, x, 0), ymin = 0, ymax = dat$y),
                 fill = "red", alpha = 0.35)
print(g)

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

Approximately 68%, 95% and 99% of the normal density lie within 1, 2, and 3 standard deviations from the mean, respectively. These are shown in the three shaded areas of the figure. For example, the darkest portion (between -1 and 1) represents 68% of the area.

The R function qnorm(prob) returns the value of $x$ (quantile) for which the area under the standard normal distribution to the left of $x$ equals the parameter prob. (Recall that the entire area under the curve is 1.) Use qnorm now to find the $10^{th}$ percentile of the standard normal. Remember the argument prob must be between 0 and 1. You don’t have to specify any of the other parameters since the default is the standard normal.

qnorm(0.1)

## [1] -1.281552

g <- g + geom_vline(xintercept = qnorm(0.1), size = 3, colour="blue")
print(g)

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

We’ll see this now by drawing the vertical line at the quantile -1.281552.

Which of the following would you expect to be the 1st percentile?

1: 0
2: 2.33
3: -1.28
4: -2.33
5: -1.0

Selection: 4
Hint: Since 1 is smaller than 10 the quantile for the 1st percentile should be
smaller than the quantile for 10th percentile.

By looking at the picture can you say what the 50th percentile is?
Hint: What point x marks the halfway point of the graph?

## [1] 0

We can use the symmetry of the bell curve to determine other quantiles. Given
that 2.5% of the area under the curve falls to the left of x = -1.96, what is
the 97.5 percentile for the standard normal?

1: -1.28
2: 2
3: 2.33
4: 1.96

Selection: 4
Hint: 2.5% of the area falls to the right of the quantile of the 97.5 percentile.

Here are two useful facts concerning normal distributions. If X is a normal random variable with mean $\mu$ and variance $\sigma^2$, i.e., $X \sim N(\mu, \sigma^2)$, then the random variable $Z$ defined as $Z = \frac{X - \mu}{\sigma}$ is normally distributed with mean 0 and variance 1, i.e., $Z \sim N(0, 1)$. (Z is standard normal.)

The converse is also true. If Z is standard normal, i.e., $Z \sim N(0, 1)$, then the random variable $X$ defined as $X = \mu + \sigma Z$ is normally distributed with mean $\mu$ and variance $\sigma^2$, i.e., $X \sim N(\mu, \sigma^2)$

These formulae allow you to easily compute quantiles (and thus percentiles) for ANY normally distributed variable if you know its mean and variance. We’ll show how to find the $97.5^{th}$ percentile of a normal distribution with mean 3 and variance 4.

Again, we can use R’s qnorm function and simply specify the mean and standard deviation (the square root of the variance). Do this now. Find the 97.5^{th} percentile of a normal distribution with mean 3 and standard deviation 2.

qnorm(0.975, 3, 2)

## [1] 6.919928

Let’s check it using the formula above, $X = \mu + \sigma Z$. Here we’ll use the $97.5^{th}$ percentile for the standard normal as the value $Z$ in the formula. Recall that we previously calculated this to be $1.96$. Let’s multiply this by the standard deviation of the given normal distribution (2) and add in its mean (3) to see if we get a result close to the one qnorm gave us.

(1.96 * 2) + 3

## [1] 6.92

Suppose you have a normal distribution with mean 1020 and standard deviation of 50 and you want to compute the probability that the associated random variable X > 1200. The easiest way to do this is to use R’s pnorm function in which you specify the quantile (1200), the mean (1020) and standard deviation (50). You also must specify that the lower.tail is FALSE since we’re asking for a probability that the random variable is greater than our quantile. Do this now.

pnorm(1200, 1020, 50, lower.tail = FALSE)

## [1] 0.0001591086

Alternatively, we could use the formula above to transform the given distribution to a standard normal. We compute the number of standard deviations the specified number (1200) is from the mean with $Z = \frac{(X - \mu)}{\sigma}$. This is our new quantile. We can then use the standard normal distribution and the default values of pnorm. Remember to specify that lower.tail is FALSE. Do this now.

pnorm((1200 - 1020)/50, lower.tail = FALSE)

## [1] 0.0001591086

For practice, using the same distribution, find the 75% percentile. Use qnorm and specify the probability (0.75), the mean (1020) and standard deviation (50). Since we want to include the left part of the curve we can use the default lower.tail = TRUE.

qnorm(0.75, 1020, 50)

## [1] 1053.724

Note that R functions pnorm and qnorm are inverses. What would you expect pnorm(qnorm(0.53)) to return?

0.53

## [1] 0.53

How about qnorm(pnorm(.53))?

0.53

## [1] 0.53

Poisson Distribution

Now let’s talk about our last common distribution, the Poisson. This is, as Wikipedia tells us, “a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.”

In other words, the Poisson distribution models counts or number of event in some interval of time. From Wikipedia, “Any variable that is Poisson distributed only takes on integer values.”

The PMF of the Poisson distribution has one parameter, $\lambda$. As with the other distributions the PMF calculates the probability that the Poisson distributed random variable $X$ takes the value $x$. Specifically, \[P(X = x) =\frac{\lambda^x e^{-\lambda}}{x!}\]. Here $x$ ranges from 0 to $\infty$.

The mean and variance of the Poisson distribution are both $\lambda$.

Poisson random variables are used to model rates such as the rate of hard drive failures. We write $X \sim Poisson(\lambda t)$ where $\lambda$ is the expected count per unit of time and $t$ is the total monitoring time.

For example, suppose the number of people that show up at a bus stop is Poisson with a mean of 2.5 per hour, and we want to know the probability that at most 3 people show up in a 4 hour period. We use the R function ppois which returns a probability that the random variable is less than or equal to 3. We only need to specify the quantile (3) and the mean (2.5 * 4). We can use the default parameters, lower.tail = TRUE and log.p = FALSE. Try it now.

ppois(3, 2.5 * 4)

## [1] 0.01033605

Finally, the Poisson distribution approximates the binomial distribution in certain cases. Recall that the binomial distribution is the discrete distribution of the number of successes, $k$, out of $n$ independent binary trials, each with probability $p$. If $n$ is large and $p$ is small then the Poisson distribution with $\lambda = n p$ is a good approximation to the binomial distribution.

To see this, use the R function pbinom to estimate the probability that you’ll see at most 5 successes out of 1000 trials each of which has probability 0.01. As before, you can use the default parameter values (lower.tail = TRUE and log.p = FALSE) and just specify the quantile, size, and probability.

pbinom(5, 1000, 0.01)

## [1] 0.06613951

Now use the function ppois with quantile equal to 5 and lambda equal to n * p to see if you get a similar result.

ppois(5, 1000 * 0.01)

## [1] 0.06708596

See how they’re close? Pretty cool, right? This worked because n was large (1000) and p was small (0.01).

Congrats! You’ve concluded this uncommon lesson on common distributions.

Asymptotics

Asymptotics. (Slides for this and other Data Science courses may be found at github. If you care to use them, they must be downloaded as a zip file and viewed locally. This lesson corresponds to 07_Statistical_Inference/07_Asymptopia.)

In this lesson, we’ll discuss asymptotics, a topic which describes how statistics behave as sample sizes get very large and approach infinity. Pretending sample sizes and populations are infinite is useful for making statistical inferences and approximations since it often leads to a nice understanding of procedures.

Asymptotics generally give no assurances about finite sample performance, but they form the basis for frequency interpretation of probabilities (the long run proportion of times an event occurs).

Recall our simulations and discussions of sample means in previous lessons. We
can now talk about the distribution of sample means of a collection of iid
observations. The mean of the sample mean estimates what?

1: population mean
2: sigma^2/n
3: standard error
4: population variance

Selection: 1
Hint: Which choice has the word 'mean' in it?

Law of Large Numbers

The Law of Large Numbers (LLN) says that the average (mean) approaches what it’s estimating. We saw in our simulations that the larger the sample size the better the estimation. As we flip a fair coin over and over, it eventually converges to the true probability of a head (0.5).

The LLN forms the basis of frequency style thinking.

library(ggplot2)

coinPlot <- function(n) {
    means <- cumsum(sample(0:1, n, replace = TRUE))/(1:n)
    g <- ggplot(data.frame(x = 1:n, y = means), aes(x = x, y = y)) + 
         geom_hline(size = 1.5, yintercept = 0.5, alpha = 0.6,
                    linetype = "longdash") +
         geom_line(size = 1)
    
    if(n < 100) {
        g <- g + geom_point(colour = "red", size = 3, alpha = 0.8)
    }

    g <- g + labs(x = "Number of obs", y = "Cumulative mean") +
             scale_x_continuous(breaks = seq(0, n + 1, ceiling(n/10)))
    print(g)
    
    invisible()
}

To see this in action, we’ve copied some code from the slides and created the function coinPlot. It takes an integer $n$ which is the number of coin tosses that will be simulated. As coinPlot does these coin flips it computes the cumulative sum (assuming heads are 1 and tails 0), but after each toss it divides the cumulative sum by the number of flips performed so far. It then plots this value for each of the $k = 1, ..., n$ tosses. Try it now for n = 10.

coinPlot(10)

Your output depends on R’s random number generator, but your plot probably jumps around a bit and, by the $10^{th}$ flip, your $\frac{\mbox{cumulative sum}}{10}$ is probably different from mine. If you did this several times, your plots would vary quite a bit. Now call coinPlot again, this time with 10000 as the argument.

coinPlot(10000)

See the difference? Asymptotics in Action! The line approaches its asymptote of 0.5. This is the probability you expect since what we’re plotting, the $\frac{\mbox{cumulative sum}}{\mbox{number of flips}}$, represents the probability of the coin landing on heads. As we know, this is 0.5 .

We say that an estimator is CONSISTENT if it converges to what it’s trying to estimate. The LLN says that the sample mean of iid samples is consistent for the population mean. This is good, right?

Based on our previous lesson do you think the sample variance (and hence sample
deviation) are consistent as well?

1: Yes
2: No

Selection: 1
Hint: Recall our simulations of sample variances and how, as we increased the
sample size, they converged to the population variance. Sounds like consistency,
right?

Central Limit Theorem

Now for something really important - the CENTRAL LIMIT THEOREM (CLT) - one of the most important theorems in all of statistics. It states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases.

Let’s dissect this to see what it means. First, ‘properly normalized’ means that you transformed the sample mean $X'$. You subtracted the population mean $\mu$ from it and divided the difference by $\sigma/\sqrt n$. Here $\sigma$ is the standard deviation of the population and $n$ is the sample size.

Second, the CLT says that for large $n$, this normalized variable, $\frac{X' - \mu}{\sigma/\sqrt n}$ is almost normally distributed with mean 0 and variance 1. Remember that $n$ must be large for the CLT to apply.

Do you recognize $\sigma/\sqrt n$ from our lesson on variance? Since the population std deviation $\sigma$ is unknown, $\sigma/\sqrt n$ is often approximated by what?

1: the standard error of the sample mean
2: the mean of the population
3: I give up
4: the variance of the population

Selection: 1
Hint: Recall our many simulation experiments in the variance lesson where we calculated standard deviations of means using R’s sd function, then we calculated an approximation using a formula involving the population variance and the square root of the sample size.

Let’s rephrase the CLT. Suppose $X_1, X_2, ..., X_n$ are independent, identically distributed random variables from an infinite population with mean $\mu$ and variance $\sigma^2$. Then if $n$ is large, the mean of the X’s, call it $X'$, is approximately normal with mean $\mu$ and variance $\sigma^2/n$. We denote this as $X' \sim N(\mu, \sigma^2/n)$.

nosim <- 1000
cfunc <- function(x, n) (mean(x) - 3.5)/(1.71/sqrt(n))
dat <- data.frame(x = c(apply(matrix(sample(1:6, nosim * 10, replace = TRUE),
                                     nosim), 1, cfunc, 10),
                        apply(matrix(sample(1:6, nosim * 20, replace = TRUE),
                                     nosim), 1, cfunc, 20),
                        apply(matrix(sample(1:6, nosim * 30, replace = TRUE),
                                     nosim), 1, cfunc, 30)),
                  size = factor(rep(c(10, 20, 30), rep(nosim, 3))))

g <- ggplot(dat, aes(x = x, fill = size)) +
     geom_histogram(alpha = 0.20, binwidth = 0.3, colour = "black",
                    aes(y = ..density..)) +
     stat_function(fun = dnorm, size = 2) +
     facet_grid(. ~ size)
print(g)

To show the CLT in action consider this figure from the slides. It presents 3 histograms of 1000 averages of dice rolls with sample sizes of 10, 20 and 30 respectively. Each average of $n$ dice rolls (n = 10,20,30) has been normalized by subtracting off the mean (3.5) then dividing by the standard error, $\sqrt{2.92/n}$. The normalization has made each histogram look like a standard normal, i.e., mean 0 and standard deviation 1.

Notice that the CLT said nothing about the original population being normally distributed. That’s precisely where its usefulness lies! We can assume normality of a sample mean no matter what kind of population we have, as long as our sample size is large enough and our samples are independent. Let’s look at how it works with a binomial experiment like flipping a coin.

Recall that if the probability of a head (call it 1) is $p$, then the probability of a tail (0) is $1 - p$. The expected value then is $p$ and the variance is $p - p^2$ or $p(1 - p)$. Suppose we do $n$ coin flips and let $p'$ represent the average of these $n$ flips. We normalize $p'$ by subtracting the mean $p$ and dividing by the std deviation $\sqrt{\frac{p(1 - p)}{n}}$. Let’s do this for 1000 trials and plot the resulting histogram.

nosim <- 1000
cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5) 
dat <- data.frame(x = c(apply(matrix(sample(0:1, nosim * 10, replace = TRUE),
                                     nosim), 1, cfunc, 10),
                        apply(matrix(sample(0:1, nosim * 20, replace = TRUE), 
                                     nosim), 1, cfunc, 20),
                        apply(matrix(sample(0:1, nosim * 30, replace = TRUE), 
                                     nosim), 1, cfunc, 30)),
                  size = factor(rep(c(10, 20, 30), rep(nosim, 3))))

g <- ggplot(dat, aes(x = x, fill = size)) +
     geom_histogram(binwidth = 0.3, colour = "black", aes(y = ..density..)) +
     stat_function(fun = dnorm, size = 2) +
     facet_grid(. ~ size)
print(g)

Here’s a figure from the slides showing the results of 3 such trials where each trial is for a different value of n (10, 20, and 30) and the coin is fair, so $E(X) = 0.5$ and the standard error is $\frac{1}{2 \sqrt n}$. Notice how with larger n (30) the histogram tightens up around the mean 0.

nosim <- 1000
cfunc <- function(x, n) (mean(x) - 0.9)/sqrt((0.1 * 0.9)/n)
dat <- data.frame(
           x = c(apply(matrix(sample(0:1, prob = c(0.1, 0.9), nosim * 10,
                                     replace = TRUE), nosim), 1, cfunc, 10),
                 apply(matrix(sample(0:1, prob = c(0.1, 0.9), nosim * 20,
                                     replace = TRUE), nosim), 1, cfunc, 20),
                 apply(matrix(sample(0:1, prob = c(0.1, 0.9), nosim * 30,
                                     replace = TRUE), nosim), 1, cfunc, 30)),
           size = factor(rep(c(10, 20, 30), rep(nosim, 3))))

g <- ggplot(dat, aes(x = x, fill = size)) +
     geom_histogram(binwidth = 0.3, colour = "black", aes(y = ..density..)) + 
     stat_function(fun = dnorm, size = 2) +
     facet_grid(. ~ size)
print(g)

Here’s another plot from the slides of the same experiment, this time using a biassed coin. We set the probability of a head to 0.9, so $E(X) = 0.9$ and the standard error is $\sqrt{0.09/n}$ Again, the larger the sample size the more closely the distribution looks normal, although with this biassed coin the normal approximation isn’t as good as it was with the fair coin.

Confidence Intervals

Now let’s talk about confidence intervals.

x <- seq(-4, 4, length = 1000)
dat <- data.frame(x = x, y = dnorm(x))

g <- ggplot(dat, aes(x = x, y = y)) +
     geom_line(size = 1.5) +
     geom_ribbon(aes(x = ifelse(x > -1 & x < 1, x, 0), ymin = 0, ymax = dat$y),
                 fill = "red", alpha = 1) +
     geom_ribbon(aes(x = ifelse(x > -2 & x < 2, x, 0), ymin = 0, ymax = dat$y),
                 fill = "red", alpha = 0.5)
print(g)

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

## Warning: Use of `dat$y` is discouraged. Use `y` instead.

We know from the CLT that for large $n$, the sample mean is normal with mean $\mu$ and standard deviation $\sigma/\sqrt n$. We also know that 95% of the area under a normal curve is within two standard deviations of the mean. This figure, a standard normal with $\mu = 0$ and $\sigma = 1$, illustrates this point; the entire shaded portion depicts the area within 2 standard deviations of the mean and the darker portion shows the 68% of the area within 1 standard deviation.

It follows that 5% of the area under the curve is not shaded. By symmetry of the curve, only 2.5% of the data is greater than the mean + 2 standard deviations $(\mu + \frac{2 \sigma}{\sqrt n})$ and only 2.5% is less than the mean - 2 standard deviations $(\mu - \frac{2 \sigma}{\sqrt n})$.

So the probability that the sample mean $P(X' > \mu + \frac{2 \sigma}{\sqrt n} | X' < \mu - \frac{2 \sigma}{\sqrt n}) = 5$% Equivalently, the probability of being between these limits is 95%. Of course we could have different sizes of intervals. If we wanted something other than 95, then we would use a quantile other than 2.

The quantity $X' \pm \frac{2 \sigma}{\sqrt n}$ is called a 95% interval for $\mu$. The 95% says that if one were to repeatedly get samples of size $n$, about 95% of the intervals obtained would contain $\mu$, the quantity we’re trying to estimate.

Note that for a 95% confidence interval we divide (100 - 95) by 2 (since we have
both left and right tails) and add the result to 95 to compute the quantile we
need. The 97.5 quantile is actually 1.96, but for simplicity it's often just
rounded up to 2. If you wanted to find a 90% confidence interval what quantile
would you want?

1: 100
2: 95
3: 85
4: 90

Selection: 2
Hint: Divide (100 - 90) by 2 and add this result to 90.

Use the R function qnorm to find the $95^{th}$ quantile for a standard normal distribution. Remember that qnorm takes a probability as an input. You can use default values for all the other arguments.

qnorm(0.95)

## [1] 1.644854

As we’ve seen before, in a binomial distribution in which $p$ represents the probability or proportion of success, the variance $\sigma^2 = p(1 - p)$, so the standard error of the sample mean $p' = \sqrt{\frac{p(1 - p)}{n}}$ where $n$ is the sample size. The 95% confidence interval of $p$ is then $p' \pm 2 \sqrt{\frac{p(1 - p)}{n}}$. The 2 in this formula represents what?

1: the mean of p’
2: the standard error of p’
3: the variance of p’
4: the approximate 97.5% normal quantile

Selection: 4
Hint: “Recall the formula for the interval $p' \pm qnorm \times \frac{\sigma}{\sqrt n}$”

A critical point here is that we don’t know the true value of p; that’s what we’re trying to estimate. How can we compute a confidence interval if we don’t know $p(1 - p)$? We could be conservative and try to maximize it so we get the largest possible confidence interval. Calculus tells us that $p(1 - p)$ is maximized when $p = 1/2$, so we get the biggest 95% confidence interval when we set $p = 1/2$ in the formula $p' \pm 2 \sqrt{\frac{p(1 - p)}{n}}$.

Using 1/2 for the value of p in the formula above yields what expression for the 95% confidence interval for p?

1: $p' \pm \frac{1}{\sqrt n}$
2: $p' \pm \frac{1}{2 \sqrt n}$
3: $p' \pm 2 \sqrt n$

Selection: 1
Hint: $p(1 - p) = \frac{1}{4}$ when $p = \frac{1}{2}$ and the $\sqrt{\frac{1}{4n}} = \frac{1}{2 \sqrt n}$. What happens when you multiply this by 2?

Here’s another example of applying this formula from the slides. Suppose you were running for office and your pollster polled 100 people. Of these 60 claimed they were going to vote for you. You’d like to estimate the true proportion of people who will vote for you and you want to be 95% confident of your estimate. We need to find the limits that will contain the true proportion of your supporters with 95% confidence, so we’ll use the formula $p' \pm \frac{1}{\sqrt n}$ to answer this question. First, what value would you use for $p'$, the sampled estimate?

1: 1.00
2: .10
3: .60
4: .56

Selection: 3
Hint: The only sampled number here is the number of people who said they would vote for you. Make it a proportion by dividing it by the sample size.

What would you use for 1/sqrt(n)?

1: 1/sqrt(60)
2: 1/sqrt(56)
3: 1/100
4: 1/10

Selection: 4
Hint: The sample size is n, and in this case n=100. What is 1/sqrt(100)?

The bounds of the interval then are what?

1: .5 and .7
2: .46 and .66
3: I haven't a clue
4: .55 and .65

Selection: 1
Hint: We know p'- 1/sqrt(n) is the lower bound and p'+ 1/sqrt(n) is the upper
bound, so use the answers from the two previous answers to fill in values for
these variables.

How do you feel about the election?

1: unsure
2: I'll pull out
3: Perseverance, that's the answer
4: confident

Selection: 4
Hint: With 95% confidence, between 0.5 and 0.7 of the voters support you. You
look like a winner to me!

Wald Confidence Interval

Another technique for calculating confidence intervals for binomial distributions is to replace $p$ with $p'$. This is called the Wald confidence interval. We can also use the R function qnorm to get a more precise quantile value (closer to 1.96) instead of our ballpark estimate of 2.

With the formula $p' \pm qnorm(0.975) \sqrt{\frac{p'(1 - p')}{100}}$, use the $p'$ and $n$ values from above and the R construct p'+ c(-1,1)... to handle the plus/minus portion of the formula. You should see bounds similar to the ones you just estimated.

0.6 + c(-1, 1) * qnorm(0.975) * sqrt(0.6 * 0.4/100)

## [1] 0.5039818 0.6960182

As an alternative to this Wald interval, we can also use the R function binom.test with the parameters 60 and 100 and let all the others default. This function “performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment.” (This means it guarantees the coverages, uses a lot of computation and doesn’t rely on the CLT.) This function returns a lot of information but all we want now are the values of the confidence interval that it returns. Use the R construct x$conf.int to find these now.

binom.test(60, 100)$conf.int

## [1] 0.4972092 0.6967052
## attr(,"conf.level")
## [1] 0.95

Close to what we’ve seen before, right? Now we’re going to see that the Wald interval isn’t very accurate when n is small. We’ll use the example from the slides.

n <- 20
nosim <- 30
mywald <- function(p) {
      phats <- rbinom(nosim, prob = p, size = n)/n
      ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
      ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
      print("Here are the p\' values")
      print(phats)
      print("Here are the lower")
      print(ll)
      print("Here are the upper")
      print(ul)
      mean(ll < p & ul > p)
}

Suppose we flip a coin a small number of times, say 20. Also suppose we have a function mywald which takes a probability $p$, and generates 30 sets of 20 coin flips using that probability $p$. It uses the sampled proportion of success, $p'$, for those 20 coin flips to compute the upper and lower bounds of the 95% Wald interval, that is, it computes the two numbers $p' \pm qnorm(0.975) \sqrt{\frac{p'(1 - p')}{n}}$ for each of the 30 trials. For the given true probability $p$, we count the number of times out of those 30 trials that the true probability $p$ was in the Wald confidence interval. We’ll call this the coverage.

To make sure you understand what’s going on, try running mywald now with the probability 0.2. It will print out 30 $p'$ values (which you don’t really need to see), followed by 30 lower bounds, 30 upper bounds and lastly the percentage of times that the input 0.2 was between the two bounds. See if you agree with the percentage you get. Usually it suffices to just count the number of times (out of the 30) that 0.2 is less than the upper bound.

mywald(0.2)

## [1] "Here are the p' values"
##  [1] 0.15 0.20 0.35 0.15 0.20 0.15 0.30 0.30 0.20 0.20 0.25 0.25 0.15 0.15 0.15
## [16] 0.10 0.45 0.25 0.10 0.30 0.30 0.10 0.05 0.25 0.35 0.25 0.30 0.25 0.10 0.30
## [1] "Here are the lower"
##  [1] -0.006490575  0.024695492  0.140962697 -0.006490575  0.024695492
##  [6] -0.006490575  0.099163455  0.099163455  0.024695492  0.024695492
## [11]  0.060227303  0.060227303 -0.006490575 -0.006490575 -0.006490575
## [16] -0.031478381  0.231967771  0.060227303 -0.031478381  0.099163455
## [21]  0.099163455 -0.031478381 -0.045516829  0.060227303  0.140962697
## [26]  0.060227303  0.099163455  0.060227303 -0.031478381  0.099163455
## [1] "Here are the upper"
##  [1] 0.3064906 0.3753045 0.5590373 0.3064906 0.3753045 0.3064906 0.5008365
##  [8] 0.5008365 0.3753045 0.3753045 0.4397727 0.4397727 0.3064906 0.3064906
## [15] 0.3064906 0.2314784 0.6680322 0.4397727 0.2314784 0.5008365 0.5008365
## [22] 0.2314784 0.1455168 0.4397727 0.5590373 0.4397727 0.5008365 0.4397727
## [29] 0.2314784 0.5008365

## [1] 0.9333333

Now that you understand the underlying principle, suppose instead of 30 trials, we used 1000 trials. Also suppose we did this experiment for a series of probabilities, say from 0.1 to 0.9 taking steps of size 0.05. More specifically, we’ll call our function using 17 different probabilities, namely 0.1, 0.15, 0.2, 0.25, …, 0.9. We can then plot the percentages of coverage for each of the probabilities.

n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000

coverage <- sapply(pvals, function(p) {
    phats <- rbinom(nosim, prob = p, size = n)/n
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    mean(ll < p & ul > p)
})

g <- ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) +
     geom_line(size = 2) +
     geom_hline(yintercept = 0.95) +
     ylim(0.75, 1.0)
print(g)

Here’s the plot of our results. Each of the 17 vertices show the percentage of coverage for a particular true probability $p$ passed to the function. Results will vary, but usually the only probability that hits close to or above the 95% line is the $p = 0.5$. So this shows that when $n$, the number of flips, is small (20) the CLT doesn’t hold for many values of $p$, so the Wald interval doesn’t work very well.

n <- 100
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000

coverage <- sapply(pvals, function(p) {
    phats <- rbinom(nosim, prob = p, size = n)/n
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    mean(ll < p & ul > p)
})

g <- ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) +
     geom_line(size = 2) +
     geom_hline(yintercept = 0.95) +
     ylim(0.75, 1.0)
print(g)

Let’s try the same experiment and increase $n$, the number of coin flips in each of our 1000 trials, from 20 to 100 to see if the plot improves. Again, results may vary, but all the probabilities are much closer to the 95% line, so the CLT works better with a bigger value of $n$.

Agresti/Coull Interval

A quick fix to the problem of having a small n is to use the Agresti/Coull interval. This simply means we add 2 successes and 2 failures to the counts when calculating the proportion $p'$. Specifically, if $X$ is the number of successes out of the 20 coin flips, then instead of setting $p' = \frac{X}{20}$, let $p' = \frac{X + 2}{24}$. We use 24 as the number of trials since we’ve added 2 successes and 2 failures to the counts. Note that we still use 20 in the calculation of the upper and lower bounds.

n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000

coverage <- sapply(pvals, function(p){
    phats <- (rbinom(nosim, prob = p, size = n) + 2)/(n + 4)
    ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
    mean(ll < p & ul > p)
})

g <- ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) +
     geom_line(size = 2) +
     geom_hline(yintercept = 0.95) + 
     ylim(0.75, 1.0)
print(g)

Here’s a plot using this Agresti/Coull interval, with 1000 trials of 20 coin flips each. It looks much better than both the original Wald with 20 coin flips and the improved Wald with 100 coin flips. However, this technique might make the confidence interval too wide.

Why does this work? Adding 2 successes and 2 failures pulls $p'$ closer to 0.5 which, as we saw, is the value which maximizes the confidence interval.

ACCompar <- function(n) {
    num <- 1:n 
    den <- n
    new_num <- num + 2
    new_den <- den + 4
    new_frac <- new_num/new_den
    old_frac <- num/den
    scor <- new_frac < old_frac
    print(scor)
    sum(scor)
}

To show this simply, we wrote a function ACCompar, which takes an integer input $n$. For each $k$ from $\mbox{1 to n}$ it computes two fractions, $\frac{k}{n}$ and $\frac{k + 2}{n + 4}$. It then prints out the boolean vector of whether the new $\frac{k + 2}{n + 4}$ fraction is less than the old $\frac{k}{n}$. It also prints out the total number of k’s for which the condition is TRUE.

For all $k < \frac{n}{2}$, you see FALSE indicating that $\frac{k + 2}{n + 4} \ge \frac{k}{n}$.
For all $k > \frac{n}{2}$ you see TRUE indicating that $\frac{k + 2}{n + 4} < \frac{k}{n}$.
If $k = \frac{n}{2}$, $\frac{k + 2}{n + 4} = \frac{k}{n}$.

Try running ACCompar now with an input of 20.

ACCompar(20)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

## [1] 10

Poisson Dist and Confidence Interval

Let’s move on to Poisson distributions and confidence intervals. Recall that Poisson distributions apply to counts or rates. For the latter, we write $X \sim Poisson(\lambda t)$ where $\lambda$ is the expected count per unit of time and $t$ is the total monitoring time.

Here’s another example from the slides. Suppose a nuclear pump failed 5 times out of 94.32 days and we want a 95% confidence interval for the failure rate per day. The number of failures $X$ is Poisson distributed with parameter $(\lambda t)$. We don’t observe the failure rate, but we estimate it as $x/t$. Call our estimate $\hat \lambda$, so $\hat \lambda = x/t$. According to theory, the variance of our estimated failure rate is $\lambda/t$. Again, we don’t observe $\lambda$, so we use our estimate of it instead. We thus approximate the variance of $\hat \lambda$ as $\hat \lambda/t$.

In this example what would you use as the estimated mean x/t?

1: 5/94.32
2: 94.32/5
3: I haven't a clue

Selection: 1
Hint: You need a number of failures divided by some measure of time.

Set a variable lamb now with this value.

lamb <- 5/94.32

So lamb is our estimated mean and lamb/t is our estimated variance. The formula we’ve used to calculate a 95% confidence interval is est mean + c(-1,1) * qnorm(0.975) * sqrt(est var). Use this formula now making the appropriate substitutions.

lamb + c(-1, 1) * qnorm(0.975) * sqrt(lamb/94.32)

## [1] 0.006545667 0.099476386

As a check we can use R’s function poisson.test with the arguments 5 and 94.32 to check this result. This is an exact test so it guarantees coverage. As with the binomial exact test, we only need to look at the conf portion of the result using the x$conf construct. Do this now.

poisson.test(5, 94.32)$conf

## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95

Pretty close, right? Now to check the coverage of our estimate we’ll run the same simulation experiment we ran before with binomial distributions. We’ll vary our $\lambda$ values from 0.005 to 0.1 with steps of 0.01 (so we have 10 of them), and for each one we’ll generate 1000 Poisson samples with mean $\lambda t$. We’ll calculate sample means and use them to compute 95% confidence intervals. We’ll then count how often out of the 1000 simulations the true mean (our $\lambda$) was contained in the computed interval.

lambdavals <- seq(0.005, 0.10, by = 0.01)
nosim <- 1000
t <- 100

coverage <- sapply(lambdavals, function(lambda) {
    lhats <- rpois(nosim, lambda = lambda * t)/t
    ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
    ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
    mean(ll < lambda & ul > lambda)
})

g <- ggplot(data.frame(lambdavals, coverage),
            aes(x = lambdavals, y = coverage)) +
     geom_line(size = 2) +
     geom_hline(yintercept = 0.95) +
     ylim(0, 1.0)
print(g)

Here’s a plot of the results. We see that the coverage improves as $\lambda$ gets larger, and it’s quite off for small $\lambda$ values.

lambdavals <- seq(0.005, 0.10, by = 0.01)
nosim <- 1000
t <- 1000

coverage <- sapply(lambdavals, function(lambda) {
    lhats <- rpois(nosim, lambda = lambda * t)/t
    ll <- lhats - qnorm(0.975) * sqrt(lhats/t)
    ul <- lhats + qnorm(0.975) * sqrt(lhats/t)
    mean(ll < lambda & ul > lambda)
})

g <- ggplot(data.frame(lambdavals, coverage),
            aes(x = lambdavals, y = coverage)) +
     geom_line(size = 2) +
     geom_hline(yintercept = 0.95) +
     ylim(0, 1.0)
print(g)

Now it’s interesting to see how the coverage improves when we increase the unit of time. In the previous plot we used $t = 100$ (rounding the 94.32 up). Here’s a plot of the same experiment setting $t = 1000$. We see that the coverage is much better for almost all the values of lambda, except for the smallest ones.

Now for a quick review!

What tells us that averages of iid samples converge to the population means that
they are estimating?

1: the CLT
2: the law of small numbers
3: the law of large numbers
4: the BLT

Selection: 3
Hint: Think Big!

What tells us that averages are approximately normal for large enough sample
sizes

1: the law of large numbers
2: the CLT
3: the law of small numbers
4: the BLT

Selection: 2
Hint: Keep yourself centered!

The Central Limit Theorem (CLT) tells us that averages have what kind of
distributions?

1: abnormal
2: Poisson
3: binomial
4: normal

Selection: 4
Hint: Remember the previous question?

The Central Limit Theorem (CLT) tells us that averages have normal distributions
centered at what?

1: the population variance
2: the standard error
3: the population mean

Selection: 3
Hint: Remember the old E(X') = mu, where X' is the sample mean and mu is the
population mean. Know what I mean?

The Central Limit Theorem (CLT) tells us that averages have normal distributions
with standard deviations equal to what?

1: the population mean
2: the population variance
3: the standard error

Selection: 3
Hint: Which choice has the word standard in it?

True or False - The Central Limit Theorem (CLT) tells us that averages always
have normal distributions no matter how big the sample size

1: False
2: True

Selection: 1
Hint: Never trust statements with the words ALWAYS or NEVER in them. There are
ALWAYS exceptions to rules.

To calculate a confidence interval for a mean you take the sample mean and add
and subtract the relevant normal quantile times the what?

1: standard error
2: mean
3: variance/n
4: variance

Selection: 1
Hint: You want something like a standard deviation, right? Which choice has the
word standard in it?

For a 95% confidence interval approximately how many standard errors would you
add and subtract from the sample mean?

1: 4
2: 2
3: 6
4: 8

Selection: 2
Hint: Anything above 3 is pretty far from the mean. Also, purists would prefer
1.96 for this.

If you wanted increased coverage what would you do to your confidence interval?

1: decrease it
2: increase it
3: keep it the same

Selection: 2
Hint: The key word here is increase. Bigger interval means bigger coverage.

If you had less variability in your data would your confidence interval get
bigger or smaller?

1: bigger
2: smaller

Selection: 2
Hint: Recall the size of the confidence interval positively depends on standard
error which is sqrt(var/n). If variance is smaller then so is variability and
the interval.

If you had larger sample size would your confidence interval get bigger or
smaller?

1: smaller
2: bigger

Selection: 1
Hint: Recall the size of the confidence interval positively depends on standard
error which is sqrt(var/n). If the sample size, n, gets bigger the standard
error gets smaller and so does the interval.

A quick fix for small sample size binomial calculations is what?

1: add 2 successes and subtract 2 failures
2: changing data seem dishonest
3: add 2 successes and 4 failures
4: add 2 successes and 2 failures

Selection: 4
Hint: Adding 2 successes and 2 failures brings the proportion of successes
closer to 1/2 which maximizes the confidence interval.

Congrats! You’ve concluded this lesson on asymptotics. We hope you feel confident and are asymptomatic after going through it.