September 17, 2018

Inference and Distributions

Last time, distributions were complete sentences: nouns and verbs. They remain so. They can be assumed or derived. Before turning to inference, I will conclude a discussion of distributions with an application that illustrates their primary use – a Monte Carlo simulation.

Monte Carlo Simulation

A problem.

  1. Customers arriving at a car dealership at a rate of 6 per hour.
  2. Each customer has a 15% probability of making a purchase.
  3. Purchases have
  1. uniform profits over the interval $1000-$3000.
  2. Normal profits that average 1500 with standard deviation 500

Monte Carlo Simulation

Putting it together. The tool is r, let’s try 1000.

  1. Customers arriving at a car dealership at a rate of 6 per hour. C=Poisson(6)
  2. Each customer has a 15% probability of making a purchase. P=Binomial(C,p=0.15)
  3. Purchases have sum p of these
  • Uniform profits over the interval $1000-$3000.
  • Normal profits that average $1500 with standard deviation $500

Parts 1 and 2

Customers <- rpois(1000, 6) # Customers ~ Poisson(6)
Purchasers <- rbinom(1000, size=Customers, prob=0.15) # P ~ Binomial(Customers,0.15)
# Next part needs a coding trick.  For each row [of 1000], I want sum the Profits given Purchasers random draws.
Profits.U <- sapply(c(1:1000), function(x) 
  { sum(runif(Purchasers[[x]], 1000, 3000))} )
Profits.N <- sapply(c(1:1000), function(x) 
  { sum(rnorm(Purchasers[[x]], 1500, 500))} )

Customers and Purchasers

Solutions

Summary

##    Profits.U      Profits.N     
##  Min.   :   0   Min.   :-394.8  
##  1st Qu.:   0   1st Qu.:   0.0  
##  Median :1399   Median :1048.0  
##  Mean   :1677   Mean   :1244.3  
##  3rd Qu.:2721   3rd Qu.:2081.4  
##  Max.   :9779   Max.   :8295.1

Your First Quiz Module

Will play off of Monte Carlo Simulation such as this.

A Workflow for Distributions

What values can x take? A large random sample gives the min and max.

plot(density(data)) or hist(data) show samples.

The assignment operator or equals to save samples <- or =

poss.x <- seq(min.x, max.x, by=units)
prob.x <- p*(poss.x, verbs)

An Example: Normal(0,1)

The meat is in p and q [-4,4]

my.seq <- seq(-4,4,by=0.01)
plot(x=my.seq, y=pnorm(my.seq), main="The Standard Normal (0,1)", xlab="z", type="l", ylab="Probability")

The normal is norm

The empirical rule(s).

pnorm(1)-pnorm(-1)
## [1] 0.6826895
pnorm(2)-pnorm(-2)
## [1] 0.9544997
pnorm(3)-pnorm(-3)
## [1] 0.9973002

The Magical QQ-Norm

Explained:

On y: scale the data: \[\frac{x - \overline{x}}{s_{x}}\].

On x: generate the sorted quantiles given 1…n/(n+1).

Like So

Toulmin and the Empirical World

In a world of data, Toulmin’s backing, rebuttals, and qualifiers come in two major forms:

  • Assumption
  • Implication

Probability Distributions

In the last lab and the coming, we posited a distribution for things and asked questions given that distribution. That was two things: a probability distribution and the parameters that complete that. The verb and the noun….

  • Air Traffic Control incidents
  • Filling grape jelly jars
  • Scottish pounds
  • The median…

The Questions

In each of the aforementioned problems, there are two unknowns – the parameters and the distribution. This is because we are asking questions of the entire distribution. These are very hard questions and we have assumed answers to both questions: distribution and parameters.

Motivating Problem

Consider an example concerning the space shuttle Challenger and the O-rings. NASA posits that the probability of an accident is 1/60000. The Air Force posits an alternative, 1/35. Who is right?

The problem is that data cannot by themselves decide because we cannot continuously repeat the process of launching space shuttles and seeing whether or not they explode for reasons of, first and foremost morality, but also of expense measured in a host of ways.

Other Assumptions

Trials of the shuttle are not independent in any case. Poisson trials – damage to tiles and reentry may make the probability of accidents increasing in number of flights. But the core question is that we need to use the data that we have to say things about data that we do not have. We only know 1 in 25, 2 in 113, or 135 flights.

Inference

Reverses the problem from last time. With data and a known distribution, I can infer one or more verbs, e.g. the mean of a normal distribution. But the sample must have particular characteristics and the set of statistics with known distributions is not large. I never get one number answers.

Sampling

Four principle types of sampling

  • Random
  • Stratified
  • Clustered
  • Systematic Samples

Given the Sampling …

Things are assumed to be random. Threats to this:

  • Coverage error
  • Nonresponse error
  • Sampling error
  • Nonresponse error
  • Measurement error

Inferred Quantities

  • Means
  • Variances
  • \(\pi\) in a binomial
  • Correlation

On the Mean

The mean is special. The process of averaging does many things. We have already seen that it guarantees the deviations above and below the mean cancel out. It also reduces variation.

Sample sizes and variation of the mean

The Central Limit Theorem(s?)

Maintains that the population mean has a normal distribution, under general conditions, with standard error \(\frac{\sigma}{\sqrt{n}}\).

Averaging Shrinks the Variation

n=3

n=5

n=10

n=25

n=50

It Even Works for Binaries

Intervals and Hypothesis Tests

One specifies an interval given a level of probability.

The other specifies a decision rule for a statement given a level of probability. If the statement is at least \(p\) likely, then it is true.

They are dual to one another.

With tests, we can still make two errors

Type I and II

The Bottom Line

In statistics, as in nature, everything has variation. There is not a single answer; there is an interaction between probability and this range. There is one that is always correct; it is just not interesting. Next time, we will discover a distribution designed for data: t.