September 23, 2019

Inference and Distributions

Last time, distributions were complete sentences: nouns and verbs. They remain so. They can be assumed or derived. Before turning to inference, I will conclude a discussion of distributions with an application that illustrates their primary use – a Monte Carlo simulation.

Monte Carlo Simulation

A problem.

  1. Customers arriving at a car dealership at a rate of 6 per hour.
  2. Each customer has a 15% probability of making a purchase.
  3. Purchases have
  1. uniform profits over the interval $1000-$3000.
  2. Normal profits that average 1500 with standard deviation 500

Monte Carlo Simulation

Putting it together. The tool is r, let’s try 1000.

  1. Customers arriving at a car dealership at a rate of 6 per hour. C=Poisson(6)
  2. Each customer has a 15% probability of making a purchase. P=Binomial(C,p=0.15)
  3. Purchases have sum p of these
  • Uniform profits over the interval $1000-$3000.
  • Normal profits that average $1500 with standard deviation $500

Parts 1 and 2

Customers <- rpois(1000, 6) # Customers ~ Poisson(6)
Purchasers <- rbinom(1000, size=Customers, prob=0.15) # P ~ Binomial(Customers,0.15)
# Next part needs a coding trick.  For each row [of 1000], I want sum the Profits given Purchasers random draws.
Profits.U <- sapply(c(1:1000), function(x) 
  { sum(runif(Purchasers[[x]], 1000, 3000))} )
Profits.N <- sapply(c(1:1000), function(x) 
  { sum(rnorm(Purchasers[[x]], 1500, 500))} )

Customers and Purchasers

Solutions

Summary

##    Profits.U       Profits.N      
##  Min.   :    0   Min.   : -36.92  
##  1st Qu.:    0   1st Qu.:   0.00  
##  Median : 1530   Median :1233.31  
##  Mean   : 1842   Mean   :1392.00  
##  3rd Qu.: 2901   3rd Qu.:2181.18  
##  Max.   :12372   Max.   :8849.13

Your First Quiz Module

Will play off of Monte Carlo Simulation such as this.

A Workflow for Distributions

What values can x take? A large random sample gives the approximate min and max.

plot(density(data)) or hist(data) show samples.

The assignment operator or equals to save samples <- or =

poss.x <- seq(min.x, max.x, by=units)  # range from sim.
prob.x <- p<noun>(poss.x, verbs)  # probability of x

An Example: Normal(0,1)

The meat is in p and q [-4,4]

my.seq <- seq(-4,4,by=0.01)
plot(x=my.seq, y=pnorm(my.seq), main="The Standard Normal (0,1)", xlab="z", type="l", ylab="Probability")

The normal is norm

The empirical rule(s).

pnorm(1)-pnorm(-1)
## [1] 0.6826895
pnorm(2)-pnorm(-2)
## [1] 0.9544997
pnorm(3)-pnorm(-3)
## [1] 0.9973002

The Magical QQ-Norm

Explained:

On y: scale the data: \[\frac{x - \overline{x}}{s_{x}}\].

On x: generate the sorted quantiles given 1…n/(n+1).

Like So

the qqPlot in car

Has an interval to give us an idea of how normal. An interval that, with a given probability, the observations should fall in the interval that frequently. The command requires library(car).

## Loading required package: carData

## [1] 6416 5059

Toulmin and the Empirical World

In a world of data, Toulmin’s backing, rebuttals, and qualifiers come in two major forms:

  • Assumption
  • Implication

Probability Distributions

In the last lab and the coming, we posited a distribution for things and asked questions given that distribution. That was two things: a probability distribution and the parameters that complete that. The verb and the noun….

  • Air Traffic Control incidents
  • Filling grape jelly jars
  • Scottish pounds
  • The median…

The Questions

In each of the aforementioned problems, there are two unknowns – the parameters and the distribution. This is because we are asking questions of the entire distribution. These are very hard questions and we have assumed answers to both questions: distribution and parameters.

A Radiant Aside

install.packages("radiant")

installs a graphical user interface that, while limited, is very useful for probability distributions.

The Median is a Binomial with p=0.5

Interestingly, any given observation has a 50-50 chance of being over or under the median. Suppose that I have five datum.
1. What is the probability that all are under? 2. What is the probability that all are over? 3. What is the probability that the median is somewhere in between our smallest and largest sampled values?

This is called the Rule of Five

Motivating Problem 2

Consider an example concerning the space shuttle Challenger and the O-rings. NASA posits that the probability of an accident is 1/60000. The Air Force posits an alternative, 1/35. Who is right?

The problem is that data cannot by themselves decide because we cannot continuously repeat the process of launching space shuttles and seeing whether or not they explode for reasons of, first and foremost morality, but also of expense measured in a host of ways.

Other Assumptions

Trials of the shuttle are not independent in any case. Poisson trials – damage to tiles and reentry may make the probability of accidents increasing in number of flights. But the core question is that we need to use the data that we have to say things about data that we do not have. We only know 1 in 25, 2 in 113, or 135 flights.

Let’s Work This Through

Each contains an assumption about what the data should look like. What should they look like?

  • Find the bounds with a large sample using rpois.
  • Plot the distribution within those bounds using p(q).
  • Evaluate the data we have.

Inference

Reverses the problem from last time. With data and a known distribution, I can infer one or more verbs, e.g. the mean of a normal distribution. But the sample must have particular characteristics and the set of statistics with known distributions is not large. I never get one number answers.

Sampling

Four principle types of sampling

  • Random
  • Stratified
  • Clustered
  • Systematic Samples

Given the Sampling …

Things are assumed to be random. Threats to this:

  • Coverage error
  • Nonresponse error
  • Sampling error
  • Nonresponse error
  • Measurement error

Inferred Quantities

  • Means
  • Variances
  • \(\pi\) in a binomial
  • Correlation

On the Mean

The mean is special. The process of averaging does many things. We have already seen that it guarantees the deviations above and below the mean cancel out. It also reduces variation.

Sample sizes and variation of the mean

The Central Limit Theorem(s?)

Maintains that the population mean has a normal distribution, under general conditions, with standard error \(\frac{\sigma}{\sqrt{n}}\).

Averaging Shrinks the Variation

n=3

n=5

n=10

n=25

n=50

It Even Works for Binaries

Intervals and Hypothesis Tests

One specifies an interval given a level of probability.

The other specifies a decision rule for a statement given a level of probability. If the statement is at least \(p\) likely, then it is true.

They are dual to one another.

With tests, we can still make two errors

Type I and II

Chebyshev

\[ 1 - (\frac{1}{k^2}) \]

is the probability that the data lie within \(k\) standard deviations of the mean where \(k\) is any positive integer greater than one. A bonus nugget from your homework

The Bottom Line

In statistics, as in nature, everything has variation. There is not a single answer; there is an interaction between probability and this range. If a normal holds, we have an empirical rule. For any distribution, we have Chebyshev’s theorem. There is one that is always correct; it is just not interesting.