Stats 155 Class Notes 2012-10-24

Background

All through the semester we have worked within a certain framework for modeling: models which partition variation in a response variable into an “explained” part (using the variation in explanatory variables) and into a “residual” part, which has been treated as completely random.

Today, we're going to work with a completely different framework for modeling, one that is completely descriptive and doesn't try to partition variation into deterministic and random parts — everything is regarded as random.

Even when something is regarded as random, it doesn't mean that it is completely unpredictable. For instance, the roll of a pair of dice is much more likely to result in a 7 than an 11. The models we'll work with today are ways of describing probabilities of different outcome.

Example

I was asked last month to help, pro bono, a contractor trying to improve the survivability of US Marine Corp amphibious vehicles. The vehicles in use weren't designed around the threat of IEDs and the seating in them needs to be changed to provide better support in case of an explosion underneath, e.g. benches mounted on shock absorbers, no one sitting on the floor. The hull configuration (that's what you call the bottom of an amphibious vehicle) also needs to be changed. But it doesn't work just to make the vehicle bigger and heavier. Heavier is always a problem, requiring a bigger engine, etc. Bigger means not using an existing platform, which raises development costs and incurs delays.

The people I was working with had a small contract to build a design around modifications to an existing USMC vehicle, the AAV. To make it work, they would have to allocate somewhat less space per marine. They wanted to know how to figure out what space would work.

Here are the data I was given the widths (in feet) of equipped marines and the number in a 17-person squad:

Role # in squad 5th median 95th
Rifleman 6 1.99 2.06 2.15
Grenadier 6 1.99 2.05 2.24
Pistol Carrier 1 2.28 2.36 2.67
Automatic Rifleman 4 2.00 2.06 2.34

The standard calculation involves constructing a squad of 17 marines all at the 95th percentile. The contractor thought this was unnecessarily large and asked whether it wouldn't suffice to make the vehicle big eough to handle a squad of median marines, figuring that the wider ones would share space with narrower marines and things would cancel out.

I proposed to do a simulation of randomly generated squads and find out what vehicle size will accomodate the vast majority of squads. I would generate random sizes that meet the above percentiles in a sensible way and add up the total width. (It's a little bit simplified, because the seating plan has 8 marines on a side, with the pistol carrier at the forward bulkhead, facing backward toward the exit ramp.)

QUESTION How should I generate random marine widths?

Definitions

A probability is a number between 0 and 1. 0 means “impossible” and 1 means “certain”. Values between 0 and 1 indicate possibility, with bigger numbers indicating a greater possibility.

A random variable is a quantity (a number) that is random.

A sample space (poorly named) is the set of possible values for that number.

A probability model is an assignment of a probability to each member of the sample space.

It's helpful to distinguish between two kinds of sample spaces that apply to random variables:

For discrete numbers, it's possible to assign a probability to each outcome.

For continuous numbers, it's possible to assign a probability to a range of outcomes. Or, by dividing the probability by the extent of the range, one can assign a probability density to each outcome. We usually treat this probability density as a function of the value of the random value: \( p(x) \)

We'll often use probabilities and probability densities in a similar way.

Creating Probability Models

There are several approaches to creating probability models, that is, to making an association between a probability and a member of the sample space:

Your job is to learn how the setting relates to the choice of probability model and the meaning of the parameter(s) for each model.

Some Important Probability Models

Introduce the rxxx() operation for each. For equal probabilities, just use
resample(1:k, size=n). Generate random numbers from each. Ask them to find the mean and standard deviation of each.

Discrete

Continuous

Why the “Normal” Distribution is Normal

Basic Operations: P and Q

The D operation

ACTIVITY: Returns on investments

Stock market gives return of, say 5%/year on average with a standard deviation of about 6%. Simulate the total investment return over 50 years.

prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
## [1] 9.968

Have each student do their own, then congratulate the student who got the highest return.

Then show the overall distribution:

trials = do(1000) * prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
densityplot(~trials)

plot of chunk unnamed-chunk-3

This distribution has a name: lognormal. It reflects the fact that the log of the values has a normal distribution.

densityplot(~log(trials))

plot of chunk unnamed-chunk-4

Matching a Model to Data

In fitting linear models, we've used the sum of square residuals, the difference between the observed and theoretical values squared. \( E^2 = \sum (x_i - m_i)^2 \)

In fitting a probability model (for discrete outcomes) there is a similar approach.

\( E^2 = \sum (expected_i - observed_i)^2 \)

This criterion doesn't really work well. For instance, if the expected is zero, then the observed should be impossible. But this formula doesn't reflect that fact.

The standard matching criterion is called a \( \chi^2 \) (chi-squared) and is
\( \chi^2 = \sum \frac{(expected_i - observed_i)^2}{expected^2} \)

Another way to measure the match between a set of observations and a probability model is via the likelihood: the probability of the observations if the model were right.