All through the semester we have worked within a certain framework for modeling: models which partition variation in a response variable into an “explained” part (using the variation in explanatory variables) and into a “residual” part, which has been treated as completely random.
Today, we're going to work with a completely different framework for modeling, one that is completely descriptive and doesn't try to partition variation into deterministic and random parts — everything is regarded as random.
Even when something is regarded as random, it doesn't mean that it is completely unpredictable. For instance, the roll of a pair of dice is much more likely to result in a 7 than an 11. The models we'll work with today are ways of describing probabilities of different outcome.
I was asked last semester to help, pro bono, a contractor trying to improve the survivability of US Marine Corp amphibious vehicles. The vehicles in use weren't designed around the threat of IEDs and the seating in them needs to be changed to provide better support in case of an explosion underneath, e.g. benches mounted on shock absorbers, no one sitting on the floor. The hull configuration (that's what you call the bottom of an amphibious vehicle) also needs to be changed. But it doesn't work just to make the vehicle bigger and heavier. Heavier is always a problem, requiring a bigger engine, etc. Bigger means not using an existing platform, which raises development costs and incurs delays.
The people I was working with had a small contract to build a design around modifications to an existing USMC vehicle, the AAV. To make it work, they would have to allocate somewhat less space per marine. They wanted to know how to figure out what space would work.
Here are the data I was given the widths (in feet) of equipped marines and the number in a 17-person squad:
| Role | # in squad | 5th | median | 95th |
|---|---|---|---|---|
| Rifleman | 6 | 1.99 | 2.06 | 2.15 |
| Grenadier | 6 | 1.99 | 2.05 | 2.24 |
| Pistol Carrier | 1 | 2.28 | 2.36 | 2.67 |
| Automatic Rifleman | 4 | 2.00 | 2.06 | 2.34 |
The standard calculation involves constructing a squad of 17 marines all at the 95th percentile. The contractor thought this was unnecessarily large and asked whether it wouldn't suffice to make the vehicle big eough to handle a squad of median marines, figuring that the wider ones would share space with narrower marines and things would cancel out.
I proposed to do a simulation of randomly generated squads and find out what vehicle size will accomodate the vast majority of squads. I would generate random sizes that meet the above percentiles in a sensible way and add up the total width. (It's a little bit simplified, because the seating plan has 8 marines on a side, with the pistol carrier at the forward bulkhead, facing backward toward the exit ramp.)
QUESTION How should I generate random marine widths?
A probability is a number between 0 and 1. 0 means “impossible” and 1 means “certain”. Values between 0 and 1 indicate possibility, with bigger numbers indicating a greater possibility.
A random variable is a quantity (a number) that is random.
A sample space (poorly named) is the set of possible values for that number.
A probability model is an assignment of a probability to each member of the sample space.
It's helpful to distinguish between two kinds of sample spaces that apply to random variables:
For discrete numbers, it's possible to assign a probability to each outcome.
For continuous numbers, it's possible to assign a probability to a range of outcomes. Or, by dividing the probability by the extent of the range, one can assign a probability density to each outcome. We usually treat this probability density as a function of the value of the random value: \( p(x) \)
We'll often use probabilities and probability densities in a similar way.
There are several approaches to creating probability models, that is, to making an association between a probability and a member of the sample space:
Your job is to learn how the setting relates to the choice of probability model and the meaning of the parameter(s) for each model.
Introduce the rxxx() operation for each. For equal probabilities, just use
resample(1:k, size=n). Generate random numbers from each. Ask them to find the mean and standard deviation of each.
rate (the mean time is 1/rate)prob, adding them up just increases the size parameter. Prediction: binomial with large size will be normal. Mean is \( np \). Var is \( n p (1-p) \).lambda will be the sum of the lambdas. Mean is lambda. Variance is lambda. ACTIVITY: Show that this is true.Stock market gives return of, say 5%/year on average with a standard deviation of about 6%. Simulate the total investment return over 50 years.
prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
## [1] 14.68
Have each student do their own, then congratulate the student who got the highest return.
Then show the overall distribution:
trials = do(1000) * prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
densityplot(~trials)
This distribution has a name: lognormal. It reflects the fact that the log of the values has a normal distribution.
densityplot(~log(trials))
In fitting linear models, we've used the sum of square residuals, the difference between the observed and theoretical values squared. \( E^2 = \sum (x_i - m_i)^2 \)
In fitting a probability model (for discrete outcomes) there is a similar approach.
\( E^2 = \sum (expected_i - observed_i)^2 \)
This criterion doesn't really work well. For instance, if the expected is zero, then the observed should be impossible. But this formula doesn't reflect that fact.
The standard matching criterion is called a \( \chi^2 \) (chi-squared) and is \( \chi^2 = \sum \frac{(expected_i - observed_i)^2}{expected^2} \)
Another way to measure the match between a set of observations and a probability model is via the likelihood: the probability of the observations if the model were right.