Monday 25 March 2013 Stats 155 Class Notes

Definitions

A probability is a number between 0 and 1. 0 means “impossible” and 1 means “certain”. Values between 0 and 1 indicate possibility, with bigger numbers indicating a greater possibility.

A random variable is a quantity (a number) that is random.

A sample space (poorly named) is the set of possible values for that number.

A probability model is an assignment of a probability to each member of the sample space.

It's helpful to distinguish between two kinds of sample spaces that apply to random variables:

discrete numbers such as the outcome of rolling a die
continuous numbers that can take on any value in a range. (The range might be infinite or finite.)

For discrete numbers, it's possible to assign a probability to each outcome.

For continuous numbers, it's possible to assign a probability to a range of outcomes. Or, by dividing the probability by the extent of the range, one can assign a probability density to each outcome. We usually treat this probability density as a function of the value of the random value: \( p(x) \)

We'll often use probabilities and probability densities in a similar way.

For a discrete sample space, the assigned probabilities over all the members of the space must add up to 1.
For a continuous sample space, the integral over the assigned probability over the possible values must be one, that is: \( \int_{-\infty}^{\infty} p(x) dx = 1 \).

Our Main Use for Probability

The sampling distribution

Given a measurement, discern the sampling distribution. If you're not going to do the actual repetition of sampling, you need to make some assumptions, e.g.
- That the population is just like your sample.
- That the measurement comes from a mechanism whose sampling distribution is known up to some parameters.

For each probability model that we study, we'll name a setting that involves a confidence interval.

Some Important Probability Models

Introduce the rxxx() operation for each. For equal probabilities, just use resample(1:k, size=n). Generate random numbers from each. Ask them to find the mean and standard deviation of each.

Discrete

Equi-probable. Teach is equally likely. sample() versus resample() You can use this to reconstruct a wide range of sampling distributions, without having to model it in a specific way. Examples: * Fraction of the world covered by water. * Model coefficients. * R²
Binomial: trials of a coin flip where the outcome is the number of “successes” or “heads” or “1s”. Parameters:
- size number of trials
- prob probability of head on each trial Examples:
  - Fraction of the world covered by water. Use the size of your sample, use the prob indicated by your sample.
  - Political polls. You've conducted a poll of 100 people, of whom 48% say they like your candidate. Give a confidence interval. Now do it for a poll of 400 people, of 1600 people, of 6400 people.
Poisson: Number of events that happen in a given. Example: number of cars passing by a point, number of shooting stars in a minute's observation of the sky, number of snowflakes that land on your glove in a minute. Parameter:
- lambda mean number of events. Example:
  - There were 25 accidents this year on the corner of Snelling and Grand. You've just put a median on Snelling.
    - If nothing changed, what's the largest number of accidents that you would expect to see in the next year?
    - You claim that the new median will reduce accidents by 20%. If this is true, what's the probability that next year's accident rate will be larger than the 25 in the year before median was installed.
    - You found out that, due to light-rail construction, traffic on Snelling is up by 40%. Assuming that the accident rate is proportional to the amount of traffic, and that the median has reduced the accident rate by 20%, what number of accidents are you likely to see.
  - hemocytometer. What's the difference in concentration between counting 10 cells in one green square (6.25 nl) and counting 160 cells in the array of 16 green squares (100 nl).
    ### Continuous
Uniform.
- Parameters: Max and min
- Models: spinners (angles in 2-d, but not higher), p-values under the Null Hypothesis Example: probability that a p-value will be less than 0.05. Under the Null Hypothesis, p-values are distributed uniformly on 0 to 1.
Normal (gaussian)
- Parameters: mean and sd.
- Models: your general purpose model Example: Given a standard deviation, what's the 95% coverage interval. Try this for sd = 1, sd=10, sd=100.
Exponential
- Times between random events (earthquakes, 100-year storms)
- Parameter: rate (the mean time is 1/rate) We won't see sampling distributions associated with the exponential model, but it's pretty informative for general situations. Example: There's been 20 earthquakes recorded since Roman times. An earthquake occurred last year. When might the next one occur.
t-distribution. A technical distribution discovered by William Gosset and published in 1908. Link to a transcription of the publication and to the original on JSTOR Example: The difference between the estimate of the coefficient and the population value, divided by the standard error, is a t-distribution. Parameter: degrees of freedom — set to the degrees of freedom of the residual: \( n - m \). This t-distribution tells you how to turn a standard error into a 95% confidence interval — use the 95% limits from the relevant t-distribution. * Why 3 replications in biology? You need two runs to get a standard error. summary(lm(c(7,4)~1)) — this will give a df of 1

qt(c(0.025, 0.975), df = 1)

## [1] -12.71  12.71

qt(c(0.025, 0.975), df = 2)

## [1] -4.303  4.303

qt(c(0.025, 0.975), df = 10000)  # the famous 1.96

## [1] -1.96  1.96

Log-normal — wait for it …

Why the “Normal” Distribution is Normal

add up uniform
add up binomial — these should be binomial if they have the same prob, adding them up just increases the size parameter. Prediction: binomial with large size will be normal. Mean is \( np \). Var is \( n p (1-p) \).
Add up two poissons and you should get a poisson. lambda will be the sum of the lambdas. Mean is lambda. Variance is lambda. ACTIVITY: Show that this is true.

Basic Operations: P and Q

The D operation

ACTIVITY: Returns on investments

Stock market gives return of, say 5%/year on average with a standard deviation of about 6%. Simulate the total investment return over 50 years.

prod(1 + rnorm(50, mean = 0.05, sd = 0.06))

## [1] 7.265

Have each student do their own, then congratulate the student who got the highest return.

Then show the overall distribution:

trials = do(1000) * prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
densityplot(~trials)

plot of chunk unnamed-chunk-4

This distribution has a name: lognormal. It reflects the fact that the log of the values has a normal distribution.

densityplot(~log(trials))

plot of chunk unnamed-chunk-5

Random Angles

Distribution of random angles.