A probability is a number between 0 and 1. 0 means “impossible” and 1 means “certain”. Values between 0 and 1 indicate possibility, with bigger numbers indicating a greater possibility.
A random variable is a quantity (a number) that is random.
A sample space (poorly named) is the set of possible values for that number.
A probability model is an assignment of a probability to each member of the sample space.
It's helpful to distinguish between two kinds of sample spaces that apply to random variables:
For discrete numbers, it's possible to assign a probability to each outcome.
For continuous numbers, it's possible to assign a probability to a range of outcomes. Or, by dividing the probability by the extent of the range, one can assign a probability density to each outcome. We usually treat this probability density as a function of the value of the random value: \( p(x) \)
We'll often use probabilities and probability densities in a similar way.
The sampling distribution
For each probability model that we study, we'll name a setting that involves a confidence interval.
Introduce the rxxx() operation for each. For equal probabilities, just use resample(1:k, size=n). Generate random numbers from each. Ask them to find the mean and standard deviation of each.
sample() versus resample()
You can use this to reconstruct a wide range of sampling distributions, without having to model it in a specific way.
Examples:
* Fraction of the world covered by water.
* Model coefficients.
* R2Poisson: Number of events that happen in a given. Example: number of cars passing by a point, number of shooting stars in a minute's observation of the sky, number of snowflakes that land on your glove in a minute. Parameter:
Uniform.
Normal (gaussian)
Exponential
rate (the mean time is 1/rate)
We won't see sampling distributions associated with the exponential model, but it's pretty informative for general situations.
Example: There's been 20 earthquakes recorded since Roman times. An earthquake occurred last year. When might the next one occur.t-distribution. A technical distribution discovered by William Gosset and published in 1908. Link to a transcription of the publication and to the original on JSTOR
Example: The difference between the estimate of the coefficient and the population value, divided by the standard error, is a t-distribution. Parameter: degrees of freedom — set to the degrees of freedom of the residual: \( n - m \). This t-distribution tells you how to turn a standard error into a 95% confidence interval — use the 95% limits from the relevant t-distribution.
* Why 3 replications in biology? You need two runs to get a standard error. summary(lm(c(7,4)~1)) — this will give a df of 1
qt(c(0.025, 0.975), df = 1)
## [1] -12.71 12.71
qt(c(0.025, 0.975), df = 2)
## [1] -4.303 4.303
qt(c(0.025, 0.975), df = 10000) # the famous 1.96
## [1] -1.96 1.96
prob, adding them up just increases the size parameter. Prediction: binomial with large size will be normal. Mean is \( np \). Var is \( n p (1-p) \).lambda will be the sum of the lambdas. Mean is lambda. Variance is lambda. ACTIVITY: Show that this is true.Stock market gives return of, say 5%/year on average with a standard deviation of about 6%. Simulate the total investment return over 50 years.
prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
## [1] 7.265
Have each student do their own, then congratulate the student who got the highest return.
Then show the overall distribution:
trials = do(1000) * prod(1 + rnorm(50, mean = 0.05, sd = 0.06))
densityplot(~trials)
This distribution has a name: lognormal. It reflects the fact that the log of the values has a normal distribution.
densityplot(~log(trials))
Distribution of random angles.