C. Donovan
Gain Communications sells aircraft communications units to both the military and the civilian markets. Next year's sales depend on market conditions that cannot be predicted exactly. Gain follows the modern practice of using probability estimates of sales. The military division estimates its sales as follows:
Units Sold | 1000 | 3000 | 5000 | 10,000 |
Probability | 0.1 | 0.3 | 0.4 | 0.2 |
These are personal probabilities that express the informed opinion of Gain's executives.
The corresponding sales estimates for the civilian division are
Units Sold | 300 | 500 | 750 |
Probability | 0.4 | 0.5 | 0.1 |
Take \( X \) to be the number of military units sold and \( Y \) the number of civilian units.
\[ \begin{align*} E[X] &= 1000\times 0.1 + 3000\times 0.3 + 5000\times 0.4 + 10,000\times 0.2 \\ &= 5000 \mbox{ units} \\ E[Y] &= 300\times 0.4 + 500\times 0.5 + 750\times 0.1 \\ &= 445 \mbox{ units} \end{align*} \]
Gain makes a profit of $2000 on each military unit sold and $3500 on each civilian unit. Next year's profits from military sales will be $2000\( X \). Similarly, the profit from civilian sales will be $3500\( Y \). Thus, the total profit for next year is $2000\( X \) + $3500\( Y \).
\[ \begin{align*} E[2000X + 3500Y] &= 2000 E[X] + 3500 E[Y] \\ &= 2000\times 5000 + 3500\times 445 = 11,557,500 \end{align*} \]
\[ \begin{align*} E[2000X - 3500Y] &= 2000 E[X] - 3500 E[Y] \\ &= 2000\times 5000 - 3500\times 445 = $8,442,500 \end{align*} \]
We have looked at the general formulae for Expectation (\( E[X] \)) and Variance (\( V[X] \)) for the discrete case.
This requires some modification for continuous RV.
Formally, if \( X \) is a continuous RV, the probability density function, PDF, of \( X \) is a function \( f(x) \) such that for any two numbers \( a \) and \( b \) such that \( a \le b \),
\[ \Pr(a \le X \le b) = \int_a^b f(x) dx \]
Two conditions that \( f(x) \) satisfies:
Observe here, the value of the function for \( x \) is not a probability.
Chemistry Example A reaction temperature \( X \) (in degrees C) in a certain chemical process has a uniform distribution on the interval \( [-5, 5] \). \( f(x) \) in this case is \( \frac{1}{10} \) on \( [-5,5] \).
# R knows about uniform distributions
x <- seq(-6, 6, by = 0.1)
plot(x, dunif(x, -5, 5), type = 'l',
xlim = c(-6, 6), ylab = 'density', lwd = 2)
A picture:
Verifying that \( f(x) \) is a pdf:
\[ \begin{eqnarray*} \int_{-\infty}^\infty f(x) dx = \int_{-5}^5 \frac{1}{10} dx = \frac{1}{10} x|^5_{-5} = \frac{1}{10}[5 - -5] = 1 \end{eqnarray*} \]
Sample calculations.
\[ \begin{align*} \Pr(X < 0) &= \int_{-5}^0 f(x) dx = \int_{-5}^0 \frac{1}{10} dx\\ &= \frac{x}{10}|_{-5}^0 = \frac{0}{10} + \frac{5}{10} = 1/2 \end{align*} \]
\[ \begin{align*} \Pr(-2.5 < X < 2.5) &= \int_{-2.5}^{2.5} \frac{1}{10} dx = \frac{x}{10}|_{-2.5}^{2.5} \\ &= \frac{2.5+2.5}{10} = 1/2 \end{align*} \]
Of course, we don't do this by hand - the wise, use computers:
# integral of uniform(-5, 5) from 0 downwards
punif(0, -5, 5)
[1] 0.5
# integral of uniform(-5, 5) between -2.5 and 2.5
# use integral below 2.5, then subtract below -2.5
punif(2.5, -5, 5) - punif(-2.5, -5, 5)
[1] 0.5
Note, I'm using CDFs! dunif
gives the point evaluation of the PDF
The Cumulative Distribution Function, (CDF), for a continuous RV is defined the same as for a discrete RV, but the calculation for a cts RV involves integration while that for a discrete RV involves summation. Formally,
\[ \begin{eqnarray*} F(x) = \Pr(X \le x) %= \int_{-\infty}^x f(u) du \end{eqnarray*} \]
Similar to a discrete RV, the expectation for a continuous RV is like a weighted sum, this time with integration required.
\[ \begin{eqnarray*} E(X) = \mu_X = \int_{-\infty}^{\infty} x f(x) dx \end{eqnarray*} \]
Likewise the variance,
\[ \begin{eqnarray*} V(X) = \sigma^2_X = \int_{-\infty}^{\infty} (x-\mu_X)^2 f(x) dx \end{eqnarray*} \]
An alternative calculation,
\[ \begin{eqnarray*} V(X) = E(X^2) - \left [ E(X) \right ]^2 \end{eqnarray*} \]
We've seen the Uniform already.
We'll use (eventually) the \( F \), \( \chi^2 \), Normal and \( t \)-distributions
Some Features
# a few distributions with different parameters
x <- seq(-20, 20, by = 0.1)
y_0_2 <- dnorm(x, 0, 2)
y_0_5 <- dnorm(x, 0, 5)
y_2_5 <- dnorm(x, 10, 5)
plot(x, y_0_2, lwd = 2, type ='l')
lines(x, y_0_5, lwd = 2, col = 'blue')
lines(x, y_2_5, lwd = 2, col = 'purple')
Some properties of normally distributed RV's.
The function itself is this (a fuction of \( x \), with two parameters):
\[ \begin{eqnarray*} f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\left [ \frac{-1}{2\sigma^2} (x-\mu)^2 \right ]} \end{eqnarray*} \]
This near normality can be helpful in that the percentage of individuals falling in some range can be estimated given knowledge of \( \mu \) and \( \sigma \) alone.
For example, the distribution of heights of women aged 18 to 24 is approximately normal with mean \( \mu \)=64.5 inches and standard deviation \( \sigma \)=2.5 inches.
\[ \Pr(X \le 60) | X \sim N(64.5, 2.5) \]
# so integrate the N(64.5, 2.5) below 60
pnorm(q = 60, mean = 64.5, sd = 2.5)
[1] 0.03593032
Find the probability that a randomly chosen women is taller than 5'8''.
\[ \Pr(X > 68) \]
# integrate the same distribution, but above 68
pnorm(68, 64.5, 2.5, lower.tail = F)
[1] 0.08075666
#or
1-pnorm(68, 64.5, 2.5)
[1] 0.08075666
Find the probability that a randomly chosen women is between 4'11.5'' and 5'9.5''.
\[ Pr(59.5 \le X \le 69.5) \]
# integrate the same distribution, but between 59.5 and 69.5
pnorm(69.5, 64.5, 2.5) - pnorm(59.5, 64.5, 2.5)
[1] 0.9544997
\[ \begin{eqnarray*} f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi}} e^{\frac{(x-\mu)^2}{2} } \end{eqnarray*} \]
The standard normal RV is often denoted \( Z \).
The standard normal distribution is ''historically'' important. For any given Normal distribution, the distribution can be transformed to a standard normal pdf by converting the normal RV to standard units. This process, also called standardization, is simply \( \frac{X-\mu}{\sigma} \)
It is commonly assumed that IQ is Normal(100,15). What is the IQ for someone in the 99th percentile?
qnorm(0.99)*15 + 100
[1] 134.8952
Note, leaving out the mean and std dev reverts to the default of \( N(0,1) \)
Recall your project
We'll now approach our Cannabis study with more statistical rigour. We'll examine how Calcium levels in cannabis leaves and seed germination rates differ across soil types.
Exploratory Data Analysis
summary(CaData$Ca)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18000 22750 30000 30208 35000 50000
Summary statistics for Calcium values in Cannabis leaves grown in potting mix.
hist(CaData$Ca, col = 'purple')
An estimate is a quantity calculated from the data to estimate an unknown parameter.
Great, but:
Intuitively
Average Ca values obtained by repeatedly sampling from a Normal distribution with \( \mu=30208.33 \) and \( \sigma=9069.678 \).
Notice,
(remember we are pretending we know the population mean).
It turns out that:
The mean of the sample means is equal to the true population mean. ie. the sample mean is an unbiased estimate of the population mean \( \mu \) – the number we are after.
The standard deviation of the sample means is found by dividing the population standard deviation by the square root of the sample size being averaged.
\[ \begin{align*} \textrm{Expected value}(\textrm{sample mean}) =& \textrm{Population mean}\\ E(\overline{X})=&\mu\\ sd(\textrm{sample mean}) =& \frac{\textrm{Population SD}}{\sqrt{\textrm{Sample size}}}\\ sd(\overline{X})=&\frac{\sigma}{\sqrt{n}} \end{align*} \]
We have a sample size of 24 cannabis leaves grown in potting mix - suppose our data looks approximately Normally distributed (i.e. bell-shaped, this is a bit of a stretch
When the data come from a Normal distribution the distribution of the sample mean is also Normal. This is very handy because:
While we rarely know the true population mean, it is a fairly safe bet that our sample estimate will fall within 2 standard deviations of the population mean.
Note: this must depend on the shape of the distribution of \( \bar{x} \).
Average Ca values obtained by repeatedly sampling from a Normal distribution with \( \mu=30208.33 \) and \( \sigma=9069.678 \).
Using the information above, we expect to get:
Sample size | Mean of the means | SD of the means |
---|---|---|
5 | 30208 (30123) | \( \frac{9069.678}{\sqrt{5}}= 4056.083 \) (3943.348) |
20 | 30208 (30162) | \( \frac{9069.678}{\sqrt{20}}=2028.042 \) (1950.543) |
30 | 30208 (30189) | \( \frac{9069.678}{\sqrt{30}}= 1655.889 \) (1737.237) |
40 | 30208 (30292) | \( \frac{9069.678}{\sqrt{40}}=1434.042 \) (1422.778) |
Mean and standard deviation of the 1000 sample means obtained using sample sizes of 5, 20, 30 and 40 from \( N(\mu=30208.33, \sigma=9069.678) \). The numbers in brackets are what we observed when we used the computer to do the simulations in previous figure.
We don't usually know if our data is drawn from an exactly Normal distribution….
CLT
No matter what distribution we sample from, the distribution of the sample mean (\( \overline{X} \)) is closely approximated by the Normal distribution in large samples
Exploration by simulation…. (refer PPT slides)
How big does \( n \) have to be for the central limit effect to work?
The standard error \[ \begin{align*} se(\textrm{sample mean}) =& \frac{\textrm{Sample SD}}{\sqrt{\textrm{Sample size}}}\\ se(\overline{x})=&\frac{s_x}{\sqrt{n}} \end{align*} \]
(simulation PPT)
We've covered:
Next: