MT5762 Lecture 7

C. Donovan

Combining random variables

Example

Gain Communications sells aircraft communications units to both the military and the civilian markets. Next year's sales depend on market conditions that cannot be predicted exactly. Gain follows the modern practice of using probability estimates of sales. The military division estimates its sales as follows:


Units Sold	1000	3000	5000	10,000
Probability	0.1	0.3	0.4	0.2

These are personal probabilities that express the informed opinion of Gain's executives.

The corresponding sales estimates for the civilian division are


Units Sold	300	500	750
Probability	0.4	0.5	0.1

Example

Take $ X $ to be the number of military units sold and $ Y $ the number of civilian units.

Find $ E[X] $ and $ E[Y] $

\[ \begin{align*} E[X] &= 1000\times 0.1 + 3000\times 0.3 + 5000\times 0.4 + 10,000\times 0.2 \\ &= 5000 \mbox{ units} \\ E[Y] &= 300\times 0.4 + 500\times 0.5 + 750\times 0.1 \\ &= 445 \mbox{ units} \end{align*} \]

Example

Gain makes a profit of $2000 on each military unit sold and $3500 on each civilian unit. Next year's profits from military sales will be $2000$ X $. Similarly, the profit from civilian sales will be $3500$ Y $. Thus, the total profit for next year is $2000$ X $ + $3500$ Y $.

The expected total profit for next year:

\[ \begin{align*} E[2000X + 3500Y] &= 2000 E[X] + 3500 E[Y] \\ &= 2000\times 5000 + 3500\times 445 = 11,557,500 \end{align*} \]

The expected difference in profits between military and civilian sales can be calculated:

\[ \begin{align*} E[2000X - 3500Y] &= 2000 E[X] - 3500 E[Y] \\ &= 2000\times 5000 - 3500\times 445 = $8,442,500 \end{align*} \]

What about variances?

If we assume independence, our life is easy - we can add the variances

The continuous case

We have looked at the general formulae for Expectation ($ E[X] $) and Variance ($ V[X] $) for the discrete case.

This requires some modification for continuous RV.

Modification for Continuous RVs

The discrete RV case the number of events in the sample space is finite or countable,
A continous RV is a RV where the number of events in the sample space is infinite and uncountable.
At least some portion of the sample space will consist of an interval of the real number line.

Modification for Continuous RVs

Formally, if $ X $ is a continuous RV, the probability density function, PDF, of $ X $ is a function $ f(x) $ such that for any two numbers $ a $ and $ b $ such that $ a \le b $,

\[ \Pr(a \le X \le b) = \int_a^b f(x) dx \]

Two conditions that $ f(x) $ satisfies:

$ f(x) \ge 0 \ \forall \ x $.
$ \int_{-\infty}^{\infty} f(x) dx = 1 $

Observe here, the value of the function for $ x $ is not a probability.

A continuous example

Chemistry Example A reaction temperature $ X $ (in degrees C) in a certain chemical process has a uniform distribution on the interval $ [-5, 5] $. $ f(x) $ in this case is $ \frac{1}{10} $ on $ [-5,5] $.

 # R knows about uniform distributions

  x <- seq(-6, 6, by = 0.1)

  plot(x, dunif(x, -5, 5), type = 'l', 
  xlim = c(-6, 6), ylab = 'density', lwd = 2)

A picture:

plot of chunk unnamed-chunk-2

A continuous example

Verifying that $ f(x) $ is a pdf:

\[ \begin{eqnarray*} \int_{-\infty}^\infty f(x) dx = \int_{-5}^5 \frac{1}{10} dx = \frac{1}{10} x|^5_{-5} = \frac{1}{10}[5 - -5] = 1 \end{eqnarray*} \]

A continuous example

To get a probability from a PDF, we must (somehow) integrate parts of it
An integral for a specific value of X = 0

Sample calculations.

\[ \begin{align*} \Pr(X < 0) &= \int_{-5}^0 f(x) dx = \int_{-5}^0 \frac{1}{10} dx\\ &= \frac{x}{10}|_{-5}^0 = \frac{0}{10} + \frac{5}{10} = 1/2 \end{align*} \]

\[ \begin{align*} \Pr(-2.5 < X < 2.5) &= \int_{-2.5}^{2.5} \frac{1}{10} dx = \frac{x}{10}|_{-2.5}^{2.5} \\ &= \frac{2.5+2.5}{10} = 1/2 \end{align*} \]

A continuous example

Of course, we don't do this by hand - the wise, use computers:

  # integral of uniform(-5, 5) from 0 downwards
  punif(0, -5, 5)

[1] 0.5

  # integral of uniform(-5, 5) between -2.5 and 2.5
  # use integral below 2.5, then subtract below -2.5
  punif(2.5, -5, 5) - punif(-2.5, -5, 5)

[1] 0.5

Note, I'm using CDFs! dunif gives the point evaluation of the PDF

Modification for Continuous RVs

The Cumulative Distribution Function, (CDF), for a continuous RV is defined the same as for a discrete RV, but the calculation for a cts RV involves integration while that for a discrete RV involves summation. Formally,

\[ \begin{eqnarray*} F(x) = \Pr(X \le x) %= \int_{-\infty}^x f(u) du \end{eqnarray*} \]

Expectation and variance

Similar to a discrete RV, the expectation for a continuous RV is like a weighted sum, this time with integration required.

\[ \begin{eqnarray*} E(X) = \mu_X = \int_{-\infty}^{\infty} x f(x) dx \end{eqnarray*} \]

Likewise the variance,

\[ \begin{eqnarray*} V(X) = \sigma^2_X = \int_{-\infty}^{\infty} (x-\mu_X)^2 f(x) dx \end{eqnarray*} \]

An alternative calculation,

\[ \begin{eqnarray*} V(X) = E(X^2) - \left [ E(X) \right ]^2 \end{eqnarray*} \]

Specific RV distributions

We've seen the Uniform already.

We'll use (eventually) the $ F $, $ \chi^2 $, Normal and $ t $-distributions

The Normal Distribution

Some Features

The normal pdf for a RV $ X $ has two parameters, $ \mu $ and $ \sigma^2 $, where $ \mu = E[X] $ and $ \sigma^2 = V[X] $.
Changes in $ \mu $ are shifts in the location and changes in $ \sigma $ are changes in the spread.
Examples of normal density curves:

The Normal Distribution

  # a few distributions with different parameters

  x <- seq(-20, 20, by = 0.1)

  y_0_2 <- dnorm(x, 0, 2)

  y_0_5 <- dnorm(x, 0, 5)

  y_2_5 <- dnorm(x, 10, 5)

  plot(x, y_0_2, lwd = 2, type ='l')
  lines(x, y_0_5, lwd = 2, col = 'blue')
  lines(x, y_2_5, lwd = 2, col = 'purple')

plot of chunk unnamed-chunk-5

Some features

Some properties of normally distributed RV's.

symmetric, mean=median
2/3's obs'ns within $ \pm $ 1 $ \sigma $ of $ \mu $
95% within $ \pm $ 2 $ \sigma $ of $ \mu $
99% within $ \pm $ 3 $ \sigma $ of $ \mu $

Normal PDF

The function itself is this (a fuction of $ x $, with two parameters):

\[ \begin{eqnarray*} f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\left [ \frac{-1}{2\sigma^2} (x-\mu)^2 \right ]} \end{eqnarray*} \]

Calculating probabilities from a Normal

Many measurements will have relative frequency distributions that are nearly normal.
This near normality can be helpful in that the percentage of individuals falling in some range can be estimated given knowledge of $ \mu $ and $ \sigma $ alone.
For example, the distribution of heights of women aged 18 to 24 is approximately normal with mean $ \mu $=64.5 inches and standard deviation $ \sigma $=2.5 inches.
- Find the probability that a randomly chosen woman is no taller than 60 inches:

\[ \Pr(X \le 60) | X \sim N(64.5, 2.5) \]

  # so integrate the N(64.5, 2.5) below 60
  pnorm(q = 60, mean = 64.5, sd = 2.5)

[1] 0.03593032

Example

Find the probability that a randomly chosen women is taller than 5'8''.

\[ \Pr(X > 68) \]

  # integrate the same distribution, but above 68
  pnorm(68, 64.5, 2.5, lower.tail = F)

[1] 0.08075666

  #or 
  1-pnorm(68, 64.5, 2.5)

[1] 0.08075666

Example

Find the probability that a randomly chosen women is between 4'11.5'' and 5'9.5''.

\[ Pr(59.5 \le X \le 69.5) \]

  # integrate the same distribution, but between 59.5 and 69.5
  pnorm(69.5, 64.5, 2.5) - pnorm(59.5, 64.5, 2.5)

[1] 0.9544997

Standard Normal Distribution

Definition: a special case is the standard normal distribution where $ \mu = 0 $ and $ \sigma $ = 1. The PDF is simpler:

\[ \begin{eqnarray*} f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi}} e^{\frac{(x-\mu)^2}{2} } \end{eqnarray*} \]

The standard normal RV is often denoted $ Z $.

The standard normal distribution is ''historically'' important. For any given Normal distribution, the distribution can be transformed to a standard normal pdf by converting the normal RV to standard units. This process, also called standardization, is simply $ \frac{X-\mu}{\sigma} $

Example

It is commonly assumed that IQ is Normal(100,15). What is the IQ for someone in the 99th percentile?

The 99th percentile - calc for the standard Normal, stretch by a factor of 15, then shift 100

qnorm(0.99)*15 + 100

[1] 134.8952

Note, leaving out the mean and std dev reverts to the default of $ N(0,1) $

Sampling distributions of Estimates

Cannabis case study

Recall your project

In 2000, a forensic scientist investigated whether chemical levels in cannabis leaves could be used to identify where cannabis was grown.
e.g. if chemical levels in cannabis leaves (eg. Calcium) differ across soil types which represent geographical regions, then chemical analyses on confiscated plants may help identify where plants are grown.
plants were grown and monitored until they reached maturity;
plants were not raised using any fertilizers or special lights.
mature plants they were harvested, dried and underwent nitric acid tests to produce a liquid for analysis

Objectives

We'll now approach our Cannabis study with more statistical rigour. We'll examine how Calcium levels in cannabis leaves and seed germination rates differ across soil types.

examine the behaviour of the sample mean and sample proportion
quantify the precision of our sample estimates using standard errors for the sample mean and sample proportion
quantify the precision of any differences between estimates of means and proportions across soil type

The sampling distribution of the sample mean

Exploratory Data Analysis

summary(CaData$Ca)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18000   22750   30000   30208   35000   50000

Summary statistics for Calcium values in Cannabis leaves grown in potting mix.

Exploratory Data Analysis

  hist(CaData$Ca, col = 'purple')

plot of chunk unnamed-chunk-12

The sample mean and the sample median are similar which suggests that Calcium values in potting mix are roughly symmetrical
50% of the Calcium values sit between 35,000 and 22,750 units.
The Calcium values are approximately Normally distributed (?) with an average (or mean) value of about 30,210 units.

Parameter estimation

An estimate is a quantity calculated from the data to estimate an unknown parameter.

We are going to estimate the population mean Calcium level from cannabis leaves grown in potting mix.
If we could measure every leaf from in our population of interest we'd have the true population mean (an underlying parameter)
This is an unknown characteristic of a population or distribution – but often want to estimate. We estimate by taking a sample.
In this case we will use our sample of 24 leaves to give us an estimate of the unknown population mean.

Precision of our sample mean

Great, but:

Is this sample mean of 30208.33 any good?
To find out how good our estimate is, we need to know the precision - is it likely to be near the 'truth'?
The behaviour of the sample mean depends on: (a) sample size and (b) the variability of the data

Precision of our sample mean

Intuitively

If we are calculating a sample mean based on a million observations from a population, it should closely resemble the true population mean.
If we are calculating a sample mean based on just 5 observations from a population, it may not closely resemble the true population mean.

Precision of our sample mean

Average Ca values obtained by repeatedly sampling from a Normal distribution with $ \mu=30208.33 $ and $ \sigma=9069.678 $.

Precision of our sample mean

Lets look at the range of sample means obtained when we take 1000 samples of just 5 from a Normal distribution with $ \mu=30208.33 $ and $ \sigma=9069.678 $. The 1000 means obtained from the 1000 samples are shown in the top-left histogram

Notice,

the samples, and thus each sample mean is different
these sample means are centered about the true population mean' of 30208.33

(remember we are pretending we know the population mean).

Precision of our sample mean

It turns out that:

The mean of the sample means is equal to the true population mean. ie. the sample mean is an unbiased estimate of the population mean $ \mu $ – the number we are after.
The standard deviation of the sample means is found by dividing the population standard deviation by the square root of the sample size being averaged.

Precision of our sample mean

\[ \begin{align*} \textrm{Expected value}(\textrm{sample mean}) =& \textrm{Population mean}\\ E(\overline{X})=&\mu\\ sd(\textrm{sample mean}) =& \frac{\textrm{Population SD}}{\sqrt{\textrm{Sample size}}}\\ sd(\overline{X})=&\frac{\sigma}{\sqrt{n}} \end{align*} \]

Estimates in statments about the population mean?

We have a sample size of 24 cannabis leaves grown in potting mix - suppose our data looks approximately Normally distributed (i.e. bell-shaped, this is a bit of a stretch

When the data come from a Normal distribution the distribution of the sample mean is also Normal. This is very handy because:

we know that a Normally distributed variable falls within 2 standard deviations of the mean about 95% of the time.
therefore the sample mean will fall within 2 standard deviations of the true population mean about 95% of the time (or for about 95% of samples taken).

A useful result

While we rarely know the true population mean, it is a fairly safe bet that our sample estimate will fall within 2 standard deviations of the population mean.

Note: this must depend on the shape of the distribution of $ \bar{x} $.

Average Ca values obtained by repeatedly sampling from a Normal distribution with $ \mu=30208.33 $ and $ \sigma=9069.678 $.

Using the information above, we expect to get:

Sample size	Mean of the means	SD of the means
5	30208 (30123)	$ \frac{9069.678}{\sqrt{5}}= 4056.083 $ (3943.348)
20	30208 (30162)	$ \frac{9069.678}{\sqrt{20}}=2028.042 $ (1950.543)
30	30208 (30189)	$ \frac{9069.678}{\sqrt{30}}= 1655.889 $ (1737.237)
40	30208 (30292)	$ \frac{9069.678}{\sqrt{40}}=1434.042 $ (1422.778)

Mean and standard deviation of the 1000 sample means obtained using sample sizes of 5, 20, 30 and 40 from $ N(\mu=30208.33, \sigma=9069.678) $. The numbers in brackets are what we observed when we used the computer to do the simulations in previous figure.

Problem

We don't usually know if our data is drawn from an exactly Normal distribution….

Central limit theorem (CLT)

CLT

No matter what distribution we sample from, the distribution of the sample mean ($ \overline{X} $) is closely approximated by the Normal distribution in large samples

Central limit theorem (CLT)

Exploration by simulation…. (refer PPT slides)

Central limit theorem (CLT)

How big does $ n $ have to be for the central limit effect to work?

The sample size required depends on the data. For data from symmetrical distributions a sample of 5 may be sufficient,
For heavily skewed data a sample of 50 may be required. Samples greater than 30 are often recommended.

Another problem

If we don't know the population SD (and we hardly ever do), we replace the population SD ($ \sigma $) by the sample standard deviation ($ s_x $).
So, instead of measuring the precision of our sample mean using the standard deviation we use the standard error.

The standard error \[ \begin{align*} se(\textrm{sample mean}) =& \frac{\textrm{Sample SD}}{\sqrt{\textrm{Sample size}}}\\ se(\overline{x})=&\frac{s_x}{\sqrt{n}} \end{align*} \]

The standard error, $ se(\bar{x}) $, is a measure of precision of $ \bar{x} $ as an estimate

(jump ahead) Confidence intervals

(simulation PPT)

Recap and look-forwards

We've covered:

Continuous RVs - Expected Values and Variances for these
The Uniform, Normal (and standard Normal)
Calculating probabilities from these
The behaviour of sample means - sampling distributions

The $ t $-distribution, confidence intervals for the mean and general inference for parameters

Sample size	Mean of the means	SD of the means
5	30208 (30123)	\( \frac{9069.678}{\sqrt{5}}= 4056.083 \) (3943.348)
20	30208 (30162)	\( \frac{9069.678}{\sqrt{20}}=2028.042 \) (1950.543)
30	30208 (30189)	\( \frac{9069.678}{\sqrt{30}}= 1655.889 \) (1737.237)
40	30208 (30292)	\( \frac{9069.678}{\sqrt{40}}=1434.042 \) (1422.778)