Probability & Distributions

Mike McCann
22-23 January 2015

Review

When do we use parentheses? Brackets? Commas? Semi-colons?

What is the [sum of (1 through 10)] multiplied by two?

What is the 12th and 45th position of seq(1,43,0.25)?

Distributions

It is often useful to generate a sample from a specific statisical distribution.

Generating random samples from a normal distribution

To generate a sample of size 100 from a standard normal distribution (with mean 0 and standard deviation 1) we use the rnorm() function.

norm <- rnorm(100, mean=0, sd=1)
head(norm)
[1]  2.0413082 -0.5788982  0.9451796  0.5947909 -0.3119882  2.9333134

Try It!

  1. Look up the rnorm() function in help screen. Locate the three arguments we used.
  2. Draw 100 random numbers from a normal distribution with a mean of 3 and an sd of 2, assign it to an object a.
  3. Find the mean number of your draw. How close was it to the true mean?
  4. Use the ?? tool to lookup the function for standard deviation Hint: Do not include spaces in the query.
  5. What is the 13th number in the your vector a?

Generating random samples from other distributions

  • R has many distributions in the base package, including all commonly used in biological analysis.
  • Depending on the distribution, each function has its own set of parameter arguments.
  • For example, the rpois() function is the random number generator for the Poisson distribution and it has only the parameter lambda.
  • The rbinom() function is the random number generator for the binomial distribution and it has two parameters: size and prob. The size argument specifies the number of Bernoulli trials and the prob argument specifies the probability of a success for each trial.

BEE552 Students: Heather will pick up from here.

Try It!

  1. Draw 100 values from a poisson with a lambda=3, assign it to an object a.
  2. Draw 1000 values from a poisson with a lambda=3, assign it to an object b.
  3. Find the means of both draws. What is the difference in means?

Other Properties of Distributions

For each distribution there are four functions which will generate fundamental quantities of a distribution.

Let's consider the normal distribution as an example.

  • rnorm() a random sample from a specific normal distribution.
  • dnorm() the density probability for a specific value for a normal distribution.
  • pnorm() the distribution function
  • qnorm() the quantile function

pnorm(), and qnorm() will be covered during Biometry. They are less commonly used.

An introduction to plotting: Univariate

Histograms are a common univariate plot.

Histograms place data into “bins”, and count the number of data falling into each bin.

Bins are usually plotted as bars, with the x range on the x axis, and count on the y axis.

An introduction to plotting: Univariate

# Draw a thousand random normal points
pts <- rnorm(1000)
hist(pts)

plot of chunk unnamed-chunk-2

Histograms are an effective way of visualizing distributions

Try It!

  1. Draw 10 random normal points and plot a histogram, then 100, then 1000. What do you notice about the plot?

  2. Explore at least one other distribution, look up ?distributions. Hint: remember to use the r-nameofdistribution function to take random samples.

  3. Plot your new distribution and share with your neighbor.

  4. Draw 1000 random normals with a mean of 0 and a sd of 1. Look at the hist help screen. How do you specify the size of the bin range? Try making bins from -4 to 4, with intervals of 0.01, 0.1, and 1. Hint: Consider using seq() in the “breaks”“ argument within hist().

Density plots

x <- seq(0,4,0.01)
dens <- dnorm(x, 2, 0.5)
plot(x, dens, type = "l")

plot of chunk unnamed-chunk-3

Try It!

  1. Draw 1000 random normals with a mean of 0 and a sd of 1.
  2. Plot the density of the distribution from -4 to 4. Hint: You will need to use dnorm()
  3. Label your axis, “This is the x axis”, “This is the y axis” by looking at the plot help screen.

More Plotting

Another option is to plot the distribution not in terms of raw counts, but in terms of density, so the histogram sums to 1.

x <- rnorm(100, mean=0, sd=2)
hist(x, freq=FALSE)

plot of chunk unnamed-chunk-4

Very Brief Intro to Sampling

In R, it's very easy to take a random sample of numbers with the sample() command.

Sample without replacement

Take a random sample of 20 numbers from a vector of 1 to 100.

x <- 1:50
sample(x, 20)
 [1] 30 42  3  2 45 16  8 37 24 10 36 21 33 11 26 39 49 15 13 35

Sample with replacement

x <- 1:50
sample(x, 20, replace=TRUE)
 [1] 37 46  3 46  7 15 40 43  3 27 35 19 42 21 49 30 22 12 11 29

Try It!

  1. Sample 14 integers from 100 to 200 without replacement.
  2. Sample 5 letters (using the pre-installed vector 'letters') from the alphabet with replacement
  3. Sample 0 or 1 tweleve times. Do it with and without replacement. What happens?

An introduction to plotting: Bivariate

Scatterplots are useful for showing the relationship between two variables

x <- rnorm(n=100, mean=5, sd=0.05)
y <- x * rnorm(n=100, mean=1, sd=0.01)
plot(y~x)

plot of chunk unnamed-chunk-7

An introduction to plotting: Bivariate

Instead of writing the relationship as a formula i.e., y~x

You can write plot(x,y) where x and y are separated by a comma

plot(x,y)

plot of chunk unnamed-chunk-8

Adding lines to scatterplots

You can add straight lines with abline()

plot(x,y)
abline(a=0, b=1, col="red")

plot of chunk unnamed-chunk-9

a specifies the intercept. b the slope.

Lines can also be model fits

plot(x,y)
abline(lm(y~x), col="red")

plot of chunk unnamed-chunk-10

lm() fits a linear relationship between x and y.

Lots of options for plotting

Questions?