Lab 1 - density histograms and the CLT

This lab will introduce R Notebooks, R Markdown files, and knit R. You will use these formats to keep lab notes and for submitting online. These instructions don’t cover these topics. Links to help with this are provided on canvas, and we will go over the basics in depth during lab. As with so many things in R, there is no single best way to do anything, and it’s just another language that you will need to pick up along the way. Fortunately, it’s not as weird as R itself…

One funny quirk about using R Markdown files and knit R is that the Markdown environment is not directly aware of the R environment (work space) in your active sessions. That means the Markdown file can’t see the data you’ve imported, functions you’ve written, variables you’ve named etc. You need to write all the code for replicating that into the markdown file. This is intentional. The markdown files are meant to be used for reproducible work. That is, the document is self-contained by design so that someone else can recreate exactly what you’ve done using the exact same data. It’s a pretty neat feature once you get over the frustration of rectifying the disconnect between your active environment and the R Markdown environment. To make this transition in thinking a bit simpler, we’ll write code this week to generate all our data (no importing of csv files yet).

OK, let’s learn how to plot histograms using density curves. Density curves are useful as they represent pdfs. Empirical pdfs can be created and plotted from data, and it’s important to realize that they are the same thing as the standard block based histogram; the y-axis is rescaled, and the stair steps are smoothed with a curve. OK, let’s make up a data set and take a look:

set.seed(1000)
x <- sample(-50:150,75,T)
hist(x,main="random sample with replacement of 75 numbers between -50:150")

#####Question 1 a. What is the scale for the random variable x that we just created?

Interval Scale

What is the distribution for the random variable x? Hint: look at the code that generated the distribution and think about probabilities of sampling across the domain.

I would expect a normal distribution for this frequency distribution, as every number in the sample space has equal opportunity of being selected, each sample is independent and random, and the domain includes both negative and positive numbers. I would be tempted to consider this distribution a binomial or bimodal distribution based on its visual spread, but because of the way we created our sample and because there are more than two possibilities when sampling, I would not consider this distribution to qualify as such.

Why does the observed distribution look different than what you might expect for the distribution you named in b?

The observed distribution looks different than what you might expect for a normal distribution because of sampling variation due to chance and the ability for replacement. A larger sample size may produce a more normal-looking normal distribution.

Notice the y-axis label. It tells us the number of times we recorded a number in the interval defined by the x-extent of the histogram bars. We can rescale the y-axis such that the histogram represents a proper probability distribution. We don’t actually have to rescale manually, as R knows this is something that data analysts commonly do - we just set the freq argument to F and we’ll get density equivalency on the y-axis. Once we rescale the y-axis, we can add the smoothed density curve.

hist(x,main="Random sample with replacement of \n 75 numbers between 1:100",freq=F)
# NOTICE - \n insets a carriage return in the main title
lines(density(x))

#####Question 2

How does changing the y-axis scale from the frequency of observations to “density” give us a probability distribution? Hint: Think about the definition of a probability distribution, and the area of a rectangle.

By changing the y-axis scale to “density,” it gives us a probability distribution because the area under the curve represents the relative probabilities of outcomes for the random variable occurring, with all of their probabilities summing to 1.0. Scaling the distribution allows us to visualize how the areas in each rectangle of the histogram contribute to a relative percentage of the data. We could use this probability distribution to find the relative likelihood of obtaining a particular outcome. We could use the probability distribution to find the probability of obtaining an interval for variable X by finding the area of the rectangle under the curve for the value we are interested in.

What do you think would happen if we tried to add the density curve without first rescaling the y-axis? Hint: write a little code to try it!!

hist(x,main="Random sample with replacement of \n 75 numbers between 1:100", F=NULL)
lines(density(x))

hist(x,main="Random sample with replacement of \n 75 numbers between 1:100")
lines(density(x))

Without rescaling the y-axis, the density curve is unable to be visualized. Since the density values are so small in comparison to frequencies, the curve is visually lost.

OK, what a weird distribution. Now let’s invoke the CLT by generating a sampling distribution for the mean, estimated from n=75 data points. The book gives nice examples for looping to do this kind of thing, but here we’ll demonstrate a slightly quicker way of achieving the same end, and in the process pick up some skills for data management and organization. Our aim is to generate means for each of 1000 replicated samples of 75 random draws from the set -100:100. Let’s make a huge matrix where each column is a replicate, and each row is a sample. This will give us 75x1000 cells, organized such that each set of 75 is a different sample. The trick here comes because we’re taking random samples (so the initial organization is arbitrary). Once organized, we can take the column-wise means for plotting. Let’s have a look:

reps <- 1000
n <- 75
pop <- -50:150
set.seed(200)
xMat <- matrix(sample(pop,reps*n,T),nrow=75)
samplingDist <- apply(xMat,2,mean) # I am applying a function to the 2nd dimension (columns) of the matrix
length(samplingDist) # make sure I have the expected number of means (= to reps above)

## [1] 1000

OK, now let’s add the density curve for the sampling distribution of the mean for n=75 to the histogram of our single sample. Markdown environments will hang on to data frames and named objects and so on, but the code is interpreted sequentially, so if we try to add lines to an existing plot, that plot must have been rendered in the same code chunk. Let’s recall it here.

hist(x,main="Random sample with replacement of \n 75 numbers between -50:150",freq=F,cex=0.5)
lines(density(x))
lines(density(samplingDist))

#####Question 3

What happened to the top of the curve for the sampling distribution of the mean?

The top of the curve for the sampling distribution of the mean is cut off/truncated.

Yes, it was truncated. Bravo. Now, reason from what you know about the properties of probability distributions to explain why it was truncated.

The distribution is truncated because the distribution of the mean is constrained by the plotting parameters. I think perhaps it is truncated because the code may be specifying the discrete mean, while we need an interval for the mean in order to fit within the distribution properties - the probability of a specific, discrete value on a continuous probability distribution is 0. The prior code has not specified to include all values in the interval.

To fix the truncation problem, we just need to specify the plotting parameters in a slightly different order.

plot(density(samplingDist), xlim=c(min(pop),max(pop)),main="sample distribution (red) and \n sampling distribution for the mean")
lines(density(x), col="red")

#####Question 4

Why is the sampling distribution of the mean narrower and taller than the distribution for a single sample of n=75? Will this same relationship always hold between the two kinds of distributions? Explain.

The sampling distribution of the mean is narrower and taller than the distribution for a single sample of n=75 because we took 1000 replicated samples of 75 random draws. Therefore, the sample size for the mean distribution is much larger and results in a more tightly confined curve shape with less variability. I do not think the same relationship will always hold between the two kinds of distributions. It depends depends on the central tendency of them as well. If a sampling distribution of data is skewed, the mean sampling distribution will be affected and will be moved towards the tail. However, generally, as a sample size increases, the sample mean distribution should cluster closer and closer to the true population mean.

Why is the sampling distribution for the mean more bell-shaped than the data distribution?

The data distribution is affected by variance due to chance and will have a wider spread, but the distribution for the mean will be more bell-shaped as it is taking the averages for each sample and should have a tendency to cluster around the mean and follow a normal distribution bell-shaped pattern.

Do you think a sampling distribution for the mean using n=15 observations will be wider or narrower than the sampling distribution plotted here (based on n=75)? Explain. (Hint: it’s OK to add that one to the same plot to help your explanation!!)

I think the sampling distribution for the mean using n=15 observations will be wider than the sampling distribution plotted based on n=75. For a smaller number of observations, the spread is more affected by variation occurring during sampling and will be less clustered around the mean. Below, I have repeated the lab with n=15 (hopefully correctly!) for demonstration of this.

#####repeat with n=15 here:

set.seed(1000)
x2 <- sample(-50:150,15,T)
hist(x2,main="random sample with replacement of 15 numbers between -50:150")

hist(x2,main="Random sample with replacement of \n 15 numbers between 1:100",freq=F)
# NOTICE - \n insets a carriage return in the main title
lines(density(x2))

reps <- 1000
n2 <- 15
pop <- -50:150
set.seed(200)
xMat <- matrix(sample(pop,reps*n2,T),nrow=15)
samplingDist <- apply(xMat,2,mean) # I am applying a function to the 2nd dimension (columns) of the matrix
length(samplingDist)

## [1] 1000

hist(x2,main="Random sample with replacement of \n 15 numbers between -50:150",freq=F,cex=0.5)
lines(density(x2))
lines(density(samplingDist))

plot(density(samplingDist), xlim=c(min(pop),max(pop)),main="sample distribution (red) and \n sampling distribution for the mean")
lines(density(x2), col="red")