download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

# Create two variables with short names that represent above ground living area of the house and the sales price
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

EXERCISE 1

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

sd(area)

## [1] 505.5089

hist(area, breaks=50)  #include more breaks to see shape better

#create density histogram
hist(area, probability=TRUE)
x <- 300:6000
y <- dnorm(x = x, mean = 1500, sd = 505)
lines(x = x, y, col = "blue")

#create Q-Q plot 
qqnorm(area) 
qqline(area)

Describe this population distribution.

The population distribution is right-skewed with a mean of 1500 and standard deviation of 505. There are some upper outliers.

The Unknown Sampling Distribution

# This commands collects a simple random sample of size 50 from the vector area, which is assigned to samp1

samp1 <- sample(area, 50)

EXERCISE 2

mean(samp1)

## [1] 1525.68

sd(samp1)

## [1] 485.8597

hist(samp1, breaks=15)

# Create a density histogram showing normal curve 
hist(samp1, probability=TRUE)
x <- 800:3000
y <- dnorm(x = x, mean = 1537, sd = 461)
lines(x = x, y, col = "red")

qqnorm(samp1)
qqline(samp1)

Describe the distribution of this sample. How does it compare to the distribution of the population?

The sampling distribution looks mildly right-skewed, with its center around 1500 too. It is not as skewed as the population distribution, but its center is about the same.

# If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.

mean(samp1)

## [1] 1525.68

The sample mean (1537) is about the same as the population mean (1500).

EXERCISE 3

# Take a second sample, also of size 50, and call it samp2. 
samp2 <- sample(area, 50)

mean(samp2)

## [1] 1404.24

How does the mean of samp2 compare with the mean of samp1?

The mean of samp2 (1580) is hgher than the actal mean (1500), whereas the mean of samp1 (1537) is lower than the actual mean.

Suppose we took two more samples, one of size 100 and one of size 1000.Which would you think would provide a more accurate estimate of the population mean?

The larger sample size of 1000 would be a more accurate estimate of the population mean.

# Generate 5000 samples and compute the sample mean of each

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

# Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50.

# Adjust the bin width of your histogram to show a little more detail

hist(sample_means50, breaks = 25)

EXERCISE 4

How many elements are there in sample_means50?

There are 5000 elements in sample_means50.

Describe the sampling distribution, and be sure to specifically note its center.

The sampling distribution looks normal, with the center (mean) at 1500.

Would you expect the distribution to change if we instead collected 50,000 sample means?

No, collecting a larger number of sample means (50,000) would not change the distribution.

Interlude: The for loop

# With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   # print(i) - OMITTED TO SAVE SPACE IN OUTPUT
   }

EXERCISE 5

To make sure you understand what you’ve done in this loop, try running a smaller version.

# Initialize a vector of 100 zeros called sample_means_small.

sample_means_small <- rep(NA, 100)

# Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100.

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}

# Print the output to your screen (type sample_means_small into the console and press enter)

How many elements are there in this object called sample_means_small?

100

What does each element represent?

a randomly generated sample mean

Sample size and the sampling distribution

# To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

# To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

EXERCISE 6

When the sample size is larger, what happens to the center? What about the spread?

The center remains unchanged, but the spread gets smaller. A smaller standard error (standard deviation of sampling distribution) increases the precision of the statistic.

RLab 5 - Sampling Distributions

Rita Lee

October 4, 2019