Lab 5: Foundations for Statistical Inference

The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Exercise 1: Describe this population distribution

The population distribution is unimodal and skewed to the right, which can be inferred from the fact that the mean of 1500 is slightly larger than the median of 1442 and from the fact that the max value of the distribution of 5642 is much further from the median than the minimum value of the distribution of 334.

The Unknown Sampling Distribution

samp1 <- sample(area, 50)
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1155    1552    1592    1720    2944

hist(samp1)

Exercise 2: Describe the distribution of this sample. How does it compare to the distribution of the population?

The distribution of the sample of 50 houses from the population of 2,930 is very similar to the underlying population distribution. The sample mean, 1520, is only 20 values larger than the population mean and the sample median of 1446 is only 4 values larger than population median. Like the population distribution, this distribution is also skewed to the right but has less extreme outliers with a max value of 2726 and a min value of 816.

Exercise 3: Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <-sample(area,50)
summary(samp2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     498    1130    1389    1406    1499    2461

hist(samp2)

# The mean of samp2, 1514, is just 6 values below the mean of samp1, 1520. In this case, however, the median is much closer to the mean, meaning the this sample distribution is more symmetric than the previous sample distribution. If we were to take more samples, one of size 100 and one of size 1000, the sample of sie 1000 would give a more accurate estimate of the population mean. The reason is because as the sample size n increases, the distribution of observations more closely approximates the original population distribution.

sample_means50 <- rep(NA,5000)

for(i in 1:5000){
  samp <- sample(area,50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1262    1453    1500    1501    1548    1758

Exercise 4: How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 5000 elements in the array/vector sample_means50. The distribution is symmetric and unimodal. The median and mean are very close to one another (1496 and 1498 respectively) and the min and max values are nearly equidistant from the mean and median.

I would only expect the distribution to be more continuous because as the number of simulations or sample means increases, the distribution of sample means becomes more dense and, therefore, more continuous. However, I would not expect the distribution to become more normal because normality depends on the sample size more than the number of simulations or samples. The reason is because every time you simulate a large number of sample means, you are generating random sample means, which means that your standard error as well as your sampling distribution mean could increase or decrease.

Exercise 5: To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <-rep(NA,100)
for(i in 1:100){
  sampn <- sample(area,50)
  sample_means_small[i] <-mean(sampn)
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

# There are 100 elements in this vector/array called sample_means_small. Each element represents the mean of a sample taken from one and the same house area population distribution.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6: When the sample size is larger, what happens to the center? What about the spread?

The larger the sample size, the more the median and the mean converge. That is, the larger the sample size, the more normal the distribution. Similarly, the larger the sample size, the smaller the spread or standard error.

On Your Own

1. Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

samp <-sample(price,50)
summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52500  127750  156000  186647  222250  591587

hist(samp)

# Using this sample alone, the best estimate of the population mean of 180,796 is the sample mean of 184,053

2. Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 <-rep(NA,5000)
for(i in 1:5000){
  samp <- sample(price,50)
  sample_means50[i] <- mean(samp)
}
hist(sample_means50)

mean(sample_means50)

## [1] 180776.2

# The sampling distribution is unimodal and nearly symmetric about the mean but with some slight right skew. I would guess the mean home price of the population is equal to the mean home price of this sampling distribution of mean home prices, which is 180,653
mean(price)

## [1] 180796.1

# The mean home price of the population is 180,796.1, just slightly above the sampling distribution mean of 180,652.8.

3. Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <-rep(NA,5000)
for(i in 1:5000){
  samp <- sample(price,150)
  sample_means150[i]<-mean(samp)
}
hist(sample_means150)

mean(sample_means150)

## [1] 180767

# This sampling distribution is also unimodal, but it is more symmetric about the mean than the sampling distribution for a sample size of 50. It still does not appear completely normal but it seems very close to a completely normal distribution. I would guess the mean sales price of homes in Ames to be 180,847 based on this sampling distribution.

4. Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sampling distribution based on samples of size 150 had a smaller spread or standard error. To make estimates of the population mean that are more likely to be close to the true population mean, it would be preferential to choose a distribution with smaller spread. A smaller spread means a smaller standard error, which means a larger sample size. And the larger the sample size, the more accurate the sample mean will be as an estimate of the population mean.