Lab 5: Foundations for Statistical Inference

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Exercise 1: Describe this population distribution.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

### Answer: Residential homes that were sold in Ames, Iowa between the years 2006 and 2010 had a mean of 1500 square feet in their above ground living areas. The distribution of areas is fairly bell-shaped, but is somewhat right skewed.

Exercise 2: Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)

summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     759    1040    1282    1377    1731    3078
hist(samp1)

Answer: The sample has a similar mean to the population, but the sample distribution does not have as large of a range as the population distribution. The sample distribution is also slightly right skewed.

Exercise #3: Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Answer: The mean of samp2 is very close to samp1 as there is only a difference of 46. The larger sample size of 1000 will provide a more accurate estimate of the population mean as the absolute size of a sample determines accuracy.

samp2 <- sample(area, 50)

summary(samp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     672    1057    1388    1409    1668    2574
hist(samp2)

## Exercise 4: How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

sample_means_50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means_50[i] <- mean(samp)
   }

hist(sample_means_50, breaks = 25)

summary(sample_means_50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1274    1449    1498    1499    1546    1777

Answer: There are 5,000 samples in sample_mean50. The sampling distribution appears fairly normal/bell-shaped and has a center at 1500. I expect the distribution to stay about the same but to become even more normal. The center should also stay about the same.

Exercise 5: To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   print(i)
   }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

Answer: There are 100 elements, and each element represents a sample mean recorded from one sample (there are 100 samples).

Exercise 6: When the sample size is larger, what happens to the center? What about the spread?

Answer: As the sample size gets larger, the center becomes more accurate. The spread of the sample means becomes narrower.

hist(sample_means_50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means_50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

## On Your Own 1: Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Using the sample, the best point estimate of the population mean is 170122.

samp3 <- sample(price, 50)

summary(samp3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60000  119475  166300  166213  209500  309000
hist(samp3)

On Your Own 2: Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

The distribution appears to have a bell-curve shape/to be exhibiting a fairly normal distribution, perhaps slightly skewed right. Based on the sampling distribution, I would guess the mean home price of the population to be about 180,000. The actual population mean is 180,796.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

On Your Own 3: Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

The shape of this sampling distribution is more symmetrical compared to the previous distribution. Therefore it is more normal or bell-shaped. Based on this distribution, I would guess the mean sale price of homes in Ames to be about 181,000.

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

hist(sample_means150)

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

On Your Own 4: Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sampling distribution with the sample size of 150 has a smaller spread compared to the distribution with the sample size of 50. We prefer a distribution with a small spread because it will produce a more accurate estimate of the population mean.