Questions to answer:

1. Describe this population distribution.

Mean at 1500, Right-skewed

2. Describe the distribution of this sample. How does it compare to the distribution of the population?

summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     540    1107    1605    1525    1848    2728
hist(samp1)

Mean slightly lower. Still right-skewed. Slightly narrower distribution

3. Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
summary(samp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     767    1115    1445    1463    1702    2683
hist(samp2)

Mean slightly higher, but similar. Still right-skewed. Similar overall distribution.

The larger sample would be more representative.

4. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Distribution looks approximately normal, centered around the population’s mean. A larger sample size would give even better approximation of sampling distribution to the normal distribution.

5. To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(0, 100)

for (i in 1:100)
{
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
sample_means_small
##   [1] 1508.62 1426.60 1549.26 1539.34 1399.52 1397.14 1534.32 1400.94
##   [9] 1477.16 1455.88 1433.96 1568.26 1467.84 1492.12 1467.06 1488.46
##  [17] 1468.44 1501.08 1522.78 1449.48 1548.56 1513.36 1411.66 1592.52
##  [25] 1495.04 1583.64 1541.86 1449.48 1502.44 1577.30 1530.10 1569.82
##  [33] 1368.24 1550.72 1515.86 1458.48 1472.00 1575.44 1460.08 1555.74
##  [41] 1483.52 1427.60 1448.76 1479.96 1540.92 1526.12 1473.00 1607.06
##  [49] 1560.86 1387.02 1463.60 1575.70 1454.30 1443.74 1464.98 1531.34
##  [57] 1408.78 1392.38 1528.82 1353.76 1504.96 1480.76 1432.62 1392.10
##  [65] 1580.70 1594.82 1418.92 1585.12 1522.04 1509.84 1470.42 1553.50
##  [73] 1470.88 1498.92 1402.26 1436.56 1657.90 1435.22 1588.34 1548.56
##  [81] 1517.06 1484.10 1603.78 1426.92 1607.88 1460.40 1442.90 1537.12
##  [89] 1503.90 1545.84 1522.94 1536.84 1542.90 1436.98 1518.56 1391.28
##  [97] 1433.00 1300.96 1480.36 1450.76

100 elements, each is a sample mean.

6. When the sample size is larger, what happens to the center? What about the spread?

Center even closer to population mean. Spread narrows.

  • Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
sampPrice <- sample(price, 50)
summary(sampPrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55993  118719  159250  164153  206409  460000

Mean of the sample… $176,523

  • Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 25)

summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  142626  173306  180381  181018  188310  232216
mean(price)
## [1] 180796.1

Shape of sampling distribution is approximately normal. Mean of the sample: 180909 Mean of the population: 180796.1

  • Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 25)

summary(sample_means150)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  160411  176746  180720  180840  184943  206374
mean(price)
## [1] 180796.1

Mean of sampling distribution: 180626 (even closer to population mean).

  • Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Large sample size has smaller spread. Smaller spread is more representative of population parameters.