Lab 4 - Sampling Distributions

By Brian Weinfeld

February 26, 2018

#1: Describe this population distribution.

set.seed(100)
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
qqnorm(area)
qqline(area)

max(area) - min(area)
## [1] 5308

According to the qqnorm plot the data is right skewed with a long upper tail. The mean is 1500 with a range of 5308.

#2: Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)
xlimits <- range(area)
par(mfrow=c(2,2))
hist(samp1, xlim=xlimits, breaks=10)
qqnorm(samp1)
qqline(samp1)
hist(area, xlim=xlimits, breaks=10)
qqnorm(area)
qqline(area)

The distribution of this sample appears similar to the original population. It is also skew right. The mean is a similar 1442 although the range is significantly less. This makes sense as the range is highly sensitive to outliers in the population.

#3: Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
mean(samp1)
## [1] 1441.52
mean(samp2)
## [1] 1413.52

The means are very similar and both are close to the population mean. As the number of elements in the sample increases the mean of the sample will trend towards to the population mean.

#4: How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 5000 elements in sample_mean50, each one the mean of a SRS of 50 randomly selected elements all from the same population. The sampling distribution is normal and centered about the population mean. If the number of samples were increased, the resulting distribution will likely be even more normal and centered even more closely to the true population mean, although the difference may be incredibly small.

#5: To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(0, 100)
for(i in 1:100){
  sample_means_small[i] <- sample(area, 50) %>% mean()
}
sample_means_small
##   [1] 1479.92 1590.46 1452.58 1434.16 1570.34 1511.26 1564.40 1489.42
##   [9] 1552.68 1399.58 1579.74 1556.20 1502.16 1528.76 1521.06 1452.24
##  [17] 1395.46 1501.20 1486.02 1460.92 1523.34 1460.74 1524.54 1435.92
##  [25] 1507.28 1564.50 1562.10 1388.88 1501.76 1507.24 1525.98 1473.34
##  [33] 1646.28 1431.60 1586.90 1477.58 1480.66 1582.44 1610.28 1469.26
##  [41] 1662.94 1395.76 1608.50 1447.20 1440.68 1498.84 1372.96 1443.90
##  [49] 1490.94 1440.06 1498.30 1430.58 1552.58 1579.98 1490.54 1400.88
##  [57] 1419.28 1513.52 1470.92 1414.58 1485.08 1415.32 1553.18 1489.36
##  [65] 1468.34 1397.02 1406.42 1437.92 1288.64 1508.16 1429.40 1619.16
##  [73] 1458.42 1438.34 1509.10 1400.00 1634.38 1480.16 1467.26 1406.62
##  [81] 1433.42 1487.20 1455.50 1507.18 1581.96 1485.38 1479.38 1557.52
##  [89] 1375.40 1394.90 1491.16 1434.18 1434.60 1526.40 1533.06 1374.48
##  [97] 1523.12 1605.58 1479.82 1491.34

There are 100 elements in sample_means_small each one represents the mean of a SRS of 50 elements from the population area.

#6: When the sample size is larger, what happens to the center? What about the spread?

The centers are all of similar value, near the true population mean. The spread decreases as the sample size increases.

#1: Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

price.sample <- sample(price, 50)
mean(price.sample)
## [1] 184136.6

The mean of the sample is $184,136.60. This is also our estimate of the population mean.

#2: Since you have access to the population, simulate the sampling distribution for x_price_ by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 <- map(rep(0, 5000), ~sample(price, 50) %>% mean()) %>% unlist()
hist(sample_means50, breaks=40)

mean(sample_means50)
## [1] 180592.2

The sample distribution is normal centered about $180,592.20. My single sample gave a population mean of $184,136.60. The sampling distribution gave a mean of $180,592.20 and the true population means is $180,796.10

#3: Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- map(rep(0, 5000), ~sample(price, 150) %>% mean()) %>% unlist()

par(mfrow=c(2,1))
xlimits <- range(sample_means50)
hist(sample_means50, breaks=40, xlim=xlimits)
hist(sample_means150, breaks=40, xlim=xlimits)

mean(sample_means150)
## [1] 180843.1

The sampling distribution is normal, centered about $180843.10 which is roughly the same as the sampling distribution of sample_means50. The mean of this sampling distribution is our estimate of the mean sale price of homes in Ames.

#4: Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sampling distribution with the larger samples has the smaller spread. If we were concerned about the accuracy of our prediction, we should tend towards as large as a sample as is feasible. The smaller spread increases the likleyhood of an accurate prediction.