Foundations for Statistical Inference - Sampling Distributions

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Exercise 1

Describe this population distribution.

area = ames$Gr.Liv.Area
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area, breaks = 30, main = "Area")

The distribution is right skewed and unimodal.

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 = sample(area, 50)
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     407    1132    1326    1510    1724    3493

hist(samp1, breaks=10, main = "Area Sample")

The sample distribution is similar to the population distribution in that it is also right skewed and unimodal.

Exercise 3

How does the mean of a second sample of 50 (samp2) compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

# Samples
samp2 = sample(area, 50)
samp3 = sample(area, 100)
samp4 = sample(area, 1000)

# Table of sample means
samps = matrix(c(summary(area)[4],
                 summary(samp1)[4],
                 summary(samp2)[4],
                 summary(samp3)[4],
                 summary(samp4)[4]),ncol=5,byrow=TRUE)
colnames(samps) = c("Popul_Mean", "Samp1", "Samp2", "Samp_of_100", "Samp_of_1000")
rownames(samps) = c("Means")
as.table(samps)

##       Popul_Mean Samp1 Samp2 Samp_of_100 Samp_of_1000
## Means       1500  1510  1472        1517         1498

The mean of the samp2 will not be the same as the mean of samp1, however it is difficult to determine if samp2 will consistantly have a higher or lower mean because the samples are randomly selected. The larger the sample mean, the more accurate the estimate of the population mean will be. As such, the sasmple size of 1,000 will be the closest estimate to the original population.

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution (note its center). Would you expect the distribution to change if we instead collected 50,000 sample means?

sample_means50 = rep(NA, 5000)

for(i in 1:5000){
  samp = sample(area, 50)
  sample_means50[i] = mean(samp)
}

hist(sample_means50, breaks = 25, main = "Area Sample_Means_50")

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1262    1450    1498    1500    1547    1783

There are 5,000 elements in sample_means50. The sampling distribution is normal and unimodal with a center of approximately 1500. Even though additional samples help identify the actual population mean, the sampling distribution will not change from unimodal normal.

Exercise 5

Initialize a vector of 100 zeros called “sample_means_small”. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, (iterate from 1 to 100). How many elements are there in sample_means_small? What does each element represent?

sample_means_small = rep(NA, 100)

for(i in 1:100){
  samp = sample(area, 50)
  sample_means_small[i] = mean(samp)
}

sample_means_small

##   [1] 1500.58 1459.80 1505.88 1494.46 1537.24 1424.48 1484.76 1452.68
##   [9] 1505.66 1472.70 1494.98 1463.24 1298.02 1537.24 1433.78 1406.08
##  [17] 1464.32 1426.86 1583.60 1450.96 1412.10 1557.38 1565.44 1488.98
##  [25] 1594.40 1571.38 1556.54 1406.34 1383.56 1461.54 1444.16 1563.16
##  [33] 1520.62 1450.18 1499.14 1487.28 1399.98 1531.34 1554.80 1496.58
##  [41] 1458.64 1507.12 1521.36 1399.04 1455.14 1457.98 1485.50 1437.72
##  [49] 1469.30 1547.44 1526.56 1447.22 1351.88 1617.62 1364.48 1450.36
##  [57] 1417.76 1612.12 1513.74 1437.78 1402.04 1458.74 1617.84 1540.82
##  [65] 1545.14 1486.62 1634.82 1378.32 1484.70 1470.00 1467.06 1536.24
##  [73] 1628.62 1366.24 1447.12 1469.98 1503.14 1470.44 1477.72 1460.26
##  [81] 1454.76 1604.90 1543.46 1444.48 1392.52 1546.62 1571.36 1604.66
##  [89] 1529.88 1432.38 1382.62 1535.06 1481.36 1367.76 1434.64 1540.06
##  [97] 1412.98 1518.06 1490.94 1546.02

length(sample_means_small)

## [1] 100

There are 100 elements in sample_means_small and each element represents the mean of a sample of 50 drawn for the population (area). Essentially, sample_means_small is a collection of 100 means from samples of size 50.

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

# Creating samples
sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)

for(i in 1:5000){
  samp = sample(area, 10)
  sample_means10[i] = mean(samp)
  samp = sample(area, 100)
  sample_means100[i] = mean(samp)
}

# Plotting samples
par(mfrow= c(3,1))
xlimits = range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits, main = "Area Sample_Means_10")
hist(sample_means50, breaks = 20, xlim = xlimits, main = "Area Sample_Means_50")
hist(sample_means100, breaks = 20, xlim = xlimits, main = "Area Sample_Means_100")

samps2 = matrix(c(summary(area)[4],
                  summary(sample_means10)[4],
                  summary(sample_means50)[4],
                  summary(sample_means100)[4]),ncol=4,byrow=TRUE)

# Table of sample means
colnames(samps2) = c("Popul_Mean", "Samp_of_10", "Samp_of_50", "Samp_of_100")
rownames(samps2) = c("Means")
as.table(samps2)

##       Popul_Mean Samp_of_10 Samp_of_50 Samp_of_100
## Means       1500       1501       1500        1500

When the sample size is larger, the spread decreases and the center grows closer to the population’s center.

ON YOUR OWN

1. Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

price = ames$SalePrice
summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000

hist(price,  main = "Price")

prsamp = sample(price, 50)
summary(prsamp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   44000  110600  143700  162700  182000  745000

hist(prsamp, breaks = 15, main = "Price Sample")

The mean is the best point estimate for the population mean.

2. Simulate the sampling distribution for x¯price (5000 samples from the population of size 50 and compute 5000 sample means). Store means as “sample_means50”. Plot the data. Describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Calculate and report the population mean.

sample_means50 = rep(NA, 5000)

for(i in 1:5000){
  prsamp = sample(price, 50)
  sample_means50[i] = mean(prsamp)
}

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  144300  172700  179900  180500  187800  229300

hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")

summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000

The shape of the sampling distribution is normal and unimodal. Using the sampling distribution, I would guess that the population’s mean would be 181,000. It turns out that my estimate was higher than the actual population mean of 180,800. The sample mean was closer than my guess.

3. Change your sample size from 50 to 150. Repeat process above but store these means in a vector called “sample_means150”. Describe the shape of this sampling distribution. Compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 = rep(NA, 5000)
for(i in 1:5000){
  prsamp = sample(price, 50)
  sample_means150[i] = mean(prsamp)
}

# Comparisons between means50 and means150
par(mfrow= c(2,1))
hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")
hist(sample_means150, breaks = 50, main = "Price Sample_Means_150")

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  144300  172700  179900  180500  187800  229300

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  142500  173000  180200  180700  188000  222600

The shape of the sampling distribution for sample_means150 is normal and unimodal. It is similar, in that respect to sample_means50, though sample_means50 has a more spread out distribution. Based on the sampling distribution of sample_means150, I would guess the mean sale price of homes in Ames is approximately 180,800, which is the actual population’s mean.

4. Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

par(mfrow= c(2,1))
hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")
hist(sample_means150, breaks = 50, main = "Price Sample_Means_150")

Of the two sampling distributions (sample_means50 and sample_means150), the second (sample_means150) has the smaller spread. It is preferable to have a distribution with a smaller spread when trying to make estimates that are close to the true value, because there is less uncertainty of where the true estimates lie.

Foundations for Statistical Inference - Sampling Distributions

Georgia Galanopoulos

ON YOUR OWN