Foundations for statistical inference - Sampling distributions

load("more/ames.RData")

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

The distribution of the living area is slightly skewed to the right with most observations around 1400 to 1500 square feet. There are a few observations with over 3000 square feet.

Describe the distribution of this sample. How does it compare to the distribution of the population?

set.seed(100)
samp1 <- sample(area, 50)

hist(samp1)

The exercise did not specify a seed, so the sample distribution changes every time I run the script. I set the seed to 100 so I could reproduce the distribution.

My sample distribution appears to be right skewed centering around 1440ish square feet. There are a few observations above 2000 square feet, but not so many very small areas.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

set.seed(200)

samp2 <- sample(area,50)

mean(samp1)

## [1] 1441.52

mean(samp2)

## [1] 1486.7

The means are fairly close, but are clearly different from each other. Sample 2 is closer to the population mean than Sample 1.

The 1000 size sample will be a much more accurate estimate of the mean compared to the 100 size sample. Since the variation of an estimate depends on the size of the sample, larger samples will have less variation than smaller samples. As the sample size approaches the population size, the mean of the estimate will approach the true mean of the population.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

length(sample_means50)

## [1] 5000

There are 5000 elements in sample_means50. The sampling distribution looks almost exactly normal. Its center is extremely close to the true mean of 1500 square feet. The distribution would not change much with 50,000 sample means. The distribution is very close to normal, and adding more samples will just reinforce the normality already shown with diminishing returns.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   #print(i)
   }

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(0, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   #print(i)
}

sample_means_small

##   [1] 1466.64 1694.96 1540.60 1362.60 1455.66 1386.86 1470.94 1403.76
##   [9] 1534.04 1623.04 1408.18 1596.70 1600.32 1572.30 1589.70 1503.70
##  [17] 1428.10 1496.62 1492.70 1486.22 1502.60 1454.34 1445.08 1528.28
##  [25] 1579.90 1493.92 1392.26 1479.92 1606.12 1676.40 1535.80 1557.52
##  [33] 1472.28 1494.24 1609.62 1547.00 1528.12 1531.90 1416.30 1543.34
##  [41] 1583.44 1547.50 1523.26 1492.78 1466.22 1458.54 1555.62 1417.58
##  [49] 1362.94 1517.30 1518.40 1569.54 1549.72 1626.52 1633.04 1533.00
##  [57] 1554.84 1548.42 1448.54 1369.40 1398.88 1537.02 1444.24 1468.32
##  [65] 1432.48 1503.78 1490.60 1714.50 1452.96 1479.84 1517.16 1473.50
##  [73] 1510.38 1583.44 1508.76 1550.82 1415.48 1432.08 1615.66 1449.26
##  [81] 1594.60 1476.66 1533.64 1520.60 1598.12 1478.06 1547.74 1562.74
##  [89] 1529.40 1487.38 1347.86 1471.64 1313.40 1430.94 1502.60 1455.74
##  [97] 1512.32 1343.04 1605.16 1475.70

length(sample_means_small)

## [1] 100

There are 100 elements in the object. Each element represents a sample mean.

Side note: I’m surprised you didn’t use an apply function here. This seems like the perfect use of sapply (while being a bit faster as well).

When the sample size is larger, what happens to the center? What about the spread?

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

When the sample size is larger, the center moves closer to the true mean and the spread decreases.

On your own

1). Take a random sample of size 50 from `price`. Using this sample, what is your best point estimate of the population mean?

set.seed(100)
price_50_sample = sample(price,50)

mean(price_50_sample)

## [1] 174131.4

The best point estimate of the population mean is the sample mean. In this case, the sample mean is 174131.4.

2). Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called `sample_means50`. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 = sapply(1:5000,function(x){
  return(mean(sample(price,50)))
})

hist(sample_means50)

The sampling distribution looks almost exactly normal. I would guess the mean home price would be around 180,000.

mean(price)

## [1] 180796.1

The mean is 180,796, which is very close to 180,000.

3). Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called `sample_means150`. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 = sapply(1:5000,function(x){
  return(mean(sample(price,150)))
})

hist(sample_means150)

The shape of this sampling distribution looks very close to normal, centered around 180,000. Compared to the sample size 50 distribtuion, the spread of the sample size 150 distributions is much smaller. I would guess the mean sale price of homes in Ames to be about 180,000.

4). Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sampling distribution from 3 has a smaller spread. If we want to make estimates close to the true value, we would prefer a distribution with a small spread because there is less uncertainty about the range of possible values of the mean when there is less spread. When a distribution has small spread, the range of probable values is small, so we are more sure of what the true value should be.

Foundations for statistical inference - Sampling distributions

Austin Chan

On your own

1). Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

4). Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

1). Take a random sample of size 50 from `price`. Using this sample, what is your best point estimate of the population mean?