Lab4a_StephenJones

load("more/ames.RData")

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

The distribution is unimodal, right skewed, with mean 1500 and median (center) 1442.

#added set seed to maintain the sample.
set.seed(3162019)
samp1 <- sample(area, 50)

Describe the distribution of this sample. How does it compare to the distribution of the population?

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     774    1105    1341    1424    1632    3086

hist(samp1)

The distribution is right skewed, basically unimodal with mean 1424 and median (center) 1341. The distribution is similar to the distribution of the population.

mean(samp1)

## [1] 1423.8

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

#added set seed to maintain the sample.
set.seed(3162020)
samp2 <- sample(area, 50)
cat("The mean of samp2 is",mean(samp2),", which is higher than the mean of samp1,",mean(samp1),".")

## The mean of samp2 is 1493.62 , which is higher than the mean of samp1, 1423.8 .

The larger sample has a substantially greater likelihood of approximating the population mean.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

cat("There are",length(sample_means50),"elements in sample_means50.")

## There are 5000 elements in sample_means50.

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1283    1449    1497    1499    1547    1767

The distribution is nearly normal with center just under 1500, with mean of means 1499 and median of means 1497. The spread would narrow and the frequency of the center (mean) would increase in accordance with normality and the CLT.

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   #print(i)
   }

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(NA, 100)

for(i in 1:100){
  #set.seed used to stabilize sample production.
  set.seed(i)
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
print(sample_means_small)

##   [1] 1491.38 1519.08 1527.12 1504.92 1399.32 1389.28 1640.48 1531.78
##   [9] 1460.82 1433.48 1497.26 1503.64 1384.62 1454.00 1423.10 1391.46
##  [17] 1416.36 1323.44 1532.86 1482.02 1510.72 1513.20 1525.12 1435.74
##  [25] 1604.94 1616.28 1526.22 1491.96 1568.34 1500.54 1484.86 1479.12
##  [33] 1497.90 1480.60 1563.52 1419.28 1632.22 1626.34 1593.34 1434.64
##  [41] 1508.00 1420.10 1460.36 1454.10 1613.66 1535.36 1550.96 1582.80
##  [49] 1440.52 1477.32 1419.64 1626.50 1526.38 1472.74 1591.08 1425.30
##  [57] 1461.08 1588.64 1487.40 1554.02 1513.94 1452.84 1662.90 1354.30
##  [65] 1552.16 1521.18 1549.50 1436.74 1422.50 1642.16 1390.00 1440.86
##  [73] 1513.00 1565.62 1465.70 1483.74 1468.04 1537.94 1498.04 1438.92
##  [81] 1418.98 1477.84 1467.74 1395.30 1480.90 1526.70 1431.56 1548.02
##  [89] 1515.44 1548.12 1579.32 1407.90 1526.54 1604.14 1449.02 1512.10
##  [97] 1606.30 1503.66 1471.26 1441.52

There are 100 elements in sample_means_small; each element represents the mean of the random sample of 50 drawn from the original dataset.

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

When the sample size is larger, what happens to the center? What about the spread?

When the sample size is larger, the center grows higher and the spread constricts.

On your own

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

#added set seed to maintain the sample.
set.seed(9231973)
sampprice <- sample(price, 50)
cat("The mean price--and the best point estimate using a random sample of 50--is",mean(sampprice),".")

## The mean price--and the best point estimate using a random sample of 50--is 183666.7 .

Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  #set.seed used to stabilize sample production.
  set.seed(5000+i)
   sprice <- sample(price, 50)
   sample_means50[i] <- mean(sprice)
}
hist(sample_means50, breaks = 20)

The distribution is normal, with center just under 180,000; I would guess the mean of the population is also around 180,000. The mean of the samples:

mean(sample_means50)

## [1] 181128.9

The median of the samples:

median(sample_means50)

## [1] 180429.9

The mean of the population is:

mean(price)

## [1] 180796.1

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  #set.seed used to stabilize sample production.
  set.seed(11000+i)
   sprice <- sample(price, 150)
   sample_means150[i] <- mean(sprice)
}
hist(sample_means50, breaks = 20)

hist(sample_means150, breaks = 20)

The distribution is normal, with center approximately 182,000; this doesn’t quite align the center observed in samples of size 50, and, as expected, the larger samples have constricted the spread. According to the histogram of the larger random sample sizes I would guess the mean sale price of homes in Ames is approximately 182,000.

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The spread is smaller in the sampling distribution with the larger sample size. For more accurate estimates, a smaller spread would be preferable.