Foundations for statistical inference - Sampling distributions

1. Describe this population distribution.

Ans: The distribution is a bit left skew, the mode is between 1000 and 1500, with mean equal to 1500.

2. Describe the distribution of this sample. How does it compare to the distribution of the population?

Ans: Random sample size is 50 which is greater than 30, that means it is good for approch the population distribution.

3. Take a second sample, also of size 50, and call it `samp2`. How does the mean of `samp2` compare with the mean of `samp1`? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Ans: If both samp1 and samp2 are good for approach the popluation distribution, their mean would be very close or equal. As the random size increase, sample mean will approch better to the population mean, which will be more accurate. Therefore, size 1000 will be more accurate.

4. How many elements are there in `sample_means50`? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Ans: 5000 elements in ‘sample_means50’; the belt curve will more symetric, and not as flat as lesser sampling size.

5. To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called `sample_means_small`. Run a loop that takes a sample of size 50 from `area` and stores the sample mean in `sample_means_small`, but only iterate from 1 to 100. Print the output to your screen (type `sample_means_small` into the console and press enter). How many elements are there in this object called `sample_means_small`? What does each element represent?

load("more/ames.RData")
area <- ames$Gr.Liv.Area

sample_means_samll <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_samll[i] <- mean(samp)
}
sample_means_samll

##   [1] 1576.54 1530.34 1639.10 1586.96 1452.44 1372.86 1461.60 1501.08
##   [9] 1367.06 1405.66 1436.46 1523.86 1536.76 1518.26 1406.92 1395.58
##  [17] 1552.64 1502.64 1553.06 1586.54 1363.42 1582.14 1553.90 1561.92
##  [25] 1374.02 1453.62 1498.94 1533.52 1492.40 1467.18 1463.08 1422.74
##  [33] 1497.06 1562.58 1575.06 1418.30 1438.26 1371.80 1442.28 1620.48
##  [41] 1460.62 1506.00 1373.44 1527.00 1459.24 1415.92 1424.34 1468.02
##  [49] 1547.00 1439.64 1594.54 1380.02 1415.62 1520.78 1453.10 1548.46
##  [57] 1606.18 1421.88 1470.42 1486.72 1415.16 1535.48 1501.98 1447.98
##  [65] 1518.68 1444.62 1543.98 1369.02 1521.78 1523.74 1381.36 1545.72
##  [73] 1543.84 1571.76 1432.08 1583.34 1468.16 1479.46 1392.12 1416.24
##  [81] 1449.58 1446.92 1432.22 1498.42 1560.46 1550.70 1512.68 1561.76
##  [89] 1417.50 1444.62 1440.84 1489.82 1439.56 1510.54 1505.96 1516.96
##  [97] 1588.94 1445.34 1442.06 1560.48

Ans: There are 100 objects in sample_means_samll.

6. When the sample size is larger, what happens to the center? What about the spread?

Ans: The center is close to the middle, the spread between median and mean is smaller.

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

sample50<-rep(NA,50)

price <- ames$SalePrice

for( i in 1:50){
  
    samp <- sample(price, 50)
    sample50[i] <- mean(samp)
}
summary(sample50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  153756  173638  180819  180004  186624  200692

Ans: The best point estmimation is sample mean 187754.

- Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called `sample_means50`. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

Ans:

samp1e <- sample(price, 50)

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(price, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  138061  172936  180046  180759  188259  227624

Ans: The mean and meddian are very close. The belt curve is symetric. The population mean of the 5000 sample means is 180974.

- Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called `sample_means150`. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

samp1e <- sample(price, 150)

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(price, 150)
   sample_means150[i] <- mean(samp)
   }

hist(sample_means150)

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  156679  176430  180641  180765  184859  205121

Ans: The distribution of sample_mean150 is almost same as sample_mean50. In sample_mean150, the different of median and mean is 467; in sample_mean50,it is 649.The price sale in Ames is close to 180856.

- Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

par(mfrow = c(2, 1))

xlimits <- range(sample_means50)

hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)

Ans: sampling distributions from 3 has samller spread in median and mean, which is close to the true value.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Foundations for statistical inference - Sampling distributions

Chunhui Zhu

October 3, 2017

1. Describe this population distribution.

2. Describe the distribution of this sample. How does it compare to the distribution of the population?

3. Take a second sample, also of size 50, and call it `samp2`. How does the mean of `samp2` compare with the mean of `samp1`? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

4. How many elements are there in `sample_means50`? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

6. When the sample size is larger, what happens to the center? What about the spread?

On your own

- Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Foundations for statistical inference - Sampling distributions

Chunhui Zhu

October 3, 2017

1. Describe this population distribution.

2. Describe the distribution of this sample. How does it compare to the distribution of the population?

3. Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

4. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

6. When the sample size is larger, what happens to the center? What about the spread?

On your own

- Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

3. Take a second sample, also of size 50, and call it `samp2`. How does the mean of `samp2` compare with the mean of `samp1`? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

4. How many elements are there in `sample_means50`? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?