download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
Exercise 1
Describe this population distribution.
area = ames$Gr.Liv.Area
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area, breaks = 30, main = "Area")
The distribution is right skewed and unimodal.
Exercise 2
Describe the distribution of this sample. How does it compare to the distribution of the population?
samp1 = sample(area, 50)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 407 1132 1326 1510 1724 3493
hist(samp1, breaks=10, main = "Area Sample")
The sample distribution is similar to the population distribution in that it is also right skewed and unimodal.
Exercise 3
How does the mean of a second sample of 50 (samp2) compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
# Samples
samp2 = sample(area, 50)
samp3 = sample(area, 100)
samp4 = sample(area, 1000)
# Table of sample means
samps = matrix(c(summary(area)[4],
summary(samp1)[4],
summary(samp2)[4],
summary(samp3)[4],
summary(samp4)[4]),ncol=5,byrow=TRUE)
colnames(samps) = c("Popul_Mean", "Samp1", "Samp2", "Samp_of_100", "Samp_of_1000")
rownames(samps) = c("Means")
as.table(samps)
## Popul_Mean Samp1 Samp2 Samp_of_100 Samp_of_1000
## Means 1500 1510 1472 1517 1498
The mean of the samp2 will not be the same as the mean of samp1, however it is difficult to determine if samp2 will consistantly have a higher or lower mean because the samples are randomly selected. The larger the sample mean, the more accurate the estimate of the population mean will be. As such, the sasmple size of 1,000 will be the closest estimate to the original population.
Exercise 4
How many elements are there in sample_means50? Describe the sampling distribution (note its center). Would you expect the distribution to change if we instead collected 50,000 sample means?
sample_means50 = rep(NA, 5000)
for(i in 1:5000){
samp = sample(area, 50)
sample_means50[i] = mean(samp)
}
hist(sample_means50, breaks = 25, main = "Area Sample_Means_50")
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1262 1450 1498 1500 1547 1783
There are 5,000 elements in sample_means50. The sampling distribution is normal and unimodal with a center of approximately 1500. Even though additional samples help identify the actual population mean, the sampling distribution will not change from unimodal normal.
Exercise 5
Initialize a vector of 100 zeros called “sample_means_small”. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, (iterate from 1 to 100). How many elements are there in sample_means_small? What does each element represent?
sample_means_small = rep(NA, 100)
for(i in 1:100){
samp = sample(area, 50)
sample_means_small[i] = mean(samp)
}
sample_means_small
## [1] 1500.58 1459.80 1505.88 1494.46 1537.24 1424.48 1484.76 1452.68
## [9] 1505.66 1472.70 1494.98 1463.24 1298.02 1537.24 1433.78 1406.08
## [17] 1464.32 1426.86 1583.60 1450.96 1412.10 1557.38 1565.44 1488.98
## [25] 1594.40 1571.38 1556.54 1406.34 1383.56 1461.54 1444.16 1563.16
## [33] 1520.62 1450.18 1499.14 1487.28 1399.98 1531.34 1554.80 1496.58
## [41] 1458.64 1507.12 1521.36 1399.04 1455.14 1457.98 1485.50 1437.72
## [49] 1469.30 1547.44 1526.56 1447.22 1351.88 1617.62 1364.48 1450.36
## [57] 1417.76 1612.12 1513.74 1437.78 1402.04 1458.74 1617.84 1540.82
## [65] 1545.14 1486.62 1634.82 1378.32 1484.70 1470.00 1467.06 1536.24
## [73] 1628.62 1366.24 1447.12 1469.98 1503.14 1470.44 1477.72 1460.26
## [81] 1454.76 1604.90 1543.46 1444.48 1392.52 1546.62 1571.36 1604.66
## [89] 1529.88 1432.38 1382.62 1535.06 1481.36 1367.76 1434.64 1540.06
## [97] 1412.98 1518.06 1490.94 1546.02
length(sample_means_small)
## [1] 100
There are 100 elements in sample_means_small and each element represents the mean of a sample of 50 drawn for the population (area). Essentially, sample_means_small is a collection of 100 means from samples of size 50.
Exercise 6
When the sample size is larger, what happens to the center? What about the spread?
# Creating samples
sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)
for(i in 1:5000){
samp = sample(area, 10)
sample_means10[i] = mean(samp)
samp = sample(area, 100)
sample_means100[i] = mean(samp)
}
# Plotting samples
par(mfrow= c(3,1))
xlimits = range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits, main = "Area Sample_Means_10")
hist(sample_means50, breaks = 20, xlim = xlimits, main = "Area Sample_Means_50")
hist(sample_means100, breaks = 20, xlim = xlimits, main = "Area Sample_Means_100")
samps2 = matrix(c(summary(area)[4],
summary(sample_means10)[4],
summary(sample_means50)[4],
summary(sample_means100)[4]),ncol=4,byrow=TRUE)
# Table of sample means
colnames(samps2) = c("Popul_Mean", "Samp_of_10", "Samp_of_50", "Samp_of_100")
rownames(samps2) = c("Means")
as.table(samps2)
## Popul_Mean Samp_of_10 Samp_of_50 Samp_of_100
## Means 1500 1501 1500 1500
When the sample size is larger, the spread decreases and the center grows closer to the population’s center.
1. Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
price = ames$SalePrice
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
hist(price, main = "Price")
prsamp = sample(price, 50)
summary(prsamp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44000 110600 143700 162700 182000 745000
hist(prsamp, breaks = 15, main = "Price Sample")
The mean is the best point estimate for the population mean.
2. Simulate the sampling distribution for x¯price (5000 samples from the population of size 50 and compute 5000 sample means). Store means as “sample_means50”. Plot the data. Describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Calculate and report the population mean.
sample_means50 = rep(NA, 5000)
for(i in 1:5000){
prsamp = sample(price, 50)
sample_means50[i] = mean(prsamp)
}
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144300 172700 179900 180500 187800 229300
hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
The shape of the sampling distribution is normal and unimodal. Using the sampling distribution, I would guess that the population’s mean would be 181,000. It turns out that my estimate was higher than the actual population mean of 180,800. The sample mean was closer than my guess.
3. Change your sample size from 50 to 150. Repeat process above but store these means in a vector called “sample_means150”. Describe the shape of this sampling distribution. Compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150 = rep(NA, 5000)
for(i in 1:5000){
prsamp = sample(price, 50)
sample_means150[i] = mean(prsamp)
}
# Comparisons between means50 and means150
par(mfrow= c(2,1))
hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")
hist(sample_means150, breaks = 50, main = "Price Sample_Means_150")
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 144300 172700 179900 180500 187800 229300
summary(sample_means150)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 142500 173000 180200 180700 188000 222600
The shape of the sampling distribution for sample_means150 is normal and unimodal. It is similar, in that respect to sample_means50, though sample_means50 has a more spread out distribution. Based on the sampling distribution of sample_means150, I would guess the mean sale price of homes in Ames is approximately 180,800, which is the actual population’s mean.
4. Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
par(mfrow= c(2,1))
hist(sample_means50, breaks = 50, main = "Price Sample_Means_50")
hist(sample_means150, breaks = 50, main = "Price Sample_Means_150")
Of the two sampling distributions (sample_means50 and sample_means150), the second (sample_means150) has the smaller spread. It is preferable to have a distribution with a smaller spread when trying to make estimates that are close to the true value, because there is less uncertainty of where the true estimates lie.