load("more/ames.RData")area <- ames$Gr.Liv.Area
price <- ames$SalePricesummary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)The distribution of the living area is slightly skewed to the right with most observations around 1400 to 1500 square feet. There are a few observations with over 3000 square feet.
set.seed(100)
samp1 <- sample(area, 50)
hist(samp1)The exercise did not specify a seed, so the sample distribution changes every time I run the script. I set the seed to 100 so I could reproduce the distribution.
My sample distribution appears to be right skewed centering around 1440ish square feet. There are a few observations above 2000 square feet, but not so many very small areas.
samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?set.seed(200)
samp2 <- sample(area,50)
mean(samp1)## [1] 1441.52
mean(samp2)## [1] 1486.7
The means are fairly close, but are clearly different from each other. Sample 2 is closer to the population mean than Sample 1.
The 1000 size sample will be a much more accurate estimate of the mean compared to the 100 size sample. Since the variation of an estimate depends on the size of the sample, larger samples will have less variation than smaller samples. As the sample size approaches the population size, the mean of the estimate will approach the true mean of the population.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)hist(sample_means50, breaks = 25)sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?length(sample_means50)## [1] 5000
There are 5000 elements in sample_means50. The sampling distribution looks almost exactly normal. Its center is extremely close to the true mean of 1500 square feet. The distribution would not change much with 50,000 sample means. The distribution is very close to normal, and adding more samples will just reinforce the normality already shown with diminishing returns.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
#print(i)
}sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
#print(i)
}
sample_means_small## [1] 1466.64 1694.96 1540.60 1362.60 1455.66 1386.86 1470.94 1403.76
## [9] 1534.04 1623.04 1408.18 1596.70 1600.32 1572.30 1589.70 1503.70
## [17] 1428.10 1496.62 1492.70 1486.22 1502.60 1454.34 1445.08 1528.28
## [25] 1579.90 1493.92 1392.26 1479.92 1606.12 1676.40 1535.80 1557.52
## [33] 1472.28 1494.24 1609.62 1547.00 1528.12 1531.90 1416.30 1543.34
## [41] 1583.44 1547.50 1523.26 1492.78 1466.22 1458.54 1555.62 1417.58
## [49] 1362.94 1517.30 1518.40 1569.54 1549.72 1626.52 1633.04 1533.00
## [57] 1554.84 1548.42 1448.54 1369.40 1398.88 1537.02 1444.24 1468.32
## [65] 1432.48 1503.78 1490.60 1714.50 1452.96 1479.84 1517.16 1473.50
## [73] 1510.38 1583.44 1508.76 1550.82 1415.48 1432.08 1615.66 1449.26
## [81] 1594.60 1476.66 1533.64 1520.60 1598.12 1478.06 1547.74 1562.74
## [89] 1529.40 1487.38 1347.86 1471.64 1313.40 1430.94 1502.60 1455.74
## [97] 1512.32 1343.04 1605.16 1475.70
length(sample_means_small)## [1] 100
There are 100 elements in the object. Each element represents a sample mean.
Side note: I’m surprised you didn’t use an apply function here. This seems like the perfect use of sapply (while being a bit faster as well).
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)When the sample size is larger, the center moves closer to the true mean and the spread decreases.
price. Using this sample, what is your best point estimate of the population mean?set.seed(100)
price_50_sample = sample(price,50)
mean(price_50_sample)## [1] 174131.4
The best point estimate of the population mean is the sample mean. In this case, the sample mean is 174131.4.
sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.sample_means50 = sapply(1:5000,function(x){
return(mean(sample(price,50)))
})
hist(sample_means50)The sampling distribution looks almost exactly normal. I would guess the mean home price would be around 180,000.
mean(price)## [1] 180796.1
The mean is 180,796, which is very close to 180,000.
sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?sample_means150 = sapply(1:5000,function(x){
return(mean(sample(price,150)))
})
hist(sample_means150)The shape of this sampling distribution looks very close to normal, centered around 180,000. Compared to the sample size 50 distribtuion, the spread of the sample size 150 distributions is much smaller. I would guess the mean sale price of homes in Ames to be about 180,000.
The sampling distribution from 3 has a smaller spread. If we want to make estimates close to the true value, we would prefer a distribution with a small spread because there is less uncertainty about the range of possible values of the mean when there is less spread. When a distribution has small spread, the range of probable values is small, so we are more sure of what the true value should be.