An assigned lab to investigate if a random sample of data can serve as point estimates for population parameters. This lab will formulate a sampling distribution of estimates from the random sample to learn about the distribution properties of the estimate.
#download data file and load
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
Slicing out to variables from the dataset to focus on: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice).
#carving out two variables from dataset
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
#viewing summary statistics for the two variables
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area, main = "Population Living Area", xlab = "sqft")
#survey the population to estimate mean of avg living area
samp1 <- sample(area, 50)
samp1
## [1] 1494 1868 2030 1478 1078 1840 1142 1602 1471 1025 1484 1572 1742 1680 1522
## [16] 2052 2552 1728 960 2403 1701 1489 1178 1086 958 1040 1094 1951 1040 1614
## [31] 816 1524 1380 1510 1640 1200 1116 2531 1700 1426 1717 1472 1982 960 1950
## [46] 2061 2322 1196 1188 1149
#viewing summary statistics area sample, samp1
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1156 1502 1534 1738 2552
mean(samp1)
## [1] 1534.28
hist(samp1, main = "Sample 1 of Size 50 Living Areas", xlab = "sqft")
Population μ = 1500
Sample 1 x̄ = 1474
The distribution of both the population and sample are clearly right skewed.
#taking three more samples, two will have larger size
samp2 <- sample(area, 50)
mean(samp2)
## [1] 1498.7
hist(samp2, main = "Sample 2 of Size 50 Living Areas", xlab = "sqft")
samp3 <- sample(area, 100)
mean(samp3)
## [1] 1501.84
hist(samp3, main = "Sample Size 100 Living Areas", xlab = "sqft")
samp4 <- sample(area, 1000)
mean(samp4)
## [1] 1483.298
hist(samp4, main = "Sample Size 1000 Living Areas", xlab = "sqft")
Population μ = 1500
Sample 1 x̄ (size 50) = 1474
Sample 2 x̄ (size 50) = 1551
Sample 3 x̄ (size 100)= 1462
Sample 4 x̄ (size 1000) = 1506
The closest mean is sample 4 (size 1,000), but the second closest is sample 1 (50).
#generating 5,000 samples of size 50 living areas
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 25, main = "Sampling Distribution of Living Area\n5,000 Samples Size 50", xlab = "sqft")
#counting number of elements
length(sample_means50)
## [1] 5000
There are 5,000 elements in sample_means50, the distribution of which looks very normal as the observations are more symmetrically centered around the mean than the previous histograms. If the sample means collected increased to 50,000 I would expect that shape to be even more precisely and symmetrically centered around the mean.
#printing for for living area sampling distribution loop
#sample_means50 <- rep(NA, 5000)
#for(i in 1:5000){
# samp <- sample(area, 50)
# sample_means50[i] <- mean(samp)
#print(i)
# }
#REMOVING THIS PORTION OF EXERCISE FOR KNITTING PURPOSES.
#practicing for loop
sample_means_small <- rep(NA, 100)
for(i in 1:100){
samp_small <- sample(area, 50)
sample_means_small[i] <- mean(samp_small)
}
sample_means_small
## [1] 1276.10 1538.04 1421.06 1463.86 1615.48 1480.44 1499.24 1481.62 1637.76
## [10] 1537.12 1629.62 1644.40 1411.36 1491.32 1573.98 1474.60 1446.72 1518.82
## [19] 1318.94 1542.92 1505.74 1510.84 1584.42 1578.90 1443.80 1532.92 1486.48
## [28] 1487.10 1496.54 1467.48 1533.10 1598.98 1652.98 1497.66 1384.64 1427.10
## [37] 1530.06 1486.16 1382.40 1483.42 1429.08 1615.40 1518.22 1534.02 1552.74
## [46] 1598.48 1452.54 1557.26 1596.18 1586.70 1609.40 1397.84 1409.56 1456.08
## [55] 1465.88 1647.12 1463.76 1492.86 1405.10 1559.48 1518.92 1432.56 1487.64
## [64] 1481.26 1355.58 1401.64 1485.84 1393.24 1471.22 1590.14 1539.58 1566.00
## [73] 1626.80 1387.56 1567.76 1514.26 1560.38 1555.48 1500.34 1475.02 1497.48
## [82] 1425.66 1609.38 1496.98 1407.84 1405.16 1503.46 1570.54 1433.68 1413.42
## [91] 1553.50 1451.14 1509.08 1450.52 1439.52 1495.60 1694.04 1519.82 1517.56
## [100] 1512.22
There are 100 elements in sample_means_small and each represent the average of 100 samples of 50 living areas.
Building two more sampling distributions to see the effect of the sample size. One sample size will be 10 and the second size 100.
#creating two additional sampling distributions
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
#plotting all three sampling distributions
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
When the sample size is larger more of the occurrences happen closer to the mean creating a larger center. Additionally, the spread of occurrences contracts closer to the center/mean.
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
Since you have access to the population, simulate the sampling distribution for x¯priceby taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
#generating 5,000 samples of size 50 mean home prices
sample_means_price50 <- rep(NA, 5000)
sample_means_price150 <- rep(NA, 5000)
for(i in 1:5000){
samp_price <- sample(price, 50)
sample_means_price50[i] <- mean(samp_price)
samp_price <- sample(price, 150)
sample_means_price150[i] <- mean(samp_price)
}
#plotting both price sampling distributions
par(mfrow = c(2, 1))
xlimits2 <- range(sample_means_price50)
hist(sample_means_price50, breaks = 20, xlim = xlimits2)
hist(sample_means_price150, breaks = 20, xlim = xlimits2)
mean(sample_means_price50)
## [1] 180927.2
mean(sample_means_price150)
## [1] 180764.4
The shape of the 50 sample test is fairly normal with a slightly right skewed distribution. The shape of the 150 sample test is more precise around the mean, normally shaped, no visible skewness, and a smaller range than 50 samples.
Best estimate for the population mean home price is the mean of sample_means_price150. Since the 150 samples is more normally distributed its mean is a better estimate of the two.