download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area # sq. feet
price <- ames$SalePrice # dollars
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area, col = "lightseagreen", # histogram
main = "Area population histogram",
xlab = "area (sq. ft.)")
boxplot(area, col = "lightseagreen", # boxplot
main = "Area population boxplot")
Describe this population distribution.
The distribution is heavily right skewed, many high outliers create a long, thin right tail. It has a mean = 1500.
Create a sample and describe it’s distribution.
set.seed(21) # set seed
samp1 <- sample(area, 50) # create sample
summary(samp1) # sample's statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 796 1106 1390 1520 1731 2978
hist(samp1, col = "slateblue", # histogram
main = "Area Samp1 Histogram",
xlab = "area (sq. ft.)")
boxplot(samp1, col = "slateblue", # boxplot
main = "Area Samp1 boxplot")
The sample mean is 1520, slightly higher than but close to the population mean. The distribution is a bit right skewed with a high outlier.
set.seed(6)
samp1 <- sample(area, 50)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 733 1106 1352 1497 1827 3493
hist(samp1, col = "turquoise",
main = "Area Samp2 Histogram",
xlab = "area (sq. ft.)")
boxplot(samp1, col = "turquoise",
main = "Area Samp2 boxplot")
How does the mean of samp2 compare with the mean of samp1?
The samp2 mean is slightly lower than, but closer to, the population mean than that of samp1.
Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
Neither, the accuracy of the estimate of the population mean from a sample isn’t dependent on sample size.
summary(sample(area, 100)) # statistics for sample size 100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 630 1091 1438 1453 1755 2500
summary(sample(area, 1000)) # statistics for sample size 1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 407 1143 1456 1510 1764 3820
sample_means50 <- rep(NA, 5000) # create vector for sampling
for(i in 1:5000){ # collect means for sampling
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, # histogram of the sampling
breaks = 25,
col = "plum",
main = "Sampling distribution")
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1281 1452 1498 1500 1547 1800
How many elements are there in sample_means50?
There are 5000 elements in the sample_means50 vector, each is the mean of a sample of 50 areas taken from the area population.
Describe the sampling distribution, and be sure to specifically note its center.
The sampling distribution is normal, the center is very close to the population mean; within a 1 or 2 sq. ft, depending on the execution of the sampling.
Would you expect the distribution to change if we instead collected 50,000 sample means?
It would still be a normal distribution with a narrower spread, and the mean closely reflecting the population mean.
sample_means_small <- rep(0, 100) # initialize small vector
for(i in 1:100){ # collect means for 100 samples
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small # look at the vector of sample meanes
## [1] 1433.32 1606.66 1572.74 1496.42 1433.88 1453.48 1448.00 1542.00 1554.96
## [10] 1537.92 1472.90 1491.74 1619.22 1446.66 1373.78 1541.90 1445.54 1502.92
## [19] 1479.34 1448.38 1509.46 1518.56 1416.28 1457.22 1573.80 1426.44 1447.70
## [28] 1506.98 1500.52 1417.06 1527.28 1538.06 1427.80 1429.36 1482.12 1466.06
## [37] 1597.94 1435.04 1509.88 1522.00 1495.54 1463.12 1571.70 1396.36 1600.52
## [46] 1439.70 1555.46 1496.98 1454.80 1574.60 1310.30 1505.44 1524.64 1486.72
## [55] 1523.00 1557.02 1386.34 1509.52 1554.54 1480.14 1521.66 1523.42 1509.66
## [64] 1527.04 1463.02 1556.38 1546.52 1396.88 1483.56 1434.46 1590.30 1456.24
## [73] 1465.22 1506.84 1684.14 1392.64 1400.28 1504.24 1434.14 1582.40 1521.50
## [82] 1658.28 1509.68 1584.62 1554.14 1471.40 1552.40 1349.28 1489.50 1481.60
## [91] 1548.30 1537.58 1497.82 1499.66 1560.90 1514.92 1514.36 1395.62 1524.62
## [100] 1393.48
How many elements are there in this object called sample_means_small? What does each element represent?
There are 100 elements in sample_means_small, each is the mean of the sample of 50 areas taken from the area population.
sample_means10 <- rep(NA, 5000) # initialize two vectors of
sample_means100 <- rep(NA, 5000) # different sizes
for(i in 1:5000){ # create sampling distributions
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1)) # set display rows, columns
xlimits <- range(sample_means10) # set x axes limits to the widest range
hist(sample_means10, breaks = 20, xlim = xlimits, col = "plum1")
hist(sample_means50, breaks = 20, xlim = xlimits, col = "plum2")
hist(sample_means100, breaks = 20, xlim = xlimits, col = "plum3")
When the sample size is larger, what happens to the center? What about the spread?
The center doesn’t change, it still closely approximates the population mean, the spread of the sampling distribution gets narrower.
price1 <- sample(price, 50) # create sample of price data
paste("Sample mean price", dollar(mean(price1))) # get the sample mean
## [1] "Sample mean price $182,315"
The estimate of the population mean based on this sample is $182,315.
Simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50.
sample_means50 <- rep(NA, 5000) # initialize vector
for(i in 1:5000){ # create sampling data
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?
paste("Sampling (n=50) mean price", dollar(mean(sample_means50)))
## [1] "Sampling (n=50) mean price $181,032"
From the price sampling with n = 50, the estimate of the mean home price is $181,032.
Finally, calculate and report the population mean.
paste("Price population mean", dollar(mean(price)))
## [1] "Price population mean $180,796"
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150.
sample_means150 <- rep(NA, 5000) # initialize vector
for(i in 1:5000){ # get means of samples
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50.
par(mfrow = c(2, 1)) # set display rows, columns
hist(sample_means50, col = "darkseagreen",
main = "Price Sampling (n=50) Histogram",
xlab = "price")
hist(sample_means150, col = "darkseagreen",
main = "Price Sampling (n=150) Histogram",
xlab = "price")
Both sampling distributions are normal, with centers around $180,00. The spread for the sampling with n=150 is smaller than the spread of the sampling with n=50.
Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
paste("Sampling (n=150) mean price", dollar(mean(sample_means150)))
## [1] "Sampling (n=150) mean price $180,786"
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
The sampling with n=150 has a smaller spread than the sampling with n=50. A sampling distribution with a smaller spread would give estimates that are more often close to the true value.