load("C:/Users/ZacharyHerold/Documents/DATA606/Lab4a/more/ames.RData")
head(ames[1:6,1:7])
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## 1 1 526301100 20 RL 141 31770 Pave
## 2 2 526350040 20 RH 80 11622 Pave
## 3 3 526351010 20 RL 81 14267 Pave
## 4 4 526353030 20 RL 93 11160 Pave
## 5 5 527105010 60 RL 74 13830 Pave
## 6 6 527105030 60 RL 78 9978 Pave
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
hist(area)
Describe this population distribution.
It has a right-skew, is non-symmetrical and unimodal, seems lognormal.
mean(area)
## [1] 1499.69
The mean of the population is nearly 1500.
Describe the distribution of this sample. How does it compare to the distribution of the population?
samp1 <- sample(area, 50)
hist(samp1, breaks = 10)
It is unclear from first glance if the right-skew is retained in the sampling distribution.
qqnorm(samp1)
qqline(samp1)
The Q-Q Plot makes apparent the right skew of the sampling data, with a few extreme positive outliers.
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
samp2 <- sample(area, 50)
print(c(mean(samp1), mean(samp2)))
## [1] 1509.86 1536.24
samp3 <- sample(area, 100)
samp4 <- sample(area, 1000)
print(c(mean(samp3), mean(samp4)))
## [1] 1475.920 1463.368
The best point estimate is the one based on the largest sample size of 1000. It gets closer to the population mean of 1500.
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
There are 50 elements in sample_means50, the number of means calculated. The center is close to 1500. As the number of observations increases the distribution gets tighter around the point estimate.
To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. How many elements are there in this object called sample_means_small? What does each element represent?
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small
## [1] 1587.26 1587.02 1602.04 1499.38 1558.90 1453.22 1622.08 1617.46
## [9] 1521.38 1485.38 1423.34 1444.90 1395.40 1518.78 1538.62 1561.46
## [17] 1526.48 1577.38 1546.10 1462.06 1447.32 1537.62 1503.48 1531.20
## [25] 1376.56 1632.46 1471.44 1397.10 1326.22 1544.74 1602.10 1527.80
## [33] 1364.34 1450.98 1484.02 1345.26 1468.68 1422.98 1468.88 1477.78
## [41] 1459.06 1484.28 1250.84 1495.26 1485.56 1644.60 1532.78 1494.52
## [49] 1437.24 1502.88 1617.00 1522.92 1506.36 1429.98 1515.78 1554.42
## [57] 1646.38 1647.08 1551.14 1527.58 1413.10 1526.78 1568.04 1497.28
## [65] 1666.34 1369.52 1493.18 1555.76 1525.96 1563.78 1455.34 1396.34
## [73] 1474.36 1530.74 1469.58 1512.68 1548.46 1467.20 1534.30 1466.70
## [81] 1490.26 1430.02 1447.36 1422.70 1508.36 1533.00 1544.74 1497.08
## [89] 1506.24 1537.96 1518.74 1557.18 1400.66 1448.06 1547.90 1610.06
## [97] 1620.96 1534.26 1571.66 1446.08
length(sample_means_small)
## [1] 100
mean(sample_means_small)
## [1] 1504.24
sd(sample_means_small)
## [1] 75.15417
There are 100 elements in this vector, representing point estimates (sample means) of the area population.
When the sample size is larger, what happens to the center? What about the spread?
sample_means_2 <- rep(0, 10000)
for(i in 1:10000){
samp <- sample(area, 50)
sample_means_2[i] <- mean(samp)
}
mean(sample_means_2)
## [1] 1499.416
sd(sample_means_2)
## [1] 69.95256
The center should begin to more closely approximate the population mean. The spread should also decrease. The larger SD here however does not reflect that.
Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
Here is a summary of the actual price data, with mean of 180,796.
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
samp2 <- sample(price, 50)
summary(samp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87500 139350 170500 173153 188750 305900
Ignoring the population mean, the best approximation we have here is the mean of the random sample.
Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?
par(mfrow = c(1, 1))
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 145688 173082 180701 181001 188414 223803
Based on this sample, the mean is 180,788, very close to the actual price data, with mean of 180,796.
hist(sample_means50)
qqnorm(sample_means50)
qqline(sample_means50)
The Q-Q Plot will have increasingly upturned ends as the sample size increases, indicating the tight boundness around the point estimate. But the shape of the distribution is symmetrical.
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
summary(sample_means150)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159568 176396 180698 180782 185077 205388
The better point estimate is the mean here.
par(mfrow = c(2, 1))
hist(sample_means50, breaks = 25, xlim = c(140000, 230000))
hist(sample_means150, breaks = 25, xlim = c(140000, 230000))
This shows the tighter fit of the distribution about the mean.
qqnorm(sample_means150)
qqline(sample_means150)
The Q-Q plot shows the same upward-turned ends, but no more apparent than with the smaple size of 50.
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
sd(sample_means50)
## [1] 11182.19
sd(sample_means150)
## [1] 6360.148
The SD is much lower for the sample distribution based on 150 observations. The small spread pinpoints the mean better.