load("lab4a//more//ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
set.seed(7)
hist(area)
hist(price)
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
Price and area are both right skewed, as we see a heavy tail on the right for both distributions. They seem to tend towards lognormal. The mean is greater than the median, adding to that assesment
sample <- sample(area, 50)
summary(sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 864 1217 1490 1640 1986 3395
hist(sample)
This sample has more weight at the lower end of the distribution than the population. It has similar spread, but not as large of a range, as would be expected from 50 samples. Its mean and median are higher than the population.
sample2 <- sample(area, 50)
The large the sample, the lower the standard error. More is better for estimating the mean.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
length(sample_means50)
## [1] 5000
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1294 1452 1497 1500 1544 1779
qqnorm(sample_means50)
The distribution has the same mean as that of the population. It is fairly normal, with a median near its mean. The QQ plot hugs the line quite well.
sample_means_small <- rep(0,100)
for (i in 1:100){
sample_means_small[i] <- mean(sample(area, 50))
}
sample_means_small
## [1] 1469.80 1428.96 1573.74 1332.64 1495.26 1599.34 1487.60 1483.08
## [9] 1403.80 1472.68 1504.90 1466.28 1464.10 1498.54 1547.94 1574.12
## [17] 1501.64 1477.86 1562.46 1389.88 1563.08 1551.70 1480.44 1406.98
## [25] 1467.16 1430.96 1510.86 1485.22 1490.12 1627.94 1477.12 1584.70
## [33] 1563.84 1527.42 1577.06 1369.46 1392.10 1473.62 1464.98 1517.42
## [41] 1444.56 1606.70 1656.00 1603.92 1430.42 1486.48 1470.30 1479.14
## [49] 1540.70 1490.58 1476.98 1362.90 1479.38 1469.90 1490.16 1446.66
## [57] 1524.52 1465.26 1526.52 1470.54 1581.68 1519.12 1563.00 1410.06
## [65] 1614.14 1566.56 1527.06 1446.54 1457.64 1426.98 1418.98 1478.32
## [73] 1452.96 1535.56 1528.86 1427.88 1494.84 1466.06 1612.72 1502.52
## [81] 1465.50 1531.94 1385.04 1575.68 1356.06 1376.24 1454.04 1545.90
## [89] 1399.34 1678.22 1561.62 1524.52 1452.70 1450.92 1550.86 1579.92
## [97] 1482.84 1550.22 1506.44 1453.52
length(sample_means_small)
## [1] 100
Each element of the vector is the mean of a sized 50 sample from the poulation
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
As the sample size increases, the spread will decrease. The mean of the means should remain the same. The standard error will decrease.
sample_p <- sample(price, 50)
mean(sample_p)
## [1] 195150.6
The point estimate will be the sample mean as computed above
sample_means50 <- rep(0, 5000)
for (i in 1:5000){
sample_means50[i] <- mean(sample(price, 50))
}
hist(sample_means50, breaks = 20)
mean(sample_means50) #expected population mean
## [1] 180731.3
mean(price) #actual population mean
## [1] 180796.1
sample_means150 <- rep(0, 5000)
for (i in 1:5000){
sample_means150[i] <- mean(sample(price, 150))
}
hist(sample_means150, breaks = 20)
mean(sample_means150) #Expected population mean
## [1] 181013.7
This distribution more approximately resembles the normal distribution. It has very little skew and more weight between 170,000 and 190,000
The second sampling distriubtion has a smaller spread, which is preferable in making estimates if we want them to be closer to the true value. This is another way of saying its 95% confidence interval is smaller, but will contain the true mean just as frequently.