download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
The population distribution is skewed right, most homes are between 1000 to 2000 square feet.
samp1 <- sample(area, 50)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 804 1199 1432 1526 1718 3820
hist(samp1)
This distribution is very similar to the distribution of all the homes just on a smaller scale, most of the home are still with in 1000-2000 square feet but there are not big outliers such as the full sample with 5642 being the highest square feet size.
mean(samp1)
## [1] 1525.7
samp2<-sample(area, 50)
mean(samp1)
## [1] 1525.7
mean(samp2)
## [1] 1492.7
In my samp2 the mean is about 100 square feet higher but in the grand scheme it isn’t that much. We can still make some predictions about the whole population even when the two samples are slightly different. If we took 2 samples of 100 and 1000, the sample with 1000 would provide a much more accuracte estimate of the population because it of course holds more of the actual population.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
hist(sample_means50, breaks = 25)
summary (sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1259 1453 1499 1501 1547 1833
There are 5000 elements in sample_means50. The sample distribution seems fairly normal a little skewed to the right with some outliers but still decently normal with its center at about 1500. If we collected 50,000 sample means I believe we would have a more normal distribution and the mean would be closer to the original samples mean.
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
sample_means_small <- rep(NA, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small
## [1] 1445.10 1554.10 1361.10 1449.32 1529.70 1692.92 1554.58 1482.92 1403.62
## [10] 1484.90 1536.42 1564.98 1586.02 1536.34 1483.66 1535.72 1455.24 1453.80
## [19] 1484.38 1505.78 1451.94 1485.66 1504.72 1449.22 1414.60 1461.56 1617.28
## [28] 1421.38 1420.80 1444.68 1417.54 1589.18 1475.16 1588.50 1466.24 1604.74
## [37] 1497.62 1622.08 1416.64 1545.24 1620.70 1492.06 1447.34 1413.38 1624.76
## [46] 1513.86 1388.78 1434.54 1436.38 1447.12 1475.82 1515.68 1401.20 1542.56
## [55] 1549.84 1356.34 1606.52 1510.72 1534.06 1462.80 1594.22 1445.52 1381.42
## [64] 1415.42 1568.54 1424.36 1464.54 1594.34 1480.10 1521.54 1415.90 1481.52
## [73] 1459.92 1559.38 1531.22 1563.26 1495.12 1437.06 1537.60 1486.52 1464.38
## [82] 1427.04 1525.70 1471.12 1611.94 1559.00 1469.86 1541.48 1512.34 1444.38
## [91] 1525.02 1517.30 1574.62 1466.20 1425.76 1525.46 1507.48 1601.12 1533.20
## [100] 1421.86
There are 100 elements, we take a random sample of size 50 from area, takes its mean and stores it in sample_means_small
hist(sample_means50)
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
When the sample size is larger the center is more or less the same but the spread is much less.
#On Your Own
Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
samp1 <- sample(price, 50)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63900 126250 167500 191017 235000 466500
hist(samp1)
The mean is around 170000, from taking 50 random samples of price.
sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
Based on this sampling distribution I would guess the mean is a little less than 180000.
##Number 3
sample_means150<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,150)
sample_means150[i]<-mean(samp)
}
hist(sample_means150, breaks = 25)
The mean sale price is still around 180000
The distribution from number 3 has a smaller spread, we prefer a smaller spread because it indicates lower variability