download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)

Exercise 1
Describe this population distribution.
The distribution is heavily skewed to the right tail
set.seed(873492)
samp1 <- sample(area, 50)
mean(samp1)
## [1] 1463.92
hist(samp1)

Exercise 2
Describe the distribution of this sample. How does it compare to the distribution of the population?
The distribution of the sample is much more concentrated than the population but still maintains a right skew.
set.seed(340999)
samp2 <- sample(area, 50)
mean(samp2)
## [1] 1437.42
hist(samp2)

Exercise 3
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
The second sample of 50 looks even more condensed, even closer to normal, with only a slight right skew. A larger sample should help us get closer to our true mean, so a sample with a size of 1000 should provide a more accurate estimate of the mean.
set.seed(222304)
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)

mean(sample_means50)
## [1] 1501.321
Exercise 4
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
There are 5000 elements in sample_mean50 and our distribution has become incredibly normal but still mantains the mean of the whole poulation to within 1.5 of the actual mean. If we collected 50,000 samples I would expect out distribution to be even more normal, but it would have diminishing returns with regards to improving accuracy, with 5000 samples we have already normalized our data quite a bit.
set.seed(656342)
sample_means_small <- rep(NA, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small
## [1] 1579.40 1590.74 1545.56 1580.48 1498.22 1611.86 1406.18 1496.54 1501.74
## [10] 1522.04 1420.38 1463.20 1392.98 1478.48 1512.14 1422.00 1563.70 1458.70
## [19] 1425.40 1617.56 1410.00 1470.76 1503.62 1624.24 1415.58 1451.92 1501.72
## [28] 1376.64 1606.34 1421.90 1459.96 1511.74 1588.94 1555.14 1518.58 1572.90
## [37] 1529.76 1394.46 1553.90 1544.36 1527.34 1427.34 1470.34 1538.92 1451.66
## [46] 1601.10 1624.84 1613.98 1481.06 1532.06 1478.26 1579.02 1405.72 1597.76
## [55] 1556.74 1374.32 1487.30 1507.18 1563.84 1516.04 1464.18 1534.76 1435.80
## [64] 1432.66 1429.98 1494.34 1708.20 1563.68 1532.32 1571.72 1393.48 1487.78
## [73] 1625.62 1556.60 1393.26 1383.62 1424.12 1528.60 1592.90 1457.02 1458.94
## [82] 1615.48 1537.94 1579.00 1457.90 1493.20 1487.46 1578.40 1573.94 1560.32
## [91] 1540.22 1351.98 1535.20 1462.64 1500.70 1571.98 1509.90 1536.30 1454.34
## [100] 1579.34
Exercise 5
To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
100 elements in exist within sample_means_small with each element being the mean of one simple random sample of 50 from our population
set.seed(987549)
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

summary(sample_means10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1059 1390 1495 1503 1604 2261
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1281 1453 1498 1501 1547 1827
summary(sample_means100)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1336 1465 1498 1498 1530 1675
Exercise 6
When the sample size is larger, what happens to the center? What about the spread?
As the sample size increase the mean of means gets tighter and closer to the true population mean while the spread shrinks significantly.