download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
Exercise 1 Describe this population distribution.
This population is slightly right skewed.
samp1 <- sample(area, 50)
hist(samp1)
Exercise 2 Describe the distribution of this sample. How does it compare to the distribution of the population?
The distribution on this sample is right skewed. The histograms are quite similar except for the outliers on samp1.
mean(samp1)
## [1] 1383.82
Exercise 3 Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
samp2<-sample(area, 50)
mean(samp2)
## [1] 1626.04
The mean of samp1 and samp2 is different with one another. With a sample size of 100 and 1000, the results become more accurate. The one with 1000 sample size will be the most accurate.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
hist(sample_means50, breaks = 25)
Exercise4
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means? There are 50 elements in sample_means50. The sampling distribution looks like a normal distribution. The center is around 1500. No the distribution will be similar even if we collected 50,000 sample means.
sample_means50a <- rep(NA, 50000)
for(i in 1:50000){
samp <- sample(area, 50)
sample_means50a[i] <- mean(samp)
}
hist(sample_means50a, breaks=25)
Exercise 5 To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
sample_means_small<-rep(NA, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small
## [1] 1427.28 1473.56 1614.16 1429.52 1489.84 1626.10 1459.14 1455.64
## [9] 1598.74 1559.72 1547.16 1525.30 1504.28 1510.36 1531.94 1414.98
## [17] 1538.54 1469.10 1499.54 1434.86 1438.62 1493.28 1406.42 1632.40
## [25] 1476.96 1529.44 1393.14 1431.98 1468.56 1639.12 1462.16 1534.44
## [33] 1495.64 1680.86 1467.22 1393.02 1512.52 1496.78 1582.24 1399.12
## [41] 1352.86 1509.40 1495.58 1536.66 1559.10 1469.32 1642.94 1444.46
## [49] 1430.84 1443.04 1469.48 1395.32 1421.24 1507.36 1494.74 1414.60
## [57] 1401.60 1530.08 1326.84 1514.36 1502.72 1668.58 1509.72 1465.48
## [65] 1506.46 1580.32 1380.70 1576.46 1522.50 1656.32 1496.72 1402.76
## [73] 1528.42 1542.14 1491.94 1532.00 1385.90 1523.06 1500.00 1468.04
## [81] 1629.34 1427.98 1488.00 1496.00 1403.36 1529.32 1467.28 1398.00
## [89] 1518.58 1505.32 1508.16 1594.92 1475.14 1449.20 1372.26 1424.72
## [97] 1550.46 1539.10 1504.08 1400.54
There are 100 elements because we iterated the vector 100 times. Each element is a mean from sample_means_small.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
Exercise 6 When the sample size is larger, what happens to the center? What about the spread?
When the sample size is larger the center gets more defined and the spread gets closer to the center.
price1<-sample(price, 50)
mean(price1)
## [1] 181171.8
2.Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
sample_means50 <- rep(NA, 5000)
for (i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks=25)
Based of the sampling distribution, I would guess the mean house price of the population to be around 18000.
mean(sample_means50)
## [1] 180808.8
sample_means150 <- rep(NA, 5000)
for (i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks=25)
Based on this sample distribution, I will guess the mean to be a little more than 180000.
Distribution 3 have the smaller spread. If we are concerned with making estimates that are more often close to the true value, we will prefer the distribution with a small spread.