Describe this population distribution.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
The histogram of the above ground living area of houses in Ames is skewed to the right, as demonstrated by the longer right tail. Also, there are some outliers in the data, as shown by the large range with few points near the maximum.
samp1 <- sample(area, 50)
Describe the distribution of this sample. How does it compare to the distribution of the population?
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 630 1150 1484 1518 1727 2654
hist(samp1)
The distribution of this sample is skewed to the right. When compared to the distribution of the population, the sample is less normal due to its smaller sample size.
mean(samp1)
## [1] 1517.78
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
samp2<-sample(area,50)
mean(samp2)
## [1] 1570.08
The mean of sample 1 is close to the mean of sample 2. I believe the sample size of 1000 would be a more accurate estimate of the population mean. This is due to the greater sample size number. As n increases, it should more closely resemble that of the population.
samp3 <- sample(area, 100)
mean(samp3)
## [1] 1602.9
samp4 <- sample(area, 1000)
mean(samp4)
## [1] 1472.861
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
The code above was used to create the distribution of sampling means also called the sampling distribution. 5000 samples were generated and their sampling means are displayed in the histogram.
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
hist(sample_means50, breaks = 25)
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1264 1450 1498 1499 1545 1821
There are 5000 elements in sample_means50.The sampling distribution for sample_means50 is relatively normal and there is no skewing. The center of the histogram is around 1500. If we collected 50,000 samples instead I would expect the distribution to appear even more normal.
To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
sample_means_small<-rep(0,100)
for(i in 1:100){
samp5<-sample(area,50)
sample_means_small[i]<-mean(samp5)
}
sample_means_small
## [1] 1762.06 1462.10 1577.44 1581.38 1410.78 1427.48 1485.46 1544.12 1501.74
## [10] 1535.22 1462.18 1545.12 1414.34 1517.10 1385.70 1514.42 1594.34 1409.78
## [19] 1460.48 1379.74 1566.32 1702.26 1575.92 1484.04 1534.76 1477.88 1503.76
## [28] 1559.66 1562.14 1557.30 1578.08 1523.14 1372.14 1547.56 1533.88 1516.46
## [37] 1453.28 1617.20 1505.04 1450.44 1518.96 1502.60 1427.80 1415.72 1445.22
## [46] 1466.02 1464.38 1533.52 1487.28 1363.44 1559.04 1485.22 1554.58 1404.50
## [55] 1478.92 1448.94 1469.34 1482.80 1464.16 1448.84 1488.16 1468.30 1482.88
## [64] 1544.52 1612.02 1512.18 1535.34 1526.06 1512.20 1487.64 1470.06 1461.84
## [73] 1468.18 1615.14 1401.22 1489.86 1460.64 1483.02 1565.78 1666.10 1427.72
## [82] 1574.90 1606.60 1551.82 1527.02 1514.36 1563.92 1564.02 1486.22 1503.38
## [91] 1496.02 1496.26 1519.08 1408.78 1462.06 1583.74 1509.56 1513.12 1564.30
## [100] 1457.68
There are 100 elements in this object. Each element represents the mean area from a sample of 50 houses.
hist(sample_means50)
Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
sample_means10<-rep(NA,5000)
sample_means100<-rep(NA,5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i]<-mean(samp)
}
When the sample size is larger, what happens to the center? What about the spread?
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
As the sample size gets larger, the center gets taller (i.e. the frequency for the center which is around 1500, increases). The spread decreases and gets narrower as the sample size increases.
sample_price<-sample(price,50)
## On your own
mean(sample_price)
## [1] 178290.2
The population mean for this sample of 50 houses is 176940.
### 2
Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
sample_means50<-rep(NA,5000)
for(i in 1:5000){
samp <- sample(price,50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 20)
The distribution of this sample is normal with a slight skew to the right. The mean price is around 180,000.
mean(sample_means50)
## [1] 180847.5
The mean is 180722.4
## 3
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150<-rep(NA,5000)
for(i in 1:5000){
samp <- sample(price,150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 20)
The distribution for this sample is normal. There is still a slight right skew; however when compared to the distribution for the sample size of 50, this one is less noticeable. In additon, the distribution for the sample of 150 is a little narrower than the one for a sample size of 50. The mean for this sample is slightly above 180,000
## 4 Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
The sampling distribution with a sample size of 150 has a smaller spread than the sample size of 50. To get an estimate that is close to the true value, we would want a distribution with a small spread. This can be achieved by having a larger sample size