In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
load("more/ames.RData")We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.
area <- ames$Gr.Liv.Area
price <- ames$SalePriceLet’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.
hist(samp1,breaks=11,col = "blue")summarys1 <- summary(samp1)
summarys1## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 752 1083 1382 1417 1689 2403
mins1 <- as.numeric(summarys1["Min."])
means1 <- round(as.numeric(summarys1["Mean"]),2)
meds1 <- as.numeric(summarys1["Median"])
maxs1 <- as.numeric(summarys1["Max."])
iqrs1 <- as.numeric(IQR(samp1))
cat(paste("Inter-Quartile Range of the sample: ",iqrs1,"\n"))## Inter-Quartile Range of the sample: 606.25
stdevs1 <- round(as.numeric(sd(samp1)),2)
cat(paste("Standard Deviation of the sample: ",stdevs1,"\n"))## Standard Deviation of the sample: 429.86
summarypop <- summary(area)
summarypop## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
minpop <- as.numeric(summarypop["Min."])
meanpop <- round(as.numeric(summarypop["Mean"]),2)
medpop <- as.numeric(summarypop["Median"])
maxpop <- as.numeric(summarypop["Max."])
iqrpop <- as.numeric(IQR(area))
cat(paste("Inter-Quartile Range of the population: ",iqrpop,"\n"))## Inter-Quartile Range of the population: 616.75
stdevpop <- round(as.numeric(sd(area)),2)
cat(paste("Standard Deviation of the population: ",stdevpop,"\n"))## Standard Deviation of the population: 505.51
s1medZscore = round((meds1 - medpop) / stdevpop, 4)
cat(paste("Z-Score of the sample median vs. the population: ",s1medZscore,"\n"))## Z-Score of the sample median vs. the population: -0.1177
s1meanZscore = round((means1 - meanpop) / stdevpop, 4)
cat(paste("Z-Score of the sample mean vs. the population: ",s1meanZscore,"\n"))## Z-Score of the sample mean vs. the population: -0.1634
If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.
mean(samp1)## [1] 1417
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?samp2 <- sample(area, 50)
hist(samp2,breaks=11,col="green")summarys2 <- summary(samp2)
summarys2## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1119 1458 1660 1877 5642
mins2 <- as.numeric(summarys2["Min."])
means2 <- round(as.numeric(summarys2["Mean"]),2)
meds2 <- as.numeric(summarys2["Median"])
maxs2 <- as.numeric(summarys2["Max."])
iqrs2 <- as.numeric(IQR(samp2))
cat(paste("Inter-Quartile Range of sample 2: ",iqrs2,"\n"))## Inter-Quartile Range of sample 2: 758
stdevs2 <- round(as.numeric(sd(samp2)),2)
cat(paste("Standard Deviation of sample 2: ",stdevs2,"\n"))## Standard Deviation of sample 2: 817.34
s2medZscore = round((meds2 - medpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 median vs. the population: ",s2medZscore,"\n"))## Z-Score of sample 2 median vs. the population: 0.0317
s2meanZscore = round((means2 - meanpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 mean vs. the population: ",s2meanZscore,"\n"))## Z-Score of sample 2 mean vs. the population: 0.3164
samp100 <- sample(area, 100)
hist(samp100)summarys100 <- summary(samp100)
summarys100## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 672 1210 1498 1560 1809 3627
mins100 <- as.numeric(summarys100["Min."])
means100 <- round(as.numeric(summarys100["Mean"]),2)
meds100 <- as.numeric(summarys100["Median"])
maxs100 <- as.numeric(summarys100["Max."])
iqrs100 <- as.numeric(IQR(samp100))
cat(paste("Inter-Quartile Range of sample 100: ",iqrs100,"\n"))## Inter-Quartile Range of sample 100: 598.5
stdevs100 <- round(as.numeric(sd(samp100)),2)
cat(paste("Standard Deviation of sample 100: ",stdevs100,"\n"))## Standard Deviation of sample 100: 545.53
s100medZscore = round((meds100 - medpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 median vs. the population: ",s100medZscore,"\n"))## Z-Score of sample 2 median vs. the population: 0.1108
s100meanZscore = round((means100 - meanpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 mean vs. the population: ",s100meanZscore,"\n"))## Z-Score of sample 2 mean vs. the population: 0.1189
samp1000 <- sample(area, 1000)
hist(samp1000)summarys1000 <- summary(samp1000)
summarys1000## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 480 1132 1438 1508 1778 5642
mins1000 <- as.numeric(summarys1000["Min."])
means1000 <- round(as.numeric(summarys1000["Mean"]),2)
meds1000 <- as.numeric(summarys1000["Median"])
maxs1000 <- as.numeric(summarys1000["Max."])
iqrs1000 <- as.numeric(IQR(samp1000))
cat(paste("Inter-Quartile Range of sample 1000: ",iqrs1000,"\n"))## Inter-Quartile Range of sample 1000: 646.5
stdevs1000 <- round(as.numeric(sd(samp1000)),2)
cat(paste("Standard Deviation of sample 1000: ",stdevs1000,"\n"))## Standard Deviation of sample 1000: 509.92
s1000medZscore = round((meds1000 - medpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 median vs. the population: ",s1000medZscore,"\n"))## Z-Score of sample 2 median vs. the population: -0.0079
s1000meanZscore = round((means1000 - meanpop) / stdevpop, 4)
cat(paste("Z-Score of sample 2 mean vs. the population: ",s1000meanZscore,"\n"))## Z-Score of sample 2 mean vs. the population: 0.0156
SampleSummary <- data.frame(matrix(
c(50,means1,s1meanZscore,
50,means2,s2meanZscore,
100,means100,s100meanZscore,
1000,means1000,s1000meanZscore),
nrow = 4,ncol = 3, byrow = T,
dimnames = list(NULL,c("SampleSize","SampleMean","ZScore"))
)
)
SampleSummary %>%
kable() %>%
kable_styling()| SampleSize | SampleMean | ZScore |
|---|---|---|
| 50 | 1417 | -0.16 |
| 50 | 1660 | 0.32 |
| 100 | 1560 | 0.12 |
| 1000 | 1508 | 0.02 |
\[SE\quad =\quad \frac { \sigma }{ \sqrt { N } } \]
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.
hist(sample_means50, breaks = 25)Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.
sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?summary_sample_means50 <- summary(sample_means50)
summary_sample_means50## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1252 1450 1498 1500 1546 1808
stdev_summary_sample_means50 <- sd(sample_means50)
stdev_summary_sample_means50## [1] 72
min_sample_means50 <- as.numeric(summary_sample_means50["Min."])
mean_sample_means50 <- round(as.numeric(summary_sample_means50["Mean"]),2)
med_sample_means50 <- as.numeric(summary_sample_means50["Median"])
max_sample_means50 <- as.numeric(summary_sample_means50["Max."])
iqr_sample_means50 <- round(as.numeric(IQR(sample_means50)),2)
cat(paste("Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 50): ",iqr_sample_means50,"\n"))## Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 50): 95.92
stdev_sample_means50 <- round(as.numeric(sd(sample_means50)),2)
cat(paste("Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 50): ", stdev_sample_means50,"\n"))## Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 50): 72.01
theoretical_stdev_sample_means50 <- round(sd(area) / sqrt(50),2)
cat(paste("Theoretical Standard Error of Mean of samples of size 50: ", theoretical_stdev_sample_means50 ,"\n"))## Theoretical Standard Error of Mean of samples of size 50: 71.49
sample_means50 .\[SE_{ \bar { x } } = \frac { \sigma _{ pop } }{ \sqrt { N } } = \frac { 505.51 }{ \sqrt { 50 } } = \frac { 505.51 }{ 0.7071 } = 71.49 \]
sample_means50000 <- rep(NA, 50000)
for(i in 1:50000){
samp <- sample(area, 50)
sample_means50000[i] <- mean(samp)
}
hist(sample_means50000,col="yellow")summary(sample_means50000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1244 1451 1498 1500 1546 1874
sd(sample_means50000)## [1] 71
for loopLet’s take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area then save the mean of that sample into the sample_means50 vector. Without the for loop, this would be painful:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)and so on…
With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
## printing suppressed for final knitting
#######(print(i))
}Let’s consider this code line by line to figure out what it does. In the first line we initialized a vector. In this case, we created a vector of 5000 zeros called sample_means50. This vector will will store values generated within the for loop.
The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i = 1, then once when i = 2, and so on up to i = 5000.
The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the \(i\)th element of sample_means50.
In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.
The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.
sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
## printing suppressed for final knitting
#######(print(i))
}
sample_means_small## [1] 1488 1557 1575 1490 1463 1446 1508 1476 1554 1636 1366 1574 1527 1563
## [15] 1387 1546 1553 1499 1512 1467 1483 1421 1663 1568 1675 1419 1545 1505
## [29] 1466 1446 1599 1382 1492 1750 1520 1606 1445 1586 1520 1480 1597 1443
## [43] 1491 1444 1528 1488 1404 1495 1550 1608 1536 1552 1482 1504 1530 1514
## [57] 1547 1526 1491 1395 1416 1400 1577 1512 1545 1477 1479 1432 1531 1489
## [71] 1614 1615 1532 1389 1504 1470 1545 1452 1626 1496 1472 1370 1418 1418
## [85] 1487 1669 1604 1432 1530 1619 1610 1518 1529 1492 1594 1410 1510 1409
## [99] 1519 1484
sample_means_small?sample_means_small .Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.
hist(sample_means50)The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}Here we’re able to use a single for loop to build two distributions by adding additional lines inside the curly braces. Don’t worry about the fact that samp is used for the name of two different objects. In the second command of the for loop, the mean of samp is saved to the relevant place in the vector sample_means10. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one.
To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits, col="green")
hist(sample_means50, breaks = 20, xlim = xlimits, col="yellow")
hist(sample_means100, breaks = 20, xlim = xlimits, col="blue")The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
summary_sample_means10 <- summary(sample_means10)
summary_sample_means10## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1033 1390 1491 1502 1601 2285
stdev_summary_sample_means10 <- sd(sample_means10)
stdev_summary_sample_means10## [1] 160
min_sample_means10 <- as.numeric(summary_sample_means10["Min."])
mean_sample_means10 <- round(as.numeric(summary_sample_means10["Mean"]),2)
med_sample_means10 <- as.numeric(summary_sample_means10["Median"])
max_sample_means10 <- as.numeric(summary_sample_means10["Max."])
iqr_sample_means10 <- round(as.numeric(IQR(sample_means10)),2)
cat(paste("Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 10): ",iqr_sample_means10,"\n"))## Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 10): 210.87
stdev_sample_means10 <- round(as.numeric(sd(sample_means10)),2)
cat(paste("Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 10): ", stdev_sample_means10,"\n"))## Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 10): 160.13
theoretical_stdev_sample_means10 <- round(sd(area) / sqrt(10),2)
cat(paste("Theoretical Standard Error of Mean of samples of size 10: ", theoretical_stdev_sample_means10 ,"\n"))## Theoretical Standard Error of Mean of samples of size 10: 159.86
summary_sample_means100 <- summary(sample_means100)
summary_sample_means100## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1347 1466 1499 1500 1533 1695
stdev_summary_sample_means100 <- sd(sample_means100)
stdev_summary_sample_means100## [1] 50
min_sample_means100 <- as.numeric(summary_sample_means100["Min."])
mean_sample_means100 <- round(as.numeric(summary_sample_means100["Mean"]),2)
med_sample_means100 <- as.numeric(summary_sample_means100["Median"])
max_sample_means100 <- as.numeric(summary_sample_means100["Max."])
iqr_sample_means100 <- round(as.numeric(IQR(sample_means100)),2)
cat(paste("Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 100): ",iqr_sample_means100,"\n"))## Inter-Quartile Range of Mean of Sampling Distribution (5000 draws, each of size 100): 67.47
stdev_sample_means100 <- round(as.numeric(sd(sample_means100)),2)
cat(paste("Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 100): ", stdev_sample_means100,"\n"))## Standard Deviation of Mean of Sampling Distribution (5000 draws, each of size 100): 49.75
theoretical_stdev_sample_means100 <- round(sd(area) / sqrt(100),2)
cat(paste("Theoretical Standard Error of Mean of samples of size 100: ", theoretical_stdev_sample_means100 ,"\n"))## Theoretical Standard Error of Mean of samples of size 100: 50.55
SamplingDistributionsSummary <- data.frame(
matrix(
c(10,mean_sample_means10,100*(mean_sample_means10 - meanpop)/meanpop,
stdev_sample_means10,theoretical_stdev_sample_means10,
50,mean_sample_means50,100*(mean_sample_means50 - meanpop)/meanpop,
stdev_sample_means50,theoretical_stdev_sample_means50,
100,mean_sample_means100,100*(mean_sample_means100 - meanpop)/meanpop,
stdev_sample_means100,theoretical_stdev_sample_means100),
nrow = 3,ncol = 5, byrow = T,
dimnames = list(NULL,c("SampleSize","Mean of 5,000 Samples","Pct.Error","Actual Std Dev of sampling dist","Theoretical Std Error of the Mean")))
)
SamplingDistributionsSummary %>%
kable() %>%
kable_styling()| SampleSize | Mean.of.5.000.Samples | Pct.Error | Actual.Std.Dev.of.sampling.dist | Theoretical.Std.Error.of.the.Mean |
|---|---|---|---|---|
| 10 | 1502 | 0.13 | 160 | 160 |
| 50 | 1500 | 0.05 | 72 | 71 |
| 100 | 1500 | 0.04 | 50 | 51 |
\[SE_{ \bar { x } } = \frac { \sigma _{ pop } }{ \sqrt { N } }\]
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
price. Using this sample, what is your best point estimate of the population mean?samp1px <- sample(price, 50)
px_samp1mean <- round(mean(samp1px),2)
cat(paste("Sample mean of the price (n=50): ", px_samp1mean, "\n"))## Sample mean of the price (n=50): 177132.06
sample_means50.sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
#summary(sample_means50)
#sd(sample_means50)hist(sample_means50,breaks=25,col="green")px_summary_sample_means50 <- summary(sample_means50)
px_summary_sample_means50## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146694 172986 180325 180622 187772 218048
px_stdev_summary_sample_means50 <- sd(sample_means50)
px_stdev_summary_sample_means50## [1] 10974
px_min_sample_means50 <- as.numeric(px_summary_sample_means50["Min."])
px_mean_sample_means50 <- round(as.numeric(px_summary_sample_means50["Mean"]),2)
cat(paste("\n\n**Sample mean of the distribution** of 5,000 samples (n=50) of Price: ", px_mean_sample_means50, "\n\n"))##
##
## **Sample mean of the distribution** of 5,000 samples (n=50) of Price: 180621.74
px_med_sample_means50 <- as.numeric(px_summary_sample_means50["Median"])
px_max_sample_means50 <- as.numeric(px_summary_sample_means50["Max."])
px_iqr_sample_means50 <- round(as.numeric(IQR(sample_means50)),2)
cat(paste("Inter-Quartile Range of Mean Price of Sampling Distribution (5000 draws, each of size 50): ",px_iqr_sample_means50,"\n"))## Inter-Quartile Range of Mean Price of Sampling Distribution (5000 draws, each of size 50): 14785.27
px_stdev_sample_means50 <- round(as.numeric(sd(sample_means50)),2)
cat(paste("Standard Deviation of Mean Price of Sampling Distribution (5000 draws, each of size 50): ", px_stdev_sample_means50,"\n"))## Standard Deviation of Mean Price of Sampling Distribution (5000 draws, each of size 50): 10973.87
px_theoretical_stdev_sample_means50 <- round(sd(price) / sqrt(50),2)
cat(paste("Theoretical Standard Error of Mean Price of samples of size 50: ", px_theoretical_stdev_sample_means50 ,"\n"))## Theoretical Standard Error of Mean Price of samples of size 50: 11297.68
px_summarypop <- summary(price)
px_summarypop## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
px_minpop <- as.numeric(px_summarypop["Min."])
px_meanpop <- round(as.numeric(px_summarypop["Mean"]),2)
cat(paste("\n\nActual mean of the price (pop.): ", px_meanpop, "\n\n"))##
##
## Actual mean of the price (pop.): 180796.06
px_errormean <- round(100*(px_samp1mean - px_meanpop)/px_meanpop,2)
cat(paste("\n\nPercent error of the sample mean (single sample) vs. actual mean (price): ", px_errormean, "%.\n\n"))##
##
## Percent error of the sample mean (single sample) vs. actual mean (price): -2.03 %.
px_medpop <- as.numeric(px_summarypop["Median"])
px_maxpop <- as.numeric(px_summarypop["Max."])
px_iqrpop <- as.numeric(IQR(price))
cat(paste("Inter-Quartile Range of the population of Price: ",px_iqrpop,"\n"))## Inter-Quartile Range of the population of Price: 84000
px_stdevpop <- round(as.numeric(sd(price)),2)
cat(paste("Standard Deviation of the population: ",px_stdevpop,"\n"))## Standard Deviation of the population: 79886.69
\[SE_{ \bar { px } } = \frac { \sigma _{ px_{pop} } }{ \sqrt { N } } = \frac { 79886.69 }{ \sqrt { 50 } } = \frac { 79886.69 }{ 7.071 } = 11297.68 \]
sample_means150.sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
summary(sample_means150)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159129 176415 180741 180890 185269 208960
sd(sample_means150)## [1] 6403
par(mfrow = c(2, 1))
xlimits <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits, col="green")
hist(sample_means150, breaks = 20, xlim = xlimits, col="yellow")invisible(par(new))px_summary_sample_means150 <- summary(sample_means150)
px_summary_sample_means150## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159129 176415 180741 180890 185269 208960
px_stdev_summary_sample_means150 <- sd(sample_means150)
px_stdev_summary_sample_means150## [1] 6403
px_min_sample_means150 <- as.numeric(px_summary_sample_means150["Min."])
px_mean_sample_means150 <- round(as.numeric(px_summary_sample_means150["Mean"]),2)
cat(paste("\n\n**Sample mean of the distribution** of 5,000 samples (n=150) of Price: ", px_mean_sample_means150, "\n\n"))##
##
## **Sample mean of the distribution** of 5,000 samples (n=150) of Price: 180889.56
px_med_sample_means150 <- as.numeric(px_summary_sample_means150["Median"])
px_max_sample_means150 <- as.numeric(px_summary_sample_means150["Max."])
px_iqr_sample_means150 <- round(as.numeric(IQR(sample_means150)),2)
cat(paste("Inter-Quartile Range of Mean Price of Sampling Distribution (5,000 draws, each of size 150): ",px_iqr_sample_means150,"\n"))## Inter-Quartile Range of Mean Price of Sampling Distribution (5,000 draws, each of size 150): 8854.23
px_stdev_sample_means150 <- round(as.numeric(sd(sample_means150)),2)
cat(paste("Standard Deviation of Mean Price of Sampling Distribution (5,000 draws, each of size 150): ", px_stdev_sample_means150,"\n"))## Standard Deviation of Mean Price of Sampling Distribution (5,000 draws, each of size 150): 6402.75
px_theoretical_stdev_sample_means150 <- round(sd(price) / sqrt(150),2)
cat(paste("Theoretical Standard Error of Mean Price of samples of size 150: ", px_theoretical_stdev_sample_means150 ,"\n"))## Theoretical Standard Error of Mean Price of samples of size 150: 6522.72
SamplingDistributionsSummaryPrice <- data.frame(
matrix(
c(50,px_mean_sample_means50,px_mean_sample_means50-px_meanpop,
px_stdev_sample_means50,px_theoretical_stdev_sample_means50,
150,px_mean_sample_means150,px_mean_sample_means150-px_meanpop,
px_stdev_sample_means150,px_theoretical_stdev_sample_means150),
nrow = 2,ncol = 5, byrow = T,
dimnames = list(NULL,c("SampleSize","Mean Price 5000 Samples","Error","Actual Std Dev of Sampling Dist","Theoretical Std Error of the Mean")))
)
SamplingDistributionsSummaryPrice %>%
kable() %>%
kable_styling()| SampleSize | Mean.Price.5000.Samples | Error | Actual.Std.Dev.of.Sampling.Dist | Theoretical.Std.Error.of.the.Mean |
|---|---|---|---|---|
| 50 | 180622 | -174 | 10974 | 11298 |
| 150 | 180890 | 94 | 6403 | 6523 |