R Lab 5 Introduction to Inference-Sampling distributions

An assigned lab to investigate if a random sample of data can serve as point estimates for population parameters. This lab will formulate a sampling distribution of estimates from the random sample to learn about the distribution properties of the estimate.

#download data file and load
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Slicing out to variables from the dataset to focus on: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice).

#carving out two variables from dataset
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

#viewing summary statistics for the two variables
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area, main = "Population Living Area", xlab = "sqft")

Exercise 1

Describing the population distribution

#survey the population to estimate mean of avg living area
samp1 <- sample(area, 50)
samp1

##  [1] 1494 1868 2030 1478 1078 1840 1142 1602 1471 1025 1484 1572 1742 1680 1522
## [16] 2052 2552 1728  960 2403 1701 1489 1178 1086  958 1040 1094 1951 1040 1614
## [31]  816 1524 1380 1510 1640 1200 1116 2531 1700 1426 1717 1472 1982  960 1950
## [46] 2061 2322 1196 1188 1149

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

#viewing summary statistics area sample, samp1
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1156    1502    1534    1738    2552

mean(samp1)

## [1] 1534.28

hist(samp1, main = "Sample 1 of Size 50 Living Areas", xlab = "sqft")

Population μ = 1500
Sample 1 x̄ = 1474

The distribution of both the population and sample are clearly right skewed.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

#taking three more samples, two will have larger size
samp2 <- sample(area, 50)
mean(samp2)

## [1] 1498.7

hist(samp2, main = "Sample 2 of Size 50 Living Areas", xlab = "sqft")

samp3 <- sample(area, 100)
mean(samp3)

## [1] 1501.84

hist(samp3, main = "Sample Size 100 Living Areas", xlab = "sqft")

samp4 <- sample(area, 1000)
mean(samp4)

## [1] 1483.298

hist(samp4, main = "Sample Size 1000 Living Areas", xlab = "sqft")

Population μ = 1500
Sample 1 x̄ (size 50) = 1474
Sample 2 x̄ (size 50) = 1551
Sample 3 x̄ (size 100)= 1462
Sample 4 x̄ (size 1000) = 1506

The closest mean is sample 4 (size 1,000), but the second closest is sample 1 (50).

#generating 5,000 samples of size 50 living areas
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50, breaks = 25, main = "Sampling Distribution of Living Area\n5,000 Samples Size 50", xlab = "sqft")

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

#counting number of elements
length(sample_means50)

## [1] 5000

There are 5,000 elements in sample_means50, the distribution of which looks very normal as the observations are more symmetrically centered around the mean than the previous histograms. If the sample means collected increased to 50,000 I would expect that shape to be even more precisely and symmetrically centered around the mean.

#printing for for living area sampling distribution loop
#sample_means50 <- rep(NA, 5000)

#for(i in 1:5000){
#   samp <- sample(area, 50)
#   sample_means50[i] <- mean(samp)
   #print(i)
#   }
#REMOVING THIS PORTION OF EXERCISE FOR KNITTING PURPOSES.

Exercise 5

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

#practicing for loop
sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp_small <- sample(area, 50)
   sample_means_small[i] <- mean(samp_small)
}
sample_means_small

##   [1] 1276.10 1538.04 1421.06 1463.86 1615.48 1480.44 1499.24 1481.62 1637.76
##  [10] 1537.12 1629.62 1644.40 1411.36 1491.32 1573.98 1474.60 1446.72 1518.82
##  [19] 1318.94 1542.92 1505.74 1510.84 1584.42 1578.90 1443.80 1532.92 1486.48
##  [28] 1487.10 1496.54 1467.48 1533.10 1598.98 1652.98 1497.66 1384.64 1427.10
##  [37] 1530.06 1486.16 1382.40 1483.42 1429.08 1615.40 1518.22 1534.02 1552.74
##  [46] 1598.48 1452.54 1557.26 1596.18 1586.70 1609.40 1397.84 1409.56 1456.08
##  [55] 1465.88 1647.12 1463.76 1492.86 1405.10 1559.48 1518.92 1432.56 1487.64
##  [64] 1481.26 1355.58 1401.64 1485.84 1393.24 1471.22 1590.14 1539.58 1566.00
##  [73] 1626.80 1387.56 1567.76 1514.26 1560.38 1555.48 1500.34 1475.02 1497.48
##  [82] 1425.66 1609.38 1496.98 1407.84 1405.16 1503.46 1570.54 1433.68 1413.42
##  [91] 1553.50 1451.14 1509.08 1450.52 1439.52 1495.60 1694.04 1519.82 1517.56
## [100] 1512.22

There are 100 elements in sample_means_small and each represent the average of 100 samples of 50 living areas.

Building two more sampling distributions to see the effect of the sample size. One sample size will be 10 and the second size 100.

#creating two additional sampling distributions
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

#plotting all three sampling distributions
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

When the sample size is larger more of the occurrences happen closer to the mean creating a larger center. Additionally, the spread of occurrences contracts closer to the center/mean.

On my own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
Since you have access to the population, simulate the sampling distribution for x¯priceby taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

#generating 5,000 samples of size 50 mean home prices
sample_means_price50 <- rep(NA, 5000)
sample_means_price150 <- rep(NA, 5000)

for(i in 1:5000){
  samp_price <- sample(price, 50)
  sample_means_price50[i] <- mean(samp_price)
  samp_price <- sample(price, 150)
  sample_means_price150[i] <- mean(samp_price)
}

#plotting both price sampling distributions
par(mfrow = c(2, 1))

xlimits2 <- range(sample_means_price50)

hist(sample_means_price50, breaks = 20, xlim = xlimits2)
hist(sample_means_price150, breaks = 20, xlim = xlimits2)

mean(sample_means_price50)

## [1] 180927.2

mean(sample_means_price150)

## [1] 180764.4

The shape of the 50 sample test is fairly normal with a slightly right skewed distribution. The shape of the 150 sample test is more precise around the mean, normally shaped, no visible skewness, and a smaller range than 50 samples.

Best estimate for the population mean home price is the mean of sample_means_price150. Since the 150 samples is more normally distributed its mean is a better estimate of the two.