download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Exercise 1

Describe this population distribution.

The distribution is heavily skewed to the right tail

set.seed(873492)
samp1 <- sample(area, 50)
mean(samp1)

## [1] 1463.92

hist(samp1)

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

The distribution of the sample is much more concentrated than the population but still maintains a right skew.

set.seed(340999)
samp2 <- sample(area, 50)
mean(samp2)

## [1] 1437.42

hist(samp2)

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

The second sample of 50 looks even more condensed, even closer to normal, with only a slight right skew. A larger sample should help us get closer to our true mean, so a sample with a size of 1000 should provide a more accurate estimate of the mean.

set.seed(222304)
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

mean(sample_means50)

## [1] 1501.321

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 5000 elements in sample_mean50 and our distribution has become incredibly normal but still mantains the mean of the whole poulation to within 1.5 of the actual mean. If we collected 50,000 samples I would expect out distribution to be even more normal, but it would have diminishing returns with regards to improving accuracy, with 5000 samples we have already normalized our data quite a bit.

set.seed(656342)

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
sample_means_small

##   [1] 1579.40 1590.74 1545.56 1580.48 1498.22 1611.86 1406.18 1496.54 1501.74
##  [10] 1522.04 1420.38 1463.20 1392.98 1478.48 1512.14 1422.00 1563.70 1458.70
##  [19] 1425.40 1617.56 1410.00 1470.76 1503.62 1624.24 1415.58 1451.92 1501.72
##  [28] 1376.64 1606.34 1421.90 1459.96 1511.74 1588.94 1555.14 1518.58 1572.90
##  [37] 1529.76 1394.46 1553.90 1544.36 1527.34 1427.34 1470.34 1538.92 1451.66
##  [46] 1601.10 1624.84 1613.98 1481.06 1532.06 1478.26 1579.02 1405.72 1597.76
##  [55] 1556.74 1374.32 1487.30 1507.18 1563.84 1516.04 1464.18 1534.76 1435.80
##  [64] 1432.66 1429.98 1494.34 1708.20 1563.68 1532.32 1571.72 1393.48 1487.78
##  [73] 1625.62 1556.60 1393.26 1383.62 1424.12 1528.60 1592.90 1457.02 1458.94
##  [82] 1615.48 1537.94 1579.00 1457.90 1493.20 1487.46 1578.40 1573.94 1560.32
##  [91] 1540.22 1351.98 1535.20 1462.64 1500.70 1571.98 1509.90 1536.30 1454.34
## [100] 1579.34

Exercise 5

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

100 elements in exist within sample_means_small with each element being the mean of one simple random sample of 50 from our population

set.seed(987549)
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

summary(sample_means10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1059    1390    1495    1503    1604    2261

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1281    1453    1498    1501    1547    1827

summary(sample_means100)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1336    1465    1498    1498    1530    1675

Introduction to Inference

Els

3/1/2021

Exercise 1

Describe this population distribution.

The distribution is heavily skewed to the right tail

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

The distribution of the sample is much more concentrated than the population but still maintains a right skew.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

The second sample of 50 looks even more condensed, even closer to normal, with only a slight right skew. A larger sample should help us get closer to our true mean, so a sample with a size of 1000 should provide a more accurate estimate of the mean.

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Exercise 5

100 elements in exist within sample_means_small with each element being the mean of one simple random sample of 50 from our population

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

As the sample size increase the mean of means gets tighter and closer to the true population mean while the spread shrinks significantly.