The Data

setwd("C:/Users/Robert/Documents/R/win-library/3.2/IS606/labs/Lab4a")
load("more/ames.RData")

Exercise 1

Describe this population distribution.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

This is a near normal distribution with a left-skew and is unimodal. The mean is 1500, median 1442, and the range is 334 to 5642.

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1161    1485    1497    1694    2687

hist(samp1)

Depending on the results of the random sample, we often find data missing the outliers reponsible for the left-skew, however these random samples are often fairly reflective of the population.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

mean(samp1)

## [1] 1497.24

samp2 <- sample(area, 50)
mean(samp2)

## [1] 1489.22

The mean of samp1 and samp2 are usually fairly close, and one might gather that these means are within the confidance interval.

samp3 <- sample(area, 100)
samp4 <- sample(area, 1000)
mean(samp3)

## [1] 1491.37

mean(samp4)

## [1] 1516.122

One would expect the mean of samp4 (n=1000) to be closest to the area population mean.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50, breaks = 25)

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

length(sample_means50)

## [1] 5000

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1234    1447    1496    1496    1542    1786

There are 5000 samples of means in sample_means50. The summary results provide the mean and median values. These values will closely reflect the original mean of the population. We are looking at a near normal distribution. The “normality” would only increase as samples increase to greater numbers.

Exercise 5

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   }
sample_means_small

##   [1] 1520.14 1469.30 1359.06 1514.00 1576.30 1379.00 1608.50 1336.38
##   [9] 1520.14 1374.98 1501.24 1567.62 1507.38 1484.62 1557.72 1563.14
##  [17] 1426.64 1511.72 1366.88 1518.10 1504.08 1501.76 1560.96 1452.06
##  [25] 1578.04 1685.00 1621.50 1594.34 1606.70 1542.18 1438.10 1474.22
##  [33] 1497.20 1409.04 1390.98 1485.92 1582.86 1506.12 1570.82 1462.44
##  [41] 1539.06 1576.74 1617.06 1668.44 1613.58 1356.18 1508.00 1582.34
##  [49] 1479.62 1452.72 1397.98 1478.02 1525.50 1427.42 1465.66 1450.28
##  [57] 1481.36 1428.94 1429.54 1393.72 1466.78 1477.68 1641.06 1492.20
##  [65] 1394.20 1646.62 1512.84 1528.04 1503.88 1413.56 1513.32 1413.30
##  [73] 1466.20 1523.30 1566.52 1316.30 1488.18 1484.78 1476.56 1442.82
##  [81] 1649.42 1445.20 1555.32 1518.60 1594.70 1480.96 1653.26 1598.68
##  [89] 1471.64 1550.92 1544.72 1606.24 1437.02 1375.84 1502.72 1458.66
##  [97] 1599.72 1501.50 1580.48 1646.06

length(sample_means_small)

## [1] 100

There are 100 elements inside sample_means_small. These values represent means of random samples of 50 elements from within the area data column.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

The distribution range becomes narrower and the mean frequencies taller as the sample size of the calculated sample mean increases.

On Your Own

1

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

priceSample <- sample(price, 50)
mean(priceSample)

## [1] 169812.5

2

Since you have access to the population, simulate the sampling distribution for xprice by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

#reset par settings
par(mfrow=c(1,1))
hist(sample_means50, breaks = 20)

#Sample mean estimate
mean(sample_means50)

## [1] 180756.5

#population mean
mean(price)

## [1] 180796.1

Please refer to the previous two calculations for the respective answers to the question.

3

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

#reset par settings
par(mfrow=c(2,1))
xlimits <- range(sample_means150)
hist(sample_means50,breaks = 30, xlim = xlimits)
hist(sample_means150,breaks = 20, xlim = xlimits)

mean(sample_means50)

## [1] 180756.5

#Expected mean home price
mean(sample_means150)

## [1] 180714.2

The mean sale is more likely to be the mean of the sample_means150. The sample distribution, with 150 means calculated instead of 50, is now more narrow and tall around the expected mean.

4

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The smaller spread is with the larger sample size (e.g. n= 150). A smaller spread is most statistically appropriate when seeking accuracy of mean estimation.

SELLERS-Lab4a

Robert Sellers, robertwsellers@gmail.com

3/13/2016