CUNY DATA606

load("C:/Users/ZacharyHerold/Documents/DATA606/Lab4a/more/ames.RData")

head(ames[1:6,1:7])

##   Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## 1     1 526301100          20        RL          141    31770   Pave
## 2     2 526350040          20        RH           80    11622   Pave
## 3     3 526351010          20        RL           81    14267   Pave
## 4     4 526353030          20        RL           93    11160   Pave
## 5     5 527105010          60        RL           74    13830   Pave
## 6     6 527105030          60        RL           78     9978   Pave

area <- ames$Gr.Liv.Area
price <- ames$SalePrice
hist(area)

Exercise 1

Describe this population distribution.

It has a right-skew, is non-symmetrical and unimodal, seems lognormal.

mean(area)

## [1] 1499.69

The mean of the population is nearly 1500.

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)
hist(samp1, breaks = 10)

It is unclear from first glance if the right-skew is retained in the sampling distribution.

qqnorm(samp1)
qqline(samp1)

The Q-Q Plot makes apparent the right skew of the sampling data, with a few extreme positive outliers.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
print(c(mean(samp1), mean(samp2)))

## [1] 1509.86 1536.24

samp3 <- sample(area, 100)
samp4 <- sample(area, 1000)
print(c(mean(samp3), mean(samp4)))

## [1] 1475.920 1463.368

The best point estimate is the one based on the largest sample size of 1000. It gets closer to the population mean of 1500.

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 50 elements in sample_means50, the number of means calculated. The center is close to 1500. As the number of observations increases the distribution gets tighter around the point estimate.

Exercise 5

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}

sample_means_small

##   [1] 1587.26 1587.02 1602.04 1499.38 1558.90 1453.22 1622.08 1617.46
##   [9] 1521.38 1485.38 1423.34 1444.90 1395.40 1518.78 1538.62 1561.46
##  [17] 1526.48 1577.38 1546.10 1462.06 1447.32 1537.62 1503.48 1531.20
##  [25] 1376.56 1632.46 1471.44 1397.10 1326.22 1544.74 1602.10 1527.80
##  [33] 1364.34 1450.98 1484.02 1345.26 1468.68 1422.98 1468.88 1477.78
##  [41] 1459.06 1484.28 1250.84 1495.26 1485.56 1644.60 1532.78 1494.52
##  [49] 1437.24 1502.88 1617.00 1522.92 1506.36 1429.98 1515.78 1554.42
##  [57] 1646.38 1647.08 1551.14 1527.58 1413.10 1526.78 1568.04 1497.28
##  [65] 1666.34 1369.52 1493.18 1555.76 1525.96 1563.78 1455.34 1396.34
##  [73] 1474.36 1530.74 1469.58 1512.68 1548.46 1467.20 1534.30 1466.70
##  [81] 1490.26 1430.02 1447.36 1422.70 1508.36 1533.00 1544.74 1497.08
##  [89] 1506.24 1537.96 1518.74 1557.18 1400.66 1448.06 1547.90 1610.06
##  [97] 1620.96 1534.26 1571.66 1446.08

length(sample_means_small)

## [1] 100

mean(sample_means_small)

## [1] 1504.24

sd(sample_means_small)

## [1] 75.15417

There are 100 elements in this vector, representing point estimates (sample means) of the area population.

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

sample_means_2 <- rep(0, 10000)

for(i in 1:10000){
  samp <- sample(area, 50)
  sample_means_2[i] <- mean(samp)
}

mean(sample_means_2)

## [1] 1499.416

sd(sample_means_2)

## [1] 69.95256

The center should begin to more closely approximate the population mean. The spread should also decrease. The larger SD here however does not reflect that.

On your own

(1)

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Here is a summary of the actual price data, with mean of 180,796.

summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

samp2 <- sample(price, 50)
summary(samp2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   87500  139350  170500  173153  188750  305900

Ignoring the population mean, the best approximation we have here is the mean of the random sample.

(2)

Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?

par(mfrow = c(1, 1))

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  145688  173082  180701  181001  188414  223803

Based on this sample, the mean is 180,788, very close to the actual price data, with mean of 180,796.

hist(sample_means50)

qqnorm(sample_means50)
qqline(sample_means50)

The Q-Q Plot will have increasingly upturned ends as the sample size increases, indicating the tight boundness around the point estimate. But the shape of the distribution is symmetrical.

(3)

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  159568  176396  180698  180782  185077  205388

The better point estimate is the mean here.

par(mfrow = c(2, 1))
hist(sample_means50, breaks = 25, xlim = c(140000, 230000))
hist(sample_means150, breaks = 25, xlim = c(140000, 230000))

This shows the tighter fit of the distribution about the mean.

qqnorm(sample_means150)
qqline(sample_means150)

The Q-Q plot shows the same upward-turned ends, but no more apparent than with the smaple size of 50.

(4)

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

sd(sample_means50)

## [1] 11182.19

sd(sample_means150)

## [1] 6360.148

The SD is much lower for the sample distribution based on 150 observations. The small spread pinpoints the mean better.

CUNY DATA606_Lab4

Zachary Herold

October 22, 2018