Lab 4a

load("lab4a//more//ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
set.seed(7)

Excercises

Excercise 1

hist(area)

hist(price)

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

Price and area are both right skewed, as we see a heavy tail on the right for both distributions. They seem to tend towards lognormal. The mean is greater than the median, adding to that assesment

Excercise 2

sample <- sample(area, 50)
summary(sample)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     864    1217    1490    1640    1986    3395

hist(sample)

This sample has more weight at the lower end of the distribution than the population. It has similar spread, but not as large of a range, as would be expected from 50 samples. Its mean and median are higher than the population.

Excercise 3

sample2 <- sample(area, 50)

The large the sample, the lower the standard error. More is better for estimating the mean.

Excercise 4

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

length(sample_means50)

## [1] 5000

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1294    1452    1497    1500    1544    1779

qqnorm(sample_means50)

The distribution has the same mean as that of the population. It is fairly normal, with a median near its mean. The QQ plot hugs the line quite well.

Excercise 5

sample_means_small  <- rep(0,100)
for (i in 1:100){
  sample_means_small[i] <- mean(sample(area, 50))
}
sample_means_small

##   [1] 1469.80 1428.96 1573.74 1332.64 1495.26 1599.34 1487.60 1483.08
##   [9] 1403.80 1472.68 1504.90 1466.28 1464.10 1498.54 1547.94 1574.12
##  [17] 1501.64 1477.86 1562.46 1389.88 1563.08 1551.70 1480.44 1406.98
##  [25] 1467.16 1430.96 1510.86 1485.22 1490.12 1627.94 1477.12 1584.70
##  [33] 1563.84 1527.42 1577.06 1369.46 1392.10 1473.62 1464.98 1517.42
##  [41] 1444.56 1606.70 1656.00 1603.92 1430.42 1486.48 1470.30 1479.14
##  [49] 1540.70 1490.58 1476.98 1362.90 1479.38 1469.90 1490.16 1446.66
##  [57] 1524.52 1465.26 1526.52 1470.54 1581.68 1519.12 1563.00 1410.06
##  [65] 1614.14 1566.56 1527.06 1446.54 1457.64 1426.98 1418.98 1478.32
##  [73] 1452.96 1535.56 1528.86 1427.88 1494.84 1466.06 1612.72 1502.52
##  [81] 1465.50 1531.94 1385.04 1575.68 1356.06 1376.24 1454.04 1545.90
##  [89] 1399.34 1678.22 1561.62 1524.52 1452.70 1450.92 1550.86 1579.92
##  [97] 1482.84 1550.22 1506.44 1453.52

length(sample_means_small)

## [1] 100

Each element of the vector is the mean of a sized 50 sample from the poulation

Excercise 6

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

As the sample size increases, the spread will decrease. The mean of the means should remain the same. The standard error will decrease.

On Your Own

Problem 1

sample_p <- sample(price, 50)
mean(sample_p)

## [1] 195150.6

The point estimate will be the sample mean as computed above

Problem 2

sample_means50 <- rep(0, 5000)

for (i in 1:5000){
  sample_means50[i] <- mean(sample(price, 50))
}

hist(sample_means50, breaks = 20)

mean(sample_means50) #expected population mean

## [1] 180731.3

mean(price) #actual population mean

## [1] 180796.1

Problem 3

sample_means150 <- rep(0, 5000)

for (i in 1:5000){
  sample_means150[i] <- mean(sample(price, 150))
}
hist(sample_means150, breaks = 20)

mean(sample_means150) #Expected population mean

## [1] 181013.7

This distribution more approximately resembles the normal distribution. It has very little skew and more weight between 170,000 and 190,000

Problem 4

The second sampling distriubtion has a smaller spread, which is preferable in making estimates if we want them to be closer to the true value. This is another way of saying its 95% confidence interval is smaller, but will contain the true mean just as frequently.