Lab 5 : Foundations for statistical inference - Sampling distributions

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

Exercise 1

The population distribution is skewed right, most homes are between 1000 to 2000 square feet.

samp1 <- sample(area, 50)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     804    1199    1432    1526    1718    3820
hist(samp1)

Exercise 2

This distribution is very similar to the distribution of all the homes just on a smaller scale, most of the home are still with in 1000-2000 square feet but there are not big outliers such as the full sample with 5642 being the highest square feet size.

mean(samp1)
## [1] 1525.7

Exercise 3

samp2<-sample(area, 50)
mean(samp1)
## [1] 1525.7
mean(samp2)
## [1] 1492.7

In my samp2 the mean is about 100 square feet higher but in the grand scheme it isn’t that much. We can still make some predictions about the whole population even when the two samples are slightly different. If we took 2 samples of 100 and 1000, the sample with 1000 would provide a much more accuracte estimate of the population because it of course holds more of the actual population.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

summary (sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1259    1453    1499    1501    1547    1833

Exercise 4

There are 5000 elements in sample_means50. The sample distribution seems fairly normal a little skewed to the right with some outliers but still decently normal with its center at about 1500. If we collected 50,000 sample means I believe we would have a more normal distribution and the mean would be closer to the original samples mean.

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
  }

Exercise 5

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
sample_means_small
##   [1] 1445.10 1554.10 1361.10 1449.32 1529.70 1692.92 1554.58 1482.92 1403.62
##  [10] 1484.90 1536.42 1564.98 1586.02 1536.34 1483.66 1535.72 1455.24 1453.80
##  [19] 1484.38 1505.78 1451.94 1485.66 1504.72 1449.22 1414.60 1461.56 1617.28
##  [28] 1421.38 1420.80 1444.68 1417.54 1589.18 1475.16 1588.50 1466.24 1604.74
##  [37] 1497.62 1622.08 1416.64 1545.24 1620.70 1492.06 1447.34 1413.38 1624.76
##  [46] 1513.86 1388.78 1434.54 1436.38 1447.12 1475.82 1515.68 1401.20 1542.56
##  [55] 1549.84 1356.34 1606.52 1510.72 1534.06 1462.80 1594.22 1445.52 1381.42
##  [64] 1415.42 1568.54 1424.36 1464.54 1594.34 1480.10 1521.54 1415.90 1481.52
##  [73] 1459.92 1559.38 1531.22 1563.26 1495.12 1437.06 1537.60 1486.52 1464.38
##  [82] 1427.04 1525.70 1471.12 1611.94 1559.00 1469.86 1541.48 1512.34 1444.38
##  [91] 1525.02 1517.30 1574.62 1466.20 1425.76 1525.46 1507.48 1601.12 1533.20
## [100] 1421.86

There are 100 elements, we take a random sample of size 50 from area, takes its mean and stores it in sample_means_small

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6

When the sample size is larger the center is more or less the same but the spread is much less.

#On Your Own

Number 1

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

samp1 <- sample(price, 50)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   63900  126250  167500  191017  235000  466500
hist(samp1)

The mean is around 170000, from taking 50 random samples of price.

Number 2

sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000

Based on this sampling distribution I would guess the mean is a little less than 180000.

##Number 3

sample_means150<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,150)
sample_means150[i]<-mean(samp)
}
hist(sample_means150, breaks = 25)

The mean sale price is still around 180000

Number 4

The distribution from number 3 has a smaller spread, we prefer a smaller spread because it indicates lower variability