DATA_606_Lab4a

The distribution of house area data is unimodal and skewed to the right.

load("more/ames.RData")

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

samp1 <- sample(area, 50)

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     720    1099    1440    1459    1698    2656

hist(samp1)

mean(samp1)

## [1] 1458.8

The distribution is unimodal and more symmetric than the entire data. The mean and median are similar to that of the entire data set, but the range of values is lower. The minumum area is higher and the maximum area is lower.

samp2 <- sample(area, 50)
mean(samp2)

## [1] 1554.3

The mean of samp2 is larger than the mean of samp1 and larger than the mean of entire data set.
I would expect that the larger the sample size, the closer the mean will approximate the mean of the actual population.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1249    1451    1498    1500    1548    1790

There are 5000 elements in sample means. The sample distribution is nearly normal. It is unimodal and symmetric. The mean of the sample distribution is 1500, which is the mean of the entire data set. The more sample means collected, the more I would expect the sample distribution mean to approach the mean of the data set. I would expect 50000 samples to be a little different, but 5000 samples is already a large number so I would not expect 50000 sample means to yield a result that is very different.

sample_means_small <- rep(0, 100)
for (i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}
sample_means_small

##   [1] 1412.76 1598.40 1454.84 1435.26 1579.24 1469.72 1426.70 1445.74
##   [9] 1539.02 1349.00 1431.48 1450.40 1582.42 1484.56 1543.88 1548.80
##  [17] 1536.40 1510.60 1461.84 1481.30 1532.80 1520.28 1459.18 1596.72
##  [25] 1364.86 1489.16 1428.54 1542.40 1522.30 1665.02 1356.34 1394.78
##  [33] 1397.44 1511.06 1511.60 1540.66 1593.48 1540.40 1608.34 1413.00
##  [41] 1612.56 1395.72 1504.42 1439.64 1507.40 1564.46 1477.38 1525.56
##  [49] 1656.90 1488.52 1473.72 1428.52 1535.62 1454.92 1422.34 1570.18
##  [57] 1593.02 1561.82 1608.60 1558.30 1642.56 1517.42 1417.44 1422.88
##  [65] 1608.62 1597.18 1457.20 1460.48 1471.26 1507.34 1439.46 1403.52
##  [73] 1515.78 1521.98 1490.90 1525.92 1584.00 1528.50 1501.14 1479.28
##  [81] 1520.84 1472.42 1521.30 1510.28 1481.04 1593.38 1551.02 1403.10
##  [89] 1448.04 1376.44 1448.44 1615.32 1455.92 1449.78 1521.78 1563.28
##  [97] 1486.62 1548.92 1556.52 1437.34

There are 100 elements in sample_means_small. Each element represents the mean area of a random sample of 50 houses that were sold.
6. When the sample size is larger, the spread becomes narrower and the center becomes more defined.

####Price of Homes

samp_price <- sample(price, 50)
mean(samp_price)

## [1] 155866.5

The mean house price in a sample of 50 homes is $155866.50.

Computing 5000 Sample Means from subsets of 50

sample_means50 <- rep(NA, 5000)
for (i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks=25)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  144571  172684  180290  180719  188040  226455

The data is unimodal and symmetric. The mean of the sampling distribution is $180719.

mean(price)

## [1] 180796.1

The mean house price in the population is $180796.1.

Computing 5000 Sample Means from subsets of 150

sample_means150 <- rep(NA, 5000)
for (i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks=25)

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  157326  176488  180790  180887  185184  201573

The shape of the sampling distribution is unimodal and symmetric. The sampling distribtion with samples of size 150 has a narrower spread than the sampling distribution with samples of size 50. I would guess the mean sale price of a house in Ames is close to $180887. In order to make estimates that are close to the true value, I would prefer a distribution with a small spread.

DATA_606_Lab4a

Sarah Wigodsky

October 1, 2017

Computing 5000 Sample Means from subsets of 50

Computing 5000 Sample Means from subsets of 150