download.file("http://www.openintro.org/stat/data/ames.RData",destfile ="ames.RData")
load("ames.RData")
area<-ames$Gr.Liv.Area
price<-ames$SalePrice
The population distribution is right skewed and unimodal. You can see the right skew as the mean is closer to the tail. The range is is about 5,300. By using a smaller bin (20) we can see some extream outliers between 4,000 to 6,000.
summary(area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
334 1130 1440 1500 1740 5640
hist(area, breaks = 20 )
The distribution of the 50 samples follow that of the population distribution. This agrees with the basic properties of point. The sample mean (point estimate) is close to the population mean. However, the range is not as wide (only about 600) and the outliers are not as extreme. However, since this is a simple random sample the outcome will differ if we do not use the set.seed function.
set.seed(50)
samp1<-sample(area,50)
summary(samp1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
848 1090 1340 1480 1720 3500
hist(samp1)
The sample mean 'r sd(samp2)' for the 2nd set of random samples changed. As the sample sizes get bigger, to 100 and 1000,the shape of the plot is looking more like the one for whole population. The SEs are getting smaller as well, which indicates that larger samples provide better estimate of the population mean, though it is not a guarantee that every large smaple will provide a better estimate than a particular small sample.
samp2<-sample(area,50)
summary(samp2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
845 1140 1530 1560 1850 3400
sd(samp2)/(sqrt(50))
[1] 77.61
hist(samp2)
set.seed(100)
samp3<-sample(area,100)
summary(samp3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
605 1130 1390 1510 1770 3110
sd(samp3)/(sqrt(100))
[1] 51.3
hist(samp3)
set.seed(1000)
samp4<-sample(area,1000)
summary(samp4)
Min. 1st Qu. Median Mean 3rd Qu. Max.
407 1130 1440 1510 1780 4320
sd(samp4)/(sqrt(1000))
[1] 15.66
hist(samp4, breaks = 25)
5000 elements (samples) are selected in sample_mean50. The center of the sampling disbribution is exactly at 1,500, which is the population mean. This is an unbiased calculation becasue the mean for the sampling distribution is equal to the true value of the population mean. Looking at the histogram of the sample means the spread is small because the sample size is big (5000 samples). The bigger size the samples, the smaller the spread, which shows low variability. The shape of the histogram is close to a normal distribution.
sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(area,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)
There are 100 elements in sample_means_small, each representing a sample mean generated by the 100 samples of size 50.
sample_means_small<-rep(0,100)
for(i in 1:100){
samp<-sample(area, 50)
sample_means_small[i]<-mean(samp)
}
sample_means_small
[1] 1479 1511 1467 1523 1443 1541 1510 1585 1503 1474 1525 1472 1535 1550
[15] 1588 1470 1458 1528 1591 1462 1565 1333 1462 1469 1605 1516 1526 1478
[29] 1438 1546 1479 1413 1496 1744 1510 1503 1470 1547 1415 1408 1498 1529
[43] 1485 1382 1575 1585 1442 1423 1414 1403 1474 1544 1447 1547 1488 1344
[57] 1517 1574 1624 1538 1522 1452 1407 1584 1360 1568 1516 1571 1597 1465
[71] 1456 1631 1511 1415 1348 1481 1470 1488 1545 1511 1548 1433 1399 1508
[85] 1320 1486 1492 1588 1548 1445 1480 1505 1499 1577 1500 1473 1555 1464
[99] 1499 1410
hist(sample_means_small)
when the sample size is large the center of the sampling disbribution should be the center of the population. The spread is becoming smaller as well as the sample sizes increse which indicate low variability of the data.
set.seed(50)
samp5<-sample(price,50)
summary(samp5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
76500 112000 140000 171000 202000 584000
The shape of the sampling mean distribution follows normal distrubtion as the sample size is large. the sample mean home price is at 180,000 based on 5000 samples of size 50. the spread is small. Looking at the summary for the population mean home price it is at 181,000, which is very close to the sampling mean.
sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)
summary(price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12800 130000 160000 181000 214000 755000
My guesswould still be around 180,000.
sample_means150<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,150)
sample_means150[i]<-mean(samp)
}
hist(sample_means150, breaks = 25)
Sampling distributions from 3 has smaller spread. We would prefer a distribution with a small spread because it is associated with low data variability.