download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
The population distribution is unimodal and skewed to the right, which can be inferred from the fact that the mean of 1500 is slightly larger than the median of 1442 and from the fact that the max value of the distribution of 5642 is much further from the median than the minimum value of the distribution of 334.
samp1 <- sample(area, 50)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1155 1552 1592 1720 2944
hist(samp1)
The distribution of the sample of 50 houses from the population of 2,930 is very similar to the underlying population distribution. The sample mean, 1520, is only 20 values larger than the population mean and the sample median of 1446 is only 4 values larger than population median. Like the population distribution, this distribution is also skewed to the right but has less extreme outliers with a max value of 2726 and a min value of 816.
samp2 <-sample(area,50)
summary(samp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 498 1130 1389 1406 1499 2461
hist(samp2)
# The mean of samp2, 1514, is just 6 values below the mean of samp1, 1520. In this case, however, the median is much closer to the mean, meaning the this sample distribution is more symmetric than the previous sample distribution. If we were to take more samples, one of size 100 and one of size 1000, the sample of sie 1000 would give a more accurate estimate of the population mean. The reason is because as the sample size n increases, the distribution of observations more closely approximates the original population distribution.
sample_means50 <- rep(NA,5000)
for(i in 1:5000){
samp <- sample(area,50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1262 1453 1500 1501 1548 1758
There are 5000 elements in the array/vector sample_means50. The distribution is symmetric and unimodal. The median and mean are very close to one another (1496 and 1498 respectively) and the min and max values are nearly equidistant from the mean and median.
I would only expect the distribution to be more continuous because as the number of simulations or sample means increases, the distribution of sample means becomes more dense and, therefore, more continuous. However, I would not expect the distribution to become more normal because normality depends on the sample size more than the number of simulations or samples. The reason is because every time you simulate a large number of sample means, you are generating random sample means, which means that your standard error as well as your sampling distribution mean could increase or decrease.
sample_means_small <-rep(NA,100)
for(i in 1:100){
sampn <- sample(area,50)
sample_means_small[i] <-mean(sampn)
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100
# There are 100 elements in this vector/array called sample_means_small. Each element represents the mean of a sample taken from one and the same house area population distribution.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
The larger the sample size, the more the median and the mean converge. That is, the larger the sample size, the more normal the distribution. Similarly, the larger the sample size, the smaller the spread or standard error.
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
samp <-sample(price,50)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52500 127750 156000 186647 222250 591587
hist(samp)
# Using this sample alone, the best estimate of the population mean of 180,796 is the sample mean of 184,053
sample_means50 <-rep(NA,5000)
for(i in 1:5000){
samp <- sample(price,50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
mean(sample_means50)
## [1] 180776.2
# The sampling distribution is unimodal and nearly symmetric about the mean but with some slight right skew. I would guess the mean home price of the population is equal to the mean home price of this sampling distribution of mean home prices, which is 180,653
mean(price)
## [1] 180796.1
# The mean home price of the population is 180,796.1, just slightly above the sampling distribution mean of 180,652.8.
sample_means150 <-rep(NA,5000)
for(i in 1:5000){
samp <- sample(price,150)
sample_means150[i]<-mean(samp)
}
hist(sample_means150)
mean(sample_means150)
## [1] 180767
# This sampling distribution is also unimodal, but it is more symmetric about the mean than the sampling distribution for a sample size of 50. It still does not appear completely normal but it seems very close to a completely normal distribution. I would guess the mean sales price of homes in Ames to be 180,847 based on this sampling distribution.
The sampling distribution based on samples of size 150 had a smaller spread or standard error. To make estimates of the population mean that are more likely to be close to the true population mean, it would be preferential to choose a distribution with smaller spread. A smaller spread means a smaller standard error, which means a larger sample size. And the larger the sample size, the more accurate the sample mean will be as an estimate of the population mean.