##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
Population: Residential home sales in Ames, Iowa between 2006 and 2010 (2,930 observations)
- area: above ground living area of the house in square feet
- price: sales price
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
Area (square feet)
- Min 334 sqft, Max 5642 sqft
- Mean 1500 sqft
- Median 1442 sqft
- IQR 616.75 sqft
- SD 505.5089 sqft
The histogram of
area
has a bell curve shape that is skewed to the right. I ran the qqnormsim at least 5 times, and I did not observe any normally distributed plot that looks similar enough to the normal probability plot ofarea
. So, I do not think thatarea
has a normal distribution.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
IQR(area)
## [1] 616.75
sd(area)
## [1] 505.5089
hist(area)
qqnorm(area)
qqline(area)
DATA606::qqnormsim(area)
Price
- Min $12,789, Max $755,000
- Mean $180,796
- Median $160,000
- IQR $84,000
- SD $79,886.69
The histogram of
price
has a bell curve shape that is skewed to the right. I ran the qqnormsim at least 5 times, and I did not observe any normally distributed plot that looks similar enough to the normal probability plot ofprice
. So, I do not think thatprice
has a normal distribution.
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
IQR(price)
## [1] 84000
sd(price)
## [1] 79886.69
hist(price)
qqnorm(price)
qqline(price)
DATA606::qqnormsim(price)
For this particular sample of 50 observations, the histogram of the sample is also skewed to the right. It looks unimodal. The mean of the sample is 1491 sqft while the population mean is 1500 sqft. * Min 864 sqft (population 334 sqft) * Max 2730 sqft (population 5642 sqft) * Mean 1491 sqft (population 1500 sqft) * Median 1463 sqft (population 1442 sqft) * IQR 461 sqft (population 616.75 sqft) * SD 401 sqft (population 505.5089 sqft
#simple random sample
set.seed(1)
samp1 <- sample(area, 50)
hist(samp1)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 864 1218 1463 1491 1679 2730
IQR(samp1)
## [1] 461
sd(samp1)
## [1] 401.2159
hist(samp1)
mean(samp1)
## [1] 1491.38
samp2
. How does the mean of samp2
compare with the mean of samp1
? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?The mean of samp2
is 1519, and the mean of samp1
is 1491. So in this particular case, the mean of samp2
is bigger than the mean of samp1
.
I would think that the larger sample size would generally have a mean that is closer to the population mean. Below, I ran 2 samples of size 100 and 2 samples of size 1000. The population mean is 1500. The 2 samples of size 100 had a mean of 1531 and 1480. The 2 samples of size 1000 had a mean of 1499 and 1508. So, in this particular case, the samples of size 1000 have means that are closer to the population mean.
set.seed(2)
samp2 <- sample(area, 50)
summary(samp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 845 1180 1468 1519 1786 2822
set.seed(10)
samp3_100 <- sample(area, 100)
set.seed(15)
samp4_100 <- sample(area, 100)
set.seed(20)
samp5_1000 <- sample(area, 1000)
set.seed(25)
samp6_1000 <- sample(area, 1000)
summary(samp3_100)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 666 1143 1461 1531 1837 5095
summary(samp4_100)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 796 1056 1420 1480 1741 3086
summary(samp5_1000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 1441 1499 1771 3820
summary(samp6_1000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1127 1440 1508 1769 5642
sample_means50
? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?In sample_means50
, there are 5000 elements since in the code 5000 samples of size 50 were taken. The distribution looks unimodal and symmetric with a mean of 1500. The Q-Q plot for sample_means50
shows that the distribution is normal.
No, I do not expect the distribution to change if we instead collected 50,000 sample means. Below I created sample_means50_2
, which has 50,000 sample means. The distribution is the same, which centers around the population mean of 1500.
#NOTE: Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called `sample_means50`.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks=25)
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1272 1453 1498 1500 1546 1836
qqnorm(sample_means50)
qqline(sample_means50)
DATA606::qqnormsim(sample_means50)
sample_means50_2 <- rep(NA, 50000)
for(i in 1:50000){
samp <- sample(area, 50)
sample_means50_2[i] <- mean(samp)
}
hist(sample_means50_2, breaks=25)
summary(sample_means50_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1215 1451 1498 1500 1547 1827
sample_means_small
. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?There are 100 elements in the sample_means_small
object. Each element represents the mean of the sample size of 50 from area
.
sample_means_small <- rep(NA, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
print(paste(i, ")", sample_means_small[i]))
}
## [1] "1 ) 1562.32"
## [1] "2 ) 1447.24"
## [1] "3 ) 1445.08"
## [1] "4 ) 1440.26"
## [1] "5 ) 1517.84"
## [1] "6 ) 1742.34"
## [1] "7 ) 1477.92"
## [1] "8 ) 1503.96"
## [1] "9 ) 1480.06"
## [1] "10 ) 1500.16"
## [1] "11 ) 1473.56"
## [1] "12 ) 1431.14"
## [1] "13 ) 1453.16"
## [1] "14 ) 1446.14"
## [1] "15 ) 1504.22"
## [1] "16 ) 1432.82"
## [1] "17 ) 1459.76"
## [1] "18 ) 1418.9"
## [1] "19 ) 1598.58"
## [1] "20 ) 1437.3"
## [1] "21 ) 1491.9"
## [1] "22 ) 1451.44"
## [1] "23 ) 1541.92"
## [1] "24 ) 1462.38"
## [1] "25 ) 1493.46"
## [1] "26 ) 1616.82"
## [1] "27 ) 1559.24"
## [1] "28 ) 1408.2"
## [1] "29 ) 1537.84"
## [1] "30 ) 1475.28"
## [1] "31 ) 1463.54"
## [1] "32 ) 1550.84"
## [1] "33 ) 1405.6"
## [1] "34 ) 1417.46"
## [1] "35 ) 1459.1"
## [1] "36 ) 1568.98"
## [1] "37 ) 1424.42"
## [1] "38 ) 1546.1"
## [1] "39 ) 1556.36"
## [1] "40 ) 1502"
## [1] "41 ) 1550.12"
## [1] "42 ) 1497.32"
## [1] "43 ) 1556.98"
## [1] "44 ) 1543.34"
## [1] "45 ) 1373.7"
## [1] "46 ) 1506.6"
## [1] "47 ) 1435.52"
## [1] "48 ) 1477.88"
## [1] "49 ) 1459.3"
## [1] "50 ) 1461.5"
## [1] "51 ) 1555.78"
## [1] "52 ) 1465.48"
## [1] "53 ) 1553.08"
## [1] "54 ) 1589.06"
## [1] "55 ) 1573.92"
## [1] "56 ) 1404.98"
## [1] "57 ) 1528.6"
## [1] "58 ) 1639.58"
## [1] "59 ) 1470.38"
## [1] "60 ) 1561.48"
## [1] "61 ) 1437.38"
## [1] "62 ) 1439.8"
## [1] "63 ) 1553.16"
## [1] "64 ) 1408.84"
## [1] "65 ) 1422.88"
## [1] "66 ) 1452.18"
## [1] "67 ) 1528.34"
## [1] "68 ) 1545.5"
## [1] "69 ) 1458.44"
## [1] "70 ) 1501.54"
## [1] "71 ) 1439.02"
## [1] "72 ) 1526.38"
## [1] "73 ) 1439.96"
## [1] "74 ) 1453.12"
## [1] "75 ) 1660.54"
## [1] "76 ) 1513.1"
## [1] "77 ) 1377.12"
## [1] "78 ) 1457.96"
## [1] "79 ) 1623.08"
## [1] "80 ) 1406.62"
## [1] "81 ) 1478.32"
## [1] "82 ) 1527.52"
## [1] "83 ) 1649.86"
## [1] "84 ) 1446.02"
## [1] "85 ) 1416.5"
## [1] "86 ) 1549.16"
## [1] "87 ) 1546.1"
## [1] "88 ) 1537.54"
## [1] "89 ) 1478.14"
## [1] "90 ) 1464.94"
## [1] "91 ) 1559.12"
## [1] "92 ) 1431.9"
## [1] "93 ) 1458.14"
## [1] "94 ) 1448.2"
## [1] "95 ) 1618.84"
## [1] "96 ) 1525.56"
## [1] "97 ) 1477.38"
## [1] "98 ) 1527.32"
## [1] "99 ) 1381.74"
## [1] "100 ) 1507.22"
sample_means_small
## [1] 1562.32 1447.24 1445.08 1440.26 1517.84 1742.34 1477.92 1503.96
## [9] 1480.06 1500.16 1473.56 1431.14 1453.16 1446.14 1504.22 1432.82
## [17] 1459.76 1418.90 1598.58 1437.30 1491.90 1451.44 1541.92 1462.38
## [25] 1493.46 1616.82 1559.24 1408.20 1537.84 1475.28 1463.54 1550.84
## [33] 1405.60 1417.46 1459.10 1568.98 1424.42 1546.10 1556.36 1502.00
## [41] 1550.12 1497.32 1556.98 1543.34 1373.70 1506.60 1435.52 1477.88
## [49] 1459.30 1461.50 1555.78 1465.48 1553.08 1589.06 1573.92 1404.98
## [57] 1528.60 1639.58 1470.38 1561.48 1437.38 1439.80 1553.16 1408.84
## [65] 1422.88 1452.18 1528.34 1545.50 1458.44 1501.54 1439.02 1526.38
## [73] 1439.96 1453.12 1660.54 1513.10 1377.12 1457.96 1623.08 1406.62
## [81] 1478.32 1527.52 1649.86 1446.02 1416.50 1549.16 1546.10 1537.54
## [89] 1478.14 1464.94 1559.12 1431.90 1458.14 1448.20 1618.84 1525.56
## [97] 1477.38 1527.32 1381.74 1507.22
Based on the findings below, when the sample size is larger the sample mean is closer to the population mean and the spread gets narrow (less variability in the sample mean).
Mean (a single random case): > * mean(area) = 1499.69 * mean(sample_means10) = 1501.595 * mean(sample_means50) = 1498.085 * mean(sample_means100) = 1499.218
Spread (standard deviation - for a single random case): > * sd(sample_means10) = 157.6809 * sd(sample_means50) = 70.85539 * sd(sample_means100) = 49.5489
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
#divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))).
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
mean(area)
## [1] 1499.69
mean(sample_means10)
## [1] 1499.162
mean(sample_means50)
## [1] 1499.545
mean(sample_means100)
## [1] 1499.626
sd(sample_means10)
## [1] 158.3283
sd(sample_means50)
## [1] 70.47117
sd(sample_means100)
## [1] 49.21454
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
Using this sample, my best point estimate of the population mean is $199,771.40.
set.seed(55)
price_50 <- sample(price, 50)
mean(price_50)
## [1] 199771.4
The distribution looks normal. It is unimodal and symmetric. The center is 180,866.6, and the spread is 11,239.13. The guess for the mean home price based on this sample is $180,866.60.
The population mean of home price is $180,796.10.
set.seed(1)
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
par(mfrow = c(1, 1))
hist(sample_means50, breaks = 20)
mean(sample_means50)
## [1] 180866.6
sd(sample_means50)
## [1] 11239.13
mean(price)
## [1] 180796.1
The shape of the d distribution with sample size of 150 is bell shaped, symmetric, and unimodal. The center for this particular random sampling is 180865.9. So, based on this sampling distribution $180,865.90 would be the guess to the mean home price.
Looking at the side by side histogram below, the spread of the sampling distribution of size 50 is much wider than the spread of the sampling distribution of size 150.
- mean(sample_means50): 180866.6
- mean(sample_means150): 180865.9
- sd(sample_means50): 11239.13
- sd(sample_means150): 6319.835
set.seed(2)
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
mean(sample_means50)
## [1] 180866.6
mean(sample_means150)
## [1] 180865.9
sd(sample_means50)
## [1] 11239.13
sd(sample_means150)
## [1] 6319.835
par(mfrow = c(2, 1))
xlimits <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)
Of the sampling distribution from 2 and 3, the sampling distribution of size 150 has a smaller spread. If we’re concerned with making estimates that are more often close to the true value, we would prefer a distribution with a small spread.