## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo

Assign variables

Population: Residential home sales in Ames, Iowa between 2006 and 2010 (2,930 observations)

  • area: above ground living area of the house in square feet
  • price: sales price
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

(1) Describe this population distribution.

Area (square feet)

  • Min 334 sqft, Max 5642 sqft
  • Mean 1500 sqft
  • Median 1442 sqft
  • IQR 616.75 sqft
  • SD 505.5089 sqft

The histogram of area has a bell curve shape that is skewed to the right. I ran the qqnormsim at least 5 times, and I did not observe any normally distributed plot that looks similar enough to the normal probability plot of area. So, I do not think that area has a normal distribution.

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
IQR(area)
## [1] 616.75
sd(area)
## [1] 505.5089
hist(area)

qqnorm(area)
qqline(area)

DATA606::qqnormsim(area)

Price

  • Min $12,789, Max $755,000
  • Mean $180,796
  • Median $160,000
  • IQR $84,000
  • SD $79,886.69

The histogram of price has a bell curve shape that is skewed to the right. I ran the qqnormsim at least 5 times, and I did not observe any normally distributed plot that looks similar enough to the normal probability plot of price. So, I do not think that price has a normal distribution.

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000
IQR(price)
## [1] 84000
sd(price)
## [1] 79886.69
hist(price)

qqnorm(price)
qqline(price)

DATA606::qqnormsim(price)

(2) Describe the distribution of this sample. How does it compare to the distribution of the population?

For this particular sample of 50 observations, the histogram of the sample is also skewed to the right. It looks unimodal. The mean of the sample is 1491 sqft while the population mean is 1500 sqft. * Min 864 sqft (population 334 sqft) * Max 2730 sqft (population 5642 sqft) * Mean 1491 sqft (population 1500 sqft) * Median 1463 sqft (population 1442 sqft) * IQR 461 sqft (population 616.75 sqft) * SD 401 sqft (population 505.5089 sqft

#simple random sample
set.seed(1)
samp1 <- sample(area, 50) 
hist(samp1)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     864    1218    1463    1491    1679    2730
IQR(samp1)
## [1] 461
sd(samp1)
## [1] 401.2159
hist(samp1)

mean(samp1)
## [1] 1491.38

(3) Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

The mean of samp2 is 1519, and the mean of samp1 is 1491. So in this particular case, the mean of samp2 is bigger than the mean of samp1.

I would think that the larger sample size would generally have a mean that is closer to the population mean. Below, I ran 2 samples of size 100 and 2 samples of size 1000. The population mean is 1500. The 2 samples of size 100 had a mean of 1531 and 1480. The 2 samples of size 1000 had a mean of 1499 and 1508. So, in this particular case, the samples of size 1000 have means that are closer to the population mean.

set.seed(2)
samp2 <- sample(area, 50)
summary(samp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     845    1180    1468    1519    1786    2822
set.seed(10)
samp3_100 <- sample(area, 100)
set.seed(15)
samp4_100 <- sample(area, 100)

set.seed(20)
samp5_1000 <- sample(area, 1000)
set.seed(25)
samp6_1000 <- sample(area, 1000)

summary(samp3_100)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     666    1143    1461    1531    1837    5095
summary(samp4_100)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     796    1056    1420    1480    1741    3086
summary(samp5_1000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    1441    1499    1771    3820
summary(samp6_1000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1127    1440    1508    1769    5642

(4)How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

In sample_means50, there are 5000 elements since in the code 5000 samples of size 50 were taken. The distribution looks unimodal and symmetric with a mean of 1500. The Q-Q plot for sample_means50 shows that the distribution is normal.

No, I do not expect the distribution to change if we instead collected 50,000 sample means. Below I created sample_means50_2, which has 50,000 sample means. The distribution is the same, which centers around the population mean of 1500.

#NOTE: Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called  `sample_means50`. 

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50, breaks=25)

summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1272    1453    1498    1500    1546    1836
qqnorm(sample_means50)
qqline(sample_means50)

DATA606::qqnormsim(sample_means50)

sample_means50_2 <- rep(NA, 50000)

for(i in 1:50000){
   samp <- sample(area, 50)
   sample_means50_2[i] <- mean(samp)
}

hist(sample_means50_2, breaks=25)

summary(sample_means50_2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1215    1451    1498    1500    1547    1827

(5) To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

There are 100 elements in the sample_means_small object. Each element represents the mean of the sample size of 50 from area.

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   print(paste(i, ")", sample_means_small[i]))
}
## [1] "1 ) 1562.32"
## [1] "2 ) 1447.24"
## [1] "3 ) 1445.08"
## [1] "4 ) 1440.26"
## [1] "5 ) 1517.84"
## [1] "6 ) 1742.34"
## [1] "7 ) 1477.92"
## [1] "8 ) 1503.96"
## [1] "9 ) 1480.06"
## [1] "10 ) 1500.16"
## [1] "11 ) 1473.56"
## [1] "12 ) 1431.14"
## [1] "13 ) 1453.16"
## [1] "14 ) 1446.14"
## [1] "15 ) 1504.22"
## [1] "16 ) 1432.82"
## [1] "17 ) 1459.76"
## [1] "18 ) 1418.9"
## [1] "19 ) 1598.58"
## [1] "20 ) 1437.3"
## [1] "21 ) 1491.9"
## [1] "22 ) 1451.44"
## [1] "23 ) 1541.92"
## [1] "24 ) 1462.38"
## [1] "25 ) 1493.46"
## [1] "26 ) 1616.82"
## [1] "27 ) 1559.24"
## [1] "28 ) 1408.2"
## [1] "29 ) 1537.84"
## [1] "30 ) 1475.28"
## [1] "31 ) 1463.54"
## [1] "32 ) 1550.84"
## [1] "33 ) 1405.6"
## [1] "34 ) 1417.46"
## [1] "35 ) 1459.1"
## [1] "36 ) 1568.98"
## [1] "37 ) 1424.42"
## [1] "38 ) 1546.1"
## [1] "39 ) 1556.36"
## [1] "40 ) 1502"
## [1] "41 ) 1550.12"
## [1] "42 ) 1497.32"
## [1] "43 ) 1556.98"
## [1] "44 ) 1543.34"
## [1] "45 ) 1373.7"
## [1] "46 ) 1506.6"
## [1] "47 ) 1435.52"
## [1] "48 ) 1477.88"
## [1] "49 ) 1459.3"
## [1] "50 ) 1461.5"
## [1] "51 ) 1555.78"
## [1] "52 ) 1465.48"
## [1] "53 ) 1553.08"
## [1] "54 ) 1589.06"
## [1] "55 ) 1573.92"
## [1] "56 ) 1404.98"
## [1] "57 ) 1528.6"
## [1] "58 ) 1639.58"
## [1] "59 ) 1470.38"
## [1] "60 ) 1561.48"
## [1] "61 ) 1437.38"
## [1] "62 ) 1439.8"
## [1] "63 ) 1553.16"
## [1] "64 ) 1408.84"
## [1] "65 ) 1422.88"
## [1] "66 ) 1452.18"
## [1] "67 ) 1528.34"
## [1] "68 ) 1545.5"
## [1] "69 ) 1458.44"
## [1] "70 ) 1501.54"
## [1] "71 ) 1439.02"
## [1] "72 ) 1526.38"
## [1] "73 ) 1439.96"
## [1] "74 ) 1453.12"
## [1] "75 ) 1660.54"
## [1] "76 ) 1513.1"
## [1] "77 ) 1377.12"
## [1] "78 ) 1457.96"
## [1] "79 ) 1623.08"
## [1] "80 ) 1406.62"
## [1] "81 ) 1478.32"
## [1] "82 ) 1527.52"
## [1] "83 ) 1649.86"
## [1] "84 ) 1446.02"
## [1] "85 ) 1416.5"
## [1] "86 ) 1549.16"
## [1] "87 ) 1546.1"
## [1] "88 ) 1537.54"
## [1] "89 ) 1478.14"
## [1] "90 ) 1464.94"
## [1] "91 ) 1559.12"
## [1] "92 ) 1431.9"
## [1] "93 ) 1458.14"
## [1] "94 ) 1448.2"
## [1] "95 ) 1618.84"
## [1] "96 ) 1525.56"
## [1] "97 ) 1477.38"
## [1] "98 ) 1527.32"
## [1] "99 ) 1381.74"
## [1] "100 ) 1507.22"
sample_means_small
##   [1] 1562.32 1447.24 1445.08 1440.26 1517.84 1742.34 1477.92 1503.96
##   [9] 1480.06 1500.16 1473.56 1431.14 1453.16 1446.14 1504.22 1432.82
##  [17] 1459.76 1418.90 1598.58 1437.30 1491.90 1451.44 1541.92 1462.38
##  [25] 1493.46 1616.82 1559.24 1408.20 1537.84 1475.28 1463.54 1550.84
##  [33] 1405.60 1417.46 1459.10 1568.98 1424.42 1546.10 1556.36 1502.00
##  [41] 1550.12 1497.32 1556.98 1543.34 1373.70 1506.60 1435.52 1477.88
##  [49] 1459.30 1461.50 1555.78 1465.48 1553.08 1589.06 1573.92 1404.98
##  [57] 1528.60 1639.58 1470.38 1561.48 1437.38 1439.80 1553.16 1408.84
##  [65] 1422.88 1452.18 1528.34 1545.50 1458.44 1501.54 1439.02 1526.38
##  [73] 1439.96 1453.12 1660.54 1513.10 1377.12 1457.96 1623.08 1406.62
##  [81] 1478.32 1527.52 1649.86 1446.02 1416.50 1549.16 1546.10 1537.54
##  [89] 1478.14 1464.94 1559.12 1431.90 1458.14 1448.20 1618.84 1525.56
##  [97] 1477.38 1527.32 1381.74 1507.22

(6) When the sample size is larger, what happens to the center? What about the spread?

Based on the findings below, when the sample size is larger the sample mean is closer to the population mean and the spread gets narrow (less variability in the sample mean).

Mean (a single random case): > * mean(area) = 1499.69 * mean(sample_means10) = 1501.595 * mean(sample_means50) = 1498.085 * mean(sample_means100) = 1499.218

Spread (standard deviation - for a single random case): > * sd(sample_means10) = 157.6809 * sd(sample_means50) = 70.85539 * sd(sample_means100) = 49.5489

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

#divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). 
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

mean(area)
## [1] 1499.69
mean(sample_means10)
## [1] 1499.162
mean(sample_means50)
## [1] 1499.545
mean(sample_means100)
## [1] 1499.626
sd(sample_means10)
## [1] 158.3283
sd(sample_means50)
## [1] 70.47117
sd(sample_means100)
## [1] 49.21454

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

(1) Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Using this sample, my best point estimate of the population mean is $199,771.40.

set.seed(55)
price_50 <- sample(price, 50)
mean(price_50)
## [1] 199771.4

(2) Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

The distribution looks normal. It is unimodal and symmetric. The center is 180,866.6, and the spread is 11,239.13. The guess for the mean home price based on this sample is $180,866.60.

The population mean of home price is $180,796.10.

set.seed(1)
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

par(mfrow = c(1, 1))
hist(sample_means50, breaks = 20)

mean(sample_means50)
## [1] 180866.6
sd(sample_means50)
## [1] 11239.13
mean(price)
## [1] 180796.1

(3) Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

The shape of the d distribution with sample size of 150 is bell shaped, symmetric, and unimodal. The center for this particular random sampling is 180865.9. So, based on this sampling distribution $180,865.90 would be the guess to the mean home price.

Looking at the side by side histogram below, the spread of the sampling distribution of size 50 is much wider than the spread of the sampling distribution of size 150.

  • mean(sample_means50): 180866.6
  • mean(sample_means150): 180865.9
  • sd(sample_means50): 11239.13
  • sd(sample_means150): 6319.835
set.seed(2)
sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

mean(sample_means50)
## [1] 180866.6
mean(sample_means150)
## [1] 180865.9
sd(sample_means50)
## [1] 11239.13
sd(sample_means150)
## [1] 6319.835
par(mfrow = c(2, 1))
xlimits <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)

(4) Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Of the sampling distribution from 2 and 3, the sampling distribution of size 150 has a smaller spread. If we’re concerned with making estimates that are more often close to the true value, we would prefer a distribution with a small spread.