load("C:\\Users\\jkuruvilla\\Desktop\\Education\\MS Data Analytics - CUNY\\Lab4a\\more\\ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
Exercise 1: Describe this population distribution.
hist(area)
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
The population distribution is right skewed and unimodal. Mean is 1500 . Median 1442, and the range is 334 to 5642
samp1 <- sample(area,50)
Describe the distribution of this sample. How does it compare to the distribution of the population?
hist(samp1)
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 792 1178 1374 1526 1870 2898
Depending on the results of the random sample, we often find data missing the outliers reponsible for the right-skew, however these random samples are often fairly reflective of the population.
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
samp2<-sample(area,50)
summary(samp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 784 1180 1640 1638 1972 2855
samp3 <- sample(area,100)
summary(samp3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 498 1051 1419 1426 1714 2589
samp4<- sample(area,1000)
summary(samp4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 498 1141 1438 1500 1740 5642
One would expect the mean of samp4 (n=1000) to be closest to the area population mean.
sample_means50 <- rep(NA,5000)
for(i in 1:5000)
{ samp <- sample(area,50)
sample_means50 [i] <- mean(samp)
}
hist(sample_means50,breaks = 25)
Exercise 4 : How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1250 1452 1497 1500 1546 1854
Answer : There are 5000 elements in sample_means50 with mean as per randomly selected samples and is very close to the true population mean. The histogram suggests that, the distribution of means (sampling distribution) follows normal distribution with center very close to the center of the population. The sampling distribution is centered at the true average living area of the the population
sample2_means50 <- rep(NA,50000)
for(i in 1:50000)
{ samp <- sample(area,50)
sample2_means50 [i] <- mean(samp)
}
hist(sample2_means50,breaks = 25)
summary(sample2_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1217 1451 1498 1500 1546 1798
As the sample size increased to 50,000, the center became even closer to the center of the population.
Exercise 5 : To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
sample_means_small <- rep(NA,100)
for (i in 1:100)
{samp <- sample(area,50)
sample_means_small[i]<-mean(samp)
print (sample_means_small[i])
}
## [1] 1411.2
## [1] 1475.84
## [1] 1528.78
## [1] 1506.62
## [1] 1548.28
## [1] 1493.32
## [1] 1430.52
## [1] 1462.32
## [1] 1567.48
## [1] 1539.5
## [1] 1550.36
## [1] 1509.06
## [1] 1538.56
## [1] 1569.56
## [1] 1418.22
## [1] 1558.78
## [1] 1645.94
## [1] 1452.18
## [1] 1433.28
## [1] 1400.06
## [1] 1425.38
## [1] 1531.14
## [1] 1403.12
## [1] 1394.18
## [1] 1482.56
## [1] 1521.68
## [1] 1496.32
## [1] 1572.24
## [1] 1498.44
## [1] 1535.12
## [1] 1496.22
## [1] 1444.72
## [1] 1560.12
## [1] 1608.96
## [1] 1572.66
## [1] 1640.92
## [1] 1534.56
## [1] 1521.86
## [1] 1522.68
## [1] 1480.28
## [1] 1605.14
## [1] 1542.88
## [1] 1314.82
## [1] 1491.94
## [1] 1590.98
## [1] 1520.06
## [1] 1580.22
## [1] 1516.24
## [1] 1598.86
## [1] 1536.62
## [1] 1513.74
## [1] 1505.8
## [1] 1511.84
## [1] 1529.14
## [1] 1520.48
## [1] 1563.6
## [1] 1501.68
## [1] 1509.6
## [1] 1586.9
## [1] 1485.06
## [1] 1518.94
## [1] 1547.06
## [1] 1537.76
## [1] 1539.8
## [1] 1599.34
## [1] 1429.88
## [1] 1418
## [1] 1431.04
## [1] 1550.7
## [1] 1464.86
## [1] 1424.96
## [1] 1588.46
## [1] 1433.4
## [1] 1472.36
## [1] 1471.8
## [1] 1410.98
## [1] 1352.2
## [1] 1310.54
## [1] 1391.12
## [1] 1439.28
## [1] 1470.52
## [1] 1596.22
## [1] 1478.18
## [1] 1512.42
## [1] 1592.16
## [1] 1415.1
## [1] 1529.86
## [1] 1518.14
## [1] 1410.04
## [1] 1548.94
## [1] 1573.54
## [1] 1549.16
## [1] 1510.12
## [1] 1470.48
## [1] 1599.88
## [1] 1537.04
## [1] 1421.22
## [1] 1511.1
## [1] 1537.66
## [1] 1548.8
There are 100 elements in sample_means_small and each element represent the mean of the randomly selected samples of areas with size 50 each,
hist(sample_means50,breaks=25)
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
xlimits
## [1] 1013.8 2200.5
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
Exercise : When the sample size is larger, what happens to the center? What about the spread? The distribution range becomes narrower and the mean frequencies taller as the sample size of the calculated sample mean increases.
Answer :
price_sample <- sample(price,50)
summary(price_sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58500 129800 163500 188200 220800 552000
The point estimate of the population mean is 176100. It could be different as it runs again and select another random sample.
Answer :
sample_means50 <- rep(NA,5000)
for (i in 1:5000)
{samp <- sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, , breaks = 25)
The bell shapped /unimodal histogram suggests that the sampling distribution of the means of price follows normal distribution with mean close to 180000.
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 145100 173300 180700 181000 188200 228900
Based on this statistics of the sampling distribution, the mean home price of the population would be closer to 181000.
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
From the summary of the population, the actual mean of the population is 180800. which is very close to the estimate 181000.
Answer :
sample_means150 <- rep(NA,5000)
for (i in 1:5000){
samp<- sample(price,150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 25)
The bell shapped /unimodal histogram suggests that the sampling distribution of the means of price follows normal distribution with mean close to 180000. Also this is more narrower than the distribution of sample means of sample size 50.
summary(sample_means150)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 160600 176200 180500 180700 184900 203400
This summary suggests that as the sample size gets bigger, the sampling distribution provide more accurate estimates of the population mean. Based on this sampling distribution, the mean sale price of homes in Ames would be close to 180900.
Answer : The sampling distribution of the sample means of sample size 150 (in 3) has a smaller spread. If we are concerned with making estimates that are more often close tot he true value, we should prefer a distribution with a small spread.