Lab 4a - Chapter 4 Foundations for statistical inference

load("C:\\Users\\jkuruvilla\\Desktop\\Education\\MS Data Analytics - CUNY\\Lab4a\\more\\ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Exercise 1: Describe this population distribution.

hist(area)

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

The population distribution is right skewed and unimodal. Mean is 1500 . Median 1442, and the range is 334 to 5642

samp1 <- sample(area,50)

Describe the distribution of this sample. How does it compare to the distribution of the population?

hist(samp1)

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     792    1178    1374    1526    1870    2898

Depending on the results of the random sample, we often find data missing the outliers reponsible for the right-skew, however these random samples are often fairly reflective of the population.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2<-sample(area,50)
summary(samp2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     784    1180    1640    1638    1972    2855

samp3 <- sample(area,100)
summary(samp3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     498    1051    1419    1426    1714    2589

samp4<- sample(area,1000)
summary(samp4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     498    1141    1438    1500    1740    5642

One would expect the mean of samp4 (n=1000) to be closest to the area population mean.

sample_means50 <- rep(NA,5000)
for(i in 1:5000)
{ samp <- sample(area,50)
  sample_means50 [i] <- mean(samp)
  }
hist(sample_means50,breaks = 25)

Exercise 4 : How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1250    1452    1497    1500    1546    1854

Answer : There are 5000 elements in sample_means50 with mean as per randomly selected samples and is very close to the true population mean. The histogram suggests that, the distribution of means (sampling distribution) follows normal distribution with center very close to the center of the population. The sampling distribution is centered at the true average living area of the the population

sample2_means50 <- rep(NA,50000)
for(i in 1:50000)
{ samp <- sample(area,50)
  sample2_means50 [i] <- mean(samp)
  }
hist(sample2_means50,breaks = 25)

summary(sample2_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1217    1451    1498    1500    1546    1798

As the sample size increased to 50,000, the center became even closer to the center of the population.

Exercise 5 : To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small <- rep(NA,100)
for (i in 1:100)
{samp <- sample(area,50)
sample_means_small[i]<-mean(samp)
print (sample_means_small[i])
}

## [1] 1411.2
## [1] 1475.84
## [1] 1528.78
## [1] 1506.62
## [1] 1548.28
## [1] 1493.32
## [1] 1430.52
## [1] 1462.32
## [1] 1567.48
## [1] 1539.5
## [1] 1550.36
## [1] 1509.06
## [1] 1538.56
## [1] 1569.56
## [1] 1418.22
## [1] 1558.78
## [1] 1645.94
## [1] 1452.18
## [1] 1433.28
## [1] 1400.06
## [1] 1425.38
## [1] 1531.14
## [1] 1403.12
## [1] 1394.18
## [1] 1482.56
## [1] 1521.68
## [1] 1496.32
## [1] 1572.24
## [1] 1498.44
## [1] 1535.12
## [1] 1496.22
## [1] 1444.72
## [1] 1560.12
## [1] 1608.96
## [1] 1572.66
## [1] 1640.92
## [1] 1534.56
## [1] 1521.86
## [1] 1522.68
## [1] 1480.28
## [1] 1605.14
## [1] 1542.88
## [1] 1314.82
## [1] 1491.94
## [1] 1590.98
## [1] 1520.06
## [1] 1580.22
## [1] 1516.24
## [1] 1598.86
## [1] 1536.62
## [1] 1513.74
## [1] 1505.8
## [1] 1511.84
## [1] 1529.14
## [1] 1520.48
## [1] 1563.6
## [1] 1501.68
## [1] 1509.6
## [1] 1586.9
## [1] 1485.06
## [1] 1518.94
## [1] 1547.06
## [1] 1537.76
## [1] 1539.8
## [1] 1599.34
## [1] 1429.88
## [1] 1418
## [1] 1431.04
## [1] 1550.7
## [1] 1464.86
## [1] 1424.96
## [1] 1588.46
## [1] 1433.4
## [1] 1472.36
## [1] 1471.8
## [1] 1410.98
## [1] 1352.2
## [1] 1310.54
## [1] 1391.12
## [1] 1439.28
## [1] 1470.52
## [1] 1596.22
## [1] 1478.18
## [1] 1512.42
## [1] 1592.16
## [1] 1415.1
## [1] 1529.86
## [1] 1518.14
## [1] 1410.04
## [1] 1548.94
## [1] 1573.54
## [1] 1549.16
## [1] 1510.12
## [1] 1470.48
## [1] 1599.88
## [1] 1537.04
## [1] 1421.22
## [1] 1511.1
## [1] 1537.66
## [1] 1548.8

There are 100 elements in sample_means_small and each element represent the mean of the randomly selected samples of areas with size 50 each,

hist(sample_means50,breaks=25)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)
xlimits

## [1] 1013.8 2200.5

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise : When the sample size is larger, what happens to the center? What about the spread? The distribution range becomes narrower and the mean frequencies taller as the sample size of the calculated sample mean increases.

On Your Own

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Answer :

price_sample <- sample(price,50)
summary(price_sample)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58500  129800  163500  188200  220800  552000

The point estimate of the population mean is 176100. It could be different as it runs again and select another random sample.

Since you have access to the population, simulate the sampling distribution for x¯pricex¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

Answer :

sample_means50 <- rep(NA,5000)
for (i in 1:5000)
{samp <- sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, , breaks = 25)

The bell shapped /unimodal histogram suggests that the sampling distribution of the means of price follows normal distribution with mean close to 180000.

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  145100  173300  180700  181000  188200  228900

Based on this statistics of the sampling distribution, the mean home price of the population would be closer to 181000.

summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000

From the summary of the population, the actual mean of the population is 180800. which is very close to the estimate 181000.

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

Answer :

sample_means150 <- rep(NA,5000)
for (i in 1:5000){
  samp<- sample(price,150)
  sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 25)

The bell shapped /unimodal histogram suggests that the sampling distribution of the means of price follows normal distribution with mean close to 180000. Also this is more narrower than the distribution of sample means of sample size 50.

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  160600  176200  180500  180700  184900  203400

This summary suggests that as the sample size gets bigger, the sampling distribution provide more accurate estimates of the population mean. Based on this sampling distribution, the mean sale price of homes in Ames would be close to 180900.

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Answer : The sampling distribution of the sample means of sample size 150 (in 3) has a smaller spread. If we are concerned with making estimates that are more often close tot he true value, we should prefer a distribution with a small spread.

Lab 4a - Chapter 4 Foundations for statistical inference

James Kuruvilla

March 8, 2017

On Your Own