Data Analysis Lab3

Lab 3A: Foundations for inference - Sampling distributions

Complete all Exercises, and submit answers to Questions on the Coursera platform. Note that the order of the choices in multiple choice questions may be different on the Duke Coursera platform than the order in this document. In Part A of this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

The data We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest (which is rare to have access to), but we will also work with smaller samples from this population. Let’s load the data.

load(url("http://www.openintro.org/stat/data/ames.RData"))

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis.

names(ames)

##  [1] "Order"           "PID"             "MS.SubClass"    
##  [4] "MS.Zoning"       "Lot.Frontage"    "Lot.Area"       
##  [7] "Street"          "Alley"           "Lot.Shape"      
## [10] "Land.Contour"    "Utilities"       "Lot.Config"     
## [13] "Land.Slope"      "Neighborhood"    "Condition.1"    
## [16] "Condition.2"     "Bldg.Type"       "House.Style"    
## [19] "Overall.Qual"    "Overall.Cond"    "Year.Built"     
## [22] "Year.Remod.Add"  "Roof.Style"      "Roof.Matl"      
## [25] "Exterior.1st"    "Exterior.2nd"    "Mas.Vnr.Type"   
## [28] "Mas.Vnr.Area"    "Exter.Qual"      "Exter.Cond"     
## [31] "Foundation"      "Bsmt.Qual"       "Bsmt.Cond"      
## [34] "Bsmt.Exposure"   "BsmtFin.Type.1"  "BsmtFin.SF.1"   
## [37] "BsmtFin.Type.2"  "BsmtFin.SF.2"    "Bsmt.Unf.SF"    
## [40] "Total.Bsmt.SF"   "Heating"         "Heating.QC"     
## [43] "Central.Air"     "Electrical"      "X1st.Flr.SF"    
## [46] "X2nd.Flr.SF"     "Low.Qual.Fin.SF" "Gr.Liv.Area"    
## [49] "Bsmt.Full.Bath"  "Bsmt.Half.Bath"  "Full.Bath"      
## [52] "Half.Bath"       "Bedroom.AbvGr"   "Kitchen.AbvGr"  
## [55] "Kitchen.Qual"    "TotRms.AbvGrd"   "Functional"     
## [58] "Fireplaces"      "Fireplace.Qu"    "Garage.Type"    
## [61] "Garage.Yr.Blt"   "Garage.Finish"   "Garage.Cars"    
## [64] "Garage.Area"     "Garage.Qual"     "Garage.Cond"    
## [67] "Paved.Drive"     "Wood.Deck.SF"    "Open.Porch.SF"  
## [70] "Enclosed.Porch"  "X3Ssn.Porch"     "Screen.Porch"   
## [73] "Pool.Area"       "Pool.QC"         "Fence"          
## [76] "Misc.Feature"    "Misc.Val"        "Mo.Sold"        
## [79] "Yr.Sold"         "Sale.Type"       "Sale.Condition" 
## [82] "SalePrice"

sapply(ames, class)

##           Order             PID     MS.SubClass       MS.Zoning 
##       "integer"       "integer"       "integer"        "factor" 
##    Lot.Frontage        Lot.Area          Street           Alley 
##       "integer"       "integer"        "factor"        "factor" 
##       Lot.Shape    Land.Contour       Utilities      Lot.Config 
##        "factor"        "factor"        "factor"        "factor" 
##      Land.Slope    Neighborhood     Condition.1     Condition.2 
##        "factor"        "factor"        "factor"        "factor" 
##       Bldg.Type     House.Style    Overall.Qual    Overall.Cond 
##        "factor"        "factor"       "integer"       "integer" 
##      Year.Built  Year.Remod.Add      Roof.Style       Roof.Matl 
##       "integer"       "integer"        "factor"        "factor" 
##    Exterior.1st    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area 
##        "factor"        "factor"        "factor"       "integer" 
##      Exter.Qual      Exter.Cond      Foundation       Bsmt.Qual 
##        "factor"        "factor"        "factor"        "factor" 
##       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1    BsmtFin.SF.1 
##        "factor"        "factor"        "factor"       "integer" 
##  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF   Total.Bsmt.SF 
##        "factor"       "integer"       "integer"       "integer" 
##         Heating      Heating.QC     Central.Air      Electrical 
##        "factor"        "factor"        "factor"        "factor" 
##     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF     Gr.Liv.Area 
##       "integer"       "integer"       "integer"       "integer" 
##  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath       Half.Bath 
##       "integer"       "integer"       "integer"       "integer" 
##   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##       "integer"       "integer"        "factor"       "integer" 
##      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##        "factor"       "integer"        "factor"        "factor" 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##       "integer"        "factor"       "integer"       "integer" 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF 
##        "factor"        "factor"        "factor"       "integer" 
##   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##       "integer"       "integer"       "integer"       "integer" 
##       Pool.Area         Pool.QC           Fence    Misc.Feature 
##       "integer"        "factor"        "factor"        "factor" 
##        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##       "integer"       "integer"       "integer"        "factor" 
##  Sale.Condition       SalePrice 
##        "factor"       "integer"

#summary(ames)

For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet () and the sale price (). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Question 1.Which of the following is false?

The distribution of areas of houses in Ames is unimodal and right-skewed.
50% of houses in Ames are smaller than 1,500 square feet.

median(area, na.rm = TRUE)

## [1] 1442

pnorm(1500, mean(area), sd(area), lower.tail = FALSE)

## [1] 0.4997557

The middle 50% of the houses range between approximately 1,130 square feet and 1,740 square feet.

quantile(area)

##      0%     25%     50%     75%    100% 
##  334.00 1126.00 1442.00 1742.75 5642.00

The IQR is approximately 610 square feet.

IQR(area, na.rm = TRUE)

## [1] 616.75

The smallest house is 334 square feet and the largest is 5,642 square feet.

min(area)

## [1] 334

max(area)

## [1] 5642

The unknown sampling distribution In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we can use the sample function to sample from the population.

samp0 <- sample(area, 50)
samp0

##  [1]  936  864 1336 1098 1053 1513 1056  861 1838  845 1984 1738 1178 1440
## [15] 1707 1114 1680  949 1390  788 1232 1008 1350 1850 1055 1569 1178  988
## [29] 2634 1158 1552  952 1630 1535 2683 1861 1218 1040 1499 1868 1800 1632
## [43] 1319 1734 1248  672 1487 1621 1652 1861

This command collects a simple random sample of size 50 from the vector area, which is assigned to samp0. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. If we didn’t have access to the population data, working with these 50 files would be considerably simpler than having to go through all 2930 home sales.

Now that you’ve taken a sample, take another sample and compare the two. Are the samples the same? Why?

samp1 <- sample(area, 50)
samp1

##  [1]  912 1668 1872 1660  875 1362 1392 1344 1651 1025  925  934 1494 1517
## [15] 1839  864  943 1302  810 1095 1534 1104 2452 1838  864 1208 1072 1162
## [29] 2374 1812 1588 1593  480 1200  713 1750 1746 1078 1536 1374 1872 1040
## [43]  833 2687 1851  914 1080 1325 2121 1627

identical(samp0, samp1)

## [1] FALSE

Exercise: Describe the distribution of this sample? How does it compare to the distribution of the population? If we’re interested in estimating the average living area of homes in Ames using the sample, our best single guess is the sample mean.

mean(samp1)

## [1] 1386.24

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of approximately 1,500 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Sample size of 50
Sample size of 100
Sample size of 1000

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.

(mean.area <- mean(area) )

## [1] 1499.69

sample_means50 <- rep(NA, 5000)
for(s in c(50, 100, 1000)) {
    for(i in 1:5000){
        samp <- sample(area, s)
        sample_means50[i] <- mean(samp)
    }
    mean.sample <- mean(sample_means50)
    print(paste("sample size: ", s,", mean=", mean.sample, ", error=", abs(mean.sample - mean.area)) )
    print(quantile(sample_means50))
    hist(sample_means50, main=paste("sample size=", s))        
}

## [1] "sample size:  50 , mean= 1500.016376 , error= 0.325932313993235"
##       0%      25%      50%      75%     100% 
## 1278.780 1452.050 1497.560 1545.935 1773.020

## [1] "sample size:  100 , mean= 1499.425588 , error= 0.26485568600674"
##       0%      25%      50%      75%     100% 
## 1307.160 1465.218 1498.080 1533.607 1683.470

## [1] "sample size:  1000 , mean= 1499.7409444 , error= 0.0505007139931877"
##       0%      25%      50%      75%     100% 
## 1453.158 1490.963 1499.605 1508.583 1547.193

If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.

hist(sample_means50, breaks = 25)

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50, using what we call a for loop. If this is completely new to you, do not fear, on the next page we’ll review in detail how this set of code works.

Exercise: Describe the sampling distribution (the distribution of the sample means that you just created), and be sure to specifically note its center. Interlude: The for loop

With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   #print(i)
}

Let’s consider this code line by line to figure out what it does. In the first line we initialized a vector. In this case, we created a vector of 5000 NAs called sample_means50. This vector will store values generated within the for loop. NA means not available, and in this case they’re used as placeholders until we fill in the values with actual sample means. NA is also often used for missing data in R.

The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i = 1, then once wheni = 2, and so on up to i = 5000.

The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the ith element of sample_means50.

In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.

The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.

Exercise: To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 NAs called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small. Print the output to your screen (type sample_means_small into the console and press enter).

sample_means_small <- rep(NA, 100)

for(i in 1:length(sample_means_small)){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   #print(i)
}

How many elements are there in this object called sample_means_small? A) 0 B) 30 C) 50 D) 100 E) 5,000

length(sample_means_small)

## [1] 100

Which of the following is true about the elements in the sampling distributions you created? A) Each element represents a mean square footage from a simple random sample of 50 houses. B) Each element represents the square footage of a house. C) Each element represents the true population mean of square footage of houses.

?sample

Sample size and the sampling distribution Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.

hist(sample_means50)

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.

To get a sense of the effect that sample size has on our sampling distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means50 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
 
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

Here we’re able to use a single for loop to build two distributions by adding additional lines inside the curly braces. Don’t worry about the fact that samp is used for the name of two different objects. In the second command of the for loop, the mean of samp is saved to the relevant place in the vector sample_means10. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one, i.e. R will write over the existing object with the new one, which is something you want to be careful about if you don’t intend to do so.

To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

par(mfrow = c(3, 1))

xlimits = range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots. You may need to stretch your plotting window to accommodate the extra plots. The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

sampleApply <- function(data, sample, FUN, ...) {
        if(sample > length(data)) {
                sample <- length(data)
        }
        samp <- sample(data, sample)
        sample_means <- FUN(samp, ...)
        return (sample_means)
}
sample.means.10  <- replicate(5000, sampleApply(area, 10, mean, ...))
sample.means.50  <- replicate(5000, sampleApply(area, 50, mean, ...))
sample.means.100 <- replicate(5000, sampleApply(area, 100, mean, ...))

par(mfrow = c(3, 1))

xlimits = range(sample.means.10)

hist(sample.means.10, breaks = 20, xlim = xlimits)
hist(sample.means.50, breaks = 20, xlim = xlimits)
hist(sample.means.100, breaks = 20, xlim = xlimits)

To return to the default setting of plotting one plot at a time, run the following command:

par(mfrow = c(1, 1))

It makes intuitive sense that as the sample size increases, the center of the sampling distribution becomes a more reliable estimate for the true population mean. Also as the sample size increases, the variability of the sampling distribution ________. A) decreases B) increases C) stays the same

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Exercise: Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Exercise: Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?

sample.price.means.50  <- replicate(5000, sampleApply(price, 50, mean, ...))
xlimits <- range(sample.price.means.50)
hist(sample.price.means.50, breaks = 20, xlim = xlimits)

mean(sample.price.means.50)

## [1] 180770.9

Exercise: Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample.price.means.150  <- replicate(5000, sampleApply(price, 150, mean, ...))
xlimits <- range(sample.price.means.150)
hist(sample.price.means.150, breaks = 20, xlim = xlimits)

mean(sample.price.means.150)

## [1] 180790.3

Which of the following is false?

The variability of the sampling distribution with the smaller sample size (sample_means50) is smaller than the variability of the sampling distribution with the larger sample size (sample_means150).
The means for the two sampling distribtuions are roughly similar.
Both sampling distributions are symmetric.

par(mfrow = c(2, 1))
xlimits = range(sample.price.means.50)

hist(sample.price.means.50, breaks = 20, xlim = xlimits)
hist(sample.price.means.150, breaks = 20, xlim = xlimits)

sd(sample.price.means.50) / mean(sample.price.means.50)

## [1] 0.06069741

sd(sample.price.means.150) / mean(sample.price.means.150)

## [1] 0.03481901

t.test(sample.price.means.50, sample.price.means.150)

## 
##  Welch Two Sample t-test
## 
## data:  sample.price.means.50 and sample.price.means.150
## t = -0.10884, df = 7968.1, p-value = 0.9133
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -370.1528  331.2115
## sample estimates:
## mean of x mean of y 
##  180770.9  180790.3

Data Analysis Lab3

AllenTsai

2015年10月10日

Lab 3A: Foundations for inference - Sampling distributions

Question 1.Which of the following is false?