#In this lab, we investigate the ways in which the statistics from a random sample of
#data can serve as point estimates for population parameters.
#We're interested in formulating a sampling distribution of our estimate in order to
#learn about the properties of the estimate, such as its distribution.
#The data
#We consider real estate data from the city of Ames, Iowa.
#The details of every real estate transaction in Ames is recorded by the
#City Assessor's office. Our particular focus for this lab will be all residential home
#sales in Ames between 2006 and 2010. This collection represents our population of
#interest. In this lab we would like to learn about these home sales by taking smaller
#samples from the full population. Let's load the data.
rm(list=ls())
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
head(ames)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1 SBrkr 1656 0 0 1656
## 2 SBrkr 896 0 0 896
## 3 SBrkr 1329 0 0 1329
## 4 SBrkr 2110 0 0 2110
## 5 SBrkr 928 701 0 1629
## 6 SBrkr 926 678 0 1604
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 1 0 1 0 3
## 2 0 0 1 0 2
## 3 0 0 1 1 3
## 4 1 0 2 1 3
## 5 0 0 2 1 3
## 6 0 0 2 1 3
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1 1 TA 7 Typ 2
## 2 1 TA 5 Typ 0
## 3 1 Gd 6 Typ 0
## 4 1 Ex 8 Typ 2
## 5 1 TA 6 Typ 1
## 6 1 Gd 7 Typ 1
## Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1 Gd Attchd 1960 Fin 2
## 2 <NA> Attchd 1961 Unf 1
## 3 <NA> Attchd 1958 Unf 1
## 4 TA Attchd 1968 Fin 2
## 5 TA Attchd 1997 Fin 2
## 6 Gd Attchd 1998 Fin 2
## Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1 528 TA TA P 210
## 2 730 TA TA Y 140
## 3 312 TA TA Y 393
## 4 522 TA TA Y 0
## 5 482 TA TA Y 212
## 6 470 TA TA Y 360
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1 62 0 0 0 0 <NA>
## 2 0 0 0 120 0 <NA>
## 3 36 0 0 0 0 <NA>
## 4 0 0 0 0 0 <NA>
## 5 34 0 0 0 0 <NA>
## 6 36 0 0 0 0 <NA>
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> 0 5 2010 WD Normal
## 2 MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> 0 4 2010 WD Normal
## 5 MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
names(ames)
## [1] "Order" "PID" "MS.SubClass"
## [4] "MS.Zoning" "Lot.Frontage" "Lot.Area"
## [7] "Street" "Alley" "Lot.Shape"
## [10] "Land.Contour" "Utilities" "Lot.Config"
## [13] "Land.Slope" "Neighborhood" "Condition.1"
## [16] "Condition.2" "Bldg.Type" "House.Style"
## [19] "Overall.Qual" "Overall.Cond" "Year.Built"
## [22] "Year.Remod.Add" "Roof.Style" "Roof.Matl"
## [25] "Exterior.1st" "Exterior.2nd" "Mas.Vnr.Type"
## [28] "Mas.Vnr.Area" "Exter.Qual" "Exter.Cond"
## [31] "Foundation" "Bsmt.Qual" "Bsmt.Cond"
## [34] "Bsmt.Exposure" "BsmtFin.Type.1" "BsmtFin.SF.1"
## [37] "BsmtFin.Type.2" "BsmtFin.SF.2" "Bsmt.Unf.SF"
## [40] "Total.Bsmt.SF" "Heating" "Heating.QC"
## [43] "Central.Air" "Electrical" "X1st.Flr.SF"
## [46] "X2nd.Flr.SF" "Low.Qual.Fin.SF" "Gr.Liv.Area"
## [49] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Full.Bath"
## [52] "Half.Bath" "Bedroom.AbvGr" "Kitchen.AbvGr"
## [55] "Kitchen.Qual" "TotRms.AbvGrd" "Functional"
## [58] "Fireplaces" "Fireplace.Qu" "Garage.Type"
## [61] "Garage.Yr.Blt" "Garage.Finish" "Garage.Cars"
## [64] "Garage.Area" "Garage.Qual" "Garage.Cond"
## [67] "Paved.Drive" "Wood.Deck.SF" "Open.Porch.SF"
## [70] "Enclosed.Porch" "X3Ssn.Porch" "Screen.Porch"
## [73] "Pool.Area" "Pool.QC" "Fence"
## [76] "Misc.Feature" "Misc.Val" "Mo.Sold"
## [79] "Yr.Sold" "Sale.Type" "Sale.Condition"
## [82] "SalePrice"
#We see that there are quite a few variables in the data set, enough to do a very in-depth
#analysis. For this lab, we'll restrict our attention to just two of the variables:
#the above ground living area of the house in square feet ( Gr.Liv.Area ) and
#the sale price ( SalePrice ). To save some effort throughout the lab, create two
#variables with short names that represent these two variables.
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
#Let's look at the distribution of area in our population of home sales by calculating a
#few summary statistics and making a histogram.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)

hist(area,breaks=25)

# 1.Describe this population distribution.
#
# The unknown sampling distribution
#
# In this lab we have access to the entire population, but this is rarely the
# case in real life.
# Gathering information on an entire population is often extremely costly or
# impossible. Because of this, we often take a sample of the population and use
# that to understand the properties of the population.
# If we were interested in estimating the mean living area in Ames based on
#a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)
length(area)
## [1] 2930
length(samp1)
## [1] 50
# This command collects a simple random sample of size 50 from the vector area ,
# which is assigned to samp1 . This is like going into the City Assessor's
# database and pulling up the files on 50 random home sales.
# Working with these 50 files would be considerably simpler than working with all
# 2930 home sales.
# 2.Describe the distribution of this sample. How does it compare to the
# distribution of the population?
# If we're interested in estimating the average living area in homes in Ames
# using the sample, our best single guess is the sample mean.
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 492 1234 1535 1625 1792 4476
mean(samp1)
## [1] 1624.88
# Depending on which 50 homes you selected, your estimate could be a bit above or
# a bit below the true population mean of 1499.69 square feet. In general,
# though, the sample mean turns out to be a pretty good estimate of the average
# living area, and we were able to get it by sampling less than 3% of the
# population. (<10% can ensure independent sobservations)
#3.Take a second sample, also of size 50, and call it samp2 .
# How does the mean of samp2 compare with the mean of samp1 ?
# Suppose we took two more samples, one of size 100 and one of size 1000. Which
# would you think would provide a more accurate estimate of the population mean?
samp1 <- sample(area, 1000)
mean(samp1)
## [1] 1502.948
# Not surprisingly, every time we take another random sample, we get a different
# sample mean. It's useful to get a sense of just how much variability we should
# expect when estimating the population mean this way. The distribution of sample
# means, called the sampling distribution, can help us understand this
# variability. In this lab, because we have access to the population,
# we can build up the sampling distribution for the sample mean by repeating the
# above steps many times. Here we will generate 5000 samples and compute the
# sample mean of each.
sample_means50 <- rep(NA, 5000)
## Notice the for loop construct
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)

summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1276 1452 1497 1500 1545 1788
# If you would like to adjust the bin width of your histogram to show a
#little more detail, you can do so by changing the breaks argument.
hist(sample_means50, breaks = 25)

# Here we use R to take 5000 samples of size 50 from the population,
# calculate the mean of each sample, and store each result in a vector called
# sample_means50 . On the next page, we'll review how this set of code works.
# 4.How many elements are there in sample_means50 ?
# Describe the sampling distribution, and be sure to specifically note its center.
# Would you expect the distribution to change if we instead collected 50,000
# sample means?
# #Interlude: The for loop
### Let's take a break from the statistics for a moment to let that last block of
#code sink in. You have just run your first for loop, a cornerstone of computer
# programming. The idea behind the for loop is iteration: it allows you to
# execute code as many times as you want without having to type out every
# iteration. In the case above, we wanted to iterate the two lines of code inside
# the curly braces that take a random sample of size 50 from area then save the
# mean of that sample into the sample_means50 vector. Without the for loop,
# this would be painful:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
# and so on..for a 1000 times
## OR .....we can "for loop" !!!
# With the for loop, these thousands of lines of code are compressed into a handful
# of lines. We've added one extra line to the code below, which prints the variable
# i during each iteration of the for loop. Run this code.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
#print(i)
}
#Let's consider this code line by line to figure out what it does.
#In the first line we initialized a vector. In this case, we created a vector of
# 5000 zeros called sample_means50. This vector will will store values generated within the for loop.
#The second line calls the for loop itself. The syntax can be loosely read as,
# "for every element i from 1 to 5000, run the following lines of code".
# You can think of i as the counter that keeps track of which loop you're on.
# Therefore, more precisely, the loop will run once when i = 1 , then once when
# i = 2 , and so on up to i = 5000 .
# The body of the for loop is the part inside the curly braces, and this set of
# code is run for each value of i . Here, on every loop, we take a random sample
# of size 50 from area , take its mean, and store it as the $i$th element of
# sample_means50 .
# In order to display that this is really happening, we asked R to print i
# at each iteration. This line of code is optional and is only used for
# displaying what's going on while the for loop is running.
# ALSO VERY HELPFUL WHEN YOU ARE TRYING TO DEBUG YOUR CODE!!!!!!
#The for loop allows us to not just run the code 5000 times, but to neatly
# package the results, element by element, into the empty vector that we
# initialized at the outset.
# 5.To make sure you understand what you've done in this loop, try running a
# smaller version. Initialize a vector of 100 zeros called sample_means_small .
# Run a loop that takes a sample of size 50 from area and stores the
# sample mean in sample_means_small , but only iterate from 1 to 100.
# Print the output to your screen (type sample_means_small into the console
# and press enter). How many elements are there in this object called
# sample_means_small ? What does each element represent?
######################### I AM DOING IT HERE BUT USE IT TO CHECK YOUR ANSWER
small_sample_means = rep(NA, 100)
for(i in 1:100){
samp <- sample(area, 50)
small_sample_means[i] <- mean(samp)
#print(i)
#print(small_sample_means[i])
}
########################################################################
# Sample size and the sampling distribution
#Mechanics aside, let's return to the reason we used a for loop: to
#compute a sampling distribution, specifically, this one.
hist(sample_means50)

# The sampling distribution that we computed tells us much about estimating the
# average living area in homes in Ames. Because the sample mean is an
# unbiased estimator, the sampling distribution is centered at the true average
# living area of the the population, and the spread of the distribution indicates
# how much variability is induced by sampling only 50 home sales.
#To get a sense of the effect that SAMPLE SIZE has on our distribution,
#let's build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
# Here we're able to use a single for loop to build two distributions
# by adding additional lines inside the curly braces.
# Don't worry about the fact that samp is used for the name of two different
# objects. In the second command of the for loop, the mean of samp is saved
# to the relevant place in the vector sample_means10 .
# With the mean saved, we're now free to overwrite the object samp with a
# new sample, this time of size 100. In general, anytime you create an object
# using a name that is already in use, the old object will get replaced with
# the new one.
# To see the effect that different sample sizes have on the sampling
# distribution, plot the three distributions on top of one another.
par(mfrow = c(3, 1))
xlimits <- range(sample_means10) # limiting the values of the x axis
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

# The first command specifies that you'd like to divide the plotting area into 3 rows
# and 1 column of plots (to return to the default setting of plotting one at a time,
# use par(mfrow = c(1, 1)) ).
# The breaks argument specifies the number of bins used in constructing the histogram.
# The xlim argument specifies the range of the x-axis of the histogram, and by setting
# it equal to xlimits for each histogram, we ensure that all three histograms will be
# plotted with the same limits on the x-axis.
#####################################################################
# 6.When the sample size is larger, what happens to the center? What about the spread?
# On your own
# So far, we have only focused on estimating the mean living area in homes in Ames.
# Now you'll try to estimate the mean home price.
# .Take a random sample of size 50 from price . Using this sample, what is your
# best point estimate of the population mean?
price_50 <- sample(price,50)
mean(price_50)
## [1] 188437.5
#
# .Since you have access to the population, simulate the sampling distribution
# for $\bar{x}_{price}$ price (sampling mean of price ) by taking 5000 samples from
# the population of size 50 and computing 5000 sample means.
# Store these means in a vector called sample_means50 . Plot the data,
# then describe the shape of this sampling distribution.
# Based on this sampling distribution, what would you guess the mean home price of
# the population to be? Finally, calculate and report the population mean.
sample_means50 <- rep(0,5000)
for(i in 1:5000)
{
sam <- sample(price,50)
sample_means50[i] <- mean(sam)
}
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 141473 173015 180563 181041 188321 222192
mean_sample_means <- mean(sample_means50)
pop_mean <- mean(price)
#180796.1
hist(sample_means50, breaks = 20)
# .Change your sample size from 50 to 150, then compute the sampling distribution
# using the same method as above, and store these means in a new vector called
# sample_means150 . Describe the shape of this sampling distribution, and compare
# it to the sampling distribution for a sample size of 50.
# Based on this sampling distribution, what would you guess to be the mean sale
# price of homes in Ames?
sample_means150 <- rep(0,5000)
for(i in 1:5000)
{
sam <- sample(price,150)
sample_means150[i] <- mean(sam)
}
summary(sample_means150)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159048 176370 180545 180770 184983 205750
mean_sample_means <- mean(sample_means150)
#180672.9
hist(sample_means150, breaks = 20)
# .Of the sampling distributions from 2 and 3, which has a smaller spread?
# If we're concerned with making estimates that are more often close to the
# true value, would we prefer a distribution with a large or small spread?
# From the above 2 histogram comparison, we can see that sample_means150 has a smaller spread.
# If we concern more about the true value, we should take the 150 sample results. Because larger datasets means higher
# accuracy, if compared to 1 sample, 1000 dataset is more convincing.
