Lecture-5-Sampling-Distribution.utf8.md

#In this lab, we investigate the ways in which the statistics from a random sample of 
#data can serve as point estimates for population parameters.
#We're interested in formulating a sampling distribution of our estimate in order to 
#learn about the properties of the estimate, such as its distribution.

#The data

#We consider real estate data from the city of Ames, Iowa. 
#The details of every real estate transaction in Ames is recorded by the
#City Assessor's office. Our particular focus for this lab will be all residential home 
#sales in Ames between 2006 and 2010. This collection represents our population of 
#interest. In this lab we would like to learn about these home sales by taking smaller
#samples from the full population. Let's load the data.
rm(list=ls())

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
head(ames)

##   Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1     1 526301100          20        RL          141    31770   Pave  <NA>
## 2     2 526350040          20        RH           80    11622   Pave  <NA>
## 3     3 526351010          20        RL           81    14267   Pave  <NA>
## 4     4 526353030          20        RL           93    11160   Pave  <NA>
## 5     5 527105010          60        RL           74    13830   Pave  <NA>
## 6     6 527105030          60        RL           78     9978   Pave  <NA>
##   Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1       IR1          Lvl    AllPub     Corner        Gtl        NAmes
## 2       Reg          Lvl    AllPub     Inside        Gtl        NAmes
## 3       IR1          Lvl    AllPub     Corner        Gtl        NAmes
## 4       Reg          Lvl    AllPub     Corner        Gtl        NAmes
## 5       IR1          Lvl    AllPub     Inside        Gtl      Gilbert
## 6       IR1          Lvl    AllPub     Inside        Gtl      Gilbert
##   Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1        Norm        Norm      1Fam      1Story            6            5
## 2       Feedr        Norm      1Fam      1Story            5            6
## 3        Norm        Norm      1Fam      1Story            6            6
## 4        Norm        Norm      1Fam      1Story            7            5
## 5        Norm        Norm      1Fam      2Story            5            5
## 6        Norm        Norm      1Fam      2Story            6            6
##   Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1       1960           1960        Hip   CompShg      BrkFace      Plywood
## 2       1961           1961      Gable   CompShg      VinylSd      VinylSd
## 3       1958           1958        Hip   CompShg      Wd Sdng      Wd Sdng
## 4       1968           1968        Hip   CompShg      BrkFace      BrkFace
## 5       1997           1998      Gable   CompShg      VinylSd      VinylSd
## 6       1998           1998      Gable   CompShg      VinylSd      VinylSd
##   Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1        Stone          112         TA         TA     CBlock        TA
## 2         None            0         TA         TA     CBlock        TA
## 3      BrkFace          108         TA         TA     CBlock        TA
## 4         None            0         Gd         TA     CBlock        TA
## 5         None            0         TA         TA      PConc        Gd
## 6      BrkFace           20         TA         TA      PConc        TA
##   Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1        Gd            Gd            BLQ          639            Unf
## 2        TA            No            Rec          468            LwQ
## 3        TA            No            ALQ          923            Unf
## 4        TA            No            ALQ         1065            Unf
## 5        TA            No            GLQ          791            Unf
## 6        TA            No            GLQ          602            Unf
##   BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1            0         441          1080    GasA         Fa           Y
## 2          144         270           882    GasA         TA           Y
## 3            0         406          1329    GasA         TA           Y
## 4            0        1045          2110    GasA         Ex           Y
## 5            0         137           928    GasA         Gd           Y
## 6            0         324           926    GasA         Ex           Y
##   Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1      SBrkr        1656           0               0        1656
## 2      SBrkr         896           0               0         896
## 3      SBrkr        1329           0               0        1329
## 4      SBrkr        2110           0               0        2110
## 5      SBrkr         928         701               0        1629
## 6      SBrkr         926         678               0        1604
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1              1              0         1         0             3
## 2              0              0         1         0             2
## 3              0              0         1         1             3
## 4              1              0         2         1             3
## 5              0              0         2         1             3
## 6              0              0         2         1             3
##   Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1             1           TA             7        Typ          2
## 2             1           TA             5        Typ          0
## 3             1           Gd             6        Typ          0
## 4             1           Ex             8        Typ          2
## 5             1           TA             6        Typ          1
## 6             1           Gd             7        Typ          1
##   Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1           Gd      Attchd          1960           Fin           2
## 2         <NA>      Attchd          1961           Unf           1
## 3         <NA>      Attchd          1958           Unf           1
## 4           TA      Attchd          1968           Fin           2
## 5           TA      Attchd          1997           Fin           2
## 6           Gd      Attchd          1998           Fin           2
##   Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1         528          TA          TA           P          210
## 2         730          TA          TA           Y          140
## 3         312          TA          TA           Y          393
## 4         522          TA          TA           Y            0
## 5         482          TA          TA           Y          212
## 6         470          TA          TA           Y          360
##   Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1            62              0           0            0         0    <NA>
## 2             0              0           0          120         0    <NA>
## 3            36              0           0            0         0    <NA>
## 4             0              0           0            0         0    <NA>
## 5            34              0           0            0         0    <NA>
## 6            36              0           0            0         0    <NA>
##   Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1  <NA>         <NA>        0       5    2010       WD          Normal
## 2 MnPrv         <NA>        0       6    2010       WD          Normal
## 3  <NA>         Gar2    12500       6    2010       WD          Normal
## 4  <NA>         <NA>        0       4    2010       WD          Normal
## 5 MnPrv         <NA>        0       3    2010       WD          Normal
## 6  <NA>         <NA>        0       6    2010       WD          Normal
##   SalePrice
## 1    215000
## 2    105000
## 3    172000
## 4    244000
## 5    189900
## 6    195500

names(ames)

##  [1] "Order"           "PID"             "MS.SubClass"    
##  [4] "MS.Zoning"       "Lot.Frontage"    "Lot.Area"       
##  [7] "Street"          "Alley"           "Lot.Shape"      
## [10] "Land.Contour"    "Utilities"       "Lot.Config"     
## [13] "Land.Slope"      "Neighborhood"    "Condition.1"    
## [16] "Condition.2"     "Bldg.Type"       "House.Style"    
## [19] "Overall.Qual"    "Overall.Cond"    "Year.Built"     
## [22] "Year.Remod.Add"  "Roof.Style"      "Roof.Matl"      
## [25] "Exterior.1st"    "Exterior.2nd"    "Mas.Vnr.Type"   
## [28] "Mas.Vnr.Area"    "Exter.Qual"      "Exter.Cond"     
## [31] "Foundation"      "Bsmt.Qual"       "Bsmt.Cond"      
## [34] "Bsmt.Exposure"   "BsmtFin.Type.1"  "BsmtFin.SF.1"   
## [37] "BsmtFin.Type.2"  "BsmtFin.SF.2"    "Bsmt.Unf.SF"    
## [40] "Total.Bsmt.SF"   "Heating"         "Heating.QC"     
## [43] "Central.Air"     "Electrical"      "X1st.Flr.SF"    
## [46] "X2nd.Flr.SF"     "Low.Qual.Fin.SF" "Gr.Liv.Area"    
## [49] "Bsmt.Full.Bath"  "Bsmt.Half.Bath"  "Full.Bath"      
## [52] "Half.Bath"       "Bedroom.AbvGr"   "Kitchen.AbvGr"  
## [55] "Kitchen.Qual"    "TotRms.AbvGrd"   "Functional"     
## [58] "Fireplaces"      "Fireplace.Qu"    "Garage.Type"    
## [61] "Garage.Yr.Blt"   "Garage.Finish"   "Garage.Cars"    
## [64] "Garage.Area"     "Garage.Qual"     "Garage.Cond"    
## [67] "Paved.Drive"     "Wood.Deck.SF"    "Open.Porch.SF"  
## [70] "Enclosed.Porch"  "X3Ssn.Porch"     "Screen.Porch"   
## [73] "Pool.Area"       "Pool.QC"         "Fence"          
## [76] "Misc.Feature"    "Misc.Val"        "Mo.Sold"        
## [79] "Yr.Sold"         "Sale.Type"       "Sale.Condition" 
## [82] "SalePrice"

#We see that there are quite a few variables in the data set, enough to do a very in-depth
#analysis. For this lab, we'll restrict our attention to just two of the variables:
#the above ground living area of the house in square feet ( Gr.Liv.Area ) and
#the sale price ( SalePrice ). To save some effort throughout the lab, create two 
#variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

#Let's look at the distribution of area in our population of home sales by calculating a 
#few summary statistics and making a histogram.
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

hist(area,breaks=25)

# 1.Describe this population distribution.
#
# The unknown sampling distribution
#
# In this lab we have access to the entire population, but this is rarely the
# case in real life.
# Gathering information on an entire population is often extremely costly or
# impossible. Because of this, we often take a sample of the population and use
# that to understand the properties of the population.

# If we were interested in estimating the mean living area in Ames based on
#a sample, we can use the following command to survey the population.

samp1 <- sample(area, 50)
length(area)

## [1] 2930

length(samp1)

## [1] 50

# This command collects a simple random sample of size 50 from the vector  area ,
#  which is assigned to  samp1 . This is like going into the City Assessor's 
#  database and pulling up the files on 50 random home sales. 
#  Working with these 50 files would be considerably simpler than working with all
#  2930 home sales.

# 2.Describe the distribution of this sample. How does it compare to the 
#  distribution of the population?

# If we're interested in estimating the average living area in homes in Ames 
# using the sample, our best single guess is the sample mean.
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     492    1234    1535    1625    1792    4476

mean(samp1)

## [1] 1624.88

# Depending on which 50 homes you selected, your estimate could be a bit above or
# a bit below the true population mean of 1499.69 square feet. In general, 
# though, the sample mean turns out to be a pretty good estimate of the average 
# living area, and we were able to get it by sampling less than 3% of the 
# population. (<10% can ensure independent sobservations)

#3.Take a second sample, also of size 50, and call it  samp2 . 
# How does the mean of  samp2  compare with the mean of  samp1 ?
# Suppose we took two more samples, one of size 100 and one of size 1000. Which 
# would you think would provide a more accurate estimate of the population mean?

samp1 <- sample(area, 1000)

mean(samp1)

## [1] 1502.948

# Not surprisingly, every time we take another random sample, we get a different 
# sample mean. It's useful to get a sense of just how much variability we should 
# expect when estimating the population mean this way. The distribution of sample
# means, called the sampling distribution, can help us understand this
# variability. In this lab, because we have access to the population, 
# we can build up the sampling distribution for the sample mean by repeating the
# above steps many times. Here we will generate 5000 samples and compute the
# sample mean of each.

sample_means50 <- rep(NA, 5000)
##  Notice the for loop construct

for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1276    1452    1497    1500    1545    1788

# If you would like to adjust the bin width of your histogram to show a 
#little more detail, you can do so by changing the  breaks  argument.

hist(sample_means50, breaks = 25)

# Here we use R to take 5000 samples of size 50 from the population, 
# calculate the mean of each sample, and store each result in a vector called 
# sample_means50 . On the next page, we'll review how this set of code works.

# 4.How many elements are there in  sample_means50 ? 
# Describe the sampling distribution, and be sure to specifically note its center.
# Would you expect the distribution to change if we instead collected 50,000 
# sample means?

#  #Interlude: The  for  loop

### Let's take a break from the statistics for a moment to let that last block of
#code sink in. You have just run your first  for  loop, a cornerstone of computer
# programming. The idea behind the for loop is iteration: it allows you to 
# execute code as many times as you want without having to type out every 
# iteration. In the case above, we wanted to iterate the two lines of code inside
# the curly braces that take a random sample of size 50 from  area  then save the
# mean of that sample into the  sample_means50  vector. Without the  for  loop, 
# this would be painful:

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)


# and so on..for a 1000 times

##  OR .....we can "for loop" !!!

# With the for loop, these thousands of lines of code are compressed into a handful
# of lines. We've added one extra line to the code below, which prints the variable
# i  during each iteration of the  for  loop. Run this code.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
  #print(i)
}

#Let's consider this code line by line to figure out what it does.
#In the first line we initialized a vector. In this case, we created a vector of
# 5000 zeros called  sample_means50. This vector will will store values generated within the  for  loop.

#The second line calls the  for  loop itself. The syntax can be loosely read as,
# "for every element  i  from 1 to 5000, run the following lines of code". 
#  You can think of  i  as the counter that keeps track of which loop you're on.
#  Therefore, more precisely, the loop will run once when  i = 1 , then once when
#  i = 2 , and so on up to  i = 5000 .

# The body of the  for  loop is the part inside the curly braces, and this set of
# code is run for each value of  i . Here, on every loop, we take a random sample
# of size 50 from  area , take its mean, and store it as the $i$th element of  
# sample_means50 .

# In order to display that this is really happening, we asked R to print  i 
# at each iteration. This line of code is optional and is only used for 
# displaying what's going on while the  for  loop is running.
#  ALSO VERY HELPFUL WHEN YOU ARE TRYING TO DEBUG YOUR CODE!!!!!!

#The  for  loop allows us to not just run the code 5000 times, but to neatly
# package the results, element by element, into the empty vector that we 
# initialized at the outset.

# 5.To make sure you understand what you've done in this loop, try running a 
# smaller version. Initialize a vector of 100 zeros called  sample_means_small .
# Run a loop that takes a sample of size 50 from  area  and stores the 
# sample mean in  sample_means_small , but only iterate from 1 to 100.
# Print the output to your screen (type  sample_means_small  into the console 
# and press enter). How many elements are there in this object called  
# sample_means_small ? What does each element represent?
######################### I AM DOING IT HERE BUT USE IT TO CHECK YOUR ANSWER
small_sample_means = rep(NA, 100)
for(i in 1:100){
  samp <- sample(area, 50)
  small_sample_means[i] <- mean(samp)
  #print(i)
  #print(small_sample_means[i])
}

########################################################################

# Sample size and the sampling distribution

#Mechanics aside, let's return to the reason we used a  for  loop: to 
#compute a sampling distribution, specifically, this one.

hist(sample_means50)

# The sampling distribution that we computed tells us much about estimating the
# average living area in homes in Ames. Because the sample mean is an 
# unbiased estimator, the sampling distribution is centered at the true average
# living area of the the population, and the spread of the distribution indicates
# how much variability is induced by sampling only 50 home sales.

#To get a sense of the effect that SAMPLE SIZE has on our distribution, 
#let's build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}


# Here we're able to use a single  for  loop to build two distributions
# by adding additional lines inside the curly braces. 
# Don't worry about the fact that  samp  is used for the name of two different
# objects. In the second command of the  for  loop, the mean of  samp  is saved 
# to the relevant place in the vector  sample_means10 .
# With the mean saved, we're now free to overwrite the object  samp  with a
# new sample, this time of size 100. In general, anytime you create an object
# using a name that is already in use, the old object will get replaced with 
# the new one.

# To see the effect that different sample sizes have on the sampling
# distribution, plot the three distributions on top of one another.

par(mfrow = c(3, 1))

xlimits <- range(sample_means10) # limiting the values of the x axis

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

# The first command specifies that you'd like to divide the plotting area into 3 rows 
# and 1 column of plots (to return to the default setting of plotting one at a time,
# use  par(mfrow = c(1, 1)) ). 
# The  breaks  argument specifies the number of bins used in constructing the histogram. 
# The  xlim  argument specifies the range of the x-axis of the histogram, and by setting
# it equal to  xlimits  for each histogram, we ensure that all three histograms will be
# plotted with the same limits on the x-axis.
#####################################################################
# 6.When the sample size is larger, what happens to the center? What about the spread?



# On your own

# So far, we have only focused on estimating the mean living area in homes in Ames.
# Now you'll try to estimate the mean home price.
# .Take a random sample of size 50 from  price . Using this sample, what is your
#  best point estimate of the population mean?
price_50 <- sample(price,50)
mean(price_50)

## [1] 188437.5

#
# .Since you have access to the population, simulate the sampling distribution 
# for $\bar{x}_{price}$ price (sampling mean of price ) by taking 5000 samples from
# the population of size 50 and computing 5000 sample means.
# Store these means in a vector called  sample_means50 . Plot the data,
# then describe the shape of this sampling distribution. 
# Based on this sampling distribution, what would you guess the mean home price of
# the population to be? Finally, calculate and report the population mean.
sample_means50 <- rep(0,5000)
for(i in 1:5000)
{
  sam <- sample(price,50)
  sample_means50[i] <- mean(sam)
}
summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  141473  173015  180563  181041  188321  222192

mean_sample_means <- mean(sample_means50)
pop_mean <- mean(price)
#180796.1
hist(sample_means50, breaks = 20)

# .Change your sample size from 50 to 150, then compute the sampling distribution 
# using the same method as above, and store these means in a new vector called 
# sample_means150 . Describe the shape of this sampling distribution, and compare
# it to the sampling distribution for a sample size of 50.
# Based on this sampling distribution, what would you guess to be the mean sale
# price of homes in Ames?
sample_means150 <- rep(0,5000)
for(i in 1:5000)
{
  sam <- sample(price,150)
  sample_means150[i] <- mean(sam)
}
summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  159048  176370  180545  180770  184983  205750

mean_sample_means <- mean(sample_means150)
#180672.9
hist(sample_means150, breaks = 20)


# .Of the sampling distributions from 2 and 3, which has a smaller spread? 
# If we're concerned with making estimates that are more often close to the 
# true value, would we prefer a distribution with a large or small spread?

# From the above 2 histogram comparison, we can see that sample_means150 has a smaller spread.
# If we concern more about the true value, we should take the 150 sample results. Because larger datasets means higher
# accuracy, if compared to 1 sample, 1000 dataset is more convincing.

Lecture-5-Sampling-Distribution.R

John

2019-12-10