Title

Lab 3a: Sampling Distributions

Start by loading the Data:

load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData"))

Now inspect the data using the standard tools we have learned: head(), dim(), names(), str().

head(ames)
##   Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1     1 526301100          20        RL          141    31770   Pave  <NA>
## 2     2 526350040          20        RH           80    11622   Pave  <NA>
## 3     3 526351010          20        RL           81    14267   Pave  <NA>
## 4     4 526353030          20        RL           93    11160   Pave  <NA>
## 5     5 527105010          60        RL           74    13830   Pave  <NA>
## 6     6 527105030          60        RL           78     9978   Pave  <NA>
##   Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1       IR1          Lvl    AllPub     Corner        Gtl        NAmes
## 2       Reg          Lvl    AllPub     Inside        Gtl        NAmes
## 3       IR1          Lvl    AllPub     Corner        Gtl        NAmes
## 4       Reg          Lvl    AllPub     Corner        Gtl        NAmes
## 5       IR1          Lvl    AllPub     Inside        Gtl      Gilbert
## 6       IR1          Lvl    AllPub     Inside        Gtl      Gilbert
##   Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1        Norm        Norm      1Fam      1Story            6            5
## 2       Feedr        Norm      1Fam      1Story            5            6
## 3        Norm        Norm      1Fam      1Story            6            6
## 4        Norm        Norm      1Fam      1Story            7            5
## 5        Norm        Norm      1Fam      2Story            5            5
## 6        Norm        Norm      1Fam      2Story            6            6
##   Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1       1960           1960        Hip   CompShg      BrkFace      Plywood
## 2       1961           1961      Gable   CompShg      VinylSd      VinylSd
## 3       1958           1958        Hip   CompShg      Wd Sdng      Wd Sdng
## 4       1968           1968        Hip   CompShg      BrkFace      BrkFace
## 5       1997           1998      Gable   CompShg      VinylSd      VinylSd
## 6       1998           1998      Gable   CompShg      VinylSd      VinylSd
##   Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1        Stone          112         TA         TA     CBlock        TA
## 2         None            0         TA         TA     CBlock        TA
## 3      BrkFace          108         TA         TA     CBlock        TA
## 4         None            0         Gd         TA     CBlock        TA
## 5         None            0         TA         TA      PConc        Gd
## 6      BrkFace           20         TA         TA      PConc        TA
##   Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1        Gd            Gd            BLQ          639            Unf
## 2        TA            No            Rec          468            LwQ
## 3        TA            No            ALQ          923            Unf
## 4        TA            No            ALQ         1065            Unf
## 5        TA            No            GLQ          791            Unf
## 6        TA            No            GLQ          602            Unf
##   BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1            0         441          1080    GasA         Fa           Y
## 2          144         270           882    GasA         TA           Y
## 3            0         406          1329    GasA         TA           Y
## 4            0        1045          2110    GasA         Ex           Y
## 5            0         137           928    GasA         Gd           Y
## 6            0         324           926    GasA         Ex           Y
##   Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1      SBrkr        1656           0               0        1656
## 2      SBrkr         896           0               0         896
## 3      SBrkr        1329           0               0        1329
## 4      SBrkr        2110           0               0        2110
## 5      SBrkr         928         701               0        1629
## 6      SBrkr         926         678               0        1604
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1              1              0         1         0             3
## 2              0              0         1         0             2
## 3              0              0         1         1             3
## 4              1              0         2         1             3
## 5              0              0         2         1             3
## 6              0              0         2         1             3
##   Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1             1           TA             7        Typ          2
## 2             1           TA             5        Typ          0
## 3             1           Gd             6        Typ          0
## 4             1           Ex             8        Typ          2
## 5             1           TA             6        Typ          1
## 6             1           Gd             7        Typ          1
##   Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1           Gd      Attchd          1960           Fin           2
## 2         <NA>      Attchd          1961           Unf           1
## 3         <NA>      Attchd          1958           Unf           1
## 4           TA      Attchd          1968           Fin           2
## 5           TA      Attchd          1997           Fin           2
## 6           Gd      Attchd          1998           Fin           2
##   Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1         528          TA          TA           P          210
## 2         730          TA          TA           Y          140
## 3         312          TA          TA           Y          393
## 4         522          TA          TA           Y            0
## 5         482          TA          TA           Y          212
## 6         470          TA          TA           Y          360
##   Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1            62              0           0            0         0    <NA>
## 2             0              0           0          120         0    <NA>
## 3            36              0           0            0         0    <NA>
## 4             0              0           0            0         0    <NA>
## 5            34              0           0            0         0    <NA>
## 6            36              0           0            0         0    <NA>
##   Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1  <NA>         <NA>        0       5    2010       WD          Normal
## 2 MnPrv         <NA>        0       6    2010       WD          Normal
## 3  <NA>         Gar2    12500       6    2010       WD          Normal
## 4  <NA>         <NA>        0       4    2010       WD          Normal
## 5 MnPrv         <NA>        0       3    2010       WD          Normal
## 6  <NA>         <NA>        0       6    2010       WD          Normal
##   SalePrice
## 1    215000
## 2    105000
## 3    172000
## 4    244000
## 5    189900
## 6    195500
dim(ames)
## [1] 2930   82
names(ames)
##  [1] "Order"           "PID"             "MS.SubClass"    
##  [4] "MS.Zoning"       "Lot.Frontage"    "Lot.Area"       
##  [7] "Street"          "Alley"           "Lot.Shape"      
## [10] "Land.Contour"    "Utilities"       "Lot.Config"     
## [13] "Land.Slope"      "Neighborhood"    "Condition.1"    
## [16] "Condition.2"     "Bldg.Type"       "House.Style"    
## [19] "Overall.Qual"    "Overall.Cond"    "Year.Built"     
## [22] "Year.Remod.Add"  "Roof.Style"      "Roof.Matl"      
## [25] "Exterior.1st"    "Exterior.2nd"    "Mas.Vnr.Type"   
## [28] "Mas.Vnr.Area"    "Exter.Qual"      "Exter.Cond"     
## [31] "Foundation"      "Bsmt.Qual"       "Bsmt.Cond"      
## [34] "Bsmt.Exposure"   "BsmtFin.Type.1"  "BsmtFin.SF.1"   
## [37] "BsmtFin.Type.2"  "BsmtFin.SF.2"    "Bsmt.Unf.SF"    
## [40] "Total.Bsmt.SF"   "Heating"         "Heating.QC"     
## [43] "Central.Air"     "Electrical"      "X1st.Flr.SF"    
## [46] "X2nd.Flr.SF"     "Low.Qual.Fin.SF" "Gr.Liv.Area"    
## [49] "Bsmt.Full.Bath"  "Bsmt.Half.Bath"  "Full.Bath"      
## [52] "Half.Bath"       "Bedroom.AbvGr"   "Kitchen.AbvGr"  
## [55] "Kitchen.Qual"    "TotRms.AbvGrd"   "Functional"     
## [58] "Fireplaces"      "Fireplace.Qu"    "Garage.Type"    
## [61] "Garage.Yr.Blt"   "Garage.Finish"   "Garage.Cars"    
## [64] "Garage.Area"     "Garage.Qual"     "Garage.Cond"    
## [67] "Paved.Drive"     "Wood.Deck.SF"    "Open.Porch.SF"  
## [70] "Enclosed.Porch"  "X3Ssn.Porch"     "Screen.Porch"   
## [73] "Pool.Area"       "Pool.QC"         "Fence"          
## [76] "Misc.Feature"    "Misc.Val"        "Mo.Sold"        
## [79] "Yr.Sold"         "Sale.Type"       "Sale.Condition" 
## [82] "SalePrice"
str(ames)
## 'data.frame':    2930 obs. of  82 variables:
##  $ Order          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PID            : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
##  $ MS.SubClass    : int  20 20 20 20 60 60 120 120 120 60 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
##  $ Lot.Frontage   : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot.Area       : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
##  $ Overall.Qual   : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ Overall.Cond   : int  5 6 6 5 5 6 5 5 5 5 ...
##  $ Year.Built     : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ Year.Remod.Add : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
##  $ Mas.Vnr.Area   : int  112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
##  $ BsmtFin.SF.1   : int  639 468 923 1065 791 602 616 263 1180 0 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  441 270 406 1045 137 324 722 1017 415 994 ...
##  $ Total.Bsmt.SF  : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
##  $ X2nd.Flr.SF    : int  0 0 0 0 701 678 0 0 0 776 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gr.Liv.Area    : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ Bsmt.Full.Bath : int  1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 1 1 2 2 2 2 2 2 2 ...
##  $ Half.Bath      : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom.AbvGr  : int  3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
##  $ TotRms.AbvGrd  : int  7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Fireplaces     : int  2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Garage.Yr.Blt  : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
##  $ Garage.Cars    : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ Open.Porch.SF  : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed.Porch : int  0 0 0 0 0 0 170 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  5 6 6 4 3 6 4 1 3 6 ...
##  $ Yr.Sold        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ SalePrice      : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...

There are a lot of variables in this data set. There are two variables that we will pay attention to initially, Gr.Liv.Area (the above ground living area) and salePrice. We will create two data objects, area and price, that we will use for further analysis. We call them data objects rather than variables because they are not attached to the data set.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Now look at the distribution or area.

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1440    1500    1740    5640
hist(area)

plot of chunk unnamed-chunk-4

Now we know that there are 2930 cases in the data set and that the data set represents the population–not just a sample–of all the homes sold in Ames Iowa. To look at how samples compare to populations we are going to take some samples from this population and see how the characteristics of the sample compare to the population. We will use the R function sample(data, sample.size) to draw two samples from the population and compare the samples to each other.

samp0 <- sample(area, 50)
samp1 <- sample(area, 50)
summary(samp0)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     754    1070    1300    1380    1610    2380
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     640    1150    1480    1470    1820    2260
hist(samp0)

plot of chunk unnamed-chunk-5

hist(samp1)

plot of chunk unnamed-chunk-5

Now the tutorial gives us the code for what statisticians call a 'Monte Carlo' simulation, taking a lot of samples and comparing their mean to the population mean. To do this, they use the important programing tool of a 'for loop'. Here is how to do a Monte Carlo simulation. First, set up an empty vector of 5000 NAs (the value that R uses for missing data) to store sample means:

sample_means50 = rep(NA, 5000)

Take 5000 samples of size 50 from 'area' and store all of them in 'sample_means50'.

for (i in 1:5000) {
    samp = sample(area, 50)
    sample_means50[i] = mean(samp)
}

Now that we have the sample, view the result. If you want, you can increase the bin width to show more detail by changing the 'breaks' argument.

hist(sample_means50)
hist(sample_means50, breaks = 13)

plot of chunk unnamed-chunk-8

hist(sample_means50, breaks = 50)

plot of chunk unnamed-chunk-8

hist(sample_means50, breaks = 100)

plot of chunk unnamed-chunk-8

To get a better grasp of how for loops work they suggest adding a print statement. Again, set up an empty vector or 5000 NAs to store sample means:

sample_means50 = rep(NA, 5000)

Then take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50' as before, but this time with the print statement, using 'i' (the index counter):

for (i in 1:5000) {
    samp = sample(area, 50)
    sample_means50[i] = mean(samp)
    # print(i)
}

Now we write our own for loop. We will take 100 samples rather than 5000 this time to see the effect of having a smaller number of samples. First, we initialize a vector of NAs.

sample_means_small <- rep("NA", 100)

Then we make the for loop:

for (i in 1:100) {
    samp <- sample(area, 50)
    sample_means_small[i] <- mean(samp)
}

Then we inspect the results:

hist(as.integer(sample_means_small))

plot of chunk unnamed-chunk-13

class(sample_means_small)  #for some reason it thought that sample_means_small was not an integer.
## [1] "character"

Now look at several different sample sizes. First, initialize the sample distributions:

sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)

Run the for loop:

for (i in 1:5000) {
    samp = sample(area, 10)
    sample_means10[i] = mean(samp)
    samp = sample(area, 100)
    sample_means100[i] = mean(samp)
}

Take a look at the results:

head(sample_means10)
## [1] 1567 1435 1526 1640 1344 1528
head(sample_means50)  # was already loaded
## [1] 1474 1499 1537 1483 1375 1541
head(sample_means100)
## [1] 1484 1574 1501 1518 1568 1541

To make the comparisons more visually easy to compare we produce multiple plots at the same time. We do this by passign the mfrow argument to the par() function. 'par' stands for parameters, as in the graphics parameters of aplot or other visual display. mfrow means multiple figures by row, i.e., the first element is the number of rows in which the figures are to be displayed and the second is the number of columns.

First, divide the plot in 3 rows:

par(mfrow = c(3, 1))

Define the limits for the x-axis:

xlimits = range(sample_means10)

Then draw the histograms:

hist(sample_means10, breaks = 20, xlim = xlimits)

plot of chunk unnamed-chunk-19

hist(sample_means50, breaks = 20, xlim = xlimits)

plot of chunk unnamed-chunk-19

hist(sample_means100, breaks = 20, xlim = xlimits)

plot of chunk unnamed-chunk-19

par(mfrow = c(1, 1))

Notice how, as the number of samples we take gets larger the variance of their means gets smaller and closer to the true mean of the population.

Now we look at price, taking a sample of 50 and inspecting its mean:

sample_50 <- sample(price, 50)
head(price)
## [1] 215000 105000 172000 244000 189900 195500

Now we have to write our own for loop to get a sample of means from the population as we did for area. The trick to writing a for loop is to write it from the inside out or backwards, starting with the operation that you want repeated to create the end result. What do we want? We want a bunch of sample means of price: “price_means_5000_50[i] <- mean(samp)” Or 5000 means from samples of 50. Of course, that has to be initialized: price_means_5000_50 <- rep(NA, 5000) So where do we get samp? samp <- sample(price, 50) And we want to do this 5000 times, so for i in 1:5000.

price_means_5000_50 <- rep(NA, 5000)
for (i in 1:5000) {
    samp <- sample(price, 50)
    price_means_5000_50[i] <- mean(samp)
}
summary(price_means_5000_50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  146000  173000  181000  181000  188000  228000

Now do the same with sample sizes of 150.

price_means_5000_150 <- rep(NA, 5000)
for (i in 1:5000) {
    samp <- sample(price, 150)
    price_means_5000_150[i] <- mean(samp)
}
summary(price_means_5000_150)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  161000  176000  181000  181000  185000  206000

Part II, 3b: Confidence Intervals Now we want to calculate confidence intervals. First we create the vector 'population' from the data set's Gr.Liv.Area and draw from it a sample of size 60, samp, and store its mean in sample_mean. Finally we will draw a histogram.

population <- area
head(population)
## [1] 1656  896 1329 2110 1629 1604
# I think I am just going to use area
samp <- sample(area, 60)
sample_mean <- mean(samp)
sample_mean
## [1] 1534
hist(samp)

plot of chunk unnamed-chunk-23

Now we need a confidence interval. Here is the general form: se = sd(samp)/sqrt(60), (the standard error) lower = sample_mean - 1.96*se upper = sample_mean + 1.96*se c(lower,upper)

se = sd(samp)/sqrt(60)
lower = sample_mean - 1.96 * se
upper = sample_mean + 1.96 * se
c(lower, upper)
## [1] 1397 1671
# print('95% of samples of this size will contain the true population mean.'
# \lower ' and ' upper' square feet.')

Now compare to the true mean of the population:

mean(population)
## [1] 1500

Now we are going to test the claim of statistical theory that 95% of the confidence intervals constructed from random samples of size 60 will contain the true population mean by running a simulation. We will use a for loop, one of the most powerful functions of computer programing. Here is the algorithm: get a random sample take the mean and standard deviation (sd) of the sample calculate the confidence interval store the confidence interval in a new variable indexed i add one to the index create the for loop

# initialize the variables outside the for loop, otherwise you will keep
# replacing your calculations with NA!
sample_mean = rep(NA, 50)
sample_sd = rep(NA, 50)
n = 60
for (i in 1:50) {
    samp = sample(population, n)
    # print(samp) --these are print statements I inserted to debug the program.
    sample_mean[i] = mean(samp)
    # print(sample_mean[i])
    sample_sd[i] = sd(samp)
    # i = i + 1--increasing the index is unnecessary with R, its done
    # automatically.  print(i)
}

Now we use the vectors of means and standard deviations to calculate the confidence intervals outside the for loop.

lower = sample_mean - 1.96 * sample_sd/sqrt(n)
upper = sample_mean + 1.96 * sample_sd/sqrt(n)
# print(lower)
pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)

plot of chunk unnamed-chunk-27

See how the confidence intervals are wider?

I think this is one very cool plot that really gets the idea of confidence intervals across. 
Remember that the sample variance should be multiplied by 1.96 to get the 95% confidence interval and by 2.58 to get the 99% confidence interval. Do you see why you have to multiply by a larger number to obtain a higher degree of confidence in your results? 
To see this we use our prior results to obtain the 99% CI and graph of the Monte Carlo simulation. 

```r
lower = sample_mean - 2.58 * sample_sd/sqrt(n)
upper = sample_mean + 2.58 * sample_sd/sqrt(n)
# print(lower) pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)

plot of chunk unnamed-chunk-28