Lab 3a: Sampling Distributions
Start by loading the Data:
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData"))
Now inspect the data using the standard tools we have learned: head(), dim(), names(), str().
head(ames)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1 SBrkr 1656 0 0 1656
## 2 SBrkr 896 0 0 896
## 3 SBrkr 1329 0 0 1329
## 4 SBrkr 2110 0 0 2110
## 5 SBrkr 928 701 0 1629
## 6 SBrkr 926 678 0 1604
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 1 0 1 0 3
## 2 0 0 1 0 2
## 3 0 0 1 1 3
## 4 1 0 2 1 3
## 5 0 0 2 1 3
## 6 0 0 2 1 3
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1 1 TA 7 Typ 2
## 2 1 TA 5 Typ 0
## 3 1 Gd 6 Typ 0
## 4 1 Ex 8 Typ 2
## 5 1 TA 6 Typ 1
## 6 1 Gd 7 Typ 1
## Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1 Gd Attchd 1960 Fin 2
## 2 <NA> Attchd 1961 Unf 1
## 3 <NA> Attchd 1958 Unf 1
## 4 TA Attchd 1968 Fin 2
## 5 TA Attchd 1997 Fin 2
## 6 Gd Attchd 1998 Fin 2
## Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1 528 TA TA P 210
## 2 730 TA TA Y 140
## 3 312 TA TA Y 393
## 4 522 TA TA Y 0
## 5 482 TA TA Y 212
## 6 470 TA TA Y 360
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1 62 0 0 0 0 <NA>
## 2 0 0 0 120 0 <NA>
## 3 36 0 0 0 0 <NA>
## 4 0 0 0 0 0 <NA>
## 5 34 0 0 0 0 <NA>
## 6 36 0 0 0 0 <NA>
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> 0 5 2010 WD Normal
## 2 MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> 0 4 2010 WD Normal
## 5 MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
dim(ames)
## [1] 2930 82
names(ames)
## [1] "Order" "PID" "MS.SubClass"
## [4] "MS.Zoning" "Lot.Frontage" "Lot.Area"
## [7] "Street" "Alley" "Lot.Shape"
## [10] "Land.Contour" "Utilities" "Lot.Config"
## [13] "Land.Slope" "Neighborhood" "Condition.1"
## [16] "Condition.2" "Bldg.Type" "House.Style"
## [19] "Overall.Qual" "Overall.Cond" "Year.Built"
## [22] "Year.Remod.Add" "Roof.Style" "Roof.Matl"
## [25] "Exterior.1st" "Exterior.2nd" "Mas.Vnr.Type"
## [28] "Mas.Vnr.Area" "Exter.Qual" "Exter.Cond"
## [31] "Foundation" "Bsmt.Qual" "Bsmt.Cond"
## [34] "Bsmt.Exposure" "BsmtFin.Type.1" "BsmtFin.SF.1"
## [37] "BsmtFin.Type.2" "BsmtFin.SF.2" "Bsmt.Unf.SF"
## [40] "Total.Bsmt.SF" "Heating" "Heating.QC"
## [43] "Central.Air" "Electrical" "X1st.Flr.SF"
## [46] "X2nd.Flr.SF" "Low.Qual.Fin.SF" "Gr.Liv.Area"
## [49] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Full.Bath"
## [52] "Half.Bath" "Bedroom.AbvGr" "Kitchen.AbvGr"
## [55] "Kitchen.Qual" "TotRms.AbvGrd" "Functional"
## [58] "Fireplaces" "Fireplace.Qu" "Garage.Type"
## [61] "Garage.Yr.Blt" "Garage.Finish" "Garage.Cars"
## [64] "Garage.Area" "Garage.Qual" "Garage.Cond"
## [67] "Paved.Drive" "Wood.Deck.SF" "Open.Porch.SF"
## [70] "Enclosed.Porch" "X3Ssn.Porch" "Screen.Porch"
## [73] "Pool.Area" "Pool.QC" "Fence"
## [76] "Misc.Feature" "Misc.Val" "Mo.Sold"
## [79] "Yr.Sold" "Sale.Type" "Sale.Condition"
## [82] "SalePrice"
str(ames)
## 'data.frame': 2930 obs. of 82 variables:
## $ Order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PID : int 526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
## $ MS.SubClass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
## $ Lot.Frontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot.Area : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
## $ Overall.Qual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ Overall.Cond : int 5 6 6 5 5 6 5 5 5 5 ...
## $ Year.Built : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ Year.Remod.Add : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
## $ Mas.Vnr.Area : int 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
## $ BsmtFin.SF.1 : int 639 468 923 1065 791 602 616 263 1180 0 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 441 270 406 1045 137 324 722 1017 415 994 ...
## $ Total.Bsmt.SF : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ X2nd.Flr.SF : int 0 0 0 0 701 678 0 0 0 776 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gr.Liv.Area : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ Bsmt.Full.Bath : int 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 1 1 2 2 2 2 2 2 2 ...
## $ Half.Bath : int 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom.AbvGr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
## $ TotRms.AbvGrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Garage.Yr.Blt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
## $ Garage.Cars : int 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage.Area : int 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ Open.Porch.SF : int 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed.Porch : int 0 0 0 0 0 0 170 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 5 6 6 4 3 6 4 1 3 6 ...
## $ Yr.Sold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
There are a lot of variables in this data set. There are two variables that we will pay attention to initially, Gr.Liv.Area (the above ground living area) and salePrice. We will create two data objects, area and price, that we will use for further analysis. We call them data objects rather than variables because they are not attached to the data set.
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
Now look at the distribution or area.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1440 1500 1740 5640
hist(area)
Now we know that there are 2930 cases in the data set and that the data set represents the population–not just a sample–of all the homes sold in Ames Iowa. To look at how samples compare to populations we are going to take some samples from this population and see how the characteristics of the sample compare to the population. We will use the R function sample(data, sample.size) to draw two samples from the population and compare the samples to each other.
samp0 <- sample(area, 50)
samp1 <- sample(area, 50)
summary(samp0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 754 1070 1300 1380 1610 2380
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 640 1150 1480 1470 1820 2260
hist(samp0)
hist(samp1)
Now the tutorial gives us the code for what statisticians call a 'Monte Carlo' simulation, taking a lot of samples and comparing their mean to the population mean. To do this, they use the important programing tool of a 'for loop'. Here is how to do a Monte Carlo simulation. First, set up an empty vector of 5000 NAs (the value that R uses for missing data) to store sample means:
sample_means50 = rep(NA, 5000)
Take 5000 samples of size 50 from 'area' and store all of them in 'sample_means50'.
for (i in 1:5000) {
samp = sample(area, 50)
sample_means50[i] = mean(samp)
}
Now that we have the sample, view the result. If you want, you can increase the bin width to show more detail by changing the 'breaks' argument.
hist(sample_means50)
hist(sample_means50, breaks = 13)
hist(sample_means50, breaks = 50)
hist(sample_means50, breaks = 100)
To get a better grasp of how for loops work they suggest adding a print statement. Again, set up an empty vector or 5000 NAs to store sample means:
sample_means50 = rep(NA, 5000)
Then take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50' as before, but this time with the print statement, using 'i' (the index counter):
for (i in 1:5000) {
samp = sample(area, 50)
sample_means50[i] = mean(samp)
# print(i)
}
Now we write our own for loop. We will take 100 samples rather than 5000 this time to see the effect of having a smaller number of samples. First, we initialize a vector of NAs.
sample_means_small <- rep("NA", 100)
Then we make the for loop:
for (i in 1:100) {
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
Then we inspect the results:
hist(as.integer(sample_means_small))
class(sample_means_small) #for some reason it thought that sample_means_small was not an integer.
## [1] "character"
Now look at several different sample sizes. First, initialize the sample distributions:
sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)
Run the for loop:
for (i in 1:5000) {
samp = sample(area, 10)
sample_means10[i] = mean(samp)
samp = sample(area, 100)
sample_means100[i] = mean(samp)
}
Take a look at the results:
head(sample_means10)
## [1] 1567 1435 1526 1640 1344 1528
head(sample_means50) # was already loaded
## [1] 1474 1499 1537 1483 1375 1541
head(sample_means100)
## [1] 1484 1574 1501 1518 1568 1541
To make the comparisons more visually easy to compare we produce multiple plots at the same time. We do this by passign the mfrow argument to the par() function. 'par' stands for parameters, as in the graphics parameters of aplot or other visual display. mfrow means multiple figures by row, i.e., the first element is the number of rows in which the figures are to be displayed and the second is the number of columns.
First, divide the plot in 3 rows:
par(mfrow = c(3, 1))
Define the limits for the x-axis:
xlimits = range(sample_means10)
Then draw the histograms:
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
par(mfrow = c(1, 1))
Notice how, as the number of samples we take gets larger the variance of their means gets smaller and closer to the true mean of the population.
Now we look at price, taking a sample of 50 and inspecting its mean:
sample_50 <- sample(price, 50)
head(price)
## [1] 215000 105000 172000 244000 189900 195500
Now we have to write our own for loop to get a sample of means from the population as we did for area. The trick to writing a for loop is to write it from the inside out or backwards, starting with the operation that you want repeated to create the end result. What do we want? We want a bunch of sample means of price: “price_means_5000_50[i] <- mean(samp)” Or 5000 means from samples of 50. Of course, that has to be initialized: price_means_5000_50 <- rep(NA, 5000) So where do we get samp? samp <- sample(price, 50) And we want to do this 5000 times, so for i in 1:5000.
price_means_5000_50 <- rep(NA, 5000)
for (i in 1:5000) {
samp <- sample(price, 50)
price_means_5000_50[i] <- mean(samp)
}
summary(price_means_5000_50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146000 173000 181000 181000 188000 228000
Now do the same with sample sizes of 150.
price_means_5000_150 <- rep(NA, 5000)
for (i in 1:5000) {
samp <- sample(price, 150)
price_means_5000_150[i] <- mean(samp)
}
summary(price_means_5000_150)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 161000 176000 181000 181000 185000 206000
Part II, 3b: Confidence Intervals Now we want to calculate confidence intervals. First we create the vector 'population' from the data set's Gr.Liv.Area and draw from it a sample of size 60, samp, and store its mean in sample_mean. Finally we will draw a histogram.
population <- area
head(population)
## [1] 1656 896 1329 2110 1629 1604
# I think I am just going to use area
samp <- sample(area, 60)
sample_mean <- mean(samp)
sample_mean
## [1] 1534
hist(samp)
Now we need a confidence interval. Here is the general form: se = sd(samp)/sqrt(60), (the standard error) lower = sample_mean - 1.96*se upper = sample_mean + 1.96*se c(lower,upper)
se = sd(samp)/sqrt(60)
lower = sample_mean - 1.96 * se
upper = sample_mean + 1.96 * se
c(lower, upper)
## [1] 1397 1671
# print('95% of samples of this size will contain the true population mean.'
# \lower ' and ' upper' square feet.')
Now compare to the true mean of the population:
mean(population)
## [1] 1500
Now we are going to test the claim of statistical theory that 95% of the confidence intervals constructed from random samples of size 60 will contain the true population mean by running a simulation. We will use a for loop, one of the most powerful functions of computer programing. Here is the algorithm: get a random sample take the mean and standard deviation (sd) of the sample calculate the confidence interval store the confidence interval in a new variable indexed i add one to the index create the for loop
# initialize the variables outside the for loop, otherwise you will keep
# replacing your calculations with NA!
sample_mean = rep(NA, 50)
sample_sd = rep(NA, 50)
n = 60
for (i in 1:50) {
samp = sample(population, n)
# print(samp) --these are print statements I inserted to debug the program.
sample_mean[i] = mean(samp)
# print(sample_mean[i])
sample_sd[i] = sd(samp)
# i = i + 1--increasing the index is unnecessary with R, its done
# automatically. print(i)
}
Now we use the vectors of means and standard deviations to calculate the confidence intervals outside the for loop.
lower = sample_mean - 1.96 * sample_sd/sqrt(n)
upper = sample_mean + 1.96 * sample_sd/sqrt(n)
# print(lower)
pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)
See how the confidence intervals are wider?
I think this is one very cool plot that really gets the idea of confidence intervals across.
Remember that the sample variance should be multiplied by 1.96 to get the 95% confidence interval and by 2.58 to get the 99% confidence interval. Do you see why you have to multiply by a larger number to obtain a higher degree of confidence in your results?
To see this we use our prior results to obtain the 99% CI and graph of the Monte Carlo simulation.
```r
lower = sample_mean - 2.58 * sample_sd/sqrt(n)
upper = sample_mean + 2.58 * sample_sd/sqrt(n)
# print(lower) pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)