The Data
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
Exercise 1: Describe this population distribution.
head(ames)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1 SBrkr 1656 0 0 1656
## 2 SBrkr 896 0 0 896
## 3 SBrkr 1329 0 0 1329
## 4 SBrkr 2110 0 0 2110
## 5 SBrkr 928 701 0 1629
## 6 SBrkr 926 678 0 1604
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 1 0 1 0 3
## 2 0 0 1 0 2
## 3 0 0 1 1 3
## 4 1 0 2 1 3
## 5 0 0 2 1 3
## 6 0 0 2 1 3
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1 1 TA 7 Typ 2
## 2 1 TA 5 Typ 0
## 3 1 Gd 6 Typ 0
## 4 1 Ex 8 Typ 2
## 5 1 TA 6 Typ 1
## 6 1 Gd 7 Typ 1
## Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1 Gd Attchd 1960 Fin 2
## 2 <NA> Attchd 1961 Unf 1
## 3 <NA> Attchd 1958 Unf 1
## 4 TA Attchd 1968 Fin 2
## 5 TA Attchd 1997 Fin 2
## 6 Gd Attchd 1998 Fin 2
## Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1 528 TA TA P 210
## 2 730 TA TA Y 140
## 3 312 TA TA Y 393
## 4 522 TA TA Y 0
## 5 482 TA TA Y 212
## 6 470 TA TA Y 360
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1 62 0 0 0 0 <NA>
## 2 0 0 0 120 0 <NA>
## 3 36 0 0 0 0 <NA>
## 4 0 0 0 0 0 <NA>
## 5 34 0 0 0 0 <NA>
## 6 36 0 0 0 0 <NA>
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> 0 5 2010 WD Normal
## 2 MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> 0 4 2010 WD Normal
## 5 MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)

The distribution for this population is right skewed, with most of the data being between 1000-2000.
Exercise 2: Describe the distribution of this sample. How does it compare to the distribution of the population?
samp1 <- sample(area,50)
par(mfrow = c(1,2))
hist(area)
hist(samp1, xlim = c(0,6000))

The distribution the population is right skewed whereas the distribution of the sample data resembles a normal distribution.
mean(samp1)
## [1] 1559.4
Exercise 2: Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
samp2 <- sample(area,50)
mean (samp2)
## [1] 1534.1
The mean of samp1 is only off by 18 whereas the mean of samp2 is off by 95. The mean of a sample size of 1000 would probably provide the most accurate estimate of the population mean.
samp3 <- sample(area,100)
mean(samp3)
## [1] 1551.48
samp4 <- sample(area,1000)
mean(samp4)
## [1] 1479.221
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 25)

Exercise 4: How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
There are 5000 elements in sample_means50, with each element representing a sample of 50 means. The sampling distribution resembles a normal distribution. The center is around 1500 which is fairly close to the population mean of 1499. If 50000 sample means were collected instead of 5000, I would expect the distribution to fit a normal distribution even better and for the center to be closer to the population mean.
sample_means50 <- rep(NA, 50000)
for(i in 1:50000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 25)
