#In this lab, we investigate the ways in which the statistics from a random sample of #data can serve as point estimates for population parameters. #We’re interested in formulating a sampling distribution of our estimate in order to #learn about the properties of the estimate, such as its distribution.
#The data
#We consider real estate data from the city of Ames, Iowa. #The details of every real estate transaction in Ames is recorded by the #City Assessor’s office. Our particular focus for this lab will be all residential home #sales in Ames between 2006 and 2010. This collection represents our population of #interest. In this lab we would like to learn about these home sales by taking smaller #samples from the full population. Let’s load the data.
rm(list=ls())
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
head(ames)
names(ames)
## [1] "Order" "PID" "MS.SubClass" "MS.Zoning"
## [5] "Lot.Frontage" "Lot.Area" "Street" "Alley"
## [9] "Lot.Shape" "Land.Contour" "Utilities" "Lot.Config"
## [13] "Land.Slope" "Neighborhood" "Condition.1" "Condition.2"
## [17] "Bldg.Type" "House.Style" "Overall.Qual" "Overall.Cond"
## [21] "Year.Built" "Year.Remod.Add" "Roof.Style" "Roof.Matl"
## [25] "Exterior.1st" "Exterior.2nd" "Mas.Vnr.Type" "Mas.Vnr.Area"
## [29] "Exter.Qual" "Exter.Cond" "Foundation" "Bsmt.Qual"
## [33] "Bsmt.Cond" "Bsmt.Exposure" "BsmtFin.Type.1" "BsmtFin.SF.1"
## [37] "BsmtFin.Type.2" "BsmtFin.SF.2" "Bsmt.Unf.SF" "Total.Bsmt.SF"
## [41] "Heating" "Heating.QC" "Central.Air" "Electrical"
## [45] "X1st.Flr.SF" "X2nd.Flr.SF" "Low.Qual.Fin.SF" "Gr.Liv.Area"
## [49] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Full.Bath" "Half.Bath"
## [53] "Bedroom.AbvGr" "Kitchen.AbvGr" "Kitchen.Qual" "TotRms.AbvGrd"
## [57] "Functional" "Fireplaces" "Fireplace.Qu" "Garage.Type"
## [61] "Garage.Yr.Blt" "Garage.Finish" "Garage.Cars" "Garage.Area"
## [65] "Garage.Qual" "Garage.Cond" "Paved.Drive" "Wood.Deck.SF"
## [69] "Open.Porch.SF" "Enclosed.Porch" "X3Ssn.Porch" "Screen.Porch"
## [73] "Pool.Area" "Pool.QC" "Fence" "Misc.Feature"
## [77] "Misc.Val" "Mo.Sold" "Yr.Sold" "Sale.Type"
## [81] "Sale.Condition" "SalePrice"
#We see that there are quite a few variables in the data set, enough to do a very in-depth #analysis. For this lab, we’ll restrict our attention to just two of the variables: #the above ground living area of the house in square feet ( Gr.Liv.Area ) and #the sale price ( SalePrice ). To save some effort throughout the lab, create two #variables with short names that represent these two variables.
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
#Let’s look at the distribution of area in our population of home sales by calculating a #few summary statistics and making a histogram.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
hist(area,breaks=25)
#install.packages("e1071")
library(e1071)
skew <- skewness(area)
print(paste("Skewness:", skew))
## [1] "Skewness: 1.27280546412348"
kurt <- kurtosis(area)
print(paste("Kurtosis:", kurt))
## [1] "Kurtosis: 4.12386812688596"
Heavily Right skewed distribution. Leptokurtic Kurtosis (more peaked distribution with heavier tails).
#a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)
length(area)
## [1] 2930
length(samp1)
## [1] 50
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 732 1167 1501 1541 1776 3672
hist(samp1)
hist(samp1,breaks=25)
skew <- skewness(samp1)
print(paste("Skewness:", skew))
## [1] "Skewness: 1.34243702321063"
kurt <- kurtosis(samp1)
print(paste("Kurtosis:", kurt))
## [1] "Kurtosis: 3.25545042025181"
Similar to population distribution, sample distribution is also right skewed and has Leptokurtic Kurtosis.
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 732 1167 1501 1541 1776 3672
mean(samp1)
## [1] 1540.52
#3.Take a second sample, also of size 50, and call it samp2 . # How does the mean of samp2 compare with the mean of samp1 ? # Suppose we took two more samples, one of size 100 and one of size 1000. Which # would you think would provide a more accurate estimate of the population mean?
samp1 <- sample(area, 1000)
mean(samp1)
## [1] 1504.741
sample_means50 <- rep(NA, 5000)
## Notice the for loop construct
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
summary(sample_means50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1258 1450 1496 1498 1544 1786
#little more detail, you can do so by changing the breaks argument.
hist(sample_means50, breaks = 25)
5000 elements
The sampling distribution of the sample means is expected to be approximately normally distributed due to the Central Limit Theorem (CLT) when the sample size is sufficiently large. In this case, each sample contains 50 observations, which is often considered large enough for the CLT to apply. The center of this sampling distribution of means will be approximately equal to the population mean, assuming that the samples are representative of the population.
If you were to collect 50,000 sample means instead of 5000, the sampling distribution would likely be even closer to a perfect normal distribution due to the larger number of samples. The increased sample size (50,000 as opposed to 5000) would result in a narrower and taller normal distribution. This means that the variability of the sample means would decrease, and the sampling distribution would be more concentrated around the population mean. In general, as the sample size increases, the sampling distribution becomes more normal and approaches the population distribution.
#code sink in. You have just run your first for loop, a cornerstone of computer # programming. The idea behind the for loop is iteration: it allows you to # execute code as many times as you want without having to type out every # iteration. In the case above, we wanted to iterate the two lines of code inside # the curly braces that take a random sample of size 50 from area then save the # mean of that sample into the sample_means50 vector. Without the for loop, # this would be painful:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50) sample_means50[1] <- mean(samp)
samp <- sample(area, 50) sample_means50[2] <- mean(samp)
samp <- sample(area, 50) sample_means50[3] <- mean(samp)
samp <- sample(area, 50) sample_means50[4] <- mean(samp)
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){ samp <- sample(area, 50) sample_means50[i] <- mean(samp) #print(i) }
#Let’s consider this code line by line to figure out what it does. #In the first line we initialized a vector. In this case, we created a vector of # 5000 zeros called sample_means50. This vector will will store values generated within the for loop.
#The second line calls the for loop itself. The syntax can be loosely read as, # “for every element i from 1 to 5000, run the following lines of code”. # You can think of i as the counter that keeps track of which loop you’re on. # Therefore, more precisely, the loop will run once when i = 1 , then once when # i = 2 , and so on up to i = 5000 .
#The for loop allows us to not just run the code 5000 times, but to neatly # package the results, element by element, into the empty vector that we # initialized at the outset.
I AM DOING IT HERE BUT USE IT TO CHECK YOUR ANSWER
small_sample_means = rep(NA, 100) for(i in 1:100){ samp <- sample(area, 50) small_sample_means[i] <- mean(samp) print(i) print(small_sample_means[i]) }
Answer: There are 100 elements in sample_means_small and each element represent mean of a sample of size 50.
#Mechanics aside, let’s return to the reason we used a for loop: to #compute a sampling distribution, specifically, this one.
hist(sample_means50)
#To get a sense of the effect that SAMPLE SIZE has on our distribution, #let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))
xlimits <- range(sample_means10) # limiting the values of the x axis
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
par(mfrow = c(3, 1))
xlimits <- range(sample_means10) # limiting the values of the x axis
hist(sample_means10, breaks = 20, xlim = xlimits) hist(sample_means50, breaks = 20, xlim = xlimits) hist(sample_means100, breaks = 20, xlim = xlimits)
Larger sample size makes the distribution narrower and taller. This means that the variability of the sample means would decrease, and the sampling distribution would be more concentrated around the population mean. In general, as the sample size increases, the sampling distribution becomes more normal and approaches the population distribution.
sam_price <- sample(price,50)
mean(sam_price)
## [1] 164412.8
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
mean(sample_means50)
## [1] 180535.6
mean(price)
## [1] 180796.1
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150)
mean(sample_means150)
## [1] 180823.4
mean(price)
## [1] 180796.1
3 has a samller spred due to large sample size. The mean of 3 is also closer to mean of the population. if we are concerned with making estimates that are more often close to the true vaklue, then we should take a large sample size.