In the previous lab, ``Sampling Distributions’’, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.
load("more/ames.RData")In this lab we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)Answer: The simple random sample of size 60 has a bit negative skew comparing the distribution of population since the sample mean 1376 less than population mean 1500. Another, ther standard error 416.7429 for the sample mean is less than the population’s 505.5089 because it has less outliners.
summary(population)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
sd(population)## [1] 505.5089
hist(population, breaks=25)qqnorm(population)
qqline(population)summary(samp)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 612 1211 1498 1522 1796 2868
sd(samp)## [1] 492.8303
hist(samp, breaks=25)qqnorm(samp)
qqline(samp)#(mean(samp)- mean(population))/mean(population)Answer: (1)Yes, it may be identical to mine. If so, we will have similar mean and SE. However, it may not identical to mine since the simple random sampling method may did sampling bias when sample size is not big.Here I generate a second sample with size 60. Second sample’s mean 1484 is bigger than first sample’s 1376 and has widely error 520.4294 than first’s 401.7573.
samp2 <- sample(population, 60)
summary(samp2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1093 1445 1550 1882 5095
sd(samp2)## [1] 662.4161
hist(samp2)qqnorm(samp2)
qqline(samp2)(2)Here I also generate 5000 sample’s means to see what’s resule of the samping distribution.As the result, most students will get their sample means which close to the population mean.
sample_means60 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(population, 60)
sample_means60[i] <- mean(samp)
}
hist(sample_means60)Answer: The conditions must be as following: 1) the sample size equal or more than 30, 2) the sample observations are independence, 3) the population distribution is not strong skewed.
Answer:There are 95 probability that the unobsered random variable X will be with 1.96 standard deviations of mean.
mean(population)## [1] 1499.69
Answer:Yes, my sample mean with 95% confidence interval is [1317.233, 1528.134] and capture the population mean is 1499.69. My neighbor’s interval may difference to mine since it depends on the sample mean and sd.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1327.834 1582.666
Answer: I don’t know what data my classmate have but I expect 95% students can capture the true population mean. I can estemate what they get by random sampling. Assume my class size is 50. 95% confident interval is (1291.933, 1553.433), which capture the population mean 1499.69.
sample_means50 <- rep(NA, 50)
for(i in 1:50){
samp <- sample(population, 50)
sample_means50[i] <- mean(samp)
}
se <- sd(samp) / sqrt(50)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1329.151 1581.349
hist(sample_means50, breaks=50)Here is the rough outline:
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.
c(lower_vector[1], upper_vector[1])## [1] 1450.773 1741.760
Answer: 50 95% confidence intervals of 50 random samples with size 60 each were plotted. 5 outliners didn’t capture the true mean 1499.6904.There is 90% propotion of CI plot (1-(5/50)).So the CI are not meant to capture exact values but a range of values that 95% likely to contain true value of the population.
plot_ci(lower_vector, upper_vector, mean(population))Answer: Pick 99% confidence level and its signifiance level is 0.995 a in two tail model.The critical value is 2.575829.
z <- qnorm(0.995)
z## [1] 2.575829
plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?Answer: 99% CI in the range is wider than 95% CI’s. My sample in 99% CI covers all ramdon observations.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n)
samp_mean[i] <- mean(samp)
samp_sd[i] <- sd(samp)
}
lower99 <- samp_mean - 2.58 * samp_sd / sqrt(n)
upper99<- samp_mean + 2.58 * samp_sd / sqrt(n)
c(lower99[1], upper99[1])## [1] 1344.499 1701.768
plot_ci(lower99, upper99, mean(population))