#setwd("C:/Users/stina/Documents/CUNY SPS Data Science/Spring 2018 Classes/DATA 606 - Probability and Statistics/Lab4b")
load("./lab4b/more/ames.RData")
The distribution of the sample doesn’t look symmetric. It looks like it is multimodal. It looks like it is more skewed to the right.
The “typical” size within the sample would be 1450 sqft.
- mean(samp): 1450.067
- sd(samp): 412.8894
population <- ames$Gr.Liv.Area
set.seed(1)
samp <- sample(population, 60)
#par(mfrow = c(1, 1))
#hist(population, breaks = 25)
hist(samp, breaks = 25)
mean(samp)
## [1] 1450.067
sd(samp)
## [1] 412.8894
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1206 1412 1450 1661 2730
Below I plotted the histogram of 4 different samples of size 60. Each distribution shape is somewhat different, but the distribution do tend to be more skewed towards the right. The distributions are not symmetric and appears to be multimodal. The center of the distribution ranged from 1450 to 1570.
So, to answer the question. No, I do not expect another student’s distribution to be identical to mine, but I do expect the distribution to be similar because the sample size of 60 are random samples from the population.
- mean(samp1): 1450.067
- mean(samp2): 1511.733
- mean(samp3): 1570.067
- mean(samp4): 1516.25
set.seed(1)
samp1 <- sample(population, 60)
set.seed(2)
samp2 <- sample(population, 60)
set.seed(3)
samp3 <- sample(population, 60)
set.seed(4)
samp4 <- sample(population, 60)
par(mfrow = c(2, 2))
hist(samp1, breaks = 25)
hist(samp2, breaks = 25)
hist(samp3, breaks = 25)
hist(samp4, breaks = 25)
mean(samp1)
## [1] 1450.067
mean(samp2)
## [1] 1511.733
mean(samp3)
## [1] 1570.067
mean(samp4)
## [1] 1516.25
From pg. 178 of the book:
Important conditions to help ensure the sampling distribution of x_mean is nearly normal and the estimate of SE sufficiently accurate:
- The sample observations are independent.
- The sample size is large. n >= 30 is a good rule of thumb.
- The population distribution is not strongly skewed. This condition can be difficult to evaluate so just use your best judgement.
From pg. 175 of the book:
But what does “95% confident” mean? Suppose we took many samples and built a confidence interval from each sample. Then about 95% of th ose intervals would contain the actual mean (of the population).
In the sample below, the 95% confidence interval (1345.591, 1554.542) does have the population mean of 1499.69.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60) #sd of sample/ sqrt of sample size
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1345.591 1554.542
mean(population)
## [1] 1499.69
I would expect 95% of the confidence interval from different samples would capture the true mean of the population.
I am not working in a classroom environment.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
lower_vector <- rep(NA, 50)
upper_vector <- rep(NA, 50)
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
for (i in 1:50){
lower_vector[i] <- samp_mean[i] - 1.96 * samp_sd[i] / sqrt(n)
upper_vector[i] <- samp_mean[i] + 1.96 * samp_sd[i] / sqrt(n)
}
c(lower_vector[1], upper_vector[1])
## [1] 1384.483 1666.117
Looking at this particular case, there are 4 confidence intervals out of 50 confidence intervals that DO NOT contain the actual population mean. This is about 92%.
The proportion is not exactly equal to the expected 95% confidence. Maybe because we did NOT run enough samples.
par(mfrow = c(1, 1))
plot_ci(lower_vector, upper_vector, mean(population))
# proportion of confidence intervals that contain population mean.
(50-4)/50
## [1] 0.92
To create a 99%, a critical value of 2.58 would be used.
In the plot below, I do not see any confidence interval that does not contain the population mean.
for (i in 1:50){
lower_vector[i] <- samp_mean[i] - 2.58 * samp_sd[i] / sqrt(n)
upper_vector[i] <- samp_mean[i] + 2.58 * samp_sd[i] / sqrt(n)
}
plot_ci(lower_vector, upper_vector, mean(population))