#Load ames data into R
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
#Create population and samp variable
#Sample takes a sample of the specified size from the elements of x using either with or without replacement
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
Describe the distribution of your sample. What would you say is the “typical” size of your sample? Also state precisely what you interpreted “typical” to mean.
#Summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1142 1369 1441 1599 3395
#Standard Deviation
sd(samp)
## [1] 483.4648
#Histogram
hist(samp)
Based on the histogram it seems that the sample is skewed to the right.
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
#Create the mean and standard deviation of the population variable
mean(population)
## [1] 1499.69
sd(population)
## [1] 505.5089
After running descriptives on the population variable it is safe to say the distribution would not be identical to mine, but somewhat similar. The mean for the samp variable and the population variable differs by 104 points so it is close but won’t be identical.
#Create a variable known as the sample_mean
sample_mean <-mean(samp)
#View the variable
sample_mean
## [1] 1440.6
#We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate
#sqrt() function computes the square root of a numeric vector
se <- sd(samp)/sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1318.267 1562.933
#We're 95% confident that the true average size of houses in Ames lies between the values lower and upper
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error. What conditions must be met for this to be true?
What does “95% confidence” mean? If you’re not sure, see Secton 4.2.2
#True population mean
mean(population)
## [1] 1499.69
Does your confidence interval capture the ture average size of houses in Ames? If you are working on this lab in a classroom, does your neighbors interval capture this value?
Yes it does, the population mean is 1499.69 while the lower 95% confidence interval is 1486.30
Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.
#Create empty vectors where we can save the means and standard deviations that will be calculated from each sample.
samp_mean <- rep(NA,50)
samp_sd <- rep(NA, 0)
n <- 60
for (i in 1:50){
samp <- sample(population, n) #obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) #save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) #save sample sd in ith element of samp_sd
}
#Construct the confidence intervals
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
#View the first interval
c(lower_vector[1],upper_vector[1])
## [1] 1351.271 1579.895
Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
plot_ci(lower_vector, upper_vector, mean(population))
Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
#Confidence level of 98%
lower_vector <- samp_mean - 2.33 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 2.33 * samp_sd / sqrt(n)
#View the first interval
c(lower_vector[1],upper_vector[1])
## [1] 1329.692 1601.475
Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?