load("C:\\Users\\jkuruvilla\\Desktop\\Education\\MS Data Analytics - CUNY\\Lab4a\\more\\ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
Exercise 1: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
hist(samp)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1211 1433 1476 1656 2872
The distribution is right skewed and unimodal. Mean is 1446 . Median 1388, and the range is 660 to 2840. Mean, median and range would change as the sample is changed.
Exercise 2 : Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
I would not expect another student’s distribution to be identical because as the sampling is done randomly, another student will have another sample with another set of elements. But as long as the samples are taken from the same population, it is possible to have simillarity among the samples.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1367.787 1584.913
Exercise 3 : For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/???n. What conditions must be met for this to be true?
Answer : The sample observations are random, sample size is greater than 30 and population distribution is not strongly skewed
Exercise 4: what does “95% confidence” mean? If you’re not sure, see Section 4.2.2.
Answer : If we take many samples and compute 95% confidence interval for each sample then about 95% of those confidence intervals will have true population mean
mean(population)
## [1] 1499.69
Exercise 5: Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
Answer : Yes. 1499.69 is in the range [1317.185, 1574.448]
Exercise 6: Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.
Answer : I expect 95% of the confidence intervals to capture true population mean. As the intervals are built for 95% confidence level, I expect 95% of the intervals to capture true population mean
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
Answer :
plot_ci(lower_vector, upper_vector, mean(population))
out of 50 intervals, 48 confidence intervals include true population. That is 96% of the confidence intervals include true population mean. This is very close to the confidence level selected but not exactly same. The confidence level is a good approximate measure but not a perfect calculation
Answer :
I picked 99% confidence interval and critical value for this confidence interval is 2.58
Answer :
lower_vector_90 <- samp_mean - 1.65 * samp_sd / sqrt(n)
upper_vector_90 <- samp_mean + 1.65 * samp_sd / sqrt(n)
plot_ci(lower_vector_90, upper_vector_90, mean(population))
The intervals should have 90% confidence to include the population mean. From this sampe 46 out 50 is including the mean which is 92% . As this is an estimate this is a good approximation of the confidence level
lower_vector_99 <- samp_mean - 2.58 * samp_sd / sqrt(n)
upper_vector_99 <- samp_mean + 2.58 * samp_sd / sqrt(n)
plot_ci(lower_vector_99, upper_vector_99, mean(population))
All of the confidence intervals (i.e 100%) include true population mean. As I choose a confidence level of 99% a 100% rate is a good approximation of the confidence level