load("C:\\Users\\jkuruvilla\\Desktop\\Education\\MS Data Analytics - CUNY\\Lab4a\\more\\ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

Exercise 1: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

hist(samp)

summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1211    1433    1476    1656    2872

The distribution is right skewed and unimodal. Mean is 1446 . Median 1388, and the range is 660 to 2840. Mean, median and range would change as the sample is changed.

Exercise 2 : Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I would not expect another student’s distribution to be identical because as the sampling is done randomly, another student will have another sample with another set of elements. But as long as the samples are taken from the same population, it is possible to have simillarity among the samples.

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1367.787 1584.913

Exercise 3 : For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/???n. What conditions must be met for this to be true?

Answer : The sample observations are random, sample size is greater than 30 and population distribution is not strongly skewed

Exercise 4: what does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

Answer : If we take many samples and compute 95% confidence interval for each sample then about 95% of those confidence intervals will have true population mean

mean(population)
## [1] 1499.69

Exercise 5: Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Answer : Yes. 1499.69 is in the range [1317.185, 1574.448]

Exercise 6: Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

Answer : I expect 95% of the confidence intervals to capture true population mean. As the intervals are built for 95% confidence level, I expect 95% of the intervals to capture true population mean

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

On Your Own

  1. Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

Answer :

plot_ci(lower_vector, upper_vector, mean(population))

out of 50 intervals, 48 confidence intervals include true population. That is 96% of the confidence intervals include true population mean. This is very close to the confidence level selected but not exactly same. The confidence level is a good approximate measure but not a perfect calculation

  1. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

Answer :

I picked 99% confidence interval and critical value for this confidence interval is 2.58

  1. Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

Answer :

lower_vector_90 <- samp_mean - 1.65 * samp_sd / sqrt(n) 
upper_vector_90 <- samp_mean + 1.65 * samp_sd / sqrt(n)
plot_ci(lower_vector_90, upper_vector_90, mean(population))

The intervals should have 90% confidence to include the population mean. From this sampe 46 out 50 is including the mean which is 92% . As this is an estimate this is a good approximation of the confidence level

lower_vector_99 <- samp_mean - 2.58 * samp_sd / sqrt(n) 
upper_vector_99 <- samp_mean + 2.58 * samp_sd / sqrt(n)
plot_ci(lower_vector_99, upper_vector_99, mean(population))

All of the confidence intervals (i.e 100%) include true population mean. As I choose a confidence level of 99% a 100% rate is a good approximation of the confidence level