##Confidence Intervals
library(DATA606)
## Loading required package: shiny
## Warning: package 'shiny' was built under R version 3.5.2
## Loading required package: openintro
## Warning: package 'openintro' was built under R version 3.5.2
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
## Loading required package: OIdata
## Warning: package 'OIdata' was built under R version 3.5.2
## Loading required package: RCurl
## Warning: package 'RCurl' was built under R version 3.5.2
## Loading required package: bitops
## Loading required package: maps
## Warning: package 'maps' was built under R version 3.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:openintro':
##
## diamonds
## Loading required package: markdown
## Warning: package 'markdown' was built under R version 3.5.2
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
startLab('Lab4b')
## Setting working directory to C:/Users/zahir/Documents/R/win-library/3.5/DATA606/labs/Lab4b
## [1] "C:/Users/zahir/Documents/Data 606/Lab/Lab4b/Lab4b/zahir-confidence_intervals.Rmd"
##The data
load("more/ames.RData")
population <- ames$Gr.Liv.Area
set.seed(420)
samp <- sample(population, 60)
##Ex1: Describe the distribution of your sample. What would you say is the "typical" size within your sample? Also state precisely what you interpreted "typical" to mean.
par(mfrow=c(1,1))
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 792 1254 1498 1539 1736 2726
hist(samp, probability=TRUE)
range <- min(samp) : max(samp)
lines(x=range, y=dnorm(x=range, mean=mean(samp), sd=sd(samp)))

##The distribution is skewed to the right.
##The typical size is the mean which is 1500. Typical means average or mean size.
##Ex2: Would you expect another student's distribution to be identical to yours? Would you expect it to be similar? Why or why not?
##Not exactly the same as samples are generated randomly and each is unique.
##However the sample size is more than 30 and I would expect another student to come up with something close to a normal distribution.
##Confidence Intervals
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1429.185 1648.415
##Ex3: For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/n?????????. What conditions must be met for this to be true?
##From the textbook, the following conditions are needed:
##1. The sample observations need to be independant.
##2. The sample size is equal to or larger than 30
##3. The population distribution is not strongly skewed
##Ex4: What does "95% confidence" mean? If you're not sure, see Section 4.2.2.
##It means that there is a 95% certainty that the true population mean lies within the interval calculated.
mean(population)
## [1] 1499.69
##Ex5: Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor's interval capture this value?
##Yes the value lies within the interval. Sorry no neighbour!
##Ex6: Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.
##WE would expect 95% of the intervals to contain the true population mean.
##Creating Empty Vectors
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
##Creating loop
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
##Construct the Confidence Interval
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1325.32 1568.88
##On Your own
##1. Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
plot_ci(lower_vector, upper_vector, mean(population))

##4 out of 50 samples do not contain the true population mean. This means 92% of the samples contaon the true mean.
##This is not exactly equal to the confidence level but is close.
##2. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
##Let us choose 85
qnorm(0.90)
## [1] 1.281552
##3. Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.281552 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.281552 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

##8 out of 50 samples do not contain true mean (16% do not)
##So 84% contain the true mean. This is much lower than the confidence interval of 90%.