The Data

setwd("C:/Users/Robert/Documents/R/win-library/3.2/IS606/labs/Lab4a")
load("more/ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

Exercise 1

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

set.seed(60)
summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     672    1185    1484    1483    1698    2855
hist(samp,breaks=20)

This is a near normal distribution with a slight left-skew and a unimodal distribution. There are often outliers which dramatically skews the distribution range. The typical size is the mean. Please refer to the summary table for specific results.

Exercise 2

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

To some degree, yes. There is a likelihood for a left-skew, but by sampling only a mere 60 results, we are unlikely to see any obvious uniformity between sample sets.

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1363.341 1602.759

Exercise 3

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/sqrt(n). What conditions must be met for this to be true?

The Central Limit Theorem dictates that there be a minimum of 30 independent random samples in a normal or near-normal distribution.

Exercise 4

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

A 95% confidence interval means that there is a 95% statistical “certainty” that the mean result of the population distribution, following the rules of the CLT, will fall between a specific range. The value is derived from a random sample.

Exercise 5

Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

sample_mean
## [1] 1483.05
c(lower, upper)
## [1] 1363.341 1602.759
mean(population)
## [1] 1499.69

Based on the above value results, the sample confidence interval does capture the real mean value.

Exercise 6

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

95%, give or take, of the other results would capture the real mean (1500). Because the distribution is not perfectly normal, the confidence interval will not be exact. There are all tools for approximation and inference. A demonstration of that variation is as follows:

hist(sample(population, 60))

hist(sample(population, 60))

#On Your Own

1

Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

set.seed(50)
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

c(lower_vector[1], upper_vector[1])
## [1] 1320.593 1587.307
plot_ci(lower_vector, upper_vector, mean(population))

#proportion calculation
p <- 1-(2/50)
p
## [1] 0.96

The p value result (96%) is close to the 95% confidence rate.

2

Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

For a confidence value of 99%…

criticalvalue<-qnorm(.995)
criticalvalue
## [1] 2.575829

3

Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

set.seed(50)
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 50

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - criticalvalue * samp_sd / sqrt(n) 
upper_vector <- samp_mean + criticalvalue * samp_sd / sqrt(n)

#c(lower_vector[1], upper_vector[1])

plot_ci(lower_vector, upper_vector, mean(population))

#proportion calculation
p <- 1-(0/50)
p
## [1] 1

In this scenario, each of the confidence ranges satisfied the mean of the population. 99% and 100% are not the same, but if we were to calculate 100 confidence intervals, it is far more likely we would see a single anomaly.