Data 606 Lab 4b

#setwd("C:/Users/stina/Documents/CUNY SPS Data Science/Spring 2018 Classes/DATA 606 - Probability and Statistics/Lab4b")

load("./lab4b/more/ames.RData")

(1) Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

The distribution of the sample doesn’t look symmetric. It looks like it is multimodal. It looks like it is more skewed to the right.

The “typical” size within the sample would be 1450 sqft.

mean(samp): 1450.067

sd(samp): 412.8894

population <- ames$Gr.Liv.Area
set.seed(1)
samp <- sample(population, 60)

#par(mfrow = c(1, 1))
#hist(population, breaks = 25)
hist(samp, breaks = 25)

mean(samp)

## [1] 1450.067

sd(samp)

## [1] 412.8894

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1206    1412    1450    1661    2730

(2) Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Below I plotted the histogram of 4 different samples of size 60. Each distribution shape is somewhat different, but the distribution do tend to be more skewed towards the right. The distributions are not symmetric and appears to be multimodal. The center of the distribution ranged from 1450 to 1570.

So, to answer the question. No, I do not expect another student’s distribution to be identical to mine, but I do expect the distribution to be similar because the sample size of 60 are random samples from the population.

mean(samp1): 1450.067

mean(samp2): 1511.733

mean(samp3): 1570.067

mean(samp4): 1516.25

set.seed(1)
samp1 <- sample(population, 60)

set.seed(2)
samp2 <- sample(population, 60)

set.seed(3)
samp3 <- sample(population, 60)

set.seed(4)
samp4 <- sample(population, 60)

par(mfrow = c(2, 2))

hist(samp1, breaks = 25)
hist(samp2, breaks = 25)
hist(samp3, breaks = 25)
hist(samp4, breaks = 25)

mean(samp1)

## [1] 1450.067

mean(samp2)

## [1] 1511.733

mean(samp3)

## [1] 1570.067

mean(samp4)

## [1] 1516.25

(3) For the confidence interval to be valid, the sample mean must be normally distributed and have standard error sd/???(sample size). What conditions must be met for this to be true?

From pg. 178 of the book:

Important conditions to help ensure the sampling distribution of x_mean is nearly normal and the estimate of SE sufficiently accurate:

The sample observations are independent.

The sample size is large. n >= 30 is a good rule of thumb.

The population distribution is not strongly skewed. This condition can be difficult to evaluate so just use your best judgement.

(4) What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

From pg. 175 of the book:

But what does “95% confident” mean? Suppose we took many samples and built a confidence interval from each sample. Then about 95% of th ose intervals would contain the actual mean (of the population).

(5) Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

In the sample below, the 95% confidence interval (1345.591, 1554.542) does have the population mean of 1499.69.

sample_mean <- mean(samp)
  
se <- sd(samp) / sqrt(60)           #sd of sample/ sqrt of sample size
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1345.591 1554.542

mean(population)

## [1] 1499.69

(6) Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

I would expect 95% of the confidence interval from different samples would capture the true mean of the population.

I am not working in a classroom environment.

Simulation

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
lower_vector <- rep(NA, 50)
upper_vector <- rep(NA, 50)

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

for (i in 1:50){
  lower_vector[i] <- samp_mean[i] - 1.96 * samp_sd[i] / sqrt(n) 
  upper_vector[i] <- samp_mean[i] + 1.96 * samp_sd[i] / sqrt(n)
}


c(lower_vector[1], upper_vector[1])

## [1] 1384.483 1666.117

On Your Own

(1) Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

Looking at this particular case, there are 4 confidence intervals out of 50 confidence intervals that DO NOT contain the actual population mean. This is about 92%.

The proportion is not exactly equal to the expected 95% confidence. Maybe because we did NOT run enough samples.

par(mfrow = c(1, 1))
plot_ci(lower_vector, upper_vector, mean(population))

# proportion of confidence intervals that contain population mean. 
(50-4)/50

## [1] 0.92

(2) Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

To create a 99%, a critical value of 2.58 would be used.

(3) Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

In the plot below, I do not see any confidence interval that does not contain the population mean.

for (i in 1:50){
  lower_vector[i] <- samp_mean[i] - 2.58 * samp_sd[i] / sqrt(n) 
  upper_vector[i] <- samp_mean[i] + 2.58 * samp_sd[i] / sqrt(n)
}

plot_ci(lower_vector, upper_vector, mean(population))