Lab 4b

Excercises

Excercise 1

population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp, breaks = 10)

The distribution is right skewed, which makes intuitive sense, as area can’t go below 0, but can get pretty high. A typical obervation would be between 750 and 1750. That’s where the bucket sizes really start to drop off at either end. This will be the heavy part of the distribution

Excercise 2

No one is going to have an identical distribtion. There are an impossibly large number of possible samples (5000 choose 60). Using similar loosely, I would imagine most will look similar, as 60 is double 30. Although because the choice of seed number isn’t random, identical is possible (favorite numbers and such).

Excercise 3

The sample size must be sufficiently large and taken at random. The skew of the population’s distribution will determine the number needed for a sample.

Excercise 4

95% confidence means that 95% of the the sampling distribution’s observations will fall between those two levels. A consequence of that is that the true population mean will fall between those levels 95% of the time.

Excercise 5

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1491.152 1775.748

mean(population)

## [1] 1499.69

The population mean is within our confidence interval

Excercise 6

You would expect about 95% of those samples to be within the condfience interval. If its 20 samples, it could easlity be 2 that are outside it, but if we’re talking 1 million, it will be very close to 95%, assuming the sampling distribution is okay.

On Your Own

Problem 1

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

plot_ci(lower_vector, upper_vector, mean(population))

47/50 or 94% of the samples have the true mean within their 95% confidence interval. If we assume the sampling distribution is exactly normal, we could use the binomial distribution to deterine the probability getting 3 samples not containing the true mean. it would be 0.2198748 which isn’t that unlikely.

Problem 2

90% is 1.645

Problem 3

lower_vector <- samp_mean - 1.645 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.645 * samp_sd / sqrt(n)

plot_ci(lower_vector, upper_vector, mean(population))

Here we see 4 cases when we expect about 5. The probablity of exactly 4 is 0.1809045 Again not that unlikely