Foundations for statistical inference

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.

The data

In the previous lab, ``Sampling Distributions’’, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.

load("more/ames.RData")

In this lab we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

set.seed(500)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp)

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
```
This is a unimodal distribution with a single dominant peak and is right skewed.
The typical size is 1574. Typical would mean the size of a random house.
```

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     759    1141    1492    1574    1837    3820

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I would expect it to be close to but not exactly, due to the randomness of their sample. I would expect it to be          within a certain range of the true population mean as mine.

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,

sample_mean <- mean(samp)

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (See Section 4.2.3 if you are unfamiliar with this formula).

se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1433.279 1714.987

This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

The sample must be randomly picked to satisfy the normally distributed condition, which is met in this case since         the sample is randomly picked from the population data.
The sample size should also pass the success-failure condition

Confidence levels

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

95% confidence level means we are 95% confident that the typical house size in ames city in Iowa falls between            1,433.279 and 1,714.987.

In this case we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:

mean(population)

## [1] 1499.69

Does your confidence interval capture the true average size of houses in Ames?

Yes it does. the true population mean of 1,499.69 falls between 1,433.279 and 1,714.987. our 95% confidence               interval.

If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

    This is not done in a classroom but i would expect their data to capture this although there might be a few               outliers.

Each student in your class should have gotten a slightly different confidence interval.

Each student is using a different sample although from the same population, and as such we would expect the               confidence intervals to be slightly different. Each sample will have a slightly different mean and standard               deviation due to the randonmess with a few outliers.

What proportion of those intervals would you expect to capture the true population mean?

    I would expect a majority of them around 95% (but not exactly) to capture the true population mean, with a few outliers.

Why? I would expect there to be a few that dont capture the true population mean since some of the samples might capture outliers and have their means falling slightly outside the true population mean.

If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

    This is not applicable for this lab. However i have simulated 5 random samples, to mimic the random                       samples by different classmates. All the intervals from the 5 random samples capture the true population. I would         expect there to be a few that dont capture the true population mean since some of the samples might capture               outliers and have their means falling slightly outside the true population mean.

Student A

set.seed(100)
samp1 <- sample(population, 60)
samp1_mean <- mean(samp1)
se <- sd(samp1) / sqrt(60)
lower <- samp1_mean - 1.96 * se
upper <- samp1_mean + 1.96 * se
c(lower, upper)

## [1] 1352.864 1588.303

Student B

set.seed(200)
samp2 <- sample(population, 60)
samp2_mean <- mean(samp2)
se <- sd(samp2) / sqrt(60)
lower <- samp2_mean - 1.96 * se
upper <- samp2_mean + 1.96 * se
c(lower, upper)

## [1] 1344.854 1604.646

Student C

set.seed(300)
samp3 <- sample(population, 60)
samp3_mean <- mean(samp3)
se <- sd(samp3) / sqrt(60)
lower <- samp3_mean - 1.96 * se
upper <- samp3_mean + 1.96 * se
c(lower, upper)

## [1] 1311.072 1548.562

Student D

set.seed(400)
samp4 <- sample(population, 60)
samp4_mean <- mean(samp4)
se <- sd(samp4) / sqrt(60)
lower <- samp4_mean - 1.96 * se
upper <- samp4_mean + 1.96 * se
c(lower, upper)

## [1] 1360.048 1589.385

Student E

set.seed(600)
samp5 <- sample(population, 60)
samp5_mean <- mean(samp5)
se <- sd(samp5) / sqrt(60)
lower <- samp5_mean - 1.96 * se
upper <- samp5_mean + 1.96 * se
c(lower, upper)

## [1] 1451.101 1742.999

Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).

Here is the rough outline:

Obtain a random sample.
Calculate and store the sample’s mean and standard deviation.
Repeat steps (1) and (2) 50 times.
Use these stored statistics to calculate many confidence intervals.

But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as n.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

Lastly, we construct the confidence intervals.

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.

c(lower_vector[1], upper_vector[1])

## [1] 1450.684 1767.416

On your own

Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

47 of the 50 samples fall within the confidence interval. This is 94% of the total samples selected. This is              within our expectations and its not very unusual since 95% would be represented by 47.5 which is impossible to            obtain in this case. The closest we can get to this is 94% or 96%. In this case a 94% is good enough.

plot_ci(lower_vector, upper_vector, mean(population))

Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
```
I have chosen an confidence interval of 99% equivalent to 2.58 critical value. 
```
Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?
```
49 of the 50 samples have their mean within 2.58 standard deviations of the population mean or within a 99%               confidence interval. This is within our expectations since an exact 99% would be represented by 49.5 which is             impossible to obtain in this case. The closest we can get is either 98% or 100%. in this case 98% is good enough.
```

lower_vector <- samp_mean - 2.58 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 2.58 * samp_sd / sqrt(n)

plot_ci(lower_vector, upper_vector, mean(population))

Foundations for statistical inference - Confidence intervals

Samuel Kigamba

Sampling from Ames, Iowa

The data

Confidence intervals

Confidence levels

On your own