Lab - Inference for categorical data

Exercises:

Exercise 1:

One thing that can present a normal distribution in the population is human height. Since most people in the population are average height, and outlier heights are less common, the distribution of height in a population would be normal, with a bell-curve shape. Since this is a normal (10,2) distribution, mu would be 10, and the standard deviation is 2.

Exercise 2:

set.seed(667733)
ts = NULL
at.a.time = 100000
for(i in 1:at.a.time)
{
samp = rnorm(n = 7, mean = 10, sd = 2)
ts[length(ts)+1] = (mean(samp) - 10) / (2/sqrt(7)) # This line calculates the standardized test statistic (ts)
}
sample.from.pop(mu=10, sigma=2, n=7)

This chunk of code is simulating the sample distribution for the normal distribution mentioned in problem 1. In that example, this would represent the observations from the sample, and each dot would represent one observation, or one person’s height. The distribution is normal, even though the sample size only includes 7 observations, which is relatively small.

Exercise 3:

set.seed(094389)
ts = NULL
at.a.time = 100000
for(i in 1:at.a.time)
{
samp = rnorm(n = 7, mean = 10, sd = 2)
ts[length(ts)+1] = (mean(samp) - 10) / (2/sqrt(7)) # This line calculates the standardized test statistic (ts)
}
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7)

The population parameter used to standardize this distribution is the sample mean and the sample size. The standardization in math notation would be x bar minus mu, the mean, over the standard deviation divided by the square root of the sample size. It would look like this x (bar - mu)/sigma/sqrt(n). The sample statistic used to standardize the distribution is sigma, or the standard deviation, and mu.

Exercise 4:

set.seed(462849)
add.normal.overlay = T
add.t.overlay = T
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

add.normal.overlay = F
add.t.overlay = F
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)
add.normal.overlay = F
add.t.overlay = T
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

add.normal.overlay = T
add.t.overlay = F
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

I think the combination of T and T for the overlays has the most accurate distribution, since the T distribution is very similar to the standard normal distribution. The degrees of freedom value is also high, which means it is closer to the test statistic distribution.

Exercise 5:

When we don’t know the standard deviation, we use the sample standard deviation instead, so our mathematical equation would be x bar - mu/S/sqrt(n). The statistics for this equation is mu, and the parameters would be x bar, the sample standard deviation, and the sample size.

Exercise 6:

set.seed(324234)
add.normal.overlay = T
add.t.overlay = T
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

add.normal.overlay = F
add.t.overlay = F
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)
add.normal.overlay = F
add.t.overlay = T
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

add.normal.overlay = T
add.t.overlay = F
sampling.dist.of.standardized.sample.mean(ts=ts, n = 7,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

I believe that the histogram with both overlays set at true is the most accurate since it seems the closest to the standard normal disitribution. It also has 6 degrees of freedom, which indicates that it is closer to the standard normal distribution.

Exercise 7:

The T distribution describes the standardized distances of the sample means to the population mean when the population standard deviation is not known, but the observations are from a normally distributed population. They also use precise mathematical values and definitions. However, they are similar in that they will both have a smooth shape, and will be symmetrical. They also both have a mean of 0, and as the sample size increases the t distribution and the standard normal distribution become more similar. The t distribution is appropriate for hypothesis tests for population mean because it does not assume to know the standard deviation and instead uses the sample version of this value. This means that it would be the most useful for calcualting population means in which we do not know the exact standard deviation.

On your own:

1:

set.seed(153829)
sample.from.pop(mu=10,sigma=2,n=15)

I picked a sample size of n = 15.

2:

set.seed(999880)
sample.from.pop(mu=10,sigma=2,n=15)

ts = NULL
at.a.time = 100000
for(i in 1:at.a.time)
{
samp = rnorm(n = 15, mean = 10, sd = 2)
ts[length(ts)+1] = (mean(samp) - 10) / (2/sqrt(15)) # This line calculates the standardized test statistic (ts)
}
sampling.dist.of.standardized.sample.mean(ts=ts, n = 15)

3:

add.normal.overlay = T
add.t.overlay = T
sampling.dist.of.standardized.sample.mean(ts=ts, n = 15,
                                          add.normal.overlay =  add.normal.overlay,
                                          add.t.overlay =  add.t.overlay)

This distribution features a higher degree of freedom, at 14, compared to when n=7, which was 6. It also seems that the T distribution is more similar to the standard normal distribution when n=15, as opposed to when n=7. This leads me to believe that the larger the sample size, even when it is still under 30, will improve the similarity of the T distribution to the standard normal distribution.