Homework #6: Standard Errors, t-Statistics, Sampling Distributions

Sociology 333: Introduction to Quantitative Analysis

Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)

The textbook talks about populations and samples of populations. Populations are just that - everyone in a specific group that we want to study. All Americans, NBA basketball players, churches, elderly Swedes, injection drug users in Seattle, Duke undergraduates, etc. In general, it is prohibitively expensive or time-consuming to conduct a survey of an entire population. So, we choose a group of people from our population to study. Because we only choose a small group, our data will contain some errors (we call them sampling errors). That is, if we wanted to know average NBA salary, and chose 20 NBA players out of all NBA players and asked their salaries, we'd get a slightly different answer if we repeated this process several times. The amount of error that this process induces is called sampling error. It's especially important to consider with very small samples. When we repeatedly sample from a population, and record our means, we end up with a sampling distribution.

We want to quantify how close a statistic from a sample is likely to be from the true population parameter. To do this, we calculate standard errors. Standard errors take advantage of the Central Limit Theorem that states that, if we take a bunch of samples from a population and calculate the mean of those samples, those sampled means will be normally distributed around the true population mean.

Therefore: If we take a bunch of samples and calculate their means (the sample means), and then calculate the mean of the sample means, that will be the population mean.

And: If we take a bunch of samples and calculate their sampling means, and then calculate the standard deviation of those sampling means from the population mean (or the mean of the sampling means) that is the standard error.

** You will need to review the material in chapters 5-7 of Urdan for this Homework. **

** Also: you will need to execute the following R command to get the data for this homework: **

load(url("http://www.soc.duke.edu/~dee4/soc333data/hw6.data"))

Exercise 1:

  1. The dataframe HW6 contains a variable named, samp1.

  2. What is the mean of samp1?

  3. What is the standard deviation of samp1?

  4. What is the standard error of samp1?

  5. What is the best estimate of the population mean?

  6. What is the best estimate of the population standard deviation?

Exercise 2:

  1. The dataframe HW6 contains four variables, samp1, samp2, samp3, samp4, that contain 30 observations each from a population.

  2. What are the sample means?

  3. What are the sample standard deviations?

  4. What is the best estimate of the population mean?

  5. What is the best estimate of the population standard deviation?

  6. What is the standard deviation of the sample means?

  7. What is the standard error of each sample?

  8. What is the standard error for all the samples combined?

Exercise 3: Joe weights himself throughout the day and obtains the values contained in the variable joe.weighs.

  1. What is the best estimate of Joe's “true” weight?

  2. What is the standard error of this estimate of Joe's “true” weight?

  3. The next day, Joe weighs himself and is shocked to find he has gained weight! The scale reads 215 pounds. As Joe's statistically astute friend, you decide to console him. “Don't worry,” you say. “Given the natural fluctuation in weights, there is a XX.X% chance of getting a value at least this large on the scale with no change in your "true” weight.“ Calculate this percentage.

Exercise 4: A population of men's heights had a population mean = 69 inches and a population standard deviation = 3.2 inches.

  1. If many random samples of size n=4 were collected, and in each case the sample mean Xbar was calculated, how would these sample means fluctuate?

  2. One sample had Xbar = 70. Is it lucky that this sample happened to be so close to the population mean?

  3. Suppose our sample size increased to 16. Repeat part 1.

  4. Does doubling the sample size double the accuracy at which we estimate the population mean?

Exercise 5:

  1. What is a sampling distribution?

  2. The population of American men had incomes that averaged $10,000 per year, with a standard deviation of $8000. If a random sample of n=100 men was taken to estimate the population mean, what would be the standard error?

  3. If the population of California is 1/10th that of the US, but with the same mean and standard deviation as the US, and we took a sample of n=100 and calculated Xbar, what is the standard error?

  4. If a 1% sample of the 78 million men in the US were taken, what would be the standard error?

  5. If a sample of 20 men is drawn, what is the chance that the sample mean Xbar will be no higher than $700 from the true mean? Do this question with:

  6. Why is there a difference between i and ii? Check http://en.wikipedia.org/wiki/Student's_t-distribution for help. Note that for this case, you would use a z-score and a normal distribution because we have the population mean and the population standard deviation.

Exercise 6: 1. Suppose a large class in statistics has marks contained in the dataframe stat.scores. Find the probability that a student will have a score of 80 or above.

  1. Find the z-score that corresponds to the 60th percentile (to convert p-values to z-scores, use qnorm(pvalue,mean=0,sd=1)).

  2. Find the raw score that corresponds to the 60th percentile.

  3. Find the z-score that corresponds to the 20th percentile. Find the corresponding raw score.

  4. Use z-scores for parts 5-10: this is assuming all the test scores represent our population. Find the probability that a random sample of 20 students will have a score of 80 or larger.

  5. Find the probability that a random sample of 20 students will have a score of 20 or smaller.

  6. Find the probability that a random sample of 20 students will have a score between 30 and 60.

  7. Find the range where 95% of the sample means (if n=20) should fall with repeated sampling.

  8. Find the range where 95% of the sample means (if n=50) should fall with repeated sampling.

  9. Find the range where 40% of the sample means (if n=20) should fall with repeated sampling.

  10. Repeat 5-10 using t-statistics. This is assuming that all the test score represent only a sample of everyone who could possibly take this test. It is more conservative, but only marginally so.