Homework #7: Comparing Means and Proportions

#### Sociology 333: Introduction to Quantitative Analysis #### Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)

** You need these data to do this homework: **

load(url("http://www.soc.duke.edu/~dee4/soc333data/hw7.data"))

This homework will introduce you to t-test of means and chi-sq tests of proportions. These tests tell you if there is sufficient evidence to conclude that two groups are different than each other.

If we are comparing proportions, we use a chi-squared test. If we are comparing continuous values, we use t-tests.

With t-tests, there are three different things to determine before conducting the test: 1. Is this an independent test (i.e. you are comparing two groups) or a dependent/paired test (you are comparing the same group at two different time points). 2. Is this a one-tailed or two-tailed test. A one-tailed test asks if one mean is smaller than another. A two-tailed tests asks if the means are different from one another (i.e. not equal to each other). 3. Whether to use unequal variances. This is asking you if the groups you are comparing have the same standard deviation. The more conservative approach is to always use the unequal variances test. It generally doesn't make a big difference.

The t-test uses the Central Limit Theorem, which tells us that if we were to measure the average difference between two-samples over and over, it will be normally distributed. With the t-test, we always set the null Hypothesis to be that there is no difference between the means of the two-groups/time-points we are comparing.

This is the in-class example, where we looked at the difference between samples means and saw that they were normally distributed.

# Only sample people with incomes!
inc.men = subset(HW7, HW7$sex == "male" & HW7$coninc > 0)
inc.women = subset(HW7, HW7$sex == "female" & HW7$coninc > 0)
# Pretend our data represent a population, and simulate sampling from them
# real difference:
mean(inc.women$coninc) - mean(inc.men$coninc)
## [1] -7488
samp.men = NULL
samp.women = NULL
samp.diff = NULL
# Take 50 samples, calculate their means and store them Do this 1000 times
n = 50
for (i in 1:1000) {
    samp.men[i] = mean(sample(inc.men$coninc, n))
    samp.women[i] = mean(sample(inc.women$coninc, n))
    samp.diff[i] = samp.women[i] - samp.men[i]
}
hist(samp.men, density = 100)

plot of chunk unnamed-chunk-2

mean(inc.men$coninc)
## [1] 53192
hist(samp.women, density = 100)

plot of chunk unnamed-chunk-2

mean(inc.women$coninc)
## [1] 45705
hist(samp.diff, density = 100)

plot of chunk unnamed-chunk-2

# Same
samp.diffs = samp.women - samp.men
hist(samp.diffs, density = 100)

plot of chunk unnamed-chunk-2

# This difference is centered around the true mean
mean(samp.diff)
## [1] -7386
sd(samp.diff)
## [1] 7865
# Pooled standard deviation of the difference formula = sqrt(s1^2/n1 +
# s2^2/n2) Not quite the same, but close Remember, we are testing for
# differences...so we want to know how many standard deviations *above* the
# mean we need to be to see zero (because the mean difference is negative).
# Study the histogram of the difference and prove to yourself that this is
# true.
pt(-1 * mean(samp.diff)/sd(samp.diff), df = (n + n - 2))
## [1] 0.825
# About 81% of the time, we'll see a difference less than zero.

This shows that the central limit theorem applies both to means of samples and differences of sample means. However, this does NOT apply to proportions because the SE varies with the proportion being estimated….

The R function t.test takes several arguments:

For example, if we want to do a t.test of the difference between men and women in the full GSS sample:

t.test(inc.women$coninc, inc.men$coninc, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  inc.women$coninc and inc.men$coninc
## t = -9.719, df = 10048, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -8755   Inf
## sample estimates:
## mean of x mean of y 
##     45705     53192
# Quick way to calculate a confidence interval:
t.test(inc.women$coninc, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  inc.women$coninc
## t = 91.37, df = 5957, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  44724 46685
## sample estimates:
## mean of x 
##     45705
# The confidence interval pops up, to extract it, type:
t.test(inc.women$coninc, conf.level = 0.95)$conf.int[1:2]
## [1] 44724 46685

Exercise 1: Use only men who report more than $0 earnings. We will consider the GSS to be a random sample of the US population. Calculate the following:

  1. You want to find out if men who have only a high school diploma earn more than those do not have a high school diploma. What is the average difference? What is the probability that high schoolers earn more than those who have not completed high school?

  2. What is the 99% confidence interval for the difference between high schoolers and less than high schoolers?

  3. You want to find out if the incomes of those with a bachelor degree are higher vs. those with only a high school diploma. What is the average difference? What is the 95% CI of the difference?

  4. You want to find out if the incomes of those with a graduate degree vs. those with only a college degree are different. What is the 97% CI of the difference?

  5. What percent of the time would be expect the difference to be more than $24316? What percent of the time would we expect the difference to be less than $12956?

Exercise 2: Repeat exercise 1, questions 1-4 with women.

Exercise 3: A drug company wishes to see if their drug is effective for treating depression. They recruit 100 people with depression and randomly assign 50 of them to receive the drug and 50 to receive the placebo. To measure depression, they administer a PHQ-9 test and record their score. The PHQ-9 test scores range from 0 to 9, with 9 indicating very high levels of depression, and 0 indicating no depressive symptoms. The test is taken by each participant at the beginning of the trial and at the end.

The scores of the participants are in the data frame drug.trial. This data frame contains four measurements, 2 measures for the placebo group, placebo.begin, placebo.end and 2 measures for the drug group, drug.begin, drug.end.

# advanced R tip If you have a data frame with several variables, and you
# want to get the same statisitic for each variable you can use the lapply()
# command. This command applys a command across each variable. It takes two
# arguments, the data frame you want to use and the function you want to
# apply. So,
lapply(drug.trial, mean)
## $placebo.begin
## [1] 4.82
## 
## $drug.begin
## [1] 4.34
## 
## $placebo.end
## [1] 4.1
## 
## $drug.end
## [1] 3.26
  1. To compare the placebo group at the beginning of the trial with the drug group at the beginning of the trial, you will use an independent test. Why?

  2. To compare the placebo group at the beginning of the trial with the placebo group at the end of the trial, you will use a paired test. Why?

  3. At time 1, are the placebo and drug treatment group significantly different? If they are, you're experiment is in trouble! With random assignment, you shouldn't end up with this problem.

  4. How much change is there in the placebo group? Using a 95% Confidence Level, does the placebo group change from time 1 to time 2?

  5. How much change is there in the treatment group from time 1 to time 2? Using a 95% Confidence Level, does the placebo group change from time 1 to time 2? What is the 95% CI of the amount of change?

  6. What is the difference in change between the treatment and the placebo group? Hint: create two new variables, one for the change in placebo, one for the change in the drug group.

  7. Can we say that this drug is more effective than the placebo?

  8. Compare the drug trial group at the beginning and end with a paired test, then an independent test. Which has a lower p-value. What implications does this have for research?

Exercise 4: Create a new dataframe, with women with 1 to 300 male sexual partners. Answer the following questions:

  1. What is the probability that female evangelicals have fewer sexual partners than catholics? That female catholics have fewer than evangelicals? That there is a difference between the two?

  2. Repeat this for men with 1-300 female sexual partners.

Exercise 5: Warning: hard 1.Look at this histogram of the difference in sample means of income for men and women again:

hist(samp.diff)

plot of chunk unnamed-chunk-5

What does this histogram show us?

  1. If we pretend that our data is a population: about how big of a sample would we need to take to detect that women earn less than men, 99% of the time? HINT: All you need to do is re-run the loop above, calculate a p-value for samp.diff, and keep adjusting n until the p-value is consistently around .99. You can change n by hand or set up a while loop.