Module 7 Lecture Notes

Categorical -> Quantitative

We need to test hypotheises about Differences between means.

The Quick-R website has a good reference listing of the uses of t.test() in conducting various hypothesis test concerning means. Click here.

Comparing Means of 2 Independent Samples

Let’s walk through the example beginning on slide 179 of the CMU-OLI material.

The first step is to decide if the assumptions we need are met.

1. Are the samples independent random samples of the two sub-populations? Apparently this is true.
1. What about normality or size? We clearly have more than 30 observations in each sample, so it’s safe to proceed.

Let’s do a side-by-side boxplot and test the normality assumption at the same time we get a look at the answer to our question.

load("~/Downloads/looks.RData")
boxplot(looks$Score~looks$Gender)

What do we see in the boxplots? Normality? Comparison?

Now let’s run the actual hypothesis test and see the results.

t.test(looks$Score~looks$Gender)

## 
##  Welch Two Sample t-test
## 
## data:  looks$Score by looks$Gender
## t = -4.6574, df = 182.973, p-value = 6.143e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.695865 -1.496292
## sample estimates:
## mean in group Female   mean in group Male 
##             10.73333             13.32941

Given the tiny p-value, we can conclude that we have sufficient evidence to reject the null hypothesis that gender does not influence the score. We can also see that the average male score is higher than the average female score by about 2.6 points.

Here we are in the normal situation faced by a professional statistician. We have raw data and we want to answer a question. In textbooks and classrooms you may be given a very artificial problem. You may be presented with the means, standard deviations and sample sizes and asked to conduct a hypothesis test. In that case, you need to use your software in pocket-calculator mode to do the computations described on the top of Slide 181 and follow that with the pt() or pnorm() function to get the p-value of the test statistic. I want to emphasize that this is strictly a classroom exercise of no practical use.

Matched Pairs

Let’s look at the beer tasting data in the CMU-OLI material and carry out the hypothesis test presented there using t.test() rather than the pocket-calculator worrk.

load("~/Downloads/beers.RData")
diff <- beers$After - beers$Before
t.test(diff,mu=0)

## 
##  One Sample t-test
## 
## data:  diff
## t = 2.5821, df = 19, p-value = 0.01827
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.09498266 0.90801734
## sample estimates:
## mean of x 
##    0.5015

It looks like we can reject the null hypothesis that drinking beer has no impact on reaction time.

We’re actually skating on some thin ice here. We don’t have sufficiently large samples to ignore normality and we haven’t looked at normality. At the very least, we should look at the histograms.

hist(beers$Before)

hist(beers$After)

In situations where normality is in doubt, there are non-parametric tests to use. A non-paramteric test does not depend on the assumption of normality. The appropriate function in R for this situation is wilcox.test(). Let’s run that.

wilcox.test(beers$Before,beers$After,paired=TRUE,exact=FALSE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  beers$Before and beers$After
## V = 42.5, p-value = 0.02062
## alternative hypothesis: true location shift is not equal to 0

We can see that the null hypothesis is rejected by the non-paramteric test. So our conclusion did not depend on a false assumption of normality.

Categorical -> Categorical

Differences among proportions.

The exercises in the CMU material were of the classroom variety. In the real professional world, you will have data and a question. R does have the good resources for tackling real problems. It is actually easier to do the real problems than the classroom exercises.

Let’s take an example based on the cdc dataset. Do men and women have the same attitude towards their weight. Construct a variable “FeelHeavy” as a logical expression based on the difference between desired and actual weight.

load("~/Downloads/cdc.RData")

FeelHeavy <- cdc$weight > cdc$wtdesire
# Lets see how many people feel heavy.
sum(FeelHeavy)

## [1] 12764

#What fraction of the sample is this?
mean(FeelHeavy)

## [1] 0.6382

#Let's see what kind of values we have.
table(FeelHeavy)

## FeelHeavy
## FALSE  TRUE 
##  7236 12764

Now we can test the hypothesis that feeling heavy is independent of gender. First we create a table. Then we run chisq.test() on the table.

mytable <- table(FeelHeavy,cdc$gender)
chisq.test(mytable)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 639.4812, df = 1, p-value < 2.2e-16

It seems pretty clear that we need to reject the null hypothesis of independence. The p-value tells us that the probability of seeing a chi-square value this large or larger if the null hypothesis were true is close to zero. We can conclude that feeling heavy is associated with gender

Of course, we can apply this same technique to categorical variables with more than two values. For example genhlth has several values. let’s test the null hypothesis that FeelHeavy and genhlth are independent using the same process.

table2 <- table(FeelHeavy,cdc$genhlth)
chisq.test(table2)

## 
##  Pearson's Chi-squared test
## 
## data:  table2
## X-squared = 176.1406, df = 4, p-value < 2.2e-16

Again, we need to reject the null hypothesis of independence. We can conclude that feeling heavy is assoicated with general health.

There is a function CrossTable in the package gmodels, which does the table generation and chi-square test in one step. It also has many options for displaying the table.

library(gmodels)
CrossTable(FeelHeavy,cdc$genhlth,chisq=TRUE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  20000 
## 
##  
##              | cdc$genhlth 
##    FeelHeavy | excellent | very good |      good |      fair |      poor | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##        FALSE |      2052 |      2359 |      1876 |       678 |       271 |      7236 | 
##              |    79.981 |    10.594 |    15.296 |     3.770 |     2.773 |           | 
##              |     0.284 |     0.326 |     0.259 |     0.094 |     0.037 |     0.362 | 
##              |     0.441 |     0.338 |     0.331 |     0.336 |     0.400 |           | 
##              |     0.103 |     0.118 |     0.094 |     0.034 |     0.014 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##         TRUE |      2605 |      4613 |      3799 |      1341 |       406 |     12764 | 
##              |    45.342 |     6.006 |     8.671 |     2.137 |     1.572 |           | 
##              |     0.204 |     0.361 |     0.298 |     0.105 |     0.032 |     0.638 | 
##              |     0.559 |     0.662 |     0.669 |     0.664 |     0.600 |           | 
##              |     0.130 |     0.231 |     0.190 |     0.067 |     0.020 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      4657 |      6972 |      5675 |      2019 |       677 |     20000 | 
##              |     0.233 |     0.349 |     0.284 |     0.101 |     0.034 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  176.1406     d.f. =  4     p =  5.026704e-37 
## 
## 
##

Quantitative -> Quantitative

We can use the lm() to ask if two quantitative variables are independent. We actually get much more out of this exercise. Let’s create a variable wtdiff as the difference between actual and desired weight. Then test the null hypothesis that wtdiff is independent of weight.

wtdiff <- cdc$wtdesire - cdc$weight
wtlm <- lm(wtdiff~cdc$weight)

# Display the statistical result
summary(wtlm)

## 
## Call:
## lm(formula = wtdiff ~ cdc$weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -167.98   -9.32    0.08   11.51  518.31 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.664015   0.590782   78.99   <2e-16 ***
## cdc$weight  -0.360986   0.003388 -106.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.21 on 19998 degrees of freedom
## Multiple R-squared:  0.3621, Adjusted R-squared:  0.362 
## F-statistic: 1.135e+04 on 1 and 19998 DF,  p-value: < 2.2e-16

# Do a scatterplot of the two variables
plot(cdc$weight,wtdiff)

# Add the regression line to the plot
abline(wtlm)

The result says that an increase in weight of one pound decreases the value of wtdiff by about .36 pounds. The t-statistic of this coefficient is effectively zero. So we can reject the null hypothesis that the true value of the coefficient is zero.

The result also gives us an estimated linear relationship between the two variables.