KITADA

Lab Activity #2

Inference for comparing a quantitative variable between two groups

Objectives:

Recognize when the two-sample methods can be used as opposed to the paired methods or one-sample methods
Use R to perform a hypothesis test and construct a confidence interval for the difference in means between two independent samples using the two-sample t-methods
Use the twomeans macro in R to obtain a p-value from a hypothesis test on the difference in population means and population medians and to obtain the bounds of a confidence interval for the difference in population means or medians
Interpret R output from all analyses listed above
Explain when to use the t-methods and when to use the bootstrap/randomization methods
Explain when to perform inference on the median instead of the mean

Part I: Activities

Reminder of Activity 2 in the Week 1 Lab Activity:

(to be performed outside class on your own – you will use the data you collect in Assignment 2)

This activity will take a couple of hours of your time outside of class. You are to take a trip to two different grocery stores (such as WinCo and Safeway). Before visiting the first store, decide on 15 products for which you’d like to compare prices at the two grocery stores. You will have to be specific about the brand name, product, and “size” of container. (Note: the “size of container” may not be determined until the visit to the first store.) For example, you may want to compare the price of a 12 ounce box of Kellogg’s Frosted Flakes between the two grocery stores you chose. You will collect data on the 15 products at each store, record your data into an Excel Worksheet (saving as a CSV or TXT file), and analyze the data in Problem 1 of Assignment 2. Here’s a checklist of what you’ll need to do for this activity:

Decide on the two grocery stores to use in this activity
Decide on the 15 products you want to compare.
- The brand name, product, and size have to be exactly the same at each store. Therefore, do not compare generic brands as they have different names at different stores.
- You may have to wait until your visit to the first store to determine the “size” as you may not be aware of the different size packages for different products.
- Try to come up with a variety of products to get a good representation of products at the stores. (That is, don’t choose all different brands and sizes of cereals to compare!)
At each store, record the price of each product on your list. If you didn’t record the prices in an electronic spreadsheet (such as an Excel worksheet) at each store, do so after you collect all your data. (Note: think about the way the data should be recorded to do the proper analysis of the data in R.)
Use the data to answer the questions in Problem 1 of Assignment 2.

You may work with one other person to collect the data only, but must work on the assignment on your own!!! If you do collect data with someone else, give the name of the person you worked with in Problem 1 of Assignment 2.

Part II: Examples

For Example 1, only the analysis steps for doing inference will be emphasized in lab. You should still go through the first several steps on your own.

Step 1: Identify the variable of interest and populations

Step 2: Assess whether the samples are representative of their respective populations

Step 3: Understand why a particular example is an estimation example or hypothesis test example

Example 1:

Is there a difference in the average amount of time spent exercising each day between male and female OSU students? Use the appropriate variables in the STUDENTSURVEY15 data set to answer this question. Let's assume that the daily amount of exercise for male and female students in the samples is representative of daily amount of exercise for all OSU male and female students (although that could be debated).

# IMPORT DATA
student<-read.csv("/Users/heatherhisako1/Desktop/Teaching/ST352_Summer16/STUDENTSURVEY15.csv",
                   header=TRUE)

# VARIABLE NAMES
names(student)

##  [1] "SEX"              "COLLEGE"          "HEIGHT"          
##  [4] "EYE"              "PARTY"            "APPROVAL"        
##  [7] "YEAR"             "EXERCISE"         "PHONE"           
## [10] "STUDY"            "SLEEP"            "COURSE.MATERIALS"
## [13] "ANXIETY"          "GLASS"            "STATES"          
## [16] "SPEED"            "SOCIAL"

1. When comparing a quantitative variable between two (or more) groups, side-by-side box-and-whisker plots work well to get an idea of the shape of the data in each sample and also get an idea of whether the claim in the null hypothesis will be rejected for the claim in the alternative hypothesis. Obtain a side-by-side box-and-whisker plot.

a.  Describe the shape of amount of time spent exercising for both males and females in the study.

# CREATE SIDE-BY-SIDE BOXPLOT
boxplot(EXERCISE~SEX, data=student)

plot of chunk unnamed-chunk-2

OR we can split up the dataset

if you dont already have the dplyr library you should install it:

# SPLIT THE DATA SET
install.packages("dplyr")

## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

library("dplyr")

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# SPLIT THE DATA SET
male<-student%>%
  filter(SEX=="Male")

dim(male)

## [1] 29 17

female<-student%>%
  filter(SEX=="Female")

dim(female)

## [1] 36 17

# OVERLAPPING HISTOGRAM (Female = blue and Male = red)
hist(male$EXERCISE, breaks=seq(0, 200, by=25), ylim=c(0, 12),col=rgb(1,0,0,0.5), main="Overlapping Histogram", xlab="EXERCISE")
hist(female$EXERCISE, breaks=seq(0, 200, by=25), col=rgb(0,0,1,0.5), add=T)

plot of chunk unnamed-chunk-3

b. Based on the shape of the data in each sample, which parameter (mean or median) and which method (t-methods or randomization/bootstrap methods) would be more appropriate to use? Explain.

It looks like the distribution of exercise times for both sexes are skewed. So we should use the bootstrap method.

2. Obtain the summary statistics (mean, median, standard deviation, and sample size) for each comparison group.

n_m<-length(male$EXERCISE)
mu_m<-mean(male$EXERCISE)
med_m<-median(male$EXERCISE)
sd_m<-sd(male$EXERCISE)

n_m

## [1] 29

mu_m

## [1] 50.86207

med_m

## [1] 40

sd_m

## [1] 44.241

n_f<-length(female$EXERCISE)
mu_f<-mean(female$EXERCISE)
med_f<-median(female$EXERCISE)
sd_f<-sd(female$EXERCISE)

n_f

## [1] 36

mu_f

## [1] 42.22222

med_f

## [1] 30

sd_f

## [1] 33.60367

3. Using an appropriate parameter, state the null and alternative hypotheses in words and statistical notation. Define the parameter used in the context of the problem.

$ \H_0: m_{male}=m_{female} $ $ \H_A: m_{male} \neq m_{female} $

4. As in the Week 1 Lab Activity, both the two-sample t-method and randomization/bootstrap method will be shown regardless of your answers in #1 above.

a. Two-sample t-method (can only be used if comparing means)

Can the pooled two-sample t-methods be used? Explain.

These are two different populations and the sample standard deviations are quite different ($ SD_{male}=44.24 $ vs $ SD_{female}=33.60 $) so it doesnt make sense to use the pooled method.

To perform a two-sample t-test in R, the t.test() function has to have a response input and a grouping variable input in the formula “y~x”, where y=response and x=grouping variable. The t.test() function still needs a specified mean difference for the null hypothesis (mu= ) and a direction for the alternative hypothesis (alternative= ). For all the two-sample problems in this class, including this one, the null-hypothesized mean difference is 0 (mu=0) and for our example here the alternative is two-sided (alternative=”two.sided”). Note that mu=0 and alternative=”two.sided” are the default options for the t.test() function.

The full syntax for this function will then be:

t.test(student$EXERCISE ~ student$SEX, mu=0, alternative="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  student$EXERCISE by student$SEX
## t = -0.86896, df = 51.223, p-value = 0.3889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -28.59869  11.31900
## sample estimates:
## mean in group Female   mean in group Male 
##             42.22222             50.86207

i. Give the t-statistic with degrees of freedom

t = -0.86896, df = 51.223

Note: By default, R performs an ‘un-pooled” two-sample t-test. The way R calculates the degrees of freedom is by using a rather lengthy formula. The formula gives the correct degrees of freedom for an “un-pooled” two-sample t-test and should be used, but you will not need to know how to get the degrees of freedom by hand using the formula method. Instead, if doing such a problem by hand on an exam, use this method: smaller n – 1. If the “pooled” methods were used, R calculates the degrees of freedom with this formula: n1 + n2 – 2. This “pooled” method can be done by adding another argument to the t.test() function specifying var.equal=TRUE.

ii. Give the p-value

p-value = 0.3889

B. Bootstrap/Randomization method using the twomeans macro

Step 1: The STUDENTSURVEY data set should be imported into R.

Step 2: Open the twomeans macro in R, highlight and run the full script. This will make the macro available for use.

Step 3: In order to run the twomeans macro, your data set MUST have:

A column of numeric values of the response variable (for both groups)
A column for the explanatory variable (i.e. the “name” of the category each case belongs to) with exactly 2 levels.

Here are the arguments for the twomeans macro:

response: the numeric response variable for both groups
groups: the categorical variable containing the grouping names (only 2 groups allowed)
iterations: the number of randomizations you want the macro to generate (at least 2000 is suggested)
MEAN: a TRUE or FALSE statement determining if the mean is the parameter of interest
MEDIAN: a TRUE or FALSE statement determining if the median is the parameter of interest
ci_level: a decimal value from 0 to 1 for the level of confidence for the confidence interval
Alt_Hyp: a value of 1, 2 or 3 determining the alternative hypthesis as:
- 1 for $ H_A: Group_1 < Group_2 $
- 2 for $ H_A: Group_1 > Group_2 $
- 3 for $ H_A: Group_1 \neq Group_2 $

NOTE: When using this macro, Group 1 and Group 2 are assigned alphabetically from the grouping variable, not necessarily as they appear in the data set. For example, if the two groups of a grouping variable sex are “Male” and “Female”, then this macro will assign “Female” as group 1 and “Male” as group 2. This is important in determining the alternative hypothesis.

NOTE: The updated two means macro uses the sort() function to get the groups…this sorts alphabetically..HOWEVER, this sorting is case sensitive (capital starting letters come before lowercase starting letters). TO BE CONSISTENT, ALL LEVELS OF GROUPING VARIABLES SHOULD BEGIN WITH A CAPITAL LETTER OR BEGIN WITH A LOWERCASE LETTER…DO NOT MIX AND MATCH. For example, if your grouping variable has levels “Low” and “high”, then the sort function will put “Low” as group 1 and “high” as group 2, but if the levels are “Low” and “High”, then the sort function will do what we expect alphabetically putting group 1 as “High” and group 2 as “Low”. This is very important for the alternative hypothesis in this maco!

Fill in the blanks with the correct arguments for this example:

response: student$EXERCISE

groups: student$SEX

iterations: 2000

MEAN: TRUE

MEDIAN: FALSE

ci_level: .95

Alt_Hyp: 3

Step 4: Run the macro by typing in the following with the correct arguments:

exercise2Boot<-twomeans(response=student$EXERCISE,
         groups=student$SEX,
         iterations=2000,
         MEAN=TRUE,
         MEDIAN=FALSE,
         Alt_Hyp=3,
         ci_level=0.95)

plot of chunk unnamed-chunk-7

Here is what will be displayed in the output

the standard deviation of the difference in sample means (or medians) from the specified number of randomizations.
The distribution of bootstrapped/randomized sample differences (number determined by iterations).
a confidence interval (with the indicated level of confidence in the macro) with bounds determined using the percentile method AND the formula method.
a group 1 indicator to confirm which group the macro calls group 1.
the p-value from the hypothesis test.
The alternative hypothesis (Alt_Hyp) chosen.

A. What is the p-value and what does that mean in the context of the adjusted bootstrap distribution?

exercise2Boot$pval

## [1] 0.3785

Of the bootstrapped differences of means, 38.85% of the values were found to be more extreme.

B. What is the value of the standard deviation of the adjusted bootstrap distribution?

exercise2Boot$Diffs_SD

## [1] 9.618918

c. How was the standard deviation of the adjusted bootstrap distribution calculated?

It is the standard deviation of the bootstrapped differences of means

sd(exercise2Boot$Diffs_Dist)

## [1] 9.618918

5. Based on your answers to the questions in #1, which p-value will you use to make a conclusion: the p-value from the two-sample t-test or the p-value from the twomeans macro?

Both p-values are very close. In this case I think it would be ok to use either because the sample sizes are large enough to believe that the central limit theorem kicks in, which means that we can feel confortable using t-methods and making inference of the mean. However, if you are concerned about the skew you can do the analysis with the bootstrap and making inference for the median.

6. Using that p-value, state a conclusion in the context of the problem.

We will fail to reject the null at an $ \alpha=0.05 $ level.

7. Constructing a confidence interval for the difference in the mean (or median) number of minutes per day of exercise among OSU students.

A. The two-sample t-methods:

To change the confidence level inside of the t.test() function, you can add an argument called conf.level= inside with a decimal value for the desired level. Note that the default level is 0.95, or 95% confidence.

Note: if the alternative hypothesis in the hypothesis test was NOT “two.sided”, you will need to re-do the two-sample t-procedures using an alternative of “two.sided” to obtain both bounds of the confidence interval.

i. Write the 95% confidence interval for the difference in population means in proper notation.

t.test(student$EXERCISE ~ student$SEX, mu=0, conf.level=0.95,
       alternative="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  student$EXERCISE by student$SEX
## t = -0.86896, df = 51.223, p-value = 0.3889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -28.59869  11.31900
## sample estimates:
## mean in group Female   mean in group Male 
##             42.22222             50.86207

Confidence Interval = (-28.60, 11.32) minutes

ii. Interpret this confidence interval in the context of the problem.

We are 95% confident that the true difference in sample means of exercise times is between (-28.60, 11.32) minutes.

Note: To interpret this problem, you need to understand how R subtracts. In the output, how the subtraction was performed is given. The first group mean listed is Group 1 and the second group mean listed is Group 2, and the subtraction is Group 1 – Group 2. The group assignment should always be alphabetical (or numerically ascending), but it is good to confirm this with the output.

B. the percentile method from the twomeans macro.

i. Write the 95% confidence interval for the difference in population means (or medians) in correct notation using the percentile method.

exercise2Boot$Confidence_Intervals

##   CI_Percent CI_Formula
## 1  -28.17529  -27.49258
## 2    9.80364   10.21289

ii. If the mean is being used, are the bounds from the percentile method the same as from the two-sample t-method? Why or why not?(())

They are a little different due to randomness and a difference in the way they are constructed.

Example 2

Weights of 9 students (in pounds) before and after a month-long training schedule are given below. Is there evidence to indicate the mean weight after training is less than the mean weight before training?

beforeWeight<-c(124, 152, 129, 144, 150, 152, 138, 161, 140)
afterWeight<-c(133, 145, 131, 150, 150, 137, 130, 159, 141)

1. Explain why this is an example of a matched-pairs design.

Since data is coming from the same individual.

2. After entering the data into R, use the following command to obtain the differences in weights (after – before exercise):

diffWeight<-afterWeight-beforeWeight

3. State the null and alternative hypotheses in words and in notation. Define any parameters used in the notation.

$ H_0: \mu_d=0 $ $ H_O: \mu_d<0 $

4. Obtain a graph of the differences. Describe the shape of the differences.

hist(diffWeight, main="Histogram of Differences")

plot of chunk unnamed-chunk-15

5. Based on the sample size and the graph, which method is most appropriate to use to perform a hypothesis test? Would it be on the mean or the median? (If on the median, change the parameter in your hypotheses in #3 to the median.)

The sample size is very small so its hard to tell what the distribution of the difference is (ie we want tell if its truly normal). There isnt evidence of extreme skew. To protect ourselves we might want to use bootstrap methods for the mean.

6. Obtain the p-value two different ways:

A. Use the Paired t-methods in R from the t.test() function.

Note: this method is a one-sample t-test on the differences. It will only perform a test on the mean. In order to perform a paired t-test, you can do one of two things:

OPTION 1: Using separate vectors for the paired responses, add an argument to t.test() stating paired=TRUE. Be careful of the order of the responses in relation to the direction of the alternative.

t.test(afterWeight, beforeWeight, mu=0, paired=TRUE, alternative="less")

## 
##  Paired t-test
## 
## data:  afterWeight and beforeWeight
## t = -0.62767, df = 8, p-value = 0.2739
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 3.052964
## sample estimates:
## mean of the differences 
##               -1.555556

OPTION 2: Since paired methods are one-sample tests on differences, one can perform a one-sample t-test on the differences.

t.test(diffWeight, mu=0, alternative="less")

## 
##  One Sample t-test
## 
## data:  diffWeight
## t = -0.62767, df = 8, p-value = 0.2739
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##      -Inf 3.052964
## sample estimates:
## mean of x 
## -1.555556

i. What is the p-value from the paired t-test?

The p-value is 0.2739. So we will fail to reject the null.

** B. The “adjusted” bootstrap method**

Use the commands in the Week 1 Lab Activity to perform a hypothesis test on the mean or median of the differences (whichever was deemed more appropriate) using the onemean macro.

diffBoot<-onemean(original_sample=diffWeight,
                      iterations=2000, 
                      ci_only=FALSE,
                      MEAN=TRUE, 
                      MEDIAN=FALSE, 
                      ci_level=0.95, 
                      null.value=0,
                      Alt_Hyp=1)

plot of chunk unnamed-chunk-19

i. Report and interpret the p-value in the context of the problem.

diffBoot$pval

## [1] 0.252

ii. Use the p-value to state a conclusion in the context of the problem.

We will fail to reject the null, that there is no difference between the before and after weights. There is no evidence to suggest that the after weights are lower than the before weights.