KITADA
Lab Activity #2
Objectives:
Part I: Activities
Reminder of Activity 2 in the Week 1 Lab Activity:
(to be performed outside class on your own – you will use the data you collect in Assignment 2)
This activity will take a couple of hours of your time outside of class. You are to take a trip to two different grocery stores (such as WinCo and Safeway). Before visiting the first store, decide on 15 products for which you’d like to compare prices at the two grocery stores. You will have to be specific about the brand name, product, and “size” of container. (Note: the “size of container” may not be determined until the visit to the first store.) For example, you may want to compare the price of a 12 ounce box of Kellogg’s Frosted Flakes between the two grocery stores you chose. You will collect data on the 15 products at each store, record your data into an Excel Worksheet (saving as a CSV or TXT file), and analyze the data in Problem 1 of Assignment 2. Here’s a checklist of what you’ll need to do for this activity:
You may work with one other person to collect the data only, but must work on the assignment on your own!!! If you do collect data with someone else, give the name of the person you worked with in Problem 1 of Assignment 2.
Part II: Examples
For Example 1, only the analysis steps for doing inference will be emphasized in lab. You should still go through the first several steps on your own.
Step 1: Identify the variable of interest and populations
Step 2: Assess whether the samples are representative of their respective populations
Step 3: Understand why a particular example is an estimation example or hypothesis test example
Example 1:
Is there a difference in the average amount of time spent exercising each day between male and female OSU students? Use the appropriate variables in the STUDENTSURVEY15 data set to answer this question. Let's assume that the daily amount of exercise for male and female students in the samples is representative of daily amount of exercise for all OSU male and female students (although that could be debated).
# IMPORT DATA
student<-read.csv("/Users/heatherhisako1/Desktop/Teaching/ST352_Summer16/STUDENTSURVEY15.csv",
header=TRUE)
# VARIABLE NAMES
names(student)
## [1] "SEX" "COLLEGE" "HEIGHT"
## [4] "EYE" "PARTY" "APPROVAL"
## [7] "YEAR" "EXERCISE" "PHONE"
## [10] "STUDY" "SLEEP" "COURSE.MATERIALS"
## [13] "ANXIETY" "GLASS" "STATES"
## [16] "SPEED" "SOCIAL"
1. When comparing a quantitative variable between two (or more) groups, side-by-side box-and-whisker plots work well to get an idea of the shape of the data in each sample and also get an idea of whether the claim in the null hypothesis will be rejected for the claim in the alternative hypothesis. Obtain a side-by-side box-and-whisker plot.
a. Describe the shape of amount of time spent exercising for both males and females in the study.
# CREATE SIDE-BY-SIDE BOXPLOT
boxplot(EXERCISE~SEX, data=student)
OR we can split up the dataset
if you dont already have the dplyr library you should install it:
# SPLIT THE DATA SET
install.packages("dplyr")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# SPLIT THE DATA SET
male<-student%>%
filter(SEX=="Male")
dim(male)
## [1] 29 17
female<-student%>%
filter(SEX=="Female")
dim(female)
## [1] 36 17
# OVERLAPPING HISTOGRAM (Female = blue and Male = red)
hist(male$EXERCISE, breaks=seq(0, 200, by=25), ylim=c(0, 12),col=rgb(1,0,0,0.5), main="Overlapping Histogram", xlab="EXERCISE")
hist(female$EXERCISE, breaks=seq(0, 200, by=25), col=rgb(0,0,1,0.5), add=T)
b. Based on the shape of the data in each sample, which parameter (mean or median) and which method (t-methods or randomization/bootstrap methods) would be more appropriate to use? Explain.
It looks like the distribution of exercise times for both sexes are skewed. So we should use the bootstrap method.
2. Obtain the summary statistics (mean, median, standard deviation, and sample size) for each comparison group.
n_m<-length(male$EXERCISE)
mu_m<-mean(male$EXERCISE)
med_m<-median(male$EXERCISE)
sd_m<-sd(male$EXERCISE)
n_m
## [1] 29
mu_m
## [1] 50.86207
med_m
## [1] 40
sd_m
## [1] 44.241
n_f<-length(female$EXERCISE)
mu_f<-mean(female$EXERCISE)
med_f<-median(female$EXERCISE)
sd_f<-sd(female$EXERCISE)
n_f
## [1] 36
mu_f
## [1] 42.22222
med_f
## [1] 30
sd_f
## [1] 33.60367
3. Using an appropriate parameter, state the null and alternative hypotheses in words and statistical notation. Define the parameter used in the context of the problem.
\( \H_0: m_{male}=m_{female} \) \( \H_A: m_{male} \neq m_{female} \)
4. As in the Week 1 Lab Activity, both the two-sample t-method and randomization/bootstrap method will be shown regardless of your answers in #1 above.
a. Two-sample t-method (can only be used if comparing means)
These are two different populations and the sample standard deviations are quite different (\( SD_{male}=44.24 \) vs \( SD_{female}=33.60 \)) so it doesnt make sense to use the pooled method.
To perform a two-sample t-test in R, the t.test() function has to have a response input and a grouping variable input in the formula “y~x”, where y=response and x=grouping variable. The t.test() function still needs a specified mean difference for the null hypothesis (mu= ) and a direction for the alternative hypothesis (alternative= ). For all the two-sample problems in this class, including this one, the null-hypothesized mean difference is 0 (mu=0) and for our example here the alternative is two-sided (alternative=”two.sided”). Note that mu=0 and alternative=”two.sided” are the default options for the t.test() function.
The full syntax for this function will then be:
t.test(student$EXERCISE ~ student$SEX, mu=0, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: student$EXERCISE by student$SEX
## t = -0.86896, df = 51.223, p-value = 0.3889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.59869 11.31900
## sample estimates:
## mean in group Female mean in group Male
## 42.22222 50.86207
i. Give the t-statistic with degrees of freedom
t = -0.86896, df = 51.223
Note: By default, R performs an ‘un-pooled” two-sample t-test. The way R calculates the degrees of freedom is by using a rather lengthy formula. The formula gives the correct degrees of freedom for an “un-pooled” two-sample t-test and should be used, but you will not need to know how to get the degrees of freedom by hand using the formula method. Instead, if doing such a problem by hand on an exam, use this method: smaller n – 1. If the “pooled” methods were used, R calculates the degrees of freedom with this formula: n1 + n2 – 2. This “pooled” method can be done by adding another argument to the t.test() function specifying var.equal=TRUE.
ii. Give the p-value
p-value = 0.3889
B. Bootstrap/Randomization method using the twomeans macro
Step 1: The STUDENTSURVEY data set should be imported into R.
Step 2: Open the twomeans macro in R, highlight and run the full script. This will make the macro available for use.
Step 3: In order to run the twomeans macro, your data set MUST have:
Here are the arguments for the twomeans macro:
NOTE: When using this macro, Group 1 and Group 2 are assigned alphabetically from the grouping variable, not necessarily as they appear in the data set. For example, if the two groups of a grouping variable sex are “Male” and “Female”, then this macro will assign “Female” as group 1 and “Male” as group 2. This is important in determining the alternative hypothesis.
NOTE: The updated two means macro uses the sort() function to get the groups…this sorts alphabetically..HOWEVER, this sorting is case sensitive (capital starting letters come before lowercase starting letters). TO BE CONSISTENT, ALL LEVELS OF GROUPING VARIABLES SHOULD BEGIN WITH A CAPITAL LETTER OR BEGIN WITH A LOWERCASE LETTER…DO NOT MIX AND MATCH. For example, if your grouping variable has levels “Low” and “high”, then the sort function will put “Low” as group 1 and “high” as group 2, but if the levels are “Low” and “High”, then the sort function will do what we expect alphabetically putting group 1 as “High” and group 2 as “Low”. This is very important for the alternative hypothesis in this maco!
Fill in the blanks with the correct arguments for this example:
response: student$EXERCISE
groups: student$SEX
iterations: 2000
MEAN: TRUE
MEDIAN: FALSE
ci_level: .95
Alt_Hyp: 3
Step 4: Run the macro by typing in the following with the correct arguments:
exercise2Boot<-twomeans(response=student$EXERCISE,
groups=student$SEX,
iterations=2000,
MEAN=TRUE,
MEDIAN=FALSE,
Alt_Hyp=3,
ci_level=0.95)
Here is what will be displayed in the output
A. What is the p-value and what does that mean in the context of the adjusted bootstrap distribution?
exercise2Boot$pval
## [1] 0.3785
Of the bootstrapped differences of means, 38.85% of the values were found to be more extreme.
B. What is the value of the standard deviation of the adjusted bootstrap distribution?
exercise2Boot$Diffs_SD
## [1] 9.618918
c. How was the standard deviation of the adjusted bootstrap distribution calculated?
It is the standard deviation of the bootstrapped differences of means
sd(exercise2Boot$Diffs_Dist)
## [1] 9.618918
5. Based on your answers to the questions in #1, which p-value will you use to make a conclusion: the p-value from the two-sample t-test or the p-value from the twomeans macro?
Both p-values are very close. In this case I think it would be ok to use either because the sample sizes are large enough to believe that the central limit theorem kicks in, which means that we can feel confortable using t-methods and making inference of the mean. However, if you are concerned about the skew you can do the analysis with the bootstrap and making inference for the median.
6. Using that p-value, state a conclusion in the context of the problem.
We will fail to reject the null at an \( \alpha=0.05 \) level.
7. Constructing a confidence interval for the difference in the mean (or median) number of minutes per day of exercise among OSU students.
A. The two-sample t-methods:
To change the confidence level inside of the t.test() function, you can add an argument called conf.level= inside with a decimal value for the desired level. Note that the default level is 0.95, or 95% confidence.
Note: if the alternative hypothesis in the hypothesis test was NOT “two.sided”, you will need to re-do the two-sample t-procedures using an alternative of “two.sided” to obtain both bounds of the confidence interval.
i. Write the 95% confidence interval for the difference in population means in proper notation.
t.test(student$EXERCISE ~ student$SEX, mu=0, conf.level=0.95,
alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: student$EXERCISE by student$SEX
## t = -0.86896, df = 51.223, p-value = 0.3889
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.59869 11.31900
## sample estimates:
## mean in group Female mean in group Male
## 42.22222 50.86207
Confidence Interval = (-28.60, 11.32) minutes
ii. Interpret this confidence interval in the context of the problem.
We are 95% confident that the true difference in sample means of exercise times is between (-28.60, 11.32) minutes.
Note: To interpret this problem, you need to understand how R subtracts. In the output, how the subtraction was performed is given. The first group mean listed is Group 1 and the second group mean listed is Group 2, and the subtraction is Group 1 – Group 2. The group assignment should always be alphabetical (or numerically ascending), but it is good to confirm this with the output.
B. the percentile method from the twomeans macro.
i. Write the 95% confidence interval for the difference in population means (or medians) in correct notation using the percentile method.
exercise2Boot$Confidence_Intervals
## CI_Percent CI_Formula
## 1 -28.17529 -27.49258
## 2 9.80364 10.21289
ii. If the mean is being used, are the bounds from the percentile method the same as from the two-sample t-method? Why or why not?(())
They are a little different due to randomness and a difference in the way they are constructed.
Example 2
Weights of 9 students (in pounds) before and after a month-long training schedule are given below. Is there evidence to indicate the mean weight after training is less than the mean weight before training?
beforeWeight<-c(124, 152, 129, 144, 150, 152, 138, 161, 140)
afterWeight<-c(133, 145, 131, 150, 150, 137, 130, 159, 141)
1. Explain why this is an example of a matched-pairs design.
Since data is coming from the same individual.
2. After entering the data into R, use the following command to obtain the differences in weights (after – before exercise):
diffWeight<-afterWeight-beforeWeight
3. State the null and alternative hypotheses in words and in notation. Define any parameters used in the notation.
\( H_0: \mu_d=0 \) \( H_O: \mu_d<0 \)
4. Obtain a graph of the differences. Describe the shape of the differences.
hist(diffWeight, main="Histogram of Differences")
5. Based on the sample size and the graph, which method is most appropriate to use to perform a hypothesis test? Would it be on the mean or the median? (If on the median, change the parameter in your hypotheses in #3 to the median.)
The sample size is very small so its hard to tell what the distribution of the difference is (ie we want tell if its truly normal). There isnt evidence of extreme skew. To protect ourselves we might want to use bootstrap methods for the mean.
6. Obtain the p-value two different ways:
A. Use the Paired t-methods in R from the t.test() function.
Note: this method is a one-sample t-test on the differences. It will only perform a test on the mean. In order to perform a paired t-test, you can do one of two things:
t.test(afterWeight, beforeWeight, mu=0, paired=TRUE, alternative="less")
##
## Paired t-test
##
## data: afterWeight and beforeWeight
## t = -0.62767, df = 8, p-value = 0.2739
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 3.052964
## sample estimates:
## mean of the differences
## -1.555556
t.test(diffWeight, mu=0, alternative="less")
##
## One Sample t-test
##
## data: diffWeight
## t = -0.62767, df = 8, p-value = 0.2739
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf 3.052964
## sample estimates:
## mean of x
## -1.555556
i. What is the p-value from the paired t-test?
The p-value is 0.2739. So we will fail to reject the null.
** B. The “adjusted” bootstrap method**
Use the commands in the Week 1 Lab Activity to perform a hypothesis test on the mean or median of the differences (whichever was deemed more appropriate) using the onemean macro.
diffBoot<-onemean(original_sample=diffWeight,
iterations=2000,
ci_only=FALSE,
MEAN=TRUE,
MEDIAN=FALSE,
ci_level=0.95,
null.value=0,
Alt_Hyp=1)
i. Report and interpret the p-value in the context of the problem.
diffBoot$pval
## [1] 0.252
ii. Use the p-value to state a conclusion in the context of the problem.
We will fail to reject the null, that there is no difference between the before and after weights. There is no evidence to suggest that the after weights are lower than the before weights.