KITADA
Lab Activity #1
Objectives:
Part I: Outside Class Activity
Activity 1: (performed outside class on your own – you will use the data you collect in Assignment 2)
This activity will take a couple of hours of your time outside of class. You are to take a trip to two different grocery stores (such as WinCo and Safeway). Before visiting the first store, decide on 15 products for which you’d like to compare prices at the two grocery stores. You will have to be specific about the brand name, product, and “size” of container. (Note: the “size of container” may not be determined until the visit to the first store.) For example, you may want to compare the price of a 12 ounce box of Kellogg’s Frosted Flakes between the two grocery stores you chose. You will collect data on the 15 products at each store, record your data into an Excel Worksheet, save the sheet as a .CSV or .TXT file, and analyze the data in Problem 1 of Assignment 2. Here’s a checklist of what you’ll need to do for this activity:
Decide on the two grocery stores to use in this activity
Decide on the 15 products you want to compare.
At each store, record the price of each product on your list. If you didn’t record the prices in an electronic spreadsheet (such as an Excel worksheet) at each store, do so after you collect all your data. (Note: think about the way the data should be recorded to do the proper analysis of the data in R.)
Use the data to answer the questions in Problem 1 of Assignment 2.
You may work with one other person to collect the data only, but must work on the assignment on your own!!! If you do collect data with someone else, give the name of the person you worked with in Problem 1 of Assignment 2.
Part II: Examples
For each example, only the analysis steps for doing inference will be emphasized in lab. You should still go through the first several steps on your own for each example:
Step 1: Identify the variable of interest and population
Step 2: Assess whether the sample is representative of the population
Step 3: Understand why a particular example is an estimation example or hypothesis test example
Example 1:
The American Heart Association recommends that adults get an average of 30 minutes of moderate-intensity exercise 5 days a week AND participate in moderate to high intensity muscle-strengthening activities at least 2 days per week (http://www.heart.org/HEARTORG/GettingHealthy/PhysicalActivity/StartWalking/American-Heart-Association-Guidelines_UCM_307976_Article.jsp). Assuming the moderate to high intensity muscle-strengthening exercises take an average of 30 minutes per day, let's say that adults should “exercise” at least 30 minutes per day every day. Do college students exercise more than 30 minutes per day, on average? A sample of 34 OSU students from the ST 352 class several years ago was taken. Their daily amount of exercise (in minutes) is recorded in the EXERCISE data set. Import this data set in R.
1. Obtain a histogram, dotplot, or stem-and-leaf display AND a normal probability plot of the sample data. Describe the shape of the sample data.
# HISTOGRAM
hist(daily.exercise,
main="Histogram of Exercise Times",
xlab="Daily Amount of Exercise (Minutes)")
# DOTPLOT
stripchart(daily.exercise, method = "stack", offset = .5, at = .15, pch = 19,
main = "Dotplot of Exercise Times", xlab = "Daily Amount of Exercise (Minutes)")
2. Based on the shape of the sample data, would an inference on the mean or median be more appropriate? Explain. Therefore, can the t-methods be used?
Based on the histogram and dot plot, it appears that the sample data is skewed right. This suggests that inference on the median might be more appropriate.
However, there are 34 data points and since we know that t-procedures are robust to skewness and outliers inference on the mean using t-methods (t-test and t confidence interval) is also reasonable.
3. Using an appropriate parameter, state the null and alternative hypotheses in words and statistical notation. Define the parameter used in the context of the problem.
For Mean:
\( H_0 : \mu=30 \) (College students exercise 30 minutes a day, on average)
\( H_A : \mu>30 \) (College students exercise more than 30 minutes a day, on average)
For Median
\( H_0 : m=30 \) \( H_A : m>30 \)
4. Regardless of your answer to #2, both the bootstrap method and the one-sample t-method will be shown to obtain the p-value.
A. Bootstrap method using the onemean macro
- Step 1: Find the onemean macro and copy the code directly into your R script. Highlight and run all the source code for the macro.
- Step 2: Download the dataset from Canvas
How the onemean macro works
The onemean macro creates an "adjusted" bootstrap distribution of sample means (or medians). (The number of "adjusted" bootstrap sample means or medians to generate is determined by you, but it is suggested to use at least 2000 – more on this below.) These sample means (or medians) are calculated based on the several thousand bootstrap samples taken from the sample data after the sample data have been "shifted" so that the mean (or median) of the sample data is equal to the hypothesized value of the population mean (or median) – this creates a distribution centered around the value in the null hypothesis (i.e. such that the null hypothesis is true, which is part of the definition of a p-value). (Recall that a bootstrap sample is a random sample of the same size as the original sample from the original sample with replacement)
The arguments for the onemean macro are as follows:
iterations is the number of bootstrap sample means (or medians) be generated
Example: iterations=2000
Fill in the blanks with the arguments for this example:
original_sample= daily.exercise
iterations= 2000
- Step 3: There are a few other arguments that should be specified in the macro:
- Step 4 (Optional): There ae two additional arguments in the macro the change the output shown in the console. These are both TRUE by default but may be turned to FALSE if you desire
histogram: if TRUE, a histogram is given of the bootstrapped means (or medians).
Fill in the blanks below for the values to use for each prompted question:
ci_only= FALSE
MEAN= TRUE
MEDIAN= FALSE
ci_level= 0.95
null.value= 30
Alt_Hyp= 2
The full syntax for our example running the onemean macro to do an α=0.05 hypothesis test on the mean parameter would be:
exerciseBoot<-onemean(original_sample=daily.exercise,
iterations=2000,
ci_only=FALSE,
MEAN=TRUE,
MEDIAN=FALSE,
ci_level=0.95,
null.value=30,
Alt_Hyp=2)
After running the macro, the “adjusted” bootstrap distribution (i.e. several thousand sample means, or medians) will be displayed in the console of R, along with the estimated standard deviation of the distribution, confidence intervals and a p-value is that was requested. Also shown are the original sample summary statistics if they were requested (by default they are).
- Step 5: Interpret the output:
i. What is the p-value and what does that mean in the context of this adjusted bootstrap distribution?
# WHAT IS THE BOOTSTRAP P-VALUE?
exerciseBoot$pval
## [1] 0.088
This means that 9.6% of the bootstrapped means are greater than 30.
ii. What is the value of the standard deviation of the bootstrap sample means (or medians)? Why is this value called the “estimate of the standard deviation of the distribution of sample means (or medians)”?
# WHAT IS THE BOOTSTRAP STANDARD DEVIATION?
exerciseBoot$BootSD
## [1] 4.360147
This is the standard deviation of the distribution of bootstrapped mean. This is called the “estimate of the standard deviation of the distribution of sample means” since it is close to \( \frac{s}{\sqrt{n}} \), which is the theoretic standard deviation of the distribution of sample means.
One-sample t-method. (Note: the one-sample t-method can only be done on the mean)
In your R script, the t.test() function will run one-sample, two-sample and paired t-tests. We will discuss two-sample and paired t-tests in Lab Activity 2. The R Help File for the t.test() function can be found here: http://127.0.0.1:14511/library/stats/html/t.test.html
For a one-sample t-test, the syntax to run the test is: t.test(x=, alternative=, mu=) x is the variable of interest (same as “original_sample” above) alternative is the direction of the alternative: “two-sided”, “greater than” or “less than” (Note: you must include the quotation marks on your input) mu is the value of the null hypothesis.
For our example, the full syntax to perform the one-sided “greater than” one-sample t-test would be:
# PERFORM A ONE SIDED HYPOTHESIS TEST
t.test(daily.exercise, mu=30, alternative="greater")
##
## One Sample t-test
##
## data: daily.exercise
## t = 1.3187, df = 33, p-value = 0.09817
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
## 28.33225 Inf
## sample estimates:
## mean of x
## 35.88656
# CREATE A CONFIDENCE INTERVAL
t.test(daily.exercise, mu=30, alternative="two.sided")
##
## One Sample t-test
##
## data: daily.exercise
## t = 1.3187, df = 33, p-value = 0.1963
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
## 26.80495 44.96817
## sample estimates:
## mean of x
## 35.88656
i. What is the value of the standard error of the sample means? How was this calculated?
# WHAT IS THE STANDARD ERROR OF THE SAMPLE MEANS?
s<-sd(daily.exercise)
s
## [1] 26.02804
n<-length(daily.exercise)
n
## [1] 34
stdErr<-s/sqrt(n)
stdErr
## [1] 4.463773
ii. If the mean was chosen as the parameter to use for the onemean macro, compare the standard error of the sample means from the t-methods to standard deviation of the bootstrap sample means found above. Are they the same? If they're not exactly the same, why not?
Bootstrap estimated standard error of means
exerciseBoot$BootSD
## [1] 4.360147
T-method standard error
stdErr
## [1] 4.463773
They are not exactly the same due to the randomness of the bootstrap but they are pretty close.
iii. Give the t-statistic with degrees of freedom
There are 33 degrees of freedom.
iv. Give the p-value
The p-value is 0.09817
5. Based on your answer to #2, which p-value will you use to make a conclusion: the p-value from the one-sample t-test or the p-value from the onemean macro?
Due to the skew, I favor the bootstratp p-value. However, the bootstrap and t-method p-values are very similar.
6. Using that p-value, state a conclusion in the context of the problem.
There is no/inconclusive evidence to suggest that the average amount of daily exercise of college students is greater than 30. Therefore, we fail to reject the null at an \( \alpha=0.05 \) significance level.
7. Constructing a confidence interval for the mean (or median) number of minutes per day of exercise among OSU students.
A. The bootstrap methods using the onemean macro.
The output from the onemean gives a confidence interval using the percentile method.
i. Write the 95% confidence interval for the mean (or median) in correct notation using the percentile method.
# WHAT ARE THE TWO CONFIDENCE INTERVALS?
# WHICH METHOD IS BEST?
exerciseBoot$Confidence_Intervals
## CI_Percent CI_Formula
## 1 27.64412 27.34083
## 2 44.79506 44.43229
ii. Explain how the bounds are determined from the percentile method.
The percentile method gives the middle 95% of the distribution of bootstrapped sample means (or medians).
iii. Interpret the confidence interval in the context of the problem.
95% of the bootstrap sample means fall within the confidence interval.
iv. If the mean was used as the parameter, use the standard deviation of the bootstrap sample means and the correct t* critical value to construct a 95% confidence interval for the mean by hand. Compare this to the “CI_Formula” output from the onemean macro.
# T CONFIDENCE INTERVAL BY HAND
x_bar<-mean(daily.exercise)
df<-n-1
criticalVal<-qt(0.975, df)
criticalVal
## [1] 2.034515
CI<-x_bar+c(-1,1)*criticalVal*exerciseBoot$BootSD
CI
## [1] 27.01577 44.75734
Compare to
# WHAT ARE THE TWO CONFIDENCE INTERVALS?
# WHICH METHOD IS BEST?
exerciseBoot$Confidence_Intervals
## CI_Percent CI_Formula
## 1 27.64412 27.34083
## 2 44.79506 44.43229
B. The one-sample t-methods
(Once again, the t-methods can only be used on the mean. See example code above to get this output)
i. Write the 95% confidence interval for the mean in proper notation.
\( \bar{x}\pm t^{*}_{df} \times \frac{s}{sqrt{n}} \)
# T CONFIDENCE INTERVAL BY HAND
x_bar<-mean(daily.exercise)
df<-n-1
criticalVal<-qt(0.975, df)
criticalVal
## [1] 2.034515
CI_T<-x_bar+c(-1,1)*criticalVal*stdErr
CI_T
## [1] 26.80495 44.96817
ii. If the mean was used as the parameter in part (a), compare this confidence interval from the t-methods to the confidence intervals constructed in parts i and iv in part (a) above. Why are the bounds different or the same?
Compare
# T CONFIDENCE INTERVAL BY HAND
CI_T
## [1] 26.80495 44.96817
versus
# BOOTSTRAP CONFIDENCE INTERVALS
exerciseBoot$Confidence_Intervals
## CI_Percent CI_Formula
## 1 27.64412 27.34083
## 2 44.79506 44.43229
They are different since the bootstrap is created with a random process of resampling and the t method is based on normal theory.
Example 2
A random sample of 20 OSU students was taken. Each was asked the number of hours of TV they watch, on average, each day. The data are in the TV data set on Canvas and in the ST 352 class folder. Use these data to estimate the “average” number of hours of TV watched per day among OSU students by constructing an appropriate confidence interval.