KITADA

Lab Activity #1

Inference for a Single Quantitative Variable

Objectives:

Use R to perform a hypothesis test and construct a confidence interval for a mean using the one-sample t-methods
Use the onemean macro in R to obtain a p-value from a hypothesis test and the bounds from a confidence interval on a mean and median
Interpret R output from the analyses listed above
Explain when to use the t-methods and when to use the bootstrap methods
Explain when to perform inference on the median instead of the mean

Part I: Outside Class Activity

Activity 1: (performed outside class on your own – you will use the data you collect in Assignment 2)

This activity will take a couple of hours of your time outside of class. You are to take a trip to two different grocery stores (such as WinCo and Safeway). Before visiting the first store, decide on 15 products for which you’d like to compare prices at the two grocery stores. You will have to be specific about the brand name, product, and “size” of container. (Note: the “size of container” may not be determined until the visit to the first store.) For example, you may want to compare the price of a 12 ounce box of Kellogg’s Frosted Flakes between the two grocery stores you chose. You will collect data on the 15 products at each store, record your data into an Excel Worksheet, save the sheet as a .CSV or .TXT file, and analyze the data in Problem 1 of Assignment 2. Here’s a checklist of what you’ll need to do for this activity:

Decide on the two grocery stores to use in this activity
Decide on the 15 products you want to compare.
- The brand name, product, and size have to be exactly the same at each store. Therefore, do not compare generic brands as they have different names at different stores.
- You may have to wait until your visit to the first store to determine the “size” as you may not be aware of the different size packages for different products.
- Try to come up with a variety of products to get a good representation of products at the stores. (That is, don’t choose all different brands and sizes of cereals to compare!)
At each store, record the price of each product on your list. If you didn’t record the prices in an electronic spreadsheet (such as an Excel worksheet) at each store, do so after you collect all your data. (Note: think about the way the data should be recorded to do the proper analysis of the data in R.)
Use the data to answer the questions in Problem 1 of Assignment 2.

You may work with one other person to collect the data only, but must work on the assignment on your own!!! If you do collect data with someone else, give the name of the person you worked with in Problem 1 of Assignment 2.

Part II: Examples

For each example, only the analysis steps for doing inference will be emphasized in lab. You should still go through the first several steps on your own for each example:

Step 1: Identify the variable of interest and population

Step 2: Assess whether the sample is representative of the population

Step 3: Understand why a particular example is an estimation example or hypothesis test example

Example 1:

The American Heart Association recommends that adults get an average of 30 minutes of moderate-intensity exercise 5 days a week AND participate in moderate to high intensity muscle-strengthening activities at least 2 days per week (http://www.heart.org/HEARTORG/GettingHealthy/PhysicalActivity/StartWalking/American-Heart-Association-Guidelines_UCM_307976_Article.jsp). Assuming the moderate to high intensity muscle-strengthening exercises take an average of 30 minutes per day, let's say that adults should “exercise” at least 30 minutes per day every day. Do college students exercise more than 30 minutes per day, on average? A sample of 34 OSU students from the ST 352 class several years ago was taken. Their daily amount of exercise (in minutes) is recorded in the EXERCISE data set. Import this data set in R.

1. Obtain a histogram, dotplot, or stem-and-leaf display AND a normal probability plot of the sample data. Describe the shape of the sample data.

# HISTOGRAM
hist(daily.exercise, 
     main="Histogram of Exercise Times",
     xlab="Daily Amount of Exercise (Minutes)")

plot of chunk unnamed-chunk-2

# DOTPLOT
stripchart(daily.exercise, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Exercise Times", xlab = "Daily Amount of Exercise (Minutes)")

plot of chunk unnamed-chunk-2

2. Based on the shape of the sample data, would an inference on the mean or median be more appropriate? Explain. Therefore, can the t-methods be used?

Based on the histogram and dot plot, it appears that the sample data is skewed right. This suggests that inference on the median might be more appropriate.

However, there are 34 data points and since we know that t-procedures are robust to skewness and outliers inference on the mean using t-methods (t-test and t confidence interval) is also reasonable.

3. Using an appropriate parameter, state the null and alternative hypotheses in words and statistical notation. Define the parameter used in the context of the problem.

For Mean:

\( H_0 : \mu=30 \) (College students exercise 30 minutes a day, on average)

\( H_A : \mu>30 \) (College students exercise more than 30 minutes a day, on average)

For Median

\( H_0 : m=30 \) \( H_A : m>30 \)

4. Regardless of your answer to #2, both the bootstrap method and the one-sample t-method will be shown to obtain the p-value.

A. Bootstrap method using the onemean macro

- Step 1: Find the onemean macro and copy the code directly into your R script. Highlight and run all the source code for the macro.

- Step 2: Download the dataset from Canvas

How the onemean macro works

    The onemean macro creates an "adjusted" bootstrap distribution of sample means (or medians). (The number of "adjusted" bootstrap sample means or medians to generate is determined by you, but it is suggested to use at least 2000 – more on this below.) These sample means (or medians) are calculated based on the several thousand bootstrap samples taken from the sample data after the sample data have been "shifted" so that the mean (or median) of the sample data is equal to the hypothesized value of the population mean (or median) – this creates a distribution centered around the value in the null hypothesis (i.e. such that the null hypothesis is true, which is part of the definition of a p-value). (Recall that a bootstrap sample is a random sample of the same size as the original sample from the original sample with replacement)

The arguments for the onemean macro are as follows:

original_sample is the column in the data set that contains the sample data
iterations is the number of bootstrap sample means (or medians) be generated
- Example: iterations=2000
  
  Fill in the blanks with the arguments for this example:
```
original_sample= daily.exercise     

iterations= 2000
```

- Step 3: There are a few other arguments that should be specified in the macro:

ci_only: A true/false statement for whether the problem is an estimation problem only or a hypothesis test problem:
- ci_only=TRUE (only a confidence interval for the mean or median is desired)
- ci_only=FALSE (both a p-value and confidence interval will be given)
- MEAN or MEDIAN: A code for the parameter to be used
- MEAN=TRUE (indicating a hypothesis test will be performed on the mean and a confidence interval for the mean will be constructed)
- MEDIAN=TRUE (indicating a hypothesis test will be performed on the median and a confidence interval for the median will be constructed)
- NOTE: By default, both of these are false and if you do not specify one to be true the macro will stop giving you a message to pick one. Also, this macro cannot run a bootstrap for both mean and median at the same time. If you want both, you must run the macro twice, indicating on each run which parameter to use.
- NOTE: The default on both MEAN= and MEDIAN= is false, so you only need to specify one of them in the onemean macro.
- ci_level: An decimal value from 0 to 1 (not including 0 or 1) for the level of confidence desired for the confidence interval.
- Example: ci_level=0.95
- null.value: If a hypothesis test is to be performed: a value for the mean or median in the null hypothesis (i.e. the “hypothesized value”)
- Example: null.value=10
- Alt_hyp: If a hypothesis test is to be performed, a code for the alternative hypothesis
- Type 1 if HA: parameter < hypothesized value
- Type 2 if HA: parameter > hypothesized value
- Type 3 if HA: parameter hypothesized value

- Step 4 (Optional): There ae two additional arguments in the macro the change the output shown in the console. These are both TRUE by default but may be turned to FALSE if you desire

Summary_Stats: If TRUE, this displays summary statistics for the original sample.

histogram: if TRUE, a histogram is given of the bootstrapped means (or medians).

    Fill in the blanks below for the values to use for each prompted question:

        ci_only= FALSE          

        MEAN= TRUE          

        MEDIAN= FALSE

        ci_level= 0.95          

        null.value= 30      

        Alt_Hyp= 2          

The full syntax for our example running the onemean macro to do an α=0.05 hypothesis test on the mean parameter would be:

exerciseBoot<-onemean(original_sample=daily.exercise,
                      iterations=2000, 
                      ci_only=FALSE,
                      MEAN=TRUE, 
                      MEDIAN=FALSE, 
                      ci_level=0.95, 
                      null.value=30,
                      Alt_Hyp=2)

plot of chunk unnamed-chunk-4

After running the macro, the “adjusted” bootstrap distribution (i.e. several thousand sample means, or medians) will be displayed in the console of R, along with the estimated standard deviation of the distribution, confidence intervals and a p-value is that was requested. Also shown are the original sample summary statistics if they were requested (by default they are).

- Step 5: Interpret the output:

i. What is the p-value and what does that mean in the context of this adjusted bootstrap distribution?

# WHAT IS THE BOOTSTRAP P-VALUE? 
exerciseBoot$pval

## [1] 0.088

This means that 9.6% of the bootstrapped means are greater than 30.

ii. What is the value of the standard deviation of the bootstrap sample means (or medians)? Why is this value called the “estimate of the standard deviation of the distribution of sample means (or medians)”?

# WHAT IS THE BOOTSTRAP STANDARD DEVIATION?
exerciseBoot$BootSD

## [1] 4.360147

This is the standard deviation of the distribution of bootstrapped mean. This is called the “estimate of the standard deviation of the distribution of sample means” since it is close to \( \frac{s}{\sqrt{n}} \), which is the theoretic standard deviation of the distribution of sample means.

One-sample t-method. (Note: the one-sample t-method can only be done on the mean)

In your R script, the t.test() function will run one-sample, two-sample and paired t-tests. We will discuss two-sample and paired t-tests in Lab Activity 2. The R Help File for the t.test() function can be found here: http://127.0.0.1:14511/library/stats/html/t.test.html

For a one-sample t-test, the syntax to run the test is: t.test(x=, alternative=, mu=) x is the variable of interest (same as “original_sample” above) alternative is the direction of the alternative: “two-sided”, “greater than” or “less than” (Note: you must include the quotation marks on your input) mu is the value of the null hypothesis.

To obtain a closed confidence interval (bounded confidence interval), specify the alternative as “two.sided”.
The default confidence level for t.test() is 95% but you can change this by the conf.level argument using a decimal value.
- Example: t.test(x=?, alternative=”two-sided”, mu=?, conf.level=0.90) The above code will give you a 90% confidence interval for the mean AND perform a two-sided alternative t-test

For our example, the full syntax to perform the one-sided “greater than” one-sample t-test would be:

# PERFORM A ONE SIDED HYPOTHESIS TEST
t.test(daily.exercise, mu=30, alternative="greater")

## 
##  One Sample t-test
## 
## data:  daily.exercise
## t = 1.3187, df = 33, p-value = 0.09817
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  28.33225      Inf
## sample estimates:
## mean of x 
##  35.88656

# CREATE A CONFIDENCE INTERVAL
t.test(daily.exercise, mu=30, alternative="two.sided")

## 
##  One Sample t-test
## 
## data:  daily.exercise
## t = 1.3187, df = 33, p-value = 0.1963
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  26.80495 44.96817
## sample estimates:
## mean of x 
##  35.88656

i. What is the value of the standard error of the sample means? How was this calculated?

# WHAT IS THE STANDARD ERROR OF THE SAMPLE MEANS?
s<-sd(daily.exercise)
s

## [1] 26.02804

n<-length(daily.exercise)
n

## [1] 34

stdErr<-s/sqrt(n)
stdErr

## [1] 4.463773

ii. If the mean was chosen as the parameter to use for the onemean macro, compare the standard error of the sample means from the t-methods to standard deviation of the bootstrap sample means found above. Are they the same? If they're not exactly the same, why not?

Bootstrap estimated standard error of means

exerciseBoot$BootSD

## [1] 4.360147

T-method standard error

stdErr

## [1] 4.463773

They are not exactly the same due to the randomness of the bootstrap but they are pretty close.

iii. Give the t-statistic with degrees of freedom

There are 33 degrees of freedom.

iv. Give the p-value

The p-value is 0.09817

5. Based on your answer to #2, which p-value will you use to make a conclusion: the p-value from the one-sample t-test or the p-value from the onemean macro?

Due to the skew, I favor the bootstratp p-value. However, the bootstrap and t-method p-values are very similar.

6. Using that p-value, state a conclusion in the context of the problem.

There is no/inconclusive evidence to suggest that the average amount of daily exercise of college students is greater than 30. Therefore, we fail to reject the null at an \( \alpha=0.05 \) significance level.

7. Constructing a confidence interval for the mean (or median) number of minutes per day of exercise among OSU students.

A. The bootstrap methods using the onemean macro.

The output from the onemean gives a confidence interval using the percentile method.

i. Write the 95% confidence interval for the mean (or median) in correct notation using the percentile method.

# WHAT ARE THE TWO CONFIDENCE INTERVALS?
# WHICH METHOD IS BEST?
exerciseBoot$Confidence_Intervals

##   CI_Percent CI_Formula
## 1   27.64412   27.34083
## 2   44.79506   44.43229

ii. Explain how the bounds are determined from the percentile method.

The percentile method gives the middle 95% of the distribution of bootstrapped sample means (or medians).

iii. Interpret the confidence interval in the context of the problem.

95% of the bootstrap sample means fall within the confidence interval.

iv. If the mean was used as the parameter, use the standard deviation of the bootstrap sample means and the correct t* critical value to construct a 95% confidence interval for the mean by hand. Compare this to the “CI_Formula” output from the onemean macro.

# T CONFIDENCE INTERVAL BY HAND
x_bar<-mean(daily.exercise)
df<-n-1
criticalVal<-qt(0.975, df)
criticalVal

## [1] 2.034515

CI<-x_bar+c(-1,1)*criticalVal*exerciseBoot$BootSD
CI

## [1] 27.01577 44.75734

Compare to

# WHAT ARE THE TWO CONFIDENCE INTERVALS?
# WHICH METHOD IS BEST?
exerciseBoot$Confidence_Intervals

##   CI_Percent CI_Formula
## 1   27.64412   27.34083
## 2   44.79506   44.43229

B. The one-sample t-methods

(Once again, the t-methods can only be used on the mean. See example code above to get this output)

i. Write the 95% confidence interval for the mean in proper notation.

\( \bar{x}\pm t^{*}_{df} \times \frac{s}{sqrt{n}} \)

# T CONFIDENCE INTERVAL BY HAND
x_bar<-mean(daily.exercise)
df<-n-1
criticalVal<-qt(0.975, df)
criticalVal

## [1] 2.034515

CI_T<-x_bar+c(-1,1)*criticalVal*stdErr
CI_T

## [1] 26.80495 44.96817

ii. If the mean was used as the parameter in part (a), compare this confidence interval from the t-methods to the confidence intervals constructed in parts i and iv in part (a) above. Why are the bounds different or the same?

Compare

# T CONFIDENCE INTERVAL BY HAND
CI_T

## [1] 26.80495 44.96817

versus

# BOOTSTRAP CONFIDENCE INTERVALS
exerciseBoot$Confidence_Intervals

##   CI_Percent CI_Formula
## 1   27.64412   27.34083
## 2   44.79506   44.43229

They are different since the bootstrap is created with a random process of resampling and the t method is based on normal theory.

Example 2

A random sample of 20 OSU students was taken. Each was asked the number of hours of TV they watch, on average, each day. The data are in the TV data set on Canvas and in the ST 352 class folder. Use these data to estimate the “average” number of hours of TV watched per day among OSU students by constructing an appropriate confidence interval.