KITADA

Lesson #1

Inference for One-Sample problems for means or medians

Motivation:

There are many different methods to use when a problem involves making an inference. Which method is appropriate to use depends on the type of variable of interest (quantitative or categorical) and the number of populations. In this lesson, we’ll focus on appropriate inference methods when a sample is taken from one population and the variable of interest is quantitative. Depending on the number of cases in the sample AND the shape of the sample data, the t-methods and/or the bootstrap methods can be used. In addition, the shape of the sample data can aid in making a decision of whether the mean or the median is the more appropriate parameter to use in making an inference to a population of interest. In this lesson, we’ll learn which method and parameter are more appropriate to use in different situations and how to perform each method.

What you need to know from this lesson:

After completing this lesson, you should be able to

Recognize when a problem involves inference from one-sample involving a quantitative variable of interest
Distinguish between the t-methods and bootstrap methods for analyzing data from a sample involving a quantitative variable
- use the t-methods to construct a confidence interval for a population mean and perform a hypothesis test about a population mean
- use the bootstrap methods to construct a confidence interval for a population mean and perform a hypothesis test about a population mean
- explain when it is appropriate to use the t-methods and when it is appropriate to use the bootstrap methods.
- explain when it is more appropriate to perform an inference on the median rather than the mean.

*To accomplish the above “What You Need to Know”, do the following: *

Attend lecture and answer the questions on the following pages of this lesson.
1. Read Sections 6.4 (pages 373 – 376), 6.5, and 6.6 in the text
2. Do the Lesson 1 questions at the end of the lesson notes

The Lesson

I. Steps for inference

Step 1: Identify the variable of interest and the population

This step will help determine (or at least narrow down the choices for) the correct inference method to use

Step 2: Assess if the sample is representative of the population

An inference to a population or populations of interest will only be valid if the sample is representative of the population(s). If a sample is not representative of a population, caution should be used in using any inference method to make an inference to the population of interest. Taking a random sample goes a long way towards guaranteeing a sample is representative of a population (but still not may fully guarantee a representative sample!), but a non-random sample could still be representative of a population.

Note: if an experiment is performed, the experimental units may not have been from a sample. To make a valid inference in an experiment, experimental units must be randomly assigned to the different experimental groups.

Step 3: Determine if this is a hypothesis test or estimation problem

“Estimating” involves only constructing a confidence interval for a population parameter. If the question of interest does not involve a value for the population parameter (i.e. “hypothesizes” or makes a claim about a population parameter), but rather “wonders” what the value of a population parameter is, then the problem is strictly an estimation problem.

Note: if a problem is a hypothesis testing problem, a confidence interval for the population parameter can still be constructed. (If is often recommended to do both if possible in a hypothesis test problem. In some situations, though, it may not be possible to construct a confidence interval.)

If the problem is a hypothesis test problem:

Step 4: State the null and alternative hypotheses

For quantitative variables, the hypotheses will typically involve either the population mean or median.
For categorical variables with two categories for the response variable, the hypotheses will typically involve the population proportion (proportion in one of the two categories)
For investigating the relationship between two categorical variables, the hypotheses are often written in terms of an “association” between the two variables.
There are exceptions to the above guidelines, especially when more advanced analysis methods are used such as regression and Analysis of Variance

Step 5: Explore the sample data

There are a couple of reasons for exploring the sample data:

1) to get an idea of what decision might be made regarding the null hypothesis (“reject” the claim in the null hypothesis for the claim made in the alternative hypothesis or not)
2) to help assess whether “normal-theory” (i.e. “t-methods”) and/or bootstrap/randomization methods can be used

Note: if the variable of interest is quantitative, exploring the sample data can also help determine if the mean or median is more appropriate to use in the hypotheses and analysis

Step 6: Determine the p-value

We will focus on the most appropriate method to determine the best approximation for the p-value if the exact p-value cannot be determined. (In some analysis methods, such as with proportions, an exact p-value can be calculated. In most analysis methods, though, the best we can do is approximate the exact p-value.)

Step 7: Answer the question of interest

Note: a confidence interval for the population parameter can be constructed to support a conclusion.

If the problem is only an estimation problem:

Step 4: Explore the sample data

As with hypothesis test problems, we need to explore the sample data to help assess whether “normal-theory” (i.e. “t-methods”) and/or bootstrap methods can be used. If the variable of interest is quantitative, exploring the sample data can also help determine if it is more appropriate to construct a confidence interval for the population mean or population median.

Step 5: Construct the appropriate confidence interval using the appropriate method

The “appropriate” method could be based on “normal-theory” (t-methods, for example) or bootstrap methods (using the “formula” method or “percentile” method)

Step 6: Interpret the confidence interval and answer the question of interest

II. Inference from a sample that is symmetric

Example 1: The Wrist Extension Example

Medical research has shown that repeated wrist extension beyond 20 degrees increases the risk of wrist and hand injuries. A study performed among 30 Cornell students used a proposed new computer mouse design. While using the new mouse design, the wrist extension (in degrees) was recorded for each of the 30 students. The data are given below and in the WRIST EXTENSION data set. Is there evidence to indicate the average wrist extension for the new mouse design is more than 20 degrees?

## WRIST EXTENSION DATA
wrist<-c(27, 24, 28, 26, 24, 24, 31, 26,
         24, 27, 25, 26, 27, 25, 22, 20,
         28, 26, 25, 29, 24, 25, 25, 27, 
         28, 26, 25, 27, 27, 28)

Step 1: Identify the variable of interest and population of interest

1. What is the variable of interest? Is it quantitative or categorical?

The variable is interest is angle of wrist extension using the new mouse.

2. What is the population of interest? The population of interest is all students who use computer mice.

3. Therefore, what is the appropriate inference method called?

We should perform one sample inference for numeric data.

Step 2: Assess if the sample is representative of the population of interest

4. What concerns would you have with saying the data from the sample of 30 students is representative of the data of all students?

The researchers only sampled from students at Cornell. It is possible that students at Cornell have different characteristics than the overal population of students.

5. Would you still have concerns about the sample being representative of a population of only Cornell students?

If the sample of students from Cornell was randomly sampled, I wouldnt be concerned. However, if these students volunteered for the study we would still have concerns about whether the sample was representative of the population of Cornell students.

Step 3: Determine if this is an estimation problem or hypothesis test problem.

6. Is this an estimation problem or hypothesis testing problems? Why?

This is a hypothesis problem since we want to test the claim that wrist extension is more than 20 degrees.

Step 4: State the hypotheses

7. State the null and alternative hypotheses in words and statistical notation. Define the parameter used in the hypotheses.

$ H_0: \mu=20 $ : The averge wrist extension for the new mouse is 20 degrees.

$ H_A: \mu>20 $ : The averge wrist extension for the new mouse is more than 20 degrees.

$ \mu $ is the true mean wrist extension for students using the new mouse.

8. Is the hypothesis test a one-sided or two-sided test? Why?

This is a one-sided test because we are looking for more extreeme lower values.

Step 5: Explore the sample data

9. Below is a dotplot of the sample data.

# WHAT IS THE SHAPE OF THE DATA?
# MAKE A DOT PLOT
# DOES IT LOOK NORMAL? IS THERE SKEW?
stripchart(wrist, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Wrist Extension Values", xlab = "Wrist Extension")

plot of chunk unnamed-chunk-2

*a. Based on the shape of the sample data, which method(s) is(are) appropriate to use: one-sample t-methods, bootstrap method, or either one? *

This looks roughly normal (its symmetric) and the sample size is thirty we can use t-methods.

b. Do you feel there is evidence to reject the claim made in the null hypothesis for the claim made in the alternative hypothesis? Why?

Since all the values (except one) are larger than 20, I think that we will reject the null hypothesis in favor of the alternative, that the mean is larger than 20.

c. Give the following summary statistics:

# HOW MANY OBSERVATIONS ARE IN THE DATA SET?
n<-length(wrist)
n

## [1] 30

# WHAT IS THE MEAN?
x_bar<-mean(wrist)
x_bar

## [1] 25.86667

# WHAT IS THE SAMPLE STANDARD DEVIATION
s<-sd(wrist)
s

## [1] 2.145297

Steps 6 & 7: Determine the p-value and answer the question

Let’s show and compare the one-sample t-methods and the bootstrap method

10. One-sample t-methods

a. What values of sample means are “as or more unusual” than the observed sample mean if the null hypothesis is true?

The values that are more unusual are values greater than 20.

b. Based on the answer to part a, write the probability of observing a sample mean as or more unusual than the observed sample mean if the null hypothesis is true.

$ Pr(X>20) $

c. Sketch the distribution of sample means.

Looks normal with the following characteristics..

What shape does the distribution of sample means have? Why?

Normal shape
What is the mean of the distribution of sample means? Why?

The mean of the distribution of sample means is the population mean, $ \mu $.
Calculate the standard error of the sample means.

The standard deviation of sample means is $ \frac{\sigma}{\sqrt{n}} $.

The standard error is:

stdErr<-(s/sqrt(n))
stdErr

## [1] 0.3916758

d. Calculate the t-statistic with degrees of freedom.

# PERFORM A T-TEST
# TEST STATISTIC
mu0<-20
testStat<-(x_bar-mu0)/stdErr
testStat

## [1] 14.97837

# HOW MANY DEGREES OF FREEDOM ARE THERE?
df<-n-1
df

## [1] 29

e. Determine the p-value by calculating the probability in part b.

# WHAT IS THE P-VALE?
pVal<-pt(testStat,df, lower.tail=FALSE)
pVal

## [1] 1.743204e-15

f. Based on the p-value, state a conclusion in the context of the problem.

There is convincing evidence to suggests that the average degree of wrist extension is greater than 20, with a p-value of 1.743e-15. Therefore, we will reject the null at an $ \alpha=0.05 $ significance level.

g. Use the t-methods to construct and interpret a 95% confidence interval for the average wrist extension for people using the new mouse design.

# MAKE A CONFIDENCE INTERVAL
criticalVal<-qt(0.975, df)
criticalVal

## [1] 2.04523

CI<-x_bar+c(-1,1)*criticalVal*stdErr
CI

## [1] 25.06560 26.66773

11. Construct an “adjusted” bootstrap distribution using the onemean R function from the sample data.

When performing a hypothesis test using the onemean function, the values of the variable of interest (the original data) are “adjusted” so that the mean (or median) of the sample data is equal to the hypothesized value. The function then takes several thousand bootstrap samples from the “adjusted” sample data and displays the mean (or median) of these several thousand bootstrap samples in the Console which will appear once you run the function. These bootstrap sample means (or medians) will be centered at the hypothesized value of the population mean (or median), which can be seen in the histogram displayed once the function is ran. Having the adjusted data centered at the null hypothesized value is necessary when performing a hypothesis test.

The reason the bootstrap distribution must be centered at the hypothesized value is that a hypothesis test starts with the condition that the null hypothesis is true – that’s part of the definition of a p-value!
- Note that the hypotheses are about what’s happening in a population. With the bootstrapping procedure, we’re thinking that the data in the population is just many, many replications of the data in the sample. Therefore, the mean (or median) of a bootstrap distribution should be equal to the mean (or median) of the population as long as the sample is representative of the population, which can usually be guaranteed by taking a random sample from the population.
The way each value in the sample is “adjusted” is to add or subtract the difference between the hypothesized value and the sample mean ( ) to each value in the sample.
A confidence interval using the percentile method and formula method will also be given.
- To determine the lower and upper bounds using the percentile method, each “adjusted” sample mean (or median) will have to be “readjusted” back its original bootstrap sample mean (or median) since the bootstrap distribution must be centered around the sample mean (or median) when constructing a confidence interval. The good news is that the onemean function automatically does this even though the original bootstrap sample means (or medians) are not displayed in the console.
If the parameter used is the mean, a confidence interval for the population mean ( ) is also shown. The function simply use the standard deviation of the bootstrap distribution in the formula for constructing the confidence interval:

$ \bar{x} \pm (t^{*}_{n-1}) \times (Std Err) $

The formula method CANNOT be use to construct a confidence interval for the median! Only the percentile method can be used when constructing a confidence interval for a median!

After importing the WRIST data, the first thing you must do is highlight and run all of the onemean function code (see lab activity for more instructions). In order to call the function you read into R, you must use the following command:

    With(NAMEOFDATASET,

    onemean(original_sample, iterations, null.value = 912194, Alt_Hyp = NULL, MEAN = FALSE,

    MEDIAN = FALSE, ci_level = 0.95, Summary_Stats = TRUE, histogram = TRUE, ci_only = FALSE)

    )

The with(NAMEOFDATASET, ) just indicates that you are wanting to call up the onemean function using a specific data set.

The arguments for “calling” the onemean function are
- original_sample: for this type in the name of the variable you are using. You have the option of typing out original_sample = wrist or simply type wrist for this first argument.
- iterations: type in the number of bootstrapped samples you would like to take 5000 (5000 bootstrap sample means are to be generated)
- null.value = (type in the null hypothesized value here.
- Alt_Hyp = (Type in 1 for $ \mu < \mu_0 $, Type in 2 for $ \mu > \mu_0 $, Type in 3 for $ \mu \neq \mu_0 $)
- MEAN = (Type “TRUE” if you are using the mean. If you are using the median you can skip this argument. By default this is set to FALSE)
- MEDIAN = (Type “TRUE” if you are using the median. If you are using the mean you can skip this argument. By default this is set to FALSE)
- ci_level = (Type in confidence level as a decimal)
- Summary_Stats = TRUE (by default the function will give you the summary statistics of your data)
- Histogram = TRUE (by default the function will give you a histogram of the adjusted bootstrap distribution)
- ci_only = FALSE, if you wish to ONLY find a confidence interval, you can type in ci_only = TRUE.
For this example the code that was used to call the function is found below. Notice that we first set the function equal to my.onemean. Now all of the output will be saved under the item name my.onemean and anytime we want to see one of the parts to the output, we can first type my.onemean$___________ then in the space type in the item we want to see.
```
my.onemean<- onemean(wrist, 5000, null.value = 20, Alt_Hyp = 2, 

MEAN = TRUE, ci_level = .95)

)
```

Now to show all of the items saved as output under the name my.onemean, use the following code:

*Note the summary statistics given is of the original data. Also, we saved the output under the my.onemean name therefore there is no need to use the with() statement.

    my.onemean$BootSD

    my.onemean$Summary_Stats

    my.onemean$Confidence_Intervals

    my.onemean$pval

    my.onemean$Alt_Hyp

The output from the onemean function will display

    > my.onemean$BootSD

    [1] 0.3882132

    > my.onemean$Summary_Stats

    Mean Median       SD MIN Q1 Q3 Max

    1 25.86667     26 2.145297  20 25 27  31

    > my.onemean$Confidence_Intervals

CI_Percent CI_Formula

    1   25.10000   25.10578

    2   26.63333   26.62755

    > my.onemean$pval

    [1] 0

    > my.onemean$Alt_Hyp

    [1] "mu > nullvalue (one sided)"

Note:

a confidence interval (with the indicated level of confidence in the function) with bounds determined using the percentile method (See CI_Percent above):

1 lower bound is the (100 – CL) / 2 percentile, where CL = the confidence level
2 upper bound is the (100 + CL) / 2 percentile and using the formula method

$ \bar{x} \pm (t^{*}_{n-1}) \times (Std Err of Bootstrap) $

If the median is used as the parameter, the confidence interval displayed under CI¬_Formula will read “NA”

*2. the standard deviation of the several thousand bootstrap sample means (or medians) is a way to estimate the true standard deviation of the sample means (or medians) and is called a standard error. If the mean is the choice of parameter to be used, the standard deviation from the bootstrap distribution should be close to the value determined from the formula using the one-sample t-methods, but may not be exactly the same. *

a. Is the p-value the same as from the one-sample t-methods?

The p-values are virtually the same and are approximately 0.

b. How close is the standard deviation of the bootstrap sample means to the standard error calculated using the t-methods?

    T-method SE =  0.3916758

    Bootstrap SD = 0.3882132

They are very close.

c. Based on the output, what decision would be made at the 5% significance level?

We would reject the null at an $ \alpha=0.05 $ level.

d. Would the same decision be made at the 1% significance level?

We would reject the null at an $ \alpha=0.01 $ level.

e. State a conclusion in the context of the problem.

There is convincing evidence to suggests that the average degree of wrist extension is greater than 20, with a p-value that is approximately 0. Therefore, we will reject the null at an $ \alpha=0.05 $ significance level.

12. Confidence interval for the population mean

a. Construct a 95% confidence interval for the mean wrist extension for people using the new mouse design using the standard deviation of the bootstrap distribution. (This is sometimes called the “boot t-method”.) Are the bounds close to the bounds using one-sample t-methods?

(25.10578, 26.62755) degrees

b. Give the 95% confidence interval for $ \mu $ the mean wrist extension for people using the new mouse design using the percentile method.

(25.10000, 26.63333) degrees

c. Which method (t-method or percentile method) would be better to use in this example? Or, does it matter in this example? Explain.

It doesnt really matter in this example since the data is roughly normal to begin with.

III. Inference from a skewed distribution

Example 2: The Date Example

A student was interested in estimating the average cost of a Saturday night date for students attending her college. To do so, she randomly selected 10 men each from 2 randomly selected dormitories around campus. She asked each how much they spent on a date last Saturday. (If they did not go on a date last Saturday, they were excluded from the analysis and another randomly selected male from that dorm was selected. All who were selected that went on a date last Saturday responded.) The data are provided in the table below:

## SATURDAY NIGHT DATE
amount<-c(10, 10, 15, 20, 20, 
          22, 25, 29, 30, 42, 
          35, 39, 49, 56, 45, 
          68, 75, 80, 110, 135)

Step 1: Identify the variable of interest and population of interest

1. What is the variable of interest? Is it quantitative or categorical?

The variable of interest is amount of money spent on Saturday night dates. This is quantitative.

2. What is the population of interest?

The population of interest is men who go on dates at her college.

Step 2: Assess if the sample is representative of the population of interest

3. What concerns would you have with saying the data from the sample of 20 is representative of the student’s population of interest?

The data is only taken from men from two colleges. This colleges may not be representative of men at the college in general. This also doesnt even take into account students who do not live on campus.

4. To what population can this student legitimately make a conclusion?

We could estimates amounts spent be the men in those two colleges.

Step 3: Determine if this is an estimation problem or hypothesis test problem.

5. Is this an estimation problem or hypothesis testing problems? Why?

This is an estimation problem.

Step 4: Explore the sample data

There are two options to use in a one-sample inference problem with a quantitative variable of interest:

1) the bootstrap methods
2) the one-sample t-methods

Which one or ones can be used depends on the shape of the sample data and the sample size. Here are three situations and what methods can be used:

1) if the sample data are symmetric, either method can be used regardless of the sample size
2) if the sample data are slightly skewed, and
- if the sample size is at least 30, either method can still be used
- if the sample size is less than 30, the bootstrap method on the median should be used.
3) if the sample data are heavily skewed and/or there are extreme outliers,
- the t-methods should NOT be used. Only the bootstrap methods should be used.
- a hypothesis test on the median should be performed
- a confidence interval for the population median should be constructed using the percentile method.

The histogram of the 20 amounts spent on a date last Saturday from the sample is given below.

# WHAT IS THE SHAPE OF THE DATA?
# MAKE A HISTOGRAM
# DOES IT LOOK NORMAL? IS THERE SKEW?
hist(amount, 
     main = "Histogram of Amount Spent", xlab = "Amount Spent")

plot of chunk unnamed-chunk-9

6. Describe the shape.

The data is skewed right with possible outliers.

7. Based on the shape of the sample data, should a confidence interval be constructed for the mean or the median? Explain.

Based on the shape of the data and small sample size, I would suggest creating a confidence interval for the median.

Step 5: Construct the confidence interval

With heavily skewed sample data, it is likely the data in the population from which the sample was taken will also be skewed if the sample is representative of the population. Therefore, it is more appropriate to use the median as the measure of the center instead of the mean. So, instead of constructing a confidence interval for the mean, a confidence interval for the median should be constructed using the percentile method. (Likewise, if a hypothesis test was to be performed, it would be more appropriate to test hypotheses involving the median instead of the mean.)

Recall that the percentile method involves finding certain percentiles of the bootstrap sample medians depending on the level of confidence (2.5th and 97.5th for a 95% confidence interval). These percentiles will be the lower and upper bounds of the confidence interval.

Note: a confidence interval for the mean should NOT be constructed using any method (t-methods or percentile methods) when the population is skewed and the sample size is small. About 72% of all 95% confidence intervals constructed using either of these two methods will capture the population mean with skewed populations and sample sizes under 30. Even when the sample size is more than 30, less than 90% of all 95% confidence intervals will capture the population mean when the population data are skewed. This percentage will only get close to 95% when the sample size is quite large (well over 1000). Therefore, it is suggested to always construct a confidence interval for the median when we feel the data in a population are quite skewed – close to 95% of all 95% confidence intervals for the median will capture the true population median using the bootstrap methods.

Using the onemean function to construct a confidence interval for the median When only a confidence interval is desired, the onemean function will create a bootstrap distribution of sample means (or medians) by sampling the same number of cases as in the sample from the original sample with replacement. The number of bootstrap sample means (or medians) in the bootstrap distribution will be set by you in the arguments used to call the function.

First you must import the DATE data set. Then using the same function from earlier, the following code was used:

    medCI <- with(DATE, 
                  onemean(amount, 5000, null.value = 0, MEDIAN = TRUE, 
                    ci_level = .95, ci_only = TRUE)
    )

    #To see the confidence interval output:

    medCI$Confidence_Intervals

Here is the output from one such bootstrap distribution using the DATE example:

    > medCI$Confidence_Intervals
      CI_Percent CI_Formula
    1       23.5         NA
    2       52.5         NA

A graph of the boostrap distribution of the sample median is given in the plot window. You can see two red lines corresponding to the lower and upper bound of the confidence interval calculated.

8. Write the 95% confidence interval for the median amount spent on a date last Saturday in proper notation.

($23.50, $52.5)

Step 6: Interpret the confidence interval

9. In the context of the problem, interpret the confidence interval.

95% of the bootstrapped medians for amount spent on dates is beween ($23.50, $52.5)

10. Suppose the student hypothesized that the median amount spent on a date last Saturday among college students was different than $35.

a. State the null and alternative hypotheses in notation. Define the notation.

$ H_0: m=35 $

$ H_A: m \neq 35 $

$ m $: is the median amount of money spent on dates

b. Using the confidence interval from #8, what decision would be made about the null hypothesis? Why? At what significance level are you using in making this decision?

Since $35 is contained in the confidence interval this means that we will fail to reject the null hypothesis at an $ \alpha=0.05 $ level. This is also associated with a large p-value.