25 de enero de 2016

Hypothesis Testing

Hypothesis Testing

  • Fundamental for inferential statistics
    • Use statistical methods to make decisions "real world problems"
  • Conceptual steps for hypothesis testing
    1. Develop research hypothesis to be mathematically tested
    2. State in a formal way the null and alternative hypotheses
    3. Choose a statistical test, get data, perform computations
    4. Take decision based on results

Hypothesis Testing - Example - Evaluating Medication

  • Medication to treat hypertension
  • Establish that it works better than current treatments for hypertension
  • Research hypothesis
    • Hypertension patients treated with new drug X will show greater lowering of blood pressure than those treated with other drugs Y
  • Statistical test
    • Choose \(\mu_1\) as mean lowering of blood pressure in group treated with drug X
    • Choose \(\mu_2\) as mean lowering of blood pressure in group treated with drug Y

Hypothesis Testing - Example - Evaluating Medication

  • Null and Alternative hypotheses
    • \(H_0: \mu_1 \leq \mu_2\)
    • \(H_A: \mu_1 > \mu_2\)
  • \(H_0\) is the null hypothesis
    • Drug \(X\) is no improvement over drug \(Y\) because lowering of blood pressure achieved by drug \(X\) is less than or equal to that achieved by drug \(Y\)
  • \(H_A\) is the alternative hypothesis
    • Drug \(X\) is more effective than drug \(Y\) because patients treated with drug \(X\) show more lowering of blood pressure than patients treated with drug \(Y\)

Hypothesis Testing - Example - Evaluating Medication

  • Null and alternative hypotheses must be mutually exclusive
  • In this case alternative hypothesis is single-tailed
    • Drug \(X\) must achive greater lowering of blood pressure than drug \(Y\) to reject the null hypothesis
  • In a two-tailed alternative hypothesis
    • \(H_0: \mu_1 = \mu_2\)
    • \(H_A: \mu_1 \neq \mu_2\)
    • Can find a difference in either direction

Hypothesis Testing - Example - Evaluating Medication

  • After we collect data and compute statistics with it
    • Take one of two decisions
      • Reject the null hypothesis
      • Fail to reject the null hypothesis
    • Failing to reject null hypothesis doesn't prove the null hypothesis is true
      • It only proves that in our study we didn't find enough evidence to reject it

Hypothesis Testing

  • Rejecting the null hypothesis
    • Sometimes called finding significance or finding significant results
    • We show that there are differences in the group means but also that those differences are statistically significant

Statistical Significance

  • Informal meaning of statistical significance
    • probably not due to chance
  • Statistical Testing process
    • Choose a probability level or p-value
      • Defines when sample results are considered strong or not to support rejection of null hypothesis
      • Commonly set at \(p <\) 0.05
      • Others p-values \(p < 0.01\) or \(p < 0.001\)
      • Very uncommon \(p < 0.10\)

Type I and Type II errors

Type I and Type II errors

  • Correct decisions
    • \(H_0\) is true and NOT rejected
    • \(H_0\) is false and IS rejected
  • Errors
    • Type I error or \(\alpha\): \(H_0\) is true and IS rejected
    • Type II error or \(\beta\): \(H_0\) is false but NOT rejected

In the matrix it is assumed that true state of population is known - Generally unknown to the researcher - We make decisions about the population but based on the analysis of a sample

Type I and Type II errors

  • Trial example
  • Null hypothesis: defendant is innocent
  • True state of affairs: is the defendant guilty?
  • Jury members take decision based on information presented to them
  • Jury doesn't know ture state of affairs any more than statisticians knows the true state of population

Type I and Type II errors

  • Jury might make correct decision or might commit Type I or Type II error
  • Finds an innocent client as guilty
    • Type I error
    • Rejects the null hypothesis of innocence when it should not be…
  • Finds a guilty client as not guilty
    • Type II error
    • Fails to reject the null hypothesis of innocence when it should have rejected it

Type I and Type II errors

  • Level of acceptability of Type I error
    • Conventionally \(\alpha\) < 0.05
    • We accept 5% probability of Type I error: 5% chance of rejecting the null hypothesis when we should fail to reject it

Type I and Type II errors

  • Level of acceptability of Type II error
    • Conventionally \(\beta\) = 0.10 or \(\beta\) = 0.20
    • We accept 10% (or 20%) probability of a Type II error: 10% (or 20%) chance the null hypothesis is false but will fail to be rejected
    • In other words:
      • In a study that should return significant results based on the true state of the population, there is 10% chance that the results of the study will no be significant

Type I and Type II errors

  • Reciprocal of Type II error is power
    • \(1 - \beta\)
    • Became more appreciated in recent years ("medical field")
    • Don't invest time, effort, and expense in a study unless it has a reasonable probability of finding significant results
    • Power is important in planning studies
      • Determining sample size required for adequate power

Confidence Intervals

  • Sometimes we don't work with a point estimate statistic such as a mean
    • Not every sample will have the same exact mean
    • How much a point estimate is likely to vary by chance?
    • This can be answered with interval estimates
  • A point estimate is a single number, an interval estimate is a range of numbers
  • Common interval estimate
    • Confidence interval, the interval between two values that represent the upper and lower confidence limits or confidence bounds of a statistic

Confidence Intervals

  • Formula, depends on the statistic used
  • Significance level set to \(\alpha\), commonly set at 0.05
  • Confidence coefficient is: (1 - \(\alpha\)) or 100(1 - \(\alpha\))%
  • With \(\alpha = 0.05\), the confidence coefficient is 95% as in 95% confidence intervals

Confidence Intervals

  • Idea of confidence intervals
    • We repeat a study an infinite number of times
      • Each drawing different sample of the same size from the same population
      • Build a confidence interval based on each sample
    • \(95%\) of the time the confidence interval would contain the true parameter value we are estimating.
    • Here \(95%\) is the size of the confidence interval
    • If using the mean, 95% of the time the constructed confidence intervalwould contain the true mean of the population

p-Values

  • When working with inferential statistics
    • We try to estimate something that we can't measure directly
    • Can't collect data from the whole population (all hypertensive adults in the world)
    • Can collect a sample to make inferences base on the sample (sample of hypertensive adults)
  • p-value expresses the probability that results at least as extreme as those obtained in an analysis of sample data are due to chance

p-Values

  • Experiment: flip coin we believe to be fair
    • \(P(h) = P(t) = 0.5\)
    • Each flip is a trial
    • Best guess: 5 heads on 10 trials
    • Might have a set of 10 trials with different number of heads
    • 8 heads?, p-value of this result (how likely is that a coin with probability of 0.5 of heads on a single trial will produce 8 heads in 10 trials?)
      • With binomial table, (8 heads in 10 trials) has probability of 0.0439, less than 5% we expect to have 8 heads in 10 flips with a fair coin

p-Values

  • Experiment: flip coin we believe to be fair (continues…)
    • 9 heads in 10 trials: 0.0098
    • 10 heads in 10 trials: 0.0010
    • As results are further away from the expected result of 5 heads in 10 trials, they become less likely
  • If we evaluate probability that coin is fair
    • Results far from expectations (5 heads in 10 trials) give evidence that it is unfair
    • We compute probability of results at least as extreme as those obtained: probability of obtaining 8, 9, or 10 heads in 10 flips of a fair coin is: 0.0439 + 0.0098 + 0.0010 = 0.0547
    • This is the p-value for the result of at least 8 heads in 10 trials using a coin where P(heads) = 0.5

p-Values

  • p-values commonly reported in research results involving statistical calculations
    • Intuition –> poor guide to how unusual a result is
  • No statistical definition about unusual results
    • p-value for our results must be less than 0.05 for us to reject the null hypothesis

The t-test

The t-test

  • Introduced by William Sealy Gosset (Quality Control, Guinness brewery, Ireland)
  • Article under pseudonym Student
    • Student's t-test
  • Test difference between means and comparing a test statistic to the \(t\) distribution to determine probability of statistic if study's null hypothesis is true

The t Distribution

  • In inferential statistics we use known probability distributions to make inferences about real data sets
    • Normal and binomial distributions
  • t distribution
    • Continuous, symmetrical
    • Shape depends on degrees of freedom of a sample
      • Number of values allowed to vary
    • Degrees of freedom influenced by sample size
    • Tests on larger sample sizes generally have more degrees of freedom than small sample sizes

The t Distribution

  • Two main reasons to use the t Distribution to test differences in means
    1. Working with small samples from a population we believe has approximate normal distribution
    2. We don't know standard deviation of population and need to use standard deviation of the sample as substitute fo population standard deviation

The t Distribution

The t Distribution

The t Distribution

  • t distribution –> similar to normal
    • Thicker tails, extreme values more probable in t than in normal
    • As sample size (hence degrees of freedom) increases, t distribution looks more like a normal distribution

The t Distribution

  • Formula: \(t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}\)
    • \(\bar{x}\), sample mean
    • \(\mu\), population mean
    • \(s\), sample standrd deviation
    • \(n\), sample size
  • Similar to \(Z\)-statistic
    • But we use sample standar deviation instead of population standard deviation

The t Distribution

  • Upper critical values of the \(t\) distribution for different degrees of freedom
    • "Upper critical values", because t distribution is symmetric, no need to print the lower critical values
    • They are the negatives of numbers in table
  • When only positive values included
    • To find critical value for a two-tailed \(t\)-test with \(\alpha = 0.05\)
      • Use column with \(\alpha = 0.025\)

The t Distribution

The t Distribution

  • As sample size increases
    • Critical values for t dist. approach those of standard normal distribution (see tables)
  • Normal distribution two-tailed test with \(\alpha = 0.05\) is 1.96
  • Two-tailed test with t distribution with \(\alpha = 0.05\)
    • Critical upper value depends on degrees of freedom (df)
    • 1 df, 12.706
    • 10 df, 2.228
    • 30 df, 2.042
    • 50 df, 2.009
    • 100 df, 1.984
    • inf df, 1.96

The One-Sample t-Test

  • Compare the mean of a sample to a population with known mean
    • Null hypothesis
      • There isn't significant difference between mean in population from sample and mean of known population
  • Example: "*effects of lead exposure on intelligence in children in USA"
    • 5 years old - intelligence test <- 100
    • Sample 15 5 years old children (exposed to lead)
    • Did exposure afected their intelligence?
    • We also know: intelligence scores generally assume normal distribution in this population

The One-Sample t-Test

  • Example (continues)
    • Null hypothesis: there is no diference in intelligence scores of lead-exposed group
    • Conduct two-tailed test with \(\alpha = 0.05\)
    • \(t = \frac{\bar{x}- \mu_0}{\frac{s}{\sqrt(n)}}\)
      • \(\bar{x}\), mean of our sample
      • \(\mu_0\), the reference mean (avg intelligence score for 5 year-old's)
      • \(s\), standard deviation of our sample
      • \(n\), sample size

The One-Sample t-Test

  • Example (continues)
    • Formula for the sample mean
      • \(\bar{x}=\frac{\displaystyle\sum_{i=1}^{n}x_i}{n}\)
    • Formula for the sample standard deviation
      • \(s = \sqrt{\frac{\displaystyle\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\)

The One-Sample t-Test

  • Example (continues)
    • Computational formula for the sample standard deviation
      • \(\sqrt{\frac{\displaystyle\sum_{i=1}^{n}x^2 - \frac{\left(\displaystyle\sum_{i=1}^n x_i\right)^2}{n}}{n - 1}}\)

The One-Sample t-Test

  • Example (continues)
    • With \(\bar{x} = 90\), \(s = 10\), \(n = 15\)
    • \(t = \frac{90 - 100}{\frac{10}{\sqrt{15}}} = -3.87\)
    • degrees of freedom \(df = n - 1\), \(df = 15 - 1 = 14\)
    • From critical values table for \(t\) distribution for two-tailed \(t\)-test, with 14 df, \(\alpha = 0.05\): 2.145
      • We have that \((|-3.87| > 2.145)\)
      • The absolute value of the \(t\)-statistic for our data is greather than the upper critical value
      • We reject the null hypothesis that there is no diference in intelligence scores of lead-exposed group
      • Difference in means and \(t\)-statistic are negative, we can say that intelligence score is lower for children exposed to lead compared to the avg children - same age- in the whole population

Confidence Interval for the One-Sample t-Test

  • We also want to report confidence interval with the t-statistic and significance test
    • Range of values
    • if we drew infinite samples with same size from same population, x% of the time the true population mean would be included in the confidence interval calculated from samples
    • Get 95% confidence interval: 95% of the confidence intervals calculated from infinite samples with same size from same population, can be expected to contain the true population mean

Confidence Interval for the One-Sample t-Test

  • Information about the precision of a point estimate, such as the mean
  • Wide range
    • With different samples, might get quite different sample mean
  • Narrow range
    • With different samples, sample mean would be close to the other
  • Formula for two-tailed confidence interval (CI) for mean for one-sample \(t\)-test
    • \(CI_{1-\alpha} = \bar{x} \pm \left(t_{\frac{\alpha}{2},df}\right) \left(\frac{s}{\sqrt{n}}\right)\)

Confidence Interval for the One-Sample t-Test

  • In the previous example
    • \(\alpha = 0.05\), \(\bar{x} = 90\), \(df = n-1 = 14\), \(s = 10\)
    • \(t_{0.025, 14} = 2.145\), \(n = 15\)
  • \(CI_{0.95} = 90 \pm \left(2.145\right)\left(\frac{10}{\sqrt{15}}\right) = 90 \pm 5.54 = (84.46, 95.54)\)

Confidence Interval for the One-Sample t-Test

  • In order to compute a one-sided confidence interval
    • Chance the \(\pm\) to plus or minus
    • Use upper critical value from t table for \(\alpha\) instead of \(\frac{\alpha}{2}\)
    • i.e. the upper critical value for t of a one-sided, 90% CI with 20 df is 1.325

The Paired t-Test

  • Used to compare two population means
  • Pair observations in one sample with observations in the other sample
  • When to use
    • Before/after study with observations on same subjects, ie. diagnosis of students before and after particular module or course
    • Comparison of two different methods of measurement or two different treatments where meassurements/treatments are applied to the same subjects, i.e. blood pressure using a stethoscope and a dynamap

The Paired t-Test

  • Sample of \(n\) students
  • Diagnostic test before studying particular module, then test again after completing the module
  • Want to learn if teaching leads to improvements in students' knowledge or skills (measured in test scores)
  • Use our sample to make conclusion
  • \(x\) = test score before module, \(y\) = test after module
  • Null hypothesis: the true mean difference is zero

The Paired t-Test

  • Procedure:
    • Compute difference between the two observations on each pair \(d_i = y_i - x_i\)
    • Compute mean difference \(\bar{d}\)
    • Compute standard deviation of differences: \(s_d\)
    • Compute standard error of mean difference: \(SE(\bar{d}) = \frac{s_d}{\sqrt{n}}\)
    • Compute the t-statistic: \(T = \frac{\bar{d}}{SE(d)}\), which follows a t-distribution with \(n - 1\) degrees of freedom
    • Compare our T value with critical value of table of t-distribution

The Paired t-Test

  • Important notes for the test to be valid
    • Differences only need to be approximately normally distributed
    • Not advisable to use paired t-test when there are extreme outliers
  • Example
    • \(n = 20\)
    • \(df = n - 1 = 19\)
    • \(\bar{d} = 2.05\)
    • \(s_d = 2.837\)
    • \(SE(\bar{d}) = \frac{s_d}{\sqrt{n}} = \frac{2.837}{\sqrt{20}} = 0.634\)
    • \(t = \frac{\bar{d}}{SE(d)} = \frac{2.05}{0.634} = 3.231\)

The Paired t-Test

  • In table we find that for \(19 df\), the closest number to 3.231 is 3.579 on one side and 2.861 on the other.
  • For 3.579, we have that the tail area of a one-tailed test (the p value or area) is 0.1%, and on the other side (for 2.861) the area is of 0.5%
  • Then, our area or p-value is in the range: 0.001 < p < 0.005
  • Then, if we use a 5% significance level, we have enough evidence to reject the null hypothesis and take the alternative hypothesis and then students perform better after taking the course module
    • Remember that the null hypothesis was: *the true mean difference is zero**

The Paired t-Test

The Paired t-Test

The Paired t-Test

  • The Confidence interval for the true mean difference
    • A 95% confidence interval for the true mean difference
    • \(\bar{d} \pm t^*\frac{s_d}{\sqrt{n}}\) or \(\bar{d} \pm (t^* \times SE(\bar{x}))\)
    • \(t^*\) is the 2.5% point of the t-distribution on \(n -1\) df
    • Mean difference is 2.05
    • 2.5% point of t-distribution with 19 df is 2.093
    • 95% confidence interval for true mean difference is: \(2.05 \pm (2.093 \times 0.634) = 2.05 \pm 1.33 = (0.72, 3.38\))
    • Although difference in scores is statistically significant, it is small. We are 95% sure that true mean increase in score is between just under 1 point and over 3 points.

The paired t-test in R

The paired t-test in R

  • This is a new example
  • School of athletics has new instructor
  • Test effectiveness of new type of training
  • Compare average times of 10 runners in the 100 meters
  • Times for each athlete:
    • Before training: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
    • After training: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1

The paired t-test in R

  • We have two sets of paired samples
    • Same athletes before and after new type of training
    • Improvement?, deterioration? in means of times
  • Null hypothesis: There is no difference in mean time of 100 meters before and after new type of training

The paired t-test in R

a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

ttest <- t.test(a,b, paired=TRUE)

The paired t-test in R

ttest
## 
##  Paired t-test
## 
## data:  a and b
## t = -0.21331, df = 9, p-value = 0.8358
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5802549  0.4802549
## sample estimates:
## mean of the differences 
##                   -0.05

The paired t-test in R

  • With t.test we found
    • \(t = -0.2133\)
    • \(df = 9\)
    • p-value = 0.8358
    • \(CI = (-0.5802549, 0.4802549)\)
    • \(\bar{d} = -0.05\), mean of differences

The paired t-test in R

  • Tabulated t-value
qt(0.975, 9)
## [1] 2.262157
  • t-computed < t-tabulated
    • Accept null hypothesis

The paired t-test in R

  • Analyzing the result of the t-test
    • p-value is greather than 0.05
      • We accept the null hypothesis or \(H_0\) of equality of the averages
    • Conclusion: the new training hasn't made any significant improvement or deterioration over the previous one to the athletes

The paired t-test in R

  • New trainer achieved new times with the athletes team
  • Times enhanced
  • Test alternative hypothesis H1 of improvement in times
  • In R we add syntax alt = "less"

The paired t-test in R

a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

ttest <- t.test(a,b, paired=TRUE, alt="less")

The paired t-test in R

ttest
## 
##  Paired t-test
## 
## data:  a and b
## t = 5.2671, df = 9, p-value = 0.9997
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 2.170325
## sample estimates:
## mean of the differences 
##                    1.61

The paired t-test in R

  • We asked R to check if mean of values in a is less of mean of values in b
  • alternative hypothesis: true difference in means is less than 0
  • \(p\)-value = 0.9997, well above 0.05
  • We can't reject \(H_0\)

The paired t-test in R

  • Now we ask R to evaluate with syntax alt="greather"
    • Ask R to check whether the mean of values contained in a is greater than mean of values in b
    • Suspect p-value is less than 0.05 to reject \(H_0\)

The paired t-test in R

a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.0, 12.2, 11.2, 13.0, 15.0, 15.8, 12.2, 13.4, 12.9, 11.0)

ttest <- t.test(a,b, paired=TRUE, alt="greater")

The paired t-test in R

ttest
## 
##  Paired t-test
## 
## data:  a and b
## t = 5.2671, df = 9, p-value = 0.0002579
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  1.049675      Inf
## sample estimates:
## mean of the differences 
##                    1.61

The paired t-test in R

  • \(t = 5.2671\)
  • Alternative hypothesis: true difference in means is greater than 0
  • \(p\)-value = 0.0002579
  • Now we reject \(H_0\) and accept \(H_A\)

References