January 15, 2018

Overview

Plan for today

  • Hypothesis Testing
  • Difference in Means
  • Regressions

Hypothesis Testing

Why do we need hypothesis testing?

Purpose of Hypothesis Testing

  • We do not know true population parameters (only the sample)
  • By using what we observe we can approximate the true population

Two Sides of a Coin

  • Either the two groups have different means (or the same)
  • Either Aly has a higher mean than Bill (or Bill \(\geq\) Aly)
  • Either Bill has a higher mean than Aly (or Aly \(\geq\) Bill)
  • All three of these are examples of testable hypotheses!

Using Hypothesis Testing Terminology

  • In hypothesis testing, there is always 2 hypotheses
  • \(H_0:\) The Null Hypothesis
  • \(H_a:\) The Alternative Hypothesis
  • The Null Hypothesis is what you are actually testing
  • The Alternative Hypothesis is the other possible situation
  • Typically the Null Hypothesis is assumed true, until proven false

An Example of Hypothesis Testing

  • Let's use the first example from the last slide:
  • "Either the two groups have different means (or the same)"
  • \(H_0: mean(Aly) = mean(Bill)\)
  • \(H_a: mean(Aly) \ne mean(Bill)\)

How to Disprove the Null Hypothesis

We try and minimize Type 1 errors


  • If we fail to reject our null hypothesis we say:
  • "We failed to reject null hypothesis"
  • NOT "We proved the alternative hypothesis"

Rejection Regions

  • Using Probability Theory to Measure Type 1 Errors

  • Sample vs Population

Actual Caclulations

Formula for differences in means:

\(\frac{{\overline x_1 - \overline x_2}}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}} = t\)

Checking if a mean of one group is greater than a value (V):

\(\frac{{\overline x - V}}{\sqrt{\frac{\sigma^2}{n}}} = t\)

Implementation in R: Part 1

Means of Groups:
mean(A) ; mean(B)
## [1] 49.88979
## [1] 44.47433
Standard Deviation of Groups:
sd(A) ; sd(B)
## [1] 4.247403
## [1] 3.729741
Sample Size of Groups:
sum(!is.na(A)) ; sum(!is.na(B))
## [1] 30
## [1] 30

Implementation in R: Part 2

  • \(\frac{{\overline x_1 - \overline x_2}}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}} = t\)
  • \(\frac{{49.9_A - 44.4_B}}{\sqrt{\frac{4.25^2_A}{30_A}+\frac{3.73^2_B}{30_B}}} = t\)
  • \(|t| = 5.2475\)


  • Since this is greater than 1.96, we say there is a significant difference with at least 95% Confidence!

Implementation in R: Part 3

  • This is more simply done with the t.test function:
t.test(A, B)
## 
##  Welch Two Sample t-test
## 
## data:  A and B
## t = 5.2475, df = 57.047, p-value = 2.358e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.348931 7.481990
## sample estimates:
## mean of x mean of y 
##  49.88979  44.47433

Show on Distribution

  • Looking at the Distribution

One vs Two Sided Tests

  • Two different kinds of hypothesis tests
  • \(H_0\) that contain "=" signs or those with ">" and "<"

Asumptions of Difference of Means

  • The two populations (not samples) are normally distributed
  • Each value is sampled independently from each other value. This assumption requires that each subject provide only one value. If a subject provides two scores, then the scores are not independent.

Testing Assumptions of A

Testing Assumptions of B

If Assumptions Fail Tests

  • The first assumption, if violated, increases the chances of rejecting the null hypothesis when it is true
  • The second assumption, if violated, severely undermines the integrity of the test
  • If not normally distributed assign a penalty to your difference test
  • If samples are not independent, aggregate by individual

Regressions

For Assessing Relationships

  • We want to know IF there is a relationship between X and Y
  • What DIRECTION is the relationship?
  • What is the MAGNITUDE of the relationship?
  • Is the relationship STATISTICALLY SIGNIFICANT?

Some Motivation

  • Perhaps we want to know if urban communities are more at risk for Cirrhosis
  • The scatter-plot seems to indicate a positive correlation

Why not use correlation/difference in means?

  • What other variables might be responsible?
  • Liquor consumption
  • Wine consumption
  • Let's try adding them to the model

Piecing Together the Puzzle

  • \(Cirrhosis = \square + \square Urban + \square Wine + \square Liquor\)
  • Add an error term, and that is a regression model!

Example plot

Although for each variable in the regression we increase dimensions, we can still look at a cross section of our plot.

How it Works

  • Minimizes square distance between line and observation

Model Interpretation 1

## 
## Call:
## lm(formula = Cirrhosis_death_rate ~ Pct_urban + Liquor_consumption_per_capita + 
##     Wine_consumption_per_capita, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5939  -5.0002   0.7397   7.2051  18.1331 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.8706     7.1618   0.540 0.591738    
## Pct_urban                       0.4965     0.1414   3.512 0.001078 ** 
## Liquor_consumption_per_capita   0.2286     0.1002   2.281 0.027702 *  
## Wine_consumption_per_capita     1.6008     0.3919   4.085 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 42 degrees of freedom
## Multiple R-squared:  0.796,  Adjusted R-squared:  0.7814 
## F-statistic: 54.62 on 3 and 42 DF,  p-value: 1.503e-14

Model Interpretation 2

  • How well does our model explain the variance in Y (Cirrhosis Death Rate)?
summary(mod)$r.squared
## [1] 0.7959899
summary(mod)$adj.r.squared
## [1] 0.7814178

Why linear regression

  • Other models may be better for prediction
  • linear model better for interpretation of relationship
  • Don't have to worry as much about over-fitting

Regression Assumptions

  • Direction of causality
  • Omitted variable bias
  • Linearity of Parameters
  • Conditional Expected Value of Error is 0
  • No Heteroskedasticity
  • No Perfect Co-linearity
  • Data is Randomly 'Drawn' from the Population

How to test assumptions: Part 1

Direction of Causality

  • Need to know if X causes Y
  • Or if Y causes X

How to test assumptions: Part 2

Ommitted variable bias

  • Is there something else that is related to our dependent variable not in the model?
  • If so, it can skew the coefficients and our results

How to test assumptions: Part 3

Linearity of Parameters

  • Are the coefficients linear?
  • Could there be a log or polynomial relationship?

How to test assumptions: Part 4

Conditional Expected Value of Error is 0

  • Error is centered around 0
  • Is violated with omitted variables
  • Improperly transformed variable can skew the error as well

How to test assumptions: Part 5

No Heteroskedasticity

How to test assumptions: Part 6

No Perfect Co-linearity

##    male female
## 1:    1      0
## 2:    0      1
## 3:    1      0
## 4:    0      1
## 5:    0      1

How to test assumptions: Part 7

Data is Randomly 'Drawn' from the Population

  • No selection bias
  • No multiple observations from one individual