Analysis Presentation!

January 15, 2018

Overview

Plan for today

Hypothesis Testing
Difference in Means
Regressions

Hypothesis Testing

Why do we need hypothesis testing?

Purpose of Hypothesis Testing

We do not know true population parameters (only the sample)
By using what we observe we can approximate the true population

Two Sides of a Coin

Either the two groups have different means (or the same)
Either Aly has a higher mean than Bill (or Bill \(\geq\) Aly)
Either Bill has a higher mean than Aly (or Aly \(\geq\) Bill)
All three of these are examples of testable hypotheses!

Using Hypothesis Testing Terminology

In hypothesis testing, there is always 2 hypotheses
\(H_0:\) The Null Hypothesis
\(H_a:\) The Alternative Hypothesis
The Null Hypothesis is what you are actually testing
The Alternative Hypothesis is the other possible situation
Typically the Null Hypothesis is assumed true, until proven false

An Example of Hypothesis Testing

Let's use the first example from the last slide:

"Either the two groups have different means (or the same)"

\(H_0: mean(Aly) = mean(Bill)\)
\(H_a: mean(Aly) \ne mean(Bill)\)

How to Disprove the Null Hypothesis

We try and minimize Type 1 errors

If we fail to reject our null hypothesis we say:
"We failed to reject null hypothesis"
NOT "We proved the alternative hypothesis"

Rejection Regions

Using Probability Theory to Measure Type 1 Errors
Sample vs Population

Actual Caclulations

Formula for differences in means:

\(\frac{{\overline x_1 - \overline x_2}}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}} = t\)

Checking if a mean of one group is greater than a value (V):

\(\frac{{\overline x - V}}{\sqrt{\frac{\sigma^2}{n}}} = t\)

Implementation in R: Part 1

Means of Groups:

mean(A) ; mean(B)

## [1] 49.88979

## [1] 44.47433

Standard Deviation of Groups:

sd(A) ; sd(B)

## [1] 4.247403

## [1] 3.729741

Sample Size of Groups:

sum(!is.na(A)) ; sum(!is.na(B))

## [1] 30

## [1] 30

Implementation in R: Part 2

\(\frac{{\overline x_1 - \overline x_2}}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}} = t\)
\(\frac{{49.9_A - 44.4_B}}{\sqrt{\frac{4.25^2_A}{30_A}+\frac{3.73^2_B}{30_B}}} = t\)
\(|t| = 5.2475\)

Since this is greater than 1.96, we say there is a significant difference with at least 95% Confidence!

Implementation in R: Part 3

This is more simply done with the t.test function:

t.test(A, B)

## 
##  Welch Two Sample t-test
## 
## data:  A and B
## t = 5.2475, df = 57.047, p-value = 2.358e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.348931 7.481990
## sample estimates:
## mean of x mean of y 
##  49.88979  44.47433

Show on Distribution

Looking at the Distribution

One vs Two Sided Tests

Two different kinds of hypothesis tests
\(H_0\) that contain "=" signs or those with ">" and "<"

Asumptions of Difference of Means

The two populations (not samples) are normally distributed
Each value is sampled independently from each other value. This assumption requires that each subject provide only one value. If a subject provides two scores, then the scores are not independent.

Testing Assumptions of A

Testing Assumptions of B

If Assumptions Fail Tests

The first assumption, if violated, increases the chances of rejecting the null hypothesis when it is true
The second assumption, if violated, severely undermines the integrity of the test
If not normally distributed assign a penalty to your difference test
If samples are not independent, aggregate by individual

Regressions

For Assessing Relationships

We want to know IF there is a relationship between X and Y
What DIRECTION is the relationship?
What is the MAGNITUDE of the relationship?
Is the relationship STATISTICALLY SIGNIFICANT?

Some Motivation

Perhaps we want to know if urban communities are more at risk for Cirrhosis
The scatter-plot seems to indicate a positive correlation

Why not use correlation/difference in means?

What other variables might be responsible?
Liquor consumption
Wine consumption
Let's try adding them to the model

Piecing Together the Puzzle

\(Cirrhosis = \square + \square Urban + \square Wine + \square Liquor\)

Add an error term, and that is a regression model!

Example plot

Although for each variable in the regression we increase dimensions, we can still look at a cross section of our plot.

How it Works

Minimizes square distance between line and observation

Model Interpretation 1

## 
## Call:
## lm(formula = Cirrhosis_death_rate ~ Pct_urban + Liquor_consumption_per_capita + 
##     Wine_consumption_per_capita, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5939  -5.0002   0.7397   7.2051  18.1331 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.8706     7.1618   0.540 0.591738    
## Pct_urban                       0.4965     0.1414   3.512 0.001078 ** 
## Liquor_consumption_per_capita   0.2286     0.1002   2.281 0.027702 *  
## Wine_consumption_per_capita     1.6008     0.3919   4.085 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 42 degrees of freedom
## Multiple R-squared:  0.796,  Adjusted R-squared:  0.7814 
## F-statistic: 54.62 on 3 and 42 DF,  p-value: 1.503e-14

Model Interpretation 2

How well does our model explain the variance in Y (Cirrhosis Death Rate)?

summary(mod)$r.squared

## [1] 0.7959899

summary(mod)$adj.r.squared

## [1] 0.7814178

Why linear regression

Other models may be better for prediction
linear model better for interpretation of relationship
Don't have to worry as much about over-fitting

Regression Assumptions

Direction of causality
Omitted variable bias
Linearity of Parameters
Conditional Expected Value of Error is 0
No Heteroskedasticity
No Perfect Co-linearity
Data is Randomly 'Drawn' from the Population

How to test assumptions: Part 1

Direction of Causality

Need to know if X causes Y
Or if Y causes X

How to test assumptions: Part 2

Ommitted variable bias

Is there something else that is related to our dependent variable not in the model?
If so, it can skew the coefficients and our results

How to test assumptions: Part 3

Linearity of Parameters

Are the coefficients linear?
Could there be a log or polynomial relationship?

How to test assumptions: Part 4

Conditional Expected Value of Error is 0

Error is centered around 0
Is violated with omitted variables
Improperly transformed variable can skew the error as well

How to test assumptions: Part 5

No Heteroskedasticity

How to test assumptions: Part 6

No Perfect Co-linearity

##    male female
## 1:    1      0
## 2:    0      1
## 3:    1      0
## 4:    0      1
## 5:    0      1

How to test assumptions: Part 7

Data is Randomly 'Drawn' from the Population

No selection bias
No multiple observations from one individual