Ch1

  1. Quantitative: A variable that takes on numerical values on which it makes sense to perform arithmetic operations.
    Types of Quantitative Variables:
  1. Categorical: A categorical variable places individuals in one of several groups or categories. These variables are also known as quantitative Examples)

ordinal: A categorical variable whose categories can be arranged in a meaningful rank order   Examples)
- Size of the shirts (‘S’,‘M’,‘L’)
- Class stanfing (‘Senior’,‘Junior’,‘Sofmore’,‘Freshman’)

nominal: A categorical variable whose categories cannot be arranged in a meaningful rank order Examples)
- Race (‘Asian’ ,‘White’)
- Religion

A response variable measures an outcome of a study (y) An explanatory variable explains or influences changes in a response variable (x or\(x_1\),\(x_2\)\(x_k\))

When the response variable Y is quantitative, depending on whether the explanatory vari-able(s) Xi are quantitative or categorical, some candidate methods for statistical inference are:
- z test
- t test
- ANOVA
- Linear Regression

When the response variable Y is categorical, depending on whether the explanatory vari-able(s) Xi are quantitative or categorical, some candidate methods for statistical inference are:
- z test
- chi square
- Logistic Regression

The Binomial Distribution

The following 4 conditions are applicable to real life experiments:
1. Fixed number of trials (n)
2. Each trial leads to ‘success’ or ‘failure’
3. Probability of success (\(\pi\)) fixed from to trials
4. Trials are independent


If the number of trials is smaller than expected success

Example: Five fair coins are flipped independently. What is the proportion of times where we would obtain exactly two heads?
Solution: For this problem, we can use the binomial distribution (be sure to check that all four conditions of the binomial are satisfied). Letting Y = number of heads observed, we see that Y ∼ Bin(n = 5, π = 0.5). So, we simply need to find P(2).
\(P(y) = \dfrac{n!}{y!(n-y)} \pi^y(1-\pi)^{n-y}\)
\(P(2) = \dfrac{5!}{2!(5-2)} 0.5^2(1-0.5)^{5-2}\)
density of binomial (dbinom)

dbinom(x = 2, size = 5, prob=0.5)
## [1] 0.3125

The Expected Value of Y is E(Y) =\(\mu\) =\(n\pi\)
The Variance of Y is V(Y) =\(\sigma^22\) =\(n\pi(1-\pi)\)
The Standard Deviation of Y is SD(Y) =\(\sigma\) =\(\sqrt{n\pi(1-\pi)}\)

The statistic\(\overline\theta\) (estimated population parameter\(\theta\)) will often be called point estimator and actual oberved value os statistic is called point estimate.

Example: Suppose we are interested in the average monthly income of all employed students at the university. We will denote this unknown parameter as\(\theta\). If we secure a sample of 100 students through a random fashion, we can consider several different options in determining a good guess for\(\theta\). Let’s list some of them here:
-\(\theta\) = true avg monthly income
-\(\hat\theta_1\) =\(\overline x\) = sample mean
-\(\hat\theta_2\) = sample median
-\(\hat\theta_3\) = MAX(\(\overline x_i\)) * 250

If E(\(\hat\theta\)) =\(\theta\),\(\theta\) is unbiased.

Our model of interest will be the Bin(n, π) distribution where π will be our unknown parameter. The Binomial distribution will be the foundation to construct inferential tools such

1.\(\hat\pi\) pint estimator of\(\pi\)
2. Confident Interval for\(\pi\)
3. Hypothesis test based on\(\pi\)

Natual Estimator for\(\pi\)

Let Y be the number of successes out of n trials from the binomial experiment. A natural estimator for π would be: \(\hat\pi\) =\(\dfrac{y}{n}\) = p (sample proportion)

Is this estimator unbiased for\(\pi\) ? Is E(\(\hat\pi\)) =\(\pi\)
E(\(\hat\pi\)) = E(p) = E(y/n) =\(\dfrac{1}{n} E(y)\) =\(\dfrac{1}{n} \pi n\) =\(\pi\)

Example) Four people are at a basketball court: Stephen Curry (NBA Pro), Shaquille O’Neal (former NBA Pro), Jimmy Doi (non-NBA Pro), and Elsa (Jimmy’s 72 year old next door neighbor)

Fact:

Curry O’Neal Doi Elsa
90% 50% 70% 10%


n = 10 , y = 7

Equation: \(\dfrac{10!}{7!(10-7)} \pi^7(1-\pi)^{10-7}\)

Goal: Try to find the optional\(\pi\) that BEST FITS with the observed data y.

Hit the maximum likelihood of the graph to find the best\(\pi\) where\(\pi = 0.7\).

The Maximum Likelihood Estimate
\(\hat\pi= p = y/x\)
p is the MLE of\(\pi\) =\(\hat\pi_{mle}\)
mean:E(p) =\(\pi\)
variance: V(p) =\(\dfrac{\pi(1-\pi)}{n}\)

A maximum likelihood estimator enjoys some nice mathematical and optimally properties. If the sample size is large, then
1. MLE is approx. unbiased, if not unbiased already
2. MLE achieves minmum variance among all unbiased estimators
3. MLE is approx. normally distributed

standardized normal variable: p~N(\(\pi\),\(\sqrt{ \dfrac{\pi(1-\pi)}{n}}\))

Z =\(\dfrac{p-\pi}{\sqrt{ \dfrac{\pi(1-\pi)}{n}}}\) ~ N(0, 1) if N is large

Confidence Interval of\(\pi\)

Confidence Interval(C) \(C = 1 - \alpha\)

qnorm(0.975, 0, 1)
## [1] 1.959964

Interpretation of confidence level: The desired percent of confident intervals generated from repeated sampling that should capture the parameter of interest.
Interpretation of coverage probability: the actual percent of confident interval sgenerated from the repeated sampling that capture tha parameter of interest.

Wald Method

We will now consider different CIs for π:
\(p(-1.96 < Z < 1.96)\) = 0.95 where Z~N(01)
Since
$ Z = $

\(-1.96< \dfrac{p-\pi}{\sqrt{ \dfrac{\pi(1-\pi)}{n}}} < 1.96\)

And replace\(\pi\) with\(p\), since\(\hat\pi_mle = p\)

\(\dfrac{p-\pi}{\sqrt{ \dfrac{p(1-p)}{n}}}\)

=\(p-1.96 \sqrt{ \dfrac{p(1-p)}{n}}< \pi < p+1.96 \sqrt{ \dfrac{p(1-p)}{n}}\)

95% CI for\(\pi\):\(p\pm1.96 \sqrt{ \dfrac{p(1-p)}{n}}\)

In another word:
\(p\pm z_{\alpha/2} \sqrt{ \dfrac{p(1-p)}{n}}\)
Pros: - Easy to use

Cons: - Did not work when\(\pi\) is far from 0.5 - Did not work when n is large (Use when\(n\pi \geq 5\) and\(n(1-\pi)\geq5\)) - The actual coverage probability may be very different from the advertised confidence level - Did not work when y = 0.

Score Interval

\(p(-1.96 < Z < 1.96)\) = 0.95 where Z~N(01)

\(-1.96< \dfrac{p-\pi}{\sqrt{ \dfrac{\pi(1-\pi)}{n}}} < 1.96\)

\(-1.96\sqrt{ \dfrac{\pi(1-\pi)}{n}}< p-\pi < 1.96\sqrt{ \dfrac{\pi(1-\pi)}{n}}\)

Solve of\(\pi\)


Pros: - Work when\(\pi\) is far from 0.5
- Work whenn is large

Agresti-Coull Interval / Adjusted Wald

The Wald Interval

\(p\pm1.96 \sqrt{ \dfrac{p(1-p)}{n}}\)
\(p = y/n\)

The A-C Interval
\(\tilde{p}\pm1.96 \sqrt{ \dfrac{\tilde{p}(1-\tilde{p})}{n+4}}\)

\(\tilde{p} = \dfrac{y+2}{n+4}\)
## Note Liberal Coverage Probability A common problem with all three of the confidence intervals discussed so far is that they can have liberal coverage probabilities. In other words, the actual coverage probabilities may be less than the advertised confidence level.

y 1 2 3 4 5 6 7 8
Low 0 0.012 0.080 0.160 0.249 0.346 0.454 0.572
Up 0.293 0.428 0.546 0.654 0.751 0.840 0.920 0.988

‘CI jumps when the\(\pi\) changes and covers many y’

FACT: The vertical jumps of the coverage probability function perfectly align with the start/end of a new confidence interval.

Clopper-Pearson interval

  • Strict CI
  • The actual coverage probability can be MUCH higher than the advertised confidence level
  • CON:Intervals tend to be long since CI is high

LCO interval

  • Keep CI at least 95%
  • Keep CI short

ALCO interval

Puts average CI at 95%

Sumup Table

Approx. Method Average Coverage
Wald 76.926%
Score 95.408%
Agresti-Coull 96.452%
ALCO 95.000%
#WALD CI: method = "asymptotic" 
binom.confint(80, 124, conf.level = 0.95, method = "asymptotic") 
##       method  x   n      mean     lower     upper
## 1 asymptotic 80 124 0.6451613 0.5609468 0.7293758
# Wilson (also known as Score) CI: method = "wilson" 
binom.confint(80, 124, conf.level = 0.95, method = "wilson") 
##   method  x   n      mean     lower     upper
## 1 wilson 80 124 0.6451613 0.5577452 0.7238536
# Agresti-Coull CI: method = "ac" 
binom.confint(80, 124, conf.level = 0.95, method = "ac") 
##          method  x   n      mean     lower     upper
## 1 agresti-coull 80 124 0.6451613 0.5576342 0.7239646
#Clopper-Pearson CI: method = "exac
binom.confint(80, 124, conf.level = 0.95, method = "exact")
##   method  x   n      mean     lower     upper
## 1  exact 80 124 0.6451613 0.5542296 0.7289832

Which method (approx. or strict) yields shorter intervals? Why does this make sense?
Approx. yield shorter intervals because of liberal coverage.
Among the approximate methods, which yielded the biggest length? Which yielded the shortest length? Is this surprising?
ALCO shortest
Wald longest: Suprisign given its lineral coverage
Among the strict methods, which yielded the biggest length? Which yielded the shortest length? Is this surprising?

Clopper-Pearson: :Longer (as expected)
LCO: Sorter (as expected)

Hypothesis Test for\(\pi\)

No method is uniformly most powerful (Related to type II error)

a. Wald Method

H0:\(\pi\) =\(\pi_0\) p is the mle of\(\pi\) p~N(\(\pi\),\(\sqrt{ \dfrac{\pi(1-\pi)}{n}}\))
z =\(\dfrac{p-\pi}{\sqrt{ \dfrac{\pi(1-\pi)}{n}}}\)
Assuming H0 is true, z =\(\dfrac{p-\pi_0}{\sqrt{ \dfrac{p(1-p)}{n}}}\)

  • The approximate distribution of Z depends on an important condition: n is large.
  • Rule of thumb: n\(\pi\)\(\ge\) 5 and n(1-\(\pi\))\(\ge\) 5

Mathematical Fact

If Z~N(0,1), then\(Z^2\) ~\(\chi^2_i\)
In the event that the alternative hypothesis is two-sided (Ha :\(\beta\)\(\ne\)\(\beta_0\)), then we can compute\(Z^2\) and use a table of Chi-square probabilities to determine a corresponding p-value.

Ex) H0:\(\pi\) = 1/2
Ha:\(\pi\)\(\ne\) 1/2
p = 9/10
z (statistic) =\(\dfrac{0.9-0.5}{\sqrt{ \dfrac{0.9(0.1)}{10}}}\) = 4.21637 p-val = 2 * p(z\(\ge\) 4.22) =\(2.48 * 10^{-5}\)
\(\chi\) = 4.22^2 = 17.8
p-val = p(\(\chi^2_1 \ge 17.8\)) =\(2.48 * 10^{-5}\)

# Wald Test Example in Sec. 1.4: n=10, y=9
# TEST H0 :π=0.5 versus Ha :π ̸= 0.5
z.wald <- (0.9-.5)/sqrt((.9*.1)/10)
pval.z.wald <- 2*(1-pnorm(z.wald,0,1)) 
pval.z.wald
## [1] 2.482661e-05
#[1] 2.482661e-05

1-pnorm(z.wald,0,1) Represents

chisq.wald <- z.wald^2 
#[1] 17.77778

pval.chisq.wald <- 1 - pchisq(chisq.wald,1) 
#> pval.chisq.wald
#[1] 2.482661e-05
# Two p-values above are IDENTICAL
pval.chisq.wald
## [1] 2.482661e-05

1-pchisq(chisq.wald,1) Represents

b. The Score Method

H0:\(\pi = \pi_0\)
z =\(\dfrac{p-\pi}{\sqrt{ \dfrac{\pi_0(1-\pi_0)}{n}}}\)
Assuming Ho is true

z =\(\dfrac{p-\pi_0}{\sqrt{ \dfrac{\pi_0(1-\pi_0)}{n}}}\) = Score Statistics
Ex) H0:\(\pi\) = 1/2 Ha:\(\pi\)\(\ne\) 1/2

z =\(\dfrac{0.9-0.5}{\sqrt{ \dfrac{0.5(0.5)}{10}}}\) = 2.53
p-val = 2 * p(z\(\ge\) 2.53) =\(2.48 * (0.0057)\) = 0.0114

\(z^2 = (2.53)^2 = 6.4\), p-value =\(p(\chi^2 \ge 6.4)\) = 0.0114

z.score <- (0.9-.5)/sqrt((.5*.5)/10)
pval.z.score <- 2*(1-pnorm(z.score,0,1)) 
# pval.z.score
# [1] 0.01141204
chisq.score <- z.score**2 
#chisq.score
# [1] 6.4
pval.chisq.score <- 1 - pchisq(chisq.score,1) 
# pval.chisq.score
# [1] 0.01141204
# Two p-values above are IDENTICAL
pval.chisq.score
## [1] 0.01141204

c. The Likelihood Ratio Method

Let\(l(\beta)\) be the generic likelihood function of the observed data
Let\(l_0\) be the maximized value of the likelihood under the null hypothesis
Mathematically, we have: \(l_0 = max value of \beta\) at H0
If the null is very simple such as H0 :\(\beta\) =\(\beta_0\), then\(l_0\) is\(l(\beta_0)\)
Let\(l_1\) be the unrestricted maximized value of the likelihood
Mathematically, we have:\(l_1 = max value of \beta\)
If there is only a single parameter\(\beta\) then\(l\) is\(l(\hat\beta_{mle})\)
The likelihood ratio test statistic equals:

\(-2log_e( \dfrac{l_0}{l_1}) = -2ln( \dfrac{l_0}{l_1})\)

The approximate distribution of the LRT (Likelihood Ration Test) is\(\chi^2_1\) Df = 1
This depends on an important condition: n is large
Ex) H0:\(\pi\) = 1/2 Ha:\(\pi\)\(\ne\) 1/2 n = 10 y = 9

\(l(\pi) = \dfrac{10!}{9!1!}\pi^9(1-\pi)^1\)

\(l_0 = l(\pi_0) = l(1/2) = \dfrac{10!}{9!1!}0.5^90.5^1\) = 0.00977
\(l_1 = l(\hat\pi_{mle}) = l(9/10) = \dfrac{10!}{9!1!}0.9^90.1^1\) = 0.3874
\(-2log_e( \dfrac{l_0}{l_1}) = -2ln(0.0252)\) = 7.36
p-value =\(p(\chi^2_1\ge7.36)\) = 0.0067

# LRT Test Example in Sec. 1.4: n=10, y=9
# TEST H0 :π=0.5 versus Ha :π ̸= 0.5
n <- 10
y <- 9
pi.0 <- 0.5
pi.mle <- y/n

l.0 <- factorial(n)/(factorial(y)*factorial(n-y))*(pi.0^y)*((1-pi.0)^(n-y))
l.1 <- factorial(n)/(factorial(y)*factorial(n-y))*(pi.mle^y)*((1-pi.mle)^(n-y)) 
chisq.lrt <- -2*log(l.0/l.1)
# chisq.lrt
# [1] 7.361284
pval.chisq.lrt <- 1 - pchisq(chisq.lrt,1) 
pval.chisq.lrt
## [1] 0.006664318
Wald Test Score Test LR Test
17.8 6.4 7.36

Small Sample Inference (p30)

When the sample size is small, use the actual underlying binomial distribution to perform a hypothesis test. Such as test is known as an Exact Test.

H0:\(\pi = 1/2\)
Let Y be the number of successes out of the 10 fixed trials. What is the distribution of Y?
y ~ Bin(10,\(\pi\))
Assuming H0 is true, what is the distribution of Y? y~Bin(10, 0.5)

Density of the Bin(10, 0.5) Assume we observed y = 9 successes.
a. The alternative is one-sided as in:
\(Ha: \pi>1/2\) p-value =\(p(y \ge 9) = p(9) + p(10) = 0.01+0.001 = 0.011\)
In R,,
Test H0 : π = 0.5 versus Ha : π > 0.5
P-value = P(9) + P(10)

dbinom(9,10,0.5) + dbinom(10,10,0.5) 
## [1] 0.01074219
# or  
binom.test(x=9, n=10, p=0.5, alternative="greater")
## 
##  Exact binomial test
## 
## data:  9 and 10
## number of successes = 9, number of trials = 10, p-value = 0.01074
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.6058367 1.0000000
## sample estimates:
## probability of success 
##                    0.9
  1. The alternative is two-sided as in:
    \(Ha: \pi\ne1/2\)
    p-value =\(p(y \ge 9\) or\(y \le 1) = p(9) + p(10) + p(1) + p(0) = 0.02+0.002= 0.022\)
    In R,, P-value = P(0) + P(1) + P(9) + P(10)
dbinom(0,10,0.5) + dbinom(1,10,0.5) + dbinom(9,10,0.5) + dbinom(10,10,0.5)
## [1] 0.02148438
# or 
binom.test(x=9, n=10, p=0.5, alternative="two.sided")
## 
##  Exact binomial test
## 
## data:  9 and 10
## number of successes = 9, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5549839 0.9974714
## sample estimates:
## probability of success 
##                    0.9

Ch2 Contingency Table

Notation: Statistic

Belief in Afterlife?

Gender(x) Yes(y) No or Undecided(y)
Female 509 116
Male 398 104

The table above is known as a contingency table
x = gender y = afterlife

When we only have two variables, which yields a table having only two dimensions, we call the resulting table a two-way contingency table. For a two-way table based on variables X and Y:
- Let I denote the number of categories of X (the number of table rows)
- Let J denote the number of categories of Y (the number of table columns)
- We will often refer to the table as an I × J table
- There are a total of IJ outcomes in the table, each position in the table is called a cell.
- Let\(n_{ij}\) be the number of individuals found in row i and column j
- Let\(n_{i+}\) be the row total of individuals for row i
- Let\(n_{+j}\) be the row total of individuals for column j

Gender(x) Yes(y) No or Undecided(y) total
Female 509(\(n_{11}\)) 116(\(n_{12}\)) 625(\(n_{1+}\))
Male 398(\(n_{21}\)) 104(\(n_{22}\)) 502(\(n_{2+}\))
Total 907(\(n_{+1}\)) 220(\(n_{+2}\)) 1127(\(n\))

Marginal frequencies: Totals that appear in the margins of the table

Let\(p_{ij}\) be the sample proportion of individuals found in row i and column j: \(p_{ij} = n_{iij}/n\)

Gender(x) Yes(y) No or Undecided(y) total
Female 509/1127(\(p_{11}\)) 116/1127(\(p_{12}\)) 625/1127(\(p_{1+}\))
Male 398/1127(\(p_{21}\)) 104/1127(\(p_{22}\)) 502/1127(\(p_{2+}\))
Total 907/1127(\(p_{+1}\)) 220/1127(\(p_{+2}\)) 1

Empirical Marginal Distributions: Totals that appear in the margins of the table

Notation: Parameters

Let\(\pi_{ij}\) be the probability that an individual will be found in row i and column j
\(\pi+{ij}\) = P(X = i, Y = j)
- Let\(\pi_{i+}\) be the row total of probabilities for row i
- Let\(\pi_{+j}\) be the column total of probabilities for column j

Gender(x) Yes(y) No or Undecided(y) total
Female \(\pi_{11}\) \(\pi_{12}\) \(\pi_{1+}\)
Male \(\pi_{21}\) \(\pi_{22}\) \(\pi_{2+}\)
Total \(\pi_{+1}\) \(\pi_{+2}\) 1

2.2: Comparing Proportions in Two-by-Two Tables

A variable that can only assume one of two outcomes (Y or N) is known as binary or dichotomous.
Let π1 be the probability of success for individuals in group 1
Let π2 be the probability of success for individuals in group 2

\(Y_1\sim Bin(n_1, \pi_1)\)
\(Y_2 \sim Bin(n_2, \pi_2)\)

a. The Difference in Proportions

Let y1 and y2 be the number of observed successes in group 1 and 2 respectively. Let n1 and n2 be the total number of observations in group 1 and 2 respectively. Construct a point estimate for π1 − π2 and use its sampling distribution to construct a confidence interval:

\(p_1 = y_1/n_1, p_2 = y_2/n_2\)
=>\((p_1 - p_2)\) is a point estimate of\((\pi_1 - \pi_2)\)
If\(n_1\) and\(n_2\) large then
\((p_1 - p_2)\) ~ N(\(\pi_1 - \pi_2\),\(\sqrt{ \dfrac{\pi_1(1-\pi_1)}{n_1}+ \dfrac{\pi_2(1-\pi_2)}{n_2}}\))
=>\((p_1 - p_2) \pm z_{\alpha/z}\sqrt{ \dfrac{\pi_1(1-\pi_1)}{n_1}+ \dfrac{\pi_2(1-\pi_2)}{n_2}}\)
Replace\((p_1, p_2) with (p_1, p_2)\)
\((p_1- p_2) \pm z_{\alpha/z}\sqrt{ \dfrac{p_1(1-p_1)}{n_1}+ \dfrac{p_2(1-p_2)}{n_2}}\) Wald CI for (\(\pi_1\),\(\pi_2\))

Ex)

Group(x) Yes(y) No(y) Total
Placebo 189 10845 11034
Aspirin 104 10933 11037

Let π1 be the probability of an individual on placebo having a heart attack. Let π2 be the probability of an individual on aspirin having a heart attack. Construct a 95% CI for π1−π2 and interpret the outcome (does it contain 0?). Note: The +1 adjustment NOT required here as n is huge.

\(p_1 = 189/11034\),\(p_1 = 104/11037\)
CI(95%):\((p_1 - p_2) \pm 1.96\sqrt{ \dfrac{p_1(1-p_1)}{n_1}+ \dfrac{p_2(1-p_2)}{n_2}}\) = (0.0047, 0.0107)
Both endpoints are positive, indecates there is sufficient evidence to suggest\(\pi_1 > \pi_2\) (Aspirin helps compare to the placebo)

n11 <-189; n1 <-11034 
pi1.hat <-n11/n1
n21 <-104; n2 <-11037 
pi2.hat <-n21/n2 
c(pi1.hat,pi2.hat)
## [1] 0.01712887 0.00942285
diff <-pi1.hat - pi2.hat 
se.diff <-sqrt(pi1.hat * (1 - pi1.hat)/n1 + pi2.hat * (1 - pi2.hat)/n2) 
level <-.95
alpha <-1-level
WaldCI.diff.lower <-diff - qnorm(1-alpha/2) * se.diff  
WaldCI.diff.upper <-diff + qnorm(1-alpha/2) * se.diff
WaldCI.diff <-c(WaldCI.diff.lower,WaldCI.diff.upper) 
WaldCI.diff
## [1] 0.004687751 0.010724297

OR

# To generate the confidence interval for π1 − π2 use # prop.test(c(n11,n21),c(n1,n2),conf.level=0.95, correct=F) # # where n11, n21 = number of successes for group 1 and 2 respectively #
# n1, n2 = sample sizes for group 1 and 2 respectively
# # correct=F specifies that a continuity correction SHOULD NOT be used
n11 <-189; n1 <-11034 
n21 <-104; n2 <-11037 
prop.test(c(n11, n21), c(n1, n2), conf.level = 0.95, correct = F)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(n11, n21) out of c(n1, n2)
## X-squared = 25.014, df = 1, p-value = 5.692e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.004687751 0.010724297
## sample estimates:
##     prop 1     prop 2 
## 0.01712887 0.00942285

b. The Ratio of Proportions (Relative Risk)

Suppose a clinical study is measuring the percentage of individuals who have an adverse reaction to one of two drugs (Drug 1 versus Drug 2)

\(\pi_1\) \(\pi_2\) \(\pi_1-\pi_2\) \(\pi_1/\pi_2\)
Scenario 1 0.010 0.001 0.009 10
Scenario 2 0.41 0.401 0.009 1.02

Comments: 1. Both scenario yield the same difference in probabilities.
2. Scenario1 is more extreme because of the 10 fold contrast. 10 times as any subjects have adverse reaction with drug1 vs drug2
3. Difference in probabilities fails to capture this contract.
Tips: Use ration when probabilities are close to zero.

Relative Risk

When each probability π1 and π2 is based on a deleterious(bad) outcome, the ratio is called the relative risk.
Conditional Probabilities

\(\pi_1 = p(hearttatack|placebo)\) “prob. of heat attack given placebo”
\(\pi_2 = p(hearttatack |aspirin)\)

\(\hat{RR} = p_1/p_2\)

Group(x) Yes(y) No(y) Total
Placebo 189 10845 11034
Aspirin 104 10933 11037

\(p_1 = 0.0171\)\(P_2 = 0.0094\)
\(p_1/p_2 = 1.82\)
The risk of having a heart attack for placebo(\(P_1\)) is 82% higher than for aspirin.

Confidence Interval for\(\pi_1/\pi_2\) A large sample confidence interval for the log of the relative risk
\(log_e(p_1/p_2) \pm z_{\alpha/2}\sqrt{ \dfrac{p_1(1-p_1)}{n_1p_1}+ \dfrac{p_2(1-p_2)}{n_2p_2}}\)

Ex)
Verify that the 95% confidence interval of the relative risk from the Physician’s Health Study is (1.43, 2.30) \(log_e(p_1/p_2) \pm z_{\alpha/2}\sqrt{ \dfrac{p_1(1-p_1)}{n_1p_1}+ \dfrac{p_2(1-p_2)}{n_2p_2}}\) = (0.3599, 0.8353) = (Take antilog) =>\((e^{0.3599}, e^{0.8353})\) = (1.43, 2.30)

Use riskratio.wald() to generate a CI for relative risk IMPORTANT NOTE ABOUT riskratio.wald(): R.Risk is defined as (prop’n of Group.B “Yes”)/(prop’n of Group.A “Yes”)

YES NO
GRP.A 104 10933
GRP.B 189 10845

The data table MUST match the following format: RRtable<-matrix(c(GRP.A.NO,GRP.B.NO,GRP.A.YES,GRP.B.YES),nrow = 2, ncol = 2)

#if (!require("epitools")) install.packages("epitools")
library(epitools)


RRtable<-matrix(c(10933,10845,104,189),nrow = 2, ncol = 2)
riskratio.wald(RRtable) 
## $data
##           Outcome
## Predictor  Disease1 Disease2 Total
##   Exposed1    10933      104 11037
##   Exposed2    10845      189 11034
##   Total       21778      293 22071
## 
## $measure
##           risk ratio with 95% C.I.
## Predictor  estimate    lower    upper
##   Exposed1 1.000000       NA       NA
##   Exposed2 1.817802 1.433031 2.305884
## 
## $p.value
##           two-sided
## Predictor    midp.exact fisher.exact   chi.square
##   Exposed1           NA           NA           NA
##   Exposed2 4.989646e-07 5.032836e-07 5.691897e-07
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

2.3 The Odds Ratio

HIV+ baby HIV- baby Total
Placebo 40 120 160
AZT 13 151 164
Total 53 271 324

π1 = Pr{HIV+ baby|Placebo}, π2 = Pr{HIV+ baby|AZT}
\(p_1 = 40/160\),\(p_2 = 13/164\),\(p_1/p_2 = 3.15\)
Over 3 times more likely to give birth to an HIV+ baby when compared to the mother on AZT

Odds

The odds of an event is the ratio of the probability that the event happens to the probability that it does not happen. Use the generic notation of π to express the definition of odds. Also, solve for π to express it as a function of the odds.

\(Odds = \pi/(1-\pi)\)
\(\pi = Odds(1-\pi)\)
\(\pi = Odds-Odds\pi\)
\(\pi = Odds/(1+Odds)\)

Ex)
In the game of darts, suppose the probability that I will successfully hit the bullseye is 75%. What is the odds that I will hit the bullseye?

\(\pi = 3/4\), Odds = (3/4)/(1/4) = “3:1”

When odds = 3.0,
we expect to observe three successes for every one failure.
When odds = 1/3,
a failure is three times as likely as a success.

Ex)

HIV+ baby HIV- baby Total
Placebo 40 120 160
AZT 13 151 164
Total 53 271 324

\(\hat{Odds_{AZT}} = (13/164)/(1-(13/164)) = 0.086\)
\(\hat{Odds_{Placebo}} = (40/160)/(1-(40/160)) = 0.333\)

Odds Ratio

The odds ratio is the ratio of the odds under two conditions. It is denoted by\(\theta\)

\(\theta = \dfrac{ \dfrac{\pi}{1-\pi_1}}{ \dfrac{\pi_2}{1-\pi_2}}\)
\(\hat\theta = \dfrac{ \dfrac{p_1}{1-p_1}}{ \dfrac{p_2}{1-p_2}}\)

Calculate the odds ratio of having an HIV+ baby between the placebo group and the AZT group and provide an interpretation. \(\hat\theta = \dfrac{\hat{Odds_p}}{\hat{Odds_{AZT}}} = \dfrac{0.333}{0.086}\) = 3.87

And We can see it as..
\(\hat{Odds_p} = 3.87*\hat{Odds_{AZT}}\)
Interpretation:
Estimated odds for HIV+ birth for placebo users is 3.87 times that for AZT users.

\(\hat\theta = (40*151)/(120*13) = 3.87\)

CI for the odd ratio

  • Calculate the sample odds ratio\(\hat\theta\)
  • Take the natural log f the sample odds ratio
  • then

Practice)

Determine a 95% confidence interval for the population odds ratio of having an HIV+ baby between the placebo group and the AZT group.
\(\hat\theta = 3.87\)
CI for\(ln(\theta)\)
\(ln(3.87)\pm1.96\sqrt( \dfrac{1}{120}+ \dfrac{1}{40}+ \dfrac{1}{13}+ \dfrac{1}{151})\)
\(= 1.353 \pm 0.67\)
\(= (0.683, 2.023)\)
\((e^{0.683}, e^{2.023})\)
\(= (1.98, 7.56)\)

#if (!require("epitools")) install.packages("epitools") 
library(epitools)
n11 <-40
n12 <-120
n21 <-13
n22 <-151
epitools::oddsratio(c(n11, n12, n21, n22),  method = "wald", conf = 0.95, correct = F) 
## $data
##           Outcome
## Predictor  Disease1 Disease2 Total
##   Exposed1       40      120   160
##   Exposed2       13      151   164
##   Total          53      271   324
## 
## $measure
##           odds ratio with 95% C.I.
## Predictor  estimate    lower    upper
##   Exposed1 1.000000       NA       NA
##   Exposed2 3.871795 1.981104 7.566889
## 
## $p.value
##           two-sided
## Predictor    midp.exact fisher.exact  chi.square
##   Exposed1           NA           NA          NA
##   Exposed2 2.919379e-05 3.711464e-05 3.26993e-05
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Odds Ratio vs Relative Risk

\(\theta = \hat{RR}* \dfrac{1-p_2}{1-p_1}\)

The odds ratio and the relative risk are approximately equal when\(p_1\) and \(p_2 = 0\)

It’s hard to compute to relative risk, but not odd ratio, we can estimate relative risk from odds ratio.

Types of Sampling Designs:

• Retrospective – What values did the categorical variables have in past? (Look back in time)
• Prospective – What values will the categorical variables have in the future?

Example: Suppose there was a study conducted to investigate whether the daily ingestion of multivitamins can help in reducing heart attacks among the elderly. A total of 360 subjects were part of the study. Consider the following fictitious data.

Heart Attack Multivitamin
No Multivitamin 25 150
No Multivitamin 10 175

Describe how the study would be done if it was retrospective or prospective:
Retrospective: interview thoes who have/have not had a heart attack and ask about multivitamin usage. Prospective: collect subjects who do/do not use multivitamins and observe over time.

Types of Study

Experiment - deliberately impose some treatment on individuals to observe their re-sponses. Goal is to study whether there’s a causal relationship between the explanatory and response variables.
Example: Clinical Trials, Lab Experiments, Field Trials All experimental studies are prospective in nature.
Observational Study - observe individuals and measure variables of interest, but do not attempt to influence responses. Goal is to describe any associations.
Examples: Case-control, Cohort, Cross-Sectional

Case-control (Retrospective)
• Gather subjects with and without a particular condition
• Cases – patients with a rare disease
• Controls – patients without the rare disease
• Match controls with cases according to some other variable (gender, age, …)
• This helps ensure an adequate number of cases in the study - If we randomly select (via simple random sample) from the population, we may obtain a very small number of individuals to fill the cases group.

Cohort (Prospective) • Subjects select themselves into an explanatory variable classification (e.g. smoker or non-smoker)
• Watch subjects over time and measure response (no subjection to treatments)
• Disadvantage: May require a long period and may suffer from drop-out

Cross-Sectional (Retrospective)
• Subjects are assessed at a single time in their lives … like a snapshot
• Study is often quick and cheap (large sample sizes possible)
• Difficult to assess cause and effect relationship

Chi-Squared Test for Independence

Forming the Hypotheses Let A1 and A2 be events in S(sample space). The events A1 and A2 are said to be independent if and only if
p(\(A_1\) and\(A_2\)) = P(\(A_1\))(\(A_2\))
An interpretation of this is that the occurrence of\(A_1\) does not affect the likelihood of\(A_2\) (and vice versa).

We say that the categorical variables X and Y are (statistically) independent if:\(\pi_{ij} = \pi_{i+}*\pi_{+j}\)

Using this notation, we state our null and alternative hypotheses as:
H0: \(\pi_{ij} = \pi_{i+}*\pi_{+j}\) for i = 1, …, I and j = 1,.., J
Ha: H0 is not true

Wald Test

Using the sampling distribution of (\(p_1 - p_2\))

Example: Angina pectoris is a chronic heart condition in which the sufferer has periodic attacks of chest pain. In a study to evaluate the effectiveness of the drug Timolol in pre-venting angina attacks, patients were randomly assigned to receive a daily dosage of either Timolol or placebo for 28 weeks. Here’s the data:
TABLE OF OBSERVED COUNTS

Angina No Angina Total
Timolol 44 116 160
Placebo 19 128 147
Total 63 244 307
y11 <-44
y12 <-116
n1 <-160
y21 <-19
y22 <-128
n2 <-147
p1 <-y11/n1
p2 <-y21/n2
p <-(y11+y21)/(n1+n2) 
z <-(p1 - p2)/sqrt(p*(1-p)*(1/n1 + 1/n2)) 
z_sq <-z*z 
z_sq 
## [1] 9.978202
p_value_wald <-1 - pchisq(z_sq, 1)
p_value_wald
## [1] 0.001584043

Score Test


\(\mu_{ij}\): Expected frequency count

This leads to the Score based test statistic (note the technical and casual expression)

\(\chi_2 = \sum_{i,j} \dfrac{(n_{ij}-\hat\mu_{ij})^2}{\hat\mu_{ij}}\)
\(\chi_2 = \sum_{all-categories} \dfrac{(Obs- Exp)^2}{Exp}\)

Follows with df = (I-1)(J-1)

Rule of thumbs:

Use when\(\hat\mu_{ij} \ge 5\) for all i, j.

row1 <-y11 + y12
row2 <-y21 + y22 
col1 <-y11 + y21
col2 <-y12 + y22
n <-n1 + n2;
e11 <-row1*col1/n
e12 <-row1*col2/n 
e21 <-row2*col1/n
e22 <-row2*col2/n
X2 <-(y11 - e11)^2/e11 + (y12 - e12)^2/e12 + (y21 - e21)^2/e21 + (y22 - e22)^2/e22
X2
## [1] 9.978202
p_value_score <-1 - pchisq(X2, 1)
p_value_score
## [1] 0.001584043
# To generate the Score Test (equivalent to the Wald Test in the 2x2 case) 
# use chisq.test(data.matrix, correct=F) 
# where data.matrix is a matrix containing the observed values 
# correct=F specifies that a continuity correction SHOULD NOT be us
Timolol = matrix(c(44, 116, 19, 128), ncol = 2, byrow = T) 
Timolol
##      [,1] [,2]
## [1,]   44  116
## [2,]   19  128
chisq.test(Timolol, correct=F)
## 
##  Pearson's Chi-squared test
## 
## data:  Timolol
## X-squared = 9.9782, df = 1, p-value = 0.001584
chisq.test(Timolol)$observed 
##      [,1] [,2]
## [1,]   44  116
## [2,]   19  128
chisq.test(Timolol)$expected 
##          [,1]     [,2]
## [1,] 32.83388 127.1661
## [2,] 30.16612 116.8339

Likelihood Ratio Test

if (!require("vcd")) install.packages("vcd")
library('vcd')
Timolol = matrix(c(44, 116, 19, 128), ncol = 2, byrow = T) 
assocstats(Timolol)
##                      X^2 df  P(> X^2)
## Likelihood Ratio 10.2387  1 0.0013753
## Pearson           9.9782  1 0.0015840
## 
## Phi-Coefficient   : 0.18 
## Contingency Coeff.: 0.177 
## Cramer's V        : 0.18

Likelihood Ratio:10.2387
Pearson: Score result (9.9782)

Democrat Republican Independent
Female 495 272 590
Male 330 265 498
total 825 537 1088
GenderGap <-matrix(c(495, 272, 590, 330, 265, 498), ncol = 3, byrow = T)
GenderGap
##      [,1] [,2] [,3]
## [1,]  495  272  590
## [2,]  330  265  498
assocstats(GenderGap)$chisq_test
##                       X^2 df    P(> X^2)
## Likelihood Ratio 12.60090  2 0.001835477
## Pearson          12.56926  2 0.001864750

Strong evidence against the null hypothesis of independence between x and y. Sufficient evidence to suggest there is a dependency between gender and political party identification.
If columns/rows of the data table are rearranged or even transposed, the corresponding test statistic values [WILL NOT] change.

Review

In the context of regression, we often use residual analysis as part of the diagnostic process. In the space below, briefly describe what a residual is, the point of standardized/studentized residuals and how they are used during the diagnostic check.

\(residual = (obs - pred) = (y_i-\hat{y_i})\)
\(standardized = resid/SE(resid)\)
purpose = residuals analysis is used to assecc gooness of fit w/ proposed model and useful in flagging outliners.

Definition \(\hat\mu_{ij}\) =\(\dfrac{(row-total)(col-total)}{grand-total}\)
=\(n * \hat\pi_{ij}\)

  • Residual:
    \(n_{ij}-\hat\mu_{ij}\)
  • Person Residual: \(\dfrac{(n_{ij}-\hat\mu_{ij})}{\sqrt{\mu_{ij}}}\)
  • Standardized Residual
    \(\dfrac{(n_{ij}-\hat\mu_{ij})}{\sqrt{\mu_{ij}(1-p_{i+})(1-p_{+j})}}\)

x~Bin(n,\(\pi\)) =\(\mu\) = n\(\pi\),\(\sigma = \sqrt{n\pi(1-\pi)}\)

Purpose of residual analysis: To assess goodness of fit with H0 and detect any unusual observation.

Rule of Thumb:
If 2x2 table = |Standardized Residual | > 2 then lack of fit
If large table = |Standardized Residual | > 3 then lack of fit

GenderGap <-matrix(c(495, 272, 590, 330, 265, 498), ncol = 3, byrow = T)
GenderGap
##      [,1] [,2] [,3]
## [1,]  495  272  590
## [2,]  330  265  498
chisq.test(GenderGap)
## 
##  Pearson's Chi-squared test
## 
## data:  GenderGap
## X-squared = 12.569, df = 2, p-value = 0.001865
#$stdres gives standard reridual for the table 
chisq.test(GenderGap)$stdres
##           [,1]      [,2]      [,3]
## [1,]  3.272365 -2.498557 -1.032199
## [2,] -3.272365  2.498557  1.032199
Democrat Republican Independent
Female 495 272 590
Male 330 265 498
total 825 537 1088

3.2723653, -3.2723653, -2.4985566, 2.4985566, -1.0321988, 1.0321988

1st col is poor fit because it’s bigger than 3.

3rd col is good fit because it’s not bigger than 3.

Note:
Democrats => We have poor fit with X,Y independence
Independent => We have okay fit with X,Y independence
Republics => We have in unclear fit with X,Y independence

Democrat Republican Independent
Female 495 272 590
Male 330 265 498
total 825 537 1088
GenderGap = matrix(c(495, 272, 590, 330, 265, 498), ncol = 3, byrow = T) 
assocstats(GenderGap)$chisq_tests
##                       X^2 df    P(> X^2)
## Likelihood Ratio 12.60090  2 0.001835477
## Pearson          12.56926  2 0.001864750

\(G^2\) = 12.6009014

Example:
From Page 60 recall\(G^2\) = 12.60. Create the\(G^2_1\) statistics for Independent and Republicans (\(G^2_1\)=1.845). Set up but do not calculate the\(G^2_2\) statistic (\(G^2_2\)= 10.756). Include interpretations of results.

Republican Independent
Female 272 590
Male 265 498
GnderG1 = matrix(c(272, 590, 265, 498), ncol = 2, byrow = T)
assocstats(GnderG1)$chisq_tests
##                       X^2 df  P(> X^2)
## Likelihood Ratio 1.844838  1 0.1743849
## Pearson          1.846127  1 0.1742345

No significant party affiliation difference (between Republican and Independent) across gender.
\(G^2_1\) = 1.8448382

Democrat Republican+Independent
Female 495 862
Male 330 763
GnderG2 = matrix(c(495, 862, 330, 763), ncol = 2, byrow = T)
assocstats(GnderG2)$chisq_tests
##                       X^2 df    P(> X^2)
## Likelihood Ratio 10.75606  1 0.001039382
## Pearson          10.70837  1 0.001066517

Significant party affilication difference between Democrat and Republican+Independent across gender.
\(G^2_2\) = 10.7560632

\(G^2_1 + G^2_2 = 12.601(G^2)\)

Review
In the context of hypothesis testing and the decisions we declare about H0, there are two types of errors we can commit. In the space below, briefly define these errors, their connection with\(\alpha\), and also define power. Discuss this with a partner.

TypeI Error: Reject H0 when H0 is true
\(\alpha\): probability of rejecting H0 when H0 is true(desired to be small)

TypeII Error: Fail to reject H0 when H0 is false
\(\beta\): probability of failing to reject H0 is false (desired to be small)

Power: 1 -\(\beta\) = probability of reject H0 when H0 is false (desired to be high)

A test with only\(\alpha\) is not enough
(We need to consider\(\beta\) not only\(\alpha\))

2.5 Ordinal Data

\(M^2\) statistic can detect linear trend.

When you use\(X^2\) and\(G^2\) for ordinal variable, the Power (1-\(\beta\) / probability of reject H0 when H0 is false) will decreases.

Tyhe correlation will be computed using software by directly entering each (\(u_i\),\(v_j\)) in-stance.

Letting\(\rho\) be the population correlation parameter between X and Y, we can form the following hypotheses.

H0:\(\rho = 0\)
Ha:\(\rho > 0\)
Ha:\(\rho < 0\)
Ha:\(\rho \ne 0\)

The test statistics we will use is
\(M^2 = (n-1)r^2\) \(M^2\sim \chi^2\) df = 1
Method to test a one-sided alternative.
\(M = \sqrt{M^2}\), M~N(0,1)

library(vcdExtra)
malform <-matrix(c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1), ncol = 2)
# Use CMHtest(data.matrix, rscores=c(...)) to generate the M2 test 
# where data.matrix is a matrix containing the observed counts 
# and rscores contains the "row scores" (we used midpoints)
CMHtest(malform, rscores = c(0, 0.5, 1.5, 4, 7))
## Cochran-Mantel-Haenszel Statistics 
## 
##                  AltHypothesis   Chisq Df     Prob
## cor        Nonzero correlation  6.5699  1 0.010372
## rmeans  Row mean scores differ 12.0817  4 0.016754
## cmeans  Col mean scores differ  6.5699  1 0.010372
## general    General association 12.0817  4 0.016754

Look at cor

2.6 Exact Inference for Small Sample

Fisher’s Exact Test

Ex)

Cancer Controlled Cancer Not Controlled Tota
Surgery 21 2 23
Radiation 15 3 18
Total 36 5 41

Find Estimated cell count (\(\hat\mu_{ij}\))

Cancer Controlled Cancer Not Controlled Tota
Surgery 20.195122 ((23*36)/41) 2.804878((23*5)/41) 23
Radiation 15.804878((18*36)/41) 2.195122((18*5)/41) 18
Total 36 5 41

Problem: The\(\hat\mu_{ij}\) is smaller than 5. Violation of assumption.

Interpretation of\(\theta\)

If\(\theta=1\), Radiation and Surgery are equally effective.

If\(\theta<1\), Radiation is more effective than Surgery.

If\(\theta>1\), Surgery is more effective than Radiation.

p-value = p(\(n_{11}\ge21\))

Hypergeometric Distribution

Supporse Ha:\(\theta>1\). Assuming the marginal total\(n_{i+}\) and\(n_{+j}\) are fixed, larger values of\(n_{11}\) yield larger values of the sample odds ratio\(\hat\theta\)
p-value = p(\(n_{11} \ge\) Observed\(n_{11}\))

\(n_{11}\)~hypergeom(m, n, k)
m: total # of successes
n: total # of failures
k: total # of group 1 observation

Cancer Controlled Cancer Not Controlled Tota
Surgery 21 2 23
Radiation 15 3 18
Total 36 5 41
m = 36
n = 5
k = 23
# Based on the fixed margins of the data table 
# The possible values of n11 are 18, 19, ..., 23 
# In the following, cbind() is used to bind two columns
HG.probs <-cbind(18:23,dhyper(18:23,m,n,k))
colnames(HG.probs) <-c("n11","prob") 
HG.probs
##      n11       prob
## [1,]  18 0.04490137
## [2,]  19 0.21269072
## [3,]  20 0.36157422
## [4,]  21 0.27548512
## [5,]  22 0.09391538
## [6,]  23 0.01143318

For Ha:\(\theta>1\), find the p-value: p(21)+p(22)+p(23) = 0.3808
p(20)+p(19)+p(18) = 0.8947

For Ha:\(\theta\ne1\), find the p-value will be the sum of all probabilities as small or smaller than p(\(n_{11} =\)# observed).

Since\(n_{11} = 21\)

HG.probs
##      n11       prob
## [1,]  18 0.04490137
## [2,]  19 0.21269072
## [3,]  20 0.36157422
## [4,]  21 0.27548512
## [5,]  22 0.09391538
## [6,]  23 0.01143318
sum(0.04490137, 0.21269072, 0.27548512, 0.09391538, 0.01143318)
## [1] 0.6384258

p-value = 0.6384258

cancer <-matrix(c(21,2,15,3),ncol=2,byrow=T) 

fisher.test(cancer, alternative = "greater")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cancer
## p-value = 0.3808
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.2864828       Inf
## sample estimates:
## odds ratio 
##   2.061731
fisher.test(cancer, alternative = "less")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cancer
## p-value = 0.8947
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##   0.00000 18.44156
## sample estimates:
## odds ratio 
##   2.061731
fisher.test(cancer, alternative = "two.sided")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cancer
## p-value = 0.6384
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.2089115 27.5538747
## sample estimates:
## odds ratio 
##   2.061731

Hypergeometric Distribution

We say this testing method is conservative. becasue in all 4 cases on the previous page, note that the actual Type I Error Rate (\(\alpha\)) is always strictly less than the desired Type I Error Rate of\(\alpha\)

2.7 Association in Three_Way Tables (Simpson’s paradox)

Simpson’s Paradox: An association that holds for all of several groups can disap-pear or even reverse direction when the data are combined to form a single group.

University (Engineering and English)

Admit Deny Total $ Admitted
Male 35 45 80 43.75
Female 20 40 60 33.33

Do these data show an association between gender and success in admission?

\(\hat\theta\) = (35x40)/(45x20) =- 1.5555556
Indicates some favoritism for men.

Engineering

Admit Deny Total $ Admitted
Male 30 30 60 50
Female 10 10 20 50

\(\hat\theta\) = (30x10)/(30x10) =- 1

Engineering

Admit Deny Total $ Admitted
Male 5 15 20 25
Female 10 30 40 25

\(\hat\theta\) = (5x30)/(15x10) =- 1

No favoritism.

Let X be gender (M,F), Y be decision (admit,deny), Z be program (English,Engineering) • As there are now three dimensions (X, Y, Z), we use the term three-way tables.
X = i, Y = j, Z = k
\(n_{ijk}\): Observed count
\(p_{ijk}\): Observed percentage
\(\pi_{ijk}\): Theatrical percentage

Marginal Table: The composite/overall table.
Partial Table: The two-way table for the variables at a fixed value of the remaining variable.

Conditional Associations: Associations we determine for a partial table

Marginal Associations: Associations we determine for marginal table

Conditional and Marginal Odds Ratio

Suppose:
X: dichotomous (Binary)
Y: dichotomous (Binary)
Z: multi-chotomous

Marginal Odds Ratio\(\theta_{XY}\) : Overall odds ratio for X and Y
Conditional Odds Ratio\(\theta_{XY(k)}\): Odds ration between X and Y at level k of Z variable

Conditional and Marginal Idependence

If\(\theta_{XY(k)} = 1\), then X and Y are conditionally independent given Z = k

If\(\theta_{XY(k)} = 1\) for all k, then X and Y are conditionally independent given Z. (Al conditional odds ratios = 1)

Homogeneous Association: when all\(\theta_{XY(k)}\) is the same.

If X and Y are also multi-chotomous…

  • Choose any two levels of X (i and i’) and Y (j and j’)
  • Construct 2x2 table for all Z levels of chosen X (i and i’) and Y (j and j’)
  • If X and Y have a homogeneous association, then all these conditional odds ratios are equal.

Example:
Let I = 3, J = 3, K = 4 and i = 1, i’ = 3, j =2, j’ = 3

Notes about Homogeneous Association:

• When there is homogeneous XY association, this implies there is homogeneous XZ association and there is homogeneous YZ association.
• Once a homogeneous association exists for one pair of variables, it exists for all pairs of variables. This can be mathematically proven.
• Homogeneous association is a symmetric property.
• When this occurs, we say there is no interaction between two variables in their effects on the third variable. In other words, the way in which X impacts Y is not changed by the level of Z.

Example: X = smoking (yes,no), Y = lung cancer (yes,no), Z = age group (< 45, 45–65,> 65)

We get: \(\theta_{XY(1)}=1.2, \theta_{XY(2)}=2.8, \theta_{XY(3)}=6.2\)

Since\(\theta\) is increasing over Z, there is an interaction between the two variables X, Y and the variable Z. (Increasing)

Chapter 3 Generalized Linear Models

Expected value of Y and X in the linear regression is noted by \(E(Y) = \alpha+\beta X\)
The variance and standard deviation of Y:

\(V(Y) = V(\alpha+\beta X+\epsilon)\)
\(= V(C + \epsilon)\)
\(= V(E) = \sigma^2\)

If the error term is assumed to be normal…
\(Y\sim N(\alpha+\beta Xor\mu, \sigma)\)

The statement above describes three very important characteristics of this model:
1. Y has a random compoment, namely the normal distribution

  1. It also describes a systematic component involving the explanatory varibale\((\alpha+\beta)\)

  2. It also shows there’s a relationship between the expected value of Y\((\mu)\)and the systematic component.\(\mu = \alpha+\beta x\)

These 3 points are part of a grand model known as the Generalized Linear Model (GLM)

  1. Random component of Y
  2. Systematic component involving y
  3. A way to connect\(\mu = E(Y)\) to the systematic component involving X

Random Component

Suppose we will be working with n observations with corresponding response variables. (\(y_1, y_2, y_3,..., y_n\))

We assume\(y_i\) are independent

Example:
- If Y is dichotomous
\(y\sim Binomial(n, \pi)\)

  • If Y is a count variable with no specified upper bound
    \(y\sim poisson\)
    \(y\sim NegativeBinomial\)

  • If Y is continuous
    \(y\sim Normal\)

Systematic Component

  • Assume we have the following explanatory variables. (\(x_1, x_2, x_3,...,x_k\))

  • We can specify the explanatory variables via a linear model
    \(\alpha + \beta_1 x_1+\beta_2 x_2+...+\beta_k x_k\)

  • What we mean by “linear model”: \(\beta_i\) terns enter the model in a linear fashion.

  • This is also called the linear predictor

  • Some\(X_i\) can be based on other X variables by adding interaction term:

\(\alpha + \beta_1 x_1+\beta_2 x_2+\beta_k x_1x_2\)

Section 3.2: Generalized Lonear Models For Binary Data

Logistic Regression

  • Odds:\(\dfrac {\pi(x)}{1-\pi(x)}\)
    -\(\dfrac {\pi(x)}{1-\pi(x)}\) is assumed to be non-negative
  • Simple model to relate odds and X: Odds =\(\alpha+\beta x\)
  • Con: Odds can be negative depending on\(\alpha, \beta, x\)
  • Modified model to relate odds and X: odds =\(e^{\alpha+\beta x}\)

Advantage of this model: Odds will always be non-negative.
Thus,\(e^{\alpha+\beta x}\) is a resonalbe model for the odds.

Now use the ecpression for odds and above and solve for\(\pi(x)\)

\(\dfrac {\pi(x)}{1-\pi(x)}\) =\(e^{\alpha+\beta x}\)

\(\pi(x) = (1-\pi(x))e^{\alpha+\beta x}\)
\(\pi(x) = \dfrac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\)

Establising the GLM

  • Random Component: Binomial
  • Systematic Component:\(\alpha + \beta x\)
  • Natural Parameter:\(\pi/(1-\pi)\)
  • Link Function:\(logit (\pi) = log_e( \dfrac{\pi}{1-\pi})\)

The transformation is known as logit transformation.

\(\pi(x) = \dfrac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\) we want\(g(\pi)=\alpha+\beta x\)
Claim:\(logit(\pi) = \alpha+\beta x\)

• Estimate\(\alpha\) and\(\beta\) by maximum likelihood methods using the Binomial PMF
• Problem: There is no closed form solution for these estimates
• Solution is found via numerical approximations
• We use software to determine the maximum likelihood estimates for\(\alpha\) and\(\beta\)


Given a fixed value of\(\alpha\), describe the effect of increasing\(\bata\) upon the logistic curve
If\(\beta\) > 0, as\(\beta\) increases, the curve has a steeper rate of as-cent/descent.


Given a fixed value of\(\beta\), describe the effect of increasing\(\alpha\) upon the logistic curve:
If\(\beta\) > 0, as\(\alpha\) increases, there is a horizontal shift to the left.

Given a fixed value of\(\alpha\), describe the effect of decreasing\(\beta\) upon the logistic curve:
If\(\beta\) < 0, as\(\beta\) decreases, the curve has a steeper rate of as-cent/descent.


Given a fixed value of\(\beta\), describe the effect of increasing\(\alpha\) upon the logistic curve:
If\(\beta\) < 0, as\(\alpha\) increases, there is a horizontal shift to the right

rats = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/rats.txt', sep='', header = T)

fit = glm(death~toxin, family = binomial, data = rats)
summary(fit)
## 
## Call:
## glm(formula = death ~ toxin, family = binomial, data = rats)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6544  -0.2380  -0.0513   0.2390   1.6582  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -10.3324     5.2417  -1.971   0.0487 *
## toxin         0.3083     0.1546   1.995   0.0461 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23.035  on 16  degrees of freedom
## Residual deviance:  8.438  on 15  degrees of freedom
## AIC: 12.438
## 
## Number of Fisher Scoring iterations: 6

From the output: \(\hat\pi(x) = \dfrac{e^{-10.3324+0.3083x}}{1+e^{-10.3324+0.3083x}}\)

Example: Based on the 23 space shuttle flights before the Challenger disaster in 1986, the following graphs represent the temperature (in ◦F) at the time of flight and whether at least one primary O-ring suffered thermal distress. The flight temperature was 31◦F for Challenger that morning.

o_ring = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/o_ring.txt', sep = '', header = T)  
fit = glm(TDistress~Temp, family = binomial, data = o_ring)
summary(fit)$coefficients
##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) 15.0429016  7.3786301  2.038712 0.04147878
## Temp        -0.2321627  0.1082364 -2.144959 0.03195610

\(\alpha\): 15.0429016
\(\beta\): -0.2321627
What is the predicted probability of thermal distress when the temperature is 31◦F?
\(\hat\pi(31) = 0.9996\)

3.3: Generalized Linear Models For Count Data

Poisson Distribution

The Poisson distribution (pronounced ‘pwa-son’) is useful to model a response variable based on counts. Estimating number of counts.

\(P(Y = y) = \dfrac{e^{-\mu}\mu^y}{y!}\)
If y = 0, 1, 2, …
\(\mu\) Intensity Parameters. average count.
E(Y) =\(\mu\)
V(Y) =\(\mu\)
SD(Y) =\(\mu\)

E(Y) =\(\mu\) is assumed to be positive
Interpretation:\(\mu\): mean # of success
Modified model that\(\mu\) to be non-negative:\(\mu = e^{\alpha+\beta x}\)

Establishing GLM
- Random Component: Poisson
- Systematic Component:\(\alpha + \beta x\)
- Natural Parameter:\(\mu\)
- Link Function: Natural log

Estimate\(\alpha\) and\(\beta\) by maximum likelihood methods using the Poisson PMF (Probability mass function)

Example: In a 1996 study, investigators examined factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response variable for each female crab is her number of satellites. An explanatory variable thought possible to affect this was the female crab’s shell width (measured in centimeters).

crabs = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/crabs.txt', sep='', header =T)
rbind(head(crabs,3),tail(crabs,3))
##     crab sat y weight width color spine
## 1      1   8 1  3.050  28.3     2     3
## 2      2   0 0  1.550  22.5     3     3
## 3      3   9 1  2.300  26.0     1     1
## 171  171   0 0  2.625  28.0     1     1
## 172  172   0 0  2.625  27.0     4     3
## 173  173   0 0  2.000  24.5     2     2
fit = glm(sat~width, family = poisson, data = crabs)
summary(fit)$coefficient
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -3.3047572 0.54224155 -6.094622 1.096964e-09
## width        0.1640451 0.01996535  8.216491 2.095450e-16

\(\hat\mu = e^{\hat\alpha+\hat\beta x}\)
\(= e^{-3.3047572+0.1640451 x}\)

x =26
exp(-3.3048+0.1640*x)
## [1] 2.609608
# x =27
exp(-3.3048+0.1640*x)*exp(0.1640)
## [1] 3.074677
# x =28
exp(-3.3048+0.1640*x)*exp(0.1640)*exp(0.1640)
## [1] 3.622629

\(\hat\beta\) leads to the multiplicative factor of\(e^{\hat\beta}\)
Interpret\(\beta\): For every one unit of increase in x, there is multiplicative change in\(\hat\mu\) by a factor of\(e^{\hat\beta}\)

When\(\beta = 0\) or\(e^{\beta} = e^0 = 1\)suggests no relationship between Y and X. Multiplicative effect of 1.

\(\beta > 0 = e^{\beta}>1\): positive effect
\(\beta < 0 = e^{\beta}<1\): negative effect

Example: We are interested in seeing if temperature has an impact on the number of in-juries incurred by visitors to an amusement park on a given day.

Sun Mon Tue Wed Thu Fri Sat
Injuries 53 48 30 25 27 47 58
High Temp 98.2 88.1 94.5 85.6 89.5 90.2 97.9

\(\mu = e^{\alpha+\beta x}\)

# of injuries will depend on the # of visitors to the park on a given day. We should use the rate of injuries.
model:\(\mu/n=e^{\alpha+\beta x}\)

Count Regression for Rate Data

  • It may NOT be appropriate to use the raw counts of an event for comparative purposes.
  • Instead, the rate of occurrence will be a more reasonable measure to make fair comparisons.
  • Instead of modeling the expected number of counts (\(\mu\)), we will model:\(\dfrac{\mu}{t}\)
  • Here, we will assume t is the corresponding index value
  • Our parameter of interest is now rate of occurrence -\(\dfrac{\mu}{t}\): Assumed to be positive
    -\(\dfrac{\mu}{t}=e^{\alpha+\beta x}\)
    Thus,\(e^{\alpha+\beta x}\) is a reasonable model for\(\dfrac{\mu}{t}\)

Establishing the GLM

Random Component: Poisson
Systematic Component:\(\alpha + \beta x\)
Natural Parameter:\(\dfrac{\mu}{t}\)
Link Function: natural log

\(log_e(\dfrac{\mu}{t}) = \alpha + \beta x\)
\(log_e(\mu)- log_e(t) = \alpha + \beta x\)
\(log_e(\mu) = \alpha + \beta x + log_e(t)\)

log_e(t): Offset term

\(\mu = e^{\alpha+\beta x + log_e(t)}\)
\(\mu = e^{\alpha+\beta x} e^{log_e(t)}\)
\(\mu = te^{\alpha+\beta x}\)

Example: Data was collected on train-related accidents in the UK between 1975 and 2003. Two types of accidents were considered: (1) accidents involving trains alone (t_coll) and (2) collisions between trains and road vehicles (tr_coll). Here, we will only consider the second type of accident. Also included for each year was the number of train-kilometers (trainkm), which is a measure of railway activity indicating millions of kilometers traveled by trains during the year.

Use OFFSET= to tell R to use ratio (log)

trains = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/trains.txt', sep='', header = T)
# Due to the fact that the variable trainkm is a component of the data object trains,  we can extract the variable by using trains$trainkm. In the next line we construct the offset term by finding the logarithm of trainkm. 
log.km = log(trains$trainkm) # Offset term  

fit = glm(tr_coll~time, family = poisson, offset = log.km, data = trains)

summary(fit)$coefficients
##                Estimate Std. Error   z value      Pr(>|z|)
## (Intercept) -4.21141778 0.15891902 -26.50040 9.589817e-155
## time        -0.03291798 0.01075909  -3.05955  2.216698e-03

\(\hat{rate}=\dfrac{\hat\mu}{trainkm}=e^{-4.2114-0.0329x}\)

Interpretation
1975: x = 0
\(\hat{rate}\) = 0.0148 accidents per 1 million km of travel for 1975
2003: x = 28
\(\hat{rate}\) = 0.0059 accidents per 1 million km of travel for 2003

Statistical Inference and Model Checking

Statistical Inference for\(\beta\)
\(z = \dfrac{\beta}{EE_{\beta}}\)

Crabs:
\(H0: \beta = 0\)
\(Ha: \beta \ne 0\)

fit = glm(sat~width, family = poisson, data = crabs)
summary(fit)$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -3.3047572 0.54224155 -6.094622 1.096964e-09
## width        0.1640451 0.01996535  8.216491 2.095450e-16

\(z = \dfrac{0.164045}{0.01996535} = 8.216485\)
\(z^2 = 8.22^2 = 67.57\)
p-val: 2.095450e-16

Calculate 95% wald CI
\(\hat\beta \pm 1.96 SE_{\hat\beta} = (0.125, 0.203)\)

confint.default(fit, level=.95)
##                  2.5 %     97.5 %
## (Intercept) -4.3675312 -2.2419833
## width        0.1249137  0.2031764

Likelihood-based Confidence Intervals

As it turns out, when sample size is small and/or when effect sizes are large, Wald-based methods can perform poorly and are unreliable. We saw this back in Section 1.3 with CIs for the binomial parameter π. Instead of Wald-based methods we can use likelihood-based methods which tend to perform better in those situations.

confint(fit, level=.95)
## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -4.3662326 -2.2406858
## width        0.1247244  0.2029871

In this situation note that the Wald and likelihood CIs are very similar. But when sample size is small and/or when effect sizes are large the CIs may differ drastically.

\(-2log_e(\dfrac{l_0}{l_1})\)
\(= -2[log_e(l_0)-log_e(l_1)]\)
\(= -2[l_0 - l_1])\)
Follows the\(\chi^2_{df=1}\) distribution

To generate the Likelihood Ratio Test

car::Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: sat
##       LR Chisq Df Pr(>Chisq)    
## width   64.913  1  7.828e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LR Chisq(LRT statistics): 64.913

Review

  • For a given data set, suppose we want to consider an ordinary least squares regression model based on the explanatory variables X1,X2,X3. This model is described as
    \(\hat y = \hat\alpha+\hat\beta_1 x_1+\hat\beta_2 x_2+\hat\beta_3 x_3\)(Full model)

  • You are also interested in fitting the simpler regression model that only includes X1. This model is described as

\(\hat y = \hat\alpha+\hat\beta_1 x_1\)(Reduced Model)

  • How would you conduct a hypothesis test to see if the simpler model is sufficient? Describe the steps below and be sure to specify the appropriate null and alternative hypotheses.

Partial F-test \(F = \dfrac{\dfrac{SSE(red.)-SSE(full)}{error(red.)-errorDF(full)}}{MSE(full)}\) H0: Reduce model is sufficient relative to the full model
Ha: Reduced model is insufficient relative to the full model

Fullest Regression Model/Saturated Model: A model that contains the maximum number of parameters that can be uniquely estimated from the data.

The Deviance/ Residual Deviance

\(-2log_e(\dfrac{l_0}{l_1})\)
\(= -2[log_e(l_0)-log_e(l_1)]\)
\(= -2[l_0 - l_1]\)
Follows the\(\chi^2_{df=1}\) distribution

-Suppose we are considering a model, denoted by M, for logistic or Poisson regression.
- Let LM denote the maximized log-likelihood under the model M
- Suppose we wanted to compare M to the most complex model possible, which will be denoted by S (for saturated)
- Let LS denote the maximized log-likelihood under the saturated model S
- Due to the fact that S contains at least as many parameters as M,\(L_M \le L_S\)
- We can perform a “lack of fit” test to determine how the reduced model M compares to the saturated model S
- A measure of the sufficiency of the reduced7 model M compared to S is given by

Deviance:
\(-2log_e(\dfrac{l_M}{l_S})\)
\(= -2[log_e(l_M)-log_e(l_S)]\)
\(= -2[l_M - l_S]\)

  • Note the distribution of this statistic and its degrees of freedom
    \(\chi^2_{DF=\Delta}\)

\(\Delta\): difference in number of parameters in each model(\(\Delta = df_{staurated}-df_{reduced}\))

Example: Snoring and Heart Disease In a survey to 2484 subjects, investigators wanted to determine if snoring is a possible risk factor for heart disease. Snoring level (never, occasional, nearly every night, every night) was recorded with corresponding scores (0,2,4,5)

Snoring Yes No Total
Never 24 1355 1379
Occasional 35 603 638
Near Every Night 21 192 213
Every Night 30 224 254

Y = has heart disease (0, 1)
X = snoring level (0, 2, 4, 5) (Ordinal Data)

\(\pi(x) = \dfrac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\)
x = 0, 2, 4, 5

\(logit(\pi(x)) = \alpha+\beta x\)
x = 0, 2, 4, 5

Make table of counts

x\y 1 0
0 \(\pi(0)\)
2 \(\pi(2)\)
4 \(\pi(4)\)
5 \(\pi(5)\)

\(\pi(0)=P(y=1|x=0)\)
\(\pi(2)=P(y=1|x=2)\)
\(\pi(4)=P(y=1|x=4)\)
\(\pi(5)=P(y=1|x=5)\)

4 total\(\pi\)’s as there are only 4 possible x values.

Model S (Full Model)
4 binomial success probabilities => 4 parameters in Saturated Model

Model M (Reduced Model)
\(logit(\pi(x)) = \alpha+\beta x\) with 2 parameters
\(\Delta = 4 -2 = 2\)

Snoring Yes No Total
Never 24 1355 1379
Occasional 35 603 638
Near Every Night 21 192 213
Every Night 30 224 254
snoring = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week6/snoring.txt', sep = '', header = T)
total = snoring$yes + snoring$no

# IMPORTANT NOTE: In the following call to glm, note that the response variable is of the form yes/total. Also the option weight=total is used. Both of these are used to reflect the fact we are modeling multiple binomial probabilities  

fit = glm(yes/total ~ score, family = binomial, weight = total, data = snoring)
summary(fit)
## 
## Call:
## glm(formula = yes/total ~ score, family = binomial, data = snoring, 
##     weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -0.8346   1.2521   0.2758  -0.6845  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.86625    0.16621 -23.261  < 2e-16 ***
## score        0.39734    0.05001   7.945 1.94e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.9045  on 3  degrees of freedom
## Residual deviance:  2.8089  on 2  degrees of freedom
## AIC: 27.061
## 
## Number of Fisher Scoring iterations: 4

\(\pi(x) = \dfrac{e^{-3.86625 +0.39734 x}}{1+e^{-3.86625 +0.39734 x}}\)

In the R output where Residual deviance is reported, the number for degrees of freedom represents the additional number of parameters the saturated model contains that the model M does not.
“Residual deviance: 2.8089 on 2 degrees of freedom”

\(= -2[l_M - l_S] = 2.8089\)
This is the deviance statistics comparing model M to the saturated model S
Reported 2 degree of freedom is the number of parameters S contains that M does not.

p_value = 1 - pchisq(2.8089, 2)
p_value
## [1] 0.245502

H0: Model M is sufficient compare to the saturated model
Ha: Model M is not sufficient compare to the saturated model

Conclusion: Do not reject H0 at any standard\(\alpha\) level
= Saturated model S is not better than model M.

If we want to compare two non-saturated models,\(M_0\) and\(M_1\). (\(M_0\) is nested within\(M_1\))

LRT comparing\(M_0\) to\(M_1\):
\(LRT = -2[L_{M_0}-L_{M_1}]\)
\(= -2[L_{M_0} - L_S + L_S - L_{M_1}]\)
\(= -2[L_{M_0} - L_S] + 2 [L_{M_1}-L_S]\)
\(= -2[L_{M_0} - L_S] - ( -2 [L_{M_1}-L_S])\)
\(= Deviance_0 - Deviance_1\)

Follows the chisq distribution with\(\Delta = df_{staurated}-df_{reduced}\)

\(dev_1=-2[l_M - l_S] = 2.8089\) Deviance1 = 2.8089

# In many R regression functions, y ~ 1 means "fit an intercept only 
fit2 = glm(yes/total ~ 1, family = binomial, weight = total, data = snoring)
summary(fit2)
## 
## Call:
## glm(formula = yes/total ~ 1, family = binomial, data = snoring, 
##     weights = total)
## 
## Deviance Residuals: 
##      1       2       3       4  
## -5.508   1.254   3.339   4.779  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.07185    0.09753   -31.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.904  on 3  degrees of freedom
## Residual deviance: 65.904  on 3  degrees of freedom
## AIC: 88.157
## 
## Number of Fisher Scoring iterations: 5

Deviance: \(-2[l_{M_0} - l_S]\) = 65.904

\(LRT = dev_0 - dev_1\)
\(LRT = 65.904 - 2.8089\)
\(LRT =\) 63.0951

We know
\(M_1\) has 2 parameters pram(\(M_1\))
\(M_0\) has 1 parameters pram(\(M_0\))
\(\Delta = M_1-M_0=2-1=1\)

H0:\(logit(\pi) = \alpha\) is sufficient to\(logit(\pi) = \alpha+ \beta x\)
Ha:\(logit(\pi) = \alpha\) is NOT sufficient to\(logit(\pi) = \alpha+ \beta x\)

df = 2-1
LRT = 63.0951
p_value = 1 - pchisq(LRT, df)
p_value
## [1] 1.998401e-15

Reject H0 => keep x in the model, x is a usefu predictor

  • param (S): the number of parameters specified in the Saturated Model
  • param (\(M_1\)): the number of parameters specified in the Model 1
  • param (\(M_0\)): the number of parameters specified in the Model 0

DF = 3 = param(S) - param(\(M_0\))
DF = 2 = param(S) - param(\(M_1\))
DF = 1 = param(\(M_1\)) - param(\(M_0\))

See PDF in week6

LRT test

Recall Example: In a 1996 study, investigators examined factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response variable for each female crab is her number of satellites. An explanatory variable thought possible to affect this was the female crab’s shell width (measured in centimeters).

snoring = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week6/snoring.txt', sep = '', header = T)
total = snoring$yes + snoring$no
fit = glm(yes/total ~ score, family = binomial, weight = total, data = snoring)
Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: yes/total
##       LR Chisq Df Pr(>Chisq)    
## score   63.096  1  1.969e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The corresponding hypotheses and conclusion.
\(logit(\pi) = \alpha + \beta x\) \(H0: \beta = 0\)
\(Ha: \beta \ne 0\)
\(LRT = 63.096\)
\(DF = 1\)
\(p-value = 1.969e-15\)

Reject H0 at any standard\(\alpha\) level.
-> Keep x in the model, x is a useful predictor

Exact same test as above.

Residual:\(y_i-\hat\mu_i\)

Pearson Residual:\(\dfrac{y_i-\hat\mu_i}{\sqrt{var_(\hat\mu_i)}}\)

Standardized Residual:\(\dfrac{y_i-\hat\mu_i}{SE}\)

Rule of thumb unusual if |STD Residual| > 2 or 3

o_ring = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/o_ring.txt', sep = '', header = T)  
plot(o_ring$Temp,o_ring$TDistress)

fit = glm(TDistress~Temp, family = binomial, data = o_ring)
std_resids <-rstandard(fit,type="pearson")
cbind(o_ring, std_resids) 
##    Temp TDistress std_resids
## 1    66         0 -0.9096075
## 2    70         1  1.8901296
## 3    69         0 -0.6341654
## 4    68         0 -0.7135886
## 5    67         0 -0.8046721
## 6    72         0 -0.4483076
## 7    73         0 -0.3995756
## 8    70         0 -0.5644830
## 9    57         1  0.4545905
## 10   63         1  0.8770428
## 11   70         1  1.8901296
## 12   78         0 -0.2230624
## 13   67         0 -0.8046721
## 14   53         1  0.2787366
## 15   67         0 -0.8046721
## 16   75         0 -0.3170090
## 17   70         0 -0.5644830
## 18   81         0 -0.1564473
## 19   76         0 -0.2821137
## 20   79         0 -0.1982280
## 21   75         1  3.3888106
## 22   76         0 -0.2821137
## 23   58         1  0.5109346

Observation 21 and possibly 2 and 11 are unsual observations. Unusual in the failur accurred in such warm conditions.

Ch4 Logistic Regression

Interpreting the Logistic Regression Model

\(logit(\pi(x)) = \alpha + \beta x\)
\(\pi(x) = \dfrac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\)

  • Notice that rate of change depends upon\(\beta\) and\(\pi(x)\)
  • The steepest slope is always at\(\pi(x)=1/2\) and\(x = - \dfrac{\alpha}{\beta}\) knows as ELso, the median effective level)
  • The flattest point of curve occures at extreme where\(\pi(x)=0\) or\(\pi(x)=1\)

Example: Horseshoe Crabs Revisiting the horseshoe crab example, let’s create a dichoto-mous variable Y which equals 0 if the number of satellites for the given female is 0. Y will be 1 if the number of satellites exceeds 0. We can fit a logistic regression model based on Y as the response variable and X = width of the female.

knitr::kable(head(crabs))
crab sat y weight width color spine
1 8 1 3.05 28.3 2 3
2 0 0 1.55 22.5 3 3
3 9 1 2.30 26.0 1 1
4 0 0 2.10 24.8 3 3
5 4 1 2.60 26.0 3 3
6 0 0 2.10 23.8 2 3
fit = glm(y~width, family = binomial, data = crabs)
summary(fit)$coefficient  
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

The minimum and maximum widths from the data are 21cm and 33.5cm respectively. Find π for each and note steepness of curve at these points

a = -12.3508
b = 0.4972

x = 21
pi1 = exp(a+b*x)/(1+exp(a+b*x))

x = 33.5
pi2 = exp(a+b*x)/(1+exp(a+b*x))


x = -a/b
pi3 = exp(a+b*x)/(1+exp(a+b*x))

cat('pi(21)', pi1,'\n')
## pi(21) 0.1290258
cat('pi(33.5)',pi2,'\n')
## pi(33.5) 0.9866842
cat('pi(-a/b or 24.84071)',pi3,'\n')
## pi(-a/b or 24.84071) 0.5

pi(21) = 0.1290258: Curve not very steep here
pi(33.5) = 0.9866842: Curve is flat here

Median effective level:
\(\hat{EL_{so}}= - \dfrac{\hat\alpha}{\hat\beta}\)
\(\hat{EL_{so}}= - \dfrac{-12.3508}{0.4972}\)
\(\hat{EL_{so}}=\) 24.840708
pi(-a/b or 24.84071) = 0.5

The steepest part of the curve can be found at x = 24.84071

For X = x:
\(\pi(x) = \dfrac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}= \dfrac{e^{\pi(x)}}{1-\pi(x)}= e^{\alpha+\beta x}\)

For X = x+1: \(= \dfrac{e^{\pi(x+1)}}{1-\pi(x+1)} = e^{\alpha+\beta x+1} = e^{\alpha+\beta x} + e^\beta= \dfrac{e^{\pi(x+1)}}{1-\pi(x+1)}*e^\beta\)

Interpretation of\(\beta\) and\(e^\beta\):
\(\beta\) yields the multiplicative change in the odds of success from x to x + 1
(A 1 unit increase in x) through\(e^\beta\)

We can also consider the odds ration as follows:
\(\dfrac{\pi(x+1)}{1-\pi(x+1)} = \dfrac{\pi(x)}{1-\pi(x)}*e^\beta\)

\(\dfrac{\dfrac{\pi(x+1)}{1-\pi(x+1)}}{\dfrac{\pi(x)}{1-\pi(x)}}\) is the odd ratio comparing to x and x+1
\(\dfrac{\dfrac{\pi(x+1)}{1-\pi(x+1)}}{\dfrac{\pi(x)}{1-\pi(x)}} = e^\beta\)

\(e^\beta\) is the odd ratio comparing to x and x+1

Interpreting Estimated Odds Ratio

  • If\(\theta > 1\) then we can say the odds of the event is estimated to increase by\(e^\hat\beta -1 * 100%\) for each unit increase in X.
  • If\(\theta < 1\) then we can say the odds of the event is estimated to increase by\(1-e^\hat\beta * 100%\) for each unit increase in X.

Example: Return to the Horseshoe Crab example and provide an interpretation for beta.
\(\hat\beta = 0.4972\),\(e^{0.4972} = 1.64\)
\(164%- 100%=64%\) Estimated odds of at least one satellite (success) increase by 64% for each one unit increase in x (width).

Prof x = 26.3 and 27.3
\(\pi(26.3) = 0.6738\)
\(odds = \dfrac{0.6738}{1-0.6738} = 2.066\)
\(\pi(27.3) = 0.7725\)
\(odds = \dfrac{0.7725}{1-0.7725} = 3.396\)

\(\hat\theta = \dfrac{3.396}{2.066} = 1.64\)

4.2 Inference for Logistic Regression

Recall 95% Wald-based CI

\(\hat\beta \pm 1.96(SE_\beta)\)

CI = (\(e^{\hat\beta - 1.96(SE_\beta)}\),\(e^{\hat\beta + 1.96(SE_\beta)}\))

fit <-glm(y ~ width, family = binomial, data = crabs)
summary(fit)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

Logistic Regression CI

Generate the Wald CI for \(e^\beta\) by doing the computation by hand below:

\(\hat\beta \pm 1.96(SE_\beta) = 04972 \pm 1.96(0.1017) = (0.298, 0.697)\)

confint.default(fit)
##                   2.5 %     97.5 %
## (Intercept) -17.5030100 -7.1986254
## width         0.2978326  0.6966286

Wald 95% CI for \(e^\beta = (e^{0.295}, e^{0.697}) = (1.35, 2.01)\)

exp(confint.default(fit))
##                    2.5 %       97.5 %
## (Intercept) 2.503452e-08 0.0007476128
## width       1.346936e+00 2.0069749360

Likelihood-based CI

It is similer when the sample is large

confint(fit, level = .95)
## Waiting for profiling to be done...
##                   2.5 %     97.5 %
## (Intercept) -17.8100090 -7.4572470
## width         0.3083806  0.7090167
exp(confint(fit, level = .95))
## Waiting for profiling to be done...
##                    2.5 %       97.5 %
## (Intercept) 1.841668e-08 0.0005772432
## width       1.361219e+00 2.0319922986

We are 95% confident that the odds ratio for success witha 1 unit increase in x (width) is between 1.36 and 2.03

Hypothesis Test for \(\beta\)

When sample size is small, the Wald test method perform poorly
An alternative is to use tests generated by the likelihood ratio method:

\(-2[log_e(l_0) - log_e(l1)] = -2[L_0-L_1]\)

Hypothesis Test for \(\beta\) (Wald)

H0: \(\beta = 0\)
Ha: \(\beta \ne 0\)

fit <-glm(y ~ width, family = binomial, data = crabs)
summary(fit)$coefficients 
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

\(z = \dfrac{0.4972}{0.1017} = 4.89\)
\(z^2 = 4.89^2 = 23.91\)
p-value: \(1.02*10^{-6}\) => Reject H0

Shell width is useful explanatory factor in predicting \(\pi\)

Hypothesis Test for \(\beta\) (Likelihood Test)

H0: \(\beta = 0\)
Ha: \(\beta \ne 0\)

Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: y
##       LR Chisq Df Pr(>Chisq)    
## width   31.306  1  2.204e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LRT = 31.31
p-value: 2.204e-08

Full vs Reduced Model Test

fit <-glm(y ~ width, family = binomial, data = crabs)
summary(fit)
## 
## Call:
## glm(formula = y ~ width, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0281  -1.0458   0.5480   0.9066   1.6942  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
## width         0.4972     0.1017   4.887 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 194.45  on 171  degrees of freedom
## AIC: 198.45
## 
## Number of Fisher Scoring iterations: 4
fit2 <-glm(y ~ 1, family = binomial, data = crabs)
summary(fit2)
## 
## Call:
## glm(formula = y ~ 1, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4326  -1.4326   0.9421   0.9421   0.9421  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.5824     0.1585   3.673 0.000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 225.76  on 172  degrees of freedom
## AIC: 227.76
## 
## Number of Fisher Scoring iterations: 4

\(-2[L_{M_0}-L_{M_1}] = 225.76 - 194.45 = 31.31\)
Use \(\chi^2_{\Delta} = 2-1=1\)
Exact same LRT value from previous test

How could we have generated the LRT statistic WITH ONLY the full model output above? - Null deviance is the deviance statistic when only an intercept is in our model (No explanatory variable)

summary(fit)
## 
## Call:
## glm(formula = y ~ width, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0281  -1.0458   0.5480   0.9066   1.6942  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
## width         0.4972     0.1017   4.887 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 194.45  on 171  degrees of freedom
## AIC: 198.45
## 
## Number of Fisher Scoring iterations: 4

4.3 Logistic Regression with Categotical Predictors

  • Suppose Y is a dichotomous response variable (1 = success, 0 = failure)
  • Suppose X and Z are two dichotomous explanatory variables

\[ x = \begin{cases} 0, if-x-absent\\ 1,if-x-present \end{cases} \]
\[ z = \begin{cases} 0, if-z-absent\\ 1,if-z-present \end{cases} \]

Data can be displayed as indicator or dummy variable

x\y 1 0
1
0

The logit regression model for \(\pi\) is \(logit(\pi)=log_e(\dfrac{\pi}{1-\pi}) = \alpha+\beta_1x+\beta_2z\) : Call this model *

This model is said to have main effect X and Z

x z logit
1 1 \(\alpha+\beta_1+\beta_2\)
1 0 \(\alpha+\beta_1\)
0 1 \(\alpha+\beta_2\)
0 0 \(\alpha\)

Homogeneous Association
- The effect of x is the same at z = 0 and z = 1
- The effect of z is the same at x = 0 and x = 1

Using our model (*), let’s compare the logit (for a fixed z) where X = 0 and X = 1. For convenience, use the notation \(π_{ik}\) for cell probabilities, where i = x, x’ and k = z, z’( the superscript ’ implies the factor is absent)

Let z be present
\(log_e(\dfrac {π_{xz}}{1-π_{xz}}) = \alpha + \beta_1+\beta_2\)
\(log_e(\dfrac {π_{x'z}}{1-π_{x'z}}) = \alpha +\beta_2\)

Take difference of two equations
\(log_e(\dfrac {π_{xz}}{1-π_{xz}})-log_e(\dfrac {π_{x'z}}{1-π_{x'z}})= \alpha + \beta_1+\beta_2 -( \alpha +\beta_2) = \beta_1\)
\(log_e \Bigg[ \dfrac{\dfrac {π_{xz}}{1-π_{xz}}}{\dfrac {π_{x'z}}{1-π_{x'z}}}\Bigg] = \beta_1\)
OR
\(\dfrac {π_{xz}(1-πx'z)}{π_{x'z}(1-πxz)} = e^{\beta_1}\)
OR
\(\dfrac {π_{xz}}{1-π_{xz}} = e^{\beta_1}\dfrac {π_{x'z}}{1-π_{x'z}}\)

Let z be absent
\(log_e(\dfrac {π_{xz'}}{1-π_{xz'}}) = \alpha + \beta_1+\beta_2\)
\(log_e(\dfrac {π_{x'z'}}{1-π_{x'z'}}) = \alpha +\beta_2\)

Take difference of two equations
\(log_e(\dfrac {π_{xz'}}{1-π_{xz'}})-log_e(\dfrac {π_{x'z'}}{1-π_{x'z'}})= \alpha + \beta_1 - \alpha = \beta_1\)
\(log_e \Bigg[ \dfrac{\dfrac {π_{xz'}}{1-π_{xz'}}}{\dfrac {π_{x'z'}}{1-π_{x'z'}}}\Bigg] = \beta_1\)
OR
\(\dfrac {π_{xz}(1-πx'z')}{π_{x'z'}(1-πxz')} = e^{\beta_1}\)
OR
\(\dfrac {π_{xz'}}{1-π_{xz'}} = e^{\beta_1}\dfrac {π_{x'z'}}{1-π_{x'z'}}\)

Note that the expressions above involve odds ratios. What are these odds ratios? Draw the two 2 × 2 tables (one at Z = 0, one at Z = 1) and use the previous notation \(π_{ik}\) for cell probabilities, where i = x, x’ and k = z, z’.

Z = present

x\y 1 0
1 \(π_{xz}\) \(1-π_{xz}\)
0 \(π_{x'z}\) \(1-π_{x'z}\)

Odds ratio for x and y given x = 1
\(\theta_{xy(1)}\) = \(\dfrac{π_{xz}(1-π_{x'z})}{π_{x'z}(1-π_{xz})}\)

Z = absent

x\y 1 0
1 \(π_{xz'}\) \(1-π_{xz'}\)
0 \(π_{x'z'}\) \(1-π_{x'z'}\)

Odds ratio for x and y given x = 0
\(\theta_{xy(0)}\) = \(\dfrac{π_{xz'}(1-π_{x'z'})}{π_{x'z'}(1-π_{xz'})}\)

This tells us something very important about \(\beta_1\) and the conditional odds ratio between X and Y: \(\theta_{xy(0)} = \theta_{xy(1)} = e^{\beta_1}\)

There is constant odds ratio between x and y across z.

x and y have a homogeneous Association

What condition must be true in order to establish conditional independence(All \(\theta = 1\))* between X and Y?

By definition: \(\theta_{xy(0)} = \theta_{xy(1)} = 1\) for conditional independence => \(e^{\beta_1} = 1\) when \(\beta_1 = 0\) so, \(\beta_0\) implies conditional independence between x and y.

Given \(\beta_1 = 0\)
\(logit(\pi)= \alpha+\beta_1x+\beta_2z = \alpha+0+\beta_2z = \alpha+\beta_2z\)

Example: AZT Use and AIDS In a study described in the NY Times (1991), researchers investigated the effects of the AZT drug in slowing the development of AIDS symptoms. 338 veterans whose immune systems were beginning to falter after infection with AIDS were randomly assigned to receive AZT immediately or after some delay.

  • Y = whether AIDS symptoms developed (1 = yes, 0 = no)
  • X = Immediate AZT treatment (1 = yes, 0 = no)
    • X is indecator/dummy variable
  • z = race (1 = white, 0 = black)

z = 1 (Whites)

Treat\Symptoms Yes No
Immed 14 93
Delay 32 81

z = 0 (Blacks)

Treat\Symptoms Yes No
Immed 11 52
Delay 12 43

Reference Levels in R When using a categorical explanatory variable in regression, one of the levels of the variable is called the reference level. Information about all levels except reference appears in the output. You should have seen this in STAT 324 regardless of the software you used. In the example above, we have two categorical explanatory variables (AZT and Race). When categories are text-based, R will designate the level that appears alphabetically first as reference level:
- AZT (‘yes’, ‘no’): ‘no’ comes first alphabetically so ‘no’ is reference for AZT
- Race (‘white’, ‘black’): ‘black’ comes first alphabetically so ‘black’ is reference for Race

aids = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week7/AIDS.txt', sep= '', header = T)
total = aids$yes + aids$no
fit = glm(yes/total~race + azt, weights = total, family = binomial, data = aids)

REFERENCE LEVEL DETERMINED BY ALPHABETICAL ORDER FOR TEXT BASED LEVELS. Default behavior: ‘black’ is reference for race, ‘no’ is reference for AZT # In the output below for parameter estimates only information about NON-REFERENCE LEVELS is shown

summary(fit)
## 
## Call:
## glm(formula = yes/total ~ race + azt, family = binomial, data = aids, 
##     weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -0.5547   0.4253   0.7035  -0.6326  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.07357    0.26294  -4.083 4.45e-05 ***
## racewhite    0.05548    0.28861   0.192  0.84755    
## aztyes      -0.71946    0.27898  -2.579  0.00991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3499  on 3  degrees of freedom
## Residual deviance: 1.3835  on 1  degrees of freedom
## AIC: 24.86
## 
## Number of Fisher Scoring iterations: 4
predicted = fit$fitted.values
predicted
##         1         2         3         4 
## 0.1496245 0.2653998 0.1427012 0.2547241
std_resid = rstandard(fit,type="pearson")
std_resid
##         1         2         3         4 
## -1.179418  1.179418  1.179418 -1.179418
cbind(aids,predicted,std_resid)
##    race azt yes no predicted std_resid
## 1 white yes  14 93 0.1496245 -1.179418
## 2 white  no  32 81 0.2653998  1.179418
## 3 black yes  11 52 0.1427012  1.179418
## 4 black  no  12 43 0.2547241 -1.179418

\(logit(\hatπ) = -1.0736 - 0.7195 AZT + 0.0555Race\)

x(AZT) z(RACE) logit probability
1(y) 1(w) -1.0736 - 0.7195(1) + 0.0555(1) = -1.7376 0.1496245
1(y) 0(b) -1.0736 - 0.7195(1) + 0.0555(0) = -1.7931 0.1427012
0(n) 1(w) -1.0736 - 0.7195(0) + 0.0555(1) = -1.0181 0.2653998
0(n) 0(b) -1.0736 - 0.7195(0) + 0.0555(0) = -1.0736 0.2547241
Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: yes/total
##      LR Chisq Df Pr(>Chisq)   
## race   0.0371  1   0.847295   
## azt    6.8709  1   0.008761 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the R output on Page 153, determine the estimated main effect of AZT and the estimated conditional odds ratio of immediate AZT use (X = 1) on development of AIDS symptoms (Y = 1). Interpret this odds ratio:

Main effect: of AZT = -0.71946 = \(\hat\beta_1\)

\(\hat\theta_{xy(0)} = \hat\theta_{xy(1)} = e^{-0.7195} = 0.49\)

The odds of AIDS symptoms is nearly halved for those who immediately receive AZT treatment vs. those who delayed AZT treatment.

Test of AZT

Conduct a hypothesis test of conditional independence of AZT treatment and the development of AIDS symptoms, controlling for race.
In other word, the required null hypothesis with respect to model parameters is:
H0: \(\beta_1 = 0\)
Ha: \(\beta_1 \ne 0\)

summary(fit)$coefficients
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -1.07357363  0.2629407 -4.0829494 4.446771e-05
## racewhite    0.05548453  0.2886132  0.1922453 8.475501e-01
## aztyes      -0.71945991  0.2789791 -2.5789025 9.911477e-03
Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: yes/total
##      LR Chisq Df Pr(>Chisq)   
## race   0.0371  1   0.847295   
## azt    6.8709  1   0.008761 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wald: z = -2.5789025, p-value = 9.911477e-03
LRT: \(\chi^2 = 6.8709\), p-value = 0.008761

Reject H0 at any standard alpha level.
Reject the hypothesis of conditional independence between x and y.

Test of RACE

H0: \(\beta_2 = 0\)
Ha: \(\beta_2 \ne 0\)

summary(fit)$coefficients
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -1.07357363  0.2629407 -4.0829494 4.446771e-05
## racewhite    0.05548453  0.2886132  0.1922453 8.475501e-01
## aztyes      -0.71945991  0.2789791 -2.5789025 9.911477e-03
Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: yes/total
##      LR Chisq Df Pr(>Chisq)   
## race   0.0371  1   0.847295   
## azt    6.8709  1   0.008761 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wald: z = 0.1922453, p-value = 8.475501e-01 (0.8475501)
LRT: \(\chi^2 = 0.0371\), p-value = 0.847295

Fail to reject H0 at any standard alpha level.
Fail to reject the hypothesis of conditional independence between z and y.
The relationship between z and y have conditional independence

Reduced Model

fit2 = glm(yes/total ~ race, weights = total, family=binomial, data=aids) 
summary(fit2)
## 
## Call:
## glm(formula = yes/total ~ race, family = binomial, data = aids, 
##     weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -2.1028   1.8650  -0.4126   0.4294  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.41838    0.23239  -6.103 1.04e-09 ***
## racewhite    0.08797    0.28547   0.308    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3499  on 3  degrees of freedom
## Residual deviance: 8.2544  on 2  degrees of freedom
## AIC: 29.731
## 
## Number of Fisher Scoring iterations: 4

Using the appropriate deviance values and the methods we discussed at the end of Chapter 3, write the value of the LRT statistic and the corresponding conclusion: \(-2[L_{M_0}-L_{M_1}] = Dev_0 - Dev_1 = 8.2544 - 1.3835 = 6.87\)
Distribution \(\chi^2_{\Delta =1}\)

p_val = 1 - pchisq(6.87, 1)
p_val
## [1] 0.008765465

Reject H0 at any standard alpha level.
Reject the hypothesis of conditional independence between x and y.

We stated the output for ONLY the full model was needed to manually construct the LRT statistic. Can we do that here as well? That is, can we solely rely on the output on Page 153 to generate our LRT statistic? Why or why not?

No, Recall the null deviance is for the model that has no explanatory variables. We do not want to test whether we want to drop BOTH x and z. Here we want to see if we can just drop x.

Does it matter what the reference levels are for categorical explanatory variables?

fit1=glm(yes/total ~ race + azt, weights = total, family=binomial, data=aids)
summary(fit1)$coefficient
##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -1.07357363  0.2629407 -4.0829494 4.446771e-05
## racewhite    0.05548453  0.2886132  0.1922453 8.475501e-01
## aztyes      -0.71945991  0.2789791 -2.5789025 9.911477e-03

Designating the Reference Level in R

aids
##    race azt yes no
## 1 white yes  14 93
## 2 white  no  32 81
## 3 black yes  11 52
## 4 black  no  12 43
aids$race = relevel(factor(aids$race), ref = 'white')
aids$azt = relevel(factor(aids$azt), ref = 'yes')
fit2 = glm(yes/total~race + azt, weights = total, family=binomial, data=aids)
summary(fit2)
## 
## Call:
## glm(formula = yes/total ~ race + azt, family = binomial, data = aids, 
##     weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -0.5547   0.4253   0.7035  -0.6326  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.73755    0.24038  -7.228 4.89e-13 ***
## raceblack   -0.05548    0.28861  -0.192  0.84755    
## aztno        0.71946    0.27898   2.579  0.00991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3499  on 3  degrees of freedom
## Residual deviance: 1.3835  on 1  degrees of freedom
## AIC: 24.86
## 
## Number of Fisher Scoring iterations: 4

ANOVA-Type Model Representation of Factors

  • x has 2 levels
X x logit(\(\pi\))
level1 1 \(\alpha+\beta*1= \alpha+\beta\)
level2 0 \(\alpha+\beta*0= \alpha\)
  • x has 2 levels
X \(x_1\) \(x_2\) logit(\(\pi\))
level1 1 0 \(\alpha+\beta_1\)
level2 0 1 \(\alpha+\beta_2\)
level3 0 0 \(\alpha\)
  • x has I levels
X \(x_1\) \(x_2\) \(x_{I-1}\) logit(\(\pi\))
level1 1 0 0 \(\alpha+\beta_1\)
level2 0 1 0 \(\alpha+\beta_2\)
level I-1 0 0 1 \(\alpha+\beta_{I-1}\)
level I 0 0 0 \(\alpha+\beta_I\)

\(logit(\pi) = \alpha + \beta_1^xx_1 + \beta_2^xx_2 + ... + \beta_{I-1}^xx_{I-1}\)

Z \(z_1\) \(z_2\) \(z_{I-1}\) logit(\(\pi\))
level1 1 0 0 \(\alpha+\beta_1\)
level2 0 1 0 \(\alpha+\beta_2\)
level I-1 0 0 1 \(\alpha+\beta_{I-1}\)
level I 0 0 0 \(\alpha+\beta_I\)

\(logit(\pi) = \alpha + \beta_1^xx_1 + \beta_2^xx_2 + ... + \beta_{I-1}^xx_{I-1} + \beta_1^zz_1 + \beta_2^zz_2 + ... + \beta_{I-1}^zz_{I-1}\)

The effect of x at level i and z at level k om the logit.

\(logit(\pi) = \alpha + \beta_i^x+\beta_k^z\)

Conditional Indepedence

Conditional Independence: All odds rations are 1

Y is dichotomous with explanatory variables X (having I levels) and Z (having K levels). For the model we just discussed, let us see why the following statement is ture.

Conditional independence between X and Y, given Z, corresponds to \(\beta_1^x = \beta_2^x= ... = \beta_I^x\)

Ex)

  • Fix the level of Z at k and draw a 4 × 2 table representing the cell probabilities.
  • Show that the conditional odds ratio between X and Y at X = 1 versus X = 2 is exactly \(e^{\beta_1^x-\beta_2^x}\)
    • Show that the conditional odds ratio between X and Y at X = 2 versus X = 3 is exactly \(e^{\beta_2^x-\beta_3^x}\)
    • Show that the conditional odds ratio between X and Y at X = 3 versus X = 4 is exactly \(e^{\beta_3^x-\beta_4^x}\)
x\y 1 2
1 \(π_{11}\) \(1-π_{11}\)
2 \(π_{21}\) \(1-π_{21}\)
3 \(π_{31}\) \(1-π_{31}\)
4 \(π_{41}\) \(1-π_{41}\)

Look at top 2 rows (x = 1 and x = 2)

x\y 1 2
1 \(π_{11}\) \(1-π_{11}\)
2 \(π_{21}\) \(1-π_{21}\)

In general = logit(π) = \(logit(\pi) = \alpha + \beta_i^x+\beta_k^z\)

x = 1 vs x = 2 (at z = k)

\(logit(π_{11}) = log_e \dfrac {π_{11}}{1-π_{11}} = \alpha + \beta_1^x+\beta_k^z\)

\(logit(π_{21}) = log_e \dfrac {π_{21}}{1-π_{21}} = \alpha + \beta_2^x+\beta_k^z\)

Take difference…

\(log_e({\dfrac{odds_1}{odds_2}})=\beta_1^x-\beta_2^x = log_e(O.R.) = \beta_1^x-\beta_2^x = \theta_{xy(k)}[x=1 vs. x=2] = e^{\beta_1^x-\beta_2^x} = 1\)

Bt the same argument
\(\theta_{xy(k)}[x=2 vs. x=3] = e^{\beta_2^x-\beta_3^x}=1\)
\(\theta_{xy(k)}[x=3 vs. x=4] = e^{\beta_3^x-\beta_4^x}=1\)

Under conditional independence, \(e^{\beta_1^x-\beta_2^x} = 1 => \beta_1^x-\beta_2^x = log_e(1) = 0 => \beta_1^x=\beta_2^x\)
Therefore

\(\beta_1^x=\beta_2^x = \beta_3^x = \beta_4^x\)

The Cochran-Mantel-Haenszel Test for 2x2xk Table

Y is dichotomous with explanatory variables X (having 2 levels) and Z (having K levels).

\(logit(\pi) = \alpha + \beta x+\beta_k^z\)

Using the same argument given on Page 151, the following interpretation holds for β and the conditional odds ratio between X and Y (at any Z = k):

\(\theta_{xy(1)}= \theta_{xy(2)} = ... = \theta_{xy(k)} = e^\beta\)

\(e^\beta = 1 <=> \beta = 0\)
If \(\beta = 0\), x and y have conditional independent

Another test for conditional independence is based on the score method. This test is called the Cochran-Mantel-Haenszel Test (CMH).

aids
##    race azt yes no
## 1 white yes  14 93
## 2 white  no  32 81
## 3 black yes  11 52
## 4 black  no  12 43

This test will be conducted via the function mantelhaen.test(). Unfortunately, to use this function, given that our data is a 3-way contingency table, it must first be organized as a 2×2×2

aids.new = array(c(14,32,93,81,11,12,52,43), dim=c(2,2,2), dimnames = list(azt = c("Yes", "No"), aids= c("Yes", "No"), race = c('White', 'Black'))) 
aids.new
## , , race = White
## 
##      aids
## azt   Yes No
##   Yes  14 93
##   No   32 81
## 
## , , race = Black
## 
##      aids
## azt   Yes No
##   Yes  11 52
##   No   12 43
# Perform the CMH test with mantelhaen.test(data, correct=F), where # correct=F specifies that a continuity correction SHOULD NOT be used
mantelhaen.test(aids.new, correct=F)
## 
##  Mantel-Haenszel chi-squared test without continuity correction
## 
## data:  aids.new
## Mantel-Haenszel X-squared = 6.7624, df = 1, p-value = 0.00931
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2818621 0.8414208
## sample estimates:
## common odds ratio 
##         0.4869955

CMH = 6.7624
p-val = 0.00931

Testing Method Name Test Statistics Value p-value
Wald 6.65 0.0099
LRT 6.87 0.0088
Score(CMH) 6.76 0.0093

4.4 Multiple Logistic Regression

  • Simple linear regression \(y = \alpha + \beta x\)

  • Multiple liner regression
    \(y = \alpha + \beta_1 x + \beta_2x_2 + ... + \beta_kx_k\)

  • Simple logistic regression \(logit(π) = \alpha + \beta x\)

  • Multiple logistic regression
    \(logit(π) = \alpha + \beta_1 x + \beta_2x_2 + ... + \beta_kx_k\)

Suppose Y is dichotomous and there are k explanatory variables \(x_1, x_2, ... , x_k\). The corre-sponding logistic regression model is as given above. In the space below, write the corre-sponding expression for the odds of success.

\(\dfrac{π(x_1,...,x_k)}{1-π(x_1,...,x_k)} = e^{\alpha + \beta_1+...+\beta_kx_k)}\)

Interpretaion of \(\beta_i\)

  • \(\beta_i\) is the effect of \(x_i\) on the log odds of success (controlling the other x variables)
  • In other words, \(e^{\beta}\) is the multiplicative effect on the odds of 1 unit increase in \(x_i\) at fixed levels of the other x variables. (*)

Prrof:

\(logit[π(x_1,...,x_i,..., x_k)] = \alpha + \beta_1x_1+...+\beta_ix_i+...+\beta_k x_k\)

\(logit[π(x_1,...,x_i+1,..., x_k)] = \alpha + \beta_1x_1+...+\beta_i(x_i+1)+...+\beta_k x_k = \alpha + \beta_1x_1+...+\beta_ix_i+...+\beta_k x_k+\beta_i = logit[π(x_1,...,x_i,..., x_k)]+\beta_i\)

\(log_e( \dfrac{π(x_1, ..., x_i+1, ..., x_k)}{1-π(x_1, ..., x_i+1, ..., x_k)}) - log_e( \dfrac{π(x_1, ..., x_i, ..., x_k)}{1-π(x_1, ..., x_i, ..., x_k)}) = \beta_i\)

\(log_e (O.R. (x_i + 1 vs X_i)) = \beta_i\)
\(O.R. (x_i + 1 vs X_i) = e^{\beta_i}\)
So the interpretation at * is correct.

Example) Horseshoe Crabs
Create model for x = width of the female, and c = color of the female (medium light, mdeium, mudium dark, and dark)

c1 c2 c3
medium light 1 0 0
medium 0 1 0
medium dark 0 0 1
dark 0 0 0

\(logit(π) = \alpha + \beta_1c_1+\beta_2c_2+ \beta_3c_3 + \beta_4x\)

fit = glm(y~factor(color)+width, family = binomial, data = crabs)
# The value "1" is lowest, so by default R will will set "1" to be the reference level. 
summary(fit)$coefficients
##                    Estimate Std. Error     z value     Pr(>|z|)
## (Intercept)    -11.38519276  2.8734572 -3.96219324 7.426439e-05
## factor(color)2   0.07241694  0.7398922  0.09787498 9.220316e-01
## factor(color)3  -0.22379766  0.7770789 -0.28799862 7.733478e-01
## factor(color)4  -1.32991913  0.8525220 -1.55998209 1.187641e-01
## width            0.46795598  0.1055446  4.43372912 9.261698e-06
car::Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: y
##               LR Chisq Df Pr(>Chisq)    
## factor(color)   6.9956  3    0.07204 .  
## width          24.6038  1  7.041e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(logit(π) = -11.39 + 0.07c_1-0.22c_2 -1.33 c_3 + 0.47x\)

Medium light:$ logit() = -11.39 + 0.47x$
Medium :$ logit() = -11.39 + + 0.07 + 0.47x$
Medium Dark:$ logit() = -11.39 -0.22 +0.47x$
Dark:$ logit() = -11.39 -1.33 + 0.47x$

The graph exhibits a horizontal shift for the 4 curves

Find the estimated odds of success for the color “medium light” (use “ml” for subscript):

\(\dfrac{\hatπ_{ml}}{1-\hatπ_{ml}} = e^{-11.39+0+0.47x}\)

Find the estimated odds of success for the color “dark” (use “d” for subscript):
\(\dfrac{\hatπ_{d}}{1-\hatπ_{d}} = e^{-11.39-1.33+0.47x}\)

The estimated odds ratio for these two odds

Odd ratio (ml vs d): \(e^{(-11.39+0+0.47x)-(-11.39-1.33+0.47x)} = e^{0-(-1.33)} = e^{1.33} = 3.78\)

Odds of success on medium light is 3.78 time higher than the odds of success on dark color.

Find the estimated odds of success for the color “medium” (use “m” for subscript):
\(\dfrac{\hatπ_{m}}{1-\hatπ_{m}} = e^{-11.39+0.07+0.47x}\)

Find the estimated odds of success for the color “medium dark” (use “md” for subscript):
\(\dfrac{\hatπ_{md}}{1-\hatπ_{md}} = e^{-11.39-0.22+0.47x}\)

Odd ratio (m vs md): \(e^{0.07-(-0.22)} = 1.34\)

Odds of success on medium is 1.34time higher than the odds of success on medium dark color.

In general…we can determine the estimated odds ratio between two colors based on the individual color parameter estimates by exponentiated difference between 2 parameters estimates is the estimated odd ratio for those colors.

\(logit(π) = \alpha + \beta_1c_1+\beta_2c_2+ \beta_3c_3 + \beta_4x\)

Given

car::Anova(fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: y
##               LR Chisq Df Pr(>Chisq)    
## factor(color)   6.9956  3    0.07204 .  
## width          24.6038  1  7.041e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For color:
$H0: _1 = _2 = 0 $
Ha: At least one \(\beta\) is non zero
LRT= 0.99
p-value: 0.072
At 0.07 p-value, it is weak to moderate evidence that color is a contributed for success(y = 1) at least one satellite.

For width:
$H0: _1 = 0 $
$Ha: _1 $
LRT = 24.6 p-value: 7.0e-07

Strong evidence against H0, sugesting that the width is a contributing factor for at least one atellite.

Quantitative Treatment of Ordinal Predictor

\(logit(π) = \alpha + \beta_1c_1+\beta_2c_2+ \beta_3c_3 + \beta_4x\)

Quantitatively:
\(logit(π) = \alpha + \beta_1c + \beta_2x\)

# Treat it to quantitative  
fit = glm(y ~ color + width, family = binomial, data = crabs) 
summary(fit)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -10.0708390  2.8068339 -3.587971 3.332611e-04
## color        -0.5090467  0.2236817 -2.275763 2.286018e-02
## width         0.4583097  0.1040181  4.406056 1.052696e-05

\(logit(π) = -10.07 - 0.51c +0.46 x\)

Predict for c = midium(c=2) and x = 27cm
\(logit(π) = -10.07 - 0.51(2) +0.46(27)\)

\(\hatπ = \dfrac{e^{-10.07 - 0.51(2) +0.46(27)}}{1+e^{-10.07 - 0.51(2) +0.46(27)}} = 0.791\)

interpretation of \(e^{\hat\beta_2}\)
\(e^{0.46} = 1.58\)
This is the estimated odds ratio of success for one unit of increase in x(width) variable, controlling for c(color).

interpretation of \(e^{\hat\beta_1}\)
\(e^{-0.51} = 0.6\)
This is the estimated odds ratio of success for one level of increase in c(color)/get darker variable, controlling for x(width).

ANOVA for factor and ordianl color

Use it as Factor
\(logit(π) = \alpha + \beta_1c_1+\beta_2c_2+ \beta_3c_3 + \beta_4x\)

Use it as Ordinal
\(logit(π) = \alpha + \beta_1c + \beta_2x\)

# FULL MODEL using factor(color) in glm()
fit <-glm(y ~ factor(color) + width, family = binomial, data = crabs)
summary(fit)
## 
## Call:
## glm(formula = y ~ factor(color) + width, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1124  -0.9848   0.5243   0.8513   2.1413  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -11.38519    2.87346  -3.962 7.43e-05 ***
## factor(color)2   0.07242    0.73989   0.098    0.922    
## factor(color)3  -0.22380    0.77708  -0.288    0.773    
## factor(color)4  -1.32992    0.85252  -1.560    0.119    
## width            0.46796    0.10554   4.434 9.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 187.46  on 168  degrees of freedom
## AIC: 197.46
## 
## Number of Fisher Scoring iterations: 4
# REDUCED MODEL using color in glm()
fit <-glm(y ~ color + width, family = binomial, data = crabs)
summary(fit)
## 
## Call:
## glm(formula = y ~ color + width, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1692  -0.9889   0.5429   0.8700   1.9742  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.0708     2.8068  -3.588 0.000333 ***
## color        -0.5090     0.2237  -2.276 0.022860 *  
## width         0.4583     0.1040   4.406 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 189.12  on 170  degrees of freedom
## AIC: 195.12
## 
## Number of Fisher Scoring iterations: 4

H0: Reduced/ordinal model is sufficient relative to Full model/factor
H0: Reduced/ordinal model is NOT sufficient relative to Full model/factor

\(LRT = Dev_0 - Dev_1 = 189.12-187.46 =1.66\)
\(\chi^2_{\Delta} = 170 - 168=2\)

p_val = 1-pchisq(1.66, 2)
p_val
## [1] 0.4360493

p_val = 0.4360493

Do not reject Ho, The reduced model is sufficient relative to full model => Use color as quantitative

Given color as factors c = 0, 0.07, -0.22, -1.33

How anout we treat it two levels since 0, 0.07, and -0.22 is really close to each other.

We can consider a different coding scheme where the first three categories (med. lt., med., med. dark) are coded as “success” and the last category as “failure” (dark). So Set 0 for non-dark and 1 for dark.

Dark model:
\(logit(π) = \alpha + \beta_1d + \beta_2x\)

Modify df

#First, set dark=1 when color=4 and dark=0 otherwise
dark <-(crabs$color==4)*1  
#Add the variable dark as a column to the original data object and store as crabs.new 
crabs.new <-cbind(crabs,dark)  
crabs.new[28:31,]  
##    crab sat y weight width color spine dark
## 28   28   5 1    2.7  26.8     2     1    0
## 29   29   0 0    2.6  27.5     4     3    1
## 30   30   0 0    2.1  24.9     2     3    0
## 31   31   4 1    3.2  29.3     1     1    0
fit3 = glm(y ~ dark + width, family = binomial, data = crabs)
summary(fit3)
## 
## Call:
## glm(formula = y ~ dark + width, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0821  -0.9932   0.5274   0.8606   2.1553  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.6790     2.6925  -4.338 1.44e-05 ***
## dark         -1.3005     0.5259  -2.473   0.0134 *  
## width         0.4782     0.1041   4.592 4.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 187.96  on 170  degrees of freedom
## AIC: 193.96
## 
## Number of Fisher Scoring iterations: 4

H0: Reduced/Dark model is sufficient relative to Full model/factor
H0: Reduced/Dark model is NOT sufficient relative to Full model/factor

\(LRT = Dev_0 - Dev_1 = 187.96 - 187.46 = 0.5\)
\(\chi^2_{\Delta} = 170 - 168=2\)

p_val = 1-pchisq(0.5, 2)
p_val
## [1] 0.7788008

p_val = 0.7788008

Do not reject Ho, The reduced/dark model is sufficient relative to full model => Use color as dichotomous

At 0.5 for LRT is closer to 0 than LRT with ordianl color parameter, the model with dichotomous model may be preferable but both models are very comparable.

Allowing Interaction

\(logit(π) = \alpha + \beta_1d + \beta_2x + \beta_3 dx\)

fit4 <-glm(y ~ dark + width + dark*width, family = binomial, data = crabs) 
summary(fit4)
## 
## Call:
## glm(formula = y ~ dark + width + dark * width, family = binomial, 
##     data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1366  -0.9344   0.4996   0.8554   1.7753  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.8117     2.9576  -4.332 1.48e-05 ***
## dark          6.9578     7.3182   0.951    0.342    
## width         0.5222     0.1146   4.556 5.21e-06 ***
## dark:width   -0.3217     0.2857  -1.126    0.260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 186.79  on 169  degrees of freedom
## AIC: 194.79
## 
## Number of Fisher Scoring iterations: 4
car::Anova(fit4)
## Analysis of Deviance Table (Type II tests)
## 
## Response: y
##            LR Chisq Df Pr(>Chisq)    
## dark         6.4948  1    0.01082 *  
## width       26.8351  1  2.216e-07 ***
## dark:width   1.1715  1    0.27909    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(H0: \beta_3 = 0\)
\(Ha: \beta_3 \ne 0\)

LRT: 1.17
p-value = 0.27
Fail to reject H0. The interaction term is not contributing factor as a predictors.

\(logit(π) = -12.81 + 6.96d + 0.52x -0.32 dx\)

For dark crabs d = 1:

\(logit(π) = -12.81 + 6.96(1) + 0.52x -0.32(1)x = -5.85 + 0.2x\)

For non-dark crabs d = 0:

\(logit(π) = -12.81 + 6.96(0) + 0.52x -0.32(0)x = -5.85 + 0.2x = -12.81 + 0.52x\)

The plot of 3D logstic Regtression Model

Ch 5 Building and Applying Logistic Regression Models

5.1 Strategies in Model Slecetion

  • When building regression models, we want to accomplish two goals
  • Include sufficient complexity so that the model performs adequately
  • Not overly complicate the model, making it difficult to use and interpre

A model that can achieve these two goals is said to be parsimonious.

Multicollinearity

Example:Houseshoe crabs
- Color (four categories)
pecify the indicator variables: \(c_1, c_2, c_3\)
- Spine condition (three categories)
Specify the indicator variables: \(s_1, s_2\)
- Width

The full logit model:

\(logit(π) = \alpha + \beta_1c_1 + \beta_2c_2 + \beta_3 c_3 + \beta_4s_1 + \beta_5s_2+\beta_6wi+\beta_7we\)

Treat color as categorical, and reference level at 4 (“dark”)
Treat spine as categorical, and reference level at 3

crabs = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week5/crabs.txt', sep='', header =T)
crabs$color = relevel(factor(crabs$color), ref = '4')

crabs$spine = relevel(factor(crabs$spine), ref = '3')
fit <-glm(y ~ factor(color) + factor(spine) + width + weight, family = binomial, data = crabs) 
summary(fit)
## 
## Call:
## glm(formula = y ~ factor(color) + factor(spine) + width + weight, 
##     family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1977  -0.9424   0.4849   0.8491   2.1198  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -9.2734     3.8378  -2.416  0.01568 * 
## factor(color)1   1.6087     0.9355   1.720  0.08552 . 
## factor(color)2   1.5058     0.5667   2.657  0.00788 **
## factor(color)3   1.1198     0.5933   1.887  0.05910 . 
## factor(spine)1  -0.4003     0.5027  -0.796  0.42588   
## factor(spine)2  -0.4963     0.6292  -0.789  0.43024   
## width            0.2631     0.1953   1.347  0.17788   
## weight           0.8258     0.7038   1.173  0.24069   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 185.20  on 165  degrees of freedom
## AIC: 201.2
## 
## Number of Fisher Scoring iterations: 4

\(logit(π) = \alpha + \beta_1c_1 + \beta_2c_2 + \beta_3 c_3 + \beta_4s_1 + \beta_5s_2+\beta_6wi+\beta_7we\)

fit2 = glm(y~1, family = binomial, data = crabs)
summary(fit2)
## 
## Call:
## glm(formula = y ~ 1, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4326  -1.4326   0.9421   0.9421   0.9421  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.5824     0.1585   3.673 0.000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 225.76  on 172  degrees of freedom
## AIC: 227.76
## 
## Number of Fisher Scoring iterations: 4

\(H0: \beta_1 = \beta_2 = ... = \beta_7 = 0\)
Ha: At least one is non zero

\(LRT: Dev_0 - Dev_1 = 225.76 - 185.20 = 40.56\) \(\Delta = 172 - 165 = 7\)

p_value = 1 - pchisq(40.56,7)
p_value
## [1] 9.832923e-07

Reject H0, conclude at least one of the exploratory variable is not zero.

Suspect mulicollinearity between width and weight

The correlation between width and weight was 0.887
Let’s drop weight and keep width

Model Notaion

Possible Models

Results of LRT
Compare model1(CS+CW+SW) and 2(C+S+W), as H0: model2(Reduced) is sufficient model compare to model1(Full model), p-value = 0.3. Can’t reject H0; there is not much difference between model2 and model1.

Compare model2(C+S+W) and 3a(C+S), as H0: model3a(Reduced) is sufficient model compare to model2(Full model), p-value < 0.0001. Reject H0; there is sufficient difference between model3a and model2. Do not drop W.

Compare model2(C+S+W) and 3c(C+W), as H0: model3c(Reduced) is sufficient model compare to model2(Full model), p-value = 0.64. Do not reject H0; there is sufficient difference between model3c and model2. Drop S.

Compare model3c(C+W) and 5(D+W), as H0: model5(Reduced) is sufficient model compare to model3c(Full model), p-value = 0.78. Do not reject H0; there is sufficient difference between model3c and model5. Drop C, add D.

Akaike Information Criterion

Aim for small AIC score
\(AIC = -2(loglikelihood - \# ofparameters)\)
or
\(AIC = 2(loglikelihood) + 2(\# ofparameters)\)

AIC is analogous to a criterion used in least squares regression (STAT 324) which assesses a penalty for a model with a large number of parameters. This quantity called Adjusted \(R^2\)

model_2 <-glm(y ~ factor(color) + factor(spine) + width, family = binomial, data = crabs) 
summary(model_2)$aic
## [1] 200.6119
model_3a <-glm(y ~ factor(color) + factor(spine), family = binomial, data = crabs) 
summary(model_3a)$aic
## [1] 220.8338
model_3b <-glm(y ~ factor(spine) + width, family = binomial, data = crabs) 
summary(model_3b)$aic
## [1] 202.4248
model_3c <-glm(y ~ factor(color) + width, family = binomial, data = crabs) 
summary(model_3c)$aic
## [1] 197.457
crabs$dark <-(crabs$color==4)*1  
model_5 <-glm(y ~ dark + width, family = binomial, data = crabs) 
summary(model_5)$aic
## [1] 193.9579

Low AIC values are found at model_3c and model5

5.2 Model Checking

  • Suppose the model of interest is denoted by M
  • We can compare this against the fullest model known as the saturated model (S)
  • We do this comparison via the reported deviance statistic
  • There are two choices for goodness of fit statistics (comparing M to S)

LRT based on the \(G^2\) form (this is deviance):

\(G^2(M) = 2 \sum obs[log_e(\dfrac{obs}{fitted})]\)

Chi-squrare statistic based on the \(\chi^2\) form:
\(\chi^2(M) = \sum\frac{(observation-fitted)^2}{fitted}\)

Example:

Z = 1 (White)

Yes No
Immed 14 93
Delay 32 81

z = 0 (Black)

Yes No
Immed 11 51
Delay 12 43
aids = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week7/AIDS.txt', sep= '', header = T)
total = aids$yes + aids$no
fit = glm(yes/total~race + azt, weights = total, family = binomial, data = aids)
summary(fit)
## 
## Call:
## glm(formula = yes/total ~ race + azt, family = binomial, data = aids, 
##     weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -0.5547   0.4253   0.7035  -0.6326  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.07357    0.26294  -4.083 4.45e-05 ***
## racewhite    0.05548    0.28861   0.192  0.84755    
## aztyes      -0.71946    0.27898  -2.579  0.00991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3499  on 3  degrees of freedom
## Residual deviance: 1.3835  on 1  degrees of freedom
## AIC: 24.86
## 
## Number of Fisher Scoring iterations: 4
predicted <-fit$fitted.values
##################################### # Generate the standardized residuals 
std_resids <-rstandard(fit,type="pearson")
##################################################################################### # Bind the columns of the original data with the predicted probabilities and # standardized residuals to align values with explanatory variable combinations > 
cbind(aids,predicted,std_resids)
##    race azt yes no predicted std_resids
## 1 white yes  14 93 0.1496245  -1.179418
## 2 white  no  32 81 0.2653998   1.179418
## 3 black yes  11 52 0.1427012   1.179418
## 4 black  no  12 43 0.2547241  -1.179418

\(logit(π) = -1.0736-0.7195x+0.0555z\)

\(\hatπ(x,z)= \dfrac{e^{-1.0736-0.7195x+0.0555}}{1+e^{-1.0736-0.7195x+0.0555}}\)

\(\hatπ(x=1,z=1) = 0.1496\)

Based on the predicted probabilities we can then determine fitted counts for all cells.

Z = 1 (White)

  • For X = 1 (immed) there were a total of 107 individuals
    • \(\hatπ(x=1,z=1) = 0.1496245\)
    • Fitted values at (x=1, z=1, y = 1) = 0.1496245(107) = 16.0098215
    • 107-16.00982 = 90.99018
  • For X = 0 (delay) there were a total of 113 individuals
    • \(\hatπ(x=0,z=1) = 0.2653998\)
    • Fitted values at (x=1, z=1, y = 1) = 0.2653998(107) = 28.3977786
    • 113-28.39778 = 84.60222
Yes No Total
x=1 16.00982 90.99018 107
x=0 28.39778 84.60222 113

Z = 0 (Black)

  • For X = 1 (immed) there were a total of 63 individuals
    • \(\hatπ(x=1,z=1) = 0.1427012\)
    • Fitted values at (x=1, z=0, y = 1) = 0.1427012(63) = 8.9901756
    • 63-8.990176 = 54.009824
  • For X = 0 (delay) there were a total of 55 individuals
    • \(\hatπ(x=0,z=1) = 0.2547241\)
    • Fitted values at (x=1, z=0, y = 1) = 0.2547241(55) = 14.0098255
    • 55-14.00983 = 40.99017

z = 0

Yes No Total
x=1 8.990176 54.00982 63
x=0 14.00983 40.99017 55

z = 1

Yes No Total
x=1 16.00982 90.99018 107
x=0 28.39778 84.60222 113

\(G^2(M) = 2[\dfrac14log(\dfrac{14}{16.00982}+93log(\dfrac{93}{90.99018)}+...+43log(\dfrac{43}{40.99017})] = 1.3836\)

summary(fit)$deviance
## [1] 1.38353

Saturated has 4 parameters -> \(π(x=1,z=1),π(x=0,z=1),π(x=1,z=0),π(x=0,z=0)\)
Model M has 3 parameters -> \(\alpha, \beta_1,\beta_2\)
Use \(\chi^2_{\Delta=1}\) distribution
p-value = \(p(\chi^2_{\Delta=1}\ge1.3835) = 0.2395\)

Write the form of the null and alternative hypotheses and your conclusion:
H0: Model M is sufficient relative to Saturated Model
H0: Model M is NOT sufficient relative to Saturated Model
Fail to reject H0. Model M is sufficient relative to Saturated Model.

Ch 6 Multicategory Logit Models

Example: Does the amount of money one contributes to charities serve as a good indicator of that person’s political affiliation?
- Response variable: y = {Republican, Democrat, Other}(Normial Level)
- Level probabilities (\(π_1,π_2,π_3\))
- Explanatory variable: X = total amount of charitable contributions in the last tax year

Based-Category Logit (Generalized Logit Models)

  • Suppose the response variable, Y, is multichotomous with J levels
  • The models for nominal Y compare each category to a baseline level
  • Let category J be the baseline/reference level
  • Baseline choice is completely arbitrary
  • Logit model comparing level j to level J is given as \(log_e(π_i/π_J)\) \(j = 1,...,j-1\)
  • Interpretation by Agresti and by many in the field: \(log_e(πj/π_J)\) = “log odds” = is that so??

In political example: 3 levels of Y (\(π_1,π_2,π_3\))
Claim: \(\dfrac{π_1}{π_3}\) is an odds
\(\dfrac{π_1}{π_3} = \dfrac{\dfrac{π_1}{π_1+π_3}}{\dfrac{π_3}{π_1+π_3}}=\dfrac{\dfrac{π_1}{π_1+π_3}}{1-\dfrac{π_1}{π_1+π_3}}= \dfrac{π}{1-π}\) = Odds

  • The baseline-category logit with predictor x is \(log_e(\dfrac{π_j}{π_J}) = \alpha_j+\beta_ix\)

  • If J = 2, then \(log_e{\dfrac{π_1}{π_2}} = log_e(\dfrac{π_1}{1-π_2}) = logit(π_1)\) = Gives usual logit form.

  • The baseline-category logit with predictor x model yields all other pairs of categories: \(log_e(\dfrac{π_a}{π_b}) = log_e(\dfrac{π_a/π_J}{π_b/π_J}) = log_e(\dfrac{π_a}{π_J})-log_e(\dfrac{π_b}{π_J})= (\alpha_a+\beta_ax)-(\alpha_b+\beta_bx) = (\alpha_a-\alpha_b)+(\beta_a-\beta_b)x\)

Example: Alligator Food Choice (finally, a non-horseshoe crab example!)
Data was collected on the foods alligators in the wild choose to eat. For 59 alligators sampled in Lake George, Florida, researchers determined the primary food type (Fish, Invertebrate, Other) and also the alligator length.

  • Response variable: Y = primary food choice (3 levels: F, I, O)
  • Explanatory variable: X = alligator length
  • Write the form of the baseline-category logits: \(log_e(\dfrac{π_F}{π_O})= \alpha_1+\beta_1x\)
    \(log_e(\dfrac{π_I}{π_O}) = \alpha_2+\beta_2x\)
gator = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week9/Alligators.txt', sep = '', header = T) 

fit = vglm(y~x, family = multinomial, data = gator)
summary(fit)
## 
## Call:
## vglm(formula = y ~ x, family = multinomial, data = gator)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1   1.6177     1.3073   1.237  0.21591   
## (Intercept):2   5.6974     1.7937   3.176  0.00149 **
## x:1            -0.1101     0.5171  -0.213  0.83137   
## x:2            -2.4654     0.8996      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 98.3412 on 114 degrees of freedom
## 
## Log-likelihood: -49.1706 on 114 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'x:2'
## 
## 
## Reference group is level  3  of the response

Reference group is level 3 of the response

We have two intercept because we used function: \(log_e(\dfrac{π_F}{π_O}) = \alpha_1+\beta_1x\)
\(log_e(\dfrac{π_I}{π_O}) = \alpha_2+\beta_2x\)

Estimated baseline-category logit: \(log_e(\dfrac{π_F}{π_O}) = 1.6177-0.1101x\)
\(log_e(\dfrac{π_I}{π_O}) = 5.6974-2.4654x\)

the estimated logit model for the remaining pairwise comparison From: \(log_e(\dfrac{π_a}{π_b}) = log_e(\dfrac{π_a/π_J}{π_b/π_J}) = log_e(\dfrac{π_a}{π_J})-log_e(\dfrac{π_b}{π_J})= (\alpha_a+\beta_ax)-(\alpha_b+\beta_bx) = (\alpha_a-\alpha_b)+(\beta_a-\beta_b)x\)

\(log_e(\dfrac{\hatπ_F}{\hatπ_I}) = (1.6177-5.6974)+(-0.1101--2.4654) = -4.0797 + 2.3553x\)

Interpretation of Parameter Estimate

At X = x+1 \(log_e(\dfrac{\hatπ_F}{\hatπ_I}) = -4.0797 + 2.3553(x+1)\)
At X = x \(log_e(\dfrac{\hatπ_F}{\hatπ_I}) = -4.0797 + 2.3553x\)

Difference: \(log_e\Bigg(\dfrac{\dfrac{\hatπ_F}{\hatπ_I}(at x +1)}{\dfrac{\hatπ_F}{\hatπ_I}(atx)}\Bigg) = 2.3553\)

\(\dfrac{\dfrac{\hatπ_F}{\hatπ_I}(at x +1)}{\dfrac{\hatπ_F}{\hatπ_I}(atx)} = e^{2.3553}\)

\(e^{2.3553}\) is the multiplicative change in the predicted value of \(\dfrac{\π_F}{\π_I}\) for a 1-unit increase in x Alternatively, \(e^{2.3553}\) is the multiplicative change in the predicted value of “odds” for a 1-unit increase in x

Hypo test

\(log_e(\dfrac{π_F}{π_O}) = \alpha_1+\beta_1x\)
\(log_e(\dfrac{π_I}{π_O}) = \alpha_2+\beta_2x\)
Test the hypothesis that primary food choice is independent of alligator length.

fit = vglm(y~x, family = multinomial, data = gator)
summary(fit)
## 
## Call:
## vglm(formula = y ~ x, family = multinomial, data = gator)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1   1.6177     1.3073   1.237  0.21591   
## (Intercept):2   5.6974     1.7937   3.176  0.00149 **
## x:1            -0.1101     0.5171  -0.213  0.83137   
## x:2            -2.4654     0.8996      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 98.3412 on 114 degrees of freedom
## 
## Log-likelihood: -49.1706 on 114 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'x:2'
## 
## 
## Reference group is level  3  of the response

Residual deviance: 98.3412

Note: No NUll deviance is shown

fit0 = vglm(y~1, family = multinomial, data = gator)
summary(fit0)
## 
## Call:
## vglm(formula = y ~ 1, family = multinomial, data = gator)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1   1.3545     0.3965   3.416 0.000636 ***
## (Intercept):2   0.9163     0.4183   2.190 0.028495 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 115.1419 on 116 degrees of freedom
## 
## Log-likelihood: -57.5709 on 116 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Reference group is level  3  of the response

Residual deviance: 115.1419

Full vs reduced test (LRT) in vglm

H0: \(\beta_1 = \beta_2 = 0\)
Ha: At least one \(\beta\) is non zero
LRT: \(Dev_0-Dev_1 = 115.1419 - 98.3412 = 16.801\)
\(\chi^2_\Delta = 116 - 114= 2\)

1 - pchisq(16.801, 2)
## [1] 0.0002247549

OR

lrtest(fit, fit0)
## Likelihood ratio test
## 
## Model 1: y ~ x
## Model 2: y ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1 114 -49.171                         
## 2 116 -57.571  2 16.801  0.0002248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclude: Reject H0, the food choice is dependent on the alligators’ length.

Designating the Reference Level in R when Using vglm()

The reference level with vglm() is the LAST ALPHABETICAL LEVEL of the response (for the gator data (F,I,O): level 3 = “O”). We can specify the reference level of the response variable by using the refLevel = “” option as shown below.

fit2 = vglm(y~x, family = multinomial(refLevel = 'I'), data = gator)
summary(fit2)
## 
## Call:
## vglm(formula = y ~ x, family = multinomial(refLevel = "I"), data = gator)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1  -4.0797     1.4686  -2.778  0.00547 **
## (Intercept):2  -5.6974     1.7937  -3.176  0.00149 **
## x:1             2.3553     0.8032      NA       NA   
## x:2             2.4654     0.8996   2.741  0.00613 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,2]), log(mu[,3]/mu[,2])
## 
## Residual deviance: 98.3412 on 114 degrees of freedom
## 
## Log-likelihood: -49.1706 on 114 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'x:1'
## 
## 
## Reference group is level  2  of the response

Reference group is level 2 of the response = “I”

\(log_e(\dfrac{π_F}{π_I}) = -4.0797 +2.3553 x\)
\(log_e(\dfrac{π_O}{π_I}) = -5.6974+2.4654 x\)
\(log_e(\dfrac{π_F}{π_O}) = (-4.0797 +2.3553 x)-(-5.6974+2.4654 x) = 1.6177 - 0.1101x\)

Estimating Response Probabilities

It can be shown that the baseline-category logit model with predictor x (as shown on Page 199) yields the following expression for the response probability.

\(π_j = \dfrac{e^{\alpha_j+\beta_jx}}{\sum_he^{\alpha_j+\beta_jx}}\) h = 1,2,…,J, j = 1,2,…,J
Note: The parameters equal zero wherever the index variable refers to the baseline category. (At reference level, use \(e^{0+0x} = 1\))

Using the response probability expression, write the estimated probabilities for:

summary(fit)
## 
## Call:
## vglm(formula = y ~ x, family = multinomial, data = gator)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1   1.6177     1.3073   1.237  0.21591   
## (Intercept):2   5.6974     1.7937   3.176  0.00149 **
## x:1            -0.1101     0.5171  -0.213  0.83137   
## x:2            -2.4654     0.8996      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 98.3412 on 114 degrees of freedom
## 
## Log-likelihood: -49.1706 on 114 degrees of freedom
## 
## Number of Fisher scoring iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'x:2'
## 
## 
## Reference group is level  3  of the response

\(log_e(\dfrac{π_F}{π_O}) = 1.6177-0.1101x\)
\(log_e(\dfrac{π_I}{π_O}) = 5.6974-2.4654x\)

Fish
\(\alpha = 1.6177\)
\(\bata = -0.1101\)

\(\hatπ_F = \dfrac{e^{1.6177-0.1101x}}{e^{1.6177-0.1101x}+e^{5.6974-2.4654x} + e^0}\)

At x = 3.89 (Max observed length)
\(\hatπ_F = 0.76\)

Invertebrate
\(\alpha = 5.6974\)
\(\bata = -2.4654\)

\(\hatπ_I = \dfrac{e^{5.6974-2.4654}}{e^{1.6177-0.1101x}+e^{5.6974-2.4654x} + e^0}\) At x = 3.89 (Max observed length)
\(\hatπ_I = 0.005\)

Other
\(e^0 =1\)

\(\hatπ_O = \dfrac{e^0}{e^{1.6177-0.1101x}+e^{5.6974-2.4654x} + e^0}\)

At x = 3.89 (Max observed length)
\(\hatπ_) = 0.23\)


As length increases, primary food source more apt to be fish instead of invertebrates.
As length decreases, primary food source more apt to be invertebrates.

6.2 Cumulative Logit Models for Ordinal Responses

Example: Teacher grades on students

Grade Level1 F Level2 D Level3 C Level4 B Level5 A
\(π_1 = 0.1\) \(π_2 = 0.2\) \(π_3 = 0.35\) \(π_4 = 0.25\) \(π_5 = 0.1\)

Since it is nominal, it does not make sense to find probability of getting less that C P(C or lower) by \(P(F) +P(D) +P(C) = π1 + π2 + π3 = 0.10 + 0.20 + 0.35 = 0.65\)

In general, for ordinal Y having J levels, for outcome category j, the cumulative prob.

\(P(Y\le j) = π_1 + π_2,...,π_j\)

Cumulative probabilities reflect the ordinal nature of Y where

\(P(Y ≤ 1) ≤ P(Y ≤ 2) ≤ · · · ≤ P(Y ≤ J − 1) ≤ P(Y ≤ J)\)

P(Y≤J) = 1 (Probability of getting A or less)
Logit model for P(Y≤j)

\(logit[P(T\le j)] = log_e \dfrac{P(Y \le j)}{1-P(Y \le j)}=log_e \dfrac{P(Y \le j)}{P(Y > j)} = log_e[\dfrac{π_1+...+π_j}{π_j+1+...+π_J}]\)

Cumulative Logit Models with Proportional Odds Property

The cumulative logit with predictor x is
\(logit[P(Y\le j)] = \alpha_j+\beta x\) (Common slop)
\(P(Y\le j) = π\)

Note that model contains a common \(\beta\) for all j = 1, …,J−1. Write all logit models below:

\(logit[P(Y\le 1)] = \alpha_1+\beta x\)
\(logit[P(Y\le 2)] = \alpha_2+\beta x\)
.
.
.
\(logit[P(Y\le J-1)] = \alpha_{J-1}+\beta x\)

What does this imply about the graph of P(Y ≤ 1), P(Y ≤ 2), . . ., P(Y ≤ J − 1)?
The graph looks like…

Interpretation of \(\beta\)

Consider fixed level j = 1,…,J-1
For X = x, define \(P(Y \le j) = π(x)\)
For X = x+1,define \(P(Y \le j) = π(x+1)\)

\(log_e \dfrac{π(x+1)}{1-π(x+1)} = \alpha_j + \beta(x+1)\)
\(log_e \dfrac{π(x)}{1-π(x)} = \alpha_j + \beta(x)\)

Take difference: \(log_e\Bigg(\dfrac{\dfrac{π(x+1)}{1-π(x+1)}}{\dfrac{π(x)}{1-π(x)}}\Bigg)=\beta\)
=> \(\dfrac{π(x+1)}{1-π(x+1)} = \dfrac{π(x)}{1-π(x)}*e^\beta\)

Recall: \(P(Y \le j)\) call π of “success”

\(e^\beta\) is the multiplicative change in the odds of response in category j or below for a unit increase of x.

Interpretation of common \(\beta\): Propotional Odds Property

Choose two setting of x (\(x_1\)and \(x_2\) where \(x_1\) < \(x_2\))
Consider a fixed level of j = 1,..,J-1

For \(x = x_1\) define \(P(Y\le j) = π_1\)
For \(x = x_2\) define \(P(Y\le j) = π_2\)

\(log_e\dfrac{π_2}{1-π_2} = \alpha_j + \beta x_2\)
\(log_e\dfrac{π_1}{1-π_1} = \alpha_j + \beta x_1\)

Take difference: \(log_e\Bigg(\dfrac{\dfrac{π_2}{1-π_2}}{\dfrac{π_1}{1-π_1}}\Bigg) = \beta(x_2-x_1)\)
=> \(log_e\)[Odd Ratio for \(x_2\) vs, \(x_1\)] = \(\beta(x_2-x_1)\)
\(\beta(x_2-x_1)\)= proportional to the distance between \(x_1\) and \(x_2\)
- the log odd ration of \(x_1\) vs. \(x_2\) is proportional to \((x_2-x_1)\) for any j = 1,…, J-1
- \(\beta\) is called proportionality constant
- This is often called the proportional odds mode

Cumulative Logit and Cumulative Probability Mod
The cumulative logit model with predictor x yields the following expression for P(Y ≤ j):

\(logit[P(Y\le J-1)] = \alpha_{J-1}+\beta x\)

<=> \(P(Y\le j) = \dfrac{e^{\alpha_j+\beta x}}{1+e^{\alpha_j+\beta x}}\)

Example: Political ideology and Part Affiliation
- Y = political Ideology (very liberal, slightly liberal, moderate, slightly conservative, very conservative)
- X = 1 democrat or 0 republican

ideology = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week9/ideology.txt', sep = '', header = T)
ideology
##   party y1 y2  y3 y4 y5
## 1   dem 80 81 171 41 55
## 2 repub 30 46 148 84 99
ideology$party = relevel(factor(ideology$party), ref = 'repub')
# For cumulative logit models with proportional odds property we use family=cumulative(parallel=T) where parallel=T specifies the proportional odds property
fit = vglm(cbind(y1,y2,y3,y4,y5) ~ party, family = cumulative(parallel=T), data = ideology)
summary(fit)
## 
## Call:
## vglm(formula = cbind(y1, y2, y3, y4, y5) ~ party, family = cumulative(parallel = T), 
##     data = ideology)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -2.46899    0.13182 -18.730  < 2e-16 ***
## (Intercept):2 -1.47454    0.10908 -13.517  < 2e-16 ***
## (Intercept):3  0.23712    0.09485   2.500   0.0124 *  
## (Intercept):4  1.06954    0.10457  10.228  < 2e-16 ***
## partydem       0.97452    0.12905   7.551 4.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 3.6877 on 3 degrees of freedom
## 
## Log-likelihood: -24.6201 on 3 degrees of freedom
## 
## Number of Fisher scoring iterations: 3 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
## partydem 
## 2.649882

Let \(P(Y\le j) = π_D\) for democrats, and \(P(Y\le j) = π_R\) for republicans.

Interpretation of the estimate for \(\beta\)

For fixed j, \(logit(\hat D)=\hat\alpha_j+\hat\beta\)
For fixed j, \(logit(\hat R)=\hat\alpha_j\)

Different: \(log(\dfrac{\dfrac{\hatπ_D}{1-\hatπ_D}}{\dfrac{\hatπ_R}{1-\hatπ_R}}) = \hat\beta\)

\(\dfrac{\hatπ_D}{1-\hatπ_D} = \dfrac{\hatπ_R}{1-\hatπ_R} e^\hat\beta\)

In the case of ideology : \(\beta = 0.97452\) \(e^{0.97452} = 2.65\)
The estimated odds a democrats response is in the liberal direction rather than conservative is 2.65 times the same odds for a republican. Remember X = 1 democrat or 0 republican.

  • Write the form of the estimated cumulative probabilities based on this output:

\(P(Y\le 1) = \dfrac{e^{-2.46899 + 0.975 x}}{1+e^{-2.46899 + 0.975x}}\)

\(P(Y\le 2) = \dfrac{e^{-1.47454 + 0.975 x}}{1+e^{-1.47454 + 0.975x}}\)

\(P(Y\le 3) = \dfrac{e^{0.23712 + 0.975 x}}{1+e^{0.23712 + 0.975x}}\)

\(P(Y\le 4) = \dfrac{e^{1.06954 + 0.975 x}}{1+e^{1.06954 + 0.975x}}\)

Note: \(\hat\alpha_1<\hat\alpha_2<\hat\alpha_3<\hat\alpha_4\)

level = j \(P(Y\le j)\) \(P(Y=j)\)
1 0.18 0.18
2 0.38 0.20 (0.38-0.18)
3 0.77 0.39 (0.77-0.38)
4 0.88 0.11 (0.88-0.77)
5 1 0.12 (1-0.88)

Inference about Model Parameters

H0: \(\beta = 0\) Ha: \(\beta \ne 0\)
LRT: 58.645
p-value(\(\chi^2\)): 1.888e-14

fit0 = vglm(cbind(y1,y2,y3,y4,y5) ~ 1, family = cumulative(parallel=T), data = ideology)
lrtest(fit, fit0)
## Likelihood ratio test
## 
## Model 1: cbind(y1, y2, y3, y4, y5) ~ party
## Model 2: cbind(y1, y2, y3, y4, y5) ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -24.620                         
## 2   4 -53.943  1 58.645  1.888e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject H0.

  • Returning to the 2×5 data table, one could use the methods we learned in Chapter 2 and perform an analysis using the \(\chi^2\) and \(G^2\) methods.
ideology
##   party y1 y2  y3 y4 y5
## 1   dem 80 81 171 41 55
## 2 repub 30 46 148 84 99
  • However, those methods are based on nominal categorical variables and ignore the ordinal nature of the ideology variable.
  • So, the \(\chi^2\) and \(G^2\) methods suffer from lower power compared to the full versus reduced test generated by lrtest(fit,fit0)

Checking Model Fit

  • Based on the estimated category probabilities P(Y = j) we determined from the model, we can find expected counts for each cell
  • We can perform a goodness of fit analysis using the G2(M) method as discussed in Chapter 5 on Page 191
  • This \(G^2(M)\) test statistic is automatically generated in the summary(fit) output in the form of the Residual Deviance.
    • Recall deviance \(-2[L_M-L_S]\) is represented by Residual Deviance.

\(G^2(M)\): 3.6877
Df: 3

summary(fit)
## 
## Call:
## vglm(formula = cbind(y1, y2, y3, y4, y5) ~ party, family = cumulative(parallel = T), 
##     data = ideology)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 -2.46899    0.13182 -18.730  < 2e-16 ***
## (Intercept):2 -1.47454    0.10908 -13.517  < 2e-16 ***
## (Intercept):3  0.23712    0.09485   2.500   0.0124 *  
## (Intercept):4  1.06954    0.10457  10.228  < 2e-16 ***
## partydem       0.97452    0.12905   7.551 4.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
## 
## Residual deviance: 3.6877 on 3 degrees of freedom
## 
## Log-likelihood: -24.6201 on 3 degrees of freedom
## 
## Number of Fisher scoring iterations: 3 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## 
## Exponentiated coefficients:
## partydem 
## 2.649882

In this case, model M is proportional model
H0: M is sufficient relative to saturated
Ha: M is insufficient relative to saturated

Why df = 3?

ideology
##   party y1 y2  y3 y4 y5
## 1   dem 80 81 171 41 55
## 2 repub 30 46 148 84 99

This is 2x5 table with 10 cells, BUT we are interested in \(logit[P(Y\le j)]= \alpha_j+\beta x\)
For \(logit[P(Y\le j)]\), this is really \(logit[P(Y\le j)|X=x]\) x = 1 for democrat, x = 0 for republican, j = 1,2,3,4. This yields 4x2 table. So Only 8 parameters for saturated model.
For \(\alpha_j+\beta x\), j = 1,2,3,4. Parameters are \(\alpha_1,\alpha_2,\alpha_3,\alpha_4, \beta\). So 5 parameters in reduced model.

\(\Delta = 8 -5 = 3\)
\(\chi^2 = 3.6877\)

1 - pchisq(3.6877,3)
## [1] 0.2972215

Do NOT reject H0, The reduced model is okay.

Chapter 7 Loglinear Models for Contingency Table

Section 7.1: Loglinear Models for Two-Way Tables

Suppose data is cross classified based on X (having I levels) and Y (having J levels). The count data would appear in a 2-way table such as:

  • Row variable X has I levels
  • Column variable Y has J levels
  • Using the same notation
  • \(π_{ij}\) = P(X = i, Y = j)
  • \(\sum_{i,j}π_{ij} = 1\)
  • Let \(π_{i+}\) be the row total of probabilities for row i => \(π_{i+} = \sum_{i}π_{ij}\)
  • Let \(π_{+j}\) be the column total of probabilities for column j =>\(π_{+j} = \sum_{j}π_{ij}\)

\(π_{ij} = (π_{i+})(π_{+j})\)

expected frequency = \(µ_{ij} = nπ_{ij}\)

\(µ_{ij} = n(π_{i+})(π_{+j})\)

Distribution: Poisson
Link function: natural log

\(µ_{ij} = n(π_{i+})(π_{+j})\)
\(log_e(µ_{ij}) = log_e(n(π_{i+})(π_{+j}))\)
\(= log_en+log_eπ_{i+}+log_eπ_{+j}\)
\(= \lambda + \lambda_i^x+\lambda_j^y\)

loglinear model of independence

We call \(\lambda_i^x\) the row effect
- Larger \(\lambda_i^x\) imply larger expected count in row i
We call \(\lambda_j^x\) the column effect
- Larger \(\lambda_j^x\) imply larger expected count in column j

Hypothesis Test for Independence

H0: \(π_{ij} = π_{i+}π_{+j}\)

by testing whether the log-linear independence model \(log_eµ_{ij} = \lambda + \lambda_i^x+\lambda_j^y\) is sufficient.

In other words, we can perform this test by conducting a goodness of fit (i.e. “full versus reduced”) test. If the log-linear independence model is the ‘reduced’ model, in words, describe what would be a reasonable choice for the ‘full’ model?

Model containing interaction terms wouold be the ‘full model’

Now, when conducting any goodness of fit test, we need to compare what we observed to what we would fit from the null hypothesis model. What are the fitted values \(\hatµ_{ij}\) in this case?

\(\hatµ_{ij} = \dfrac{(n_{i+})(n_{+j})}{n} = \dfrac{(row.total)(col.total)}{ground.total}\)

NOTE: If you look back on Page 55, the fitted values above are the EXACT SAME fitted values for the \(\chi^2\) and the \(G^2\) tests of independence from Chapter 2!
- This means ⇒ \(\chi^2\) and \(G^2\) ARE the goodness of fit tests for (7.1)
- This means ⇒ X2 and G2 are tests based on the GLM Poisson regression model

Example: Suppose X has l level and Y has 2 levels. The I x 2 table would look like.

For a fixed row i use the log-linear model of independence to determine an expression for the \(log(\dfrac{π_{i1}}{π_{i2}})=log(\dfrac{nπ_{i1}}{nπ_{i2}})=log(\dfrac{µ_{i1}}{µ_{i2}}) = log(µ_{i1})-log(µ_{i2})) = (\lambda+\lambda^X_i+\lambda^Y_1)-(\lambda+\lambda^X_i+\lambda^Y_2) = \lambda^Y_1-\lambda^Y_2\)
Does not depend on i

What is the interpretation of \(\lambda^Y_1-\lambda^Y_2\)

Supplemental Page

\(\dfrac{π_{i1}}{π_{j2}}=e^{\lambda^Y_1-\lambda^Y_2}\)

We may interpret \(e^{\lambda^Y_1-\lambda^Y_2}\) as the odds of response I relative to response 2, given row i. This applies regardless of the row.

The GLM output would allow us to provide an estimate for the odds of success (as shown above).

Example: In the 2000 General Social Survey, subjects were asked whether they believed in life after death. Individuals were cross classified based on response Y (Yes/No) and race X (White/Black/Other). The data is given here:

Yes No
W 1339 300
B 260 55
O 88 22
afterlife = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week10/life_after_death.txt', sep = '',header = T)
#belief (no, yes): Default reference level is "no" 
# Let’s change reference to "yes" 
afterlife$belief <-relevel(factor(afterlife$belief), ref="no")


# race (black, other, white): Default reference level is "black" 
# Let’s change reference to "other" > 
afterlife$race <-relevel(factor(afterlife$race), ref="other")
fit <-glm(count~race+belief, family = poisson, data = afterlife)
summary(fit)
## 
## Call:
## glm(formula = count ~ race + belief, family = poisson, data = afterlife)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
## -0.01717   0.03631   0.15781  -0.33688  -0.20194   0.41917  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.00032    0.10611   28.28   <2e-16 ***
## raceblack    1.05209    0.11075    9.50   <2e-16 ***
## racewhite    2.70136    0.09849   27.43   <2e-16 ***
## beliefyes    1.49846    0.05697   26.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2849.21758  on 5  degrees of freedom
## Residual deviance:    0.35649  on 2  degrees of freedom
## AIC: 49.437
## 
## Number of Fisher Scoring iterations: 3

Consider the indicator variables we are using for the race variable X:

\(x_1\) \(x_2\)
black 1 0
white 0 1
other 0 0

How do we interpret the deviance value of 0.3565 and its corresponding DF=2? What is its corresponding p-value and conclusion?
model(7.1): \(log_eµ_{ij} = \lambda + \lambda_i^x+\lambda_j^y\)

H0: model(7.1) is sufficient relative to saturated model
H0: model(7.1) is NOT sufficient relative to saturated model

lrt = 0.3565 
df = 2
1-pchisq(lrt, df)
## [1] 0.8367332

Do not reject H0, the independence model is sufficient relative to saturated model.

The data table on the previous page is simply a 3×2 contingency table. What if we did things from the perspective of Chapter 2? What would the corresponding R output reveal?

belief.data <-matrix(c(1339, 300, 260, 55, 88, 22), ncol = 2, byrow = T) 

vcd::assocstats(belief.data)$chisq_test
##                        X^2 df  P(> X^2)
## Likelihood Ratio 0.3564851  2 0.8367395
## Pearson          0.3600752  2 0.8352388

It is the same LRT of 0.35699 and same conclusion. Fail to reject H0: X and Y are independent.

summary(fit)$coef
##             Estimate Std. Error   z value      Pr(>|z|)
## (Intercept) 3.000324 0.10610809 28.276113 6.798108e-176
## raceblack   1.052092 0.11074975  9.499726  2.104434e-21
## racewhite   2.701361 0.09849382 27.426708 1.317510e-165
## beliefyes   1.498462 0.05696743 26.303836 1.733401e-152

\(\hat\lambda_1^Y = 1.49846\)
\(\hat\lambda_2^Y = 0\)

interpretation

\(\dfrac{π_{i1}}{π_{j2}}=e^{1.49846-0} = 4.47\)

4.47 is the estimated odds of afterlife belief (yes vs no) for any given race.

Consider the estimates for race W and for race B. What λ parameters correspond to these two quantities? How do we interpret the difference in the estimates?

\(\hat\lambda_w^x = 2.701361\)
\(\hat\lambda_B^x = 1.052092\)

For fixed level j, \(log(\dfrac{π_{wj}}{π_{Nj}}) = log(\dfrac{µ_{Wj}}{µ_{Bj}})= log(µ_{Wj})-log(µ_{Bj}) = (\lambda + \lambda^x_W+\lambda^y_j) - (\lambda + \lambda^x_B+\lambda^y_j) =\lambda^x_W-\lambda^x_B\)
Does not depend on j

\(\dfrac{π_{wj}}{π_{Bj}} = e^{\lambda^x_W-\lambda^x_B}\)

Estimated by \(e^{2.701361-1.052092} = 5.20\)

5.20 is the estimated odd of being white vs black for any give afterlife belief.

Dependence Model for X and Y

\(logµ_{ij} = \lambda + \lambda_j^2+\lambda^y_j + \lambda^{xy}_{ij}\)

x/y 1 2
1 \(π_{11}\) \(π_{12}\)
2 \(π_{21}\) \(π_{22}\)

\(\theta = \dfrac{π_{11}π_{22}}{π_{21}π_{12}} = log(\theta) = log(\dfrac{nπ_{11}nπ_{22}}{nπ_{21}nπ_{12}}) = log(\dfrac{µ_{11}µ_{22}}{µ_{21}µ_{12}}) = (log(µ_{11})+log(µ_{22}))-(log(µ_{21})+log(µ_12))\)
\(= (\lambda+\lambda^x_1+\lambda^y_1+\lambda^{xy}_{11})+(\lambda+\lambda^x_2+\lambda^y_2+\lambda^{xy}_{22})-(\lambda+\lambda^x_2+\lambda^y_1+\lambda^{xy}_{21})-(\lambda+\lambda^x_1+\lambda^y_2+\lambda^{xy}_{12}) = \lambda^{xy}_{11}+\lambda^{xy}_{22}-\lambda^{xy}_{21}-\lambda^{xy}_{12} = log(\theta)\)

model(7.1): \(log_eµ_{ij} = \lambda + \lambda_i^x+\lambda_j^y\) (Model of independence)

\(\lambda^{xy}_{ij} = 0\) for all i, j

Under model(7.1), \(log(\theta) = 0+0+0+0 = \theta = 1\)

Under independence, \(\theta = 1\)

The two statement above concur with one another.

A closer Look at the Godness of Fit Test Comparing Models 7.1 and 7.2

x/y 1 2 j J
1
2
:
i
:
I

IxJ cells exit in this table. \(µ_{ij}\): Model for each cells
parameters exist in the saturated model.

  • Variable X has I levels => Required number of indicator variables = I-1
    • There are I-1 \(\lambda^x_i\) parameters for X
  • Variable X has I levels => Required number of indicator variables = J-1
    • There are J-1 \(\lambda^y_j\) parameters for Y
  • Therefore, there are (I-1)(J-1) interaction parameters
    • The interaction parameters are of \(\lambda_{ij}^{xy}\)
      \(\lambda + \lambda^x_i+\lambda^y_j + \lambda^{xy}_{ij}\)
      # of parameters

1 + (I-1) + (J-1) + (I-1)(J-1) = IJ

Important Row
Model 7.2 fits IJ parameters => 7.2 is a saturated model
Rewrite Model (7.1) and indicate the total number of parameters this model supports:
\(\lambda + \lambda^x_i+\lambda^y_j\) Model (7.1)

# of parameters

1 + (I-1) + (J-1) = I+J-1

The deviance based goodness of fit test Model 7.1 to 7.2, yields the following number of degree of freedom:
\(\Delta = IJ-(I+J-1) = IJ-I-J+1\)
\(\Delta = (I-1)(J-1) = (r-1)(c-1)\)

Example: Let’s return to the afterlife belief data set. Although our initial analysis sup-ported the hypothesis of independence between X and Y, let’s apply the dependence model (7.2) to the data so that we can understand how to implement this in R. Note that we now need to use race*belief in the glm() function to invoke dependence model (7.2):

afterlife = read.table('/Users/shuseiyokoi/Desktop/shutats/notes/STAT418/week10/life_after_death.txt', sep = '',header = T)
#belief (no, yes): Default reference level is "no" 
# Let’s change reference to "yes" 
afterlife$belief <-relevel(factor(afterlife$belief), ref="no")


# race (black, other, white): Default reference level is "black" 
# Let’s change reference to "other" > 
afterlife$race <-relevel(factor(afterlife$race), ref="other")
fit2 <-glm(count~race+belief+race*belief, family = poisson, data = afterlife)
summary(fit2)
## 
## Call:
## glm(formula = count ~ race + belief + race * belief, family = poisson, 
##     data = afterlife)
## 
## Deviance Residuals: 
## [1]  0  0  0  0  0  0
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           3.0910     0.2132  14.498  < 2e-16 ***
## raceblack             0.9163     0.2523   3.632 0.000281 ***
## racewhite             2.6127     0.2209  11.829  < 2e-16 ***
## beliefyes             1.3863     0.2384   5.816 6.03e-09 ***
## raceblack:beliefyes   0.1671     0.2808   0.595 0.551889    
## racewhite:beliefyes   0.1096     0.2468   0.444 0.656946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:  2.8492e+03  on 5  degrees of freedom
## Residual deviance: -1.0281e-13  on 0  degrees of freedom
## AIC: 53.081
## 
## Number of Fisher Scoring iterations: 3

Residual deviance: -1.0281e-13 on 0 degrees of freedom

Deviance = \(-2[L_M-L_S] = 0\)
This is the comparison of models 7.2 to saturated model, but model 7.2 is saturated.So \(L_M=L_S\) \(\Delta = 0df\) make sense.

1 2
black \(π_{b1}\) \(π_{b2}\)
white \(π_{w1}\) \(π_{w2}\)
other \(π_{o1}\) \(π_{o2}\)

Now that we have applied an interaction model, let us consider the log of the odds ratio for afterlife belief between Whites and Others. Write this expression below:

\(log\Bigg(\dfrac{\dfrac{π_{w1}}{π_{w2}}}{\dfrac{π_{o1}}{π_{o2}}}\Bigg) = log(\dfrac{π_{w1}}{π_{w2}})-log({\dfrac{π_{o1}}{π_{o2}})} = log(\dfrac{µ_{w1}}{µ_{w2}})-log({\dfrac{µ_{o1}}{µ_{o2}})}\)

summary(fit2)$coef
##                      Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)         3.0910425  0.2132007 14.4982742 1.242342e-47
## raceblack           0.9162907  0.2522625  3.6322908 2.809162e-04
## racewhite           2.6127400  0.2208798 11.8287875 2.771158e-32
## beliefyes           1.3862944  0.2383656  5.8158312 6.033334e-09
## raceblack:beliefyes 0.1670541  0.2807956  0.5949313 5.518894e-01
## racewhite:beliefyes 0.1096015  0.2467759  0.4441338 6.569459e-01

Apply model 7.2
The odd ratio of afterlife belief between white and others.

\(log(OddsRatio) = [(\lambda + \lambda^x_w+\lambda^y_1 + \lambda^{xy}_{w1})-(\lambda + \lambda^x_w+\lambda^y_2 + \lambda^{xy}_{w2})]- [(\lambda + \lambda^x_o+\lambda^y_1 + \lambda^{xy}_{o1})-(\lambda + \lambda^x_o+\lambda^y_2 + \lambda^{xy}_{o2})]\)
\(= \lambda^{xy}_{w1}-\lambda^{xy}_{w2} - \lambda^{xy}_{o1}-\lambda^{xy}_{o2}\)

Estimated by 0.1096-0-0-0 = 0.1096

\(\hat{OR} = e^{0.1096} = 1.12\)
The odd ratio of afterlife belief between white and others.

The odd ratio of afterlife belief between black and others.

\(\lambda^{xy}_{b1}-\lambda^{xy}_{b2} - \lambda^{xy}_{o1}-\lambda^{xy}_{o2}\)

Estimated by 0.1671-0-0-0 = 0.1671
\(\hat{OR} = e^{0.1671} = 1.18\)

The odd ratio of afterlife belief between black and white
\(log(OR) = \lambda^{xy}_{w1}-\lambda^{xy}_{w2} - \lambda^{xy}_{b1}-\lambda^{xy}_{b2}\)
$ log(OR) = 0.1096-0-0.1671-0 = -0.0575$

\(\hat{OR} = e^{-0.0575} = 0.94\)

Under logistic regression, where the probability of success π = P(Y = 1) was of most interest, we’ve used the corresponding R output to determine predicted probabilities.
Now, under Poisson regression, the quantity of interest is μij, the expected count in a given table cell. We can use the R output to determine predicted counts.

1 2
black 1339 300
white 260 55
other 88 22

\(\hatµ_{w1} = e ^{3.0910+2.6127+1.3863+0.1096} = 1339.00007\)

int <-3.0910425
wht <-2.6127400
blk <-.9162907
bel <-1.3862944
W_Y <-.1096015
B_Y <-.1670541
wht_yes <-exp(int + wht + bel + W_Y) 
wht_no <-exp(int + wht + 0.0 + 0.0) 
blk_yes <-exp(int + blk + bel + B_Y) 
blk_no <-exp(int + blk + 0.0 + 0.0) 
oth_yes <-exp(int + 0.0 + bel + 0.0) 
oth_no <-exp(int + 0.0 + 0.0 + 0.0)
wht_yes           
## [1] 1339
wht_no            
## [1] 300
blk_yes
## [1] 260
blk_no
## [1] 55
oth_yes
## [1] 88.00001
oth_no
## [1] 22

In the R code above we are manually determining predicted or fitted counts for each of the 6 cells of the table.
Why is there such an incredible amount of agreement between what is predicted and what was observed? Can you explain what is happening here?

We are fitting a saturated model, so it prefatory prediction the data.

fit2$fitted.values
##    1    2    3    4    5    6 
## 1339  300  260   55   88   22

7.1 Loglinear Models for Three-Ways Table

Suppose data is cross classified based on X (having I levels), Y (having J levels), and Z (having K levels). The count data would appear in multiple 2-way tables such as:

The following will closely mimic our discussion of the loglinear models for two-way tables (see Page 216) but now extended to three-way tables:

  • \(π_{ijk} = P(X=i, Y = j, Z = k)\)
  • \(\sum_{i,j,k}\) must eb equal to 1
  • \(π_{i++} =\sum_{j,k}π_{ijk}\)
  • \(π_{+j+} =\sum_{i,k}π_{ijk}\)
  • \(π_{++z} =\sum_{i,j}π_{ijk}\)

X,Y, and Z, are said to be mutually independent if \(π_{ijk} = (π_{i++})(π_{+j+})(π_{++k})\) For all i,j,k

Expeced frequency count for \(call_{ijk}\) would be: \(µ_{ijk} = nπ_{ijk}\)

Assuming mutual independence X, Y, and Z, the expected frequency count would now be: \(µ_{ijk} = nπ_{i++}+π_{+j+}+π_{++k}\)

Link function: natural log
Distribution: Poisson

\(log(µ_{ijk}) = log(nπ_{i++}π_{+j+}π_{++k}) = log(n) + logπ_{i++}+logπ_{+j+}+logπ_{++k} = \lambda + \lambda^x_i+\lambda^y_j+\lambda^z_k\)
As before, note the ANOVA-type model representation as used in Chapter 4.
This model called loglinear model of mutual independence. Shorthand notation (x,y,x,z)
The model with all 2way interaction terms would be: \(\lambda + \lambda^x_i+\lambda^y_j+\lambda^z_k+\lambda^{xy}_{ij} + \lambda^{xz}_{ik}+ \lambda^{yz}_{jk}\)
Shorthand notation: (xy, xz, yz). By stating introducing the interaction term, it means we consider that containing the main effect to the model.

Note that the mutual independence model (X, Y, Z) and the 2-way interaction model (XY, YZ,XZ) are identical when: all 2 interaction terms = 0
The model with 3-ways interaction term would be: \(\lambda + \lambda^x_i+\lambda^y_j+\lambda^z_k+\lambda^{xy}_{ij} + \lambda^{xz}_{ik}+ \lambda^{yz}_{jk} + \lambda^{xyz}_{ijk}\)

Short notation: (xyz).

Note that the mutual independence model (X, Y, Z) and the 3-way interaction model (XYZ) are identical when: all intereaction terms = 0.

Example: 2x2xk tables
Fix Z = k and write the corresponding expression for the conditional odds ratio:

x/y 1 2
1 \(π_{11k}\) \(π_{12k}\)
2 \(π_{21k}\) \(π_{22k}\)

\(\theta_{xy(k)}=\dfrac{π_{11k}*π_{22k}}{π_{21k}*π_{12k}}= \dfrac{µ_{11k}*µ_{22k}}{µ_{21k}*µ_{12k}}\)
Apply natural log \(log(\theta_{xy(k)})=log(\dfrac{π_{11k}*π_{22k}}{π_{21k}*π_{12k}})= log( \dfrac{µ_{11k}*µ_{22k}}{µ_{21k}*µ_{12k}}) = logµ_{11k}+logµ_{22k}-logµ_{21k}-logµ_{12k}\)

Apply the 2-way interaction model(XY,XZ,YZ) to obtain an expression for the loglinear model:

\(logµ_{11k}= \lambda + \lambda^x_1+\lambda^y_1+\lambda^z_k+\lambda^{xy}_{11} + \lambda^{xz}_{1k}+ \lambda^{yz}_{1k}\)
\(logµ_{22k}=\lambda + \lambda^x_2+\lambda^y_2+\lambda^z_k+\lambda^{xy}_{22} + \lambda^{xz}_{2k}+ \lambda^{yz}_{2k}\)
\(logµ_{21k} = \lambda + \lambda^x_2+\lambda^y_1+\lambda^z_k+\lambda^{xy}_{21} + \lambda^{xz}_{2k}+ \lambda^{yz}_{1k}\)
\(logµ_{12k} = \lambda + \lambda^x_1+\lambda^y_2+\lambda^z_k+\lambda^{xy}_{12} + \lambda^{xz}_{1k}+ \lambda^{yz}_{2k}\)

\(log(\theta_{xy(k)})=(\lambda + \lambda^x_1+\lambda^y_1+\lambda^z_k+\lambda^{xy}_{11} + \lambda^{xz}_{1k}+ \lambda^{yz}_{1k})\)
\(+(\lambda + \lambda^x_2+\lambda^y_2+\lambda^z_k+\lambda^{xy}_{22} + \lambda^{xz}_{2k}+ \lambda^{yz}_{2k})\)
\(- (\lambda + \lambda^x_2+\lambda^y_1+\lambda^z_k+\lambda^{xy}_{21} + \lambda^{xz}_{2k}+ \lambda^{yz}_{1k})\)
\(-(\lambda + \lambda^x_1+\lambda^y_2+\lambda^z_k+\lambda^{xy}_{12} + \lambda^{xz}_{1k}+ \lambda^{yz}_{2k})\)

Make it simpler

\(log(\theta_{xy(k)})=\lambda^{xy}_{11}+\lambda^{xy}_{22}-\lambda^{xy}_{21}-\lambda^{xy}_{12}\)

\(\theta_{xy(k)} = e^{\lambda^{xy}_{11}+\lambda^{xy}_{22}-\lambda^{xy}_{21}-\lambda^{xy}_{12}}\) (model*)

No k term here
The conditional odd ratio between x and y at z = k at any level of z.

The Derivation applied to the 2 × 2 table for X and Y at level Z = k. But the previous expression for the conditional odds ratio does NOT DEPEND ON k. What does this mean?

\(\theta_{xy(1)} = (model*)\),…,\(\theta_{xy(k)} = (model*)\),…, \(\theta_{xy(K)} = (model*)\)

Therefore, (XY, XZ, YZ) implies homogeneous association.

If we return to the model (X, Y, Z) what are the values of the 2-way interaction terms bet. X and Y? What does this imply about the conditional odds ratio?

\(\theta_{xy(k)} = e^{0+0-0-0} = e^0 = 1\)

Therefore (X,Y,Z) implies conditional independence between any 2 variable given the third variable.

The argument above holds true for non-dichotomous X and Y.
(1) Suppose X has I level and Y has J levels.
(2) Choose two levels of X (i and i’)
(3) Choose teo levels of Y (j and j’)

A vidual representation is given here:

Suppose X has I levels, Y has J levels, and Z has K levels. What if we only consider the model (XY)? Under this model setting what does this imply about the relationship between (I) X and Y, (II) Y and Z, (III) X and Z?

(I) Relationship between X and Y under model (XY)

Consider the fixed level Z = k. For X focus only on levels 1 and 2. For Y focus only on levels 1 and 2. The corresponding contingency table is shown here:

z = k

x/y 1 2
1 \(π_{11k}\) \(π_{12k}\)
2 \(π_{21k}\) \(π_{22k}\)

What is \(\theta_{XY(k)}\) in this setting?
\(\theta_{xy(k)} = e^{\lambda^{xy}_{11}+\lambda^{xy}_{22}-\lambda^{xy}_{21}-\lambda^{xy}_{12}}\)
We got homogeneous Association fot X and Y.

(II) Relationship between Y and Z under model (XY)
Consider the fixed level X = i. For Y focus only on levels 1 and 2. For Z focus only on levels 1 and 2. The corresponding contingency table is shown here:

x = i

y/z 1 2
1 \(π_{i11}\) \(π_{i12}\)
2 \(π_{i21}\) \(π_{i22}\)

How do we write the conditional odds ratio in this setting?
\(log(\theta_{(i)yz}) = log(\dfrac{µ_{i11}µ_{i22}}{µ_{i21}µ_{i22}}) = log(µ_{i11})+log(µ_{i22})-log(µ_{i21})-log(µ_{i12})\)

\(log(\theta_{xy(k)})=(\lambda + \lambda^x_i+\lambda^y_1+\lambda^z_k+\lambda^{xy}_{i1}) +(\lambda + \lambda^x_i+\lambda^y_2+\lambda^z_2+\lambda^{xy}_{i2}) - (\lambda + \lambda^x_1+\lambda^y_2+\lambda^z_1+\lambda^{xy}_{i2}) -(\lambda + \lambda^x_i+\lambda^y_1+\lambda^z_2+\lambda^{xy}_{i12})\)

\(log(\theta_{xy(k)}) = 0\)
\(\theta_{xy(k)} = 1\) True for all i.

Does the expression above depend on i? Whaty does thissuggest?
\(\theta_{(1)YZ} = 1\),\(\theta_{(2)YZ} = 1\),…,\(\theta_{(I)YZ} = 1\)
This implies conditional independence for Y and Z given any X.

(III) Relationship between X and Z under model (XY)

Consider the fixed level Y = j. For X focus only on levels 1 and 2. For Z focus only on levels 1 and 2.

(XY) implies conditional independent for X and Z given Y.

For the variables X, Y, and Z, the following hold true in general (assuming a non-three-way interaction model):

  • For any pair of variables whose 2-way interaction is absent from the model, we have conditional Independent between that pair given the 3rd variable.
  • For any pair of variables whose 2-way interaction is present in the model, we have homogeneous association between that pair given the 3rd variable.

Three way interaction model

The two variables case (X and Y), recall the important note we stated about the interaction model containing the highest order interaction term.
Now let’s again consider the three variables case. Suppose X has I levels, Y has J levels, and Z has K levels.
Given that we are modeling \(μ_{ijk}\), how many total parameters can be found in the saturated model?: IJK

Consider (XYZ) – the model containing the highest order interaction term.
If you had to take a guess, how many parameters can be found in the (XYZ) model?: IxJxK

What can we therefore say about model (XYZ)?: is saturated model.

Using (XYZ) how would the fitted counts compare to the observed counts from the data?: Perfect fit.