Logistic Regression: Prediction

Notes

We are often interested in predicting whether a given observation will have a “yes” response
Use logistic regression model to calculate predicted log-odds that an observation has a “yes” response, then use log-oddes to calculate the predicted probability of a “yes”
Use predicted probabilities to classify the observation as having a “yes” or “no” response

Calculating the Predicted Probability

\[log(\frac{\hat{\pi}}{1-\hat{\pi}})=\hat{\beta}+\hat{beta_1x_i} \\ exp(log(\frac{\hat{\pi}}{1-\hat{\pi}})) = exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \frac{\hat{\pi_i}}{1-\hat{\pi_i}}=exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\]

Predicted Probability vs. Predicted Log-Odds

\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1}x_i)}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}=\frac{exp(\hat{log-odds}}{1+exp(\hat{log-odds})}\]

Confusion Matrix

We can use predicted probability to predict the outcome for a given observation: in other words, we can classify the observations into two groups: yes or no
establish threshold so y = 1 if predicted probability is greater than the threshold, else y = 0
confusion matrix to assess accuracy of predictions: misclassification rate: what proportion of observations were misclassified?

Sensitivity & Specificity - ROC Curve

sensitivity: proportion of observations with y = 1 that have predicted probability above specific threshold. true positive rate (y-axis) <>
specificity: proportion of observations with y = 0 that have predicted probability below specified threshold. (1 - specificty) called false positive rate (x-axis)

♥ We want: HIGH sensitivity and LOW values of 1 - specificity ♥

Area Under the Curve (AUC)

AUC = 0.5 : very bad fit

AUC close to 1: good fit

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(yardstick)

leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv")

1. Fit a model using `Age`, `Index` and `Temp` to predict the odds a patient will respond to the treatment.

leukemia <- leukemia %>%
  mutate(Resp = factor(Resp))

model <- glm(Resp ~ Age + Temp + Index, data = leukemia, family = "binomial")
tidy(model)

## # A tibble: 4 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  87.4      35.5         2.46 0.0137 
## 2 Age          -0.0585    0.0256     -2.29 0.0222 
## 3 Temp         -0.0890    0.0361     -2.47 0.0136 
## 4 Index         0.385     0.122       3.17 0.00154

2. Use the augment function to obtain the predicted probability each patient in the data set will respond to the treatment.

model_aug <- augment(model, type.predict = "response")

model_aug %>%
  slice(1:10)

## # A tibble: 10 x 11
##    Resp    Age  Temp Index .fitted .se.fit  .resid    .hat .sigma .cooksd
##    <fct> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 1        20   990     7   0.696 0.165    0.851  0.128    0.960 1.84e-2
##  2 1        25  1030    16   0.609 0.267    0.996  0.300    0.954 9.86e-2
##  3 1        26   982    12   0.957 0.0422   0.295  0.0438   0.969 5.32e-4
##  4 1        26  1000    16   0.955 0.0435   0.304  0.0440   0.969 5.68e-4
##  5 1        27   980     6   0.716 0.154    0.818  0.117    0.961 1.48e-2
##  6 0        27  1010     8   0.274 0.168   -0.800  0.142    0.961 1.82e-2
##  7 1        28   986    20   0.997 0.00557  0.0809 0.00954  0.970 7.97e-6
##  8 1        28  1010    14   0.782 0.130    0.702  0.0984   0.964 8.45e-3
##  9 1        31   988     5   0.400 0.160    1.35   0.106    0.947 5.00e-2
## 10 0        33   986     7   0.605 0.138   -1.36   0.0794   0.947 3.58e-2
## # … with 1 more variable: .std.resid <dbl>

3. Construct a confusion matrix using 0.5 as the decision-making threshold.

model_aug <- model_aug %>%
  mutate(pred_resp = if_else(.fitted > 0.5, "Will Respond", "Will Not Respond"))

model_aug %>%
  count(Resp, pred_resp)

## # A tibble: 4 x 3
##   Resp  pred_resp            n
##   <fct> <chr>            <int>
## 1 0     Will Not Respond    20
## 2 0     Will Respond         7
## 3 1     Will Not Respond     5
## 4 1     Will Respond        19

4. Use your confusion matrix from the previous exercise to calculate the following:

Sensitivity

P(Predict Will respond | Resp = 1) =

19 / (5 + 19)

## [1] 0.7916667

Specificity

P(Predict Will Not Respond | Resp = 0) =

20 / (20 + 7)

## [1] 0.7407407

Misclassification rate

12/51

## [1] 0.2352941

5. Create the ROC curve for this model.

model_aug <- model_aug %>%
  mutate(Resp = fct_relevel(Resp, c("1", "0")))

roc_curve_data <- model_aug %>%
  roc_curve(Resp, .fitted)

autoplot(roc_curve_data)

ggplot(data = model_aug, aes(x = .fitted)) +
  geom_dotplot()

6. What is the area under the curve? Is this model a good fit for the data?

model_aug %>%
  roc_auc(Resp, .fitted) %>%
  pull(.estimate)

## [1] 0.8780864

It is a good fit because it is close to 1.

Logistic Regression: Conditions

Notes

Conditions for Logistic Regression

Linearity: The log-odds have a lienar relationship for the prediction
Randomness: The data were obtained from a random process
Independence: The observations are independent from one another

The Empirical Logit

The empircal logit: The log of the observed odds

\[logit(\hat{p})=log(\frac{\hat{p}}{1-\hat{p}})=log(\frac{YES}{NO})\]

Calculating Emprical Logit (Categorical Predictor)

♥ If the predictor is categorical, we can calculate the empirical logit for each level of the predictor, find linearity

Calculating Empircal Logit (Quantitative Predictor)

Divide the range of the predictor into intervals with approximately equal number of cases. If you have enough observations, use 5-10 intervals
Calculate the mean value of the predictor in each interval.
Compute the empircal logit for each interval.
♥ Then, we can create a scatterplot of the empirical logit vs. the mean value of the predictor in each interval. We’ll use plots to make determination about linearity

Checking Linearity

♥ Linearity condition satisfied if there is a relationship between the empirical logit & the predictor variables

Checking Randomness

Check conditions based on the context of the data & how the observations were collected.
was the sample randomly selected?
If the sample was NOT randomly selected, ask whether there is reason to believe the observations in the sample differ systematically from the population of interest.

♥ Randomnes condition satisfied if we do not have a reason to believe that the participants in the study differ systematically from adults in the U.S. in regards to health characteristics and risk of heart disease

Checking Independence

We can check the independence condition based on the context of the data and how the observations were collected.
Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations.

♥ Independence condition satisfied if it is reasonable to conclude that the participants health characteristics we independent of one another.

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(yardstick)

leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv") %>%
  mutate(Resp = factor(Resp))

Ultimately, we want to use the model to classify patients into two groups: those who are likely to respond to the treatment and those are not. To do so, we’ll fit an ROC curve to help us determine a decision-making threshold.

resp_model <- glm(Resp ~ Age + Index + Temp, data = leukemia, 
                  family = "binomial")

tidy(resp_model) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	87.388	35.458	2.465	0.014
Age	-0.059	0.026	-2.287	0.022
Index	0.385	0.122	3.168	0.002
Temp	-0.089	0.036	-2.467	0.014

#in probabilities instead of log odds
resp_aug <- augment(resp_model, type.predict = "response") %>%
  mutate(Resp = fct_relevel(Resp, c("1", "0")))

resp_aug %>%
  select(.fitted) %>%
  slice(1:5)

## # A tibble: 5 x 1
##   .fitted
##     <dbl>
## 1   0.696
## 2   0.609
## 3   0.957
## 4   0.955
## 5   0.716

# calculate sensitivity and specificity at each threshold
roc_curve_data <- resp_aug %>%
  roc_curve(Resp, .fitted) %>%
  mutate(false_rate = 1 - specificity)

roc_curve_data %>%
  filter(.threshold > .39, .threshold < .40)

## # A tibble: 1 x 4
##   .threshold specificity sensitivity false_rate
##        <dbl>       <dbl>       <dbl>      <dbl>
## 1      0.400       0.741       0.917      0.259

autoplot(roc_curve_data)

2. Check the model conditions - linearity, randomness, independence.

library(Stat2Data)
emplogitplot1(Resp ~ Age, data = leukemia, ngroups = 5)

emplogitplot1(Resp ~ Index, data = leukemia, ngroups = 5)

emplogitplot1(Resp ~ Temp, data = leukemia, ngroups = 5)

The linearity condition is satisfied - The empirical logit plots all show a linear relationship between the log-odds and the predictor variables.

Randomness condition is satisfied - we have no reason to believe that the people in this study are systematically different from the population of people with leukemia.

Independence condition is satisfied - The health and response of one person in the study is not related to the health and response of another person.

Multinomial Logistic Regression: Introduction

Notes

Generalized Linear Models (GLM)

Examples of linear models: Binary (win/lose), Nominal (Democrat, Republican, Independent), Ordered (Movie Rating (1-5))
Broader class of model that generalize the mulitple linear regression model

Binary Response (Logistic)

Given (P(yi=1|xi)) = predicted probability AND P(yi=0|xi)= 1- predicted probability

\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=\hat{\beta_0}+\hat{\beta_1}x_i\] * We can calculate predicted probability by solving the logit equation:

\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\] Binary Response (Logistic) y = 0 the baseline category then logistic regression model is:

\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=log(\frac{\hat{\pi_{i1}}}{\hat{\pi_{i0}}})=\hat{\beta_0}+\hat{\beta_1}x_i\]

Slope: When x increases by 1 unit, the odds of y = 1 vs the baseline y = 0 are expected to multiply by a factor of \(exp(\hat{\beta_1})\)
Intercept: When x = 0, the predicted odds of y = 1 vs. the baseline y = 0 are \(exp(\hat{\beta_0})\)

Multinomial Response Variable

Suppose the response variable y is categorical and can take values 1, 2, …, k such that k > 2

Multinomial Distributions

\[p(y=1)=\pi,p(y=2)=\pi_2..., p(y=k)=\pi_k\] Such that the sum of k = 1 to k of the porbabilities equals 1.

If we have an explanatory variable x, then we want to fit a model such that \(p(y=k)=\pi_k\) is a function of x. Choose a baseline category. If choosing y = 1: Log odds that y = k vs. y = 1:

\[log(\frac{\pi_{ik}}{\pi_{i1}}) = \beta_{0k}+\beta_{1k}x_i\] * Has linear relation with all predictor variables

* In the multinomial logistic model, we have a separate equation for each category of the response relative to the baseline category. If the response has k possible categories, there will be k-1 equations as part of the multinomial logistic model * ex. A, B, or C has outcomes A is baseline

\[log(\frac{\pi_iB}{\pi_iA})=\beta_{0B}+\beta_{1B}x_i \\ log(\frac{\pi_iC}{\pi_iA})=\beta_{0C}+\beta_{jC}x_i\] Have separate coefficients for the model: we need ALL to calculate probabilities

EDA-Code

Use multinom() in library(nnet), exponentiate = FALSE for log odds

Interpretations

ex. \[log(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}) = 0.915 + 0.003age - 1.645PhysActive\]

♥ For each additional year in age, the odds a person rates themselves as having fair health vs. excellent health are expected to multiply by 1.003(exp(0.003)), holding physical activity constant.
♥ The odds a person who does physical activity will rate themselves as having fair health vs. excellent health are expected to be 0.193 (exp(-1.645)) time sthe odds for a person who doesn’t do physical activity, holding age constant.
♥ The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)). NEED TO MEAN CENTER AGE.

Hypothesis Test for B_jk

\[H_0: \beta_{jk}=0 \text{ vs. } H_a: \beta_{jk} \neq 0 \\ \text{Test-Statistic: } z=\frac{\hat{\beta}_{jk}-0}{SE(\hat{\beta}_{jk})} \\ \text{p-value = } P(|Z|>|z|), \text{where Z ~ N(0,1), the standard Normal distribution}\]

Confidence Interval for B_jk

\[\hat{\beta}_{jk} \pm Z^*SE(\hat{\beta}_{jk}) \\ \text{Where Z is calculated from N(0,1) distribution}\]

♥ We are C% confident that for every unit change in x_j, the odds of y = k vs. baseline y = 1 are expected to multiply by a factor of \(exp(\hat{\beta}_{jk} - Z^*SE(\hat{\beta}_{jk})\) to \(exp(\hat{\beta}_{jk} + Z^*SE(\hat{\beta}_{jk}))\), holding all else constant. If confidence interval not include 0 and p-value less than 0.05, strongn evidence

Code

Load Packages & Data

library(tidyverse) library(knitr) library(broom) library(nnet) sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")

As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not.

The show is only effective if children watch it, so we want to understand what effect the encouragement had on the frequency children watched the program.

Response:

viewcat

1: rarely watched show

2: once or twice a week

3: three to five times a week

4: watched show on average more than five times a week

Predictors:

age: child’s age in months

prenumb: score on numbers pretest (0 to 54)

prelet: score on letters pretest (0 to 58)

viewenc: 1: encouraged to watch, 0: not encouraged

site:

1: three to five year old from urban area

2: four year old from suburban area

3: from rural area with high socioeconomic status

4: from rural area with low socioeconomic status

5: from Spanish speaking home

# mean-center relevant continuous variables, make categorical variables factors sesame <- sesame %>% mutate(viewcat = factor(viewcat), site = factor(site), prenumbCent = prenumb - mean(prenumb), preletCent = prelet - mean(prelet), ageCent = age - mean(age), viewenc = if_else(viewenc == 1, "1", "0"))

Exploratory data analysis

Create a plot to visualize the relationship between the response and viewenc, the primary variable of interest in this analysis. What do you observe from the plot?

ggplot(sesame, aes(x = viewenc, fill = viewcat)) + geom_bar(position = "fill")

Create a plot to visualize the relationship between the response and a quantitative predictor variable. What do you observe from the plot?

ggplot(data = sesame, aes(x = prenumb, y = viewcat)) + geom_boxplot()

Model selection

Fit a model with all predictors except the primary variable of interest, viewenc.

full_model <- multinom(viewcat ~ site + prenumbCent + preletCent + ageCent, data = sesame)

## # weights: 36 (24 variable) ## initial value 332.710647 ## iter 10 value 318.011845 ## iter 20 value 301.917674 ## final value 301.773530 ## converged

tidy(full_model)

## # A tibble: 24 x 6 ## y.level term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 2 (Intercept) 1.97 0.423 1.60 0.110 ## 2 2 site2 1.16 0.732 0.199 0.842 ## 3 2 site3 0.878 0.580 -0.224 0.823 ## 4 2 site4 0.243 0.572 -2.47 0.0134 ## 5 2 site5 0.554 0.717 -0.824 0.410 ## 6 2 prenumbCent 1.05 0.0324 1.42 0.155 ## 7 2 preletCent 0.973 0.0378 -0.733 0.464 ## 8 2 ageCent 0.969 0.0371 -0.849 0.396 ## 9 3 (Intercept) 2.31 0.416 2.01 0.0444 ## 10 3 site2 1.46 0.702 0.540 0.589 ## # … with 14 more rows

Conduct backward model selection using the step function and AIC as the model selection criterion. Display the chosen model.

selected_model <- step(full_model, direction = "backward")

## Start: AIC=651.55 ## viewcat ~ site + prenumbCent + preletCent + ageCent ## ## trying - site ## # weights: 20 (12 variable) ## initial value 332.710647 ## iter 10 value 323.332793 ## final value 321.476167 ## converged ## trying - prenumbCent ## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 312.747389 ## iter 20 value 306.064567 ## final value 306.039662 ## converged ## trying - preletCent ## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 312.926919 ## iter 20 value 302.585673 ## final value 302.570483 ## converged ## trying - ageCent ## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 313.637377 ## iter 20 value 302.234054 ## final value 302.185732 ## converged ## Df AIC ## - ageCent 21 646.3715 ## - preletCent 21 647.1410 ## <none> 24 651.5471 ## - prenumbCent 21 654.0793 ## - site 12 666.9523 ## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 313.637377 ## iter 20 value 302.234054 ## final value 302.185732 ## converged ## ## Step: AIC=646.37 ## viewcat ~ site + prenumbCent + preletCent ## ## trying - site ## # weights: 16 (9 variable) ## initial value 332.710647 ## iter 10 value 322.753191 ## final value 322.577181 ## converged ## trying - prenumbCent ## # weights: 28 (18 variable) ## initial value 332.710647 ## iter 10 value 309.307098 ## iter 20 value 306.357823 ## final value 306.357651 ## converged ## trying - preletCent ## # weights: 28 (18 variable) ## initial value 332.710647 ## iter 10 value 304.700664 ## iter 20 value 303.024907 ## final value 303.024804 ## converged ## Df AIC ## - preletCent 18 642.0496 ## <none> 21 646.3715 ## - prenumbCent 18 648.7153 ## - site 9 663.1544 ## # weights: 28 (18 variable) ## initial value 332.710647 ## iter 10 value 304.700664 ## iter 20 value 303.024907 ## final value 303.024804 ## converged ## ## Step: AIC=642.05 ## viewcat ~ site + prenumbCent ## ## trying - site ## # weights: 12 (6 variable) ## initial value 332.710647 ## iter 10 value 323.301870 ## final value 323.301813 ## converged ## trying - prenumbCent ## # weights: 24 (15 variable) ## initial value 332.710647 ## iter 10 value 308.919709 ## final value 308.675610 ## converged ## Df AIC ## <none> 18 642.0496 ## - prenumbCent 15 647.3512 ## - site 6 658.6036

tidy(selected_model, conf.int = T)

## # A tibble: 18 x 8 ## y.level term estimate std.error statistic p.value conf.low conf.high ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 (Intercept) 2.00 0.420 1.66 0.0978 0.880 4.56 ## 2 2 site2 1.19 0.731 0.239 0.811 0.284 4.99 ## 3 2 site3 0.794 0.562 -0.410 0.682 0.264 2.39 ## 4 2 site4 0.231 0.562 -2.61 0.00900 0.0767 0.693 ## 5 2 site5 0.520 0.712 -0.918 0.359 0.129 2.10 ## 6 2 prenumbCent 1.02 0.0219 0.952 0.341 0.978 1.07 ## 7 3 (Intercept) 2.31 0.414 2.02 0.0430 1.03 5.20 ## 8 3 site2 1.56 0.699 0.639 0.523 0.397 6.16 ## 9 3 site3 0.918 0.555 -0.154 0.878 0.309 2.72 ## 10 3 site4 0.124 0.618 -3.38 0.000732 0.0369 0.416 ## 11 3 site5 0.0730 1.16 -2.25 0.0243 0.00747 0.712 ## 12 3 prenumbCent 1.05 0.0221 2.23 0.0255 1.01 1.10 ## 13 4 (Intercept) 1.51 0.448 0.922 0.357 0.628 3.63 ## 14 4 site2 3.08 0.707 1.59 0.111 0.772 12.3 ## 15 4 site3 1.11 0.598 0.169 0.866 0.343 3.57 ## 16 4 site4 0.125 0.706 -2.95 0.00321 0.0314 0.499 ## 17 4 site5 0.523 0.771 -0.842 0.400 0.115 2.37 ## 18 4 prenumbCent 1.07 0.0223 2.89 0.00379 1.02 1.11

Interpretation + conclusions

Interpret the intercept associated with the odds of a child being in the category viewcat == 2 versus viewcat == 1.

The odds that a child who is in Site 1 and who has an average score on the numbers pretest 20.854 watches the show 1 - 2 times per week versus rarely are predicted to be round(exp(0.69501925),3)

Interpret the coefficient of one numeric predictor in terms of the odds of a child being in the category viewcat == 2 versus viewcat == 1. Based on the confidence interval for the coefficient, is the numeric predictor a statistically significant predictor of viewership?

For each additional point on the numbers pretest, the odds a child watches the show 1 - 2 times per week versus rarely are expected to multiply by a factor of round(exp(0.02089701),3), holding the site constant.

The 95% confidence interval is about (-0.022116790, 0.06391081) which includes 0. Therefore, after accounting for the site, the score on the numbers pretest is not a statistically significant predictor of the odds a child is in viewer category 2 versus category 1.

The primary objective of the experiment was to understand the effect of encouragement viewenc on viewership. Does encouragement have a significant effect on viewership after adjusting for other possible factors? If so, describe the effect. Otherwise, explain why not.

new_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)

## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 288.361620 ## iter 20 value 281.624328 ## final value 281.602888 ## converged

tidy(new_model) %>% kable(digits = 3)

y.level term estimate std.error statistic p.value

2 (Intercept) 0.815 0.484 -0.421 0.674

2 site2 0.934 0.774 -0.088 0.929

2 site3 0.343 0.640 -1.670 0.095

2 site4 0.149 0.640 -2.971 0.003

2 site5 0.170 0.830 -2.136 0.033

2 prenumbCent 1.023 0.024 0.967 0.334

2 viewenc1 14.181 0.493 5.378 0.000

3 (Intercept) 1.052 0.467 0.108 0.914

3 site2 1.249 0.739 0.300 0.764

3 site3 0.415 0.629 -1.399 0.162

3 site4 0.085 0.681 -3.621 0.000

3 site5 0.025 1.235 -2.974 0.003

3 prenumbCent 1.053 0.024 2.184 0.029

3 viewenc1 11.782 0.494 4.997 0.000

4 (Intercept) 0.761 0.499 -0.547 0.584

4 site2 2.507 0.741 1.241 0.215

4 site3 0.525 0.663 -0.973 0.330

4 site4 0.089 0.753 -3.211 0.001

4 site5 0.193 0.869 -1.893 0.058

4 prenumbCent 1.069 0.023 2.831 0.005

4 viewenc1 9.886 0.501 4.575 0.000

When we look at the coefficients from encouragement levels from the odds of level 2 vs level 1 (2.6 to 2.2), we can see that after adjusting for the site and pretest score, the impact of the encouragement for watching the show a lot more reduces. So there is a greater impact for the jump from not watching the show to watching the show a little bit, but less of an impact from rarely to everyday.

y.level	term	estimate	std.error	statistic	p.value
2	(Intercept)	0.815	0.484	-0.421	0.674
2	site2	0.934	0.774	-0.088	0.929
2	site3	0.343	0.640	-1.670	0.095
2	site4	0.149	0.640	-2.971	0.003
2	site5	0.170	0.830	-2.136	0.033
2	prenumbCent	1.023	0.024	0.967	0.334
2	viewenc1	14.181	0.493	5.378	0.000
3	(Intercept)	1.052	0.467	0.108	0.914
3	site2	1.249	0.739	0.300	0.764
3	site3	0.415	0.629	-1.399	0.162
3	site4	0.085	0.681	-3.621	0.000
3	site5	0.025	1.235	-2.974	0.003
3	prenumbCent	1.053	0.024	2.184	0.029
3	viewenc1	11.782	0.494	4.997	0.000
4	(Intercept)	0.761	0.499	-0.547	0.584
4	site2	2.507	0.741	1.241	0.215
4	site3	0.525	0.663	-0.973	0.330
4	site4	0.089	0.753	-3.211	0.001
4	site5	0.193	0.869	-1.893	0.058
4	prenumbCent	1.069	0.023	2.831	0.005
4	viewenc1	9.886	0.501	4.575	0.000

Multinomial Logistic Regression pt. 2

Notes

Prediction

We need all probabilities! Predict function & observation number
Actual vs. Predicted
♥ For each observation, the predicted perceived health rating is the category with the highest predicted probability.
♥ If you have categories that are selected much more often than others, the analyzation will pick up on that. * Consider adding more predictor variables * Combine Category with categories not selected as much * Can also combine to make logistic model
♥ Check distribution of predicted variables!

Model Selection

Drop in deviance test \[H_0: \beta_1=\beta_2=\beta_3=0 \\ H_a: \text{at least one B_j is not equal to 0.}\] model_red <– multinom(…)
model_full <– multinom(…)
then anova (model_red, model_full, test = “Chisq”)
Compare models using AIC with step function

Checking Conditions

Linearity: Is there a linear relationship between the log-odds and the predictor variables?

Randomness: Was the sample randomly selected? Or can we reasonably treat it as random?

Independence: There is no obvious relationship between observations

Checking Linearity

Examine empircal logit plots between each level of the response & the quantitative predictor variables.
mutate(x = factor|if_else…) emplogit1() Check for general linear trend
♥ The linearity condition is satisfied. There is a linear relationship between the empircal logit & the quantitative predictor variable

Checking Randomness

Was the sample randomly selected? If the sample was not randomly selected, ask whether there is reason to believe the observations in the smaple differ systemtaicallly from the population of interest.
♥ The randomness condition is satisfied, we do not have any reason to believe that the participants in the study differ systematically from adults in the U.S.

Checking Independence

Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations
♥ The independence condition is satisfied. It is reasonable to conclude that the participants’ health and behavior characteristics are independent of on another.

Multinomial Logistic Regression pt. 2

Log-Linear Models

Notes

Find mean & variance & distribution of predictor variables

We want percent variance (lambda) function of predictor variables x1,…,xp

Why is a multiple linear regression model not appropriate? </mark?>

lambda must be greater than or equal to 0 for any combination of predictor variables

constant variance assumption will be violated!

Poisson Regression

curved upwards

If observed values of Y of poisson, then we can model usingn a poisson regression model:

\[log(x_i)=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+...+\beta_px_{pi}\]

Interpreting Model Coefficients

Slope, B_j:
* Quantitative Predictor: When x_j increases by one unit, the mean of y is expected to multiply by a factor of exp{B_j}, holding all else constant. * Categorical Predictor: The mean of y for category k is expected to be exp{b_j} times the mean of y for the baseline category, holding all else constant.

Intercept, B_0:
* When all of the predictors equal 0, the mean of y is expected to be exp(B_0).
ex. number in household vs. age.
model1 <- glm(total~ageCent, data = hh, family = “poisson”)
tidy(model1, conf.int = T) %>%
kable(digits = 3)
log(mean total) = 1.303-.0047 x ageCent

Interpretations

♥For each additional year the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995(exp(-0.0047)).
♥ For households with a head of the household who is 52.657 years old, we expect the mean number of people in the household to be 3.68 (exp(1.303))

Drop-in-Deviance Test

To compare nested models (similar to logistic regression)

CAN ADD QUADRATIC TERM FOR AGE

Model Assumptions

Poisson Response: The response follows a poisson distribution for each level of the predictor.

Independence: The observations are independent of one another.

Mean = Variance: The mean value of the response equals the variance of the response for each level of the predictor.

Linearity: log(lambda) is a linear function of the predictors

Poisson Response

Create intervals for groups (mutate)

ex. “The condition is satisfied based on the overall distribution of the response (from the EDA) and the distribution of the responsen by age group.”

Skewed right

Independence

Randomly selected

Mean = Variance

Mean & variance for each group

Look if they are close

Linearity

Look at residuals
♥ The raw residuals for the ith observation, y_i-predicted lambda_i, is difficult to interpret since the variable is equal to the mean in the poisson distribution

Instead, we can analyze a standard residual called the Pearson residual:

\[r_i=\frac{y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

Augment function: Plot of predicted mean or pearson residual
* Look for distinguishible patterns

Code

Load Packages & Data

library(tidyverse) library(knitr) library(broom) library(nnet) sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")

# mean-center relevant continuous variables, make categorical variables factors sesame <- sesame %>% mutate(viewcat = factor(viewcat), site = factor(site), prenumbCent = prenumb - mean(prenumb), preletCent = prelet - mean(prelet), ageCent = age - mean(age), viewenc = ifelse(viewenc == 1, "1", "0"))

Part 1. Last time we fit a multinomial logistic model using viewenc, prenumbCent, and site to predict how frequently a child viewed Sesame Street (viewcat).

view_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)

## # weights: 32 (21 variable) ## initial value 332.710647 ## iter 10 value 288.361620 ## iter 20 value 281.624328 ## final value 281.602888 ## converged

tidy(view_model) %>% kable(digits = 3)

y.level term estimate std.error statistic p.value

2 (Intercept) 0.815 0.484 -0.421 0.674

2 site2 0.934 0.774 -0.088 0.929

2 site3 0.343 0.640 -1.670 0.095

2 site4 0.149 0.640 -2.971 0.003

2 site5 0.170 0.830 -2.136 0.033

2 prenumbCent 1.023 0.024 0.967 0.334

2 viewenc1 14.181 0.493 5.378 0.000

3 (Intercept) 1.052 0.467 0.108 0.914

3 site2 1.249 0.739 0.300 0.764

3 site3 0.415 0.629 -1.399 0.162

3 site4 0.085 0.681 -3.621 0.000

3 site5 0.025 1.235 -2.974 0.003

3 prenumbCent 1.053 0.024 2.184 0.029

3 viewenc1 11.782 0.494 4.997 0.000

4 (Intercept) 0.761 0.499 -0.547 0.584

4 site2 2.507 0.741 1.241 0.215

4 site3 0.525 0.663 -0.973 0.330

4 site4 0.089 0.753 -3.211 0.001

4 site5 0.193 0.869 -1.893 0.058

4 prenumbCent 1.069 0.023 2.831 0.005

4 viewenc1 9.886 0.501 4.575 0.000

Let’s get the predicted view category using the augment function .

sesame <- sesame %>% mutate(pred_view = predict(view_model, type = "class"))

Make a table to view the actual vs. predicted view categories. How well did the model perform?

sesame %>% count(viewcat, pred_view) %>% pivot_wider(names_from = pred_view, values_from = n)

## # A tibble: 4 x 5 ## viewcat `1` `2` `3` `4` ## <fct> <int> <int> <int> <int> ## 1 1 37 9 5 3 ## 2 2 8 24 18 10 ## 3 3 7 15 21 21 ## 4 4 5 10 20 27

(37 + 24 + 21 + 27) / nrow(sesame)

## [1] 0.4541667

The numbers on the diagonal are the ones we get correct. The most errors occur in groups 2, 3 and 4. This model is good at picking out the extremes, but not the middle. Could group this differently so that it is people who watch every day vs. not every day.

Part 2: Log-linear models

The data come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.

The variables in the data are

age: the age of the head of household

total: the number of people in the household other than the head

location: where the house is located (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas)

numLT5: the number in the household under 5 years of age

roof: the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material, where stronger material can sometimes be used as a proxy for greater wealth)

hh <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/fHH1.csv") %>% filter(age < 95) %>% mutate(ageCent = age - mean(age))

We fit the following model:

hh_model <- glm(total ~ ageCent + I(ageCent^2), data = hh, family = "poisson") tidy(hh_model) %>% kable(digits = 3)

term estimate std.error statistic p.value

(Intercept) 1.436 0.017 82.339 0

ageCent -0.004 0.001 -3.584 0

I(ageCent^2) -0.001 0.000 -10.938 0

1. Interpret the coefficient of ageCent^2 in the context of the data.

Interpreting quadratic terms: talking about the entire effect - choose values of x to change between and report the change for those values.

How response changes between Q1 and Q3 of x. Q1 = 42 and Q3 = 63.

When the age of the head of the household changes from 42 to 63 the log (mean number in household) is expected to change by -0.004(63-42) + -0.001(63^2 - 42^2).

When the household age changes from 42 to 63 years old, the mean number in the household is expected to multiply by a factor of exp(-0.004(63-42) + -0.001(63^2 - 42^2))

2. Conduct a test to assess whether location is a useful predictor of the number of people in the household after accounting for age of the head of the household.

term	estimate	std.error	statistic
(Intercept)	1.436	0.017	82.339
ageCent	-0.004	0.001	-3.584
I(ageCent^2)	-0.001	0.000	-10.938

Quiz 4 Study Guide

Courtney Lee

11/03/2020

Logistic Regression: Prediction

Notes

Calculating the Predicted Probability

Predicted Probability vs. Predicted Log-Odds

Confusion Matrix

Sensitivity & Specificity - ROC Curve

Area Under the Curve (AUC)

Code

Load Packages & Data

1. Fit a model using Age, Index and Temp to predict the odds a patient will respond to the treatment.

2. Use the augment function to obtain the predicted probability each patient in the data set will respond to the treatment.

3. Construct a confusion matrix using 0.5 as the decision-making threshold.

4. Use your confusion matrix from the previous exercise to calculate the following:

5. Create the ROC curve for this model.

6. What is the area under the curve? Is this model a good fit for the data?

Logistic Regression: Conditions

Notes

Conditions for Logistic Regression

The Empirical Logit

Calculating Emprical Logit (Categorical Predictor)

Calculating Empircal Logit (Quantitative Predictor)

Checking Linearity

Checking Randomness

Checking Independence

Code

Load Packages & Data

Ultimately, we want to use the model to classify patients into two groups: those who are likely to respond to the treatment and those are not. To do so, we’ll fit an ROC curve to help us determine a decision-making threshold.

2. Check the model conditions - linearity, randomness, independence.

Multinomial Logistic Regression: Introduction

Notes

Generalized Linear Models (GLM)

Binary Response (Logistic)

Multinomial Response Variable

Multinomial Distributions

EDA-Code

Interpretations

Hypothesis Test for B_jk

Confidence Interval for B_jk

Code

Load Packages & Data

Exploratory data analysis

Model selection

Interpretation + conclusions

Multinomial Logistic Regression pt. 2

Notes

Prediction

Model Selection

Checking Conditions

Checking Linearity

Checking Randomness

Checking Independence

Multinomial Logistic Regression pt. 2

Log-Linear Models

Notes

Poisson Regression

Interpreting Model Coefficients

Interpretations

Drop-in-Deviance Test

Model Assumptions

Poisson Response

Independence

Mean = Variance

Linearity

Code

Load Packages & Data

Part 1. Last time we fit a multinomial logistic model using viewenc, prenumbCent, and site to predict how frequently a child viewed Sesame Street (viewcat).

Part 2: Log-linear models

1. Interpret the coefficient of ageCent^2 in the context of the data.

2. Conduct a test to assess whether location is a useful predictor of the number of people in the household after accounting for age of the head of the household.

1. Fit a model using `Age`, `Index` and `Temp` to predict the odds a patient will respond to the treatment.

Part 1. Last time we fit a multinomial logistic model using `viewenc`, `prenumbCent`, and `site` to predict how frequently a child viewed Sesame Street (`viewcat`).

1. Interpret the coefficient of `ageCent^2` in the context of the data.

2. Conduct a test to assess whether `location` is a useful predictor of the number of people in the household after accounting for age of the head of the household.