Logistic Regression: Prediction

Notes

  • We are often interested in predicting whether a given observation will have a “yes” response
  • Use logistic regression model to calculate predicted log-odds that an observation has a “yes” response, then use log-oddes to calculate the predicted probability of a “yes”
  • Use predicted probabilities to classify the observation as having a “yes” or “no” response

Calculating the Predicted Probability

\[log(\frac{\hat{\pi}}{1-\hat{\pi}})=\hat{\beta}+\hat{beta_1x_i} \\ exp(log(\frac{\hat{\pi}}{1-\hat{\pi}})) = exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \frac{\hat{\pi_i}}{1-\hat{\pi_i}}=exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\]

Predicted Probability vs. Predicted Log-Odds

\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1}x_i)}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}=\frac{exp(\hat{log-odds}}{1+exp(\hat{log-odds})}\]

Confusion Matrix

  • We can use predicted probability to predict the outcome for a given observation: in other words, we can classify the observations into two groups: yes or no
  • establish threshold so y = 1 if predicted probability is greater than the threshold, else y = 0
  • confusion matrix to assess accuracy of predictions: misclassification rate: what proportion of observations were misclassified?

Sensitivity & Specificity - ROC Curve

  • sensitivity: proportion of observations with y = 1 that have predicted probability above specific threshold. true positive rate (y-axis) <>
  • specificity: proportion of observations with y = 0 that have predicted probability below specified threshold. (1 - specificty) called false positive rate (x-axis)

♥ We want: HIGH sensitivity and LOW values of 1 - specificity ♥

Area Under the Curve (AUC)

  • AUC = 0.5 : very bad fit
  • AUC close to 1: good fit

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(yardstick)

leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv")

1. Fit a model using Age, Index and Temp to predict the odds a patient will respond to the treatment.

leukemia <- leukemia %>%
  mutate(Resp = factor(Resp))
model <- glm(Resp ~ Age + Temp + Index, data = leukemia, family = "binomial")
tidy(model)
## # A tibble: 4 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  87.4      35.5         2.46 0.0137 
## 2 Age          -0.0585    0.0256     -2.29 0.0222 
## 3 Temp         -0.0890    0.0361     -2.47 0.0136 
## 4 Index         0.385     0.122       3.17 0.00154

2. Use the augment function to obtain the predicted probability each patient in the data set will respond to the treatment.

model_aug <- augment(model, type.predict = "response")

model_aug %>%
  slice(1:10)
## # A tibble: 10 x 11
##    Resp    Age  Temp Index .fitted .se.fit  .resid    .hat .sigma .cooksd
##    <fct> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 1        20   990     7   0.696 0.165    0.851  0.128    0.960 1.84e-2
##  2 1        25  1030    16   0.609 0.267    0.996  0.300    0.954 9.86e-2
##  3 1        26   982    12   0.957 0.0422   0.295  0.0438   0.969 5.32e-4
##  4 1        26  1000    16   0.955 0.0435   0.304  0.0440   0.969 5.68e-4
##  5 1        27   980     6   0.716 0.154    0.818  0.117    0.961 1.48e-2
##  6 0        27  1010     8   0.274 0.168   -0.800  0.142    0.961 1.82e-2
##  7 1        28   986    20   0.997 0.00557  0.0809 0.00954  0.970 7.97e-6
##  8 1        28  1010    14   0.782 0.130    0.702  0.0984   0.964 8.45e-3
##  9 1        31   988     5   0.400 0.160    1.35   0.106    0.947 5.00e-2
## 10 0        33   986     7   0.605 0.138   -1.36   0.0794   0.947 3.58e-2
## # … with 1 more variable: .std.resid <dbl>

3. Construct a confusion matrix using 0.5 as the decision-making threshold.

model_aug <- model_aug %>%
  mutate(pred_resp = if_else(.fitted > 0.5, "Will Respond", "Will Not Respond"))
model_aug %>%
  count(Resp, pred_resp)
## # A tibble: 4 x 3
##   Resp  pred_resp            n
##   <fct> <chr>            <int>
## 1 0     Will Not Respond    20
## 2 0     Will Respond         7
## 3 1     Will Not Respond     5
## 4 1     Will Respond        19

4. Use your confusion matrix from the previous exercise to calculate the following:

  • Sensitivity

P(Predict Will respond | Resp = 1) =

19 / (5 + 19)
## [1] 0.7916667
  • Specificity

P(Predict Will Not Respond | Resp = 0) =

20 / (20 + 7)
## [1] 0.7407407
  • Misclassification rate
12/51
## [1] 0.2352941

5. Create the ROC curve for this model.

model_aug <- model_aug %>%
  mutate(Resp = fct_relevel(Resp, c("1", "0")))

roc_curve_data <- model_aug %>%
  roc_curve(Resp, .fitted)

autoplot(roc_curve_data)

ggplot(data = model_aug, aes(x = .fitted)) +
  geom_dotplot()

6. What is the area under the curve? Is this model a good fit for the data?

model_aug %>%
  roc_auc(Resp, .fitted) %>%
  pull(.estimate)
## [1] 0.8780864

It is a good fit because it is close to 1.

Logistic Regression: Conditions

Notes

Conditions for Logistic Regression

  1. Linearity: The log-odds have a lienar relationship for the prediction
  2. Randomness: The data were obtained from a random process
  3. Independence: The observations are independent from one another

The Empirical Logit

  • The empircal logit: The log of the observed odds

\[logit(\hat{p})=log(\frac{\hat{p}}{1-\hat{p}})=log(\frac{YES}{NO})\]

Calculating Emprical Logit (Categorical Predictor)

♥ If the predictor is categorical, we can calculate the empirical logit for each level of the predictor, find linearity

Calculating Empircal Logit (Quantitative Predictor)

  1. Divide the range of the predictor into intervals with approximately equal number of cases. If you have enough observations, use 5-10 intervals
  2. Calculate the mean value of the predictor in each interval.
  3. Compute the empircal logit for each interval.
    ♥ Then, we can create a scatterplot of the empirical logit vs. the mean value of the predictor in each interval. We’ll use plots to make determination about linearity

Checking Linearity

♥ Linearity condition satisfied if there is a relationship between the empirical logit & the predictor variables

Checking Randomness

  • Check conditions based on the context of the data & how the observations were collected.
  • was the sample randomly selected?
  • If the sample was NOT randomly selected, ask whether there is reason to believe the observations in the sample differ systematically from the population of interest.

♥ Randomnes condition satisfied if we do not have a reason to believe that the participants in the study differ systematically from adults in the U.S. in regards to health characteristics and risk of heart disease

Checking Independence

  • We can check the independence condition based on the context of the data and how the observations were collected.
  • Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations.

♥ Independence condition satisfied if it is reasonable to conclude that the participants health characteristics we independent of one another.

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(yardstick)

leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv") %>%
  mutate(Resp = factor(Resp))

Ultimately, we want to use the model to classify patients into two groups: those who are likely to respond to the treatment and those are not. To do so, we’ll fit an ROC curve to help us determine a decision-making threshold.

resp_model <- glm(Resp ~ Age + Index + Temp, data = leukemia, 
                  family = "binomial")

tidy(resp_model) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 87.388 35.458 2.465 0.014
Age -0.059 0.026 -2.287 0.022
Index 0.385 0.122 3.168 0.002
Temp -0.089 0.036 -2.467 0.014
#in probabilities instead of log odds
resp_aug <- augment(resp_model, type.predict = "response") %>%
  mutate(Resp = fct_relevel(Resp, c("1", "0")))
resp_aug %>%
  select(.fitted) %>%
  slice(1:5)
## # A tibble: 5 x 1
##   .fitted
##     <dbl>
## 1   0.696
## 2   0.609
## 3   0.957
## 4   0.955
## 5   0.716
# calculate sensitivity and specificity at each threshold
roc_curve_data <- resp_aug %>%
  roc_curve(Resp, .fitted) %>%
  mutate(false_rate = 1 - specificity)

roc_curve_data %>%
  filter(.threshold > .39, .threshold < .40)
## # A tibble: 1 x 4
##   .threshold specificity sensitivity false_rate
##        <dbl>       <dbl>       <dbl>      <dbl>
## 1      0.400       0.741       0.917      0.259
autoplot(roc_curve_data)

1. Suppose a doctor wants to use your model to determine if she should recommend this particular treatment for her patients with leukemia. Based on the data from the ROC curve, what decision-making threshold do you recommend the doctor use to select patients for this treatment? What factors did you consider to determine the threshold?

The threshold is the highest combined value of sensitivity and 1-specificity to maximize both components. This gives us the highest probability of getting the most true positives and the most true negatives. This is 0.399.

2. Check the model conditions - linearity, randomness, independence.

library(Stat2Data)
emplogitplot1(Resp ~ Age, data = leukemia, ngroups = 5)

emplogitplot1(Resp ~ Index, data = leukemia, ngroups = 5)

emplogitplot1(Resp ~ Temp, data = leukemia, ngroups = 5)

The linearity condition is satisfied - The empirical logit plots all show a linear relationship between the log-odds and the predictor variables.

Randomness condition is satisfied - we have no reason to believe that the people in this study are systematically different from the population of people with leukemia.

Independence condition is satisfied - The health and response of one person in the study is not related to the health and response of another person.

Multinomial Logistic Regression: Introduction

Notes

Generalized Linear Models (GLM)

  • Examples of linear models: Binary (win/lose), Nominal (Democrat, Republican, Independent), Ordered (Movie Rating (1-5))
  • Broader class of model that generalize the mulitple linear regression model

Binary Response (Logistic)

  • Given (P(yi=1|xi)) = predicted probability AND P(yi=0|xi)= 1- predicted probability

\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=\hat{\beta_0}+\hat{\beta_1}x_i\] * We can calculate predicted probability by solving the logit equation:

\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\] Binary Response (Logistic) y = 0 the baseline category then logistic regression model is:

\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=log(\frac{\hat{\pi_{i1}}}{\hat{\pi_{i0}}})=\hat{\beta_0}+\hat{\beta_1}x_i\]

  • Slope: When x increases by 1 unit, the odds of y = 1 vs the baseline y = 0 are expected to multiply by a factor of \(exp(\hat{\beta_1})\)
  • Intercept: When x = 0, the predicted odds of y = 1 vs. the baseline y = 0 are \(exp(\hat{\beta_0})\)

Multinomial Response Variable

  • Suppose the response variable y is categorical and can take values 1, 2, …, k such that k > 2

Multinomial Distributions

\[p(y=1)=\pi,p(y=2)=\pi_2..., p(y=k)=\pi_k\] Such that the sum of k = 1 to k of the porbabilities equals 1.

  • If we have an explanatory variable x, then we want to fit a model such that \(p(y=k)=\pi_k\) is a function of x. Choose a baseline category. If choosing y = 1: Log odds that y = k vs. y = 1:

\[log(\frac{\pi_{ik}}{\pi_{i1}}) = \beta_{0k}+\beta_{1k}x_i\] * Has linear relation with all predictor variables

* In the multinomial logistic model, we have a separate equation for each category of the response relative to the baseline category. If the response has k possible categories, there will be k-1 equations as part of the multinomial logistic model * ex. A, B, or C has outcomes A is baseline

\[log(\frac{\pi_iB}{\pi_iA})=\beta_{0B}+\beta_{1B}x_i \\ log(\frac{\pi_iC}{\pi_iA})=\beta_{0C}+\beta_{jC}x_i\] Have separate coefficients for the model: we need ALL to calculate probabilities

EDA-Code

Use multinom() in library(nnet), exponentiate = FALSE for log odds

Interpretations

ex. \[log(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}) = 0.915 + 0.003age - 1.645PhysActive\]

For each additional year in age, the odds a person rates themselves as having fair health vs. excellent health are expected to multiply by 1.003(exp(0.003)), holding physical activity constant.
The odds a person who does physical activity will rate themselves as having fair health vs. excellent health are expected to be 0.193 (exp(-1.645)) time sthe odds for a person who doesn’t do physical activity, holding age constant.
The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)). NEED TO MEAN CENTER AGE.

Hypothesis Test for B_jk

\[H_0: \beta_{jk}=0 \text{ vs. } H_a: \beta_{jk} \neq 0 \\ \text{Test-Statistic: } z=\frac{\hat{\beta}_{jk}-0}{SE(\hat{\beta}_{jk})} \\ \text{p-value = } P(|Z|>|z|), \text{where Z ~ N(0,1), the standard Normal distribution}\]

Confidence Interval for B_jk

\[\hat{\beta}_{jk} \pm Z^*SE(\hat{\beta}_{jk}) \\ \text{Where Z is calculated from N(0,1) distribution}\]

We are C% confident that for every unit change in x_j, the odds of y = k vs. baseline y = 1 are expected to multiply by a factor of \(exp(\hat{\beta}_{jk} - Z^*SE(\hat{\beta}_{jk})\) to \(exp(\hat{\beta}_{jk} + Z^*SE(\hat{\beta}_{jk}))\), holding all else constant. If confidence interval not include 0 and p-value less than 0.05, strongn evidence

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(nnet)

sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")

As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not.

The show is only effective if children watch it, so we want to understand what effect the encouragement had on the frequency children watched the program.

Response:

  • viewcat
  • 1: rarely watched show
  • 2: once or twice a week
  • 3: three to five times a week
  • 4: watched show on average more than five times a week

Predictors:

  • age: child’s age in months
  • prenumb: score on numbers pretest (0 to 54)
  • prelet: score on letters pretest (0 to 58)
  • viewenc: 1: encouraged to watch, 0: not encouraged
  • site:
    • 1: three to five year old from urban area
    • 2: four year old from suburban area
    • 3: from rural area with high socioeconomic status
    • 4: from rural area with low socioeconomic status
    • 5: from Spanish speaking home
# mean-center relevant continuous variables, make categorical variables factors
sesame <- sesame %>% 
  mutate(viewcat = factor(viewcat), 
         site = factor(site), 
         prenumbCent = prenumb - mean(prenumb), 
         preletCent = prelet - mean(prelet), 
         ageCent = age - mean(age),
         viewenc = if_else(viewenc == 1, "1", "0"))

Exploratory data analysis

  1. Create a plot to visualize the relationship between the response and viewenc, the primary variable of interest in this analysis. What do you observe from the plot?
ggplot(sesame, aes(x = viewenc, fill = viewcat)) +
  geom_bar(position = "fill")

  1. Create a plot to visualize the relationship between the response and a quantitative predictor variable. What do you observe from the plot?
ggplot(data = sesame, aes(x = prenumb, y = viewcat)) +
  geom_boxplot()

Model selection

  1. Fit a model with all predictors except the primary variable of interest, viewenc.
full_model <- multinom(viewcat ~ site + prenumbCent + preletCent + ageCent, data = sesame)
## # weights:  36 (24 variable)
## initial  value 332.710647 
## iter  10 value 318.011845
## iter  20 value 301.917674
## final  value 301.773530 
## converged
tidy(full_model)
## # A tibble: 24 x 6
##    y.level term        estimate std.error statistic p.value
##    <chr>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 2       (Intercept)    1.97     0.423      1.60   0.110 
##  2 2       site2          1.16     0.732      0.199  0.842 
##  3 2       site3          0.878    0.580     -0.224  0.823 
##  4 2       site4          0.243    0.572     -2.47   0.0134
##  5 2       site5          0.554    0.717     -0.824  0.410 
##  6 2       prenumbCent    1.05     0.0324     1.42   0.155 
##  7 2       preletCent     0.973    0.0378    -0.733  0.464 
##  8 2       ageCent        0.969    0.0371    -0.849  0.396 
##  9 3       (Intercept)    2.31     0.416      2.01   0.0444
## 10 3       site2          1.46     0.702      0.540  0.589 
## # … with 14 more rows
  1. Conduct backward model selection using the step function and AIC as the model selection criterion. Display the chosen model.
selected_model <- step(full_model, direction = "backward")
## Start:  AIC=651.55
## viewcat ~ site + prenumbCent + preletCent + ageCent
## 
## trying - site 
## # weights:  20 (12 variable)
## initial  value 332.710647 
## iter  10 value 323.332793
## final  value 321.476167 
## converged
## trying - prenumbCent 
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 312.747389
## iter  20 value 306.064567
## final  value 306.039662 
## converged
## trying - preletCent 
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 312.926919
## iter  20 value 302.585673
## final  value 302.570483 
## converged
## trying - ageCent 
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 313.637377
## iter  20 value 302.234054
## final  value 302.185732 
## converged
##               Df      AIC
## - ageCent     21 646.3715
## - preletCent  21 647.1410
## <none>        24 651.5471
## - prenumbCent 21 654.0793
## - site        12 666.9523
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 313.637377
## iter  20 value 302.234054
## final  value 302.185732 
## converged
## 
## Step:  AIC=646.37
## viewcat ~ site + prenumbCent + preletCent
## 
## trying - site 
## # weights:  16 (9 variable)
## initial  value 332.710647 
## iter  10 value 322.753191
## final  value 322.577181 
## converged
## trying - prenumbCent 
## # weights:  28 (18 variable)
## initial  value 332.710647 
## iter  10 value 309.307098
## iter  20 value 306.357823
## final  value 306.357651 
## converged
## trying - preletCent 
## # weights:  28 (18 variable)
## initial  value 332.710647 
## iter  10 value 304.700664
## iter  20 value 303.024907
## final  value 303.024804 
## converged
##               Df      AIC
## - preletCent  18 642.0496
## <none>        21 646.3715
## - prenumbCent 18 648.7153
## - site         9 663.1544
## # weights:  28 (18 variable)
## initial  value 332.710647 
## iter  10 value 304.700664
## iter  20 value 303.024907
## final  value 303.024804 
## converged
## 
## Step:  AIC=642.05
## viewcat ~ site + prenumbCent
## 
## trying - site 
## # weights:  12 (6 variable)
## initial  value 332.710647 
## iter  10 value 323.301870
## final  value 323.301813 
## converged
## trying - prenumbCent 
## # weights:  24 (15 variable)
## initial  value 332.710647 
## iter  10 value 308.919709
## final  value 308.675610 
## converged
##               Df      AIC
## <none>        18 642.0496
## - prenumbCent 15 647.3512
## - site         6 658.6036
tidy(selected_model, conf.int = T)
## # A tibble: 18 x 8
##    y.level term        estimate std.error statistic  p.value conf.low conf.high
##    <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 2       (Intercept)   2.00      0.420      1.66  0.0978    0.880       4.56 
##  2 2       site2         1.19      0.731      0.239 0.811     0.284       4.99 
##  3 2       site3         0.794     0.562     -0.410 0.682     0.264       2.39 
##  4 2       site4         0.231     0.562     -2.61  0.00900   0.0767      0.693
##  5 2       site5         0.520     0.712     -0.918 0.359     0.129       2.10 
##  6 2       prenumbCent   1.02      0.0219     0.952 0.341     0.978       1.07 
##  7 3       (Intercept)   2.31      0.414      2.02  0.0430    1.03        5.20 
##  8 3       site2         1.56      0.699      0.639 0.523     0.397       6.16 
##  9 3       site3         0.918     0.555     -0.154 0.878     0.309       2.72 
## 10 3       site4         0.124     0.618     -3.38  0.000732  0.0369      0.416
## 11 3       site5         0.0730    1.16      -2.25  0.0243    0.00747     0.712
## 12 3       prenumbCent   1.05      0.0221     2.23  0.0255    1.01        1.10 
## 13 4       (Intercept)   1.51      0.448      0.922 0.357     0.628       3.63 
## 14 4       site2         3.08      0.707      1.59  0.111     0.772      12.3  
## 15 4       site3         1.11      0.598      0.169 0.866     0.343       3.57 
## 16 4       site4         0.125     0.706     -2.95  0.00321   0.0314      0.499
## 17 4       site5         0.523     0.771     -0.842 0.400     0.115       2.37 
## 18 4       prenumbCent   1.07      0.0223     2.89  0.00379   1.02        1.11

Interpretation + conclusions

  1. Interpret the intercept associated with the odds of a child being in the category viewcat == 2 versus viewcat == 1.

The odds that a child who is in Site 1 and who has an average score on the numbers pretest 20.854 watches the show 1 - 2 times per week versus rarely are predicted to be round(exp(0.69501925),3)

  1. Interpret the coefficient of one numeric predictor in terms of the odds of a child being in the category viewcat == 2 versus viewcat == 1. Based on the confidence interval for the coefficient, is the numeric predictor a statistically significant predictor of viewership?

For each additional point on the numbers pretest, the odds a child watches the show 1 - 2 times per week versus rarely are expected to multiply by a factor of round(exp(0.02089701),3), holding the site constant.

The 95% confidence interval is about (-0.022116790, 0.06391081) which includes 0. Therefore, after accounting for the site, the score on the numbers pretest is not a statistically significant predictor of the odds a child is in viewer category 2 versus category 1.

  1. The primary objective of the experiment was to understand the effect of encouragement viewenc on viewership. Does encouragement have a significant effect on viewership after adjusting for other possible factors? If so, describe the effect. Otherwise, explain why not.
new_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 288.361620
## iter  20 value 281.624328
## final  value 281.602888 
## converged
tidy(new_model) %>%
  kable(digits = 3)
y.level term estimate std.error statistic p.value
2 (Intercept) 0.815 0.484 -0.421 0.674
2 site2 0.934 0.774 -0.088 0.929
2 site3 0.343 0.640 -1.670 0.095
2 site4 0.149 0.640 -2.971 0.003
2 site5 0.170 0.830 -2.136 0.033
2 prenumbCent 1.023 0.024 0.967 0.334
2 viewenc1 14.181 0.493 5.378 0.000
3 (Intercept) 1.052 0.467 0.108 0.914
3 site2 1.249 0.739 0.300 0.764
3 site3 0.415 0.629 -1.399 0.162
3 site4 0.085 0.681 -3.621 0.000
3 site5 0.025 1.235 -2.974 0.003
3 prenumbCent 1.053 0.024 2.184 0.029
3 viewenc1 11.782 0.494 4.997 0.000
4 (Intercept) 0.761 0.499 -0.547 0.584
4 site2 2.507 0.741 1.241 0.215
4 site3 0.525 0.663 -0.973 0.330
4 site4 0.089 0.753 -3.211 0.001
4 site5 0.193 0.869 -1.893 0.058
4 prenumbCent 1.069 0.023 2.831 0.005
4 viewenc1 9.886 0.501 4.575 0.000

When we look at the coefficients from encouragement levels from the odds of level 2 vs level 1 (2.6 to 2.2), we can see that after adjusting for the site and pretest score, the impact of the encouragement for watching the show a lot more reduces. So there is a greater impact for the jump from not watching the show to watching the show a little bit, but less of an impact from rarely to everyday.

Multinomial Logistic Regression pt. 2

Notes

Prediction

We need all probabilities! Predict function & observation number
Actual vs. Predicted
♥ For each observation, the predicted perceived health rating is the category with the highest predicted probability.
♥ If you have categories that are selected much more often than others, the analyzation will pick up on that. * Consider adding more predictor variables * Combine Category with categories not selected as much * Can also combine to make logistic model
♥ Check distribution of predicted variables!

Model Selection

  • Drop in deviance test \[H_0: \beta_1=\beta_2=\beta_3=0 \\ H_a: \text{at least one B_j is not equal to 0.}\] model_red <– multinom(…)
    model_full <– multinom(…)
    then anova (model_red, model_full, test = “Chisq”)
    Compare models using AIC with step function

Checking Conditions

  1. Linearity: Is there a linear relationship between the log-odds and the predictor variables?
  2. Randomness: Was the sample randomly selected? Or can we reasonably treat it as random?
  3. Independence: There is no obvious relationship between observations

Checking Linearity

  • Examine empircal logit plots between each level of the response & the quantitative predictor variables.
    mutate(x = factor|if_else…) emplogit1() Check for general linear trend
    The linearity condition is satisfied. There is a linear relationship between the empircal logit & the quantitative predictor variable

Checking Randomness

  • Was the sample randomly selected? If the sample was not randomly selected, ask whether there is reason to believe the observations in the smaple differ systemtaicallly from the population of interest.
    The randomness condition is satisfied, we do not have any reason to believe that the participants in the study differ systematically from adults in the U.S.

Checking Independence

  • Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations
    The independence condition is satisfied. It is reasonable to conclude that the participants’ health and behavior characteristics are independent of on another.

Multinomial Logistic Regression pt. 2

Log-Linear Models

Notes

  • Find mean & variance & distribution of predictor variables
  • We want percent variance (lambda) function of predictor variables x1,…,xp
  • Why is a multiple linear regression model not appropriate? </mark?>
  • lambda must be greater than or equal to 0 for any combination of predictor variables
  • constant variance assumption will be violated!

Poisson Regression

  • curved upwards
  • If observed values of Y of poisson, then we can model usingn a poisson regression model:

\[log(x_i)=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+...+\beta_px_{pi}\]

Interpreting Model Coefficients

Slope, B_j:
* Quantitative Predictor: When x_j increases by one unit, the mean of y is expected to multiply by a factor of exp{B_j}, holding all else constant. * Categorical Predictor: The mean of y for category k is expected to be exp{b_j} times the mean of y for the baseline category, holding all else constant.

Intercept, B_0:
* When all of the predictors equal 0, the mean of y is expected to be exp(B_0).
ex. number in household vs. age.
model1 <- glm(total~ageCent, data = hh, family = “poisson”)
tidy(model1, conf.int = T) %>%
kable(digits = 3)
log(mean total) = 1.303-.0047 x ageCent

Interpretations

♥For each additional year the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995(exp(-0.0047)).
♥ For households with a head of the household who is 52.657 years old, we expect the mean number of people in the household to be 3.68 (exp(1.303))

Drop-in-Deviance Test

  • To compare nested models (similar to logistic regression)
  • CAN ADD QUADRATIC TERM FOR AGE

Model Assumptions

  1. Poisson Response: The response follows a poisson distribution for each level of the predictor.
  2. Independence: The observations are independent of one another.
  3. Mean = Variance: The mean value of the response equals the variance of the response for each level of the predictor.
  4. Linearity: log(lambda) is a linear function of the predictors

Poisson Response

  • Create intervals for groups (mutate)
  • ex. “The condition is satisfied based on the overall distribution of the response (from the EDA) and the distribution of the responsen by age group.”
  • Skewed right

Independence

  • Randomly selected

Mean = Variance

  • Mean & variance for each group
  • Look if they are close

Linearity

  • Look at residuals
    ♥ The raw residuals for the ith observation, y_i-predicted lambda_i, is difficult to interpret since the variable is equal to the mean in the poisson distribution
  • Instead, we can analyze a standard residual called the Pearson residual:

\[r_i=\frac{y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

Augment function: Plot of predicted mean or pearson residual
* Look for distinguishible patterns

Code

Load Packages & Data

library(tidyverse)
library(knitr)
library(broom)
library(nnet)

sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")
# mean-center relevant continuous variables, make categorical variables factors
sesame <- sesame %>% 
  mutate(viewcat = factor(viewcat), 
         site = factor(site), 
         prenumbCent = prenumb - mean(prenumb), 
         preletCent = prelet - mean(prelet), 
         ageCent = age - mean(age),
         viewenc = ifelse(viewenc == 1, "1", "0"))

Part 1. Last time we fit a multinomial logistic model using viewenc, prenumbCent, and site to predict how frequently a child viewed Sesame Street (viewcat).

view_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)
## # weights:  32 (21 variable)
## initial  value 332.710647 
## iter  10 value 288.361620
## iter  20 value 281.624328
## final  value 281.602888 
## converged
tidy(view_model) %>%
  kable(digits = 3)
y.level term estimate std.error statistic p.value
2 (Intercept) 0.815 0.484 -0.421 0.674
2 site2 0.934 0.774 -0.088 0.929
2 site3 0.343 0.640 -1.670 0.095
2 site4 0.149 0.640 -2.971 0.003
2 site5 0.170 0.830 -2.136 0.033
2 prenumbCent 1.023 0.024 0.967 0.334
2 viewenc1 14.181 0.493 5.378 0.000
3 (Intercept) 1.052 0.467 0.108 0.914
3 site2 1.249 0.739 0.300 0.764
3 site3 0.415 0.629 -1.399 0.162
3 site4 0.085 0.681 -3.621 0.000
3 site5 0.025 1.235 -2.974 0.003
3 prenumbCent 1.053 0.024 2.184 0.029
3 viewenc1 11.782 0.494 4.997 0.000
4 (Intercept) 0.761 0.499 -0.547 0.584
4 site2 2.507 0.741 1.241 0.215
4 site3 0.525 0.663 -0.973 0.330
4 site4 0.089 0.753 -3.211 0.001
4 site5 0.193 0.869 -1.893 0.058
4 prenumbCent 1.069 0.023 2.831 0.005
4 viewenc1 9.886 0.501 4.575 0.000

Let’s get the predicted view category using the augment function .

sesame <- sesame %>%
  mutate(pred_view = predict(view_model, type = "class"))

Make a table to view the actual vs. predicted view categories. How well did the model perform?

sesame %>%
  count(viewcat, pred_view) %>%
  pivot_wider(names_from = pred_view, values_from = n)
## # A tibble: 4 x 5
##   viewcat   `1`   `2`   `3`   `4`
##   <fct>   <int> <int> <int> <int>
## 1 1          37     9     5     3
## 2 2           8    24    18    10
## 3 3           7    15    21    21
## 4 4           5    10    20    27
(37 + 24 + 21 + 27) / nrow(sesame)
## [1] 0.4541667

The numbers on the diagonal are the ones we get correct. The most errors occur in groups 2, 3 and 4. This model is good at picking out the extremes, but not the middle. Could group this differently so that it is people who watch every day vs. not every day.

Part 2: Log-linear models

The data come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.

The variables in the data are

  • age: the age of the head of household
  • total: the number of people in the household other than the head
  • location: where the house is located (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas)
  • numLT5: the number in the household under 5 years of age
  • roof: the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material, where stronger material can sometimes be used as a proxy for greater wealth)
hh <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/fHH1.csv") %>%
  filter(age < 95) %>%
  mutate(ageCent = age - mean(age))

We fit the following model:

hh_model <- glm(total ~ ageCent + I(ageCent^2),
              data = hh, family = "poisson")
tidy(hh_model) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 1.436 0.017 82.339 0
ageCent -0.004 0.001 -3.584 0
I(ageCent^2) -0.001 0.000 -10.938 0

1. Interpret the coefficient of ageCent^2 in the context of the data.

Interpreting quadratic terms: talking about the entire effect - choose values of x to change between and report the change for those values.

How response changes between Q1 and Q3 of x. Q1 = 42 and Q3 = 63.

When the age of the head of the household changes from 42 to 63 the log (mean number in household) is expected to change by -0.004(63-42) + -0.001(63^2 - 42^2).

When the household age changes from 42 to 63 years old, the mean number in the household is expected to multiply by a factor of exp(-0.004(63-42) + -0.001(63^2 - 42^2))

2. Conduct a test to assess whether location is a useful predictor of the number of people in the household after accounting for age of the head of the household.