\[log(\frac{\hat{\pi}}{1-\hat{\pi}})=\hat{\beta}+\hat{beta_1x_i} \\ exp(log(\frac{\hat{\pi}}{1-\hat{\pi}})) = exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \frac{\hat{\pi_i}}{1-\hat{\pi_i}}=exp(\hat{\beta}+\hat{\beta_1x_i}) \\ \hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\]
\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1}x_i)}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}=\frac{exp(\hat{log-odds}}{1+exp(\hat{log-odds})}\]
♥ We want: HIGH sensitivity and LOW values of 1 - specificity ♥
library(tidyverse)
library(knitr)
library(broom)
library(yardstick)
leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv")
Age, Index and Temp to predict the odds a patient will respond to the treatment.leukemia <- leukemia %>%
mutate(Resp = factor(Resp))
model <- glm(Resp ~ Age + Temp + Index, data = leukemia, family = "binomial")
tidy(model)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 87.4 35.5 2.46 0.0137
## 2 Age -0.0585 0.0256 -2.29 0.0222
## 3 Temp -0.0890 0.0361 -2.47 0.0136
## 4 Index 0.385 0.122 3.17 0.00154
model_aug <- augment(model, type.predict = "response")
model_aug %>%
slice(1:10)
## # A tibble: 10 x 11
## Resp Age Temp Index .fitted .se.fit .resid .hat .sigma .cooksd
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 20 990 7 0.696 0.165 0.851 0.128 0.960 1.84e-2
## 2 1 25 1030 16 0.609 0.267 0.996 0.300 0.954 9.86e-2
## 3 1 26 982 12 0.957 0.0422 0.295 0.0438 0.969 5.32e-4
## 4 1 26 1000 16 0.955 0.0435 0.304 0.0440 0.969 5.68e-4
## 5 1 27 980 6 0.716 0.154 0.818 0.117 0.961 1.48e-2
## 6 0 27 1010 8 0.274 0.168 -0.800 0.142 0.961 1.82e-2
## 7 1 28 986 20 0.997 0.00557 0.0809 0.00954 0.970 7.97e-6
## 8 1 28 1010 14 0.782 0.130 0.702 0.0984 0.964 8.45e-3
## 9 1 31 988 5 0.400 0.160 1.35 0.106 0.947 5.00e-2
## 10 0 33 986 7 0.605 0.138 -1.36 0.0794 0.947 3.58e-2
## # … with 1 more variable: .std.resid <dbl>
model_aug <- model_aug %>%
mutate(pred_resp = if_else(.fitted > 0.5, "Will Respond", "Will Not Respond"))
model_aug %>%
count(Resp, pred_resp)
## # A tibble: 4 x 3
## Resp pred_resp n
## <fct> <chr> <int>
## 1 0 Will Not Respond 20
## 2 0 Will Respond 7
## 3 1 Will Not Respond 5
## 4 1 Will Respond 19
P(Predict Will respond | Resp = 1) =
19 / (5 + 19)
## [1] 0.7916667
P(Predict Will Not Respond | Resp = 0) =
20 / (20 + 7)
## [1] 0.7407407
12/51
## [1] 0.2352941
model_aug <- model_aug %>%
mutate(Resp = fct_relevel(Resp, c("1", "0")))
roc_curve_data <- model_aug %>%
roc_curve(Resp, .fitted)
autoplot(roc_curve_data)
ggplot(data = model_aug, aes(x = .fitted)) +
geom_dotplot()
model_aug %>%
roc_auc(Resp, .fitted) %>%
pull(.estimate)
## [1] 0.8780864
It is a good fit because it is close to 1.
\[logit(\hat{p})=log(\frac{\hat{p}}{1-\hat{p}})=log(\frac{YES}{NO})\]
♥ If the predictor is categorical, we can calculate the empirical logit for each level of the predictor, find linearity
♥ Linearity condition satisfied if there is a relationship between the empirical logit & the predictor variables
♥ Randomnes condition satisfied if we do not have a reason to believe that the participants in the study differ systematically from adults in the U.S. in regards to health characteristics and risk of heart disease
♥ Independence condition satisfied if it is reasonable to conclude that the participants health characteristics we independent of one another.
library(tidyverse)
library(knitr)
library(broom)
library(yardstick)
leukemia <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/leukemia.csv") %>%
mutate(Resp = factor(Resp))
resp_model <- glm(Resp ~ Age + Index + Temp, data = leukemia,
family = "binomial")
tidy(resp_model) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 87.388 | 35.458 | 2.465 | 0.014 |
| Age | -0.059 | 0.026 | -2.287 | 0.022 |
| Index | 0.385 | 0.122 | 3.168 | 0.002 |
| Temp | -0.089 | 0.036 | -2.467 | 0.014 |
#in probabilities instead of log odds
resp_aug <- augment(resp_model, type.predict = "response") %>%
mutate(Resp = fct_relevel(Resp, c("1", "0")))
resp_aug %>%
select(.fitted) %>%
slice(1:5)
## # A tibble: 5 x 1
## .fitted
## <dbl>
## 1 0.696
## 2 0.609
## 3 0.957
## 4 0.955
## 5 0.716
# calculate sensitivity and specificity at each threshold
roc_curve_data <- resp_aug %>%
roc_curve(Resp, .fitted) %>%
mutate(false_rate = 1 - specificity)
roc_curve_data %>%
filter(.threshold > .39, .threshold < .40)
## # A tibble: 1 x 4
## .threshold specificity sensitivity false_rate
## <dbl> <dbl> <dbl> <dbl>
## 1 0.400 0.741 0.917 0.259
autoplot(roc_curve_data)
The threshold is the highest combined value of sensitivity and 1-specificity to maximize both components. This gives us the highest probability of getting the most true positives and the most true negatives. This is 0.399.
library(Stat2Data)
emplogitplot1(Resp ~ Age, data = leukemia, ngroups = 5)
emplogitplot1(Resp ~ Index, data = leukemia, ngroups = 5)
emplogitplot1(Resp ~ Temp, data = leukemia, ngroups = 5)
The linearity condition is satisfied - The empirical logit plots all show a linear relationship between the log-odds and the predictor variables.
Randomness condition is satisfied - we have no reason to believe that the people in this study are systematically different from the population of people with leukemia.
Independence condition is satisfied - The health and response of one person in the study is not related to the health and response of another person.
\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=\hat{\beta_0}+\hat{\beta_1}x_i\] * We can calculate predicted probability by solving the logit equation:
\[\hat{\pi_i}=\frac{exp(\hat{\beta}+\hat{\beta_1x_i})}{1+exp(\hat{\beta}+\hat{\beta_1x_i})}\] Binary Response (Logistic) y = 0 the baseline category then logistic regression model is:
\[log(\frac{\hat{\pi_i}}{1-\hat{\pi_i}})=log(\frac{\hat{\pi_{i1}}}{\hat{\pi_{i0}}})=\hat{\beta_0}+\hat{\beta_1}x_i\]
\[p(y=1)=\pi,p(y=2)=\pi_2..., p(y=k)=\pi_k\] Such that the sum of k = 1 to k of the porbabilities equals 1.
\[log(\frac{\pi_{ik}}{\pi_{i1}}) = \beta_{0k}+\beta_{1k}x_i\] * Has linear relation with all predictor variables
* In the multinomial logistic model, we have a separate equation for each category of the response relative to the baseline category. If the response has k possible categories, there will be k-1 equations as part of the multinomial logistic model * ex. A, B, or C has outcomes A is baseline
\[log(\frac{\pi_iB}{\pi_iA})=\beta_{0B}+\beta_{1B}x_i \\ log(\frac{\pi_iC}{\pi_iA})=\beta_{0C}+\beta_{jC}x_i\] Have separate coefficients for the model: we need ALL to calculate probabilities
Use multinom() in library(nnet), exponentiate = FALSE for log odds
ex. \[log(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}) = 0.915 + 0.003age - 1.645PhysActive\]
♥ For each additional year in age, the odds a person rates themselves as having fair health vs. excellent health are expected to multiply by 1.003(exp(0.003)), holding physical activity constant.
♥ The odds a person who does physical activity will rate themselves as having fair health vs. excellent health are expected to be 0.193 (exp(-1.645)) time sthe odds for a person who doesn’t do physical activity, holding age constant.
♥ The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)). NEED TO MEAN CENTER AGE.
\[H_0: \beta_{jk}=0 \text{ vs. } H_a: \beta_{jk} \neq 0 \\ \text{Test-Statistic: } z=\frac{\hat{\beta}_{jk}-0}{SE(\hat{\beta}_{jk})} \\ \text{p-value = } P(|Z|>|z|), \text{where Z ~ N(0,1), the standard Normal distribution}\]
\[\hat{\beta}_{jk} \pm Z^*SE(\hat{\beta}_{jk}) \\ \text{Where Z is calculated from N(0,1) distribution}\]
♥ We are C% confident that for every unit change in x_j, the odds of y = k vs. baseline y = 1 are expected to multiply by a factor of \(exp(\hat{\beta}_{jk} - Z^*SE(\hat{\beta}_{jk})\) to \(exp(\hat{\beta}_{jk} + Z^*SE(\hat{\beta}_{jk}))\), holding all else constant. If confidence interval not include 0 and p-value less than 0.05, strongn evidence
library(tidyverse)
library(knitr)
library(broom)
library(nnet)
sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")
As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not.
The show is only effective if children watch it, so we want to understand what effect the encouragement had on the frequency children watched the program.
Response:
viewcatPredictors:
age: child’s age in monthsprenumb: score on numbers pretest (0 to 54)prelet: score on letters pretest (0 to 58)viewenc: 1: encouraged to watch, 0: not encouragedsite:
# mean-center relevant continuous variables, make categorical variables factors
sesame <- sesame %>%
mutate(viewcat = factor(viewcat),
site = factor(site),
prenumbCent = prenumb - mean(prenumb),
preletCent = prelet - mean(prelet),
ageCent = age - mean(age),
viewenc = if_else(viewenc == 1, "1", "0"))
viewenc, the primary variable of interest in this analysis. What do you observe from the plot?ggplot(sesame, aes(x = viewenc, fill = viewcat)) +
geom_bar(position = "fill")
ggplot(data = sesame, aes(x = prenumb, y = viewcat)) +
geom_boxplot()
viewenc.full_model <- multinom(viewcat ~ site + prenumbCent + preletCent + ageCent, data = sesame)
## # weights: 36 (24 variable)
## initial value 332.710647
## iter 10 value 318.011845
## iter 20 value 301.917674
## final value 301.773530
## converged
tidy(full_model)
## # A tibble: 24 x 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 (Intercept) 1.97 0.423 1.60 0.110
## 2 2 site2 1.16 0.732 0.199 0.842
## 3 2 site3 0.878 0.580 -0.224 0.823
## 4 2 site4 0.243 0.572 -2.47 0.0134
## 5 2 site5 0.554 0.717 -0.824 0.410
## 6 2 prenumbCent 1.05 0.0324 1.42 0.155
## 7 2 preletCent 0.973 0.0378 -0.733 0.464
## 8 2 ageCent 0.969 0.0371 -0.849 0.396
## 9 3 (Intercept) 2.31 0.416 2.01 0.0444
## 10 3 site2 1.46 0.702 0.540 0.589
## # … with 14 more rows
step function and AIC as the model selection criterion. Display the chosen model.selected_model <- step(full_model, direction = "backward")
## Start: AIC=651.55
## viewcat ~ site + prenumbCent + preletCent + ageCent
##
## trying - site
## # weights: 20 (12 variable)
## initial value 332.710647
## iter 10 value 323.332793
## final value 321.476167
## converged
## trying - prenumbCent
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 312.747389
## iter 20 value 306.064567
## final value 306.039662
## converged
## trying - preletCent
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 312.926919
## iter 20 value 302.585673
## final value 302.570483
## converged
## trying - ageCent
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 313.637377
## iter 20 value 302.234054
## final value 302.185732
## converged
## Df AIC
## - ageCent 21 646.3715
## - preletCent 21 647.1410
## <none> 24 651.5471
## - prenumbCent 21 654.0793
## - site 12 666.9523
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 313.637377
## iter 20 value 302.234054
## final value 302.185732
## converged
##
## Step: AIC=646.37
## viewcat ~ site + prenumbCent + preletCent
##
## trying - site
## # weights: 16 (9 variable)
## initial value 332.710647
## iter 10 value 322.753191
## final value 322.577181
## converged
## trying - prenumbCent
## # weights: 28 (18 variable)
## initial value 332.710647
## iter 10 value 309.307098
## iter 20 value 306.357823
## final value 306.357651
## converged
## trying - preletCent
## # weights: 28 (18 variable)
## initial value 332.710647
## iter 10 value 304.700664
## iter 20 value 303.024907
## final value 303.024804
## converged
## Df AIC
## - preletCent 18 642.0496
## <none> 21 646.3715
## - prenumbCent 18 648.7153
## - site 9 663.1544
## # weights: 28 (18 variable)
## initial value 332.710647
## iter 10 value 304.700664
## iter 20 value 303.024907
## final value 303.024804
## converged
##
## Step: AIC=642.05
## viewcat ~ site + prenumbCent
##
## trying - site
## # weights: 12 (6 variable)
## initial value 332.710647
## iter 10 value 323.301870
## final value 323.301813
## converged
## trying - prenumbCent
## # weights: 24 (15 variable)
## initial value 332.710647
## iter 10 value 308.919709
## final value 308.675610
## converged
## Df AIC
## <none> 18 642.0496
## - prenumbCent 15 647.3512
## - site 6 658.6036
tidy(selected_model, conf.int = T)
## # A tibble: 18 x 8
## y.level term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 (Intercept) 2.00 0.420 1.66 0.0978 0.880 4.56
## 2 2 site2 1.19 0.731 0.239 0.811 0.284 4.99
## 3 2 site3 0.794 0.562 -0.410 0.682 0.264 2.39
## 4 2 site4 0.231 0.562 -2.61 0.00900 0.0767 0.693
## 5 2 site5 0.520 0.712 -0.918 0.359 0.129 2.10
## 6 2 prenumbCent 1.02 0.0219 0.952 0.341 0.978 1.07
## 7 3 (Intercept) 2.31 0.414 2.02 0.0430 1.03 5.20
## 8 3 site2 1.56 0.699 0.639 0.523 0.397 6.16
## 9 3 site3 0.918 0.555 -0.154 0.878 0.309 2.72
## 10 3 site4 0.124 0.618 -3.38 0.000732 0.0369 0.416
## 11 3 site5 0.0730 1.16 -2.25 0.0243 0.00747 0.712
## 12 3 prenumbCent 1.05 0.0221 2.23 0.0255 1.01 1.10
## 13 4 (Intercept) 1.51 0.448 0.922 0.357 0.628 3.63
## 14 4 site2 3.08 0.707 1.59 0.111 0.772 12.3
## 15 4 site3 1.11 0.598 0.169 0.866 0.343 3.57
## 16 4 site4 0.125 0.706 -2.95 0.00321 0.0314 0.499
## 17 4 site5 0.523 0.771 -0.842 0.400 0.115 2.37
## 18 4 prenumbCent 1.07 0.0223 2.89 0.00379 1.02 1.11
viewcat == 2 versus viewcat == 1.The odds that a child who is in Site 1 and who has an average score on the numbers pretest 20.854 watches the show 1 - 2 times per week versus rarely are predicted to be round(exp(0.69501925),3)
viewcat == 2 versus viewcat == 1. Based on the confidence interval for the coefficient, is the numeric predictor a statistically significant predictor of viewership?For each additional point on the numbers pretest, the odds a child watches the show 1 - 2 times per week versus rarely are expected to multiply by a factor of round(exp(0.02089701),3), holding the site constant.
The 95% confidence interval is about (-0.022116790, 0.06391081) which includes 0. Therefore, after accounting for the site, the score on the numbers pretest is not a statistically significant predictor of the odds a child is in viewer category 2 versus category 1.
viewenc on viewership. Does encouragement have a significant effect on viewership after adjusting for other possible factors? If so, describe the effect. Otherwise, explain why not.new_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 288.361620
## iter 20 value 281.624328
## final value 281.602888
## converged
tidy(new_model) %>%
kable(digits = 3)
| y.level | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| 2 | (Intercept) | 0.815 | 0.484 | -0.421 | 0.674 |
| 2 | site2 | 0.934 | 0.774 | -0.088 | 0.929 |
| 2 | site3 | 0.343 | 0.640 | -1.670 | 0.095 |
| 2 | site4 | 0.149 | 0.640 | -2.971 | 0.003 |
| 2 | site5 | 0.170 | 0.830 | -2.136 | 0.033 |
| 2 | prenumbCent | 1.023 | 0.024 | 0.967 | 0.334 |
| 2 | viewenc1 | 14.181 | 0.493 | 5.378 | 0.000 |
| 3 | (Intercept) | 1.052 | 0.467 | 0.108 | 0.914 |
| 3 | site2 | 1.249 | 0.739 | 0.300 | 0.764 |
| 3 | site3 | 0.415 | 0.629 | -1.399 | 0.162 |
| 3 | site4 | 0.085 | 0.681 | -3.621 | 0.000 |
| 3 | site5 | 0.025 | 1.235 | -2.974 | 0.003 |
| 3 | prenumbCent | 1.053 | 0.024 | 2.184 | 0.029 |
| 3 | viewenc1 | 11.782 | 0.494 | 4.997 | 0.000 |
| 4 | (Intercept) | 0.761 | 0.499 | -0.547 | 0.584 |
| 4 | site2 | 2.507 | 0.741 | 1.241 | 0.215 |
| 4 | site3 | 0.525 | 0.663 | -0.973 | 0.330 |
| 4 | site4 | 0.089 | 0.753 | -3.211 | 0.001 |
| 4 | site5 | 0.193 | 0.869 | -1.893 | 0.058 |
| 4 | prenumbCent | 1.069 | 0.023 | 2.831 | 0.005 |
| 4 | viewenc1 | 9.886 | 0.501 | 4.575 | 0.000 |
When we look at the coefficients from encouragement levels from the odds of level 2 vs level 1 (2.6 to 2.2), we can see that after adjusting for the site and pretest score, the impact of the encouragement for watching the show a lot more reduces. So there is a greater impact for the jump from not watching the show to watching the show a little bit, but less of an impact from rarely to everyday.
We need all probabilities! Predict function & observation number
Actual vs. Predicted
♥ For each observation, the predicted perceived health rating is the category with the highest predicted probability.
♥ If you have categories that are selected much more often than others, the analyzation will pick up on that. * Consider adding more predictor variables * Combine Category with categories not selected as much * Can also combine to make logistic model
♥ Check distribution of predicted variables!
\[log(x_i)=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+...+\beta_px_{pi}\]
Slope, B_j:
* Quantitative Predictor: When x_j increases by one unit, the mean of y is expected to multiply by a factor of exp{B_j}, holding all else constant. * Categorical Predictor: The mean of y for category k is expected to be exp{b_j} times the mean of y for the baseline category, holding all else constant.
Intercept, B_0:
* When all of the predictors equal 0, the mean of y is expected to be exp(B_0).
ex. number in household vs. age.
model1 <- glm(total~ageCent, data = hh, family = “poisson”)
tidy(model1, conf.int = T) %>%
kable(digits = 3)
log(mean total) = 1.303-.0047 x ageCent
♥For each additional year the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995(exp(-0.0047)).
♥ For households with a head of the household who is 52.657 years old, we expect the mean number of people in the household to be 3.68 (exp(1.303))
\[r_i=\frac{y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]
Augment function: Plot of predicted mean or pearson residual
* Look for distinguishible patterns
library(tidyverse)
library(knitr)
library(broom)
library(nnet)
sesame <- read_csv("https://raw.githubusercontent.com/sta210-fa19/website/master/docs/appex/data/sesame.csv")
# mean-center relevant continuous variables, make categorical variables factors
sesame <- sesame %>%
mutate(viewcat = factor(viewcat),
site = factor(site),
prenumbCent = prenumb - mean(prenumb),
preletCent = prelet - mean(prelet),
ageCent = age - mean(age),
viewenc = ifelse(viewenc == 1, "1", "0"))
viewenc, prenumbCent, and site to predict how frequently a child viewed Sesame Street (viewcat).view_model <- multinom(viewcat ~ site + prenumbCent + viewenc, data = sesame)
## # weights: 32 (21 variable)
## initial value 332.710647
## iter 10 value 288.361620
## iter 20 value 281.624328
## final value 281.602888
## converged
tidy(view_model) %>%
kable(digits = 3)
| y.level | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| 2 | (Intercept) | 0.815 | 0.484 | -0.421 | 0.674 |
| 2 | site2 | 0.934 | 0.774 | -0.088 | 0.929 |
| 2 | site3 | 0.343 | 0.640 | -1.670 | 0.095 |
| 2 | site4 | 0.149 | 0.640 | -2.971 | 0.003 |
| 2 | site5 | 0.170 | 0.830 | -2.136 | 0.033 |
| 2 | prenumbCent | 1.023 | 0.024 | 0.967 | 0.334 |
| 2 | viewenc1 | 14.181 | 0.493 | 5.378 | 0.000 |
| 3 | (Intercept) | 1.052 | 0.467 | 0.108 | 0.914 |
| 3 | site2 | 1.249 | 0.739 | 0.300 | 0.764 |
| 3 | site3 | 0.415 | 0.629 | -1.399 | 0.162 |
| 3 | site4 | 0.085 | 0.681 | -3.621 | 0.000 |
| 3 | site5 | 0.025 | 1.235 | -2.974 | 0.003 |
| 3 | prenumbCent | 1.053 | 0.024 | 2.184 | 0.029 |
| 3 | viewenc1 | 11.782 | 0.494 | 4.997 | 0.000 |
| 4 | (Intercept) | 0.761 | 0.499 | -0.547 | 0.584 |
| 4 | site2 | 2.507 | 0.741 | 1.241 | 0.215 |
| 4 | site3 | 0.525 | 0.663 | -0.973 | 0.330 |
| 4 | site4 | 0.089 | 0.753 | -3.211 | 0.001 |
| 4 | site5 | 0.193 | 0.869 | -1.893 | 0.058 |
| 4 | prenumbCent | 1.069 | 0.023 | 2.831 | 0.005 |
| 4 | viewenc1 | 9.886 | 0.501 | 4.575 | 0.000 |
Let’s get the predicted view category using the augment function .
sesame <- sesame %>%
mutate(pred_view = predict(view_model, type = "class"))
Make a table to view the actual vs. predicted view categories. How well did the model perform?
sesame %>%
count(viewcat, pred_view) %>%
pivot_wider(names_from = pred_view, values_from = n)
## # A tibble: 4 x 5
## viewcat `1` `2` `3` `4`
## <fct> <int> <int> <int> <int>
## 1 1 37 9 5 3
## 2 2 8 24 18 10
## 3 3 7 15 21 21
## 4 4 5 10 20 27
(37 + 24 + 21 + 27) / nrow(sesame)
## [1] 0.4541667
The numbers on the diagonal are the ones we get correct. The most errors occur in groups 2, 3 and 4. This model is good at picking out the extremes, but not the middle. Could group this differently so that it is people who watch every day vs. not every day.
The data come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.
The variables in the data are
age: the age of the head of householdtotal: the number of people in the household other than the headlocation: where the house is located (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas)numLT5: the number in the household under 5 years of ageroof: the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material, where stronger material can sometimes be used as a proxy for greater wealth)hh <- read_csv("/Users/courtneylee/Documents/STAT210/Quiz 4/fHH1.csv") %>%
filter(age < 95) %>%
mutate(ageCent = age - mean(age))
We fit the following model:
hh_model <- glm(total ~ ageCent + I(ageCent^2),
data = hh, family = "poisson")
tidy(hh_model) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1.436 | 0.017 | 82.339 | 0 |
| ageCent | -0.004 | 0.001 | -3.584 | 0 |
| I(ageCent^2) | -0.001 | 0.000 | -10.938 | 0 |
ageCent^2 in the context of the data.Interpreting quadratic terms: talking about the entire effect - choose values of x to change between and report the change for those values.
How response changes between Q1 and Q3 of x. Q1 = 42 and Q3 = 63.
When the age of the head of the household changes from 42 to 63 the log (mean number in household) is expected to change by -0.004(63-42) + -0.001(63^2 - 42^2).
When the household age changes from 42 to 63 years old, the mean number in the household is expected to multiply by a factor of exp(-0.004(63-42) + -0.001(63^2 - 42^2))
location is a useful predictor of the number of people in the household after accounting for age of the head of the household.