In casino, we are interested in the outcome of a dice or a flip of a coin. The study and model to explore the Binary outcomes is Logistic Regression.
In Clinical Trials we use factors to predict Responder and Non-Responders
If we let the outcome of probability of success be p (head), then the none success is 1-p ,(tail).
One predictor \[logit[P(x)]=\frac{log(P(x)}{1???P(x))}= ax+b\]
Mutliple Predictors: \[logit[P(x)]=\frac{log(P(xi)}{1???P(xi))}= a1x1+a2x2+a3x3+a4x4+...b\] When a>0 , then p(x) and the log odds increases as x increase
when a<0, then p(x) and the log odds decrease as x increase
Binary Response Variable For Logistic Model: Better : Improved, None
If the variable Improved is is “Some” or “Marked”, we consider the patients’ Arthritis symptom is Improved If the variable Improved is “None”, we consider the patients Arthritis problem remains.
Goal: We want to know how the factors affect the outcome of the Arthritis Symptom: Treatment, Sex, Age :
Treatment (Categorical Data): Treated, Placebo Sex (categorical Data) : Male, Female Age (Continuous Data) : 18-85
data<- Arthritis
#Treatment
data$better <- as.numeric(data$Improved >"None")
data[35:45, ]## ID Treatment Sex Age Improved better
## 35 58 Treated Female 66 Marked 1
## 36 13 Treated Female 67 Marked 1
## 37 61 Treated Female 68 Some 1
## 38 65 Treated Female 68 Marked 1
## 39 11 Treated Female 69 None 0
## 40 56 Treated Female 69 Some 1
## 41 43 Treated Female 70 Some 1
## 42 9 Placebo Male 37 None 0
## 43 14 Placebo Male 44 None 0
## 44 73 Placebo Male 50 None 0
## 45 74 Placebo Male 51 None 0
We want to investigate the relationship between Age and Treatment Outcome variable: Better. As the graph and statistical model suggested: when Age gets Older, Arthritis is harder to cure, and Age is quite significant Factor in the treatment of Arthritis, as p-value=0.011
#Age
artha.log <-glm(better ~Age, data=data, family=binomial)
artha.log##
## Call: glm(formula = better ~ Age, family = binomial, data = data)
##
## Coefficients:
## (Intercept) Age
## -2.64207 0.04925
##
## Degrees of Freedom: 83 Total (i.e. Null); 82 Residual
## Null Deviance: 116.4
## Residual Deviance: 109.2 AIC: 113.2
summary(artha.log)##
## Call:
## glm(formula = better ~ Age, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.51065 -1.12772 0.07936 1.06771 1.76109
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64207 1.07317 -2.462 0.0138 *
## Age 0.04925 0.01936 2.544 0.0110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.45 on 83 degrees of freedom
## Residual deviance: 109.16 on 82 degrees of freedom
## AIC: 113.16
##
## Number of Fisher Scoring iterations: 4
xyvalues <- seq(15, 85, 5)
(pred.logistic <- predict(artha.log, newdata= data.frame(Age= xyvalues), type="response", se.fit= TRUE))## $fit
## 1 2 3 4 5 6 7
## 0.1297314 0.1601529 0.1961007 0.2378324 0.2852931 0.3380245 0.3951142
## 8 9 10 11 12 13 14
## 0.4552154 0.5166502 0.5775858 0.6362479 0.6911201 0.7410832 0.7854731
## 15
## 0.8240594
##
## $se.fit
## 1 2 3 4 5 6
## 0.08940039 0.09411439 0.09601029 0.09435746 0.08881876 0.07982508
## 7 8 9 10 11 12
## 0.06906736 0.05998089 0.05706462 0.06204566 0.07165549 0.08150700
## 13 14 15
## 0.08885431 0.09257500 0.09260412
##
## $residual.scale
## [1] 1
(upper <- pred.logistic$fit + 1.96 * pred.logistic$se.fit)## 1 2 3 4 5 6 7
## 0.3049561 0.3446171 0.3842809 0.4227730 0.4593779 0.4944817 0.5304862
## 8 9 10 11 12 13 14
## 0.5727779 0.6284969 0.6991953 0.7766927 0.8508738 0.9152376 0.9669201
## 15
## 1.0055635
(lower <- pred.logistic$fit - 1.96 * pred.logistic$se.fit)## 1 2 3 4 5
## -0.045493382 -0.024311304 0.007920551 0.052891734 0.111208373
## 6 7 8 9 10
## 0.181567360 0.259742159 0.337652864 0.404803537 0.455976300
## 11 12 13 14 15
## 0.495803156 0.531366350 0.566928749 0.604026145 0.642555364
plot(jitter(better, .1) ~ Age, data= data, xlim = c(15,85), pch= 16, ylab= "Probability (Btter)", main="Probability of Arthritis Response respect to Age")
polygon(c(xyvalues, rev(xyvalues)), c(upper, rev(lower)), col= rgb(0, 0, 1, .2), border= NA)
lines(xyvalues, pred.logistic$fit, lwd= 4, col= "pink")As the visualization and statistical Model showed: Treatment is an extremely important factor, as the p-value=0.0013. If the patients in Treated Group, there is a much higher probability to get better in Arthritis.
#Treatment
arth.log <-glm(better ~Treatment, data=data, family=binomial)
arth.log##
## Call: glm(formula = better ~ Treatment, family = binomial, data = data)
##
## Coefficients:
## (Intercept) TreatmentTreated
## -0.7282 1.4955
##
## Degrees of Freedom: 83 Total (i.e. Null); 82 Residual
## Null Deviance: 116.4
## Residual Deviance: 105.5 AIC: 109.5
summary(arth.log)##
## Call:
## glm(formula = better ~ Treatment, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.51567 -0.88759 -0.00712 0.87335 1.49809
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7282 0.3254 -2.238 0.02524 *
## TreatmentTreated 1.4955 0.4675 3.199 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.45 on 83 degrees of freedom
## Residual deviance: 105.49 on 82 degrees of freedom
## AIC: 109.49
##
## Number of Fisher Scoring iterations: 4
rtreat <- ggplot(data, aes(Age, better, color=Treatment)) + xlim(5, 95) + theme_bw() + geom_point(position = position_jitter(height = 0.02, width = 0)) + geom_smooth(method = "glm", method.args = list(family = "binomial"),alpha= 0.2, aes(fill=Treatment), size= 2.5, fullrange = TRUE)
rtreatThe male patients have a higher probability to cure from Arthritis, as p-value=0.035.
#Sex
arths.log <-glm(better ~Sex, data=data, family=binomial)
arths.log##
## Call: glm(formula = better ~ Sex, family = binomial, data = data)
##
## Coefficients:
## (Intercept) SexMale
## 0.3075 -1.0613
##
## Degrees of Freedom: 83 Total (i.e. Null); 82 Residual
## Null Deviance: 116.4
## Residual Deviance: 111.8 AIC: 115.8
summary(arths.log)##
## Call:
## glm(formula = better ~ Sex, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.31047 -1.31047 0.08584 1.04993 1.50959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3075 0.2635 1.167 0.243
## SexMale -1.0613 0.5032 -2.109 0.035 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.45 on 83 degrees of freedom
## Residual deviance: 111.76 on 82 degrees of freedom
## AIC: 115.76
##
## Number of Fisher Scoring iterations: 4
lsex<- rtreat+facet_wrap(~Sex)
lsex## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Logistic Regression is designed for predict the outcome of Categorical Variable
Advantages : the output is informative, “Yes” or “No” .
handle non-linear effect
Limitations: Logistic Regression Assume all the variables are independent. Logistic Regression Can not predict continuous outcome, unless convert the data to groups (age:65-75, >75)
Future Work: Mixed Effect, such as Sex*Treatment ?