0.1 Background Info

In casino, we are interested in the outcome of a dice or a flip of a coin. The study and model to explore the Binary outcomes is Logistic Regression.

0.1.0.1 Application:

In Clinical Trials we use factors to predict Responder and Non-Responders

If we let the outcome of probability of success be p (head), then the none success is 1-p ,(tail).

0.1.0.2 Formula:

One predictor \[logit[P(x)]=\frac{log(P(x)}{1???P(x))}= ax+b\]

Mutliple Predictors: \[logit[P(x)]=\frac{log(P(xi)}{1???P(xi))}= a1x1+a2x2+a3x3+a4x4+...b\] When a>0 , then p(x) and the log odds increases as x increase

when a<0, then p(x) and the log odds decrease as x increase

0.2 Arthritis Data Processing

Binary Response Variable For Logistic Model: Better : Improved, None

If the variable Improved is is “Some” or “Marked”, we consider the patients’ Arthritis symptom is Improved If the variable Improved is “None”, we consider the patients Arthritis problem remains.

Goal: We want to know how the factors affect the outcome of the Arthritis Symptom: Treatment, Sex, Age :

Treatment (Categorical Data): Treated, Placebo Sex (categorical Data) : Male, Female Age (Continuous Data) : 18-85

data<- Arthritis

#Treatment
data$better <- as.numeric(data$Improved >"None")

data[35:45, ]
##    ID Treatment    Sex Age Improved better
## 35 58   Treated Female  66   Marked      1
## 36 13   Treated Female  67   Marked      1
## 37 61   Treated Female  68     Some      1
## 38 65   Treated Female  68   Marked      1
## 39 11   Treated Female  69     None      0
## 40 56   Treated Female  69     Some      1
## 41 43   Treated Female  70     Some      1
## 42  9   Placebo   Male  37     None      0
## 43 14   Placebo   Male  44     None      0
## 44 73   Placebo   Male  50     None      0
## 45 74   Placebo   Male  51     None      0

0.3 Logistic Regression Model: Age

We want to investigate the relationship between Age and Treatment Outcome variable: Better. As the graph and statistical model suggested: when Age gets Older, Arthritis is harder to cure, and Age is quite significant Factor in the treatment of Arthritis, as p-value=0.011

#Age
artha.log <-glm(better ~Age, data=data, family=binomial)
artha.log
## 
## Call:  glm(formula = better ~ Age, family = binomial, data = data)
## 
## Coefficients:
## (Intercept)          Age  
##    -2.64207      0.04925  
## 
## Degrees of Freedom: 83 Total (i.e. Null);  82 Residual
## Null Deviance:       116.4 
## Residual Deviance: 109.2     AIC: 113.2
summary(artha.log)
## 
## Call:
## glm(formula = better ~ Age, family = binomial, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.51065  -1.12772   0.07936   1.06771   1.76109  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.64207    1.07317  -2.462   0.0138 *
## Age          0.04925    0.01936   2.544   0.0110 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 116.45  on 83  degrees of freedom
## Residual deviance: 109.16  on 82  degrees of freedom
## AIC: 113.16
## 
## Number of Fisher Scoring iterations: 4
xyvalues <- seq(15, 85, 5)
(pred.logistic <- predict(artha.log, newdata= data.frame(Age= xyvalues), type="response", se.fit= TRUE))
## $fit
##         1         2         3         4         5         6         7 
## 0.1297314 0.1601529 0.1961007 0.2378324 0.2852931 0.3380245 0.3951142 
##         8         9        10        11        12        13        14 
## 0.4552154 0.5166502 0.5775858 0.6362479 0.6911201 0.7410832 0.7854731 
##        15 
## 0.8240594 
## 
## $se.fit
##          1          2          3          4          5          6 
## 0.08940039 0.09411439 0.09601029 0.09435746 0.08881876 0.07982508 
##          7          8          9         10         11         12 
## 0.06906736 0.05998089 0.05706462 0.06204566 0.07165549 0.08150700 
##         13         14         15 
## 0.08885431 0.09257500 0.09260412 
## 
## $residual.scale
## [1] 1
(upper <- pred.logistic$fit + 1.96 * pred.logistic$se.fit)
##         1         2         3         4         5         6         7 
## 0.3049561 0.3446171 0.3842809 0.4227730 0.4593779 0.4944817 0.5304862 
##         8         9        10        11        12        13        14 
## 0.5727779 0.6284969 0.6991953 0.7766927 0.8508738 0.9152376 0.9669201 
##        15 
## 1.0055635
(lower <- pred.logistic$fit - 1.96 * pred.logistic$se.fit)
##            1            2            3            4            5 
## -0.045493382 -0.024311304  0.007920551  0.052891734  0.111208373 
##            6            7            8            9           10 
##  0.181567360  0.259742159  0.337652864  0.404803537  0.455976300 
##           11           12           13           14           15 
##  0.495803156  0.531366350  0.566928749  0.604026145  0.642555364
plot(jitter(better, .1) ~ Age, data= data, xlim = c(15,85), pch= 16, ylab= "Probability (Btter)", main="Probability of Arthritis Response respect to Age")
polygon(c(xyvalues, rev(xyvalues)), c(upper, rev(lower)), col= rgb(0, 0, 1, .2), border= NA)
lines(xyvalues, pred.logistic$fit, lwd= 4, col= "pink")

0.4 Logistic Regression Model: Treatment

As the visualization and statistical Model showed: Treatment is an extremely important factor, as the p-value=0.0013. If the patients in Treated Group, there is a much higher probability to get better in Arthritis.

#Treatment
arth.log <-glm(better ~Treatment, data=data, family=binomial)
arth.log
## 
## Call:  glm(formula = better ~ Treatment, family = binomial, data = data)
## 
## Coefficients:
##      (Intercept)  TreatmentTreated  
##          -0.7282            1.4955  
## 
## Degrees of Freedom: 83 Total (i.e. Null);  82 Residual
## Null Deviance:       116.4 
## Residual Deviance: 105.5     AIC: 109.5
summary(arth.log)
## 
## Call:
## glm(formula = better ~ Treatment, family = binomial, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.51567  -0.88759  -0.00712   0.87335   1.49809  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       -0.7282     0.3254  -2.238  0.02524 * 
## TreatmentTreated   1.4955     0.4675   3.199  0.00138 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 116.45  on 83  degrees of freedom
## Residual deviance: 105.49  on 82  degrees of freedom
## AIC: 109.49
## 
## Number of Fisher Scoring iterations: 4
rtreat <- ggplot(data, aes(Age, better, color=Treatment)) + xlim(5, 95) + theme_bw() + geom_point(position = position_jitter(height = 0.02, width = 0)) + geom_smooth(method = "glm", method.args = list(family = "binomial"),alpha= 0.2, aes(fill=Treatment), size= 2.5, fullrange = TRUE)
rtreat

0.4.1 Logistic Regression Model: Sex

The male patients have a higher probability to cure from Arthritis, as p-value=0.035.

#Sex
arths.log <-glm(better ~Sex, data=data, family=binomial)
arths.log
## 
## Call:  glm(formula = better ~ Sex, family = binomial, data = data)
## 
## Coefficients:
## (Intercept)      SexMale  
##      0.3075      -1.0613  
## 
## Degrees of Freedom: 83 Total (i.e. Null);  82 Residual
## Null Deviance:       116.4 
## Residual Deviance: 111.8     AIC: 115.8
summary(arths.log)
## 
## Call:
## glm(formula = better ~ Sex, family = binomial, data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.31047  -1.31047   0.08584   1.04993   1.50959  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.3075     0.2635   1.167    0.243  
## SexMale      -1.0613     0.5032  -2.109    0.035 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 116.45  on 83  degrees of freedom
## Residual deviance: 111.76  on 82  degrees of freedom
## AIC: 115.76
## 
## Number of Fisher Scoring iterations: 4
lsex<- rtreat+facet_wrap(~Sex)
lsex
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

0.5 Summary

Logistic Regression is designed for predict the outcome of Categorical Variable

Advantages : the output is informative, “Yes” or “No” .

handle non-linear effect

Limitations: Logistic Regression Assume all the variables are independent. Logistic Regression Can not predict continuous outcome, unless convert the data to groups (age:65-75, >75)

Future Work: Mixed Effect, such as Sex*Treatment ?