Logiustic regression with one explanatory variable

The CHDAGE data set.

This is an analysis of the famous chdage data from Prof Stanley Lemeshow’s Cousera Course on applied logistic regression. The package aplore3 is already installed on my computer. Step 1. load the data set, recode variables if needed, examine it produce summary statistics and plot chd by age

##    id age agegrp chd
## 1   1  20  20-39  No
## 2   2  23  20-39  No
## 3   3  24  20-39  No
## 4   4  25  20-39  No
## 5   5  25  20-39 Yes
## 6   6  26  20-39  No
## 7   7  26  20-39  No
## 8   8  28  20-39  No
## 9   9  28  20-39  No
## 10 10  29  20-39  No

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   34.75   44.00   44.38   55.00   69.00
##   Summary x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1      No  20.00     32.00    38.00  39.18     46.00  64.00
## 2     Yes  25.00     44.50    54.00  51.28     58.00  69.00

The subjects have also been put into age categories and we can look at the prevalence of CHD by age group

## Warning in mean.default(X[[1L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[2L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[3L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[4L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[5L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[6L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[7L]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[8L]], ...): argument is not numeric or logical:
## returning NA

Logistic regression Model: chd = Constant+Beta1*Age

# fit1 <- glm(chd ~ age, family=binomial(logit), data = d) 
# summary(fit1)
# durbinWatsonTest(fit1) # DW-Statistic
# crPlots(fit1)
fit2 <- glm(as.integer(chd)-1 ~ age, family = binomial, data = d)
summary(fit2)
## 
## Call:
## glm(formula = as.integer(chd) - 1 ~ age, family = binomial, data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9718  -0.8456  -0.4576   0.8253   2.2859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
## age          0.11092    0.02406   4.610 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.35  on 98  degrees of freedom
## AIC: 111.35
## 
## Number of Fisher Scoring iterations: 4
coef(fit2)
## (Intercept)         age 
##  -5.3094534   0.1109211
vcov(fit2)
##             (Intercept)           age
## (Intercept)  1.28517059 -0.0266769747
## age         -0.02667697  0.0005788748

The co-efficients are

exp(coef(fit2))
## (Intercept)         age 
## 0.004944629 1.117306795
exp(coef(fit2))[-1] # gives the odds ratio
##      age 
## 1.117307
exp(confint(fit2))[-1, ] # gives the 95% Ci for the odds ratio
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 1.069222 1.175868
plot(d$age, fit2$fitted.values, type = "l", ylim=c(0,1.1), col="blue",
     main="Fitted values", xlab="Age",ylab="Estimated probability")
points(d$age, as.integer(d$chd)-1)

The likelihood ratio test is given by the improvement in devaince score between the Null Deviance and the Residual Deviance still left after fitting the model (i.e after takling account of age and doing the logistic regression)

This is obtained either from the print out of the summary(fit2), or from the foll R code The Likelihood ratio test is given by G, equal to

fit2$null.deviance - fit2$deviance
## [1] 29.30989

which is a Chi-square statistic with 1 df and has a p-value of

round(pchisq(fit2$null.deviance - fit2$deviance, df=1, lower.tail=FALSE), 5)
## [1] 0

Logistic Regression with more than one explanatory variables.

The Low Birth Weight data set

see also: http://rpubs.com/fernandosansegundo/82577. I learned a lot from this fellow student’s post.

d <- read.table("LOWBWT.txt", header=TRUE)
colnames(d)
##  [1] "ID"    "LOW"   "AGE"   "LWT"   "RACE"  "SMOKE" "PTL"   "HT"   
##  [9] "UI"    "FTV"   "BWT"
names(d) <- tolower(names(d))
str(d)
## 'data.frame':    189 obs. of  11 variables:
##  $ id   : int  4 10 11 13 15 16 17 18 19 20 ...
##  $ low  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  28 29 34 25 25 27 23 24 24 21 ...
##  $ lwt  : int  120 130 187 105 85 150 97 128 132 165 ...
##  $ race : int  3 1 2 3 3 3 3 2 3 1 ...
##  $ smoke: int  1 0 1 0 0 0 0 0 0 1 ...
##  $ ptl  : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ ht   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ ui   : int  1 1 0 0 1 0 1 0 0 0 ...
##  $ ftv  : int  0 2 0 0 0 0 1 1 0 1 ...
##  $ bwt  : int  709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...
fit1 <- glm(low~age+lwt+race+ftv, family= binomial(link="logit"), data=d)
d$low <- as.factor(d$low)
d$race <- as.factor(d$race)
by(d[,c(3:4,10)], d$low, function(x) summary(x))
## d$low: 0
##       age             lwt             ftv        
##  Min.   :14.00   Min.   : 85.0   Min.   :0.0000  
##  1st Qu.:19.00   1st Qu.:113.0   1st Qu.:0.0000  
##  Median :23.00   Median :123.5   Median :1.0000  
##  Mean   :23.66   Mean   :133.3   Mean   :0.8385  
##  3rd Qu.:28.00   3rd Qu.:147.0   3rd Qu.:1.0000  
##  Max.   :45.00   Max.   :250.0   Max.   :6.0000  
## -------------------------------------------------------- 
## d$low: 1
##       age             lwt             ftv        
##  Min.   :14.00   Min.   : 80.0   Min.   :0.0000  
##  1st Qu.:19.50   1st Qu.:104.0   1st Qu.:0.0000  
##  Median :22.00   Median :120.0   Median :0.0000  
##  Mean   :22.31   Mean   :122.1   Mean   :0.6949  
##  3rd Qu.:25.00   3rd Qu.:130.0   3rd Qu.:1.0000  
##  Max.   :34.00   Max.   :200.0   Max.   :4.0000
table(d$low,d$race)
##    
##      1  2  3
##   0 73 15 42
##   1 23 11 25
fit2 <- glm(low~age+lwt+race+ftv, family= binomial(link="logit"), data=d)
fit1$deviance
## [1] 225.3132
fit2$deviance
## [1] 222.5729
summary(fit2)
## 
## Call:
## glm(formula = low ~ age + lwt + race + ftv, family = binomial(link = "logit"), 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4163  -0.8931  -0.7113   1.2454   2.0755  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.295366   1.071443   1.209   0.2267  
## age         -0.023823   0.033730  -0.706   0.4800  
## lwt         -0.014245   0.006541  -2.178   0.0294 *
## race2        1.003898   0.497859   2.016   0.0438 *
## race3        0.433108   0.362240   1.196   0.2318  
## ftv         -0.049308   0.167239  -0.295   0.7681  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 222.57  on 183  degrees of freedom
## AIC: 234.57
## 
## Number of Fisher Scoring iterations: 4

The confidence intervals for the Beta coefficients are:

confint.default(fit2)
##                   2.5 %       97.5 %
## (Intercept) -0.80462420  3.395355695
## age         -0.08993181  0.042285866
## lwt         -0.02706418 -0.001425038
## race2        0.02811220  1.979683451
## race3       -0.27686939  1.143086244
## ftv         -0.37709090  0.278474261

The log-likelihood of the fitted model is given by the logLik() function as follows: the expoenent of this the likeihood or probability function

logLik(fit2)
## 'log Lik.' -111.2865 (df=6)
exp(logLik(fit2))
## 'log Lik.' 4.6656e-49 (df=6)

The likelihhod ratio test statistic G is given by thedeifference of the devaiance between the null model and the model we fitted the object fit2 contains both these statistics as shown in the printout above

LikRatio <- fit2$null.deviance - fit2$deviance
df.LikRatio <- fit2$df.null - fit2$df.residual
p.Value <- pchisq(LikRatio, lower.tail=FALSE, df.LikRatio)

The result: The Likelihood ratio test statistic is 12.0990898 which has follows a Chi-square distribution with 5 degrees of freedom and and so has a a p-value equal to 0.033455

The reduced model is obtained by fitting only the 2 explanatory variables that have signifcant Beta coefficients. These are lwt and race:

fit3 <- glm(low~lwt+race, family=binomial(link="logit"), data=d)
summary(fit3)
## 
## Call:
## glm(formula = low ~ lwt + race, family = binomial(link = "logit"), 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3491  -0.8919  -0.7196   1.2526   2.0993  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.805753   0.845167   0.953   0.3404  
## lwt         -0.015223   0.006439  -2.364   0.0181 *
## race2        1.081066   0.488052   2.215   0.0268 *
## race3        0.480603   0.356674   1.347   0.1778  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 223.26  on 185  degrees of freedom
## AIC: 231.26
## 
## Number of Fisher Scoring iterations: 4

The 2 models can be compared by the likelihood ratio obtained by taking the ation of the deviance of the second model with the first.

Likelihopd ratio test statistic G comparing the two models is given by 0.6861841. This follows a chisquared distribution with 2 degrees of freedom and so has a p-value of:

pchisq(fit3$deviance - fit2$deviance, df=fit3$df.residual-fit2$df.residual, lower.tail=FALSE )
## [1] 0.7095729

The end