The South Africa heart attack example is an interesting example to use when study splines and logit for two features.
first,to determine whether a person is going to have a heart attack or not can be viewed as a classification problem,where logistic is appliable. second,some of the variables do not display a linear relationship with the final probability.
Lets take a closer look!
#heart<-read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",header = TRUE,sep = ",")
#saveRDS(object = heart,file="D:\\rworkspace\\heart.rds")
heart<-readRDS("D:\\rworkspace\\heart.rds") #rds is a much more efficient file type than txt and csv
head(heart)
## row.names sbp tobacco ldl adiposity famhist typea obesity alcohol age
## 1 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52
## 2 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63
## 3 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46
## 4 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58
## 5 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49
## 6 6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45
## chd
## 1 1
## 2 1
## 3 0
## 4 1
## 5 1
## 6 0
names(heart)
## [1] "row.names" "sbp" "tobacco" "ldl" "adiposity"
## [6] "famhist" "typea" "obesity" "alcohol" "age"
## [11] "chd"
“ldl”,blood fat density,numeric
“famhist”,family desease history,factor
“obesity”,too fat,numeric
“alcohol”,comsuption of wine,numeric
“age”,age,numeric
“chd”,indicator for heart attact,factor the dependent variable
library(sqldf)
heart.var<-sqldf("select sbp,tobacco,ldl,famhist,obesity,alcohol,age,chd from heart")
names(heart.var)
## [1] "sbp" "tobacco" "ldl" "famhist" "obesity" "alcohol" "age"
## [8] "chd"
par(cex=0.5)
pairs(heart.var[,1:7],col=c("red","green")[unclass(factor(heart.var$chd))]) #Attention,must convert chd from numeric to factor
logifit<-glm(chd~sbp+tobacco+ldl+factor(famhist)+obesity+alcohol+age,family="binomial",data = heart.var)
#family=binomial
logifit$coefficients
## (Intercept) sbp tobacco
## -4.1295996883 0.0057606767 0.0795256305
## ldl factor(famhist)Present obesity
## 0.1847793334 0.9391854851 -0.0345434340
## alcohol age
## 0.0006065017 0.0425412093
library(glmnet)
attach(heart.var)
X<-cbind(sbp,tobacco,ldl,factor(famhist),obesity,alcohol,age)#must convert famhist to factor
detach(heart.var)
logi.lasso<-glmnet(y = factor(heart.var[,8]),x=X,family = "binomial",alpha = 1)
summary(logi.lasso)
## Length Class Mode
## a0 55 -none- numeric
## beta 385 dgCMatrix S4
## df 55 -none- numeric
## dim 2 -none- numeric
## lambda 55 -none- numeric
## dev.ratio 55 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## classnames 2 -none- character
## call 5 -none- call
## nobs 1 -none- numeric
The beta and lambda are in the logi.lasso obejct Different lambda means different penalty on the parameter,how to choose the lambda is a tradeoff between variance and bias,flexibility and interpretation.See http://www.r-bloggers.com/lang/chinese/1017 for details.
The function cv.glmnet in glmnet package can do cross validation while doing lasso and logistic,we try that.
library(glmnet)
logi.cv<-cv.glmnet(y = factor(heart.var[,8]),x=X,family = "binomial",alpha = 1,type.measure = "deviance")
summary(logi.cv)
## Length Class Mode
## lambda 53 -none- numeric
## cvm 53 -none- numeric
## cvsd 53 -none- numeric
## cvup 53 -none- numeric
## cvlo 53 -none- numeric
## nzero 53 -none- numeric
## name 1 -none- character
## glmnet.fit 13 lognet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
plot(logi.cv)# automaticlly plot the lambda and deviance relationship, we should choose the smallest lambda
#the final model with smallest lambda has already been prepared for us,nice!
#Attention,the following two are different!!
coef(logi.cv$glmnet.fit,s = logi.cv$lambda.1se)
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.95521830
## sbp .
## tobacco 0.03611122
## ldl 0.06361087
## 0.41373618
## obesity .
## alcohol .
## age 0.02885504
coef(logi.cv$glmnet.fit,s = min(logi.cv$glmnet.fit$lambda))
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.0428115016
## sbp 0.0055112708
## tobacco 0.0786722522
## ldl 0.1804313400
## 0.9260374566
## obesity -0.0315536417
## alcohol 0.0004016724
## age 0.0421021175
The first one chooses not the exact minimum \(\lambda\), but the \(\lambda\) with in \(\pm \sigma\) so that some other parameters are zero so the model is neat. The second one is the exact minimin \(\lambda\) model,but we choose the first one because it does variable selection for us.
now we try to fit it as a spline+logistic function, use the variable obesity first.
library(splines)
#use ns() function to get the natrual spline basic functions bsf
attach(heart.var)
sbp<-heart.var[,1]
tobacoo<-heart.var[,2]
ldl.bsf<-heart.var[,3]
famhist<-heart.var[,4]
obesity<-heart.var[,5]
alcohol<-heart.var[,6]
age<-heart.var[,7]
chd<-heart.var[,8]
#logistic plus spline
logispline<-glm(chd~ns(sbp,df=4)+ ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4),family = "binomial")
print( summary(logispline), digits=3 )
##
## Call:
## glm(formula = chd ~ ns(sbp, df = 4) + ns(tobacco, df = 4) + ns(ldl,
## df = 4) + famhist + ns(obesity, df = 4) + ns(alcohol, df = 4) +
## ns(age, df = 4), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.724 -0.827 -0.388 0.887 2.959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1160 2.4124 -0.88 0.38040
## ns(sbp, df = 4)1 -1.4794 0.8450 -1.75 0.07998 .
## ns(sbp, df = 4)2 -1.3197 0.7637 -1.73 0.08397 .
## ns(sbp, df = 4)3 -3.7537 2.0254 -1.85 0.06383 .
## ns(sbp, df = 4)4 1.3973 1.0039 1.39 0.16395
## ns(tobacco, df = 4)1 0.6495 0.4587 1.42 0.15677
## ns(tobacco, df = 4)2 0.4181 0.9033 0.46 0.64346
## ns(tobacco, df = 4)3 3.3626 1.1877 2.83 0.00464 **
## ns(tobacco, df = 4)4 3.8534 2.3775 1.62 0.10506
## ns(ldl, df = 4)1 1.8688 1.3280 1.41 0.15937
## ns(ldl, df = 4)2 1.7217 1.0327 1.67 0.09547 .
## ns(ldl, df = 4)3 4.5209 3.0015 1.51 0.13201
## ns(ldl, df = 4)4 3.3454 1.4526 2.30 0.02127 *
## famhistPresent 1.0787 0.2390 4.51 6.4e-06 ***
## ns(obesity, df = 4)1 -3.1058 1.7216 -1.80 0.07122 .
## ns(obesity, df = 4)2 -2.3753 1.2057 -1.97 0.04884 *
## ns(obesity, df = 4)3 -5.0542 3.8266 -1.32 0.18657
## ns(obesity, df = 4)4 0.0545 1.7499 0.03 0.97516
## ns(alcohol, df = 4)1 0.1367 0.4010 0.34 0.73314
## ns(alcohol, df = 4)2 -0.3561 0.7422 -0.48 0.63144
## ns(alcohol, df = 4)3 -0.1101 0.8081 -0.14 0.89168
## ns(alcohol, df = 4)4 0.5908 1.0934 0.54 0.58893
## ns(age, df = 4)1 2.5975 1.1189 2.32 0.02026 *
## ns(age, df = 4)2 3.1008 0.9179 3.38 0.00073 ***
## ns(age, df = 4)3 7.5511 2.5674 2.94 0.00327 **
## ns(age, df = 4)4 1.5199 0.5964 2.55 0.01082 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 457.63 on 436 degrees of freedom
## AIC: 509.6
##
## Number of Fisher Scoring iterations: 6
Attention Here! the ns() function creates 4 basis functions(constant term ignored,so total 3 internel + 2 boundary =5 knots),though df=4, 4 basis function, the knots are still five in total.
For example, ns(sbp.df=1) creates 4 basis functions: sbp1 sbp2 sbp3 sbp4 each of them has the estimated coefficient in the whole coefficients of the model. the final model would look like this in formula: Watch closely
\(logit(a)=\vec{\theta_0}+\begin{pmatrix} sbp1\\sbp2\\sbp3\\sbp4\\ \end{pmatrix}^T *\begin{pmatrix} \theta_1^sbp\\ \theta_2^spb\\ \theta_3^sbp\\ \theta_4^sbp\\ \end{pmatrix}+...\)
As a comparison,the intercept term is different
form = "chd ~ ns(sbp,df=4) + ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4)"
form = formula(form)
m = glm( form, data=heart.var, family=binomial )
print( summary(m), digits=3 )
##
## Call:
## glm(formula = form, family = binomial, data = heart.var)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.724 -0.827 -0.388 0.887 2.959
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1160 2.4124 -0.88 0.38040
## ns(sbp, df = 4)1 -1.4794 0.8450 -1.75 0.07998 .
## ns(sbp, df = 4)2 -1.3197 0.7637 -1.73 0.08397 .
## ns(sbp, df = 4)3 -3.7537 2.0254 -1.85 0.06383 .
## ns(sbp, df = 4)4 1.3973 1.0039 1.39 0.16395
## ns(tobacco, df = 4)1 0.6495 0.4587 1.42 0.15677
## ns(tobacco, df = 4)2 0.4181 0.9033 0.46 0.64346
## ns(tobacco, df = 4)3 3.3626 1.1877 2.83 0.00464 **
## ns(tobacco, df = 4)4 3.8534 2.3775 1.62 0.10506
## ns(ldl, df = 4)1 1.8688 1.3280 1.41 0.15937
## ns(ldl, df = 4)2 1.7217 1.0327 1.67 0.09547 .
## ns(ldl, df = 4)3 4.5209 3.0015 1.51 0.13201
## ns(ldl, df = 4)4 3.3454 1.4526 2.30 0.02127 *
## famhistPresent 1.0787 0.2390 4.51 6.4e-06 ***
## ns(obesity, df = 4)1 -3.1058 1.7216 -1.80 0.07122 .
## ns(obesity, df = 4)2 -2.3753 1.2057 -1.97 0.04884 *
## ns(obesity, df = 4)3 -5.0542 3.8266 -1.32 0.18657
## ns(obesity, df = 4)4 0.0545 1.7499 0.03 0.97516
## ns(alcohol, df = 4)1 0.1367 0.4010 0.34 0.73314
## ns(alcohol, df = 4)2 -0.3561 0.7422 -0.48 0.63144
## ns(alcohol, df = 4)3 -0.1101 0.8081 -0.14 0.89168
## ns(alcohol, df = 4)4 0.5908 1.0934 0.54 0.58893
## ns(age, df = 4)1 2.5975 1.1189 2.32 0.02026 *
## ns(age, df = 4)2 3.1008 0.9179 3.38 0.00073 ***
## ns(age, df = 4)3 7.5511 2.5674 2.94 0.00327 **
## ns(age, df = 4)4 1.5199 0.5964 2.55 0.01082 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 457.63 on 436 degrees of freedom
## AIC: 509.6
##
## Number of Fisher Scoring iterations: 6
drop1( m, scope=form, test="Chisq" )
## Single term deletions
##
## Model:
## chd ~ ns(sbp, df = 4) + ns(tobacco, df = 4) + ns(ldl, df = 4) +
## famhist + ns(obesity, df = 4) + ns(alcohol, df = 4) + ns(age,
## df = 4)
## Df Deviance AIC LRT Pr(>Chi)
## <none> 457.63 509.63
## ns(sbp, df = 4) 4 466.77 510.77 9.1429 0.0576257 .
## ns(tobacco, df = 4) 4 469.61 513.61 11.9753 0.0175355 *
## ns(ldl, df = 4) 4 470.90 514.90 13.2710 0.0100249 *
## famhist 1 478.76 528.76 21.1319 4.287e-06 ***
## ns(obesity, df = 4) 4 465.41 509.41 7.7749 0.1001811
## ns(alcohol, df = 4) 4 458.09 502.09 0.4562 0.9776262
## ns(age, df = 4) 4 480.37 524.37 22.7414 0.0001426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference results from the initial choice of level of the factor famhist.
Here we can do cv.glmnet also,which will be a combination of logistic,spline, and lasso all together!
ns(obesity, df = 4)1 -3.1058
ns(obesity, df = 4)2 -2.3753
ns(obesity, df = 4)3 -5.0542
ns(obesity, df = 4)4 0.0545
theta <- c(-3.1058,-2.3753,-5.0542,0.0545)
#ns() function creates a matrix, rows are each obs that put in, columns are each numurical result of the obs
#been put into that basis fucntion
obe.ns <- ns(obesity,df=4)
obe.ns<-as.matrix(obe.ns)
theta<-(as.matrix(theta))
#
obe.partial<-obe.ns%*%theta
plot(obesity,obe.partial,col="green3")#no intercept term now