1.Introduction

The South Africa heart attack example is an interesting example to use when study splines and logit for two features.
first,to determine whether a person is going to have a heart attack or not can be viewed as a classification problem,where logistic is appliable. second,some of the variables do not display a linear relationship with the final probability.
Lets take a closer look!

2.Processing the data

Read in the data

   #heart<-read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",header = TRUE,sep = ",")
   #saveRDS(object = heart,file="D:\\rworkspace\\heart.rds")
   heart<-readRDS("D:\\rworkspace\\heart.rds") #rds is a much more efficient file type than txt and csv
   head(heart)

##   row.names sbp tobacco  ldl adiposity famhist typea obesity alcohol age
## 1         1 160   12.00 5.73     23.11 Present    49   25.30   97.20  52
## 2         2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63
## 3         3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46
## 4         4 170    7.50 6.41     38.03 Present    51   31.99   24.26  58
## 5         5 134   13.60 3.50     27.78 Present    60   25.99   57.34  49
## 6         6 132    6.20 6.47     36.21 Present    62   30.77   14.14  45
##   chd
## 1   1
## 2   1
## 3   0
## 4   1
## 5   1
## 6   0

   names(heart)

##  [1] "row.names" "sbp"       "tobacco"   "ldl"       "adiposity"
##  [6] "famhist"   "typea"     "obesity"   "alcohol"   "age"      
## [11] "chd"

“ldl”,blood fat density,numeric
“famhist”,family desease history,factor
“obesity”,too fat,numeric
“alcohol”,comsuption of wine,numeric
“age”,age,numeric
“chd”,indicator for heart attact,factor the dependent variable

Now we extract them out(use SQL delibrately to polish up)

   library(sqldf)
   heart.var<-sqldf("select sbp,tobacco,ldl,famhist,obesity,alcohol,age,chd from heart")
   names(heart.var)

## [1] "sbp"     "tobacco" "ldl"     "famhist" "obesity" "alcohol" "age"    
## [8] "chd"

We apply the fancy pair function displayed in the book to see the variable relations.

   par(cex=0.5)
   pairs(heart.var[,1:7],col=c("red","green")[unclass(factor(heart.var$chd))]) #Attention,must convert chd from numeric to factor

Do a simple linear logistic

\(\pi_x=P(Y=1|X)\)
\(logit(\pi_x)=ln\left(\frac{\pi_x}{1-\pi_x}\right)\)
\(logit(\pi_x)=X^T\beta\)
inner function glm() can perform the logistic regression

   logifit<-glm(chd~sbp+tobacco+ldl+factor(famhist)+obesity+alcohol+age,family="binomial",data = heart.var)
   #family=binomial
   logifit$coefficients

##            (Intercept)                    sbp                tobacco 
##          -4.1295996883           0.0057606767           0.0795256305 
##                    ldl factor(famhist)Present                obesity 
##           0.1847793334           0.9391854851          -0.0345434340 
##                alcohol                    age 
##           0.0006065017           0.0425412093

obesity is a bit weird. ******

Do logistic with lasso
the function glmnet() is available in package “glmnet”,with parameter \(\alpha\) =0 means ridge adn \(\alpha\) =1 means lasso

   library(glmnet)
   attach(heart.var)
   X<-cbind(sbp,tobacco,ldl,factor(famhist),obesity,alcohol,age)#must convert famhist to factor
   detach(heart.var)
   logi.lasso<-glmnet(y = factor(heart.var[,8]),x=X,family = "binomial",alpha = 1)
   summary(logi.lasso)

##            Length Class     Mode     
## a0          55    -none-    numeric  
## beta       385    dgCMatrix S4       
## df          55    -none-    numeric  
## dim          2    -none-    numeric  
## lambda      55    -none-    numeric  
## dev.ratio   55    -none-    numeric  
## nulldev      1    -none-    numeric  
## npasses      1    -none-    numeric  
## jerr         1    -none-    numeric  
## offset       1    -none-    logical  
## classnames   2    -none-    character
## call         5    -none-    call     
## nobs         1    -none-    numeric

The beta and lambda are in the logi.lasso obejct Different lambda means different penalty on the parameter,how to choose the lambda is a tradeoff between variance and bias,flexibility and interpretation.See http://www.r-bloggers.com/lang/chinese/1017 for details.

The function cv.glmnet in glmnet package can do cross validation while doing lasso and logistic,we try that.

   library(glmnet)
   logi.cv<-cv.glmnet(y = factor(heart.var[,8]),x=X,family = "binomial",alpha = 1,type.measure = "deviance")
   summary(logi.cv)

##            Length Class  Mode     
## lambda     53     -none- numeric  
## cvm        53     -none- numeric  
## cvsd       53     -none- numeric  
## cvup       53     -none- numeric  
## cvlo       53     -none- numeric  
## nzero      53     -none- numeric  
## name        1     -none- character
## glmnet.fit 13     lognet list     
## lambda.min  1     -none- numeric  
## lambda.1se  1     -none- numeric

   plot(logi.cv)# automaticlly plot the lambda and deviance relationship,  we should choose the smallest lambda

   #the final model with smallest lambda has already been prepared for us,nice!

   #Attention,the following two are different!!
   coef(logi.cv$glmnet.fit,s = logi.cv$lambda.1se)

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -2.95521830
## sbp          .         
## tobacco      0.03611122
## ldl          0.06361087
##              0.41373618
## obesity      .         
## alcohol      .         
## age          0.02885504

   coef(logi.cv$glmnet.fit,s = min(logi.cv$glmnet.fit$lambda))

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -5.0428115016
## sbp          0.0055112708
## tobacco      0.0786722522
## ldl          0.1804313400
##              0.9260374566
## obesity     -0.0315536417
## alcohol      0.0004016724
## age          0.0421021175

The first one chooses not the exact minimum \(\lambda\), but the \(\lambda\) with in \(\pm \sigma\) so that some other parameters are zero so the model is neat. The second one is the exact minimin \(\lambda\) model,but we choose the first one because it does variable selection for us.

now we try to fit it as a spline+logistic function, use the variable obesity first.

  library(splines)
  #use ns() function to get the natrual spline basic functions bsf
  attach(heart.var)
  sbp<-heart.var[,1]
  tobacoo<-heart.var[,2]
  ldl.bsf<-heart.var[,3]
  famhist<-heart.var[,4]
  obesity<-heart.var[,5]
  alcohol<-heart.var[,6]
  age<-heart.var[,7]
  chd<-heart.var[,8]
  
  #logistic plus spline
  logispline<-glm(chd~ns(sbp,df=4)+ ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4),family = "binomial")
  print( summary(logispline), digits=3 )

## 
## Call:
## glm(formula = chd ~ ns(sbp, df = 4) + ns(tobacco, df = 4) + ns(ldl, 
##     df = 4) + famhist + ns(obesity, df = 4) + ns(alcohol, df = 4) + 
##     ns(age, df = 4), family = "binomial")
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.724  -0.827  -0.388   0.887   2.959  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.1160     2.4124   -0.88  0.38040    
## ns(sbp, df = 4)1      -1.4794     0.8450   -1.75  0.07998 .  
## ns(sbp, df = 4)2      -1.3197     0.7637   -1.73  0.08397 .  
## ns(sbp, df = 4)3      -3.7537     2.0254   -1.85  0.06383 .  
## ns(sbp, df = 4)4       1.3973     1.0039    1.39  0.16395    
## ns(tobacco, df = 4)1   0.6495     0.4587    1.42  0.15677    
## ns(tobacco, df = 4)2   0.4181     0.9033    0.46  0.64346    
## ns(tobacco, df = 4)3   3.3626     1.1877    2.83  0.00464 ** 
## ns(tobacco, df = 4)4   3.8534     2.3775    1.62  0.10506    
## ns(ldl, df = 4)1       1.8688     1.3280    1.41  0.15937    
## ns(ldl, df = 4)2       1.7217     1.0327    1.67  0.09547 .  
## ns(ldl, df = 4)3       4.5209     3.0015    1.51  0.13201    
## ns(ldl, df = 4)4       3.3454     1.4526    2.30  0.02127 *  
## famhistPresent         1.0787     0.2390    4.51  6.4e-06 ***
## ns(obesity, df = 4)1  -3.1058     1.7216   -1.80  0.07122 .  
## ns(obesity, df = 4)2  -2.3753     1.2057   -1.97  0.04884 *  
## ns(obesity, df = 4)3  -5.0542     3.8266   -1.32  0.18657    
## ns(obesity, df = 4)4   0.0545     1.7499    0.03  0.97516    
## ns(alcohol, df = 4)1   0.1367     0.4010    0.34  0.73314    
## ns(alcohol, df = 4)2  -0.3561     0.7422   -0.48  0.63144    
## ns(alcohol, df = 4)3  -0.1101     0.8081   -0.14  0.89168    
## ns(alcohol, df = 4)4   0.5908     1.0934    0.54  0.58893    
## ns(age, df = 4)1       2.5975     1.1189    2.32  0.02026 *  
## ns(age, df = 4)2       3.1008     0.9179    3.38  0.00073 ***
## ns(age, df = 4)3       7.5511     2.5674    2.94  0.00327 ** 
## ns(age, df = 4)4       1.5199     0.5964    2.55  0.01082 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 457.63  on 436  degrees of freedom
## AIC: 509.6
## 
## Number of Fisher Scoring iterations: 6

Attention Here! the ns() function creates 4 basis functions(constant term ignored,so total 3 internel + 2 boundary =5 knots),though df=4, 4 basis function, the knots are still five in total.
For example, ns(sbp.df=1) creates 4 basis functions: sbp1 sbp2 sbp3 sbp4 each of them has the estimated coefficient in the whole coefficients of the model. the final model would look like this in formula: Watch closely
\(logit(a)=\vec{\theta_0}+\begin{pmatrix} sbp1\\sbp2\\sbp3\\sbp4\\ \end{pmatrix}^T *\begin{pmatrix} \theta_1^sbp\\ \theta_2^spb\\ \theta_3^sbp\\ \theta_4^sbp\\ \end{pmatrix}+...\)

As a comparison,the intercept term is different

   form = "chd ~ ns(sbp,df=4) + ns(tobacco,df=4) + ns(ldl,df=4) + famhist + ns(obesity,df=4) + ns(alcohol,df=4) + ns(age,df=4)"
   form = formula(form)
   m = glm( form, data=heart.var, family=binomial )
   print( summary(m), digits=3 )

## 
## Call:
## glm(formula = form, family = binomial, data = heart.var)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.724  -0.827  -0.388   0.887   2.959  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.1160     2.4124   -0.88  0.38040    
## ns(sbp, df = 4)1      -1.4794     0.8450   -1.75  0.07998 .  
## ns(sbp, df = 4)2      -1.3197     0.7637   -1.73  0.08397 .  
## ns(sbp, df = 4)3      -3.7537     2.0254   -1.85  0.06383 .  
## ns(sbp, df = 4)4       1.3973     1.0039    1.39  0.16395    
## ns(tobacco, df = 4)1   0.6495     0.4587    1.42  0.15677    
## ns(tobacco, df = 4)2   0.4181     0.9033    0.46  0.64346    
## ns(tobacco, df = 4)3   3.3626     1.1877    2.83  0.00464 ** 
## ns(tobacco, df = 4)4   3.8534     2.3775    1.62  0.10506    
## ns(ldl, df = 4)1       1.8688     1.3280    1.41  0.15937    
## ns(ldl, df = 4)2       1.7217     1.0327    1.67  0.09547 .  
## ns(ldl, df = 4)3       4.5209     3.0015    1.51  0.13201    
## ns(ldl, df = 4)4       3.3454     1.4526    2.30  0.02127 *  
## famhistPresent         1.0787     0.2390    4.51  6.4e-06 ***
## ns(obesity, df = 4)1  -3.1058     1.7216   -1.80  0.07122 .  
## ns(obesity, df = 4)2  -2.3753     1.2057   -1.97  0.04884 *  
## ns(obesity, df = 4)3  -5.0542     3.8266   -1.32  0.18657    
## ns(obesity, df = 4)4   0.0545     1.7499    0.03  0.97516    
## ns(alcohol, df = 4)1   0.1367     0.4010    0.34  0.73314    
## ns(alcohol, df = 4)2  -0.3561     0.7422   -0.48  0.63144    
## ns(alcohol, df = 4)3  -0.1101     0.8081   -0.14  0.89168    
## ns(alcohol, df = 4)4   0.5908     1.0934    0.54  0.58893    
## ns(age, df = 4)1       2.5975     1.1189    2.32  0.02026 *  
## ns(age, df = 4)2       3.1008     0.9179    3.38  0.00073 ***
## ns(age, df = 4)3       7.5511     2.5674    2.94  0.00327 ** 
## ns(age, df = 4)4       1.5199     0.5964    2.55  0.01082 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 457.63  on 436  degrees of freedom
## AIC: 509.6
## 
## Number of Fisher Scoring iterations: 6

   drop1( m, scope=form, test="Chisq" )

## Single term deletions
## 
## Model:
## chd ~ ns(sbp, df = 4) + ns(tobacco, df = 4) + ns(ldl, df = 4) + 
##     famhist + ns(obesity, df = 4) + ns(alcohol, df = 4) + ns(age, 
##     df = 4)
##                     Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>                   457.63 509.63                      
## ns(sbp, df = 4)      4   466.77 510.77  9.1429 0.0576257 .  
## ns(tobacco, df = 4)  4   469.61 513.61 11.9753 0.0175355 *  
## ns(ldl, df = 4)      4   470.90 514.90 13.2710 0.0100249 *  
## famhist              1   478.76 528.76 21.1319 4.287e-06 ***
## ns(obesity, df = 4)  4   465.41 509.41  7.7749 0.1001811    
## ns(alcohol, df = 4)  4   458.09 502.09  0.4562 0.9776262    
## ns(age, df = 4)      4   480.37 524.37 22.7414 0.0001426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference results from the initial choice of level of the factor famhist.

Here we can do cv.glmnet also,which will be a combination of logistic,spline, and lasso all together!

ns(obesity, df = 4)1 -3.1058
ns(obesity, df = 4)2 -2.3753
ns(obesity, df = 4)3 -5.0542
ns(obesity, df = 4)4 0.0545

   theta <- c(-3.1058,-2.3753,-5.0542,0.0545)
   #ns() function creates a matrix, rows are each obs that put in, columns are each numurical result of the obs 
   #been put into that basis fucntion
   obe.ns <- ns(obesity,df=4)
   obe.ns<-as.matrix(obe.ns)
   theta<-(as.matrix(theta))
   #
   obe.partial<-obe.ns%*%theta
   plot(obesity,obe.partial,col="green3")#no intercept term now

South Africa Heart Attack(spline and logit)

Xinyu Jiao

2016年2月23日

1.Introduction

2.Processing the data