The good ol’ (generalized) linear model

Linear regression

The standard linear regression can be written:

\[\hat y = \hat \beta_0 + \sum {\hat \beta_k x_{ik}}\]

where the terms on the right correspond to the linear predictor for the outcome (\(y\)), given the values of the predictors (\(x_k\)). The \(\hat \beta\) values are the unbiased estimates of the regression parameters, for the linear model, typically found by least squares estimation.

Generalized linear model

Extension of the linear modeling concept to non-gaussian outcomes.
Link function \(\eta\) links the mean of the outcome to a linear predictor with a nonlinear function \(\eta\)
E.G. Logistic regression -

\[log \left( \frac{p_i}{1-p_i} \right ) =\eta_i = \sum {\beta_k x_{ik}}\] To get the probability from the model, we use the inverse logit transform: \[p_i = \frac{1}{1+exp^{\eta}}\]

Model estimation

Find the parameters of the model so that the model Deviance is minimized. In the linear regression context, Deviance is the same as the Residual Squared Error or Residual Sums of Squares

\[\text{Deviance} = \sum (y_i - \sum{\hat{\beta_k x_{ik}}}) ^2 \]

The problem is when there are parameters in the model that are zero or nearly zero, the model may have higher deviance than it could if some of those parameters were not in the model.

The question is how to go about doing this?

You may have seen methods of regression subsetting via stepwise, forward or backward selection. These methods iteratively insert or remove predictor variables from the model, and in this process the models are scored via either their RSS or AIC or some other information criteria.

Here is an example where we fit a linear model and the perform variable subset selection.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: Number of logged events: 652

A common data science approach would be to fit a model with all predictors included, and then winnow the model to include only the significant predictors. We start with this type of model:

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

X<-scale(model.matrix(e0total~. , data=prb2))
dat<-data.frame(X[,-1])
dat$y<-prb2$e0total

fm<-glm(y~., data=dat) 
summary(fm)

## 
## Call:
## glm(formula = y ~ ., data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3165  -1.0635   0.0717   1.0734   8.4634  
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       67.96172    0.12695 535.353  < 2e-16 ***
## continentAsia                      0.10542    0.21032   0.501  0.61682    
## continentEurope                    0.34620    0.31736   1.091  0.27676    
## continentNorth.America             0.47294    0.20704   2.284  0.02350 *  
## continentOceania                  -0.01321    0.18164  -0.073  0.94212    
## continentSouth.America             0.31581    0.16765   1.884  0.06119 .  
## population.                       -0.07204    0.13756  -0.524  0.60109    
## cbr                               -4.88069    3.32297  -1.469  0.14361    
## cdr                               -3.18919    1.35799  -2.348  0.01992 *  
## rate.of.natural.increase           4.58698    3.01912   1.519  0.13041    
## net.migration.rate                -0.09932    0.21244  -0.468  0.64069    
## imr                               -2.25578    0.44254  -5.097 8.55e-07 ***
## womandlifetimeriskmaternaldeath    0.04779    0.22116   0.216  0.82916    
## tfr                                1.19527    0.86345   1.384  0.16796    
## percpoplt15                       -1.67783    0.78081  -2.149  0.03296 *  
## percpopgt65                        3.27295    0.45908   7.129 2.26e-11 ***
## percurban                         -0.13582    0.24230  -0.561  0.57579    
## percpopinurbangt750k               0.36062    0.23572   1.530  0.12777    
## percpop1549hivaids2007            -1.39839    0.27793  -5.031 1.16e-06 ***
## percmarwomcontraall               -0.54894    0.39554  -1.388  0.16688    
## percmarwomcontramodern             0.81473    0.36666   2.222  0.02751 *  
## percppundernourished0204          -0.51257    0.19299  -2.656  0.00861 ** 
## motorvehper1000pop0005             0.39727    0.30291   1.312  0.19132    
## percpopwaccessimprovedwatersource  0.46966    0.25031   1.876  0.06221 .  
## gnippppercapitausdollars           0.36604    0.35237   1.039  0.30028    
## popdenspersqkm                     0.38914    0.19757   1.970  0.05039 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.368179)
## 
##     Null deviance: 25183.69  on 208  degrees of freedom
## Residual deviance:   616.38  on 183  degrees of freedom
## AIC: 873.15
## 
## Number of Fisher Scoring iterations: 2

length(coef(fm))

## [1] 26

One traditional approach would then be to do some sort of variable selection, what people used to be taught was stepwise (backward or forward) elmination to obtain the best subset of predictors:

Stepwise selection model

smod<-fm%>%
  stepAIC(trace = F, direction = "both")

summary(smod)

## 
## Call:
## glm(formula = y ~ continentEurope + continentNorth.America + 
##     continentSouth.America + cbr + cdr + rate.of.natural.increase + 
##     imr + tfr + percpoplt15 + percpopgt65 + percpopinurbangt750k + 
##     percpop1549hivaids2007 + percmarwomcontraall + percmarwomcontramodern + 
##     percppundernourished0204 + motorvehper1000pop0005 + percpopwaccessimprovedwatersource + 
##     popdenspersqkm, data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2123  -1.1090   0.0285   1.1096   8.6632  
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        67.9617     0.1254 542.100  < 2e-16 ***
## continentEurope                     0.3607     0.2479   1.455  0.14724    
## continentNorth.America              0.4237     0.1511   2.804  0.00558 ** 
## continentSouth.America              0.2574     0.1370   1.878  0.06186 .  
## cbr                                -5.2992     3.2535  -1.629  0.10502    
## cdr                                -3.2596     1.3293  -2.452  0.01511 *  
## rate.of.natural.increase            4.7891     2.9565   1.620  0.10693    
## imr                                -2.1870     0.4091  -5.346 2.56e-07 ***
## tfr                                 1.4301     0.8108   1.764  0.07938 .  
## percpoplt15                        -1.8012     0.7160  -2.516  0.01271 *  
## percpopgt65                         3.2943     0.4002   8.232 2.90e-14 ***
## percpopinurbangt750k                0.3451     0.1762   1.959  0.05163 .  
## percpop1549hivaids2007             -1.3271     0.2527  -5.252 4.01e-07 ***
## percmarwomcontraall                -0.6133     0.3749  -1.636  0.10350    
## percmarwomcontramodern              0.8929     0.3238   2.757  0.00639 ** 
## percppundernourished0204           -0.5095     0.1834  -2.778  0.00601 ** 
## motorvehper1000pop0005              0.5449     0.2358   2.311  0.02189 *  
## percpopwaccessimprovedwatersource   0.4383     0.2386   1.837  0.06782 .  
## popdenspersqkm                      0.4175     0.1679   2.487  0.01376 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.284857)
## 
##     Null deviance: 25183.69  on 208  degrees of freedom
## Residual deviance:   624.12  on 190  degrees of freedom
## AIC: 861.76
## 
## Number of Fisher Scoring iterations: 2

length(smod$coefficients)

## [1] 19

Sure enough, there are fewer predictors in the model, and those that remain have small p values.

We can also see the relative model fits of the two models:

AIC(fm)

## [1] 873.1548

AIC(smod)

## [1] 861.765

This shows that the original model had 26 parameters, while the model that used AIC to find the best subset had 19 parameters, and the relative model fit, in terms of AIC went from 873.1548299 for the saturated model to 861.7649873 for the stepwise model.

An alternative to stepwise selection is called Best Subset Regression and considers all possible combinations of variables from the data, and scores the model based on seeing all possible predictors.

library(bestglm)

## Loading required package: leaps

yx<-data.frame(cbind(X[,-1],prb2$e0total))
b1<-bestglm(yx,IC = "AIC", family=gaussian, method = "exhaustive")

summary(b1$BestModel)

## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2970 -1.0015  0.1472  1.0413  7.9947 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        67.9617     0.1259 539.752  < 2e-16 ***
## continentNorth.America              0.4513     0.1403   3.216  0.00152 ** 
## continentSouth.America              0.3089     0.1340   2.306  0.02218 *  
## cbr                                -5.4216     3.1666  -1.712  0.08847 .  
## cdr                                -2.5159     1.2957  -1.942  0.05362 .  
## rate.of.natural.increase            6.2180     2.8878   2.153  0.03254 *  
## imr                                -2.1299     0.3939  -5.407 1.87e-07 ***
## percpoplt15                        -1.5118     0.6976  -2.167  0.03143 *  
## percpopgt65                         3.8075     0.3456  11.018  < 2e-16 ***
## percpop1549hivaids2007             -1.4029     0.2328  -6.025 8.34e-09 ***
## percmarwomcontramodern              0.2580     0.1810   1.426  0.15559    
## percppundernourished0204           -0.5003     0.1822  -2.745  0.00662 ** 
## percpopwaccessimprovedwatersource   0.4891     0.2382   2.054  0.04136 *  
## gnippppercapitausdollars            0.9359     0.2036   4.598 7.69e-06 ***
## popdenspersqkm                      0.2343     0.1491   1.571  0.11770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.82 on 194 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9726 
## F-statistic:   529 on 14 and 194 DF,  p-value: < 2.2e-16

This creates a parsimonious model, with only 15 coefficients and an AIC of 859.9334878 which is slightly lower than the stepwise model.

Model shrinkage

Variable selection methods are one way to approach finding the most parsimonious model in an analysis, but alternative methods also exist. The modern approach to doing this is lumped under the term shrinkage methods

knitr::include_graphics("/media/corey/extra/predictive_workinggroup/images/shrink2.gif")

As opposed to the variable selection methods, shrinkage models fit a model that contains all the predictors in the data, and constrains or regularizes the coefficients so that they shrink towards 0, effectively eliminating the unnecessary parameters from the model.

There are two main classes of these methods, Ridge regression and the Lasso.

Ridge regression

Ridge regression is very similar to other regression methods, except the estimates are estimated using a slightly different objective criteria. The estimates of the model coefficient are estimates so that the quantity:

\[\text{Penalized Deviance} = \sum (y_i - \sum{\hat{\beta_k x_k}}) ^2 + \lambda \sum_k \beta_k^2 = \text{Deviance} + \lambda \sum_k \beta_k^2\] is minimized. This looks a lot like the ordinary Deviance for the models above, except it adds a penalization term to the fit criteria. The term \(\lambda\), \(\lambda \geq 0\) is a tuning parameter. The Ridge regression coefficients are those that both fit the data well, and penalize the deviance the least.

Then penalty term is small when the \(\beta\)s are close to 0. As the value of \(\lambda\) increases, the coefficients will be forced to 0. This implies another set of estimation so that the “best” value for \(\lambda\) can be found.

Unlike regular regression, which produces 1 set of regression parameters, the Ridge regression produces a set of regression parameters, each one for a given value of \(\lambda\). The best value for \(\lambda\) is found by a grid search over a range of values.

Here is an example

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.0-2

x<-X[,-1]
y<-as.matrix(yx[, 26])

ridgemod<-glmnet(x=x,y=y, alpha = 0, nlambda = 100)
plot(ridgemod, label = T, xvar = "lambda")

So, the y axis represents the value of the \(\beta\)’s from the model, and the x axis are the various values of \(\lambda\) for each specific value of \(\lambda\). As you see when \(\lambda\) is 0, the coefficients can be quite large, and as \(\lambda\) increase, the \(\beta\)’s converge to 0. This is nice, but we need to find a “best” value of \(\lambda\). This is done via cross-validation, typically.

s1<-cv.glmnet(x =x, y=y, family="gaussian", alpha = 0) #alpha = 0 for ridge regression
plot(s1)

s1

## 
## Call:  cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Measure     SE Nonzero
## min  0.970   4.870 0.6408      25
## 1se  2.241   5.453 0.7667      25

The plot shows the mean square error for the model with a particular value of log(\(\lambda\)). The dotted lines represent the “best” value and the value of \(\lambda\) that is 1 standard error larger than the true minimum.

Why the two values? Well, the minimum value of \(\lambda\), in this case about -0.0304766, and the minimum + 1se is 0.806827. The smaller value gives the simplest model, while the 1 se value gives the simplest model that also has high accuracy.

How do the coefficient compare?

options(scipen=999)
ridgemod2<-glmnet(x=x,y=y, alpha = 0, lambda = s1$lambda.min)
ridgemod2

## 
## Call:  glmnet(x = x, y = y, alpha = 0, lambda = s1$lambda.min) 
## 
##   Df  %Dev Lambda
## 1 25 97.09   0.97

ridgemod2$beta

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                             s0
## continentAsia                      0.138746997
## continentEurope                    0.339095102
## continentNorth America             0.462673475
## continentOceania                  -0.004004286
## continentSouth America             0.305966695
## population.                       -0.096063058
## cbr                               -0.713848006
## cdr                               -3.269232678
## rate.of.natural.increase           0.811597096
## net.migration.rate                -0.202903456
## imr                               -2.187445540
## womandlifetimeriskmaternaldeath    0.261461312
## tfr                               -0.278134874
## percpoplt15                       -0.788972896
## percpopgt65                        1.768386695
## percurban                         -0.062632754
## percpopinurbangt750k               0.375625628
## percpop1549hivaids2007            -2.164990903
## percmarwomcontraall               -0.054160227
## percmarwomcontramodern             0.640392008
## percppundernourished0204          -0.586644083
## motorvehper1000pop0005             0.735190517
## percpopwaccessimprovedwatersource  0.741533812
## gnippppercapitausdollars           0.651949671
## popdenspersqkm                     0.474059119

df1<-data.frame(names = names(coef(fm)) , beta=round(coef(fm),4), mod="gm")
df2<-data.frame(names = c(names(coef(fm)[1]), row.names(ridgemod2$beta)), beta=round(as.numeric(coef(s1, s = s1$lambda.min)),4), mod="ridge")

dfa<-rbind(df1, df2)

library(ggplot2)
dfa%>%
  ggplot()+geom_point(aes(y=beta,x=names, color=mod))+theme_minimal()+theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

An important note about the ridge regression is that no coefficients are forced to zero, so it is not doing variable selection, only shrinkage of the coefficients. The Lasso, on the other hand can do both.

Lasso

knitr::include_graphics("/media/corey/extra/predictive_workinggroup/images/cowboy.jpeg")

What is the Lasso? Lasso is short for least absolute shrinkage and selection operator. It is designed so that it produces a model that is both strong in the prediction sense, and parsimonious, in that it performs feature (variable) selection as well.

The Lasso is similar to the ridge regression model, in that it penalizes the regression parameter solution, except that the penalty term is not the sum of squared \(\beta\)’s, but the sum of the absolute values of the \(\beta\)’s.

\[ \text{Deviance} + \lambda \sum_k | \beta_k| \]

ridgemod3<-glmnet(x=x,y=y, alpha = 1) #alpha = 1 for Lasso
plot(ridgemod3, label = T, xvar = "lambda")

s2<-cv.glmnet(x =x, y=y, family="gaussian", alpha = 1) #alpha = 1 for Lasso
plot(s2)

s2

## 
## Call:  cv.glmnet(x = x, y = y, family = "gaussian", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Measure     SE Nonzero
## min 0.0638   4.291 0.7199      15
## 1se 0.4102   4.937 0.9468      13

test<-glmnet(x =x, y=y, family="gaussian", alpha = 1, lambda = s2$lambda.1se)
test

## 
## Call:  glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = s2$lambda.1se) 
## 
##   Df %Dev Lambda
## 1 12 96.8 0.4102

print(test$beta)

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                            s0
## continentAsia                      .         
## continentEurope                    .         
## continentNorth America             .         
## continentOceania                   .         
## continentSouth America             .         
## population.                        .         
## cbr                                .         
## cdr                               -4.65214216
## rate.of.natural.increase           .         
## net.migration.rate                 .         
## imr                               -2.29039829
## womandlifetimeriskmaternaldeath    .         
## tfr                                .         
## percpoplt15                       -0.40522499
## percpopgt65                        3.26298392
## percurban                          .         
## percpopinurbangt750k               0.13059746
## percpop1549hivaids2007            -1.51825139
## percmarwomcontraall                .         
## percmarwomcontramodern             0.23071627
## percppundernourished0204          -0.35583423
## motorvehper1000pop0005             0.09962760
## percpopwaccessimprovedwatersource  0.38868056
## gnippppercapitausdollars           0.81254209
## popdenspersqkm                     0.01556287

Elastic net

A third method, the Elastic net regression, blends the shrinkage aspects of the ridge regression with the feature selection of the Lasso. This was proposed by Zou and Hastie 2005

\[\text{Deviance} + \lambda [(1-\alpha) \sum_k |\beta_k| + \alpha \sum_k \beta_k^2]\]

ridgemod4<-glmnet(x=x,y=y, alpha = .5, nlambda = 100) #alpha = .5 for elastic net
plot(ridgemod4, label = T, xvar = "lambda")

s3<-cv.glmnet(x =x, y=y, family="gaussian", alpha = .5) #alpha = .5 for elastic net
plot(s3)

s3

## 
## Call:  cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0.5) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Measure     SE Nonzero
## min 0.1276   4.389 0.8611      16
## 1se 0.9005   5.249 0.9849      12

test3<-glmnet(x =x, y=y, family="gaussian", alpha = .5, lambda = s3$lambda.1se)
test3$beta

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                            s0
## continentAsia                      .         
## continentEurope                    .         
## continentNorth America             .         
## continentOceania                   .         
## continentSouth America             .         
## population.                        .         
## cbr                                .         
## cdr                               -3.82226943
## rate.of.natural.increase           .         
## net.migration.rate                 .         
## imr                               -2.41981395
## womandlifetimeriskmaternaldeath    .         
## tfr                                .         
## percpoplt15                       -0.84464899
## percpopgt65                        2.23532423
## percurban                          .         
## percpopinurbangt750k               0.15596566
## percpop1549hivaids2007            -1.84708503
## percmarwomcontraall                .         
## percmarwomcontramodern             0.38684024
## percppundernourished0204          -0.41735812
## motorvehper1000pop0005             0.36099859
## percpopwaccessimprovedwatersource  0.62939351
## gnippppercapitausdollars           0.71701776
## popdenspersqkm                     0.02748666

In these examples, we will use the Demographic and Health Survey Model Data. These are based on the DHS survey, but are publicly available and are used to practice using the DHS data sets, but don’t represent a real country.

In this example, we will use the outcome of contraceptive choice (modern vs other/none) as our outcome. An excellent guide for this type of literature is this article

library(haven)
dat<-url("https://github.com/coreysparks/data/blob/master/ZZIR62FL.DTA?raw=true")
model.dat<-read_dta(dat)

Here we recode some of our variables and limit our data to those women who are not currently pregnant and who are sexually active.

library(dplyr)

model.dat2<-model.dat%>%
  mutate(region = v024, 
         modcontra= ifelse(v364 ==1,1, 0),
         age = cut(v012, breaks = 5), 
         livchildren=v218,
         educ = v106,
         currpreg=v213,
         wealth = as.factor(v190),
         partnered = ifelse(v701<=1, 0, 1),
         work = ifelse(v731%in%c(0,1), 0, 1),
         knowmodern=ifelse(v301==3, 1, 0),
         age2=v012^2, 
         rural = ifelse(v025==2, 1,0),
         wantmore = ifelse(v605%in%c(1,2), 1, 0))%>%
  filter(currpreg==0, v536>0, v701!=9)%>% #notpreg, sex active
  dplyr::select(caseid, region, modcontra,age, age2,livchildren, educ, knowmodern, rural, wantmore, partnered,wealth, work)

knitr::kable(head(model.dat2))

caseid	region	age	age2	livchildren	educ	knowmodern	rural	wantmore	wealth	work
1 1 2	2	(28.6,35.4]	900	4	0	1	1	1	1	1
1 4 2	2	(35.4,42.2]	1764	2	0	1	1	0	3	1
1 4 3	2	(21.8,28.6]	625	3	1	1	1	0	3	1
1 5 1	2	(21.8,28.6]	625	2	2	1	1	1	2	1
1 6 2	2	(35.4,42.2]	1369	2	0	1	1	1	3	1
1 7 2	2	(15,21.8]	441	1	0	1	1	1	1	1

library(caret)

## Loading required package: lattice

set.seed(1115)
train<- createDataPartition(y = model.dat2$modcontra , p = .80, list=F)

dtrain<-model.dat2[train,]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

dtest<-model.dat2[-train,]

Logistic regression

library(glmnet)
library(MASS)
fm<-glm(modcontra~region+factor(age)+livchildren+wantmore+rural+factor(educ)+partnered+work+wealth, data = dtrain, family=binomial) 
summary(fm)

## 
## Call:
## glm(formula = modcontra ~ region + factor(age) + livchildren + 
##     wantmore + rural + factor(educ) + partnered + work + wealth, 
##     family = binomial, data = dtrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3727  -0.6498  -0.4970  -0.3590   2.5673  
## 
## Coefficients:
##                        Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)            -2.80866    0.26506 -10.596 < 0.0000000000000002 ***
## region                  0.15475    0.03772   4.103       0.000040825388 ***
## factor(age)(21.8,28.6]  0.44413    0.18439   2.409              0.01601 *  
## factor(age)(28.6,35.4]  0.54804    0.19226   2.850              0.00437 ** 
## factor(age)(35.4,42.2]  0.14667    0.20988   0.699              0.48468    
## factor(age)(42.2,49]   -0.63063    0.23623  -2.670              0.00760 ** 
## livchildren             0.18805    0.02976   6.318       0.000000000264 ***
## wantmore               -0.16630    0.10204  -1.630              0.10316    
## rural                  -0.62567    0.11323  -5.525       0.000000032859 ***
## factor(educ)1           0.30769    0.13045   2.359              0.01834 *  
## factor(educ)2           0.59775    0.13288   4.498       0.000006849378 ***
## factor(educ)3           0.62670    0.24653   2.542              0.01102 *  
## partnered               0.28746    0.10467   2.746              0.00603 ** 
## work                    0.16342    0.10386   1.573              0.11561    
## wealth2                 0.13635    0.14047   0.971              0.33170    
## wealth3                 0.17350    0.14114   1.229              0.21898    
## wealth4                 0.24257    0.15013   1.616              0.10614    
## wealth5                 0.17383    0.18138   0.958              0.33786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3818.3  on 4128  degrees of freedom
## Residual deviance: 3536.4  on 4111  degrees of freedom
## AIC: 3572.4
## 
## Number of Fisher Scoring iterations: 5

stepwise logistic

smod<-fm%>%
  stepAIC(trace = T, direction = "both")

## Start:  AIC=3572.39
## modcontra ~ region + factor(age) + livchildren + wantmore + rural + 
##     factor(educ) + partnered + work + wealth
## 
##                Df Deviance    AIC
## - wealth        4   3539.2 3567.2
## <none>              3536.4 3572.4
## - work          1   3538.9 3572.9
## - wantmore      1   3539.0 3573.0
## - partnered     1   3543.8 3577.8
## - region        1   3553.0 3587.0
## - factor(educ)  3   3559.4 3589.4
## - rural         1   3566.7 3600.7
## - livchildren   1   3577.1 3611.1
## - factor(age)   4   3609.4 3637.4
## 
## Step:  AIC=3567.21
## modcontra ~ region + factor(age) + livchildren + wantmore + rural + 
##     factor(educ) + partnered + work
## 
##                Df Deviance    AIC
## <none>              3539.2 3567.2
## - work          1   3541.7 3567.7
## - wantmore      1   3541.8 3567.8
## + wealth        4   3536.4 3572.4
## - partnered     1   3547.9 3573.9
## - region        1   3555.0 3581.0
## - factor(educ)  3   3564.0 3586.0
## - livchildren   1   3579.4 3605.4
## - rural         1   3590.0 3616.0
## - factor(age)   4   3611.8 3631.8

coef(smod)

##            (Intercept)                 region factor(age)(21.8,28.6] 
##             -2.6270672              0.1501583              0.4524232 
## factor(age)(28.6,35.4] factor(age)(35.4,42.2]   factor(age)(42.2,49] 
##              0.5612389              0.1621533             -0.6103720 
##            livchildren               wantmore                  rural 
##              0.1861566             -0.1651203             -0.6915328 
##          factor(educ)1          factor(educ)2          factor(educ)3 
##              0.3178917              0.6052986              0.6167321 
##              partnered                   work 
##              0.3043424              0.1621220

Best subset regression

y<-dtrain$modcontra
x<-model.matrix(~factor(region)+factor(age)+livchildren+rural+wantmore+factor(educ)+partnered+work+factor(wealth), data=dtrain)
x<-(x)[,-1]

yx<-as.data.frame(cbind(x[,-1],y))
names(yx)

##  [1] "factor(region)3"        "factor(region)4"        "factor(age)(21.8,28.6]"
##  [4] "factor(age)(28.6,35.4]" "factor(age)(35.4,42.2]" "factor(age)(42.2,49]"  
##  [7] "livchildren"            "rural"                  "wantmore"              
## [10] "factor(educ)1"          "factor(educ)2"          "factor(educ)3"         
## [13] "partnered"              "work"                   "factor(wealth)2"       
## [16] "factor(wealth)3"        "factor(wealth)4"        "factor(wealth)5"       
## [19] "y"

#names(yx)<-c(paste("X",1:13, sep=""),"y")

library(bestglm)

b1<-bestglm(yx, family=binomial)
b1$BestModels
summary(b1$BestModel)

Logistic ridge regression

s1<-cv.glmnet(x = x,y = y, data = dtrain, family="binomial", alpha = 0)
plot(s1)

rmod<-glmnet(x,y, data = dtrain, family="binomial", alpha = 0)
plot(rmod)

rmod1<-glmnet(x,y, data = dtrain, family="binomial", alpha = 0, lambda = s1$lambda.1se)
rmod1$beta

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## factor(region)2         0.11736906
## factor(region)3         0.20097363
## factor(region)4         0.15952400
## factor(age)(21.8,28.6]  0.07831211
## factor(age)(28.6,35.4]  0.23721168
## factor(age)(35.4,42.2]  0.06382724
## factor(age)(42.2,49]   -0.27968879
## livchildren             0.06545045
## rural                  -0.33722253
## wantmore               -0.09296988
## factor(educ)1           0.13004018
## factor(educ)2           0.27349312
## factor(educ)3           0.28825836
## partnered               0.22401406
## work                    0.06631462
## factor(wealth)2        -0.02330795
## factor(wealth)3         0.00422731
## factor(wealth)4         0.13902383
## factor(wealth)5         0.16285245

Logistic Lasso

s2<-cv.glmnet(x = x,y = y, data = dtrain, family="binomial", alpha = 1)
plot(s2)

rmod2<-glmnet(x,y, data = dtrain, family="binomial", alpha = 1)
plot(rmod2)

rmod3<-glmnet(x,y, data = dtrain, family="binomial", alpha = 1, lambda = s2$lambda.1se)
rmod3$beta

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## factor(region)2         .         
## factor(region)3         0.04882613
## factor(region)4         .         
## factor(age)(21.8,28.6]  .         
## factor(age)(28.6,35.4]  0.17720355
## factor(age)(35.4,42.2]  .         
## factor(age)(42.2,49]   -0.37140425
## livchildren             0.09635800
## rural                  -0.62095437
## wantmore                .         
## factor(educ)1           .         
## factor(educ)2           0.20666302
## factor(educ)3           .         
## partnered               0.26927535
## work                    .         
## factor(wealth)2         .         
## factor(wealth)3         .         
## factor(wealth)4         .         
## factor(wealth)5         .

classification results

Regular GLM

probabilities <- fm %>% predict(dtest, type = "response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra, na.rm=T), 1, 0)
# Prediction accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes, na.rm=T)

## [1] 0.6889535

table(predicted.classes, observed.classes)

##                  observed.classes
## predicted.classes   0   1
##                 0 605 111
##                 1 210 106

stepwise logistic model

probabilities <- predict(smod, dtest, type = "response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra, na.rm=T), 1, 0)
# Prediction accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)

## [1] 0.6860465

table(predicted.classes, observed.classes)

##                  observed.classes
## predicted.classes   0   1
##                 0 605 114
##                 1 210 103

Logistic ridge regression

x.test <- model.matrix(~factor(region)+factor(age)+livchildren+rural+wantmore+factor(educ)+partnered+work+factor(wealth), data=dtest)
x.test<-(x.test)[,-1]
ytest<-dtrain$modcontra

probabilities <-   predict(object=rmod1, newx = x.test, s=s1$lambda.1se, type="response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra), 1,0)
# Model accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)

## [1] 0.7131783

table(predicted.classes, observed.classes)

##                  observed.classes
## predicted.classes   0   1
##                 0 646 127
##                 1 169  90

Logistic Lasso

probabilities <-   predict(object=rmod3, newx = x.test, s=s2$lambda.1se, type="response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra), 1,0)
# Model accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)

## [1] 0.6879845

table(predicted.classes, observed.classes)

##                  observed.classes
## predicted.classes   0   1
##                 0 611 118
##                 1 204  99

So what?

By doing any type of regression variable selection, you are going to have the computer tell you what variables are going into your models. A lot of people don’t like this, but it is very common in the applied world in order to build the best models, especially in high-dimensional data.

Take aways

These methods will lead to more parsimonius models with better predictive power
The models are sensitive to their tuning parameters, so please don’t jump into this without tuning the model parameters.

Demography Informal Methods Seminar - Lasso and Regularized Regression

Corey Sparks, Ph.D. - UTSA Department of Demography

6/23/2019

Regularization in Regression

The good ol’ (generalized) linear model

Linear regression

Generalized linear model

Model estimation

Stepwise selection model

Model shrinkage

Ridge regression

Lasso

Elastic net

Logistic regression

stepwise logistic

Best subset regression

Logistic ridge regression

Logistic Lasso

classification results

Regular GLM

stepwise logistic model

Logistic ridge regression

Logistic Lasso

So what?

Take aways