I would suggest you read Chapter 6 and especially section 6.2 of An Introduction to Statistical Learning to get a full treatment of this topic.
If you want to read the sources for the Lasso, check out the paper by Friedman, Hastie and Tibshirani (2010)
Why do I want to talk about this? The Lasso and regularization are a popular technique in data science today that allows you to perform both variable selection and optimized prediction in one algorithm. It also can alleviate some problems of multicollinearity between predictors. The Lasso can be applied to linear, generalized linear, generalized linear mixed models and the Cox model, and there are R packages that provide this functionality.
We will first have a look at the linear model solution and extend it.
The standard linear regression can be written:
\[\hat y = \hat \beta_0 + \sum {\hat \beta_k x_{ik}}\]
where the terms on the right correspond to the linear predictor for the outcome (\(y\)), given the values of the predictors (\(x_k\)). The \(\hat \beta\) values are the unbiased estimates of the regression parameters, for the linear model, typically found by least squares estimation.
Extension of the linear modeling concept to non-gaussian outcomes.
Link function \(\eta\) links the mean of the outcome to a linear predictor with a nonlinear function \(\eta\)
E.G. Logistic regression -
\[log \left( \frac{p_i}{1-p_i} \right ) =\eta_i = \sum {\beta_k x_{ik}}\] To get the probability from the model, we use the inverse logit transform: \[p_i = \frac{1}{1+exp^{\eta}}\]
Find the parameters of the model so that the model Deviance is minimized. In the linear regression context, Deviance is the same as the Residual Squared Error or Residual Sums of Squares
\[\text{Deviance} = \sum (y_i - \sum{\hat{\beta_k x_{ik}}}) ^2 \]
The problem is when there are parameters in the model that are zero or nearly zero, the model may have higher deviance than it could if some of those parameters were not in the model.
The question is how to go about doing this?
You may have seen methods of regression subsetting via stepwise, forward or backward selection. These methods iteratively insert or remove predictor variables from the model, and in this process the models are scored via either their RSS or AIC or some other information criteria.
Here is an example where we fit a linear model and the perform variable subset selection.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: Number of logged events: 652
A common data science approach would be to fit a model with all predictors included, and then winnow the model to include only the significant predictors. We start with this type of model:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
X<-scale(model.matrix(e0total~. , data=prb2))
dat<-data.frame(X[,-1])
dat$y<-prb2$e0total
fm<-glm(y~., data=dat)
summary(fm)
##
## Call:
## glm(formula = y ~ ., data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3165 -1.0635 0.0717 1.0734 8.4634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.96172 0.12695 535.353 < 2e-16 ***
## continentAsia 0.10542 0.21032 0.501 0.61682
## continentEurope 0.34620 0.31736 1.091 0.27676
## continentNorth.America 0.47294 0.20704 2.284 0.02350 *
## continentOceania -0.01321 0.18164 -0.073 0.94212
## continentSouth.America 0.31581 0.16765 1.884 0.06119 .
## population. -0.07204 0.13756 -0.524 0.60109
## cbr -4.88069 3.32297 -1.469 0.14361
## cdr -3.18919 1.35799 -2.348 0.01992 *
## rate.of.natural.increase 4.58698 3.01912 1.519 0.13041
## net.migration.rate -0.09932 0.21244 -0.468 0.64069
## imr -2.25578 0.44254 -5.097 8.55e-07 ***
## womandlifetimeriskmaternaldeath 0.04779 0.22116 0.216 0.82916
## tfr 1.19527 0.86345 1.384 0.16796
## percpoplt15 -1.67783 0.78081 -2.149 0.03296 *
## percpopgt65 3.27295 0.45908 7.129 2.26e-11 ***
## percurban -0.13582 0.24230 -0.561 0.57579
## percpopinurbangt750k 0.36062 0.23572 1.530 0.12777
## percpop1549hivaids2007 -1.39839 0.27793 -5.031 1.16e-06 ***
## percmarwomcontraall -0.54894 0.39554 -1.388 0.16688
## percmarwomcontramodern 0.81473 0.36666 2.222 0.02751 *
## percppundernourished0204 -0.51257 0.19299 -2.656 0.00861 **
## motorvehper1000pop0005 0.39727 0.30291 1.312 0.19132
## percpopwaccessimprovedwatersource 0.46966 0.25031 1.876 0.06221 .
## gnippppercapitausdollars 0.36604 0.35237 1.039 0.30028
## popdenspersqkm 0.38914 0.19757 1.970 0.05039 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.368179)
##
## Null deviance: 25183.69 on 208 degrees of freedom
## Residual deviance: 616.38 on 183 degrees of freedom
## AIC: 873.15
##
## Number of Fisher Scoring iterations: 2
length(coef(fm))
## [1] 26
One traditional approach would then be to do some sort of variable selection, what people used to be taught was stepwise (backward or forward) elmination to obtain the best subset of predictors:
smod<-fm%>%
stepAIC(trace = F, direction = "both")
summary(smod)
##
## Call:
## glm(formula = y ~ continentEurope + continentNorth.America +
## continentSouth.America + cbr + cdr + rate.of.natural.increase +
## imr + tfr + percpoplt15 + percpopgt65 + percpopinurbangt750k +
## percpop1549hivaids2007 + percmarwomcontraall + percmarwomcontramodern +
## percppundernourished0204 + motorvehper1000pop0005 + percpopwaccessimprovedwatersource +
## popdenspersqkm, data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2123 -1.1090 0.0285 1.1096 8.6632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9617 0.1254 542.100 < 2e-16 ***
## continentEurope 0.3607 0.2479 1.455 0.14724
## continentNorth.America 0.4237 0.1511 2.804 0.00558 **
## continentSouth.America 0.2574 0.1370 1.878 0.06186 .
## cbr -5.2992 3.2535 -1.629 0.10502
## cdr -3.2596 1.3293 -2.452 0.01511 *
## rate.of.natural.increase 4.7891 2.9565 1.620 0.10693
## imr -2.1870 0.4091 -5.346 2.56e-07 ***
## tfr 1.4301 0.8108 1.764 0.07938 .
## percpoplt15 -1.8012 0.7160 -2.516 0.01271 *
## percpopgt65 3.2943 0.4002 8.232 2.90e-14 ***
## percpopinurbangt750k 0.3451 0.1762 1.959 0.05163 .
## percpop1549hivaids2007 -1.3271 0.2527 -5.252 4.01e-07 ***
## percmarwomcontraall -0.6133 0.3749 -1.636 0.10350
## percmarwomcontramodern 0.8929 0.3238 2.757 0.00639 **
## percppundernourished0204 -0.5095 0.1834 -2.778 0.00601 **
## motorvehper1000pop0005 0.5449 0.2358 2.311 0.02189 *
## percpopwaccessimprovedwatersource 0.4383 0.2386 1.837 0.06782 .
## popdenspersqkm 0.4175 0.1679 2.487 0.01376 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.284857)
##
## Null deviance: 25183.69 on 208 degrees of freedom
## Residual deviance: 624.12 on 190 degrees of freedom
## AIC: 861.76
##
## Number of Fisher Scoring iterations: 2
length(smod$coefficients)
## [1] 19
Sure enough, there are fewer predictors in the model, and those that remain have small p values.
We can also see the relative model fits of the two models:
AIC(fm)
## [1] 873.1548
AIC(smod)
## [1] 861.765
This shows that the original model had 26 parameters, while the model that used AIC to find the best subset had 19 parameters, and the relative model fit, in terms of AIC went from 873.1548299 for the saturated model to 861.7649873 for the stepwise model.
An alternative to stepwise selection is called Best Subset Regression and considers all possible combinations of variables from the data, and scores the model based on seeing all possible predictors.
library(bestglm)
## Loading required package: leaps
yx<-data.frame(cbind(X[,-1],prb2$e0total))
b1<-bestglm(yx,IC = "AIC", family=gaussian, method = "exhaustive")
summary(b1$BestModel)
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2970 -1.0015 0.1472 1.0413 7.9947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9617 0.1259 539.752 < 2e-16 ***
## continentNorth.America 0.4513 0.1403 3.216 0.00152 **
## continentSouth.America 0.3089 0.1340 2.306 0.02218 *
## cbr -5.4216 3.1666 -1.712 0.08847 .
## cdr -2.5159 1.2957 -1.942 0.05362 .
## rate.of.natural.increase 6.2180 2.8878 2.153 0.03254 *
## imr -2.1299 0.3939 -5.407 1.87e-07 ***
## percpoplt15 -1.5118 0.6976 -2.167 0.03143 *
## percpopgt65 3.8075 0.3456 11.018 < 2e-16 ***
## percpop1549hivaids2007 -1.4029 0.2328 -6.025 8.34e-09 ***
## percmarwomcontramodern 0.2580 0.1810 1.426 0.15559
## percppundernourished0204 -0.5003 0.1822 -2.745 0.00662 **
## percpopwaccessimprovedwatersource 0.4891 0.2382 2.054 0.04136 *
## gnippppercapitausdollars 0.9359 0.2036 4.598 7.69e-06 ***
## popdenspersqkm 0.2343 0.1491 1.571 0.11770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.82 on 194 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9726
## F-statistic: 529 on 14 and 194 DF, p-value: < 2.2e-16
This creates a parsimonious model, with only 15 coefficients and an AIC of 859.9334878 which is slightly lower than the stepwise model.
Variable selection methods are one way to approach finding the most parsimonious model in an analysis, but alternative methods also exist. The modern approach to doing this is lumped under the term shrinkage methods
knitr::include_graphics("/media/corey/extra/predictive_workinggroup/images/shrink2.gif")
As opposed to the variable selection methods, shrinkage models fit a model that contains all the predictors in the data, and constrains or regularizes the coefficients so that they shrink towards 0, effectively eliminating the unnecessary parameters from the model.
There are two main classes of these methods, Ridge regression and the Lasso.
Ridge regression is very similar to other regression methods, except the estimates are estimated using a slightly different objective criteria. The estimates of the model coefficient are estimates so that the quantity:
\[\text{Penalized Deviance} = \sum (y_i - \sum{\hat{\beta_k x_k}}) ^2 + \lambda \sum_k \beta_k^2 = \text{Deviance} + \lambda \sum_k \beta_k^2\] is minimized. This looks a lot like the ordinary Deviance for the models above, except it adds a penalization term to the fit criteria. The term \(\lambda\), \(\lambda \geq 0\) is a tuning parameter. The Ridge regression coefficients are those that both fit the data well, and penalize the deviance the least.
Then penalty term is small when the \(\beta\)s are close to 0. As the value of \(\lambda\) increases, the coefficients will be forced to 0. This implies another set of estimation so that the “best” value for \(\lambda\) can be found.
Unlike regular regression, which produces 1 set of regression parameters, the Ridge regression produces a set of regression parameters, each one for a given value of \(\lambda\). The best value for \(\lambda\) is found by a grid search over a range of values.
Here is an example
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
x<-X[,-1]
y<-as.matrix(yx[, 26])
ridgemod<-glmnet(x=x,y=y, alpha = 0, nlambda = 100)
plot(ridgemod, label = T, xvar = "lambda")
So, the y axis represents the value of the \(\beta\)’s from the model, and the x axis are the various values of \(\lambda\) for each specific value of \(\lambda\). As you see when \(\lambda\) is 0, the coefficients can be quite large, and as \(\lambda\) increase, the \(\beta\)’s converge to 0. This is nice, but we need to find a “best” value of \(\lambda\). This is done via cross-validation, typically.
s1<-cv.glmnet(x =x, y=y, family="gaussian", alpha = 0) #alpha = 0 for ridge regression
plot(s1)
s1
##
## Call: cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.970 4.870 0.6408 25
## 1se 2.241 5.453 0.7667 25
The plot shows the mean square error for the model with a particular value of log(\(\lambda\)). The dotted lines represent the “best” value and the value of \(\lambda\) that is 1 standard error larger than the true minimum.
Why the two values? Well, the minimum value of \(\lambda\), in this case about -0.0304766, and the minimum + 1se is 0.806827. The smaller value gives the simplest model, while the 1 se value gives the simplest model that also has high accuracy.
How do the coefficient compare?
options(scipen=999)
ridgemod2<-glmnet(x=x,y=y, alpha = 0, lambda = s1$lambda.min)
ridgemod2
##
## Call: glmnet(x = x, y = y, alpha = 0, lambda = s1$lambda.min)
##
## Df %Dev Lambda
## 1 25 97.09 0.97
ridgemod2$beta
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s0
## continentAsia 0.138746997
## continentEurope 0.339095102
## continentNorth America 0.462673475
## continentOceania -0.004004286
## continentSouth America 0.305966695
## population. -0.096063058
## cbr -0.713848006
## cdr -3.269232678
## rate.of.natural.increase 0.811597096
## net.migration.rate -0.202903456
## imr -2.187445540
## womandlifetimeriskmaternaldeath 0.261461312
## tfr -0.278134874
## percpoplt15 -0.788972896
## percpopgt65 1.768386695
## percurban -0.062632754
## percpopinurbangt750k 0.375625628
## percpop1549hivaids2007 -2.164990903
## percmarwomcontraall -0.054160227
## percmarwomcontramodern 0.640392008
## percppundernourished0204 -0.586644083
## motorvehper1000pop0005 0.735190517
## percpopwaccessimprovedwatersource 0.741533812
## gnippppercapitausdollars 0.651949671
## popdenspersqkm 0.474059119
df1<-data.frame(names = names(coef(fm)) , beta=round(coef(fm),4), mod="gm")
df2<-data.frame(names = c(names(coef(fm)[1]), row.names(ridgemod2$beta)), beta=round(as.numeric(coef(s1, s = s1$lambda.min)),4), mod="ridge")
dfa<-rbind(df1, df2)
library(ggplot2)
dfa%>%
ggplot()+geom_point(aes(y=beta,x=names, color=mod))+theme_minimal()+theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))
An important note about the ridge regression is that no coefficients are forced to zero, so it is not doing variable selection, only shrinkage of the coefficients. The Lasso, on the other hand can do both.
knitr::include_graphics("/media/corey/extra/predictive_workinggroup/images/cowboy.jpeg")
What is the Lasso? Lasso is short for least absolute shrinkage and selection operator. It is designed so that it produces a model that is both strong in the prediction sense, and parsimonious, in that it performs feature (variable) selection as well.
The Lasso is similar to the ridge regression model, in that it penalizes the regression parameter solution, except that the penalty term is not the sum of squared \(\beta\)’s, but the sum of the absolute values of the \(\beta\)’s.
\[ \text{Deviance} + \lambda \sum_k | \beta_k| \]
ridgemod3<-glmnet(x=x,y=y, alpha = 1) #alpha = 1 for Lasso
plot(ridgemod3, label = T, xvar = "lambda")
s2<-cv.glmnet(x =x, y=y, family="gaussian", alpha = 1) #alpha = 1 for Lasso
plot(s2)
s2
##
## Call: cv.glmnet(x = x, y = y, family = "gaussian", alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.0638 4.291 0.7199 15
## 1se 0.4102 4.937 0.9468 13
test<-glmnet(x =x, y=y, family="gaussian", alpha = 1, lambda = s2$lambda.1se)
test
##
## Call: glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = s2$lambda.1se)
##
## Df %Dev Lambda
## 1 12 96.8 0.4102
print(test$beta)
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s0
## continentAsia .
## continentEurope .
## continentNorth America .
## continentOceania .
## continentSouth America .
## population. .
## cbr .
## cdr -4.65214216
## rate.of.natural.increase .
## net.migration.rate .
## imr -2.29039829
## womandlifetimeriskmaternaldeath .
## tfr .
## percpoplt15 -0.40522499
## percpopgt65 3.26298392
## percurban .
## percpopinurbangt750k 0.13059746
## percpop1549hivaids2007 -1.51825139
## percmarwomcontraall .
## percmarwomcontramodern 0.23071627
## percppundernourished0204 -0.35583423
## motorvehper1000pop0005 0.09962760
## percpopwaccessimprovedwatersource 0.38868056
## gnippppercapitausdollars 0.81254209
## popdenspersqkm 0.01556287
A third method, the Elastic net regression, blends the shrinkage aspects of the ridge regression with the feature selection of the Lasso. This was proposed by Zou and Hastie 2005
\[\text{Deviance} + \lambda [(1-\alpha) \sum_k |\beta_k| + \alpha \sum_k \beta_k^2]\]
ridgemod4<-glmnet(x=x,y=y, alpha = .5, nlambda = 100) #alpha = .5 for elastic net
plot(ridgemod4, label = T, xvar = "lambda")
s3<-cv.glmnet(x =x, y=y, family="gaussian", alpha = .5) #alpha = .5 for elastic net
plot(s3)
s3
##
## Call: cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0.5)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.1276 4.389 0.8611 16
## 1se 0.9005 5.249 0.9849 12
test3<-glmnet(x =x, y=y, family="gaussian", alpha = .5, lambda = s3$lambda.1se)
test3$beta
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s0
## continentAsia .
## continentEurope .
## continentNorth America .
## continentOceania .
## continentSouth America .
## population. .
## cbr .
## cdr -3.82226943
## rate.of.natural.increase .
## net.migration.rate .
## imr -2.41981395
## womandlifetimeriskmaternaldeath .
## tfr .
## percpoplt15 -0.84464899
## percpopgt65 2.23532423
## percurban .
## percpopinurbangt750k 0.15596566
## percpop1549hivaids2007 -1.84708503
## percmarwomcontraall .
## percmarwomcontramodern 0.38684024
## percppundernourished0204 -0.41735812
## motorvehper1000pop0005 0.36099859
## percpopwaccessimprovedwatersource 0.62939351
## gnippppercapitausdollars 0.71701776
## popdenspersqkm 0.02748666
In these examples, we will use the Demographic and Health Survey Model Data. These are based on the DHS survey, but are publicly available and are used to practice using the DHS data sets, but don’t represent a real country.
In this example, we will use the outcome of contraceptive choice (modern vs other/none) as our outcome. An excellent guide for this type of literature is this article
library(haven)
dat<-url("https://github.com/coreysparks/data/blob/master/ZZIR62FL.DTA?raw=true")
model.dat<-read_dta(dat)
Here we recode some of our variables and limit our data to those women who are not currently pregnant and who are sexually active.
library(dplyr)
model.dat2<-model.dat%>%
mutate(region = v024,
modcontra= ifelse(v364 ==1,1, 0),
age = cut(v012, breaks = 5),
livchildren=v218,
educ = v106,
currpreg=v213,
wealth = as.factor(v190),
partnered = ifelse(v701<=1, 0, 1),
work = ifelse(v731%in%c(0,1), 0, 1),
knowmodern=ifelse(v301==3, 1, 0),
age2=v012^2,
rural = ifelse(v025==2, 1,0),
wantmore = ifelse(v605%in%c(1,2), 1, 0))%>%
filter(currpreg==0, v536>0, v701!=9)%>% #notpreg, sex active
dplyr::select(caseid, region, modcontra,age, age2,livchildren, educ, knowmodern, rural, wantmore, partnered,wealth, work)
knitr::kable(head(model.dat2))
caseid | region | modcontra | age | age2 | livchildren | educ | knowmodern | rural | wantmore | partnered | wealth | work |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 1 2 | 2 | 0 | (28.6,35.4] | 900 | 4 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
1 4 2 | 2 | 0 | (35.4,42.2] | 1764 | 2 | 0 | 1 | 1 | 0 | 0 | 3 | 1 |
1 4 3 | 2 | 0 | (21.8,28.6] | 625 | 3 | 1 | 1 | 1 | 0 | 0 | 3 | 1 |
1 5 1 | 2 | 0 | (21.8,28.6] | 625 | 2 | 2 | 1 | 1 | 1 | 0 | 2 | 1 |
1 6 2 | 2 | 0 | (35.4,42.2] | 1369 | 2 | 0 | 1 | 1 | 1 | 0 | 3 | 1 |
1 7 2 | 2 | 0 | (15,21.8] | 441 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
library(caret)
## Loading required package: lattice
set.seed(1115)
train<- createDataPartition(y = model.dat2$modcontra , p = .80, list=F)
dtrain<-model.dat2[train,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
dtest<-model.dat2[-train,]
library(glmnet)
library(MASS)
fm<-glm(modcontra~region+factor(age)+livchildren+wantmore+rural+factor(educ)+partnered+work+wealth, data = dtrain, family=binomial)
summary(fm)
##
## Call:
## glm(formula = modcontra ~ region + factor(age) + livchildren +
## wantmore + rural + factor(educ) + partnered + work + wealth,
## family = binomial, data = dtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3727 -0.6498 -0.4970 -0.3590 2.5673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.80866 0.26506 -10.596 < 0.0000000000000002 ***
## region 0.15475 0.03772 4.103 0.000040825388 ***
## factor(age)(21.8,28.6] 0.44413 0.18439 2.409 0.01601 *
## factor(age)(28.6,35.4] 0.54804 0.19226 2.850 0.00437 **
## factor(age)(35.4,42.2] 0.14667 0.20988 0.699 0.48468
## factor(age)(42.2,49] -0.63063 0.23623 -2.670 0.00760 **
## livchildren 0.18805 0.02976 6.318 0.000000000264 ***
## wantmore -0.16630 0.10204 -1.630 0.10316
## rural -0.62567 0.11323 -5.525 0.000000032859 ***
## factor(educ)1 0.30769 0.13045 2.359 0.01834 *
## factor(educ)2 0.59775 0.13288 4.498 0.000006849378 ***
## factor(educ)3 0.62670 0.24653 2.542 0.01102 *
## partnered 0.28746 0.10467 2.746 0.00603 **
## work 0.16342 0.10386 1.573 0.11561
## wealth2 0.13635 0.14047 0.971 0.33170
## wealth3 0.17350 0.14114 1.229 0.21898
## wealth4 0.24257 0.15013 1.616 0.10614
## wealth5 0.17383 0.18138 0.958 0.33786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3818.3 on 4128 degrees of freedom
## Residual deviance: 3536.4 on 4111 degrees of freedom
## AIC: 3572.4
##
## Number of Fisher Scoring iterations: 5
smod<-fm%>%
stepAIC(trace = T, direction = "both")
## Start: AIC=3572.39
## modcontra ~ region + factor(age) + livchildren + wantmore + rural +
## factor(educ) + partnered + work + wealth
##
## Df Deviance AIC
## - wealth 4 3539.2 3567.2
## <none> 3536.4 3572.4
## - work 1 3538.9 3572.9
## - wantmore 1 3539.0 3573.0
## - partnered 1 3543.8 3577.8
## - region 1 3553.0 3587.0
## - factor(educ) 3 3559.4 3589.4
## - rural 1 3566.7 3600.7
## - livchildren 1 3577.1 3611.1
## - factor(age) 4 3609.4 3637.4
##
## Step: AIC=3567.21
## modcontra ~ region + factor(age) + livchildren + wantmore + rural +
## factor(educ) + partnered + work
##
## Df Deviance AIC
## <none> 3539.2 3567.2
## - work 1 3541.7 3567.7
## - wantmore 1 3541.8 3567.8
## + wealth 4 3536.4 3572.4
## - partnered 1 3547.9 3573.9
## - region 1 3555.0 3581.0
## - factor(educ) 3 3564.0 3586.0
## - livchildren 1 3579.4 3605.4
## - rural 1 3590.0 3616.0
## - factor(age) 4 3611.8 3631.8
coef(smod)
## (Intercept) region factor(age)(21.8,28.6]
## -2.6270672 0.1501583 0.4524232
## factor(age)(28.6,35.4] factor(age)(35.4,42.2] factor(age)(42.2,49]
## 0.5612389 0.1621533 -0.6103720
## livchildren wantmore rural
## 0.1861566 -0.1651203 -0.6915328
## factor(educ)1 factor(educ)2 factor(educ)3
## 0.3178917 0.6052986 0.6167321
## partnered work
## 0.3043424 0.1621220
y<-dtrain$modcontra
x<-model.matrix(~factor(region)+factor(age)+livchildren+rural+wantmore+factor(educ)+partnered+work+factor(wealth), data=dtrain)
x<-(x)[,-1]
yx<-as.data.frame(cbind(x[,-1],y))
names(yx)
## [1] "factor(region)3" "factor(region)4" "factor(age)(21.8,28.6]"
## [4] "factor(age)(28.6,35.4]" "factor(age)(35.4,42.2]" "factor(age)(42.2,49]"
## [7] "livchildren" "rural" "wantmore"
## [10] "factor(educ)1" "factor(educ)2" "factor(educ)3"
## [13] "partnered" "work" "factor(wealth)2"
## [16] "factor(wealth)3" "factor(wealth)4" "factor(wealth)5"
## [19] "y"
#names(yx)<-c(paste("X",1:13, sep=""),"y")
library(bestglm)
b1<-bestglm(yx, family=binomial)
b1$BestModels
summary(b1$BestModel)
s1<-cv.glmnet(x = x,y = y, data = dtrain, family="binomial", alpha = 0)
plot(s1)
rmod<-glmnet(x,y, data = dtrain, family="binomial", alpha = 0)
plot(rmod)
rmod1<-glmnet(x,y, data = dtrain, family="binomial", alpha = 0, lambda = s1$lambda.1se)
rmod1$beta
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## factor(region)2 0.11736906
## factor(region)3 0.20097363
## factor(region)4 0.15952400
## factor(age)(21.8,28.6] 0.07831211
## factor(age)(28.6,35.4] 0.23721168
## factor(age)(35.4,42.2] 0.06382724
## factor(age)(42.2,49] -0.27968879
## livchildren 0.06545045
## rural -0.33722253
## wantmore -0.09296988
## factor(educ)1 0.13004018
## factor(educ)2 0.27349312
## factor(educ)3 0.28825836
## partnered 0.22401406
## work 0.06631462
## factor(wealth)2 -0.02330795
## factor(wealth)3 0.00422731
## factor(wealth)4 0.13902383
## factor(wealth)5 0.16285245
s2<-cv.glmnet(x = x,y = y, data = dtrain, family="binomial", alpha = 1)
plot(s2)
rmod2<-glmnet(x,y, data = dtrain, family="binomial", alpha = 1)
plot(rmod2)
rmod3<-glmnet(x,y, data = dtrain, family="binomial", alpha = 1, lambda = s2$lambda.1se)
rmod3$beta
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## factor(region)2 .
## factor(region)3 0.04882613
## factor(region)4 .
## factor(age)(21.8,28.6] .
## factor(age)(28.6,35.4] 0.17720355
## factor(age)(35.4,42.2] .
## factor(age)(42.2,49] -0.37140425
## livchildren 0.09635800
## rural -0.62095437
## wantmore .
## factor(educ)1 .
## factor(educ)2 0.20666302
## factor(educ)3 .
## partnered 0.26927535
## work .
## factor(wealth)2 .
## factor(wealth)3 .
## factor(wealth)4 .
## factor(wealth)5 .
probabilities <- fm %>% predict(dtest, type = "response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra, na.rm=T), 1, 0)
# Prediction accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes, na.rm=T)
## [1] 0.6889535
table(predicted.classes, observed.classes)
## observed.classes
## predicted.classes 0 1
## 0 605 111
## 1 210 106
probabilities <- predict(smod, dtest, type = "response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra, na.rm=T), 1, 0)
# Prediction accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)
## [1] 0.6860465
table(predicted.classes, observed.classes)
## observed.classes
## predicted.classes 0 1
## 0 605 114
## 1 210 103
x.test <- model.matrix(~factor(region)+factor(age)+livchildren+rural+wantmore+factor(educ)+partnered+work+factor(wealth), data=dtest)
x.test<-(x.test)[,-1]
ytest<-dtrain$modcontra
probabilities <- predict(object=rmod1, newx = x.test, s=s1$lambda.1se, type="response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra), 1,0)
# Model accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)
## [1] 0.7131783
table(predicted.classes, observed.classes)
## observed.classes
## predicted.classes 0 1
## 0 646 127
## 1 169 90
probabilities <- predict(object=rmod3, newx = x.test, s=s2$lambda.1se, type="response")
predicted.classes <- ifelse(probabilities > mean(dtest$modcontra), 1,0)
# Model accuracy
observed.classes <- dtest$modcontra
mean(predicted.classes == observed.classes)
## [1] 0.6879845
table(predicted.classes, observed.classes)
## observed.classes
## predicted.classes 0 1
## 0 611 118
## 1 204 99
By doing any type of regression variable selection, you are going to have the computer tell you what variables are going into your models. A lot of people don’t like this, but it is very common in the applied world in order to build the best models, especially in high-dimensional data.