在回归问题中,当输入(数据)变量的维度比较高时(比如:输入变量维度高于样本的个数),如果我们继续使用像LSE(最小二乘方法)等简单方法时就会产生很多问题。
当我们使用数据训练模型的时候,很重要的一点就是要在过拟合与欠拟合之间达到一个平衡。防止过拟合的一种方法就是对模型的复杂度进行约束。模型中用到的解释变量的个数是模型复杂度的一种体现。控制解释变量个数有很多方法,例如变量选择 (feature selection),即用 filter 或 wrapper 方法提取解释变量的最佳子集。或是进行变量构造 (feature construction),也称为将维技术,即将原始变量进行某种映射或转换,如主成分方法和因子分析。变量选择的方法是比较“硬”的方法,变量要么进入模型,要么不进入模型,只有0-1两种选择。但也有“软”的方法,也就是Regularization类方法。
AICBICsummaryleaps::regsubsetsstepglmnet::glmnet变量选择的目的:
理想的模型:
变量选择(Variable selection)的思路:
假设\(M_{k}\)是一个在所有有\(k\)各变量的模型中残差MSE最小的模型,我们的目的是在\(M_{0}, M_{1}, \ldots, M_{p}\)中选择残差MSE最小的模型,这里的\(p\)指的是数据中变量的总个数。
变量选择方法的原则:
变量选择中估计测试误差(test error)的准则:
regsubsets in the leaps packageExample 1:
Data:
library(leaps)
heights <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/heights.csv", stringsAsFactors = FALSE)
str(heights)
## 'data.frame': 1192 obs. of 6 variables:
## $ earn : num 78500 94200 47100 78500 80070 ...
## $ height: num 74.4 65.5 63.6 63.1 63.4 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ ed : int 16 16 16 16 17 15 12 17 15 12 ...
## $ age : int 45 58 29 91 39 26 49 46 21 26 ...
## $ race : chr "white" "white" "white" "other" ...
Regression:
subs <- regsubsets(earn ~ ., data = heights)
## 最优子集选择的输出结果
summary(subs); objects(summary(subs)); summary(subs)$adjr2; summary(subs)$bic; summary(subs)$cp; summary(subs)$rsq; summary(subs)$rss
## Subset selection object
## Call: regsubsets.formula(earn ~ ., data = heights)
## 7 Variables (and intercept)
## Forced in Forced out
## height FALSE FALSE
## sexmale FALSE FALSE
## ed FALSE FALSE
## age FALSE FALSE
## racehispanic FALSE FALSE
## raceother FALSE FALSE
## racewhite FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## height sexmale ed age racehispanic raceother racewhite
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " "
## 3 ( 1 ) " " "*" "*" "*" " " " " " "
## 4 ( 1 ) " " "*" "*" "*" " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## [1] "adjr2" "bic" "cp" "obj" "outmat" "rsq" "rss" "which"
## [1] 0.1148408 0.1932942 0.2135329 0.2162244 0.2163839 0.2159392 0.2152832
## [1] -132.2439 -236.7904 -260.9964 -259.0031 -253.1669 -246.4128 -239.3388
## [1] 154.317842 36.317532 6.649736 3.576249 4.336434 6.009350
## [7] 8.000000
## [1] 0.1155840 0.1946489 0.2155139 0.2188567 0.2196736 0.2198891 0.2198953
## [1] 984468909920 896459519788 873234051286 869513106603 868603814536
## [6] 868363928163 868357070519
最优子集选择结果(Adjusted \(R^2\), \(C_p\), BIC)可视化:
从图上可以看出选择变量个数为3时已经是很好的效果:Adjusted \(R^2\)最大, \(C_p\) 和 BIC最小
df <- data.frame(
est = c(summary(subs)$adjr2, summary(subs)$cp, summary(subs)$bic),
x = rep(1:7, 3),
type = rep(c("adjr2", "cp", "bic"), each = 7)
)
library(ggplot2)
library(plotly)
p <- ggplot(data = df, aes(x = x, y = est)) +
geom_line(colour = "blue") +
facet_grid(type ~ ., scales = "free_y") +
geom_vline(xintercept = 3, colour = "red")
ggplotly(p)
最终通过最优子集选择的变量及系数为:
library(magrittr)
subs %>% coef(3)
## (Intercept) sexmale ed age
## -42561.280 17944.913 4418.815 281.288
最终的模型为:
\[earn = -42561.280 + 17944.913*sexmale + 4418.815*ed + 281.288*age\]
Example 1
wages <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/wages.csv", stringsAsFactors = FALSE)
str(wages)
## 'data.frame': 1379 obs. of 6 variables:
## $ earn : num 79571 96397 48711 80478 82089 ...
## $ height: num 73.9 66.2 63.8 63.2 63.1 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ race : chr "white" "white" "white" "other" ...
## $ ed : int 16 16 16 16 17 15 12 17 15 12 ...
## $ age : int 49 62 33 95 43 30 53 50 25 30 ...
# 开始模型假设earn 只与height有关
start.mod <- lm(earn ~ height, data = wages)
empty.mod <- lm(earn ~ 1, data = wages)
full.mod <- lm(earn ~ ., data = wages)
(a)向前逐步选择:
step(start.mod,
scope = list(upper = full.mod, lower = empty.mod),
direction = "forward")
## Start: AIC=28425.76
## earn ~ height
##
## Df Sum of Sq RSS AIC
## + ed 1 1.3717e+11 1.0947e+12 28265
## + sex 1 4.6562e+10 1.1853e+12 28375
## + age 1 1.7504e+10 1.2143e+12 28408
## <none> 1.2318e+12 28426
## + race 3 4.3366e+09 1.2275e+12 28427
##
## Step: AIC=28264.95
## earn ~ height + ed
##
## Df Sum of Sq RSS AIC
## + sex 1 5.0882e+10 1.0438e+12 28201
## + age 1 3.1185e+10 1.0635e+12 28227
## <none> 1.0947e+12 28265
## + race 3 2.0128e+09 1.0926e+12 28268
##
## Step: AIC=28201.32
## earn ~ height + ed + sex
##
## Df Sum of Sq RSS AIC
## + age 1 2.8766e+10 1.0150e+12 28165
## <none> 1.0438e+12 28201
## + race 3 3.6979e+09 1.0401e+12 28202
##
## Step: AIC=28164.78
## earn ~ height + ed + sex + age
##
## Df Sum of Sq RSS AIC
## <none> 1.0150e+12 28165
## + race 3 2199277647 1.0128e+12 28168
##
## Call:
## lm(formula = earn ~ height + ed + sex + age, data = wages)
##
## Coefficients:
## (Intercept) height ed sexmale age
## -92107.5 689.6 4403.3 17229.4 294.1
最终的模型是:\[-92107.5 + 689.6*height + 4403.3*ed + 17229.4*sexmale + 294.1*age\]
(b)向后逐步选择:
step(start.mod,
scope = list(upper = full.mod, lower = empty.mod),
direction = "backward")
## Start: AIC=28425.76
## earn ~ height
##
## Df Sum of Sq RSS AIC
## <none> 1.2318e+12 28426
## - height 1 1.1448e+11 1.3463e+12 28546
##
## Call:
## lm(formula = earn ~ height, data = wages)
##
## Coefficients:
## (Intercept) height
## -126523 2387
最终的模型是:\[earn = -126523 + 2387*height\]
(c)向前向后逐步选择:
step(start.mod,
scope = list(upper = full.mod, lower = empty.mod),
direction = "both")
## Start: AIC=28425.76
## earn ~ height
##
## Df Sum of Sq RSS AIC
## + ed 1 1.3717e+11 1.0947e+12 28265
## + sex 1 4.6562e+10 1.1853e+12 28375
## + age 1 1.7504e+10 1.2143e+12 28408
## <none> 1.2318e+12 28426
## + race 3 4.3366e+09 1.2275e+12 28427
## - height 1 1.1448e+11 1.3463e+12 28546
##
## Step: AIC=28264.95
## earn ~ height + ed
##
## Df Sum of Sq RSS AIC
## + sex 1 5.0882e+10 1.0438e+12 28201
## + age 1 3.1185e+10 1.0635e+12 28227
## <none> 1.0947e+12 28265
## + race 3 2.0128e+09 1.0926e+12 28268
## - height 1 8.6376e+10 1.1810e+12 28368
## - ed 1 1.3717e+11 1.2318e+12 28426
##
## Step: AIC=28201.32
## earn ~ height + ed + sex
##
## Df Sum of Sq RSS AIC
## + age 1 2.8766e+10 1.0150e+12 28165
## <none> 1.0438e+12 28201
## + race 3 3.6979e+09 1.0401e+12 28202
## - height 1 2.5681e+09 1.0463e+12 28203
## - sex 1 5.0882e+10 1.0947e+12 28265
## - ed 1 1.4149e+11 1.1853e+12 28375
##
## Step: AIC=28164.78
## earn ~ height + ed + sex + age
##
## Df Sum of Sq RSS AIC
## <none> 1.0150e+12 28165
## + race 3 2.1993e+09 1.0128e+12 28168
## - height 1 4.7190e+09 1.0197e+12 28169
## - age 1 2.8766e+10 1.0438e+12 28201
## - sex 1 4.8463e+10 1.0635e+12 28227
## - ed 1 1.5463e+11 1.1696e+12 28358
##
## Call:
## lm(formula = earn ~ height + ed + sex + age, data = wages)
##
## Coefficients:
## (Intercept) height ed sexmale age
## -92107.5 689.6 4403.3 17229.4 294.1
最终的模型是:\[earn = -92107.5 + 689.6*height + 4403.3*ed + 17229.4*sexmale + 294.1*age\]
可以看出向前逐步选择和向前向后逐步选选择了相同的模型
正则化 (Regularization)方法:
Ridge
LASSO
SCAD
elastic net
正则化方法将解释变量的系数加入到(惩罚函数) Cost Function 中,并对其进行最小化,本质上是对过多的参数实施了惩罚。而这些方法的区别在于惩罚函数不同。下面的公式就是在线性模型中两种方法所对应的目标函数: 公式中的lambda是重要的设置参数,它控制了惩罚的严厉程度,如果设置得过大,那么最后的模型参数均将趋于0,形成拟合不足。如果设置得过小,又会形成拟合过度。所以lambda的取值一般需要通过交叉检验来确定。
library(glmnet)
library(reshape)
# 读入数据
data1 <- read.csv("E:/R/Data/ex2data1.txt", header = F)
names(data1) = paste0(rep("x", 3), 1:3)
data2 <- read.csv("E:/R/Data/ex2data2.txt", header = F)
names(data2) = paste0(rep("x", 3), 1:3)
# 散点图
p1 <- ggplot() +
geom_point(data = data1,
aes(x = x1, y = x2, colour = factor(x3), shape = factor(x3)))
ggplotly(p1)
p2 <- ggplot(data = data2, aes(x = x1, y = x2, colour = factor(x3), shape = factor(x3))) +
geom_point()
ggplotly(p2)
# 建立六阶多项式自变量
degree = 6
X = matrix(rep(1, length(data2$x1)), ncol = 1)
for (i in 1:degree) {
for (j in 0:i) {
X <- cbind(X, (data2$x1^(i-j)) * data2$x2^j)
}
}
x <- X[, -1]
y <- data2$x3
# 用glmnet包建模
model1 <- cv.glmnet(x, y, family = "binomial", type.measure = "deviance")
model2 <- cv.glmnet(x, y, family = "binomial", type.measure = "class")
model3 <- cv.glmnet(x, y, family = "binomial", type.measure = "auc")
model4 <- cv.glmnet(x, y, family = "binomial", type.measure = "mse")
model5 <- cv.glmnet(x, y, family = "binomial", type.measure = "mae")
# 绘制CV曲线图,选择最佳lambda值
plot(model1)
model1$lambda.1se
## [1] 0.02517923
plot(model2)
model2$lambda.1se
## [1] 0.02517923
plot(model3)
model3$lambda.1se
## [1] 0.02090424
plot(model4)
model4$lambda.1se
## [1] 0.02517923
plot(model5)
model5$lambda.1se
## [1] 0.006845195
# 提取最终模型
model.final <- model$glmnet.fit
# 取得简洁模型的参数系数
model.coef <- coef(model[Math Processing Error]lambda.1se)
# 取得原始模型的参数系数
all.coef <- coef(model[Math Processing Error]lambda))
# 可以用predict进行预测
# pre <-predict(model.final,newx=x,s=model$lambda.1se,type=’class‘)
# table(Y,pre)
# 下面的工作全部是为了绘制决策边界
u <- seq(-1,1.2, len=200)
v <- seq(-1,1.2, len=200)
z28 <-z9 <- matrix(0, length(u), length(v))
mapFeature <- function(u,v, degree=6) {
out <- sapply(0:degree,function(i)
sapply(0:i, function(j)
u^(i-j) * v^j
)
)
out <- unlist(out)
return(out)
}
for (i in 1:length(u)) {
for (j in 1:length(v)) {
features <- mapFeature(u[i],v[j])
z9[i,j] <- as.numeric(features %*% model.coef)
z28[i,j] <- as.numeric(features %*%all.coef)
}
}
rownames(z9) <- rownames(z28) <- as.character(u)
colnames(z9) <- colnames(z28) <- as.character(v)
z9.melted <- melt(z9)
z28.melted <- melt(z28)
z9.melted <- data.frame(z9.melted, lambda=9)
z28.melted <- data.frame(z28.melted, lambda=28)
zz <- rbind(z9.melted, z28.melted)
zz[Math Processing Error]lambda)
colnames(zz) <- c(“u”, “v”, “z”, “lambda”)
p <- ggplot()+
geom_point(data=data,aes(V1,V2,colour=factor(V3),shape=factor(V3)),size=3) +
geom_contour(data=zz, aes(u, v, z = z,
group=lambda, colour=lambda),bins=1,size=1)
p
Example 2
Which variables predict crime? County Demographics:Demographic information about 440 US counties
cts <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/counties.csv",
stringsAsFactors = FALSE)
names(cts)
## [1] "county" "state" "area" "pop" "p18_25"
## [6] "p65" "nphysician" "nhospbeds" "phs" "pcollege"
## [11] "ppoverty" "punemployed" "avg_income" "tot_income" "region"
## [16] "crime"
| Variable | Measures |
|---|---|
| county | Name of county |
| state | Name of state |
| crime | Number of crimes per 1000 people |
| area | Land area of county |
| pop | Population of county |
| p18_25 | % between 18 and 25 years old |
| p65 | % greater that 65 years old |
| nphysician | number of physicians |
| nhospbeds | number of hospital beds |
| phs | % with high school degree |
| pcollege | % with college degree |
| ppoverty | % below poverty line |
| punemployed | % unemployed |
| avg_income | Per capita income |
| tot_income | Total income |
| region | Region of country |
Title:
Lasso and Elastic-Net Regularized Generalized Linear Models
Description:
Functions:
Data : beta_CVX
coef.glnmnet
plot.cv.glmnet
glmnet.control
library(glmnet)
data(BinomialExample)
data(CVXResults)
data(CoxExample)
data(MultiGaussianExample)
data(MultinomialExample)
data(PoissonExample)
data(QuickStartExample)
data(SparseExample)
glmnet(x, y,
family = c("gaussian", "binomial", "multinomial", "poisson", "cox", "mgaussian"),
weights,
offset = NULL,
alpha = 1,
nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 0.0001),
lambda = NULL,
standardize = TRUE,
intercept = TRUE,
thresh = 1e-07,
dfmax = nvars + 1,
pmax = min(dfmax * 2+20, nvars),
exclude,
penalty.factor = rep(1, nvars),
lower.limits = -Inf, upper.limits = Inf,
maxit = 100000,
type.gaussian = ifelse(nvars < 500, "covariance", "naive"),
type.logistic = c("Newton", "modified.Newton"),
standardize.response = FALSE,
type.multinomial = c("ungrouped", "grouped"))
Arguments:
Examples:
# Gaussian
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
fit1=glmnet(x,y)
print(fit1)
coef(fit1,s=0.01) # extract coefficients at a single value of lambda
predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions
#multivariate gaussian
y=matrix(rnorm(100*3),100,3)
fit1m=glmnet(x,y,family="mgaussian")
plot(fit1m,type.coef="2norm")
#binomial
g2=sample(1:2,100,replace=TRUE)
fit2=glmnet(x,g2,family="binomial")
#multinomial
g4=sample(1:4,100,replace=TRUE)
fit3=glmnet(x,g4,family="multinomial")
fit3a=glmnet(x,g4,family="multinomial",type.multinomial="grouped")
#poisson
N=500; p=20
nzc=5
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta
mu=exp(f)
y=rpois(N,mu)
fit=glmnet(x,y,family="poisson")
plot(fit)
pfit = predict(fit,x,s=0.001,type="response")
plot(pfit,y)
#Cox
set.seed(10101)
N=1000;p=30
nzc=p/3
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
fx=x[,seq(nzc)]%*%beta/3
hx=exp(fx)
ty=rexp(N,hx)
tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator
y=cbind(time=ty,status=1-tcens) # y=Surv(ty,1-tcens) with library(survival)
fit=glmnet(x,y,family="cox")
plot(fit)
# Sparse
n=10000;p=200
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
iz=sample(1:(n*p),size=n*p*.85,replace=FALSE)
x[iz]=0
sx=Matrix(x,sparse=TRUE)
inherits(sx,"sparseMatrix")#confirm that it is sparse
beta=rnorm(nzc)
fx=x[,seq(nzc)]%*%beta
eps=rnorm(n)
y=fx+eps
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)
system.time(fit1<-glmnet(sx,y))
system.time(fit2n<-glmnet(x,y))
make predictions from a “glmnet” object.
## S3 method for class 'glmnet'
predict(object,
newx,
s = NULL,
type = c("link", "response", "coefficients", "nonzero", "class"),
exact = FALSE,
offset,
...)
Arguments:
Examples:
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
g2=sample(1:2,100,replace=TRUE)
g4=sample(1:4,100,replace=TRUE)
fit1=glmnet(x,y)
predict(fit1,newx=x[1:5,],s=c(0.01,0.005))
predict(fit1,type="coef")
fit2=glmnet(x,g2,family="binomial")
predict(fit2,type="response",newx=x[2:5,])
predict(fit2,type="nonzero")
fit3=glmnet(x,g4,family="multinomial")
predict(fit3,newx=x[1:3,],type="response",s=0.01)
Produces a coefficient profile plot of the coefficient paths for a fitted “glmnet” object
## S3 method for class 'glmnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, ...)
## S3 method for class 'multnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, type.coef = c("coef", "2norm"), ...)
## S3 method for class 'mrelnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, type.coef = c("coef", "2norm"), ...)
Arguments:
Details:
Examples
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
g2=sample(1:2,100,replace=TRUE)
g4=sample(1:4,100,replace=TRUE)
fit1=glmnet(x,y)
plot(fit1)
plot(fit1,xvar="lambda",label=TRUE)
fit3=glmnet(x,g4,family="multinomial")
plot(fit3,pch=19)
Print a summary of the glmnet path at each step along the path.
## S3 method for class 'glmnet'
print(x, digits = max(3, getOption("digits") - 3), ...)
Arguments:
Details:
Example:
x = matrix(rnorm(100*20), 100, 20)
y = rnorm(100)
fit1 = glmnet(x, y)
print(fit1)
## S3 method for class 'glmnet'
coef(object,
s = NULL,
exact = FALSE,
...)
Arguments: