降维与特征选择

背景、摘要

在回归问题中，当输入(数据)变量的维度比较高时(比如：输入变量维度高于样本的个数)，如果我们继续使用像LSE(最小二乘方法)等简单方法时就会产生很多问题。

在高维问题中产生的最重要的两个问题是：
- 多重共线性 (Multicollinearity)
  - 一些输入变量会高度相关
  - 当输入变量维度高于样本的个数时，最小二乘估计将不是唯一的
- 过拟合 (Overfitting)
  - 模型在训练过程中将训练数据(少数样本)的一些特性学习到模型中而作为所有数据(总体)的共性，以至于使得模型在训练数据外的其他数据样本上的效果很差。

当我们使用数据训练模型的时候，很重要的一点就是要在过拟合与欠拟合之间达到一个平衡。防止过拟合的一种方法就是对模型的复杂度进行约束。模型中用到的解释变量的个数是模型复杂度的一种体现。控制解释变量个数有很多方法，例如变量选择 (feature selection)，即用 filter 或 wrapper 方法提取解释变量的最佳子集。或是进行变量构造 (feature construction)，也称为将维技术，即将原始变量进行某种映射或转换，如主成分方法和因子分析。变量选择的方法是比较“硬”的方法，变量要么进入模型，要么不进入模型，只有0-1两种选择。但也有“软”的方法，也就是Regularization类方法。

解决高维问题的方法或措施：
- 变量选择 (Variable selection)
  - 最优子集选择 (Best subset selection)
  - 逐步选择 (stepwise selection)
- 正则化 (Regularization)
  - Ridge
  - LASSO
  - SCAD
  - elastic net
- 降维 (Dimension reduction techniques)
  - 主成分分析(回归) (Principal component regression)
  - 因子分析 (Factor analysis)
  - 偏最小二乘 (Partial least squares)

Let theory dictate variables
- Let established scientific theory determine which variables should be in your model before you look at the data.
Let data dictate variables
- A.Choose an optimization criteria:
  - i.Mallow’s Cp
  - ii.AIC - AIC
  - iii.BIC - BIC
  - iv.Adjusted R2 - summary
- B.Search for optimal subset
  - i.Best subsets - leaps::regsubsets
  - ii.stepwise - step
Let model algorithm dictate variables
- A. Penalized regression (LASSO) - glmnet::glmnet

1. 特征选择(Feature Selection)

1.1 变量选择 (Variable selection)

1.1.1 背景

变量选择的目的：

哪些变量应该出现在最终的模型中？
怎么才能建立一个预测效果好(预测准确)的模型？

理想的模型：

残差小 (Small residuals)
- 残差(Residuals)：训练样本点与模型回归线的距离
- 残差平方和(RSS): $RSS=\sum residuals^{2}$
模型预测准确 (Accurate predictions)
…

变量选择(Variable selection)的思路:

当数据中有$p$个变量时，希望在所有可能的模型中选择一个最好的模型
假设$M_{k}$是一个在所有有$k$各变量的模型中残差MSE最小的模型，我们的目的是在$M_{0}, M_{1}, \ldots, M_{p}$中选择残差MSE最小的模型，这里的$p$指的是数据中变量的总个数。
变量选择的局限和不足：
- 当$p$很大的时候(比如$\geqslant 100$)，这些变量选择的方法在计算复杂度上将会变得很难。尽管这些变量选择方法是启发式的，能够在模型选择的复杂度上有所提升，但这些方法仍然很不稳定
- Variable selection uses a hard decision rule (survive or die). So, it may result in inferior prediction error.
- Breiman (1996, Annals of Statistics): variable selection is unstable. ‘Unstable’ procedure means that small change of data results in large change of the estimator (eg. decision tree or neural net).
- Regularization methods are promising alternatives.

1.1.2 变量选择方法

变量选择方法的原则：

一个好的模型拥有小的测试误差(test error)
- 预测准确
- 低残差

变量选择中估计测试误差(test error)的准则：

Mallow’s $C_p$
- $C_{p}=\frac{1}{n}(RSS+2p\hat{\sigma}^{2})$
- $C_p$ adds a penalty as number of variables (flexibility) increases. Choose the lowest $C_p$.
AIC
- $AIC=\frac{2}{n}(p-loglik)$
- Similar to $C_p$, but can be generalized to non-linear models. Choose the lowest AIC.
BIC
- $BIC=-2 \cdot loglik+p\cdot log(n)$
- Proportional to AIC, but penalizes flexibility more heavily. Choose the lowest BIC.
Adjusted $R^2$
- $R^{2}=1-\frac{RSS / (n-p-1)}{TSS / (n-1)}$
- Adds penalty to R2 for unnecessary variables. Popular, but not as statistically well-founded as $C_p$, AIC and BIC. Choose the highest Adjusted $R^2$.
变量选择方法：
- 最优子集选择 (best subset selection)
  - Find the prediction error for every possible subset of variables. Use the subset with the lowest test error.
- 逐步选择(stepwise section)
  - 向后逐步选择 (Backwards selection)
    - Start with every variable in the model. Then delete the variable who’s omission improves the test error the most. Continue until you cannot improve the test error.
  - 向前逐步选择 (Forward selection)
    - Start with no variables in the model. Then add the variable that improves the test error the most. Continue until you cannot improve the test error.
  - 向前向后逐步选择()

(1)Best Subset selection in R

Find best subsets of each size with regsubsets in the leaps package
Compare different sizes with Adjusted $R^2$, $C_p$ or BIC.

Example 1:

Data:

library(leaps)
heights <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/heights.csv", stringsAsFactors = FALSE)
str(heights)

## 'data.frame':    1192 obs. of  6 variables:
##  $ earn  : num  78500 94200 47100 78500 80070 ...
##  $ height: num  74.4 65.5 63.6 63.1 63.4 ...
##  $ sex   : chr  "male" "female" "female" "female" ...
##  $ ed    : int  16 16 16 16 17 15 12 17 15 12 ...
##  $ age   : int  45 58 29 91 39 26 49 46 21 26 ...
##  $ race  : chr  "white" "white" "white" "other" ...

Regression:

subs <- regsubsets(earn ~ ., data = heights)
## 最优子集选择的输出结果
summary(subs); objects(summary(subs)); summary(subs)$adjr2; summary(subs)$bic; summary(subs)$cp; summary(subs)$rsq; summary(subs)$rss

## Subset selection object
## Call: regsubsets.formula(earn ~ ., data = heights)
## 7 Variables  (and intercept)
##              Forced in Forced out
## height           FALSE      FALSE
## sexmale          FALSE      FALSE
## ed               FALSE      FALSE
## age              FALSE      FALSE
## racehispanic     FALSE      FALSE
## raceother        FALSE      FALSE
## racewhite        FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          height sexmale ed  age racehispanic raceother racewhite
## 1  ( 1 ) " "    " "     "*" " " " "          " "       " "      
## 2  ( 1 ) " "    "*"     "*" " " " "          " "       " "      
## 3  ( 1 ) " "    "*"     "*" "*" " "          " "       " "      
## 4  ( 1 ) " "    "*"     "*" "*" " "          " "       "*"      
## 5  ( 1 ) "*"    "*"     "*" "*" " "          " "       "*"      
## 6  ( 1 ) "*"    "*"     "*" "*" "*"          " "       "*"      
## 7  ( 1 ) "*"    "*"     "*" "*" "*"          "*"       "*"

## [1] "adjr2"  "bic"    "cp"     "obj"    "outmat" "rsq"    "rss"    "which"

## [1] 0.1148408 0.1932942 0.2135329 0.2162244 0.2163839 0.2159392 0.2152832

## [1] -132.2439 -236.7904 -260.9964 -259.0031 -253.1669 -246.4128 -239.3388

## [1] 154.317842  36.317532   6.649736   3.576249   4.336434   6.009350
## [7]   8.000000

## [1] 0.1155840 0.1946489 0.2155139 0.2188567 0.2196736 0.2198891 0.2198953

## [1] 984468909920 896459519788 873234051286 869513106603 868603814536
## [6] 868363928163 868357070519

最优子集选择结果(Adjusted $R^2$, $C_p$, BIC)可视化:

从图上可以看出选择变量个数为3时已经是很好的效果：Adjusted $R^2$最大, $C_p$ 和 BIC最小

df <- data.frame(
    est = c(summary(subs)$adjr2, summary(subs)$cp, summary(subs)$bic),
    x = rep(1:7, 3),
    type = rep(c("adjr2", "cp", "bic"), each = 7)
)
library(ggplot2)
library(plotly)
p <- ggplot(data = df, aes(x = x, y = est)) +
    geom_line(colour = "blue") +
    facet_grid(type ~ ., scales = "free_y") + 
    geom_vline(xintercept = 3, colour = "red")
ggplotly(p)

最终通过最优子集选择的变量及系数为：

library(magrittr)
subs %>% coef(3)

## (Intercept)     sexmale          ed         age 
##  -42561.280   17944.913    4418.815     281.288

最终的模型为：

\[earn = -42561.280 + 17944.913*sexmale + 4418.815*ed + 281.288*age\]

(2)Stepwise selection in R

Example 1

wages <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/wages.csv", stringsAsFactors = FALSE)
str(wages)

## 'data.frame':    1379 obs. of  6 variables:
##  $ earn  : num  79571 96397 48711 80478 82089 ...
##  $ height: num  73.9 66.2 63.8 63.2 63.1 ...
##  $ sex   : chr  "male" "female" "female" "female" ...
##  $ race  : chr  "white" "white" "white" "other" ...
##  $ ed    : int  16 16 16 16 17 15 12 17 15 12 ...
##  $ age   : int  49 62 33 95 43 30 53 50 25 30 ...

# 开始模型假设earn 只与height有关
start.mod <- lm(earn ~ height, data = wages)
empty.mod <- lm(earn ~ 1, data = wages)
full.mod <- lm(earn ~ ., data = wages)

(a)向前逐步选择：

step(start.mod, 
     scope = list(upper = full.mod, lower = empty.mod),
     direction = "forward")

## Start:  AIC=28425.76
## earn ~ height
## 
##        Df  Sum of Sq        RSS   AIC
## + ed    1 1.3717e+11 1.0947e+12 28265
## + sex   1 4.6562e+10 1.1853e+12 28375
## + age   1 1.7504e+10 1.2143e+12 28408
## <none>               1.2318e+12 28426
## + race  3 4.3366e+09 1.2275e+12 28427
## 
## Step:  AIC=28264.95
## earn ~ height + ed
## 
##        Df  Sum of Sq        RSS   AIC
## + sex   1 5.0882e+10 1.0438e+12 28201
## + age   1 3.1185e+10 1.0635e+12 28227
## <none>               1.0947e+12 28265
## + race  3 2.0128e+09 1.0926e+12 28268
## 
## Step:  AIC=28201.32
## earn ~ height + ed + sex
## 
##        Df  Sum of Sq        RSS   AIC
## + age   1 2.8766e+10 1.0150e+12 28165
## <none>               1.0438e+12 28201
## + race  3 3.6979e+09 1.0401e+12 28202
## 
## Step:  AIC=28164.78
## earn ~ height + ed + sex + age
## 
##        Df  Sum of Sq        RSS   AIC
## <none>               1.0150e+12 28165
## + race  3 2199277647 1.0128e+12 28168

## 
## Call:
## lm(formula = earn ~ height + ed + sex + age, data = wages)
## 
## Coefficients:
## (Intercept)       height           ed      sexmale          age  
##    -92107.5        689.6       4403.3      17229.4        294.1

最终的模型是：\[-92107.5 + 689.6*height + 4403.3*ed + 17229.4*sexmale + 294.1*age\]

(b)向后逐步选择：

step(start.mod, 
     scope = list(upper = full.mod, lower = empty.mod),
     direction = "backward")

## Start:  AIC=28425.76
## earn ~ height
## 
##          Df  Sum of Sq        RSS   AIC
## <none>                 1.2318e+12 28426
## - height  1 1.1448e+11 1.3463e+12 28546

## 
## Call:
## lm(formula = earn ~ height, data = wages)
## 
## Coefficients:
## (Intercept)       height  
##     -126523         2387

最终的模型是：\[earn = -126523 + 2387*height\]

(c)向前向后逐步选择：

step(start.mod, 
     scope = list(upper = full.mod, lower = empty.mod),
     direction = "both")

## Start:  AIC=28425.76
## earn ~ height
## 
##          Df  Sum of Sq        RSS   AIC
## + ed      1 1.3717e+11 1.0947e+12 28265
## + sex     1 4.6562e+10 1.1853e+12 28375
## + age     1 1.7504e+10 1.2143e+12 28408
## <none>                 1.2318e+12 28426
## + race    3 4.3366e+09 1.2275e+12 28427
## - height  1 1.1448e+11 1.3463e+12 28546
## 
## Step:  AIC=28264.95
## earn ~ height + ed
## 
##          Df  Sum of Sq        RSS   AIC
## + sex     1 5.0882e+10 1.0438e+12 28201
## + age     1 3.1185e+10 1.0635e+12 28227
## <none>                 1.0947e+12 28265
## + race    3 2.0128e+09 1.0926e+12 28268
## - height  1 8.6376e+10 1.1810e+12 28368
## - ed      1 1.3717e+11 1.2318e+12 28426
## 
## Step:  AIC=28201.32
## earn ~ height + ed + sex
## 
##          Df  Sum of Sq        RSS   AIC
## + age     1 2.8766e+10 1.0150e+12 28165
## <none>                 1.0438e+12 28201
## + race    3 3.6979e+09 1.0401e+12 28202
## - height  1 2.5681e+09 1.0463e+12 28203
## - sex     1 5.0882e+10 1.0947e+12 28265
## - ed      1 1.4149e+11 1.1853e+12 28375
## 
## Step:  AIC=28164.78
## earn ~ height + ed + sex + age
## 
##          Df  Sum of Sq        RSS   AIC
## <none>                 1.0150e+12 28165
## + race    3 2.1993e+09 1.0128e+12 28168
## - height  1 4.7190e+09 1.0197e+12 28169
## - age     1 2.8766e+10 1.0438e+12 28201
## - sex     1 4.8463e+10 1.0635e+12 28227
## - ed      1 1.5463e+11 1.1696e+12 28358

## 
## Call:
## lm(formula = earn ~ height + ed + sex + age, data = wages)
## 
## Coefficients:
## (Intercept)       height           ed      sexmale          age  
##    -92107.5        689.6       4403.3      17229.4        294.1

最终的模型是：\[earn = -92107.5 + 689.6*height + 4403.3*ed + 17229.4*sexmale + 294.1*age\]

可以看出向前逐步选择和向前向后逐步选选择了相同的模型

1.2 正则化 (Regularization)

正则化 (Regularization)方法：

Ridge
LASSO
SCAD
elastic net

正则化方法将解释变量的系数加入到(惩罚函数) Cost Function 中，并对其进行最小化，本质上是对过多的参数实施了惩罚。而这些方法的区别在于惩罚函数不同。下面的公式就是在线性模型中两种方法所对应的目标函数：公式中的lambda是重要的设置参数，它控制了惩罚的严厉程度，如果设置得过大，那么最后的模型参数均将趋于0，形成拟合不足。如果设置得过小，又会形成拟合过度。所以lambda的取值一般需要通过交叉检验来确定。

library(glmnet)
library(reshape)
# 读入数据   
data1 <- read.csv("E:/R/Data/ex2data1.txt", header = F)
names(data1) = paste0(rep("x", 3), 1:3)
data2 <- read.csv("E:/R/Data/ex2data2.txt", header = F)
names(data2) = paste0(rep("x", 3), 1:3)

# 散点图
p1 <- ggplot() + 
    geom_point(data = data1, 
               aes(x = x1, y = x2, colour = factor(x3), shape = factor(x3)))
ggplotly(p1)

p2 <- ggplot(data = data2, aes(x = x1, y = x2, colour = factor(x3), shape = factor(x3))) + 
    geom_point()
ggplotly(p2)

# 建立六阶多项式自变量   
degree = 6
X = matrix(rep(1, length(data2$x1)), ncol = 1)
for (i in 1:degree) {
    for (j in 0:i) {
        X <- cbind(X, (data2$x1^(i-j)) * data2$x2^j)
    }
}
x <- X[, -1]
y <- data2$x3

# 用glmnet包建模
model1 <- cv.glmnet(x, y, family = "binomial", type.measure = "deviance") 
model2 <- cv.glmnet(x, y, family = "binomial", type.measure = "class")
model3 <- cv.glmnet(x, y, family = "binomial", type.measure = "auc")
model4 <- cv.glmnet(x, y, family = "binomial", type.measure = "mse")
model5 <- cv.glmnet(x, y, family = "binomial", type.measure = "mae")
# 绘制CV曲线图，选择最佳lambda值   
plot(model1)

model1$lambda.1se

## [1] 0.02517923

plot(model2)

model2$lambda.1se

## [1] 0.02517923

plot(model3)

model3$lambda.1se

## [1] 0.02090424

plot(model4)

model4$lambda.1se

## [1] 0.02517923

plot(model5)

model5$lambda.1se

## [1] 0.006845195

# 提取最终模型   
model.final <- model$glmnet.fit   
# 取得简洁模型的参数系数   
model.coef <- coef(model[Math Processing Error]lambda.1se)   
# 取得原始模型的参数系数   
all.coef <- coef(model[Math Processing Error]lambda))   
           
# 可以用predict进行预测   
# pre <-predict(model.final,newx=x,s=model$lambda.1se,type=’class‘)   
# table(Y,pre)   
           
# 下面的工作全部是为了绘制决策边界   
u <- seq(-1,1.2, len=200)   
v <- seq(-1,1.2, len=200)   
           
z28 <-z9  <- matrix(0, length(u), length(v))   
           
mapFeature <- function(u,v, degree=6) {   
   out <- sapply(0:degree,function(i)   
                 sapply(0:i, function(j)   
                        u^(i-j) * v^j   
                        )   
                 )   
   out <- unlist(out)   
   return(out)   
}   
           
for (i in 1:length(u)) {   
   for (j in 1:length(v)) {   
       features <- mapFeature(u[i],v[j])   
       z9[i,j] <- as.numeric(features %*% model.coef)   
      z28[i,j] <- as.numeric(features %*%all.coef)   
   }   
}   
           
rownames(z9) <- rownames(z28) <- as.character(u)   
colnames(z9) <- colnames(z28) <-  as.character(v)   
           
z9.melted <- melt(z9)   
z28.melted <- melt(z28)   
z9.melted <- data.frame(z9.melted, lambda=9)   
z28.melted <- data.frame(z28.melted, lambda=28)   
           
zz <- rbind(z9.melted, z28.melted)   
zz[Math Processing Error]lambda)   
colnames(zz) <- c(“u”, “v”, “z”, “lambda”)   
           
p <- ggplot()+   
 geom_point(data=data,aes(V1,V2,colour=factor(V3),shape=factor(V3)),size=3) +   
 geom_contour(data=zz, aes(u, v, z = z,   
                           group=lambda, colour=lambda),bins=1,size=1)   
             
p

Example 2

Which variables predict crime? County Demographics：Demographic information about 440 US counties

cts <- read.csv("E:/R/Wangzf/Study/Intro-to-R/Day 2/data/counties.csv",
                stringsAsFactors = FALSE)
names(cts)

##  [1] "county"      "state"       "area"        "pop"         "p18_25"     
##  [6] "p65"         "nphysician"  "nhospbeds"   "phs"         "pcollege"   
## [11] "ppoverty"    "punemployed" "avg_income"  "tot_income"  "region"     
## [16] "crime"

Variable	Measures
county	Name of county
state	Name of state
crime	Number of crimes per 1000 people
area	Land area of county
pop	Population of county
p18_25	% between 18 and 25 years old
p65	% greater that 65 years old
nphysician	number of physicians
nhospbeds	number of hospital beds
phs	% with high school degree
pcollege	% with college degree
ppoverty	% below poverty line
punemployed	% unemployed
avg_income	Per capita income
tot_income	Total income
region	Region of country

2. 降维(Dimensionality Reduction)

n. glmnet package

Title:

Lasso and Elastic-Net Regularized Generalized Linear Models

Description:

Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for
- linear regression model
- logistic regression model
- multinomial regression model
- Poisson regression model
- Cox model
- multiple-response Gaussian
- grouped multinomial
The algorithm uses cyclical coordinate descent in a pathwise fashion, as described in the paper Regularization Paths for Generalized Linear Models via Coordinate Descent.
The algorithm is extremely fast, and exploits sparsity in the input $x$ matrix where it exists. A variety of predictions can be made from the fitted models.
Very simple to use. Accepts $x$, $y$ data for regression models, and produces the regularization path over a grid of values for the tuning parameter lambda.

Functions:

Data : beta_CVX
glmnet
predict.glmnet
plot.glmnet
print.glmnet
coef.glnmnet
cv.glmnet
predict.cv.glmnet
plot.cv.glmnet
deviance.glmnet
glmnet.control

library(glmnet)

(1) beta_CVX : Simulated data for the glmnet vignette

data(BinomialExample)
data(CVXResults)
data(CoxExample)
data(MultiGaussianExample)
data(MultinomialExample)
data(PoissonExample)
data(QuickStartExample)
data(SparseExample)

(2) glmnet

glmnet(x, y, 
       family = c("gaussian", "binomial", "multinomial", "poisson", "cox", "mgaussian"),
       
       weights, 
       offset = NULL, 
       
       alpha = 1,
       nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 0.0001),
       lambda = NULL,
       
       standardize = TRUE,
       intercept = TRUE, 
       thresh = 1e-07, 
       dfmax = nvars + 1,
       pmax = min(dfmax * 2+20, nvars),
       exclude, 
       penalty.factor = rep(1, nvars),
       lower.limits = -Inf, upper.limits = Inf,
       maxit = 100000,
       type.gaussian = ifelse(nvars < 500, "covariance", "naive"),
       type.logistic = c("Newton", "modified.Newton"),
       standardize.response = FALSE,
       type.multinomial = c("ungrouped", "grouped"))

Arguments:

x
- 输入矩阵，拥有nobs和nvars两个属性;
- 也可以是sparse matrix(Matrix::sparseMatrix)形式;
- 现在还不能用于family=“cox”.
y
family
- 响应变量(类型或形式)
- family = “gaussian”
  - linear regression model
  - y should be：
    - quantitative responses
- family = “binomial”
  - logistic regression model
  - y should be:
    - a factor with two levels
    - a two-column matrix of counts or proportions
      - (the second column is treated as the target class; for a factor, the last level in alphabetical order(字母顺序表) is the target class).
- family = “multinomial”
  - multinomial regression model
  - y should be:
    - a nc>=2 level factor
    - a matrix with nc columns of counts or proportions.
    - For either “binomial” or “multinomial”, if y is presented as a vector, it will be coerced into a factor.
- family = “poisson”
  - Poisson regression model
  - y should be：
    - non-negative counts
- family = “cox”
  - Cox model
  - y should be：
    - a two-column matrix with columns named ’time’ and ’status’.
    - The latter is a binary variable, with ’1’ indicating death, and ’0’ indicating right censored.
    - The function Surv() in package survival produces such a matrix.
- family = “mgaussian”
  - multiple-response Gaussian
  - y is a：
    - matrix of quantitative responses.
weights
- (样本)观测值权重. Can be total counts if responses are proportion matrices. Default is 1 for each observation
offset
- A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family). Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL. If supplied, then values must also be supplied to the predict function.
alpha
- elasticnet mixing parameter, with $0 \leqslant \alpha \leqslant 1$.
- The penalty is defined as \[(1-\alpha)\frac{1}{2}||\beta||^{2}_{2}+\alpha||\beta||_{1}\].
- alpha=1 is the lasso penalty, and alpha=0 the ridge penalty.
nlambda
- lambda 的个数值，默认是100.
lambda.min.ratio
- Smallest value for lambda, as a fraction of lambda.max, the (data derived) entry value (i.e. the smallest value for which all coefficients are zero).
- The default depends on the sample size nobs relative to the number of variables nvars. If nobs > nvars, the default is 0.0001, close to zero. If nobs < nvars, the default is 0.01. A very small value of lambda.min.ratio will lead to a saturated fit in the nobs < nvars case. This is undefined for “binomial” and “multinomial” models, and glmnet will exit gracefully when the percentage deviance explained is almost 1.
lambda
- A user supplied lambda sequence.
- Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio.
- Supplying a value of lambda overrides this.
- WARNING: use with care. Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.
standardize
- Logical flag for x variable standardization, prior to fitting the model sequence.
- The coefficients are always returned on the original scale.
- 默认standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family=“gaussian”.
intercept
- 截距项是否要进行拟合
- intercept = TRUE : 默认
- intercept = FALSE :截距项设为0
thresh
- 坐标下降算法的收敛阈值
- Each inner coordinate-descent loop continues until the maximum change in the objective after any coefficient update is less than thresh times the null deviance.
- 默认值是: 1E-7.
dfmax
- 限制模型中变量的最大个数. Useful for very large nvars, if a partial path is desired.
pmax
- Limit the maximum number of variables ever to be nonzero
exclude
- Indices of variables to be excluded from the model. Default is none. 相当于一个无限惩罚因子.
penalty.factor
- Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.
lower.limits
- Vector of lower limits for each coefficient; 默认：-Inf.
- Each of these must be non-positive. Can be presented as a single value (which will then be replicated), else a vector of length nvars
upper.limits
- Vector of upper limits for each coefficient; 默认：Inf.
maxit
- Maximum number of passes over the data for all lambda values; default is 10^5.
type.gaussian
- Two algorithm types are supported for (only) family=“gaussian”.
- The default when nvar<500 is type.gaussian=“covariance”, and saves all innerproducts ever computed. This can be much faster than type.gaussian=“naive”, which loops through nobs every time an inner-product is computed. The latter can be far more efficient for nvar >> nobs situations, or when nvar > 500.
type.logistic
- If “Newton” then the exact hessian is used (default),
- while “modified.Newton” uses an upper-bound on the hessian, and can be faster.
standardize.response
- This is for the family=“mgaussian” family, and allows the user to standardize the response variables
type.multinomial
- If “grouped” then a grouped lasso penalty is used on the multinomial coefficientsfor a variable.
- This ensures they are all in our out together. The default is “ungrouped”

Examples:

# Gaussian
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
fit1=glmnet(x,y)
print(fit1)
coef(fit1,s=0.01) # extract coefficients at a single value of lambda
predict(fit1,newx=x[1:10,],s=c(0.01,0.005)) # make predictions
#multivariate gaussian
y=matrix(rnorm(100*3),100,3)
fit1m=glmnet(x,y,family="mgaussian")
plot(fit1m,type.coef="2norm")
#binomial
g2=sample(1:2,100,replace=TRUE)
fit2=glmnet(x,g2,family="binomial")
#multinomial
g4=sample(1:4,100,replace=TRUE)
fit3=glmnet(x,g4,family="multinomial")
fit3a=glmnet(x,g4,family="multinomial",type.multinomial="grouped")
#poisson
N=500; p=20
nzc=5
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta
mu=exp(f)
y=rpois(N,mu)
fit=glmnet(x,y,family="poisson")
plot(fit)
pfit = predict(fit,x,s=0.001,type="response")
plot(pfit,y)
#Cox
set.seed(10101)
N=1000;p=30
nzc=p/3
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
fx=x[,seq(nzc)]%*%beta/3
hx=exp(fx)
ty=rexp(N,hx)
tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator
y=cbind(time=ty,status=1-tcens) # y=Surv(ty,1-tcens) with library(survival)
fit=glmnet(x,y,family="cox")
plot(fit)
# Sparse
n=10000;p=200
nzc=trunc(p/10)
x=matrix(rnorm(n*p),n,p)
iz=sample(1:(n*p),size=n*p*.85,replace=FALSE)
x[iz]=0
sx=Matrix(x,sparse=TRUE)
inherits(sx,"sparseMatrix")#confirm that it is sparse
beta=rnorm(nzc)
fx=x[,seq(nzc)]%*%beta
eps=rnorm(n)
y=fx+eps
px=exp(fx)
px=px/(1+px)
ly=rbinom(n=length(px),prob=px,size=1)
system.time(fit1<-glmnet(sx,y))
system.time(fit2n<-glmnet(x,y))

(3) predict.glmnet

make predictions from a “glmnet” object.

## S3 method for class 'glmnet'
predict(object, 
        newx, 
        s = NULL,
        type = c("link", "response", "coefficients", "nonzero", "class"), 
        exact = FALSE, 
        offset, 
        ...)

Arguments:

object
- Fitted “glmnet” model object.
newx
- Matrix of new values for x at which predictions are to be made.
- Must be a matrix; can be sparse as in Matrix package.
- This argument is not used for type = c(“coefficients”, “nonzero”)
s
- Value(s) of the penalty parameter lambda at which predictions are required.
- Default is the entire sequence used to create the model.
type
- Type of prediction required.
- Type “link” gives the linear predictors for “binomial”, “multinomial”, “poisson” or “cox” models; for “gaussian” models it gives the fitted values.
- Type “response” gives the fitted probabilities for “binomial” or “multinomial”, fitted mean for “poisson” and the fitted relative-risk for “cox”; for “gaussian” type “response” is equivalent to type “link”.
- Type “coefficients” computes the coefficients at the requested values for s. Note that for “binomial” models, results are returned only for the class corresponding to the second level of the factor response.
- Type “class” applies only to “binomial” or “multinomial” models, and produces the class label corresponding to the maximum probability.
- Type “nonzero” returns a list of the indices of the nonzero coefficients for each value of s.
exact
- If exact=TRUE, and predictions are to made at values of s not included in the original fit, these values of s are merged with object$lambda, and the model is refit before predictions are made.
- If exact=FALSE (default), then the predict function uses linear interpolation to make predictions for values of s that do not coincide with those used in the fitting algorithm. Note that exact=TRUE is fragile when used inside a nested sequence of function calls. predict.glmnet() needs to update the model, and expects the data used to create it to be around.
offset If an offset is used in the fit, then one must be supplied for making predictions (except for type = “coefficients” or type = “nonzero”)
… Not used. Other arguments to predict.

Examples:

x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
g2=sample(1:2,100,replace=TRUE)
g4=sample(1:4,100,replace=TRUE)
fit1=glmnet(x,y)
predict(fit1,newx=x[1:5,],s=c(0.01,0.005))
predict(fit1,type="coef")
fit2=glmnet(x,g2,family="binomial")
predict(fit2,type="response",newx=x[2:5,])
predict(fit2,type="nonzero")
fit3=glmnet(x,g4,family="multinomial")
predict(fit3,newx=x[1:3,],type="response",s=0.01)

(4) plot.glmnet

Produces a coefficient profile plot of the coefficient paths for a fitted “glmnet” object

## S3 method for class 'glmnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, ...)
## S3 method for class 'multnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, type.coef = c("coef", "2norm"), ...)
## S3 method for class 'mrelnet'
plot(x, xvar = c("norm", "lambda", "dev"), label = FALSE, type.coef = c("coef", "2norm"), ...)

Arguments:

x
- fitted “glmnet” model
xvar
- What is on the X-axis. “norm” plots against the L1-norm of the coefficients, “lambda” against the log-lambda sequence, and “dev” against the percent deviance explained.
label
- If TRUE, label the curves with variable sequence numbers. type.coef If type.coef=“2norm” then a single curve per variable, else if type.coef=“coef”, a coefficient plot per response
…
- Other graphical parameters to plot

Details:

A coefficient profile plot is produced. If x is a multinomial model, a coefficient plot is produced for each class.

Examples

x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
g2=sample(1:2,100,replace=TRUE)
g4=sample(1:4,100,replace=TRUE)
fit1=glmnet(x,y)
plot(fit1)
plot(fit1,xvar="lambda",label=TRUE)
fit3=glmnet(x,g4,family="multinomial")
plot(fit3,pch=19)

(5) print.glmnet

Print a summary of the glmnet path at each step along the path.

## S3 method for class 'glmnet'
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments:

x
- fitted glmnet object
digits
- significant digits in printout
…
- additional print arguments

Details:

The call that produced the object x is printed, followed by a three-column matrix with columns Df, %dev and Lambda.
- Df is the number of nonzero coefficients (Df is a reasonable name only for lasso fits).
- %dev is the percent deviance explained (relative to the null deviance).

Example:

x = matrix(rnorm(100*20), 100, 20)
y = rnorm(100)
fit1 = glmnet(x, y)
print(fit1)

(6) coef.glmnet

## S3 method for class 'glmnet'
coef(object, 
     s = NULL, 
     exact = FALSE, 
     ...)

Arguments:

object
- Fitted “glmnet” model object.
s
- Value(s) of the penalty parameter lambda at which predictions are required.
- Default is the entire sequence used to create the model.
exact
- If exact=TRUE, and predictions are to made at values of s not included in the original fit, these values of s are merged with object$lambda, and the model is refit before predictions are made.
- If exact=FALSE (default), then the predict function uses linear interpolation to make predictions for values of s that do not coincide with those used in the fitting algorithm. Note that exact=TRUE is fragile when used inside a nested sequence of function calls. predict.glmnet() needs to update the model, and expects the data used to create it to be around.
…
- Not used. Other arguments to predict.

降维与特征选择

zfwang

2016年8月6日

背景、摘要

1. 特征选择(Feature Selection)

1.1 变量选择 (Variable selection)

1.1.1 背景

1.1.2 变量选择方法

(1)Best Subset selection in R

(2)Stepwise selection in R

1.2 正则化 (Regularization)

2. 降维(Dimensionality Reduction)

n. glmnet package

(1) beta_CVX : Simulated data for the glmnet vignette

(2) glmnet

(3) predict.glmnet

(4) plot.glmnet

(5) print.glmnet

(6) coef.glmnet