Having a large number of variables in the data set leads to several problems in modeling. For more information, check out this Quora thread where Prof. Yoshua Bengio contributed to the answers: https://www.quora.com/What-is-the-curse-of-dimensionality.
In this note I am going to discuss a few dimensionality reduction techniques. Typically they are grouped into Feature Selection (or Elimination)
and Feature Extraction
. I will use the white wine data set from UCI machine learning repository. I will treat quality
as the dependent variable and the rest of the 11 variables as independent variables.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(pastecs)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
library(pander)
library(corrplot)
wine<- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-white.csv")
panderOptions('table.split.table', 120)
panderOptions('table.style', 'grid')
panderOptions('table.alignment.default', 'right')
panderOptions('table.alignment.rownames', 'left')
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed_acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile_acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric_acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual_sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free_sulfur_dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total_sulfur_dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
pander(stat.desc(wine), style = 'simple')
fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | |
---|---|---|---|---|---|---|
nbr.val | 4898 | 4898 | 4898 | 4898 | 4898 | 4898 |
nbr.null | 0 | 0 | 19 | 0 | 0 | 0 |
nbr.na | 0 | 0 | 0 | 0 | 0 | 0 |
min | 3.8 | 0.08 | 0 | 0.6 | 0.009 | 2 |
max | 14.2 | 1.1 | 1.66 | 65.8 | 0.346 | 289 |
range | 10.4 | 1.02 | 1.66 | 65.2 | 0.337 | 287 |
sum | 33575 | 1363 | 1637 | 31305 | 224.2 | 172939 |
median | 6.8 | 0.26 | 0.32 | 5.2 | 0.043 | 34 |
mean | 6.855 | 0.2782 | 0.3342 | 6.391 | 0.04577 | 35.31 |
SE.mean | 0.01206 | 0.00144 | 0.001729 | 0.07247 | 0.0003122 | 0.243 |
CI.mean.0.95 | 0.02364 | 0.002823 | 0.00339 | 0.1421 | 0.000612 | 0.4764 |
var | 0.7121 | 0.01016 | 0.01465 | 25.73 | 0.0004773 | 289.2 |
std.dev | 0.8439 | 0.1008 | 0.121 | 5.072 | 0.02185 | 17.01 |
coef.var | 0.1231 | 0.3623 | 0.3621 | 0.7936 | 0.4773 | 0.4817 |
total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|
nbr.val | 4898 | 4898 | 4898 | 4898 | 4898 | 4898 |
nbr.null | 0 | 0 | 0 | 0 | 0 | 0 |
nbr.na | 0 | 0 | 0 | 0 | 0 | 0 |
min | 9 | 0.9871 | 2.72 | 0.22 | 8 | 3 |
max | 440 | 1.039 | 3.82 | 1.08 | 14.2 | 9 |
range | 431 | 0.05187 | 1.1 | 0.86 | 6.2 | 6 |
sum | 677690 | 4869 | 15616 | 2399 | 51499 | 28790 |
median | 134 | 0.9937 | 3.18 | 0.47 | 10.4 | 6 |
mean | 138.4 | 0.994 | 3.188 | 0.4898 | 10.51 | 5.878 |
SE.mean | 0.6072 | 4.274e-05 | 0.002158 | 0.001631 | 0.01758 | 0.01265 |
CI.mean.0.95 | 1.19 | 8.378e-05 | 0.00423 | 0.003197 | 0.03447 | 0.02481 |
var | 1806 | 8.946e-06 | 0.0228 | 0.01302 | 1.514 | 0.7844 |
std.dev | 42.5 | 0.002991 | 0.151 | 0.1141 | 1.231 | 0.8856 |
coef.var | 0.3072 | 0.003009 | 0.04736 | 0.233 | 0.117 | 0.1507 |
Feature selection methods help a modeler choose features (also known as variables or attributes) from a large set of attributes in order to include in the model. Note that ideally the independent or predictor variables that one includes in the model must be supported by theory. However, for many real world applications there may not be any clear guidance from theory to include a specific set of variables. In such situations feature selection will help.
The idea here is that we will retain the independent variables that have low correlations with other variables. The variables with high correlations with other independent variables are not bringing in new information. Their inclusion in the model leads to the well known problem of multicollinearity.
To identify the variables with high correlations with other variables, we can directly use findCorrelation
function from the Caret
package. The function needs a correlation matrix as input. The function also needs a cutoff for what you consider as high. In the following example, I am using 0.5 cutoff. You are free to increase or decrease it.
findCorrelation
code is available on Github. After studying the code, I figured out what the function is doing. First, the function looks for all the pairs that have absolute correlations more than the cutoff that you specify. Next, for each pair it decides which variable to keep and which one to drop. The variable that has a higher average absolute correlation with all the other variables is identified to be dropped while the other variable in the paid is identified to be retained. The algorithm performs this comparison with all the pairs. Clearly there will be a few variables that will be flagged for dropping in multiple cases. Finally the function drops all the unique variables that were identified to be dropped.
In order to better understand the logic, let’s look at the correlations in the variables without quality
(which is the 12th variable). I am using corrplot
package to get a nice correlation matrix.
# Get the correlation matrix
cormat <- cor(wine[,-12])
# Plot the correlations in a nice plot
corrplot::corrplot(cormat, type = "lower")
From the plot, it looks like density
and alcohol
are two variables that have high correlations with many variables. However, we need the actual correlations.
corrplot::corrplot(cormat, method = "number")
There are 4 pairs of correlations greater than 0.5. I am looking at the values above the diagonal because the matrix is symmetric.
Variable Pairs | Absolute Correlations |
---|---|
Residual Sugar - Density | 0.84 |
Free Sulfur Dioxide - Total Sulfur Dioxide | 0.62 |
Total Sulfur Dioxide - Density | 0.53 |
Density - Alcohol | 0.78 |
For each pair we will decide which variable to keep and which one to drop. As I wrote above caret
uses average absolute correlation of the variable to decide this. If you manually calculate the average absolute correlations for these variables, you will find density
has a higher average absolute correlation than residual_sugar
, total_sulfur_dioxide
, and alcohol
. Therefore, in the pairs 1, 3, and 4 in the above table, we will flag density
to be dropped. In the pair 2, total_sulfur_dioxide
has higher average absolute correlation than free_sulfur_dioxide
.1 Thus, caret::findCorrelation
will advocate dropping this variable as well. To summarize, we expect the function to recommend dropping density
and total_sulfur_dioxide
.
Now let’s use caret::findCorrelation
function to verify this logic and see whether really it drops these two variables.
caret::findCorrelation(cor(wine),cutoff = 0.5,names = T, verbose = T)
## Compare row 8 and column 11 with corr 0.78
## Means: 0.329 vs 0.164 so flagging column 8
## Compare row 7 and column 6 with corr 0.616
## Means: 0.228 vs 0.142 so flagging column 7
## All correlations <= 0.5
## [1] "density" "total_sulfur_dioxide"
Yes, indeed it suggested dropping the two variables that we had identified earlier. In the code line above, I have added verbose
option and set it as True
. Because of this the function returned a little bit more details. First it reported that between row 8 (density
) and column 11 (alcohol
), density
has a higher average absolute correlation of 0.329
vs 0.164
. Thus, it suggests dropping row 8 of the correlation matrix, which holds density
. Similarly, in the next line it recommends dropping column 7, which is total_sulfur_dioxide
(Look at the correlation plot above).
If you already know which model you plan to use, you can choose variables depending on their importance in the model. The definition of importance is subjective. For example, assume that you are estimating a linear regression model. If you have k variables and n observations such that n > (k+1) then your model is identified and you will be able to estimate it. The statistical significance of the regression coefficients can tell you which variables are important and should be included in the model. In what follows, we are going to use this basic idea to select variables. However, instead of running the model only once,we will estimate it 10 times using cross validation.
Let’s first create a variable identifying k, which for this example is 10, equal sized groups in the sample.
k <- 10
r <- rep(1:k,times = nrow(wine)%/%k)
r <- if(nrow(wine)%%k != 0){append(r,seq(1,nrow(wine)%%k))} else{r}
set.seed(100)
wine$CV_Group <- sample(r,nrow(wine),replace = F) # This variable has the cross validation group number ranging between 1 and 10.
table(wine$CV_Group)
##
## 1 2 3 4 5 6 7 8 9 10
## 490 490 490 490 490 490 490 490 489 489
Next we will use this sample for estimating variable importance.
A variable that really affects our dependent variable will show up statistically significant in the regression model more often than not. 10-fold cross validation that we used consists of splitting the original sample in 10 equal parts and then estimating the model on 9 of these parts at a time. Thus, a cross validation sample is likely to look different from the original sample. When we generate 10 different cross validation samples, we are in effect creating 10 new samples. Now if we fit the same regression model on these samples, we will obtain different coefficients for the model variables. The variable that’s important will show up statistically significant in many of these samples. Usually we treat \(p \leq 0.05\) as the rule of thumb for labeling a coefficient statistically significant. The corresponding t statistic for this is roughly \(\pm 2\). Therefore, a variable that shows up significant in most of the cross validation samples will have a large average absolute t statistic.2
Finally, let’s get the variable importance for our example. For many machine learning models, it’s better to use varImp
function from caret
. However, for linear regression model, varImp
simply reports the absolye t statistics. This is basically equivalent to using significance levels based on the t statistics.
I am going to use a different method for this. I will estimate our linear regression model 10 times using 90% of the sample. I will save the t statistics from each estimation in a matrix. Let’s first look at how the output from linear regression looks like.
summary(lm(quality ~., wine))
##
## Call:
## lm(formula = quality ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8357 -0.4922 -0.0379 0.4628 3.1108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.881e+01 7.986 1.73e-15 ***
## fixed_acidity 6.545e-02 2.088e-02 3.135 0.00173 **
## volatile_acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric_acid 2.173e-02 9.579e-02 0.227 0.82056
## residual_sugar 8.149e-02 7.528e-03 10.825 < 2e-16 ***
## chlorides -2.483e-01 5.466e-01 -0.454 0.64965
## free_sulfur_dioxide 3.734e-03 8.442e-04 4.423 9.96e-06 ***
## total_sulfur_dioxide -2.850e-04 3.781e-04 -0.754 0.45111
## density -1.503e+02 1.908e+01 -7.877 4.10e-15 ***
## pH 6.863e-01 1.054e-01 6.512 8.17e-11 ***
## sulphates 6.312e-01 1.004e-01 6.287 3.52e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.69e-15 ***
## CV_Group -8.104e-04 3.744e-03 -0.216 0.82864
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4885 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2801
## F-statistic: 159.8 on 12 and 4885 DF, p-value: < 2.2e-16
We want to get the column of t statistics (in the output it is labeled as t values). As it turns out, it’s stored in the 4th column of a data frame coefficients
, which in turn is stored in the list created by summary
function.3 Note that the column of t statistics has 13 values—12 for independent variables and 1 for the intercept. We will store this in a matrix that will have 13 rows and 10 columns. Each column will hold t statistics for each cross validation sample.
# Create a matrix of NAs. This matrix will hold t values.
m1 <- matrix(data = NA, nrow = 13, ncol = k)
# Assign row names to the matrix to identify the variables correctly.
# You can manually type these row names. An easier way is to simply assign these
# names from the coefficients data frame from the summary function.
rownames(m1) <- rownames(summary(lm(quality ~., wine))$coefficients)
# Store t statistics in m1 matrix.
for (i in 1:k){
m1[,i] <- summary(lm(quality~., wine[wine$CV_Group != i,]))$coefficients[,3]
}
I won’t print out this matrix here. Instead I will get the row means so that we have the average t statistics.
rowMeans(m1)
## (Intercept) fixed_acidity volatile_acidity
## 7.6144449 3.0296170 -15.5370915
## citric_acid residual_sugar chlorides
## 0.2148612 10.2967468 -0.4215721
## free_sulfur_dioxide total_sulfur_dioxide density
## 4.1835162 -0.6828594 -7.5117479
## pH sulphates alcohol
## 6.2198003 5.9864063 7.4939500
## CV_Group
## -0.2082132
From the output, volatile_acidity
is the most critical variable followed by residual_sugar
, etc. The last 3 variables have average t statistic less than 2 and therefore can be dropped from the model.
This note is getting updated so kindly bear with me
You need to do these calculations yourself if you want to verify what I wrote.↩
Note that instead of t statistic one could also take average of the p value. The lower the average p value, more important the variable is.↩
If you want to verify it yourself, just execute the following command summary(lm(quality ~., wine))$coefficients
↩