Dimensionality Reduction

Having a large number of variables in the data set leads to several problems in modeling. For more information, check out this Quora thread where Prof. Yoshua Bengio contributed to the answers: https://www.quora.com/What-is-the-curse-of-dimensionality.

In this note I am going to discuss a few dimensionality reduction techniques. Typically they are grouped into Feature Selection (or Elimination) and Feature Extraction. I will use the white wine data set from UCI machine learning repository. I will treat quality as the dependent variable and the rest of the 11 variables as independent variables.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(pastecs)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

library(pander)
library(corrplot)


wine<- read.csv("/Volumes/Transcend/Dropbox/Work/Teaching/DA 6813/Course Documents/Data/Wine data/winequality-white.csv")

panderOptions('table.split.table', 120)
panderOptions('table.style', 'grid')
panderOptions('table.alignment.default', 'right')
panderOptions('table.alignment.rownames', 'left')

str(wine)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed_acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile_acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric_acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual_sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free_sulfur_dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total_sulfur_dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

pander(stat.desc(wine), style = 'simple')

Table continues below
	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide
nbr.val	4898	4898	4898	4898	4898	4898
nbr.null	0	0	19	0	0	0
nbr.na	0	0	0	0	0	0
min	3.8	0.08	0	0.6	0.009	2
max	14.2	1.1	1.66	65.8	0.346	289
range	10.4	1.02	1.66	65.2	0.337	287
sum	33575	1363	1637	31305	224.2	172939
median	6.8	0.26	0.32	5.2	0.043	34
mean	6.855	0.2782	0.3342	6.391	0.04577	35.31
SE.mean	0.01206	0.00144	0.001729	0.07247	0.0003122	0.243
CI.mean.0.95	0.02364	0.002823	0.00339	0.1421	0.000612	0.4764
var	0.7121	0.01016	0.01465	25.73	0.0004773	289.2
std.dev	0.8439	0.1008	0.121	5.072	0.02185	17.01
coef.var	0.1231	0.3623	0.3621	0.7936	0.4773	0.4817

	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality
nbr.val	4898	4898	4898	4898	4898	4898
nbr.null	0	0	0	0	0	0
nbr.na	0	0	0	0	0	0
min	9	0.9871	2.72	0.22	8	3
max	440	1.039	3.82	1.08	14.2	9
range	431	0.05187	1.1	0.86	6.2	6
sum	677690	4869	15616	2399	51499	28790
median	134	0.9937	3.18	0.47	10.4	6
mean	138.4	0.994	3.188	0.4898	10.51	5.878
SE.mean	0.6072	4.274e-05	0.002158	0.001631	0.01758	0.01265
CI.mean.0.95	1.19	8.378e-05	0.00423	0.003197	0.03447	0.02481
var	1806	8.946e-06	0.0228	0.01302	1.514	0.7844
std.dev	42.5	0.002991	0.151	0.1141	1.231	0.8856
coef.var	0.3072	0.003009	0.04736	0.233	0.117	0.1507

Feature Selection

Feature selection methods help a modeler choose features (also known as variables or attributes) from a large set of attributes in order to include in the model. Note that ideally the independent or predictor variables that one includes in the model must be supported by theory. However, for many real world applications there may not be any clear guidance from theory to include a specific set of variables. In such situations feature selection will help.

Correlation-based feature selection

The idea here is that we will retain the independent variables that have low correlations with other variables. The variables with high correlations with other independent variables are not bringing in new information. Their inclusion in the model leads to the well known problem of multicollinearity.

To identify the variables with high correlations with other variables, we can directly use findCorrelation function from the Caret package. The function needs a correlation matrix as input. The function also needs a cutoff for what you consider as high. In the following example, I am using 0.5 cutoff. You are free to increase or decrease it.

Logic

findCorrelation code is available on Github. After studying the code, I figured out what the function is doing. First, the function looks for all the pairs that have absolute correlations more than the cutoff that you specify. Next, for each pair it decides which variable to keep and which one to drop. The variable that has a higher average absolute correlation with all the other variables is identified to be dropped while the other variable in the paid is identified to be retained. The algorithm performs this comparison with all the pairs. Clearly there will be a few variables that will be flagged for dropping in multiple cases. Finally the function drops all the unique variables that were identified to be dropped.

In order to better understand the logic, let’s look at the correlations in the variables without quality (which is the 12th variable). I am using corrplot package to get a nice correlation matrix.

# Get the correlation matrix
cormat <- cor(wine[,-12])
# Plot the correlations in a nice plot
corrplot::corrplot(cormat, type = "lower")

From the plot, it looks like density and alcohol are two variables that have high correlations with many variables. However, we need the actual correlations.

corrplot::corrplot(cormat, method =  "number")

There are 4 pairs of correlations greater than 0.5. I am looking at the values above the diagonal because the matrix is symmetric.

Variable Pairs	Absolute Correlations
Residual Sugar - Density	0.84
Free Sulfur Dioxide - Total Sulfur Dioxide	0.62
Total Sulfur Dioxide - Density	0.53
Density - Alcohol	0.78

For each pair we will decide which variable to keep and which one to drop. As I wrote above caret uses average absolute correlation of the variable to decide this. If you manually calculate the average absolute correlations for these variables, you will find density has a higher average absolute correlation than residual_sugar, total_sulfur_dioxide, and alcohol. Therefore, in the pairs 1, 3, and 4 in the above table, we will flag density to be dropped. In the pair 2, total_sulfur_dioxide has higher average absolute correlation than free_sulfur_dioxide.¹ Thus, caret::findCorrelation will advocate dropping this variable as well. To summarize, we expect the function to recommend dropping density and total_sulfur_dioxide.

Now let’s use caret::findCorrelation function to verify this logic and see whether really it drops these two variables.

caret::findCorrelation(cor(wine),cutoff = 0.5,names = T, verbose = T)

## Compare row 8  and column  11 with corr  0.78 
##   Means:  0.329 vs 0.164 so flagging column 8 
## Compare row 7  and column  6 with corr  0.616 
##   Means:  0.228 vs 0.142 so flagging column 7 
## All correlations <= 0.5

## [1] "density"              "total_sulfur_dioxide"

Yes, indeed it suggested dropping the two variables that we had identified earlier. In the code line above, I have added verbose option and set it as True. Because of this the function returned a little bit more details. First it reported that between row 8 (density) and column 11 (alcohol), density has a higher average absolute correlation of 0.329 vs 0.164. Thus, it suggests dropping row 8 of the correlation matrix, which holds density. Similarly, in the next line it recommends dropping column 7, which is total_sulfur_dioxide (Look at the correlation plot above).

Variable importance-based feature selection

If you already know which model you plan to use, you can choose variables depending on their importance in the model. The definition of importance is subjective. For example, assume that you are estimating a linear regression model. If you have k variables and n observations such that n > (k+1) then your model is identified and you will be able to estimate it. The statistical significance of the regression coefficients can tell you which variables are important and should be included in the model. In what follows, we are going to use this basic idea to select variables. However, instead of running the model only once,we will estimate it 10 times using cross validation.

Let’s first create a variable identifying k, which for this example is 10, equal sized groups in the sample.

k <- 10
r <- rep(1:k,times = nrow(wine)%/%k)
r <- if(nrow(wine)%%k != 0){append(r,seq(1,nrow(wine)%%k))} else{r}

set.seed(100)

wine$CV_Group <- sample(r,nrow(wine),replace = F) # This variable has the cross validation group number ranging between 1 and 10.

table(wine$CV_Group)

## 
##   1   2   3   4   5   6   7   8   9  10 
## 490 490 490 490 490 490 490 490 489 489

Next we will use this sample for estimating variable importance.

Logic

A variable that really affects our dependent variable will show up statistically significant in the regression model more often than not. 10-fold cross validation that we used consists of splitting the original sample in 10 equal parts and then estimating the model on 9 of these parts at a time. Thus, a cross validation sample is likely to look different from the original sample. When we generate 10 different cross validation samples, we are in effect creating 10 new samples. Now if we fit the same regression model on these samples, we will obtain different coefficients for the model variables. The variable that’s important will show up statistically significant in many of these samples. Usually we treat $p \leq 0.05$ as the rule of thumb for labeling a coefficient statistically significant. The corresponding t statistic for this is roughly $\pm 2$. Therefore, a variable that shows up significant in most of the cross validation samples will have a large average absolute t statistic.²

Getting the variable importance

Finally, let’s get the variable importance for our example. For many machine learning models, it’s better to use varImp function from caret. However, for linear regression model, varImp simply reports the absolye t statistics. This is basically equivalent to using significance levels based on the t statistics.

I am going to use a different method for this. I will estimate our linear regression model 10 times using 90% of the sample. I will save the t statistics from each estimation in a matrix. Let’s first look at how the output from linear regression looks like.

summary(lm(quality ~., wine))

## 
## Call:
## lm(formula = quality ~ ., data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8357 -0.4922 -0.0379  0.4628  3.1108 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.881e+01   7.986 1.73e-15 ***
## fixed_acidity         6.545e-02  2.088e-02   3.135  0.00173 ** 
## volatile_acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## citric_acid           2.173e-02  9.579e-02   0.227  0.82056    
## residual_sugar        8.149e-02  7.528e-03  10.825  < 2e-16 ***
## chlorides            -2.483e-01  5.466e-01  -0.454  0.64965    
## free_sulfur_dioxide   3.734e-03  8.442e-04   4.423 9.96e-06 ***
## total_sulfur_dioxide -2.850e-04  3.781e-04  -0.754  0.45111    
## density              -1.503e+02  1.908e+01  -7.877 4.10e-15 ***
## pH                    6.863e-01  1.054e-01   6.512 8.17e-11 ***
## sulphates             6.312e-01  1.004e-01   6.287 3.52e-10 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.69e-15 ***
## CV_Group             -8.104e-04  3.744e-03  -0.216  0.82864    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4885 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2801 
## F-statistic: 159.8 on 12 and 4885 DF,  p-value: < 2.2e-16

We want to get the column of t statistics (in the output it is labeled as t values). As it turns out, it’s stored in the 4^th column of a data frame coefficients, which in turn is stored in the list created by summary function.³ Note that the column of t statistics has 13 values—12 for independent variables and 1 for the intercept. We will store this in a matrix that will have 13 rows and 10 columns. Each column will hold t statistics for each cross validation sample.

# Create a matrix of NAs. This matrix will hold t values.

m1 <- matrix(data = NA, nrow = 13, ncol = k)

# Assign row names to the matrix to identify the variables correctly.
# You can manually type these row names. An easier way is to simply assign these
# names from the coefficients data frame from the summary function.

rownames(m1) <- rownames(summary(lm(quality ~., wine))$coefficients)

# Store t statistics in m1 matrix.

for (i in 1:k){
m1[,i] <- summary(lm(quality~., wine[wine$CV_Group != i,]))$coefficients[,3]
}

I won’t print out this matrix here. Instead I will get the row means so that we have the average t statistics.

rowMeans(m1)

##          (Intercept)        fixed_acidity     volatile_acidity 
##            7.6144449            3.0296170          -15.5370915 
##          citric_acid       residual_sugar            chlorides 
##            0.2148612           10.2967468           -0.4215721 
##  free_sulfur_dioxide total_sulfur_dioxide              density 
##            4.1835162           -0.6828594           -7.5117479 
##                   pH            sulphates              alcohol 
##            6.2198003            5.9864063            7.4939500 
##             CV_Group 
##           -0.2082132

From the output, volatile_acidity is the most critical variable followed by residual_sugar, etc. The last 3 variables have average t statistic less than 2 and therefore can be dropped from the model.

This note is getting updated so kindly bear with me

You need to do these calculations yourself if you want to verify what I wrote.↩
Note that instead of t statistic one could also take average of the p value. The lower the average p value, more important the variable is.↩
If you want to verify it yourself, just execute the following command summary(lm(quality ~., wine))$coefficients↩

Dimensionality Reduction

Ashwin Malshe

30 September 2016

Feature Selection

Correlation-based feature selection

Logic

Variable importance-based feature selection

Logic

Getting the variable importance