Introduction

This document deals with:

  1. the problems faced with dimensionality.
  2. The ways in which we can deal with redundancy and multicollinearity introduced in our dataset due to high dimensionality.
  3. Detecting multicollinearity.
  4. Remedies for multicollinearity and dimension reduction using Principal Component Analysis(PCA).
  5. PCA Implemtation example with R-codes

Causes of dimensionality

Two factors can contribute towards high dimenstionaly. First, Having categorical variables with various levels and second, redundant variables in a dataset. When we include categorical variables in modeling we use dummy variables, and more the number of levels in a categorical variable, the more dummy variables we create, which inturn increases the dimensionality. This can be solved by an approach known as “Collapsing the levels”, which will be disucssed in a separate document. This document will focus on dealing with redundnant variables and multicollinearity.

Multicollinearity (collinearity) can be defined as a phenomenon in which two or more variables in a multivariate regression are highly correlated, that means, one variable can be linearly predicted from the others with a substantial degree of accuracy. Collinear variables contain the same information about the dependent variable. If nominally “different” measures actually quantify the same phenomenon, then they are reduntant variables.

Presence of redundant variables and multicollinearity among variables can effect the analyses of variables in many ways, some of which are shared below:

Small changes in input data (like removing or adding a single variable) leads to a large change in the model, even resulting in sign changes of parameters. Also, the presence of multicollinearity increases the variance or standard error of the coefficient estimates making it sensitive to minor changes, thus resulting in difficulty in interpretation.

When we talk about best regression model for a dataset, it means that the model considers predictor variables where each predictor correlates with the target/dependent variable but correlate almost minimally with each other. Such a model is called low-noise model and is stastically robust and can be used for scoring Now, if the correlation among the variables itself is high and we achieve high predictive power on the training dataset, it is possible that we may not achieve the same on the test dataset, unless the same collinear relation exists between the variables in test dataset, in which case multicollinearity won’t effect the predictive ability of your model.

Detecting Multicollinearity

After understanding how how multicollinearity affects our analysis and predictive ability of the model, it is very important to learn how to detect the presence of multicollinearity in our data.

Following methods can be used to detect multicollinearity:

Variable Inflation Factor (VIF)

One of the points that was mentioned in detecting multicollinearity was, that the correlation between pairs of variables are large, but sometimes looking only at the correlation among pairs of predictors can be a limiting factor.

It is possible that pairwise correlation are small and yet there exists a linear dependece between three or more variables. For example, the relation between X1, X2, and X3 is defined by a linear equation, say: X3 = 3X1 + 4X2 + Error. For such cases, we use Variance Inflation Factor.

VIF is a measure of how much the variance of the estimated regression coefficient, can be inflated by the existence of correlations among the predictor variables in the model. If the value of VIF > 10, it shows that there is some serious correlation between variables.

We won’t be dwelve into mathematical derivations for VIF. The formula for VIF is given as

                VIF = 1/(1-R-square)

VIF for a predictor variable “k” is obtained by calculating R squared value by regressing that k-th variable on the remaining predictor variable.

Remedies for multicollinearity and high dimensionality

In this document, we will discuss Principal Component Analysis with the help of an example.

Principal Component Analysis (PCA)

PCA is also called as feature reduction or feature extraction. PCA creates new features using the many features already present in the dataset. In PCA, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. One typically uses only a subset of all the principal components for regression, thus making PCR some kind of a regularized procedure. Often the principal components with higher variances are selected as regressors.

One major use of PCA lies in overcoming the multicollinearity problem. PCA can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model.

Learning PCA through Example

To understaind how PCA works, the first thing we need a dataset with high dimensionality. This data can be downloaded using the code below:

# donwloading the file1 - the dataset with observations

urlfile <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/gisette/GISETTE/gisette_train.data'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
gisetteRaw <- read.table(textConnection(x), sep = '', header = FALSE, stringsAsFactors = FALSE)


# download file2 - datasaet with column names
urlfile <- "https://archive.ics.uci.edu/ml/machine-learning-databases/gisette/GISETTE/gisette_train.labels"
x <- getURL(urlfile, ssl.verifypeer = FALSE)
g_labels <- read.table(textConnection(x), sep = '', header = FALSE, stringsAsFactors = FALSE)

print(dim(gisetteRaw))
## [1] 6000 5000
head(gisetteRaw)

On running a check on dimension for our data, we can observe that this data has 6000 observations distributed over 5000 predictor variables.

As discussed above, PCA creates new features using the many features present in the dataset. A problem that may arise with PCA is that if the variables have zero variance or very little variance, i.e., almost the same values throughtout all observations, the PCA collapses. Also, thinking about statistical relevance of such a data, it can be said that a variable of almost zero variance won’t contribute much to the model.

nzv <- nearZeroVar(gisetteRaw, saveMetrics = TRUE) #SaveMetrics options gives us the amount of %age
                                                   # of zero variance of each feature
print(paste('Range:', range(nzv$percentUnique)))
## [1] "Range: 0.0166666666666667" "Range: 8.6"
print(paste('Column count before cutoff:',ncol(gisetteRaw)))
## [1] "Column count before cutoff: 5000"
dim(nzv)
## [1] 5000    4
head(nzv) # in the output PercentUnique tells us how much percent uniqueness is there in a variable
          # Does it has zero variance and nzv tells us if it is close to near zero variance

# we get a range from 0 to 8.6 for Percent unique, based on it we choose a cutoff of 0.1

So our first step is to check the variances of observations under each variable and based on the range received from the above select a value using which, we will eliminate the near zero variance variables. From the output, it can be observed that the values for variances lies between 0 and 8.6.

For our analysis, we choose the cutoff as 0.1 and use it to clean our data as shown in the code below:

dim(nzv[nzv$percentUnique > 0.1,]) # from 5000 to 4639; 361 columns eliminated
## [1] 4639    4
gisette_nzv <- gisetteRaw[c(rownames(nzv[nzv$percentUnique > 0.1,]))] 
#rownames() functions passes the name of columns to be used as rownames for our original dataset
print(paste('Column count after cutoff:',ncol(gisette_nzv)))
## [1] "Column count after cutoff: 4639"

We can see that after removing all the columns that don’t meet criteria, we were able to reduce the dimensionality by 361 columns, i.e., the columns dropped from 5000 to 4639

After cleaning up we check how it is performing without PCA transformation. Since our data has high dimensionality we are using xgboost as the model to perform regression analysis. We are using k fold cross validation, where the value of k is taken to be 5

# function for xgboost
EvaluateAUC <- function(dfEvaluate) {
  require(xgboost)
  require(Metrics)
  CVs <- 5
  cvDivider <- floor(nrow(dfEvaluate) / (CVs+1))
  indexCount <- 1
  outcomeName <- c('cluster')
  predictors <- names(dfEvaluate)[!names(dfEvaluate) %in% outcomeName]
  lsErr <- c()
  lsAUC <- c()
  for (cv in seq(1:CVs)) {
    print(paste('cv',cv))
    dataTestIndex <- c((cv * cvDivider):(cv * cvDivider + cvDivider))
    dataTest <- dfEvaluate[dataTestIndex,]
    dataTrain <- dfEvaluate[-dataTestIndex,]
    
    bst <- xgboost(data = as.matrix(dataTrain[,predictors]),
                   label = dataTrain[,outcomeName],
                   max.depth=6, eta = 1, verbose=0,
                   nround=5, nthread=4, 
                   objective = "reg:linear")
    
    predictions <- predict(bst, as.matrix(dataTest[,predictors]), outputmargin=TRUE)
    err <- rmse(dataTest[,outcomeName], predictions)
    auc <- auc(dataTest[,outcomeName],predictions)
    
    lsErr <- c(lsErr, err)
    lsAUC <- c(lsAUC, auc)
    gc()
  }
  print(paste('Mean Error:',mean(lsErr)))
  print(paste('Mean AUC:',mean(lsAUC)))
}


#bind the data and column names
dfEvaluate <- cbind(as.data.frame(sapply(gisette_nzv, as.numeric)),
                    cluster=g_labels$V1)
EvaluateAUC(dfEvaluate) # we get 0.97, now using PCA we want to get as close as possible to this
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.503290785405663"
## [1] "Mean AUC: 0.970094573718909"

From the evaluation of the above model, we recived an AUC value of 0.97. Now in this case study, our goal is to reach as close as possible to this AUC value using PCA.

Before running PCA, the first step is to standardize the data. As we consider weighted averages of the combined features, we do not want the effect to be considered based on the measure of a variable (the greater the measure, the more will be the impact on target variable). To overcome this problem, we standardize our variables.

# Preparing data for PCA

pmatrix <- scale(gisette_nzv)
princ <- prcomp(pmatrix)

#head(princ) # un-comment this command to see the result of PCA
nComp <- 1  
dfComponents <- predict(princ, newdata=pmatrix)[,1:nComp]

dfEvaluate <- cbind(as.data.frame(dfComponents),
                    cluster=g_labels$V1)

head(dfEvaluate)
EvaluateAUC(dfEvaluate)
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.925499112970102"
## [1] "Mean AUC: 0.718743151498416"
nComp <- 2
dfComponents <- predict(princ, newdata=pmatrix)[,1:nComp]

dfEvaluate <- cbind(as.data.frame(dfComponents),
                    cluster=g_labels$V1)

head(dfEvaluate)
EvaluateAUC(dfEvaluate)
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.929239857681807"
## [1] "Mean AUC: 0.724171225258702"
nComp <- 5
dfComponents <- predict(princ, newdata=pmatrix)[,1:nComp]

dfEvaluate <- cbind(as.data.frame(dfComponents),
                    cluster=g_labels$V1)

head(dfEvaluate)
EvaluateAUC(dfEvaluate)
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.662873286001254"
## [1] "Mean AUC: 0.927671419929747"
nComp <- 11
dfComponents <- predict(princ, newdata=pmatrix)[,1:nComp]

dfEvaluate <- cbind(as.data.frame(dfComponents),
                    cluster=g_labels$V1)

head(dfEvaluate)
EvaluateAUC(dfEvaluate)
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.553538115728"
## [1] "Mean AUC: 0.962053109379695"
nComp <- 10
dfComponents <- predict(princ, newdata=pmatrix)[,1:nComp]

dfEvaluate <- cbind(as.data.frame(dfComponents),
                    cluster=g_labels$V1)

head(dfEvaluate)
EvaluateAUC(dfEvaluate)
## [1] "cv 1"
## [1] "cv 2"
## [1] "cv 3"
## [1] "cv 4"
## [1] "cv 5"
## [1] "Mean Error: 0.552423564209513"
## [1] "Mean AUC: 0.963849772458523"

The above code checks the AUC components as we increase the Principal components to predict the target variable. It can be observed that after feature extraction, even with one column we are able to achieve an AUC of 0.72 and it increases as we add more principal components. The optimum value here is 10 for which the AUC is 0.963. When we take 11 components the AUC starts to decrease.

From this case study, it is very clear how Princinple Component Analysis or regression helped us reduce the dimensionality. From 5000 variables that were being used to predict the target variable, we were able to extract 10 features that can explain the randomness in the target variable to almost the same degree.