Principal Components Regression

To create a predictive model based on regression we like to have as many relevant predictors as possible. The whole difficulty resides in finding relevant predictors. For predictors to be relevant, they should explain the variance of the dependent variable.
Too many predictors (high dimensionality) and we take the risk of over-fitting, multicollinearity, etc.

The intuition of Principal Component Analysis is to find new combination of variables which form larger variances. Why are larger variances important? This is a similar concept of entropy in information theory. Let’s say you have two variables. One of them (Var 1) forms N(1, 0.01) and the other (Var 2) forms N(1, 1). Which variable do you think has more information? Var 1 is always pretty much 1 whereas Var 2 can take a wider range of values, like 0 or 2. Thus, Var 2 has more chances to have various values than Var 1, which means Var 2’s entropy is larger than Var 1’s. Thus, we can say Var 2 contains more information than Var 1.

PCA tries to find linear combination of the variables which contain much information by looking at the variance. This is why the standard deviation is one of the important metrics to determine the number of new variables in PCA. Another interesting aspect of the new variables derived by PCA is that all new variables are orthogonal. You can think that PCA is rotating and translating the data such that the first axis contains the most information, and the second has the second most information, and so forth.

Principal Component Analysis (PCA) is a feature extraction methods that use orthogonal linear projections to capture the underlying variance of the data.
When PCA compute the principle components is not looking at the response but only at the predictors (by looking for a linear combination of the predictors that has the highest variance). It makes the assumption that the linear combination of the predictors that has the highest variance is associated with the response.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.

When choosing the principal component, we assume that the regression plane varies along the line and doesn’t vary in the other orthogonal direction. By choosing one component and not the other, we’re ignoring the second direction.

PCR looks in the direction of variation of the predictors to find the places where the responses is most likely to vary.

Some of the most notable advantages of performing PCA are the following:

Dimensionality reduction
Avoidance of multicollinearity between predictors. Variables are orthogonal, so including, say, PC9 in the model has no bearing on, say, PC3
Variables are ordered in terms of standard error. Thus, they also tend to be ordered in terms of statistical significance
Overfitting mitigation

The primary disadvantage is that this model is far more difficult to interpret than a regular logistic regression model

With principal components regression, the new transformed variables (the principal components) are calculated in a totally unsupervised way:

the response Y is not used to help determine the principal component directions).
the response does not supervise the identification of the principal components.
PCR just looks at the x variables

The PCA method can dramatically improve estimation and insight in problems where multicollinearity is a large problem – as well as aid in detecting it.

PCA on an easy example.

Let’s say we asked 16 participants four questions (on a 7 scale) about what they care about when choosing a new computer, and got the results like this.

library(tibble)
Price <- c(6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2)
Software <- c(5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3)
Aesthetics <- c(3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7)
Brand <- c(4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7)
buy_computer <- tibble(Price, Software, Aesthetics, Brand)

Let’s go on with the PCA and the princomp function which is part of the stats package.

pca <- princomp(buy_computer, cor = TRUE)
names(pca)

## [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"  
## [7] "call"

print(pca)

## Call:
## princomp(x = buy_computer, cor = TRUE)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4 
## 1.5589391 0.9804092 0.6816673 0.3792578 
## 
##  4  variables and  16 observations.

summary(pca, loadings = TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.5589391 0.9804092 0.6816673 0.37925777
## Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911
## Cumulative Proportion  0.6075727 0.8478733 0.9640409 1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4
## Price      -0.523         0.848       
## Software   -0.177  0.977 -0.120       
## Aesthetics  0.597  0.134  0.295 -0.734
## Brand       0.583  0.167  0.423  0.674

OS <- c(0,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1)
library(ggbiplot)
g <- ggbiplot(pca, obs.scale = 1, var.scale = 1, groups = as.character(OS),
              ellipse = TRUE, circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

The code below has been taken and modify from Github here. The idea is to plot each variables coefficients inside a unit circle to get insight on a possible interpretation for PCs.

require(ggplot2)
theta <- seq(0,2*pi,length.out = 100)
circle <- data.frame(x = cos(theta), y = sin(theta))
p <- ggplot(circle,aes(x,y)) + geom_path()
loadings <- data.frame(pca$rotation, 
                       .names = row.names(pca$rotation))
p + geom_text(data=loadings, 
              mapping=aes(x = pca$loadings[,1], y = pca$loadings[,2], label = rownames(pca$loadings), colour = rownames(pca$loadings))) +
  coord_fixed(ratio=1) +
  labs(x = "PC1", y = "PC2")

Remember that one of the disadventage of PCA is how difficult it is to interpret the model (ie. what does the PC1 is representing, what does PC2 is representing, etc.). The biplot graph help somehow to overcome that.

In the above graph, one can see that Brand and Aesthetic explain most of the variance in the new predictor PC1 while Software explain most of the variance in the new predictor PC2.

Once you have done the analysis with PCA, you may want to look into whether the new variables can predict some phenomena well. This is kinda like machine learning: Whether features can classify the data well. Let’s say you have asked the participants one more thing, which OS they are using (Windows or Mac) in your survey, and the results are like this.

OS <- c(0,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1)
# Let's test our model
model1 <- glm(OS ~ pca$scores[,1] + pca$scores[,2], family = binomial)
summary(model1)

## 
## Call:
## glm(formula = OS ~ pca$scores[, 1] + pca$scores[, 2], family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4485  -0.4003   0.1258   0.5652   1.2814  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)      -0.2138     0.7993  -0.268   0.7891  
## pca$scores[, 1]   1.4744     0.6411   2.300   0.0215 *
## pca$scores[, 2]   0.7104     0.8940   0.795   0.4269  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22.181  on 15  degrees of freedom
## Residual deviance: 11.338  on 13  degrees of freedom
## AIC: 17.338
## 
## Number of Fisher Scoring iterations: 5

Let’s see how well this model predicts the kind of OS. You can use fitted() function to see the prediction.

fitted(model1)

##           1           2           3           4           5           6 
## 0.114201733 0.009372181 0.217716320 0.066009817 0.440016243 0.031640529 
##           7           8           9          10          11          12 
## 0.036189119 0.175766013 0.906761064 0.855587371 0.950088045 0.888272270 
##          13          14          15          16 
## 0.781098710 0.757499202 0.842557931 0.927223453

These values represent the probabilities of being 1. For example, we can expect 11% chance that Participant 1 is using OS 1 based on the variable derived by PCA. Thus, in this case, Participant 1 is more likely to be using OS 0, which agrees with the survey response. In this way, PCA can be used with regression models for calculating the probability of a phenomenon or making a prediction.

I have tried to do the same with scaling the data using scale(x) and it changed absolutely nothing.

Attempt of PCA on technical indicators.

For this purpose, we have taken a random stock, added a lots of variables and have one dependent variable.

# Read the file
library(readr)
stock_data <- read_csv("AugmentedStockData/CVX.csv")

Now onto create our dependent variable and stipping down the data frame to just the columns that interest us, and only get rows and columns without NA.

library(dplyr)
binary = if_else(stock_data$ret3days[25:4150] > 0.03, 1, 0)
depend_var <- stock_data[25:4150,8:34]

The base R function prcomp() is used to perform PCA. PCA only works with normalized data. So we need to center the variable to have mean equals to zero. With parameter scale. = T, we normalize the variables to have standard deviation equals to 1. Normalized predictors have mean equals to zero and standard deviation equals to one.

prin_comp <- prcomp(depend_var, scale. = TRUE, center = TRUE)

Let’s have a closer look at that ‘prcomp’ function.

Center and scale refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA.

names(prin_comp)
summary(prin_comp)

#outputs the mean of variables
prin_comp$center

#outputs the standard deviation of variables
prin_comp$scale

The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.

#because it can be a huge matrix, let's only check the first few rows and columns.
prin_comp$rotation[1:5, 1:5]

Let’s plot the resultant principal components.

The prcomp() function also provides the facility to compute standard deviation of each principal component. sdev refers to the standard deviation of principal components.

#compute standard deviation of each principal component
std_dev <- prin_comp$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]

We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components.

To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:

#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]

This shows that first principal component explains 41.7% variance. Second component explains 23.8% variance. Third component explains 10.4% variance and so on. So, how do we decide how many components should we select for modeling stage ?

The answer to this question is provided by a scree plot. A scree plot is used to access components or factors which explains the most of variability in the data. It represents values in descending order.

#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

Or we can do a cumulative scree plot

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

Hence in this case the first 6 Principal Components explain over 90% of the variance of the data. That is we can use these first 6 PC as predictor in our next model.

Let’s apply this now on a logisitc regression model. For this, we need to create our binary dependent variable. So we’ll put a 1 for every ret3days > 3%, 0 otherwise.

mydata <- cbind(binary, prin_comp$x)
mydata <- as.data.frame(mydata)
model1 <- glm(binary ~ PC1 + PC2 + PC5 + PC6 + PC7, data = mydata, family=binomial)
summary(model1)

mydata <- cbind(binary, prin_comp$x)
mydata <- as.data.frame(mydata)
checkit <- fitted(model1)
checkit <- cbind(binary, checkit)
checkit <- as.data.frame(checkit)
head(checkit %>% filter(binary == 1), 20)
head(checkit %>% filter(checkit > 0.2), 20)

Really not very successful model.

Doing PCA and PCR with the PLS package

Same as before we can not have NA data in our set.

library(pls)
depend_var <- stock_data[25:4150,8:35]
pcr_model <- pcr(ret3days~., data = depend_var, scale = TRUE, validation = "CV")

In oder to print out the results, simply use the summary function as below

summary(pcr_model)

As you can see, two main results are printed, namely the validation error and the cumulative percentage of variance explained using n components. The cross validation results are computed for each number of components used so that you can easily check the score with a particular number of components without trying each combination on your own. The pls package provides also a set of methods to plot the results of PCR. For example you can plot the results of cross validation using the validationplot function. By default, the pcr function computes the root mean squared error and the validationplot function plots this statistic, however you can choose to plot the usual mean squared error or the R2 by setting the val.type argument equal to “MSEP” or “R2” respectively

# Plot the root mean squared error
validationplot(pcr_model)

What you would like to see is a low cross validation error with a lower number of components than the number of variables in your dataset. If this is not the case or if the smalles cross validation error occurs with a number of components close to the number of variables in the original data, then no dimensionality reduction occurs. In the example above, it looks like 3 components are enough to explain more than 90% of the variability in the data. Now you can try to use PCR on a traning-test set and evaluate its performance using, for example, using only 6 components

# Train-test split
train <- stock_data[25:3000,8:35]
y_test <- stock_data[3001:4150,35]
test <- stock_data[3001:4150,8:34]
    
pcr_model <- pcr(ret3days~., data = train,scale =TRUE, validation = "CV")
 
pcr_pred <- predict(pcr_model, test, ncomp = 6)
mean((pcr_pred - y_test)^2)

PCA with Dow Jones

We download the different sectors of the Dow Jones

library(quantmod)
tickers = c("XLY", "XLP", "XLE", "XLF", "XLV", "XLI", "XLB", "XLK", "XLU")
tickers.desc = c("ConsumerCyclicals", "ConsumerStaples", "Energy", "Financials", "HealthCare" ,"Industrials" , "Materials", "Technology", "Utilities")

References.

Here are the articles I have consulted for this research.

Principal Component Analysis (PCA)
Principal Component Analysis using R
Computing and visualizing PCA in R This is where we learned about the `ggbiplot
Practical Guide to Principal Component Analysis (PCA) in R & Python
Performing Principal Components Regression (PCR) in R
Data Mining - Principal Component (Analysis|Regression) (PCA)
PRINCIPAL COMPONENT ANALYSIS IN R A really nice explanation on the difference between the main packages doing PCA such as svd, princompand prcomp. In R there are two general methods to perform PCA without any missing values: The spectral decomposition method of analysis examines the covariances and correlations between variables, whereas the singular value decomposition method looks at the covariances and correlations among the samples. While both methods can easily be performed within R, the singular value decomposition method is the preferred analysis for numerical accuracy.

Although principal component analysis assumes multivariate normality, this is not a very strict assumption, especially when the procedure is used for data reduction or exploratory purposes. Undoubtedly, the correlation and covariance matrices are better measures of similarity if the data is normal, and yet, PCA is often unaffected by mild violations. However, if the new components are to be used in further analyses, such as regression analysis, normality of the data might be more important.