The function produces a full correlogram of the entire data set, a bar chart showing variables that are significantly correlated with the variable of interest, and a reduced correlogram that only includes the variables from the bar chart.

Reason for function creation

The CORR_REDUCE function helps the user to understand how the variables in a data set are related to each other. This is useful for determining potential variables to include in a regression model. We want to find variables that are highly correlated with our variable of interest (the dependent variable for the model). To avoid multicolinearity, the variables highly correlated with the dependent variable need to have low correlation to each other, if they are to be included in the model.

This function helps the user narrow in on the best potential variable combinations for the model. The bar chart shows all the variables correlated higher than the specified correlation level from the function. Then these variables are placed into a reduced correlogram so the user can see the correlations between the potential variables.

Importing the data and renaming the file as DF

library(readxl)
MaskedData <- read_excel("F:/Summer2022/MaskedData.xlsx", 
    col_types = c("date", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric"))
DF <- MaskedData

Corr Reduce Function

Inputs:

Outputs:

CORR_REDUCE <- function(df,corrLev, yVar) { 
  require(psych)
  require(dplyr)
  
  newDF <- na.omit(df)
  newDF <- newDF %>%
    select(where(is.numeric))
  
  #Full Dataset Correlogram
  colSize <- ncol(newDF)
  fullCorrPlot <- corPlot(newDF[,1:colSize], scale = FALSE, main = "Full Dataset Correlogram")
  
  #Bar Chart 
  corrD <- cor(newDF, use = "all.obs")
  
  
  names(corrD) <- gsub(x = names(corrD),
                       pattern = "(\\.)+",
                       replacement = " ")
  corrD <- data.frame(corrD)
  
  y <- match(yVar,names(corrD))
  
  negCorrLev <- -1 * corrLev
  
  corrY <- corrD %>% 
    select(all_of(y)) %>%
    filter(corrD[y] >= corrLev | corrD[y] < negCorrLev)
  
  corrYt <- t(corrY)
  
  #Title String
  tM1 <- "Vars Correlated  Than: "
  tM2 <- as.character(corrLev)
  tM3 <- "(abs value) with: "
  tM4 <- as.character(yVar)
  
  titleB <- paste(tM1,tM2,tM3,tM4)
  barplot(corrYt, main = titleB , xlab = "Variables", ylab = "Correlation AMT", col = "blue")
  
  #Refined Correlogram, with only variables highly correlated with yVar
  releventVars <- colnames(corrYt)
  xCorrDF <- newDF[,releventVars]
  subColSize <- ncol(xCorrDF)
  
  s1 <- "Correlations Between Vars Highly Correlated with: "
  s2 <- as.character(yVar)
  titleS <- paste(s1,s2)
  subCorrPlot <- corPlot(xCorrDF[,1:subColSize], scale = FALSE, main = titleS)
}
CORR_REDUCE(DF,.5,"var1")

Output Use

3a. Some examples would be: var17 and var25, they have a correlation of .28 and are each correlated with var1 greater than .6.

3b. Var20 and Var17 might be a better combination. They have a correlation of .52 which is somewhat high, but var20 has a correlation to var 1 of .87

3c. I would probably start with a model containing var17 and var20, even though there is a higher chance of multicolinearity. My reasoning is var20’s correlation to var1 is much higher and I can use VIF factors to determine if there actually is an issue.

Notes

4a. For example, if we found a high correlation between closing price and the number of dogs in Europe, we would still not include that variable in the model. There is no economic reasoning that could back up that metric.

Corr_Reduce