Brian Scott
The LOG_LAG function will log and lag an entire data set. The function will only log variables that have a minimum value greater than 0. This way there are no NA’s produced by trying to log negative numbers. The result of this function is a data frame that includes the raw variable, the logged variable (if applicable), and lagged variables for both the raw and the logged data.
Logging and Lagging Data is easy to simply hard code, but when you are unsure what variables will actually go into the final model hard coding the logs and lags can be very time consuming. Especially, when the potential variables are extensive. Additionally, it is easier to call variables from a single database, and the hard coded variables would have to be manually inserted back into that database to do this. This function also does not change the original database, it only adds new databases.
In my experience logged models typically have much higher adjusted R squares than non-logged models. I found myself trying different variations of logged and non logged models quite often. This function makes it easy to quickly switch variables to the desired form.
This data is masked, each variable was renamed as var1,var2, etc.
library(readxl)
MaskedData <- read_excel("F:/Summer2022/MaskedData.xlsx",
col_types = c("date", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric"))
DF <- MaskedDataThe function has two inputs.
The function saves three data frames to the global environment. The FullLagLog dataframe can become very large. I included other data frames as output so that analysis can be done faster using the smaller data frames, if lagged variables are unneeded.
LOG_LAG <- function(df,lagAMT) {
require(zoo)
require(dplyr)
require(tidyverse)
#Create Logged DF, by logging only positive numeric collumns
D <- df
#omitting NA's in order to use min function, otherwise Min will return NA for any collumn that has NA's
newD <- na.omit(D)
#selecting only numeric collumns that are fully positive
newDF <- newD %>%
summarise_if(is.numeric, min)
relCols <- newDF %>%
select_if(newDF > 0)
#Using the positive numeric collumns to create a DB of each of those collumns, NA's back in DB
relColNames <- colnames(relCols)
nDF <- D[,relColNames]
#R returns NA's for logged values of NA's, so no error occurs by logging the whole DB
logNDF <- log(nDF)
names(logNDF) <- paste(names(relCols), "_log")
#Removes the space between "variable name" and "_log"
names(logNDF) <- gsub(" ", "",names(logNDF))
#Save LogDF to global environment
LogDF <<- as.data.frame(logNDF)
#Combine Log dataframe with the original database
dfLogRaw <- cbind(D,logNDF)
LogRawDF <<- dfLogRaw
#Create lagged Variables
#Variables lagged lagAMT units. Notation: "variableName_log_lag"
FullLagLogDF <<- mutate_all(dfLogRaw, funs("Lag" = lag(.,lagAMT)))
}
LOG_LAG(DF,1)The data frame that this function creates is typically lengthy. I selected only one of the variables to show that the function did create the three additional variables.
head(FullLagLogDF %>% select(var1, var1_log, var1_Lag, var1_log_Lag))## var1 var1_log var1_Lag var1_log_Lag
## 1 4384.99 8.385943 NA NA
## 2 4498.25 8.411444 4384.99 8.385943
## 3 4210.72 8.345389 4498.25 8.411444
## 4 4100.00 8.318742 4210.72 8.345389
## 5 3919.78 8.273791 4100.00 8.318742
## 6 4392.71 8.387702 3919.78 8.273791
To show that the two other databases were also made I called variables from each. This is the data frame that includes logged and raw data but no lags
head(LogRawDF %>% select(var1, var1_log))## var1 var1_log
## 1 4384.99 8.385943
## 2 4498.25 8.411444
## 3 4210.72 8.345389
## 4 4100.00 8.318742
## 5 3919.78 8.273791
## 6 4392.71 8.387702
This data frame only includes variables that were able to be logged
head(LogDF %>% select(var1_log,var33_log))## var1_log var33_log
## 1 8.385943 NA
## 2 8.411444 1.3862944
## 3 8.345389 0.6931472
## 4 8.318742 1.0986123
## 5 8.273791 1.0986123
## 6 8.387702 1.7917595
This function is a work in progress, I intend to make updates to fix a few small issues.
The function currently logs categorical variables, if they are numeric categories. This takes additional processing power to make a variable of no use.
The function currently creates a lagged variable for the date, which again takes processing power for a useless variable.
I intend to add in if-statement tests for the function inputs, so the function does not crash if an input value is used incorrectly. For example, specifying lagAMT to 1.5 will cause an error. I want to provide output stating “The lagAMT must be a discrete number”