Generating a Model

Overview

The first broad level idea one needs to know in order to build a model for the calculation engine is what a .RData file does. A .RData file consists of a representation of variables and functions that can be loaded in to an R environment. The data can be any kind of R data structure, and the functions will typically be associated with an R script in which the .RData file is saved to disk. .RData files are saved using the save() function and loaded using the load() function

When the calculation engine sends base64 encoded JSON from PDI to R, it calls a script called “calc.R” which in turn sources a “runCalc.RData” file. This gives us the runCalc() function, that does all data pre and post processing steps, and in turn sources the .RData corresponding to the model called for by the JSON and runs that model with backtesting.

Specifically, every model has an associated .RData file (e.g. VAR.RData). This .RData file consists of a set of functions needed to both backtest the nodel and predict forward from the base date. The set of functions will include the following:

1. A function to backtest the model on the given data and return error metrics and score e.g. "backTestVAR()".
2. A function that both calls the backtested model function and runs the model on all the data up to the base date, predicts forward, and extracts the relevant model-specific diagnostics.  This function is always called "runModel()"
3. *OPTIONAL* A function to format the data that the model is expecting and give a standard output e.g. "makeVARAndPredict()" which is sourced from the makeVAR.R script.

It should be recalled that the input and output JSON are formatted as follows:

Output: { “model_name”: string, “diagnostics”{ “rsq”:number, “aic”:number, “rmse”:number, “mape”:number, “rmspe”:number, “error_score”:number, “final_score”:number }, “predicted_values_unweighted”: [integer,…], “predicted_values_weighted”:[integer,…], “predicted_monthCounters”: [integer,…], “back_test_base_dates”:[integer,…], “back_test_predicted”:[integer,…], “back_test_actual”:[integer,…], “back_test_month_counters”:[integer,…], “outcome”:string, “predictors”:[string,…], “predictor_id”:integer }

Input:

{ “model_name”: string, “predictors”: [string,…], “predictor_monthCounters”: { “predictor_name_A”:[integer,…], … }, “predictor_values”: { “predictor_name_A”:[number,…], … }, “predictor_id”:integer, “outcome”:string, “employment”:[number,…], “empl_monthCounter”:[integer,…], “base_date”:integer, “long_term_growth_rate”:number }

Here is a breakdwon of how this works for VAR, the most complicated model we run.

The first issue with VAR is that it takes 1 or more time series that all fall on the same dates. We need to ensure that all the data being fed to the VAR model is commensurate, and we need to format that data to be in a square matrix with one variable per column.

The actual input to the runModel() step will always be an R list derived from the input JSON. Therefore, to “make” each VAR model, we need to make sure the data is subset to all the points where the input time series “intersect”.

The second issue with VAR is that the seasonality is projected forward by running the model on the first order 12 month difference of values from the historical time series, then adding the previous year’s actual values to the projected difference values from the model. This means that the seasonality from the last 12 months prior to the base date for projection is essentially replicated for the entire projection period.

Because of these issues, I made a “makeVARAndPredict()” function in a seperate script that the runModel() function will source. This function takes care of the second issue, projecting forward and adding back seasonality, and provides useful automatic extraction of the summary diagnostics for the model. The function is as follows:

##Function to make VAR model with a given number of predictors and an n-Ahead projection
makeVARandPredict<-function(x,nAhead=24){
  #Subset to remove month counter  
  x.noMnthCnt<-subset(x,select=-c(monthCounter))
  x.noMnthCnt.ts<-ts(x.noMnthCnt,frequency=12)
  #year on year difference
  x.noMnthCnt.ts.diff<-diff(x.noMnthCnt.ts,lag=12)
  #make a model w/ up to lag 6 coeffs
  model<-VAR(x.noMnthCnt.ts.diff,p=6)
  #get summary model statistics
  summMod<-summary(model)
  rsq<-summMod$varresult[[1]]$r.squared
  adjusted_rsq<-summMod$varresult[[1]]$adj.r.squared
  aic<-AIC(model)
  #make forecasts of differenced values then de-difference
  preds<-predict(model,n.ahead=nAhead)
  fcstVals<-preds$fcst[[1]][,1]
  fcstVals<-round(fcstVals)
  predicted<-integer(length(fcstVals))
  lastN<-tail(x.noMnthCnt.ts[,1],12)
  for(j in 1:length(fcstVals)){
    #for the first twelve forecast values, add to previous years values
    if(j<=12){
      p<-fcstVals[j]
      prev<-lastN[j]
      predicted[j]<-prev+p
    }
    else{
      # for the rest, add to the 12 values from previous forecasted year
      p<-fcstVals[j]
      prev<-predicted[j-12]
      predicted[j]<-prev+p
      }
    }
    # generate monthCounters moving forward
    timePointsForward<-seq(max(x$monthCounter)+1,max(x$monthCounter)+nAhead)
    diagnostics<-list(rsq=rsq,adjusted_rsq=adjusted_rsq,AIC=aic)
    #gather values in to output
    all<-list(model_name="VAR",diagnostics=diagnostics,predicted_values_unweighted = predicted, predicted_monthCounters = timePointsForward)
    all
}

This function is now returning an R list with fields that will be passsed further downstream to build the output JSON.

We use this function inside the backTestVAR script as well:

backTestVAR<-function(forVAR,baseDate){
  library(data.table);library(forecast);library(vars)
  #how many variables in our VAR data? The pre-determined format for this is that the first two columns are monthCounter and outcome,
    #remaining are N number of predictors
  nCols<-ncol(forVAR)
  
  #declare variables
  naicsCode<-colnames(forVAR)[2]
  
  #preset back test dates
  backTestBaseDates<-c(baseDate-23,baseDate-29,baseDate-41)
  
  #track what we're doing while backtesting
  errorScores<-list(rmse=numeric(),rmspe=numeric(),mape=numeric(),nrmse=numeric())
  predictedValues<-list()
  actualValues<-list()
  theMonthCounters<-list()
  
  for(i in 1:length(backTestBaseDates)){
    #our base date monthCounter for back testing
    month<-backTestBaseDates[i]
    timeSeriesBack<-forVAR[monthCounter<month]
    timeSeriesActual<-forVAR[monthCounter>=month]
    
    #the actual employment values
    employment.actual<-timeSeriesActual[,2,with=F]
    employment.actual<-as.integer(unlist(employment.actual))
    employment.actual<-employment.actual[1:24]
    
    res<-makeVARandPredict(timeSeriesBack)
    
    error<-employment.actual-res$predicted_values_unweighted
    error<-as.numeric(unlist(error))
    pctError<-(error/employment.actual)*100
    pctError<-as.numeric(unlist(pctError))
    
    rmse<-(sqrt(mean(error^2)))
    mape<-(mean(abs(pctError)))
    rmspe<-(sqrt(mean(pctError^2)))
    nrmse<-rmse/(max(employment.actual)-min(employment.actual))
    
    errorScores$mape<-c(errorScores$mape,mape);errorScores$rmspe<-c(errorScores$rmspe,rmspe)
    errorScores$rmse<-c(errorScores$rmse,rmse);errorScores$nrmse<-c(errorScores$nrmse,nrmse)
    predictedValues[[i]]<-round(res$predicted_values)
    actualValues[[i]]<-employment.actual
    theMonthCounters[[i]]<-res$predicted_monthCounters
  }
  
  #scoreOne takes in to account variance across back testing points and the mean error across backtesting (using root mean square percent error)
  scoreOne<-sd(errorScores$nrmse)*2 + mean(errorScores$nrmse)
  
  #norm it relative to what we think the "worst" score is
  normedScore<-1-(scoreOne)
  
  #this is the final score that is determined by the equation 0.05*Model Complexity Rank + 0.1 * world-relevance + 0.85 * normed  score1
  scoreTwo<-0.05 * 0.66 + 0.1 * 0 + 0.85 * normedScore
  
  out<-makeVARandPredict(forVAR,120)
  out$diagnostics$rmse<-errorScores$rmse
  out$diagnostics$rmspe<-errorScores$rmspe
  out$diagnostics$nrmse<-errorScores$nrmse
  out$diagnostics$errorScore<-normedScore
  out$diagnostics$finalScore<-scoreTwo
  out$back_test_base_dates<-backTestBaseDates;out$back_test_predicted<-predictedValues
  out$back_test_actuals<-actualValues
  out$back_test_monthCounters<-theMonthCounters
  out$predictors<-colnames(forVAR)[3:nCols]
  out$outcome_naics<-as.character(naicsCode)
  out
}

The structure of the backtesting will look similar to this for all models. You extract back test base dates, iterate through them in a for loop, and get the error metric vectors of length 3 for the backtested projections (rmse,rmspe,and nrmse, the last of which is used to construct the normedScore and finalScore). In the cases of simple models, you can just use the R function to predict the models based on the input time series. For less simple ones, you may want to build a function used to project the model and ensure correct formatting of the input data. This function also wraps the predicted values from the base date together with the back tested data. It is optional whether you do this in backtesting or upon calling “runModel”

The final function you need to construct is runModel, here’s an example of the runModelVAR script:

#source('backTestVAR.R');source('makeVAR.R');source('weightProjections.R')
runModel<-function(input){
  library(data.table)
  formatted<-data.frame(var_name=character(),monthCounter=numeric(),var_value=numeric())
  
  nPredictors<-length(input$predictors)
  
  foo<-data.frame(var_name=input$outcome,monthCounter=(input$empl_monthCounter),var_value=(input$employment))
  formatted<-rbind(formatted,foo)
  
  for(i in 1:nPredictors){
    foo<-data.frame(var_name=input$predictors[i],monthCounter=(input$predictor_monthCounters[[i]]), 
                    var_value=(input$predictor_values[[i]]))
    formatted<-rbind(formatted,foo)
  }
  
  formatted<-data.table(formatted)
  formatted<-data.table::dcast(formatted,monthCounter~var_name)
  formatted<-na.omit(formatted)
  
  output<-backTestVAR(formatted,input$base_date)
  output$predictor_id<-input$predictor_id
  output
}

save(makeVARandPredict,backTestVAR,runModel,file="VAR.RData")

The first line that sources the other functions needed should be uncommented.

Notice that we first source all of the scripts needed to run the model. Then we format the data to be fed to VAR i.e. we make sure the time series are commensurate. Then, we run the back test function which gives us all the information we need to pass downstream except for the predictor_id. This “output” variable is a list that will eventually be converted back to JSON.

The “formatted” data table fed to backTestVAR is the square matrix described above, all predictors and the outcome are put in to columns and subset to eachother. This is accomplished by putting the data in to long format (one column for variable name, one for variable value), and using the dcast function in the data.table package to “cast” the data in to standard columnar format.

Most other models one might want to implement will not be as complicated as VAR. They will typically require two scripts with two functions, one for back testing the model and optionally forecasting as well, and one for running the model, subsetting the data as needed and forecasting either via the backtest function or within runModel() itself. The runModel and backTest functions and any other supporting functions will then all be saved in a .RData file with the corresponding file name, that will be passed in through the input JSON. This .RData file is loaded when runCalc() is run, and it serves as a “black box” that takes in the input JSON and spits out the metrics and predicted values for the output JSON.