2020-08-25 Supervisory Technology Assignment

Overview

Question 1

During each Time Step t, a financial supervisor, B, is able to observe the scores for Risk Factors X1 and X2, as well as the scores for Control Factors Y1 and Y2, for a particular Company A. At the same Time Step t, B can also observe the data for the Net Risk Measure Z, corresponding to A. A sample of the data for Time Step t can be found in Table 1.

B has a hypothesis that the Net Risk Measure Z can be predicted using the Risk and Control Factors.

By using a Multi-Variate Linear Regression Approach, identify a suitable regression model and write out its expression. What is the predicted Net Risk Measure Z for the next Time Step t?

Question 2

Briefly explain at least 2 issues, from a statistical perspective, which should be examined before one can conclude that a risk model, such as the one above, is robust and can be used.

Loading Libraries

# Loading Libraries
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.6.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## corrplot 0.84 loaded

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

Loading and Cleaning Data

# Loading Data
table1_data <- read.csv('./Q1 Q2.csv', header = TRUE)

# Cleaning Data
table1_data <- table1_data[, colSums(is.na(table1_data)) == 0]
summary(table1_data)

##   Time.Step.t   Risk.Factor.X1  Risk.Factor.X2  Control.Factor.Y1
##  Min.   : 1.0   Min.   :2.620   Min.   :1.040   Min.   :1.110    
##  1st Qu.: 4.5   1st Qu.:3.840   1st Qu.:2.205   1st Qu.:3.040    
##  Median : 8.0   Median :4.190   Median :3.370   Median :4.360    
##  Mean   : 8.0   Mean   :4.164   Mean   :3.450   Mean   :3.979    
##  3rd Qu.:11.5   3rd Qu.:4.445   3rd Qu.:4.690   3rd Qu.:5.230    
##  Max.   :15.0   Max.   :5.610   Max.   :5.910   Max.   :5.920    
##  Control.Factor.Y2 Net.Risk.Measure.Z
##  Min.   :1.320     Min.   :3.450     
##  1st Qu.:2.085     1st Qu.:4.955     
##  Median :2.530     Median :5.260     
##  Mean   :2.478     Mean   :5.435     
##  3rd Qu.:2.830     3rd Qu.:6.025     
##  Max.   :3.800     Max.   :7.470

Preparing Datasets for Prediction

Normally, we would split the data into 50% for training and 50% for testing our Multi-Variate Linear Regression Model. Here, our data (table1_data) has a small sample size of 15 samples. Therefore, we will need to use our data as both our training and testing datasets.

# Splitting Data
set.seed(1234)
inTrain <- createDataPartition(table1_data$Time.Step.t, p = 1, list = FALSE)
trainData <- table1_data[inTrain, ]
testData <- table1_data[inTrain, ]
dim(trainData)

## [1] 15  6

dim(testData)

## [1] 15  6

trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))

Plotting a Correlation Matrix for Training Data

# Plotting a Correlation Matrix for Training Data
trainMatrix <- rcorr(as.matrix(trainData))
print(trainMatrix)

##                    Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t               1.00           0.31           0.31              0.03
## Risk.Factor.X1            0.31           1.00           0.61              0.04
## Risk.Factor.X2            0.31           0.61           1.00              0.13
## Control.Factor.Y1         0.03           0.04           0.13              1.00
## Control.Factor.Y2         0.62           0.12          -0.29              0.14
## Net.Risk.Measure.Z        0.32          -0.15          -0.03              0.00
##                    Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t                     0.62               0.32
## Risk.Factor.X1                  0.12              -0.15
## Risk.Factor.X2                 -0.29              -0.03
## Control.Factor.Y1               0.14               0.00
## Control.Factor.Y2               1.00               0.41
## Net.Risk.Measure.Z              0.41               1.00
## 
## n= 15 
## 
## 
## P
##                    Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t                    0.2572         0.2576         0.9082           
## Risk.Factor.X1     0.2572                     0.0163         0.8743           
## Risk.Factor.X2     0.2576      0.0163                        0.6562           
## Control.Factor.Y1  0.9082      0.8743         0.6562                          
## Control.Factor.Y2  0.0143      0.6757         0.3027         0.6217           
## Net.Risk.Measure.Z 0.2392      0.6006         0.9197         0.9910           
##                    Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t        0.0143            0.2392            
## Risk.Factor.X1     0.6757            0.6006            
## Risk.Factor.X2     0.3027            0.9197            
## Control.Factor.Y1  0.6217            0.9910            
## Control.Factor.Y2                    0.1254            
## Net.Risk.Measure.Z 0.1254

flattenCorrMatrix <- function(cormat, pmat){
  ut <- upper.tri(cormat)
  data.frame(row = rownames(cormat)[row(cormat)[ut]],
             column = rownames(cormat)[col(cormat)[ut]],
             cor = (cormat)[ut],
             p = (pmat)[ut])}
flattentrainMatrix <- flattenCorrMatrix(trainMatrix$r, trainMatrix$P)
print(flattentrainMatrix)

##                  row             column         cor          p
## 1        Time.Step.t     Risk.Factor.X1  0.31222177 0.25724040
## 2        Time.Step.t     Risk.Factor.X2  0.31197950 0.25762950
## 3     Risk.Factor.X1     Risk.Factor.X2  0.60744462 0.01631352
## 4        Time.Step.t  Control.Factor.Y1  0.03260289 0.90817134
## 5     Risk.Factor.X1  Control.Factor.Y1  0.04471071 0.87428435
## 6     Risk.Factor.X2  Control.Factor.Y1  0.12537919 0.65615462
## 7        Time.Step.t  Control.Factor.Y2  0.61688389 0.01430005
## 8     Risk.Factor.X1  Control.Factor.Y2  0.11786044 0.67570291
## 9     Risk.Factor.X2  Control.Factor.Y2 -0.28528054 0.30270321
## 10 Control.Factor.Y1  Control.Factor.Y2  0.13883414 0.62169068
## 11       Time.Step.t Net.Risk.Measure.Z  0.32373285 0.23916567
## 12    Risk.Factor.X1 Net.Risk.Measure.Z -0.14719332 0.60063548
## 13    Risk.Factor.X2 Net.Risk.Measure.Z -0.02848653 0.91972842
## 14 Control.Factor.Y1 Net.Risk.Measure.Z -0.00317299 0.99104576
## 15 Control.Factor.Y2 Net.Risk.Measure.Z  0.41357888 0.12542495

From the Summary shown above, there are several associations between Net Risk Measure Z and the other variables.

There is a positive correlation (0.324) between Net Risk Measure Z and Time Step t. There is a negative correlation (-0.147) between Net Risk Measure Z and Risk Factor X1. There is a negative correlation (-0.028) between Net Risk Measure Z and Risk Factor X2. There is a negative correlation (-0.003) between Net Risk Measure Z and Control Factor Y1. There is a positive correlation (0.414) between Net Risk Measure Z and Control Factor Y2.

This suggests that Net Risk Measure Z increases as Time Step t and Control Factor Y2 increase. This also suggests that Net Risk Measure Z decreases as Risk Factors X1 and X2, and Control Factor Y1 increase.

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
corrplot(trainMatrix$r, type = "upper", order = "FPC", method = "color",
         p.mat = trainMatrix$P, sig.level = 0.05, insig = "pch",
         tl.cex = 0.8, tl.col = "black", tl.srt = 45)

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a level of significance of 0.05 to determine correlations that are statistically significant. Correlations with a p-value > 0.05 are considered statistically insignificant and will be marked with a cross.

Essentially, only two correlations are not marked with a cross. This suggests that the correlations, Time Step t vs Control Factor Y2, and Risk Factor X1 vs Risk Factor X2, are statistically significant. Nevertheless, we will be using all the variables to build our Multi-Variate Linear Regression Model.

Building our Multi-Variate Linear Regression Model

# Building our Multi-Variate Linear Regression Model
modFit <- train(Net.Risk.Measure.Z ~ Time.Step.t + Risk.Factor.X1 + Risk.Factor.X2
                + Control.Factor.Y1 + Control.Factor.Y2, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)

## Linear Regression 
## 
## 15 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 15, 15, 15, 15, 15, 15, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.463806  0.1823794  2.045874
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

print(finMod)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##       (Intercept)        Time.Step.t     Risk.Factor.X1     Risk.Factor.X2  
##           4.87522           -0.03762           -0.77507            0.40960  
## Control.Factor.Y1  Control.Factor.Y2  
##          -0.11189            1.25907

Plotting Regression Diagnostic Plots

# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)

The Regression Diagnostic Plots show residuals in four different ways:

Residuals vs Fitted - To examine the linearity assumptions. A horizontal line without distinct patterns is a good indication of linearity.
Normal Q-Q - To examine whether the residuals are normally distributed. The data is good if residual points follow the straight dashed line.
Scale-Location - To examine the homogeneity in variances of residuals (Homoscedasticity). A horizontal line with equally spread points is a good indication of Homoscedasticity.
Residuals vs Leverage - To identify extreme values that may influence the analysis.

Linearity of Data

The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.

# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)

From our Residuals vs Fitted Plot shown above, it suggests that our data has good linearity. Characteristics to support this include:

The residuals are well distributed around the 0 straight dashed line, which usually suggests good linearity.
The residuals form a horizontal band around the 0 straight dashed line, which usually suggests similar variances.

However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Normality of Residuals

The normality assumptions can be examined by inspecting the Normal Q-Q Plot.

# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)

From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:

The residual points all fall approximately along the straight dashed line.

Homoscedasticity

Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.

# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)

From our Scale-Location Plot shown above, it suggests that our data has good Homoscedasticity. Characteristics to support this include:

The residuals are equally spread along the range of predictors.
The residuals are equally spread around the horizontal red line.

Outliers, High Leverage Points and Influential Values

Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.

Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.

# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)

From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:

All data points are within the Cook’s Distance Lines, which are represented by the red dashed lines in our Residuals vs Leverage Plot.

However, there are some outliers (X10, X11, X13) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Prediction with our Multi-Variate Linear Regression Model

# Prediction with our Multi-Variate Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Time.Step.t, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X1, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X2, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y1, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y2, data = testData)

Summary of our Prediction

# Summary of our Prediction
summary(Prediction)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.530   4.973   5.483   5.435   5.769   6.469

Answers to Questions

Answer to Question 1

For this Question, a suitable regression model is a Hedonic Regression Model. Alternatively, we can use Econometric Forecasting Models such as a Vector Autoregression (VAR) Model for this analysis.
The expression for our Multi-Variate Linear Regression Model is as follows: Net Risk Measure Z = (4.87522) (Intercept) + (-0.03762) (Time Step t) + (-0.77507) (Risk Factor X1) + (0.40960) (Risk Factor X2) + (-0.11189) (Control Factor Y1) + (1.25907) (Control Factor Y2).
The predicted Net Risk Measure Z for the next Time Step t has a predicted minimum value of 4.530, a predicted maximum value of 6.469, a predicted median value of 5.483, and a predicted mean value of 5.435.

Answer to Question 2

Essentially, we used a Multi-Variate Linear Regression Model to explain the relationship between one continuous dependent variable (outcome) and two or more independent variables (predictors). Our outcome, Net Risk Measure Z is a continuous variable, while our predictors can be continuous or categorical.
Some assumptions made when building our model:
1. Linearity of Data – A linear relationship between the dependent variable (outcome) and the independent variables (predictors).
2. Normality of Residuals – Residual errors are normally distributed.
3. Homoscedasticity – Homogeneity in variances of residuals.
4. No Multi-Collinearity – Independence of residual errors.
Pros of our model: It can determine the relative influence of one or more predictors on the outcome. It can also identify outliers, high leverage points and influential values.
Cons of our model: It may not provide a good fit if our data comes from a population with a distribution that violates the abovementioned assumptions.
Our model gave a R-squared value of 0.18, a Root Mean Square Error (RMSE) of 2.46, and a Mean Absolute Error (MAE) of 2.05. This is a reasonable error rate as our data has a small sample size of 15 samples.
While our R-squared value is not high, we evaluated our model in conjunction with Regression Diagnostic Plots to reveal any residual patterns that may indicate bias.
From our Residuals vs Fitted Plot, our data shows good linearity. From our Normal Q-Q Plot, our data shows good normality of residuals. From our Scale-Location Plot, our data shows good Homoscedasticity and no multi-collinearity. From our Residuals vs Leverage Plot, our data does not show any influential values.
Essentially, our data fulfilled the assumptions for building our model. Moreover, goodness-of-fit can be improved with data of a larger sample size. Therefore, we are confident that our model is robust, performing well, and can be used.