Overview

Question 1

During each Time Step t, a financial supervisor, B, is able to observe the scores for Risk Factors X1 and X2, as well as the scores for Control Factors Y1 and Y2, for a particular Company A. At the same Time Step t, B can also observe the data for the Net Risk Measure Z, corresponding to A. A sample of the data for Time Step t can be found in Table 1.

B has a hypothesis that the Net Risk Measure Z can be predicted using the Risk and Control Factors.

By using a Multi-Variate Linear Regression Approach, identify a suitable regression model and write out its expression. What is the predicted Net Risk Measure Z for the next Time Step t?

Question 2

Briefly explain at least 2 issues, from a statistical perspective, which should be examined before one can conclude that a risk model, such as the one above, is robust and can be used.

Loading Libraries

# Loading Libraries
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(corrplot)
## corrplot 0.84 loaded
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster

Loading and Cleaning Data

# Loading Data
table1_data <- read.csv('./Q1 Q2.csv', header = TRUE)

# Cleaning Data
table1_data <- table1_data[, colSums(is.na(table1_data)) == 0]
summary(table1_data)
##   Time.Step.t   Risk.Factor.X1  Risk.Factor.X2  Control.Factor.Y1
##  Min.   : 1.0   Min.   :2.620   Min.   :1.040   Min.   :1.110    
##  1st Qu.: 4.5   1st Qu.:3.840   1st Qu.:2.205   1st Qu.:3.040    
##  Median : 8.0   Median :4.190   Median :3.370   Median :4.360    
##  Mean   : 8.0   Mean   :4.164   Mean   :3.450   Mean   :3.979    
##  3rd Qu.:11.5   3rd Qu.:4.445   3rd Qu.:4.690   3rd Qu.:5.230    
##  Max.   :15.0   Max.   :5.610   Max.   :5.910   Max.   :5.920    
##  Control.Factor.Y2 Net.Risk.Measure.Z
##  Min.   :1.320     Min.   :3.450     
##  1st Qu.:2.085     1st Qu.:4.955     
##  Median :2.530     Median :5.260     
##  Mean   :2.478     Mean   :5.435     
##  3rd Qu.:2.830     3rd Qu.:6.025     
##  Max.   :3.800     Max.   :7.470

Preparing Datasets for Prediction

Normally, we would split the data into 50% for training and 50% for testing our Multi-Variate Linear Regression Model. Here, our data (table1_data) has a small sample size of 15 samples. Therefore, we will need to use our data as both our training and testing datasets.

# Splitting Data
set.seed(1234)
inTrain <- createDataPartition(table1_data$Time.Step.t, p = 1, list = FALSE)
trainData <- table1_data[inTrain, ]
testData <- table1_data[inTrain, ]
dim(trainData)
## [1] 15  6
dim(testData)
## [1] 15  6
trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))

Plotting a Correlation Matrix for Training Data

# Plotting a Correlation Matrix for Training Data
trainMatrix <- rcorr(as.matrix(trainData))
print(trainMatrix)
##                    Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t               1.00           0.31           0.31              0.03
## Risk.Factor.X1            0.31           1.00           0.61              0.04
## Risk.Factor.X2            0.31           0.61           1.00              0.13
## Control.Factor.Y1         0.03           0.04           0.13              1.00
## Control.Factor.Y2         0.62           0.12          -0.29              0.14
## Net.Risk.Measure.Z        0.32          -0.15          -0.03              0.00
##                    Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t                     0.62               0.32
## Risk.Factor.X1                  0.12              -0.15
## Risk.Factor.X2                 -0.29              -0.03
## Control.Factor.Y1               0.14               0.00
## Control.Factor.Y2               1.00               0.41
## Net.Risk.Measure.Z              0.41               1.00
## 
## n= 15 
## 
## 
## P
##                    Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t                    0.2572         0.2576         0.9082           
## Risk.Factor.X1     0.2572                     0.0163         0.8743           
## Risk.Factor.X2     0.2576      0.0163                        0.6562           
## Control.Factor.Y1  0.9082      0.8743         0.6562                          
## Control.Factor.Y2  0.0143      0.6757         0.3027         0.6217           
## Net.Risk.Measure.Z 0.2392      0.6006         0.9197         0.9910           
##                    Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t        0.0143            0.2392            
## Risk.Factor.X1     0.6757            0.6006            
## Risk.Factor.X2     0.3027            0.9197            
## Control.Factor.Y1  0.6217            0.9910            
## Control.Factor.Y2                    0.1254            
## Net.Risk.Measure.Z 0.1254
flattenCorrMatrix <- function(cormat, pmat){
  ut <- upper.tri(cormat)
  data.frame(row = rownames(cormat)[row(cormat)[ut]],
             column = rownames(cormat)[col(cormat)[ut]],
             cor = (cormat)[ut],
             p = (pmat)[ut])}
flattentrainMatrix <- flattenCorrMatrix(trainMatrix$r, trainMatrix$P)
print(flattentrainMatrix)
##                  row             column         cor          p
## 1        Time.Step.t     Risk.Factor.X1  0.31222177 0.25724040
## 2        Time.Step.t     Risk.Factor.X2  0.31197950 0.25762950
## 3     Risk.Factor.X1     Risk.Factor.X2  0.60744462 0.01631352
## 4        Time.Step.t  Control.Factor.Y1  0.03260289 0.90817134
## 5     Risk.Factor.X1  Control.Factor.Y1  0.04471071 0.87428435
## 6     Risk.Factor.X2  Control.Factor.Y1  0.12537919 0.65615462
## 7        Time.Step.t  Control.Factor.Y2  0.61688389 0.01430005
## 8     Risk.Factor.X1  Control.Factor.Y2  0.11786044 0.67570291
## 9     Risk.Factor.X2  Control.Factor.Y2 -0.28528054 0.30270321
## 10 Control.Factor.Y1  Control.Factor.Y2  0.13883414 0.62169068
## 11       Time.Step.t Net.Risk.Measure.Z  0.32373285 0.23916567
## 12    Risk.Factor.X1 Net.Risk.Measure.Z -0.14719332 0.60063548
## 13    Risk.Factor.X2 Net.Risk.Measure.Z -0.02848653 0.91972842
## 14 Control.Factor.Y1 Net.Risk.Measure.Z -0.00317299 0.99104576
## 15 Control.Factor.Y2 Net.Risk.Measure.Z  0.41357888 0.12542495

From the Summary shown above, there are several associations between Net Risk Measure Z and the other variables.

There is a positive correlation (0.324) between Net Risk Measure Z and Time Step t. There is a negative correlation (-0.147) between Net Risk Measure Z and Risk Factor X1. There is a negative correlation (-0.028) between Net Risk Measure Z and Risk Factor X2. There is a negative correlation (-0.003) between Net Risk Measure Z and Control Factor Y1. There is a positive correlation (0.414) between Net Risk Measure Z and Control Factor Y2.

This suggests that Net Risk Measure Z increases as Time Step t and Control Factor Y2 increase. This also suggests that Net Risk Measure Z decreases as Risk Factors X1 and X2, and Control Factor Y1 increase.

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
corrplot(trainMatrix$r, type = "upper", order = "FPC", method = "color",
         p.mat = trainMatrix$P, sig.level = 0.05, insig = "pch",
         tl.cex = 0.8, tl.col = "black", tl.srt = 45)

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a level of significance of 0.05 to determine correlations that are statistically significant. Correlations with a p-value > 0.05 are considered statistically insignificant and will be marked with a cross.

Essentially, only two correlations are not marked with a cross. This suggests that the correlations, Time Step t vs Control Factor Y2, and Risk Factor X1 vs Risk Factor X2, are statistically significant. Nevertheless, we will be using all the variables to build our Multi-Variate Linear Regression Model.

Building our Multi-Variate Linear Regression Model

# Building our Multi-Variate Linear Regression Model
modFit <- train(Net.Risk.Measure.Z ~ Time.Step.t + Risk.Factor.X1 + Risk.Factor.X2
                + Control.Factor.Y1 + Control.Factor.Y2, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression 
## 
## 15 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 15, 15, 15, 15, 15, 15, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.463806  0.1823794  2.045874
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(finMod)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
##       (Intercept)        Time.Step.t     Risk.Factor.X1     Risk.Factor.X2  
##           4.87522           -0.03762           -0.77507            0.40960  
## Control.Factor.Y1  Control.Factor.Y2  
##          -0.11189            1.25907

Plotting Regression Diagnostic Plots

# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)

The Regression Diagnostic Plots show residuals in four different ways:

  1. Residuals vs Fitted - To examine the linearity assumptions. A horizontal line without distinct patterns is a good indication of linearity.
  2. Normal Q-Q - To examine whether the residuals are normally distributed. The data is good if residual points follow the straight dashed line.
  3. Scale-Location - To examine the homogeneity in variances of residuals (Homoscedasticity). A horizontal line with equally spread points is a good indication of Homoscedasticity.
  4. Residuals vs Leverage - To identify extreme values that may influence the analysis.

Linearity of Data

The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.

# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)

From our Residuals vs Fitted Plot shown above, it suggests that our data has good linearity. Characteristics to support this include:

  1. The residuals are well distributed around the 0 straight dashed line, which usually suggests good linearity.
  2. The residuals form a horizontal band around the 0 straight dashed line, which usually suggests similar variances.

However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Normality of Residuals

The normality assumptions can be examined by inspecting the Normal Q-Q Plot.

# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)

From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:

  1. The residual points all fall approximately along the straight dashed line.

However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Homoscedasticity

Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.

# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)

From our Scale-Location Plot shown above, it suggests that our data has good Homoscedasticity. Characteristics to support this include:

  1. The residuals are equally spread along the range of predictors.
  2. The residuals are equally spread around the horizontal red line.

However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Outliers, High Leverage Points and Influential Values

Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.

Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.

# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)

From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:

  1. All data points are within the Cook’s Distance Lines, which are represented by the red dashed lines in our Residuals vs Leverage Plot.

However, there are some outliers (X10, X11, X13) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.

Prediction with our Multi-Variate Linear Regression Model

# Prediction with our Multi-Variate Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Time.Step.t, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X1, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X2, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y1, data = testData)

qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y2, data = testData)

Summary of our Prediction

# Summary of our Prediction
summary(Prediction)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.530   4.973   5.483   5.435   5.769   6.469

Answers to Questions

Answer to Question 1

Answer to Question 2