During each Time Step t, a financial supervisor, B, is able to observe the scores for Risk Factors X1 and X2, as well as the scores for Control Factors Y1 and Y2, for a particular Company A. At the same Time Step t, B can also observe the data for the Net Risk Measure Z, corresponding to A. A sample of the data for Time Step t can be found in Table 1.
B has a hypothesis that the Net Risk Measure Z can be predicted using the Risk and Control Factors.
By using a Multi-Variate Linear Regression Approach, identify a suitable regression model and write out its expression. What is the predicted Net Risk Measure Z for the next Time Step t?
Briefly explain at least 2 issues, from a statistical perspective, which should be examined before one can conclude that a risk model, such as the one above, is robust and can be used.
# Loading Libraries
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(corrplot)
## corrplot 0.84 loaded
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
# Loading Data
table1_data <- read.csv('./Q1 Q2.csv', header = TRUE)
# Cleaning Data
table1_data <- table1_data[, colSums(is.na(table1_data)) == 0]
summary(table1_data)
## Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Min. : 1.0 Min. :2.620 Min. :1.040 Min. :1.110
## 1st Qu.: 4.5 1st Qu.:3.840 1st Qu.:2.205 1st Qu.:3.040
## Median : 8.0 Median :4.190 Median :3.370 Median :4.360
## Mean : 8.0 Mean :4.164 Mean :3.450 Mean :3.979
## 3rd Qu.:11.5 3rd Qu.:4.445 3rd Qu.:4.690 3rd Qu.:5.230
## Max. :15.0 Max. :5.610 Max. :5.910 Max. :5.920
## Control.Factor.Y2 Net.Risk.Measure.Z
## Min. :1.320 Min. :3.450
## 1st Qu.:2.085 1st Qu.:4.955
## Median :2.530 Median :5.260
## Mean :2.478 Mean :5.435
## 3rd Qu.:2.830 3rd Qu.:6.025
## Max. :3.800 Max. :7.470
Normally, we would split the data into 50% for training and 50% for testing our Multi-Variate Linear Regression Model. Here, our data (table1_data) has a small sample size of 15 samples. Therefore, we will need to use our data as both our training and testing datasets.
# Splitting Data
set.seed(1234)
inTrain <- createDataPartition(table1_data$Time.Step.t, p = 1, list = FALSE)
trainData <- table1_data[inTrain, ]
testData <- table1_data[inTrain, ]
dim(trainData)
## [1] 15 6
dim(testData)
## [1] 15 6
trainData <- as.data.frame(lapply(trainData, as.numeric))
testData <- as.data.frame(lapply(testData, as.numeric))
# Plotting a Correlation Matrix for Training Data
trainMatrix <- rcorr(as.matrix(trainData))
print(trainMatrix)
## Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t 1.00 0.31 0.31 0.03
## Risk.Factor.X1 0.31 1.00 0.61 0.04
## Risk.Factor.X2 0.31 0.61 1.00 0.13
## Control.Factor.Y1 0.03 0.04 0.13 1.00
## Control.Factor.Y2 0.62 0.12 -0.29 0.14
## Net.Risk.Measure.Z 0.32 -0.15 -0.03 0.00
## Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t 0.62 0.32
## Risk.Factor.X1 0.12 -0.15
## Risk.Factor.X2 -0.29 -0.03
## Control.Factor.Y1 0.14 0.00
## Control.Factor.Y2 1.00 0.41
## Net.Risk.Measure.Z 0.41 1.00
##
## n= 15
##
##
## P
## Time.Step.t Risk.Factor.X1 Risk.Factor.X2 Control.Factor.Y1
## Time.Step.t 0.2572 0.2576 0.9082
## Risk.Factor.X1 0.2572 0.0163 0.8743
## Risk.Factor.X2 0.2576 0.0163 0.6562
## Control.Factor.Y1 0.9082 0.8743 0.6562
## Control.Factor.Y2 0.0143 0.6757 0.3027 0.6217
## Net.Risk.Measure.Z 0.2392 0.6006 0.9197 0.9910
## Control.Factor.Y2 Net.Risk.Measure.Z
## Time.Step.t 0.0143 0.2392
## Risk.Factor.X1 0.6757 0.6006
## Risk.Factor.X2 0.3027 0.9197
## Control.Factor.Y1 0.6217 0.9910
## Control.Factor.Y2 0.1254
## Net.Risk.Measure.Z 0.1254
flattenCorrMatrix <- function(cormat, pmat){
ut <- upper.tri(cormat)
data.frame(row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor = (cormat)[ut],
p = (pmat)[ut])}
flattentrainMatrix <- flattenCorrMatrix(trainMatrix$r, trainMatrix$P)
print(flattentrainMatrix)
## row column cor p
## 1 Time.Step.t Risk.Factor.X1 0.31222177 0.25724040
## 2 Time.Step.t Risk.Factor.X2 0.31197950 0.25762950
## 3 Risk.Factor.X1 Risk.Factor.X2 0.60744462 0.01631352
## 4 Time.Step.t Control.Factor.Y1 0.03260289 0.90817134
## 5 Risk.Factor.X1 Control.Factor.Y1 0.04471071 0.87428435
## 6 Risk.Factor.X2 Control.Factor.Y1 0.12537919 0.65615462
## 7 Time.Step.t Control.Factor.Y2 0.61688389 0.01430005
## 8 Risk.Factor.X1 Control.Factor.Y2 0.11786044 0.67570291
## 9 Risk.Factor.X2 Control.Factor.Y2 -0.28528054 0.30270321
## 10 Control.Factor.Y1 Control.Factor.Y2 0.13883414 0.62169068
## 11 Time.Step.t Net.Risk.Measure.Z 0.32373285 0.23916567
## 12 Risk.Factor.X1 Net.Risk.Measure.Z -0.14719332 0.60063548
## 13 Risk.Factor.X2 Net.Risk.Measure.Z -0.02848653 0.91972842
## 14 Control.Factor.Y1 Net.Risk.Measure.Z -0.00317299 0.99104576
## 15 Control.Factor.Y2 Net.Risk.Measure.Z 0.41357888 0.12542495
From the Summary shown above, there are several associations between Net Risk Measure Z and the other variables.
There is a positive correlation (0.324) between Net Risk Measure Z and Time Step t. There is a negative correlation (-0.147) between Net Risk Measure Z and Risk Factor X1. There is a negative correlation (-0.028) between Net Risk Measure Z and Risk Factor X2. There is a negative correlation (-0.003) between Net Risk Measure Z and Control Factor Y1. There is a positive correlation (0.414) between Net Risk Measure Z and Control Factor Y2.
This suggests that Net Risk Measure Z increases as Time Step t and Control Factor Y2 increase. This also suggests that Net Risk Measure Z decreases as Risk Factors X1 and X2, and Control Factor Y1 increase.
# Plotting a Correlation Plot for Training Data
corrplot(trainMatrix$r, type = "upper", order = "FPC", method = "color",
p.mat = trainMatrix$P, sig.level = 0.05, insig = "pch",
tl.cex = 0.8, tl.col = "black", tl.srt = 45)
In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a level of significance of 0.05 to determine correlations that are statistically significant. Correlations with a p-value > 0.05 are considered statistically insignificant and will be marked with a cross.
Essentially, only two correlations are not marked with a cross. This suggests that the correlations, Time Step t vs Control Factor Y2, and Risk Factor X1 vs Risk Factor X2, are statistically significant. Nevertheless, we will be using all the variables to build our Multi-Variate Linear Regression Model.
# Building our Multi-Variate Linear Regression Model
modFit <- train(Net.Risk.Measure.Z ~ Time.Step.t + Risk.Factor.X1 + Risk.Factor.X2
+ Control.Factor.Y1 + Control.Factor.Y2, method = "lm", data = trainData)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression
##
## 15 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 15, 15, 15, 15, 15, 15, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.463806 0.1823794 2.045874
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(finMod)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) Time.Step.t Risk.Factor.X1 Risk.Factor.X2
## 4.87522 -0.03762 -0.77507 0.40960
## Control.Factor.Y1 Control.Factor.Y2
## -0.11189 1.25907
# Plotting Regression Diagnostic Plots
par(mfrow = c(2,2))
plot(finMod)
The Regression Diagnostic Plots show residuals in four different ways:
The linearity assumptions can be examined by inspecting the Residuals vs Fitted Plot. Fitted values are predictions derived from our model and training data, while residuals are the difference between the observed and estimated values.
# Plotting the Residuals vs Fitted Plot
plot(finMod, 1, pch = 19, cex = 0.5)
From our Residuals vs Fitted Plot shown above, it suggests that our data has good linearity. Characteristics to support this include:
However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.
The normality assumptions can be examined by inspecting the Normal Q-Q Plot.
# Plotting the Normal Q-Q Plot
plot(finMod, 2, pch = 19, cex = 0.5)
From our Normal Q-Q Plot shown above, it suggests that our data has good normality of residuals. Characteristics to support this include:
However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.
Homogeneity in variances of residuals can be examined by inspecting the Scale-Location Plot.
# Plotting the Scale-Location Plot
plot(finMod, 3, pch = 19, cex = 0.5)
From our Scale-Location Plot shown above, it suggests that our data has good Homoscedasticity. Characteristics to support this include:
However, there are some outliers (X7, X10, X11) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.
Outliers, High Leverage Points and Influential Values can be identified by inspecting Cook’s Distance and the Residuals vs Leverage Plot. An Outlier is a data point that has an extreme outcome variable value. A High Leverage Point is a data point that has an extreme predictor variable value. An Influential Value is associated with a large residual and its inclusion or exclusion can alter the regression analysis.
Not all extreme data points are influential in regression analysis. Cook’s Distance is a metric used to determine the influence of a value. It defines influence as a combination of leverage and residual size.
# Plotting the Residuals vs Leverage Plot
par(mfrow = c(1,2))
plot(finMod, 4, pch = 19, cex = 0.5)
plot(finMod, 5, pch = 19, cex = 0.5)
From our Residuals vs Leverage Plot shown above, our data does not present with any influential points. Characteristics to support this include:
However, there are some outliers (X10, X11, X13) that stand out from the distribution pattern. We need to look into them individually to examine for anything unique or whether they are data entry errors.
# Prediction with our Multi-Variate Linear Regression Model
Prediction <- predict(modFit, testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Time.Step.t, data = testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X1, data = testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Risk.Factor.X2, data = testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y1, data = testData)
qplot(Net.Risk.Measure.Z, Prediction, colour = Control.Factor.Y2, data = testData)
# Summary of our Prediction
summary(Prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.530 4.973 5.483 5.435 5.769 6.469
Essentially, we used a Multi-Variate Linear Regression Model to explain the relationship between one continuous dependent variable (outcome) and two or more independent variables (predictors). Our outcome, Net Risk Measure Z is a continuous variable, while our predictors can be continuous or categorical.
Some assumptions made when building our model:
Pros of our model: It can determine the relative influence of one or more predictors on the outcome. It can also identify outliers, high leverage points and influential values.
Cons of our model: It may not provide a good fit if our data comes from a population with a distribution that violates the abovementioned assumptions.
Our model gave a R-squared value of 0.18, a Root Mean Square Error (RMSE) of 2.46, and a Mean Absolute Error (MAE) of 2.05. This is a reasonable error rate as our data has a small sample size of 15 samples.
While our R-squared value is not high, we evaluated our model in conjunction with Regression Diagnostic Plots to reveal any residual patterns that may indicate bias.
From our Residuals vs Fitted Plot, our data shows good linearity. From our Normal Q-Q Plot, our data shows good normality of residuals. From our Scale-Location Plot, our data shows good Homoscedasticity and no multi-collinearity. From our Residuals vs Leverage Plot, our data does not show any influential values.
Essentially, our data fulfilled the assumptions for building our model. Moreover, goodness-of-fit can be improved with data of a larger sample size. Therefore, we are confident that our model is robust, performing well, and can be used.