Data Lineage using LinerRegression

Background

Often in may financial institutions the business users create their own user databases for their reporting and analytical needs. This involves data loads from multiple source systems with multiple data transformation. Data takes multiple hops and multiple transformations/calculations and land in the user data bases for the reporting and analytical purpose.

Over a period of time the business users change and with changing business needs and challenges new attributes are added to these user databases. With this on going process; the number of attributes increases; which might also include duplicate attributes and often it so happens that the transformation that a data attribute has taken to land in the user database is forgotten or lost.

When these systems need to undergo a data migration; a lot of reverse engineering is involved and mapping of the attributes to its source or calculating data elements is much needed to keep all the reports intact. Business users who were involved in sourcing some of these critical data elements may not be available to support this migration and a challenge to map them back to source is huge and cumbersome.

With this Nano project I am attempting to use Multiple Linear Regression to establish a relation between the calculated/transformed data elements back to its source data elements.

Data

For this project I mocked up Personal Loan data. Below is the description of each of the data element. There are 700 observations with two calculated fields MonthlyRate and EMI

CustID        - Customer ID Number (A number generated by the loan issuing bank)  
CustName      - Name of the customer as per the application from  
LoanAmount    - Loan amount disbursed by the bank to the customer  
LoanTenure    - Time in months to repay the loan  
InterestRate  - Interest rate on the lona  
MonthlyRate   - Calculated field, monthly interest rate  
EMI           - Equal monthly installment for the loan

Here in the data set the MonthlyRate and EMI are calculated fileds (Response variables); and variable LoanAmount, LoanTenure and InterestRate are predictors.

Data can be downloaded from here

Opportunity

On the available (mocked-up) data we would be running multiple linear regression to regress the MonthlyRate and EMI variables over LoanAmount, LoanTenure and InterestRate to identify the relationship/correlation between the predictor and response variables.

Assumptions

The data is collected and reviewed with a SME/Business user for its correctness in terms of dependencies.
Data is well understood before the regression is run.

Regression Analysis

Looking a the pair plot below; the EMI and Loan amount have a pretty positive linear relation; i.e. as the amount increases the EMI too increases, and this is obvious. Whereas the plot for LoanAmount against LoanTenure, InterestRate and MonthlyRate is very scattered.

df = read.csv(DataFile,header = TRUE, stringsAsFactors = FALSE)
dim(df)

## [1] 700   7

str(df)

## 'data.frame':    700 obs. of  7 variables:
##  $ CustID      : chr  "JNW7777667" "LDW2731985" "KIB2193328" "WDV6223251" ...
##  $ CustName    : chr  "S  SOORIYO" "SUKHADEVSINH JATVBHA RAHOL" "VIPIN BIHARI RAI" "K RAMACHANDRAN PILLAI" ...
##  $ LoanAmount  : num  308000 244000 284000 507000 722000 815000 322000 342000 472000 111000 ...
##  $ LoanTenure  : int  22 31 17 32 45 21 31 41 28 34 ...
##  $ InterestRate: num  12.9 11.5 12.1 12.7 13.1 11.2 12.1 10.8 12 13.7 ...
##  $ MonthlyRate : num  0.0108 0.0096 0.0101 0.0106 0.0109 0.0093 0.0101 0.009 0.01 0.0114 ...
##  $ EMI         : num  15795 9135 18262 18761 20392 ...

summary(df)

##     CustID            CustName           LoanAmount       LoanTenure  
##  Length:700         Length:700         Min.   :100000   Min.   :11.0  
##  Class :character   Class :character   1st Qu.:299750   1st Qu.:21.0  
##  Mode  :character   Mode  :character   Median :498000   Median :29.0  
##                                        Mean   :493841   Mean   :29.6  
##                                        3rd Qu.:675750   3rd Qu.:39.0  
##                                        Max.   :900000   Max.   :48.0  
##   InterestRate    MonthlyRate           EMI       
##  Min.   :10.00   Min.   :0.00830   Min.   : 2683  
##  1st Qu.:11.10   1st Qu.:0.00930   1st Qu.:12287  
##  Median :12.10   Median :0.01010   Median :19746  
##  Mean   :12.06   Mean   :0.01006   Mean   :22445  
##  3rd Qu.:13.00   3rd Qu.:0.01080   3rd Qu.:28400  
##  Max.   :14.00   Max.   :0.01170   Max.   :83659

pairs(~LoanAmount+LoanTenure+InterestRate+MonthlyRate+EMI, data = df)

Lets regress the MonthlyRate over LoanAmount, LoanTenure and InterestRate.

If we observer the output of the summary; the coefficient of the variables InterestRate has high t-value and very low p-value (<0.05) and hence statistically significant. Also signifies that the confidence of the relation is 99.99%. Whereas the other two varaibles LoanAmount and LoanTenure have a very low t-value and a high p-value resulting a low or no statistical significance. Hence for now we can say (before we evaluate the model assumptions) that the MonthlyRate is highly co-related to InterestRate and can have an association when deriving this value. Building the linear equation would result into a probable equation for MonthlyRate, but we cannot say that its the accurate formula to calculate the same.

As we now understand the co-relation between the two terms; we can reach out to the SME/Business users to get an accurate formula and validate the outcome.

  With unit increase in InterestRate the MonthlyRate increases by 0.00084 units.

Also observe that the R-squared and Adjusted R-squared values are very high (99.9%); close to 1; signifying that this model explains best fit of the data and explains close to all the data in the model.

rm1 = lm(MonthlyRate ~ LoanAmount+LoanTenure+InterestRate, data = df)
summary(rm1)

## 
## Call:
## lm(formula = MonthlyRate ~ LoanAmount + LoanTenure + InterestRate, 
##     data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.891e-05 -2.350e-05 -6.180e-07  2.399e-05  5.277e-05 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.792e-05  1.195e-05  -4.011 6.69e-05 ***
## LoanAmount    4.954e-12  4.750e-12   1.043    0.297    
## LoanTenure    8.837e-08  9.709e-08   0.910    0.363    
## InterestRate  8.374e-04  9.137e-07 916.534  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.79e-05 on 696 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 2.83e+05 on 3 and 696 DF,  p-value: < 2.2e-16

On the similar lines we can regress the EMI on LoanAmount, LoanTenure and InterestRate to find a co-relation.

In this model observe that the coefficients of LoanAmount and LoanTenure have a high t-value (LoanTenure is high negative) and a very low p-value and hence statistically significant. The LoanAmount and LoanTenure have a 99.99% confidence of co-relation.

  1. With one unit increase in LoanAmount the EMI increases by 0.046 units.
  2. With one unit increase in the LoanTenure the EMI will decrease by 0.076 units. This makes sense :)

rm2 = lm(EMI ~ LoanAmount+LoanTenure+InterestRate, data = df)
summary(rm2)

## 
## Call:
## lm(formula = EMI ~ LoanAmount + LoanTenure + InterestRate, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8167  -3482  -1640   1980  29895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.051e+04  2.267e+03   9.047   <2e-16 ***
## LoanAmount    4.592e-02  9.012e-04  50.960   <2e-16 ***
## LoanTenure   -7.566e+02  1.842e+01 -41.074   <2e-16 ***
## InterestRate  1.370e+02  1.734e+02   0.790     0.43    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5294 on 696 degrees of freedom
## Multiple R-squared:  0.8625, Adjusted R-squared:  0.8619 
## F-statistic:  1455 on 3 and 696 DF,  p-value: < 2.2e-16

The above mentioned co-relation stands true only when the model regression assumptions stands true. Lets evaluate these assumptions and validate the correctness of the model.

Normal probability plot of the residuals (Q-Q plot in R); refer to Normal Q-Q Plot. This plot shows a straight line with intercept of Zero (Mean) and a slop (Standard Deviation) of one.

#Q-Q Plot for MonthlyInterest
s.rm1 <- rstandard(rm1) 
qqnorm(s.rm1)

#Q-Q Plot for EMI
s.rm2 <- rstandard(rm2) 
qqnorm(s.rm2)

The below plot shows that the residuals are following a normal distribution, with mean Zero.

plot(density(residuals(rm1)))

plot(density(residuals(rm2)))

The below show the scale-location plot; which depicts that the residuals are following a constant variance (Homoscedastic)

plot(fitted(rm1),rstudent(rm1),type="p",col="red")

plot(fitted(rm2),rstudent(rm2),type="p",col="red")

Below is a acf (Auto correlation) plot for residuals. Clearly depicts that the correlation within the residuals does not exists making our model better.

acf(residuals(rm1))

acf(residuals(rm2))

Conclusion

As the model stands all the regression assumption we can conclude that the relationship we were able establish between the variables stands good.

  1. MonthlyRate is an equation of InterestRate and
  2. EMI is an equation of LoanAmount and LoanTenure