Often in may financial institutions the business users create their own user databases for their reporting and analytical needs. This involves data loads from multiple source systems with multiple data transformation. Data takes multiple hops and multiple transformations/calculations and land in the user data bases for the reporting and analytical purpose.
Over a period of time the business users change and with changing business needs and challenges new attributes are added to these user databases. With this on going process; the number of attributes increases; which might also include duplicate attributes and often it so happens that the transformation that a data attribute has taken to land in the user database is forgotten or lost.
When these systems need to undergo a data migration; a lot of reverse engineering is involved and mapping of the attributes to its source or calculating data elements is much needed to keep all the reports intact. Business users who were involved in sourcing some of these critical data elements may not be available to support this migration and a challenge to map them back to source is huge and cumbersome.
With this Nano project I am attempting to use Multiple Linear Regression to establish a relation between the calculated/transformed data elements back to its source data elements.
For this project I mocked up Personal Loan data. Below is the description of each of the data element. There are 700 observations with two calculated fields MonthlyRate and EMI
CustID - Customer ID Number (A number generated by the loan issuing bank)
CustName - Name of the customer as per the application from
LoanAmount - Loan amount disbursed by the bank to the customer
LoanTenure - Time in months to repay the loan
InterestRate - Interest rate on the lona
MonthlyRate - Calculated field, monthly interest rate
EMI - Equal monthly installment for the loan
Here in the data set the MonthlyRate and EMI are calculated fileds (Response variables); and variable LoanAmount, LoanTenure and InterestRate are predictors.
Data can be downloaded from here
On the available (mocked-up) data we would be running multiple linear regression to regress the MonthlyRate and EMI variables over LoanAmount, LoanTenure and InterestRate to identify the relationship/correlation between the predictor and response variables.
Looking a the pair plot below; the EMI and Loan amount have a pretty positive linear relation; i.e. as the amount increases the EMI too increases, and this is obvious. Whereas the plot for LoanAmount against LoanTenure, InterestRate and MonthlyRate is very scattered.
df = read.csv(DataFile,header = TRUE, stringsAsFactors = FALSE)
dim(df)
## [1] 700 7
str(df)
## 'data.frame': 700 obs. of 7 variables:
## $ CustID : chr "JNW7777667" "LDW2731985" "KIB2193328" "WDV6223251" ...
## $ CustName : chr "S SOORIYO" "SUKHADEVSINH JATVBHA RAHOL" "VIPIN BIHARI RAI" "K RAMACHANDRAN PILLAI" ...
## $ LoanAmount : num 308000 244000 284000 507000 722000 815000 322000 342000 472000 111000 ...
## $ LoanTenure : int 22 31 17 32 45 21 31 41 28 34 ...
## $ InterestRate: num 12.9 11.5 12.1 12.7 13.1 11.2 12.1 10.8 12 13.7 ...
## $ MonthlyRate : num 0.0108 0.0096 0.0101 0.0106 0.0109 0.0093 0.0101 0.009 0.01 0.0114 ...
## $ EMI : num 15795 9135 18262 18761 20392 ...
summary(df)
## CustID CustName LoanAmount LoanTenure
## Length:700 Length:700 Min. :100000 Min. :11.0
## Class :character Class :character 1st Qu.:299750 1st Qu.:21.0
## Mode :character Mode :character Median :498000 Median :29.0
## Mean :493841 Mean :29.6
## 3rd Qu.:675750 3rd Qu.:39.0
## Max. :900000 Max. :48.0
## InterestRate MonthlyRate EMI
## Min. :10.00 Min. :0.00830 Min. : 2683
## 1st Qu.:11.10 1st Qu.:0.00930 1st Qu.:12287
## Median :12.10 Median :0.01010 Median :19746
## Mean :12.06 Mean :0.01006 Mean :22445
## 3rd Qu.:13.00 3rd Qu.:0.01080 3rd Qu.:28400
## Max. :14.00 Max. :0.01170 Max. :83659
pairs(~LoanAmount+LoanTenure+InterestRate+MonthlyRate+EMI, data = df)
Lets regress the MonthlyRate over LoanAmount, LoanTenure and InterestRate.
If we observer the output of the summary; the coefficient of the variables InterestRate has high t-value and very low p-value (<0.05) and hence statistically significant. Also signifies that the confidence of the relation is 99.99%. Whereas the other two varaibles LoanAmount and LoanTenure have a very low t-value and a high p-value resulting a low or no statistical significance. Hence for now we can say (before we evaluate the model assumptions) that the MonthlyRate is highly co-related to InterestRate and can have an association when deriving this value. Building the linear equation would result into a probable equation for MonthlyRate, but we cannot say that its the accurate formula to calculate the same.
As we now understand the co-relation between the two terms; we can reach out to the SME/Business users to get an accurate formula and validate the outcome.
With unit increase in InterestRate the MonthlyRate increases by 0.00084 units.
Also observe that the R-squared and Adjusted R-squared values are very high (99.9%); close to 1; signifying that this model explains best fit of the data and explains close to all the data in the model.
rm1 = lm(MonthlyRate ~ LoanAmount+LoanTenure+InterestRate, data = df)
summary(rm1)
##
## Call:
## lm(formula = MonthlyRate ~ LoanAmount + LoanTenure + InterestRate,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.891e-05 -2.350e-05 -6.180e-07 2.399e-05 5.277e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.792e-05 1.195e-05 -4.011 6.69e-05 ***
## LoanAmount 4.954e-12 4.750e-12 1.043 0.297
## LoanTenure 8.837e-08 9.709e-08 0.910 0.363
## InterestRate 8.374e-04 9.137e-07 916.534 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.79e-05 on 696 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 2.83e+05 on 3 and 696 DF, p-value: < 2.2e-16
On the similar lines we can regress the EMI on LoanAmount, LoanTenure and InterestRate to find a co-relation.
In this model observe that the coefficients of LoanAmount and LoanTenure have a high t-value (LoanTenure is high negative) and a very low p-value and hence statistically significant. The LoanAmount and LoanTenure have a 99.99% confidence of co-relation.
1. With one unit increase in LoanAmount the EMI increases by 0.046 units.
2. With one unit increase in the LoanTenure the EMI will decrease by 0.076 units. This makes sense :)
rm2 = lm(EMI ~ LoanAmount+LoanTenure+InterestRate, data = df)
summary(rm2)
##
## Call:
## lm(formula = EMI ~ LoanAmount + LoanTenure + InterestRate, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8167 -3482 -1640 1980 29895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.051e+04 2.267e+03 9.047 <2e-16 ***
## LoanAmount 4.592e-02 9.012e-04 50.960 <2e-16 ***
## LoanTenure -7.566e+02 1.842e+01 -41.074 <2e-16 ***
## InterestRate 1.370e+02 1.734e+02 0.790 0.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5294 on 696 degrees of freedom
## Multiple R-squared: 0.8625, Adjusted R-squared: 0.8619
## F-statistic: 1455 on 3 and 696 DF, p-value: < 2.2e-16
The above mentioned co-relation stands true only when the model regression assumptions stands true. Lets evaluate these assumptions and validate the correctness of the model.
#Q-Q Plot for MonthlyInterest
s.rm1 <- rstandard(rm1)
qqnorm(s.rm1)
#Q-Q Plot for EMI
s.rm2 <- rstandard(rm2)
qqnorm(s.rm2)
plot(density(residuals(rm1)))
plot(density(residuals(rm2)))
plot(fitted(rm1),rstudent(rm1),type="p",col="red")
plot(fitted(rm2),rstudent(rm2),type="p",col="red")
acf(residuals(rm1))
acf(residuals(rm2))
As the model stands all the regression assumption we can conclude that the relationship we were able establish between the variables stands good.
1. MonthlyRate is an equation of InterestRate and
2. EMI is an equation of LoanAmount and LoanTenure