Description of the dataset we are going to use can be seen below.
caption
Lets read the dataset first and check the structure and summary of the data set.
rm(list=ls())
setwd("D:/ISB/21-StatAnalysis-2/SA-2 Assignment-1 files-20170219")
NAICExpense <- read.csv("NAICExpense.csv")
dim(NAICExpense)
## [1] 384 15
summary(NAICExpense)
## COMPANY_NAME GROUP MUTUAL
## AAA Mid-Atlantic Ins Co : 1 Min. :0.000 Min. :0.0000
## Acceptance Ind Ins Co : 1 1st Qu.:0.000 1st Qu.:0.0000
## Accredited Surety & Cas Co Inc: 1 Median :1.000 Median :0.0000
## Ace Ins Co : 1 Mean :0.612 Mean :0.1875
## Admiral Ind Co : 1 3rd Qu.:1.000 3rd Qu.:0.0000
## Adriatic Ins Co : 1 Max. :1.000 Max. :1.0000
## (Other) :378
## STOCK RBC EXPENSES STAFFWAGE
## Min. :0.0000 Min. :0.000e+00 Min. :-0.002038 Min. : 51.73
## 1st Qu.:0.0000 1st Qu.:6.257e+08 1st Qu.: 0.001584 1st Qu.: 80.06
## Median :1.0000 Median :2.753e+09 Median : 0.008504 Median : 84.38
## Mean :0.6823 Mean :2.247e+10 Mean : 0.043190 Mean : 87.18
## 3rd Qu.:1.0000 3rd Qu.:1.118e+10 3rd Qu.: 0.029826 3rd Qu.: 93.82
## Max. :1.0000 Max. :8.388e+11 Max. : 1.236946 Max. :137.48
##
## AGENTWAGE LONGLOSS SHORTLOSS
## Min. : 47.47 Min. :-0.070623 Min. :-0.0031685
## 1st Qu.: 74.81 1st Qu.: 0.000000 1st Qu.: 0.0002369
## Median : 78.77 Median : 0.001784 Median : 0.0040240
## Mean : 80.15 Mean : 0.024926 Mean : 0.0373586
## 3rd Qu.: 85.44 3rd Qu.: 0.011280 3rd Qu.: 0.0217943
## Max. :126.17 Max. : 0.853915 Max. : 1.1710587
## NA's :19
## GPWPERSONAL GPWCOMM ASSETS
## Min. :-0.0037514 Min. :-0.000648 Min. :0.000321
## 1st Qu.: 0.0000000 1st Qu.: 0.003838 1st Qu.:0.012758
## Median : 0.0003125 Median : 0.023807 Median :0.056746
## Mean : 0.0531127 Mean : 0.122657 Mean :0.356543
## 3rd Qu.: 0.0272581 3rd Qu.: 0.086440 3rd Qu.:0.197437
## Max. : 1.8224858 Max. : 4.189401 Max. :8.705380
##
## CASH LIQUIDRATIO
## Min. :0.000018 Min. : 1.788
## 1st Qu.:0.011377 1st Qu.: 87.403
## Median :0.050469 Median : 96.027
## Mean :0.332871 Mean : 92.597
## 3rd Qu.:0.184971 3rd Qu.:103.861
## Max. :8.823477 Max. :127.858
##
colnames(NAICExpense)
## [1] "COMPANY_NAME" "GROUP" "MUTUAL" "STOCK"
## [5] "RBC" "EXPENSES" "STAFFWAGE" "AGENTWAGE"
## [9] "LONGLOSS" "SHORTLOSS" "GPWPERSONAL" "GPWCOMM"
## [13] "ASSETS" "CASH" "LIQUIDRATIO"
Sample Dataset look’s loke this
head(NAICExpense)
## COMPANY_NAME GROUP MUTUAL STOCK RBC
## 1 Tift Area Captive Ins Co 0 0 1 228184000
## 2 Alliance Of Nonprofits For Ins RRG 0 0 0 1627708000
## 3 GA Timber Harvesters Mut Captive 0 1 0 422907000
## 4 American Natl Lloyds Ins Co 1 0 0 652906000
## 5 Chubb Natl Ins Co 1 0 1 8124624000
## 6 Harleysville Ins Co of OH 1 0 1 1441725000
## EXPENSES STAFFWAGE AGENTWAGE LONGLOSS SHORTLOSS GPWPERSONAL
## 1 0.0008019802 84.40508 77.46100 0.0001873308 0.000000000 0.000000000
## 2 0.0044878635 81.56754 84.87802 0.0027822909 0.000000000 0.000000000
## 3 0.0019045075 84.40508 77.46100 0.0010121463 0.001329539 0.000000000
## 4 0.0022909382 82.49788 75.71071 0.0000000000 0.002979557 0.029545038
## 5 0.0182956574 79.26495 78.24790 0.0107939577 0.011777314 0.040614120
## 6 0.0049830133 84.35856 77.41831 0.0019387476 0.003797946 0.001448431
## GPWCOMM ASSETS CASH LIQUIDRATIO
## 1 0.001375438 0.002949942 0.003258406 110.45661
## 2 0.012272512 0.022170349 0.019760347 89.12961
## 3 0.005028351 0.004617343 0.003499702 75.79472
## 4 0.001986159 0.043719914 0.040934885 93.62984
## 5 0.058094479 0.144773034 0.138424153 95.61460
## 6 0.013375740 0.029204831 0.029965958 102.60617
There are 384 records 15 Variables in the data set.
There is a Categorical Variable called COMPANY_NAME which can be excluded from our regression analysis since it is not going add any value.
The regressors GROUP,MUTUAL,STOCK are also categorical variables indicating if the company is affiliated, mutual company, stock company respectively. These are converted into Dummy Variables, with ‘1’ representing the ‘Yes’ and ‘0’ representing ‘No’. The rest of all the variables are Quantative in nature and are suitable for performing regression. It can also be seen that few variables are in Million’s and few in Thousands scale.
So, first Lets Scale the data into one Unit of measurement. Here i convert the Million’s into Thousands.
options("scipen"=100, "digits"=4)
Expenses_T <- NAICExpense$EXPENSES*1000 # Multiplying the variable in Million with 1000 for conversion into thousands.
LONGLOSS_T <- NAICExpense$LONGLOSS*1000
SHORTLOSS_T <- NAICExpense$SHORTLOSS*1000
GPWPERSONAL_T <- NAICExpense$GPWPERSONAL*1000
GPWCOMM_T <- NAICExpense$GPWCOMM*1000
ASSETS_T <- NAICExpense$ASSETS*1000
CASH_T <- NAICExpense$CASH*1000
NAICExpense.scaled <- cbind.data.frame(NAICExpense,Expenses_T,LONGLOSS_T,SHORTLOSS_T,GPWPERSONAL_T,GPWCOMM_T,ASSETS_T,CASH_T)
NAICExpense.scaled <- NAICExpense.scaled[,-c(6,9,10,11,12,13,14)] #removing the original columns in Million's units
colnames(NAICExpense.scaled) <-c("COMPANY_NAME","GROUP","MUTUAL","STOCK","RBC","STAFFWAGE","AGENTWAGE","LIQUIDRATIO","EXPENSES"
,"LONGLOSS","SHORTLOSS","GPWPERSONAL","GPWCOMM","ASSETS","CASH") # reassigning original column names
rm(Expenses_T,LONGLOSS_T,SHORTLOSS_T,GPWPERSONAL_T,GPWCOMM_T,ASSETS_T,CASH_T) # Removing the variables
From now we consider the scaled dataset to be our Datset in our further processing. The scaling conversion is done and we are assigning the scaled data as our original dataset.
NAICExpense <- NAICExpense.scaled
dim(NAICExpense)
## [1] 384 15
From summary statisticts, it can be seen that there are 19 NULL values in the AGENTWAGE. For the purpose of modelling we need to check our model perofromace with null values presnt in the data set and with filling null values with any appropriate value. In this case i check by assigning the ‘mean’ value of the AGENTWAGE to the NULL Values. (Note: There are many other ways which you can choose to handle your null values).
Regression is performed on both the datasets and the R2 values of both the models are compared. The one with the highest R2 value is choosen for further processing.
In Regression Analysis, We consider the ‘EXPENSES’ as the dependant or response variable and the rest of the variables as independant variables or regressors.
Model1 shown below is the Linear regression for the data set with Null Values. From the below summary we can see that the Value of R2 for this model is 0.9439
model1 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)+RBC+STAFFWAGE+AGENTWAGE
+LONGLOSS+SHORTLOSS+GPWPERSONAL+GPWCOMM+ASSETS+CASH+LIQUIDRATIO,data = NAICExpense)
summary(model1)
##
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) +
## RBC + STAFFWAGE + AGENTWAGE + LONGLOSS + SHORTLOSS + GPWPERSONAL +
## GPWCOMM + ASSETS + CASH + LIQUIDRATIO, data = NAICExpense)
##
## Residuals:
## Min 1Q Median 3Q Max
## -214.11 -4.66 -0.63 3.90 154.12
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.0231912290429 16.4684357857614 -0.18
## factor(GROUP)1 1.1739392480503 3.6773445483426 0.32
## factor(MUTUAL)1 -2.3021450659916 5.7366191295391 -0.40
## factor(STOCK)1 -6.3113015800784 5.1025200248204 -1.24
## RBC 0.0000000000688 0.0000000000408 1.69
## STAFFWAGE -0.0227221149306 0.2035984701034 -0.11
## AGENTWAGE 0.0487542702114 0.2652767363692 0.18
## LONGLOSS 0.5193325025076 0.0566007789295 9.18
## SHORTLOSS 0.2197962318176 0.0356373644951 6.17
## GPWPERSONAL 0.0738411002176 0.0209656743521 3.52
## GPWCOMM 0.1165232270759 0.0122443344866 9.52
## ASSETS -0.0292934039757 0.0227998127293 -1.28
## CASH 0.0366885277701 0.0208901903107 1.76
## LIQUIDRATIO 0.0656455080518 0.0983505809014 0.67
## Pr(>|t|)
## (Intercept) 0.85445
## factor(GROUP)1 0.74974
## factor(MUTUAL)1 0.68844
## factor(STOCK)1 0.21695
## RBC 0.09252 .
## STAFFWAGE 0.91120
## AGENTWAGE 0.85429
## LONGLOSS < 0.0000000000000002 ***
## SHORTLOSS 0.0000000019 ***
## GPWPERSONAL 0.00048 ***
## GPWCOMM < 0.0000000000000002 ***
## ASSETS 0.19971
## CASH 0.07992 .
## LIQUIDRATIO 0.50491
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.9 on 351 degrees of freedom
## (19 observations deleted due to missingness)
## Multiple R-squared: 0.944, Adjusted R-squared: 0.942
## F-statistic: 454 on 13 and 351 DF, p-value: <0.0000000000000002
Model2 is the Linear regression for the data set without Null Values, i.e. filled with Mean value of the AGNETWAGE. From the below summary we can see that the Value of R2 for this model is 0.9402.
model2 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)+RBC+STAFFWAGE+AGENTWAGE
+LONGLOSS+SHORTLOSS+GPWPERSONAL+GPWCOMM+ASSETS+CASH+LIQUIDRATIO,data = NAICExpense.without.na)
summary(model2)
##
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) +
## RBC + STAFFWAGE + AGENTWAGE + LONGLOSS + SHORTLOSS + GPWPERSONAL +
## GPWCOMM + ASSETS + CASH + LIQUIDRATIO, data = NAICExpense.without.na)
##
## Residuals:
## Min 1Q Median 3Q Max
## -216.90 -4.11 -0.32 4.06 153.07
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -5.2274763688410 16.5430452004589 -0.32
## factor(GROUP)1 0.6677889659980 3.6058682714119 0.19
## factor(MUTUAL)1 -2.2902323999803 5.6997299126064 -0.40
## factor(STOCK)1 -6.3419544819168 5.0152034734682 -1.26
## RBC 0.0000000000789 0.0000000000407 1.94
## STAFFWAGE 0.0456947349780 0.1976552920114 0.23
## AGENTWAGE -0.0106211944562 0.2619536511766 -0.04
## LONGLOSS 0.5085219086773 0.0568438538105 8.95
## SHORTLOSS 0.2278905474506 0.0358009946810 6.37
## GPWPERSONAL 0.0816164932007 0.0210084270342 3.88
## GPWCOMM 0.1162067210338 0.0121012740448 9.60
## ASSETS -0.0475425540967 0.0223314455785 -2.13
## CASH 0.0538094713724 0.0204131919490 2.64
## LIQUIDRATIO 0.0769264413293 0.0972767947106 0.79
## Pr(>|t|)
## (Intercept) 0.75219
## factor(GROUP)1 0.85318
## factor(MUTUAL)1 0.68805
## factor(STOCK)1 0.20683
## RBC 0.05335 .
## STAFFWAGE 0.81730
## AGENTWAGE 0.96768
## LONGLOSS < 0.0000000000000002 ***
## SHORTLOSS 0.00000000058 ***
## GPWPERSONAL 0.00012 ***
## GPWCOMM < 0.0000000000000002 ***
## ASSETS 0.03392 *
## CASH 0.00874 **
## LIQUIDRATIO 0.42957
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.1 on 370 degrees of freedom
## Multiple R-squared: 0.94, Adjusted R-squared: 0.938
## F-statistic: 447 on 13 and 370 DF, p-value: <0.0000000000000002
On comparing R2 Value is more for the model1 when the null values are not filled and omitted, than when these observations are filled with mean values. Hence, we can consider only Model-1 in this case.
So, We consider the NAICExpense Dataset going further.
NAICExpense <- na.omit(NAICExpense)
The F-Statistic says that our regression is highly significant with small p-value.
LONGLOSS, SHORTLOSS, GPWPERSONAL, GPWCOMM are very highly significant and CASH,RBC are significant at 10% significance level.The insignificant coefficients says that threre might be no influence (or very minimal influence) on the value of the expense’s.They can also be Zero at times.
Now lets check the assumption of the regression model which we assumed before building the model. The errors or reseduals generated are random and are normally distributed.
plot(model1)
Now Lets look at the correlation’s between regressors and scatter plots. The Variance Inflation Factors(VIF) of the model-1 regression is also calculated below.
library(corrplot)
corrmatrix <-round(cor(NAICExpense[,-c(1,2,3,4)]),2)
corrplot(corrmatrix, method = "number")
pairs(NAICExpense[,-c(1,2,3,4)])
car::vif(model1)
## factor(GROUP) factor(MUTUAL) factor(STOCK) RBC STAFFWAGE
## 1.322 2.132 2.351 3.632 2.406
## AGENTWAGE LONGLOSS SHORTLOSS GPWPERSONAL GPWCOMM
## 2.380 9.471 7.826 5.660 6.378
## ASSETS CASH LIQUIDRATIO
## 225.106 172.636 1.095
The correlation Matrix of the regressors says that there is a strong correlation of 0.99 between ASSETS & CASH. This is clearly aginst the assumption that all our regressors are independant of each other. There is also as atrong relation between LONGLOSS to SHORTLOSS and GPWPERSONAL.
Same is also shown in the VIF Scores. Scores for two regressors (ASSETS & CASH) are 225.105648, 172.636176 respectively, which are very much above the threshold limit’s. So, there is a collinearity between these variables.
We need try regression models and choose the best variable’s that will be adding more value to us. This should be done by trail and error method or through domain knowledge.
We should also check the collinearity diagnostic plot for the given dataset.
perturb::colldiag(NAICExpense[,-c(1,2,3,4)])
## Condition
## Index Variance Decomposition Proportions
## intercept RBC STAFFWAGE AGENTWAGE LIQUIDRATIO EXPENSES
## 1 1.000 0.000 0.003 0.000 0.000 0.000 0.001
## 2 1.568 0.001 0.001 0.000 0.000 0.002 0.000
## 3 3.101 0.000 0.100 0.000 0.000 0.000 0.001
## 4 5.327 0.000 0.441 0.000 0.000 0.000 0.003
## 5 5.763 0.000 0.140 0.000 0.000 0.000 0.007
## 6 9.211 0.000 0.020 0.000 0.000 0.002 0.001
## 7 10.135 0.000 0.006 0.000 0.000 0.000 0.041
## 8 13.817 0.000 0.000 0.001 0.000 0.013 0.909
## 9 16.491 0.001 0.002 0.049 0.023 0.721 0.027
## 10 30.556 0.776 0.001 0.248 0.004 0.194 0.002
## 11 46.326 0.202 0.000 0.647 0.937 0.011 0.000
## 12 58.348 0.021 0.287 0.055 0.034 0.058 0.008
## LONGLOSS SHORTLOSS GPWPERSONAL GPWCOMM ASSETS CASH
## 1 0.001 0.001 0.002 0.002 0.000 0.000
## 2 0.001 0.001 0.001 0.001 0.000 0.000
## 3 0.016 0.003 0.066 0.005 0.000 0.000
## 4 0.006 0.034 0.116 0.088 0.000 0.000
## 5 0.002 0.031 0.026 0.083 0.004 0.008
## 6 0.029 0.515 0.292 0.461 0.000 0.000
## 7 0.534 0.353 0.478 0.003 0.000 0.000
## 8 0.353 0.039 0.002 0.205 0.000 0.000
## 9 0.009 0.000 0.002 0.046 0.000 0.000
## 10 0.000 0.001 0.004 0.001 0.005 0.004
## 11 0.014 0.003 0.004 0.000 0.012 0.015
## 12 0.033 0.018 0.008 0.104 0.978 0.972
From the matrix for the index 30 we can see there are no significant sets. For index 46 we can see STAFFWAGE,AGENTWAGE are sets with collinearity and for index 58 we can see ASSETS,CASH are collinear to each other.
We should also check what are the inflentual observations in our model. so that we can remove those observations and run our model to check if they significantly effect our Beta values, R2 values, Resedual Std. Error.
car::influenceIndexPlot(model1,id.n=5)
We can see that there are 4 types of measurements to check the influentiality of observations on our model. We choose the Cook’s method to determine this. So, in cook’s model We can see that there are two observations that are influencing our models.
So, we need to check our model including these records and removing these records and choose the best one.
Finally we check the resedual plots: independant valriable Vs resedual and check if there is ay pattern in these plots. Ideally these should be random scattered points and there should be no pattern. If there is any pattern we find we need to transform those accordingly and run our model.
library(car)
residualPlots(model1)
## Test stat Pr(>|t|)
## factor(GROUP) NA NA
## factor(MUTUAL) NA NA
## factor(STOCK) NA NA
## RBC -3.704 0.000
## STAFFWAGE -0.079 0.937
## AGENTWAGE 0.282 0.778
## LONGLOSS -0.241 0.810
## SHORTLOSS -1.415 0.158
## GPWPERSONAL -3.081 0.002
## GPWCOMM 5.096 0.000
## ASSETS 2.363 0.019
## CASH 1.144 0.253
## LIQUIDRATIO -0.036 0.971
## Tukey test 2.958 0.003
From the plots we can observe the below things. RBC: We can see a Logrthmic trend. GPWPersonal: We can see an Log trend. GPWCOMM: We can see the Quadratric trend and also outlier influence. Assets: Has Log trend Cash: Outlier Influence.
From the above findings we try to fit a new model and that will be our improved model.
Before modelling we need to transform our variables.
NAICExpense$GPWCOMM.sq <- NAICExpense$GPWCOMM*NAICExpense$GPWCOMM
NAICExpense$LONGLOSS.sq <- NAICExpense$LONGLOSS*NAICExpense$LONGLOSS
NAICExpense$GPWPERSONAL.sq <- NAICExpense$GPWPERSONAL*NAICExpense$GPWPERSONAL
NAICExpense$CASH.sq <- NAICExpense$CASH*NAICExpense$CASH
NAICExpense$RBC.sq <- NAICExpense$RBC*NAICExpense$RBC
NAICExpense$SHORTLOSS.sq <- NAICExpense$SHORTLOSS*NAICExpense$SHORTLOSS
#shortloss_Trans <- log(max(SHORTLOSS)+1-SHORTLOSS)
#expenses_trans <- log(max(EXPENSES)+1-EXPENSES)
Below is the improved model and we can see from the summary that we have improved R2 and also reduced Standard Error.
model3 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)
+RBC + RBC.sq
+STAFFWAGE
#+AGENTWAGE
+LONGLOSS + LONGLOSS.sq
+SHORTLOSS + SHORTLOSS.sq
+GPWPERSONAL + GPWPERSONAL.sq
+GPWCOMM + GPWCOMM.sq
+LIQUIDRATIO
#+ASSETS
+CASH
,data = NAICExpense)
summary(model3)
##
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) +
## RBC + RBC.sq + STAFFWAGE + LONGLOSS + LONGLOSS.sq + SHORTLOSS +
## SHORTLOSS.sq + GPWPERSONAL + GPWPERSONAL.sq + GPWCOMM + GPWCOMM.sq +
## LIQUIDRATIO + CASH, data = NAICExpense)
##
## Residuals:
## Min 1Q Median 3Q Max
## -186.97 -4.22 -1.56 3.93 162.99
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -8.077566103859192381264620 12.227883788295828892955797
## factor(GROUP)1 1.755551017233113242355103 3.205056062105439806941831
## factor(MUTUAL)1 -2.728132368771277960206589 4.896473451366778029125726
## factor(STOCK)1 -4.864704259683332310260084 4.334709606787880531442170
## RBC 0.000000000262355087394743 0.000000000077767682501707
## RBC.sq -0.000000000000000000000384 0.000000000000000000000103
## STAFFWAGE 0.062462620771259878826864 0.113340288229061586511293
## LONGLOSS 0.279881133704393991745718 0.065289743205929895442097
## LONGLOSS.sq 0.001276330505868971927977 0.000182361457267686143027
## SHORTLOSS 0.375190170633010577905253 0.050311532334097602836565
## SHORTLOSS.sq -0.000299309776714503793411 0.000046449004813322207824
## GPWPERSONAL 0.205106025705477035270263 0.026847708744324672719417
## GPWPERSONAL.sq -0.000258322668778284065313 0.000037621709517332362852
## GPWCOMM 0.015890487524592763340925 0.015637653484269303100218
## GPWCOMM.sq 0.000025814228281433666996 0.000003760776703845596315
## LIQUIDRATIO 0.079730214985546263295468 0.081704812403278787025229
## CASH 0.015008375642211009212690 0.003020187278405969850958
## t value Pr(>|t|)
## (Intercept) -0.66 0.50932
## factor(GROUP)1 0.55 0.58422
## factor(MUTUAL)1 -0.56 0.57777
## factor(STOCK)1 -1.12 0.26252
## RBC 3.37 0.00083 ***
## RBC.sq -3.74 0.00022 ***
## STAFFWAGE 0.55 0.58191
## LONGLOSS 4.29 0.00002350439422 ***
## LONGLOSS.sq 7.00 0.00000000001333 ***
## SHORTLOSS 7.46 0.00000000000071 ***
## SHORTLOSS.sq -6.44 0.00000000038828 ***
## GPWPERSONAL 7.64 0.00000000000021 ***
## GPWPERSONAL.sq -6.87 0.00000000003039 ***
## GPWCOMM 1.02 0.31026
## GPWCOMM.sq 6.86 0.00000000003081 ***
## LIQUIDRATIO 0.98 0.32983
## CASH 4.97 0.00000105580206 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.3 on 348 degrees of freedom
## Multiple R-squared: 0.96, Adjusted R-squared: 0.958
## F-statistic: 523 on 16 and 348 DF, p-value: <0.0000000000000002
The analysis is not over. We were able to fit the best model with the constraint of a particular dataset. There can be lot of improved transformations that can be applied and can improve the model further. Going forward one can try to get data on other variables that might have an impact on the pridictibility of Expenses. On the other hand, we should not include variables only to achieve a high R2, i.e. we must not over fit our model
This is posted in this Link