Expenses of the insurance companies

The Goal is to fit a suitable linear regression model that explains the expenses of the insurance companies

Step 1 - Load the data file:

Insurance<-read.csv("C://Neeraj//Neeraj//Neeraj-1//neeraj//personal//Neeraj//ISB - CBA//Term2//SA2//individual Assignment 1//NAICExpense.csv", header=T)

Step 2 - Exploratory Analyis:

This step is required to initial see how the data looks. To see the relation between them, if there is any. And then eventually, see whether all method says same thing or not

First, we will see Correlation matrix

From below data, we can clear see Expenses has a positive correlation * RBC * LONGLOSS * SHORTLOSS * GPWPERSONAL * GPWCOMM * ASSETS * CASH

But has NO LINEAR correlation with * STAFFWAGE * AGENTWAGE * LIQUIDRATIO

We are not interested in looking for any correlation between categorical or dummy variables.

cor(na.omit(Insurance[,-c(1:4)]))
##                      RBC   EXPENSES  STAFFWAGE  AGENTWAGE   LONGLOSS
## RBC          1.000000000 0.65112669 0.09702797 0.04510254 0.52875929
## EXPENSES     0.651126685 1.00000000 0.07345174 0.06937996 0.92205504
## STAFFWAGE    0.097027967 0.07345174 1.00000000 0.76140285 0.05890984
## AGENTWAGE    0.045102535 0.06937996 0.76140285 1.00000000 0.07900190
## LONGLOSS     0.528759287 0.92205504 0.05890984 0.07900190 1.00000000
## SHORTLOSS    0.611035826 0.93081998 0.06747716 0.04484587 0.87720069
## GPWPERSONAL  0.410630219 0.81849057 0.01764025 0.03668741 0.89219827
## GPWCOMM      0.717061417 0.87303753 0.10719766 0.06585602 0.73639845
## ASSETS       0.780363636 0.85063417 0.06827034 0.05817508 0.74453978
## CASH         0.746863898 0.83068700 0.06768256 0.05813373 0.72527444
## LIQUIDRATIO -0.001495437 0.01474347 0.06653607 0.08883819 0.01785654
##               SHORTLOSS GPWPERSONAL     GPWCOMM     ASSETS       CASH
## RBC         0.611035826 0.410630219  0.71706142 0.78036364 0.74686390
## EXPENSES    0.930819984 0.818490566  0.87303753 0.85063417 0.83068700
## STAFFWAGE   0.067477156 0.017640253  0.10719766 0.06827034 0.06768256
## AGENTWAGE   0.044845871 0.036687412  0.06585602 0.05817508 0.05813373
## LONGLOSS    0.877200685 0.892198275  0.73639845 0.74453978 0.72527444
## SHORTLOSS   1.000000000 0.769589733  0.85375523 0.82183785 0.80003391
## GPWPERSONAL 0.769589733 1.000000000  0.58372413 0.63011090 0.62110392
## GPWCOMM     0.853755234 0.583724129  1.00000000 0.85657317 0.82772372
## ASSETS      0.821837846 0.630110896  0.85657317 1.00000000 0.99477222
## CASH        0.800033909 0.621103922  0.82772372 0.99477222 1.00000000
## LIQUIDRATIO 0.004387883 0.007752182 -0.02815168 0.03021100 0.05268059
##              LIQUIDRATIO
## RBC         -0.001495437
## EXPENSES     0.014743470
## STAFFWAGE    0.066536071
## AGENTWAGE    0.088838191
## LONGLOSS     0.017856540
## SHORTLOSS    0.004387883
## GPWPERSONAL  0.007752182
## GPWCOMM     -0.028151679
## ASSETS       0.030210998
## CASH         0.052680586
## LIQUIDRATIO  1.000000000

Now within Exploratory analysis, we will pair wise combination or graphical represntation of correlation to verify the data which we seen above

And this almost say same but data is clustered at one place. In next step, we will do pair-wise comparison

pairs(na.omit(Insurance[,-c(1:4)]))

Step 3 - Training and Test Data-set

Before we move ahead, we will divide data in two segments. One is training data on which we will train our model and second will be test / validation data, on which we will verify the data.

The training set will take 70% of entire data and 30% will go under testing dataset. We will work on Training data setup in the building our model.

## [1] 251  15
## [1] 109  15

Step 4 - Box Plot (Another view of looking at data - Exploratory Analysis)

Here we will see the data with boxplot representaion. If data is more clustered or outliers are too much, we will transformt the data and see whether after transformation does that make any sense.

We will do transformation by taking log.

Step 5 - Pairwise relation between all the combination with Expenses

This will give us the relation between two variable. Where data is more clustered or outliers are too much, we will do the comparison wit tranaformed data as well

We will do transformation by taking log.

Clearly from below diagarm, you can see that there has NO LINEAR correlation with * STAFFWAGE * AGENTWAGE * LIQUIDRATIO

From the charts below, it’s prominent that after transformation, there’s comes a relation between the different dimensions or column.

You can see,

* transformation of Expenses and transformation of RBC seems linearly relation with definitely variablility

* transformation of Expenses and transformation of LONGLOSS seems linearly relation with definitely variablility

* transformation of Expenses and transformation of SHORTLOSS seems linearly relation with definitely variablility

* transformation of Expenses and transformation of GPWPERSONAL seems linearly relation with definitely variablility

* transformation of Expenses and transformation of GPWCOMM seems linearly relation with definitely variablility

* transformation of Expenses and transformation of ASSETS seems linearly relation with definitely variablility

* transformation of Expenses and transformation of CASH seems linearly relation with definitely variablility

par(mfrow=c(1,2))
# log came good
plot(RBC, EXPENSES, main="RBC Vs Expenses")
plot(log(RBC), log(EXPENSES), main="Log - RBC Vs Expenses")
## Warning in log(EXPENSES): NaNs produced

#No Relation
plot(STAFFWAGE, EXPENSES)
plot(log(STAFFWAGE), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

#No Relation
plot(AGENTWAGE, EXPENSES)
plot(log(AGENTWAGE), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

# log came good
plot(LONGLOSS, EXPENSES)
plot(log(LONGLOSS), log(EXPENSES))
## Warning in log(LONGLOSS): NaNs produced

## Warning in log(LONGLOSS): NaNs produced

# log came good
plot(SHORTLOSS, EXPENSES)
plot(log(SHORTLOSS), log(EXPENSES))
## Warning in log(SHORTLOSS): NaNs produced

## Warning in log(SHORTLOSS): NaNs produced

#lot of variability in log - lot of outliers in initial part
plot(GPWPERSONAL, EXPENSES)
plot(log(GPWPERSONAL), log(EXPENSES))
## Warning in log(GPWPERSONAL): NaNs produced

## Warning in log(GPWPERSONAL): NaNs produced

# log came good - lot of outliers in initial part
plot(GPWCOMM, EXPENSES)
plot(log(GPWCOMM), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

# log came good
plot(ASSETS, EXPENSES)
plot(log(ASSETS), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

# log came good
plot(CASH, EXPENSES)
plot(log(CASH), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

# No relation
plot(LIQUIDRATIO, EXPENSES, main = "abc")
plot(log(LIQUIDRATIO), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced

Step 5 - Building a model on Training Data set

Now’s let fit the model. I have reached to this model after number of trials that not a scope of this.

The decision was made after number of iternation with different dimensions / columns.

The factor that was considered are:

Summary of model

1) Median of Residual should be close to ZERO (to make sure residuals follow the normal distribution)
2) Cofficient Matrix. Should have only factors / dimension that are giving value to model. Remove reduandant varibales,
    In this case we only rentained four columns:
    a) LONGLOSS
    b) SHORTLOSS
    c) GPWPERSONAL
    d) GPWCOMM
    e) CASH
3) Shold look for R-square and R-square adjusted. Both should be close. R-square adjusted should take into account redundant variables.
    Our R-square and R-square adjusted are coming close to 95%
    With this we got the Q-Q plot. And that too looks good.
    Cooks distnace is also under range.

Then we looked for ResidualPlot that is a function in Car library. There should be straight line in the graph

Followed by, we looked on Influencial Plot and remove any observation that are outlier and causing the problem. In the entire process, I have removed 12 obseration to reach to a good model (Those can be outlier or specify case but with limited domain knowledge, I can’t comment on that)

VIF function

all the values are under 10 only. So, NO action was required.

Finally I had drawn colldiag table (that is from Pertub table). This table will show the coleniarlity between dimesnion. With this, we found a good collienarity between ASSETS AND CASH. So, we have removed ASSET from our model because that information is captured in CASH data-points. The whole process is then repeated

P.S.: We have used logTransformationDelta variable equals to “.000000000001” and added this value in doing log transfomation. This is required to get the log of ZERO, else our model will fail. There’s are multiple ways of doing this. But I chose this one for building my model.

attach(Insurance_train)
## The following objects are masked from Insurance_train (pos = 3):
## 
##     AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
##     GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
##     SHORTLOSS, STAFFWAGE, STOCK
logTransformationDelta <- .000000000001
par(mfrow=c(2,2))
fit<-lm(EXPENSES~log(logTransformationDelta +LONGLOSS)+log(logTransformationDelta +SHORTLOSS)+GPWPERSONAL+log(logTransformationDelta +GPWCOMM)+CASH-1)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
summary(fit)
## 
## Call:
## lm(formula = EXPENSES ~ log(logTransformationDelta + LONGLOSS) + 
##     log(logTransformationDelta + SHORTLOSS) + GPWPERSONAL + log(logTransformationDelta + 
##     GPWCOMM) + CASH - 1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.083534 -0.003685 -0.000098  0.008668  0.187678 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## log(logTransformationDelta + LONGLOSS)   6.386e-05  3.223e-04   0.198
## log(logTransformationDelta + SHORTLOSS) -3.617e-04  4.534e-04  -0.798
## GPWPERSONAL                              3.671e-01  1.711e-02  21.459
## log(logTransformationDelta + GPWCOMM)   -5.466e-06  7.593e-04  -0.007
## CASH                                     2.326e-02  3.394e-03   6.853
##                                         Pr(>|t|)    
## log(logTransformationDelta + LONGLOSS)     0.843    
## log(logTransformationDelta + SHORTLOSS)    0.427    
## GPWPERSONAL                              < 2e-16 ***
## log(logTransformationDelta + GPWCOMM)      0.994    
## CASH                                    5.82e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03211 on 101 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8907, Adjusted R-squared:  0.8853 
## F-statistic: 164.7 on 5 and 101 DF,  p-value: < 2.2e-16
plot(fit)

par(mfrow=c(1,1))

library(car)
residualPlots(fit)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced

## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced

##                                         Test stat Pr(>|t|)
## log(logTransformationDelta + LONGLOSS)     -1.465    0.146
## log(logTransformationDelta + SHORTLOSS)    -1.890    0.062
## GPWPERSONAL                                -3.174    0.002
## log(logTransformationDelta + GPWCOMM)      -0.050    0.960
## CASH                                      -10.549    0.000
## Tukey test                                 -3.387    0.001
#influencial data or outlier
influenceIndexPlot(fit, id.n=5)

#co linerality
vif(fit) #should be less than 20
## Warning in vif.default(fit): No intercept: vifs may not be sensible.
##  log(logTransformationDelta + LONGLOSS) 
##                                2.233720 
## log(logTransformationDelta + SHORTLOSS) 
##                                1.874416 
##                             GPWPERSONAL 
##                                1.268212 
##   log(logTransformationDelta + GPWCOMM) 
##                                2.413359 
##                                    CASH 
##                                1.346911
library(perturb)
colldiag(Insurance[,-c(1:4)]) #greater than .5
## Condition
## Index    Variance Decomposition Proportions
##            intercept RBC   EXPENSES STAFFWAGE AGENTWAGE LONGLOSS SHORTLOSS
## 1    1.000 0.000     0.003 0.001    0.000     0.000     0.001    0.001    
## 2    1.573 0.001     0.001 0.000    0.000     0.000     0.001    0.001    
## 3    3.108 0.000     0.102 0.001    0.000     0.000     0.015    0.003    
## 4    5.370 0.000     0.452 0.002    0.000     0.000     0.005    0.032    
## 5    5.770 0.000     0.127 0.006    0.000     0.000     0.002    0.033    
## 6    9.270 0.000     0.022 0.000    0.000     0.000     0.018    0.539    
## 7   10.253 0.000     0.004 0.039    0.000     0.000     0.534    0.318    
## 8   14.250 0.000     0.000 0.905    0.001     0.001     0.366    0.050    
## 9   16.421 0.001     0.002 0.039    0.047     0.021     0.013    0.000    
## 10  30.252 0.781     0.000 0.001    0.235     0.005     0.002    0.001    
## 11  47.161 0.198     0.000 0.001    0.661     0.936     0.012    0.005    
## 12  58.468 0.019     0.286 0.006    0.056     0.036     0.031    0.017    
##    GPWPERSONAL GPWCOMM ASSETS CASH  LIQUIDRATIO
## 1  0.002       0.002   0.000  0.000 0.000      
## 2  0.001       0.001   0.000  0.000 0.002      
## 3  0.064       0.005   0.000  0.000 0.000      
## 4  0.117       0.087   0.000  0.000 0.000      
## 5  0.027       0.090   0.004  0.008 0.000      
## 6  0.259       0.476   0.000  0.000 0.002      
## 7  0.507       0.008   0.000  0.000 0.000      
## 8  0.002       0.175   0.000  0.000 0.020      
## 9  0.002       0.053   0.000  0.000 0.727      
## 10 0.006       0.001   0.004  0.004 0.184      
## 11 0.004       0.000   0.014  0.017 0.007      
## 12 0.009       0.102   0.977  0.970 0.058

Step 6 - Validating a model on Test Data set

Same steps and graph has been plotted and verify that model fits there as well with 80% of R-square and R-square adjusted. Keeping other valildation in mind

attach(Insurance_test)
## The following objects are masked from Insurance_train (pos = 5):
## 
##     AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
##     GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
##     SHORTLOSS, STAFFWAGE, STOCK
## The following objects are masked from Insurance_train (pos = 6):
## 
##     AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
##     GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
##     SHORTLOSS, STAFFWAGE, STOCK
par(mfrow=c(2,2))
fit<-lm(EXPENSES~log(logTransformationDelta +LONGLOSS)+log(logTransformationDelta +SHORTLOSS)+GPWPERSONAL+log(logTransformationDelta +GPWCOMM)+CASH-1)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
## Warning in log(logTransformationDelta + GPWCOMM): NaNs produced
summary(fit)
## 
## Call:
## lm(formula = EXPENSES ~ log(logTransformationDelta + LONGLOSS) + 
##     log(logTransformationDelta + SHORTLOSS) + GPWPERSONAL + log(logTransformationDelta + 
##     GPWCOMM) + CASH - 1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.190276 -0.002638 -0.000074  0.004291  0.194852 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## log(logTransformationDelta + LONGLOSS)  -0.0001228  0.0002712  -0.453
## log(logTransformationDelta + SHORTLOSS) -0.0001262  0.0002822  -0.447
## GPWPERSONAL                              0.1812048  0.0210533   8.607
## log(logTransformationDelta + GPWCOMM)    0.0004920  0.0010125   0.486
## CASH                                     0.1145793  0.0037967  30.179
##                                         Pr(>|t|)    
## log(logTransformationDelta + LONGLOSS)     0.651    
## log(logTransformationDelta + SHORTLOSS)    0.655    
## GPWPERSONAL                             1.29e-15 ***
## log(logTransformationDelta + GPWCOMM)      0.627    
## CASH                                     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03517 on 226 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.9464, Adjusted R-squared:  0.9452 
## F-statistic: 798.2 on 5 and 226 DF,  p-value: < 2.2e-16
plot(fit)

par(mfrow=c(1,1))

library(car)
qqPlot(fit)

residualPlot(fit)

residualPlots(fit)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
## Warning in log(logTransformationDelta + GPWCOMM): NaNs produced

##                                         Test stat Pr(>|t|)
## log(logTransformationDelta + LONGLOSS)     -0.055    0.956
## log(logTransformationDelta + SHORTLOSS)     0.249    0.804
## GPWPERSONAL                                 0.865    0.388
## log(logTransformationDelta + GPWCOMM)       0.338    0.736
## CASH                                        2.714    0.007
## Tukey test                                  2.325    0.020
#influencial data or outlier
influenceIndexPlot(fit, id.n=5)

#co linerality
vif(fit) #should be less than 20
## Warning in vif.default(fit): No intercept: vifs may not be sensible.
##  log(logTransformationDelta + LONGLOSS) 
##                                2.643462 
## log(logTransformationDelta + SHORTLOSS) 
##                                2.462335 
##                             GPWPERSONAL 
##                                2.836691 
##   log(logTransformationDelta + GPWCOMM) 
##                                4.612298 
##                                    CASH 
##                                2.844274
library(perturb)
colldiag(Insurance[,-c(1:4)]) #greater than .5
## Condition
## Index    Variance Decomposition Proportions
##            intercept RBC   EXPENSES STAFFWAGE AGENTWAGE LONGLOSS SHORTLOSS
## 1    1.000 0.000     0.003 0.001    0.000     0.000     0.001    0.001    
## 2    1.573 0.001     0.001 0.000    0.000     0.000     0.001    0.001    
## 3    3.108 0.000     0.102 0.001    0.000     0.000     0.015    0.003    
## 4    5.370 0.000     0.452 0.002    0.000     0.000     0.005    0.032    
## 5    5.770 0.000     0.127 0.006    0.000     0.000     0.002    0.033    
## 6    9.270 0.000     0.022 0.000    0.000     0.000     0.018    0.539    
## 7   10.253 0.000     0.004 0.039    0.000     0.000     0.534    0.318    
## 8   14.250 0.000     0.000 0.905    0.001     0.001     0.366    0.050    
## 9   16.421 0.001     0.002 0.039    0.047     0.021     0.013    0.000    
## 10  30.252 0.781     0.000 0.001    0.235     0.005     0.002    0.001    
## 11  47.161 0.198     0.000 0.001    0.661     0.936     0.012    0.005    
## 12  58.468 0.019     0.286 0.006    0.056     0.036     0.031    0.017    
##    GPWPERSONAL GPWCOMM ASSETS CASH  LIQUIDRATIO
## 1  0.002       0.002   0.000  0.000 0.000      
## 2  0.001       0.001   0.000  0.000 0.002      
## 3  0.064       0.005   0.000  0.000 0.000      
## 4  0.117       0.087   0.000  0.000 0.000      
## 5  0.027       0.090   0.004  0.008 0.000      
## 6  0.259       0.476   0.000  0.000 0.002      
## 7  0.507       0.008   0.000  0.000 0.000      
## 8  0.002       0.175   0.000  0.000 0.020      
## 9  0.002       0.053   0.000  0.000 0.727      
## 10 0.006       0.001   0.004  0.004 0.184      
## 11 0.004       0.000   0.014  0.017 0.007      
## 12 0.009       0.102   0.977  0.970 0.058