The Goal is to fit a suitable linear regression model that explains the expenses of the insurance companies
Insurance<-read.csv("C://Neeraj//Neeraj//Neeraj-1//neeraj//personal//Neeraj//ISB - CBA//Term2//SA2//individual Assignment 1//NAICExpense.csv", header=T)
This step is required to initial see how the data looks. To see the relation between them, if there is any. And then eventually, see whether all method says same thing or not
From below data, we can clear see Expenses has a positive correlation * RBC * LONGLOSS * SHORTLOSS * GPWPERSONAL * GPWCOMM * ASSETS * CASH
But has NO LINEAR correlation with * STAFFWAGE * AGENTWAGE * LIQUIDRATIO
We are not interested in looking for any correlation between categorical or dummy variables.
cor(na.omit(Insurance[,-c(1:4)]))
## RBC EXPENSES STAFFWAGE AGENTWAGE LONGLOSS
## RBC 1.000000000 0.65112669 0.09702797 0.04510254 0.52875929
## EXPENSES 0.651126685 1.00000000 0.07345174 0.06937996 0.92205504
## STAFFWAGE 0.097027967 0.07345174 1.00000000 0.76140285 0.05890984
## AGENTWAGE 0.045102535 0.06937996 0.76140285 1.00000000 0.07900190
## LONGLOSS 0.528759287 0.92205504 0.05890984 0.07900190 1.00000000
## SHORTLOSS 0.611035826 0.93081998 0.06747716 0.04484587 0.87720069
## GPWPERSONAL 0.410630219 0.81849057 0.01764025 0.03668741 0.89219827
## GPWCOMM 0.717061417 0.87303753 0.10719766 0.06585602 0.73639845
## ASSETS 0.780363636 0.85063417 0.06827034 0.05817508 0.74453978
## CASH 0.746863898 0.83068700 0.06768256 0.05813373 0.72527444
## LIQUIDRATIO -0.001495437 0.01474347 0.06653607 0.08883819 0.01785654
## SHORTLOSS GPWPERSONAL GPWCOMM ASSETS CASH
## RBC 0.611035826 0.410630219 0.71706142 0.78036364 0.74686390
## EXPENSES 0.930819984 0.818490566 0.87303753 0.85063417 0.83068700
## STAFFWAGE 0.067477156 0.017640253 0.10719766 0.06827034 0.06768256
## AGENTWAGE 0.044845871 0.036687412 0.06585602 0.05817508 0.05813373
## LONGLOSS 0.877200685 0.892198275 0.73639845 0.74453978 0.72527444
## SHORTLOSS 1.000000000 0.769589733 0.85375523 0.82183785 0.80003391
## GPWPERSONAL 0.769589733 1.000000000 0.58372413 0.63011090 0.62110392
## GPWCOMM 0.853755234 0.583724129 1.00000000 0.85657317 0.82772372
## ASSETS 0.821837846 0.630110896 0.85657317 1.00000000 0.99477222
## CASH 0.800033909 0.621103922 0.82772372 0.99477222 1.00000000
## LIQUIDRATIO 0.004387883 0.007752182 -0.02815168 0.03021100 0.05268059
## LIQUIDRATIO
## RBC -0.001495437
## EXPENSES 0.014743470
## STAFFWAGE 0.066536071
## AGENTWAGE 0.088838191
## LONGLOSS 0.017856540
## SHORTLOSS 0.004387883
## GPWPERSONAL 0.007752182
## GPWCOMM -0.028151679
## ASSETS 0.030210998
## CASH 0.052680586
## LIQUIDRATIO 1.000000000
And this almost say same but data is clustered at one place. In next step, we will do pair-wise comparison
pairs(na.omit(Insurance[,-c(1:4)]))
Before we move ahead, we will divide data in two segments. One is training data on which we will train our model and second will be test / validation data, on which we will verify the data.
The training set will take 70% of entire data and 30% will go under testing dataset. We will work on Training data setup in the building our model.
## [1] 251 15
## [1] 109 15
Here we will see the data with boxplot representaion. If data is more clustered or outliers are too much, we will transformt the data and see whether after transformation does that make any sense.
We will do transformation by taking log.
This will give us the relation between two variable. Where data is more clustered or outliers are too much, we will do the comparison wit tranaformed data as well
We will do transformation by taking log.
Clearly from below diagarm, you can see that there has NO LINEAR correlation with * STAFFWAGE * AGENTWAGE * LIQUIDRATIO
From the charts below, it’s prominent that after transformation, there’s comes a relation between the different dimensions or column.
You can see,
par(mfrow=c(1,2))
# log came good
plot(RBC, EXPENSES, main="RBC Vs Expenses")
plot(log(RBC), log(EXPENSES), main="Log - RBC Vs Expenses")
## Warning in log(EXPENSES): NaNs produced
#No Relation
plot(STAFFWAGE, EXPENSES)
plot(log(STAFFWAGE), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
#No Relation
plot(AGENTWAGE, EXPENSES)
plot(log(AGENTWAGE), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
# log came good
plot(LONGLOSS, EXPENSES)
plot(log(LONGLOSS), log(EXPENSES))
## Warning in log(LONGLOSS): NaNs produced
## Warning in log(LONGLOSS): NaNs produced
# log came good
plot(SHORTLOSS, EXPENSES)
plot(log(SHORTLOSS), log(EXPENSES))
## Warning in log(SHORTLOSS): NaNs produced
## Warning in log(SHORTLOSS): NaNs produced
#lot of variability in log - lot of outliers in initial part
plot(GPWPERSONAL, EXPENSES)
plot(log(GPWPERSONAL), log(EXPENSES))
## Warning in log(GPWPERSONAL): NaNs produced
## Warning in log(GPWPERSONAL): NaNs produced
# log came good - lot of outliers in initial part
plot(GPWCOMM, EXPENSES)
plot(log(GPWCOMM), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
# log came good
plot(ASSETS, EXPENSES)
plot(log(ASSETS), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
# log came good
plot(CASH, EXPENSES)
plot(log(CASH), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
# No relation
plot(LIQUIDRATIO, EXPENSES, main = "abc")
plot(log(LIQUIDRATIO), log(EXPENSES))
## Warning in log(EXPENSES): NaNs produced
Now’s let fit the model. I have reached to this model after number of trials that not a scope of this.
The decision was made after number of iternation with different dimensions / columns.
The factor that was considered are:
In this case we only rentained four columns:
a) LONGLOSS
b) SHORTLOSS
c) GPWPERSONAL
d) GPWCOMM
e) CASH
Our R-square and R-square adjusted are coming close to 95%
With this we got the Q-Q plot. And that too looks good.
Cooks distnace is also under range.
all the values are under 10 only. So, NO action was required.
P.S.: We have used logTransformationDelta variable equals to “.000000000001” and added this value in doing log transfomation. This is required to get the log of ZERO, else our model will fail. There’s are multiple ways of doing this. But I chose this one for building my model.
attach(Insurance_train)
## The following objects are masked from Insurance_train (pos = 3):
##
## AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
## GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
## SHORTLOSS, STAFFWAGE, STOCK
logTransformationDelta <- .000000000001
par(mfrow=c(2,2))
fit<-lm(EXPENSES~log(logTransformationDelta +LONGLOSS)+log(logTransformationDelta +SHORTLOSS)+GPWPERSONAL+log(logTransformationDelta +GPWCOMM)+CASH-1)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
summary(fit)
##
## Call:
## lm(formula = EXPENSES ~ log(logTransformationDelta + LONGLOSS) +
## log(logTransformationDelta + SHORTLOSS) + GPWPERSONAL + log(logTransformationDelta +
## GPWCOMM) + CASH - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.083534 -0.003685 -0.000098 0.008668 0.187678
##
## Coefficients:
## Estimate Std. Error t value
## log(logTransformationDelta + LONGLOSS) 6.386e-05 3.223e-04 0.198
## log(logTransformationDelta + SHORTLOSS) -3.617e-04 4.534e-04 -0.798
## GPWPERSONAL 3.671e-01 1.711e-02 21.459
## log(logTransformationDelta + GPWCOMM) -5.466e-06 7.593e-04 -0.007
## CASH 2.326e-02 3.394e-03 6.853
## Pr(>|t|)
## log(logTransformationDelta + LONGLOSS) 0.843
## log(logTransformationDelta + SHORTLOSS) 0.427
## GPWPERSONAL < 2e-16 ***
## log(logTransformationDelta + GPWCOMM) 0.994
## CASH 5.82e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03211 on 101 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.8907, Adjusted R-squared: 0.8853
## F-statistic: 164.7 on 5 and 101 DF, p-value: < 2.2e-16
plot(fit)
par(mfrow=c(1,1))
library(car)
residualPlots(fit)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Test stat Pr(>|t|)
## log(logTransformationDelta + LONGLOSS) -1.465 0.146
## log(logTransformationDelta + SHORTLOSS) -1.890 0.062
## GPWPERSONAL -3.174 0.002
## log(logTransformationDelta + GPWCOMM) -0.050 0.960
## CASH -10.549 0.000
## Tukey test -3.387 0.001
#influencial data or outlier
influenceIndexPlot(fit, id.n=5)
#co linerality
vif(fit) #should be less than 20
## Warning in vif.default(fit): No intercept: vifs may not be sensible.
## log(logTransformationDelta + LONGLOSS)
## 2.233720
## log(logTransformationDelta + SHORTLOSS)
## 1.874416
## GPWPERSONAL
## 1.268212
## log(logTransformationDelta + GPWCOMM)
## 2.413359
## CASH
## 1.346911
library(perturb)
colldiag(Insurance[,-c(1:4)]) #greater than .5
## Condition
## Index Variance Decomposition Proportions
## intercept RBC EXPENSES STAFFWAGE AGENTWAGE LONGLOSS SHORTLOSS
## 1 1.000 0.000 0.003 0.001 0.000 0.000 0.001 0.001
## 2 1.573 0.001 0.001 0.000 0.000 0.000 0.001 0.001
## 3 3.108 0.000 0.102 0.001 0.000 0.000 0.015 0.003
## 4 5.370 0.000 0.452 0.002 0.000 0.000 0.005 0.032
## 5 5.770 0.000 0.127 0.006 0.000 0.000 0.002 0.033
## 6 9.270 0.000 0.022 0.000 0.000 0.000 0.018 0.539
## 7 10.253 0.000 0.004 0.039 0.000 0.000 0.534 0.318
## 8 14.250 0.000 0.000 0.905 0.001 0.001 0.366 0.050
## 9 16.421 0.001 0.002 0.039 0.047 0.021 0.013 0.000
## 10 30.252 0.781 0.000 0.001 0.235 0.005 0.002 0.001
## 11 47.161 0.198 0.000 0.001 0.661 0.936 0.012 0.005
## 12 58.468 0.019 0.286 0.006 0.056 0.036 0.031 0.017
## GPWPERSONAL GPWCOMM ASSETS CASH LIQUIDRATIO
## 1 0.002 0.002 0.000 0.000 0.000
## 2 0.001 0.001 0.000 0.000 0.002
## 3 0.064 0.005 0.000 0.000 0.000
## 4 0.117 0.087 0.000 0.000 0.000
## 5 0.027 0.090 0.004 0.008 0.000
## 6 0.259 0.476 0.000 0.000 0.002
## 7 0.507 0.008 0.000 0.000 0.000
## 8 0.002 0.175 0.000 0.000 0.020
## 9 0.002 0.053 0.000 0.000 0.727
## 10 0.006 0.001 0.004 0.004 0.184
## 11 0.004 0.000 0.014 0.017 0.007
## 12 0.009 0.102 0.977 0.970 0.058
attach(Insurance_test)
## The following objects are masked from Insurance_train (pos = 5):
##
## AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
## GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
## SHORTLOSS, STAFFWAGE, STOCK
## The following objects are masked from Insurance_train (pos = 6):
##
## AGENTWAGE, ASSETS, CASH, COMPANY_NAME, EXPENSES, GPWCOMM,
## GPWPERSONAL, GROUP, LIQUIDRATIO, LONGLOSS, MUTUAL, RBC,
## SHORTLOSS, STAFFWAGE, STOCK
par(mfrow=c(2,2))
fit<-lm(EXPENSES~log(logTransformationDelta +LONGLOSS)+log(logTransformationDelta +SHORTLOSS)+GPWPERSONAL+log(logTransformationDelta +GPWCOMM)+CASH-1)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
## Warning in log(logTransformationDelta + GPWCOMM): NaNs produced
summary(fit)
##
## Call:
## lm(formula = EXPENSES ~ log(logTransformationDelta + LONGLOSS) +
## log(logTransformationDelta + SHORTLOSS) + GPWPERSONAL + log(logTransformationDelta +
## GPWCOMM) + CASH - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.190276 -0.002638 -0.000074 0.004291 0.194852
##
## Coefficients:
## Estimate Std. Error t value
## log(logTransformationDelta + LONGLOSS) -0.0001228 0.0002712 -0.453
## log(logTransformationDelta + SHORTLOSS) -0.0001262 0.0002822 -0.447
## GPWPERSONAL 0.1812048 0.0210533 8.607
## log(logTransformationDelta + GPWCOMM) 0.0004920 0.0010125 0.486
## CASH 0.1145793 0.0037967 30.179
## Pr(>|t|)
## log(logTransformationDelta + LONGLOSS) 0.651
## log(logTransformationDelta + SHORTLOSS) 0.655
## GPWPERSONAL 1.29e-15 ***
## log(logTransformationDelta + GPWCOMM) 0.627
## CASH < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03517 on 226 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.9464, Adjusted R-squared: 0.9452
## F-statistic: 798.2 on 5 and 226 DF, p-value: < 2.2e-16
plot(fit)
par(mfrow=c(1,1))
library(car)
qqPlot(fit)
residualPlot(fit)
residualPlots(fit)
## Warning in log(logTransformationDelta + LONGLOSS): NaNs produced
## Warning in log(logTransformationDelta + SHORTLOSS): NaNs produced
## Warning in log(logTransformationDelta + GPWCOMM): NaNs produced
## Test stat Pr(>|t|)
## log(logTransformationDelta + LONGLOSS) -0.055 0.956
## log(logTransformationDelta + SHORTLOSS) 0.249 0.804
## GPWPERSONAL 0.865 0.388
## log(logTransformationDelta + GPWCOMM) 0.338 0.736
## CASH 2.714 0.007
## Tukey test 2.325 0.020
#influencial data or outlier
influenceIndexPlot(fit, id.n=5)
#co linerality
vif(fit) #should be less than 20
## Warning in vif.default(fit): No intercept: vifs may not be sensible.
## log(logTransformationDelta + LONGLOSS)
## 2.643462
## log(logTransformationDelta + SHORTLOSS)
## 2.462335
## GPWPERSONAL
## 2.836691
## log(logTransformationDelta + GPWCOMM)
## 4.612298
## CASH
## 2.844274
library(perturb)
colldiag(Insurance[,-c(1:4)]) #greater than .5
## Condition
## Index Variance Decomposition Proportions
## intercept RBC EXPENSES STAFFWAGE AGENTWAGE LONGLOSS SHORTLOSS
## 1 1.000 0.000 0.003 0.001 0.000 0.000 0.001 0.001
## 2 1.573 0.001 0.001 0.000 0.000 0.000 0.001 0.001
## 3 3.108 0.000 0.102 0.001 0.000 0.000 0.015 0.003
## 4 5.370 0.000 0.452 0.002 0.000 0.000 0.005 0.032
## 5 5.770 0.000 0.127 0.006 0.000 0.000 0.002 0.033
## 6 9.270 0.000 0.022 0.000 0.000 0.000 0.018 0.539
## 7 10.253 0.000 0.004 0.039 0.000 0.000 0.534 0.318
## 8 14.250 0.000 0.000 0.905 0.001 0.001 0.366 0.050
## 9 16.421 0.001 0.002 0.039 0.047 0.021 0.013 0.000
## 10 30.252 0.781 0.000 0.001 0.235 0.005 0.002 0.001
## 11 47.161 0.198 0.000 0.001 0.661 0.936 0.012 0.005
## 12 58.468 0.019 0.286 0.006 0.056 0.036 0.031 0.017
## GPWPERSONAL GPWCOMM ASSETS CASH LIQUIDRATIO
## 1 0.002 0.002 0.000 0.000 0.000
## 2 0.001 0.001 0.000 0.000 0.002
## 3 0.064 0.005 0.000 0.000 0.000
## 4 0.117 0.087 0.000 0.000 0.000
## 5 0.027 0.090 0.004 0.008 0.000
## 6 0.259 0.476 0.000 0.000 0.002
## 7 0.507 0.008 0.000 0.000 0.000
## 8 0.002 0.175 0.000 0.000 0.020
## 9 0.002 0.053 0.000 0.000 0.727
## 10 0.006 0.001 0.004 0.004 0.184
## 11 0.004 0.000 0.014 0.017 0.007
## 12 0.009 0.102 0.977 0.970 0.058