Statistical Analysis-2 Assignment-1

Goal: Our goal is to fit a suitable linear regression model that explains the expenses of the insurance companies

Description of the dataset we are going to use can be seen below.

caption

Lets read the dataset first and check the structure and summary of the data set.

rm(list=ls())
setwd("D:/ISB/21-StatAnalysis-2/SA-2  Assignment-1 files-20170219")
NAICExpense <- read.csv("NAICExpense.csv")
dim(NAICExpense)

## [1] 384  15

summary(NAICExpense)

##                          COMPANY_NAME     GROUP           MUTUAL      
##  AAA Mid-Atlantic Ins Co       :  1   Min.   :0.000   Min.   :0.0000  
##  Acceptance Ind Ins Co         :  1   1st Qu.:0.000   1st Qu.:0.0000  
##  Accredited Surety & Cas Co Inc:  1   Median :1.000   Median :0.0000  
##  Ace Ins Co                    :  1   Mean   :0.612   Mean   :0.1875  
##  Admiral Ind Co                :  1   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Adriatic Ins Co               :  1   Max.   :1.000   Max.   :1.0000  
##  (Other)                       :378                                   
##      STOCK             RBC               EXPENSES           STAFFWAGE     
##  Min.   :0.0000   Min.   :0.000e+00   Min.   :-0.002038   Min.   : 51.73  
##  1st Qu.:0.0000   1st Qu.:6.257e+08   1st Qu.: 0.001584   1st Qu.: 80.06  
##  Median :1.0000   Median :2.753e+09   Median : 0.008504   Median : 84.38  
##  Mean   :0.6823   Mean   :2.247e+10   Mean   : 0.043190   Mean   : 87.18  
##  3rd Qu.:1.0000   3rd Qu.:1.118e+10   3rd Qu.: 0.029826   3rd Qu.: 93.82  
##  Max.   :1.0000   Max.   :8.388e+11   Max.   : 1.236946   Max.   :137.48  
##                                                                           
##    AGENTWAGE         LONGLOSS           SHORTLOSS         
##  Min.   : 47.47   Min.   :-0.070623   Min.   :-0.0031685  
##  1st Qu.: 74.81   1st Qu.: 0.000000   1st Qu.: 0.0002369  
##  Median : 78.77   Median : 0.001784   Median : 0.0040240  
##  Mean   : 80.15   Mean   : 0.024926   Mean   : 0.0373586  
##  3rd Qu.: 85.44   3rd Qu.: 0.011280   3rd Qu.: 0.0217943  
##  Max.   :126.17   Max.   : 0.853915   Max.   : 1.1710587  
##  NA's   :19                                               
##   GPWPERSONAL            GPWCOMM              ASSETS        
##  Min.   :-0.0037514   Min.   :-0.000648   Min.   :0.000321  
##  1st Qu.: 0.0000000   1st Qu.: 0.003838   1st Qu.:0.012758  
##  Median : 0.0003125   Median : 0.023807   Median :0.056746  
##  Mean   : 0.0531127   Mean   : 0.122657   Mean   :0.356543  
##  3rd Qu.: 0.0272581   3rd Qu.: 0.086440   3rd Qu.:0.197437  
##  Max.   : 1.8224858   Max.   : 4.189401   Max.   :8.705380  
##                                                             
##       CASH           LIQUIDRATIO     
##  Min.   :0.000018   Min.   :  1.788  
##  1st Qu.:0.011377   1st Qu.: 87.403  
##  Median :0.050469   Median : 96.027  
##  Mean   :0.332871   Mean   : 92.597  
##  3rd Qu.:0.184971   3rd Qu.:103.861  
##  Max.   :8.823477   Max.   :127.858  
##

colnames(NAICExpense)

##  [1] "COMPANY_NAME" "GROUP"        "MUTUAL"       "STOCK"       
##  [5] "RBC"          "EXPENSES"     "STAFFWAGE"    "AGENTWAGE"   
##  [9] "LONGLOSS"     "SHORTLOSS"    "GPWPERSONAL"  "GPWCOMM"     
## [13] "ASSETS"       "CASH"         "LIQUIDRATIO"

Sample Dataset look’s loke this

head(NAICExpense)

##                         COMPANY_NAME GROUP MUTUAL STOCK        RBC
## 1           Tift Area Captive Ins Co     0      0     1  228184000
## 2 Alliance Of Nonprofits For Ins RRG     0      0     0 1627708000
## 3   GA Timber Harvesters Mut Captive     0      1     0  422907000
## 4        American Natl Lloyds Ins Co     1      0     0  652906000
## 5                  Chubb Natl Ins Co     1      0     1 8124624000
## 6          Harleysville Ins Co of OH     1      0     1 1441725000
##       EXPENSES STAFFWAGE AGENTWAGE     LONGLOSS   SHORTLOSS GPWPERSONAL
## 1 0.0008019802  84.40508  77.46100 0.0001873308 0.000000000 0.000000000
## 2 0.0044878635  81.56754  84.87802 0.0027822909 0.000000000 0.000000000
## 3 0.0019045075  84.40508  77.46100 0.0010121463 0.001329539 0.000000000
## 4 0.0022909382  82.49788  75.71071 0.0000000000 0.002979557 0.029545038
## 5 0.0182956574  79.26495  78.24790 0.0107939577 0.011777314 0.040614120
## 6 0.0049830133  84.35856  77.41831 0.0019387476 0.003797946 0.001448431
##       GPWCOMM      ASSETS        CASH LIQUIDRATIO
## 1 0.001375438 0.002949942 0.003258406   110.45661
## 2 0.012272512 0.022170349 0.019760347    89.12961
## 3 0.005028351 0.004617343 0.003499702    75.79472
## 4 0.001986159 0.043719914 0.040934885    93.62984
## 5 0.058094479 0.144773034 0.138424153    95.61460
## 6 0.013375740 0.029204831 0.029965958   102.60617

There are 384 records 15 Variables in the data set.

There is a Categorical Variable called COMPANY_NAME which can be excluded from our regression analysis since it is not going add any value.

The regressors GROUP,MUTUAL,STOCK are also categorical variables indicating if the company is affiliated, mutual company, stock company respectively. These are converted into Dummy Variables, with ‘1’ representing the ‘Yes’ and ‘0’ representing ‘No’. The rest of all the variables are Quantative in nature and are suitable for performing regression. It can also be seen that few variables are in Million’s and few in Thousands scale.

So, first Lets Scale the data into one Unit of measurement. Here i convert the Million’s into Thousands.

options("scipen"=100, "digits"=4)
Expenses_T <- NAICExpense$EXPENSES*1000  # Multiplying the variable in Million with 1000 for conversion into thousands.
LONGLOSS_T <- NAICExpense$LONGLOSS*1000
SHORTLOSS_T <- NAICExpense$SHORTLOSS*1000
GPWPERSONAL_T <- NAICExpense$GPWPERSONAL*1000
GPWCOMM_T <- NAICExpense$GPWCOMM*1000
ASSETS_T <- NAICExpense$ASSETS*1000
CASH_T <- NAICExpense$CASH*1000

NAICExpense.scaled <- cbind.data.frame(NAICExpense,Expenses_T,LONGLOSS_T,SHORTLOSS_T,GPWPERSONAL_T,GPWCOMM_T,ASSETS_T,CASH_T)
NAICExpense.scaled <- NAICExpense.scaled[,-c(6,9,10,11,12,13,14)]  #removing the original columns in Million's units
colnames(NAICExpense.scaled) <-c("COMPANY_NAME","GROUP","MUTUAL","STOCK","RBC","STAFFWAGE","AGENTWAGE","LIQUIDRATIO","EXPENSES"
                                 ,"LONGLOSS","SHORTLOSS","GPWPERSONAL","GPWCOMM","ASSETS","CASH") # reassigning original column names
rm(Expenses_T,LONGLOSS_T,SHORTLOSS_T,GPWPERSONAL_T,GPWCOMM_T,ASSETS_T,CASH_T) # Removing the variables

From now we consider the scaled dataset to be our Datset in our further processing. The scaling conversion is done and we are assigning the scaled data as our original dataset.

NAICExpense <- NAICExpense.scaled
dim(NAICExpense)

## [1] 384  15

From summary statisticts, it can be seen that there are 19 NULL values in the AGENTWAGE. For the purpose of modelling we need to check our model perofromace with null values presnt in the data set and with filling null values with any appropriate value. In this case i check by assigning the ‘mean’ value of the AGENTWAGE to the NULL Values. (Note: There are many other ways which you can choose to handle your null values).

Regression is performed on both the datasets and the R² values of both the models are compared. The one with the highest R² value is choosen for further processing.

Performing Regression

In Regression Analysis, We consider the ‘EXPENSES’ as the dependant or response variable and the rest of the variables as independant variables or regressors.

Model-1

Model1 shown below is the Linear regression for the data set with Null Values. From the below summary we can see that the Value of R² for this model is 0.9439

model1 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)+RBC+STAFFWAGE+AGENTWAGE
             +LONGLOSS+SHORTLOSS+GPWPERSONAL+GPWCOMM+ASSETS+CASH+LIQUIDRATIO,data = NAICExpense)
summary(model1)

## 
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) + 
##     RBC + STAFFWAGE + AGENTWAGE + LONGLOSS + SHORTLOSS + GPWPERSONAL + 
##     GPWCOMM + ASSETS + CASH + LIQUIDRATIO, data = NAICExpense)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -214.11   -4.66   -0.63    3.90  154.12 
## 
## Coefficients:
##                         Estimate       Std. Error t value
## (Intercept)     -3.0231912290429 16.4684357857614   -0.18
## factor(GROUP)1   1.1739392480503  3.6773445483426    0.32
## factor(MUTUAL)1 -2.3021450659916  5.7366191295391   -0.40
## factor(STOCK)1  -6.3113015800784  5.1025200248204   -1.24
## RBC              0.0000000000688  0.0000000000408    1.69
## STAFFWAGE       -0.0227221149306  0.2035984701034   -0.11
## AGENTWAGE        0.0487542702114  0.2652767363692    0.18
## LONGLOSS         0.5193325025076  0.0566007789295    9.18
## SHORTLOSS        0.2197962318176  0.0356373644951    6.17
## GPWPERSONAL      0.0738411002176  0.0209656743521    3.52
## GPWCOMM          0.1165232270759  0.0122443344866    9.52
## ASSETS          -0.0292934039757  0.0227998127293   -1.28
## CASH             0.0366885277701  0.0208901903107    1.76
## LIQUIDRATIO      0.0656455080518  0.0983505809014    0.67
##                             Pr(>|t|)    
## (Intercept)                  0.85445    
## factor(GROUP)1               0.74974    
## factor(MUTUAL)1              0.68844    
## factor(STOCK)1               0.21695    
## RBC                          0.09252 .  
## STAFFWAGE                    0.91120    
## AGENTWAGE                    0.85429    
## LONGLOSS        < 0.0000000000000002 ***
## SHORTLOSS               0.0000000019 ***
## GPWPERSONAL                  0.00048 ***
## GPWCOMM         < 0.0000000000000002 ***
## ASSETS                       0.19971    
## CASH                         0.07992 .  
## LIQUIDRATIO                  0.50491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.9 on 351 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.944,  Adjusted R-squared:  0.942 
## F-statistic:  454 on 13 and 351 DF,  p-value: <0.0000000000000002

Model-2

Model2 is the Linear regression for the data set without Null Values, i.e. filled with Mean value of the AGNETWAGE. From the below summary we can see that the Value of R² for this model is 0.9402.

model2 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)+RBC+STAFFWAGE+AGENTWAGE
             +LONGLOSS+SHORTLOSS+GPWPERSONAL+GPWCOMM+ASSETS+CASH+LIQUIDRATIO,data = NAICExpense.without.na)
summary(model2)

## 
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) + 
##     RBC + STAFFWAGE + AGENTWAGE + LONGLOSS + SHORTLOSS + GPWPERSONAL + 
##     GPWCOMM + ASSETS + CASH + LIQUIDRATIO, data = NAICExpense.without.na)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -216.90   -4.11   -0.32    4.06  153.07 
## 
## Coefficients:
##                         Estimate       Std. Error t value
## (Intercept)     -5.2274763688410 16.5430452004589   -0.32
## factor(GROUP)1   0.6677889659980  3.6058682714119    0.19
## factor(MUTUAL)1 -2.2902323999803  5.6997299126064   -0.40
## factor(STOCK)1  -6.3419544819168  5.0152034734682   -1.26
## RBC              0.0000000000789  0.0000000000407    1.94
## STAFFWAGE        0.0456947349780  0.1976552920114    0.23
## AGENTWAGE       -0.0106211944562  0.2619536511766   -0.04
## LONGLOSS         0.5085219086773  0.0568438538105    8.95
## SHORTLOSS        0.2278905474506  0.0358009946810    6.37
## GPWPERSONAL      0.0816164932007  0.0210084270342    3.88
## GPWCOMM          0.1162067210338  0.0121012740448    9.60
## ASSETS          -0.0475425540967  0.0223314455785   -2.13
## CASH             0.0538094713724  0.0204131919490    2.64
## LIQUIDRATIO      0.0769264413293  0.0972767947106    0.79
##                             Pr(>|t|)    
## (Intercept)                  0.75219    
## factor(GROUP)1               0.85318    
## factor(MUTUAL)1              0.68805    
## factor(STOCK)1               0.20683    
## RBC                          0.05335 .  
## STAFFWAGE                    0.81730    
## AGENTWAGE                    0.96768    
## LONGLOSS        < 0.0000000000000002 ***
## SHORTLOSS              0.00000000058 ***
## GPWPERSONAL                  0.00012 ***
## GPWCOMM         < 0.0000000000000002 ***
## ASSETS                       0.03392 *  
## CASH                         0.00874 ** 
## LIQUIDRATIO                  0.42957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.1 on 370 degrees of freedom
## Multiple R-squared:  0.94,   Adjusted R-squared:  0.938 
## F-statistic:  447 on 13 and 370 DF,  p-value: <0.0000000000000002

Model-1 Vs Model-2

On comparing R² Value is more for the model1 when the null values are not filled and omitted, than when these observations are filled with mean values. Hence, we can consider only Model-1 in this case.

So, We consider the NAICExpense Dataset going further.

NAICExpense <- na.omit(NAICExpense)

Model-1 Summary Stats

From the summary stats of the Model-1, we can see that 94.39% of the total variation in the data set is explained by the model from the value of R².
We can also see that Adujested R² value is close to the R² value, and hence, this says that any new variable added to the model will not explain any thing more significant than this.
Resedual standard Error, explains about the variation that is present in the predicted values. We can say that predicted expenses may lie +/- $30.1 Thousand dollars.
The F-Statistic says that our regression is highly significant with small p-value.
Comming to the regression coefficients, We can say that regressors: LONGLOSS, SHORTLOSS, GPWPERSONAL, GPWCOMM are very highly significant and CASH,RBC are significant at 10% significance level.
For every thousand $’s increase in the LONGLOSS, there is an effect of 0.519 Thousand $’s in the expense. The other coefficients can also be intrepreted the same way.
The insignificant coefficients says that threre might be no influence (or very minimal influence) on the value of the expense’s.They can also be Zero at times.

Model-1 Plot Interpretation

Now lets check the assumption of the regression model which we assumed before building the model. The errors or reseduals generated are random and are normally distributed.

plot(model1)

From the above Reseduals Vs Fitted Plot we can see that the errors are randomly generated and there look no pattern in the distribution of the reseduals.
- There is no Heterostaditity either from the plot.
- From the same plot we can see that few observations are having high resedual values.
The Normal Q-Q Plot says there are observations with reseduals that are effecting the Normality of the error distribution.
The plot for Square root of reseduals and fitted values is also random.

Now Lets look at the correlation’s between regressors and scatter plots. The Variance Inflation Factors(VIF) of the model-1 regression is also calculated below.

library(corrplot)
corrmatrix <-round(cor(NAICExpense[,-c(1,2,3,4)]),2)
corrplot(corrmatrix, method = "number")

pairs(NAICExpense[,-c(1,2,3,4)])

car::vif(model1)

##  factor(GROUP) factor(MUTUAL)  factor(STOCK)            RBC      STAFFWAGE 
##          1.322          2.132          2.351          3.632          2.406 
##      AGENTWAGE       LONGLOSS      SHORTLOSS    GPWPERSONAL        GPWCOMM 
##          2.380          9.471          7.826          5.660          6.378 
##         ASSETS           CASH    LIQUIDRATIO 
##        225.106        172.636          1.095

The correlation Matrix of the regressors says that there is a strong correlation of 0.99 between ASSETS & CASH. This is clearly aginst the assumption that all our regressors are independant of each other. There is also as atrong relation between LONGLOSS to SHORTLOSS and GPWPERSONAL.

Same is also shown in the VIF Scores. Scores for two regressors (ASSETS & CASH) are 225.105648, 172.636176 respectively, which are very much above the threshold limit’s. So, there is a collinearity between these variables.

We need try regression models and choose the best variable’s that will be adding more value to us. This should be done by trail and error method or through domain knowledge.

We should also check the collinearity diagnostic plot for the given dataset.

perturb::colldiag(NAICExpense[,-c(1,2,3,4)])

## Condition
## Index    Variance Decomposition Proportions
##            intercept RBC   STAFFWAGE AGENTWAGE LIQUIDRATIO EXPENSES
## 1    1.000 0.000     0.003 0.000     0.000     0.000       0.001   
## 2    1.568 0.001     0.001 0.000     0.000     0.002       0.000   
## 3    3.101 0.000     0.100 0.000     0.000     0.000       0.001   
## 4    5.327 0.000     0.441 0.000     0.000     0.000       0.003   
## 5    5.763 0.000     0.140 0.000     0.000     0.000       0.007   
## 6    9.211 0.000     0.020 0.000     0.000     0.002       0.001   
## 7   10.135 0.000     0.006 0.000     0.000     0.000       0.041   
## 8   13.817 0.000     0.000 0.001     0.000     0.013       0.909   
## 9   16.491 0.001     0.002 0.049     0.023     0.721       0.027   
## 10  30.556 0.776     0.001 0.248     0.004     0.194       0.002   
## 11  46.326 0.202     0.000 0.647     0.937     0.011       0.000   
## 12  58.348 0.021     0.287 0.055     0.034     0.058       0.008   
##    LONGLOSS SHORTLOSS GPWPERSONAL GPWCOMM ASSETS CASH 
## 1  0.001    0.001     0.002       0.002   0.000  0.000
## 2  0.001    0.001     0.001       0.001   0.000  0.000
## 3  0.016    0.003     0.066       0.005   0.000  0.000
## 4  0.006    0.034     0.116       0.088   0.000  0.000
## 5  0.002    0.031     0.026       0.083   0.004  0.008
## 6  0.029    0.515     0.292       0.461   0.000  0.000
## 7  0.534    0.353     0.478       0.003   0.000  0.000
## 8  0.353    0.039     0.002       0.205   0.000  0.000
## 9  0.009    0.000     0.002       0.046   0.000  0.000
## 10 0.000    0.001     0.004       0.001   0.005  0.004
## 11 0.014    0.003     0.004       0.000   0.012  0.015
## 12 0.033    0.018     0.008       0.104   0.978  0.972

From the matrix for the index 30 we can see there are no significant sets. For index 46 we can see STAFFWAGE,AGENTWAGE are sets with collinearity and for index 58 we can see ASSETS,CASH are collinear to each other.

We should also check what are the inflentual observations in our model. so that we can remove those observations and run our model to check if they significantly effect our Beta values, R² values, Resedual Std. Error.

car::influenceIndexPlot(model1,id.n=5)

We can see that there are 4 types of measurements to check the influentiality of observations on our model. We choose the Cook’s method to determine this. So, in cook’s model We can see that there are two observations that are influencing our models.

So, we need to check our model including these records and removing these records and choose the best one.

Finally we check the resedual plots: independant valriable Vs resedual and check if there is ay pattern in these plots. Ideally these should be random scattered points and there should be no pattern. If there is any pattern we find we need to transform those accordingly and run our model.

library(car)
residualPlots(model1)

##                Test stat Pr(>|t|)
## factor(GROUP)         NA       NA
## factor(MUTUAL)        NA       NA
## factor(STOCK)         NA       NA
## RBC               -3.704    0.000
## STAFFWAGE         -0.079    0.937
## AGENTWAGE          0.282    0.778
## LONGLOSS          -0.241    0.810
## SHORTLOSS         -1.415    0.158
## GPWPERSONAL       -3.081    0.002
## GPWCOMM            5.096    0.000
## ASSETS             2.363    0.019
## CASH               1.144    0.253
## LIQUIDRATIO       -0.036    0.971
## Tukey test         2.958    0.003

From the plots we can observe the below things. RBC: We can see a Logrthmic trend. GPWPersonal: We can see an Log trend. GPWCOMM: We can see the Quadratric trend and also outlier influence. Assets: Has Log trend Cash: Outlier Influence.

Model-3

From the above findings we try to fit a new model and that will be our improved model.

Before modelling we need to transform our variables.

NAICExpense$GPWCOMM.sq <- NAICExpense$GPWCOMM*NAICExpense$GPWCOMM
NAICExpense$LONGLOSS.sq <- NAICExpense$LONGLOSS*NAICExpense$LONGLOSS
NAICExpense$GPWPERSONAL.sq <- NAICExpense$GPWPERSONAL*NAICExpense$GPWPERSONAL
NAICExpense$CASH.sq <- NAICExpense$CASH*NAICExpense$CASH
NAICExpense$RBC.sq <- NAICExpense$RBC*NAICExpense$RBC
NAICExpense$SHORTLOSS.sq <- NAICExpense$SHORTLOSS*NAICExpense$SHORTLOSS
#shortloss_Trans <- log(max(SHORTLOSS)+1-SHORTLOSS)
#expenses_trans <- log(max(EXPENSES)+1-EXPENSES)

Below is the improved model and we can see from the summary that we have improved R² and also reduced Standard Error.

model3 <- lm(EXPENSES~factor(GROUP)+factor(MUTUAL)+factor(STOCK)
             +RBC + RBC.sq
             +STAFFWAGE
             #+AGENTWAGE
             +LONGLOSS + LONGLOSS.sq
             +SHORTLOSS + SHORTLOSS.sq
             +GPWPERSONAL + GPWPERSONAL.sq
             +GPWCOMM + GPWCOMM.sq
             +LIQUIDRATIO
             #+ASSETS
             +CASH
             ,data = NAICExpense)
summary(model3)

## 
## Call:
## lm(formula = EXPENSES ~ factor(GROUP) + factor(MUTUAL) + factor(STOCK) + 
##     RBC + RBC.sq + STAFFWAGE + LONGLOSS + LONGLOSS.sq + SHORTLOSS + 
##     SHORTLOSS.sq + GPWPERSONAL + GPWPERSONAL.sq + GPWCOMM + GPWCOMM.sq + 
##     LIQUIDRATIO + CASH, data = NAICExpense)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -186.97   -4.22   -1.56    3.93  162.99 
## 
## Coefficients:
##                                    Estimate                  Std. Error
## (Intercept)     -8.077566103859192381264620 12.227883788295828892955797
## factor(GROUP)1   1.755551017233113242355103  3.205056062105439806941831
## factor(MUTUAL)1 -2.728132368771277960206589  4.896473451366778029125726
## factor(STOCK)1  -4.864704259683332310260084  4.334709606787880531442170
## RBC              0.000000000262355087394743  0.000000000077767682501707
## RBC.sq          -0.000000000000000000000384  0.000000000000000000000103
## STAFFWAGE        0.062462620771259878826864  0.113340288229061586511293
## LONGLOSS         0.279881133704393991745718  0.065289743205929895442097
## LONGLOSS.sq      0.001276330505868971927977  0.000182361457267686143027
## SHORTLOSS        0.375190170633010577905253  0.050311532334097602836565
## SHORTLOSS.sq    -0.000299309776714503793411  0.000046449004813322207824
## GPWPERSONAL      0.205106025705477035270263  0.026847708744324672719417
## GPWPERSONAL.sq  -0.000258322668778284065313  0.000037621709517332362852
## GPWCOMM          0.015890487524592763340925  0.015637653484269303100218
## GPWCOMM.sq       0.000025814228281433666996  0.000003760776703845596315
## LIQUIDRATIO      0.079730214985546263295468  0.081704812403278787025229
## CASH             0.015008375642211009212690  0.003020187278405969850958
##                 t value         Pr(>|t|)    
## (Intercept)       -0.66          0.50932    
## factor(GROUP)1     0.55          0.58422    
## factor(MUTUAL)1   -0.56          0.57777    
## factor(STOCK)1    -1.12          0.26252    
## RBC                3.37          0.00083 ***
## RBC.sq            -3.74          0.00022 ***
## STAFFWAGE          0.55          0.58191    
## LONGLOSS           4.29 0.00002350439422 ***
## LONGLOSS.sq        7.00 0.00000000001333 ***
## SHORTLOSS          7.46 0.00000000000071 ***
## SHORTLOSS.sq      -6.44 0.00000000038828 ***
## GPWPERSONAL        7.64 0.00000000000021 ***
## GPWPERSONAL.sq    -6.87 0.00000000003039 ***
## GPWCOMM            1.02          0.31026    
## GPWCOMM.sq         6.86 0.00000000003081 ***
## LIQUIDRATIO        0.98          0.32983    
## CASH               4.97 0.00000105580206 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.3 on 348 degrees of freedom
## Multiple R-squared:  0.96,   Adjusted R-squared:  0.958 
## F-statistic:  523 on 16 and 348 DF,  p-value: <0.0000000000000002

Conclusion

The analysis is not over. We were able to fit the best model with the constraint of a particular dataset. There can be lot of improved transformations that can be applied and can improve the model further. Going forward one can try to get data on other variables that might have an impact on the pridictibility of Expenses. On the other hand, we should not include variables only to achieve a high R2, i.e. we must not over fit our model

This is posted in this Link