WQD 7004 Group Assignment

Group Members

Project Title

Prediction of Vehicle Insurance Claims Data Based on Non-negative Fat-tailed Distribution

1. Introduction

The methodology proposed in this assignment focuses on predicting policy payout costs based on various objective factors and analyzing the risk level of customer groups to help insurance companies accurately price their policies and thus ensure the proper operation of their businesses.

1.1 Problem Statement

With the rapid development of insurance business, the uncertainty of insurance claims have been deeply plagued by insurance company. Generally speaking, the amount of insurance claims is directly related to the stability of the financial operations of insurance companies, especially the non-life insurance business. When it comes to the modeling of insurance claim data, finding an appropriate distribution to mine information from the data is an important task to help actuaries calculate premiums. Most insurance claims data have several characteristics, such as unimodal, right skewness, and thick right tail. To be specific, the thick-tailed nature of the data means that once extreme events occur, they will bring huge losses to insurance companies. Correspondingly, the tail dependence will also lead to a higher probability of simultaneous claims in different insurance policies. This creates a risk aggregation problem for insurers. In previous studies, it is found that insurance claims data often has the feature of non-negative thick tail, which means that majority of the claim amount tend to come from small number of claims. Therefore, this paper focuses on the regression application of insurance claims data with non-negative thick tail.

1.2 Data Mining Goal

  • To select an algorithm model to predict whether auto insurance claims will occur based on objective factors such as vehicle price, age, driver age, and gender.
  • To select an algorithm model, such as the generalized gamma regression model, because the generalized gamma regression model also has the feature of non-negative thick tail and has good fitting effect.

1.3 Objectives

  • To classify whether auto insurance claims will occur based on the factors such as vehicle price, age, driver age, and gender.
  • To build a model that can predict the amounts of custom’s claim.

2. Dataset

Auto insurance claims data provided by SAS Institute Inc

Data Source: https://github.com/Alisonin/Insurance

Reference:SAS Institute Inc. Solving business problems using SAS enterprise miner software[White Paper]. Cary.

#Loading necessary library
library("readxl")
library("ggplot2")
library("dplyr")
library("corrplot")
library("hexbin")
library('plyr')
library("rmdformats")
library("tidyr")
library("purrr")
library("gridExtra")
library("ggrepel")
library("pastecs")

3. Data Processing

In data processing, there are several tasks to be performed. They are

  • Removing blank information
  • Removing duplicates
  • Removing redundant data
  • Deleting data where no claim occured

The following code blocks have been used to perform the tasks mentioned above.

Reading the Datset

#claimsData = read.csv('claims_2812_33.csv')
url <- 'https://raw.githubusercontent.com/Alisonin/Insurance/main/claims_2812_33.csv'
claimsData <- read.csv(url)

Checking for Missing Data

# Checking missing data in claimsData
missTotal <- apply(is.na(claimsData)|(claimsData==""),2,sum)
missPercent <- apply(is.na(claimsData)|(claimsData==""),2,function(x) sum(x)/length(x))
missingData <- data.frame(missTotal,missPercent)
missingData <- missingData[order(missTotal,decreasing=TRUE),]

# Retain variables with missing data

Retain Variables with Missing Data

missingData[missingData$missTotal>0,]
##          missTotal missPercent
## CLM_DATE      1921  0.68314367
## SAMEHOME       175  0.06223329
## HOME_VAL       168  0.05974395
## YOJ            153  0.05440967

Delete Some Variables with the Same Meaning but Inconsistent Statistical Methods

claimsData = claimsData[, !(names(claimsData) %in% c("ID", "PLCYDATE", "POLICYNO", "INITDATE", "RETAINED", "MVR_PTS", "CLM_DATE", "CLM_FLAG", "BIRTH", "PARENT1", "HOME_VAL", "SAMEHOME", "YEARQTR"))]

Delete Samples with Missing YOJ Variable

claimsData <- claimsData[complete.cases(claimsData[c("YOJ")]),]

Create a New Column Named CLM_JUDGE, if the CLM_AMT Value is 0, the Corresponding CLM_JUDGE value is no, otherwise yes

claimsData$CLM_JUDGE <- ifelse(claimsData$CLM_AMT == 0, "no", "yes")
#claimsData
data0 = claimsData

4. Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is an essential step in the data science process as it allows for the discovery of patterns, relationships, and insights in the data. It is a way to understand the underlying structure of the data, identify any potential issues or outliers, and generate hypotheses for further analysis. EDA also helps in the selection of appropriate statistical models and techniques for the data at hand. Furthermore, it helps in the communication of findings and results to stakeholders and in the development of a deeper understanding of the problem being addressed. Overall, EDA is crucial for ensuring that the data is properly understood and utilized in decision making.

Reading the Data into a Data Frame

df <- claimsData

Combined Boxplot for Explanatory Variables

#colnames(data1)
con.v=c(1,2,4:5,8,9,11:14) 
totaltu=function(data,a,b)
{
  zhibiao=colnames(data)
  xn=ncol(data)
  par(mfrow = c(a,b))   #a行b列个指标
  for(i in 1:xn){
    boxplot(data[,i],xlab = zhibiao[i])
  }
}
totaltu(df[,con.v],2,5)

Combined boxplot, also known as a side-by-side boxplot or a grouped boxplot, is a type of plot used to compare the distribution of multiple variables (or groups) at the same time. It is particularly useful when working with explanatory variables, as it allows for easy comparison of the distribution of the variables across different groups or levels. The combined boxplot for explanatory variables for our dataset is showing the median and interquartile range of the variable, while the whiskers would extend out to the minimum and maximum values.

Matrix Plot

pairs(df[,con.v])

In the matrix plot above each row and column in the matrix represents a variable in the car insurance claim dataset, and the plot at the intersection of a row and column shows the relationship between those two variables. We have used a matrix plot to investigate the relationships between different variables that may affect the cost of a claim, such as the age of the vehicle, the make and model, the location of the accident, and the driver’s demographics such as age and driving history.

Relative Analysis

corr=cor(df[,c(11,con.v)])  #相关系数矩阵

###相关系数图###install.packages("corrplot")
install.packages("corrplot")
## Warning: package 'corrplot' is in use and will not be installed
library(corrplot)
library(RColorBrewer)
myColors=brewer.pal(8,"Set2")[1:8]
corrplot(corr=corr, tl.col="black",col=myColors, title = 'Correlation Graph
')

From the above Correlation Coefficient Analysis Matrix:

  • The correlation coefficient between OLDCLAM and CLM_FERQ indicates a strong correlation between the two.
  • The attribute associated with the dependent variable CLM_AMT is suggesting a very weak correlation.
  • Linear correlation coefficients of HOMEKIDS and KIDSDRIV, CLM_FREQ and OLDCLAIM, and HOMEKIDS and AGE showing some weak correlation.
  • df <- read_xls("data0.xls")
    View(df)

    Characteristics of the Customer’s Personal Information

    Ringdiagram to Show Gender, Married and Jobclass
    # plot 1
    countsgender <- table(df$GENDER)
    View(countsgender)
    
    
    hsize <- 1
    
    dfcountsgender <- as.data.frame(countsgender)
    names(dfcountsgender)[names(dfcountsgender) == "Var1"] <- "GENDER"
    View(dfcountsgender)
    r1 <- ggplot(dfcountsgender, aes(x = hsize, y = Freq, fill = GENDER)) +
      geom_col(color = "black") +
      coord_polar(theta = "y") +
      xlim(c(0.2, hsize + 0.5)) + 
      labs(x = "") + scale_fill_manual(values = myColors)
    
    
    countsmarried <- table(df$MARRIED)
    View(countsmarried )
    
    # plot 2
    
    
    hsize <- 1
    
    dfcountsmarried  <- as.data.frame(countsmarried)
    names(dfcountsmarried)[names(dfcountsmarried) == "Var1"] <- "MARRIED"
    View(dfcountsmarried)
    r2 <- ggplot(dfcountsmarried, aes(x = hsize, y = Freq, fill = MARRIED)) +
      geom_col(color = "black") +
      coord_polar(theta = "y") +
      xlim(c(0.2, hsize + 0.5)) + 
      labs(x = "")
    
    countsjobclass <- table(df$JOBCLASS)
    View(countsjobclass)
    
    # plot 3
    
    
    hsize <- 1
    
    dfcountsjobclass <- as.data.frame(countsjobclass)
    names(dfcountsjobclass)[names(dfcountsjobclass) == "Var1"] <- "Job_Class"
    View(dfcountsjobclass)
    r3 <- ggplot(dfcountsjobclass, aes(x = hsize, y = Freq, fill = Job_Class)) +
      geom_col(color = "black") +
      coord_polar(theta = "y") +
      xlim(c(0.2, hsize + 0.5)) + 
      labs(x = "") 
    #knitr::include_graphics("C://Users/Turjo/Documents/r4.PNG")
    
    grid.arrange(r1, r2, r3, ncol = 3)

    From the characteristic of the customer’s personal information ring diagrams above:

    • The circle graph above makes it evident that the insured are more evenly distributed by gender and marital status.
    • The Blue collar has a higher number of professional people.

    Violin Plot to Show the Distribution of Age, Income, Homekids

    ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")

    The above Violin Plot displays the probability density of AGE. The age is concentrated between 30 to 50.

    ggplot(df, aes(x=INCOME, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")

    The above Violin Plot displays the probability density of INCOME.

    ggplot(df, aes(x=HOMEKIDS, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")

    The above Violin Plot displays the probability density of HOMEKIDS. Which is concentrated between 2 to 3.

    Bar Plot of YOJ in Different Gender

    ggplot(df, aes(x=YOJ, fill=GENDER)) +
      geom_bar(position="dodge") +
      labs(x="YOJ", y="Count") +
      ggtitle("Count of Claim by YOJ and Gender")

    The bar plot is showing the count of claim YOJ by gender. As we can see for YOJ 0, the number of females is over 60.

    Scatter Plot to Display the Relationship Between Age Amount Claim

    ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_point(color="#9f2042")

    Each point on the scatter plot represents a single data point, with the x-coordinate being the age of the individual and the y-coordinate being the amount of the claim. By plotting all the data points on the scatter plot, we can visually observe the relationship between age and claim amount.

    Hexbin Plot to Display the Relation Between Age and Claim amount

    ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_hex()

    Each hexagon on the plot represents a group of data points with similar x and y values. The color of the hexagon represents the density of data points in that bin, with darker colors indicating a higher density of data points.

    Bar Plot to Display Percentage of Education

    education_percent <- df %>% group_by(MAX_EDUC) %>% dplyr::summarize(n = n()) %>% mutate(percent = n / sum(n))
    ggplot(education_percent, aes(x = MAX_EDUC, y = percent)) +
      geom_bar(stat = "identity",fill="#9f2042",color="#428ac2") +
      ylab("Percentage")

    From the bar plot we can see, High School has the most percentages for MAX_EDUC.

    Histogram of the Amount Claimed

    #Distribution at the time of claim
    data1 <- data.frame(claimsData)
    data1=data1[data1$CLM_AMT>0,]
    #Histogram of the amount claimed
    par(pin = c(6,3))
    hist(data1$CLM_AMT, freq=F, breaks=66,bty= "o", col=0,font.lab=2,
         main="Histogram of CLM_AMT", font=2,xlab="CLM_AMT")
    clsd=sd(data1$CLM_AMT)
    clmean=mean(data1$CLM_AMT)
    xnorm=seq(0,60000,length.out=10000)  
    lines(xnorm,dnorm(xnorm,mean = clmean,sd = clsd),lty=1,lwd=2)
    legend("topright",legend=paste("Normal distribution"),lwd='2', lty=1,cex=1) 
    box() 

    EDA Summary

    summary(data1) 
    ##     KIDSDRIV         TRAVTIME       CAR_USE             BLUEBOOK    
    ##  Min.   :0.0000   Min.   : 5.00   Length:839         Min.   : 1500  
    ##  1st Qu.:0.0000   1st Qu.:25.00   Class :character   1st Qu.: 7310  
    ##  Median :0.0000   Median :35.00   Mode  :character   Median :11890  
    ##  Mean   :0.2825   Mean   :35.08                      Mean   :13661  
    ##  3rd Qu.:0.0000   3rd Qu.:45.00                      3rd Qu.:18225  
    ##  Max.   :3.0000   Max.   :88.00                      Max.   :41440  
    ##     NPOLICY        CAR_TYPE           RED_CAR             OLDCLAIM    
    ##  Min.   :1.000   Length:839         Length:839         Min.   :    0  
    ##  1st Qu.:1.000   Class :character   Class :character   1st Qu.:    0  
    ##  Median :1.000   Mode  :character   Mode  :character   Median : 2712  
    ##  Mean   :1.627                                         Mean   : 6038  
    ##  3rd Qu.:2.000                                         3rd Qu.: 6739  
    ##  Max.   :5.000                                         Max.   :57037  
    ##     CLM_FREQ       REVOLKED            CLM_AMT           AGE       
    ##  Min.   :0.000   Length:839         Min.   :  295   Min.   :16.00  
    ##  1st Qu.:0.000   Class :character   1st Qu.: 2706   1st Qu.:37.00  
    ##  Median :1.000   Mode  :character   Median : 4102   Median :43.00  
    ##  Mean   :1.225                      Mean   : 5561   Mean   :43.26  
    ##  3rd Qu.:2.000                      3rd Qu.: 5644   3rd Qu.:49.50  
    ##  Max.   :5.000                      Max.   :61678   Max.   :67.00  
    ##     HOMEKIDS           YOJ            INCOME          GENDER         
    ##  Min.   :0.0000   Min.   : 0.00   Min.   :     0   Length:839        
    ##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 18857   Class :character  
    ##  Median :0.0000   Median :11.00   Median : 42687   Mode  :character  
    ##  Mean   :0.9976   Mean   :10.01   Mean   : 47235                     
    ##  3rd Qu.:2.0000   3rd Qu.:13.00   3rd Qu.: 67360                     
    ##  Max.   :5.0000   Max.   :18.00   Max.   :273507                     
    ##    MARRIED            JOBCLASS           MAX_EDUC           DENSITY         
    ##  Length:839         Length:839         Length:839         Length:839        
    ##  Class :character   Class :character   Class :character   Class :character  
    ##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
    ##                                                                             
    ##                                                                             
    ##                                                                             
    ##   CLM_JUDGE        
    ##  Length:839        
    ##  Class :character  
    ##  Mode  :character  
    ##                    
    ##                    
    ## 
    ydes=stat.desc(data1[,'CLM_AMT'],basic = T ,desc = T, norm=TRUE, p=0.95 )
    t(as.data.frame(ydes))
    ##      nbr.val nbr.null nbr.na min   max range     sum median     mean  SE.mean
    ## ydes     839        0      0 295 61678 61383 4665938   4102 5561.309 237.8077
    ##      CI.mean.0.95      var  std.dev coef.var skewness skew.2SE kurtosis
    ## ydes     466.7687 47447553 6888.218 1.238597 4.723786 27.97951 27.38481
    ##      kurt.2SE normtest.W   normtest.p
    ## ydes 81.19742  0.5000895 2.469903e-43

    Descriptive Statistics of Explanatory Variables

    #解释变量描述性统计(Descriptive statistics of explanatory variables)
    con.v=c(1,2,4:5,8,9,12:14)   #连续性变量的列号(Column number of the continuity variable)
    stat.desc(data1[,con.v],basic = T ,desc = T )
    ##                  KIDSDRIV     TRAVTIME     BLUEBOOK      NPOLICY     OLDCLAIM
    ## nbr.val      839.00000000 8.390000e+02 8.390000e+02 8.390000e+02 8.390000e+02
    ## nbr.null     678.00000000 0.000000e+00 0.000000e+00 0.000000e+00 3.440000e+02
    ## nbr.na         0.00000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
    ## min            0.00000000 5.000000e+00 1.500000e+03 1.000000e+00 0.000000e+00
    ## max            3.00000000 8.800000e+01 4.144000e+04 5.000000e+00 5.703700e+04
    ## range          3.00000000 8.300000e+01 3.994000e+04 4.000000e+00 5.703700e+04
    ## sum          237.00000000 2.943200e+04 1.146198e+07 1.365000e+03 5.065494e+06
    ## median         0.00000000 3.500000e+01 1.189000e+04 1.000000e+00 2.712000e+03
    ## mean           0.28247914 3.507986e+01 1.366148e+04 1.626937e+00 6.037538e+03
    ## SE.mean        0.02250370 5.216226e-01 2.702293e+02 2.847885e-02 3.470596e+02
    ## CI.mean.0.95   0.04417024 1.023840e+00 5.304059e+02 5.589826e-02 6.812082e+02
    ## var            0.42488358 2.282836e+02 6.126705e+07 6.804669e-01 1.010579e+08
    ## std.dev        0.65183095 1.510906e+01 7.827327e+03 8.249042e-01 1.005275e+04
    ## coef.var       2.30753656 4.307046e-01 5.729488e-01 5.070290e-01 1.665042e+00
    ##                  CLM_FREQ          AGE     HOMEKIDS          YOJ
    ## nbr.val      8.390000e+02 8.390000e+02 839.00000000  839.0000000
    ## nbr.null     3.440000e+02 0.000000e+00 447.00000000  108.0000000
    ## nbr.na       0.000000e+00 0.000000e+00   0.00000000    0.0000000
    ## min          0.000000e+00 1.600000e+01   0.00000000    0.0000000
    ## max          5.000000e+00 6.700000e+01   5.00000000   18.0000000
    ## range        5.000000e+00 5.100000e+01   5.00000000   18.0000000
    ## sum          1.028000e+03 3.629300e+04 837.00000000 8397.0000000
    ## median       1.000000e+00 4.300000e+01   0.00000000   11.0000000
    ## mean         1.225268e+00 4.325745e+01   0.99761621   10.0083433
    ## SE.mean      4.300704e-02 3.201217e-01   0.04240038    0.1593767
    ## CI.mean.0.95 8.441417e-02 6.283346e-01   0.08322341    0.3128244
    ## var          1.551819e+00 8.597899e+01   1.50834753   21.3113862
    ## std.dev      1.245720e+00 9.272485e+00   1.22814801    4.6164257
    ## coef.var     1.016692e+00 2.143558e-01   1.23108265    0.4612577
    #Observe the correlation between variables
    #corr=cor(data1[,c(11,con.v)])  #相关系数矩阵
    #corr

    5. Modeling

    Viewing Duplicate Data

    print('Viewing duplicate data')
    ## [1] "Viewing duplicate data"
    table(duplicated(data1))
    ## 
    ## FALSE 
    ##   839

    Viewing Missing Value

    print('Viewing missing data')
    ## [1] "Viewing missing data"
    table(is.na(data1)) 
    ## 
    ## FALSE 
    ## 17619

    Reading Data, Factor Variables

    #(1)read data
    data1 <- data.frame(claimsData)
    data1=data1[data1$CLM_AMT>0,]
    # factor variables
    as.factor(data1$CAR_USE)
    as.factor(data1$CAR_TYPE)
    as.factor(data1$RED_CAR)
    as.factor(data1$REVOLKED)
    as.factor(data1$GENDER)
    as.factor(data1$MARRIED)
    as.factor(data1$JOBCLASS)
    as.factor(data1$MAX_EDUC)
    as.factor(data1$DENSITY)
    data1=data1[data1$CLM_AMT>0,]

    5.1 Objective 1 - Classification

    To classify whether auto insurance claims will occur based on the factors such as vehicle price, age, driver age, and gender.

    performance = function(xtab, desc=""){
      cat(desc,"\n")
      ACR = sum(diag(xtab))/sum(xtab)
      TPR = xtab[1,1]/sum(xtab[,1]); TNR = xtab[2,2]/sum(xtab[,2])
      PPV = xtab[1,1]/sum(xtab[1,]); NPV = xtab[2,2]/sum(xtab[2,])
      FPR = 1 - TNR ; FNR = 1 - TPR
      RandomAccuracy = (sum(xtab[,2])*sum(xtab[2,]) +
      sum(xtab[,1])*sum(xtab[1,]))/(sum(xtab)^2)
      Kappa = (ACR - RandomAccuracy)/(1 - RandomAccuracy)
      print(xtab)
      cat("\nAccuracy (ACR) :", ACR, "\n")
      cat("Sensitivity(TPR) :", TPR, "\n")
      cat("Specificity (TNR) :", TNR, "\n")
      cat("Positive Predictive Value (PPV) :", PPV, "\n")
      cat("Negative Predictive Value (NPV) :", NPV, "\n")
      cat("False Positive Rate (FPR) :", FPR, "\n")
      cat("False Negative Rate(FNR) :", FNR, "\n")
    }
    data2 <- data0
    data2['CLM'] = ifelse(data2['CLM_AMT']>0,1,0)
    data2 = subset(data2, select = -c(CLM_AMT))
    table(is.na(data2))
    ## 
    ## FALSE 
    ## 55839
    data2['CAR_USE'] = ifelse(data2['CAR_USE'] == 'Private' ,1 ,2)
    data2 = subset(data2, select = -c(RED_CAR,JOBCLASS,MAX_EDUC,CLM_JUDGE,CAR_TYPE))
    data2['GENDER'] = ifelse(data2['GENDER'] == 'F' ,1 ,0)
    data2['REVOLKED'] = ifelse(data2['REVOLKED'] == 'No' ,0 ,1)
    data2['MARRIED'] = ifelse(data2['MARRIED'] == 'Yes' ,1 ,0)
    data2['DENSITY'] = ifelse(data2['DENSITY'] == 'Urban' ,1 ,0)
    pardim=dim(data2)
    n=pardim[1]
    set.seed(123)
    test =sample(n, 0.2*n)
    train.x= data2[test, ]             
    test.x= data2[-test, ]
    train.y=train.x$CLM
    test.y=test.x$CLM
    require(class)

    5.1.1 KNN

    #sqrt(n)
    m1 = knn(train=train.x, test=test.x, cl=train.y, k=51)
    KNN.confusionMatrix = table(test.y, m1)
    #km = as.data.frame(km)
    #KNN.confusionMatrix
    performance(KNN.confusionMatrix)
    ##  
    ##       m1
    ## test.y    0    1
    ##      0 1381   67
    ##      1  608   72
    ## 
    ## Accuracy (ACR) : 0.6828008 
    ## Sensitivity(TPR) : 0.6943188 
    ## Specificity (TNR) : 0.5179856 
    ## Positive Predictive Value (PPV) : 0.9537293 
    ## Negative Predictive Value (NPV) : 0.1058824 
    ## False Positive Rate (FPR) : 0.4820144 
    ## False Negative Rate(FNR) : 0.3056812
    # k.optm=1
    # k_optm=c()
    # for (i in 1:300){
    # knn.mod <- knn(train.x, test.x,cl=train.y, k=i)
    # k.optm [i]<- 100 * sum(test.y == knn.mod)/NROW(test.y)
    # k_optm[i]<-k.optm[i]
    # }
    # k_optm
    plot(KNN.confusionMatrix)

    5.1.2 Naive Bayes

    train.x$CLM = as.factor(train.x$CLM)
    #install.packages("naivebayes") #Naive Bayes
    library(naivebayes)
    Bayes_mod <- naive_bayes(CLM~., data=train.x, usekernel=T)
    
    #par(mfrow = c(3,5)) 
    plot(Bayes_mod)

    Bayestest_pred <- predict(Bayes_mod, test.x)
    ## Warning: predict.naive_bayes(): more features in the newdata are provided as
    ## there are probability tables in the object. Calculation is performed based on
    ## features to be found in the tables.
    Bayes.confusionMatrix = table(Bayestest_pred, test.x$CLM)
    #Bayes.confusionMatrix
    performance(Bayes.confusionMatrix )
    ##  
    ##               
    ## Bayestest_pred    0    1
    ##              0 1380  568
    ##              1   68  112
    ## 
    ## Accuracy (ACR) : 0.7011278 
    ## Sensitivity(TPR) : 0.9530387 
    ## Specificity (TNR) : 0.1647059 
    ## Positive Predictive Value (PPV) : 0.7084189 
    ## Negative Predictive Value (NPV) : 0.6222222 
    ## False Positive Rate (FPR) : 0.8352941 
    ## False Negative Rate(FNR) : 0.04696133
    plot(Bayes.confusionMatrix)

    5.1.3 Decision Tree

    #install.packages("rpart")
    library(rpart)
    #install.packages("rpart.plot")
    library(rpart.plot)
    dt_model <- rpart(CLM~., data=train.x)
    dt_pre <- predict(dt_model, test.x, type="class")
    
    rpart.plot(dt_model, type=2)

    DecisionTree.confusionMatrix = table(dt_pre, test.x$CLM)
    DecisionTree.confusionMatrix =DecisionTree.confusionMatrix [2:1,2:1]
    performance(DecisionTree.confusionMatrix )
    ##  
    ##       
    ## dt_pre    1    0
    ##      1  260  242
    ##      0  420 1206
    ## 
    ## Accuracy (ACR) : 0.6889098 
    ## Sensitivity(TPR) : 0.3823529 
    ## Specificity (TNR) : 0.8328729 
    ## Positive Predictive Value (PPV) : 0.5179283 
    ## Negative Predictive Value (NPV) : 0.7416974 
    ## False Positive Rate (FPR) : 0.1671271 
    ## False Negative Rate(FNR) : 0.6176471
    plot(DecisionTree.confusionMatrix)

    5.2 Objective 2 - Predict Claims-Regression

    To build a model that can predict the amounts of custom’s claim.

    5.2.1 Linear Regression

    lm.fit=lm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE + CAR_USE+
                   GENDER +MARRIED + JOBCLASS , data=data1, ) 
    summary(lm.fit)
    ## 
    ## Call:
    ## lm(formula = CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE + GENDER + 
    ##     MARRIED + JOBCLASS, data = data1)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -11176  -2986  -1385    482  55424 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           6.278e+01  2.031e+03   0.031  0.97535    
    ## BLUEBOOK              1.664e-01  4.147e-02   4.012 6.58e-05 ***
    ## CAR_TYPEPickup        3.093e+03  1.354e+03   2.284  0.02260 *  
    ## CAR_TYPESedan         1.735e+03  1.374e+03   1.263  0.20694    
    ## CAR_TYPESports Car    4.378e+03  1.773e+03   2.469  0.01377 *  
    ## CAR_TYPESUV           4.215e+03  1.703e+03   2.474  0.01355 *  
    ## CAR_TYPEVan           1.473e+03  1.372e+03   1.074  0.28329    
    ## CAR_USEPrivate        1.210e+03  6.707e+02   1.805  0.07145 .  
    ## GENDERM               1.862e+03  8.679e+02   2.146  0.03220 *  
    ## MARRIEDYes           -1.155e+03  4.802e+02  -2.406  0.01636 *  
    ## JOBCLASSClerical     -2.137e+03  8.231e+02  -2.596  0.00961 ** 
    ## JOBCLASSDoctor       -2.084e+03  1.882e+03  -1.107  0.26848    
    ## JOBCLASSHome Maker   -3.324e+02  9.632e+02  -0.345  0.73011    
    ## JOBCLASSLawyer       -1.953e+03  1.042e+03  -1.873  0.06140 .  
    ## JOBCLASSManager      -1.099e+03  1.023e+03  -1.075  0.28286    
    ## JOBCLASSProfessional  5.053e+02  8.635e+02   0.585  0.55861    
    ## JOBCLASSStudent      -8.873e+02  8.025e+02  -1.106  0.26919    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 6795 on 822 degrees of freedom
    ## Multiple R-squared:  0.04544,    Adjusted R-squared:  0.02686 
    ## F-statistic: 2.445 on 16 and 822 DF,  p-value: 0.001263
    par(mfrow = c(2,2))
    plot(lm.fit)

    par(mfrow = c(2,2))
    plot(lm.fit)

    Residuals vs Fitted Plot is mainly used to test the assumed condition of linear properties. The graph shows that the resulting residual value and the fitted value basically have no obvious association, which indicates that the relationship between the independent variable and the dependent variable is linear and that the assumed condition is met. Most of the dots in the Normal Q-Q plot fall on a straight line at a 45-degree angle, essentially satisfying the assumption of normality. The Scale-Location plots indicate a random distribution of points around the horizontal line, satisfying the homoscedasticity assumption.

    5.2.2 Generalized Gamma Regression Model

    y=data1$CLM_AMT
    cat("glm.inverse=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +",
        "OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +",
        "MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1, ",
        "family=Gamma(link = 'inverse_sqrt')) # gamma回归",sep = "\n")
    ## glm.inverse=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +
    ## OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +
    ## MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1, 
    ## family=Gamma(link = 'inverse_sqrt')) # gamma回归
    cat("summary(glm.inverse)",sep = "\n")
    ## summary(glm.inverse)
    glm.log0=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +
                      OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +
                      MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1, 
                    family=Gamma(link = "log")) # gamma回归
    #summary(glm.log0)
    
    glm.identity1=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE  + 
                   GENDER +MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "identity")) 
    #summary(glm.identity1)
    
    glm.log1=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE  + 
                   GENDER +MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "log")) 
    #summary(glm.log1)
    #sum.log1=summary(glm.log1)
    #sum.log1$coefficients
    
    #summary(glm.log1)
    #summary_fit
    
    glm.log1=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE +
                   GENDER +MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "log"))
    #regroup$JOBCLASS
    data1$JOBCLASS[data1$JOBCLASS == c("Home Maker","Lawyer","Doctor")] ="otherj"
    ## Warning in data1$JOBCLASS == c("Home Maker", "Lawyer", "Doctor"): longer object
    ## length is not a multiple of shorter object length
    data1$JOBCLASS[data1$JOBCLASS == c("Manager")] ="otherj"
    data1$JOBCLASS[data1$JOBCLASS == c("Student")] ="otherj"
    data1$JOBCLASS <- factor(data1$JOBCLASS, 
                             levels=c("Clerical","otherj","Professional", "Blue Collar"),
                             labels=c("Clerical","otherj","Professional","Blue Collar"))
    data1$JOBCLASS[is.na(data1$JOBCLASS)]<- "otherj"
    
    
    #regroup$CAR_TYPE
    data1$CAR_TYPE[data1$CAR_TYPE == c("Sedan","Van")] ="othert"
    ## Warning in data1$CAR_TYPE == c("Sedan", "Van"): longer object length is not a
    ## multiple of shorter object length
    data1$CAR_TYPE[data1$CAR_TYPE == c("Panel Truck")] ="othert"
    data1$CAR_TYPE <- factor(data1$CAR_TYPE, 
                             levels=c("othert" ,"Pickup", "Sports Car" ,"SUV"),
                             labels=c("othert","Pickup", "Sports Car" ,"SUV" ))
    data1$CAR_TYPE[is.na(data1$CAR_TYPE)]<- "othert"

    Final Generalized Gamma Regression Model

    #AIC = AIC(glm.log2,glm.log3,glm.log4)
    #BIC = BIC(glm.log2,glm.log3,glm.log4)
    #VIF = VIF(glm.log2,glm.log3,glm.log4)
    #cbind(AIC, BIC)
    #cbind(AIC, BIC, VIF)

    glm.log2=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE +
                   GENDER +MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "log")) 
    glm.log3=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE + CAR_USE+
                   GENDER +MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "log")) 
    glm.log4=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE + CAR_USE+
                   MARRIED + JOBCLASS ,
                 data=data1, family=Gamma(link = "log")) 
    #summary(glm.log2)
    #summary(glm.log4)
    summary(glm.log3)
    ## 
    ## Call:
    ## glm(formula = CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE + GENDER + 
    ##     MARRIED + JOBCLASS, family = Gamma(link = "log"), data = data1)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -2.0448  -0.6101  -0.2814   0.1064   3.9988  
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           7.795e+00  2.165e-01  35.997  < 2e-16 ***
    ## BLUEBOOK              1.928e-05  5.653e-06   3.410 0.000682 ***
    ## CAR_TYPEPickup        2.703e-01  1.140e-01   2.372 0.017928 *  
    ## CAR_TYPESports Car    3.462e-01  1.638e-01   2.114 0.034849 *  
    ## CAR_TYPESUV           3.271e-01  1.447e-01   2.260 0.024062 *  
    ## CAR_USEPrivate        1.672e-01  9.503e-02   1.759 0.078871 .  
    ## GENDERM               1.850e-01  1.250e-01   1.479 0.139456    
    ## MARRIEDYes           -1.718e-01  7.581e-02  -2.266 0.023684 *  
    ## JOBCLASSotherj        1.987e-01  1.111e-01   1.788 0.074078 .  
    ## JOBCLASSProfessional  4.394e-01  1.453e-01   3.025 0.002566 ** 
    ## JOBCLASSBlue Collar   3.615e-01  1.282e-01   2.820 0.004924 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for Gamma family taken to be 1.172686)
    ## 
    ##     Null deviance: 582.36  on 838  degrees of freedom
    ## Residual deviance: 538.08  on 828  degrees of freedom
    ## AIC: 16009
    ## 
    ## Number of Fisher Scoring iterations: 7

    Based on the results of parameter estimation in Table, it can be seen that the vast majority of estimates at all levels of parameters pass the significance test. According to the estimated coefficient, BLUEBOOK (vehicle value) has a positive relationship with the claim amount, that is to say, the higher the value of the car, the greater the amount when the claim occurs, and this conclusion is very common sense. In addition, Pickup, Sports Car and SUV have higher claims compared to Sedan and Van. In terms of car valuation, the value of Sports Car (sports car) is much higher than Sedan Sedan (car) and Van (van), and the coefficient of Sports Car is higher than that of other types of cars, that is, Sports Car is the car type with the largest amount of claim in a given car type. In terms of the use of vehicles, compared with official vehicles (Commercial), the claim amount of private vehicles (Private) is higher.

    In addition, male drivers make higher claims than women, that is, male drivers are more likely to make higher claims. In terms of driver occupation, based on the amount of claim incurred by priest (Clerical), the amount of claim incurred in the remaining occupation is often higher than that of priest (Clerical).

    5.2.3 Model Diagnosis

    residual=residuals(glm.log3)
    plot(residual,ylab='residual',xlab='y',font.lab=1)

    As can be seen from the residual map, all the residual values float randomly up and down 0, mostly concentrated on the 0 scale line, indicating that the model is relatively well-fitted. In addition, the iterative weighted least squares method is used to calculate the parameter estimation of the regression model. In order to further judge the quality of the model, the diagnostic map of the regression model is drawn as follows.

    par(mfrow = c(2,2)) 
    plot(glm.log3)

    #par(mfrow = c(1,2)) 
    #influencePlot(glm.log3)
    #plot(predict(glm.log3,type="response"),residuals(glm.log3,type="deviance"))

    In order to observe the outlier points, leverage value, and strong influence points of the model, this paper uses car in R language and the package to draw a comprehensive diagnostic map. In the graph, the horizontal axis represents the lever value and the vertical axis represents the student chemistry residual value, while the plotted symbol size is proportional to the Cook distance size.

    Points above 2 or less than negative 2 in the graph can be considered outlier points, and points with dashed lines above 0.025 to 0.03 in the horizontal axis have high leverage values. The circle size is proportional to the effect, and the points with a large circle may be strong points of disproportionate effects on the estimation of the model parameters. Of the 810 sets of sample data involved in this paper, there are five obvious outliers in the figure, and a very small part is high leverage values. In a comprehensive judgment, the model established in this paper fits quite well.

    5.2.4 Comparison of Models

    #glm.fit=glm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE + CAR_USE+  GENDER +
    #              MARRIED + JOBCLASS ,data=data1, family=Gamma(link = "log")) 
    #lm.fit=lm(formula =CLM_AMT ~ BLUEBOOK  + CAR_TYPE + CAR_USE+
    #            GENDER +MARRIED + JOBCLASS , data=data1, ) 
    #summary(lm.fit)
    #sumlm=summary(lm.fit)
    #deviance(lm.fit);deviance(glm.fit)
    #AIC(glm.fit,lm.fit)
    #BIC(glm.fit,lm.fit)

    In the significance of the parameter estimates in the Table of this paper, it is clear that the vast majority of the individual parameter estimates pass the significance test with a confidence level of 0.1.

    Based on the common criteria for evaluating the quality of the model: AIC value, BIC value, and the deviation of the sum of squares (Residual deviance). Specifically, AIC values and BIC values are comprehensive evaluations of model goodness of fit and model complexity, and the smaller the value, the better the fit of the model.

    6. Conclusion

    1. Naive Bayes has the highest accuracy, compared with KNN and Decision Tree, which can be a classification model of claims status.

    1. Generalized gamma regression model can predict the amount of custom’s claims based on the most important feature, INCOME, AGE, and BLUEBOOK, with smaller residuals and better model performance, compared with the linear model.