WQD 7004 Group Assignment
Group Members
- Yan Lei (S2178506)
- Liu Ziwei (S2160012)
- Nasir Uddin Ahmed (S2015449)
- Jing Yang (S2147529)
- Jiale Xiong (S2142748)
Project Title
Prediction of Vehicle Insurance Claims Data Based on Non-negative Fat-tailed Distribution
1. Introduction
The methodology proposed in this assignment focuses on predicting policy payout costs based on various objective factors and analyzing the risk level of customer groups to help insurance companies accurately price their policies and thus ensure the proper operation of their businesses.
1.1 Problem Statement
With the rapid development of insurance business, the uncertainty of insurance claims have been deeply plagued by insurance company. Generally speaking, the amount of insurance claims is directly related to the stability of the financial operations of insurance companies, especially the non-life insurance business. When it comes to the modeling of insurance claim data, finding an appropriate distribution to mine information from the data is an important task to help actuaries calculate premiums. Most insurance claims data have several characteristics, such as unimodal, right skewness, and thick right tail. To be specific, the thick-tailed nature of the data means that once extreme events occur, they will bring huge losses to insurance companies. Correspondingly, the tail dependence will also lead to a higher probability of simultaneous claims in different insurance policies. This creates a risk aggregation problem for insurers. In previous studies, it is found that insurance claims data often has the feature of non-negative thick tail, which means that majority of the claim amount tend to come from small number of claims. Therefore, this paper focuses on the regression application of insurance claims data with non-negative thick tail.
1.2 Data Mining Goal
- To select an algorithm model to predict whether auto insurance claims will occur based on objective factors such as vehicle price, age, driver age, and gender.
- To select an algorithm model, such as the generalized gamma regression model, because the generalized gamma regression model also has the feature of non-negative thick tail and has good fitting effect.
1.3 Objectives
- To classify whether auto insurance claims will occur based on the factors such as vehicle price, age, driver age, and gender.
- To build a model that can predict the amounts of custom’s claim.
2. Dataset
Auto insurance claims data provided by SAS Institute Inc
Data Source: https://github.com/Alisonin/Insurance
Reference:SAS Institute Inc. Solving business problems using SAS enterprise miner software[White Paper]. Cary.
#Loading necessary library
library("readxl")
library("ggplot2")
library("dplyr")
library("corrplot")
library("hexbin")
library('plyr')
library("rmdformats")
library("tidyr")
library("purrr")
library("gridExtra")
library("ggrepel")
library("pastecs")3. Data Processing
In data processing, there are several tasks to be performed. They are
- Removing blank information
- Removing duplicates
- Removing redundant data
- Deleting data where no claim occured
The following code blocks have been used to perform the tasks mentioned above.
Reading the Datset
#claimsData = read.csv('claims_2812_33.csv')
url <- 'https://raw.githubusercontent.com/Alisonin/Insurance/main/claims_2812_33.csv'
claimsData <- read.csv(url)Checking for Missing Data
# Checking missing data in claimsData
missTotal <- apply(is.na(claimsData)|(claimsData==""),2,sum)
missPercent <- apply(is.na(claimsData)|(claimsData==""),2,function(x) sum(x)/length(x))
missingData <- data.frame(missTotal,missPercent)
missingData <- missingData[order(missTotal,decreasing=TRUE),]
# Retain variables with missing dataRetain Variables with Missing Data
missingData[missingData$missTotal>0,]## missTotal missPercent
## CLM_DATE 1921 0.68314367
## SAMEHOME 175 0.06223329
## HOME_VAL 168 0.05974395
## YOJ 153 0.05440967
Delete Some Variables with the Same Meaning but Inconsistent Statistical Methods
claimsData = claimsData[, !(names(claimsData) %in% c("ID", "PLCYDATE", "POLICYNO", "INITDATE", "RETAINED", "MVR_PTS", "CLM_DATE", "CLM_FLAG", "BIRTH", "PARENT1", "HOME_VAL", "SAMEHOME", "YEARQTR"))]Delete Samples with Missing YOJ Variable
claimsData <- claimsData[complete.cases(claimsData[c("YOJ")]),]Create a New Column Named CLM_JUDGE, if the CLM_AMT Value is 0, the Corresponding CLM_JUDGE value is no, otherwise yes
claimsData$CLM_JUDGE <- ifelse(claimsData$CLM_AMT == 0, "no", "yes")
#claimsData
data0 = claimsData4. Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is an essential step in the data science process as it allows for the discovery of patterns, relationships, and insights in the data. It is a way to understand the underlying structure of the data, identify any potential issues or outliers, and generate hypotheses for further analysis. EDA also helps in the selection of appropriate statistical models and techniques for the data at hand. Furthermore, it helps in the communication of findings and results to stakeholders and in the development of a deeper understanding of the problem being addressed. Overall, EDA is crucial for ensuring that the data is properly understood and utilized in decision making.
Reading the Data into a Data Frame
df <- claimsDataCombined Boxplot for Explanatory Variables
#colnames(data1)
con.v=c(1,2,4:5,8,9,11:14)
totaltu=function(data,a,b)
{
zhibiao=colnames(data)
xn=ncol(data)
par(mfrow = c(a,b)) #a行b列个指标
for(i in 1:xn){
boxplot(data[,i],xlab = zhibiao[i])
}
}
totaltu(df[,con.v],2,5)Combined boxplot, also known as a side-by-side boxplot or a grouped boxplot, is a type of plot used to compare the distribution of multiple variables (or groups) at the same time. It is particularly useful when working with explanatory variables, as it allows for easy comparison of the distribution of the variables across different groups or levels. The combined boxplot for explanatory variables for our dataset is showing the median and interquartile range of the variable, while the whiskers would extend out to the minimum and maximum values.
Matrix Plot
pairs(df[,con.v])In the matrix plot above each row and column in the matrix represents a variable in the car insurance claim dataset, and the plot at the intersection of a row and column shows the relationship between those two variables. We have used a matrix plot to investigate the relationships between different variables that may affect the cost of a claim, such as the age of the vehicle, the make and model, the location of the accident, and the driver’s demographics such as age and driving history.
Relative Analysis
corr=cor(df[,c(11,con.v)]) #相关系数矩阵
###相关系数图###install.packages("corrplot")
install.packages("corrplot")## Warning: package 'corrplot' is in use and will not be installed
library(corrplot)
library(RColorBrewer)
myColors=brewer.pal(8,"Set2")[1:8]
corrplot(corr=corr, tl.col="black",col=myColors, title = 'Correlation Graph
')From the above Correlation Coefficient Analysis Matrix:
df <- read_xls("data0.xls")
View(df)Characteristics of the Customer’s Personal Information
Ringdiagram to Show Gender, Married and Jobclass
# plot 1
countsgender <- table(df$GENDER)
View(countsgender)
hsize <- 1
dfcountsgender <- as.data.frame(countsgender)
names(dfcountsgender)[names(dfcountsgender) == "Var1"] <- "GENDER"
View(dfcountsgender)
r1 <- ggplot(dfcountsgender, aes(x = hsize, y = Freq, fill = GENDER)) +
geom_col(color = "black") +
coord_polar(theta = "y") +
xlim(c(0.2, hsize + 0.5)) +
labs(x = "") + scale_fill_manual(values = myColors)
countsmarried <- table(df$MARRIED)
View(countsmarried )
# plot 2
hsize <- 1
dfcountsmarried <- as.data.frame(countsmarried)
names(dfcountsmarried)[names(dfcountsmarried) == "Var1"] <- "MARRIED"
View(dfcountsmarried)
r2 <- ggplot(dfcountsmarried, aes(x = hsize, y = Freq, fill = MARRIED)) +
geom_col(color = "black") +
coord_polar(theta = "y") +
xlim(c(0.2, hsize + 0.5)) +
labs(x = "")
countsjobclass <- table(df$JOBCLASS)
View(countsjobclass)
# plot 3
hsize <- 1
dfcountsjobclass <- as.data.frame(countsjobclass)
names(dfcountsjobclass)[names(dfcountsjobclass) == "Var1"] <- "Job_Class"
View(dfcountsjobclass)
r3 <- ggplot(dfcountsjobclass, aes(x = hsize, y = Freq, fill = Job_Class)) +
geom_col(color = "black") +
coord_polar(theta = "y") +
xlim(c(0.2, hsize + 0.5)) +
labs(x = "")
#knitr::include_graphics("C://Users/Turjo/Documents/r4.PNG")
grid.arrange(r1, r2, r3, ncol = 3)From the characteristic of the customer’s personal information ring diagrams above:
- The circle graph above makes it evident that the insured are more evenly distributed by gender and marital status.
- The Blue collar has a higher number of professional people.
Violin Plot to Show the Distribution of Age, Income, Homekids
ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")The above Violin Plot displays the probability density of AGE. The age is concentrated between 30 to 50.
ggplot(df, aes(x=INCOME, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")The above Violin Plot displays the probability density of INCOME.
ggplot(df, aes(x=HOMEKIDS, y=CLM_AMT)) + geom_violin(fill="#428ac2",color="#9f2042")The above Violin Plot displays the probability density of HOMEKIDS. Which is concentrated between 2 to 3.
Bar Plot of YOJ in Different Gender
ggplot(df, aes(x=YOJ, fill=GENDER)) +
geom_bar(position="dodge") +
labs(x="YOJ", y="Count") +
ggtitle("Count of Claim by YOJ and Gender")The bar plot is showing the count of claim YOJ by gender. As we can see for YOJ 0, the number of females is over 60.
Scatter Plot to Display the Relationship Between Age Amount Claim
ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_point(color="#9f2042")Each point on the scatter plot represents a single data point, with the x-coordinate being the age of the individual and the y-coordinate being the amount of the claim. By plotting all the data points on the scatter plot, we can visually observe the relationship between age and claim amount.
Hexbin Plot to Display the Relation Between Age and Claim amount
ggplot(df, aes(x=AGE, y=CLM_AMT)) + geom_hex()Each hexagon on the plot represents a group of data points with similar x and y values. The color of the hexagon represents the density of data points in that bin, with darker colors indicating a higher density of data points.
Bar Plot to Display Percentage of Education
education_percent <- df %>% group_by(MAX_EDUC) %>% dplyr::summarize(n = n()) %>% mutate(percent = n / sum(n))
ggplot(education_percent, aes(x = MAX_EDUC, y = percent)) +
geom_bar(stat = "identity",fill="#9f2042",color="#428ac2") +
ylab("Percentage")From the bar plot we can see, High School has the most percentages for MAX_EDUC.
Histogram of the Amount Claimed
#Distribution at the time of claim
data1 <- data.frame(claimsData)
data1=data1[data1$CLM_AMT>0,]
#Histogram of the amount claimed
par(pin = c(6,3))
hist(data1$CLM_AMT, freq=F, breaks=66,bty= "o", col=0,font.lab=2,
main="Histogram of CLM_AMT", font=2,xlab="CLM_AMT")
clsd=sd(data1$CLM_AMT)
clmean=mean(data1$CLM_AMT)
xnorm=seq(0,60000,length.out=10000)
lines(xnorm,dnorm(xnorm,mean = clmean,sd = clsd),lty=1,lwd=2)
legend("topright",legend=paste("Normal distribution"),lwd='2', lty=1,cex=1)
box() EDA Summary
summary(data1) ## KIDSDRIV TRAVTIME CAR_USE BLUEBOOK
## Min. :0.0000 Min. : 5.00 Length:839 Min. : 1500
## 1st Qu.:0.0000 1st Qu.:25.00 Class :character 1st Qu.: 7310
## Median :0.0000 Median :35.00 Mode :character Median :11890
## Mean :0.2825 Mean :35.08 Mean :13661
## 3rd Qu.:0.0000 3rd Qu.:45.00 3rd Qu.:18225
## Max. :3.0000 Max. :88.00 Max. :41440
## NPOLICY CAR_TYPE RED_CAR OLDCLAIM
## Min. :1.000 Length:839 Length:839 Min. : 0
## 1st Qu.:1.000 Class :character Class :character 1st Qu.: 0
## Median :1.000 Mode :character Mode :character Median : 2712
## Mean :1.627 Mean : 6038
## 3rd Qu.:2.000 3rd Qu.: 6739
## Max. :5.000 Max. :57037
## CLM_FREQ REVOLKED CLM_AMT AGE
## Min. :0.000 Length:839 Min. : 295 Min. :16.00
## 1st Qu.:0.000 Class :character 1st Qu.: 2706 1st Qu.:37.00
## Median :1.000 Mode :character Median : 4102 Median :43.00
## Mean :1.225 Mean : 5561 Mean :43.26
## 3rd Qu.:2.000 3rd Qu.: 5644 3rd Qu.:49.50
## Max. :5.000 Max. :61678 Max. :67.00
## HOMEKIDS YOJ INCOME GENDER
## Min. :0.0000 Min. : 0.00 Min. : 0 Length:839
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 18857 Class :character
## Median :0.0000 Median :11.00 Median : 42687 Mode :character
## Mean :0.9976 Mean :10.01 Mean : 47235
## 3rd Qu.:2.0000 3rd Qu.:13.00 3rd Qu.: 67360
## Max. :5.0000 Max. :18.00 Max. :273507
## MARRIED JOBCLASS MAX_EDUC DENSITY
## Length:839 Length:839 Length:839 Length:839
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CLM_JUDGE
## Length:839
## Class :character
## Mode :character
##
##
##
ydes=stat.desc(data1[,'CLM_AMT'],basic = T ,desc = T, norm=TRUE, p=0.95 )
t(as.data.frame(ydes))## nbr.val nbr.null nbr.na min max range sum median mean SE.mean
## ydes 839 0 0 295 61678 61383 4665938 4102 5561.309 237.8077
## CI.mean.0.95 var std.dev coef.var skewness skew.2SE kurtosis
## ydes 466.7687 47447553 6888.218 1.238597 4.723786 27.97951 27.38481
## kurt.2SE normtest.W normtest.p
## ydes 81.19742 0.5000895 2.469903e-43
Descriptive Statistics of Explanatory Variables
#解释变量描述性统计(Descriptive statistics of explanatory variables)
con.v=c(1,2,4:5,8,9,12:14) #连续性变量的列号(Column number of the continuity variable)
stat.desc(data1[,con.v],basic = T ,desc = T )## KIDSDRIV TRAVTIME BLUEBOOK NPOLICY OLDCLAIM
## nbr.val 839.00000000 8.390000e+02 8.390000e+02 8.390000e+02 8.390000e+02
## nbr.null 678.00000000 0.000000e+00 0.000000e+00 0.000000e+00 3.440000e+02
## nbr.na 0.00000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min 0.00000000 5.000000e+00 1.500000e+03 1.000000e+00 0.000000e+00
## max 3.00000000 8.800000e+01 4.144000e+04 5.000000e+00 5.703700e+04
## range 3.00000000 8.300000e+01 3.994000e+04 4.000000e+00 5.703700e+04
## sum 237.00000000 2.943200e+04 1.146198e+07 1.365000e+03 5.065494e+06
## median 0.00000000 3.500000e+01 1.189000e+04 1.000000e+00 2.712000e+03
## mean 0.28247914 3.507986e+01 1.366148e+04 1.626937e+00 6.037538e+03
## SE.mean 0.02250370 5.216226e-01 2.702293e+02 2.847885e-02 3.470596e+02
## CI.mean.0.95 0.04417024 1.023840e+00 5.304059e+02 5.589826e-02 6.812082e+02
## var 0.42488358 2.282836e+02 6.126705e+07 6.804669e-01 1.010579e+08
## std.dev 0.65183095 1.510906e+01 7.827327e+03 8.249042e-01 1.005275e+04
## coef.var 2.30753656 4.307046e-01 5.729488e-01 5.070290e-01 1.665042e+00
## CLM_FREQ AGE HOMEKIDS YOJ
## nbr.val 8.390000e+02 8.390000e+02 839.00000000 839.0000000
## nbr.null 3.440000e+02 0.000000e+00 447.00000000 108.0000000
## nbr.na 0.000000e+00 0.000000e+00 0.00000000 0.0000000
## min 0.000000e+00 1.600000e+01 0.00000000 0.0000000
## max 5.000000e+00 6.700000e+01 5.00000000 18.0000000
## range 5.000000e+00 5.100000e+01 5.00000000 18.0000000
## sum 1.028000e+03 3.629300e+04 837.00000000 8397.0000000
## median 1.000000e+00 4.300000e+01 0.00000000 11.0000000
## mean 1.225268e+00 4.325745e+01 0.99761621 10.0083433
## SE.mean 4.300704e-02 3.201217e-01 0.04240038 0.1593767
## CI.mean.0.95 8.441417e-02 6.283346e-01 0.08322341 0.3128244
## var 1.551819e+00 8.597899e+01 1.50834753 21.3113862
## std.dev 1.245720e+00 9.272485e+00 1.22814801 4.6164257
## coef.var 1.016692e+00 2.143558e-01 1.23108265 0.4612577
#Observe the correlation between variables
#corr=cor(data1[,c(11,con.v)]) #相关系数矩阵
#corr5. Modeling
Viewing Duplicate Data
print('Viewing duplicate data')## [1] "Viewing duplicate data"
table(duplicated(data1))##
## FALSE
## 839
Viewing Missing Value
print('Viewing missing data')## [1] "Viewing missing data"
table(is.na(data1)) ##
## FALSE
## 17619
Reading Data, Factor Variables
#(1)read data
data1 <- data.frame(claimsData)
data1=data1[data1$CLM_AMT>0,]
# factor variables
as.factor(data1$CAR_USE)
as.factor(data1$CAR_TYPE)
as.factor(data1$RED_CAR)
as.factor(data1$REVOLKED)
as.factor(data1$GENDER)
as.factor(data1$MARRIED)
as.factor(data1$JOBCLASS)
as.factor(data1$MAX_EDUC)
as.factor(data1$DENSITY)data1=data1[data1$CLM_AMT>0,]5.1 Objective 1 - Classification
To classify whether auto insurance claims will occur based on the factors such as vehicle price, age, driver age, and gender.
performance = function(xtab, desc=""){
cat(desc,"\n")
ACR = sum(diag(xtab))/sum(xtab)
TPR = xtab[1,1]/sum(xtab[,1]); TNR = xtab[2,2]/sum(xtab[,2])
PPV = xtab[1,1]/sum(xtab[1,]); NPV = xtab[2,2]/sum(xtab[2,])
FPR = 1 - TNR ; FNR = 1 - TPR
RandomAccuracy = (sum(xtab[,2])*sum(xtab[2,]) +
sum(xtab[,1])*sum(xtab[1,]))/(sum(xtab)^2)
Kappa = (ACR - RandomAccuracy)/(1 - RandomAccuracy)
print(xtab)
cat("\nAccuracy (ACR) :", ACR, "\n")
cat("Sensitivity(TPR) :", TPR, "\n")
cat("Specificity (TNR) :", TNR, "\n")
cat("Positive Predictive Value (PPV) :", PPV, "\n")
cat("Negative Predictive Value (NPV) :", NPV, "\n")
cat("False Positive Rate (FPR) :", FPR, "\n")
cat("False Negative Rate(FNR) :", FNR, "\n")
}data2 <- data0
data2['CLM'] = ifelse(data2['CLM_AMT']>0,1,0)
data2 = subset(data2, select = -c(CLM_AMT))table(is.na(data2))##
## FALSE
## 55839
data2['CAR_USE'] = ifelse(data2['CAR_USE'] == 'Private' ,1 ,2)
data2 = subset(data2, select = -c(RED_CAR,JOBCLASS,MAX_EDUC,CLM_JUDGE,CAR_TYPE))
data2['GENDER'] = ifelse(data2['GENDER'] == 'F' ,1 ,0)
data2['REVOLKED'] = ifelse(data2['REVOLKED'] == 'No' ,0 ,1)
data2['MARRIED'] = ifelse(data2['MARRIED'] == 'Yes' ,1 ,0)
data2['DENSITY'] = ifelse(data2['DENSITY'] == 'Urban' ,1 ,0)pardim=dim(data2)
n=pardim[1]
set.seed(123)
test =sample(n, 0.2*n)
train.x= data2[test, ]
test.x= data2[-test, ]
train.y=train.x$CLM
test.y=test.x$CLM
require(class)5.1.1 KNN
#sqrt(n)
m1 = knn(train=train.x, test=test.x, cl=train.y, k=51)
KNN.confusionMatrix = table(test.y, m1)
#km = as.data.frame(km)
#KNN.confusionMatrix
performance(KNN.confusionMatrix)##
## m1
## test.y 0 1
## 0 1381 67
## 1 608 72
##
## Accuracy (ACR) : 0.6828008
## Sensitivity(TPR) : 0.6943188
## Specificity (TNR) : 0.5179856
## Positive Predictive Value (PPV) : 0.9537293
## Negative Predictive Value (NPV) : 0.1058824
## False Positive Rate (FPR) : 0.4820144
## False Negative Rate(FNR) : 0.3056812
# k.optm=1
# k_optm=c()
# for (i in 1:300){
# knn.mod <- knn(train.x, test.x,cl=train.y, k=i)
# k.optm [i]<- 100 * sum(test.y == knn.mod)/NROW(test.y)
# k_optm[i]<-k.optm[i]
# }
# k_optmplot(KNN.confusionMatrix)5.1.2 Naive Bayes
train.x$CLM = as.factor(train.x$CLM)#install.packages("naivebayes") #Naive Bayes
library(naivebayes)
Bayes_mod <- naive_bayes(CLM~., data=train.x, usekernel=T)
#par(mfrow = c(3,5))
plot(Bayes_mod)Bayestest_pred <- predict(Bayes_mod, test.x)## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
Bayes.confusionMatrix = table(Bayestest_pred, test.x$CLM)
#Bayes.confusionMatrix
performance(Bayes.confusionMatrix )##
##
## Bayestest_pred 0 1
## 0 1380 568
## 1 68 112
##
## Accuracy (ACR) : 0.7011278
## Sensitivity(TPR) : 0.9530387
## Specificity (TNR) : 0.1647059
## Positive Predictive Value (PPV) : 0.7084189
## Negative Predictive Value (NPV) : 0.6222222
## False Positive Rate (FPR) : 0.8352941
## False Negative Rate(FNR) : 0.04696133
plot(Bayes.confusionMatrix)5.1.3 Decision Tree
#install.packages("rpart")
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)
dt_model <- rpart(CLM~., data=train.x)
dt_pre <- predict(dt_model, test.x, type="class")
rpart.plot(dt_model, type=2)DecisionTree.confusionMatrix = table(dt_pre, test.x$CLM)
DecisionTree.confusionMatrix =DecisionTree.confusionMatrix [2:1,2:1]
performance(DecisionTree.confusionMatrix )##
##
## dt_pre 1 0
## 1 260 242
## 0 420 1206
##
## Accuracy (ACR) : 0.6889098
## Sensitivity(TPR) : 0.3823529
## Specificity (TNR) : 0.8328729
## Positive Predictive Value (PPV) : 0.5179283
## Negative Predictive Value (NPV) : 0.7416974
## False Positive Rate (FPR) : 0.1671271
## False Negative Rate(FNR) : 0.6176471
plot(DecisionTree.confusionMatrix)5.2 Objective 2 - Predict Claims-Regression
To build a model that can predict the amounts of custom’s claim.
5.2.1 Linear Regression
lm.fit=lm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE+
GENDER +MARRIED + JOBCLASS , data=data1, )
summary(lm.fit)##
## Call:
## lm(formula = CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE + GENDER +
## MARRIED + JOBCLASS, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11176 -2986 -1385 482 55424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.278e+01 2.031e+03 0.031 0.97535
## BLUEBOOK 1.664e-01 4.147e-02 4.012 6.58e-05 ***
## CAR_TYPEPickup 3.093e+03 1.354e+03 2.284 0.02260 *
## CAR_TYPESedan 1.735e+03 1.374e+03 1.263 0.20694
## CAR_TYPESports Car 4.378e+03 1.773e+03 2.469 0.01377 *
## CAR_TYPESUV 4.215e+03 1.703e+03 2.474 0.01355 *
## CAR_TYPEVan 1.473e+03 1.372e+03 1.074 0.28329
## CAR_USEPrivate 1.210e+03 6.707e+02 1.805 0.07145 .
## GENDERM 1.862e+03 8.679e+02 2.146 0.03220 *
## MARRIEDYes -1.155e+03 4.802e+02 -2.406 0.01636 *
## JOBCLASSClerical -2.137e+03 8.231e+02 -2.596 0.00961 **
## JOBCLASSDoctor -2.084e+03 1.882e+03 -1.107 0.26848
## JOBCLASSHome Maker -3.324e+02 9.632e+02 -0.345 0.73011
## JOBCLASSLawyer -1.953e+03 1.042e+03 -1.873 0.06140 .
## JOBCLASSManager -1.099e+03 1.023e+03 -1.075 0.28286
## JOBCLASSProfessional 5.053e+02 8.635e+02 0.585 0.55861
## JOBCLASSStudent -8.873e+02 8.025e+02 -1.106 0.26919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6795 on 822 degrees of freedom
## Multiple R-squared: 0.04544, Adjusted R-squared: 0.02686
## F-statistic: 2.445 on 16 and 822 DF, p-value: 0.001263
par(mfrow = c(2,2))
plot(lm.fit)par(mfrow = c(2,2))
plot(lm.fit)Residuals vs Fitted Plot is mainly used to test the assumed condition of linear properties. The graph shows that the resulting residual value and the fitted value basically have no obvious association, which indicates that the relationship between the independent variable and the dependent variable is linear and that the assumed condition is met. Most of the dots in the Normal Q-Q plot fall on a straight line at a 45-degree angle, essentially satisfying the assumption of normality. The Scale-Location plots indicate a random distribution of points around the horizontal line, satisfying the homoscedasticity assumption.
5.2.2 Generalized Gamma Regression Model
y=data1$CLM_AMT
cat("glm.inverse=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +",
"OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +",
"MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1, ",
"family=Gamma(link = 'inverse_sqrt')) # gamma回归",sep = "\n")## glm.inverse=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +
## OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +
## MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1,
## family=Gamma(link = 'inverse_sqrt')) # gamma回归
cat("summary(glm.inverse)",sep = "\n")## summary(glm.inverse)
glm.log0=glm(CLM_AMT ~ TRAVTIME + CAR_USE + BLUEBOOK + NPOLICY + CAR_TYPE + RED_CAR +
OLDCLAIM + CLM_FREQ + REVOLKED + AGE + HOMEKIDS + YOJ + INCOME + GENDER +
MARRIED + JOBCLASS + MAX_EDUC + DENSITY, data=data1,
family=Gamma(link = "log")) # gamma回归
#summary(glm.log0)
glm.identity1=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE +
GENDER +MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "identity"))
#summary(glm.identity1)
glm.log1=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE +
GENDER +MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "log"))
#summary(glm.log1)
#sum.log1=summary(glm.log1)
#sum.log1$coefficients
#summary(glm.log1)
#summary_fit
glm.log1=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE +
GENDER +MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "log"))#regroup$JOBCLASS
data1$JOBCLASS[data1$JOBCLASS == c("Home Maker","Lawyer","Doctor")] ="otherj"## Warning in data1$JOBCLASS == c("Home Maker", "Lawyer", "Doctor"): longer object
## length is not a multiple of shorter object length
data1$JOBCLASS[data1$JOBCLASS == c("Manager")] ="otherj"
data1$JOBCLASS[data1$JOBCLASS == c("Student")] ="otherj"
data1$JOBCLASS <- factor(data1$JOBCLASS,
levels=c("Clerical","otherj","Professional", "Blue Collar"),
labels=c("Clerical","otherj","Professional","Blue Collar"))
data1$JOBCLASS[is.na(data1$JOBCLASS)]<- "otherj"
#regroup$CAR_TYPE
data1$CAR_TYPE[data1$CAR_TYPE == c("Sedan","Van")] ="othert"## Warning in data1$CAR_TYPE == c("Sedan", "Van"): longer object length is not a
## multiple of shorter object length
data1$CAR_TYPE[data1$CAR_TYPE == c("Panel Truck")] ="othert"
data1$CAR_TYPE <- factor(data1$CAR_TYPE,
levels=c("othert" ,"Pickup", "Sports Car" ,"SUV"),
labels=c("othert","Pickup", "Sports Car" ,"SUV" ))
data1$CAR_TYPE[is.na(data1$CAR_TYPE)]<- "othert"Final Generalized Gamma Regression Model
#AIC = AIC(glm.log2,glm.log3,glm.log4)
#BIC = BIC(glm.log2,glm.log3,glm.log4)
#VIF = VIF(glm.log2,glm.log3,glm.log4)
#cbind(AIC, BIC)
#cbind(AIC, BIC, VIF)glm.log2=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE +
GENDER +MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "log"))
glm.log3=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE+
GENDER +MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "log"))
glm.log4=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE+
MARRIED + JOBCLASS ,
data=data1, family=Gamma(link = "log"))
#summary(glm.log2)
#summary(glm.log4)
summary(glm.log3)##
## Call:
## glm(formula = CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE + GENDER +
## MARRIED + JOBCLASS, family = Gamma(link = "log"), data = data1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0448 -0.6101 -0.2814 0.1064 3.9988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.795e+00 2.165e-01 35.997 < 2e-16 ***
## BLUEBOOK 1.928e-05 5.653e-06 3.410 0.000682 ***
## CAR_TYPEPickup 2.703e-01 1.140e-01 2.372 0.017928 *
## CAR_TYPESports Car 3.462e-01 1.638e-01 2.114 0.034849 *
## CAR_TYPESUV 3.271e-01 1.447e-01 2.260 0.024062 *
## CAR_USEPrivate 1.672e-01 9.503e-02 1.759 0.078871 .
## GENDERM 1.850e-01 1.250e-01 1.479 0.139456
## MARRIEDYes -1.718e-01 7.581e-02 -2.266 0.023684 *
## JOBCLASSotherj 1.987e-01 1.111e-01 1.788 0.074078 .
## JOBCLASSProfessional 4.394e-01 1.453e-01 3.025 0.002566 **
## JOBCLASSBlue Collar 3.615e-01 1.282e-01 2.820 0.004924 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.172686)
##
## Null deviance: 582.36 on 838 degrees of freedom
## Residual deviance: 538.08 on 828 degrees of freedom
## AIC: 16009
##
## Number of Fisher Scoring iterations: 7
Based on the results of parameter estimation in Table, it can be seen that the vast majority of estimates at all levels of parameters pass the significance test. According to the estimated coefficient, BLUEBOOK (vehicle value) has a positive relationship with the claim amount, that is to say, the higher the value of the car, the greater the amount when the claim occurs, and this conclusion is very common sense. In addition, Pickup, Sports Car and SUV have higher claims compared to Sedan and Van. In terms of car valuation, the value of Sports Car (sports car) is much higher than Sedan Sedan (car) and Van (van), and the coefficient of Sports Car is higher than that of other types of cars, that is, Sports Car is the car type with the largest amount of claim in a given car type. In terms of the use of vehicles, compared with official vehicles (Commercial), the claim amount of private vehicles (Private) is higher.
In addition, male drivers make higher claims than women, that is, male drivers are more likely to make higher claims. In terms of driver occupation, based on the amount of claim incurred by priest (Clerical), the amount of claim incurred in the remaining occupation is often higher than that of priest (Clerical).
5.2.3 Model Diagnosis
residual=residuals(glm.log3)
plot(residual,ylab='residual',xlab='y',font.lab=1)As can be seen from the residual map, all the residual values float randomly up and down 0, mostly concentrated on the 0 scale line, indicating that the model is relatively well-fitted. In addition, the iterative weighted least squares method is used to calculate the parameter estimation of the regression model. In order to further judge the quality of the model, the diagnostic map of the regression model is drawn as follows.
par(mfrow = c(2,2))
plot(glm.log3)#par(mfrow = c(1,2))
#influencePlot(glm.log3)
#plot(predict(glm.log3,type="response"),residuals(glm.log3,type="deviance"))
In order to observe the outlier points, leverage value, and strong influence points of the model, this paper uses car in R language and the package to draw a comprehensive diagnostic map. In the graph, the horizontal axis represents the lever value and the vertical axis represents the student chemistry residual value, while the plotted symbol size is proportional to the Cook distance size.
Points above 2 or less than negative 2 in the graph can be considered outlier points, and points with dashed lines above 0.025 to 0.03 in the horizontal axis have high leverage values. The circle size is proportional to the effect, and the points with a large circle may be strong points of disproportionate effects on the estimation of the model parameters. Of the 810 sets of sample data involved in this paper, there are five obvious outliers in the figure, and a very small part is high leverage values. In a comprehensive judgment, the model established in this paper fits quite well.
5.2.4 Comparison of Models
#glm.fit=glm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE+ GENDER +
# MARRIED + JOBCLASS ,data=data1, family=Gamma(link = "log"))
#lm.fit=lm(formula =CLM_AMT ~ BLUEBOOK + CAR_TYPE + CAR_USE+
# GENDER +MARRIED + JOBCLASS , data=data1, )
#summary(lm.fit)
#sumlm=summary(lm.fit)
#deviance(lm.fit);deviance(glm.fit)
#AIC(glm.fit,lm.fit)
#BIC(glm.fit,lm.fit)In the significance of the parameter estimates in the Table of this paper, it is clear that the vast majority of the individual parameter estimates pass the significance test with a confidence level of 0.1.
Based on the common criteria for evaluating the quality of the model: AIC value, BIC value, and the deviation of the sum of squares (Residual deviance). Specifically, AIC values and BIC values are comprehensive evaluations of model goodness of fit and model complexity, and the smaller the value, the better the fit of the model.
6. Conclusion
- Naive Bayes has the highest accuracy, compared with KNN and Decision Tree, which can be a classification model of claims status.
- Generalized gamma regression model can predict the amount of custom’s claims based on the most important feature, INCOME, AGE, and BLUEBOOK, with smaller residuals and better model performance, compared with the linear model.