First, lets define the elements for our problem. In this scenario we need the probability function:
xval<-c(0:3)
prob<- c(0.1,0,0.4,0.5)
pf<- cbind(xval,prob)
We have the same information than before, as we can see
pf
## xval prob
## [1,] 0 0.1
## [2,] 1 0.0
## [3,] 2 0.4
## [4,] 3 0.5
Now, lets use a library that was developed for other purposes, but that allows for the calculations of convolutions.
library(kSamples)
## Warning: package 'kSamples' was built under R version 4.1.2
## Loading required package: SuppDists
## Warning: package 'SuppDists' was built under R version 4.1.2
dist<- conv(xval,prob,xval,prob)
dist
## [,1] [,2]
## [1,] 0 0.01
## [2,] 1 0.00
## [3,] 2 0.08
## [4,] 3 0.10
## [5,] 4 0.16
## [6,] 5 0.40
## [7,] 6 0.25
The object dist is the 2-fold convolution. Now, in order to estimate the next convolution we have to split the dist object into vectors.
distx<-dist[,1]
distp<-dist[,2]
dist2<-conv(distx,distp,xval,prob)
dist2
## [,1] [,2]
## [1,] 0 0.001
## [2,] 1 0.000
## [3,] 2 0.012
## [4,] 3 0.015
## [5,] 4 0.048
## [6,] 5 0.120
## [7,] 6 0.139
## [8,] 7 0.240
## [9,] 8 0.300
## [10,] 9 0.125
Now, for this new example we work with convolutions. The first part is the create the vector with values and probabilities from the binomial:
nval<-c(0:10)
prob<- 0.3
pn<- rep(0,11)
for (i in 0:11) {
pn[i]<-dbinom(nval[i],length(nval)-1,prob)
}
pn
## [1] 0.0282475249 0.1210608210 0.2334744405 0.2668279320 0.2001209490
## [6] 0.1029193452 0.0367569090 0.0090016920 0.0014467005 0.0001377810
## [11] 0.0000059049
Now, we declare the second distribution in two vectors:
xval<- c(0:3)
px <- c(0.1,0.6, 0.25, 0.05)
And use the same approach:
dist3<- conv(xval,px,nval,pn)
dist3
## [,1] [,2]
## [1,] 0 2.824752e-03
## [2,] 1 2.905460e-02
## [3,] 2 1.030458e-01
## [4,] 3 1.984450e-01
## [5,] 4 2.445305e-01
## [6,] 5 2.087452e-01
## [7,] 6 1.287989e-01
## [8,] 7 5.869020e-02
## [9,] 8 1.988088e-02
## [10,] 9 4.970067e-03
## [11,] 10 8.950188e-04
## [12,] 11 1.103232e-04
## [13,] 12 8.365275e-06
## [14,] 13 2.952450e-07
Laboratory for claim frequency
setwd("C:/Users/23043/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/data")
datahm<-read.csv("insample.csv")
We are going to be working with the year 2010. The variable that represents the number of claims is the Freq variable
datahm<- datahm[which(datahm$Year==2010),]
table(datahm$Freq)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 30
## 707 209 86 40 18 12 9 4 6 1 3 2 1 2 1 2 1 1 1 1
## 39 103 239
## 1 1 1
summary(datahm$Freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.241 1.000 239.000
We can create a variable just to indicate if we received a claim from each policy:
datahm$claim<-ifelse(datahm$Freq>0,1,0)
prop.table(table(datahm$claim))
##
## 0 1
## 0.6369369 0.3630631
It is always a good idea to start with some plots
library(ggplot2)
library(scales)
ggplot(datahm, aes(x=claim))+geom_bar()
ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Claim",
y= "Percent")
We know our variable claim is not a continuous varible, but a nominal one so we can change the information in the plot by defining the variable as a factor variable:
datahm$claim<- factor(datahm$claim, levels=c(0,1),
labels=c("No claim","Claims"))
ggplot(datahm, aes(x=claim))+geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Claim",
y= "Percent",
title= "Policies with claims (Percent), 2010")
Now, we can plot the frequencies:
ggplot(datahm, aes(x=Freq))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Number of claims",
y= "Percent",
title= "Claims per policy (Percent), 2010")
We have two extreme values, lets get rid of those, just for the class…
datahm<-datahm[which(datahm$Freq<100),]
Create variables for total policies:
totalp<- length(datahm$Freq)
totalp
## [1] 1108
We can repeat the plots…
ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Claim",
y= "Percent",
title= "Policies with claims (Percent), 2010")
ggplot(datahm, aes(x=Freq))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
labs(x = "Number of claims",
y= "Percent",
title= "Claims per policy (Percent), 2010")
We are going to obtain the total number of claims and create a new dataframe to work on
claims<-as.data.frame(table(datahm$Freq))
colnames(claims)<-c("number","Freq")
claims$number<-as.numeric(as.character(claims$number))
claims$Freq<-as.numeric(as.character(claims$Freq))
claims
## number Freq
## 1 0 707
## 2 1 209
## 3 2 86
## 4 3 40
## 5 4 18
## 6 5 12
## 7 6 9
## 8 7 4
## 9 8 6
## 10 9 1
## 11 10 3
## 12 11 2
## 13 13 1
## 14 14 2
## 15 15 1
## 16 16 2
## 17 17 1
## 18 18 1
## 19 19 1
## 20 30 1
## 21 39 1
claims$claims<-claims$number*claims$Freq
totalc<-sum(claims$claims)
totalc
## [1] 1035
We can start exploring the data we have:
media<-mean(datahm$Freq)
media
## [1] 0.9341155
vari<- var(datahm$Freq)
vari
## [1] 6.101346
Create a variable to determine the proportion, this is the probability of observing the number of claims
claims$Prop<-claims$Freq/totalp
claims
## number Freq claims Prop
## 1 0 707 0 0.6380866426
## 2 1 209 209 0.1886281588
## 3 2 86 172 0.0776173285
## 4 3 40 120 0.0361010830
## 5 4 18 72 0.0162454874
## 6 5 12 60 0.0108303249
## 7 6 9 54 0.0081227437
## 8 7 4 28 0.0036101083
## 9 8 6 48 0.0054151625
## 10 9 1 9 0.0009025271
## 11 10 3 30 0.0027075812
## 12 11 2 22 0.0018050542
## 13 13 1 13 0.0009025271
## 14 14 2 28 0.0018050542
## 15 15 1 15 0.0009025271
## 16 16 2 32 0.0018050542
## 17 17 1 17 0.0009025271
## 18 18 1 18 0.0009025271
## 19 19 1 19 0.0009025271
## 20 30 1 30 0.0009025271
## 21 39 1 39 0.0009025271
ggplot(claims, aes(x=number))+geom_line(aes(y=Prop),color="blue")
From lasts session we identify an approach to identify the best distribution to describe some data. Lets try to replicate that perspective:
claims$F1<- 0
claimsM<-as.matrix(claims)
for (i in 2:21) {
claimsM[i,5]<-claimsM[i,1]*(claimsM[i,4]/claimsM[i-1,4])
}
claims<-as.data.frame(claimsM)
names(claims)<- c("Number","Freq","claims","Prop","F1")
ggplot(claims, aes(x=Number))+geom_line(aes(y=F1),color="blue")
While this is not a straight line, we can see a pattern… at least we can tell the slope is positive… the interception with y axis could be zero or different. I think it is close to zero… however, lets ignore this and assume it is different…
The distinction is that with b=0 we have geometric and b!=0 negative binomial. However, remember that geometric is a special case of negative binomial with r=0.
p1<- claimsM[2,4]
p2<- claimsM[3,4]
p3<- claimsM[4,4]
b<- 6*(p2/p1 - p3/p2)
b
## Prop
## -0.3217982
a<- p2/p1-b/2
a
## Prop
## 0.5723823
pnb<- 1-a
rnb<- (b/(1-pnb))+1
pnb
## Prop
## 0.4276177
rnb
## Prop
## 0.4377916
We can see that most likely this is a Geometric
f<-rep(0,21)
x<-claimsM[,1]
for (i in 1:21){
f[i]<-dnbinom(x[i],rnb,pnb)
}
claimsM<-cbind(claimsM,f)
claims<-as.data.frame(claimsM)
names(claims)<- c("Number","Freq","Claims","Prop","F1","Adjusted")
ggplot(claims, aes(x=Number))+geom_line(aes(y=Prop),color="blue")+
geom_line(aes(y=Adjusted),color="red")
Activity 1:
Replicate the process for a chosen year different from 2010. Due: Thursday 27th before midnight