First, lets define the elements for our problem. In this scenario we need the probability function:

xval<-c(0:3)
prob<- c(0.1,0,0.4,0.5)

pf<- cbind(xval,prob)

We have the same information than before, as we can see

pf
##      xval prob
## [1,]    0  0.1
## [2,]    1  0.0
## [3,]    2  0.4
## [4,]    3  0.5

Now, lets use a library that was developed for other purposes, but that allows for the calculations of convolutions.

library(kSamples)
## Warning: package 'kSamples' was built under R version 4.1.2
## Loading required package: SuppDists
## Warning: package 'SuppDists' was built under R version 4.1.2
dist<- conv(xval,prob,xval,prob)

dist
##      [,1] [,2]
## [1,]    0 0.01
## [2,]    1 0.00
## [3,]    2 0.08
## [4,]    3 0.10
## [5,]    4 0.16
## [6,]    5 0.40
## [7,]    6 0.25

The object dist is the 2-fold convolution. Now, in order to estimate the next convolution we have to split the dist object into vectors.

distx<-dist[,1]
distp<-dist[,2]

dist2<-conv(distx,distp,xval,prob)
dist2
##       [,1]  [,2]
##  [1,]    0 0.001
##  [2,]    1 0.000
##  [3,]    2 0.012
##  [4,]    3 0.015
##  [5,]    4 0.048
##  [6,]    5 0.120
##  [7,]    6 0.139
##  [8,]    7 0.240
##  [9,]    8 0.300
## [10,]    9 0.125

Now, for this new example we work with convolutions. The first part is the create the vector with values and probabilities from the binomial:

nval<-c(0:10)
prob<- 0.3
pn<- rep(0,11)

for (i in 0:11) {
  pn[i]<-dbinom(nval[i],length(nval)-1,prob)
}
pn
##  [1] 0.0282475249 0.1210608210 0.2334744405 0.2668279320 0.2001209490
##  [6] 0.1029193452 0.0367569090 0.0090016920 0.0014467005 0.0001377810
## [11] 0.0000059049

Now, we declare the second distribution in two vectors:

xval<- c(0:3)
px <- c(0.1,0.6, 0.25, 0.05)

And use the same approach:

dist3<- conv(xval,px,nval,pn)
dist3
##       [,1]         [,2]
##  [1,]    0 2.824752e-03
##  [2,]    1 2.905460e-02
##  [3,]    2 1.030458e-01
##  [4,]    3 1.984450e-01
##  [5,]    4 2.445305e-01
##  [6,]    5 2.087452e-01
##  [7,]    6 1.287989e-01
##  [8,]    7 5.869020e-02
##  [9,]    8 1.988088e-02
## [10,]    9 4.970067e-03
## [11,]   10 8.950188e-04
## [12,]   11 1.103232e-04
## [13,]   12 8.365275e-06
## [14,]   13 2.952450e-07

Laboratory for claim frequency

setwd("C:/Users/23043/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/data")
datahm<-read.csv("insample.csv")

We are going to be working with the year 2010. The variable that represents the number of claims is the Freq variable

datahm<- datahm[which(datahm$Year==2010),]
table(datahm$Freq)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  13  14  15  16  17  18  19  30 
## 707 209  86  40  18  12   9   4   6   1   3   2   1   2   1   2   1   1   1   1 
##  39 103 239 
##   1   1   1
summary(datahm$Freq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.241   1.000 239.000

We can create a variable just to indicate if we received a claim from each policy:

datahm$claim<-ifelse(datahm$Freq>0,1,0)
prop.table(table(datahm$claim))
## 
##         0         1 
## 0.6369369 0.3630631

It is always a good idea to start with some plots

library(ggplot2)
library(scales)

ggplot(datahm, aes(x=claim))+geom_bar()

ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Claim",
       y= "Percent")

We know our variable claim is not a continuous varible, but a nominal one so we can change the information in the plot by defining the variable as a factor variable:

datahm$claim<- factor(datahm$claim, levels=c(0,1),
                         labels=c("No claim","Claims"))

ggplot(datahm, aes(x=claim))+geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Claim",
       y= "Percent",
       title= "Policies with claims (Percent), 2010")

Now, we can plot the frequencies:

ggplot(datahm, aes(x=Freq))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Number of claims",
       y= "Percent",
       title= "Claims per policy (Percent), 2010") 

We have two extreme values, lets get rid of those, just for the class…

datahm<-datahm[which(datahm$Freq<100),]

Create variables for total policies:

totalp<- length(datahm$Freq)
totalp
## [1] 1108

We can repeat the plots…

ggplot(datahm, aes(x=claim))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Claim",
       y= "Percent",
       title= "Policies with claims (Percent), 2010")

ggplot(datahm, aes(x=Freq))+geom_bar(aes(y = 100*(..count..)/sum(..count..)))+
  labs(x = "Number of claims",
       y= "Percent",
       title= "Claims per policy (Percent), 2010") 

We are going to obtain the total number of claims and create a new dataframe to work on

claims<-as.data.frame(table(datahm$Freq))
colnames(claims)<-c("number","Freq")
claims$number<-as.numeric(as.character(claims$number))
claims$Freq<-as.numeric(as.character(claims$Freq))
claims
##    number Freq
## 1       0  707
## 2       1  209
## 3       2   86
## 4       3   40
## 5       4   18
## 6       5   12
## 7       6    9
## 8       7    4
## 9       8    6
## 10      9    1
## 11     10    3
## 12     11    2
## 13     13    1
## 14     14    2
## 15     15    1
## 16     16    2
## 17     17    1
## 18     18    1
## 19     19    1
## 20     30    1
## 21     39    1
claims$claims<-claims$number*claims$Freq
totalc<-sum(claims$claims)
totalc
## [1] 1035

We can start exploring the data we have:

media<-mean(datahm$Freq)
media
## [1] 0.9341155
vari<- var(datahm$Freq)
vari
## [1] 6.101346

Create a variable to determine the proportion, this is the probability of observing the number of claims

claims$Prop<-claims$Freq/totalp
claims
##    number Freq claims         Prop
## 1       0  707      0 0.6380866426
## 2       1  209    209 0.1886281588
## 3       2   86    172 0.0776173285
## 4       3   40    120 0.0361010830
## 5       4   18     72 0.0162454874
## 6       5   12     60 0.0108303249
## 7       6    9     54 0.0081227437
## 8       7    4     28 0.0036101083
## 9       8    6     48 0.0054151625
## 10      9    1      9 0.0009025271
## 11     10    3     30 0.0027075812
## 12     11    2     22 0.0018050542
## 13     13    1     13 0.0009025271
## 14     14    2     28 0.0018050542
## 15     15    1     15 0.0009025271
## 16     16    2     32 0.0018050542
## 17     17    1     17 0.0009025271
## 18     18    1     18 0.0009025271
## 19     19    1     19 0.0009025271
## 20     30    1     30 0.0009025271
## 21     39    1     39 0.0009025271
ggplot(claims, aes(x=number))+geom_line(aes(y=Prop),color="blue")

From lasts session we identify an approach to identify the best distribution to describe some data. Lets try to replicate that perspective:

claims$F1<- 0
claimsM<-as.matrix(claims)
for (i in 2:21) {
  claimsM[i,5]<-claimsM[i,1]*(claimsM[i,4]/claimsM[i-1,4])
}

claims<-as.data.frame(claimsM)
names(claims)<- c("Number","Freq","claims","Prop","F1")


ggplot(claims, aes(x=Number))+geom_line(aes(y=F1),color="blue")

While this is not a straight line, we can see a pattern… at least we can tell the slope is positive… the interception with y axis could be zero or different. I think it is close to zero… however, lets ignore this and assume it is different…

The distinction is that with b=0 we have geometric and b!=0 negative binomial. However, remember that geometric is a special case of negative binomial with r=0.

p1<- claimsM[2,4]
p2<- claimsM[3,4]
p3<- claimsM[4,4]

b<- 6*(p2/p1 - p3/p2)
b
##       Prop 
## -0.3217982
a<- p2/p1-b/2
a
##      Prop 
## 0.5723823
pnb<- 1-a
rnb<- (b/(1-pnb))+1

pnb
##      Prop 
## 0.4276177
rnb
##      Prop 
## 0.4377916

We can see that most likely this is a Geometric

f<-rep(0,21)
x<-claimsM[,1]

for (i in 1:21){
  f[i]<-dnbinom(x[i],rnb,pnb)
}

claimsM<-cbind(claimsM,f)
claims<-as.data.frame(claimsM)
names(claims)<- c("Number","Freq","Claims","Prop","F1","Adjusted")

ggplot(claims, aes(x=Number))+geom_line(aes(y=Prop),color="blue")+
  geom_line(aes(y=Adjusted),color="red")

Activity 1:

Replicate the process for a chosen year different from 2010. Due: Thursday 27th before midnight