Background

It is very unlikely that two drivers who have the same auto insurance with the same insurance company will have the same monthly premium since there are many factors (other than the type of insurance) that are taken into account when the monthly premium is calculated.

In this exercise we will explore some of those factors by analyzing data that were collected from a random sample of 50 drivers insured with a certain company and having similar auto insurance coverage. For each driver the monthly premium was recorded along with other relevant information such as gender, age, driving experience, history of auto accidents, model and age of the car. Note that the data set for this exercise contains only a subset of the variables.

Raw Data

The data observes basic characteristics of drivers, as well as their premiums

rm(list=ls())
load("auto_premiums.RData")
x<-data
head(x)
##   Experience Gender Premium
## 1          1      1      73
## 2          6      0      74
## 3         15      1      37
## 4         20      0      45
## 5          3      0      68
## 6         17      0      71

Q2. It is well known that premiums of males (group 1) are generally higher than that of females (group 2). Do the data provide significant evidence to support that?

plot(factor(x$Gender), x$Premium, main="Premiums by Gender", xlab="Gender", ylab="Premiums", names=c("Male", "Female"))

tapply(x$Premium, factor(x$Gender), summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45.00   62.00   68.00   69.03   80.00   92.00 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   45.00   50.00   54.62   60.00   88.00

We see that the two genders have approximately the same distribution, but males are centered at a higher premium. In fact, Q1(male) is higher than Q3(female), meaning that 75% of males have higher premiums than 75% of women.

t.test(x$Premium~x$Gender, alternative="greater")
## 
##  Welch Two Sample t-test
## 
## data:  x$Premium by x$Gender
## t = 3.5744, df = 36.221, p-value = 0.0005083
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  7.607671      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        69.03448        54.61905

By conducting a 2 sample t test and getting a small p value (.0005), we can be statistically sure that gender has an effect on premiums.