Tina Pivk, 9. 1. 2023

Insurance Premium Data

mydata <- read.table("C:/Users/TinaP/Desktop/EF/IMB/multivariate analysis/MVA_2022_2023/DN-R/insurance-data/insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Description:

  • unit: one USA policyholder of basic health insurance (year not given)
  • sample size: 1338

Variables:

  • age: The Age of the Policyholder (years)
  • sex: The Gender of the Policyholder (male/female)
  • bmi: The Body Mass Index of the Policyholder (kg/m²)
  • children: Number of Children of the Policyholder
  • smoker: The Policyholder’s Statues of Being a Smoker (yes/no)
  • region: The Region Where the Policyholder Belongs to (southwest/southeast/northwest/northeast)
  • charges: The Premium Charged to the Policyholder (USD)

Data source:

Jain, S. (2021). Insurance Premium Data. Kaggle. Retrieved from: https://www.kaggle.com/datasets/simranjain17/insurance (original source is Census Bureau)

Main Goal of the Analysis

The main goal of the analysis carried out in homework 1 is to see how different variables (age, sex, bmi, number of children, (not) being a smoker, and region) affect the premium charged for a basic health insurance in US. The research question would therefore be does premium for basic health insurance in US differs with regards to age, gender, BMI, number of children, being or not being a smoker, and the region of the policyholder.

Data Manipulation

colSums(is.na(mydata))
##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0
mydata$ID <- seq.int(nrow(mydata))
mydata$sexF <- as.factor(mydata$sex)
mydata$regionF <- as.factor(mydata$region)
mydata$smokerF <- as.factor(mydata$smoker)

Mydata <- mydata[,c(10,1,8,3,4,11,9,7)]

sapply(Mydata[,c(2,4,5,8)], FUN=max)
##      age      bmi children  charges 
##    64.00    53.13     5.00 63770.43
sapply(Mydata[,c(2,4,5,8)], FUN=min)
##      age      bmi children  charges 
##   18.000   15.960    0.000 1121.874

All minimums and maximums are not such that it would indicate them being outliers. The only one could maybe be the maximum premium charge, which is why a histogram of the variable is presented below.

hist(Mydata$charges, 
     main = "Distribution of premium charges", 
     xlab = "Premium charges",
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 65000, by = 1000))

From about 50,000 USD onwards there is not many values, however, it does not give an impression that they are outliers.

head(Mydata)
##     regionF age ID    bmi children smokerF   sexF   charges
## 1 southwest  19  1 27.900        0     yes female 16884.924
## 2 southeast  18  2 33.770        1      no   male  1725.552
## 3 southeast  28  3 33.000        3      no   male  4449.462
## 4 northwest  33  4 22.705        0      no   male 21984.471
## 5 northwest  32  5 28.880        0      no   male  3866.855
## 6 southeast  31  6 25.740        0      no female  3756.622

Descriptive Statistics and Visualisation

#install.packages("pastecs")
suppressPackageStartupMessages(library(pastecs))
round(stat.desc(Mydata[,c(2,4,5,8)]), 2)
##                   age      bmi children      charges
## nbr.val       1338.00  1338.00  1338.00      1338.00
## nbr.null         0.00     0.00   574.00         0.00
## nbr.na           0.00     0.00     0.00         0.00
## min             18.00    15.96     0.00      1121.87
## max             64.00    53.13     5.00     63770.43
## range           46.00    37.17     5.00     62648.55
## sum          52459.00 41027.62  1465.00  17755824.99
## median          39.00    30.40     1.00      9382.03
## mean            39.21    30.66     1.09     13270.42
## SE.mean          0.38     0.17     0.03       331.07
## CI.mean.0.95     0.75     0.33     0.06       649.47
## var            197.40    37.19     1.45 146652372.15
## std.dev         14.05     6.10     1.21     12110.01
## coef.var         0.36     0.20     1.10         0.91

We can see that the range of the age variable is 46 years, defined by a minimum of 18 and a maximum of 64 years. BMI has a range of 37.17kg/m2, defined by a minimum of 15.96 and a maximum of 53.13kg/m2. The range of the number of children is 5, with the minimum being 0 children and the maximum 5 children. Lastly, the range of the premium charges is 62,648.55USD, with the minimum 1,121.87USD and the maximum of 63,770.43USD.

We can also see that the median of age is 39 years, meaning that within the sample 50% is younger and 50% older than 39 years. The median of BMI is 30.40kg/m2, so 50% has a higher BMI than this and 50% lower. (A BMI above 30 already means obese, so more than 50% of policyholders of basic health insurance included in the sample are obese). 50% of those within the sample have less than 1 child and 50% more, and the premium charge is for 50% lower than 9,382.03USD, while higher then this for the other 50%.

On average those within the sample are 39.21 years old with a BMI of 30.66kg/m2, with 1.09 children and a premium charge of 13,270.42USD. Provided the sample is random the true means of the population would be; for age on the interval (38.46, 39.96), for BMI on the interval (30.33kg/m2, 30.99kg/m2), for the number of children on the interval (1.03, 1.15), and for the premium charge on the interval (12,620.95USD; 13,919.89USD) (all at p=0.05).

The variability of variables can be compared through the use of coefficient of variance, which is the highest for the number of children, a bit smaller for premium charges and then relatively quite small for the age and the BMI.

Smoking

#install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
ggplot(Mydata, aes(x=charges, fill=smokerF))+
  geom_histogram(position = 'dodge', binwidth = 1000, colour='gray')+
  ylab('Frequency')+
  xlab('Premium charges')+
  scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=smokerF, y=charges)) +
  geom_boxplot() +
  xlab("Being a smoker or not") + 
  ylab("Premium charges")

As can be expected those that are smokers tend to get higher premium charges. The lowest premium charge of a smoker is higher than the premium charge of at least 75% of nonsmokers. We can also observe that between roughly 20,000 and 40,000USD there are still a few of non-smokers (less than 25% of them), but practically 50% of all smokers. Additionally, the 25% of smokers with highest premium charges are above 40,000USD, where there are no non-smokers anymore. From this we can conclude that if we ran a regression, we would most likely get a significant effect of smoking on the height of premium charge.

Region

ggplot(Mydata, aes(x=charges, fill=regionF))+
  geom_histogram(position = 'dodge', binwidth = 5000, colour='gray')+
  ylab('Frequency')+
  xlab('Premium charges')+
  scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=regionF, y=charges)) +
  geom_boxplot() +
  xlab("Region") + 
  ylab("Premium charges")

Here, we can observe the very similar premium charges between regions, especially for the 50% of those with lowest charges. There are some differences in the third quartile, which is the highest in southeast area and the lowest in southwest area. The same goes also for the maximum premium charged within each area.

Age

#install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
scatterplot(charges ~ age, boxplots=FALSE, 
            ylab = "Premium charges", 
            xlab = "Age", 
            smooth = FALSE,
            data = Mydata)

The relationship between the premium charges and age ir represented in a scatterplot, where we can see a slight positive correlation between the two variables, which is to be expected, since with age one is prone to more illnesses.

BMI

scatterplot(charges ~ bmi, boxplots=FALSE, 
            ylab = "Premium charges", 
            xlab = "BMI (kg/m2)", 
            smooth = FALSE,
            data = Mydata)

scatterplot(charges ~ bmi | smokerF, 
            ylab = "Premium charges", 
            xlab = "BMI (kg/m2)", 
            smooth = FALSE,
            data = Mydata)

One would expect that with a higher BMI you get charged a higher basic health premium (except maybe those with BMI lower than 18, who are considerate underweight). We can see that the sample data indicates a positive correlation between the two variables. Moreover, the correlation seem to be much higher for those that are smokers than for the non-smokers. The latter do still have a positive correlation between premium charges and BMI, but only a slight one, whereas the smokers correlation seems quite strong. Additionally, we can see that smokers under 30kg/m2 in BMI (the limit for obesity) are mostly charged less than 30,000USD whereas those with higher BMI are basically all charged more than 30,000USD. The BMI score of 30kg/m2 is for smokers, therefore, (based on this data) a very crucial ‘limit’, but once you are over this limit the change in premium with the change in BMI is lesser. We can conclude that especially smokers should be careful to not go over BMI of 30kg/m2.

Number of children

scatterplot(charges ~ children, boxplots=FALSE, 
            ylab = "Premium charges", 
            xlab = "Number of children", 
            smooth = FALSE,
            data = Mydata)

ggplot(Mydata, aes(x=as.factor(children), y=charges)) +
  geom_boxplot() +
  xlab("Number of children") + 
  ylab("Premium charges")

From this scatterplot we can observe a slight positive correlation between the premium charge and the number of children. From the boxplot, however, we can see that those with five children seem to have lower charges than what we would expect based on others. Below I searched for a possible reason for this.

MydataSY <- subset(Mydata, smokerF == 'yes')
MydataSN <- subset(Mydata, smokerF != 'yes')

table(Mydata$smokerF, Mydata$children)
##      
##         0   1   2   3   4   5
##   no  459 263 185 118  22  17
##   yes 115  61  55  39   3   1
scatterplot(charges ~ children, boxplots=FALSE, 
            ylab = "Premium charges for non-smokers", 
            xlab = "Number of children of non-smokers", 
            smooth = FALSE,
            data = MydataSN)

scatterplot(charges ~ children, boxplots=FALSE, 
            ylab = "Premium charges for smokers", 
            xlab = "Number of children of smokers", 
            smooth = FALSE,
            data = MydataSY)

From the table we can see that there is much less of those with 4 and 5 children in the sample, and a big majority of those does not smoke. As we have already seen before, non-smokers have much lower premiums. Therefore, the lower charges for those with five children are possibly strongly affected by: - Firstly, lesser number of units in this category (if we had more units here they would possibly follow the same ‘trend’ as units in other categories (defined by the number of children) - majority has lower premiums but then some have higher ones, the latter are more probably obtained in a sample if you have more units). - Secondly, there is only one smoker among those with 5 children - who represents 6% of policyholders with 5 children, among those with 4 children 12% smoke, whereas with other the percentage of smoker is around 20%.

Gender

#install.packages("psych")
suppressPackageStartupMessages(library(psych))
## Warning: package 'psych' was built under R version 4.2.2
describeBy(Mydata$age, Mydata$sexF)
## 
##  Descriptive statistics by group 
## group: female
##    vars   n mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 662 39.5 14.05     40   39.36 17.79  18  64    46 0.03    -1.25 0.55
## ------------------------------------------------------------------------------------------ 
## group: male
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 676 38.92 14.05     39   38.66 17.79  18  64    46 0.08    -1.25 0.54
describeBy(Mydata$bmi, Mydata$sexF)
## 
##  Descriptive statistics by group 
## group: female
##    vars   n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 662 30.38 6.05  30.11   30.23 6.13 16.82 48.07 31.25 0.25    -0.27 0.23
## ------------------------------------------------------------------------------------------ 
## group: male
##    vars   n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 676 30.94 6.14  30.69   30.77 6.34 15.96 53.13 37.17 0.32     0.11 0.24
describeBy(Mydata$children, Mydata$sexF)
## 
##  Descriptive statistics by group 
## group: female
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 662 1.07 1.19      1    0.92 1.48   0   5     5 0.93     0.17 0.05
## ------------------------------------------------------------------------------------------ 
## group: male
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 676 1.12 1.22      1    0.96 1.48   0   5     5 0.94      0.2 0.05
table(Mydata$sexF, Mydata$regionF)
##         
##          northeast northwest southeast southwest
##   female       161       164       175       162
##   male         163       161       189       163
table(Mydata$sexF, Mydata$smokerF)
##         
##           no yes
##   female 547 115
##   male   517 159
describeBy(Mydata$charges, Mydata$sexF)
## 
##  Descriptive statistics by group 
## group: female
##    vars   n     mean      sd  median  trimmed     mad     min      max    range skew kurtosis     se
## X1    1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92 1.72     2.71 432.53
## ------------------------------------------------------------------------------------------ 
## group: male
##    vars   n     mean       sd  median trimmed     mad     min      max range skew kurtosis     se
## X1    1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471 1.33     0.79 498.89

From this we can see that there are not big differences between men and women in the number of men/women in the sample, and also in age, BMI, and the number of children. There is also a very similar distribution of men vs women between different regions. However, the share of smokers between women and the share of smokers between men is a bit different, for women 82.6% and for men 76.5%, yet still not extremely different. Therefore, we can compare the premium charges between the two genders. We can observe the difference in the means, for women being 12,569.58USD and for men being 13,956.75USD, so a bit (1,387,17USD) higher for men. Interestingly, women’s mean is lower even though they have about 6 percentage points higher smoker share. The min and max are higher for women, as well as the median, but not much, only by 43.34USD. The letter is more in line with the smoker share.

ggplot(Mydata, aes(x=charges, fill=sexF))+
  geom_histogram(position = 'dodge', binwidth = 1000, colour='gray')+
  ylab('Frequency')+
  xlab('Premium charges')+
  scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=sexF, y=charges)) +
  geom_boxplot() +
  xlab("Gender") + 
  ylab("Premium Charges (USD)")

As mentioned number of children, BMI, age, and region variables are very similar between men and women based on mean, standard deviation, median, max, and min. Moreover, the number of women in the sample is very similar to the number of men. A bit of a difference can be seen in the share of smokers, women having a higher one, yet the third quartile and mean are higher for men (1st quartile and median being very similar). The reason behind this could be discrimination in favour of women, however, a more likely reason is probably a lack of some variables that might also have an impact on the premium charge, e.g. salary.

Conclusion

We found that smokers seem to get charged a higher premium for basic health insurance in US. The region itself is not that big of a factor in different premiums, with the highest premiums being in southeast and the lowest in southwest. With age premiums seem to get higher. A higher BMI seems to positively correlate with the premium charged, where the most crucial is a limit of 30kg/m2 (especially smokers should avoid crossing it). Number of children seems to have a positive correlation with the premium charged, and lastly, men on average seem to get charged higher premiums, however, this result is possibly a consequence of not all relevant variables being included in the data set.