Main Goal of the Analysis
The main goal of the analysis carried out in homework 1 is to see how
different variables (age, sex, bmi, number of children, (not) being a
smoker, and region) affect the premium charged for a basic health
insurance in US. The research question would therefore be does premium
for basic health insurance in US differs with regards to age, gender,
BMI, number of children, being or not being a smoker, and the region of
the policyholder.
Data Manipulation
colSums(is.na(mydata))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
mydata$ID <- seq.int(nrow(mydata))
mydata$sexF <- as.factor(mydata$sex)
mydata$regionF <- as.factor(mydata$region)
mydata$smokerF <- as.factor(mydata$smoker)
Mydata <- mydata[,c(10,1,8,3,4,11,9,7)]
sapply(Mydata[,c(2,4,5,8)], FUN=max)
## age bmi children charges
## 64.00 53.13 5.00 63770.43
sapply(Mydata[,c(2,4,5,8)], FUN=min)
## age bmi children charges
## 18.000 15.960 0.000 1121.874
All minimums and maximums are not such that it would indicate them
being outliers. The only one could maybe be the maximum premium charge,
which is why a histogram of the variable is presented below.
hist(Mydata$charges,
main = "Distribution of premium charges",
xlab = "Premium charges",
ylab = "Frequency",
breaks = seq(from = 0, to = 65000, by = 1000))

From about 50,000 USD onwards there is not many values, however, it
does not give an impression that they are outliers.
head(Mydata)
## regionF age ID bmi children smokerF sexF charges
## 1 southwest 19 1 27.900 0 yes female 16884.924
## 2 southeast 18 2 33.770 1 no male 1725.552
## 3 southeast 28 3 33.000 3 no male 4449.462
## 4 northwest 33 4 22.705 0 no male 21984.471
## 5 northwest 32 5 28.880 0 no male 3866.855
## 6 southeast 31 6 25.740 0 no female 3756.622
Descriptive Statistics and Visualisation
#install.packages("pastecs")
suppressPackageStartupMessages(library(pastecs))
round(stat.desc(Mydata[,c(2,4,5,8)]), 2)
## age bmi children charges
## nbr.val 1338.00 1338.00 1338.00 1338.00
## nbr.null 0.00 0.00 574.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 18.00 15.96 0.00 1121.87
## max 64.00 53.13 5.00 63770.43
## range 46.00 37.17 5.00 62648.55
## sum 52459.00 41027.62 1465.00 17755824.99
## median 39.00 30.40 1.00 9382.03
## mean 39.21 30.66 1.09 13270.42
## SE.mean 0.38 0.17 0.03 331.07
## CI.mean.0.95 0.75 0.33 0.06 649.47
## var 197.40 37.19 1.45 146652372.15
## std.dev 14.05 6.10 1.21 12110.01
## coef.var 0.36 0.20 1.10 0.91
We can see that the range of the age variable is 46 years, defined by
a minimum of 18 and a maximum of 64 years. BMI has a range of
37.17kg/m2, defined by a minimum of 15.96 and a maximum of 53.13kg/m2.
The range of the number of children is 5, with the minimum being 0
children and the maximum 5 children. Lastly, the range of the premium
charges is 62,648.55USD, with the minimum 1,121.87USD and the maximum of
63,770.43USD.
We can also see that the median of age is 39 years, meaning that
within the sample 50% is younger and 50% older than 39 years. The median
of BMI is 30.40kg/m2, so 50% has a higher BMI than this and 50% lower.
(A BMI above 30 already means obese, so more than 50% of policyholders
of basic health insurance included in the sample are obese). 50% of
those within the sample have less than 1 child and 50% more, and the
premium charge is for 50% lower than 9,382.03USD, while higher then this
for the other 50%.
On average those within the sample are 39.21 years old with a BMI of
30.66kg/m2, with 1.09 children and a premium charge of 13,270.42USD.
Provided the sample is random the true means of the population would be;
for age on the interval (38.46, 39.96), for BMI on the interval
(30.33kg/m2, 30.99kg/m2), for the number of children on the interval
(1.03, 1.15), and for the premium charge on the interval (12,620.95USD;
13,919.89USD) (all at p=0.05).
The variability of variables can be compared through the use of
coefficient of variance, which is the highest for the number of
children, a bit smaller for premium charges and then relatively quite
small for the age and the BMI.
Smoking
#install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
ggplot(Mydata, aes(x=charges, fill=smokerF))+
geom_histogram(position = 'dodge', binwidth = 1000, colour='gray')+
ylab('Frequency')+
xlab('Premium charges')+
scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=smokerF, y=charges)) +
geom_boxplot() +
xlab("Being a smoker or not") +
ylab("Premium charges")

As can be expected those that are smokers tend to get higher premium
charges. The lowest premium charge of a smoker is higher than the
premium charge of at least 75% of nonsmokers. We can also observe that
between roughly 20,000 and 40,000USD there are still a few of
non-smokers (less than 25% of them), but practically 50% of all smokers.
Additionally, the 25% of smokers with highest premium charges are above
40,000USD, where there are no non-smokers anymore. From this we can
conclude that if we ran a regression, we would most likely get a
significant effect of smoking on the height of premium charge.
Region
ggplot(Mydata, aes(x=charges, fill=regionF))+
geom_histogram(position = 'dodge', binwidth = 5000, colour='gray')+
ylab('Frequency')+
xlab('Premium charges')+
scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=regionF, y=charges)) +
geom_boxplot() +
xlab("Region") +
ylab("Premium charges")

Here, we can observe the very similar premium charges between
regions, especially for the 50% of those with lowest charges. There are
some differences in the third quartile, which is the highest in
southeast area and the lowest in southwest area. The same goes also for
the maximum premium charged within each area.
Age
#install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
scatterplot(charges ~ age, boxplots=FALSE,
ylab = "Premium charges",
xlab = "Age",
smooth = FALSE,
data = Mydata)

The relationship between the premium charges and age ir represented
in a scatterplot, where we can see a slight positive correlation between
the two variables, which is to be expected, since with age one is prone
to more illnesses.
BMI
scatterplot(charges ~ bmi, boxplots=FALSE,
ylab = "Premium charges",
xlab = "BMI (kg/m2)",
smooth = FALSE,
data = Mydata)

scatterplot(charges ~ bmi | smokerF,
ylab = "Premium charges",
xlab = "BMI (kg/m2)",
smooth = FALSE,
data = Mydata)

One would expect that with a higher BMI you get charged a higher
basic health premium (except maybe those with BMI lower than 18, who are
considerate underweight). We can see that the sample data indicates a
positive correlation between the two variables. Moreover, the
correlation seem to be much higher for those that are smokers than for
the non-smokers. The latter do still have a positive correlation between
premium charges and BMI, but only a slight one, whereas the smokers
correlation seems quite strong. Additionally, we can see that smokers
under 30kg/m2 in BMI (the limit for obesity) are mostly charged less
than 30,000USD whereas those with higher BMI are basically all charged
more than 30,000USD. The BMI score of 30kg/m2 is for smokers, therefore,
(based on this data) a very crucial ‘limit’, but once you are over this
limit the change in premium with the change in BMI is lesser. We can
conclude that especially smokers should be careful to not go over BMI of
30kg/m2.
Number of children
scatterplot(charges ~ children, boxplots=FALSE,
ylab = "Premium charges",
xlab = "Number of children",
smooth = FALSE,
data = Mydata)

ggplot(Mydata, aes(x=as.factor(children), y=charges)) +
geom_boxplot() +
xlab("Number of children") +
ylab("Premium charges")

From this scatterplot we can observe a slight positive correlation
between the premium charge and the number of children. From the boxplot,
however, we can see that those with five children seem to have lower
charges than what we would expect based on others. Below I searched for
a possible reason for this.
MydataSY <- subset(Mydata, smokerF == 'yes')
MydataSN <- subset(Mydata, smokerF != 'yes')
table(Mydata$smokerF, Mydata$children)
##
## 0 1 2 3 4 5
## no 459 263 185 118 22 17
## yes 115 61 55 39 3 1
scatterplot(charges ~ children, boxplots=FALSE,
ylab = "Premium charges for non-smokers",
xlab = "Number of children of non-smokers",
smooth = FALSE,
data = MydataSN)

scatterplot(charges ~ children, boxplots=FALSE,
ylab = "Premium charges for smokers",
xlab = "Number of children of smokers",
smooth = FALSE,
data = MydataSY)

From the table we can see that there is much less of those with 4 and
5 children in the sample, and a big majority of those does not smoke. As
we have already seen before, non-smokers have much lower premiums.
Therefore, the lower charges for those with five children are possibly
strongly affected by: - Firstly, lesser number of units in this category
(if we had more units here they would possibly follow the same ‘trend’
as units in other categories (defined by the number of children) -
majority has lower premiums but then some have higher ones, the latter
are more probably obtained in a sample if you have more units). -
Secondly, there is only one smoker among those with 5 children - who
represents 6% of policyholders with 5 children, among those with 4
children 12% smoke, whereas with other the percentage of smoker is
around 20%.
Gender
#install.packages("psych")
suppressPackageStartupMessages(library(psych))
## Warning: package 'psych' was built under R version 4.2.2
describeBy(Mydata$age, Mydata$sexF)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 662 39.5 14.05 40 39.36 17.79 18 64 46 0.03 -1.25 0.55
## ------------------------------------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 676 38.92 14.05 39 38.66 17.79 18 64 46 0.08 -1.25 0.54
describeBy(Mydata$bmi, Mydata$sexF)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 662 30.38 6.05 30.11 30.23 6.13 16.82 48.07 31.25 0.25 -0.27 0.23
## ------------------------------------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 676 30.94 6.14 30.69 30.77 6.34 15.96 53.13 37.17 0.32 0.11 0.24
describeBy(Mydata$children, Mydata$sexF)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 662 1.07 1.19 1 0.92 1.48 0 5 5 0.93 0.17 0.05
## ------------------------------------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 676 1.12 1.22 1 0.96 1.48 0 5 5 0.94 0.2 0.05
table(Mydata$sexF, Mydata$regionF)
##
## northeast northwest southeast southwest
## female 161 164 175 162
## male 163 161 189 163
table(Mydata$sexF, Mydata$smokerF)
##
## no yes
## female 547 115
## male 517 159
describeBy(Mydata$charges, Mydata$sexF)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92 1.72 2.71 432.53
## ------------------------------------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471 1.33 0.79 498.89
From this we can see that there are not big differences between men
and women in the number of men/women in the sample, and also in age,
BMI, and the number of children. There is also a very similar
distribution of men vs women between different regions. However, the
share of smokers between women and the share of smokers between men is a
bit different, for women 82.6% and for men 76.5%, yet still not
extremely different. Therefore, we can compare the premium charges
between the two genders. We can observe the difference in the means, for
women being 12,569.58USD and for men being 13,956.75USD, so a bit
(1,387,17USD) higher for men. Interestingly, women’s mean is lower even
though they have about 6 percentage points higher smoker share. The min
and max are higher for women, as well as the median, but not much, only
by 43.34USD. The letter is more in line with the smoker share.
ggplot(Mydata, aes(x=charges, fill=sexF))+
geom_histogram(position = 'dodge', binwidth = 1000, colour='gray')+
ylab('Frequency')+
xlab('Premium charges')+
scale_fill_brewer(palette = 'PRGn')

ggplot(Mydata, aes(x=sexF, y=charges)) +
geom_boxplot() +
xlab("Gender") +
ylab("Premium Charges (USD)")

As mentioned number of children, BMI, age, and region variables are
very similar between men and women based on mean, standard deviation,
median, max, and min. Moreover, the number of women in the sample is
very similar to the number of men. A bit of a difference can be seen in
the share of smokers, women having a higher one, yet the third quartile
and mean are higher for men (1st quartile and median being very
similar). The reason behind this could be discrimination in favour of
women, however, a more likely reason is probably a lack of some
variables that might also have an impact on the premium charge,
e.g. salary.