Introduction

In this presentation, we will be exploring an insurance data set.

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Regression Analysis of BMI vs. Charges

Let’s use BMI as the predictor and Charges as the response and get a fitted line graph:

\(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\) = 1192.9 + 393.9x

KDE of BMI

Since BMI is a continuous numerical variable, lets look at the KDE of it, using: \(\hat{f}_h\)(x) = \(1 \over nh\)\(\sum_{i = 1}^n\) \(\textbf{K}\)(\(x-x_i\over h\))

Annnnnd, a graph

For the sake of completeness!

Let’s look at age vs bmi by taking a random sample from the data set.

rndsmpl <- sample(1338, 40, replace = FALSE)
newdf <- df[rndsmpl, ]
ggplot(newdf, aes(age, bmi)) + geom_point(aes(size = children, 
shape = sex, color = region )) 

Histogram of ages

Let’s look at the histogram of the ages of all observations

Barplot of smokers

Finally, a bar plot of smokers