In this presentation, we will be exploring an insurance data set.
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
\(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\) = 1192.9 + 393.9x
Since BMI is a continuous numerical variable, lets look at the KDE of
it, using: \(\hat{f}_h\)(x) = \(1 \over nh\)\(\sum_{i = 1}^n\) \(\textbf{K}\)(\(x-x_i\over h\))
Let’s look at age vs bmi by taking a random sample from the data set.
rndsmpl <- sample(1338, 40, replace = FALSE)
newdf <- df[rndsmpl, ]
ggplot(newdf, aes(age, bmi)) + geom_point(aes(size = children,
shape = sex, color = region ))
Let’s look at the histogram of the ages of all observations
Finally, a bar plot of smokers