Problem Statement
Investigate how various attributes of insured individuals like age, gender, body mass index, number of children, smoking habits and region affect premium rates. By analyzing a detailed dataset, we seek to identify key predictors of premium costs and provide insights into promoting equitable pricing in the health insurance industry.
Describe the variables:
Dependent Variable: Health Insurance Premium/Cost
Independent Variables: Age, Gender, Body Mass Index (BMI), Number of children, Smoking status, and region.
Step (1) Install & Load Packages
#install.packages(“gglot2”)
#install.packages(“cluster”)
#install.packages(“ggpubr”)
#install.packages(“tidyverse”
#install.packages(“gclus”)
#install.packages(“readxl”)
library(ggplot2)
library(cluster)
library(ggpubr) #Helps Us Create Results that are Publication Ready for a Report
library(tidyverse) #For Data Manipulation
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gclus) #Allows Us to Create a Scatter Plot
library(readxl) #Helps Us Read Excel Files
theme_set(theme_pubr()) #Gives a Clear Plot
Step (2) Import the Data
cost_df <- read_excel("Project.xlsx")
head(cost_df) #Gives Us a Snapshot of Our Data
## # A tibble: 6 × 7
## age sex bmi children smoker region charges
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 1 27.9 0 1 4 16885.
## 2 18 0 33.8 1 0 3 1726.
## 3 28 0 33 3 0 3 4449.
## 4 33 0 22.7 0 0 2 21984.
## 5 32 0 28.9 0 0 2 3867.
## 6 31 1 25.7 0 0 3 3757.
Step (3) Descriptive Statistics
summary(cost_df) #Summaries the Data
## age sex bmi children
## Min. :18.00 Min. :0.0000 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 1st Qu.:0.0000 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Median :0.0000 Median :30.40 Median :1.000
## Mean :39.21 Mean :0.4948 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :1.0000 Max. :53.13 Max. :5.000
## smoker region charges
## Min. :0.0000 Min. :1.000 Min. : 1122
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.: 4740
## Median :0.0000 Median :3.000 Median : 9382
## Mean :0.2048 Mean :2.516 Mean :13270
## 3rd Qu.:0.0000 3rd Qu.:3.000 3rd Qu.:16640
## Max. :1.0000 Max. :4.000 Max. :63770
interpretation - On Average people are paying $13,270 a year for insurance.
Step (4) Data Vizualization -> Creating a Scatter Plot
pairs(~age + sex + bmi + children + smoker + region + charges, data = cost_df) #scatter plot

interpretation - Can see some vizual corrolation between some factors, but some factors look like counts due to them being categorical making it hard to understand if they are truely corrolated.
Step (5) Create a Correlation matrix
corr <- cor(cost_df) #Corrolation
corr # Outputs the Correlation
## age sex bmi children smoker
## age 1.000000000 0.020855872 0.109271882 0.04246900 -0.025018752
## sex 0.020855872 1.000000000 -0.046371151 -0.01716298 -0.076184817
## bmi 0.109271882 -0.046371151 1.000000000 0.01275890 0.003750426
## children 0.042468999 -0.017162978 0.012758901 1.00000000 0.007673120
## smoker -0.025018752 -0.076184817 0.003750426 0.00767312 1.000000000
## region 0.002127313 -0.004588385 0.157565849 0.01656945 -0.002180682
## charges 0.299008194 -0.057292062 0.198340969 0.06799823 0.787251431
## region charges
## age 0.002127313 0.299008194
## sex -0.004588385 -0.057292062
## bmi 0.157565849 0.198340969
## children 0.016569446 0.067998227
## smoker -0.002180682 0.787251431
## region 1.000000000 -0.006208235
## charges -0.006208235 1.000000000
interpretation - No Multicollinearity present, all factors have corrolation.
Step (6) Building a Regression Equation
model <-lm(charges ~ age + sex + bmi + children + smoker + region, data = cost_df)
summary(model)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = cost_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11343 -2807 -1017 1408 29752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11592.92 994.93 -11.652 < 2e-16 ***
## age 257.29 11.89 21.647 < 2e-16 ***
## sex 131.11 332.81 0.394 0.693681
## bmi 332.57 27.72 11.997 < 2e-16 ***
## children 479.37 137.64 3.483 0.000513 ***
## smoker 23820.43 411.84 57.839 < 2e-16 ***
## region -353.64 151.93 -2.328 0.020077 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1331 degrees of freedom
## Multiple R-squared: 0.7507, Adjusted R-squared: 0.7496
## F-statistic: 668.1 on 6 and 1331 DF, p-value: < 2.2e-16
interpretation - sex is not a significant factor - therefore we will drop it from our estimated regression equation
Step (7) Build Estimated Regression Equation Using Only Significant
Factors
model <-lm(charges ~ age + bmi + children + smoker + region, data = cost_df)
summary(model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = cost_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11404 -2805 -992 1400 29694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11513.40 973.93 -11.822 < 2e-16 ***
## age 257.41 11.88 21.670 < 2e-16 ***
## bmi 332.04 27.68 11.995 < 2e-16 ***
## children 478.44 137.58 3.478 0.000522 ***
## smoker 23808.21 410.54 57.992 < 2e-16 ***
## region -353.45 151.88 -2.327 0.020104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6058 on 1332 degrees of freedom
## Multiple R-squared: 0.7507, Adjusted R-squared: 0.7498
## F-statistic: 802.2 on 5 and 1332 DF, p-value: < 2.2e-16
Interpretation: Ŷ = -11513.40 + 257.41 * Age + 332.04 * Bmi + 478.44 * Children + 23808.21 * Smoker Status - 353.45 * Region
Model Accuracy of 75% based on the adjusted r^2. We can conclude that our overall model is statistically significant as our p-value is 0.00 which is less than = than our alpha which was 0.05.
Step (8) Statistical Assumptions:
Alpha= 0.05, it can be assumed that each independent variable has a linear correlation with the dependent variable (health insurance premium)
Step (9) Conclusion & Recommendation:
Smoking status (whether or not you are a smoker) has the highest increase in dollars out of all variables - therefore causes higher insurance costs. Accurate data and effective visualizations like scatter plot matrices enhance insights. Age, BMI, and Smoking Status are key drivers of insurance costs; focus on health programs to lower premiums.