Understanding How Personal Attributes Affect the Cost of Insurance Using Multiple Linear Regression

Project Introduction

Health insurance introduces a high cost on hospitalization and prescription drugs. 92% of the US population has some form of health insurance with 6% having personal health insurance. The regular health insurance is one that is provided by an employer or by the government while personal health insurance covers the costs of healthcare. When you have personal health insurance, the high costs of healthcare can be covered or greatly reduced. 

Problem Statement

Investigate how various attributes of insured individuals like age, gender, body mass index, number of children,  smoking habits and region affect premium rates. By analyzing a detailed dataset, we seek to identify key predictors of premium costs and provide insights into promoting equitable pricing in the health insurance industry.

Project Objectives

Uncover the factors that drive insurance pricing  to go up, be able to promote a more affordable and transparent insurance landscape, and provide actionable insights that can inform insurers about their risk assessment frameworks.

Use the Multiple Linear Regression Method to Build a Regression Model

Describe the variables:

Dependent Variable: Health Insurance Premium/Cost
Independent Variables: Age, Gender, Body Mass Index  (BMI), Number of children, Smoking status, and region.

Step (1) Install & Load Packages

#install.packages(“gglot2”)
#install.packages(“cluster”)
#install.packages(“ggpubr”)
#install.packages(“tidyverse”
#install.packages(“gclus”)
#install.packages(“readxl”)

library(ggplot2)
library(cluster)
library(ggpubr) #Helps Us Create Results that are Publication Ready for a Report
library(tidyverse) #For Data Manipulation 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gclus) #Allows Us to Create a Scatter Plot
library(readxl) #Helps Us Read Excel Files 

theme_set(theme_pubr()) #Gives a Clear Plot

Step (2) Import the Data

cost_df <- read_excel("Project.xlsx")
head(cost_df) #Gives Us a Snapshot of Our Data
## # A tibble: 6 × 7
##     age   sex   bmi children smoker region charges
##   <dbl> <dbl> <dbl>    <dbl>  <dbl>  <dbl>   <dbl>
## 1    19     1  27.9        0      1      4  16885.
## 2    18     0  33.8        1      0      3   1726.
## 3    28     0  33          3      0      3   4449.
## 4    33     0  22.7        0      0      2  21984.
## 5    32     0  28.9        0      0      2   3867.
## 6    31     1  25.7        0      0      3   3757.

Step (3) Descriptive Statistics

summary(cost_df) #Summaries the Data
##       age             sex              bmi           children    
##  Min.   :18.00   Min.   :0.0000   Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:0.0000   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Median :0.0000   Median :30.40   Median :1.000  
##  Mean   :39.21   Mean   :0.4948   Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00   Max.   :1.0000   Max.   :53.13   Max.   :5.000  
##      smoker           region         charges     
##  Min.   :0.0000   Min.   :1.000   Min.   : 1122  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 4740  
##  Median :0.0000   Median :3.000   Median : 9382  
##  Mean   :0.2048   Mean   :2.516   Mean   :13270  
##  3rd Qu.:0.0000   3rd Qu.:3.000   3rd Qu.:16640  
##  Max.   :1.0000   Max.   :4.000   Max.   :63770
interpretation - On Average people are paying $13,270 a year for insurance. 

Step (4) Data Vizualization -> Creating a Scatter Plot

pairs(~age + sex + bmi + children + smoker + region + charges, data = cost_df) #scatter plot

interpretation - Can see some vizual corrolation between some factors, but some factors look like counts due to them being categorical making it hard to understand  if they are truely corrolated.

Step (5) Create a Correlation matrix

corr <- cor(cost_df) #Corrolation
corr # Outputs the Correlation 
##                   age          sex          bmi    children       smoker
## age       1.000000000  0.020855872  0.109271882  0.04246900 -0.025018752
## sex       0.020855872  1.000000000 -0.046371151 -0.01716298 -0.076184817
## bmi       0.109271882 -0.046371151  1.000000000  0.01275890  0.003750426
## children  0.042468999 -0.017162978  0.012758901  1.00000000  0.007673120
## smoker   -0.025018752 -0.076184817  0.003750426  0.00767312  1.000000000
## region    0.002127313 -0.004588385  0.157565849  0.01656945 -0.002180682
## charges   0.299008194 -0.057292062  0.198340969  0.06799823  0.787251431
##                region      charges
## age       0.002127313  0.299008194
## sex      -0.004588385 -0.057292062
## bmi       0.157565849  0.198340969
## children  0.016569446  0.067998227
## smoker   -0.002180682  0.787251431
## region    1.000000000 -0.006208235
## charges  -0.006208235  1.000000000
interpretation - No Multicollinearity present, all factors have corrolation.

Step (6) Building a Regression Equation

model <-lm(charges ~ age + sex + bmi + children + smoker + region, data = cost_df)
summary(model)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = cost_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11343  -2807  -1017   1408  29752 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11592.92     994.93 -11.652  < 2e-16 ***
## age            257.29      11.89  21.647  < 2e-16 ***
## sex            131.11     332.81   0.394 0.693681    
## bmi            332.57      27.72  11.997  < 2e-16 ***
## children       479.37     137.64   3.483 0.000513 ***
## smoker       23820.43     411.84  57.839  < 2e-16 ***
## region        -353.64     151.93  -2.328 0.020077 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1331 degrees of freedom
## Multiple R-squared:  0.7507, Adjusted R-squared:  0.7496 
## F-statistic: 668.1 on 6 and 1331 DF,  p-value: < 2.2e-16
interpretation - sex is not a significant factor - therefore we will drop it from our estimated regression equation

Step (7) Build Estimated Regression Equation Using Only Significant Factors

model <-lm(charges ~ age + bmi + children + smoker + region, data = cost_df)
summary(model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = cost_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11404  -2805   -992   1400  29694 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11513.40     973.93 -11.822  < 2e-16 ***
## age            257.41      11.88  21.670  < 2e-16 ***
## bmi            332.04      27.68  11.995  < 2e-16 ***
## children       478.44     137.58   3.478 0.000522 ***
## smoker       23808.21     410.54  57.992  < 2e-16 ***
## region        -353.45     151.88  -2.327 0.020104 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6058 on 1332 degrees of freedom
## Multiple R-squared:  0.7507, Adjusted R-squared:  0.7498 
## F-statistic: 802.2 on 5 and 1332 DF,  p-value: < 2.2e-16
Interpretation: Ŷ = -11513.40 + 257.41 * Age + 332.04 * Bmi + 478.44 * Children + 23808.21 * Smoker Status  - 353.45 * Region

Model Accuracy of 75% based on the adjusted r^2. We can conclude that our overall model is statistically significant as our p-value is 0.00 which is less than = than our alpha which was 0.05. 

Step (8) Statistical Assumptions:

Alpha= 0.05, it can be assumed that each independent variable has a linear correlation with the dependent variable (health insurance premium)

Step (9) Conclusion & Recommendation:

Smoking status (whether or not you are a smoker) has the highest increase in dollars out of all variables - therefore causes higher insurance costs. Accurate data and effective visualizations like scatter plot matrices enhance insights. Age, BMI, and Smoking Status are key drivers of insurance costs; focus on health programs to lower premiums.