Goal of Analysis:

- to investigate the relationship between variables of an individual and the medical costs billed to health insurance (charges) incured in a year at a hospital

- to determine if a linear regression model can be used to accurately predict the insurance cost for a person within a year based on the factors of age, sex, BMI, number of children, smoking status, an region in the United States

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Load in the Data

insurance <<- read_csv("insurance.csv")
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plotting Quantitative Values

Trying to get a general idea of the relationships between the variables

num_cols <<- unlist(lapply(insurance, is.numeric))
numeric_data <- insurance[, num_cols]
pairs(numeric_data)

Correlation Between Quantitative Variables

round(cor(insurance[,num_cols]),2)
##           age  bmi children charges
## age      1.00 0.11     0.04    0.30
## bmi      0.11 1.00     0.01    0.20
## children 0.04 0.01     1.00    0.07
## charges  0.30 0.20     0.07    1.00

- no strong correlations

Boxplots Depicting Relationship Between Qualitative Variables and Charges

# must first convert variables from characters to factors
smoker = as.factor(insurance$smoker)
sex = as.factor(insurance$sex)
region = as.factor(insurance$region)

boxplot(insurance$charges ~ insurance$smoker, 
        main = "Charges by Smoking Status", 
        xlab = "Smoker", 
        ylab = "Charges", 
        col = c("lightgreen", "red"))

boxplot(insurance$charges ~ insurance$sex, 
        main = "Charges by Sex", 
        xlab = "Sex", 
        ylab = "Charges", 
        col = c("pink","lightblue"))

boxplot(insurance$charges ~ insurance$region, 
        main = "Charges by Region", 
        xlab = "Region", 
        ylab = "Charges", 
        col = rainbow(4))

Insights from Boxplots:

- average charges for smokers is much higher
- minimal difference in charges between sex
- the southeast region appears to have higher 3rd quartile and concentration of high outliers

Creating a Linear Regression Model for All Variables in Relation to Charges

model1 <- lm(charges ~ age + sex + bmi + children + smoker + region, data = insurance)
summary(model1)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Concluding Insights:

- 75% of variances can be explained by this model.

- The variables age, BMI, number of children, and smoking have significant probability of impacting charges to an insurance company.