Regression Model: Insurance Charges

1 Introduction

Regresi linier merupakan teknik analisis statistik yang digunakan untuk memprediksi hubungan antara dua atau lebih variabel. Berikut adalah analisis mengenai data US Health Insurance yang bersumber dari Kaggle. Data ini yang menunjukan pengaruh beragam aspek kehidupan terhadap premi yang harus dibayarkan oleh pemegang polis. Menggunakan linear regression analysis, akan dibuat suatu model untuk memprediksi faktor-faktor yang menjadikan nilai kesehatan, gender dan usia menjadi suatu aspek yang penting dalam penentuan total premi.

2 Data Preparation

2.1 Importing Libraries

Install library yang akan digunakan

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(tidymodels)
library(gridExtra)
library(ggstatsplot)
library(dplyr)
library(grid)
library(performance)
library(MLmetrics)

2.2 Importing Datasets

insurance_data <- read.csv("insurance_data.csv", sep = ',', header = TRUE)
str(insurance_data)
#> 'data.frame':    1338 obs. of  7 variables:
#>  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
#>  $ sex     : chr  "female" "male" "male" "male" ...
#>  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
#>  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
#>  $ smoker  : chr  "yes" "no" "no" "no" ...
#>  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
#>  $ charges : num  16885 1726 4449 21984 3867 ...

Berikut adalah keterangan dari setiap kolom: - age : Usia Pemegang Polis - sex : Jenis Kelamin Pemegang Polis - bmi : Body Mass Index Pemegang Polis - children : Jumlah Tanggungan - smoker : Smoker / Non Smoker - region : Daerah Domisili - charges : Premi

3 Exploratory Data Analysis

3.1 Duplicates

Cek data duplikat

dup <- sum(duplicated(insurance_data))
dup
#> [1] 1

Terdapat data duplicate sehingga perlu dilakukan remove duplicate

insurance_data <- distinct(insurance_data)
dup <- sum(duplicated(insurance_data))
dup
#> [1] 0

3.2 Missing Values

Cek missing values

mv <- colSums(is.na(x=insurance_data))
mv
#>      age      sex      bmi children   smoker   region  charges 
#>        0        0        0        0        0        0        0

Data tidak memiliki missing values

3.3 Data Types

Cek tipe variabel

datypes <- str(insurance_data)
#> 'data.frame':    1337 obs. of  7 variables:
#>  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
#>  $ sex     : chr  "female" "male" "male" "male" ...
#>  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
#>  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
#>  $ smoker  : chr  "yes" "no" "no" "no" ...
#>  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
#>  $ charges : num  16885 1726 4449 21984 3867 ...
datypes
#> NULL

Column sex dan smoker masih menggunakan format chr, perlu di ubah ke format factor

insurance_data$sex_binary <- as.factor(ifelse(insurance_data$sex == "female", 1, 0))
insurance_data$smoker_binary <- as.factor(ifelse(insurance_data$smoker == "yes", 1, 0))
datypes <- str(insurance_data)
#> 'data.frame':    1337 obs. of  9 variables:
#>  $ age          : int  19 18 28 33 32 31 46 37 37 60 ...
#>  $ sex          : chr  "female" "male" "male" "male" ...
#>  $ bmi          : num  27.9 33.8 33 22.7 28.9 ...
#>  $ children     : int  0 1 3 0 0 0 1 3 2 0 ...
#>  $ smoker       : chr  "yes" "no" "no" "no" ...
#>  $ region       : chr  "southwest" "southeast" "southeast" "northwest" ...
#>  $ charges      : num  16885 1726 4449 21984 3867 ...
#>  $ sex_binary   : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 2 2 1 2 ...
#>  $ smoker_binary: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
datypes
#> NULL

Tipe Variable sudah sesuai

3.4 Boxplot

# Boxplot
par(new=TRUE)
bp_age <- boxplot(insurance_data$age, col = "grey", xlab = "Age", ylab = "Value", main = "Age Var Boxplot")

bp_sex <- boxplot(insurance_data$sex_binary, col = "grey", xlab = "Sex", ylab = "Value", main = "Sex Var Boxplot")

bp_bmi <- boxplot(insurance_data$bmi, col = "grey", xlab = "BMI", ylab = "Value", main = "BMI Var Boxplot")

bp_children <- boxplot(insurance_data$children, col = "grey", xlab = "Children", ylab = "Value", main = "Children Var Boxplot")

bp_smoker <- boxplot(insurance_data$smoker_binary, col = "grey", xlab = "Smoker", ylab = "Value", main = "Smoker Var Boxplot")

bp_charges <- boxplot(insurance_data$charges, col = "grey", xlab = "Charges", ylab = "Value", main = "Charges Var Boxplot")

outlier merupakan bagian sah dari variasi alami data, maka tidak perlu menghapusnya.

3.5 Descriptive Statistics

desc_stats <- summary(insurance_data)
desc_stats 
#>       age            sex                 bmi           children    
#>  Min.   :18.00   Length:1337        Min.   :15.96   Min.   :0.000  
#>  1st Qu.:27.00   Class :character   1st Qu.:26.29   1st Qu.:0.000  
#>  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
#>  Mean   :39.22                      Mean   :30.66   Mean   :1.096  
#>  3rd Qu.:51.00                      3rd Qu.:34.70   3rd Qu.:2.000  
#>  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
#>     smoker             region             charges      sex_binary smoker_binary
#>  Length:1337        Length:1337        Min.   : 1122   0:675      0:1063       
#>  Class :character   Class :character   1st Qu.: 4746   1:662      1: 274       
#>  Mode  :character   Mode  :character   Median : 9386                           
#>                                        Mean   :13279                           
#>                                        3rd Qu.:16658                           
#>                                        Max.   :63770

3.6 Correlation

cols_2 <- c("age","sex_binary","bmi","children","smoker_binary","charges")
boxdat <- insurance_data[cols_2]
corr_numvar <- ggcorrmat(
  data = boxdat,
  type = "parametric", # Person Correlation
  colors = c("darkred", "white", "steelblue")
)

corr_numvar <- corr_numvar +  labs(title = NULL, subtitle = NULL) + theme(
  plot.margin = margin(0.15, 0, 0.1, 0.01, "npc")) 

corr_numvar

grid.text(
  "Correlation: Numeric Variable", 
  0, 
  0.900,
  just = c("left", "bottom"),
  gp = gpar(
    fontsize = 22,
    fontface = "bold",
    fontfamily = "Econ Sans Cnd"
  )
)

grid.lines(
  x = c(0, 1),
  y = 1,
  gp = gpar(col = "#e5001c", lwd = 4)
)

grid.rect(
  x = 0,
  y = 1,
  width = 0.05,
  height = 0.025,
  just = c("left", "top"),
  gp = gpar(fill = "#e5001c", lwd = 0)
)

### 3.7 Scatter Plot

# 1. Age dan Charges
plot(insurance_data$age, insurance_data$charges)

# 2. BMI dan Charges
plot(insurance_data$bmi, insurance_data$charges)

# 3. Children dan Charges
plot(insurance_data$children, insurance_data$charges)

4 Modeling

4.1 Initial Model

# Fungsi
model_ols <- lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary,
                data = insurance_data)

# Hasil Model
model_ols
#> 
#> Call:
#> lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary, 
#>     data = insurance_data)
#> 
#> Coefficients:
#>    (Intercept)             age     sex_binary1             bmi        children  
#>       -12176.5           257.7           127.2           322.4           473.9  
#> smoker_binary1  
#>        23822.3

4.2 Assumption

Linearity

# Linearity
plot(model_ols, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")

#### Normality of Residuals

# Normality by Histogram
hist(model_ols$residuals)

# Normality by Shapiro Test
shapiro.test(model_ols$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_ols$residuals
#> W = 0.89954, p-value < 0.00000000000000022

Homoscedasticity of Residuals

# Homoscedasticity by BP Test
bptest(model_ols)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_ols
#> BP = 117.82, df = 5, p-value < 0.00000000000000022

Non Multicollinearity

# Non Multicollinearity
vif(model_ols)
#>           age    sex_binary           bmi      children smoker_binary 
#>      1.015042      1.008915      1.014586      1.002182      1.006546

4.3 Metrics Evaluation

# Goodness of Fit : Ajdusted R-squared
summary(model_ols)$adj.r.squared
#> [1] 0.7486133
# MAPE
MAPE(model_ols$fitted.values, insurance_data$charges)
#> [1] 0.4236605

4.4 Variable Interpretation

summary(model_ols)
#> 
#> Call:
#> lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary, 
#>     data = insurance_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -11838  -2919  -1000   1376  29565 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)    -12176.54     964.71 -12.622 < 0.0000000000000002 ***
#> age               257.66      11.92  21.622 < 0.0000000000000002 ***
#> sex_binary1       127.21     333.61   0.381              0.70303    
#> bmi               322.38      27.43  11.753 < 0.0000000000000002 ***
#> children          473.91     137.94   3.435              0.00061 ***
#> smoker_binary1  23822.32     412.73  57.719 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6072 on 1331 degrees of freedom
#> Multiple R-squared:  0.7496, Adjusted R-squared:  0.7486 
#> F-statistic: 796.7 on 5 and 1331 DF,  p-value: < 0.00000000000000022

5 Conclusion

Model regresi pada Insurance Charges memiliki metriks evaluasi yang baik ditunjukkan dengan nilai Adjusted R-Squared yang tinggi, namun memiliki MAPE yang tinggi. MAPE yang tinggi menunjukan bahwa hasil prediksi model regresi memiliki error yang tinggi. Berdasarkan hasil signifikansi, hanya terdapat 4/5 prediktor signifikan yang mempengaruhi Insurance Charges secara positif yaitu (age,bmi,children dan smoker). Namun, untuk uji asumsi yang memenuhi hanya asumsi Non Multicollinearity, sedangkan asumsi lainnya tidak terpenuhi. Hal ini perlu dilakukan penganganan uji asumsi agar hasil pemodelan lebih valid. Selain itu dapat digunakan model lainnya seperti regresi logistic, KNN atau Random Forest untuk mendapatkan hasil prediksi yang lebih baik.