Regression Model: Insurance Charges
1 Introduction
Regresi linier merupakan teknik analisis statistik yang digunakan untuk memprediksi hubungan antara dua atau lebih variabel. Berikut adalah analisis mengenai data US Health Insurance yang bersumber dari Kaggle. Data ini yang menunjukan pengaruh beragam aspek kehidupan terhadap premi yang harus dibayarkan oleh pemegang polis. Menggunakan linear regression analysis, akan dibuat suatu model untuk memprediksi faktor-faktor yang menjadikan nilai kesehatan, gender dan usia menjadi suatu aspek yang penting dalam penentuan total premi.
2 Data Preparation
2.1 Importing Libraries
Install library yang akan digunakan
2.2 Importing Datasets
#> 'data.frame': 1338 obs. of 7 variables:
#> $ age : int 19 18 28 33 32 31 46 37 37 60 ...
#> $ sex : chr "female" "male" "male" "male" ...
#> $ bmi : num 27.9 33.8 33 22.7 28.9 ...
#> $ children: int 0 1 3 0 0 0 1 3 2 0 ...
#> $ smoker : chr "yes" "no" "no" "no" ...
#> $ region : chr "southwest" "southeast" "southeast" "northwest" ...
#> $ charges : num 16885 1726 4449 21984 3867 ...
Berikut adalah keterangan dari setiap kolom: - age : Usia Pemegang Polis - sex : Jenis Kelamin Pemegang Polis - bmi : Body Mass Index Pemegang Polis - children : Jumlah Tanggungan - smoker : Smoker / Non Smoker - region : Daerah Domisili - charges : Premi
3 Exploratory Data Analysis
3.1 Duplicates
Cek data duplikat
#> [1] 1
Terdapat data duplicate sehingga perlu dilakukan remove duplicate
#> [1] 0
3.2 Missing Values
Cek missing values
#> age sex bmi children smoker region charges
#> 0 0 0 0 0 0 0
Data tidak memiliki missing values
3.3 Data Types
Cek tipe variabel
#> 'data.frame': 1337 obs. of 7 variables:
#> $ age : int 19 18 28 33 32 31 46 37 37 60 ...
#> $ sex : chr "female" "male" "male" "male" ...
#> $ bmi : num 27.9 33.8 33 22.7 28.9 ...
#> $ children: int 0 1 3 0 0 0 1 3 2 0 ...
#> $ smoker : chr "yes" "no" "no" "no" ...
#> $ region : chr "southwest" "southeast" "southeast" "northwest" ...
#> $ charges : num 16885 1726 4449 21984 3867 ...
#> NULL
Column sex dan smoker masih menggunakan format chr, perlu di ubah ke format factor
insurance_data$sex_binary <- as.factor(ifelse(insurance_data$sex == "female", 1, 0))
insurance_data$smoker_binary <- as.factor(ifelse(insurance_data$smoker == "yes", 1, 0))#> 'data.frame': 1337 obs. of 9 variables:
#> $ age : int 19 18 28 33 32 31 46 37 37 60 ...
#> $ sex : chr "female" "male" "male" "male" ...
#> $ bmi : num 27.9 33.8 33 22.7 28.9 ...
#> $ children : int 0 1 3 0 0 0 1 3 2 0 ...
#> $ smoker : chr "yes" "no" "no" "no" ...
#> $ region : chr "southwest" "southeast" "southeast" "northwest" ...
#> $ charges : num 16885 1726 4449 21984 3867 ...
#> $ sex_binary : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 2 2 1 2 ...
#> $ smoker_binary: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
#> NULL
Tipe Variable sudah sesuai
3.4 Boxplot
# Boxplot
par(new=TRUE)
bp_age <- boxplot(insurance_data$age, col = "grey", xlab = "Age", ylab = "Value", main = "Age Var Boxplot")bp_sex <- boxplot(insurance_data$sex_binary, col = "grey", xlab = "Sex", ylab = "Value", main = "Sex Var Boxplot")bp_bmi <- boxplot(insurance_data$bmi, col = "grey", xlab = "BMI", ylab = "Value", main = "BMI Var Boxplot")bp_children <- boxplot(insurance_data$children, col = "grey", xlab = "Children", ylab = "Value", main = "Children Var Boxplot")bp_smoker <- boxplot(insurance_data$smoker_binary, col = "grey", xlab = "Smoker", ylab = "Value", main = "Smoker Var Boxplot")bp_charges <- boxplot(insurance_data$charges, col = "grey", xlab = "Charges", ylab = "Value", main = "Charges Var Boxplot")outlier merupakan bagian sah dari variasi alami data, maka tidak perlu menghapusnya.
3.5 Descriptive Statistics
#> age sex bmi children
#> Min. :18.00 Length:1337 Min. :15.96 Min. :0.000
#> 1st Qu.:27.00 Class :character 1st Qu.:26.29 1st Qu.:0.000
#> Median :39.00 Mode :character Median :30.40 Median :1.000
#> Mean :39.22 Mean :30.66 Mean :1.096
#> 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
#> Max. :64.00 Max. :53.13 Max. :5.000
#> smoker region charges sex_binary smoker_binary
#> Length:1337 Length:1337 Min. : 1122 0:675 0:1063
#> Class :character Class :character 1st Qu.: 4746 1:662 1: 274
#> Mode :character Mode :character Median : 9386
#> Mean :13279
#> 3rd Qu.:16658
#> Max. :63770
3.6 Correlation
cols_2 <- c("age","sex_binary","bmi","children","smoker_binary","charges")
boxdat <- insurance_data[cols_2]corr_numvar <- ggcorrmat(
data = boxdat,
type = "parametric", # Person Correlation
colors = c("darkred", "white", "steelblue")
)
corr_numvar <- corr_numvar + labs(title = NULL, subtitle = NULL) + theme(
plot.margin = margin(0.15, 0, 0.1, 0.01, "npc"))
corr_numvar
grid.text(
"Correlation: Numeric Variable",
0,
0.900,
just = c("left", "bottom"),
gp = gpar(
fontsize = 22,
fontface = "bold",
fontfamily = "Econ Sans Cnd"
)
)
grid.lines(
x = c(0, 1),
y = 1,
gp = gpar(col = "#e5001c", lwd = 4)
)
grid.rect(
x = 0,
y = 1,
width = 0.05,
height = 0.025,
just = c("left", "top"),
gp = gpar(fill = "#e5001c", lwd = 0)
)
### 3.7 Scatter Plot
4 Modeling
4.1 Initial Model
# Fungsi
model_ols <- lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary,
data = insurance_data)
# Hasil Model
model_ols#>
#> Call:
#> lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary,
#> data = insurance_data)
#>
#> Coefficients:
#> (Intercept) age sex_binary1 bmi children
#> -12176.5 257.7 127.2 322.4 473.9
#> smoker_binary1
#> 23822.3
4.2 Assumption
Linearity
#### Normality of Residuals
#>
#> Shapiro-Wilk normality test
#>
#> data: model_ols$residuals
#> W = 0.89954, p-value < 0.00000000000000022
Homoscedasticity of Residuals
#>
#> studentized Breusch-Pagan test
#>
#> data: model_ols
#> BP = 117.82, df = 5, p-value < 0.00000000000000022
4.3 Metrics Evaluation
#> [1] 0.7486133
#> [1] 0.4236605
4.4 Variable Interpretation
#>
#> Call:
#> lm(formula = charges ~ age + sex_binary + bmi + children + smoker_binary,
#> data = insurance_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11838 -2919 -1000 1376 29565
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -12176.54 964.71 -12.622 < 0.0000000000000002 ***
#> age 257.66 11.92 21.622 < 0.0000000000000002 ***
#> sex_binary1 127.21 333.61 0.381 0.70303
#> bmi 322.38 27.43 11.753 < 0.0000000000000002 ***
#> children 473.91 137.94 3.435 0.00061 ***
#> smoker_binary1 23822.32 412.73 57.719 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6072 on 1331 degrees of freedom
#> Multiple R-squared: 0.7496, Adjusted R-squared: 0.7486
#> F-statistic: 796.7 on 5 and 1331 DF, p-value: < 0.00000000000000022
5 Conclusion
Model regresi pada Insurance Charges memiliki metriks
evaluasi yang baik ditunjukkan dengan nilai Adjusted R-Squared yang
tinggi, namun memiliki MAPE yang tinggi. MAPE yang tinggi menunjukan
bahwa hasil prediksi model regresi memiliki error yang tinggi.
Berdasarkan hasil signifikansi, hanya terdapat 4/5 prediktor signifikan
yang mempengaruhi Insurance Charges secara positif yaitu
(age,bmi,children dan
smoker). Namun, untuk uji asumsi yang memenuhi hanya asumsi
Non Multicollinearity, sedangkan asumsi lainnya tidak terpenuhi. Hal ini
perlu dilakukan penganganan uji asumsi agar hasil pemodelan lebih valid.
Selain itu dapat digunakan model lainnya seperti regresi logistic, KNN
atau Random Forest untuk mendapatkan hasil prediksi yang lebih baik.