Se selecciono la base de datos health_insurance con la finalidad de realizar el análisis exploratorio de los datos.
El Supervised Machine Learning es un enfoque dentro
de la Inteligencia Artificial, donde se entrena un modelo utilizando un
conjunto de datos etiquetados. Algunas de sus aplicaciones dentro de
Inteligencia de negocios son:
1) Predicción de ventas.
2) Segmentación de clientes.
3) Optimización de precios.
La R^2 es la métrica utilizada para la evaluación de
la calidad de un modelo de regresión.
La métrica RMSE es una medida de la diferencia entre
los valores observados y los vaores predichos por un modelo.
La principal diferencia entre estos dos es su enfoque, por un lado la
R^2 evalua la calidad, por otro lado el
RMSE mide la magnitud de los errores de predicción.
library(dplyr)
library(ggplot2)
library(corrplot)
library(car)
library(lmtest)
library(lattice)
library(xgboost)
library(rpart)
library(randomForest)
library(neuralnet)
library(caret)
df<- read.csv("C:\\Users\\Diego Pérez\\Downloads\\health_insurance.csv")
df2 <-read.csv("C:\\Users\\Diego Pérez\\Downloads\\health_insurance.csv")
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :16.00 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.67 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
## Max. :64.00 Max. :53.10 Max. :5.000
## smoker region expenses
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
Para realzar el método OLS es necesario categorizar las variables
df$smoker <- as.numeric(factor(df$smoker, levels = c("yes", "no"), labels = c(1,0)))
df$region <- as.numeric(factor(df$region, levels = c("southwest","southeast","northeast", "northwest"), labels = c(1,2,3,4)))
df$sex <- as.numeric(factor(df$sex, levels = c("female", "male"), labels = c(0,1)))
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : num 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : num 1 2 2 2 2 2 2 2 2 2 ...
## $ region : num 1 2 2 4 4 2 2 4 3 4 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
Se identifican los NA’s dentro de la base de datos.
na_count <- colSums(is.na(df))
na_count
## age sex bmi children smoker region expenses
## 0 0 0 0 0 0 0
Nos podemos dar cuenta que al realizar el código no contamos con NA’s.
Se analizan las medidas descriptivas con la finalidad de resumir las caracteristicas principales. Porporcionando información sobre la distribución, y forma de los datos.
columnas_numericas <- sapply(df, is.numeric)
media <- colMeans(df[, columnas_numericas], na.rm = TRUE)
cat("Media:", media, "\n")
## Media: 39.20703 1.505232 30.66547 1.094918 1.795217 2.485052 13270.42
mediana <- sapply(df[, columnas_numericas], median, na.rm = TRUE)
cat("Mediana:", mediana, "\n")
## Mediana: 39 2 30.4 1 2 2 9382.03
moda <- apply(df, 2, function(x) {
unique_x <- unique(x[!is.na(x)])
unique_x[which.max(tabulate(match(x, unique_x)))]})
cat("Moda:", media, "\n")
## Moda: 39.20703 1.505232 30.66547 1.094918 1.795217 2.485052 13270.42
# Distribución de géneros:
barplot(table(df$sex), main="Distribución de género", xlab="Sexo", ylab="Frecuencia",ylim=c(0, 800))
# Distribución de fumadores:
barplot(table(df$smoker), main="Distribución de fumadores", xlab="Fumador o no fumador", ylab="Frecuencia",ylim=c(0, 1300))
# Histograma de edad
hist(df$age, main="Histograma de Edad", xlab="Edad", ylab="Frecuencia")
# Histograma de bmi
hist(df$bmi, main="Histograma de BMI", xlab="BMI", ylab="Frecuencia")
# Matriz de correlación
corr_matrix <- cor(df[, c("age", "bmi", "children", "expenses")])
corrplot(corr_matrix, method="color")
# Box plot para (expenses vs smoker)
boxplot(expenses ~ smoker, data=df, main="Box plot de Gastos por Fumador", xlab="Fumador", ylab="Gastos")
# Box plot para (expenses vs region)
boxplot(expenses ~ region, data=df, main="Box plot de Gastos por Región", xlab="Región", ylab="Gastos")
#plot_normality(df, age, bmi, children)
Al análizar los graficos previamente realizados, podemos darnos cuenta en las gráficas de dispersión que existe un número similar de registros en la variable genero. Al analizar la variable de fumadores vemos una gran diferencia entre los que son fumadores y los que no. Cuando observamos la variable edad logramos identificar las frecuencia de la personas con la misma edad. Teniendo mas presencia edades de 20’s. En la matriz de correlación de las variables, podemos notar una relación de entre la variable age y expenses. Sabemos que la variable dependiente es la variable expenses. Es por esto que debemos tener en consideración esta variable en los analisis presentados. En las ultimas dos gráficas de boxplot, vemos a la variable gastos con la variable fumador, nos damos couenta que la mediana de la primer variable es mucho mayor en el caso de si ser fumador. En la última gráfica de región con gastos, observamos que las medianas son muy parecidas.
La base de datos la partimos en (80/20), con la finalidad de utilizar un 80% los datos para el entrenamiento
#set.seed(123)
datos_partidos <- createDataPartition(y = df$expenses, p=0.8, list=F)
train = df[datos_partidos, ]
test <- df[-datos_partidos, ]
El primer modelo a realizar es la regresión.
# Creación de la regresión
ols_model <- lm(expenses ~ age + sex + bmi + children + smoker + region, data = train)
summary(ols_model)
##
## Call:
## lm(formula = expenses ~ age + sex + bmi + children + smoker +
## region, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11738 -2855 -1010 1382 29403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35247.57 1586.60 22.216 < 2e-16 ***
## age 257.08 13.54 18.989 < 2e-16 ***
## sex -46.88 380.94 -0.123 0.90208
## bmi 326.75 31.49 10.375 < 2e-16 ***
## children 476.20 156.98 3.034 0.00248 **
## smoker -24018.91 469.62 -51.145 < 2e-16 ***
## region 276.55 172.05 1.607 0.10828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6203 on 1065 degrees of freedom
## Multiple R-squared: 0.7468, Adjusted R-squared: 0.7454
## F-statistic: 523.5 on 6 and 1065 DF, p-value: < 2.2e-16
vif(ols_model)
## age sex bmi children smoker region
## 1.016539 1.010879 1.040513 1.004390 1.005728 1.024680
Al analizar estos resultados se confirma que NO EXISTE MULTICOLINEALIDAD al tener valores menores a 5
Analizamos si nuestro modelo cuenta con heterocedasticidad, analizando los residuales.
res<- residuals(ols_model)
plot(res, type="l")
En esta grafica vemos la existencia de heterocedasticidad, teniendo
valores muy dispersos, es por esto que realizara la prueba Breusch -
Pagan.
bptest(ols_model)
##
## studentized Breusch-Pagan test
##
## data: ols_model
## BP = 95.63, df = 6, p-value < 2.2e-16
Confirmamos la existencia de heterocedasticidad, en el modelo de regresión. Es por esto que realizaremos la estimación de minimos cuadrados ponderados o “WLS”.
wt <- 1 / lm(abs(ols_model$residuals) ~ ols_model$fitted.values)$fitted.values^2
wls_model<-lm(log(expenses) ~ age + sex + log(bmi) + children + smoker + region , data = train, weights=wt)
summary(wls_model)
##
## Call:
## lm(formula = log(expenses) ~ age + sex + log(bmi) + children +
## smoker + region, data = train, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3.675e-04 -6.271e-05 -2.704e-05 8.120e-06 1.338e-03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.630192 0.307860 31.281 < 2e-16 ***
## age 0.041151 0.001215 33.876 < 2e-16 ***
## sex -0.164060 0.030658 -5.351 1.07e-07 ***
## log(bmi) 0.207024 0.072406 2.859 0.00433 **
## children 0.158997 0.013381 11.882 < 2e-16 ***
## smoker -1.626112 0.087756 -18.530 < 2e-16 ***
## region 0.066613 0.013368 4.983 7.30e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0001603 on 1065 degrees of freedom
## Multiple R-squared: 0.6591, Adjusted R-squared: 0.6572
## F-statistic: 343.2 on 6 and 1065 DF, p-value: < 2.2e-16
Una vez realizado el Modelo WLS, volvemos a realizar la prueba Breush - Pagan
bptest(wls_model)
##
## studentized Breusch-Pagan test
##
## data: wls_model
## BP = 2.2166e-05, df = 6, p-value = 1
Y nos damos cuenta que la heterocedasticidad se elimino con el valor de p-value igual a 1
Volvemos a realizar el el vif para analizar la multicolinealidad
vif(wls_model)
## age sex log(bmi) children smoker region
## 1.078821 1.008367 1.062697 1.027152 1.005358 1.008087
Nos damos cuenta que no contamos con multicolinealidad
Realizamos la normalidad de residuales de los valores de WLS
# Residuales del WLS
res_wls<- residuals(wls_model)
plot(res_wls, type="l")
# Gráfico Q-Q de residuales
qqnorm(res_wls)
qqline(res_wls)
# Gráfica de residuales y los fitted values
plot(fitted(wls_model), resid(wls_model), main="Linear Regression Residual vs. Fitted Values", xlab="Fitted Values", ylab="Residuals")
abline(0,0)
hist(wls_model$residuals, xlab="Estimated Regression Residuals", main='Distribution of OLS Estimated Regression Residuals', col='lightblue', border="white" )
# Prueba Shaphiro
shaphiro <- shapiro.test(res_wls)
shaphiro
##
## Shapiro-Wilk normality test
##
## data: res_wls
## W = 0.80765, p-value < 2.2e-16
Se van a comparar los modelos de OLS, teniendo en cuenta los resultados de la regresión, los datos de logaritmo, y el modelo WLS
# Modelo OLS
predicciones_ols <- predict(ols_model, newdata = test)
# Convertir las predicciones de logaritmo a su escala original
predicciones_ols <- exp(predicciones_ols)
# Predicciones del modelo WLS
predicciones_wls <- predict(wls_model, newdata = test)
# Convertir las predicciones de logaritmo a su escala original
predicciones_wls <- exp(predicciones_wls)
# Calcular el RMSE de cada modelo
rmse_ols <- sqrt(mean((predicciones_ols - test$expenses)^2))
rmse_wls <- sqrt(mean((predicciones_wls - test$expenses)^2))
# Mostrar los resultados
cat("RMSE del modelo OLS:", rmse_ols, "\n")
## RMSE del modelo OLS: Inf
cat("RMSE del modelo WLS:", rmse_wls, "\n")
## RMSE del modelo WLS: 10460.3
Se realiza la regresión XGBoost
train_x <- data.matrix(train[, !colnames(train) %in% "expenses"])
train_y <- log(train$expenses)
test_x <- data.matrix(test[, !colnames(test) %in% "expenses"])
test_y <- log(test$expenses)
xgb_train <- xgb.DMatrix(data = train_x, label = train_y)
xgb_test <- xgb.DMatrix(data = test_x, label = test_y)
watchlist <- list(train = xgb_train, test = xgb_test)
model_xgb <- xgb.train(data = xgb_train, max_depth = 3, watchlist = watchlist, nrounds = 100)
## [1] train-rmse:6.074683 test-rmse:6.042439
## [2] train-rmse:4.271119 test-rmse:4.230960
## [3] train-rmse:3.010621 test-rmse:2.969532
## [4] train-rmse:2.131511 test-rmse:2.086862
## [5] train-rmse:1.520861 test-rmse:1.474968
## [6] train-rmse:1.102275 test-rmse:1.055514
## [7] train-rmse:0.820227 test-rmse:0.772236
## [8] train-rmse:0.634690 test-rmse:0.584757
## [9] train-rmse:0.518597 test-rmse:0.467485
## [10] train-rmse:0.448216 test-rmse:0.397547
## [11] train-rmse:0.408531 test-rmse:0.359911
## [12] train-rmse:0.387221 test-rmse:0.340330
## [13] train-rmse:0.374193 test-rmse:0.330668
## [14] train-rmse:0.366374 test-rmse:0.327095
## [15] train-rmse:0.360391 test-rmse:0.324280
## [16] train-rmse:0.357920 test-rmse:0.323269
## [17] train-rmse:0.355716 test-rmse:0.322455
## [18] train-rmse:0.353692 test-rmse:0.323220
## [19] train-rmse:0.351719 test-rmse:0.323506
## [20] train-rmse:0.348929 test-rmse:0.323700
## [21] train-rmse:0.346773 test-rmse:0.323243
## [22] train-rmse:0.344988 test-rmse:0.322173
## [23] train-rmse:0.343693 test-rmse:0.322766
## [24] train-rmse:0.343039 test-rmse:0.324445
## [25] train-rmse:0.340938 test-rmse:0.323183
## [26] train-rmse:0.339121 test-rmse:0.324061
## [27] train-rmse:0.337656 test-rmse:0.324195
## [28] train-rmse:0.336988 test-rmse:0.325237
## [29] train-rmse:0.335877 test-rmse:0.324856
## [30] train-rmse:0.334380 test-rmse:0.325285
## [31] train-rmse:0.334011 test-rmse:0.325485
## [32] train-rmse:0.332734 test-rmse:0.326217
## [33] train-rmse:0.332107 test-rmse:0.326723
## [34] train-rmse:0.329510 test-rmse:0.325675
## [35] train-rmse:0.328483 test-rmse:0.325968
## [36] train-rmse:0.327534 test-rmse:0.326317
## [37] train-rmse:0.327138 test-rmse:0.327296
## [38] train-rmse:0.325980 test-rmse:0.326591
## [39] train-rmse:0.324877 test-rmse:0.326667
## [40] train-rmse:0.322595 test-rmse:0.328126
## [41] train-rmse:0.320524 test-rmse:0.330252
## [42] train-rmse:0.318389 test-rmse:0.331322
## [43] train-rmse:0.317293 test-rmse:0.331637
## [44] train-rmse:0.316581 test-rmse:0.331913
## [45] train-rmse:0.316201 test-rmse:0.332937
## [46] train-rmse:0.315143 test-rmse:0.334479
## [47] train-rmse:0.313129 test-rmse:0.335352
## [48] train-rmse:0.312880 test-rmse:0.336193
## [49] train-rmse:0.312147 test-rmse:0.336592
## [50] train-rmse:0.311284 test-rmse:0.336818
## [51] train-rmse:0.310016 test-rmse:0.337207
## [52] train-rmse:0.308426 test-rmse:0.338329
## [53] train-rmse:0.308005 test-rmse:0.338277
## [54] train-rmse:0.307230 test-rmse:0.337640
## [55] train-rmse:0.306320 test-rmse:0.337659
## [56] train-rmse:0.305579 test-rmse:0.338920
## [57] train-rmse:0.305235 test-rmse:0.338873
## [58] train-rmse:0.304844 test-rmse:0.339213
## [59] train-rmse:0.304312 test-rmse:0.339172
## [60] train-rmse:0.303719 test-rmse:0.340046
## [61] train-rmse:0.302141 test-rmse:0.339236
## [62] train-rmse:0.301236 test-rmse:0.339908
## [63] train-rmse:0.300912 test-rmse:0.340481
## [64] train-rmse:0.300362 test-rmse:0.340664
## [65] train-rmse:0.299834 test-rmse:0.341313
## [66] train-rmse:0.298660 test-rmse:0.341182
## [67] train-rmse:0.296653 test-rmse:0.340867
## [68] train-rmse:0.295929 test-rmse:0.341822
## [69] train-rmse:0.295726 test-rmse:0.342692
## [70] train-rmse:0.295313 test-rmse:0.343015
## [71] train-rmse:0.295065 test-rmse:0.343006
## [72] train-rmse:0.294644 test-rmse:0.343435
## [73] train-rmse:0.294095 test-rmse:0.343723
## [74] train-rmse:0.293122 test-rmse:0.341626
## [75] train-rmse:0.292500 test-rmse:0.341243
## [76] train-rmse:0.291901 test-rmse:0.341952
## [77] train-rmse:0.291404 test-rmse:0.341893
## [78] train-rmse:0.290910 test-rmse:0.342113
## [79] train-rmse:0.290194 test-rmse:0.341110
## [80] train-rmse:0.289086 test-rmse:0.341192
## [81] train-rmse:0.288154 test-rmse:0.343415
## [82] train-rmse:0.287895 test-rmse:0.343340
## [83] train-rmse:0.286709 test-rmse:0.343550
## [84] train-rmse:0.286447 test-rmse:0.343502
## [85] train-rmse:0.285674 test-rmse:0.343858
## [86] train-rmse:0.285215 test-rmse:0.344505
## [87] train-rmse:0.284607 test-rmse:0.344597
## [88] train-rmse:0.283750 test-rmse:0.345544
## [89] train-rmse:0.283594 test-rmse:0.346257
## [90] train-rmse:0.282919 test-rmse:0.346699
## [91] train-rmse:0.282230 test-rmse:0.346826
## [92] train-rmse:0.281611 test-rmse:0.346866
## [93] train-rmse:0.280147 test-rmse:0.346498
## [94] train-rmse:0.280010 test-rmse:0.347088
## [95] train-rmse:0.279255 test-rmse:0.346808
## [96] train-rmse:0.279033 test-rmse:0.347121
## [97] train-rmse:0.278901 test-rmse:0.347529
## [98] train-rmse:0.278649 test-rmse:0.347934
## [99] train-rmse:0.277913 test-rmse:0.348142
## [100] train-rmse:0.276375 test-rmse:0.347161
reg_xgb <- xgboost(data = xgb_train, max_depth = 3, nrounds = 59, verbose = 0)
prediction_xgb_test <- predict(reg_xgb, xgb_test)
rmse_XGB <- sqrt(mean((exp(prediction_xgb_test) - exp(test_y))^2))
cat("RMSE del modelo XGBoost:", rmse_wls, "\n")
## RMSE del modelo XGBoost: 10460.3
Esto nos indica que las predicciones del modelo son de 11766.29
Ahora relizaremos el árbol de decisión
dt_regresion <- rpart(log(expenses) ~ age + sex + bmi + children + smoker + region, data = train)
plot(dt_regresion, compress = TRUE)
text(dt_regresion, use.n = TRUE)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
rpart.plot(dt_regresion)
decision_tree_prediction <- predict(dt_regresion,test)
rmse_dt <- sqrt(mean((exp(decision_tree_prediction) - df$expenses)^2))
## Warning in exp(decision_tree_prediction) - df$expenses: longitud de objeto
## mayor no es múltiplo de la longitud de uno menor
cat("RMSE de Decision Trees:", rmse_dt, "\n")
## RMSE de Decision Trees: 16484.62
El promedio de error del árbol de decisión es de 16132.53
Por ultimo, realizamos el random forest
# Crear el modelo de Random Forest
library(randomForest)
df2$sex<- factor(df2$sex, levels=c("male", "female"))
df2$smoker<- factor(df2$smoker, levels=c("yes", "no"))
df2$region <- factor(df2$region, levels = c("southwest", "southeast", "northeast", "northwest"))
library(caret)
set.seed(123)
datos_partidosRF <- createDataPartition(y = df2$expenses, p=0.8, list=F)
trainRF = df2[datos_partidos, ]
testRF <- df2[-datos_partidos, ]
random_forest <- randomForest(log(expenses) ~ age + sex + bmi + children + smoker + region, data = trainRF, na.action = na.exclude)
random_forest
##
## Call:
## randomForest(formula = log(expenses) ~ age + sex + bmi + children + smoker + region, data = trainRF, na.action = na.exclude)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1600267
## % Var explained: 81.04
predictions_RF <- predict(random_forest, testRF)
rmse_RF <- sqrt(mean((exp(predictions_RF) - df$expenses)^2))
## Warning in exp(predictions_RF) - df$expenses: longitud de objeto mayor no es
## múltiplo de la longitud de uno menor
cat("RMSE Random Forest:", rmse_RF, "\n")
## RMSE Random Forest: 15662.74
El promedio de error de este modelo es de 15421.87
Se realiza la red neuronal
library(neuralnet)
nn_model <- neuralnet(expenses ~ age + sex + bmi + children + smoker + region, data = train, hidden = c(5, 3), linear.output = TRUE)
plot(nn_model)
predictions_nnet <- predict(nn_model, test)
rmse_nn <- sqrt(mean((exp(predictions_nnet) - test$expenses)^2))
cat("RMSE de Redes Neuronales:", rmse_nn, "\n")
## RMSE de Redes Neuronales: Inf
Una vez realizados todos los modelos, los comparamos para seleccionar el modelo optimo a utilizar.
#rmsesy<- c(rmse_ols,rmse_wls,rmse_XGB,rmse_dt,rmse_RF,rmse_nn)
#rmsesx<- c("RMSE OLS", "RMSE WLS", "RMSE XGBoost", "RMSE Decision Trees", "RMSE Random Forest", "RMSE Neural Network")
#df_RMSES<- data.frame(rmsesx=rmsesx,rmsesy=rmsesy)
#barplot(df_RMSES$rmsesy, names= df_RMSES$rmsesx, col = "skyblue",
# main = "Valores de RMSE para diferentes modelos",
#xlab = "Modelos", ylab = "Valor de RMSE")
resultados <- data.frame(
"1" = c("OLS", rmse_wls),
"2" = c("XGBoost", rmse_XGB),
"3" = c("Árbol de decisión", rmse_dt),
"4" = c("Random Forest", rmse_RF),
"5" = c("Red Neuronal", rmse_nn)
)
rownames(resultados) <- c("Modelo", "RMSE")
resultados
## X1 X2 X3 X4
## Modelo OLS XGBoost Árbol de decisión Random Forest
## RMSE 10460.2985509439 4286.15105517566 16484.6219854509 15662.7422784515
## X5
## Modelo Red Neuronal
## RMSE Inf
A traves de esta tabla podemos comparar los resultados de RMSE de los modelos anteriores. Teniendo a consideración esto elegimos el modelo XGBoost al tener el error más pequeño.
Las variables que contribuyen más a los cambios principales de los modelos son la edad, y si son fumadores o no.
En cuanto a la variable edad por cada unidad que aumente se vera afectado por 247.44. En la variable de fumadores por cada unidad se afectara en 24226