i) ¿Qué es Supervised Machine Learning y cuáles son algunas de sus aplicaciones en Inteligencia de Negocios? Supervised Machine Learning es un tipo de aprendizaje automatico que utiliza datos categorizados para entrenar un modelo con el fin de que este realicé predicciones o clasificaciones de manera precisa.
iii) ¿Qué es la R2 Ajustada? ¿Qué es la métrica RMSE? ¿Cuál
es la diferencia entre la R2 Ajustada y la métrica RMSE?
Tanto la R2 Ajustada como el RMSE (Root Mean Square Error), son métricas
que se utilizan para evaluar la precisión de un modelo de regresión
lineal. La diferencia entre ambas métricas reside en lo que miden. La R2
Ajustada indica el porcentaje de variación en la variable dependiente
explicado por las variables independientes, es decir, qué tan bien se
ajustan las variables independientes al modelo. Mientras que el RMSE
indica la desviación estándar de los residuales del modelo, es decir,
cuánto difieren los valores observados de los valores predichos.
Bibliografía:
Shahzadi, N. (2023, May 29). Supervised Machine Learning: classification
and regression. Medium. https://medium.com/@nimrashahzadisa064/supervised-machine-learning-classification-and-regression-c145129225f8
IBM documentation. (n.d.). https://www.ibm.com/docs/es/cognos-analytics/11.2.0?topic=terms-adjusted-r-squared
Oracle® Fusion Cloud EPM Trabajo con Planning. (n.d.). https://docs.oracle.com/cloud/help/es/pbcs_common/PFUSU/insights_metrics_RMSE.htm#PFUSU-GUID-FD9381A1-81E1-4F6D-8EC4-82A6CE2A6E74
# Importar las librerías requeridas
# Manipulación de datos y visualización de datos
library(foreign)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(mapview)
## Warning: package 'mapview' was built under R version 4.3.2
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
library(naniar)
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.2
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(RColorBrewer)
library(dlookr)
## Registered S3 method overwritten by 'dlookr':
## method from
## plot.transform scales
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
##
## transform
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.2
# modelado predictivo
library(regclass)
## Warning: package 'regclass' was built under R version 4.3.2
## Loading required package: bestglm
## Warning: package 'bestglm' was built under R version 4.3.2
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.3.2
## Loading required package: VGAM
## Warning: package 'VGAM' was built under R version 4.3.2
## Loading required package: stats4
## Loading required package: splines
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.3.2
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 4.3.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
library(mctest)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'lmtest'
## The following object is masked from 'package:VGAM':
##
## lrtest
library(spdep)
## Warning: package 'spdep' was built under R version 4.3.2
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.3.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(sf)
library(spData)
library(spatialreg)
## Warning: package 'spatialreg' was built under R version 4.3.2
## Loading required package: Matrix
##
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
##
## get.ClusterOption, get.coresOption, get.mcOption,
## get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
## set.coresOption, set.mcOption, set.VerboseOption,
## set.ZeroPolicyOption
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:regclass':
##
## qq
##
## Attaching package: 'caret'
## The following object is masked from 'package:VGAM':
##
## predictors
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:dlookr':
##
## kurtosis, skewness
library(SparseM)
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.2
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(randomForest)
library(jtools)
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.2
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 4.3.2
library(effects)
## Loading required package: carData
## Registered S3 method overwritten by 'survey':
## method from
## summary.pps dlookr
## Use the command
## lattice::trellis.par.set(effectsTheme())
## to customize lattice options for effects plots.
## See ?effectsTheme for details.
# Subir la base de datos
df <- read.csv("C:\\Users\\kathi\\OneDrive\\Escritorio\\MODULO3_IA\\Materiales\\Act 1\\automoble_insurance_claims.csv") # La base de datos tiene 1000 observaciones y 40 variables.
a. Identificación de NA’s
sum(is.na(df)) # El total de NA's en la base de datos es de 1000.
## [1] 1000
colSums(is.na(df)) # Todos los NA's (1000) se encuentran en la variable "X_c39", esto quiere decir que la columna no tiene información, ya que esta conformada únicamente por valores nulos.
## months_as_customer age
## 0 0
## policy_number policy_bind_date
## 0 0
## policy_state policy_csl
## 0 0
## policy_deductable policy_annual_premium
## 0 0
## umbrella_limit insured_zip
## 0 0
## insured_sex insured_education_level
## 0 0
## insured_occupation insured_hobbies
## 0 0
## insured_relationship capital.gains
## 0 0
## capital.loss incident_date
## 0 0
## incident_type collision_type
## 0 0
## incident_severity authorities_contacted
## 0 0
## incident_state incident_city
## 0 0
## incident_location incident_hour_of_the_day
## 0 0
## number_of_vehicles_involved property_damage
## 0 0
## bodily_injuries witnesses
## 0 0
## police_report_available total_claim_amount
## 0 0
## injury_claim property_claim
## 0 0
## vehicle_claim auto_make
## 0 0
## auto_model auto_year
## 0 0
## fraud_reported X_c39
## 0 1000
gg_miss_var(df) # Esta gráfica ayuda a visualizar e identificar de forma rápida los NA's.
b. Reemplazo de NA’s
# Se elimina la variable "X_c39", ya que no contiene información de utilidad.
df$X_c39 <- NULL
str(df) # Ver que se realizó el cambio.
## 'data.frame': 1000 obs. of 39 variables:
## $ months_as_customer : int 328 228 134 256 228 256 137 165 27 212 ...
## $ age : int 48 42 29 41 44 39 34 37 33 42 ...
## $ policy_number : int 521585 342868 687698 227811 367455 104594 413978 429027 485665 636550 ...
## $ policy_bind_date : chr "10/17/2014" "6/27/2006" "9/6/2000" "5/25/1990" ...
## $ policy_state : chr "OH" "IN" "OH" "IL" ...
## $ policy_csl : chr "250/500" "250/500" "100/300" "250/500" ...
## $ policy_deductable : int 1000 2000 2000 2000 1000 1000 1000 1000 500 500 ...
## $ policy_annual_premium : num 1407 1197 1413 1416 1584 ...
## $ umbrella_limit : int 0 5000000 5000000 6000000 6000000 0 0 0 0 0 ...
## $ insured_zip : int 466132 468176 430632 608117 610706 478456 441716 603195 601734 600983 ...
## $ insured_sex : chr "MALE" "MALE" "FEMALE" "FEMALE" ...
## $ insured_education_level : chr "MD" "MD" "PhD" "PhD" ...
## $ insured_occupation : chr "craft-repair" "machine-op-inspct" "sales" "armed-forces" ...
## $ insured_hobbies : chr "sleeping" "reading" "board-games" "board-games" ...
## $ insured_relationship : chr "husband" "other-relative" "own-child" "unmarried" ...
## $ capital.gains : int 53300 0 35100 48900 66000 0 0 0 0 0 ...
## $ capital.loss : int 0 0 0 -62400 -46000 0 -77000 0 0 -39300 ...
## $ incident_date : chr "1/25/2015" "1/21/2015" "2/22/2015" "1/10/2015" ...
## $ incident_type : chr "Single Vehicle Collision" "Vehicle Theft" "Multi-vehicle Collision" "Single Vehicle Collision" ...
## $ collision_type : chr "Side Collision" "?" "Rear Collision" "Front Collision" ...
## $ incident_severity : chr "Major Damage" "Minor Damage" "Minor Damage" "Major Damage" ...
## $ authorities_contacted : chr "Police" "Police" "Police" "Police" ...
## $ incident_state : chr "SC" "VA" "NY" "OH" ...
## $ incident_city : chr "Columbus" "Riverwood" "Columbus" "Arlington" ...
## $ incident_location : chr "9935 4th Drive" "6608 MLK Hwy" "7121 Francis Lane" "6956 Maple Drive" ...
## $ incident_hour_of_the_day : int 5 8 7 5 20 19 0 23 21 14 ...
## $ number_of_vehicles_involved: int 1 1 3 1 1 3 3 3 1 1 ...
## $ property_damage : chr "YES" "?" "NO" "?" ...
## $ bodily_injuries : int 1 0 2 1 0 0 0 2 1 2 ...
## $ witnesses : int 2 0 3 2 1 2 0 2 1 1 ...
## $ police_report_available : chr "YES" "?" "NO" "NO" ...
## $ total_claim_amount : int 71610 5070 34650 63400 6500 64100 78650 51590 27700 42300 ...
## $ injury_claim : int 6510 780 7700 6340 1300 6410 21450 9380 2770 4700 ...
## $ property_claim : int 13020 780 3850 6340 650 6410 7150 9380 2770 4700 ...
## $ vehicle_claim : int 52080 3510 23100 50720 4550 51280 50050 32830 22160 32900 ...
## $ auto_make : chr "Saab" "Mercedes" "Dodge" "Chevrolet" ...
## $ auto_model : chr "92x" "E400" "RAM" "Tahoe" ...
## $ auto_year : int 2004 2007 2007 2014 2009 2003 2012 2015 2012 1996 ...
## $ fraud_reported : chr "Y" "Y" "N" "Y" ...
# Reemplazar las observaciones con signo de interrogación.
df[df == "?"] <- NA # Cambiarlos a NA's para después reemplazarlos.
colSums(is.na(df)) # Identificar los NA's generados.
## months_as_customer age
## 0 0
## policy_number policy_bind_date
## 0 0
## policy_state policy_csl
## 0 0
## policy_deductable policy_annual_premium
## 0 0
## umbrella_limit insured_zip
## 0 0
## insured_sex insured_education_level
## 0 0
## insured_occupation insured_hobbies
## 0 0
## insured_relationship capital.gains
## 0 0
## capital.loss incident_date
## 0 0
## incident_type collision_type
## 0 178
## incident_severity authorities_contacted
## 0 0
## incident_state incident_city
## 0 0
## incident_location incident_hour_of_the_day
## 0 0
## number_of_vehicles_involved property_damage
## 0 360
## bodily_injuries witnesses
## 0 0
## police_report_available total_claim_amount
## 343 0
## injury_claim property_claim
## 0 0
## vehicle_claim auto_make
## 0 0
## auto_model auto_year
## 0 0
## fraud_reported
## 0
gg_miss_var(df) # visualizar los NA's.
str(df) # Ver que se realizó el cambio e identificar tipo de dato para decidir cómo reemplazar sus valores faltantes.
## 'data.frame': 1000 obs. of 39 variables:
## $ months_as_customer : int 328 228 134 256 228 256 137 165 27 212 ...
## $ age : int 48 42 29 41 44 39 34 37 33 42 ...
## $ policy_number : int 521585 342868 687698 227811 367455 104594 413978 429027 485665 636550 ...
## $ policy_bind_date : chr "10/17/2014" "6/27/2006" "9/6/2000" "5/25/1990" ...
## $ policy_state : chr "OH" "IN" "OH" "IL" ...
## $ policy_csl : chr "250/500" "250/500" "100/300" "250/500" ...
## $ policy_deductable : int 1000 2000 2000 2000 1000 1000 1000 1000 500 500 ...
## $ policy_annual_premium : num 1407 1197 1413 1416 1584 ...
## $ umbrella_limit : int 0 5000000 5000000 6000000 6000000 0 0 0 0 0 ...
## $ insured_zip : int 466132 468176 430632 608117 610706 478456 441716 603195 601734 600983 ...
## $ insured_sex : chr "MALE" "MALE" "FEMALE" "FEMALE" ...
## $ insured_education_level : chr "MD" "MD" "PhD" "PhD" ...
## $ insured_occupation : chr "craft-repair" "machine-op-inspct" "sales" "armed-forces" ...
## $ insured_hobbies : chr "sleeping" "reading" "board-games" "board-games" ...
## $ insured_relationship : chr "husband" "other-relative" "own-child" "unmarried" ...
## $ capital.gains : int 53300 0 35100 48900 66000 0 0 0 0 0 ...
## $ capital.loss : int 0 0 0 -62400 -46000 0 -77000 0 0 -39300 ...
## $ incident_date : chr "1/25/2015" "1/21/2015" "2/22/2015" "1/10/2015" ...
## $ incident_type : chr "Single Vehicle Collision" "Vehicle Theft" "Multi-vehicle Collision" "Single Vehicle Collision" ...
## $ collision_type : chr "Side Collision" NA "Rear Collision" "Front Collision" ...
## $ incident_severity : chr "Major Damage" "Minor Damage" "Minor Damage" "Major Damage" ...
## $ authorities_contacted : chr "Police" "Police" "Police" "Police" ...
## $ incident_state : chr "SC" "VA" "NY" "OH" ...
## $ incident_city : chr "Columbus" "Riverwood" "Columbus" "Arlington" ...
## $ incident_location : chr "9935 4th Drive" "6608 MLK Hwy" "7121 Francis Lane" "6956 Maple Drive" ...
## $ incident_hour_of_the_day : int 5 8 7 5 20 19 0 23 21 14 ...
## $ number_of_vehicles_involved: int 1 1 3 1 1 3 3 3 1 1 ...
## $ property_damage : chr "YES" NA "NO" NA ...
## $ bodily_injuries : int 1 0 2 1 0 0 0 2 1 2 ...
## $ witnesses : int 2 0 3 2 1 2 0 2 1 1 ...
## $ police_report_available : chr "YES" NA "NO" "NO" ...
## $ total_claim_amount : int 71610 5070 34650 63400 6500 64100 78650 51590 27700 42300 ...
## $ injury_claim : int 6510 780 7700 6340 1300 6410 21450 9380 2770 4700 ...
## $ property_claim : int 13020 780 3850 6340 650 6410 7150 9380 2770 4700 ...
## $ vehicle_claim : int 52080 3510 23100 50720 4550 51280 50050 32830 22160 32900 ...
## $ auto_make : chr "Saab" "Mercedes" "Dodge" "Chevrolet" ...
## $ auto_model : chr "92x" "E400" "RAM" "Tahoe" ...
## $ auto_year : int 2004 2007 2007 2014 2009 2003 2012 2015 2012 1996 ...
## $ fraud_reported : chr "Y" "Y" "N" "Y" ...
collision_type, property_damage, police_report_available
# Las columnas con NA's a modificar son "collision_type, property_damage y police_report_available". Las 3 variables son caracteres, por lo que, no es posible reemplazarlas con la mediana, por lo que se usará su moda.
# Calcular la moda de cada columna.
collision_type_moda <- names(which.max(table(df$collision_type)))
property_damage_moda <- names(which.max(table(df$property_damage)))
police_report_available_moda <- names(which.max(table(df$police_report_available)))
# Reemplazar NA con la moda.
df$collision_type[is.na(df$collision_type)] <- collision_type_moda
df$property_damage[is.na(df$property_damage)] <- property_damage_moda
df$police_report_available[is.na(df$police_report_available)] <- police_report_available_moda
str(df) # Ver que se realizó el cambio.
## 'data.frame': 1000 obs. of 39 variables:
## $ months_as_customer : int 328 228 134 256 228 256 137 165 27 212 ...
## $ age : int 48 42 29 41 44 39 34 37 33 42 ...
## $ policy_number : int 521585 342868 687698 227811 367455 104594 413978 429027 485665 636550 ...
## $ policy_bind_date : chr "10/17/2014" "6/27/2006" "9/6/2000" "5/25/1990" ...
## $ policy_state : chr "OH" "IN" "OH" "IL" ...
## $ policy_csl : chr "250/500" "250/500" "100/300" "250/500" ...
## $ policy_deductable : int 1000 2000 2000 2000 1000 1000 1000 1000 500 500 ...
## $ policy_annual_premium : num 1407 1197 1413 1416 1584 ...
## $ umbrella_limit : int 0 5000000 5000000 6000000 6000000 0 0 0 0 0 ...
## $ insured_zip : int 466132 468176 430632 608117 610706 478456 441716 603195 601734 600983 ...
## $ insured_sex : chr "MALE" "MALE" "FEMALE" "FEMALE" ...
## $ insured_education_level : chr "MD" "MD" "PhD" "PhD" ...
## $ insured_occupation : chr "craft-repair" "machine-op-inspct" "sales" "armed-forces" ...
## $ insured_hobbies : chr "sleeping" "reading" "board-games" "board-games" ...
## $ insured_relationship : chr "husband" "other-relative" "own-child" "unmarried" ...
## $ capital.gains : int 53300 0 35100 48900 66000 0 0 0 0 0 ...
## $ capital.loss : int 0 0 0 -62400 -46000 0 -77000 0 0 -39300 ...
## $ incident_date : chr "1/25/2015" "1/21/2015" "2/22/2015" "1/10/2015" ...
## $ incident_type : chr "Single Vehicle Collision" "Vehicle Theft" "Multi-vehicle Collision" "Single Vehicle Collision" ...
## $ collision_type : chr "Side Collision" "Rear Collision" "Rear Collision" "Front Collision" ...
## $ incident_severity : chr "Major Damage" "Minor Damage" "Minor Damage" "Major Damage" ...
## $ authorities_contacted : chr "Police" "Police" "Police" "Police" ...
## $ incident_state : chr "SC" "VA" "NY" "OH" ...
## $ incident_city : chr "Columbus" "Riverwood" "Columbus" "Arlington" ...
## $ incident_location : chr "9935 4th Drive" "6608 MLK Hwy" "7121 Francis Lane" "6956 Maple Drive" ...
## $ incident_hour_of_the_day : int 5 8 7 5 20 19 0 23 21 14 ...
## $ number_of_vehicles_involved: int 1 1 3 1 1 3 3 3 1 1 ...
## $ property_damage : chr "YES" "NO" "NO" "NO" ...
## $ bodily_injuries : int 1 0 2 1 0 0 0 2 1 2 ...
## $ witnesses : int 2 0 3 2 1 2 0 2 1 1 ...
## $ police_report_available : chr "YES" "NO" "NO" "NO" ...
## $ total_claim_amount : int 71610 5070 34650 63400 6500 64100 78650 51590 27700 42300 ...
## $ injury_claim : int 6510 780 7700 6340 1300 6410 21450 9380 2770 4700 ...
## $ property_claim : int 13020 780 3850 6340 650 6410 7150 9380 2770 4700 ...
## $ vehicle_claim : int 52080 3510 23100 50720 4550 51280 50050 32830 22160 32900 ...
## $ auto_make : chr "Saab" "Mercedes" "Dodge" "Chevrolet" ...
## $ auto_model : chr "92x" "E400" "RAM" "Tahoe" ...
## $ auto_year : int 2004 2007 2007 2014 2009 2003 2012 2015 2012 1996 ...
## $ fraud_reported : chr "Y" "Y" "N" "Y" ...
# Las variables "policy_bind_date" e "incident_date" son fechas pero se mantendran como caracteres para facilitar su análisis en los modelos de regresion.
colSums(is.na(df))
## months_as_customer age
## 0 0
## policy_number policy_bind_date
## 0 0
## policy_state policy_csl
## 0 0
## policy_deductable policy_annual_premium
## 0 0
## umbrella_limit insured_zip
## 0 0
## insured_sex insured_education_level
## 0 0
## insured_occupation insured_hobbies
## 0 0
## insured_relationship capital.gains
## 0 0
## capital.loss incident_date
## 0 0
## incident_type collision_type
## 0 0
## incident_severity authorities_contacted
## 0 0
## incident_state incident_city
## 0 0
## incident_location incident_hour_of_the_day
## 0 0
## number_of_vehicles_involved property_damage
## 0 0
## bodily_injuries witnesses
## 0 0
## police_report_available total_claim_amount
## 0 0
## injury_claim property_claim
## 0 0
## vehicle_claim auto_make
## 0 0
## auto_model auto_year
## 0 0
## fraud_reported
## 0
gg_miss_var(df) # Confirmar ausencia de NA's.
c. Medidas descriptivas
summary(df)
## months_as_customer age policy_number policy_bind_date
## Min. : 0.0 Min. :19.00 Min. :100804 Length:1000
## 1st Qu.:115.8 1st Qu.:32.00 1st Qu.:335980 Class :character
## Median :199.5 Median :38.00 Median :533135 Mode :character
## Mean :204.0 Mean :38.95 Mean :546239
## 3rd Qu.:276.2 3rd Qu.:44.00 3rd Qu.:759100
## Max. :479.0 Max. :64.00 Max. :999435
## policy_state policy_csl policy_deductable policy_annual_premium
## Length:1000 Length:1000 Min. : 500 Min. : 433.3
## Class :character Class :character 1st Qu.: 500 1st Qu.:1089.6
## Mode :character Mode :character Median :1000 Median :1257.2
## Mean :1136 Mean :1256.4
## 3rd Qu.:2000 3rd Qu.:1415.7
## Max. :2000 Max. :2047.6
## umbrella_limit insured_zip insured_sex insured_education_level
## Min. :-1000000 Min. :430104 Length:1000 Length:1000
## 1st Qu.: 0 1st Qu.:448405 Class :character Class :character
## Median : 0 Median :466446 Mode :character Mode :character
## Mean : 1101000 Mean :501215
## 3rd Qu.: 0 3rd Qu.:603251
## Max. :10000000 Max. :620962
## insured_occupation insured_hobbies insured_relationship capital.gains
## Length:1000 Length:1000 Length:1000 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Mode :character Median : 0
## Mean : 25126
## 3rd Qu.: 51025
## Max. :100500
## capital.loss incident_date incident_type collision_type
## Min. :-111100 Length:1000 Length:1000 Length:1000
## 1st Qu.: -51500 Class :character Class :character Class :character
## Median : -23250 Mode :character Mode :character Mode :character
## Mean : -26794
## 3rd Qu.: 0
## Max. : 0
## incident_severity authorities_contacted incident_state incident_city
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## incident_location incident_hour_of_the_day number_of_vehicles_involved
## Length:1000 Min. : 0.00 Min. :1.000
## Class :character 1st Qu.: 6.00 1st Qu.:1.000
## Mode :character Median :12.00 Median :1.000
## Mean :11.64 Mean :1.839
## 3rd Qu.:17.00 3rd Qu.:3.000
## Max. :23.00 Max. :4.000
## property_damage bodily_injuries witnesses police_report_available
## Length:1000 Min. :0.000 Min. :0.000 Length:1000
## Class :character 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Mode :character Median :1.000 Median :1.000 Mode :character
## Mean :0.992 Mean :1.487
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :3.000
## total_claim_amount injury_claim property_claim vehicle_claim
## Min. : 100 Min. : 0 Min. : 0 Min. : 70
## 1st Qu.: 41813 1st Qu.: 4295 1st Qu.: 4445 1st Qu.:30293
## Median : 58055 Median : 6775 Median : 6750 Median :42100
## Mean : 52762 Mean : 7433 Mean : 7400 Mean :37929
## 3rd Qu.: 70593 3rd Qu.:11305 3rd Qu.:10885 3rd Qu.:50823
## Max. :114920 Max. :21450 Max. :23670 Max. :79560
## auto_make auto_model auto_year fraud_reported
## Length:1000 Length:1000 Min. :1995 Length:1000
## Class :character Class :character 1st Qu.:2000 Class :character
## Mode :character Mode :character Median :2005 Mode :character
## Mean :2005
## 3rd Qu.:2010
## Max. :2015
describe(df)
## # A tibble: 18 × 26
## described_variables n na mean sd se_mean IQR skewness
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 months_as_customer 1000 0 2.04e+2 1.15e+2 3.64e+0 1.61e2 0.362
## 2 age 1000 0 3.89e+1 9.14e+0 2.89e-1 1.2 e1 0.479
## 3 policy_number 1000 0 5.46e+5 2.57e+5 8.13e+3 4.23e5 0.0390
## 4 policy_deductable 1000 0 1.14e+3 6.12e+2 1.93e+1 1.5 e3 0.478
## 5 policy_annual_premium 1000 0 1.26e+3 2.44e+2 7.72e+0 3.26e2 0.00440
## 6 umbrella_limit 1000 0 1.10e+6 2.30e+6 7.27e+4 0 1.81
## 7 insured_zip 1000 0 5.01e+5 7.17e+4 2.27e+3 1.55e5 0.817
## 8 capital.gains 1000 0 2.51e+4 2.79e+4 8.81e+2 5.10e4 0.479
## 9 capital.loss 1000 0 -2.68e+4 2.81e+4 8.89e+2 5.15e4 -0.391
## 10 incident_hour_of_the_day 1000 0 1.16e+1 6.95e+0 2.20e-1 1.1 e1 -0.0356
## 11 number_of_vehicles_invo… 1000 0 1.84e+0 1.02e+0 3.22e-2 2 e0 0.503
## 12 bodily_injuries 1000 0 9.92e-1 8.20e-1 2.59e-2 2 e0 0.0148
## 13 witnesses 1000 0 1.49e+0 1.11e+0 3.51e-2 1 e0 0.0196
## 14 total_claim_amount 1000 0 5.28e+4 2.64e+4 8.35e+2 2.88e4 -0.595
## 15 injury_claim 1000 0 7.43e+3 4.88e+3 1.54e+2 7.01e3 0.265
## 16 property_claim 1000 0 7.40e+3 4.82e+3 1.53e+2 6.44e3 0.378
## 17 vehicle_claim 1000 0 3.79e+4 1.89e+4 5.97e+2 2.05e4 -0.621
## 18 auto_year 1000 0 2.01e+3 6.02e+0 1.90e-1 1 e1 -0.0483
## # ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## # p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
## # p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
## # p99 <dbl>, p100 <dbl>
b. Medidas de dispersión
media <- mean(df$total_claim_amount)
print(paste("La media de la variable dependiente es:", media))
## [1] "La media de la variable dependiente es: 52761.94"
mediana <- median(df$total_claim_amount)
print(paste("La mediana de la variable dependiente es:", mediana))
## [1] "La mediana de la variable dependiente es: 58055"
varianza <- var(df$total_claim_amount)
print(paste("La varianza de la variable dependiente es:", varianza))
## [1] "La varianza de la variable dependiente es: 697040954.791191"
desv_est <- sd(df$total_claim_amount)
print(paste("La desviación estándar de la variable dependiente es:", desv_est))
## [1] "La desviación estándar de la variable dependiente es: 26401.5331901613"
coef_var <- (desv_est / media) * 100
print(paste("El coeficiente de variación de la variable dependiente es:", coef_var))
## [1] "El coeficiente de variación de la variable dependiente es: 50.0389735293307"
rango <- range(df$total_claim_amount)
print(paste("El rango de la variable dependiente es:", rango))
## [1] "El rango de la variable dependiente es: 100"
## [2] "El rango de la variable dependiente es: 114920"
cuantiles <- quantile(df$total_claim_amount)
print(paste("Los cuantiles de la variable dependiente es:", cuantiles))
## [1] "Los cuantiles de la variable dependiente es: 100"
## [2] "Los cuantiles de la variable dependiente es: 41812.5"
## [3] "Los cuantiles de la variable dependiente es: 58055"
## [4] "Los cuantiles de la variable dependiente es: 70592.5"
## [5] "Los cuantiles de la variable dependiente es: 114920"
c. Identificación de patrones y/o tendencias en los datos mediante el uso de gráficos incluyendo bar plots, line plots, pie plots, histogramas, matriz de correlación, box plot, scatter plot, qqplot, etc Mostrar al menos 4 – 6 gráficos.
hist(df$total_claim_amount,main="Distribución de la variable dependiente",xlab="total_claim_amount",col='#CD1076')
La distribución de la variable dependiente Monto total de
reclamo tiene un sesgo hacia la izquierda, por lo que se podría
considerar escalar los datos.
df_num <- df[sapply(df, is.numeric)] # Seleccionar solo las columnas numéricas del dataframe
correlation_matrix <- cor(df_num)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
corrplot(correlation_matrix, method="circle", type="upper", tl.col="black", tl.srt=45)
Las variables con mayor correlación para la variable dependiente podrían
ser: Número de vehículos involucrados, hora del día del
incidente, meses como cliente y edad.
ggplot(df, aes(x = months_as_customer , y = total_claim_amount)) +
geom_point() +
geom_smooth(color = "#68228B") +
labs(title = "Relación entre meses como cliente y el monto total de reclamo", x = "incident_hour_of_the_day", y = "months_as_customer") +
theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Se puede notar una tendencia positiva en esta gráfica, indicando que la
variable podría ser significativamente positiva.
boxplot(total_claim_amount ~ number_of_vehicles_involved, data = df,
main = "Boxplot de Monto Total del Reclamo por Número de Vehículos Involucrados",
xlab = "Número de Vehículos Involucrados",
ylab = "Monto Total del Reclamo")
El monto total del reclamo llegar a ser más alto cuando se involucran 2
vehículos, y se encuentra una mayor dispersión cuando es solo 1
vehículo.
Se espera que las variables significativas sean: Número de vehículos involucrados, hora del día del incidente, meses como cliente y edad..
set.seed(123)
# Partición de datos
partition <- createDataPartition(y = df$total_claim_amount, p = 0.7, list = FALSE)
# Dividir datos en entrenamiento y prueba
train_data <- df[partition, ]
test_data <- df[-partition, ]
a. OLS Regresión
ols_model <- lm(total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age, data = train_data)
summary(ols_model)
##
## Call:
## lm(formula = total_claim_amount ~ number_of_vehicles_involved +
## incident_hour_of_the_day + months_as_customer + age, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51965 -17471 2657 17194 72963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22237.386 6709.397 3.314 0.000966 ***
## number_of_vehicles_involved 7143.930 926.573 7.710 4.36e-14 ***
## incident_hour_of_the_day 727.099 135.484 5.367 1.09e-07 ***
## months_as_customer -3.025 20.334 -0.149 0.881781
## age 239.806 257.395 0.932 0.351834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24690 on 695 degrees of freedom
## Multiple R-squared: 0.1342, Adjusted R-squared: 0.1292
## F-statistic: 26.93 on 4 and 695 DF, p-value: < 2.2e-16
Los resultados muestran que las variables altamente significativas para la variable dependiente son: number_of_vehicles_involved e incident_hour_of_the_day. Ambas afectan de forma positiva a la variable independiente. El coeficiente de la variable number_of_vehicles_involved indica que el monto total del reclamo aumenta 7143.930 por cada vehículo adicional involucrado en el incidente. Y la variable incident_hour_of_the_day indica que el monto total del reclamo aumenta 727.099 por cada incremento en la hora del día del incidente.
Se decidió no usar modelo Log porque no aportaba una mejora en los resultados de los coeficientes.
AIC_ols <- AIC(ols_model) # AIC = 23092.31
print(paste("AIC para el modelo OLS:", AIC_ols))
## [1] "AIC para el modelo OLS: 16153.2361453311"
RMSE_ols <- sqrt(mean(ols_model$residuals^2)) #RMSE = 24864.98
print(paste("RMSE para el modelo OLS:", RMSE_ols))
## [1] "RMSE para el modelo OLS: 24600.347708626"
b. SAR # No se puede llevar a cabo porque no se tienen las coordenadas, solo se cuenta con el nombre de las ciudades, lo cúal no es suficiente para aplicar este modelo.
c. SEM # No se puede llevar a cabo porque no se tienen las coordenadas, solo se cuenta con el nombre de las ciudades, lo cúal no es suficiente para aplicar este modelo.
d. XGBoost Regresión
# Seleccionar las variables
df_xgb <- df %>% select(total_claim_amount, number_of_vehicles_involved, incident_hour_of_the_day, months_as_customer, age)
# Establecer semilla
set.seed(123)
set.seed(123)
# Crear partición de datos
partition_xgb <- createDataPartition(y = df_xgb$total_claim_amount, p = 0.7, list = FALSE)
# Crear los conjuntos de entrenamiento y prueba
train_data_xgb <- df_xgb[partition_xgb, ]
test_data_xgb <- df_xgb[-partition_xgb, ]
# Definir variables explicativas (X) y variable dependiente (Y) para conjunto de entrenamiento
train_x_xgb <- data.matrix(train_data_xgb[, -1])
train_y_xgb <- train_data_xgb[, 1]
# Definir variables explicativas (X) y variable dependiente (Y) para conjunto de prueba
test_x_xgb <- data.matrix(test_data_xgb[, -1])
test_y_xgb <- test_data_xgb[, 1]
# Definir conjuntos de entrenamiento y prueba para XGBoost
xgb_train_data <- xgb.DMatrix(data = train_x_xgb, label = train_y_xgb)
xgb_test_data <- xgb.DMatrix(data = test_x_xgb, label = test_y_xgb)
# Entrenar el modelo
watchlist_xgb <- list(train = xgb_train_data, test = xgb_test_data)
model_xgb <- xgb.train(data = xgb_train_data, max_depth = 3, watchlist = watchlist_xgb, nrounds = 70)
## [1] train-rmse:44320.088612 test-rmse:44806.658999
## [2] train-rmse:34958.741879 test-rmse:35638.384399
## [3] train-rmse:29233.604048 test-rmse:30179.084581
## [4] train-rmse:25882.609112 test-rmse:26935.998682
## [5] train-rmse:24007.260942 test-rmse:25236.314614
## [6] train-rmse:22815.781331 test-rmse:24403.041749
## [7] train-rmse:22200.034655 test-rmse:23972.099596
## [8] train-rmse:21862.669752 test-rmse:23781.564430
## [9] train-rmse:21565.120102 test-rmse:23752.580284
## [10] train-rmse:21379.530347 test-rmse:23793.561940
## [11] train-rmse:21261.457367 test-rmse:23730.908344
## [12] train-rmse:21084.201955 test-rmse:23709.388399
## [13] train-rmse:20951.302755 test-rmse:23738.108441
## [14] train-rmse:20816.917776 test-rmse:23844.913356
## [15] train-rmse:20738.722069 test-rmse:23836.931035
## [16] train-rmse:20679.557662 test-rmse:23797.872814
## [17] train-rmse:20566.074416 test-rmse:23827.177015
## [18] train-rmse:20434.515128 test-rmse:23852.232619
## [19] train-rmse:20369.595815 test-rmse:23898.653442
## [20] train-rmse:20283.501680 test-rmse:23949.453268
## [21] train-rmse:20170.202477 test-rmse:24025.064118
## [22] train-rmse:20071.760151 test-rmse:24095.910454
## [23] train-rmse:20025.329515 test-rmse:24132.888787
## [24] train-rmse:19894.506042 test-rmse:24285.412102
## [25] train-rmse:19874.033912 test-rmse:24275.353056
## [26] train-rmse:19801.707583 test-rmse:24277.946626
## [27] train-rmse:19710.186689 test-rmse:24353.885028
## [28] train-rmse:19592.739755 test-rmse:24153.188413
## [29] train-rmse:19466.007801 test-rmse:24105.464511
## [30] train-rmse:19419.144789 test-rmse:24134.912749
## [31] train-rmse:19281.244429 test-rmse:24088.405386
## [32] train-rmse:19148.188544 test-rmse:24172.016698
## [33] train-rmse:19009.560118 test-rmse:24365.210496
## [34] train-rmse:18947.013461 test-rmse:24404.065869
## [35] train-rmse:18870.533719 test-rmse:24485.204099
## [36] train-rmse:18785.115759 test-rmse:24624.686588
## [37] train-rmse:18702.198537 test-rmse:24686.348372
## [38] train-rmse:18629.114378 test-rmse:24732.660986
## [39] train-rmse:18530.951888 test-rmse:24694.846126
## [40] train-rmse:18467.719249 test-rmse:24718.017736
## [41] train-rmse:18454.470976 test-rmse:24724.093343
## [42] train-rmse:18414.835234 test-rmse:24721.973559
## [43] train-rmse:18336.139982 test-rmse:24739.889439
## [44] train-rmse:18325.362196 test-rmse:24751.595581
## [45] train-rmse:18242.994800 test-rmse:24844.112067
## [46] train-rmse:18173.644068 test-rmse:24856.654271
## [47] train-rmse:18088.786689 test-rmse:24821.280160
## [48] train-rmse:17967.348009 test-rmse:24823.561524
## [49] train-rmse:17914.323924 test-rmse:24840.716916
## [50] train-rmse:17824.202586 test-rmse:24871.285724
## [51] train-rmse:17782.863515 test-rmse:24914.893547
## [52] train-rmse:17701.535092 test-rmse:24902.444318
## [53] train-rmse:17673.256603 test-rmse:24945.697468
## [54] train-rmse:17607.917723 test-rmse:25094.187191
## [55] train-rmse:17574.326848 test-rmse:25101.509042
## [56] train-rmse:17552.750508 test-rmse:25143.037741
## [57] train-rmse:17461.029519 test-rmse:25161.574292
## [58] train-rmse:17374.779678 test-rmse:25093.255802
## [59] train-rmse:17294.006156 test-rmse:25113.550989
## [60] train-rmse:17232.522727 test-rmse:25136.948273
## [61] train-rmse:17219.855068 test-rmse:25163.309711
## [62] train-rmse:17196.036353 test-rmse:25177.619246
## [63] train-rmse:17141.164935 test-rmse:25154.123734
## [64] train-rmse:17079.069605 test-rmse:25178.332423
## [65] train-rmse:17026.357681 test-rmse:25188.075946
## [66] train-rmse:16976.409834 test-rmse:25232.047736
## [67] train-rmse:16905.306661 test-rmse:25345.855479
## [68] train-rmse:16768.733875 test-rmse:25329.475572
## [69] train-rmse:16719.791448 test-rmse:25434.911892
## [70] train-rmse:16685.355497 test-rmse:25431.539221
# Evaluar el rendimiento del modelo en el conjunto de prueba
prediction_xgb_test <- predict(model_xgb, xgb_test_data)
rmse_xgb <- sqrt(mean((prediction_xgb_test - test_y_xgb)^2))
print(paste("RMSE para el modelo XGBoost Regresión:", rmse_xgb))
## [1] "RMSE para el modelo XGBoost Regresión: 25431.539258008"
# Hacer diagnóstico de los residuos del modelo XGBoost
xgb_reg_residuals <- test_y_xgb - prediction_xgb_test
plot(xgb_reg_residuals, xlab = "Dependent Variable", ylab = "Residuals",
main = 'XGBoost Regression Residuals')
abline(0, 0)
# Graficar las primeras 3 ramas del modelo
xgb.plot.tree(model = model_xgb, trees = 0:2)
# Importancia de las variables
importance_matrix_xgb <- xgb.importance(model = model_xgb)
xgb.plot.importance(importance_matrix_xgb, xlab = "Explanatory Variables X's Importance")
Los residuos están distribuidos aleatoriamente alrededor de la línea horizontal cero, lo que indica que no hay heterocedasticidad en los datos.
La variable incident_hour_of_the_day es la variable más importante en el modelo, seguida por las variables months_as_customer y number_of_vehicles_involved . La variable age es la variable menos importante en el modelo.
e. Decision Trees
decision_tree_model <- rpart(total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age, data = train_data)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
# Visualizar el árbol
plot(decision_tree_model, compress = TRUE)
text(decision_tree_model, use.n = TRUE)
rpart.plot(decision_tree_model)
# Predecir en el conjunto de prueba
prediction_decision_tree <- predict(decision_tree_model, test_data)
RMSE_decision_tree <- sqrt(mean((test_data$total_claim_amount - prediction_decision_tree)^2))
RMSE_decision_tree
## [1] 23700.46
Si el incidente ocurre después de las 10 horas del día hay una probabilidad del 60% de que su monto total de reclamo sea de 166.10. Mientras que si la hora del incidente fue antes de las 10 horas y hubo más de 2 vehículos involucrados hay una probabilidad del 12% de que el monto total de reclamo sea de 171.53. Si la hora del incidente fue mayor o igual a las 3 horas del día y hubo menos de 2 carros involucrados hay una probabilidad del 22% que el monto total de reclamo sea de 76.39. Y si la hora del incidente fue menor a las 3 horas del día y hubo menos de 2 carros involucrados hay una probabilidad del 6% que el monto total de reclamo sea de 147.07.
RMSE del decision tree = 23700.46
f. Random Forest
library(randomForest)
rf_model <- randomForest(total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age, data = train_data, proximity = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age, data = train_data, proximity = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 553045613
## % Var explained: 20.88
# Predicción en datos de prueba
rf_prediction_test_data <- predict(rf_model, test_data)
RMSE_rf <- sqrt(mean((rf_prediction_test_data - test_data$total_claim_amount)^2))
print(paste("RMSE para el modelo Random Forest:", RMSE_rf))
## [1] "RMSE para el modelo Random Forest: 24297.3873212328"
# Evaluar la importancia de las variables
varImpPlot(rf_model, n.var = 5, main = "Top 5 - Importancia de la variable")
importance(rf_model)
## IncNodePurity
## number_of_vehicles_involved 48792822358
## incident_hour_of_the_day 78490742445
## months_as_customer 52434382508
## age 40923679622
La variable más importante es incident_hour_of_the_day, seguida de months_as_customer, luego por number_of_vehicles_involved, y la menos importante es age.
g. Neural Networks Regresión
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.2
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
nn_model <- neuralnet(total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age,
data = train_data,
hidden = c(3,3), linear.output = TRUE, stepmax = 1000000)
plot(nn_model)
nn_predictions <- predict(nn_model, newdata = test_data)
RMSE_nn <- sqrt(mean((nn_predictions - test_data$total_claim_amount)^2))
print(paste("RMSE para el modelo Neural Networks Regresión:", RMSE_nn)) # RMSE = 26272.51
## [1] "RMSE para el modelo Neural Networks Regresión: 26272.5139956634"
a. Multicolinealidad
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:VGAM':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
vif(ols_model)
## number_of_vehicles_involved incident_hour_of_the_day
## 1.015220 1.025722
## months_as_customer age
## 6.414585 6.438011
Los valores VIF de todas las variables son menores a 10 por lo que no hay multicolinealidad en el modelo OLS.
b. Heterocedasticidad
library(lmtest)
bptest(ols_model)
##
## studentized Breusch-Pagan test
##
## data: ols_model
## BP = 130.33, df = 4, p-value < 2.2e-16
Hay heterocedasticidad en el modelo OLS porque el p-value es menor a 0.05 y se rechaza la hipotesis nula.
c. Autocorrelación Serial
No se usa porque no se esta trabajando con una serie de tiempo.
d. Autocorrelación Espacial
No se usa porque no se esta trabajando con una serie de tiempo.
e. Normalidad de los Residuales
# Para el modelo OLS
shapiro_test_ols <- shapiro.test(ols_model$residuals)
print(paste("Shapiro-Wilk test para OLS residuales p-value:", shapiro_test_ols$p.value))
## [1] "Shapiro-Wilk test para OLS residuales p-value: 3.13600095695675e-10"
# Para el modelo XGBoost
shapiro_test_xgb <- shapiro.test(xgb_reg_residuals)
print(paste("Shapiro-Wilk test para XGBoost residuales p-value:", shapiro_test_xgb$p.value))
## [1] "Shapiro-Wilk test para XGBoost residuales p-value: 0.107877612445714"
Los residuos del modelo OLS, tienen un p-value es muy pequeño (p-value: 5.8075135464086e-14), lo que indica rechazo de la hipótesis nula de normalidad. Significa que sus residuales no siguen una distribución normal.
Los residuos del modelo XGBoost, tienen un p-value relativamente grande (p-value: 0.107877612445714), lo que indica que no se rechaza la hipótesis nula de normalidad. Significa que sus residuales siguen una distribución normal.
Arreglar heterocedasticidad de modelo OLS agregando variables:
ols_model_modif <- lm(total_claim_amount ~ number_of_vehicles_involved + incident_hour_of_the_day + months_as_customer + age + policy_state + incident_type + policy_deductable + witnesses + umbrella_limit, data = train_data)
# Resumen del nuevo modelo
summary(ols_model_modif)
##
## Call:
## lm(formula = total_claim_amount ~ number_of_vehicles_involved +
## incident_hour_of_the_day + months_as_customer + age + policy_state +
## incident_type + policy_deductable + witnesses + umbrella_limit,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37950 -8442 -546 7533 49318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.682e+04 7.760e+03 7.322 6.87e-13
## number_of_vehicles_involved -6.082e+02 2.208e+03 -0.275 0.7831
## incident_hour_of_the_day -6.231e+00 8.197e+01 -0.076 0.9394
## months_as_customer -5.646e+00 1.192e+01 -0.474 0.6359
## age 1.661e+02 1.509e+02 1.101 0.2713
## policy_stateIN 3.118e+02 1.373e+03 0.227 0.8205
## policy_stateOH -2.778e+02 1.303e+03 -0.213 0.8312
## incident_typeParked Car -5.784e+04 4.912e+03 -11.775 < 2e-16
## incident_typeSingle Vehicle Collision 4.360e+02 4.574e+03 0.095 0.9241
## incident_typeVehicle Theft -5.790e+04 4.868e+03 -11.893 < 2e-16
## policy_deductable 1.837e+00 8.919e-01 2.060 0.0398
## witnesses -2.482e+01 4.927e+02 -0.050 0.9598
## umbrella_limit -6.592e-05 2.393e-04 -0.275 0.7830
##
## (Intercept) ***
## number_of_vehicles_involved
## incident_hour_of_the_day
## months_as_customer
## age
## policy_stateIN
## policy_stateOH
## incident_typeParked Car ***
## incident_typeSingle Vehicle Collision
## incident_typeVehicle Theft ***
## policy_deductable *
## witnesses
## umbrella_limit
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14380 on 687 degrees of freedom
## Multiple R-squared: 0.7098, Adjusted R-squared: 0.7048
## F-statistic: 140.1 on 12 and 687 DF, p-value: < 2.2e-16
a. Mediante el cálculo de la métrica RMSE para cada uno de los modelos estimados en 4) seleccionar el modelo que muestra los mejores resultados estimados.
Modelo OLS RMSE (RMSE_ols) = 24864.98
Modelo XGBoost Regresión RMSE(rmse_xgb) = 25431.539
Modelo Decision tree RMSE (RMSE_decision_tree) = 23700.46
Modelo Random Forest RMSE (RMSE_rf) = 24203.059
Modelo Neural Networks Regresión RMSE (RMSE_nn) = 26272.513
El mejor modelo es el Decision Tree, ya que presentó el valor RMSE más bajo de todos los modelos.
b. Presentar los valores de la métrica RMSE de cada uno de los modelos estimados en 4) en un gráfico de barras.
# Valores de RMSE
RMSE_valores <- c(RMSE_ols, rmse_xgb, RMSE_decision_tree, RMSE_rf, RMSE_nn)
modelos <- c("OLS", "XGBoost", "Decision Tree", "Random Forest", "Neural Network")
barplot(RMSE_valores, names.arg = modelos, col = '#CD1076', main = "RMSE de Modelos", ylab = "RMSE", xlab = "Modelo")
a. EDA
* El total de NA’s en la base de datos es de 1000. * Todos los NA’s
(1000) se encuentran en la variable “X_c39”, esto quiere decir que la
columna no tiene información, ya que esta conformada únicamente por
valores nulos y se eliminó al no aportar a los análisis. * Las variables
“collision_type, property_damage y police_report_available” son
caracteres, por lo que no era posible reemplazarlas con la mediana y se
reemplazaron con su moda. * Las variables “policy_bind_date” e
“incident_date” son fechas pero se mantuvieron como caracteres para
facilitar su análisis en los modelos. * La distribución de la variable
dependiente Monto total de reclamo tiene un sesgo hacia
la izquierda, por lo que se podría considerar escalar los datos. * Las
variables con mayor correlación para la variable dependiente podrían
ser: Número de vehículos involucrados, hora del día del
incidente, meses como cliente y edad.
b. Modelo seleccionado:
Tomando como base el RMSE como métrica para seleccionar el
modelo con mayor precisión, el Decision Tree es el
mejor modelo, ya que presentó el valor RMSE más bajo de todos los
modelos.
i. ¿Cuáles son las variables que contribuyen a explicar los cambios de la principal variable de estudio? Las variables signficativas del modelo que contribuyen a explicar los cambios de la principal variable de estudio son: incident_hour_of_the_day y number_of_vehicles_involved.
ii. ¿Cómo es el impacto de dichas variables explicativas sobre la variable dependiente? Si el incidente ocurre después de las 10 horas del día hay una probabilidad del 60% de que su monto total de reclamo sea de 166.10. Mientras que si la hora del incidente fue antes de las 10 horas y hubo más de 2 vehículos involucrados hay una probabilidad del 12% de que el monto total de reclamo sea de 171.53. Si la hora del incidente fue mayor o igual a las 3 horas del día y hubo menos de 2 carros involucrados hay una probabilidad del 22% que el monto total de reclamo sea de 76.39. Y si la hora del incidente fue menor a las 3 horas del día y hubo menos de 2 carros involucrados hay una probabilidad del 6% que el monto total de reclamo sea de 147.07.
iii. ¿Los resultados estimados del modelo seleccionado son similares a los otros modelos estimados? ¿Cuáles son las diferencias? Sí, los RMSE de todos los modelos son parcialmente similares, siendo el del Decision Tree el más bajo de todos. Las diferencias son de miles. Las variables altamente significativas varian entre los modelos, así como sus residuales.