REGRESIÓN POLINÓMICA

1 Configuración y Carga de Datos

Se carga el conjunto de datos de arrendamientos de hidrocarburos del estado de Kansas, EE.UU., registrados por el Kansas Geological Survey.

ruta_archivo <- "C:/Users/thann/OneDrive/Escritorio/ESTADISTICA.LOL/datos_vale.csv"

datos_vale <- read_delim(
  ruta_archivo,
  delim          = ";",
  show_col_types = FALSE
)

cat("Base de datos cargada correctamente.\n")

## Base de datos cargada correctamente.

str(datos_vale)

## spc_tbl_ [104,173 × 95] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ CODE                         : chr [1:104173] NA NA "28-JUL-2000 OPERATOR_NAME=COLT ENERGY INCO\"" NA ...
##  $ DEPT_WATER_RESOURCES_CODE    : chr [1:104173] NA NA "Whitlow Energy, Inc." NA ...
##  $ DEPT_MOTOR_VEHICLES_ABBREV   : chr [1:104173] NA NA NA NA ...
##  $ NAME                         : chr [1:104173] NA NA "800" NA ...
##  $ WIZARD_BASE_REFERENCE_YEAR   : chr [1:104173] NA NA "23100" NA ...
##  $ WIZARD_PREDEVELOPMENT_YEAR   : chr [1:104173] NA NA NA NA ...
##  $ UPDATE_DATE_1                : num [1:104173] NA NA 275319 NA NA ...
##  $ UPDATE_INITIALS_1            : num [1:104173] NA NA 1970 NA NA NA NA NA NA NA ...
##  $ ABBREVIATED_TO_4_CHARS       : num [1:104173] NA NA 2026 NA NA ...
##  $ KCC_DISTRICT_NUMBER          : chr [1:104173] NA NA "DATA_SOURCE" NA ...
##  $ OBJECTID                     : chr [1:104173] "42971829" "42971830" "1001106572" "42971831" ...
##  $ KID                          : chr [1:104173] "1001106903" "1001106572" NA "1001106590" ...
##  $ LEASE_NAME                   : chr [1:104173] "J. G. HAVENER" "EARLY" NA "ELDEN BURKE" ...
##  $ FIELD_KID                    : chr [1:104173] "1000147101" "1000147596" NA "1000147696" ...
##  $ OPERATOR_KID                 : chr [1:104173] "1044772773" "1041810049" "Town Oil Company Inc." "1027997289" ...
##  $ LEASE_CODE_PI_BEENE          : chr [1:104173] "11322" "10661" NA "10687" ...
##  $ LEASE_CODE_DOR               : chr [1:104173] "100743" "100416" NA "100434" ...
##  $ PRODUCES_GAS                 : chr [1:104173] "No" "No" "NAD27" "No" ...
##  $ PRODUCES_OIL                 : chr [1:104173] "Yes" "Yes" NA "Yes" ...
##  $ STATE_CODE                   : chr [1:104173] "15" "15" NA "15" ...
##  $ COUNTY_CODE                  : num [1:104173] 125 45 NA 49 49 49 49 207 31 31 ...
##  $ LATITUDE                     : chr [1:104173] "37.132.046" "3.876.033" NA "37.482.822" ...
##  $ LATITUDE_LEO_FOOTAGES        : logi [1:104173] NA NA NA NA NA NA ...
##  $ LATITUDE_LEO_QUARTER_CALLS   : chr [1:104173] NA NA NA NA ...
##  $ LATITUDE_DIRECTION           : chr [1:104173] NA NA "387.603.303" NA ...
##  $ LONGITUDE                    : chr [1:104173] "-95.882.477" "-95.157.902" "-951.579.018" "-96.298.344" ...
##  $ LONGITUDE_LEO_FOOTAGES       : chr [1:104173] NA NA "387.603.378" NA ...
##  $ LONGITUDE_LEO_QUARTER_CALLS  : chr [1:104173] NA NA "-951.581.503" NA ...
##  $ LONGITUDE_DIRECTION          : chr [1:104173] NA NA "312500.22" NA ...
##  $ LONGITUDE_LATITUDE_SOURCE    : chr [1:104173] "CENTER_OF_SECTION" "CENTER_OF_SECTION" "4292184.41" "CENTER_OF_SECTION" ...
##  $ PRINCIPAL_MERIDIAN           : num [1:104173] 6 6 15 6 6 6 6 6 6 6 ...
##  $ TOWNSHIP                     : num [1:104173] 33 15 312483 29 31 ...
##  $ TOWNSHIP_DIRECTION           : chr [1:104173] "S" "S" "4292393.14" "S" ...
##  $ RANGE                        : num [1:104173] 14 20 15 10 10 10 13 16 14 16 ...
##  $ RANGE_DIRECTION              : chr [1:104173] "E" "E" "EARLY" "E" ...
##  $ SECTION                      : chr [1:104173] "33" "11" "BALDWIN" "34" ...
##  $ SUBDIVISION_1_LARGEST        : chr [1:104173] NA NA "-105.929.576.967" NA ...
##  $ SUBDIVISION_2                : chr [1:104173] NA NA "46.874.008.688" NA ...
##  $ SUBDIVISION_3                : chr [1:104173] NA NA NA NA ...
##  $ SUBDIVISION_4_SMALLEST       : chr [1:104173] NA NA NA NA ...
##  $ SPOT                         : chr [1:104173] NA NA NA NA ...
##  $ FEET_NORTH_FROM_REFERENCE    : num [1:104173] NA NA NA NA NA NA NA NA NA NA ...
##  $ FEET_EAST_FROM_REFERENCE     : num [1:104173] NA NA NA NA NA NA NA NA NA NA ...
##  $ REFERENCE_CORNER             : chr [1:104173] NA NA NA NA ...
##  $ MEETS_AND_BOUNDS             : logi [1:104173] NA NA NA NA NA NA ...
##  $ OLD_SPOT_OR_LOCATION         : chr [1:104173] NA NA NA NA ...
##  $ UPDATE_INITIALS              : chr [1:104173] "DOR#4" "DOR" NA "DOR" ...
##  $ UPDATE_DATE                  : chr [1:104173] "11/23/2013 3:11:58 PM" "5/31/2011 9:48:34 AM" NA "5/31/2011 9:48:35 AM" ...
##  $ SKIP_IT                      : chr [1:104173] NA NA NA NA ...
##  $ COMMENTS                     : chr [1:104173] NA NA NA NA ...
##  $ CORRECTIONS                  : chr [1:104173] "28-JUL-2000 OPERATOR_NAME=HORTON OIL COMPA" "06-FEB-1999 LEASE_NAME=EARLY S." NA "28-JUL-2000 OPERATOR_NAME=DENTON OIL COMPA" ...
##  $ OPERATOR_NAME                : chr [1:104173] "Horton, John" NA NA "Suerte Oil Company" ...
##  $ PRODUCING_FORMATION          : chr [1:104173] NA NA NA NA ...
##  $ DEPTH_OF_WELL                : num [1:104173] 700 NA NA 1400 NA ...
##  $ FIELD_CODE_DOR               : num [1:104173] 63550 NA NA 25250 10800 ...
##  $ DATA_SOURCE                  : chr [1:104173] NA NA NA NA ...
##  $ CUMULATIVE_PRODUCTION        : num [1:104173] 47293 NA NA 82624 10518 ...
##  $ CUMULATIVE_YEAR_STARTED      : num [1:104173] 1970 NA NA 1970 1971 ...
##  $ CUMULATIVE_YEAR_ENDED        : num [1:104173] 2025 NA NA 2017 1983 ...
##  $ FIELD_KID_SOURCE             : chr [1:104173] "DATA_SOURCE" NA NA "DATA_SOURCE" ...
##  $ LEASE_KID                    : num [1:104173] 1e+09 NA NA 1e+09 1e+09 ...
##  $ LEASE_KID_SOURCE             : logi [1:104173] NA NA NA NA NA NA ...
##  $ PRODUCING_FORMATION_OLD      : chr [1:104173] NA NA NA NA ...
##  $ QUARTER_CALLS_SOURCE         : chr [1:104173] NA NA NA NA ...
##  $ OPERATOR_NAME_OLD            : chr [1:104173] "Horton Oil Company, Inc." NA NA "Suerte Oil Company" ...
##  $ PRODUCING_FORMATION_STRAT_KID: num [1:104173] NA NA NA NA NA NA NA NA NA NA ...
##  $ PRODUCING_FORMATION_SOURCE   : chr [1:104173] NA NA NA NA ...
##  $ LONGITUDE_LATITUDE_DATUM     : chr [1:104173] "NAD27" NA NA "NAD27" ...
##  $ GPS_LATITUDE                 : logi [1:104173] NA NA NA NA NA NA ...
##  $ GPS_LONGITUDE                : logi [1:104173] NA NA NA NA NA NA ...
##  $ GPS_DATUM                    : logi [1:104173] NA NA NA NA NA NA ...
##  $ GPS_SOURCE                   : logi [1:104173] NA NA NA NA NA NA ...
##  $ GPS_DATE                     : logi [1:104173] NA NA NA NA NA NA ...
##  $ GPS_ACCURACY_COMMENTS        : logi [1:104173] NA NA NA NA NA NA ...
##  $ NAD27_LATITUDE               : chr [1:104173] "371.320.458" NA NA "374.828.223" ...
##  $ NAD27_LONGITUDE              : chr [1:104173] "-95.882.477" NA NA "-962.983.435" ...
##  $ NAD83_LATITUDE               : chr [1:104173] "371.321.146" NA NA "374.828.544" ...
##  $ NAD83_LONGITUDE              : chr [1:104173] "-958.827.371" NA NA "-96.298.619" ...
##  $ NAD27_UTM_X                  : num [1:104173] 243937 NA NA 738884 741149 ...
##  $ NAD27_UTM_Y                  : num [1:104173] 4113207 NA NA 4151660 4132923 ...
##  $ NAD27_UTM_ZONE               : num [1:104173] 15 NA NA 14 14 14 15 15 15 15 ...
##  $ NAD83_UTM_X                  : num [1:104173] 243920 NA NA 738853 741119 ...
##  $ NAD83_UTM_Y                  : num [1:104173] 4113419 NA NA 4151867 4133131 ...
##  $ NAD83_UTM_ZONE               : num [1:104173] 15 NA NA 14 14 14 15 15 15 15 ...
##  $ LEASE_NAME_DOR               : chr [1:104173] "HAVENER,J.G." NA NA "ELDEN BURKE" ...
##  $ FIELD_NAME                   : chr [1:104173] "WAYSIDE-HAVANA" NA NA "DUNKLEBERGER" ...
##  $ x                            : chr [1:104173] "-106.736.183.151" NA NA "-107.199.141.514" ...
##  $ y                            : chr [1:104173] "44.575.388.408" NA NA "45.066.267.765" ...
##  $ ...89                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...90                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...91                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...92                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...93                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...94                        : logi [1:104173] NA NA NA NA NA NA ...
##  $ ...95                        : logi [1:104173] NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   CODE = col_character(),
##   ..   DEPT_WATER_RESOURCES_CODE = col_character(),
##   ..   DEPT_MOTOR_VEHICLES_ABBREV = col_character(),
##   ..   NAME = col_character(),
##   ..   WIZARD_BASE_REFERENCE_YEAR = col_character(),
##   ..   WIZARD_PREDEVELOPMENT_YEAR = col_character(),
##   ..   UPDATE_DATE_1 = col_double(),
##   ..   UPDATE_INITIALS_1 = col_double(),
##   ..   ABBREVIATED_TO_4_CHARS = col_double(),
##   ..   KCC_DISTRICT_NUMBER = col_character(),
##   ..   OBJECTID = col_character(),
##   ..   KID = col_character(),
##   ..   LEASE_NAME = col_character(),
##   ..   FIELD_KID = col_character(),
##   ..   OPERATOR_KID = col_character(),
##   ..   LEASE_CODE_PI_BEENE = col_character(),
##   ..   LEASE_CODE_DOR = col_character(),
##   ..   PRODUCES_GAS = col_character(),
##   ..   PRODUCES_OIL = col_character(),
##   ..   STATE_CODE = col_character(),
##   ..   COUNTY_CODE = col_double(),
##   ..   LATITUDE = col_character(),
##   ..   LATITUDE_LEO_FOOTAGES = col_logical(),
##   ..   LATITUDE_LEO_QUARTER_CALLS = col_character(),
##   ..   LATITUDE_DIRECTION = col_character(),
##   ..   LONGITUDE = col_character(),
##   ..   LONGITUDE_LEO_FOOTAGES = col_character(),
##   ..   LONGITUDE_LEO_QUARTER_CALLS = col_character(),
##   ..   LONGITUDE_DIRECTION = col_character(),
##   ..   LONGITUDE_LATITUDE_SOURCE = col_character(),
##   ..   PRINCIPAL_MERIDIAN = col_double(),
##   ..   TOWNSHIP = col_double(),
##   ..   TOWNSHIP_DIRECTION = col_character(),
##   ..   RANGE = col_double(),
##   ..   RANGE_DIRECTION = col_character(),
##   ..   SECTION = col_character(),
##   ..   SUBDIVISION_1_LARGEST = col_character(),
##   ..   SUBDIVISION_2 = col_character(),
##   ..   SUBDIVISION_3 = col_character(),
##   ..   SUBDIVISION_4_SMALLEST = col_character(),
##   ..   SPOT = col_character(),
##   ..   FEET_NORTH_FROM_REFERENCE = col_double(),
##   ..   FEET_EAST_FROM_REFERENCE = col_double(),
##   ..   REFERENCE_CORNER = col_character(),
##   ..   MEETS_AND_BOUNDS = col_logical(),
##   ..   OLD_SPOT_OR_LOCATION = col_character(),
##   ..   UPDATE_INITIALS = col_character(),
##   ..   UPDATE_DATE = col_character(),
##   ..   SKIP_IT = col_character(),
##   ..   COMMENTS = col_character(),
##   ..   CORRECTIONS = col_character(),
##   ..   OPERATOR_NAME = col_character(),
##   ..   PRODUCING_FORMATION = col_character(),
##   ..   DEPTH_OF_WELL = col_double(),
##   ..   FIELD_CODE_DOR = col_double(),
##   ..   DATA_SOURCE = col_character(),
##   ..   CUMULATIVE_PRODUCTION = col_double(),
##   ..   CUMULATIVE_YEAR_STARTED = col_double(),
##   ..   CUMULATIVE_YEAR_ENDED = col_double(),
##   ..   FIELD_KID_SOURCE = col_character(),
##   ..   LEASE_KID = col_double(),
##   ..   LEASE_KID_SOURCE = col_logical(),
##   ..   PRODUCING_FORMATION_OLD = col_character(),
##   ..   QUARTER_CALLS_SOURCE = col_character(),
##   ..   OPERATOR_NAME_OLD = col_character(),
##   ..   PRODUCING_FORMATION_STRAT_KID = col_double(),
##   ..   PRODUCING_FORMATION_SOURCE = col_character(),
##   ..   LONGITUDE_LATITUDE_DATUM = col_character(),
##   ..   GPS_LATITUDE = col_logical(),
##   ..   GPS_LONGITUDE = col_logical(),
##   ..   GPS_DATUM = col_logical(),
##   ..   GPS_SOURCE = col_logical(),
##   ..   GPS_DATE = col_logical(),
##   ..   GPS_ACCURACY_COMMENTS = col_logical(),
##   ..   NAD27_LATITUDE = col_character(),
##   ..   NAD27_LONGITUDE = col_character(),
##   ..   NAD83_LATITUDE = col_character(),
##   ..   NAD83_LONGITUDE = col_character(),
##   ..   NAD27_UTM_X = col_double(),
##   ..   NAD27_UTM_Y = col_double(),
##   ..   NAD27_UTM_ZONE = col_double(),
##   ..   NAD83_UTM_X = col_double(),
##   ..   NAD83_UTM_Y = col_double(),
##   ..   NAD83_UTM_ZONE = col_double(),
##   ..   LEASE_NAME_DOR = col_character(),
##   ..   FIELD_NAME = col_character(),
##   ..   x = col_character(),
##   ..   y = col_character(),
##   ..   ...89 = col_logical(),
##   ..   ...90 = col_logical(),
##   ..   ...91 = col_logical(),
##   ..   ...92 = col_logical(),
##   ..   ...93 = col_logical(),
##   ..   ...94 = col_logical(),
##   ..   ...95 = col_logical()
##   .. )
##  - attr(*, "problems")=<pointer: 0x0000024e0b4d0620>

2 Extracción y Depuración de Variables

Se extraen y depuran las dos variables continuas de interés:

Variable independiente (X): DEPTH_OF_WELL — Profundidad del pozo en pies.
Variable dependiente (Y): CUMULATIVE_PRODUCTION — Producción acumulada de petróleo en barriles.

Se eliminan registros con valores nulos, ceros o negativos en cualquiera de las dos variables.

datos_dep <- datos_vale %>%
  select(DEPTH_OF_WELL, CUMULATIVE_PRODUCTION) %>%
  filter(
    !is.na(DEPTH_OF_WELL),        !is.na(CUMULATIVE_PRODUCTION),
    DEPTH_OF_WELL > 0,             CUMULATIVE_PRODUCTION > 0
  ) %>%
  rename(
    profundidad = DEPTH_OF_WELL,
    produccion  = CUMULATIVE_PRODUCTION
  )

n <- nrow(datos_dep)
cat("Observaciones válidas tras depuración:", n, "\n")

## Observaciones válidas tras depuración: 51023

# Resumen estadístico de las variables depuradas
resumen <- data.frame(
  Indicador = c("n (observaciones)", "Media (X̄)", "Mediana (Me)",
                "Desv. Estándar (Sd)", "Mínimo", "Máximo",
                "Coef. Variación (CV%)"),
  Profundidad_pies = c(
    n,
    round(mean(datos_dep$profundidad),   2),
    round(median(datos_dep$profundidad), 2),
    round(sd(datos_dep$profundidad),     2),
    round(min(datos_dep$profundidad),    2),
    round(max(datos_dep$profundidad),    2),
    round((sd(datos_dep$profundidad) / mean(datos_dep$profundidad)) * 100, 2)
  ),
  Produccion_barriles = c(
    n,
    round(mean(datos_dep$produccion),    2),
    round(median(datos_dep$produccion),  2),
    round(sd(datos_dep$produccion),      2),
    round(min(datos_dep$produccion),     2),
    round(max(datos_dep$produccion),     2),
    round((sd(datos_dep$produccion) / mean(datos_dep$produccion)) * 100, 2)
  )
)

resumen %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°1: Estadísticos Descriptivos tras Depuración**"),
    subtitle = md("*Profundidad del Pozo (X) — Producción Acumulada (Y)*")
  ) %>%
  cols_label(
    Indicador           = md("**Indicador**"),
    Profundidad_pies    = md("**Profundidad (pies)**"),
    Produccion_barriles = md("**Producción Acumulada (barriles)**")
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, nrow(resumen), by = 2))
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width                = pct(70),
    heading.title.font.size    = px(16),
    heading.subtitle.font.size = px(12),
    table.font.size            = px(13),
    data_row.padding           = px(6)
  )

Indicador	Profundidad (pies)	Producción Acumulada (barriles)
Tabla N°1: Estadísticos Descriptivos tras Depuración
Profundidad del Pozo (X) — Producción Acumulada (Y)
n (observaciones)	51023.00	51023.00
Media (X̄)	3539.17	695741.18
Mediana (Me)	3445.00	73534.85
Desv. Estándar (Sd)	1660.68	1943106.05
Mínimo	1.00	1.00
Máximo	9999.00	37309517.00
Coef. Variación (CV%)	46.92	279.29
Autor: Araujo Valeska

3 Análisis Gráfico Exploratorio

Se grafica la dispersión entre las variables para identificar visualmente el patrón de la relación (lineal, curvada, etc.) antes de ajustar el modelo.

set.seed(42)
muestra <- datos_dep[sample(nrow(datos_dep), 3000), ]

par(mar = c(6, 6, 5, 2))
plot(
  muestra$profundidad,
  muestra$produccion,
  col  = rgb(0.2, 0.2, 0.2, 0.3),
  pch  = 16,
  cex  = 0.5,
  xlab = "",
  ylab = "",
  main = ""
)
mtext("Profundidad del Pozo (pies)",     side = 1, line = 4,   cex = 1)
mtext("Producción Acumulada (barriles)", side = 2, line = 4.5, cex = 1)
mtext("Gráfica Exploratoria: Dispersión entre Profundidad y Producción Acumulada",
      side = 3, line = 2, adj = 0.5, cex = 0.9, font = 2)

# Línea de tendencia suavizada (lowess)
linea <- lowess(muestra$profundidad, muestra$produccion, f = 0.3)
lines(linea, col = "black", lwd = 2)

legend("topright",
       legend = "Tendencia LOWESS",
       col    = "black", lwd = 2, bty = "n", cex = 0.85)

La dispersión muestra una gran variabilidad en la producción para todos los rangos de profundidad. La tendencia LOWESS sugiere una relación no lineal, lo que motiva el uso de regresión polinómica.

4 Aplicación de Binning

Se aplica binning de igual amplitud (10 intervalos) sobre la variable independiente profundidad. Para cada intervalo se calcula la producción media, lo que permite visualizar la tendencia central de la relación entre las variables y orientar la selección del grado del polinomio.

# Crear 10 bins de igual amplitud sobre profundidad
datos_dep$bin_prof <- cut(datos_dep$profundidad, breaks = 10)

tabla_bins <- datos_dep %>%
  group_by(bin_prof) %>%
  summarise(
    n_obs        = n(),
    media_prof   = round(mean(profundidad),  2),
    media_prod   = round(mean(produccion),   2),
    mediana_prod = round(median(produccion), 2),
    .groups      = "drop"
  ) %>%
  mutate(bin_num = row_number())

# Tabla gt de bins
tabla_bins %>%
  select(bin_num, bin_prof, n_obs, media_prof, media_prod, mediana_prod) %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°2: Binning — Profundidad vs Producción Media**"),
    subtitle = md("*10 intervalos de igual amplitud sobre la profundidad del pozo*")
  ) %>%
  cols_label(
    bin_num      = md("**Bin**"),
    bin_prof     = md("**Intervalo Profundidad (pies)**"),
    n_obs        = md("**n**"),
    media_prof   = md("**Media X (pies)**"),
    media_prod   = md("**Media Y (barriles)**"),
    mediana_prod = md("**Mediana Y (barriles)**")
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, nrow(tabla_bins), by = 2))
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width                = pct(80),
    heading.title.font.size    = px(16),
    heading.subtitle.font.size = px(12),
    table.font.size            = px(13),
    data_row.padding           = px(6)
  )

Bin	Intervalo Profundidad (pies)	n	Media X (pies)	Media Y (barriles)	Mediana Y (barriles)
Tabla N°2: Binning — Profundidad vs Producción Media
10 intervalos de igual amplitud sobre la profundidad del pozo
1	(-9,1e+03]	3684	779.01	50282.94	18962.51
2	(1e+03,2e+03]	5620	1323.57	111456.43	72180.77
3	(2e+03,3e+03]	10021	2729.37	2460583.48	1055483.00
4	(3e+03,4e+03]	12783	3482.68	318479.55	56880.95
5	(4e+03,5e+03]	10747	4519.55	235632.65	31861.67
6	(5e+03,6e+03]	5223	5457.85	395627.68	62859.00
7	(6e+03,7e+03]	2037	6382.90	527305.73	111652.00
8	(7e+03,8e+03]	110	7370.20	710591.07	39114.18
9	(8e+03,9e+03]	165	8682.22	230684.81	51994.00
10	(9e+03,1e+04]	633	9601.23	267750.69	75034.31
Autor: Araujo Valeska

par(mar = c(6, 6, 5, 2))
plot(
  tabla_bins$media_prof,
  tabla_bins$media_prod,
  col  = "gray20",
  pch  = 19,
  cex  = 1.4,
  xlab = "",
  ylab = "",
  main = "",
  type = "b",
  lty  = 2,
  lwd  = 1.5
)
mtext("Punto Medio del Bin — Profundidad (pies)",  side = 1, line = 4,   cex = 1)
mtext("Producción Acumulada Media (barriles)",      side = 2, line = 4.5, cex = 1)
mtext("Gráfica N°2: Tendencia de la Producción Media por Bin de Profundidad",
      side = 3, line = 2, adj = 0.5, cex = 0.9, font = 2)

La gráfica de bins revela una tendencia no lineal (con un pico en los rangos intermedios), lo que refuerza la pertinencia del modelo polinómico.

5 Conjetura del Modelo de Regresión Polinómica

A partir del análisis exploratorio y el binning, se conjetura que la relación entre la profundidad del pozo y la producción acumulada puede representarse mediante un polinomio de grado 5, que tiene la forma general:

\[\hat{Y} = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \beta_5 X^5\]

donde:

\(\hat{Y}\) = Producción acumulada estimada (barriles)
\(X\) = Profundidad del pozo (pies)
\(\beta_0, \ldots, \beta_5\) = Coeficientes del modelo estimados por mínimos cuadrados ordinarios

Se ajustan los modelos del grado 1 al 5 para comparar su capacidad explicativa.

mod1 <- lm(produccion ~ profundidad,                                                             data = datos_dep)
mod2 <- lm(produccion ~ profundidad + I(profundidad^2),                                          data = datos_dep)
mod3 <- lm(produccion ~ profundidad + I(profundidad^2) + I(profundidad^3),                       data = datos_dep)
mod4 <- lm(produccion ~ profundidad + I(profundidad^2) + I(profundidad^3) + I(profundidad^4),    data = datos_dep)
mod5 <- lm(produccion ~ profundidad + I(profundidad^2) + I(profundidad^3) + I(profundidad^4) +
             I(profundidad^5),                                                                   data = datos_dep)

get_metrics <- function(m, grado) {
  s   <- summary(m)
  pf_val <- pf(s$fstatistic[1], s$fstatistic[2], s$fstatistic[3], lower.tail = FALSE)
  data.frame(
    Grado       = grado,
    R2_ajustado = round(s$adj.r.squared, 6),
    RSE         = round(s$sigma, 2),
    AIC         = round(AIC(m), 2),
    p_valor_F   = format.pval(pf_val, digits = 3, eps = 0.001)
  )
}

comparacion <- bind_rows(
  get_metrics(mod1, 1),
  get_metrics(mod2, 2),
  get_metrics(mod3, 3),
  get_metrics(mod4, 4),
  get_metrics(mod5, 5)
)

mejor_idx <- which.min(comparacion$AIC)

comparacion %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°3: Comparación de Modelos Polinómicos (Grado 1 a 5)**"),
    subtitle = md("*Criterio de selección: menor AIC*")
  ) %>%
  cols_label(
    Grado       = md("**Grado**"),
    R2_ajustado = md("**R² Ajustado**"),
    RSE         = md("**RSE**"),
    AIC         = md("**AIC**"),
    p_valor_F   = md("**p-valor F**")
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, nrow(comparacion), by = 2))
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#D6D6D6"),
                     cell_text(weight = "bold")),
    locations = cells_body(rows = mejor_idx, columns = everything())
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska — Fila resaltada: mejor modelo*")) %>%
  tab_options(
    table.width                = pct(70),
    heading.title.font.size    = px(16),
    heading.subtitle.font.size = px(12),
    table.font.size            = px(13),
    data_row.padding           = px(6)
  )

Grado	R² Ajustado	RSE	AIC	p-valor F
Tabla N°3: Comparación de Modelos Polinómicos (Grado 1 a 5)
Criterio de selección: menor AIC
1	0.004787	1938449	1622162	<0.001
2	0.011709	1931697	1621807	<0.001
3	0.052868	1891045	1619637	<0.001
4	0.087635	1856013	1617730	<0.001
5	0.087652	1855994	1617730	<0.001
Autor: Araujo Valeska — Fila resaltada: mejor modelo

6 Gráfica del Modelo Polinómico (Grado 5)

Se grafica el modelo polinómico de grado 5 ajustado sobre la nube de puntos (muestra de 3 000 observaciones) para evaluar visualmente su capacidad de ajuste.

x_seq  <- seq(min(datos_dep$profundidad), max(datos_dep$profundidad), length.out = 500)
y_pred <- predict(mod5, newdata = data.frame(profundidad = x_seq))

par(mar = c(6, 6, 5, 2))
plot(
  muestra$profundidad,
  muestra$produccion,
  col  = rgb(0.2, 0.2, 0.2, 0.25),
  pch  = 16,
  cex  = 0.45,
  xlab = "",
  ylab = "",
  main = ""
)
lines(x_seq, y_pred, col = "black", lwd = 2.5)

mtext("Profundidad del Pozo (pies)",     side = 1, line = 4,   cex = 1)
mtext("Producción Acumulada (barriles)", side = 2, line = 4.5, cex = 1)
mtext("Gráfica N°3: Ajuste del Modelo Polinómico de Grado 5",
      side = 3, line = 2, adj = 0.5, cex = 0.9, font = 2)

legend("topright",
       legend = paste0("Polinomio grado 5   R²aj = ",
                       round(summary(mod5)$adj.r.squared, 4)),
       col = "black", lwd = 2, bty = "n", cex = 0.85)

7 Test de Bondad del Modelo

7.1 Coeficiente de Correlación

El coeficiente de correlación de Pearson (\(r\)) mide la fuerza y dirección de la relación lineal entre \(X\) y \(Y\).

r <- cor(datos_dep$profundidad, datos_dep$produccion)

cat("Coeficiente de correlación de Pearson (r):", round(r, 6), "\n")

## Coeficiente de correlación de Pearson (r): -0.069332

interpretacion <- dplyr::case_when(
  abs(r) < 0.10 ~ "correlación muy débil o nula",
  abs(r) < 0.30 ~ "correlación débil",
  abs(r) < 0.50 ~ "correlación moderada",
  abs(r) < 0.70 ~ "correlación moderada-fuerte",
  TRUE           ~ "correlación fuerte"
)
direccion <- ifelse(r < 0, "negativa", "positiva")
cat("Interpretación:", interpretacion, direccion, "\n")

## Interpretación: correlación muy débil o nula negativa

data.frame(
  Indicador = c("r de Pearson", "Dirección", "Magnitud"),
  Valor     = c(round(r, 6), direccion, interpretacion)
) %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°4: Coeficiente de Correlación de Pearson**"),
    subtitle = md("*Profundidad del Pozo (X) — Producción Acumulada (Y)*")
  ) %>%
  cols_label(Indicador = md("**Indicador**"), Valor = md("**Valor**")) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, 3, by = 2))
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width             = pct(55),
    heading.title.font.size = px(16), table.font.size = px(13),
    data_row.padding        = px(6)
  )

Indicador	Valor
Tabla N°4: Coeficiente de Correlación de Pearson
Profundidad del Pozo (X) — Producción Acumulada (Y)
r de Pearson	-0.069332
Dirección	negativa
Magnitud	correlación muy débil o nula
Autor: Araujo Valeska

7.2 Coeficiente de Determinación

El coeficiente de determinación (\(R^2\)) indica la proporción de la variabilidad total de \(Y\) que es explicada por el modelo polinómico de grado 5.

s5      <- summary(mod5)
r2      <- s5$r.squared
r2_adj  <- s5$adj.r.squared

cat("R² (coeficiente de determinación):         ", round(r2,     6), "\n")

## R² (coeficiente de determinación):          0.087742

cat("R² ajustado:                               ", round(r2_adj, 6), "\n")

## R² ajustado:                                0.087652

cat("Variabilidad explicada por el modelo:      ", round(r2 * 100, 4), "%\n")

## Variabilidad explicada por el modelo:       8.7742 %

cat("Variabilidad no explicada (residual):      ", round((1 - r2) * 100, 4), "%\n")

## Variabilidad no explicada (residual):       91.2258 %

data.frame(
  Indicador = c("R² (determinación)", "R² Ajustado",
                "Variabilidad explicada (%)", "Variabilidad residual (%)"),
  Valor     = c(round(r2, 6), round(r2_adj, 6),
                paste0(round(r2 * 100, 4), " %"),
                paste0(round((1 - r2) * 100, 4), " %"))
) %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°5: Coeficiente de Determinación — Modelo Grado 5**"),
    subtitle = md("*Proporción de variabilidad explicada por el modelo polinómico*")
  ) %>%
  cols_label(Indicador = md("**Indicador**"), Valor = md("**Valor**")) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, 4, by = 2))
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#D6D6D6"),
                     cell_text(weight = "bold")),
    locations = cells_body(rows = 1, columns = everything())
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width             = pct(60),
    heading.title.font.size = px(16), table.font.size = px(13),
    data_row.padding        = px(6)
  )

Indicador	Valor
Tabla N°5: Coeficiente de Determinación — Modelo Grado 5
Proporción de variabilidad explicada por el modelo polinómico
R² (determinación)	0.087742
R² Ajustado	0.087652
Variabilidad explicada (%)	8.7742 %
Variabilidad residual (%)	91.2258 %
Autor: Araujo Valeska

8 Ecuación del Modelo

Los coeficientes \(\beta\) del modelo polinómico de grado 5 estimados por mínimos cuadrados son:

betas <- coef(mod5)
nombres_beta <- c("β₀ (Intercepto)", "β₁ (X)", "β₂ (X²)", "β₃ (X³)", "β₄ (X⁴)", "β₅ (X⁵)")

data.frame(Coeficiente = nombres_beta, Estimación = format(betas, scientific = TRUE, digits = 5)) %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°6: Coeficientes del Modelo Polinómico Grado 5**"),
    subtitle = md("*Ŷ = β₀ + β₁X + β₂X² + β₃X³ + β₄X⁴ + β₅X⁵*")
  ) %>%
  cols_label(Coeficiente = md("**Coeficiente**"), Estimación = md("**Estimación**")) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, 6, by = 2))
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width             = pct(60),
    heading.title.font.size = px(16), table.font.size = px(13),
    data_row.padding        = px(6)
  )

Coeficiente	Estimación
Tabla N°6: Coeficientes del Modelo Polinómico Grado 5
Ŷ = β₀ + β₁X + β₂X² + β₃X³ + β₄X⁴ + β₅X⁵
β₀ (Intercepto)	-3.5826e+06
β₁ (X)	5.7993e+03
β₂ (X²)	-2.1441e+00
β₃ (X³)	2.8426e-04
β₄ (X⁴)	-1.0638e-08
β₅ (X⁵)	-1.8571e-13
Autor: Araujo Valeska

La ecuación estimada del modelo es:

\[\hat{Y} = -3.583e+06 + 5.799e+03X + -2.144e+00X^2 + 2.843e-04X^3 + -1.064e-08X^4 + -1.857e-13X^5\]

9 Tabla Resumen del Modelo

Se presenta el resumen completo del modelo polinómico de grado 5, incluyendo los coeficientes, errores estándar, estadístico t y p-valores para cada término.

coef_df <- as.data.frame(s5$coefficients)
coef_df$Termino <- nombres_beta
coef_df <- coef_df[, c("Termino", "Estimate", "Std. Error", "t value", "Pr(>|t|)")]

coef_df %>%
  mutate(
    Estimate     = round(Estimate,     4),
    `Std. Error` = round(`Std. Error`, 4),
    `t value`    = round(`t value`,    4),
    `Pr(>|t|)`   = format.pval(`Pr(>|t|)`, digits = 3, eps = 0.001),
    Significancia = case_when(
      as.numeric(gsub("<", "", `Pr(>|t|)`)) < 0.001 ~ "***",
      as.numeric(gsub("<", "", `Pr(>|t|)`)) < 0.01  ~ "**",
      as.numeric(gsub("<", "", `Pr(>|t|)`)) < 0.05  ~ "*",
      as.numeric(gsub("<", "", `Pr(>|t|)`)) < 0.10  ~ ".",
      TRUE ~ ""
    )
  ) %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°7: Resumen del Modelo Polinómico — Grado 5**"),
    subtitle = md("*Variable dependiente: Producción Acumulada (barriles)*")
  ) %>%
  cols_label(
    Termino      = md("**Término**"),
    Estimate     = md("**Estimación (β)**"),
    `Std. Error` = md("**Error Estándar**"),
    `t value`    = md("**Estadístico t**"),
    `Pr(>|t|)`   = md("**p-valor**"),
    Significancia = md("**Sig.**")
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, 6, by = 2))
  ) %>%
  tab_footnote(
    footnote  = "Códigos de significancia: *** p<0.001  ** p<0.01  * p<0.05  . p<0.10",
    locations = cells_column_labels(columns = Significancia)
  ) %>%
  tab_source_note(source_note = md(paste0(
    "*R² = ", round(r2, 6),
    " | R² Ajustado = ", round(r2_adj, 6),
    " | RSE = ", round(s5$sigma, 2),
    " | n = ", n, "*"
  ))) %>%
  tab_options(
    table.width                = pct(85),
    heading.title.font.size    = px(16),
    heading.subtitle.font.size = px(12),
    table.font.size            = px(13),
    data_row.padding           = px(6)
  )

Término	Estimación (β)	Error Estándar	Estadístico t	p-valor	Sig.¹
Tabla N°7: Resumen del Modelo Polinómico — Grado 5
Variable dependiente: Producción Acumulada (barriles)
β₀ (Intercepto)	-3582602.6282	116119.7843	-30.8526	<0.001	**
β₁ (X)	5799.3185	203.3248	28.5224	<0.001	**
β₂ (X²)	-2.1441	0.1157	-18.5323	<0.001	**
β₃ (X³)	0.0003	0.0000	9.9063	<0.001	**
β₄ (X⁴)	0.0000	0.0000	-3.3156	<0.001	**
β₅ (X⁵)	0.0000	0.0000	-1.4135	0.158
¹ Códigos de significancia: * p<0.001 p<0.01 * p<0.05 . p<0.10
R² = 0.087742 \| R² Ajustado = 0.087652 \| RSE = 1855994.49 \| n = 51023

10 Cálculo de Estimaciones

Se realizan estimaciones puntuales de la producción acumulada para distintos valores de profundidad del pozo, utilizando el modelo polinómico de grado 5.

# Valores de profundidad a estimar (representativos del rango de datos)
x_estimar <- c(500, 1000, 1500, 2000, 2500, 3000, 3500, 4000,
               4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000)

y_estimado <- predict(mod5,
                      newdata  = data.frame(profundidad = x_estimar),
                      interval = "prediction",
                      level    = 0.95)

tabla_estimaciones <- data.frame(
  Profundidad_pies    = x_estimar,
  Produccion_estimada = round(y_estimado[, "fit"],   2),
  LI_95               = round(y_estimado[, "lwr"],   2),
  LS_95               = round(y_estimado[, "upr"],   2)
)

tabla_estimaciones %>%
  gt() %>%
  tab_header(
    title    = md("**Tabla N°8: Estimaciones del Modelo Polinómico — Grado 5**"),
    subtitle = md("*Producción acumulada estimada con intervalo de predicción al 95%*")
  ) %>%
  cols_label(
    Profundidad_pies    = md("**Profundidad X (pies)**"),
    Produccion_estimada = md("**Ŷ estimada (barriles)**"),
    LI_95               = md("**LI 95%**"),
    LS_95               = md("**LS 95%**")
  ) %>%
  tab_style(
    style     = list(cell_fill(color = "#2C2C2C"),
                     cell_text(color = "white", weight = "bold")),
    locations = cells_column_labels()
  ) %>%
  tab_style(
    style     = cell_fill(color = "#F5F5F5"),
    locations = cells_body(rows = seq(1, nrow(tabla_estimaciones), by = 2))
  ) %>%
  tab_source_note(source_note = md("*Autor: Araujo Valeska*")) %>%
  tab_options(
    table.width                = pct(75),
    heading.title.font.size    = px(16),
    heading.subtitle.font.size = px(12),
    table.font.size            = px(13),
    data_row.padding           = px(6)
  )

Profundidad X (pies)	Ŷ estimada (barriles)	LI 95%	LS 95%
Tabla N°8: Estimaciones del Modelo Polinómico — Grado 5
Producción acumulada estimada con intervalo de predicción al 95%
500	-1184108.42	-4822916	2454700
1000	346043.82	-3291934	3984022
1500	1196242.03	-2441755	4834239
2000	1537524.52	-2100437	5175486
2500	1522883.63	-2115004	5160771
3000	1286569.21	-2351281	4924419
3500	943392.29	-2694453	4581238
4000	588028.61	-3049820	4225877
4500	294322.20	-3343535	3932179
5000	114588.96	-3523294	3752472
5500	78920.26	-3559017	3716858
6000	194486.47	-3443551	3832524
6500	444840.61	-3193399	4083080
7000	789221.84	-2849435	4427879
7500	1161859.11	-2477534	4801252
8000	1471274.71	-2169085	5111634
Autor: Araujo Valeska

par(mar = c(6, 6, 5, 2))
plot(
  muestra$profundidad,
  muestra$produccion,
  col  = rgb(0.2, 0.2, 0.2, 0.2),
  pch  = 16,
  cex  = 0.45,
  xlab = "",
  ylab = "",
  main = ""
)
lines(x_seq, y_pred, col = "black", lwd = 2)
points(tabla_estimaciones$Profundidad_pies,
       tabla_estimaciones$Produccion_estimada,
       col = "gray20", pch = 17, cex = 1.2)

mtext("Profundidad del Pozo (pies)",     side = 1, line = 4,   cex = 1)
mtext("Producción Acumulada (barriles)", side = 2, line = 4.5, cex = 1)
mtext("Gráfica N°4: Modelo Polinómico Grado 5 con Estimaciones",
      side = 3, line = 2, adj = 0.5, cex = 0.9, font = 2)

legend("topright",
       legend = c("Datos (muestra)", "Curva ajustada", "Estimaciones"),
       col    = c(rgb(0.2, 0.2, 0.2, 0.4), "black", "gray20"),
       pch    = c(16, NA, 17),
       lty    = c(NA, 1, NA),
       lwd    = c(NA, 2, NA),
       bty    = "n", cex = 0.8)

11 Conclusiones

Se ajustó un modelo de regresión polinómica de grado 5 para analizar la relación entre la profundidad del pozo (DEPTH_OF_WELL, variable independiente) y la producción acumulada de petróleo (CUMULATIVE_PRODUCTION, variable dependiente), utilizando 51023 observaciones válidas del conjunto de datos de arrendamientos de hidrocarburos de Kansas.

Los principales hallazgos son:

El coeficiente de correlación de Pearson es r = -0.0693, lo que indica una correlación débil negativa entre las variables.
El modelo polinómico de grado 5 explica el 8.7742% de la variabilidad total de la producción acumulada (R²), con un R² ajustado de 8.7652%.
El Error Estándar Residual (RSE) es de 1.8559945^{6} barriles.
El modelo global resultó estadísticamente significativo (p < 0.05).
El bajo R² indica que la profundidad del pozo, por sí sola, no determina la producción acumulada de manera precisa; otras variables del dataset (formación productiva, operador, localización) contribuyen significativamente a la variabilidad observada.

La aplicación de binning fue clave para identificar la tendencia curvilínea no evidente en la dispersión directa, justificando el uso del modelo polinómico sobre la regresión lineal simple.

Autor: Araujo Valeska