Descripción del proyecto

El dataset Automobile Sales Data tomado de Kaggle, contiene información sobre ventas de productos automotrices, probablemente recopilada por una empresa comercial a través de un sistema de gestión de ventas (ERP). Este sistema registra transacciones comerciales de manera automática, el dataset incluye 2,722 registros y 20 variabes.

Este notebook presenta el análisis del dataset Automobile Sales Data, con el objetivo de identificar patrones, distribuciones y relaciones que puedan informar decisiones comerciales, como estrategias de inventario, marketing y logística.

El análisis incluye: Análisis descriptivo de variables categóricas (univariado y bivariado), análisis descriptivo de variables cuantitativas (univariado y bivariado), análisis bivariado entre variables categóricas y cuantitativas y pruebas de hipótesis.

Librerias y tipo correcto de dato

Carga el dataset desde un archivo Excel, se convierten las columnas al tipo de datos correcto (numéricas, factores, fechas) y se verifican valores faltantes en variables clave para preparar los datos para el análisis. Finalmente, cargar los paquetes necesarios

# Librerías y tipo correcto de dato
library(readxl)

## Warning: package 'readxl' was built under R version 4.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.3

library(treemap)

## Warning: package 'treemap' was built under R version 4.4.3

library(moments)
library(knitr)

## Warning: package 'knitr' was built under R version 4.4.3

# Cargar el dataset
data <- read_excel("~/Auto_Sales_data_no_outliers1.xlsx")
knitr::opts_chunk$set(echo = TRUE)

# Asegurar tipos de datos correctos
data <- data %>%
  mutate(
    ORDERNUMBER = as.numeric(ORDERNUMBER),
    QUANTITYORDERED = as.numeric(QUANTITYORDERED),
    PRICEEACH = as.numeric(PRICEEACH),
    ORDERLINENUMBER = as.numeric(ORDERLINENUMBER),
    SALES = as.numeric(SALES),
    ORDERDATE = as.Date(ORDERDATE, format = "%Y-%m-%d"),
    DAYS_SINCE_LASTORDER = as.numeric(DAYS_SINCE_LASTORDER),
    STATUS = as.factor(toupper(trimws(STATUS))),
    PRODUCTLINE = as.factor(PRODUCTLINE),
    MSRP = as.numeric(MSRP),
    PRODUCTCODE = as.factor(PRODUCTCODE),
    CUSTOMERNAME = as.factor(CUSTOMERNAME),
    PHONE = as.character(PHONE),
    ADDRESSLINE1 = as.character(ADDRESSLINE1),
    CITY = as.factor(CITY),
    POSTALCODE = as.character(POSTALCODE),
    COUNTRY = as.factor(COUNTRY),
    CONTACTLASTNAME = as.character(CONTACTLASTNAME),
    CONTACTFIRSTNAME = as.character(CONTACTFIRSTNAME),
    DEALSIZE = as.factor(DEALSIZE)
  )

# Verificar valores NA en columnas clave
vars_to_check <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER", "PRODUCTLINE", "DEALSIZE", "STATUS")
colSums(is.na(data[vars_to_check]))

##                SALES      QUANTITYORDERED DAYS_SINCE_LASTORDER 
##                    3                    3                    3 
##          PRODUCTLINE             DEALSIZE               STATUS 
##                    3                    3                    3

Análisis descriptivo de variables categóricas

En este punto, se realiza un análisis descriptivo de las variables categóricas PRODUCTLINE, DEALSIZE y STATUS. Primero, se analiza cada variable de forma univariada para entender su distribución a través de frecuencias, proporciones y visualizaciones gráficas como diagramas de barras y treemaps. Luego, se realiza un análisis bivariado entre DEALSIZE y STATUS para explorar posibles relaciones entre estas variables, utilizando tablas cruzadas y un diagrama de barras apiladas. Este análisis busca identificar patrones que puedan ser útiles para decisiones logísticas y comerciales.

Univariado: Seleccionamos PRODUCTLINE, DEALSIZE, STATUS

Se crean tablas de frecuencia univariadas para las variables categóricas, mostrando cuántos casos hay para cada una de ellas

# Tablas de frecuencias univariadas
categorical_vars <- c("PRODUCTLINE", "DEALSIZE", "STATUS")
for (var in categorical_vars) {
  freq_table <- data.frame(table(data[[var]]))
  colnames(freq_table) <- c(var, "Frecuencia")
  freq_table$Proporcion <- prop.table(freq_table$Frecuencia)
  print(kable(freq_table, caption = paste("Tabla de frecuencias para", var)))
}

## 
## 
## Table: Tabla de frecuencias para PRODUCTLINE
## 
## |PRODUCTLINE      | Frecuencia| Proporcion|
## |:----------------|----------:|----------:|
## |Classic Cars     |        933|  0.3431409|
## |Motorcycles      |        309|  0.1136447|
## |Planes           |        302|  0.1110702|
## |Ships            |        230|  0.0845899|
## |Trains           |         77|  0.0283192|
## |Trucks and Buses |        295|  0.1084958|
## |Vintage Cars     |        573|  0.2107392|
## 
## 
## Table: Tabla de frecuencias para DEALSIZE
## 
## |DEALSIZE | Frecuencia| Proporcion|
## |:--------|----------:|----------:|
## |Large    |        124|  0.0456050|
## |Medium   |       1349|  0.4961383|
## |Small    |       1246|  0.4582567|
## 
## 
## Table: Tabla de frecuencias para STATUS
## 
## |STATUS     | Frecuencia| Proporcion|
## |:----------|----------:|----------:|
## |CANCELLED  |         60|  0.0220669|
## |DISPUTED   |         12|  0.0044134|
## |IN PROCESS |         40|  0.0147113|
## |ON HOLD    |         43|  0.0158146|
## |RESOLVED   |         47|  0.0172858|
## |SHIPPED    |       2517|  0.9257080|

library(ggplot2)
# Gráficos univariados con valores en las barras
# Diagrama de barras para PRODUCTLINE con valores
ggplot(data, aes(x = PRODUCTLINE)) +
  geom_bar(fill = "lightgreen", color = "black") +
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.5) +
  labs(title = "Frecuencia de Lineas de Producto", x = "Linea de Producto", y = "Frecuencia") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggsave("bar_productline_with_values.png", width = 8, height = 6)

# Diagrama de barras para DEALSIZE con valores
ggplot(data, aes(x = DEALSIZE)) +
  geom_bar(fill = "lightcoral", color = "black") +
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.5) +
  labs(title = "Frecuencia de Tamanio del Pedido (DEALSIZE)", x = "Tamanio del Pedido", y = "Frecuencia") +
  theme_minimal()

ggsave("bar_dealsize_with_values.png", width = 8, height = 6)
# Treemap para STATUS
treemap(data, index = "STATUS", vSize = "SALES", title = "Treemap: Ventas por Estado del Pedido (STATUS)")

Bivariado: DEALSIZE vs. STATUS

Se crea una tabla cruzada entre las variables, mostrando la frecuencia de las combinaciones de estas dos variables, Se calculan las proporciones de cada combinación en la tabla cruzada y se genera un gráfico de barras apiladas para mostrar las frecuencias de DEALSIZE en función de STATUS

#Tabla cruzada entre DEALSIZE y STATUS
cross_table <- table(data$DEALSIZE, data$STATUS)
prop_table <- prop.table(cross_table, margin = 1)
print("Tabla cruzada entre DEALSIZE y STATUS:")

## [1] "Tabla cruzada entre DEALSIZE y STATUS:"

print(kable(cross_table))

## 
## 
## |       | CANCELLED| DISPUTED| IN PROCESS| ON HOLD| RESOLVED| SHIPPED|
## |:------|---------:|--------:|----------:|-------:|--------:|-------:|
## |Large  |         0|        3|          2|       4|        1|     114|
## |Medium |        33|        5|         18|      24|       26|    1243|
## |Small  |        27|        4|         20|      15|       20|    1160|

print("Proporciones (por fila) entre DEALSIZE y STATUS:")

## [1] "Proporciones (por fila) entre DEALSIZE y STATUS:"

print(kable(prop_table))

## 
## 
## |       | CANCELLED|  DISPUTED| IN PROCESS|   ON HOLD|  RESOLVED|   SHIPPED|
## |:------|---------:|---------:|----------:|---------:|---------:|---------:|
## |Large  | 0.0000000| 0.0241935|  0.0161290| 0.0322581| 0.0080645| 0.9193548|
## |Medium | 0.0244626| 0.0037064|  0.0133432| 0.0177910| 0.0192735| 0.9214233|
## |Small  | 0.0216693| 0.0032103|  0.0160514| 0.0120385| 0.0160514| 0.9309791|

# Diagrama de barras apiladas con valores
cross_table_df <- as.data.frame.table(cross_table)
ggplot(cross_table_df, aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Freq), position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Diagrama de Barras Apiladas: DEALSIZE vs. STATUS", x = "Tamanio del Pedido (DEALSIZE)", y = "Frecuencia", fill = "Estado del Pedido (STATUS)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("stacked_bar_dealsize_status_with_values.png", width = 8, height = 6)

Análisis descriptivo de variables cuantitativas

Este punto se centra en el análisis descriptivo de las variables cuantitativas SALES, QUANTITYORDERED y DAYS_SINCE_LASTORDER. En el análisis univariado, se calculan estadísticas descriptivas como la media, mediana, desviación estándar, asimetría y curtosis, y se visualizan las distribuciones mediante histogramas y boxplots, permitiendo ver si siguen algun modelo distribucional conocido. En el análisis bivariado, se explora la relación entre SALES y DAYS_SINCE_LASTORDER mediante una prueba de correlación y un diagrama de dispersión, para determinar si existe una asociación entre estas variables. Este análisis permite entender el comportamiento de las ventas y la actividad de los cliente.

Univariado: Seleccionamos SALES, QUANTITYORDERED, DAYS_SINCE_LASTORDER

# Estadígrafos univariados
numeric_vars <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER")
numeric_stats <- data %>%
  summarise(across(
    all_of(numeric_vars),
    list(
      Mean = ~mean(., na.rm = TRUE),
      Median = ~median(., na.rm = TRUE),
      SD = ~sd(., na.rm = TRUE),
      Min = ~min(., na.rm = TRUE),
      Q1 = ~quantile(., 0.25, na.rm = TRUE),
      Q3 = ~quantile(., 0.75, na.rm = TRUE),
      Max = ~max(., na.rm = TRUE),
      Skewness = ~skewness(., na.rm = TRUE),
      Kurtosis = ~kurtosis(., na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  ))

# Reformatear los estadígrafos en un formato más claro
stats_list <- list()
for (var in numeric_vars) {
  stats_var <- data.frame(
    Metric = c("Mean", "Median", "SD", "Min", "Q1", "Q3", "Max", "Skewness", "Kurtosis"),
    Value = c(
      numeric_stats[[paste0(var, "_Mean")]],
      numeric_stats[[paste0(var, "_Median")]],
      numeric_stats[[paste0(var, "_SD")]],
      numeric_stats[[paste0(var, "_Min")]],
      numeric_stats[[paste0(var, "_Q1")]],
      numeric_stats[[paste0(var, "_Q3")]],
      numeric_stats[[paste0(var, "_Max")]],
      numeric_stats[[paste0(var, "_Skewness")]],
      numeric_stats[[paste0(var, "_Kurtosis")]]
    )
  )
  stats_list[[var]] <- stats_var
}

for (var in names(stats_list)) {
  print(kable(stats_list[[var]], caption = paste("Estadísticas descriptivas para", var)))
}

## 
## 
## Table: Estadísticas descriptivas para SALES
## 
## |Metric   |        Value|
## |:--------|------------:|
## |Mean     | 3481.9234939|
## |Median   | 3167.3600000|
## |SD       | 1704.2611303|
## |Min      |  482.1300000|
## |Q1       | 2197.1100000|
## |Q3       | 4437.1000000|
## |Max      | 9160.3600000|
## |Skewness |    0.8364885|
## |Kurtosis |    3.2735167|
## 
## 
## Table: Estadísticas descriptivas para QUANTITYORDERED
## 
## |Metric   |      Value|
## |:--------|----------:|
## |Mean     | 34.9187201|
## |Median   | 34.0000000|
## |SD       |  9.5914528|
## |Min      |  6.0000000|
## |Q1       | 27.0000000|
## |Q3       | 43.0000000|
## |Max      | 97.0000000|
## |Skewness |  0.3137086|
## |Kurtosis |  3.2859502|
## 
## 
## Table: Estadísticas descriptivas para DAYS_SINCE_LASTORDER
## 
## |Metric   |        Value|
## |:--------|------------:|
## |Mean     | 1764.7241633|
## |Median   | 1769.0000000|
## |SD       |  816.5323832|
## |Min      |   42.0000000|
## |Q1       | 1087.0000000|
## |Q3       | 2440.5000000|
## |Max      | 3562.0000000|
## |Skewness |   -0.0061354|
## |Kurtosis |    1.9764772|

Gráficos univariados con funciones de densidad ajustadas

Se crean histogramas para las variables SALES, QUANTITYORDERED y DAYS_SINCE_LASTORDER, y se ajustan con diferentes distribuciones (gamma, normal, log-normal) utilizando stat_function de ggplot2 .

Histograma de Sales

# Histograma de SALES con densidad gamma
sales_mean <- mean(data$SALES, na.rm = TRUE)
sales_var <- var(data$SALES, na.rm = TRUE)
shape <- sales_mean^2 / sales_var
scale <- sales_var / sales_mean
ggplot(data, aes(x = SALES)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  stat_function(fun = dgamma, args = list(shape = shape, scale = scale), color = "red", size = 1) +
  labs(title = "Distribucion de Ventas (SALES) con Densidad Gamma", x = "Ventas (SALES)", y = "Densidad") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales_gamma.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histograma de QUANTITYORDERED

# Histograma de QUANTITYORDERED con densidad normal
ggplot(data, aes(x = QUANTITYORDERED)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = mean(data$QUANTITYORDERED, na.rm = TRUE), sd = sd(data$QUANTITYORDERED, na.rm = TRUE)), color = "red", size = 1) +
  labs(title = "Distribucion de Cantidad Ordenada con Densidad Normal", x = "Cantidad Ordenada", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_quantityordered_normal.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histograma de DAYS_SINCE_LASTORDER

# Histograma de DAYS_SINCE_LASTORDER con densidad log-normal (para referencia)
ggplot(data, aes(x = DAYS_SINCE_LASTORDER)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black") +
  stat_function(fun = dlnorm, args = list(meanlog = mean(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE), sdlog = sd(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE)), color = "red", size = 1) +
  labs(title = "Distribucion de Dias desde el Ultimo Pedido con Densidad Log-Normal", x = "Dias desde el Ultimo Pedido", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_days_since_lastorder_lognormal.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Se realiza un test de Kolmogorov-Smirnov para comparar la distribución empírica de la variable DAYS_SINCE_LASTORDER con varias distribuciones teóricas (normal, gamma, log-normal) para determinar si la variable rechaza modelos paramétricos

# Test KS para distribución normal
ks_normal <- ks.test(data$DAYS_SINCE_LASTORDER, "pnorm", mean = mean(data$DAYS_SINCE_LASTORDER, na.rm = TRUE), sd = sd(data$DAYS_SINCE_LASTORDER, na.rm = TRUE))

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "pnorm", mean =
## mean(data$DAYS_SINCE_LASTORDER, : ties should not be present for the one-sample
## Kolmogorov-Smirnov test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Normal):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Normal):"

print(ks_normal)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.050568, p-value = 1.827e-06
## alternative hypothesis: two-sided

# Test KS para distribución gamma
days_mean <- mean(data$DAYS_SINCE_LASTORDER, na.rm = TRUE)
days_var <- var(data$DAYS_SINCE_LASTORDER, na.rm = TRUE)
shape_days <- days_mean^2 / days_var
scale_days <- days_var / days_mean
ks_gamma <- ks.test(data$DAYS_SINCE_LASTORDER, "pgamma", shape = shape_days, scale = scale_days)

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "pgamma", shape =
## shape_days, : ties should not be present for the one-sample Kolmogorov-Smirnov
## test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Gamma):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Gamma):"

print(ks_gamma)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.085314, p-value < 2.2e-16
## alternative hypothesis: two-sided

# Test KS para distribución log-normal
ks_lognormal <- ks.test(data$DAYS_SINCE_LASTORDER, "plnorm", meanlog = mean(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE), sdlog = sd(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE))

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "plnorm", meanlog =
## mean(log(data$DAYS_SINCE_LASTORDER), : ties should not be present for the
## one-sample Kolmogorov-Smirnov test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Log-Normal):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Log-Normal):"

print(ks_lognormal)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.099156, p-value < 2.2e-16
## alternative hypothesis: two-sided

Podemos determinar si la estimación de densidad kernel (KDE) ofrece la mejor representación, sugiriendo que esta variable tiene una distribución compleja que requiere un enfoque no paramétrico para su análisis

# Histograma de DAYS_SINCE_LASTORDER con densidad kernel (KDE)
ggplot(data, aes(x = DAYS_SINCE_LASTORDER)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black") +
  geom_density(color = "blue", size = 1) +
  labs(title = "Distribucion de Dias desde el Ultimo Pedido con Densidad Kernel (KDE)", x = "Dias desde el Ultimo Pedido", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_density()`).

ggsave("histogram_days_since_lastorder_kde.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 3 rows containing non-finite outside the scale range
## (`stat_density()`).

# Boxplot de QUANTITYORDERED
ggplot(data, aes(y = QUANTITYORDERED)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot de Cantidad Ordenada (QUANTITYORDERED)", y = "Cantidad Ordenada") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantityordered.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Bivariado: SALES vs. DAYS_SINCE_LASTORDER

Determinamos la correlacón entre las dos variables

correlation <- cor.test(data$SALES, data$DAYS_SINCE_LASTORDER)
print("Correlación entre SALES y DAYS_SINCE_LASTORDER:")

## [1] "Correlación entre SALES y DAYS_SINCE_LASTORDER:"

print(correlation)

## 
##  Pearson's product-moment correlation
## 
## data:  data$SALES and data$DAYS_SINCE_LASTORDER
## t = -17.98, df = 2717, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3592696 -0.2920724
## sample estimates:
##        cor 
## -0.3260828

# Gráfico de dispersión
ggplot(data, aes(x = DAYS_SINCE_LASTORDER, y = SALES)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relacion entre Ventas (SALES) y Dias desde el Ultimo Pedido", x = "Dias desde el Ultimo Pedido", y = "Ventas (SALES)") +
  theme_minimal()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggsave("scatter_sales_days.png", width = 8, height = 6)

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Análisis descriptivo bivariado entre variables categóricas y cuantitativas

En este punto, se analiza la relación entre variables categóricas y cuantitativas para identificar patrones que combinen ambos tipos de datos. Específicamente, se examina cómo varía SALES según la línea de producto (PRODUCTLINE) y cómo se distribuye QUANTITYORDERED según el tamaño del pedido (DEALSIZE). Para ello, se calculan estadísticas descriptivas por categoría y se visualizan los resultados mediante boxplots e histogramas de densidad. Este análisis ayuda a entender qué líneas de producto son más rentables y cómo el tamaño del pedido afecta las cantidades ordenadas.

Se calcula y muestra un análisis descriptivo de SALES por PRODUCTLINE y QUANTITYORDERED por DEALSIZE. Los resultados se presentan en tablas con estadísticas como la media, la mediana y la desviación estándar. También se generan gráficos adicionales como boxplots y histogramas segmentados para mostrar la distribución de estas variables según las categorías.

Estadigrafos, Boxplot e histograma segmentado de SALES vs. PRODUCTLINE

# Estadígrafos de SALES por PRODUCTLINE
sales_by_productline <- data %>%
  group_by(PRODUCTLINE) %>%
  summarise(
    Mean_SALES = mean(SALES),
    Median_SALES = median(SALES),
    SD_SALES = sd(SALES),
    Skewness_SALES = skewness(SALES),
    Kurtosis_SALES = kurtosis(SALES)
  )
print("Estadígrafos de SALES por PRODUCTLINE:")

## [1] "Estadígrafos de SALES por PRODUCTLINE:"

print(kable(sales_by_productline))

## 
## 
## |PRODUCTLINE      | Mean_SALES| Median_SALES| SD_SALES| Skewness_SALES| Kurtosis_SALES|
## |:----------------|----------:|------------:|--------:|--------------:|--------------:|
## |Classic Cars     |   3940.106|     3729.390| 1885.660|      0.5234855|       2.586711|
## |Motorcycles      |   3441.322|     3113.640| 1686.146|      0.9750865|       3.699903|
## |Planes           |   3143.103|     2835.770| 1421.442|      1.1132958|       4.081103|
## |Ships            |   3043.649|     2884.925| 1058.753|      0.9442692|       4.249737|
## |Trains           |   2938.227|     2445.600| 1456.596|      1.5085860|       5.764990|
## |Trucks and Buses |   3767.997|     3451.000| 1674.056|      0.4868421|       2.579675|
## |Vintage Cars     |   3038.051|     2761.960| 1575.484|      1.0097794|       3.965723|
## |NA               |         NA|           NA|       NA|             NA|             NA|

# Boxplot de SALES por PRODUCTLINE
ggplot(data, aes(x = PRODUCTLINE, y = SALES, fill = PRODUCTLINE)) +
  geom_boxplot() +
  labs(title = "Distribucion de Ventas (SALES) por Linea de Producto", x = "Linea de Producto", y = "Ventas (SALES)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_sales_productline.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Histograma segmentado de SALES por PRODUCTLINE
ggplot(data, aes(x = SALES)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~PRODUCTLINE, scales = "free_y") +
  labs(title = "Distribucion de Ventas (SALES) por Linea de Producto", x = "Ventas (SALES)", y = "Frecuencia") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales_productline.png", width = 10, height = 8)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Estadigrafo y Boxplot de QUANTITYORDERED vs. DEALSIZE

# Estadígrafos de QUANTITYORDERED por DEALSIZE
quantity_by_dealsize <- data %>%
  group_by(DEALSIZE) %>%
  summarise(
    Mean_QUANTITY = mean(QUANTITYORDERED),
    Median_QUANTITY = median(QUANTITYORDERED),
    SD_QUANTITY = sd(QUANTITYORDERED),
    Skewness_QUANTITY = skewness(QUANTITYORDERED),
    Kurtosis_QUANTITY = kurtosis(QUANTITYORDERED)
  )
print("Estadígrafos de QUANTITYORDERED por DEALSIZE:")

## [1] "Estadígrafos de QUANTITYORDERED por DEALSIZE:"

print(kable(quantity_by_dealsize))

## 
## 
## |DEALSIZE | Mean_QUANTITY| Median_QUANTITY| SD_QUANTITY| Skewness_QUANTITY| Kurtosis_QUANTITY|
## |:--------|-------------:|---------------:|-----------:|-----------------:|-----------------:|
## |Large    |      46.04839|              45|    9.875530|         2.1058435|         10.382888|
## |Medium   |      37.96071|              39|    8.448259|        -0.1632404|          2.408981|
## |Small    |      30.51766|              29|    8.495740|         0.5786779|          2.796181|
## |NA       |            NA|              NA|          NA|                NA|                NA|

# Boxplot de QUANTITYORDERED por DEALSIZE
ggplot(data, aes(x = DEALSIZE, y = QUANTITYORDERED, fill = DEALSIZE)) +
  geom_boxplot() +
  labs(title = "Distribucion de Cantidad Ordenada por Tamanio del Pedido", x = "Tamanio del Pedido (DEALSIZE)", y = "Cantidad Ordenada (QUANTITYORDERED)") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantity_dealsize.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Pruebas de hipótesis

Este punto incluye la realización de dos pruebas de hipótesis para validar suposiciones sobre los datos. La primera prueba evalúa si la media de las ventas (SALES) es igual o menor a 3500, lo que podría reflejar un valor esperado de ingresos por pedido. La segunda prueba verifica si la proporción de pedidos con estado “SHIPPED” en la variable STATUS es igual al 70%, para evaluar la eficiencia logística de la empresa. Se presentan las hipótesis nula y alternativa, los resultados de las pruebas y sus interpretaciones.

Prueba 1: Media de SALES

H0: La media de SALES es igual o menor a 3500

H1: La media de SALES no es igual a 3500

El análisis de las ventas promedio es esencial para evaluar la rentabilidad y eficiencia del proceso comercial en el sector automotriz. En este contexto, se estableció el valor de 3500 como un umbral de referencia mínimo esperado para mantener la rentabilidad de la operación y cumplir los objetivos financieros.

sales_mean_test <- t.test(data$SALES, mu = 3500, alternative = "greater")  # H0: mu <= 3500, H1: mu > 3500
print("Prueba de hipótesis para la media de SALES (H0: mu <= 3500):")

## [1] "Prueba de hipótesis para la media de SALES (H0: mu <= 3500):"

print(sales_mean_test)

## 
##  One Sample t-test
## 
## data:  data$SALES
## t = -0.55307, df = 2718, p-value = 0.7099
## alternative hypothesis: true mean is greater than 3500
## 95 percent confidence interval:
##  3428.145      Inf
## sample estimates:
## mean of x 
##  3481.923

Prueba 2: Proporción de “CANCELLED” en STATUS

H0: La proporción de “CANCELLED” es menor o igual a 0.5

H1: La proporción de “CANCELLED” es menor a 0.5

Un índice elevado de cancelaciones puede reflejar problemas en procesos logísticos, errores en el inventario, fallos en la comunicación con los clientes o debilidades en las políticas comerciales. Dado esto, se planteó la necesidad de evaluar si la proporción de pedidos cancelados se mantenía por debajo del umbral crítico del 5%.

# 1. Verificar valores únicos de STATUS
unique(data$STATUS)

## [1] SHIPPED    DISPUTED   CANCELLED  ON HOLD    RESOLVED   IN PROCESS <NA>      
## Levels: CANCELLED DISPUTED IN PROCESS ON HOLD RESOLVED SHIPPED

# 2. Asegurar limpieza en STATUS
data$STATUS <- toupper(trimws(data$STATUS))

# 3. Contar número de pedidos CANCELLED y total de pedidos
num_cancelled <- sum(data$STATUS == "CANCELLED", na.rm = TRUE)
n_total <- sum(!is.na(data$STATUS))

print(paste("Pedidos CANCELLED:", num_cancelled))

## [1] "Pedidos CANCELLED: 60"

print(paste("Total de pedidos:", n_total))

## [1] "Total de pedidos: 2719"

# 4. Prueba de hipótesis
prueba_cancelled <- prop.test(
  x = num_cancelled, 
  n = n_total, 
  p = 0.05, 
  alternative = "less",
  correct = FALSE
)

# 5. Mostrar resultados
print(prueba_cancelled)

## 
##  1-sample proportions test without continuity correction
## 
## data:  num_cancelled out of n_total, null probability 0.05
## X-squared = 44.663, df = 1, p-value = 1.17e-11
## alternative hypothesis: true p is less than 0.05
## 95 percent confidence interval:
##  0.00000000 0.02719795
## sample estimates:
##          p 
## 0.02206694

Análisis de normalidad

Este análisis tiene como objetivo evaluar la normalidad de las variables continuas en el conjunto de datos `Auto_Sales_data_no_outliers1.xlsx`. La normalidad es una suposición clave en muchos métodos estadísticos paramétricos, como pruebas t, ANOVA o regresiones lineales. Si las variables no cumplen con esta suposición, puede ser necesario aplicar transformaciones o recurrir a métodos no paramétricos. Este análisis es fundamental para:

- Determinar si las variables continuas seleccionadas (`QUANTITYORDERED`, `PRICEEACH`, `ORDERLINENUMBER`, `SALES`, `DAYS_SINCE_LASTORDER`, `MSRP`) siguen una distribución normal.

- Identificar y tratar valores atípicos (outliers) que puedan afectar la normalidad.

- Aplicar transformaciones (logarítmica, raíz cuadrada, Box-Cox, etc.) para intentar normalizar los datos.

- Proporcionar recomendaciones sobre cómo proceder en análisis posteriores (por ejemplo, usar métodos no paramétricos si la normalidad no se logra).

# Instalar y cargar librerías necesarias
if(!require(tseries)) install.packages("tseries")

## Cargando paquete requerido: tseries

## Warning: package 'tseries' was built under R version 4.4.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)

if(!require(forecast)) install.packages("forecast")

## Cargando paquete requerido: forecast

## Warning: package 'forecast' was built under R version 4.4.3

library(forecast)

library(readxl)
library(dplyr)
library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.3

## 
## Adjuntando el paquete: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(nortest)
library(car)

## Warning: package 'car' was built under R version 4.4.3

## Cargando paquete requerido: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Adjuntando el paquete: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(VIM)

## Warning: package 'VIM' was built under R version 4.4.3

## Cargando paquete requerido: colorspace

## Cargando paquete requerido: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Adjuntando el paquete: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

# Cargar datos
datos <- read_excel("~/Auto_Sales_data_no_outliers1.xlsx", sheet = "Auto Sales data")

# Identificar variables continuas
variables_continuas <- c("QUANTITYORDERED", "PRICEEACH", "ORDERLINENUMBER", 
                         "SALES", "DAYS_SINCE_LASTORDER", "MSRP")

# Crear subset con solo variables continuas
datos_continuas <- datos[, variables_continuas]

Análisis descriptivo inicial

Se generará un resumen estadístico (mínimo, máximo, media, mediana, cuartiles) para obtener una visión general de las variables continuas. Además, se verificarán valores faltantes y se realizará un diagnóstico adicional para ORDERLINENUMBER (frecuencias y valores únicos) para confirmar si es una variable continua o discreta/ordinal. Esto sirve para entender la estructura básica de los datos y detectar problemas iniciales como datos ausentes o distribuciones sesgadas.

# Resumen estadístico
summary(datos_continuas)

##  QUANTITYORDERED   PRICEEACH      ORDERLINENUMBER      SALES       
##  Min.   : 6.00   Min.   : 26.88   Min.   : 1.000   Min.   : 482.1  
##  1st Qu.:27.00   1st Qu.: 68.53   1st Qu.: 3.000   1st Qu.:2197.1  
##  Median :34.00   Median : 94.88   Median : 6.000   Median :3167.4  
##  Mean   :34.92   Mean   :100.07   Mean   : 6.506   Mean   :3481.9  
##  3rd Qu.:43.00   3rd Qu.:125.99   3rd Qu.: 9.000   3rd Qu.:4437.1  
##  Max.   :97.00   Max.   :252.87   Max.   :18.000   Max.   :9160.4  
##  NA's   :3       NA's   :3        NA's   :3        NA's   :3       
##  DAYS_SINCE_LASTORDER      MSRP      
##  Min.   :  42         Min.   : 33.0  
##  1st Qu.:1087         1st Qu.: 68.0  
##  Median :1769         Median : 99.0  
##  Mean   :1765         Mean   :100.1  
##  3rd Qu.:2440         3rd Qu.:124.0  
##  Max.   :3562         Max.   :214.0  
##  NA's   :3            NA's   :3

# Verificar valores faltantes
sapply(datos_continuas, function(x) sum(is.na(x)))

##      QUANTITYORDERED            PRICEEACH      ORDERLINENUMBER 
##                    3                    3                    3 
##                SALES DAYS_SINCE_LASTORDER                 MSRP 
##                    3                    3                    3

# Diagnóstico adicional para ORDERLINENUMBER
cat("\nDiagnóstico adicional para ORDERLINENUMBER:\n")

## 
## Diagnóstico adicional para ORDERLINENUMBER:

orderline_counts <- table(datos_continuas$ORDERLINENUMBER)
print(orderline_counts)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 296 275 257 245 230 210 190 179 159 139 126 109  95  80  55  41  24   9

cat("Número de valores únicos:", length(unique(datos_continuas$ORDERLINENUMBER, na.rm = TRUE)), "\n")

## Número de valores únicos: 19

Pruebas de Normalidad Iniciales

Se aplicarán cuatro pruebas de normalidad (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov y Jarque-Bera) a cada variable continua para evaluar si sigue una distribución normal (p-valor > 0.05 indica normalidad). Estas pruebas se implementan mediante una función personalizada que maneja errores y limita de tamaño de muestra (Shapiro-Wilk para ≤ 5000 observaciones). El propósito es obtener una evaluación estadística inicial de la normalidad antes de cualquier tratamiento.

# Función para realizar múltiples pruebas de normalidad
pruebas_normalidad <- function(variable, nombre_var) {
  cat("\n=== VARIABLE:", nombre_var, "===\n")
  
  variable <- variable[!is.na(variable) & is.finite(variable)]
  
  if(length(variable) < 3) {
    cat("Tamaño de muestra insuficiente (< 3 observaciones). No se pueden realizar pruebas de normalidad.\n")
    return(list(shapiro = NA, anderson = NA, ks = NA, jarque_bera = NA))
  }
  
  # Shapiro-Wilk (muestra <= 5000)
  shapiro_p <- NA
  if(length(variable) <= 5000) {
    shapiro_test <- try(shapiro.test(variable), silent = TRUE)
    if(!inherits(shapiro_test, "try-error")) {
      shapiro_p <- shapiro_test$p.value
      cat("Shapiro-Wilk p-value:", shapiro_p, "\n")
    } else {
      cat("Shapiro-Wilk: Error al ejecutar la prueba.\n")
    }
  }
  
  # Anderson-Darling
  ad_p <- NA
  ad_test <- try(ad.test(variable), silent = TRUE)
  if(!inherits(ad_test, "try-error")) {
    ad_p <- ad_test$p.value
    cat("Anderson-Darling p-value:", ad_p, "\n")
  } else {
    cat("Anderson-Darling: Error al ejecutar la prueba.\n")
  }
  
  # Kolmogorov-Smirnov
  ks_p <- NA
  ks_test <- try(ks.test(variable, "pnorm", mean(variable, na.rm = TRUE), sd(variable, na.rm = TRUE)), silent = TRUE)
  if(!inherits(ks_test, "try-error")) {
    ks_p <- ks_test$p.value
    cat("Kolmogorov-Smirnov p-value:", ks_p, "\n")
  } else {
    cat("Kolmogorov-Smirnov: Error al ejecutar la prueba.\n")
  }
  
  # Jarque-Bera
  jb_p <- NA
  jb_test <- try(jarque.bera.test(variable), silent = TRUE)
  if(!inherits(jb_test, "try-error")) {
    jb_p <- jb_test$p.value
    cat("Jarque-Bera p-value:", jb_p, "\n")
  } else {
    cat("Jarque-Bera: Error al ejecutar la prueba.\n")
  }
  
  return(list(
    shapiro = shapiro_p,
    anderson = ad_p,
    ks = ks_p,
    jarque_bera = jb_p
  ))
}

# Aplicar pruebas a todas las variables
resultados_normalidad <- list()
for(i in 1:length(variables_continuas)) {
  var_name <- variables_continuas[i]
  variable <- datos_continuas[[var_name]]
  resultados_normalidad[[var_name]] <- pruebas_normalidad(variable, var_name)
}

## 
## === VARIABLE: QUANTITYORDERED ===
## Shapiro-Wilk p-value: 2.21235e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.123557e-11 
## Jarque-Bera p-value: 2.014722e-12 
## 
## === VARIABLE: PRICEEACH ===
## Shapiro-Wilk p-value: 3.016714e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.101909e-09 
## Jarque-Bera p-value: 0 
## 
## === VARIABLE: ORDERLINENUMBER ===
## Shapiro-Wilk p-value: 2.455972e-32 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.297548e-33 
## Jarque-Bera p-value: 0 
## 
## === VARIABLE: SALES ===
## Shapiro-Wilk p-value: 2.09335e-30 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 9.45032e-15 
## Jarque-Bera p-value: 0 
## 
## === VARIABLE: DAYS_SINCE_LASTORDER ===
## Shapiro-Wilk p-value: 1.534705e-21 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.827449e-06 
## Jarque-Bera p-value: 0 
## 
## === VARIABLE: MSRP ===
## Shapiro-Wilk p-value: 3.160938e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 7.375531e-13 
## Jarque-Bera p-value: 0

Se crearán histogramas con curvas de densidad y gráficos QQ (cuantil-cuantil) para cada variable continua. Los histogramas muestran la forma de la distribución, mientras que los diagramas QQ comparan los datos con una distribución normal teórica. Esto permite una validación visual de la normalidad y ayuda a identificar asimetrías o colas pesadas que las pruebas estadísticas podrían no capturar completamente

# Crear histogramas y Q-Q plots
plots_list <- list()
for(i in 1:length(variables_continuas)) {
  var_name <- variables_continuas[i]
  variable <- datos_continuas[[var_name]]
  variable <- variable[!is.na(variable)]
  
  # Histograma
  p1 <- ggplot(data.frame(x = variable), aes(x = x)) +
    geom_histogram(bins = 30, fill = "lightblue", alpha = 0.7) +
    geom_density(aes(y = ..density.. * length(variable) * diff(range(variable))/30), 
                 color = "red", size = 1) +
    labs(title = paste("Histograma -", var_name), x = var_name, y = "Frecuencia")
  
  # Q-Q plot
  p2 <- ggplot(data.frame(x = variable), aes(sample = x)) +
    stat_qq() + stat_qq_line(color = "red") +
    labs(title = paste("Q-Q Plot -", var_name))
  
  plots_list[[paste0(var_name, "_hist")]] <- p1
  plots_list[[paste0(var_name, "_qq")]] <- p2
}

# Mostrar plots
grid.arrange(grobs = plots_list[1:4], ncol = 2)

grid.arrange(grobs = plots_list[5:8], ncol = 2)

grid.arrange(grobs = plots_list[9:12], ncol = 2)

Se detectarán y tratarán los valores atípicos en cada variable continua usando el método del rango intercuartílico

# Función para detectar y tratar outliers usando IQR
tratar_outliers <- function(variable, metodo = "iqr") {
  variable <- variable[!is.na(variable) & is.finite(variable)]
  
  if(length(variable) < 4) {
    cat("Advertencia: Menos de 4 observaciones válidas. No se pueden detectar outliers.\n")
    return(list(
      original = variable,
      sin_outliers = variable,
      winsorized = variable,
      outliers_indices = integer(0),
      n_outliers = 0
    ))
  }
  
  if(metodo == "iqr") {
    Q1 <- quantile(variable, 0.25, na.rm = TRUE)
    Q3 <- quantile(variable, 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    
    # Si IQR es 0 (muchos valores repetidos), ajustar límites para evitar eliminar todo
    if(IQR == 0) {
      IQR <- sd(variable, na.rm = TRUE) / 2  # Usar desviación estándar como alternativa
      if(IQR == 0) IQR <- 1  # Valor mínimo para evitar límites infinitos
    }
    
    limite_inferior <- Q1 - 1.5 * IQR
    limite_superior <- Q3 + 1.5 * IQR
    
    # Identificar outliers
    outliers <- which(variable < limite_inferior | variable > limite_superior)
    
    # Opción 1: Eliminar outliers
    variable_sin_outliers <- variable
    if(length(outliers) > 0) {
      variable_sin_outliers <- variable[-outliers]
    }
    
    # Opción 2: Winsorizar (reemplazar por percentiles)
    variable_winsorized <- variable
    variable_winsorized[variable < limite_inferior] <- limite_inferior
    variable_winsorized[variable > limite_superior] <- limite_superior
    
    return(list(
      original = variable,
      sin_outliers = variable_sin_outliers,
      winsorized = variable_winsorized,
      outliers_indices = outliers,
      n_outliers = length(outliers)
    ))
  }
}

# Aplicar tratamiento de outliers
datos_tratados <- list()
for(var_name in variables_continuas) {
  variable <- datos_continuas[[var_name]]
  tratamiento <- tratar_outliers(variable)
  datos_tratados[[var_name]] <- tratamiento
  
  cat("\nVariable:", var_name)
  cat("\nOutliers detectados:", tratamiento$n_outliers)
  cat("\nObservaciones despues de eliminar outliers:", length(tratamiento$sin_outliers))
  cat("\nPorcentaje de outliers:", round(tratamiento$n_outliers/length(variable[!is.na(variable)])*100, 2), "%\n")
}

## 
## Variable: QUANTITYORDERED
## Outliers detectados: 5
## Observaciones despues de eliminar outliers: 2714
## Porcentaje de outliers: 0.18 %
## 
## Variable: PRICEEACH
## Outliers detectados: 24
## Observaciones despues de eliminar outliers: 2695
## Porcentaje de outliers: 0.88 %
## 
## Variable: ORDERLINENUMBER
## Outliers detectados: 0
## Observaciones despues de eliminar outliers: 2719
## Porcentaje de outliers: 0 %
## 
## Variable: SALES
## Outliers detectados: 56
## Observaciones despues de eliminar outliers: 2663
## Porcentaje de outliers: 2.06 %
## 
## Variable: DAYS_SINCE_LASTORDER
## Outliers detectados: 0
## Observaciones despues de eliminar outliers: 2719
## Porcentaje de outliers: 0 %
## 
## Variable: MSRP
## Outliers detectados: 22
## Observaciones despues de eliminar outliers: 2697
## Porcentaje de outliers: 0.81 %

Transformaciones

Se aplicarán transformaciones comunes (logarítmica, raíz cuadrada, recíproca, cuadrática y Box-Cox) a los datos sin valores atípicos para intentar normalizar las distribuciones. Estas transformaciones son útiles para corregir asimetrías o colas pesadas. El propósito es encontrar una versión transformada de los datos que cumplan con la normalidad.

# Función para aplicar transformaciones
aplicar_transformaciones <- function(variable) {
  # Asegurar valores positivos para log
  variable_pos <- ifelse(variable <= 0, 0.001, variable)
  
  transformaciones <- list(
    original = variable,
    log = log(variable_pos),
    sqrt = sqrt(abs(variable)),
    reciproca = 1/(variable_pos),
    cuadratica = variable^2,
    box_cox = NULL  # Se calculará después
  )
  
  # Box-Cox transformation
  if(all(variable > 0)) {
    library(forecast)
    lambda <- BoxCox.lambda(variable)
    transformaciones$box_cox <- BoxCox(variable, lambda)
    transformaciones$lambda <- lambda
  }
  
  return(transformaciones)
}

Se realizará un análisis completo para cada variable, combinando pruebas de normalidad en los datos originales, sin valores atípicos, winsorizados y después de transformaciones. Esto permite evaluar cómo el tratamiento de valores atípicos y las transformaciones afectan la normalidad

analisis_completo <- function(var_name) {
  cat("\n\n==============================================")
  cat("\nANÁALISIS COMPLETO PARA:", var_name)
  cat("\n==============================================\n")
  
  variable_original <- datos_continuas[[var_name]]
  
  # 1. Pruebas de normalidad originales
  cat("\n1. NORMALIDAD - DATOS ORIGINALES:\n")
  resultados_orig <- pruebas_normalidad(variable_original, paste(var_name, "- Original"))
  
  # 2. Tratamiento de outliers
  cat("\n2. TRATAMIENTO DE OUTLIERS:\n")
  tratamiento <- tratar_outliers(variable_original)
  
  # Pruebas después de eliminar outliers
  cat("\n3. NORMALIDAD - SIN OUTLIERS:\n")
  resultados_sin_outliers <- pruebas_normalidad(tratamiento$sin_outliers, 
                                                paste(var_name, "- Sin Outliers"))
  
  # Pruebas después de winsorizar
  cat("\n4. NORMALIDAD - WINSORIZADO:\n")
  resultados_winsorized <- pruebas_normalidad(tratamiento$winsorized, 
                                              paste(var_name, "- Winsorizado"))
  
  # 3. Transformaciones
  cat("\n5. TRANSFORMACIONES:\n")
  
  # Probar transformaciones en datos sin outliers
  transformaciones <- aplicar_transformaciones(tratamiento$sin_outliers)
  
  # Probar normalidad en cada transformación
  for(trans_name in names(transformaciones)) {
    if(trans_name %in% c("original", "lambda")) next
    
    trans_data <- transformaciones[[trans_name]]
    if(!is.null(trans_data) && length(trans_data) >= 3) {
      cat("\n--- Transformacion:", trans_name, "---\n")
      pruebas_normalidad(trans_data, paste(var_name, "-", trans_name))
    } else {
      cat("\n--- Transformacion:", trans_name, "---\n")
      cat("Tamanio de muestra insuficiente (< 3 observaciones). No se pueden realizar pruebas.\n")
    }
  }
  
  return(list(
    original = resultados_orig,
    sin_outliers = resultados_sin_outliers,
    winsorized = resultados_winsorized,
    transformaciones = transformaciones,
    tratamiento_outliers = tratamiento
  ))
}

# Ejecutar análisis completo para cada variable
resultados_finales <- list()
for(var_name in variables_continuas) {
  resultados_finales[[var_name]] <- analisis_completo(var_name)
}

## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: QUANTITYORDERED
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: QUANTITYORDERED - Original ===
## Shapiro-Wilk p-value: 2.21235e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.123557e-11 
## Jarque-Bera p-value: 2.014722e-12 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: QUANTITYORDERED - Sin Outliers ===
## Shapiro-Wilk p-value: 1.451608e-22 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.261787e-12 
## Jarque-Bera p-value: 2.220446e-16 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: QUANTITYORDERED - Winsorizado ===
## Shapiro-Wilk p-value: 1.796798e-22 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 6.80732e-12 
## Jarque-Bera p-value: 8.66196e-13 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: QUANTITYORDERED - log ===
## Shapiro-Wilk p-value: 4.732079e-27 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.002297e-15 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: QUANTITYORDERED - sqrt ===
## Shapiro-Wilk p-value: 1.666434e-22 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 6.273338e-13 
## Jarque-Bera p-value: 1.376677e-13 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: QUANTITYORDERED - reciproca ===
## Shapiro-Wilk p-value: 1.965609e-46 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 6.385529e-33 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: QUANTITYORDERED - cuadratica ===
## Shapiro-Wilk p-value: 1.383127e-30 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.37361e-20 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: QUANTITYORDERED - box_cox ===
## Shapiro-Wilk p-value: 9.83098e-23 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.472233e-13 
## Jarque-Bera p-value: 3.097522e-13 
## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: PRICEEACH
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: PRICEEACH - Original ===
## Shapiro-Wilk p-value: 3.016714e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.101909e-09 
## Jarque-Bera p-value: 0 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: PRICEEACH - Sin Outliers ===
## Shapiro-Wilk p-value: 4.685648e-22 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 6.732738e-09 
## Jarque-Bera p-value: 0 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: PRICEEACH - Winsorizado ===
## Shapiro-Wilk p-value: 1.213928e-23 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.248683e-09 
## Jarque-Bera p-value: 0 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: PRICEEACH - log ===
## Shapiro-Wilk p-value: 1.030649e-16 
## Anderson-Darling p-value: 6.491027e-22

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 5.987258e-05 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: PRICEEACH - sqrt ===
## Shapiro-Wilk p-value: 5.815864e-12 
## Anderson-Darling p-value: 5.968912e-11

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 0.0101587 
## Jarque-Bera p-value: 1.064471e-11 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: PRICEEACH - reciproca ===
## Shapiro-Wilk p-value: 2.820825e-40 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.725163e-29 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: PRICEEACH - cuadratica ===
## Shapiro-Wilk p-value: 1.980009e-40 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.347507e-30 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: PRICEEACH - box_cox ===
## Shapiro-Wilk p-value: 2.226403e-19 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.910822e-06 
## Jarque-Bera p-value: 0 
## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: ORDERLINENUMBER
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: ORDERLINENUMBER - Original ===
## Shapiro-Wilk p-value: 2.455972e-32 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.297548e-33 
## Jarque-Bera p-value: 0 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: ORDERLINENUMBER - Sin Outliers ===
## Shapiro-Wilk p-value: 2.455972e-32 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.297548e-33 
## Jarque-Bera p-value: 0 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: ORDERLINENUMBER - Winsorizado ===
## Shapiro-Wilk p-value: 2.455972e-32 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.297548e-33 
## Jarque-Bera p-value: 0 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: ORDERLINENUMBER - log ===
## Shapiro-Wilk p-value: 3.534741e-34 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.884243e-31 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: ORDERLINENUMBER - sqrt ===
## Shapiro-Wilk p-value: 6.753232e-26 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.934624e-17 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: ORDERLINENUMBER - reciproca ===
## Shapiro-Wilk p-value: 5.030882e-56 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.992701e-148 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: ORDERLINENUMBER - cuadratica ===
## Shapiro-Wilk p-value: 9.209313e-49 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.329013e-91 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: ORDERLINENUMBER - box_cox ===
## Shapiro-Wilk p-value: 3.907063e-32 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.062679e-25 
## Jarque-Bera p-value: 0 
## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: SALES
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: SALES - Original ===
## Shapiro-Wilk p-value: 2.09335e-30 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 9.45032e-15 
## Jarque-Bera p-value: 0 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: SALES - Sin Outliers ===
## Shapiro-Wilk p-value: 5.330525e-27 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.159091e-11 
## Jarque-Bera p-value: 0 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: SALES - Winsorizado ===
## Shapiro-Wilk p-value: 5.363335e-30 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.291527e-14 
## Jarque-Bera p-value: 0 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: SALES - log ===
## Shapiro-Wilk p-value: 8.964633e-16 
## Anderson-Darling p-value: 1.675631e-15

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 0.005580328 
## Jarque-Bera p-value: 1.776357e-15 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: SALES - sqrt ===
## Shapiro-Wilk p-value: 3.881531e-13 
## Anderson-Darling p-value: 2.332817e-14

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 0.02838752 
## Jarque-Bera p-value: 6.694312e-12 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: SALES - reciproca ===
## Shapiro-Wilk p-value: 2.662164e-46 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.343012e-36 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: SALES - cuadratica ===
## Shapiro-Wilk p-value: 1.8239e-45 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 5.905445e-50 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: SALES - box_cox ===
## Shapiro-Wilk p-value: 2.393323e-14 
## Anderson-Darling p-value: 1.285723e-12

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 0.02333544 
## Jarque-Bera p-value: 1.129541e-12 
## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: DAYS_SINCE_LASTORDER
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - Original ===
## Shapiro-Wilk p-value: 1.534705e-21 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.827449e-06 
## Jarque-Bera p-value: 0 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - Sin Outliers ===
## Shapiro-Wilk p-value: 1.534705e-21 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.827449e-06 
## Jarque-Bera p-value: 0 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - Winsorizado ===
## Shapiro-Wilk p-value: 1.534705e-21 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.827449e-06 
## Jarque-Bera p-value: 0 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - log ===
## Shapiro-Wilk p-value: 7.997116e-38 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.205771e-23 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - sqrt ===
## Shapiro-Wilk p-value: 1.296841e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.503254e-11 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - reciproca ===
## Shapiro-Wilk p-value: 1.163934e-67 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.070905e-184 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - cuadratica ===
## Shapiro-Wilk p-value: 2.710041e-34 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 3.613562e-24 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: DAYS_SINCE_LASTORDER - box_cox ===
## Shapiro-Wilk p-value: 4.412931e-21 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.135808e-07 
## Jarque-Bera p-value: 0 
## 
## 
## ==============================================
## ANÁALISIS COMPLETO PARA: MSRP
## ==============================================
## 
## 1. NORMALIDAD - DATOS ORIGINALES:
## 
## === VARIABLE: MSRP - Original ===
## Shapiro-Wilk p-value: 3.160938e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 7.375531e-13 
## Jarque-Bera p-value: 0 
## 
## 2. TRATAMIENTO DE OUTLIERS:
## 
## 3. NORMALIDAD - SIN OUTLIERS:
## 
## === VARIABLE: MSRP - Sin Outliers ===
## Shapiro-Wilk p-value: 9.874999e-23 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 5.512277e-11 
## Jarque-Bera p-value: 0 
## 
## 4. NORMALIDAD - WINSORIZADO:
## 
## === VARIABLE: MSRP - Winsorizado ===
## Shapiro-Wilk p-value: 3.327429e-24 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.142607e-12 
## Jarque-Bera p-value: 0 
## 
## 5. TRANSFORMACIONES:
## 
## --- Transformacion: log ---
## 
## === VARIABLE: MSRP - log ===
## Shapiro-Wilk p-value: 2.892185e-18 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 2.373548e-12 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: sqrt ---
## 
## === VARIABLE: MSRP - sqrt ===
## Shapiro-Wilk p-value: 5.416995e-14 
## Anderson-Darling p-value: 1.938918e-16

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 7.423332e-06 
## Jarque-Bera p-value: 5.130785e-11 
## 
## --- Transformacion: reciproca ---
## 
## === VARIABLE: MSRP - reciproca ===
## Shapiro-Wilk p-value: 6.677639e-40 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 7.571377e-40 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: cuadratica ---
## 
## === VARIABLE: MSRP - cuadratica ===
## Shapiro-Wilk p-value: 1.083818e-40 
## Anderson-Darling p-value: 3.7e-24

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 1.516961e-45 
## Jarque-Bera p-value: 0 
## 
## --- Transformacion: box_cox ---
## 
## === VARIABLE: MSRP - box_cox ===
## Shapiro-Wilk p-value: 6.929169e-16 
## Anderson-Darling p-value: 8.551556e-21

## Warning in ks.test.default(variable, "pnorm", mean(variable, na.rm = TRUE), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## Kolmogorov-Smirnov p-value: 4.712422e-06 
## Jarque-Bera p-value: 3.513856e-13

Se generará un resumen final que indica si cada variable es normal en sus versiones originales, sin valores atípicos o winsorizada, basado en las pruebas de normalidad (p-valor > 0.05). Si ninguna versión es normal, se recomendarán métodos no paramétricos o transformaciones adicionales

cat("\n\n==============================================")

## 
## 
## ==============================================

cat("\nRESUMEN FINAL DE NORMALIDAD")

## 
## RESUMEN FINAL DE NORMALIDAD

cat("\n==============================================\n")

## 
## ==============================================

for(var_name in variables_continuas) {
  cat("\nVariable:", var_name)
  
  # Verificar si alguna transformación logró normalidad (p > 0.05)
  resultados <- resultados_finales[[var_name]]
  
  # Revisar datos originales
  orig_normal <- any(unlist(resultados$original) > 0.05, na.rm = TRUE)
  
  # Revisar sin outliers
  sin_out_normal <- any(unlist(resultados$sin_outliers) > 0.05, na.rm = TRUE)
  
  # Revisar winsorizado
  wins_normal <- any(unlist(resultados$winsorized) > 0.05, na.rm = TRUE)
  
  cat("\n  - Datos originales: ", ifelse(orig_normal, "NORMAL", "NO NORMAL"))
  cat("\n  - Sin outliers: ", ifelse(sin_out_normal, "NORMAL", "NO NORMAL"))
  cat("\n  - Winsorizado: ", ifelse(wins_normal, "NORMAL", "NO NORMAL"))
  
  if(!orig_normal && !sin_out_normal && !wins_normal) {
    cat("\n  - Recomendación: Usar métodos no paramétricos o transformaciones adicionales")
    # Añadir recomendación específica para ORDERLINENUMBER
    if(var_name == "ORDERLINENUMBER") {
      cat("\n    Nota: ORDERLINENUMBER es una variable discreta/ordinal. Considera tratarla como categórica o usar métodos no paramétricos.\n")
    }
  }
  cat("\n")
}

## 
## Variable: QUANTITYORDERED
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales
## 
## Variable: PRICEEACH
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales
## 
## Variable: ORDERLINENUMBER
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales
##     Nota: ORDERLINENUMBER es una variable discreta/ordinal. Considera tratarla como categórica o usar métodos no paramétricos.
## 
## 
## Variable: SALES
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales
## 
## Variable: DAYS_SINCE_LASTORDER
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales
## 
## Variable: MSRP
##   - Datos originales:  NO NORMAL
##   - Sin outliers:  NO NORMAL
##   - Winsorizado:  NO NORMAL
##   - Recomendación: Usar métodos no paramétricos o transformaciones adicionales

Pruebas de hipótesis

Este análisis tiene como objetivo realizar pruebas de hipótesis no paramétricas para comparar las distribuciones y variaciones de las ventas ( SALES ) entre dos grupos definidos por la variable SELECTIZE (binarizada como “Small” vs “Non-Small”) en el conjunto de datos Auto_Sales_data_no_outliers1.xlsx . Se utilizarán la prueba de Mann-Whitney U para comparar las distribuciones y la prueba de Siegel-Tukey para comparar las variaciones, con un nivel de significancia de 0.05.

# Cargar librerías necesarias
library(readxl)
library(dplyr)
library(ggplot2)
library(gridExtra)

# Cargar datos
datos <- read_excel("~/Auto_Sales_data_no_outliers1.xlsx", sheet = "Auto Sales data")

# Binarizar DEALSIZE: "Small" vs "Non-Small"
datos <- datos %>%
  mutate(
    SALES = as.numeric(SALES),
    DEALSIZE = as.factor(DEALSIZE),
    DEALSIZE_BIN = as.factor(ifelse(DEALSIZE == "Small", "Small", "Non-Small"))
  )

# Filtrar datos para los dos grupos
datos_bin <- datos %>%
  filter(!is.na(SALES) & !is.na(DEALSIZE_BIN))

Se generará un resumen inicial que incluye la distribución de los grupos (“Small” vs “Non-Small”), los resúmenes estadísticos (mínimo, máximo, media, mediana, cuartiles) de SALES para cada grupo y las medianas calculadas por grupo. Esto se hace para obtener una visión preliminar de los datos, identificar diferencias evidentes en las ventas y preparar el terreno para las pruebas de hipótesis, facilitando la interpretación de los resultados estadísticos

# Resumen inicial
cat("\n=== Resumen inicial ===\n")

## 
## === Resumen inicial ===

cat("Distribución de los grupos:\n")

## Distribución de los grupos:

print(table(datos_bin$DEALSIZE_BIN))

## 
## Non-Small     Small 
##      1473      1246

cat("\nResumen de SALES para Small:\n")

## 
## Resumen de SALES para Small:

print(summary(datos_bin$SALES[datos_bin$DEALSIZE_BIN == "Small"]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   482.1  1638.3  2114.0  2062.6  2558.7  3000.0

cat("\nResumen de SALES para Non-Small:\n")

## 
## Resumen de SALES para Non-Small:

print(summary(datos_bin$SALES[datos_bin$DEALSIZE_BIN == "Non-Small"]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3002    3561    4297    4682    5511    9160

# Medianas por grupo
medianas <- datos_bin %>%
  group_by(DEALSIZE_BIN) %>%
  summarise(Mediana_SALES = median(SALES, na.rm = TRUE))
cat("\nMedianas de SALES por grupo:\n")

## 
## Medianas de SALES por grupo:

print(medianas)

## # A tibble: 2 × 2
##   DEALSIZE_BIN Mediana_SALES
##   <fct>                <dbl>
## 1 Non-Small            4297.
## 2 Small                2114.

Se crearán histogramas y gráficos QQ (cuantil-cuantil) para visualizar las distribuciones de SALES separadas por los grupos “Small” y “Non-Small”. Los histogramas mostrarán la forma de las distribuciones y las diferencias entre grupos, mientras que los diagramas QQ compararán los datos con una distribución normal teórica para evaluar su comportamiento

# Visualización de distribuciones (histogramas y Q-Q plots)
cat("\n=== Visualizacion de distribuciones ===\n")

## 
## === Visualizacion de distribuciones ===

p1 <- ggplot(datos_bin, aes(x = SALES, fill = DEALSIZE_BIN)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  labs(title = "Distribucion de SALES por DEALSIZE_BIN", x = "SALES", y = "Frecuencia") +
  theme_minimal()

p2 <- ggplot(datos_bin, aes(sample = SALES, color = DEALSIZE_BIN)) +
  stat_qq() + stat_qq_line() +
  labs(title = "Q-Q Plot de SALES por DEALSIZE_BIN") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

Prueba de Mann-Whitney U

Se realizará la prueba de Mann-Whitney U para comparar las distribuciones de SALES entre los grupos “Small” y “Non-Small” sin asumir normalidad. La hipótesis nula (H0) establece que no hay diferencia en las distribuciones, mientras que la alternativa (H1) indica que sí la hay, utilizando un nivel de confianza del 95%.

# Prueba de hipótesis no paramétrica: Mann-Whitney U para comparar distribuciones
cat("\n=== Prueba de Mann-Whitney U ===\n")

## 
## === Prueba de Mann-Whitney U ===

cat("Hipotesis:\n")

## Hipotesis:

cat("H0: No hay diferencia en las distribuciones de SALES entre Small y Non-Small\n")

## H0: No hay diferencia en las distribuciones de SALES entre Small y Non-Small

cat("H1: Hay una diferencia en las distribuciones de SALES entre Small y Non-Small\n")

## H1: Hay una diferencia en las distribuciones de SALES entre Small y Non-Small

cat("Nivel de significancia: α = 0.05\n\n")

## Nivel de significancia: α = 0.05

mw_test <- wilcox.test(SALES ~ DEALSIZE_BIN, data = datos_bin, exact = FALSE)
cat("Prueba de Mann-Whitney U: p-value =", mw_test$p.value, "\n")

## Prueba de Mann-Whitney U: p-value = 0

if (mw_test$p.value < 0.05) {
  cat("Resultado: Rechazamos H0. Hay una diferencia significativa en las distribuciones.\n")
} else {
  cat("Resultado: No rechazamos H0. No hay evidencia de diferencia en las distribuciones.\n")
}

## Resultado: Rechazamos H0. Hay una diferencia significativa en las distribuciones.

Prueba de Siegel-Tukey para comparar variaciones

Se implementará la prueba de Siegel-Tukey para comparar las variaciones de SALES entre “Small” y “Non-Small” sin asumir normalidad. La hipótesis nula (H0) indica que las variaciones son iguales, mientras que la alternativa (H1) sugiere que son diferentes, utilizando un nivel de confianza del 95%.

# Prueba no paramétrica para comparar varianzas: Siegel-Tukey
cat("\n=== Prueba de Siegel-Tukey para comparar varianzas ===\n")

## 
## === Prueba de Siegel-Tukey para comparar varianzas ===

cat("Hipotesis:\n")

## Hipotesis:

cat("H0: Las varianzas de SALES entre Small y Non-Small son iguales\n")

## H0: Las varianzas de SALES entre Small y Non-Small son iguales

cat("H1: Las varianzas de SALES entre Small y Non-Small son diferentes\n")

## H1: Las varianzas de SALES entre Small y Non-Small son diferentes

cat("Nivel de significancia: α = 0.05\n\n")

## Nivel de significancia: α = 0.05

# Preparar datos para Siegel-Tukey
sales_small <- datos_bin$SALES[datos_bin$DEALSIZE_BIN == "Small"]
sales_non_small <- datos_bin$SALES[datos_bin$DEALSIZE_BIN == "Non-Small"]

# Combinar datos y asignar rangos
data_combined <- c(sales_small, sales_non_small)
ranks <- rank(data_combined)

# Dividir rangos por grupo
n_small <- length(sales_small)
n_non_small <- length(sales_non_small)
ranks_small <- ranks[1:n_small]
ranks_non_small <- ranks[(n_small + 1):(n_small + n_non_small)]

# Calcular rangos ponderados para Siegel-Tukey
siegel_tukey_ranks <- numeric(length(data_combined))
siegel_tukey_ranks[order(data_combined)] <- c(seq(1, by = 2, length.out = ceiling(length(data_combined)/2)),
                                              seq(2, by = 2, length.out = floor(length(data_combined)/2)))

# Ajustar para los dos grupos
siegel_tukey_small <- siegel_tukey_ranks[1:n_small]
siegel_tukey_non_small <- siegel_tukey_ranks[(n_small + 1):(n_small + n_non_small)]

# Estadistico de prueba
W_small <- sum(siegel_tukey_small)
W_non_small <- sum(siegel_tukey_non_small)

# Calcular el estadistico Z y p-value (aproximacion normal para muestras grandes)
n_total <- n_small + n_non_small
mean_W <- n_small * (n_total + 1) / 2
var_W <- n_small * n_non_small * (n_total + 1) * (n_total^2 - 1) / (12 * (n_total - 1))
Z <- (W_small - mean_W) / sqrt(var_W)
p_value <- 2 * pnorm(-abs(Z))  # Prueba bilateral

# Resultados de Siegel-Tukey
cat("Suma de rangos ponderados (Small):", W_small, "\n")

## Suma de rangos ponderados (Small): 1552516

cat("Suma de rangos ponderados (Non-Small):", W_non_small, "\n")

## Suma de rangos ponderados (Non-Small): 2145324

cat("Estadistico Z:", Z, "\n")

## Estadistico Z: -0.1335317

cat("p-value:", p_value, "\n")

## p-value: 0.8937729

if (p_value < 0.05) {
  cat("Resultado: Rechazamos H0. Las varianzas son significativamente diferentes.\n")
} else {
  cat("Resultado: No rechazamos H0. No hay evidencia de diferencia en las varianzas.\n")
}

## Resultado: No rechazamos H0. No hay evidencia de diferencia en las varianzas.

Se generará una conclusión final que resume los resultados de las pruebas de Mann-Whitney U y Siegel-Tukey, indicando si hay diferencias significativas en las distribuciones y variaciones de SALES entre “Small” y “Non-Small”. Además, se proporcionarán implicaciones comerciales específicas para marketing, inventario y logística basadas en las medianas de las ventas

# Conclusion final
cat("\n=== Conclusion Final ===\n")

## 
## === Conclusion Final ===

cat("Distribuciones: Basado en Mann-Whitney U (p =", mw_test$p.value, "), las distribuciones de SALES son significativamente diferentes.\n")

## Distribuciones: Basado en Mann-Whitney U (p = 0 ), las distribuciones de SALES son significativamente diferentes.

cat("Varianzas: Basado en Siegel-Tukey (p =", p_value, "), las varianzas de SALES no son significativamente diferentes.\n")

## Varianzas: Basado en Siegel-Tukey (p = 0.8937729 ), las varianzas de SALES no son significativamente diferentes.

cat("Implicaciones comerciales:\n")

## Implicaciones comerciales:

cat("- Marketing: Enfocarse en pedidos Non-Small (mediana: 4297) para maximizar ingresos, incentivando pedidos Small (mediana: 2114) a aumentar.\n")

## - Marketing: Enfocarse en pedidos Non-Small (mediana: 4297) para maximizar ingresos, incentivando pedidos Small (mediana: 2114) a aumentar.

cat("- Inventario: Priorizar stock para productos de pedidos Non-Small, optimizar para Small con alta rotacion.\n")

## - Inventario: Priorizar stock para productos de pedidos Non-Small, optimizar para Small con alta rotacion.

cat("- Logistica: Adaptar capacidad para Non-Small, estandarizar para Small.\n")

## - Logistica: Adaptar capacidad para Non-Small, estandarizar para Small.

Análisis de bondad de ajuste

Este análisis tiene como objetivo evaluar la bondad de ajuste de distribuciones específicas a dos variables del conjunto de datos Auto_Sales_data_no_outliers1.xlsx : una variable cualitativa ( SALESIZE ) y una variable escalar discreta ( LINES_PER_ORDER_BINARY ). Se utilizará la prueba chi-cuadrado para determinar si las frecuencias observadas se ajustan a las distribuciones teóricas propuestas (multinomial para TOPOFERIZE y binomial para LINES_PER_ORDER_BINARY ) con un nivel de significancia de 0.05.

Se generará un histograma de la variable LINES_PER_ORDER para inspeccionar su distribución inicial antes de la binarización. Este paso permite visualizar la forma de los datos (por ejemplo, si es sesgada o tiene picos) y justificar la decisión de binarizarla.

# Instalar y cargar librerías necesarias
# install.packages(c("readxl", "dplyr", "ggplot2", "knitr"))
library(readxl)
library(dplyr)
library(ggplot2)
library(knitr)

# Crear nueva variable: LINES_PER_ORDER (numero de lineas por pedido)
lines_per_order_df <- data %>%
  group_by(ORDERNUMBER) %>%
  summarise(LINES_PER_ORDER = max(ORDERLINENUMBER, na.rm = TRUE)) %>%
  filter(!is.na(LINES_PER_ORDER) & LINES_PER_ORDER > 0)

## Warning: There were 2 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `LINES_PER_ORDER = max(ORDERLINENUMBER, na.rm = TRUE)`.
## ℹ In group 1: `ORDERNUMBER = 8335.981`.
## Caused by warning in `max()`:
## ! ningun argumento finito para max; retornando -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

# Binarizar LINES_PER_ORDER (0 si <= 5, 1 si > 5)
lines_per_order_df <- lines_per_order_df %>%
  mutate(LINES_PER_ORDER_BINARY = ifelse(LINES_PER_ORDER <= 5, 0, 1))

# Histograma inicial de LINES_PER_ORDER para inspeccionar distribucion
p_hist_lines <- ggplot(lines_per_order_df, aes(x = LINES_PER_ORDER)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
  labs(title = "Histograma de LINES_PER_ORDER", x = "Numero de Lineas por Pedido", y = "Frecuencia") +
  theme_minimal()
ggsave("lines_per_order_histogram.png", p_hist_lines, width = 8, height = 6)
print(p_hist_lines)

Análisis de Bondad de Ajuste para Variable Cualitativa: OFERTASIZE

Se ajustará una distribución multinomial a la variable cualitativa SELECTIZE (categorías: Small, Medium, Large) calculando las frecuencias observadas y esperadas basadas en las probabilidades estimadas de los datos. Se realizará una prueba chi-cuadrado para evaluar si las frecuencias observadas se ajustan a la distribución multinomial con un nivel de significancia de 0.05.

# 1. Análisis de bondad de ajuste para variable cualitativa: DEALSIZE
cat("\n### Bondad de Ajuste para DEALSIZE (Distribución Multinomial) ###\n")

## 
## ### Bondad de Ajuste para DEALSIZE (Distribución Multinomial) ###

# Frecuencias observadas
dealsize_freq <- table(data$DEALSIZE)
dealsize_df <- as.data.frame(dealsize_freq)
colnames(dealsize_df) <- c("Category", "Observed")

# Total de observaciones
n_total <- sum(dealsize_df$Observed)

# Probabilidades estimadas
dealsize_df$Probability <- dealsize_df$Observed / n_total

# Frecuencias esperadas
dealsize_df$Expected <- dealsize_df$Probability * n_total

# Verificar frecuencias esperadas mínimas
if (any(dealsize_df$Expected < 5)) {
  cat("Advertencia: Algunas frecuencias esperadas son < 5. Los resultados de la prueba chi-cuadrado pueden no ser confiables.\n")
}

# Prueba chi-cuadrado
chi_dealsize <- chisq.test(dealsize_df$Observed, p = dealsize_df$Probability, rescale.p = TRUE)

# Resultados
cat("\nResultados de la prueba chi-cuadrado para DEALSIZE:\n")

## 
## Resultados de la prueba chi-cuadrado para DEALSIZE:

cat("Estadistico chi-cuadrado:", chi_dealsize$statistic, "\n")

## Estadistico chi-cuadrado: 0

cat("Grados de libertad:", chi_dealsize$degrees.of.freedom, "\n")

## Grados de libertad:

cat("P-valor:", chi_dealsize$p.value, "\n")

## P-valor: 1

if (chi_dealsize$p.value > 0.05) {
  cat("Conclusion: No se rechaza H0. Las frecuencias observadas de DEALSIZE se ajustan a la distribucion multinomial con las probabilidades estimadas (α = 0.05).\n")
} else {
  cat("Conclusion: Se rechaza H0. Las frecuencias observadas de DEALSIZE no se ajustan a la distribucion multinomial con las probabilidades estimadas (α = 0.05).\n")
}

## Conclusion: No se rechaza H0. Las frecuencias observadas de DEALSIZE se ajustan a la distribucion multinomial con las probabilidades estimadas (α = 0.05).

# Tabla de resultados
print(kable(dealsize_df, caption = "Frecuencias Observadas y Esperadas para DEALSIZE"))

## 
## 
## Table: Frecuencias Observadas y Esperadas para DEALSIZE
## 
## |Category | Observed| Probability| Expected|
## |:--------|--------:|-----------:|--------:|
## |Large    |      124|   0.0456050|      124|
## |Medium   |     1349|   0.4961383|     1349|
## |Small    |     1246|   0.4582567|     1246|

# Gráfico de frecuencias
p_dealsize <- ggplot(dealsize_df, aes(x = Category)) +
  geom_bar(aes(y = Observed, fill = "Observada"), stat = "identity", alpha = 0.5) +
  geom_bar(aes(y = Expected, fill = "Esperada"), stat = "identity", alpha = 0.5) +
  labs(title = "Frecuencias Observadas vs Esperadas para DEALSIZE", x = "Categoria", y = "Frecuencia") +
  scale_fill_manual(values = c("Observada" = "blue", "Esperada" = "red")) +
  theme_minimal()
ggsave("dealsize_fit.png", p_dealsize, width = 8, height = 6)
print(p_dealsize)

Análisis de Bondad de Ajuste para Variable Discreta: LINES_PER_ORDER_BINARY

Se ajustará una distribución binomial a la variable discreta LINES_PER_ORDER_BINARY (binarizada como 0 si <5 y 1 si >5) calculando las frecuencias observadas y esperadas. Se realizará una prueba chi-cuadrado para evaluar si las frecuencias observadas se ajustan a la distribución binomial con un nivel de significancia de 0.05.

# 2. Análisis de bondad de ajuste para variable discreta: LINES_PER_ORDER_BINARY (Distribución Binomial)
cat("\n### Bondad de Ajuste para LINES_PER_ORDER_BINARY (Distribución Binomial) ###\n")

## 
## ### Bondad de Ajuste para LINES_PER_ORDER_BINARY (Distribución Binomial) ###

# Frecuencias observadas
lines_binary_freq <- table(lines_per_order_df$LINES_PER_ORDER_BINARY)
lines_binary_df <- as.data.frame(lines_binary_freq)
colnames(lines_binary_df) <- c("Category", "Observed")

# Total de observaciones
n_total_lines <- sum(lines_binary_df$Observed)

# Probabilidad estimada (proporción de 1s)
p_binom <- lines_binary_df$Observed[lines_binary_df$Category == 1] / n_total_lines

# Frecuencias esperadas
lines_binary_df$Expected <- c(n_total_lines * (1 - p_binom), n_total_lines * p_binom)

# Verificar frecuencias esperadas mínimas
if (any(lines_binary_df$Expected < 5)) {
  cat("Advertencia: Algunas frecuencias esperadas son < 5. Los resultados de la prueba chi-cuadrado pueden no ser confiables.\n")
}

# Prueba chi-cuadrado
chi_lines_binom <- chisq.test(lines_binary_df$Observed, p = c(1 - p_binom, p_binom))

# Resultados
cat("\nResultados de la prueba chi-cuadrado para LINES_PER_ORDER_BINARY (Binomial):\n")

## 
## Resultados de la prueba chi-cuadrado para LINES_PER_ORDER_BINARY (Binomial):

cat("Estadistico chi-cuadrado:", chi_lines_binom$statistic, "\n")

## Estadistico chi-cuadrado: 2.348237e-30

cat("Grados de libertad:", chi_lines_binom$degrees.of.freedom, "\n")

## Grados de libertad:

cat("P-valor:", chi_lines_binom$p.value, "\n")

## P-valor: 1

if (chi_lines_binom$p.value > 0.05) {
  cat("Conclusion: No se rechaza H0. Los datos de LINES_PER_ORDER_BINARY se ajustan a la distribucion Binomial con p =", round(p_binom, 4), "(α = 0.05).\n")
} else {
  cat("Conclusion: Se rechaza H0. Los datos de LINES_PER_ORDER_BINARY no se ajustan a la distribucion Binomial con p =", round(p_binom, 4), "(α = 0.05).\n")
}

## Conclusion: No se rechaza H0. Los datos de LINES_PER_ORDER_BINARY se ajustan a la distribucion Binomial con p = 0.7114 (α = 0.05).

# Tabla de resultados
print(kable(lines_binary_df, caption = "Frecuencias Observadas y Esperadas para LINES_PER_ORDER_BINARY"))

## 
## 
## Table: Frecuencias Observadas y Esperadas para LINES_PER_ORDER_BINARY
## 
## |Category | Observed| Expected|
## |:--------|--------:|--------:|
## |0        |       86|       86|
## |1        |      212|      212|

# Gráfico de frecuencias
p_lines_binom <- ggplot(lines_binary_df, aes(x = Category)) +
  geom_bar(aes(y = Observed, fill = "Observada"), stat = "identity", alpha = 0.5) +
  geom_bar(aes(y = Expected, fill = "Esperada"), stat = "identity", alpha = 0.5) +
  labs(title = "Frecuencias Observadas vs Esperadas para LINES_PER_ORDER_BINARY (Binomial)", x = "Categoria (0 = <= 5, 1 = > 5)", y = "Frecuencia") +
  scale_fill_manual(values = c("Observada" = "blue", "Esperada" = "red")) +
  theme_minimal()
ggsave("lines_per_order_fit_binom.png", p_lines_binom, width = 8, height = 6)
print(p_lines_binom)

Se generará un resumen interpretativo que consolidará los resultados de las pruebas chi-cuadrado para LEADSIZE (multinomial) y LINES_PER_ORDER_BINARY (binomial), incluyendo estadísticos, p-valores, recomendaciones y conclusiones

# Resumen interpretativo
cat("\n### Resumen Interpretativo ###\n")

## 
## ### Resumen Interpretativo ###

cat("1. DEALSIZE (Multinomial):\n")

## 1. DEALSIZE (Multinomial):

cat("- Se ajusto una distribucion multinomial a las categorias Small, Medium, Large.\n")

## - Se ajusto una distribucion multinomial a las categorias Small, Medium, Large.

cat("- Resultado chi-cuadrado: Estadistico =", chi_dealsize$statistic, ", Grados de libertad =", chi_dealsize$degrees.of.freedom, ", P-valor =", chi_dealsize$p.value, "\n")

## - Resultado chi-cuadrado: Estadistico = 0 , Grados de libertad = , P-valor = 1

cat("- Conclusion:", if (chi_dealsize$p.value > 0.05) "Las frecuencias observadas se ajustan a la distribucion multinomial." else "Las frecuencias observadas no se ajustan a la distribucion multinomial.", "\n")

## - Conclusion: Las frecuencias observadas se ajustan a la distribucion multinomial.

cat("- Recomendacion: Revisar el grafico dealsize_fit.png para confirmar el ajuste visualmente.\n")

## - Recomendacion: Revisar el grafico dealsize_fit.png para confirmar el ajuste visualmente.

cat("\n2. LINES_PER_ORDER_BINARY (Binomial):\n")

## 
## 2. LINES_PER_ORDER_BINARY (Binomial):

cat("- Se ajusto una distribucion Binomial con p =", round(p_binom, 4), ".\n")

## - Se ajusto una distribucion Binomial con p = 0.7114 .

if (exists("chi_lines_binom")) {
  cat("- Resultado chi-cuadrado: Estadistico =", chi_lines_binom$statistic, ", Grados de libertad =", chi_lines_binom$degrees.of.freedom, ", P-valor =", chi_lines_binom$p.value, "\n")
  cat("- Conclusion:", if (chi_lines_binom$p.value > 0.05) "Los datos se ajustan a la distribucion Binomial." else "Los datos no se ajustan a la distribucion Binomial.", "\n")
} else {
  cat("- No se pudo realizar la prueba chi-cuadrado debido a frecuencias esperadas insuficientes.\n")
}

## - Resultado chi-cuadrado: Estadistico = 2.348237e-30 , Grados de libertad = , P-valor = 1 
## - Conclusion: Los datos se ajustan a la distribucion Binomial.

cat("- Recomendacion: Revisar el grafico lines_per_order_fit_binom.png y el histograma lines_per_order_histogram.png para evaluar el ajuste.\n")

## - Recomendacion: Revisar el grafico lines_per_order_fit_binom.png y el histograma lines_per_order_histogram.png para evaluar el ajuste.

cat("\nNotas Generales:\n")

## 
## Notas Generales:

cat("- LINES_PER_ORDER fue binarizado (0 si <= 5, 1 si > 5) para simplificar el analisis y facilitar el ajuste binomial.\n")

## - LINES_PER_ORDER fue binarizado (0 si <= 5, 1 si > 5) para simplificar el analisis y facilitar el ajuste binomial.

cat("- La binarizacion se eligio tras fallos previos con distribuciones como Geometrica, Poisson y Binomial Negativa.\n")

## - La binarizacion se eligio tras fallos previos con distribuciones como Geometrica, Poisson y Binomial Negativa.

cat("- Si el ajuste binomial no es satisfactorio, considerar ajustar los puntos de corte (por ejemplo, <= 3 vs > 3) o probar otra variable (por ejemplo, DAYS_SINCE_LASTORDER).\n")

## - Si el ajuste binomial no es satisfactorio, considerar ajustar los puntos de corte (por ejemplo, <= 3 vs > 3) o probar otra variable (por ejemplo, DAYS_SINCE_LASTORDER).

Análisis de independencia

Este análisis tiene como objetivo evaluar la independencia entre pares de variables categóricas ( OFERTASIZE_BIN , PRODUCTLINE , STATUS ) en el conjunto de datos Auto_Sales_data_no_outliers1.xlsx pruebas chi-cuadrado de independencia con un nivel de significancia de 0.05. Se analizarán tres pares: OFERTASIZE_BIN vs PRODUCTLINE , OFERTASIZE_BIN vs STATUS , y PRODUCTLINE vs STATUS . Si se detecta dependencia, se realizará un análisis post-hoc de residuos estandarizados para identificar combinaciones específicas que contribuyen a la relación. El propósito es determinar si estas variables están relacionadas, lo que puede influir en estrategias comerciales como segmentación de mercado o gestión de inventario.

# Cargar librerías necesarias
library(readxl)
library(dplyr)
library(ggplot2)

# Cargar datos
datos <- read_excel("~/Auto_Sales_data_no_outliers1.xlsx", sheet = "Auto Sales data")

# Preparar datos: eliminar valores faltantes y binarizar DEALSIZE
datos <- datos %>%
  filter(!is.na(DEALSIZE) & !is.na(PRODUCTLINE) & !is.na(STATUS)) %>%
  mutate(DEALSIZE_BIN = as.factor(ifelse(DEALSIZE == "Small", "Small", "Non-Small")))

Se definirá una función personalizada para realizar la prueba chi-cuadrado de independencia entre dos variables categóricas. La función generará una tabla de contingencia, verificará si las frecuencias esperadas son adecuadas (≥5), ejecutará la prueba chi-cuadrado y, si se rechaza la hipótesis nula de independencia, calculará los residuos estandarizados para identificar combinaciones significativas

# Función para realizar prueba chi-cuadrado y análisis post-hoc
realizar_chi_cuadrado <- function(var1, var2, datos) {
  cat("\n=== Prueba de independencia para", var1, "y", var2, "===\n")
  cat("Hipotesis:\n")
  cat("H0: Las variables", var1, "y", var2, "son independientes.\n")
  cat("H1: Las variables", var1, "y", var2, "no son independientes.\n")
  cat("Nivel de significancia: α = 0.05\n\n")
  
  # Crear tabla de contingencia
  tabla_contingencia <- table(datos[[var1]], datos[[var2]])
  cat("Tabla de contingencia:\n")
  print(tabla_contingencia)
  
  # Verificar que las frecuencias esperadas sean >= 5
  chi_test <- chisq.test(tabla_contingencia, correct = FALSE)
  frecuencias_esperadas <- chi_test$expected
  if (any(frecuencias_esperadas < 5)) {
    cat("Advertencia: Algunas frecuencias esperadas son menores a 5. Considera combinar categorias o usar una prueba alternativa.\n")
  }
  
  # Prueba chi-cuadrado
  cat("\nPrueba chi-cuadrado:\n")
  print(chi_test)
  
  # Interpretación
  if (chi_test$p.value < 0.05) {
    cat("Resultado: Rechazamos H0. Las variables", var1, "y", var2, "no son independientes.\n")
    cat("\n=== Analisis post-hoc de residuales ===\n")
    # Calcular residuales estandarizados manualmente
    residuales <- chi_test$stdres
    cat("Residuales estandarizados ajustados (valores > |2| indican contribuciones significativas):\n")
    print(residuales)
    # Identificar combinaciones significativas
    significativos <- which(abs(residuales) > 2, arr.ind = TRUE)
    if (length(significativos) > 0) {
      cat("\nCombinaciones significativas (residual > |2|):\n")
      for (i in 1:nrow(significativos)) {
        fila <- rownames(residuales)[significativos[i, 1]]
        columna <- colnames(residuales)[significativos[i, 2]]
        valor <- residuales[significativos[i, 1], significativos[i, 2]]
        cat(fila, "-", columna, ": Residual =", round(valor, 2), "\n")
      }
    } else {
      cat("No hay combinaciones con residuales significativos (|residual| > 2).\n")
    }
  } else {
    cat("Resultado: No rechazamos H0. Las variables", var1, "y", var2, "son independientes.\n")
  }
}

Pruebas de Independencia para Cada Par

Se aplica la función realizar_chi_cuadrado a tres pares de variables: OFERTASIZE_BIN vs PRODUCTLINE , OFERTASIZE_BIN vs STATUS , y PRODUCTLINE vs STATUS . Para cada par, se generará una tabla de contingencia, se realizará la prueba chi-cuadrado, y si se detecta dependencia, se analizarán los residuos estandarizados para identificar las combinaciones específicas que contribuyen a la relación.

# === Pruebas de independencia para cada par ===
# 1. DEALSIZE_BIN vs PRODUCTLINE
realizar_chi_cuadrado("DEALSIZE_BIN", "PRODUCTLINE", datos)

## 
## === Prueba de independencia para DEALSIZE_BIN y PRODUCTLINE ===
## Hipotesis:
## H0: Las variables DEALSIZE_BIN y PRODUCTLINE son independientes.
## H1: Las variables DEALSIZE_BIN y PRODUCTLINE no son independientes.
## Nivel de significancia: α = 0.05
## 
## Tabla de contingencia:
##            
##             Classic Cars Motorcycles Planes Ships Trains Trucks and Buses
##   Non-Small          597         162    135   106     27              180
##   Small              336         147    167   124     50              115
##            
##             Vintage Cars
##   Non-Small          266
##   Small              307
## 
## Prueba chi-cuadrado:
## 
##  Pearson's Chi-squared test
## 
## data:  tabla_contingencia
## X-squared = 84.302, df = 6, p-value = 4.604e-16
## 
## Resultado: Rechazamos H0. Las variables DEALSIZE_BIN y PRODUCTLINE no son independientes.
## 
## === Analisis post-hoc de residuales ===
## Residuales estandarizados ajustados (valores > |2| indican contribuciones significativas):
##            
##             Classic Cars Motorcycles     Planes      Ships     Trains
##   Non-Small    7.4224431  -0.6547159 -3.5040955 -2.5728373 -3.4141239
##   Small       -7.4224431   0.6547159  3.5040955  2.5728373  3.4141239
##            
##             Trucks and Buses Vintage Cars
##   Non-Small        2.4981618   -4.1920754
##   Small           -2.4981618    4.1920754
## 
## Combinaciones significativas (residual > |2|):
## Non-Small - Classic Cars : Residual = 7.42 
## Small - Classic Cars : Residual = -7.42 
## Non-Small - Planes : Residual = -3.5 
## Small - Planes : Residual = 3.5 
## Non-Small - Ships : Residual = -2.57 
## Small - Ships : Residual = 2.57 
## Non-Small - Trains : Residual = -3.41 
## Small - Trains : Residual = 3.41 
## Non-Small - Trucks and Buses : Residual = 2.5 
## Small - Trucks and Buses : Residual = -2.5 
## Non-Small - Vintage Cars : Residual = -4.19 
## Small - Vintage Cars : Residual = 4.19

# 2. DEALSIZE_BIN vs STATUS
realizar_chi_cuadrado("DEALSIZE_BIN", "STATUS", datos)

## 
## === Prueba de independencia para DEALSIZE_BIN y STATUS ===
## Hipotesis:
## H0: Las variables DEALSIZE_BIN y STATUS son independientes.
## H1: Las variables DEALSIZE_BIN y STATUS no son independientes.
## Nivel de significancia: α = 0.05
## 
## Tabla de contingencia:
##            
##             Cancelled Disputed In Process On Hold Resolved Shipped
##   Non-Small        33        8         20      28       27    1357
##   Small            27        4         20      15       20    1160
## 
## Prueba chi-cuadrado:
## 
##  Pearson's Chi-squared test
## 
## data:  tabla_contingencia
## X-squared = 3.3971, df = 5, p-value = 0.639
## 
## Resultado: No rechazamos H0. Las variables DEALSIZE_BIN y STATUS son independientes.

# 3. PRODUCTLINE vs STATUS
realizar_chi_cuadrado("PRODUCTLINE", "STATUS", datos)

## 
## === Prueba de independencia para PRODUCTLINE y STATUS ===
## Hipotesis:
## H0: Las variables PRODUCTLINE y STATUS son independientes.
## H1: Las variables PRODUCTLINE y STATUS no son independientes.
## Nivel de significancia: α = 0.05
## 
## Tabla de contingencia:
##                   
##                    Cancelled Disputed In Process On Hold Resolved Shipped
##   Classic Cars            16        2         13      12        8     882
##   Motorcycles              0        5          0       1        0     303
##   Planes                  12        2          0       9       12     267
##   Ships                   18        1          0       8       12     191
##   Trains                   1        0          0       1        0      75
##   Trucks and Buses         0        0         11       4        5     275
##   Vintage Cars            13        2         16       8       10     524

## Warning in chisq.test(tabla_contingencia, correct = FALSE): Chi-squared
## approximation may be incorrect

## Advertencia: Algunas frecuencias esperadas son menores a 5. Considera combinar categorias o usar una prueba alternativa.
## 
## Prueba chi-cuadrado:
## 
##  Pearson's Chi-squared test
## 
## data:  tabla_contingencia
## X-squared = 148.37, df = 30, p-value < 2.2e-16
## 
## Resultado: Rechazamos H0. Las variables PRODUCTLINE y STATUS no son independientes.
## 
## === Analisis post-hoc de residuales ===
## Residuales estandarizados ajustados (valores > |2| indican contribuciones significativas):
##                   
##                      Cancelled    Disputed  In Process     On Hold    Resolved
##   Classic Cars     -1.26172186 -1.29050581 -0.24346346 -0.89204255 -2.51900205
##   Motorcycles      -2.80473464  3.31471830 -2.28149206 -1.88248864 -2.47631578
##   Planes            2.21684618  0.61427853 -2.25223337  2.06641014  3.17479169
##   Ships             6.06345164 -0.01567752 -1.93686661  2.40995468  4.24302919
##   Trains           -0.55022480 -0.59269335 -1.08774526 -0.20176102 -1.18063122
##   Trucks and Buses -2.73253517 -1.21114405  3.41118603 -0.32884298 -0.04698118
##   Vintage Cars      0.11384242 -0.37517731  2.95682659 -0.40020447  0.03436727
##                   
##                        Shipped
##   Classic Cars      2.82103804
##   Motorcycles       3.90695273
##   Planes           -2.92399139
##   Ships            -5.75862413
##   Trains            1.64015300
##   Trucks and Buses  0.45055490
##   Vintage Cars     -1.15308355
## 
## Combinaciones significativas (residual > |2|):
## Motorcycles - Cancelled : Residual = -2.8 
## Planes - Cancelled : Residual = 2.22 
## Ships - Cancelled : Residual = 6.06 
## Trucks and Buses - Cancelled : Residual = -2.73 
## Motorcycles - Disputed : Residual = 3.31 
## Motorcycles - In Process : Residual = -2.28 
## Planes - In Process : Residual = -2.25 
## Trucks and Buses - In Process : Residual = 3.41 
## Vintage Cars - In Process : Residual = 2.96 
## Planes - On Hold : Residual = 2.07 
## Ships - On Hold : Residual = 2.41 
## Classic Cars - Resolved : Residual = -2.52 
## Motorcycles - Resolved : Residual = -2.48 
## Planes - Resolved : Residual = 3.17 
## Ships - Resolved : Residual = 4.24 
## Classic Cars - Shipped : Residual = 2.82 
## Motorcycles - Shipped : Residual = 3.91 
## Planes - Shipped : Residual = -2.92 
## Ships - Shipped : Residual = -5.76

Se incluye un código comentado para visualizar las relaciones entre variables si se detecta dependencia. Por ejemplo, se puede generar un gráfico de barras apiladas que muestrea la proporción de PRODUCTLINE para cada categoría de OFERTASIZE_BIN

# === Visualización de las relaciones (para pares no independientes) ===
# Nota: Agrega gráficos si es necesario después de identificar dependencias
# Ejemplo comentado:
ggplot(datos, aes(x = DEALSIZE_BIN, fill = PRODUCTLINE)) +geom_bar(position = "fill") +
labs(title = "Proporcion de PRODUCTLINE por DEALSIZE_BIN", x = "Tamanio del pedido", y = "Proporcion") +
theme_minimal()

ANÁLISIS DE AUTOMOBILE SALES DATA PARA OPTIMIZAR VENTAS Y LOGÍSTICA AUTOMOTRIZ

Diego Alejandro Peñarete Rodríguez, Diana Carolina Serrato Florez

2025-05-24

Descripción del proyecto

Librerias y tipo correcto de dato

Análisis descriptivo de variables categóricas

Univariado: Seleccionamos PRODUCTLINE, DEALSIZE, STATUS

Bivariado: DEALSIZE vs. STATUS

Análisis descriptivo de variables cuantitativas

Univariado: Seleccionamos SALES, QUANTITYORDERED, DAYS_SINCE_LASTORDER

Gráficos univariados con funciones de densidad ajustadas

Histograma de Sales

Histograma de QUANTITYORDERED

Histograma de DAYS_SINCE_LASTORDER

Bivariado: SALES vs. DAYS_SINCE_LASTORDER

Análisis descriptivo bivariado entre variables categóricas y cuantitativas

Estadigrafos, Boxplot e histograma segmentado de SALES vs. PRODUCTLINE

Estadigrafo y Boxplot de QUANTITYORDERED vs. DEALSIZE

Pruebas de hipótesis

Prueba 1: Media de SALES

Prueba 2: Proporción de “CANCELLED” en STATUS

Análisis de normalidad

Análisis descriptivo inicial

Pruebas de Normalidad Iniciales

Transformaciones

Pruebas de hipótesis

Prueba de Mann-Whitney U

Prueba de Siegel-Tukey para comparar variaciones

Análisis de bondad de ajuste

Análisis de Bondad de Ajuste para Variable Cualitativa: OFERTASIZE

Análisis de Bondad de Ajuste para Variable Discreta: LINES_PER_ORDER_BINARY

Análisis de independencia

Pruebas de Independencia para Cada Par