Descripción del proyecto

El dataset Automobile Sales Data tomado de Kaggle, contiene información sobre ventas de productos automotrices, probablemente recopilada por una empresa comercial a través de un sistema de gestión de ventas (ERP). Este sistema registra transacciones comerciales de manera automática, el dataset incluye 2,722 registros y 20 variabes.

Este notebook presenta el análisis del dataset Automobile Sales Data, con el objetivo de identificar patrones, distribuciones y relaciones que puedan informar decisiones comerciales, como estrategias de inventario, marketing y logística.

El análisis incluye: Análisis descriptivo de variables categóricas (univariado y bivariado), análisis descriptivo de variables cuantitativas (univariado y bivariado), análisis bivariado entre variables categóricas y cuantitativas y pruebas de hipótesis.

Librerias y tipo correcto de dato

Carga el dataset desde un archivo Excel, se convierten las columnas al tipo de datos correcto (numéricas, factores, fechas) y se verifican valores faltantes en variables clave para preparar los datos para el análisis. Finalmente, cargar los paquetes necesarios.

library(readxl)

## Warning: package 'readxl' was built under R version 4.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.3

library(treemap)

## Warning: package 'treemap' was built under R version 4.4.3

library(moments)
library(knitr)

## Warning: package 'knitr' was built under R version 4.4.3

# Cargar el dataset
data <- read_excel("~/Auto_Sales_data_no_outliers1.xlsx")
knitr::opts_chunk$set(echo = TRUE)

# Asegurar tipos de datos correctos
data <- data %>%
  mutate(
    ORDERNUMBER = as.numeric(ORDERNUMBER),
    QUANTITYORDERED = as.numeric(QUANTITYORDERED),
    PRICEEACH = as.numeric(PRICEEACH),
    ORDERLINENUMBER = as.numeric(ORDERLINENUMBER),
    SALES = as.numeric(SALES),
    ORDERDATE = as.Date(ORDERDATE, format = "%Y-%m-%d"),
    DAYS_SINCE_LASTORDER = as.numeric(DAYS_SINCE_LASTORDER),
    STATUS = as.factor(toupper(trimws(STATUS))),
    PRODUCTLINE = as.factor(PRODUCTLINE),
    MSRP = as.numeric(MSRP),
    PRODUCTCODE = as.factor(PRODUCTCODE),
    CUSTOMERNAME = as.factor(CUSTOMERNAME),
    PHONE = as.character(PHONE),
    ADDRESSLINE1 = as.character(ADDRESSLINE1),
    CITY = as.factor(CITY),
    POSTALCODE = as.character(POSTALCODE),
    COUNTRY = as.factor(COUNTRY),
    CONTACTLASTNAME = as.character(CONTACTLASTNAME),
    CONTACTFIRSTNAME = as.character(CONTACTFIRSTNAME),
    DEALSIZE = as.factor(DEALSIZE)
  )

# Verificar valores NA en columnas clave
vars_to_check <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER", "PRODUCTLINE", "DEALSIZE", "STATUS")
colSums(is.na(data[vars_to_check]))

##                SALES      QUANTITYORDERED DAYS_SINCE_LASTORDER 
##                    3                    3                    3 
##          PRODUCTLINE             DEALSIZE               STATUS 
##                    3                    3                    3

Análisis descriptivo de variables categóricas

En este punto, se realiza un análisis descriptivo de las variables categóricas PRODUCTLINE, DEALSIZE y STATUS. Primero, se analiza cada variable de forma univariada para entender su distribución a través de frecuencias, proporciones y visualizaciones gráficas como diagramas de barras y treemaps. Luego, se realiza un análisis bivariado entre DEALSIZE y STATUS para explorar posibles relaciones entre estas variables, utilizando tablas cruzadas y un diagrama de barras apiladas. Este análisis busca identificar patrones que puedan ser útiles para decisiones logísticas y comerciales.

Univariado: Seleccionamos PRODUCTLINE, DEALSIZE, STATUS

Se crean tablas de frecuencia univariadas para las variables categóricas, mostrando cuántos casos hay para cada una de ellas

#Tabla de frecuencias
categorical_vars <- c("PRODUCTLINE", "DEALSIZE", "STATUS")
for (var in categorical_vars) {
  freq_table <- data.frame(table(data[[var]]))
  colnames(freq_table) <- c(var, "Frecuencia")
  freq_table$Proporcion <- prop.table(freq_table$Frecuencia)
  print(kable(freq_table, caption = paste("Tabla de frecuencias para", var)))
}

## 
## 
## Table: Tabla de frecuencias para PRODUCTLINE
## 
## |PRODUCTLINE      | Frecuencia| Proporcion|
## |:----------------|----------:|----------:|
## |Classic Cars     |        933|  0.3431409|
## |Motorcycles      |        309|  0.1136447|
## |Planes           |        302|  0.1110702|
## |Ships            |        230|  0.0845899|
## |Trains           |         77|  0.0283192|
## |Trucks and Buses |        295|  0.1084958|
## |Vintage Cars     |        573|  0.2107392|
## 
## 
## Table: Tabla de frecuencias para DEALSIZE
## 
## |DEALSIZE | Frecuencia| Proporcion|
## |:--------|----------:|----------:|
## |Large    |        124|  0.0456050|
## |Medium   |       1349|  0.4961383|
## |Small    |       1246|  0.4582567|
## 
## 
## Table: Tabla de frecuencias para STATUS
## 
## |STATUS     | Frecuencia| Proporcion|
## |:----------|----------:|----------:|
## |CANCELLED  |         60|  0.0220669|
## |DISPUTED   |         12|  0.0044134|
## |IN PROCESS |         40|  0.0147113|
## |ON HOLD    |         43|  0.0158146|
## |RESOLVED   |         47|  0.0172858|
## |SHIPPED    |       2517|  0.9257080|

# Gráficos univariados con valores en las barras
# Diagrama de barras para PRODUCTLINE con valores
ggplot(data, aes(x = PRODUCTLINE)) +
  geom_bar(fill = "lightgreen", color = "black") +
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.5) +
  labs(title = "Frecuencia de Líneas de Producto", x = "Línea de Producto", y = "Frecuencia") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggsave("bar_productline_with_values.png", width = 8, height = 6)

# Diagrama de barras para DEALSIZE con valores
ggplot(data, aes(x = DEALSIZE)) +
  geom_bar(fill = "lightcoral", color = "black") +
  geom_text(aes(label = ..count..), stat = "count", vjust = -0.5) +
  labs(title = "Frecuencia de Tamaño del Pedido (DEALSIZE)", x = "Tamaño del Pedido", y = "Frecuencia") +
  theme_minimal()

ggsave("bar_dealsize_with_values.png", width = 8, height = 6)

# Treemap para STATUS
treemap(data, index = "STATUS", vSize = "SALES", title = "Treemap: Ventas por Estado del Pedido (STATUS)")

Bivariado: DEALSIZE vs. STATUS

Se crea una tabla cruzada entre las variables, mostrando la frecuencia de las combinaciones de estas dos variables, Se calculan las proporciones de cada combinación en la tabla cruzada y se genera un gráfico de barras apiladas para mostrar las frecuencias de DEALSIZE en función de STATUS

#Tabla cruzada entre DEALSIZE y STATUS
cross_table <- table(data$DEALSIZE, data$STATUS)
prop_table <- prop.table(cross_table, margin = 1)
print("Tabla cruzada entre DEALSIZE y STATUS:")

## [1] "Tabla cruzada entre DEALSIZE y STATUS:"

print(kable(cross_table))

## 
## 
## |       | CANCELLED| DISPUTED| IN PROCESS| ON HOLD| RESOLVED| SHIPPED|
## |:------|---------:|--------:|----------:|-------:|--------:|-------:|
## |Large  |         0|        3|          2|       4|        1|     114|
## |Medium |        33|        5|         18|      24|       26|    1243|
## |Small  |        27|        4|         20|      15|       20|    1160|

print("Proporciones (por fila) entre DEALSIZE y STATUS:")

## [1] "Proporciones (por fila) entre DEALSIZE y STATUS:"

print(kable(prop_table))

## 
## 
## |       | CANCELLED|  DISPUTED| IN PROCESS|   ON HOLD|  RESOLVED|   SHIPPED|
## |:------|---------:|---------:|----------:|---------:|---------:|---------:|
## |Large  | 0.0000000| 0.0241935|  0.0161290| 0.0322581| 0.0080645| 0.9193548|
## |Medium | 0.0244626| 0.0037064|  0.0133432| 0.0177910| 0.0192735| 0.9214233|
## |Small  | 0.0216693| 0.0032103|  0.0160514| 0.0120385| 0.0160514| 0.9309791|

# Diagrama de barras apiladas con valores
cross_table_df <- as.data.frame.table(cross_table)
ggplot(cross_table_df, aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Freq), position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Diagrama de Barras Apiladas: DEALSIZE vs. STATUS", x = "Tamaño del Pedido (DEALSIZE)", y = "Frecuencia", fill = "Estado del Pedido (STATUS)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("stacked_bar_dealsize_status_with_values.png", width = 8, height = 6)

Análisis descriptivo de variables cuantitativas

Este punto se centra en el análisis descriptivo de las variables cuantitativas SALES, QUANTITYORDERED y DAYS_SINCE_LASTORDER. En el análisis univariado, se calculan estadísticas descriptivas como la media, mediana, desviación estándar, asimetría y curtosis, y se visualizan las distribuciones mediante histogramas y boxplots, permitiendo ver si siguen algun modelo distribucional conocido. En el análisis bivariado, se explora la relación entre SALES y DAYS_SINCE_LASTORDER mediante una prueba de correlación y un diagrama de dispersión, para determinar si existe una asociación entre estas variables. Este análisis permite entender el comportamiento de las ventas y la actividad de los cliente

Univariado: Seleccionamos SALES, QUANTITYORDERED, DAYS_SINCE_LASTORDER

# Estadígrafos univariados
numeric_vars <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER")
numeric_stats <- data %>%
  summarise(across(
    all_of(numeric_vars),
    list(
      Mean = ~mean(., na.rm = TRUE),
      Median = ~median(., na.rm = TRUE),
      SD = ~sd(., na.rm = TRUE),
      Min = ~min(., na.rm = TRUE),
      Q1 = ~quantile(., 0.25, na.rm = TRUE),
      Q3 = ~quantile(., 0.75, na.rm = TRUE),
      Max = ~max(., na.rm = TRUE),
      Skewness = ~skewness(., na.rm = TRUE),
      Kurtosis = ~kurtosis(., na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  ))

# Reformatear los estadígrafos en un formato más claro
stats_list <- list()
for (var in numeric_vars) {
  stats_var <- data.frame(
    Metric = c("Mean", "Median", "SD", "Min", "Q1", "Q3", "Max", "Skewness", "Kurtosis"),
    Value = c(
      numeric_stats[[paste0(var, "_Mean")]],
      numeric_stats[[paste0(var, "_Median")]],
      numeric_stats[[paste0(var, "_SD")]],
      numeric_stats[[paste0(var, "_Min")]],
      numeric_stats[[paste0(var, "_Q1")]],
      numeric_stats[[paste0(var, "_Q3")]],
      numeric_stats[[paste0(var, "_Max")]],
      numeric_stats[[paste0(var, "_Skewness")]],
      numeric_stats[[paste0(var, "_Kurtosis")]]
    )
  )
  stats_list[[var]] <- stats_var
}

for (var in names(stats_list)) {
  print(kable(stats_list[[var]], caption = paste("Estadísticas descriptivas para", var)))
}

## 
## 
## Table: Estadísticas descriptivas para SALES
## 
## |Metric   |        Value|
## |:--------|------------:|
## |Mean     | 3481.9234939|
## |Median   | 3167.3600000|
## |SD       | 1704.2611303|
## |Min      |  482.1300000|
## |Q1       | 2197.1100000|
## |Q3       | 4437.1000000|
## |Max      | 9160.3600000|
## |Skewness |    0.8364885|
## |Kurtosis |    3.2735167|
## 
## 
## Table: Estadísticas descriptivas para QUANTITYORDERED
## 
## |Metric   |      Value|
## |:--------|----------:|
## |Mean     | 34.9187201|
## |Median   | 34.0000000|
## |SD       |  9.5914528|
## |Min      |  6.0000000|
## |Q1       | 27.0000000|
## |Q3       | 43.0000000|
## |Max      | 97.0000000|
## |Skewness |  0.3137086|
## |Kurtosis |  3.2859502|
## 
## 
## Table: Estadísticas descriptivas para DAYS_SINCE_LASTORDER
## 
## |Metric   |        Value|
## |:--------|------------:|
## |Mean     | 1764.7241633|
## |Median   | 1769.0000000|
## |SD       |  816.5323832|
## |Min      |   42.0000000|
## |Q1       | 1087.0000000|
## |Q3       | 2440.5000000|
## |Max      | 3562.0000000|
## |Skewness |   -0.0061354|
## |Kurtosis |    1.9764772|

Gráficos univariados con funciones de densidad ajustadas

Se crean histogramas para las variables SALES, QUANTITYORDERED y DAYS_SINCE_LASTORDER, y se ajustan con diferentes distribuciones (gamma, normal, log-normal) utilizando stat_function de ggplot2

Histograma de Sales

# Histograma de SALES con densidad gamma
sales_mean <- mean(data$SALES, na.rm = TRUE)
sales_var <- var(data$SALES, na.rm = TRUE)
shape <- sales_mean^2 / sales_var
scale <- sales_var / sales_mean
ggplot(data, aes(x = SALES)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  stat_function(fun = dgamma, args = list(shape = shape, scale = scale), color = "red", size = 1) +
  labs(title = "Distribución de Ventas (SALES) con Densidad Gamma", x = "Ventas (SALES)", y = "Densidad") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales_gamma.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histograma de QUANTITYORDERED

# Histograma de QUANTITYORDERED con densidad normal
ggplot(data, aes(x = QUANTITYORDERED)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = mean(data$QUANTITYORDERED, na.rm = TRUE), sd = sd(data$QUANTITYORDERED, na.rm = TRUE)), color = "red", size = 1) +
  labs(title = "Distribución de Cantidad Ordenada con Densidad Normal", x = "Cantidad Ordenada", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_quantityordered_normal.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Histograma de DAYS_SINCE_LASTORDER

# Histograma de DAYS_SINCE_LASTORDER con densidad log-normal (para referencia)
ggplot(data, aes(x = DAYS_SINCE_LASTORDER)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black") +
  stat_function(fun = dlnorm, args = list(meanlog = mean(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE), sdlog = sd(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE)), color = "red", size = 1) +
  labs(title = "Distribución de Días desde el Último Pedido con Densidad Log-Normal", x = "Días desde el Último Pedido", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_days_since_lastorder_lognormal.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Se realiza un test de Kolmogorov-Smirnov para comparar la distribución empírica de la variable DAYS_SINCE_LASTORDER con varias distribuciones teóricas (normal, gamma, log-normal) para determinar si la variable rechaza modelos paramétricos

# Test KS para distribución normal
ks_normal <- ks.test(data$DAYS_SINCE_LASTORDER, "pnorm", mean = mean(data$DAYS_SINCE_LASTORDER, na.rm = TRUE), sd = sd(data$DAYS_SINCE_LASTORDER, na.rm = TRUE))

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "pnorm", mean =
## mean(data$DAYS_SINCE_LASTORDER, : ties should not be present for the one-sample
## Kolmogorov-Smirnov test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Normal):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Normal):"

print(ks_normal)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.050568, p-value = 1.827e-06
## alternative hypothesis: two-sided

# Test KS para distribución gamma
days_mean <- mean(data$DAYS_SINCE_LASTORDER, na.rm = TRUE)
days_var <- var(data$DAYS_SINCE_LASTORDER, na.rm = TRUE)
shape_days <- days_mean^2 / days_var
scale_days <- days_var / days_mean
ks_gamma <- ks.test(data$DAYS_SINCE_LASTORDER, "pgamma", shape = shape_days, scale = scale_days)

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "pgamma", shape =
## shape_days, : ties should not be present for the one-sample Kolmogorov-Smirnov
## test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Gamma):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Gamma):"

print(ks_gamma)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.085314, p-value < 2.2e-16
## alternative hypothesis: two-sided

# Test KS para distribución log-normal
ks_lognormal <- ks.test(data$DAYS_SINCE_LASTORDER, "plnorm", meanlog = mean(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE), sdlog = sd(log(data$DAYS_SINCE_LASTORDER), na.rm = TRUE))

## Warning in ks.test.default(data$DAYS_SINCE_LASTORDER, "plnorm", meanlog =
## mean(log(data$DAYS_SINCE_LASTORDER), : ties should not be present for the
## one-sample Kolmogorov-Smirnov test

print("Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Log-Normal):")

## [1] "Test de Kolmogorov-Smirnov para DAYS_SINCE_LASTORDER (Distribución Log-Normal):"

print(ks_lognormal)

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DAYS_SINCE_LASTORDER
## D = 0.099156, p-value < 2.2e-16
## alternative hypothesis: two-sided

Podemos determinar si la estimación de densidad kernel (KDE) ofrece la mejor representación, sugiriendo que esta variable tiene una distribución compleja que requiere un enfoque no paramétrico para su análisis.

# Histograma de DAYS_SINCE_LASTORDER con densidad kernel (KDE)
ggplot(data, aes(x = DAYS_SINCE_LASTORDER)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black") +
  geom_density(color = "blue", size = 1) +
  labs(title = "Distribución de Días desde el Último Pedido con Densidad Kernel (KDE)", x = "Días desde el Último Pedido", y = "Densidad") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_density()`).

ggsave("histogram_days_since_lastorder_kde.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 3 rows containing non-finite outside the scale range
## (`stat_density()`).

# Boxplot de QUANTITYORDERED
ggplot(data, aes(y = QUANTITYORDERED)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot de Cantidad Ordenada (QUANTITYORDERED)", y = "Cantidad Ordenada") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantityordered.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Bivariado: SALES vs. DAYS_SINCE_LASTORDER

Determinamos la correlacón entre las dos variables

correlation <- cor.test(data$SALES, data$DAYS_SINCE_LASTORDER)
print("Correlación entre SALES y DAYS_SINCE_LASTORDER:")

## [1] "Correlación entre SALES y DAYS_SINCE_LASTORDER:"

print(correlation)

## 
##  Pearson's product-moment correlation
## 
## data:  data$SALES and data$DAYS_SINCE_LASTORDER
## t = -17.98, df = 2717, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3592696 -0.2920724
## sample estimates:
##        cor 
## -0.3260828

# Gráfico de dispersión
ggplot(data, aes(x = DAYS_SINCE_LASTORDER, y = SALES)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relación entre Ventas (SALES) y Días desde el Último Pedido", x = "Días desde el Último Pedido", y = "Ventas (SALES)") +
  theme_minimal()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggsave("scatter_sales_days.png", width = 8, height = 6)

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Análisis descriptivo bivariado entre variables categóricas y cuantitativas

En este punto, se analiza la relación entre variables categóricas y cuantitativas para identificar patrones que combinen ambos tipos de datos. Específicamente, se examina cómo varía SALES según la línea de producto (PRODUCTLINE) y cómo se distribuye QUANTITYORDERED según el tamaño del pedido (DEALSIZE). Para ello, se calculan estadísticas descriptivas por categoría y se visualizan los resultados mediante boxplots e histogramas de densidad. Este análisis ayuda a entender qué líneas de producto son más rentables y cómo el tamaño del pedido afecta las cantidades ordenadas.

Se calcula y muestra un análisis descriptivo de SALES por PRODUCTLINE y QUANTITYORDERED por DEALSIZE. Los resultados se presentan en tablas con estadísticas como la media, la mediana y la desviación estándar. También se generan gráficos adicionales como boxplots y histogramas segmentados para mostrar la distribución de estas variables según las categorías.

Estadigrafos, Boxplot e histograma segmentado de SALES vs. PRODUCTLINE

# Estadígrafos de SALES por PRODUCTLINE
sales_by_productline <- data %>%
  group_by(PRODUCTLINE) %>%
  summarise(
    Mean_SALES = mean(SALES),
    Median_SALES = median(SALES),
    SD_SALES = sd(SALES),
    Skewness_SALES = skewness(SALES),
    Kurtosis_SALES = kurtosis(SALES)
  )
print("Estadígrafos de SALES por PRODUCTLINE:")

## [1] "Estadígrafos de SALES por PRODUCTLINE:"

print(kable(sales_by_productline))

## 
## 
## |PRODUCTLINE      | Mean_SALES| Median_SALES| SD_SALES| Skewness_SALES| Kurtosis_SALES|
## |:----------------|----------:|------------:|--------:|--------------:|--------------:|
## |Classic Cars     |   3940.106|     3729.390| 1885.660|      0.5234855|       2.586711|
## |Motorcycles      |   3441.322|     3113.640| 1686.146|      0.9750865|       3.699903|
## |Planes           |   3143.103|     2835.770| 1421.442|      1.1132958|       4.081103|
## |Ships            |   3043.649|     2884.925| 1058.753|      0.9442692|       4.249737|
## |Trains           |   2938.227|     2445.600| 1456.596|      1.5085860|       5.764990|
## |Trucks and Buses |   3767.997|     3451.000| 1674.056|      0.4868421|       2.579675|
## |Vintage Cars     |   3038.051|     2761.960| 1575.484|      1.0097794|       3.965723|
## |NA               |         NA|           NA|       NA|             NA|             NA|

# Boxplot de SALES por PRODUCTLINE
ggplot(data, aes(x = PRODUCTLINE, y = SALES, fill = PRODUCTLINE)) +
  geom_boxplot() +
  labs(title = "Distribución de Ventas (SALES) por Línea de Producto", x = "Línea de Producto", y = "Ventas (SALES)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_sales_productline.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Histograma segmentado de SALES por PRODUCTLINE
ggplot(data, aes(x = SALES)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~PRODUCTLINE, scales = "free_y") +
  labs(title = "Distribución de Ventas (SALES) por Línea de Producto", x = "Ventas (SALES)", y = "Frecuencia") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales_productline.png", width = 10, height = 8)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Estadigrafo y Boxplot de QUANTITYORDERED vs. DEALSIZE

# Estadígrafos de QUANTITYORDERED por DEALSIZE
quantity_by_dealsize <- data %>%
  group_by(DEALSIZE) %>%
  summarise(
    Mean_QUANTITY = mean(QUANTITYORDERED),
    Median_QUANTITY = median(QUANTITYORDERED),
    SD_QUANTITY = sd(QUANTITYORDERED),
    Skewness_QUANTITY = skewness(QUANTITYORDERED),
    Kurtosis_QUANTITY = kurtosis(QUANTITYORDERED)
  )
print("Estadígrafos de QUANTITYORDERED por DEALSIZE:")

## [1] "Estadígrafos de QUANTITYORDERED por DEALSIZE:"

print(kable(quantity_by_dealsize))

## 
## 
## |DEALSIZE | Mean_QUANTITY| Median_QUANTITY| SD_QUANTITY| Skewness_QUANTITY| Kurtosis_QUANTITY|
## |:--------|-------------:|---------------:|-----------:|-----------------:|-----------------:|
## |Large    |      46.04839|              45|    9.875530|         2.1058435|         10.382888|
## |Medium   |      37.96071|              39|    8.448259|        -0.1632404|          2.408981|
## |Small    |      30.51766|              29|    8.495740|         0.5786779|          2.796181|
## |NA       |            NA|              NA|          NA|                NA|                NA|

# Boxplot de QUANTITYORDERED por DEALSIZE
ggplot(data, aes(x = DEALSIZE, y = QUANTITYORDERED, fill = DEALSIZE)) +
  geom_boxplot() +
  labs(title = "Distribución de Cantidad Ordenada por Tamaño del Pedido", x = "Tamaño del Pedido (DEALSIZE)", y = "Cantidad Ordenada (QUANTITYORDERED)") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantity_dealsize.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Pruebas de hipótesis

Este punto incluye la realización de dos pruebas de hipótesis para validar suposiciones sobre los datos. La primera prueba evalúa si la media de las ventas (SALES) es igual o menor a 3500, lo que podría reflejar un valor esperado de ingresos por pedido. La segunda prueba verifica si la proporción de pedidos con estado “SHIPPED” en la variable STATUS es igual al 70%, para evaluar la eficiencia logística de la empresa. Se presentan las hipótesis nula y alternativa, los resultados de las pruebas y sus interpretaciones

Prueba 1: Media de SALES

H0: La media de SALES es igual o menor a 3500

H1: La media de SALES no es igual a 3500

El análisis de las ventas promedio es esencial para evaluar la rentabilidad y eficiencia del proceso comercial en el sector automotriz. En este contexto, se estableció el valor de 3500 como un umbral de referencia mínimo esperado para mantener la rentabilidad de la operación y cumplir los objetivos financieros.

sales_mean_test <- t.test(data$SALES, mu = 3500, alternative = "greater")  # H0: mu <= 3500, H1: mu > 3500
print("Prueba de hipótesis para la media de SALES (H0: mu <= 3500):")

## [1] "Prueba de hipótesis para la media de SALES (H0: mu <= 3500):"

print(sales_mean_test)

## 
##  One Sample t-test
## 
## data:  data$SALES
## t = -0.55307, df = 2718, p-value = 0.7099
## alternative hypothesis: true mean is greater than 3500
## 95 percent confidence interval:
##  3428.145      Inf
## sample estimates:
## mean of x 
##  3481.923

Prueba 2: Proporción de “CANCELLED” en STATUS

H0: La proporción de “CANCELLED” es menor o igual a 0.5

H1: La proporción de “CANCELLED” es menor a 0.5

Un índice elevado de cancelaciones puede reflejar problemas en procesos logísticos, errores en el inventario, fallos en la comunicación con los clientes o debilidades en las políticas comerciales. Dado esto, se planteó la necesidad de evaluar si la proporción de pedidos cancelados se mantenía por debajo del umbral crítico del 5%

# 1. Verificar valores únicos de STATUS
unique(data$STATUS)

## [1] SHIPPED    DISPUTED   CANCELLED  ON HOLD    RESOLVED   IN PROCESS <NA>      
## Levels: CANCELLED DISPUTED IN PROCESS ON HOLD RESOLVED SHIPPED

# 2. Asegurar limpieza en STATUS
data$STATUS <- toupper(trimws(data$STATUS))

# 3. Contar número de pedidos CANCELLED y total de pedidos
num_cancelled <- sum(data$STATUS == "CANCELLED", na.rm = TRUE)
n_total <- sum(!is.na(data$STATUS))

print(paste("Pedidos CANCELLED:", num_cancelled))

## [1] "Pedidos CANCELLED: 60"

print(paste("Total de pedidos:", n_total))

## [1] "Total de pedidos: 2719"

# 4. Prueba de hipótesis
prueba_cancelled <- prop.test(
  x = num_cancelled, 
  n = n_total, 
  p = 0.05, 
  alternative = "less",
  correct = FALSE
)

# 5. Mostrar resultados
print(prueba_cancelled)

## 
##  1-sample proportions test without continuity correction
## 
## data:  num_cancelled out of n_total, null probability 0.05
## X-squared = 44.663, df = 1, p-value = 1.17e-11
## alternative hypothesis: true p is less than 0.05
## 95 percent confidence interval:
##  0.00000000 0.02719795
## sample estimates:
##          p 
## 0.02206694

ANÁLISIS DE AUTOMOBILE SALES DATA PARA OPTIMIZAR VENTAS Y LOGÍSTICA AUTOMOTRIZ

Diego Alejandro Peñarete Rodríguez, Diana Carolina Serrato Florez

2025-04-28