Descripcion del Proyecto

El dataset Automobile Sales Data tomado de Kaggle, contiene información sobre ventas de productos automotrices, probablemente recopilada por una empresa comercial a través de un sistema de gestión de ventas (ERP). Este sistema registra transacciones comerciales de manera automática, el dataset incluye 2,722 registros y 20 variabes.

Este notebook presenta el análisis del dataset Automobile Sales Data, con el objetivo de identificar patrones en las ventas, cantidades ordenadas y estado de los pedidos, segmentados por línea de producto y tamaño del pedido.

El análisis incluye:

- Análisis descriptivo de variables categóricas (univariado y bivariado).

- Análisis descriptivo de variables cuantitativas (univariado y bivariado).

- Análisis bivariado entre variables categóricas y cuantitativas.

- Pruebas de hipótesis.

Librerias y tipo correcto de dato

Carga el dataset desde un archivo Excel, se convierten las columnas al tipo de datos correcto (numéricas, factores, fechas) y se verifican valores faltantes en variables clave para preparar los datos para el análisis. Finalmente, cargar los paquetes necesarios.

data <- read_excel("Auto_Sales_data_no_outliers1.xlsx")
knitr::opts_chunk$set(echo = TRUE)
# Cargar los paquetes
library(readxl)
library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(reshape2)
library(treemap)
library(moments)
library(sass)  # Añadir esta línea para cargar sass

data <- data %>%
  mutate(
    ORDERNUMBER = as.numeric(ORDERNUMBER),
    QUANTITYORDERED = as.numeric(QUANTITYORDERED),
    PRICEEACH = as.numeric(PRICEEACH),
    ORDERLINENUMBER = as.numeric(ORDERLINENUMBER),
    SALES = as.numeric(SALES),
    ORDERDATE = as.Date(ORDERDATE, format = "%Y-%m-%d"),
    DAYS_SINCE_LASTORDER = as.numeric(DAYS_SINCE_LASTORDER),
    STATUS = as.factor(toupper(trimws(STATUS))),  # Estandarizar STATUS
    PRODUCTLINE = as.factor(PRODUCTLINE),
    MSRP = as.numeric(MSRP),
    PRODUCTCODE = as.factor(PRODUCTCODE),
    CUSTOMERNAME = as.factor(CUSTOMERNAME),
    PHONE = as.character(PHONE),
    ADDRESSLINE1 = as.character(ADDRESSLINE1),
    CITY = as.factor(CITY),
    POSTALCODE = as.character(POSTALCODE),
    COUNTRY = as.factor(COUNTRY),
    CONTACTLASTNAME = as.character(CONTACTLASTNAME),
    CONTACTFIRSTNAME = as.character(CONTACTFIRSTNAME),
    DEALSIZE = as.factor(DEALSIZE)
  )

# Verificar si hay valores NA en las columnas que usaremos
vars_to_check <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER", "PRODUCTLINE", "DEALSIZE", "STATUS")
colSums(is.na(data[vars_to_check]))

##                SALES      QUANTITYORDERED DAYS_SINCE_LASTORDER 
##                    3                    3                    3 
##          PRODUCTLINE             DEALSIZE               STATUS 
##                    3                    3                    3

Análisis descriptivo de variables categóricas

En este punto, se realiza un análisis descriptivo de las variables categóricas PRODUCTLINE, DEALSIZE y STATUS. Primero, se analiza cada variable de forma univariada para entender su distribución a través de frecuencias, proporciones y visualizaciones gráficas como diagramas de barras y treemaps. Luego, se realiza un análisis bivariado entre DEALSIZE y STATUS para explorar posibles relaciones entre estas variables, utilizando tablas cruzadas y un diagrama de barras apiladas. Este análisis busca identificar patrones que puedan ser útiles para decisiones logísticas y comerciales.

Univariado: Seleccionamos PRODUCTLINE, DEALSIZE, STATUS

categorical_vars <- c("PRODUCTLINE", "DEALSIZE", "STATUS")
categorical_stats <- lapply(data[categorical_vars], function(x) {
  freq_table <- table(x)
  prop_table <- prop.table(freq_table)
  list(Frequencies = freq_table, Proportions = prop.table(freq_table))
})

print("Frecuencias y proporciones de las variables categóricas (univariado):")

## [1] "Frecuencias y proporciones de las variables categóricas (univariado):"

print(categorical_stats)

## $PRODUCTLINE
## $PRODUCTLINE$Frequencies
## x
##     Classic Cars      Motorcycles           Planes            Ships 
##              933              309              302              230 
##           Trains Trucks and Buses     Vintage Cars 
##               77              295              573 
## 
## $PRODUCTLINE$Proportions
## x
##     Classic Cars      Motorcycles           Planes            Ships 
##       0.34314086       0.11364472       0.11107025       0.08458992 
##           Trains Trucks and Buses     Vintage Cars 
##       0.02831924       0.10849577       0.21073924 
## 
## 
## $DEALSIZE
## $DEALSIZE$Frequencies
## x
##  Large Medium  Small 
##    124   1349   1246 
## 
## $DEALSIZE$Proportions
## x
##     Large    Medium     Small 
## 0.0456050 0.4961383 0.4582567 
## 
## 
## $STATUS
## $STATUS$Frequencies
## x
##  CANCELLED   DISPUTED IN PROCESS    ON HOLD   RESOLVED    SHIPPED 
##         60         12         40         43         47       2517 
## 
## $STATUS$Proportions
## x
##   CANCELLED    DISPUTED  IN PROCESS     ON HOLD    RESOLVED     SHIPPED 
## 0.022066936 0.004413387 0.014711291 0.015814638 0.017285767 0.925707981

# Gráficos univariados
# Diagrama de barras para PRODUCTLINE
ggplot(data, aes(x = PRODUCTLINE)) +
  geom_bar(fill = "lightgreen", color = "black") +
  labs(title = "Frecuencia de Lineas de Producto", x = "Linea de Producto", y = "Frecuencia") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("bar_productline.png", width = 8, height = 6)

# Diagrama de barras para DEALSIZE
ggplot(data, aes(x = DEALSIZE)) +
  geom_bar(fill = "lightcoral", color = "black") +
  labs(title = "Frecuencia de Tamano del Pedido (DEALSIZE)", x = "Tamano del Pedido", y = "Frecuencia") +
  theme_minimal()

ggsave("bar_dealsize.png", width = 8, height = 6)

# Treemap para STATUS
treemap(data, index = "STATUS", vSize = "SALES", title = "Treemap: Ventas por Estado del Pedido (STATUS)")

Bivariado: DEALSIZE vs. STATUS

cross_table <- table(data$DEALSIZE, data$STATUS)
prop_table <- prop.table(cross_table, margin = 1)
print("Tabla cruzada entre DEALSIZE y STATUS:")

## [1] "Tabla cruzada entre DEALSIZE y STATUS:"

print(cross_table)

##         
##          CANCELLED DISPUTED IN PROCESS ON HOLD RESOLVED SHIPPED
##   Large          0        3          2       4        1     114
##   Medium        33        5         18      24       26    1243
##   Small         27        4         20      15       20    1160

print("Proporciones (por fila) entre DEALSIZE y STATUS:")

## [1] "Proporciones (por fila) entre DEALSIZE y STATUS:"

print(prop_table)

##         
##            CANCELLED    DISPUTED  IN PROCESS     ON HOLD    RESOLVED
##   Large  0.000000000 0.024193548 0.016129032 0.032258065 0.008064516
##   Medium 0.024462565 0.003706449 0.013343217 0.017790956 0.019273536
##   Small  0.021669342 0.003210273 0.016051364 0.012038523 0.016051364
##         
##              SHIPPED
##   Large  0.919354839
##   Medium 0.921423277
##   Small  0.930979133

# Diagrama de barras apiladas
cross_table_df <- as.data.frame.table(cross_table)
ggplot(cross_table_df, aes(x = Var1, y = Freq, fill = Var2)) +
  geom_bar(stat = "identity") +
  labs(title = "Diagrama de Barras Apiladas: DEALSIZE vs. STATUS", x = "Tamaño del Pedido (DEALSIZE)", y = "Frecuencia", fill = "Estado del Pedido (STATUS)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("stacked_bar_dealsize_status.png", width = 8, height = 6)

Análisis descriptivo de variables cuantitativas

Este punto se centra en el análisis descriptivo de las variables cuantitativas `SALES`, `QUANTITYORDERED` y `DAYS_SINCE_LASTORDER`. En el análisis univariado, se calculan estadísticas descriptivas como la media, mediana, desviación estándar, asimetría y curtosis, y se visualizan las distribuciones mediante histogramas y boxplots. En el análisis bivariado, se explora la relación entre `SALES` y `DAYS_SINCE_LASTORDER` mediante una prueba de correlación y un diagrama de dispersión, para determinar si existe una asociación entre estas variables. Este análisis permite entender el comportamiento de las ventas y la actividad de los clientes.

Univariado: Seleccionamos SALES, QUANTITYORDERED, DAYS_SINCE_LASTORDER

numeric_vars <- c("SALES", "QUANTITYORDERED", "DAYS_SINCE_LASTORDER")
numeric_stats <- data %>%
  summarise(across(
    all_of(numeric_vars),
    list(
      Mean = ~mean(., na.rm = TRUE),
      SD = ~sd(., na.rm = TRUE),
      Min = ~min(., na.rm = TRUE),
      Q1 = ~quantile(., 0.25, na.rm = TRUE),
      Median = ~median(., na.rm = TRUE),
      Q3 = ~quantile(., 0.75, na.rm = TRUE),
      Max = ~max(., na.rm = TRUE),
      Skewness = ~skewness(., na.rm = TRUE),
      Kurtosis = ~kurtosis(., na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  ))

print("Estadísticas descriptivas de las variables cuantitativas (univariado):")

## [1] "Estadísticas descriptivas de las variables cuantitativas (univariado):"

print(numeric_stats)

## # A tibble: 1 × 27
##   SALES_Mean SALES_SD SALES_Min SALES_Q1 SALES_Median SALES_Q3 SALES_Max
##        <dbl>    <dbl>     <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
## 1      3482.    1704.      482.    2197.        3167.    4437.     9160.
## # ℹ 20 more variables: SALES_Skewness <dbl>, SALES_Kurtosis <dbl>,
## #   QUANTITYORDERED_Mean <dbl>, QUANTITYORDERED_SD <dbl>,
## #   QUANTITYORDERED_Min <dbl>, QUANTITYORDERED_Q1 <dbl>,
## #   QUANTITYORDERED_Median <dbl>, QUANTITYORDERED_Q3 <dbl>,
## #   QUANTITYORDERED_Max <dbl>, QUANTITYORDERED_Skewness <dbl>,
## #   QUANTITYORDERED_Kurtosis <dbl>, DAYS_SINCE_LASTORDER_Mean <dbl>,
## #   DAYS_SINCE_LASTORDER_SD <dbl>, DAYS_SINCE_LASTORDER_Min <dbl>, …

Graficos univariados SALES, QUANTITYORDERED, DAYS_SINCE_LASTORDER

Histograma de SALES

ggplot(data, aes(x = SALES)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribucion de Ventas (SALES)", x = "Ventas (SALES)", y = "Frecuencia") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Boxplot de QUANTITYORDERED

ggplot(data, aes(y = QUANTITYORDERED)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot de Cantidad Ordenada (QUANTITYORDERED)", y = "Cantidad Ordenada") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantityordered.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Histograma de DAYS_SINCE_LASTORDER

ggplot(data, aes(x = DAYS_SINCE_LASTORDER)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  labs(title = "Distribución de Días desde el Último Pedido", x = "Días desde el Último Pedido", y = "Frecuencia") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_days_since_lastorder.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Evaluación del modelo de distribución SALES, QUANTITYORDERED Y DAYS_SINCE_LASTORDER

SALES: Sesgada a la derecha (skewness > 0).

QUANTITYORDERED: Sesgada a la derecha (skewness > 0).

DAYS_SINCE_LASTORDER: Aproximadamente simétrica (skewness ~ 0).

# Bivariado: SALES vs. DAYS_SINCE_LASTORDER
correlation <- cor.test(data$SALES, data$DAYS_SINCE_LASTORDER)
print("Correlacion entre SALES y DAYS_SINCE_LASTORDER:")

## [1] "Correlacion entre SALES y DAYS_SINCE_LASTORDER:"

print(correlation)

## 
##  Pearson's product-moment correlation
## 
## data:  data$SALES and data$DAYS_SINCE_LASTORDER
## t = -17.98, df = 2717, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3592696 -0.2920724
## sample estimates:
##        cor 
## -0.3260828

# Gráfico de dispersión
ggplot(data, aes(x = DAYS_SINCE_LASTORDER, y = SALES)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relacion entre Ventas (SALES) y Dias desde el Ultimo Pedido", x = "Dias desde el Ultimo Pedido", y = "Ventas (SALES)") +
  theme_minimal()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggsave("scatter_sales_days.png", width = 8, height = 6)

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Análisis descriptivo bivariado entre variables categóricas y cuantitativas

En este punto, se analiza la relación entre variables categóricas y cuantitativas para identificar patrones que combinen ambos tipos de datos. Específicamente, se examina cómo varía `SALES` según la línea de producto (`PRODUCTLINE`) y cómo se distribuye `QUANTITYORDERED` según el tamaño del pedido (`DEALSIZE`). Para ello, se calculan estadísticas descriptivas por categoría y se visualizan los resultados mediante boxplots e histogramas de densidad. Este análisis ayuda a entender qué líneas de producto son más rentables y cómo el tamaño del pedido afecta las cantidades ordenadas.

Estadigrafos, Boxplot e histograma segmentado de SALES vs. PRODUCTLINE

sales_by_productline <- data %>%
  group_by(PRODUCTLINE) %>%
  summarise(
    Mean_SALES = mean(SALES),
    Median_SALES = median(SALES),
    SD_SALES = sd(SALES),
    Skewness_SALES = skewness(SALES),
    Kurtosis_SALES = kurtosis(SALES)
  )
print("Estadígrafos de SALES por PRODUCTLINE:")

## [1] "Estadígrafos de SALES por PRODUCTLINE:"

print(sales_by_productline)

## # A tibble: 8 × 6
##   PRODUCTLINE     Mean_SALES Median_SALES SD_SALES Skewness_SALES Kurtosis_SALES
##   <fct>                <dbl>        <dbl>    <dbl>          <dbl>          <dbl>
## 1 Classic Cars         3940.        3729.    1886.          0.523           2.59
## 2 Motorcycles          3441.        3114.    1686.          0.975           3.70
## 3 Planes               3143.        2836.    1421.          1.11            4.08
## 4 Ships                3044.        2885.    1059.          0.944           4.25
## 5 Trains               2938.        2446.    1457.          1.51            5.76
## 6 Trucks and Bus…      3768.        3451     1674.          0.487           2.58
## 7 Vintage Cars         3038.        2762.    1575.          1.01            3.97
## 8 <NA>                   NA           NA       NA          NA              NA

# Boxplot de SALES por PRODUCTLINE
ggplot(data, aes(x = PRODUCTLINE, y = SALES, fill = PRODUCTLINE)) +
  geom_boxplot() +
  labs(title = "Distribucion de Ventas (SALES) por Linea de Producto", x = "Linea de Producto", y = "Ventas (SALES)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_sales_productline.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Histograma segmentado de SALES por PRODUCTLINE
ggplot(data, aes(x = SALES, fill = PRODUCTLINE)) +
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
  labs(title = "Distribucion de Ventas (SALES) por Linea de Producto", x = "Ventas (SALES)", y = "Frecuencia") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggsave("histogram_sales_productline.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

Estadigrafo y Boxplot de QUANTITYORDERED vs. DEALSIZE

quantity_by_dealsize <- data %>%
  group_by(DEALSIZE) %>%
  summarise(
    Mean_QUANTITY = mean(QUANTITYORDERED),
    Median_QUANTITY = median(QUANTITYORDERED),
    SD_QUANTITY = sd(QUANTITYORDERED),
    Skewness_QUANTITY = skewness(QUANTITYORDERED),
    Kurtosis_QUANTITY = kurtosis(QUANTITYORDERED)
  )
print("Estadígrafos de QUANTITYORDERED por DEALSIZE:")

## [1] "Estadígrafos de QUANTITYORDERED por DEALSIZE:"

print(quantity_by_dealsize)

## # A tibble: 4 × 6
##   DEALSIZE Mean_QUANTITY Median_QUANTITY SD_QUANTITY Skewness_QUANTITY
##   <fct>            <dbl>           <dbl>       <dbl>             <dbl>
## 1 Large             46.0              45        9.88             2.11 
## 2 Medium            38.0              39        8.45            -0.163
## 3 Small             30.5              29        8.50             0.579
## 4 <NA>              NA                NA       NA               NA    
## # ℹ 1 more variable: Kurtosis_QUANTITY <dbl>

# Boxplot de QUANTITYORDERED por DEALSIZE
ggplot(data, aes(x = DEALSIZE, y = QUANTITYORDERED, fill = DEALSIZE)) +
  geom_boxplot() +
  labs(title = "Distribucion de Cantidad Ordenada por Tamano del Pedido", x = "Tamano del Pedido (DEALSIZE)", y = "Cantidad Ordenada (QUANTITYORDERED)") +
  theme_minimal()

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggsave("boxplot_quantity_dealsize.png", width = 8, height = 6)

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Pruebas de hipótesis

Este punto incluye la realización de dos pruebas de hipótesis para validar suposiciones sobre los datos. La primera prueba evalúa si la media de las ventas (`SALES`) es igual a 3500, lo que podría reflejar un valor esperado de ingresos por pedido. La segunda prueba verifica si la proporción de pedidos con estado “SHIPPED” en la variable `STATUS` es igual al 70%, para evaluar la eficiencia logística de la empresa. Se presentan las hipótesis nula y alternativa, los resultados de las pruebas y sus interpretaciones.

Prueba 1: Media de SALES

H0: La media de SALES es igual a 3500

H1: La media de SALES no es igual a 3500

sales_mean_test <- t.test(data$SALES, mu = 3500, alternative = "two.sided")
print("Prueba de hipótesis para la media de SALES (mu = 3500):")

## [1] "Prueba de hipótesis para la media de SALES (mu = 3500):"

print(sales_mean_test)

## 
##  One Sample t-test
## 
## data:  data$SALES
## t = -0.55307, df = 2718, p-value = 0.5803
## alternative hypothesis: true mean is not equal to 3500
## 95 percent confidence interval:
##  3417.836 3546.011
## sample estimates:
## mean of x 
##  3481.923

Prueba 2: Proporción de “SHIPPED” en STATUS

H0: La proporción de “SHIPPED” es igual a 0.7

H1: La proporción de “SHIPPED” no es igual a 0.7

Verificar los niveles y distribuciones de STATUS

# Verificar los niveles de STATUS
print("Niveles de STATUS:")

## [1] "Niveles de STATUS:"

print(levels(data$STATUS))

## [1] "CANCELLED"  "DISPUTED"   "IN PROCESS" "ON HOLD"    "RESOLVED"  
## [6] "SHIPPED"

# Verificar la distribución de STATUS
print("Distribución de STATUS:")

## [1] "Distribución de STATUS:"

print(table(data$STATUS))

## 
##  CANCELLED   DISPUTED IN PROCESS    ON HOLD   RESOLVED    SHIPPED 
##         60         12         40         43         47       2517

Calcular shipped_count y total_count

shipped_count <- sum(data$STATUS == "SHIPPED", na.rm = TRUE)
total_count <- length(data$STATUS[!is.na(data$STATUS)])

# Imprimir para verificar
print(paste("Número de registros con STATUS == 'SHIPPED':", shipped_count))

## [1] "Número de registros con STATUS == 'SHIPPED': 2517"

print(paste("Número total de registros en STATUS:", total_count))

## [1] "Número total de registros en STATUS: 2719"

# Verificar si hay suficientes datos para la prueba
if (shipped_count >= 5 & (total_count - shipped_count) >= 5 & total_count > 0) {
  prop_test <- prop.test(shipped_count, total_count, p = 0.7, alternative = "two.sided")
  print("Prueba de hipótesis para la proporción de 'SHIPPED' en STATUS (p = 0.7):")
  print(prop_test)
} else {
  print("No hay suficientes datos para realizar la prueba de proporciones. Se necesitan al menos 5 éxitos y 5 fracasos.")
}

## [1] "Prueba de hipótesis para la proporción de 'SHIPPED' en STATUS (p = 0.7):"
## 
##  1-sample proportions test with continuity correction
## 
## data:  shipped_count out of total_count, null probability 0.7
## X-squared = 658.53, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.7
## 95 percent confidence interval:
##  0.9150441 0.9351483
## sample estimates:
##        p 
## 0.925708

ANÁLISIS DE AUTOMOBILE SALES DATA PARA OPTIMIZAR VENTAS Y LOGÍSTICA AUTOMOTRIZ

Diana Serrato, Diego Peñarete

2025-04-14