Analisis y Predicción de Ventas por Cliente Regresión

Author

Jaime Llanos Bardales

Minería de Datos para Analizar y Predecir Ventas por Cliente usando Northwind

Objetivo

Extraer datos de las ventas de los clientes de la base de datos Northwind en SQL Server, analizar y predecir usando Regresión.

Paquetes necesarios

Instalacion de paquetes:

install.packages(c(“DBI”, “odbc”, “tidyverse”, “caret”, “ggplot2”, “corrplot”))

"Instalando Paquetes"

[1] "Instalando Paquetes"

#install.packages(c("DBI", "odbc", "tidyverse", "caret", "ggplot2", "corrplot"))

Cargar paquetes

library(DBI)
library(odbc)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Conexion a base de datos

Conexion a SQL Server:

Realizar la consulta

consulta <-"
SELECT
C.CustomerID,
C.CompanyName,
C.Country,
SUM(OD.Quantity*OD.UnitPrice) AS TotalVentas,
COUNT(DISTINCT O.OrderID) AS NumPedidos
FROM Customers C
JOIN ORDERS O ON C.CustomerID = O.CustomerID
JOIN [Order Details] OD ON OD.OrderID=O.OrderID
GROUP BY C.CustomerID, C.CompanyName, C.Country
"

Crear variable capturar los datos de la conexion

ventas_clientes <-dbGetQuery(conexion, consulta)

Cerrar Conexion

dbDisconnect(conexion)

Visualizar la estructura de los datos

glimpse(ventas_clientes)

Rows: 89
Columns: 5
$ CustomerID  <chr> "BLONP", "OLDWO", "FRANS", "MAGAA", "SIMOB", "FAMIA", "ERN…
$ CompanyName <chr> "Blondesddsl père et fils", "Old World Delicatessen", "Fra…
$ Country     <chr> "France", "USA", "Italy", "Italy", "Denmark", "Brazil", "A…
$ TotalVentas <dbl> 19088.00, 16325.15, 1545.70, 7603.85, 18138.45, 4438.90, 1…
$ NumPedidos  <int> 11, 10, 6, 10, 7, 7, 30, 10, 9, 4, 6, 2, 2, 19, 8, 15, 18,…

Análisis Exploratorio

Resumen General

summary(ventas_clientes)

  CustomerID        CompanyName          Country           TotalVentas      
 Length:89          Length:89          Length:89          Min.   :   100.8  
 Class :character   Class :character   Class :character   1st Qu.:  3361.0  
 Mode  :character   Mode  :character   Mode  :character   Median :  7555.6  
                                                          Mean   : 15218.6  
                                                          3rd Qu.: 18138.5  
                                                          Max.   :117483.4  
   NumPedidos    
 Min.   : 1.000  
 1st Qu.: 5.000  
 Median : 8.000  
 Mean   : 9.326  
 3rd Qu.:12.000  
 Max.   :31.000

Filtrar el top de 10 clientes con mas ventas

ventas_clientes %>%
  arrange(desc(TotalVentas)) %>%
  head(10)

   CustomerID                  CompanyName Country TotalVentas NumPedidos
1       QUICK                   QUICK-Stop Germany   117483.39         28
2       SAVEA           Save-a-lot Markets     USA   115673.39         31
3       ERNSH                 Ernst Handel Austria   113236.68         30
4       HUNGO Hungry Owl All-Night Grocers Ireland    57317.39         19
5       RATTC   Rattlesnake Canyon Grocery     USA    52245.90         18
6       HANAR                Hanari Carnes  Brazil    34101.15         14
7       FOLKO               Folk och fä HB  Sweden    32555.55         19
8       MEREP               Mère Paillarde  Canada    32203.90         13
9       KOENE              Königlich Essen Germany    31745.75         14
10      QUEEN                Queen Cozinha  Brazil    30226.10         13

Visualizacion datos

Ventas por País

library(ggplot2)
ggplot(
  ventas_clientes, # Data Northwind
  aes(x=reorder(Country, TotalVentas),
        y=TotalVentas)) +
    geom_bar(stat = "identity", fill = "steelblue")+
    coord_flip() +
    labs(title = "Ventas Totales por País", x="País"     , y="Ventas"
)

Preparación del Modelo

Convertir a factor

ventas_clientes$Country <-as.factor(ventas_clientes$Country)

Dividir los datos en 2 grupos (Train y Test)

library(caret)

Cargando paquete requerido: lattice


Adjuntando el paquete: 'caret'

The following object is masked from 'package:purrr':

    lift

set.seed(123)
trainIndex<-createDataPartition(ventas_clientes$TotalVentas, p=0.7,list = FALSE)
trainData<-ventas_clientes[trainIndex,]
testData<-ventas_clientes[-trainIndex,]

Modelo Predictivo: Regresión Lineal

modelo_reg <-train(
  TotalVentas ~ NumPedidos + Country,
  data = trainData,
  method = "lm"
)

Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases

Resumen del modelo

summary(modelo_reg)


Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-16869  -4276      0   3543  45591 

Coefficients: (2 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -11150.6    10022.8  -1.113   0.2718    
NumPedidos           2922.2      279.6  10.452 1.28e-13 ***
CountryAustria      36722.6    15680.0   2.342   0.0237 *  
CountryBelgium       1126.1    14047.0   0.080   0.9365    
CountryBrazil       -2843.5    10929.5  -0.260   0.7959    
CountryCanada         373.7    11545.1   0.032   0.9743    
CountryDenmark      -4349.3    14135.7  -0.308   0.7597    
CountryFinland     -11103.9    12270.6  -0.905   0.3703    
CountryFrance       -2618.2    10557.5  -0.248   0.8053    
CountryGermany       1222.2    10571.5   0.116   0.9085    
CountryIreland      12947.0    14571.4   0.889   0.3790    
CountryItaly       -10554.5    11524.1  -0.916   0.3646    
CountryMexico        -714.7    11099.8  -0.064   0.9489    
CountryNorway            NA         NA      NA       NA    
CountryPoland            NA         NA      NA       NA    
CountryPortugal     -1609.1    12162.6  -0.132   0.8953    
CountrySpain         -793.8    11098.3  -0.072   0.9433    
CountrySweden      -14480.1    14498.8  -0.999   0.3233    
CountrySwitzerland   1310.9    12206.7   0.107   0.9150    
CountryUK            -509.4    11099.8  -0.046   0.9636    
CountryUSA           1560.5    10486.1   0.149   0.8824    
CountryVenezuela    -7658.7    11596.2  -0.660   0.5123    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9925 on 45 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.7877 
F-statistic:  13.5 on 19 and 45 DF,  p-value: 8.981e-13

Evaluación del Modelo

predicciones <-predict(modelo_reg, newdata = testData)

Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases

Calcular RMSE y R2

postResample(predicciones, testData$TotalVentas)

        RMSE     Rsquared          MAE 
1.065200e+04 7.940061e-01 6.603592e+03

Comparación de Datos Reales vs Predichos

resultados<-data.frame(
          Real = testData$TotalVentas,
          Predichos = predicciones
          )

Grafica de Reales vs Predichos

ggplot(
  resultados,
  aes(x=Real, y= Predichos))+
  geom_point(color = "red")+
  geom_abline(linetype = "dashed", color = "blue")+
  labs(title = "Ventas Reales vs Ventas Predichas",
       x="Ventas Reales", y = "Ventas Predichas"
       )

Interpretación

El modelo muestra como el numero de pedidos y el pais incluyen en las ventas
Este predicción puede servir para estimar ventas futuras por cliente y por segmento de mercado

Sys.Date()

[1] "2025-08-12"