"Instalando Paquetes"
[1] "Instalando Paquetes"
#install.packages(c("DBI", "odbc", "tidyverse", "caret", "ggplot2", "corrplot"))
Objetivo
Extraer datos de las ventas de los clientes de la base de datos Northwind en SQL Server, analizar y predecir usando Regresión.
Instalacion de paquetes:
install.packages(c(“DBI”, “odbc”, “tidyverse”, “caret”, “ggplot2”, “corrplot”))
"Instalando Paquetes"
[1] "Instalando Paquetes"
#install.packages(c("DBI", "odbc", "tidyverse", "caret", "ggplot2", "corrplot"))
library(DBI)
library(odbc)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Conexion a SQL Server:
Realizar la consulta
<-"
consulta SELECT
C.CustomerID,
C.CompanyName,
C.Country,
SUM(OD.Quantity*OD.UnitPrice) AS TotalVentas,
COUNT(DISTINCT O.OrderID) AS NumPedidos
FROM Customers C
JOIN ORDERS O ON C.CustomerID = O.CustomerID
JOIN [Order Details] OD ON OD.OrderID=O.OrderID
GROUP BY C.CustomerID, C.CompanyName, C.Country
"
Crear variable capturar los datos de la conexion
<-dbGetQuery(conexion, consulta) ventas_clientes
Cerrar Conexion
dbDisconnect(conexion)
glimpse(ventas_clientes)
Rows: 89
Columns: 5
$ CustomerID <chr> "BLONP", "OLDWO", "FRANS", "MAGAA", "SIMOB", "FAMIA", "ERN…
$ CompanyName <chr> "Blondesddsl père et fils", "Old World Delicatessen", "Fra…
$ Country <chr> "France", "USA", "Italy", "Italy", "Denmark", "Brazil", "A…
$ TotalVentas <dbl> 19088.00, 16325.15, 1545.70, 7603.85, 18138.45, 4438.90, 1…
$ NumPedidos <int> 11, 10, 6, 10, 7, 7, 30, 10, 9, 4, 6, 2, 2, 19, 8, 15, 18,…
Resumen General
summary(ventas_clientes)
CustomerID CompanyName Country TotalVentas
Length:89 Length:89 Length:89 Min. : 100.8
Class :character Class :character Class :character 1st Qu.: 3361.0
Mode :character Mode :character Mode :character Median : 7555.6
Mean : 15218.6
3rd Qu.: 18138.5
Max. :117483.4
NumPedidos
Min. : 1.000
1st Qu.: 5.000
Median : 8.000
Mean : 9.326
3rd Qu.:12.000
Max. :31.000
Filtrar el top de 10 clientes con mas ventas
%>%
ventas_clientes arrange(desc(TotalVentas)) %>%
head(10)
CustomerID CompanyName Country TotalVentas NumPedidos
1 QUICK QUICK-Stop Germany 117483.39 28
2 SAVEA Save-a-lot Markets USA 115673.39 31
3 ERNSH Ernst Handel Austria 113236.68 30
4 HUNGO Hungry Owl All-Night Grocers Ireland 57317.39 19
5 RATTC Rattlesnake Canyon Grocery USA 52245.90 18
6 HANAR Hanari Carnes Brazil 34101.15 14
7 FOLKO Folk och fä HB Sweden 32555.55 19
8 MEREP Mère Paillarde Canada 32203.90 13
9 KOENE Königlich Essen Germany 31745.75 14
10 QUEEN Queen Cozinha Brazil 30226.10 13
Ventas por País
library(ggplot2)
ggplot(
# Data Northwind
ventas_clientes, aes(x=reorder(Country, TotalVentas),
y=TotalVentas)) +
geom_bar(stat = "identity", fill = "steelblue")+
coord_flip() +
labs(title = "Ventas Totales por País", x="País" , y="Ventas"
)
Convertir a factor
$Country <-as.factor(ventas_clientes$Country) ventas_clientes
Dividir los datos en 2 grupos (Train y Test)
library(caret)
Cargando paquete requerido: lattice
Adjuntando el paquete: 'caret'
The following object is masked from 'package:purrr':
lift
set.seed(123)
<-createDataPartition(ventas_clientes$TotalVentas, p=0.7,list = FALSE)
trainIndex<-ventas_clientes[trainIndex,]
trainData<-ventas_clientes[-trainIndex,] testData
<-train(
modelo_reg ~ NumPedidos + Country,
TotalVentas data = trainData,
method = "lm"
)
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Resumen del modelo
summary(modelo_reg)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-16869 -4276 0 3543 45591
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11150.6 10022.8 -1.113 0.2718
NumPedidos 2922.2 279.6 10.452 1.28e-13 ***
CountryAustria 36722.6 15680.0 2.342 0.0237 *
CountryBelgium 1126.1 14047.0 0.080 0.9365
CountryBrazil -2843.5 10929.5 -0.260 0.7959
CountryCanada 373.7 11545.1 0.032 0.9743
CountryDenmark -4349.3 14135.7 -0.308 0.7597
CountryFinland -11103.9 12270.6 -0.905 0.3703
CountryFrance -2618.2 10557.5 -0.248 0.8053
CountryGermany 1222.2 10571.5 0.116 0.9085
CountryIreland 12947.0 14571.4 0.889 0.3790
CountryItaly -10554.5 11524.1 -0.916 0.3646
CountryMexico -714.7 11099.8 -0.064 0.9489
CountryNorway NA NA NA NA
CountryPoland NA NA NA NA
CountryPortugal -1609.1 12162.6 -0.132 0.8953
CountrySpain -793.8 11098.3 -0.072 0.9433
CountrySweden -14480.1 14498.8 -0.999 0.3233
CountrySwitzerland 1310.9 12206.7 0.107 0.9150
CountryUK -509.4 11099.8 -0.046 0.9636
CountryUSA 1560.5 10486.1 0.149 0.8824
CountryVenezuela -7658.7 11596.2 -0.660 0.5123
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9925 on 45 degrees of freedom
Multiple R-squared: 0.8508, Adjusted R-squared: 0.7877
F-statistic: 13.5 on 19 and 45 DF, p-value: 8.981e-13
<-predict(modelo_reg, newdata = testData) predicciones
Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Calcular RMSE y R2
postResample(predicciones, testData$TotalVentas)
RMSE Rsquared MAE
1.065200e+04 7.940061e-01 6.603592e+03
<-data.frame(
resultadosReal = testData$TotalVentas,
Predichos = predicciones
)
Grafica de Reales vs Predichos
ggplot(
resultados,aes(x=Real, y= Predichos))+
geom_point(color = "red")+
geom_abline(linetype = "dashed", color = "blue")+
labs(title = "Ventas Reales vs Ventas Predichas",
x="Ventas Reales", y = "Ventas Predichas"
)
El modelo muestra como el numero de pedidos y el pais incluyen en las ventas
Este predicción puede servir para estimar ventas futuras por cliente y por segmento de mercado
Sys.Date()
[1] "2025-08-12"
Derechos Reservados © Inteligencia de Negocios B1