Walmart

Importar base de datos

#file.choose()

stores <- read.csv("/Users/joseramonvazquezguzman/Documents/stores (1).csv")
features <- read.csv("/Users/joseramonvazquezguzman/Documents/features (1).csv")
train <- read.csv("/Users/joseramonvazquezguzman/Documents/train (1).csv")
test <- read.csv("/Users/joseramonvazquezguzman/Documents/test (1).csv")

Instalar paquetes y llamar librerias

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Entender la base de datos

summary(stores)
##      Store        Type                Size       
##  Min.   : 1   Length:45          Min.   : 34875  
##  1st Qu.:12   Class :character   1st Qu.: 70713  
##  Median :23   Mode  :character   Median :126512  
##  Mean   :23                      Mean   :130288  
##  3rd Qu.:34                      3rd Qu.:202307  
##  Max.   :45                      Max.   :219622
count(stores, Type,  sort = TRUE)
##   Type  n
## 1    A 22
## 2    B 17
## 3    C  6
str(stores)
## 'data.frame':    45 obs. of  3 variables:
##  $ Store: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Type : chr  "A" "A" "B" "A" ...
##  $ Size : int  151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...
summary(features)
##      Store        Date            Temperature       Fuel_Price   
##  Min.   : 1   Length:8190        Min.   : -7.29   Min.   :2.472  
##  1st Qu.:12   Class :character   1st Qu.: 45.90   1st Qu.:3.041  
##  Median :23   Mode  :character   Median : 60.71   Median :3.513  
##  Mean   :23                      Mean   : 59.36   Mean   :3.406  
##  3rd Qu.:34                      3rd Qu.: 73.88   3rd Qu.:3.743  
##  Max.   :45                      Max.   :101.95   Max.   :4.468  
##                                                                  
##    MarkDown1        MarkDown2           MarkDown3           MarkDown4       
##  Min.   : -2781   Min.   :  -265.76   Min.   :  -179.26   Min.   :    0.22  
##  1st Qu.:  1578   1st Qu.:    68.88   1st Qu.:     6.60   1st Qu.:  304.69  
##  Median :  4744   Median :   364.57   Median :    36.26   Median : 1176.42  
##  Mean   :  7032   Mean   :  3384.18   Mean   :  1760.10   Mean   : 3292.94  
##  3rd Qu.:  8923   3rd Qu.:  2153.35   3rd Qu.:   163.15   3rd Qu.: 3310.01  
##  Max.   :103185   Max.   :104519.54   Max.   :149483.31   Max.   :67474.85  
##  NA's   :4158     NA's   :5269        NA's   :4577        NA's   :4726      
##    MarkDown5             CPI         Unemployment    IsHoliday      
##  Min.   :  -185.2   Min.   :126.1   Min.   : 3.684   Mode :logical  
##  1st Qu.:  1440.8   1st Qu.:132.4   1st Qu.: 6.634   FALSE:7605     
##  Median :  2727.1   Median :182.8   Median : 7.806   TRUE :585      
##  Mean   :  4132.2   Mean   :172.5   Mean   : 7.827                  
##  3rd Qu.:  4832.6   3rd Qu.:213.9   3rd Qu.: 8.567                  
##  Max.   :771448.1   Max.   :229.0   Max.   :14.313                  
##  NA's   :4140       NA's   :585     NA's   :585
str(features)
## 'data.frame':    8190 obs. of  12 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05/02/10" "12/02/10" "19/02/10" "26/02/10" ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
summary(test)
##      Store            Dept           Date           IsHoliday      
##  Min.   : 1.00   Min.   : 1.00   Length:115064      Mode :logical  
##  1st Qu.:11.00   1st Qu.:18.00   Class :character   FALSE:106136   
##  Median :22.00   Median :37.00   Mode  :character   TRUE :8928     
##  Mean   :22.24   Mean   :44.34                                     
##  3rd Qu.:33.00   3rd Qu.:74.00                                     
##  Max.   :45.00   Max.   :99.00
str(test)
## 'data.frame':    115064 obs. of  4 variables:
##  $ Store    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date     : chr  "02/11/12" "09/11/12" "16/11/12" "23/11/12" ...
##  $ IsHoliday: logi  FALSE FALSE FALSE TRUE FALSE FALSE ...

Observaciones

#1. "features", "test" y "train" tiene la fecha como caracter
#2. "features", hay NA's en más de la mitad de los registros de Markdown (del 1 al 5).
#3. En "features", hay 585 NA's en CPI y Unemployment, y hay 585 registros de IsHoliday = TRUE ¿Tiene
#4. "train" tiene ventas semanales negativas 

#Herramienta "El generador de valor de Datos 
#Paso 1. Definir el área del negocio que buscamos impactaro mejorar su KPI.
#El departamento de mercadotecnia de EUA (con muestra de 45 tiendas) en el indicador de ventas semanales 
#Paso 2. Seleccionar plantilla (-s) para crear valor a partir de los datos de los clientes 
#Visión ** / segmentación / Personalización / Contextualización 
#Paso 3. Generar ideas o conceptos espesificos 
#Elaborar un modelo predictivo de ventas semanales 
#Paso 4.  Reunir los datos requeridos 
#Elaborar una base de datos con la variable dependiente (ventas semanales) y las variables independietes
#Mercadotecnia elaborará plan desplegar modelo predictivo en fases: 
#Fase 1. Piloto(San Antonio xTX). 
#Fase 2. Texas
#Fase 3. EUA
#Sistemas asegurará la captura del Markdown en las bases de datos

#Creación  de la base de datos consolidada

Asegurar “STORES” a “TRAIN”

bd <- merge(train, stores, by= "Store")

Agregar “FEATURES a”BD”

bd1 <- bd 
bd1 <- merge(bd1, features)

Eliminar columnas

bd2 <- bd1
str(bd2)
## 'data.frame':    421570 obs. of  16 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "01/04/11" "01/04/11" "01/04/11" "01/04/11" ...
##  $ IsHoliday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Dept        : int  49 26 81 34 59 30 7 85 8 28 ...
##  $ Weekly_Sales: num  13168 5947 28545 9950 317 ...
##  $ Type        : chr  "A" "A" "A" "A" ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Temperature : num  59.2 59.2 59.2 59.2 59.2 ...
##  $ Fuel_Price  : num  3.52 3.52 3.52 3.52 3.52 ...
##  $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : num  215 215 215 215 215 ...
##  $ Unemployment: num  7.68 7.68 7.68 7.68 7.68 ...
    bd2 <- subset(bd2, select = -c (MarkDown1,MarkDown2,MarkDown3,MarkDown4, MarkDown5))
summary(bd2)
##      Store          Date           IsHoliday            Dept      
##  Min.   : 1.0   Length:421570      Mode :logical   Min.   : 1.00  
##  1st Qu.:11.0   Class :character   FALSE:391909    1st Qu.:18.00  
##  Median :22.0   Mode  :character   TRUE :29661     Median :37.00  
##  Mean   :22.2                                      Mean   :44.26  
##  3rd Qu.:33.0                                      3rd Qu.:74.00  
##  Max.   :45.0                                      Max.   :99.00  
##   Weekly_Sales        Type                Size         Temperature    
##  Min.   : -4989   Length:421570      Min.   : 34875   Min.   : -2.06  
##  1st Qu.:  2080   Class :character   1st Qu.: 93638   1st Qu.: 46.68  
##  Median :  7612   Mode  :character   Median :140167   Median : 62.09  
##  Mean   : 15981                      Mean   :136728   Mean   : 60.09  
##  3rd Qu.: 20206                      3rd Qu.:202505   3rd Qu.: 74.28  
##  Max.   :693099                      Max.   :219622   Max.   :100.14  
##    Fuel_Price         CPI         Unemployment   
##  Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.:2.933   1st Qu.:132.0   1st Qu.: 6.891  
##  Median :3.452   Median :182.3   Median : 7.866  
##  Mean   :3.361   Mean   :171.2   Mean   : 7.960  
##  3rd Qu.:3.738   3rd Qu.:212.4   3rd Qu.: 8.572  
##  Max.   :4.468   Max.   :227.2   Max.   :14.313

Cambiar formato de fecha

bd2$Date <- as.Date(bd2$Date,format= "%d/%m/%Y")
str(bd2)
## 'data.frame':    421570 obs. of  11 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date, format: "0011-04-01" "0011-04-01" ...
##  $ IsHoliday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Dept        : int  49 26 81 34 59 30 7 85 8 28 ...
##  $ Weekly_Sales: num  13168 5947 28545 9950 317 ...
##  $ Type        : chr  "A" "A" "A" "A" ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Temperature : num  59.2 59.2 59.2 59.2 59.2 ...
##  $ Fuel_Price  : num  3.52 3.52 3.52 3.52 3.52 ...
##  $ CPI         : num  215 215 215 215 215 ...
##  $ Unemployment: num  7.68 7.68 7.68 7.68 7.68 ...

Contar ceros y negativos

#install.packages("wordspace")
library(wordspace)
## Loading required package: Matrix
signcount(bd2$Weekly_Sales)
##    pos   zero    neg 
## 420212     73   1285

Eliminar ventas menores a cero

bd3 <- bd2
bd3 <- bd3[bd3$Weekly_Sales > 0,]
summary (bd3)
##      Store           Date            IsHoliday            Dept      
##  Min.   : 1.0   Min.   :0010-02-05   Mode :logical   Min.   : 1.00  
##  1st Qu.:11.0   1st Qu.:0010-10-08   FALSE:390652    1st Qu.:18.00  
##  Median :22.0   Median :0011-06-17   TRUE :29560     Median :37.00  
##  Mean   :22.2   Mean   :0011-06-18                   Mean   :44.24  
##  3rd Qu.:33.0   3rd Qu.:0012-02-24                   3rd Qu.:74.00  
##  Max.   :45.0   Max.   :0012-10-26                   Max.   :99.00  
##   Weekly_Sales        Type                Size         Temperature    
##  Min.   :     0   Length:420212      Min.   : 34875   Min.   : -2.06  
##  1st Qu.:  2120   Class :character   1st Qu.: 93638   1st Qu.: 46.68  
##  Median :  7662   Mode  :character   Median :140167   Median : 62.09  
##  Mean   : 16033                      Mean   :136750   Mean   : 60.09  
##  3rd Qu.: 20271                      3rd Qu.:202505   3rd Qu.: 74.28  
##  Max.   :693099                      Max.   :219622   Max.   :100.14  
##    Fuel_Price         CPI         Unemployment   
##  Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.:2.933   1st Qu.:132.0   1st Qu.: 6.891  
##  Median :3.452   Median :182.4   Median : 7.866  
##  Mean   :3.361   Mean   :171.2   Mean   : 7.960  
##  3rd Qu.:3.738   3rd Qu.:212.4   3rd Qu.: 8.567  
##  Max.   :4.468   Max.   :227.2   Max.   :14.313

Agregar número de la semana

bd4<- bd3
bd4$week_number <- strftime(bd4$Date, format = "%V")
str(bd4)
## 'data.frame':    420212 obs. of  12 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date, format: "0011-04-01" "0011-04-01" ...
##  $ IsHoliday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Dept        : int  49 26 81 34 59 30 7 85 8 28 ...
##  $ Weekly_Sales: num  13168 5947 28545 9950 317 ...
##  $ Type        : chr  "A" "A" "A" "A" ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Temperature : num  59.2 59.2 59.2 59.2 59.2 ...
##  $ Fuel_Price  : num  3.52 3.52 3.52 3.52 3.52 ...
##  $ CPI         : num  215 215 215 215 215 ...
##  $ Unemployment: num  7.68 7.68 7.68 7.68 7.68 ...
##  $ week_number : chr  "13" "13" "13" "13" ...
bd4$week_number <- as.integer(bd4$week_number)
str(bd4)
## 'data.frame':    420212 obs. of  12 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date, format: "0011-04-01" "0011-04-01" ...
##  $ IsHoliday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Dept        : int  49 26 81 34 59 30 7 85 8 28 ...
##  $ Weekly_Sales: num  13168 5947 28545 9950 317 ...
##  $ Type        : chr  "A" "A" "A" "A" ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Temperature : num  59.2 59.2 59.2 59.2 59.2 ...
##  $ Fuel_Price  : num  3.52 3.52 3.52 3.52 3.52 ...
##  $ CPI         : num  215 215 215 215 215 ...
##  $ Unemployment: num  7.68 7.68 7.68 7.68 7.68 ...
##  $ week_number : int  13 13 13 13 13 13 13 13 13 13 ...
summary(bd4)
##      Store           Date            IsHoliday            Dept      
##  Min.   : 1.0   Min.   :0010-02-05   Mode :logical   Min.   : 1.00  
##  1st Qu.:11.0   1st Qu.:0010-10-08   FALSE:390652    1st Qu.:18.00  
##  Median :22.0   Median :0011-06-17   TRUE :29560     Median :37.00  
##  Mean   :22.2   Mean   :0011-06-18                   Mean   :44.24  
##  3rd Qu.:33.0   3rd Qu.:0012-02-24                   3rd Qu.:74.00  
##  Max.   :45.0   Max.   :0012-10-26                   Max.   :99.00  
##   Weekly_Sales        Type                Size         Temperature    
##  Min.   :     0   Length:420212      Min.   : 34875   Min.   : -2.06  
##  1st Qu.:  2120   Class :character   1st Qu.: 93638   1st Qu.: 46.68  
##  Median :  7662   Mode  :character   Median :140167   Median : 62.09  
##  Mean   : 16033                      Mean   :136750   Mean   : 60.09  
##  3rd Qu.: 20271                      3rd Qu.:202505   3rd Qu.: 74.28  
##  Max.   :693099                      Max.   :219622   Max.   :100.14  
##    Fuel_Price         CPI         Unemployment     week_number   
##  Min.   :2.472   Min.   :126.1   Min.   : 3.879   Min.   : 1.00  
##  1st Qu.:2.933   1st Qu.:132.0   1st Qu.: 6.891   1st Qu.:14.00  
##  Median :3.452   Median :182.4   Median : 7.866   Median :26.00  
##  Mean   :3.361   Mean   :171.2   Mean   : 7.960   Mean   :25.83  
##  3rd Qu.:3.738   3rd Qu.:212.4   3rd Qu.: 8.567   3rd Qu.:38.00  
##  Max.   :4.468   Max.   :227.2   Max.   :14.313   Max.   :52.00

Separar Año, Mes y Día

bd5 <- bd4
bd5 <- bd5 %>%
  dplyr::mutate(year= lubridate::year(Date), month = lubridate::month(Date), day = lubridate:: day(Date))

Generar regresión lineal

regresion <- lm(Weekly_Sales ~ Store+Dept+IsHoliday+Type+Size+Fuel_Price+CPI+Unemployment+week_number+year+month+day, data = bd5)
summary(regresion)
## 
## Call:
## lm(formula = Weekly_Sales ~ Store + Dept + IsHoliday + Type + 
##     Size + Fuel_Price + CPI + Unemployment + week_number + year + 
##     month + day, data = bd5)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34197 -12897  -5854   5629 671564 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.156e+04  3.149e+03   3.672 0.000240 ***
## Store         -1.434e+02  3.059e+00 -46.869  < 2e-16 ***
## Dept           1.108e+02  1.097e+00 101.013  < 2e-16 ***
## IsHolidayTRUE  8.239e+02  1.382e+02   5.961 2.51e-09 ***
## TypeB         -3.346e+02  1.071e+02  -3.126 0.001775 ** 
## TypeC          5.851e+03  1.838e+02  31.827  < 2e-16 ***
## Size           9.909e-02  9.561e-04 103.637  < 2e-16 ***
## Fuel_Price     4.829e+02  1.480e+02   3.263 0.001101 ** 
## CPI           -2.306e+01  9.808e-01 -23.516  < 2e-16 ***
## Unemployment  -2.467e+02  2.021e+01 -12.207  < 2e-16 ***
## week_number    5.533e+02  4.395e+02   1.259 0.208131    
## year          -4.946e+02  1.452e+02  -3.407 0.000656 ***
## month         -2.229e+03  1.913e+03  -1.165 0.243950    
## day           -9.725e+01  6.292e+01  -1.545 0.122239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21690 on 420198 degrees of freedom
## Multiple R-squared:  0.08981,    Adjusted R-squared:  0.08978 
## F-statistic:  3189 on 13 and 420198 DF,  p-value: < 2.2e-16

Construir un modelo de predicción

datos_nuevos <- data.frame(Store=1, IsHoliday= TRUE, Dept=1, Type="A", Size= 151315, week_number =1, Temperature = 59.17, Fuel_Price = 3.524, CPI= 214.8372, Unemployment = 7.682, year =2012, month = 1, day=1)
predict(regresion,datos_nuevos)
##         1 
## -974781.2

Conclusión

En esta actividad se realizó un modelo predictivo en cuanto a las ventas de wallmart, primero se unieron las bases de datos, todo esto para enfocarnos en los distintos departamentos que tiene la tienda y poder interpretar sus ventas y la manera en las que estas se comportaban, siendo incluso en algunas ocasiones de manera negativa, por lo que en estos casos se tuvieron que eliminar esos valores menores a cero.

Se realizo de igual manera una regresión lineal con el fin de conseguir la relación entre estas variables, en este trabajo tomamos Texas para hacer las pruenbas de los modelos y obtener dichos resultados, aplicando y descargando distitnas librerias que nos llevaron a obtener los pronósitcos necesarios partiendo de las cuatro distintas bases de datos.