Walmart
#file.choose()
stores <- read.csv("/Users/joseramonvazquezguzman/Documents/stores (1).csv")
features <- read.csv("/Users/joseramonvazquezguzman/Documents/features (1).csv")
train <- read.csv("/Users/joseramonvazquezguzman/Documents/train (1).csv")
test <- read.csv("/Users/joseramonvazquezguzman/Documents/test (1).csv")
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(stores)
## Store Type Size
## Min. : 1 Length:45 Min. : 34875
## 1st Qu.:12 Class :character 1st Qu.: 70713
## Median :23 Mode :character Median :126512
## Mean :23 Mean :130288
## 3rd Qu.:34 3rd Qu.:202307
## Max. :45 Max. :219622
count(stores, Type, sort = TRUE)
## Type n
## 1 A 22
## 2 B 17
## 3 C 6
str(stores)
## 'data.frame': 45 obs. of 3 variables:
## $ Store: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Type : chr "A" "A" "B" "A" ...
## $ Size : int 151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...
summary(features)
## Store Date Temperature Fuel_Price
## Min. : 1 Length:8190 Min. : -7.29 Min. :2.472
## 1st Qu.:12 Class :character 1st Qu.: 45.90 1st Qu.:3.041
## Median :23 Mode :character Median : 60.71 Median :3.513
## Mean :23 Mean : 59.36 Mean :3.406
## 3rd Qu.:34 3rd Qu.: 73.88 3rd Qu.:3.743
## Max. :45 Max. :101.95 Max. :4.468
##
## MarkDown1 MarkDown2 MarkDown3 MarkDown4
## Min. : -2781 Min. : -265.76 Min. : -179.26 Min. : 0.22
## 1st Qu.: 1578 1st Qu.: 68.88 1st Qu.: 6.60 1st Qu.: 304.69
## Median : 4744 Median : 364.57 Median : 36.26 Median : 1176.42
## Mean : 7032 Mean : 3384.18 Mean : 1760.10 Mean : 3292.94
## 3rd Qu.: 8923 3rd Qu.: 2153.35 3rd Qu.: 163.15 3rd Qu.: 3310.01
## Max. :103185 Max. :104519.54 Max. :149483.31 Max. :67474.85
## NA's :4158 NA's :5269 NA's :4577 NA's :4726
## MarkDown5 CPI Unemployment IsHoliday
## Min. : -185.2 Min. :126.1 Min. : 3.684 Mode :logical
## 1st Qu.: 1440.8 1st Qu.:132.4 1st Qu.: 6.634 FALSE:7605
## Median : 2727.1 Median :182.8 Median : 7.806 TRUE :585
## Mean : 4132.2 Mean :172.5 Mean : 7.827
## 3rd Qu.: 4832.6 3rd Qu.:213.9 3rd Qu.: 8.567
## Max. :771448.1 Max. :229.0 Max. :14.313
## NA's :4140 NA's :585 NA's :585
str(features)
## 'data.frame': 8190 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05/02/10" "12/02/10" "19/02/10" "26/02/10" ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ MarkDown1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
summary(test)
## Store Dept Date IsHoliday
## Min. : 1.00 Min. : 1.00 Length:115064 Mode :logical
## 1st Qu.:11.00 1st Qu.:18.00 Class :character FALSE:106136
## Median :22.00 Median :37.00 Mode :character TRUE :8928
## Mean :22.24 Mean :44.34
## 3rd Qu.:33.00 3rd Qu.:74.00
## Max. :45.00 Max. :99.00
str(test)
## 'data.frame': 115064 obs. of 4 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "02/11/12" "09/11/12" "16/11/12" "23/11/12" ...
## $ IsHoliday: logi FALSE FALSE FALSE TRUE FALSE FALSE ...
#1. "features", "test" y "train" tiene la fecha como caracter
#2. "features", hay NA's en más de la mitad de los registros de Markdown (del 1 al 5).
#3. En "features", hay 585 NA's en CPI y Unemployment, y hay 585 registros de IsHoliday = TRUE ¿Tiene
#4. "train" tiene ventas semanales negativas
#Herramienta "El generador de valor de Datos
#Paso 1. Definir el área del negocio que buscamos impactaro mejorar su KPI.
#El departamento de mercadotecnia de EUA (con muestra de 45 tiendas) en el indicador de ventas semanales
#Paso 2. Seleccionar plantilla (-s) para crear valor a partir de los datos de los clientes
#Visión ** / segmentación / Personalización / Contextualización
#Paso 3. Generar ideas o conceptos espesificos
#Elaborar un modelo predictivo de ventas semanales
#Paso 4. Reunir los datos requeridos
#Elaborar una base de datos con la variable dependiente (ventas semanales) y las variables independietes
#Mercadotecnia elaborará plan desplegar modelo predictivo en fases:
#Fase 1. Piloto(San Antonio xTX).
#Fase 2. Texas
#Fase 3. EUA
#Sistemas asegurará la captura del Markdown en las bases de datos
#Creación de la base de datos consolidada
bd <- merge(train, stores, by= "Store")
bd1 <- bd
bd1 <- merge(bd1, features)
bd2 <- bd1
str(bd2)
## 'data.frame': 421570 obs. of 16 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "01/04/11" "01/04/11" "01/04/11" "01/04/11" ...
## $ IsHoliday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Dept : int 49 26 81 34 59 30 7 85 8 28 ...
## $ Weekly_Sales: num 13168 5947 28545 9950 317 ...
## $ Type : chr "A" "A" "A" "A" ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Temperature : num 59.2 59.2 59.2 59.2 59.2 ...
## $ Fuel_Price : num 3.52 3.52 3.52 3.52 3.52 ...
## $ MarkDown1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 215 215 215 215 215 ...
## $ Unemployment: num 7.68 7.68 7.68 7.68 7.68 ...
bd2 <- subset(bd2, select = -c (MarkDown1,MarkDown2,MarkDown3,MarkDown4, MarkDown5))
summary(bd2)
## Store Date IsHoliday Dept
## Min. : 1.0 Length:421570 Mode :logical Min. : 1.00
## 1st Qu.:11.0 Class :character FALSE:391909 1st Qu.:18.00
## Median :22.0 Mode :character TRUE :29661 Median :37.00
## Mean :22.2 Mean :44.26
## 3rd Qu.:33.0 3rd Qu.:74.00
## Max. :45.0 Max. :99.00
## Weekly_Sales Type Size Temperature
## Min. : -4989 Length:421570 Min. : 34875 Min. : -2.06
## 1st Qu.: 2080 Class :character 1st Qu.: 93638 1st Qu.: 46.68
## Median : 7612 Mode :character Median :140167 Median : 62.09
## Mean : 15981 Mean :136728 Mean : 60.09
## 3rd Qu.: 20206 3rd Qu.:202505 3rd Qu.: 74.28
## Max. :693099 Max. :219622 Max. :100.14
## Fuel_Price CPI Unemployment
## Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.:2.933 1st Qu.:132.0 1st Qu.: 6.891
## Median :3.452 Median :182.3 Median : 7.866
## Mean :3.361 Mean :171.2 Mean : 7.960
## 3rd Qu.:3.738 3rd Qu.:212.4 3rd Qu.: 8.572
## Max. :4.468 Max. :227.2 Max. :14.313
bd2$Date <- as.Date(bd2$Date,format= "%d/%m/%Y")
str(bd2)
## 'data.frame': 421570 obs. of 11 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "0011-04-01" "0011-04-01" ...
## $ IsHoliday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Dept : int 49 26 81 34 59 30 7 85 8 28 ...
## $ Weekly_Sales: num 13168 5947 28545 9950 317 ...
## $ Type : chr "A" "A" "A" "A" ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Temperature : num 59.2 59.2 59.2 59.2 59.2 ...
## $ Fuel_Price : num 3.52 3.52 3.52 3.52 3.52 ...
## $ CPI : num 215 215 215 215 215 ...
## $ Unemployment: num 7.68 7.68 7.68 7.68 7.68 ...
#install.packages("wordspace")
library(wordspace)
## Loading required package: Matrix
signcount(bd2$Weekly_Sales)
## pos zero neg
## 420212 73 1285
bd3 <- bd2
bd3 <- bd3[bd3$Weekly_Sales > 0,]
summary (bd3)
## Store Date IsHoliday Dept
## Min. : 1.0 Min. :0010-02-05 Mode :logical Min. : 1.00
## 1st Qu.:11.0 1st Qu.:0010-10-08 FALSE:390652 1st Qu.:18.00
## Median :22.0 Median :0011-06-17 TRUE :29560 Median :37.00
## Mean :22.2 Mean :0011-06-18 Mean :44.24
## 3rd Qu.:33.0 3rd Qu.:0012-02-24 3rd Qu.:74.00
## Max. :45.0 Max. :0012-10-26 Max. :99.00
## Weekly_Sales Type Size Temperature
## Min. : 0 Length:420212 Min. : 34875 Min. : -2.06
## 1st Qu.: 2120 Class :character 1st Qu.: 93638 1st Qu.: 46.68
## Median : 7662 Mode :character Median :140167 Median : 62.09
## Mean : 16033 Mean :136750 Mean : 60.09
## 3rd Qu.: 20271 3rd Qu.:202505 3rd Qu.: 74.28
## Max. :693099 Max. :219622 Max. :100.14
## Fuel_Price CPI Unemployment
## Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.:2.933 1st Qu.:132.0 1st Qu.: 6.891
## Median :3.452 Median :182.4 Median : 7.866
## Mean :3.361 Mean :171.2 Mean : 7.960
## 3rd Qu.:3.738 3rd Qu.:212.4 3rd Qu.: 8.567
## Max. :4.468 Max. :227.2 Max. :14.313
bd4<- bd3
bd4$week_number <- strftime(bd4$Date, format = "%V")
str(bd4)
## 'data.frame': 420212 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "0011-04-01" "0011-04-01" ...
## $ IsHoliday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Dept : int 49 26 81 34 59 30 7 85 8 28 ...
## $ Weekly_Sales: num 13168 5947 28545 9950 317 ...
## $ Type : chr "A" "A" "A" "A" ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Temperature : num 59.2 59.2 59.2 59.2 59.2 ...
## $ Fuel_Price : num 3.52 3.52 3.52 3.52 3.52 ...
## $ CPI : num 215 215 215 215 215 ...
## $ Unemployment: num 7.68 7.68 7.68 7.68 7.68 ...
## $ week_number : chr "13" "13" "13" "13" ...
bd4$week_number <- as.integer(bd4$week_number)
str(bd4)
## 'data.frame': 420212 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "0011-04-01" "0011-04-01" ...
## $ IsHoliday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Dept : int 49 26 81 34 59 30 7 85 8 28 ...
## $ Weekly_Sales: num 13168 5947 28545 9950 317 ...
## $ Type : chr "A" "A" "A" "A" ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Temperature : num 59.2 59.2 59.2 59.2 59.2 ...
## $ Fuel_Price : num 3.52 3.52 3.52 3.52 3.52 ...
## $ CPI : num 215 215 215 215 215 ...
## $ Unemployment: num 7.68 7.68 7.68 7.68 7.68 ...
## $ week_number : int 13 13 13 13 13 13 13 13 13 13 ...
summary(bd4)
## Store Date IsHoliday Dept
## Min. : 1.0 Min. :0010-02-05 Mode :logical Min. : 1.00
## 1st Qu.:11.0 1st Qu.:0010-10-08 FALSE:390652 1st Qu.:18.00
## Median :22.0 Median :0011-06-17 TRUE :29560 Median :37.00
## Mean :22.2 Mean :0011-06-18 Mean :44.24
## 3rd Qu.:33.0 3rd Qu.:0012-02-24 3rd Qu.:74.00
## Max. :45.0 Max. :0012-10-26 Max. :99.00
## Weekly_Sales Type Size Temperature
## Min. : 0 Length:420212 Min. : 34875 Min. : -2.06
## 1st Qu.: 2120 Class :character 1st Qu.: 93638 1st Qu.: 46.68
## Median : 7662 Mode :character Median :140167 Median : 62.09
## Mean : 16033 Mean :136750 Mean : 60.09
## 3rd Qu.: 20271 3rd Qu.:202505 3rd Qu.: 74.28
## Max. :693099 Max. :219622 Max. :100.14
## Fuel_Price CPI Unemployment week_number
## Min. :2.472 Min. :126.1 Min. : 3.879 Min. : 1.00
## 1st Qu.:2.933 1st Qu.:132.0 1st Qu.: 6.891 1st Qu.:14.00
## Median :3.452 Median :182.4 Median : 7.866 Median :26.00
## Mean :3.361 Mean :171.2 Mean : 7.960 Mean :25.83
## 3rd Qu.:3.738 3rd Qu.:212.4 3rd Qu.: 8.567 3rd Qu.:38.00
## Max. :4.468 Max. :227.2 Max. :14.313 Max. :52.00
bd5 <- bd4
bd5 <- bd5 %>%
dplyr::mutate(year= lubridate::year(Date), month = lubridate::month(Date), day = lubridate:: day(Date))
regresion <- lm(Weekly_Sales ~ Store+Dept+IsHoliday+Type+Size+Fuel_Price+CPI+Unemployment+week_number+year+month+day, data = bd5)
summary(regresion)
##
## Call:
## lm(formula = Weekly_Sales ~ Store + Dept + IsHoliday + Type +
## Size + Fuel_Price + CPI + Unemployment + week_number + year +
## month + day, data = bd5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34197 -12897 -5854 5629 671564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.156e+04 3.149e+03 3.672 0.000240 ***
## Store -1.434e+02 3.059e+00 -46.869 < 2e-16 ***
## Dept 1.108e+02 1.097e+00 101.013 < 2e-16 ***
## IsHolidayTRUE 8.239e+02 1.382e+02 5.961 2.51e-09 ***
## TypeB -3.346e+02 1.071e+02 -3.126 0.001775 **
## TypeC 5.851e+03 1.838e+02 31.827 < 2e-16 ***
## Size 9.909e-02 9.561e-04 103.637 < 2e-16 ***
## Fuel_Price 4.829e+02 1.480e+02 3.263 0.001101 **
## CPI -2.306e+01 9.808e-01 -23.516 < 2e-16 ***
## Unemployment -2.467e+02 2.021e+01 -12.207 < 2e-16 ***
## week_number 5.533e+02 4.395e+02 1.259 0.208131
## year -4.946e+02 1.452e+02 -3.407 0.000656 ***
## month -2.229e+03 1.913e+03 -1.165 0.243950
## day -9.725e+01 6.292e+01 -1.545 0.122239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21690 on 420198 degrees of freedom
## Multiple R-squared: 0.08981, Adjusted R-squared: 0.08978
## F-statistic: 3189 on 13 and 420198 DF, p-value: < 2.2e-16
datos_nuevos <- data.frame(Store=1, IsHoliday= TRUE, Dept=1, Type="A", Size= 151315, week_number =1, Temperature = 59.17, Fuel_Price = 3.524, CPI= 214.8372, Unemployment = 7.682, year =2012, month = 1, day=1)
predict(regresion,datos_nuevos)
## 1
## -974781.2
En esta actividad se realizó un modelo predictivo en cuanto a las ventas de wallmart, primero se unieron las bases de datos, todo esto para enfocarnos en los distintos departamentos que tiene la tienda y poder interpretar sus ventas y la manera en las que estas se comportaban, siendo incluso en algunas ocasiones de manera negativa, por lo que en estos casos se tuvieron que eliminar esos valores menores a cero.
Se realizo de igual manera una regresión lineal con el fin de conseguir la relación entre estas variables, en este trabajo tomamos Texas para hacer las pruenbas de los modelos y obtener dichos resultados, aplicando y descargando distitnas librerias que nos llevaron a obtener los pronósitcos necesarios partiendo de las cuatro distintas bases de datos.