TP. Aplicaciones de Software Estadístico 1 - MGIAE

Fuente. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales?select=supermarket_sales+-+Sheet1.csv

Introducción

El objetivo del trabajo es hacer un pequeño analisis exploratorio de una base de datos ventas de un supermercado, la cual cuenta con aproximadamente 1000 registros a lo largo del trimestre de enero a marzo de 2019. A fines practicos, la idea es:

Calcular algunas metricas y lo vinculado a la estadistica descriptiva.
Observar el comportamiento de variables como ingresos (bruto) y total de ventas a lo largo del tiempo segmentado genero y agrupado por ciudad.

———————————————————————————————————————-

Carga de paquetes

library("tidyverse")

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library("readr")
library("janitor")

## Warning: package 'janitor' was built under R version 4.2.3

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library("dplyr")
library("lubridate")
library("ggdark")

## Warning: package 'ggdark' was built under R version 4.2.3

Carga el dataset

db_sales <- read.csv("supermarket_sales.csv")
head(db_sales)

Limpieza y manipulacion de los datos

Observo el tipo de dato de cada variable y la existencia de Nulls

str(db_sales)

## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr  "A" "C" "A" "A" ...
##  $ City                   : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer.type          : chr  "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
##  $ Product.line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  261.42 3.82 162.16 23.29 302.08 ...
##  $ Total                  : chr  "5.489.715" "80.22" "3.405.255" "489.048" ...
##  $ Date                   : chr  "1/05/2019" "3/08/2019" "3/03/2019" "1/27/2019" ...
##  $ Time                   : chr  "13:08:00" "10:29:00" "13:23:00" "20:33:00" ...
##  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: chr  "4.761.904.762" "4.761.904.762" "4.761.904.762" "4.761.904.762" ...
##  $ gross.income           : num  261.42 3.82 162.16 23.29 302.08 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...

sapply(db_sales, class)

##              Invoice.ID                  Branch                    City 
##             "character"             "character"             "character" 
##           Customer.type                  Gender            Product.line 
##             "character"             "character"             "character" 
##              Unit.price                Quantity                  Tax.5. 
##               "numeric"               "integer"               "numeric" 
##                   Total                    Date                    Time 
##             "character"             "character"             "character" 
##                 Payment                    cogs gross.margin.percentage 
##             "character"               "numeric"             "character" 
##            gross.income                  Rating 
##               "numeric"               "numeric"

sum(is.na(db_sales))

## [1] 0

Limpio y renombro las etiquetas de las variables

db_sales <- db_sales %>%
  clean_names() %>%
  rename_all(~ gsub("\\.", "_", .))

Le asigno a las variable total y date el tipo de dato numerico y fecha respectivamente

db_sales$total <- as.numeric(db_sales$total)

## Warning: NAs introducidos por coerción

db_sales <- db_sales[!is.na(db_sales$total), ]
db_sales$date <- as.Date(db_sales$date, format = "%m/%d/%Y") 
sum(is.na(db_sales))

## [1] 0

Al no haber valores nulos no es necesario imputar datos. Creo dos variables: month y week a partir de la fecha. Finalmente con todo listo, selecciono las variables a utilizar

  db_sales  <- db_sales %>%  
    mutate(month = format(date,"%B")) %>% 
    mutate(week = week(db_sales$date))

  month.name <- c("enero", "febrero", "marzo")
  months_num <- 1:3
  db_sales$month <- months_num[match(db_sales$month, month.name)]
  
db_sales <- db_sales %>% 
  select(city,customer_type,gender,product_line,unit_price, total, date, gross_income, month,week)  
 

  summary(db_sales)

##      city           customer_type         gender          product_line      
##  Length:832         Length:832         Length:832         Length:832        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    unit_price        total              date             gross_income    
##  Min.   :10.08   Min.   :  10.68   Min.   :2019-01-01   Min.   :  1.066  
##  1st Qu.:31.42   1st Qu.: 145.04   1st Qu.:2019-01-24   1st Qu.:  5.083  
##  Median :54.32   Median : 284.21   Median :2019-02-14   Median : 11.256  
##  Mean   :54.71   Mean   : 353.91   Mean   :2019-02-14   Mean   : 22.215  
##  3rd Qu.:76.94   3rd Qu.: 528.19   3rd Qu.:2019-03-08   3rd Qu.: 22.357  
##  Max.   :99.96   Max.   :1042.65   Max.   :2019-03-30   Max.   :985.000  
##      month            week       
##  Min.   :1.000   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.: 4.000  
##  Median :2.000   Median : 7.000  
##  Mean   :2.005   Mean   : 6.924  
##  3rd Qu.:3.000   3rd Qu.:10.000  
##  Max.   :3.000   Max.   :13.000

Analisis de los datos

Medidas de posicion y dispersion agrupadas por ciudad.

metrics_by_city <- db_sales %>%
  group_by(city) %>%
  summarise(total_income = sum(gross_income),
            total_sales = sum(total),
            mean_total = mean(total),
            mean_income = mean(gross_income),
            sd_total = sd(total),
            sd_income = sd(gross_income),
            min_unit_price = min(unit_price), 
            max_unit_price = max(unit_price),
            mean_unit_price = mean(unit_price),
            max_total_week = week[which.max(total)],
            max_total_month = month[which.max(total)],
            max_income_week = week[which.max(gross_income)],
            max_income_month = month[which.max(gross_income)]) 
metrics_by_city

Como vemos y veremos en la grafica si bien la sucursal Yangon esta en primera posicion en el total de ingresos obtenidos, es la sucursal Naypyitaw la que cuenta con mayor ventas.

Medidas de posicion y dispersion generales de las tres sucursales (ciudades)

metrics <- db_sales %>%
  summarise(total_income = sum(gross_income),
            total_sales = sum(total),
            mean_total = mean(total),
            mean_income = mean(gross_income),
            sd_total = sd(total),
            sd_income = sd(gross_income),
            min_unit_price = min(unit_price), 
            max_unit_price = max(unit_price),
            mean_unit_price = mean(unit_price),
            max_total_week = week[which.max(total)],
            max_total_month = month[which.max(total)],
            max_income_week = week[which.max(gross_income)],
            max_income_month = month[which.max(gross_income)]) 
metrics

Total de ventas por ciudad

total_sales_by_city <- db_sales %>%
  group_by(city)%>%
  summarise(total_sales = round(sum(total)))
total_sales_by_city

total_sales_by_city  %>%
  ggplot(aes(x = total_sales, y = city )) +
  geom_col(fill = c("#fecc8f","#fbb91f", "#e85362") ) +
  geom_text(aes(label = total_sales, width = -0.5) )+
  labs(x = "Total de Ventas (en Dolares) " , y = " Ciudad ", 
       title = "Total de Ventas del trimestre por Ciudad ", subtitle = "Enero-Marzo 2019")+
  theme(axis.text.x = element_blank())

## Warning in geom_text(aes(label = total_sales, width = -0.5)): Ignoring unknown
## aesthetics: width

Como vemos en el grafico y adelentamos previamente, la sucursal Naypyitaw tuvo mayor exito en ventas durante el primer trimestre de 2019

Frecuencia media y absoluta de ventas a lo largo del trimestre

cities <- c("Mandalay", "Naypyitaw", "Yangon")
months <- c("Enero", "Febrero", "Marzo")

total_sales_by_city_month <- db_sales %>%
  filter(city %in% cities) %>%
  group_by(city, month) %>%
  summarise(freq_total = n()) %>% 
  group_by(city) %>%
  mutate(mean_freq = round(mean(freq_total)))

## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

total_sales_by_city_month

ggplot(total_sales_by_city_month, aes(y = month, x = freq_total, fill = city)) +
  geom_bar(stat = "identity", orientation = "y", width = 0.7) +
  facet_wrap(~ city, ncol = 1) +
  geom_text(aes(label = mean_freq, width = 0.7, position = "dodge") )+
  labs(y = "Meses", x = "Numero de Ventas", title = "Frecuencia Absoluta de Ventas por Mes y por Ciudad", subtitle = "Enero-Marzo 2019") +
  scale_y_continuous(breaks = 1:3, labels = c("Enero", "Febrero", "Marzo"))+
  theme(axis.text.x = element_blank())+
  dark_theme_classic()

## Warning in geom_text(aes(label = mean_freq, width = 0.7, position = "dodge")):
## Ignoring unknown aesthetics: width and position

## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

ggplot(total_sales_by_city_month, aes(x = month, y = freq_total)) +
  geom_bar(stat = "identity", width = 0.7, position = "dodge") +
  geom_point(aes(y = mean_freq, label = freq_total), color = "#e85362", size = 3) +
  geom_line(aes(y = mean_freq, group = city), color = "#e85362", size = 1) +
  facet_wrap(~ city, ncol = 1) +
  labs(x = "Meses", y = "Frecuencia total", title = "Comparación de Frecuencia Total y Frecuencia Media por Ciudad", subtitle = "Enero-Marzo 2019") +
  dark_theme_classic() +
  scale_x_continuous(
    breaks = 1:length(months),
    labels = months
  )

## Warning in geom_point(aes(y = mean_freq, label = freq_total), color =
## "#e85362", : Ignoring unknown aesthetics: label

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Como vemos, el numero de ventas registradas en cada una de las sucursales se mantienen estables. Pero si compraramos en las tres sucursales el mes de febrero reporta un numero de ventas menor al promedio, habiendo un muy limitado repunte en marzo (especialmente en la sucursal de Yangon)

Total de ventas semanales - enero-marzo 2019

total_sales_per_week <- db_sales %>% 
  group_by(city,week) %>%
  summarise(total_sales = round(sum(total)))

## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

ggplot(total_sales_per_week, aes(x = week, y = total_sales, fill = as.factor(week))) +
  geom_bar(stat = "identity", width = 0.7, position = "dodge") +
  facet_wrap(~ city, ncol = 1) +
  labs(x = "Semanas", y = "Total de ventas", title = "Total de ventas a lo largo de las 13 semanas", subtitle = "Enero-Marzo 2019") +
  theme_light() +
  scale_fill_manual(name = "Semanas", 
                    labels = c("Ene 1-7", "Ene 8-14", "Ene 15-21", "Ene 22-28", "Ene 29-4", "Feb 5-11", "Feb 12-28", "Feb 19-25", "Feb 26-4", "Mar 5-11", "Mar 12-18", "Mar 19-25","Mar 26-1"), 
                    values = 1:13) +
  theme(axis.text.x = element_blank())

Sobre el leve repunte del numero ventas que vimos anteriormente en el mes de marzo, se ve como los ingresos bajan en febrero y repuntan (mas timidamente en Mandalay) en Naypyitaw y Yangon durante marzo. Cuestion que podremos apreciar a continuación

Total de ingresos (en bruto) enero-marzo 2019

total_gross_by_city_week <- db_sales %>%
  group_by(week,city) %>%
  summarise(total_income = round(sum(gross_income) ))

## `summarise()` has grouped output by 'week'. You can override using the
## `.groups` argument.

total_gross_by_city_week

ggplot(total_gross_by_city_week, aes(x = week, y = total_income, fill = as.factor(week))) +
  geom_bar(stat = "identity", width = 0.7, position = "dodge") +
  facet_wrap(~ city, ncol = 1) +
  labs(x = "Semanas", y = "Total de ingresos (en bruto)", title = "Total de ingresos (en bruto) ", subtitle = "enero-marzo 2019") +
  theme_light() +
  geom_line(aes(group = city), size = 0.5, color = "grey") +
  scale_fill_manual(name = "Semanas", labels = c("Ene 1-7", "Ene 8-14", "Ene 15-21", "Ene 22-28", "Ene 29-4", "Feb 5-11", "Feb 12-28", "Feb 19-25", "Feb 26-4", "Mar 5-11", "Mar 12-18", "Mar 19-25", "Mar 26-1"), values = 1:13) +
 
  theme(axis.text.x = element_blank())

Aqui cabe destacar que el crecimiento en ventas en marzo en Yangon se corresponde con el disparo de ingresos durante la semana de 19-25 de marzo y que veremos en la grafica siguiente

Total de ingresos por ciudad y por mes

total_income_by_city_month <- db_sales %>%
  group_by(month,city) %>%
  summarise(total_income = sum(gross_income)) %>%
  arrange(city)

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

total_income_by_city_month

  total_income_by_city_month %>%
    ggplot(aes(x = total_income, y = reorder(as.numeric(month), -total_income), fill = month)) +
    geom_bar(stat = "identity", position = "stack", orientation = "y") +
    geom_text(aes(label = total_income, width = -0.1, position = "dodge", color = "grey", hjust = 1) )+
    facet_wrap(~ city, ncol = 1) +
    labs(x = "Total de Ingresos", y = "",title = "Total de los ingresos por tienda y mes por Ciudad", subtitle = "Enero-Marzo 2019") +  
    theme(axis.text.x = element_blank(), legend.position = "none") +
    scale_y_discrete(
    breaks = 1:length(months),
    labels = months
  )

## Warning in geom_text(aes(label = total_income, width = -0.1, position =
## "dodge", : Ignoring unknown aesthetics: width and position

Total de ventas por Ciudad y genero durante todo el trimestre

total_sales_by_gender <- db_sales %>%
  group_by(city, gender) %>%
  summarise(total_sum = sum(total))

## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

total_sales_by_gender

  ggplot(total_sales_by_gender, aes(total_sum, city, fill = gender )) +
  geom_col(position = "dodge2") +
  labs(x = "Total de Ventas " , y = " Ciudad ", 
       title = "Total de Ventas del trimestre por Ciudad filtrado por genero", subtitle = "enero-marzo 2019") +
      scale_fill_manual(name = "Género", labels = c("Masculino", "Femenino"), values = c("#1f77b4", "#e377c8"))

No vemos una diferencia realmente sustancial en el total de ventas segmentado por genero. Deberiamos adentrarnos en saber que tipo de articulos son los mas elegido en función a dicha variable

Total de ventas por tipo de producto y genero durante todo el trimestre en las tres ciudades

total_product_by_city_gender <- db_sales %>%
  group_by(city, gender, product_line) %>%
  summarise(total_sales = sum(total))

## `summarise()` has grouped output by 'city', 'gender'. You can override using
## the `.groups` argument.

ggplot(total_product_by_city_gender, aes(x = product_line, y = total_sales, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Línea de productos", y = "Total de ventas", title = "Total de ventas por línea de productos y género", subtitle = "Enero a marzo 2019") +
  facet_wrap(~ city, ncol = 1) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  scale_fill_manual(name = "Género", labels = c("Masculino", "Femenino"), values = c("#1f77b4", "#e377c8"))

Finalmente debemos destacar que en funcion al tipo de producto no se encuentra tampoco una diferencia significativa para pensar tipo de consumidor en funcion al genero. Tal vez solo destacar la considerable diferencia en la compra de alimentos y bebidas por parte de la poblacion masculina en Mandalay

Conclusión parcial:

En función al breve analisis sobre el recorte de los datos realizados, podemos concluir que: * El inicio del enero tuvo un resultado magro en las tres sucursales (ciudades) * Se deberia analizar enero mas en particular para detectar la razon de la caida de los ingresos * Analizar en detalle procesos internos en la sucursal Mandalay para entender el mal desempeño durante todo el trimestre * Dar cuenta que las variables como genero y tipo de producto no nos explicarian mucho sobre las pautas de consumo y por ende su vinculo con las ventas y los ingresos del trimestre. Obviamente para lograr determinarlo resta por hacer un mayor analisis estadístico

TP. Aplicaciones de Software Estadístico 1 - MGIAE - UMTREF

Lucas Luiselli

21 de mayo de 2023