El objetivo del trabajo es hacer un pequeño analisis exploratorio de una base de datos ventas de un supermercado, la cual cuenta con aproximadamente 1000 registros a lo largo del trimestre de enero a marzo de 2019. A fines practicos, la idea es:
library("tidyverse")## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library("readr")
library("janitor")## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("dplyr")
library("lubridate")
library("ggdark")## Warning: package 'ggdark' was built under R version 4.2.3
db_sales <- read.csv("supermarket_sales.csv")
head(db_sales)Observo el tipo de dato de cada variable y la existencia de Nulls
str(db_sales)## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ City : chr "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer.type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product.line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 261.42 3.82 162.16 23.29 302.08 ...
## $ Total : chr "5.489.715" "80.22" "3.405.255" "489.048" ...
## $ Date : chr "1/05/2019" "3/08/2019" "3/03/2019" "1/27/2019" ...
## $ Time : chr "13:08:00" "10:29:00" "13:23:00" "20:33:00" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: chr "4.761.904.762" "4.761.904.762" "4.761.904.762" "4.761.904.762" ...
## $ gross.income : num 261.42 3.82 162.16 23.29 302.08 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
sapply(db_sales, class)## Invoice.ID Branch City
## "character" "character" "character"
## Customer.type Gender Product.line
## "character" "character" "character"
## Unit.price Quantity Tax.5.
## "numeric" "integer" "numeric"
## Total Date Time
## "character" "character" "character"
## Payment cogs gross.margin.percentage
## "character" "numeric" "character"
## gross.income Rating
## "numeric" "numeric"
sum(is.na(db_sales))## [1] 0
Limpio y renombro las etiquetas de las variables
db_sales <- db_sales %>%
clean_names() %>%
rename_all(~ gsub("\\.", "_", .)) Le asigno a las variable total y date el tipo de dato numerico y fecha respectivamente
db_sales$total <- as.numeric(db_sales$total)## Warning: NAs introducidos por coerción
db_sales <- db_sales[!is.na(db_sales$total), ]
db_sales$date <- as.Date(db_sales$date, format = "%m/%d/%Y")
sum(is.na(db_sales))## [1] 0
Al no haber valores nulos no es necesario imputar datos. Creo dos variables: month y week a partir de la fecha. Finalmente con todo listo, selecciono las variables a utilizar
db_sales <- db_sales %>%
mutate(month = format(date,"%B")) %>%
mutate(week = week(db_sales$date))
month.name <- c("enero", "febrero", "marzo")
months_num <- 1:3
db_sales$month <- months_num[match(db_sales$month, month.name)]
db_sales <- db_sales %>%
select(city,customer_type,gender,product_line,unit_price, total, date, gross_income, month,week)
summary(db_sales)## city customer_type gender product_line
## Length:832 Length:832 Length:832 Length:832
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## unit_price total date gross_income
## Min. :10.08 Min. : 10.68 Min. :2019-01-01 Min. : 1.066
## 1st Qu.:31.42 1st Qu.: 145.04 1st Qu.:2019-01-24 1st Qu.: 5.083
## Median :54.32 Median : 284.21 Median :2019-02-14 Median : 11.256
## Mean :54.71 Mean : 353.91 Mean :2019-02-14 Mean : 22.215
## 3rd Qu.:76.94 3rd Qu.: 528.19 3rd Qu.:2019-03-08 3rd Qu.: 22.357
## Max. :99.96 Max. :1042.65 Max. :2019-03-30 Max. :985.000
## month week
## Min. :1.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.: 4.000
## Median :2.000 Median : 7.000
## Mean :2.005 Mean : 6.924
## 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :3.000 Max. :13.000
metrics_by_city <- db_sales %>%
group_by(city) %>%
summarise(total_income = sum(gross_income),
total_sales = sum(total),
mean_total = mean(total),
mean_income = mean(gross_income),
sd_total = sd(total),
sd_income = sd(gross_income),
min_unit_price = min(unit_price),
max_unit_price = max(unit_price),
mean_unit_price = mean(unit_price),
max_total_week = week[which.max(total)],
max_total_month = month[which.max(total)],
max_income_week = week[which.max(gross_income)],
max_income_month = month[which.max(gross_income)])
metrics_by_cityComo vemos y veremos en la grafica si bien la sucursal Yangon esta en primera posicion en el total de ingresos obtenidos, es la sucursal Naypyitaw la que cuenta con mayor ventas.
metrics <- db_sales %>%
summarise(total_income = sum(gross_income),
total_sales = sum(total),
mean_total = mean(total),
mean_income = mean(gross_income),
sd_total = sd(total),
sd_income = sd(gross_income),
min_unit_price = min(unit_price),
max_unit_price = max(unit_price),
mean_unit_price = mean(unit_price),
max_total_week = week[which.max(total)],
max_total_month = month[which.max(total)],
max_income_week = week[which.max(gross_income)],
max_income_month = month[which.max(gross_income)])
metricstotal_sales_by_city <- db_sales %>%
group_by(city)%>%
summarise(total_sales = round(sum(total)))
total_sales_by_citytotal_sales_by_city %>%
ggplot(aes(x = total_sales, y = city )) +
geom_col(fill = c("#fecc8f","#fbb91f", "#e85362") ) +
geom_text(aes(label = total_sales, width = -0.5) )+
labs(x = "Total de Ventas (en Dolares) " , y = " Ciudad ",
title = "Total de Ventas del trimestre por Ciudad ", subtitle = "Enero-Marzo 2019")+
theme(axis.text.x = element_blank())## Warning in geom_text(aes(label = total_sales, width = -0.5)): Ignoring unknown
## aesthetics: width
Como vemos en el grafico y adelentamos previamente, la sucursal Naypyitaw tuvo mayor exito en ventas durante el primer trimestre de 2019
cities <- c("Mandalay", "Naypyitaw", "Yangon")
months <- c("Enero", "Febrero", "Marzo")
total_sales_by_city_month <- db_sales %>%
filter(city %in% cities) %>%
group_by(city, month) %>%
summarise(freq_total = n()) %>%
group_by(city) %>%
mutate(mean_freq = round(mean(freq_total)))## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
total_sales_by_city_monthggplot(total_sales_by_city_month, aes(y = month, x = freq_total, fill = city)) +
geom_bar(stat = "identity", orientation = "y", width = 0.7) +
facet_wrap(~ city, ncol = 1) +
geom_text(aes(label = mean_freq, width = 0.7, position = "dodge") )+
labs(y = "Meses", x = "Numero de Ventas", title = "Frecuencia Absoluta de Ventas por Mes y por Ciudad", subtitle = "Enero-Marzo 2019") +
scale_y_continuous(breaks = 1:3, labels = c("Enero", "Febrero", "Marzo"))+
theme(axis.text.x = element_blank())+
dark_theme_classic()## Warning in geom_text(aes(label = mean_freq, width = 0.7, position = "dodge")):
## Ignoring unknown aesthetics: width and position
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
ggplot(total_sales_by_city_month, aes(x = month, y = freq_total)) +
geom_bar(stat = "identity", width = 0.7, position = "dodge") +
geom_point(aes(y = mean_freq, label = freq_total), color = "#e85362", size = 3) +
geom_line(aes(y = mean_freq, group = city), color = "#e85362", size = 1) +
facet_wrap(~ city, ncol = 1) +
labs(x = "Meses", y = "Frecuencia total", title = "Comparación de Frecuencia Total y Frecuencia Media por Ciudad", subtitle = "Enero-Marzo 2019") +
dark_theme_classic() +
scale_x_continuous(
breaks = 1:length(months),
labels = months
)## Warning in geom_point(aes(y = mean_freq, label = freq_total), color =
## "#e85362", : Ignoring unknown aesthetics: label
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Como vemos, el numero de ventas registradas en cada una de las sucursales se mantienen estables. Pero si compraramos en las tres sucursales el mes de febrero reporta un numero de ventas menor al promedio, habiendo un muy limitado repunte en marzo (especialmente en la sucursal de Yangon)
total_sales_per_week <- db_sales %>%
group_by(city,week) %>%
summarise(total_sales = round(sum(total)))## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
ggplot(total_sales_per_week, aes(x = week, y = total_sales, fill = as.factor(week))) +
geom_bar(stat = "identity", width = 0.7, position = "dodge") +
facet_wrap(~ city, ncol = 1) +
labs(x = "Semanas", y = "Total de ventas", title = "Total de ventas a lo largo de las 13 semanas", subtitle = "Enero-Marzo 2019") +
theme_light() +
scale_fill_manual(name = "Semanas",
labels = c("Ene 1-7", "Ene 8-14", "Ene 15-21", "Ene 22-28", "Ene 29-4", "Feb 5-11", "Feb 12-28", "Feb 19-25", "Feb 26-4", "Mar 5-11", "Mar 12-18", "Mar 19-25","Mar 26-1"),
values = 1:13) +
theme(axis.text.x = element_blank())Sobre el leve repunte del numero ventas que vimos anteriormente en el mes de marzo, se ve como los ingresos bajan en febrero y repuntan (mas timidamente en Mandalay) en Naypyitaw y Yangon durante marzo. Cuestion que podremos apreciar a continuación
total_gross_by_city_week <- db_sales %>%
group_by(week,city) %>%
summarise(total_income = round(sum(gross_income) )) ## `summarise()` has grouped output by 'week'. You can override using the
## `.groups` argument.
total_gross_by_city_weekggplot(total_gross_by_city_week, aes(x = week, y = total_income, fill = as.factor(week))) +
geom_bar(stat = "identity", width = 0.7, position = "dodge") +
facet_wrap(~ city, ncol = 1) +
labs(x = "Semanas", y = "Total de ingresos (en bruto)", title = "Total de ingresos (en bruto) ", subtitle = "enero-marzo 2019") +
theme_light() +
geom_line(aes(group = city), size = 0.5, color = "grey") +
scale_fill_manual(name = "Semanas", labels = c("Ene 1-7", "Ene 8-14", "Ene 15-21", "Ene 22-28", "Ene 29-4", "Feb 5-11", "Feb 12-28", "Feb 19-25", "Feb 26-4", "Mar 5-11", "Mar 12-18", "Mar 19-25", "Mar 26-1"), values = 1:13) +
theme(axis.text.x = element_blank())Aqui cabe destacar que el crecimiento en ventas en marzo en Yangon se corresponde con el disparo de ingresos durante la semana de 19-25 de marzo y que veremos en la grafica siguiente
total_income_by_city_month <- db_sales %>%
group_by(month,city) %>%
summarise(total_income = sum(gross_income)) %>%
arrange(city)## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
total_income_by_city_month total_income_by_city_month %>%
ggplot(aes(x = total_income, y = reorder(as.numeric(month), -total_income), fill = month)) +
geom_bar(stat = "identity", position = "stack", orientation = "y") +
geom_text(aes(label = total_income, width = -0.1, position = "dodge", color = "grey", hjust = 1) )+
facet_wrap(~ city, ncol = 1) +
labs(x = "Total de Ingresos", y = "",title = "Total de los ingresos por tienda y mes por Ciudad", subtitle = "Enero-Marzo 2019") +
theme(axis.text.x = element_blank(), legend.position = "none") +
scale_y_discrete(
breaks = 1:length(months),
labels = months
)## Warning in geom_text(aes(label = total_income, width = -0.1, position =
## "dodge", : Ignoring unknown aesthetics: width and position
total_sales_by_gender <- db_sales %>%
group_by(city, gender) %>%
summarise(total_sum = sum(total))## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
total_sales_by_gender ggplot(total_sales_by_gender, aes(total_sum, city, fill = gender )) +
geom_col(position = "dodge2") +
labs(x = "Total de Ventas " , y = " Ciudad ",
title = "Total de Ventas del trimestre por Ciudad filtrado por genero", subtitle = "enero-marzo 2019") +
scale_fill_manual(name = "Género", labels = c("Masculino", "Femenino"), values = c("#1f77b4", "#e377c8"))No vemos una diferencia realmente sustancial en el total de ventas segmentado por genero. Deberiamos adentrarnos en saber que tipo de articulos son los mas elegido en función a dicha variable
total_product_by_city_gender <- db_sales %>%
group_by(city, gender, product_line) %>%
summarise(total_sales = sum(total))## `summarise()` has grouped output by 'city', 'gender'. You can override using
## the `.groups` argument.
ggplot(total_product_by_city_gender, aes(x = product_line, y = total_sales, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Línea de productos", y = "Total de ventas", title = "Total de ventas por línea de productos y género", subtitle = "Enero a marzo 2019") +
facet_wrap(~ city, ncol = 1) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
scale_fill_manual(name = "Género", labels = c("Masculino", "Femenino"), values = c("#1f77b4", "#e377c8"))Finalmente debemos destacar que en funcion al tipo de producto no se encuentra tampoco una diferencia significativa para pensar tipo de consumidor en funcion al genero. Tal vez solo destacar la considerable diferencia en la compra de alimentos y bebidas por parte de la poblacion masculina en Mandalay
En función al breve analisis sobre el recorte de los datos realizados, podemos concluir que: * El inicio del enero tuvo un resultado magro en las tres sucursales (ciudades) * Se deberia analizar enero mas en particular para detectar la razon de la caida de los ingresos * Analizar en detalle procesos internos en la sucursal Mandalay para entender el mal desempeño durante todo el trimestre * Dar cuenta que las variables como genero y tipo de producto no nos explicarian mucho sobre las pautas de consumo y por ende su vinculo con las ventas y los ingresos del trimestre. Obviamente para lograr determinarlo resta por hacer un mayor analisis estadístico