Lidl

Wykonały: Julia Chyła, Paulina Fereniec, Joanna Kościńska

Wstęp

Celem projektu jest analiza danych przedstawiających wielkość sprzedaży produktów z 3 głównych kategorii: Furniture, Office Supplies i Technology. Dane zawierają między innymi informacje o klientach, ich lokalizacji, ilości oraz wartości zamówionych przez nich produktów, a także wybranym przez nich sposobie płatności i dostawy. Wykonana analiza obejmuje etapy takie, jak data cleansing i data wrangling, wizualizację danych, ich analizę opisową oraz przeprowadzone testy statystyczne na podstawie, których sformuowano wnioski końcowe.

Do stworzenia projektu użyłyśmy bibliotek: dlookr, deducorrect, tidyverse, readr, dplyr, readxl, validate, validatetools, lubridate, naniar, visdat, car, config, kableExtra, ggstatsplot, ggplot2, scales, corplot.

Czyszczenie danych

Zmiana nazw kolumn do łatwiejszej analizy

Lidl<-Lidl%>%
  rename(Ship_Date=`Ship.Date`,
         Order_Date=`Order.Date`,
         Ship_Mode=`Ship.Mode`,
         Customer_Name=`Customer.Name`,
         Product_Name=`Product.Name`,
         Sub_Category=`Sub.Category`,
         Payment_Mode=`Payment.Mode`)

Przejrzenie struktury danych

W analizie mamy 4 różne regiony: Central, East, South, West

Lidl%>%
  select(Region)%>% 
group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 4 × 2
## # Groups:   Region [4]
##   Region      n
##   <chr>   <int>
## 1 West     1901
## 2 East     1688
## 3 Central  1381
## 4 South     931

Jest jeden kraj - Stany Zjednoczone, więc kolumna zostanie usunięta

Lidl%>%
  select(Country)%>% 
group_by_all()%>%
  count()

## # A tibble: 1 × 2
## # Groups:   Country [1]
##   Country           n
##   <chr>         <int>
## 1 United States  5901

Liczba Stanów -> 49

Lidl%>%
  select(State)%>% 
group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 49 × 2
## # Groups:   State [49]
##    State              n
##    <chr>          <int>
##  1 California      1189
##  2 New York         672
##  3 Texas            565
##  4 Washington       337
##  5 Pennsylvania     335
##  6 Illinois         293
##  7 Ohio             293
##  8 Florida          212
##  9 North Carolina   159
## 10 Michigan         138
## # ℹ 39 more rows

Liczba miast -> 452

Lidl%>%
  select(City)%>% 
group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 452 × 2
## # Groups:   City [452]
##    City              n
##    <chr>         <int>
##  1 New York City   563
##  2 Los Angeles     430
##  3 Philadelphia    310
##  4 San Francisco   304
##  5 Seattle         282
##  6 Houston         204
##  7 Chicago         197
##  8 Columbus        138
##  9 Dallas          110
## 10 Springfield     100
## # ℹ 442 more rows

Segmenty -> 3 różne segmenty Consumer,Corporate,Home Office

Lidl%>%
  select(Segment)%>%
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 3 × 2
## # Groups:   Segment [3]
##   Segment         n
##   <chr>       <int>
## 1 Consumer     2997
## 2 Corporate    1774
## 3 Home Office  1130

4 rodzaje dostawy: First Class, Same Day, Second Class, Standard Class

Lidl%>%
  select(Ship_Mode)%>%
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 4 × 2
## # Groups:   Ship_Mode [4]
##   Ship_Mode          n
##   <chr>          <int>
## 1 Standard Class  3451
## 2 Second Class    1147
## 3 First Class      959
## 4 Same Day         344

Mamy 773 klientów

Lidl%>%
  select(Customer_Name)%>% 
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 773 × 2
## # Groups:   Customer_Name [773]
##    Customer_Name      n
##    <chr>          <int>
##  1 Emily Phan        27
##  2 Edward Hooks      25
##  3 Paul Prost        25
##  4 Seth Vernon       25
##  5 Pete Kriz         24
##  6 Lena Cacioppo     23
##  7 Sally Hughsby     23
##  8 William Brown     23
##  9 Dean percer       22
## 10 Mick Hernandez    22
## # ℹ 763 more rows

Posiadamy 1742 Produktów

Lidl%>%
  select(Product_Name)%>%
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 1,742 × 2
## # Groups:   Product_Name [1,742]
##    Product_Name                                        n
##    <chr>                                           <int>
##  1 Easy-staple paper                                  27
##  2 Staples                                            24
##  3 Staple envelope                                    22
##  4 Staples in misc. colors                            13
##  5 Chromcraft Round Conference Tables                 12
##  6 Staple remover                                     12
##  7 Storex Dura Pro Binders                            12
##  8 Avery Non-Stick Binders                            11
##  9 Global Wood Trimmed Manager's Task Chair, Khaki    11
## 10 GBC Instant Report Kit                             10
## # ℹ 1,732 more rows

Są u nas możliwe 3 sposoby płatności - COD, Cards, Online

Lidl%>%
  select(Payment_Mode)%>%
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 3 × 2
## # Groups:   Payment_Mode [3]
##   Payment_Mode     n
##   <chr>        <int>
## 1 COD           2453
## 2 Online        2164
## 3 Cards         1284

Mamy 3 kategorie - Furniture, Office Supplies, Technology Najwięcej sprzedanych produktów w Office Supplies

Lidl%>%
  select(Category)%>%
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 3 × 2
## # Groups:   Category [3]
##   Category            n
##   <chr>           <int>
## 1 Office Supplies  3569
## 2 Furniture        1249
## 3 Technology       1083

Posiadamy 17 podkategorii

Lidl%>%
  select(Sub_Category)%>% 
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 17 × 2
## # Groups:   Sub_Category [17]
##    Sub_Category     n
##    <chr>        <int>
##  1 Binders        915
##  2 Paper          825
##  3 Furnishings    573
##  4 Phones         519
##  5 Storage        498
##  6 Art            465
##  7 Accessories    461
##  8 Chairs         355
##  9 Appliances     279
## 10 Labels         211
## 11 Tables         190
## 12 Envelopes      133
## 13 Bookcases      131
## 14 Fasteners      124
## 15 Supplies       119
## 16 Machines        65
## 17 Copiers         38

W kategorii Furniture mamy 4 kategorie -> Bookcases,Chairs,Furnishings,Tables

Lidl%>%
  filter(Category=="Furniture")%>%
  select(Sub_Category)%>% 
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 4 × 2
## # Groups:   Sub_Category [4]
##   Sub_Category     n
##   <chr>        <int>
## 1 Furnishings    573
## 2 Chairs         355
## 3 Tables         190
## 4 Bookcases      131

W kategorii Technology mamy 4 kategorie -> Phones, Accessories,Machines,Copiers

Lidl%>%
  filter(Category=="Technology")%>%
  select(Sub_Category)%>% 
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 4 × 2
## # Groups:   Sub_Category [4]
##   Sub_Category     n
##   <chr>        <int>
## 1 Phones         519
## 2 Accessories    461
## 3 Machines        65
## 4 Copiers         38

W kategorii Office Supplies mamy 9 kategorii -> Binders, Paper, Storage, Art, Appliances, Labels, Envelopes, Fasteners, Supplies

Lidl%>%
  filter(Category=="Office Supplies")%>%
  select(Sub_Category)%>% 
  group_by_all()%>%
  count(sort=TRUE)

## # A tibble: 9 × 2
## # Groups:   Sub_Category [9]
##   Sub_Category     n
##   <chr>        <int>
## 1 Binders        915
## 2 Paper          825
## 3 Storage        498
## 4 Art            465
## 5 Appliances     279
## 6 Labels         211
## 7 Envelopes      133
## 8 Fasteners      124
## 9 Supplies       119

Podstawowe uporządkowanie

summary(Lidl)

##  Row.ID.O6G3A1.R6   Order.ID          Order_Date         Ship_Date        
##  Min.   :   1     Length:5901        Length:5901        Length:5901       
##  1st Qu.:2486     Class :character   Class :character   Class :character  
##  Median :5091     Mode  :character   Mode  :character   Mode  :character  
##  Mean   :5022                                                             
##  3rd Qu.:7456                                                             
##  Max.   :9994                                                             
##   Ship_Mode         Customer.ID        Customer_Name        Segment         
##  Length:5901        Length:5901        Length:5901        Length:5901       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State              Region         
##  Length:5901        Length:5901        Length:5901        Length:5901       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Product.ID          Category         Sub_Category       Product_Name      
##  Length:5901        Length:5901        Length:5901        Length:5901       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Sales             Quantity          Profit            Returns         
##  Min.   :   0.836   Min.   : 1.000   Min.   :-6599.978   Length:5901       
##  1st Qu.:  71.976   1st Qu.: 2.000   1st Qu.:    1.796   Class :character  
##  Median : 128.648   Median : 3.000   Median :    8.502   Mode  :character  
##  Mean   : 265.346   Mean   : 3.782   Mean   :   29.700                     
##  3rd Qu.: 265.170   3rd Qu.: 5.000   3rd Qu.:   28.615                     
##  Max.   :9099.930   Max.   :14.000   Max.   : 8399.976                     
##  Payment_Mode         ind1           ind2        
##  Length:5901        Mode:logical   Mode:logical  
##  Class :character   NA's:5901      NA's:5901     
##  Mode  :character                                
##                                                  
##                                                  
##

Usunięcie pustych kolumn

Lidl[,c('ind2','ind1')] <-list(NULL)

Zmiana formatu danych

Lidl$Order_Date<-dmy(Lidl$Order_Date)
Lidl$Ship_Date<- dmy(Lidl$Ship_Date)
class(Lidl$Order_Date)

## [1] "Date"

class(Lidl$Ship_Date)

## [1] "Date"

Lidl$Returns <- as.numeric(Lidl$Returns)

## Warning: pojawiły się wartości NA na skutek przekształcenia

Lidl$Profit <- as.numeric(Lidl$Profit)

Usunięcie niepotrzebnych kolumn

Lidl[,c('...1','Row.ID.O6G3A1.R6','Order.ID','Customer.ID','Country','Product.ID')] <-list(NULL)

Porządkowanie kolumny Profit

Posiadamy 1098 wartości ujemnych Zmiana wartości ujemnych na brak danych

Lidl%>%
  filter(Profit<0)%>%
  count()

##      n
## 1 1098

Lidl$Profit<-replace(Lidl$Profit,Lidl$Profit<0,NA)

Lidl%>%
  sapply(function(x) is.na(x) %>% 
           sum())

##    Order_Date     Ship_Date     Ship_Mode Customer_Name       Segment 
##             0             0             0             0             0 
##          City         State        Region      Category  Sub_Category 
##             0             0             0             0             0 
##  Product_Name         Sales      Quantity        Profit       Returns 
##             0             0             0          1098          5614 
##  Payment_Mode 
##             0

missing point - wartości brakujące profit stanowią 18,6% wszystkich danych

n_miss(Lidl)

## [1] 6712

miss_var_summary(Lidl)

## # A tibble: 16 × 3
##    variable      n_miss pct_miss
##    <chr>          <int>    <dbl>
##  1 Returns         5614     95.1
##  2 Profit          1098     18.6
##  3 Order_Date         0      0  
##  4 Ship_Date          0      0  
##  5 Ship_Mode          0      0  
##  6 Customer_Name      0      0  
##  7 Segment            0      0  
##  8 City               0      0  
##  9 State              0      0  
## 10 Region             0      0  
## 11 Category           0      0  
## 12 Sub_Category       0      0  
## 13 Product_Name       0      0  
## 14 Sales              0      0  
## 15 Quantity           0      0  
## 16 Payment_Mode       0      0

summary(Lidl$Profit)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    5.203   12.701   55.584   38.307 8399.976     1098

Porządkowanie kolumny Returns

Zamiana Returns na zmienną binarna, brak danych -> 0 -> brak zwrotu. 1 -> zwrot

Lidl%>%
  sapply(function(x) is.na(x) %>% 
           sum())

##    Order_Date     Ship_Date     Ship_Mode Customer_Name       Segment 
##             0             0             0             0             0 
##          City         State        Region      Category  Sub_Category 
##             0             0             0             0             0 
##  Product_Name         Sales      Quantity        Profit       Returns 
##             0             0             0          1098          5614 
##  Payment_Mode 
##             0

  Lidl <- Lidl%>%
  mutate(Returns = ifelse(is.na(Returns), 0, Returns))

  Lidl%>%
  filter(Returns==!0 & Returns==~!1)%>%
  head(5)

##  [1] Order_Date    Ship_Date     Ship_Mode     Customer_Name Segment      
##  [6] City          State         Region        Category      Sub_Category 
## [11] Product_Name  Sales         Quantity      Profit        Returns      
## [16] Payment_Mode 
## <0 wierszy> (lub 'row.names' o zerowej długości)

Stworzenie nowej kolumny do analizy

Lidl <- Lidl%>%
 mutate(Delivery_time=c(difftime(Ship_Date,Order_Date,units="days")))

Struktura danych

vis_dat(Lidl)

vis_miss(Lidl)

Reguły

rules <- validator(
Sales >= 0
, Quantity >= 0 
, Delivery_time >= 0)

Wizualizacja spełnienia reguł

cf <- confront(Lidl, rules, key=NULL)
summary(cf)

##   name items passes fails nNA error warning             expression
## 1   V1  5901   5901     0   0 FALSE   FALSE    Sales - 0 >= -1e-08
## 2   V2  5901   5901     0   0 FALSE   FALSE Quantity - 0 >= -1e-08
## 3   V3  5901   5901     0   0 FALSE   FALSE     Delivery_time >= 0

plot(cf, main="Lidl")

Obserwacje odstające

boxplot(Lidl$Sales)

boxplot(Lidl$Profit)

boxplot(Lidl$Quantity)

scatterplot(Lidl$Sales, Lidl$Profit, smoother=NULL, boxplots = "y", main="Wykres rozrzutu dla zmiennych Profit i Sales")

Lidl$Profit <- as.numeric(Lidl$Profit)
Lidl$Profit<-imputate_outlier(Lidl, Profit, method="capping", cap_ntiles = c(0.01, 0.99))

summary(Lidl$Profit)

## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##                     Original     Imputation  
## described_variables "value"      "value"     
## n                   "4803"       "4803"      
## na                  "1098"       "1098"      
## mean                "55.58439"   "96.33276"  
## sd                  "239.2053"   "209.8001"  
## se_mean             "3.451553"   "3.027258"  
## IQR                 "33.10415"   "33.10415"  
## skewness            "19.861530"  " 2.264548" 
## kurtosis            "537.606341" "  3.193319"
## p00                 "0"          "0"         
## p01                 "0.158972"   "0.158972"  
## p05                 "1.20092"    "1.20092"   
## p10                 "2.24608"    "2.24608"   
## p20                 "4.03708"    "4.03708"   
## p25                 "5.20275"    "5.20275"   
## p30                 "6.2208"     "6.2208"    
## p40                 "8.93336"    "8.93336"   
## p50                 "12.7008"    "12.7008"   
## p60                 "19.26368"   "19.26368"  
## p70                 "29.3672"    "29.3672"   
## p75                 "38.3069"    "38.3069"   
## p80                 "51.82824"   "51.82824"  
## p90                 "107.9220"   "654.7263"  
## p95                 "206.5925"   "654.7263"  
## p99                 "654.7263"   "654.7263"  
## p100                "8399.9760"  " 654.7263"

plot(Lidl$Profit)

Lidl$Sales <- as.numeric(Lidl$Sales)
Lidl$Sales<-imputate_outlier(Lidl, Sales, method="capping", cap_ntiles = c(0.01, 0.99))

summary(Lidl$Sales)

## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##                     Original    Imputation 
## described_variables "value"     "value"    
## n                   "5901"      "5901"     
## na                  "0"         "0"        
## mean                "265.3456"  "386.9278" 
## sd                  "474.2606"  "705.0553" 
## se_mean             "6.173825"  "9.178260" 
## IQR                 "193.194"   "193.194"  
## skewness            "5.949068"  "2.419518" 
## kurtosis            "54.744949" " 4.088888"
## p00                 "0.836"     "0.836"    
## p01                 "3.282"     "3.282"    
## p05                 "9.248"     "9.248"    
## p10                 "15.992"    "15.992"   
## p20                 "43.13"     "43.13"    
## p25                 "71.976"    "71.976"   
## p30                 "94.036"    "94.036"   
## p40                 "112.97"    "112.97"   
## p50                 "128.648"   "128.648"  
## p60                 "149.6"     "149.6"    
## p70                 "204.8"     "204.8"    
## p75                 "265.17"    "265.17"   
## p80                 "334.568"   "334.568"  
## p90                 " 577.916"  "2396.400" 
## p95                 " 956.6648" "2396.4000"
## p99                 "2396.4"    "2396.4"   
## p100                "9099.93"   "2396.40"

plot(Lidl$Sales)

Lidl$Quantity <- as.numeric(Lidl$Quantity)
Lidl$Quantity<-imputate_outlier(Lidl, Quantity, method="mode")

summary(Lidl$Quantity)

## Impute outliers with mode
## 
## * Information of Imputation (before vs after)
##                     Original     Imputation  
## described_variables "value"      "value"     
## n                   "5901"       "5901"      
## na                  "0"          "0"         
## mean                "3.781901"   "3.639722"  
## sd                  "2.212917"   "1.994826"  
## se_mean             "0.02880728" "0.02596822"
## IQR                 "3"          "3"         
## skewness            "1.2168476"  "0.8622149" 
## kurtosis            "1.75451488" "0.05404482"
## p00                 "1"          "1"         
## p01                 "1"          "1"         
## p05                 "1"          "1"         
## p10                 "2"          "2"         
## p20                 "2"          "2"         
## p25                 "2"          "2"         
## p30                 "2"          "2"         
## p40                 "3"          "3"         
## p50                 "3"          "3"         
## p60                 "4"          "4"         
## p70                 "5"          "4"         
## p75                 "5"          "5"         
## p80                 "5"          "5"         
## p90                 "7"          "7"         
## p95                 "8"          "8"         
## p99                 "10"         " 9"        
## p100                "14"         " 9"

plot(Lidl$Quantity)

Wizualizacje

Wykresy rozrzutu analizujące rozkład zmiennych Sales i Profit

Lidl%>%
  filter(Sales<700 & Profit<200)%>%
ggplot(aes(x = Sales,  y = Profit, color=Segment)) +
  geom_point() +
  labs(title = "Wykres rozrzutu dla zmiennych Profit i Sales dzieląc na kategorie")+ 
  theme(plot.title = element_text(size = 30))+
  facet_wrap(~Category)

Lidl%>%
  filter(Sales<700 & Profit<200)%>%
ggplot(aes(x = Sales,  y = Profit, color=Category)) +
  geom_point()+
  labs(title = "Wykres rozrzutu dla zmiennych Profit i Sales ze względu na regiony")+ 
  theme(plot.title = element_text(size = 30))+
  facet_wrap(~Region)

Lidl%>%
  filter(Sales<700 & Profit<200)%>%
ggplot(aes (x=Sales, y= Profit, color=Payment_Mode))+
geom_point()+
  labs(title = "Wykres rozrzutu dla zmiennych Profit i Sales analizując sposób dostawy oraz sposób płatności")+ 
  theme(plot.title = element_text(size = 25))+
facet_wrap(~Ship_Mode)

Wykresy słupkowe

Wielkość sprzedaży w danych kategoriach

X<- Lidl %>%
  group_by(Category)%>%
  summarise(Sprzedaż_Kategoria=sum(Sales))%>%
arrange(desc(Sprzedaż_Kategoria))%>%
  head(3)
ggplot(X,aes(x=reorder(Category,-Sprzedaż_Kategoria),y=Sprzedaż_Kategoria,fill=Category))+
  geom_col()+guides(fill = "none")+
  labs(title="Kategoria produktów z największą łączną sprzedażą",x="Stan",y="Sprzedaż")+
theme(plot.title = element_text(size = 30))

X1 <- Lidl %>%
  group_by(Category)%>%
  summarise(Średnia=mean(Sales))

ggplot(X1,aes(x=reorder(Category, -Średnia),y=Średnia,fill=Category))+
  labs(title="Średnia sprzedaż w danych kategoriach",x="Kategoria",y="Średnia sprzedaż")+
  geom_col()+guides(fill = "none")+
theme(plot.title = element_text(size = 30))

Największą sumę sprzedaży mamy z kategorii Office Supplies, natomiast największa średnia wielkość sprzedaży mamy w kategorii Technology. Przyczyną tego może być najwięcej sprzedanych produktów w Office Supplies, natomiast wyższe ceny w kategorii Technology.

X2<- Lidl %>%
  group_by(Sub_Category)%>%
  summarise(Sprzedaż_Podkategoria=sum(Sales))%>%
arrange(desc(Sprzedaż_Podkategoria))%>%
  head()
ggplot(X2,aes(x=reorder(Sub_Category,-Sprzedaż_Podkategoria),y=Sprzedaż_Podkategoria,fill=Sub_Category))+
  geom_col()+guides(fill = "none")+
  labs(title="Podkategorie produktów z największą łączną sprzedażą",x="Podkategoria produktów",y="Sprzedaż")+
theme(plot.title = element_text(size = 30))

Największa wielkość sprzedaży jest obserwowana dla podkategorii Phones, Chairs oraz Storage.

X3 <- Lidl%>%
  group_by(Sub_Category)%>%
  summarise(Sprzedaż=mean(Sales))%>%
arrange(desc(Sub_Category))%>%
  head(8)
ggplot(X3,aes(x=reorder(Sub_Category,-Sprzedaż),y=Sprzedaż,fill=Sub_Category))+
  geom_col()+guides(fill = "none")+
labs(title="Średnia wielkość sprzedaży w danych podkategoriach",x="Podkategoria",y="Sprzedaż Średnia")+
theme(plot.title = element_text(size = 30))

Największą sumę sprzedaży mamy z podkategorii Phones, Chairs, Storage, natomiast największa średnia wielkość sprzedaży mamy w podkategoriach Machines i Tables.

X4 <- Lidl%>%
  select(Ship_Mode)%>%
  group_by_all()%>%
  count(sort=TRUE)

ggplot(X4,aes(x=reorder(Ship_Mode,-n),y=n,fill=Ship_Mode))+
  geom_col()+guides(fill = "none") +
  labs(title = "Liczba transakcji z wykorzystaniem danej klasy dostawy",x="Klasy dostawy",y="Liczba użycia metody")+
theme(plot.title = element_text(size = 30))

Największa ilość klientów wybiera Standard Class jako rodzaj dostawy.

X5 <- Lidl %>%
  count(Payment_Mode,wt=Sales,sort=TRUE)
ggplot(X5,aes(x=(reorder(Payment_Mode,-n)),y=n,fill=Payment_Mode))+
  geom_col()+guides(fill = "none")+
  labs(title = "Analiza sposobów płatności",x="Sposoby Płatności",y="Liczba użycia metody")+
theme(plot.title = element_text(size = 30))

Najczęściej wybieraną metodą płatności jest COD, czyli płatność za pobraniem.

X6 <- Lidl %>%
  count(Segment,wt=Sales,sort=TRUE)

ggplot(X6,aes(x=reorder(Segment,-n),y=n,fill=Segment))+
  geom_col()+guides(fill = "none")+
  labs(title = "Suma sprzedaży w poszczególnych segmentach",x="Segmenty",y="Wartość sprzedaży")+
theme(plot.title = element_text(size = 30))

Dominuje segment Consumer.

X9<- Lidl %>%
  group_by(City)%>%
  summarise(Sprzedaż_Miasta=sum(Sales))%>%
arrange(desc(Sprzedaż_Miasta))%>%
  head(8)
ggplot(X9,aes(x=reorder(City,-Sprzedaż_Miasta),y=Sprzedaż_Miasta,fill=City))+
  geom_col()+guides(fill = "none")+
  labs(title="Miasta z największą łączną sprzedażą",x="Miasto",y="Sprzedaż")+
theme(plot.title = element_text(size = 30))

Największą sprzedaż odnotowano w New York City, Los Angeles i Seattle.

X10<- Lidl %>%
  group_by(State)%>%
  summarise(Sprzedaż_State=sum(Sales))%>%
arrange(desc(Sprzedaż_State))%>%
  head(8)
ggplot(X10,aes(x=reorder(State,-Sprzedaż_State),y=Sprzedaż_State,fill=State))+
  geom_col()+guides(fill = "none")+
  labs(title="Stany z największą łączną sprzedażą",x="Stan",y="Sprzedaż")+
theme(plot.title = element_text(size = 30))

Największą sprzedaż odnotowano w stanach takich, jak California, New York i Texas.

X11<- Lidl %>%
  group_by(Sub_Category)%>%
  summarise(Zwroty_Podkategoria=sum(Returns))%>%
arrange(desc(Zwroty_Podkategoria))%>%
  head(5)
ggplot(X11,aes(x=reorder(Sub_Category,-Zwroty_Podkategoria),y=Zwroty_Podkategoria,fill=Sub_Category))+
  geom_col()+guides(fill = "none")+
  labs(title="Podkategorie produktów z największą liczbą zwrotów",x="Podkategoria produktów",y="Zwroty")+
theme(plot.title = element_text(size = 30))

Największa ilość zwrotów występuje w podkategoriach: Binders, Paper i Furnishings

Wykres punktowy

Lidl_stany <- Lidl %>%
  group_by(Region) %>%
  summarize(Total_Sales = sum(Sales),
            Total_quantity = sum(Quantity),
            Average_Profit = mean(Profit,na.rm = TRUE),
            Average_Profit_in_proc. = paste0(
              round(100*sum(Profit,na.rm = TRUE)/sum(Sales),2),'%'))
print(Lidl_stany)

## # A tibble: 4 × 5
##   Region  Total_Sales Total_quantity Average_Profit Average_Profit_in_proc.
##   <chr>         <dbl>          <dbl>          <dbl> <chr>                  
## 1 Central     502748.           5078           97.4 18.62%                 
## 2 East        630690.           6022          102.  22.05%                 
## 3 South       382642.           3451          102.  20.65%                 
## 4 West        767181.           6927           88.5 19.68%

ggplot(Lidl_stany, aes(x = Total_quantity, y = Total_Sales, size = Average_Profit, color = Region)) +
  geom_point(alpha = 1) +
  scale_size_continuous(range = c(4, 8)) +
  labs(title = "Bubble chart for Lidl_states data",
       x = "Total sold quantity in pcs.",
       y = "Total sales in $",
       size = "Mean profit in $") +
  theme_minimal() +
  scale_color_brewer(palette = "RdBu") +
  guides(color = guide_legend(title = "Region")) +
  scale_x_continuous(limits = c(3000, 8000)) +
  scale_y_continuous(
    limits = c(200000, 800000),
    labels = label_number(accuracy = 1e3)  # Format the labels to show numbers in thousands
  ) +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  geom_text(aes(label = round(Average_Profit, 2)), check_overlap = TRUE, vjust = 1.5, show.legend = FALSE)

W regionie West występowała największa wartość sprzedaży oraz największa ilość produktów, jednak zysk był najniższy z wszystkich regionów. Najwyższy zysk pochodził z regionu South, gdzie wielkość i ilość sprzedaży była najniższa.

Statystyki opisowe

Dla kategorii

Lidl%>%
  group_by(Category)%>%
  filter(Sales<700 & Profit<200)%>%
  rename(Kategoria=Category)%>%
  summarize('Suma sprzedaży'=sum(Sales),
            'Średnia sprzedaż'=mean(Sales),
            'Mediana sprzedaży'=median(Sales),
            'Minimalna sprzedaż'=min(Sales),
            'Maksymalna sprzedaż'=max(Sales))%>%
  arrange(desc('Suma sprzedaży')) %>%
  kbl()%>%
  kable_styling(bootstrap_options = c("striped", "hover","responsive"),position="center")

Kategoria	Suma sprzedaży	Średnia sprzedaż	Mediana sprzedaży	Minimalna sprzedaż	Maksymalna sprzedaż
Furniture	110852.0	172.3982	136.464	4.180	544.008
Office Supplies	299073.1	108.5171	107.212	1.344	552.940
Technology	115230.7	175.9248	144.960	1.980	529.768

Najwięcej pod względem wartości sprzedano produktów z kategorii Office Supplies, jednak najwyższa średnia sprzedaż jest dla kategorii Technology.Najwyższa wartość środkowa jest zaś dla kategorii Furniture.

Dla regionu

Lidl%>%
  group_by(Region)%>%
  filter(Sales<700 & Profit<200)%>%
  summarize('Suma sprzedaży'=sum(Sales),
            'Średnia sprzedaż'=mean(Sales),
            'Mediana sprzedaży'=median(Sales),
            'Minimalna sprzedaż'=min(Sales),
            'Maksymalna sprzedaż'=max(Sales))%>%
  arrange(desc('Suma sprzedaży')) %>%
  kbl()%>%
  kable_styling(bootstrap_options = c("striped", "hover","responsive"),position="center")

Region	Suma sprzedaży	Średnia sprzedaż	Mediana sprzedaży	Minimalna sprzedaż	Maksymalna sprzedaż
Central	105433.49	129.0496	113.984	1.344	541.340
East	142626.18	123.2724	114.340	1.504	552.560
South	84260.64	131.6573	114.812	2.610	552.940
West	192835.55	133.9136	116.770	1.408	544.008

Pod względem sprzedaży dominuje region West, z najwyższą sumą, średnią oraz medianą sprzedaży.

Dla Ship Mode

Lidl%>%
  group_by(Ship_Mode)%>%
  filter(Sales<700 & Profit<200)%>%
  summarize('Suma sprzedaży'=sum(Sales),
            'Średnia sprzedaż'=mean(Sales),
            'Mediana sprzedaży'=median(Sales),
            'Minimalna sprzedaż'=min(Sales),
            'Maksymalna sprzedaż'=max(Sales))%>%
  arrange(desc('Suma sprzedaży')) %>%
  kbl()%>%
  kable_styling(bootstrap_options = c("striped", "hover","responsive"),position="center")

Ship_Mode	Suma sprzedaży	Średnia sprzedaż	Mediana sprzedaży	Minimalna sprzedaż	Maksymalna sprzedaż
First Class	83959.81	126.8275	115.748	1.680	536.88
Same Day	27795.88	117.2822	114.060	1.810	457.48
Second Class	106808.41	132.3524	117.784	2.632	552.94
Standard Class	306591.75	130.5757	114.390	1.344	552.56

Pod względem wartości sprzedaży dominowała zdecydowanie Standard Class jako sposób dostawy. Najwyższa średnia sprzedaż występowała dla dostawy Same Day, zaś mediana sprzedaży najwyższa była dla First Class.

Analiza korelacji

wybrane<- select(Lidl, Sales, Quantity, Profit, Returns)
M<-cor(wybrane, use = "complete.obs")
head(round(M,2))

##          Sales Quantity Profit Returns
## Sales     1.00     0.18   0.64    0.02
## Quantity  0.18     1.00   0.20    0.00
## Profit    0.64     0.20   1.00   -0.01
## Returns   0.02     0.00  -0.01    1.00

Najwyższa dodatnia korelacja, jest między wielkością Sales i Profit. Wartość 0,69 wskazuje na dość silną korelację, według której Wzrost sprzedaży przyczynia się do wzrostu zysku.Ilość sprzedaży Quantity jest znacznie słabiej skorelowana ze sprzedaża (0.19) jak i z zyskiem (0.24). Ilość zwrotów w ogóle nie jest skorelowana z pozostałymi zmiennymi.

corrplot(M, method = "circle")

corrplot(M, method = "pie")

corrplot(M, "color")

corrplot(M, method = "number")

corrplot(M, type = "upper")

Testy statystyczne

ggbetweenstats(data = Lidl, x= Payment_Mode, y=Sales, title="Sprzedaż w zależności od sposobu płatności")

P-value=0.81 - sprzedaż w zależności od sposobu płatnośći jest nieistotna statystyczne, dlatego nie należy sądzić, że wybór danego sposobu płatności jest związany z wartością sprzedaży.

ggbetweenstats(data = Lidl, x= Segment, y= Sales, title="Sprzedaż w zależności od segmentu")

P-value=0.01. Sprzedaż w zależności od segmentu jest istotna statystycznie, a więc sprzedaż różni się w zależności od segmentu.

ggbetweenstats(data  =Lidl, x = Category, y = Sales, title="Różnice między sprzedażą w różnych kategoriach")

P-value=0 co oznacza, że różnice między sprzedażą w różnych kategoriach są bardzo istotne.

ggbarstats(data=Lidl, x=Category, y=Ship_Mode, title="Kategoria dostawy w zależności od kategorii produktu")

P-value=0,94 brak istotności statystycznej, nie ma zależności między kategorią dostawy a kategorią produktów

ggbarstats(data= Lidl, x = Payment_Mode, y= Delivery_time, title= "Czas dostawy w zależności od sposobu płatności", grouping.var = Category, package= "wesanderson", palette= "Darjeeling2")

P-value=0.007, co oznacza że są istotne różnice w czasie dostawy w zależności od sposobu płatności.

ggbarstats(data= Lidl, x = Ship_Mode, y= Delivery_time, title= "Czas dostawy w zależności od sposobu dostawy", grouping.var = Category, package= "wesanderson", palette= "Darjeeling2")

P-value=0, co oznacza że są istotne różnice w czasie dostawy w zależności od sposobu dostawy.

ggbarstats(data= Lidl, x= Returns,y= Ship_Mode, title= "Ilość zwrotów w różnych kategoriach dostawy", grouping.var = Category, package= "wesanderson",palette= "Darjeeling2")

P-value=0.005, a więc ilość zwrótów w zależności od kategorii dostawy jest istotna statystycznie.

Podsumowanie i wnioski końcowe

Przeprowadzona analiza danych doprowadziła nas do następujących wniosków. Sprzedaż istotnie różni się w zależności od segmentu, przy czym najwyższa wartośc sprzedaży występuje w segmencie Consumer, następnie Corporate, zaś najniższa w segmencie Home Office. Podobnie sprzedaż istotnie różni się w zależności od kategorii produktów. Najwyższa suma sprzedaży pochodzi z kategorii Office Supplies, natomiast największa średnia wielkość sprzedaży wystęuje w kategorii Technology. Kolejnym wnioskiem jest fakt, że czas dostawy jest istotnie zróżnicowany w zalezności od wybranego przez klienta sposobu płatności, jak również sposobu dostawy. Największa ilość klientów wybierała płatność za pobraniem oraz dostawę Standard Class. Istniał również istotny związek między ilością zwrotów zakupionych produktów, a wybraną przez klienta kategorią dostawy. Najwięcej w ujeciu procentowym zwracali klienci wybierających dostawę First Class. Stwierdzono również, że między wielkością sprzedaży w zależności od sposobu płatności oraz między wybieranym sposobem dostawy, a kategorią zamawianego produktu nie było istotnej zależności.