US Superstore Analysis

1 Preface

Role Play

This exploratory data analysis is set to role play as i am being assigned to act as data scientist to analyze US retail superstore business performance. The superstore is named BigBuzz, a unicorn startup in retail business, currently gained 500 Mio USD of valuation and 75 Mio USD of Funding.

Data Source

Dataset being used was taken from https://www.kaggle.com/datasets/vivek468/superstore-dataset-final and combined with lattitude and longitude to point the location of each US states.

2 Glimpse of The Data

As we always knpw that we need to read the data and place it to an object, so it would be easier to use for the next steps.

bigBuzz  <- read.csv("Superstore.csv")

Any Blank Informations?

anyNA(bigBuzz)
#> [1] FALSE

We All Good!

Columns name

names(bigBuzz)
#>  [1] "Row.ID"        "Order.ID"      "Order.Date"    "Ship.Date"    
#>  [5] "Ship.Mode"     "Customer.ID"   "Customer.Name" "Segment"      
#>  [9] "Country"       "City"          "State"         "Lattitude"    
#> [13] "Longitude"     "Postal.Code"   "Region"        "Product.ID"   
#> [17] "Category"      "Sub.Category"  "Product.Name"  "Sales"        
#> [21] "Quantity"      "Discount"      "Profit"

number of rows

dim(bigBuzz)
#> [1] 9994   23

Heads and Tails

It’s not a coin toss! It’s about the head of first 6 and last 6 rows of the Data

head(bigBuzz)
#>   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
#> 1      1 CA-2016-152156 11/08/2016 11/11/2016   Second Class    CG-12520
#> 2      2 CA-2016-152156 11/08/2016 11/11/2016   Second Class    CG-12520
#> 3      3 CA-2016-138688 06/12/2016  6/16/2016   Second Class    DV-13045
#> 4      4 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
#> 5      5 US-2015-108966 10/11/2015 10/18/2015 Standard Class    SO-20335
#> 6      6 CA-2014-115812 06/09/2014  6/14/2014 Standard Class    BH-11710
#>     Customer.Name   Segment       Country            City      State Lattitude
#> 1     Claire Gute  Consumer United States       Henderson   Kentucky  37.83933
#> 2     Claire Gute  Consumer United States       Henderson   Kentucky  37.83933
#> 3 Darrin Van Huff Corporate United States     Los Angeles California  36.77826
#> 4  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida  27.66483
#> 5  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida  27.66483
#> 6 Brosina Hoffman  Consumer United States     Los Angeles California  36.77826
#>    Longitude Postal.Code Region      Product.ID        Category Sub.Category
#> 1  -84.27002       42420  South FUR-BO-10001798       Furniture    Bookcases
#> 2  -84.27002       42420  South FUR-CH-10000454       Furniture       Chairs
#> 3 -119.41793       90036   West OFF-LA-10000240 Office Supplies       Labels
#> 4  -81.51575       33311  South FUR-TA-10000577       Furniture       Tables
#> 5  -81.51575       33311  South OFF-ST-10000760 Office Supplies      Storage
#> 6 -119.41793       90032   West FUR-FU-10001487       Furniture  Furnishings
#>                                                       Product.Name    Sales
#> 1                                Bush Somerset Collection Bookcase 261.9600
#> 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
#> 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
#> 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
#> 5                                   Eldon Fold 'N Roll Cart System  22.3680
#> 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
#>   Quantity Discount    Profit
#> 1        2     0.00   41.9136
#> 2        3     0.00  219.5820
#> 3        2     0.00    6.8714
#> 4        5     0.45 -383.0310
#> 5        2     0.20    2.5164
#> 6        7     0.00   14.1694
tail(bigBuzz)
#>      Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
#> 9989   9989 CA-2017-163629 11/17/2017 11/21/2017 Standard Class    RA-19885
#> 9990   9990 CA-2014-110422  1/21/2014  1/23/2014   Second Class    TB-21400
#> 9991   9991 CA-2017-121258  2/26/2017 03/03/2017 Standard Class    DB-13060
#> 9992   9992 CA-2017-121258  2/26/2017 03/03/2017 Standard Class    DB-13060
#> 9993   9993 CA-2017-121258  2/26/2017 03/03/2017 Standard Class    DB-13060
#> 9994   9994 CA-2017-119914 05/04/2017 05/09/2017   Second Class    CC-12220
#>         Customer.Name   Segment       Country        City      State Lattitude
#> 9989     Ruben Ausman Corporate United States      Athens    Georgia  32.15743
#> 9990 Tom Boeckenhauer  Consumer United States       Miami    Florida  27.66483
#> 9991      Dave Brooks  Consumer United States  Costa Mesa California  36.77826
#> 9992      Dave Brooks  Consumer United States  Costa Mesa California  36.77826
#> 9993      Dave Brooks  Consumer United States  Costa Mesa California  36.77826
#> 9994     Chris Cortes  Consumer United States Westminster California  36.77826
#>       Longitude Postal.Code Region      Product.ID        Category Sub.Category
#> 9989  -82.90712       30605  South TEC-PH-10004006      Technology       Phones
#> 9990  -81.51575       33180  South FUR-FU-10001889       Furniture  Furnishings
#> 9991 -119.41793       92627   West FUR-FU-10000747       Furniture  Furnishings
#> 9992 -119.41793       92627   West TEC-PH-10003645      Technology       Phones
#> 9993 -119.41793       92627   West OFF-PA-10004041 Office Supplies        Paper
#> 9994 -119.41793       92683   West OFF-AP-10002684 Office Supplies   Appliances
#>                                                                   Product.Name
#> 9989                                           Panasonic KX - TS880B Telephone
#> 9990                                                    Ultra Door Pull Handle
#> 9991                        Tenex B1-RE Series Chair Mats for Low Pile Carpets
#> 9992                                                     Aastra 57i VoIP phone
#> 9993                         It's Hot Message Books with Stickers, 2 3/4" x 5"
#> 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
#>        Sales Quantity Discount  Profit
#> 9989 206.100        5      0.0 55.6470
#> 9990  25.248        3      0.2  4.1028
#> 9991  91.960        2      0.0 15.6332
#> 9992 258.576        2      0.2 19.3932
#> 9993  29.600        4      0.0 13.3200
#> 9994 243.160        2      0.0 72.9480
str(bigBuzz)
#> 'data.frame':    9994 obs. of  23 variables:
#>  $ Row.ID       : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date   : chr  "11/08/2016" "11/08/2016" "06/12/2016" "10/11/2015" ...
#>  $ Ship.Date    : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#>  $ Ship.Mode    : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Customer.Name: chr  "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
#>  $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Country      : chr  "United States" "United States" "United States" "United States" ...
#>  $ City         : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
#>  $ State        : chr  "Kentucky" "Kentucky" "California" "Florida" ...
#>  $ Lattitude    : num  37.8 37.8 36.8 27.7 27.7 ...
#>  $ Longitude    : num  -84.3 -84.3 -119.4 -81.5 -81.5 ...
#>  $ Postal.Code  : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
#>  $ Region       : chr  "South" "South" "West" "South" ...
#>  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category     : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category : chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...

Summary

  • The Data consist of 9,994 rows and 23 columns.
  • Some of columns data type need to be adjusted in order to create correct analysis

3 About The Data

Brief Summary

summary(bigBuzz)
#>      Row.ID       Order.ID          Order.Date         Ship.Date        
#>  Min.   :   1   Length:9994        Length:9994        Length:9994       
#>  1st Qu.:2499   Class :character   Class :character   Class :character  
#>  Median :4998   Mode  :character   Mode  :character   Mode  :character  
#>  Mean   :4998                                                           
#>  3rd Qu.:7496                                                           
#>  Max.   :9994                                                           
#>   Ship.Mode         Customer.ID        Customer.Name        Segment         
#>  Length:9994        Length:9994        Length:9994        Length:9994       
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>    Country              City              State             Lattitude    
#>  Length:9994        Length:9994        Length:9994        Min.   :27.66  
#>  Class :character   Class :character   Class :character   1st Qu.:35.76  
#>  Mode  :character   Mode  :character   Mode  :character   Median :37.96  
#>                                                           Mean   :38.52  
#>                                                           3rd Qu.:41.60  
#>                                                           Max.   :47.75  
#>    Longitude        Postal.Code       Region           Product.ID       
#>  Min.   :-120.74   Min.   : 1040   Length:9994        Length:9994       
#>  1st Qu.:-119.42   1st Qu.:23223   Class :character   Class :character  
#>  Median : -89.40   Median :56431   Mode  :character   Mode  :character  
#>  Mean   : -94.77   Mean   :55190                                        
#>  3rd Qu.: -78.66   3rd Qu.:90008                                        
#>  Max.   : -69.45   Max.   :99301                                        
#>    Category         Sub.Category       Product.Name           Sales          
#>  Length:9994        Length:9994        Length:9994        Min.   :    0.444  
#>  Class :character   Class :character   Class :character   1st Qu.:   17.280  
#>  Mode  :character   Mode  :character   Mode  :character   Median :   54.490  
#>                                                           Mean   :  229.858  
#>                                                           3rd Qu.:  209.940  
#>                                                           Max.   :22638.480  
#>     Quantity        Discount          Profit         
#>  Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :14.00   Max.   :0.8000   Max.   : 8399.976

This summary looks Wrong, as I mentioned Before, some of the columns need to be corrected. So, here’s what we do :

bigBuzz$Row.ID <- as.character(bigBuzz$Row.ID)
bigBuzz$Order.ID<- as.character (bigBuzz$Order.ID)
bigBuzz$Order.Date<- as.Date(bigBuzz$Order.Date, "%m/%d/%y")
bigBuzz$Ship.Date <- as.Date(bigBuzz$Ship.Date, "%m/%d/%y")
bigBuzz$Customer.ID<-as.character(bigBuzz$Customer.ID)
bigBuzz$Product.ID<- as.character(bigBuzz$Product.ID)
bigBuzz$Product.Name<-as.factor(bigBuzz$Product.Name)
bigBuzz$Category <- as.factor(bigBuzz$Category)
bigBuzz$State <- as.factor(bigBuzz$State)
bigBuzz$City <- as.factor(bigBuzz$City)
bigBuzz$Sub.Category <- as.factor(bigBuzz$Sub.Category)
bigBuzz$Segment<- as.factor(bigBuzz$Segment)
bigBuzz$Ship.Mode <-  as.factor(bigBuzz$Ship.Mode)

The results? Check out the next tab!.

Corrected Brief Summary

These are how the summary has to be look like:

summary(bigBuzz)
#>     Row.ID            Order.ID           Order.Date        
#>  Length:9994        Length:9994        Min.   :2020-01-01  
#>  Class :character   Class :character   1st Qu.:2020-05-20  
#>  Mode  :character   Mode  :character   Median :2020-09-02  
#>                                        Mean   :2020-08-09  
#>                                        3rd Qu.:2020-11-08  
#>                                        Max.   :2020-12-31  
#>                                                            
#>    Ship.Date                   Ship.Mode    Customer.ID       
#>  Min.   :2020-01-01   First Class   :1538   Length:9994       
#>  1st Qu.:2020-05-18   Same Day      : 543   Class :character  
#>  Median :2020-09-01   Second Class  :1945   Mode  :character  
#>  Mean   :2020-08-08   Standard Class:5968                     
#>  3rd Qu.:2020-11-09                                           
#>  Max.   :2020-12-31                                           
#>                                                               
#>  Customer.Name             Segment       Country                     City     
#>  Length:9994        Consumer   :5191   Length:9994        New York City: 915  
#>  Class :character   Corporate  :3020   Class :character   Los Angeles  : 747  
#>  Mode  :character   Home Office:1783   Mode  :character   Philadelphia : 537  
#>                                                           San Francisco: 510  
#>                                                           Seattle      : 428  
#>                                                           Houston      : 377  
#>                                                           (Other)      :6480  
#>           State        Lattitude       Longitude        Postal.Code   
#>  California  :2001   Min.   :27.66   Min.   :-120.74   Min.   : 1040  
#>  New York    :1128   1st Qu.:35.76   1st Qu.:-119.42   1st Qu.:23223  
#>  Texas       : 985   Median :37.96   Median : -89.40   Median :56431  
#>  Pennsylvania: 587   Mean   :38.52   Mean   : -94.77   Mean   :55190  
#>  Washington  : 506   3rd Qu.:41.60   3rd Qu.: -78.66   3rd Qu.:90008  
#>  Illinois    : 492   Max.   :47.75   Max.   : -69.45   Max.   :99301  
#>  (Other)     :4295                                                    
#>     Region           Product.ID                   Category   
#>  Length:9994        Length:9994        Furniture      :2121  
#>  Class :character   Class :character   Office Supplies:6026  
#>  Mode  :character   Mode  :character   Technology     :1847  
#>                                                              
#>                                                              
#>                                                              
#>                                                              
#>       Sub.Category                      Product.Name      Sales          
#>  Binders    :1523   Staple envelope           :  48   Min.   :    0.444  
#>  Paper      :1370   Easy-staple paper         :  46   1st Qu.:   17.280  
#>  Furnishings: 957   Staples                   :  46   Median :   54.490  
#>  Phones     : 889   Avery Non-Stick Binders   :  20   Mean   :  229.858  
#>  Storage    : 846   Staples in misc. colors   :  19   3rd Qu.:  209.940  
#>  Art        : 796   KI Adjustable-Height Table:  18   Max.   :22638.480  
#>  (Other)    :3613   (Other)                   :9797                      
#>     Quantity        Discount          Profit         
#>  Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
#> 

What can wee see?

    1. First order and ship was January 1st 2020
    1. Standard clas is the most selected shipping method (5,968)
    1. Consumer segment dominated with 5.191 transactions
    1. New York city is the city with the most shipping frequency
    1. Office Supply is the best selling category
    1. Binder, paper and furnishings are top 3 grossing products
    1. Average sales ticket is 229,8 $ with 3 items being purchased

4 Some Business Findings

  • 1. Which category and sub category yields the lowest and highest sales?
bigBuzz[bigBuzz$Sales == 0.444,]
#>      Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
#> 4102   4102 US-2017-102288 2020-06-19 2020-06-23 Standard Class    ZC-21910
#>         Customer.Name  Segment       Country    City State Lattitude Longitude
#> 4102 Zuschuss Carroll Consumer United States Houston Texas   31.9686 -99.90181
#>      Postal.Code  Region      Product.ID        Category Sub.Category
#> 4102       77095 Central OFF-AP-10002906 Office Supplies   Appliances
#>                                                                    Product.Name
#> 4102 Hoover Replacement Belt for Commercial Guardsman Heavy-Duty Upright Vacuum
#>      Sales Quantity Discount Profit
#> 4102 0.444        1      0.8  -1.11
bigBuzz[bigBuzz$Sales == 22638.480 ,]
#>      Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
#> 2698   2698 CA-2014-145317 2020-03-18 2020-03-23 Standard Class    SM-20320
#>      Customer.Name     Segment       Country         City   State Lattitude
#> 2698   Sean Miller Home Office United States Jacksonville Florida  27.66483
#>      Longitude Postal.Code Region      Product.ID   Category Sub.Category
#> 2698 -81.51575       32216  South TEC-MA-10002412 Technology     Machines
#>                                               Product.Name    Sales Quantity
#> 2698 Cisco TelePresence System EX90 Videoconferencing Unit 22638.48        6
#>      Discount    Profit
#> 2698      0.5 -1811.078
  • 2 How’s the segment related to sales in each category?
xtabs(Quantity~Segment+Category,bigBuzz)
#>              Category
#> Segment       Furniture Office Supplies Technology
#>   Consumer         4166           11758       3597
#>   Corporate        2495            7018       2095
#>   Home Office      1367            4130       1247
  • 3. Median of sales and profit for every segment and category
df1 <-  as.data.frame(aggregate(Sales~Segment+Category,bigBuzz,median))
df1
#>       Segment        Category    Sales
#> 1    Consumer       Furniture 183.9680
#> 2   Corporate       Furniture 190.8200
#> 3 Home Office       Furniture 148.2725
#> 4    Consumer Office Supplies  26.6320
#> 5   Corporate Office Supplies  28.8030
#> 6 Home Office Office Supplies  28.7520
#> 7    Consumer      Technology 159.9840
#> 8   Corporate      Technology 159.9840
#> 9 Home Office      Technology 199.9800
df2 <-  as.data.frame(aggregate(Profit~Segment+Category,bigBuzz,median))
df2
#>       Segment        Category  Profit
#> 1    Consumer       Furniture  7.1060
#> 2   Corporate       Furniture  7.8031
#> 3 Home Office       Furniture  8.6137
#> 4    Consumer Office Supplies  6.7236
#> 5   Corporate Office Supplies  7.0218
#> 6 Home Office Office Supplies  7.0659
#> 7    Consumer      Technology 25.1860
#> 8   Corporate      Technology 23.3870
#> 9 Home Office      Technology 25.9441
  • 4.Does shipping modes mean something?
df3 <- as.data.frame(xtabs(Profit~Segment+Ship.Mode,bigBuzz))
df3
#>        Segment      Ship.Mode      Freq
#> 1     Consumer    First Class 21374.044
#> 2    Corporate    First Class 14464.472
#> 3  Home Office    First Class 13131.324
#> 4     Consumer       Same Day  9874.205
#> 5    Corporate       Same Day  1818.142
#> 6  Home Office       Same Day  4199.412
#> 7     Consumer   Second Class 24946.911
#> 8    Corporate   Second Class 18225.713
#> 9  Home Office   Second Class 14274.011
#> 10    Consumer Standard Class 77924.049
#> 11   Corporate Standard Class 57470.807
#> 12 Home Office Standard Class 28693.932
df4 <- as.data.frame(xtabs(Profit~Category+Ship.Mode,bigBuzz))
df4
#>           Category      Ship.Mode       Freq
#> 1        Furniture    First Class  3066.9474
#> 2  Office Supplies    First Class 18400.3291
#> 3       Technology    First Class 27502.5634
#> 4        Furniture       Same Day   797.3484
#> 5  Office Supplies       Same Day  6423.5192
#> 6       Technology       Same Day  8670.8913
#> 7        Furniture   Second Class  4226.2614
#> 8  Office Supplies   Second Class 27068.1676
#> 9       Technology   Second Class 26152.2064
#> 10       Furniture Standard Class 10360.7156
#> 11 Office Supplies Standard Class 70598.7849
#> 12      Technology Standard Class 83129.2870
  • 5. Any Loss Suffered?
bigBuzz[bigBuzz$Profit==-6599.978,]
#>      Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
#> 7773   7773 CA-2016-108196 2020-11-25 2020-12-02 Standard Class    CS-12505
#>      Customer.Name  Segment       Country      City State Lattitude Longitude
#> 7773 Cindy Stewart Consumer United States Lancaster  Ohio  40.41729 -82.90712
#>      Postal.Code Region      Product.ID   Category Sub.Category
#> 7773       43130   East TEC-MA-10000418 Technology     Machines
#>                                   Product.Name    Sales Quantity Discount
#> 7773 Cubify CubeX 3D Printer Double Head Print 4499.985        5      0.7
#>         Profit
#> 7773 -6599.978
  • 6. How many transactions were in loss?
number_of_loss = nrow(bigBuzz[bigBuzz$Profit< 0,])
number_of_loss 
#> [1] 1871

5 What Should We Do?

    1. Discount should be more careful, 1,871 transactions were at loss.
    1. Need to Increase the sales amount on corporate and home oofice.
    1. Most shippings were to New Yorl, Los Angeles and Philadelphia, could use theese city to increase sale, meanwhile investigaing on the least shippings areas.