US Superstore Analysis
1 Preface
Role Play
This exploratory data analysis is set to role play as i am being assigned to act as data scientist to analyze US retail superstore business performance. The superstore is named BigBuzz, a unicorn startup in retail business, currently gained 500 Mio USD of valuation and 75 Mio USD of Funding.
Data Source
Dataset being used was taken from https://www.kaggle.com/datasets/vivek468/superstore-dataset-final and combined with lattitude and longitude to point the location of each US states.
2 Glimpse of The Data
As we always knpw that we need to read the data and place it to an object, so it would be easier to use for the next steps.
bigBuzz <- read.csv("Superstore.csv")Any Blank Informations?
anyNA(bigBuzz)#> [1] FALSE
We All Good!
Columns name
names(bigBuzz)#> [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date"
#> [5] "Ship.Mode" "Customer.ID" "Customer.Name" "Segment"
#> [9] "Country" "City" "State" "Lattitude"
#> [13] "Longitude" "Postal.Code" "Region" "Product.ID"
#> [17] "Category" "Sub.Category" "Product.Name" "Sales"
#> [21] "Quantity" "Discount" "Profit"
number of rows
dim(bigBuzz)#> [1] 9994 23
Heads and Tails
It’s not a coin toss! It’s about the head of first 6 and last 6 rows of the Data
head(bigBuzz)#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 1 1 CA-2016-152156 11/08/2016 11/11/2016 Second Class CG-12520
#> 2 2 CA-2016-152156 11/08/2016 11/11/2016 Second Class CG-12520
#> 3 3 CA-2016-138688 06/12/2016 6/16/2016 Second Class DV-13045
#> 4 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
#> 5 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
#> 6 6 CA-2014-115812 06/09/2014 6/14/2014 Standard Class BH-11710
#> Customer.Name Segment Country City State Lattitude
#> 1 Claire Gute Consumer United States Henderson Kentucky 37.83933
#> 2 Claire Gute Consumer United States Henderson Kentucky 37.83933
#> 3 Darrin Van Huff Corporate United States Los Angeles California 36.77826
#> 4 Sean O'Donnell Consumer United States Fort Lauderdale Florida 27.66483
#> 5 Sean O'Donnell Consumer United States Fort Lauderdale Florida 27.66483
#> 6 Brosina Hoffman Consumer United States Los Angeles California 36.77826
#> Longitude Postal.Code Region Product.ID Category Sub.Category
#> 1 -84.27002 42420 South FUR-BO-10001798 Furniture Bookcases
#> 2 -84.27002 42420 South FUR-CH-10000454 Furniture Chairs
#> 3 -119.41793 90036 West OFF-LA-10000240 Office Supplies Labels
#> 4 -81.51575 33311 South FUR-TA-10000577 Furniture Tables
#> 5 -81.51575 33311 South OFF-ST-10000760 Office Supplies Storage
#> 6 -119.41793 90032 West FUR-FU-10001487 Furniture Furnishings
#> Product.Name Sales
#> 1 Bush Somerset Collection Bookcase 261.9600
#> 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
#> 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
#> 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
#> 5 Eldon Fold 'N Roll Cart System 22.3680
#> 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
#> Quantity Discount Profit
#> 1 2 0.00 41.9136
#> 2 3 0.00 219.5820
#> 3 2 0.00 6.8714
#> 4 5 0.45 -383.0310
#> 5 2 0.20 2.5164
#> 6 7 0.00 14.1694
tail(bigBuzz)#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 9989 9989 CA-2017-163629 11/17/2017 11/21/2017 Standard Class RA-19885
#> 9990 9990 CA-2014-110422 1/21/2014 1/23/2014 Second Class TB-21400
#> 9991 9991 CA-2017-121258 2/26/2017 03/03/2017 Standard Class DB-13060
#> 9992 9992 CA-2017-121258 2/26/2017 03/03/2017 Standard Class DB-13060
#> 9993 9993 CA-2017-121258 2/26/2017 03/03/2017 Standard Class DB-13060
#> 9994 9994 CA-2017-119914 05/04/2017 05/09/2017 Second Class CC-12220
#> Customer.Name Segment Country City State Lattitude
#> 9989 Ruben Ausman Corporate United States Athens Georgia 32.15743
#> 9990 Tom Boeckenhauer Consumer United States Miami Florida 27.66483
#> 9991 Dave Brooks Consumer United States Costa Mesa California 36.77826
#> 9992 Dave Brooks Consumer United States Costa Mesa California 36.77826
#> 9993 Dave Brooks Consumer United States Costa Mesa California 36.77826
#> 9994 Chris Cortes Consumer United States Westminster California 36.77826
#> Longitude Postal.Code Region Product.ID Category Sub.Category
#> 9989 -82.90712 30605 South TEC-PH-10004006 Technology Phones
#> 9990 -81.51575 33180 South FUR-FU-10001889 Furniture Furnishings
#> 9991 -119.41793 92627 West FUR-FU-10000747 Furniture Furnishings
#> 9992 -119.41793 92627 West TEC-PH-10003645 Technology Phones
#> 9993 -119.41793 92627 West OFF-PA-10004041 Office Supplies Paper
#> 9994 -119.41793 92683 West OFF-AP-10002684 Office Supplies Appliances
#> Product.Name
#> 9989 Panasonic KX - TS880B Telephone
#> 9990 Ultra Door Pull Handle
#> 9991 Tenex B1-RE Series Chair Mats for Low Pile Carpets
#> 9992 Aastra 57i VoIP phone
#> 9993 It's Hot Message Books with Stickers, 2 3/4" x 5"
#> 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
#> Sales Quantity Discount Profit
#> 9989 206.100 5 0.0 55.6470
#> 9990 25.248 3 0.2 4.1028
#> 9991 91.960 2 0.0 15.6332
#> 9992 258.576 2 0.2 19.3932
#> 9993 29.600 4 0.0 13.3200
#> 9994 243.160 2 0.0 72.9480
str(bigBuzz)#> 'data.frame': 9994 obs. of 23 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/08/2016" "11/08/2016" "06/12/2016" "10/11/2015" ...
#> $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Customer.Name: chr "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Country : chr "United States" "United States" "United States" "United States" ...
#> $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
#> $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
#> $ Lattitude : num 37.8 37.8 36.8 27.7 27.7 ...
#> $ Longitude : num -84.3 -84.3 -119.4 -81.5 -81.5 ...
#> $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
#> $ Region : chr "South" "South" "West" "South" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category : chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Summary
- The Data consist of 9,994 rows and 23 columns.
- Some of columns data type need to be adjusted in order to create correct analysis
3 About The Data
Brief Summary
summary(bigBuzz)#> Row.ID Order.ID Order.Date Ship.Date
#> Min. : 1 Length:9994 Length:9994 Length:9994
#> 1st Qu.:2499 Class :character Class :character Class :character
#> Median :4998 Mode :character Mode :character Mode :character
#> Mean :4998
#> 3rd Qu.:7496
#> Max. :9994
#> Ship.Mode Customer.ID Customer.Name Segment
#> Length:9994 Length:9994 Length:9994 Length:9994
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Country City State Lattitude
#> Length:9994 Length:9994 Length:9994 Min. :27.66
#> Class :character Class :character Class :character 1st Qu.:35.76
#> Mode :character Mode :character Mode :character Median :37.96
#> Mean :38.52
#> 3rd Qu.:41.60
#> Max. :47.75
#> Longitude Postal.Code Region Product.ID
#> Min. :-120.74 Min. : 1040 Length:9994 Length:9994
#> 1st Qu.:-119.42 1st Qu.:23223 Class :character Class :character
#> Median : -89.40 Median :56431 Mode :character Mode :character
#> Mean : -94.77 Mean :55190
#> 3rd Qu.: -78.66 3rd Qu.:90008
#> Max. : -69.45 Max. :99301
#> Category Sub.Category Product.Name Sales
#> Length:9994 Length:9994 Length:9994 Min. : 0.444
#> Class :character Class :character Class :character 1st Qu.: 17.280
#> Mode :character Mode :character Mode :character Median : 54.490
#> Mean : 229.858
#> 3rd Qu.: 209.940
#> Max. :22638.480
#> Quantity Discount Profit
#> Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :14.00 Max. :0.8000 Max. : 8399.976
This summary looks Wrong, as I mentioned Before, some of the columns need to be corrected. So, here’s what we do :
bigBuzz$Row.ID <- as.character(bigBuzz$Row.ID)
bigBuzz$Order.ID<- as.character (bigBuzz$Order.ID)
bigBuzz$Order.Date<- as.Date(bigBuzz$Order.Date, "%m/%d/%y")
bigBuzz$Ship.Date <- as.Date(bigBuzz$Ship.Date, "%m/%d/%y")
bigBuzz$Customer.ID<-as.character(bigBuzz$Customer.ID)
bigBuzz$Product.ID<- as.character(bigBuzz$Product.ID)
bigBuzz$Product.Name<-as.factor(bigBuzz$Product.Name)
bigBuzz$Category <- as.factor(bigBuzz$Category)
bigBuzz$State <- as.factor(bigBuzz$State)
bigBuzz$City <- as.factor(bigBuzz$City)
bigBuzz$Sub.Category <- as.factor(bigBuzz$Sub.Category)
bigBuzz$Segment<- as.factor(bigBuzz$Segment)
bigBuzz$Ship.Mode <- as.factor(bigBuzz$Ship.Mode)The results? Check out the next tab!.
Corrected Brief Summary
These are how the summary has to be look like:
summary(bigBuzz)#> Row.ID Order.ID Order.Date
#> Length:9994 Length:9994 Min. :2020-01-01
#> Class :character Class :character 1st Qu.:2020-05-20
#> Mode :character Mode :character Median :2020-09-02
#> Mean :2020-08-09
#> 3rd Qu.:2020-11-08
#> Max. :2020-12-31
#>
#> Ship.Date Ship.Mode Customer.ID
#> Min. :2020-01-01 First Class :1538 Length:9994
#> 1st Qu.:2020-05-18 Same Day : 543 Class :character
#> Median :2020-09-01 Second Class :1945 Mode :character
#> Mean :2020-08-08 Standard Class:5968
#> 3rd Qu.:2020-11-09
#> Max. :2020-12-31
#>
#> Customer.Name Segment Country City
#> Length:9994 Consumer :5191 Length:9994 New York City: 915
#> Class :character Corporate :3020 Class :character Los Angeles : 747
#> Mode :character Home Office:1783 Mode :character Philadelphia : 537
#> San Francisco: 510
#> Seattle : 428
#> Houston : 377
#> (Other) :6480
#> State Lattitude Longitude Postal.Code
#> California :2001 Min. :27.66 Min. :-120.74 Min. : 1040
#> New York :1128 1st Qu.:35.76 1st Qu.:-119.42 1st Qu.:23223
#> Texas : 985 Median :37.96 Median : -89.40 Median :56431
#> Pennsylvania: 587 Mean :38.52 Mean : -94.77 Mean :55190
#> Washington : 506 3rd Qu.:41.60 3rd Qu.: -78.66 3rd Qu.:90008
#> Illinois : 492 Max. :47.75 Max. : -69.45 Max. :99301
#> (Other) :4295
#> Region Product.ID Category
#> Length:9994 Length:9994 Furniture :2121
#> Class :character Class :character Office Supplies:6026
#> Mode :character Mode :character Technology :1847
#>
#>
#>
#>
#> Sub.Category Product.Name Sales
#> Binders :1523 Staple envelope : 48 Min. : 0.444
#> Paper :1370 Easy-staple paper : 46 1st Qu.: 17.280
#> Furnishings: 957 Staples : 46 Median : 54.490
#> Phones : 889 Avery Non-Stick Binders : 20 Mean : 229.858
#> Storage : 846 Staples in misc. colors : 19 3rd Qu.: 209.940
#> Art : 796 KI Adjustable-Height Table: 18 Max. :22638.480
#> (Other) :3613 (Other) :9797
#> Quantity Discount Profit
#> Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :14.00 Max. :0.8000 Max. : 8399.976
#>
What can wee see?
- First order and ship was January 1st 2020
- Standard clas is the most selected shipping method (5,968)
- Consumer segment dominated with 5.191 transactions
- New York city is the city with the most shipping frequency
- Office Supply is the best selling category
- Binder, paper and furnishings are top 3 grossing products
- Average sales ticket is 229,8 $ with 3 items being purchased
4 Some Business Findings
- 1. Which category and sub category yields the lowest and highest sales?
bigBuzz[bigBuzz$Sales == 0.444,]#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 4102 4102 US-2017-102288 2020-06-19 2020-06-23 Standard Class ZC-21910
#> Customer.Name Segment Country City State Lattitude Longitude
#> 4102 Zuschuss Carroll Consumer United States Houston Texas 31.9686 -99.90181
#> Postal.Code Region Product.ID Category Sub.Category
#> 4102 77095 Central OFF-AP-10002906 Office Supplies Appliances
#> Product.Name
#> 4102 Hoover Replacement Belt for Commercial Guardsman Heavy-Duty Upright Vacuum
#> Sales Quantity Discount Profit
#> 4102 0.444 1 0.8 -1.11
bigBuzz[bigBuzz$Sales == 22638.480 ,]#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 2698 2698 CA-2014-145317 2020-03-18 2020-03-23 Standard Class SM-20320
#> Customer.Name Segment Country City State Lattitude
#> 2698 Sean Miller Home Office United States Jacksonville Florida 27.66483
#> Longitude Postal.Code Region Product.ID Category Sub.Category
#> 2698 -81.51575 32216 South TEC-MA-10002412 Technology Machines
#> Product.Name Sales Quantity
#> 2698 Cisco TelePresence System EX90 Videoconferencing Unit 22638.48 6
#> Discount Profit
#> 2698 0.5 -1811.078
- 2 How’s the segment related to sales in each category?
xtabs(Quantity~Segment+Category,bigBuzz)#> Category
#> Segment Furniture Office Supplies Technology
#> Consumer 4166 11758 3597
#> Corporate 2495 7018 2095
#> Home Office 1367 4130 1247
- 3. Median of sales and profit for every segment and category
df1 <- as.data.frame(aggregate(Sales~Segment+Category,bigBuzz,median))
df1#> Segment Category Sales
#> 1 Consumer Furniture 183.9680
#> 2 Corporate Furniture 190.8200
#> 3 Home Office Furniture 148.2725
#> 4 Consumer Office Supplies 26.6320
#> 5 Corporate Office Supplies 28.8030
#> 6 Home Office Office Supplies 28.7520
#> 7 Consumer Technology 159.9840
#> 8 Corporate Technology 159.9840
#> 9 Home Office Technology 199.9800
df2 <- as.data.frame(aggregate(Profit~Segment+Category,bigBuzz,median))
df2#> Segment Category Profit
#> 1 Consumer Furniture 7.1060
#> 2 Corporate Furniture 7.8031
#> 3 Home Office Furniture 8.6137
#> 4 Consumer Office Supplies 6.7236
#> 5 Corporate Office Supplies 7.0218
#> 6 Home Office Office Supplies 7.0659
#> 7 Consumer Technology 25.1860
#> 8 Corporate Technology 23.3870
#> 9 Home Office Technology 25.9441
- 4.Does shipping modes mean something?
df3 <- as.data.frame(xtabs(Profit~Segment+Ship.Mode,bigBuzz))
df3#> Segment Ship.Mode Freq
#> 1 Consumer First Class 21374.044
#> 2 Corporate First Class 14464.472
#> 3 Home Office First Class 13131.324
#> 4 Consumer Same Day 9874.205
#> 5 Corporate Same Day 1818.142
#> 6 Home Office Same Day 4199.412
#> 7 Consumer Second Class 24946.911
#> 8 Corporate Second Class 18225.713
#> 9 Home Office Second Class 14274.011
#> 10 Consumer Standard Class 77924.049
#> 11 Corporate Standard Class 57470.807
#> 12 Home Office Standard Class 28693.932
df4 <- as.data.frame(xtabs(Profit~Category+Ship.Mode,bigBuzz))
df4#> Category Ship.Mode Freq
#> 1 Furniture First Class 3066.9474
#> 2 Office Supplies First Class 18400.3291
#> 3 Technology First Class 27502.5634
#> 4 Furniture Same Day 797.3484
#> 5 Office Supplies Same Day 6423.5192
#> 6 Technology Same Day 8670.8913
#> 7 Furniture Second Class 4226.2614
#> 8 Office Supplies Second Class 27068.1676
#> 9 Technology Second Class 26152.2064
#> 10 Furniture Standard Class 10360.7156
#> 11 Office Supplies Standard Class 70598.7849
#> 12 Technology Standard Class 83129.2870
- 5. Any Loss Suffered?
bigBuzz[bigBuzz$Profit==-6599.978,]#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 7773 7773 CA-2016-108196 2020-11-25 2020-12-02 Standard Class CS-12505
#> Customer.Name Segment Country City State Lattitude Longitude
#> 7773 Cindy Stewart Consumer United States Lancaster Ohio 40.41729 -82.90712
#> Postal.Code Region Product.ID Category Sub.Category
#> 7773 43130 East TEC-MA-10000418 Technology Machines
#> Product.Name Sales Quantity Discount
#> 7773 Cubify CubeX 3D Printer Double Head Print 4499.985 5 0.7
#> Profit
#> 7773 -6599.978
- 6. How many transactions were in loss?
number_of_loss = nrow(bigBuzz[bigBuzz$Profit< 0,])
number_of_loss #> [1] 1871
5 What Should We Do?
- Discount should be more careful, 1,871 transactions were at loss.
- Need to Increase the sales amount on corporate and home oofice.
- Most shippings were to New Yorl, Los Angeles and Philadelphia, could use theese city to increase sale, meanwhile investigaing on the least shippings areas.