Overall the goal of this project is to analyze the superstore data and allow businesses to gain insights, make informed decisions, and optimize their operations with the help of Visualization & statistics. Few of the areas which can be looked into are -
For the business of Super Store to grow, Ill be looking into various features, so that they can understand if they need to make any improvements on their Marketing or even any additional steps needs to be taken.
In the below part of the project, my aim is to
- figure better ways to increase sales in particular Region say
California.
- to check if categories play major role in deciding if sales would go
up or down, if average sales for different categories is same.
- overall sales over the years, if any changes need to be implemented to
up the game of Sales.
We know that a lot of data exists in this world, and in order to
analyse and get some inference of the same we need to go through a few
steps. Statistics follows a framework which includes -
1. Obtaining an idea, as to an hypothesis based on certain element in
the world (HYPOTHESIS)
2. Observing nature to obtain such kind of related data. ( SAMPLE DATA
from POPULATION )
3. Recognize that the data collected is one out of the many observations
in the world. ( PERFORM EXPLORATORY DATA ANALYSIS [EDA] )
4. Finding evidence against the idea by Insights ( PERFORM HYPOTHESIS
TESTING -> Pulling reports)
5. Checking if the hypothesis matches our assumption.
Hence, as a part of the same framework I would be working to create a similar scenario ON SUPERSTORE Data as a part of my Project in order to infer certain insights for a business.
As per my research and understanding I came across Superstore
data in the initial week of the semester, which involved quite
a lot of categorical and numerical features. This made me think about
the various possibilities of what can be interpreted from the same data
set. Following are the names of the columns -
- Ship Mode
- Segment
- Country
- City
- State
- Region
- Category
- Sub-Category
- Postal Code
- Sales - Quantity
- Discount
- Profit
Also, while working through, I came across a super set of the same
data set with additional columns. These are the additional Columns
-
- Order Id
- Order Date
- Ship Date
- Customer ID
- Customer Name
- Product Name
- Product ID
Hence, am working with the compiled Super set of the data for SuperStore Dataset with 21 columns, as a part of this Project.
First to create an idea about the Superstore and the orders being placed, did some Data Pr-processing to understand it better.
1.There are 9994 rows for Super Store Data. The Data is tidy with
about 21 columns, Each row represents a unique and separate
observation.
The same can be seen through the data cleaning / pre-processing
knitr::opts_chunk$set(echo = FALSE)
# load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggthemes)
library(ggrepel)
# time series toolkits
library(xts)
## Warning: package 'xts' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.2
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 08-11-2016 11-11-2016 Second Class CG-12520
## 2 2 CA-2016-152156 08-11-2016 11-11-2016 Second Class CG-12520
## 3 3 CA-2016-138688 12-06-2016 16-06-2016 Second Class DV-13045
## 4 4 US-2015-108966 11-10-2015 18-10-2015 Standard Class SO-20335
## 5 5 US-2015-108966 11-10-2015 18-10-2015 Standard Class SO-20335
## 6 6 CA-2014-115812 09-06-2014 14-06-2014 Standard Class BH-11710
## Customer.Name Segment Country City State
## 1 Claire Gute Consumer United States Henderson Kentucky
## 2 Claire Gute Consumer United States Henderson Kentucky
## 3 Darrin Van Huff Corporate United States Los Angeles California
## 4 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 5 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 6 Brosina Hoffman Consumer United States Los Angeles California
## Postal.Code Region Product.ID Category Sub.Category
## 1 42420 South FUR-BO-10001798 Furniture Bookcases
## 2 42420 South FUR-CH-10000454 Furniture Chairs
## 3 90036 West OFF-LA-10000240 Office Supplies Labels
## 4 33311 South FUR-TA-10000577 Furniture Tables
## 5 33311 South OFF-ST-10000760 Office Supplies Storage
## 6 90032 West FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
Change format of 2 columns to Date Format (Data Cleaning) :-
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520
## 2 2 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520
## 3 3 CA-2016-138688 2016-06-12 2016-06-16 Second Class DV-13045
## 4 4 US-2015-108966 2015-10-11 2015-10-18 Standard Class SO-20335
## 5 5 US-2015-108966 2015-10-11 2015-10-18 Standard Class SO-20335
## 6 6 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## Customer.Name Segment Country City State
## 1 Claire Gute Consumer United States Henderson Kentucky
## 2 Claire Gute Consumer United States Henderson Kentucky
## 3 Darrin Van Huff Corporate United States Los Angeles California
## 4 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 5 Sean O'Donnell Consumer United States Fort Lauderdale Florida
## 6 Brosina Hoffman Consumer United States Los Angeles California
## Postal.Code Region Product.ID Category Sub.Category
## 1 42420 South FUR-BO-10001798 Furniture Bookcases
## 2 42420 South FUR-CH-10000454 Furniture Chairs
## 3 90036 West OFF-LA-10000240 Office Supplies Labels
## 4 33311 South FUR-TA-10000577 Furniture Tables
## 5 33311 South OFF-ST-10000760 Office Supplies Storage
## 6 90032 West FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
Number of Records -
## [1] 9994
Various columns within the Data set:-
## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date"
## [5] "Ship.Mode" "Customer.ID" "Customer.Name" "Segment"
## [9] "Country" "City" "State" "Postal.Code"
## [13] "Region" "Product.ID" "Category" "Sub.Category"
## [17] "Product.Name" "Sales" "Quantity" "Discount"
## [21] "Profit"
Summarize the Data set -
* For categorical columns, this should include unique values and
counts
1.Column Ship Mode :-
Unique values for the column - Ship Mode :-
## [1] "Second Class"
## [1] "Standard Class"
## [1] "First Class"
## [1] "Same Day"
Count of Unique Ship Modes :-
## [1] 4
2.Column Segment :-
Unique values for the column - Segment :-
## [1] "Consumer"
## [1] "Corporate"
## [1] "Home Office"
## [1] "Count of the unique Segments: 3"
3.Column Country :-
Unique values for the column - Country
## [1] "United States"
## [1] "Count of the unique Countries: 1"
Here above, we can see that the data exists only for United States, so while working with data we can remove this column as it is redundant.
4.Column City :-
Unique values for the column - City
## [1] "Henderson" "Los Angeles" "Fort Lauderdale"
## [4] "Concord" "Seattle" "Fort Worth"
## [7] "Madison" "West Jordan" "San Francisco"
## [10] "Fremont" "Philadelphia" "Orem"
## [13] "Houston" "Richardson" "Naperville"
## [16] "Melbourne" "Eagan" "Westland"
## [19] "Dover" "New Albany" "New York City"
## [22] "Troy" "Chicago" "Gilbert"
## [25] "Springfield" "Jackson" "Memphis"
## [28] "Decatur" "Durham" "Columbia"
## [31] "Rochester" "Minneapolis" "Portland"
## [34] "Saint Paul" "Aurora" "Charlotte"
## [37] "Orland Park" "Urbandale" "Columbus"
## [40] "Bristol" "Wilmington" "Bloomington"
## [43] "Phoenix" "Roseville" "Independence"
## [46] "Pasadena" "Newark" "Franklin"
## [49] "Scottsdale" "San Jose" "Edmond"
## [52] "Carlsbad" "San Antonio" "Monroe"
## [55] "Fairfield" "Grand Prairie" "Redlands"
## [58] "Hamilton" "Westfield" "Akron"
## [61] "Denver" "Dallas" "Whittier"
## [64] "Saginaw" "Medina" "Dublin"
## [67] "Detroit" "Tampa" "Santa Clara"
## [70] "Lakeville" "San Diego" "Brentwood"
## [73] "Chapel Hill" "Morristown" "Cincinnati"
## [76] "Inglewood" "Tamarac" "Colorado Springs"
## [79] "Belleville" "Taylor" "Lakewood"
## [82] "Arlington" "Arvada" "Hackensack"
## [85] "Saint Petersburg" "Long Beach" "Hesperia"
## [88] "Murfreesboro" "Layton" "Austin"
## [91] "Lowell" "Manchester" "Harlingen"
## [94] "Tucson" "Quincy" "Pembroke Pines"
## [97] "Des Moines" "Peoria" "Las Vegas"
## [100] "Warwick" "Miami" "Huntington Beach"
## [103] "Richmond" "Louisville" "Lawrence"
## [106] "Canton" "New Rochelle" "Gastonia"
## [109] "Jacksonville" "Auburn" "Norman"
## [112] "Park Ridge" "Amarillo" "Lindenhurst"
## [115] "Huntsville" "Fayetteville" "Costa Mesa"
## [118] "Parker" "Atlanta" "Gladstone"
## [121] "Great Falls" "Lakeland" "Montgomery"
## [124] "Mesa" "Green Bay" "Anaheim"
## [127] "Marysville" "Salem" "Laredo"
## [130] "Grove City" "Dearborn" "Warner Robins"
## [133] "Vallejo" "Mission Viejo" "Rochester Hills"
## [136] "Plainfield" "Sierra Vista" "Vancouver"
## [139] "Cleveland" "Tyler" "Burlington"
## [142] "Waynesboro" "Chester" "Cary"
## [145] "Palm Coast" "Mount Vernon" "Hialeah"
## [148] "Oceanside" "Evanston" "Trenton"
## [151] "Cottage Grove" "Bossier City" "Lancaster"
## [154] "Asheville" "Lake Elsinore" "Omaha"
## [157] "Edmonds" "Santa Ana" "Milwaukee"
## [160] "Florence" "Lorain" "Linden"
## [163] "Salinas" "New Brunswick" "Garland"
## [166] "Norwich" "Alexandria" "Toledo"
## [169] "Farmington" "Riverside" "Torrance"
## [172] "Round Rock" "Boca Raton" "Virginia Beach"
## [175] "Murrieta" "Olympia" "Washington"
## [178] "Jefferson City" "Saint Peters" "Rockford"
## [181] "Brownsville" "Yonkers" "Oakland"
## [184] "Clinton" "Encinitas" "Roswell"
## [187] "Jonesboro" "Antioch" "Homestead"
## [190] "La Porte" "Lansing" "Cuyahoga Falls"
## [193] "Reno" "Harrisonburg" "Escondido"
## [196] "Royal Oak" "Rockville" "Coral Springs"
## [199] "Buffalo" "Boynton Beach" "Gulfport"
## [202] "Fresno" "Greenville" "Macon"
## [205] "Cedar Rapids" "Providence" "Pueblo"
## [208] "Deltona" "Murray" "Middletown"
## [211] "Freeport" "Pico Rivera" "Provo"
## [214] "Pleasant Grove" "Smyrna" "Parma"
## [217] "Mobile" "New Bedford" "Irving"
## [220] "Vineland" "Glendale" "Niagara Falls"
## [223] "Thomasville" "Westminster" "Coppell"
## [226] "Pomona" "North Las Vegas" "Allentown"
## [229] "Tempe" "Laguna Niguel" "Bridgeton"
## [232] "Everett" "Watertown" "Appleton"
## [235] "Bellevue" "Allen" "El Paso"
## [238] "Grapevine" "Carrollton" "Kent"
## [241] "Lafayette" "Tigard" "Skokie"
## [244] "Plano" "Suffolk" "Indianapolis"
## [247] "Bayonne" "Greensboro" "Baltimore"
## [250] "Kenosha" "Olathe" "Tulsa"
## [253] "Redmond" "Raleigh" "Muskogee"
## [256] "Meriden" "Bowling Green" "South Bend"
## [259] "Spokane" "Keller" "Port Orange"
## [262] "Medford" "Charlottesville" "Missoula"
## [265] "Apopka" "Reading" "Broomfield"
## [268] "Paterson" "Oklahoma City" "Chesapeake"
## [271] "Lubbock" "Johnson City" "San Bernardino"
## [274] "Leominster" "Bozeman" "Perth Amboy"
## [277] "Ontario" "Rancho Cucamonga" "Moorhead"
## [280] "Mesquite" "Stockton" "Ormond Beach"
## [283] "Sunnyvale" "York" "College Station"
## [286] "Saint Louis" "Manteca" "San Angelo"
## [289] "Salt Lake City" "Knoxville" "Little Rock"
## [292] "Lincoln Park" "Marion" "Littleton"
## [295] "Bangor" "Southaven" "New Castle"
## [298] "Midland" "Sioux Falls" "Fort Collins"
## [301] "Clarksville" "Sacramento" "Thousand Oaks"
## [304] "Malden" "Holyoke" "Albuquerque"
## [307] "Sparks" "Coachella" "Elmhurst"
## [310] "Passaic" "North Charleston" "Newport News"
## [313] "Jamestown" "Mishawaka" "La Quinta"
## [316] "Tallahassee" "Nashville" "Bellingham"
## [319] "Woodstock" "Haltom City" "Wheeling"
## [322] "Summerville" "Hot Springs" "Englewood"
## [325] "Las Cruces" "Hoover" "Frisco"
## [328] "Vacaville" "Waukesha" "Bakersfield"
## [331] "Pompano Beach" "Corpus Christi" "Redondo Beach"
## [334] "Orlando" "Orange" "Lake Charles"
## [337] "Highland Park" "Hempstead" "Noblesville"
## [340] "Apple Valley" "Mount Pleasant" "Sterling Heights"
## [343] "Eau Claire" "Pharr" "Billings"
## [346] "Gresham" "Chattanooga" "Meridian"
## [349] "Bolingbrook" "Maple Grove" "Woodland"
## [352] "Missouri City" "Pearland" "San Mateo"
## [355] "Grand Rapids" "Visalia" "Overland Park"
## [358] "Temecula" "Yucaipa" "Revere"
## [361] "Conroe" "Tinley Park" "Dubuque"
## [364] "Dearborn Heights" "Santa Fe" "Hickory"
## [367] "Carol Stream" "Saint Cloud" "North Miami"
## [370] "Plantation" "Port Saint Lucie" "Rock Hill"
## [373] "Odessa" "West Allis" "Chula Vista"
## [376] "Manhattan" "Altoona" "Thornton"
## [379] "Champaign" "Texarkana" "Edinburg"
## [382] "Baytown" "Greenwood" "Woonsocket"
## [385] "Superior" "Bedford" "Covington"
## [388] "Broken Arrow" "Miramar" "Hollywood"
## [391] "Deer Park" "Wichita" "Mcallen"
## [394] "Iowa City" "Boise" "Cranston"
## [397] "Port Arthur" "Citrus Heights" "The Colony"
## [400] "Daytona Beach" "Bullhead City" "Portage"
## [403] "Fargo" "Elkhart" "San Gabriel"
## [406] "Margate" "Sandy Springs" "Mentor"
## [409] "Lawton" "Hampton" "Rome"
## [412] "La Crosse" "Lewiston" "Hattiesburg"
## [415] "Danville" "Logan" "Waterbury"
## [418] "Athens" "Avondale" "Marietta"
## [421] "Yuma" "Wausau" "Pasco"
## [424] "Oak Park" "Pensacola" "League City"
## [427] "Gaithersburg" "Lehi" "Tuscaloosa"
## [430] "Moreno Valley" "Georgetown" "Loveland"
## [433] "Chandler" "Helena" "Kirkwood"
## [436] "Waco" "Frankfort" "Bethlehem"
## [439] "Grand Island" "Woodbury" "Rogers"
## [442] "Clovis" "Jupiter" "Santa Barbara"
## [445] "Cedar Hill" "Norfolk" "Draper"
## [448] "Ann Arbor" "La Mesa" "Pocatello"
## [451] "Holland" "Milford" "Buffalo Grove"
## [454] "Lake Forest" "Redding" "Chico"
## [457] "Utica" "Conway" "Cheyenne"
## [460] "Owensboro" "Caldwell" "Kenner"
## [463] "Nashua" "Bartlett" "Redwood City"
## [466] "Lebanon" "Santa Maria" "Des Plaines"
## [469] "Longview" "Hendersonville" "Waterloo"
## [472] "Cambridge" "Palatine" "Beverly"
## [475] "Eugene" "Oxnard" "Renton"
## [478] "Glenview" "Delray Beach" "Commerce City"
## [481] "Texas City" "Wilson" "Rio Rancho"
## [484] "Goldsboro" "Montebello" "El Cajon"
## [487] "Beaumont" "West Palm Beach" "Abilene"
## [490] "Normal" "Saint Charles" "Camarillo"
## [493] "Hillsboro" "Burbank" "Modesto"
## [496] "Garden City" "Atlantic City" "Longmont"
## [499] "Davis" "Morgan Hill" "Clifton"
## [502] "Sheboygan" "East Point" "Rapid City"
## [505] "Andover" "Kissimmee" "Shelton"
## [508] "Danbury" "Sanford" "San Marcos"
## [511] "Greeley" "Mansfield" "Elyria"
## [514] "Twin Falls" "Coral Gables" "Romeoville"
## [517] "Marlborough" "Laurel" "Bryan"
## [520] "Pine Bluff" "Aberdeen" "Hagerstown"
## [523] "East Orange" "Arlington Heights" "Oswego"
## [526] "Coon Rapids" "San Clemente" "San Luis Obispo"
## [529] "Springdale" "Lodi" "Mason"
5.Column State :-
Unique values for the column - State
## [1] "Kentucky" "California" "Florida"
## [4] "North Carolina" "Washington" "Texas"
## [7] "Wisconsin" "Utah" "Nebraska"
## [10] "Pennsylvania" "Illinois" "Minnesota"
## [13] "Michigan" "Delaware" "Indiana"
## [16] "New York" "Arizona" "Virginia"
## [19] "Tennessee" "Alabama" "South Carolina"
## [22] "Oregon" "Colorado" "Iowa"
## [25] "Ohio" "Missouri" "Oklahoma"
## [28] "New Mexico" "Louisiana" "Connecticut"
## [31] "New Jersey" "Massachusetts" "Georgia"
## [34] "Nevada" "Rhode Island" "Mississippi"
## [37] "Arkansas" "Montana" "New Hampshire"
## [40] "Maryland" "District of Columbia" "Kansas"
## [43] "Vermont" "Maine" "South Dakota"
## [46] "Idaho" "North Dakota" "Wyoming"
## [49] "West Virginia"
## [1] "Count of the unique States: 49"
6.Column Region :
Unique values for the column - Region
## [1] "South"
## [1] "West"
## [1] "Central"
## [1] "East"
## [1] "Count of the unique Regions: 4"
7.Column Category :-
Unique values for the column - Category
## [1] "Furniture"
## [1] "Office Supplies"
## [1] "Technology"
## [1] "Count of the unique Categories: 3"
8.Sub-Category :-
Unique values for the column - Sub-category
## [1] "Unique values for the column - Sub-Category :"
## [1] "Bookcases"
## [1] "Chairs"
## [1] "Labels"
## [1] "Tables"
## [1] "Storage"
## [1] "Furnishings"
## [1] "Art"
## [1] "Phones"
## [1] "Binders"
## [1] "Appliances"
## [1] "Paper"
## [1] "Accessories"
## [1] "Envelopes"
## [1] "Fasteners"
## [1] "Supplies"
## [1] "Machines"
## [1] "Copiers"
## [1] "Count of the Sub-Category: 17"
9.Postal Code :-
Even though Postal code is numeric in nature, it doesn’t have any value as the mean of a postal code wouldn’t make sense. Hence considering it as a Categorical value.
Unique values for the column - Postal code :
## [1] 42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140
## [13] 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024
## [25] 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707
## [37] 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229
## [49] 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105
## [61] 98115 73034 90045 19134 88220 78207 77036 62521 71203 6824 75051 92374
## [73] 45011 7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227
## [85] 38401 33614 95051 55044 92037 77506 94513 27514 7960 45231 94110 90301
## [97] 33319 80906 7109 48180 8701 22204 80004 7601 33710 19143 90805 92345
## [109] 37130 84041 78745 1852 31907 6040 78550 85705 62301 2038 33024 98198
## [121] 61604 89115 2886 33180 28403 92646 40475 80027 1841 39212 48187 10801
## [133] 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340
## [145] 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653
## [157] 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591
## [169] 92691 48307 7060 85635 98661 60505 76017 40214 75081 44105 75701 27217
## [181] 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016
## [193] 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052
## [205] 7036 93905 8901 17602 3301 21044 75043 6360 22304 43615 87401 92503
## [217] 90503 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376
## [229] 61107 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030
## [241] 46350 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503
## [253] 93727 27834 11561 35630 31204 52402 2908 81001 94533 32725 42071 6457
## [265] 11520 90660 84604 84062 30080 24153 44134 36608 2740 75061 8360 85301
## [277] 14304 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677 8302
## [289] 2149 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224
## [301] 60076 75023 23434 46203 7002 28314 27405 21215 53142 66062 98002 74133
## [313] 97756 27604 74403 6450 42104 46614 6010 89015 99207 76248 45014 32127
## [325] 97504 22901 59801 33178 29501 97477 32712 19601 80020 65807 7501 73120
## [337] 23320 79424 65203 37604 36830 92404 1453 59715 85345 44107 8861 91761
## [349] 91730 56560 75150 95207 32174 94086 3820 17403 77840 63116 2169 95336
## [361] 44240 76903 84106 35810 37918 72209 48146 43302 80122 5408 4401 38671
## [373] 47362 48640 57103 80525 47905 37042 95823 91360 2148 1040 87105 89431
## [385] 92236 60126 7055 29406 23602 14701 46544 43402 92253 32303 37211 98226
## [397] 60098 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107
## [409] 53186 93309 33068 45373 78415 90278 32839 7050 70601 60035 11550 46060
## [421] 55124 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369
## [433] 95695 77489 77581 94403 49505 93277 66212 92592 92399 2151 77301 60477
## [445] 52001 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762
## [457] 53214 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712
## [469] 2895 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704
## [481] 2920 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776
## [493] 33063 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321
## [505] 6708 30605 4240 61832 85323 30062 85364 54401 99301 60302 32503 77573
## [517] 20877 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423
## [529] 18018 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020
## [541] 48104 91941 83201 49423 6460 60089 92630 96003 95928 13501 72032 82001
## [553] 42301 83605 70065 3060 38134 94061 37087 93454 60016 98632 37075 50701
## [565] 2138 60067 1915 97405 93030 98059 60025 33445 80022 77590 27893 87124
## [577] 27534 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123
## [589] 91505 95351 67846 8401 80501 95616 26003 95037 7011 53081 30344 57701
## [601] 1810 34741 6484 6810 52302 32771 78666 80634 76063 44035 83301 33134
## [613] 60441 1752 20707 77803 71603 57401 21740 7017 60004 60543 55433 92672
## [625] 94568 93405 72762 95240 77571 45040 30188
10.Order Date :-
Unique orders dates for the column - Order ID
## [1] 1237
## [1] 2016 2015 2014 2017
From above can see that the data is for 4 years.
11.Order ID :-
Unique orders placed for the column - Order ID
## [1] 5009
12.Ship Date :-
Unique shipping dates for each order placed column - Ship Date
## [1] 1334
## [1] 2016 2015 2014 2017 2018
From above can see that the data for shipping is uptill 2018 i.e. total 5 years.
13.Customer ID :-
Unique Customers for each order placed column - Customer ID
## [1] 793
14.Customer Name :-
Unique Customers for each order placed column, check using Customer
Name
## [1] 793
Count of Customer.ID and Customer.Name match i.e. 793, which states that data is not having issues. Because each Customer ID / Name is unique.
15.Product Name :- Unique Products purchased in the orders placed, check using Product Name
## [1] 1850
All unique Products :-
16.Product ID :- Unique Products purchased in the orders placed, check using Product Name
## [1] 1862
For Products, maybe the Primary key could be the Product ID, somewhere certain product names repeat hence Product Name count is slighly less than Product ID.
17.Sales :-
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.444 17.280 54.490 229.858 209.940 22638.480
18.Quantity :-
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.79 5.00 14.00
19.Discount :-
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2000 0.1562 0.2000 0.8000
20.Profit :-
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6599.978 1.729 8.666 28.657 29.364 8399.976
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Min. :2014-01-03 Min. :2014-01-07
## 1st Qu.:2499 Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## Median :4998 Mode :character Median :2016-06-26 Median :2016-06-29
## Mean :4998 Mean :2016-04-30 Mean :2016-05-03
## 3rd Qu.:7496 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :9994 Max. :2017-12-30 Max. :2018-01-05
## Ship.Mode Customer.ID Customer.Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Country City State Postal.Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56431
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode
## 9994 5009 1237 1334 4
## Customer.ID Customer.Name Segment Country City
## 793 793 3 1 531
## State Postal.Code Region Product.ID Category
## 49 631 4 1862 3
## Sub.Category Product.Name Sales Quantity Discount
## 17 1850 5825 14 12
## Profit
## 7287
1.There are 16 categorical columns which are - Ship Mode, Segment,
Country, City, State, Postal Code, Region, Category, Sub.Category,
Order.ID, Order.Date, Ship.Date, Customer.ID, Customer.Name, Product ID,
Product Name
(Here, considering Postal Code as Categorical as the numeric in the
value does not have any significance. )
2.While rest 4 columns are numerical/continuous in nature, they are -
Sales(Price), Profit, Discount and Quantity, which are Decimal Also,
Profit is having both positive and negative values. Negative values are
indicating Loss.
- Sales having a range of $0.444 to $22638.480
- Profit having a range starting from $-6599.978 to
$8399.976
- Discount having a range from 0 to 0.8
- Quantity is numeric with a range starting from 1 to 14.
To figure out the sales and profit earned by the Superstore in various parts of USA. Further, it can be used to find trends in the same and come up with improvements. If there are any issues with current marketing or even certain marketing can help boost sales. All those can be inferred through analysis.
Explanation of each Columns :
Ship Mode - It has 4 values, they are First Class, Same Day,
Second Class and Standard Class. Each value is defines the limit for
being delayed as follows: -
Same day: >=0 days
First class: >=2 days
Second class: >=3 days
Standard class: >= 5 days
Segment - It is the product sector.There are segments which are Consumer, Corporate and Home Office
Country - This column contains the country details. It was found to be only for United States i.e one value. Hence this column can be removed.
City - Various Cities present in United States.There are about 531 cities.
State - There are 49 states in this dataset.
Postal Code - There are 631 postal codes in this dataset
Region - There are 4 regions in the dataset. i.e. South, West, Central and East
Category - There are 3 categories, they are Furniture, Office Supplies and Technology.
Sub-Category - There are 17 sub-categories, some of them are - Bookcases, Chairs, Labels, Tables, Storage, Furnishings, Art, Phones, Binders, Appliances, Paper, Accessories, Envelopes, Fasteners, Supplies, Machines, Copiers
Order Date :- There are about 1237 unique order dates, but overall from unique year and orders placed each year it seems to be increasing each year from time span starting from 2014 to 2017
## # A tibble: 4 × 2
## year_order yearly_total_orders
## <dbl> <int>
## 1 2014 1993
## 2 2015 2102
## 3 2016 2587
## 4 2017 3312
## [1] 5009
## # A tibble: 5 × 2
## ship_order yearly_total_shipped_orders
## <dbl> <int>
## 1 2014 1940
## 2 2015 2131
## 3 2016 2578
## 4 2017 3303
## 5 2018 42
Customer ID & 14. Customer Name :- There are 793 customers in the Superstore dataset. Count of Customer.ID and Customer.Name match i.e. 793, which states that data is not having issues.Because each Customer ID / Name is unique.
Product Name & 16. Product ID :- 1862 Unique Products are purchased in the orders placed, this can be figured from the unique id of each.
Sales - It is the price. Having a range starting from $0.444 to $22638.480
Profit - It is the profit earned from the product. Can also indicate loss with a negative sign. Having a range -6599.978 to 8399.976
Discount - Discount given for the product purchased, range starts from 0 to 0.8.
Quantity - The quantity of products bought, range starts from 1 to 14.
As it was seen above, there are about 49 states, plotting them all to understand the count of purchase is pretty difficult. Hence narrowing it down to top 10 states. Figuring out which state has the most purchases.
It would answer the question -> which are the top 10 states that have purchased the products from the Superstore?
## state count
## 1 California 2001
## 2 New York 1128
## 3 Texas 985
## 4 Pennsylvania 587
## 5 Washington 506
## 6 Illinois 492
## 7 Ohio 469
## 8 Florida 383
## 9 Michigan 255
## 10 North Carolina 249
## # A tibble: 10 × 2
## state count
## <chr> <int>
## 1 California 663
## 2 New York 352
## 3 Texas 317
## 4 Washington 215
## 5 Pennsylvania 197
## 6 Illinois 172
## 7 Ohio 161
## 8 Florida 126
## 9 North Carolina 85
## 10 Tennessee 81
## # A tibble: 10 × 2
## state count
## <chr> <int>
## 1 Binders 93
## 2 Paper 88
## 3 Art 70
## 4 Furnishings 65
## 5 Phones 56
## 6 Storage 53
## 7 Accessories 49
## 8 Appliances 37
## 9 Chairs 35
## 10 Labels 25
In 2017, binders were the most purchased product in California, followed by paper, art, furnishings, and phones. Offering better discounts could potentially boost sales based on customer needs.
The question is to determine the average discount given for each sub-category.
## # A tibble: 17 × 3
## Sub_Category count average_discount
## <chr> <int> <dbl>
## 1 Binders 93 0.2
## 2 Paper 88 0
## 3 Art 70 0
## 4 Furnishings 65 0
## 5 Phones 56 0.2
## 6 Storage 53 0
## 7 Accessories 49 0
## 8 Appliances 37 0
## 9 Chairs 35 0.2
## 10 Labels 25 0
## 11 Tables 24 0.2
## 12 Bookcases 16 0.15
## 13 Envelopes 14 0
## 14 Fasteners 12 0
## 15 Supplies 11 0
## 16 Copiers 8 0.2
## 17 Machines 7 0.2
## # A tibble: 9 × 2
## # Groups: Category, Sub.Category [9]
## Category Sub.Category
## <chr> <chr>
## 1 Office Supplies Binders
## 2 Office Supplies Art
## 3 Office Supplies Envelopes
## 4 Office Supplies Paper
## 5 Office Supplies Storage
## 6 Office Supplies Labels
## 7 Office Supplies Appliances
## 8 Office Supplies Fasteners
## 9 Office Supplies Supplies
we can check on the total sales obtained for those top 10 states, it would answer the questions like, - Are the states different when we plot for sales, when compared to orders placed over the time period of 2014 - 2017 ?
## State Total_sales
## 1 California 457688
## 2 New York 310876
## 3 Texas 170188
## 4 Washington 138641
## 5 Pennsylvania 116512
## 6 Florida 89474
## 7 Illinois 80166
## 8 Ohio 78258
## 9 Michigan 76270
## 10 Virginia 70637
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 3 CA-2016-138688 2016-06-12 2016-06-16 Second Class DV-13045
## 2 6 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 3 7 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 4 8 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 5 9 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 6 10 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## Customer.Name Segment Country City State Postal.Code
## 1 Darrin Van Huff Corporate United States Los Angeles California 90036
## 2 Brosina Hoffman Consumer United States Los Angeles California 90032
## 3 Brosina Hoffman Consumer United States Los Angeles California 90032
## 4 Brosina Hoffman Consumer United States Los Angeles California 90032
## 5 Brosina Hoffman Consumer United States Los Angeles California 90032
## 6 Brosina Hoffman Consumer United States Los Angeles California 90032
## Region Product.ID Category Sub.Category
## 1 West OFF-LA-10000240 Office Supplies Labels
## 2 West FUR-FU-10001487 Furniture Furnishings
## 3 West OFF-AR-10002833 Office Supplies Art
## 4 West TEC-PH-10002275 Technology Phones
## 5 West OFF-BI-10003910 Office Supplies Binders
## 6 West OFF-AP-10002892 Office Supplies Appliances
## Product.Name Sales
## 1 Self-Adhesive Address Labels for Typewriters by Universal 14.620
## 2 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.860
## 3 Newell 322 7.280
## 4 Mitel 5320 IP Phone VoIP phone 907.152
## 5 DXL Angle-View Binders with Locking Rings by Samsill 18.504
## 6 Belkin F5C206VTEL 6 Outlet Surge 114.900
## Quantity Discount Profit
## 1 2 0.0 6.8714
## 2 7 0.0 14.1694
## 3 4 0.0 1.9656
## 4 6 0.2 90.7152
## 5 3 0.2 5.7825
## 6 5 0.0 34.4700
## # A tibble: 6 × 2
## Order.Date Total_sales
## <date> <dbl>
## 1 2014-01-06 19.4
## 2 2014-01-13 1680.
## 3 2014-01-20 38.7
## 4 2014-01-27 427.
## 5 2014-01-31 291.
## 6 2014-02-02 12.4
## # A tsibble: 6 x 2 [1D]
## Order.Date Total_sales
## <date> <dbl>
## 1 2014-01-06 19.4
## 2 2014-01-07 NA
## 3 2014-01-08 NA
## 4 2014-01-09 NA
## 5 2014-01-10 NA
## 6 2014-01-11 NA
## # A tsibble: 6 x 2 [1D]
## Order.Date Total_sales
## <date> <dbl>
## 1 2014-01-06 19.4
## 2 2014-01-07 0
## 3 2014-01-08 0
## 4 2014-01-09 0
## 5 2014-01-10 0
## 6 2014-01-11 0
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
By looking at only one years trend of Total sales for the state of California (most orders placed = 663), we can see that the Sales are increasing for the year 2017.
In conclusion, given that the sector is primarily tech-based, we can say that the company is not currently promoting its products. If this is the case, they may want to think about doing so in the future to draw in clients who are prepared to purchase office supplies. And if they are already using marketing, they ought to keep doing so because it is helping them increase sales in the California area. In addition, since discounts are either nonexistent or extremely limited in California, businesses may consider implementing specific tactics to encourage customers to purchase their goods, such as providing discounts on infrequent days when sales are low and increasing the price of products on regular selling days. Furthermore, this ought to contribute to their overall sales growth.
## # A tsibble: 6 x 2 [1D]
## Order.Date Total_profit
## <date> <dbl>
## 1 2014-01-06 9.33
## 2 2014-01-07 0
## 3 2014-01-08 0
## 4 2014-01-09 0
## 5 2014-01-10 0
## 6 2014-01-11 0
## `geom_smooth()` using formula = 'y ~ x'
Overall Profit is also increasing over the years, in the state of california.
Since California is in the west of US, we ask the question -
###3. Is there any region wise trend in how products are bought?
From above plot, we can see that the orders placed in western and eastern region to be somewhat Higher compared to other regions. So we can put up a Hypothesis that Total Average sales in both western and eastern regions can be similar.Also, this case can be considered with the point that both regions see a increase in Tech companies. Also, from above analysis, we know that California is in the West of US, while New York the second state with highest orders is in the Eastern part of US. So, We can say that similar marketing or other tactics can be implemented in both regions, it does not have any particular role in figuring average total sales for those regions separately, as they all are same.
Hence, taking 2 Regions for the purpose of generating a Null Hypothesis. i.e. Western and Eastern regions. As discussed above, there are 2 reasons for picking those 2 regions -
Let us assume the first Null Hypothesis to be : Average Sales of Western Region = Average Sales of Eastern Region.
\[ H_0: \text{Average Sale remains equal for 2 regions. (i.e Western and Eastern) within products.} \]
\[ H_0: \text{AvgSales_West = AvgSales_East} \]
Alpha Level -\(\alpha\) here represents the rejection region. If our observed value falls in this area, we are safe to reject the null hypothesis, and assume the alternative.
The alpha level \(\alpha\), represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.
So taking the \(\alpha\) value to be 0.05, so that False negative rate (i.e. Null hypothesis is true and the hypothesis is rejected) is low and Null Hypothesis becomes true for both the regions - West and East.
Power Level (1-\(\beta\)) here represents the Power i.e. True Negative Rate, which is the probability of Rejecting a Null Hypothesis when it is actually False. So we would want this to happen, such that right predictions are obtained (i.e., avoiding a Type II error). Lets assume it to be 0.80. Such that we obtain an appropriate result.
Minimum Effect size - It is the smallest difference in average between sales for each of the category, for which we want to detect from our test. Let us assume that to be around 100$. i.e. Minimum difference between average of sales in Western and Eastern regions.
Sample size calculation:
## [1] 610.7154
From above sample size calculation, it is about 611. Since each of the region has sufficient count of data. i.e. Western region = 3203, Eastern region = 2848
So I believe I can implement Neyman-Pearson hypothesis test, to find if to accept or reject null hypothesis.
Neyman-Pearson Test:
##
## Welch Two Sample t-test
##
## data: west_sales and east_sales
## t = -0.79611, df = 5603.9, p-value = 0.426
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -41.00552 17.31977
## sample estimates:
## mean of x mean of y
## 226.4932 238.3361
## Fail to reject the null hypothesis. There is no significant difference in average sales between Western and Eastern regions for the products sold.As per the result, We need to Accept the Null Hypothesis, which states that there is no difference between the average of sales in both regions -> west and east. If we accept the Null hypothesis then the alternate hypothesis cannot be considered as True, which is - Average sales of west is equal to Average sales of Eastern region.
So all in all, we can say that Eastern and western region must be treated in a similar manner, when it comes to selling products. Whatever Marketing is influencing products in Western region must be implemented in the Eastern region too. Also, they must be considered as a single region and no separate trends need to be checked.
This same can be looked into by checking the total sales for each region,
## # A tibble: 4 × 2
## Region Total_sales
## <chr> <dbl>
## 1 Central 501240.
## 2 East 678781.
## 3 South 391722.
## 4 West 725458.
###6. Hypothesis (Total sales VS each category?)
## # A tibble: 3 × 3
## Category count Total_sales
## <chr> <int> <dbl>
## 1 Technology 1847 836154.
## 2 Furniture 2121 742000.
## 3 Office Supplies 6026 719047.
Considering the various categories under which the products are sold and Total sales in each category, to be somewhat similar for the years 2014 - 2017. We can say that Category does not have any particular role in figuring total sales, as they all are same.
Hence, taking 2 categories for the purpose of generating a Null Hypothesis. i.e. Technology and Office Supplies. The reason for picking Technology and Office Supplies are - Technology because plenty of products are widely being used all around the world. Office supplies because each day a lot of companies are opening and growing to expand.
Let us assume the first Null Hypothesis to be : Average Sales of Technology = Average Sales of Office Supplies.
\[ H_0: \text{Average Sale remains equal for 2 type of categories (i.e Technology and Office Supplies) within products.} \]
\[ H_0: \text{AvgSales_tech = AvgSales_offsupplies} \]
Alpha Level -\(\alpha\) here represents the rejection region. If our observed value falls in this area, we are safe to reject the null hypothesis, and assume the alternative.
The alpha level \(\alpha\), represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.
So taking the \(\alpha\) value to be 0.05, so that False negative rate (i.e. Null hypothesis is true and the hypothesis is rejected) is low and Null Hypothesis becomes true for both the categories - Tech and Office Sales.
Power Level (1-\(\beta\)) here represents the Power i.e. True Negative Rate, which is the probability of Rejecting a Null Hypothesis when it is actually False. So we would want this to happen, such that right predictions are obtained (i.e., avoiding a Type II error). Lets assume it to be 0.80. Such that we obtain an appropriate result.
Minimum Effect size - It is the smallest difference in average between sales for each of the category, for which we want to detect from our test. Let us assume that to be around 100$. i.e. Minimum difference between average of sales in tech and office supplies.
Sample size calculation:
## [1] 610.7154
From above sample size calculation, it is about 611. Since each of the categories have sufficient count of data. i.e. Tech = 1847, Office Supplies = 6026
So I believe I can implement Neyman-Pearson hypothesis test, to find if to accept or reject null hypothesis.
Neyman-Pearson Test:
##
## Welch Two Sample t-test
##
## data: tech_sales and office_sales
## t = 12.694, df = 1982.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 281.8807 384.8897
## sample estimates:
## mean of x mean of y
## 452.7093 119.3241
## Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.As per the result, We need to reject the Null Hypothesis as there is a large difference between the average of sales in both category -> Technology and Office Supplies. If we reject the Null hypothesis then the alternate hypothesis can be considered as True, which is - Average sales of tech is not equal to Average sales of Office Supplies.
As per the alternate Hypothesis, We cant determine if the average sales done on both the categories will remain