PURPOSE OF THE PROJECT -

Overall the goal of this project is to analyze the superstore data and allow businesses to gain insights, make informed decisions, and optimize their operations with the help of Visualization & statistics. Few of the areas which can be looked into are -

Improving Sales and Revenue
Enhancing Customer Experience
Competitive Analysis between Categories
Competitive Analysis between Regions
Supply Chain Optimization
Market Basket Analysis
Risk Management

For the business of Super Store to grow, Ill be looking into various features, so that they can understand if they need to make any improvements on their Marketing or even any additional steps needs to be taken.

In the below part of the project, my aim is to
- figure better ways to increase sales in particular Region say California.
- to check if categories play major role in deciding if sales would go up or down, if average sales for different categories is same.
- overall sales over the years, if any changes need to be implemented to up the game of Sales.

Statitics Framework -

We know that a lot of data exists in this world, and in order to analyse and get some inference of the same we need to go through a few steps. Statistics follows a framework which includes -
1. Obtaining an idea, as to an hypothesis based on certain element in the world (HYPOTHESIS)
2. Observing nature to obtain such kind of related data. ( SAMPLE DATA from POPULATION )
3. Recognize that the data collected is one out of the many observations in the world. ( PERFORM EXPLORATORY DATA ANALYSIS [EDA] )
4. Finding evidence against the idea by Insights ( PERFORM HYPOTHESIS TESTING -> Pulling reports)
5. Checking if the hypothesis matches our assumption.

Hence, as a part of the same framework I would be working to create a similar scenario ON SUPERSTORE Data as a part of my Project in order to infer certain insights for a business.

Data -

As per my research and understanding I came across Superstore data in the initial week of the semester, which involved quite a lot of categorical and numerical features. This made me think about the various possibilities of what can be interpreted from the same data set. Following are the names of the columns -
- Ship Mode
- Segment
- Country
- City
- State
- Region
- Category
- Sub-Category
- Postal Code
- Sales - Quantity
- Discount
- Profit

Also, while working through, I came across a super set of the same data set with additional columns. These are the additional Columns -
- Order Id
- Order Date
- Ship Date
- Customer ID
- Customer Name
- Product Name
- Product ID

Hence, am working with the compiled Super set of the data for SuperStore Dataset with 21 columns, as a part of this Project.

Data Pre-Processing -

First to create an idea about the Superstore and the orders being placed, did some Data Pr-processing to understand it better.

1.There are 9994 rows for Super Store Data. The Data is tidy with about 21 columns, Each row represents a unique and separate observation.
The same can be seen through the data cleaning / pre-processing

knitr::opts_chunk$set(echo = FALSE)
# load libraries 

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggthemes)
library(ggrepel)
# time series toolkits
library(xts)

## Warning: package 'xts' was built under R version 4.3.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.2

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.2

## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

##   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156 08-11-2016 11-11-2016   Second Class    CG-12520
## 2      2 CA-2016-152156 08-11-2016 11-11-2016   Second Class    CG-12520
## 3      3 CA-2016-138688 12-06-2016 16-06-2016   Second Class    DV-13045
## 4      4 US-2015-108966 11-10-2015 18-10-2015 Standard Class    SO-20335
## 5      5 US-2015-108966 11-10-2015 18-10-2015 Standard Class    SO-20335
## 6      6 CA-2014-115812 09-06-2014 14-06-2014 Standard Class    BH-11710
##     Customer.Name   Segment       Country            City      State
## 1     Claire Gute  Consumer United States       Henderson   Kentucky
## 2     Claire Gute  Consumer United States       Henderson   Kentucky
## 3 Darrin Van Huff Corporate United States     Los Angeles California
## 4  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 5  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 6 Brosina Hoffman  Consumer United States     Los Angeles California
##   Postal.Code Region      Product.ID        Category Sub.Category
## 1       42420  South FUR-BO-10001798       Furniture    Bookcases
## 2       42420  South FUR-CH-10000454       Furniture       Chairs
## 3       90036   West OFF-LA-10000240 Office Supplies       Labels
## 4       33311  South FUR-TA-10000577       Furniture       Tables
## 5       33311  South OFF-ST-10000760 Office Supplies      Storage
## 6       90032   West FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

Change format of 2 columns to Date Format (Data Cleaning) :-

##   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156 2016-11-08 2016-11-11   Second Class    CG-12520
## 2      2 CA-2016-152156 2016-11-08 2016-11-11   Second Class    CG-12520
## 3      3 CA-2016-138688 2016-06-12 2016-06-16   Second Class    DV-13045
## 4      4 US-2015-108966 2015-10-11 2015-10-18 Standard Class    SO-20335
## 5      5 US-2015-108966 2015-10-11 2015-10-18 Standard Class    SO-20335
## 6      6 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
##     Customer.Name   Segment       Country            City      State
## 1     Claire Gute  Consumer United States       Henderson   Kentucky
## 2     Claire Gute  Consumer United States       Henderson   Kentucky
## 3 Darrin Van Huff Corporate United States     Los Angeles California
## 4  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 5  Sean O'Donnell  Consumer United States Fort Lauderdale    Florida
## 6 Brosina Hoffman  Consumer United States     Los Angeles California
##   Postal.Code Region      Product.ID        Category Sub.Category
## 1       42420  South FUR-BO-10001798       Furniture    Bookcases
## 2       42420  South FUR-CH-10000454       Furniture       Chairs
## 3       90036   West OFF-LA-10000240 Office Supplies       Labels
## 4       33311  South FUR-TA-10000577       Furniture       Tables
## 5       33311  South OFF-ST-10000760 Office Supplies      Storage
## 6       90032   West FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

Number of Records -

## [1] 9994

Various columns within the Data set:-

##  [1] "Row.ID"        "Order.ID"      "Order.Date"    "Ship.Date"    
##  [5] "Ship.Mode"     "Customer.ID"   "Customer.Name" "Segment"      
##  [9] "Country"       "City"          "State"         "Postal.Code"  
## [13] "Region"        "Product.ID"    "Category"      "Sub.Category" 
## [17] "Product.Name"  "Sales"         "Quantity"      "Discount"     
## [21] "Profit"

Summarize the Data set -
* For categorical columns, this should include unique values and counts

1.Column Ship Mode :-

Unique values for the column - Ship Mode :-

## [1] "Second Class"
## [1] "Standard Class"
## [1] "First Class"
## [1] "Same Day"

Count of Unique Ship Modes :-

## [1] 4

2.Column Segment :-

Unique values for the column - Segment :-

## [1] "Consumer"
## [1] "Corporate"
## [1] "Home Office"

## [1] "Count of the unique Segments: 3"

3.Column Country :-

Unique values for the column - Country

## [1] "United States"

## [1] "Count of the unique Countries: 1"

Here above, we can see that the data exists only for United States, so while working with data we can remove this column as it is redundant.

4.Column City :-

Unique values for the column - City

##   [1] "Henderson"         "Los Angeles"       "Fort Lauderdale"  
##   [4] "Concord"           "Seattle"           "Fort Worth"       
##   [7] "Madison"           "West Jordan"       "San Francisco"    
##  [10] "Fremont"           "Philadelphia"      "Orem"             
##  [13] "Houston"           "Richardson"        "Naperville"       
##  [16] "Melbourne"         "Eagan"             "Westland"         
##  [19] "Dover"             "New Albany"        "New York City"    
##  [22] "Troy"              "Chicago"           "Gilbert"          
##  [25] "Springfield"       "Jackson"           "Memphis"          
##  [28] "Decatur"           "Durham"            "Columbia"         
##  [31] "Rochester"         "Minneapolis"       "Portland"         
##  [34] "Saint Paul"        "Aurora"            "Charlotte"        
##  [37] "Orland Park"       "Urbandale"         "Columbus"         
##  [40] "Bristol"           "Wilmington"        "Bloomington"      
##  [43] "Phoenix"           "Roseville"         "Independence"     
##  [46] "Pasadena"          "Newark"            "Franklin"         
##  [49] "Scottsdale"        "San Jose"          "Edmond"           
##  [52] "Carlsbad"          "San Antonio"       "Monroe"           
##  [55] "Fairfield"         "Grand Prairie"     "Redlands"         
##  [58] "Hamilton"          "Westfield"         "Akron"            
##  [61] "Denver"            "Dallas"            "Whittier"         
##  [64] "Saginaw"           "Medina"            "Dublin"           
##  [67] "Detroit"           "Tampa"             "Santa Clara"      
##  [70] "Lakeville"         "San Diego"         "Brentwood"        
##  [73] "Chapel Hill"       "Morristown"        "Cincinnati"       
##  [76] "Inglewood"         "Tamarac"           "Colorado Springs" 
##  [79] "Belleville"        "Taylor"            "Lakewood"         
##  [82] "Arlington"         "Arvada"            "Hackensack"       
##  [85] "Saint Petersburg"  "Long Beach"        "Hesperia"         
##  [88] "Murfreesboro"      "Layton"            "Austin"           
##  [91] "Lowell"            "Manchester"        "Harlingen"        
##  [94] "Tucson"            "Quincy"            "Pembroke Pines"   
##  [97] "Des Moines"        "Peoria"            "Las Vegas"        
## [100] "Warwick"           "Miami"             "Huntington Beach" 
## [103] "Richmond"          "Louisville"        "Lawrence"         
## [106] "Canton"            "New Rochelle"      "Gastonia"         
## [109] "Jacksonville"      "Auburn"            "Norman"           
## [112] "Park Ridge"        "Amarillo"          "Lindenhurst"      
## [115] "Huntsville"        "Fayetteville"      "Costa Mesa"       
## [118] "Parker"            "Atlanta"           "Gladstone"        
## [121] "Great Falls"       "Lakeland"          "Montgomery"       
## [124] "Mesa"              "Green Bay"         "Anaheim"          
## [127] "Marysville"        "Salem"             "Laredo"           
## [130] "Grove City"        "Dearborn"          "Warner Robins"    
## [133] "Vallejo"           "Mission Viejo"     "Rochester Hills"  
## [136] "Plainfield"        "Sierra Vista"      "Vancouver"        
## [139] "Cleveland"         "Tyler"             "Burlington"       
## [142] "Waynesboro"        "Chester"           "Cary"             
## [145] "Palm Coast"        "Mount Vernon"      "Hialeah"          
## [148] "Oceanside"         "Evanston"          "Trenton"          
## [151] "Cottage Grove"     "Bossier City"      "Lancaster"        
## [154] "Asheville"         "Lake Elsinore"     "Omaha"            
## [157] "Edmonds"           "Santa Ana"         "Milwaukee"        
## [160] "Florence"          "Lorain"            "Linden"           
## [163] "Salinas"           "New Brunswick"     "Garland"          
## [166] "Norwich"           "Alexandria"        "Toledo"           
## [169] "Farmington"        "Riverside"         "Torrance"         
## [172] "Round Rock"        "Boca Raton"        "Virginia Beach"   
## [175] "Murrieta"          "Olympia"           "Washington"       
## [178] "Jefferson City"    "Saint Peters"      "Rockford"         
## [181] "Brownsville"       "Yonkers"           "Oakland"          
## [184] "Clinton"           "Encinitas"         "Roswell"          
## [187] "Jonesboro"         "Antioch"           "Homestead"        
## [190] "La Porte"          "Lansing"           "Cuyahoga Falls"   
## [193] "Reno"              "Harrisonburg"      "Escondido"        
## [196] "Royal Oak"         "Rockville"         "Coral Springs"    
## [199] "Buffalo"           "Boynton Beach"     "Gulfport"         
## [202] "Fresno"            "Greenville"        "Macon"            
## [205] "Cedar Rapids"      "Providence"        "Pueblo"           
## [208] "Deltona"           "Murray"            "Middletown"       
## [211] "Freeport"          "Pico Rivera"       "Provo"            
## [214] "Pleasant Grove"    "Smyrna"            "Parma"            
## [217] "Mobile"            "New Bedford"       "Irving"           
## [220] "Vineland"          "Glendale"          "Niagara Falls"    
## [223] "Thomasville"       "Westminster"       "Coppell"          
## [226] "Pomona"            "North Las Vegas"   "Allentown"        
## [229] "Tempe"             "Laguna Niguel"     "Bridgeton"        
## [232] "Everett"           "Watertown"         "Appleton"         
## [235] "Bellevue"          "Allen"             "El Paso"          
## [238] "Grapevine"         "Carrollton"        "Kent"             
## [241] "Lafayette"         "Tigard"            "Skokie"           
## [244] "Plano"             "Suffolk"           "Indianapolis"     
## [247] "Bayonne"           "Greensboro"        "Baltimore"        
## [250] "Kenosha"           "Olathe"            "Tulsa"            
## [253] "Redmond"           "Raleigh"           "Muskogee"         
## [256] "Meriden"           "Bowling Green"     "South Bend"       
## [259] "Spokane"           "Keller"            "Port Orange"      
## [262] "Medford"           "Charlottesville"   "Missoula"         
## [265] "Apopka"            "Reading"           "Broomfield"       
## [268] "Paterson"          "Oklahoma City"     "Chesapeake"       
## [271] "Lubbock"           "Johnson City"      "San Bernardino"   
## [274] "Leominster"        "Bozeman"           "Perth Amboy"      
## [277] "Ontario"           "Rancho Cucamonga"  "Moorhead"         
## [280] "Mesquite"          "Stockton"          "Ormond Beach"     
## [283] "Sunnyvale"         "York"              "College Station"  
## [286] "Saint Louis"       "Manteca"           "San Angelo"       
## [289] "Salt Lake City"    "Knoxville"         "Little Rock"      
## [292] "Lincoln Park"      "Marion"            "Littleton"        
## [295] "Bangor"            "Southaven"         "New Castle"       
## [298] "Midland"           "Sioux Falls"       "Fort Collins"     
## [301] "Clarksville"       "Sacramento"        "Thousand Oaks"    
## [304] "Malden"            "Holyoke"           "Albuquerque"      
## [307] "Sparks"            "Coachella"         "Elmhurst"         
## [310] "Passaic"           "North Charleston"  "Newport News"     
## [313] "Jamestown"         "Mishawaka"         "La Quinta"        
## [316] "Tallahassee"       "Nashville"         "Bellingham"       
## [319] "Woodstock"         "Haltom City"       "Wheeling"         
## [322] "Summerville"       "Hot Springs"       "Englewood"        
## [325] "Las Cruces"        "Hoover"            "Frisco"           
## [328] "Vacaville"         "Waukesha"          "Bakersfield"      
## [331] "Pompano Beach"     "Corpus Christi"    "Redondo Beach"    
## [334] "Orlando"           "Orange"            "Lake Charles"     
## [337] "Highland Park"     "Hempstead"         "Noblesville"      
## [340] "Apple Valley"      "Mount Pleasant"    "Sterling Heights" 
## [343] "Eau Claire"        "Pharr"             "Billings"         
## [346] "Gresham"           "Chattanooga"       "Meridian"         
## [349] "Bolingbrook"       "Maple Grove"       "Woodland"         
## [352] "Missouri City"     "Pearland"          "San Mateo"        
## [355] "Grand Rapids"      "Visalia"           "Overland Park"    
## [358] "Temecula"          "Yucaipa"           "Revere"           
## [361] "Conroe"            "Tinley Park"       "Dubuque"          
## [364] "Dearborn Heights"  "Santa Fe"          "Hickory"          
## [367] "Carol Stream"      "Saint Cloud"       "North Miami"      
## [370] "Plantation"        "Port Saint Lucie"  "Rock Hill"        
## [373] "Odessa"            "West Allis"        "Chula Vista"      
## [376] "Manhattan"         "Altoona"           "Thornton"         
## [379] "Champaign"         "Texarkana"         "Edinburg"         
## [382] "Baytown"           "Greenwood"         "Woonsocket"       
## [385] "Superior"          "Bedford"           "Covington"        
## [388] "Broken Arrow"      "Miramar"           "Hollywood"        
## [391] "Deer Park"         "Wichita"           "Mcallen"          
## [394] "Iowa City"         "Boise"             "Cranston"         
## [397] "Port Arthur"       "Citrus Heights"    "The Colony"       
## [400] "Daytona Beach"     "Bullhead City"     "Portage"          
## [403] "Fargo"             "Elkhart"           "San Gabriel"      
## [406] "Margate"           "Sandy Springs"     "Mentor"           
## [409] "Lawton"            "Hampton"           "Rome"             
## [412] "La Crosse"         "Lewiston"          "Hattiesburg"      
## [415] "Danville"          "Logan"             "Waterbury"        
## [418] "Athens"            "Avondale"          "Marietta"         
## [421] "Yuma"              "Wausau"            "Pasco"            
## [424] "Oak Park"          "Pensacola"         "League City"      
## [427] "Gaithersburg"      "Lehi"              "Tuscaloosa"       
## [430] "Moreno Valley"     "Georgetown"        "Loveland"         
## [433] "Chandler"          "Helena"            "Kirkwood"         
## [436] "Waco"              "Frankfort"         "Bethlehem"        
## [439] "Grand Island"      "Woodbury"          "Rogers"           
## [442] "Clovis"            "Jupiter"           "Santa Barbara"    
## [445] "Cedar Hill"        "Norfolk"           "Draper"           
## [448] "Ann Arbor"         "La Mesa"           "Pocatello"        
## [451] "Holland"           "Milford"           "Buffalo Grove"    
## [454] "Lake Forest"       "Redding"           "Chico"            
## [457] "Utica"             "Conway"            "Cheyenne"         
## [460] "Owensboro"         "Caldwell"          "Kenner"           
## [463] "Nashua"            "Bartlett"          "Redwood City"     
## [466] "Lebanon"           "Santa Maria"       "Des Plaines"      
## [469] "Longview"          "Hendersonville"    "Waterloo"         
## [472] "Cambridge"         "Palatine"          "Beverly"          
## [475] "Eugene"            "Oxnard"            "Renton"           
## [478] "Glenview"          "Delray Beach"      "Commerce City"    
## [481] "Texas City"        "Wilson"            "Rio Rancho"       
## [484] "Goldsboro"         "Montebello"        "El Cajon"         
## [487] "Beaumont"          "West Palm Beach"   "Abilene"          
## [490] "Normal"            "Saint Charles"     "Camarillo"        
## [493] "Hillsboro"         "Burbank"           "Modesto"          
## [496] "Garden City"       "Atlantic City"     "Longmont"         
## [499] "Davis"             "Morgan Hill"       "Clifton"          
## [502] "Sheboygan"         "East Point"        "Rapid City"       
## [505] "Andover"           "Kissimmee"         "Shelton"          
## [508] "Danbury"           "Sanford"           "San Marcos"       
## [511] "Greeley"           "Mansfield"         "Elyria"           
## [514] "Twin Falls"        "Coral Gables"      "Romeoville"       
## [517] "Marlborough"       "Laurel"            "Bryan"            
## [520] "Pine Bluff"        "Aberdeen"          "Hagerstown"       
## [523] "East Orange"       "Arlington Heights" "Oswego"           
## [526] "Coon Rapids"       "San Clemente"      "San Luis Obispo"  
## [529] "Springdale"        "Lodi"              "Mason"

5.Column State :-

Unique values for the column - State

##  [1] "Kentucky"             "California"           "Florida"             
##  [4] "North Carolina"       "Washington"           "Texas"               
##  [7] "Wisconsin"            "Utah"                 "Nebraska"            
## [10] "Pennsylvania"         "Illinois"             "Minnesota"           
## [13] "Michigan"             "Delaware"             "Indiana"             
## [16] "New York"             "Arizona"              "Virginia"            
## [19] "Tennessee"            "Alabama"              "South Carolina"      
## [22] "Oregon"               "Colorado"             "Iowa"                
## [25] "Ohio"                 "Missouri"             "Oklahoma"            
## [28] "New Mexico"           "Louisiana"            "Connecticut"         
## [31] "New Jersey"           "Massachusetts"        "Georgia"             
## [34] "Nevada"               "Rhode Island"         "Mississippi"         
## [37] "Arkansas"             "Montana"              "New Hampshire"       
## [40] "Maryland"             "District of Columbia" "Kansas"              
## [43] "Vermont"              "Maine"                "South Dakota"        
## [46] "Idaho"                "North Dakota"         "Wyoming"             
## [49] "West Virginia"

## [1] "Count of the unique States: 49"

6.Column Region :

Unique values for the column - Region

## [1] "South"
## [1] "West"
## [1] "Central"
## [1] "East"

## [1] "Count of the unique Regions: 4"

7.Column Category :-

Unique values for the column - Category

## [1] "Furniture"
## [1] "Office Supplies"
## [1] "Technology"

## [1] "Count of the unique Categories: 3"

8.Sub-Category :-

Unique values for the column - Sub-category

## [1] "Unique values for the column - Sub-Category :"

## [1] "Bookcases"
## [1] "Chairs"
## [1] "Labels"
## [1] "Tables"
## [1] "Storage"
## [1] "Furnishings"
## [1] "Art"
## [1] "Phones"
## [1] "Binders"
## [1] "Appliances"
## [1] "Paper"
## [1] "Accessories"
## [1] "Envelopes"
## [1] "Fasteners"
## [1] "Supplies"
## [1] "Machines"
## [1] "Copiers"

## [1] "Count of the Sub-Category: 17"

9.Postal Code :-

Even though Postal code is numeric in nature, it doesn’t have any value as the mean of a postal code wouldn’t make sense. Hence considering it as a Categorical value.

Unique values for the column - Postal code :

##   [1] 42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140
##  [13] 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024
##  [25] 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707
##  [37] 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229
##  [49] 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105
##  [61] 98115 73034 90045 19134 88220 78207 77036 62521 71203  6824 75051 92374
##  [73] 45011  7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227
##  [85] 38401 33614 95051 55044 92037 77506 94513 27514  7960 45231 94110 90301
##  [97] 33319 80906  7109 48180  8701 22204 80004  7601 33710 19143 90805 92345
## [109] 37130 84041 78745  1852 31907  6040 78550 85705 62301  2038 33024 98198
## [121] 61604 89115  2886 33180 28403 92646 40475 80027  1841 39212 48187 10801
## [133] 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340
## [145] 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653
## [157] 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591
## [169] 92691 48307  7060 85635 98661 60505 76017 40214 75081 44105 75701 27217
## [181] 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016
## [193] 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052
## [205]  7036 93905  8901 17602  3301 21044 75043  6360 22304 43615 87401 92503
## [217] 90503 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376
## [229] 61107 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030
## [241] 46350 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503
## [253] 93727 27834 11561 35630 31204 52402  2908 81001 94533 32725 42071  6457
## [265] 11520 90660 84604 84062 30080 24153 44134 36608  2740 75061  8360 85301
## [277] 14304 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677  8302
## [289]  2149 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224
## [301] 60076 75023 23434 46203  7002 28314 27405 21215 53142 66062 98002 74133
## [313] 97756 27604 74403  6450 42104 46614  6010 89015 99207 76248 45014 32127
## [325] 97504 22901 59801 33178 29501 97477 32712 19601 80020 65807  7501 73120
## [337] 23320 79424 65203 37604 36830 92404  1453 59715 85345 44107  8861 91761
## [349] 91730 56560 75150 95207 32174 94086  3820 17403 77840 63116  2169 95336
## [361] 44240 76903 84106 35810 37918 72209 48146 43302 80122  5408  4401 38671
## [373] 47362 48640 57103 80525 47905 37042 95823 91360  2148  1040 87105 89431
## [385] 92236 60126  7055 29406 23602 14701 46544 43402 92253 32303 37211 98226
## [397] 60098 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107
## [409] 53186 93309 33068 45373 78415 90278 32839  7050 70601 60035 11550 46060
## [421] 55124 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369
## [433] 95695 77489 77581 94403 49505 93277 66212 92592 92399  2151 77301 60477
## [445] 52001 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762
## [457] 53214 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712
## [469]  2895 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704
## [481]  2920 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776
## [493] 33063 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321
## [505]  6708 30605  4240 61832 85323 30062 85364 54401 99301 60302 32503 77573
## [517] 20877 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423
## [529] 18018 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020
## [541] 48104 91941 83201 49423  6460 60089 92630 96003 95928 13501 72032 82001
## [553] 42301 83605 70065  3060 38134 94061 37087 93454 60016 98632 37075 50701
## [565]  2138 60067  1915 97405 93030 98059 60025 33445 80022 77590 27893 87124
## [577] 27534 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123
## [589] 91505 95351 67846  8401 80501 95616 26003 95037  7011 53081 30344 57701
## [601]  1810 34741  6484  6810 52302 32771 78666 80634 76063 44035 83301 33134
## [613] 60441  1752 20707 77803 71603 57401 21740  7017 60004 60543 55433 92672
## [625] 94568 93405 72762 95240 77571 45040 30188

10.Order Date :-

Unique orders dates for the column - Order ID

## [1] 1237

## [1] 2016 2015 2014 2017

From above can see that the data is for 4 years.

11.Order ID :-

Unique orders placed for the column - Order ID

## [1] 5009

12.Ship Date :-

Unique shipping dates for each order placed column - Ship Date

## [1] 1334

## [1] 2016 2015 2014 2017 2018

From above can see that the data for shipping is uptill 2018 i.e. total 5 years.

13.Customer ID :-
Unique Customers for each order placed column - Customer ID

## [1] 793

14.Customer Name :-
Unique Customers for each order placed column, check using Customer Name

## [1] 793

Count of Customer.ID and Customer.Name match i.e. 793, which states that data is not having issues. Because each Customer ID / Name is unique.

15.Product Name :- Unique Products purchased in the orders placed, check using Product Name

## [1] 1850

All unique Products :-

16.Product ID :- Unique Products purchased in the orders placed, check using Product Name

## [1] 1862

For Products, maybe the Primary key could be the Product ID, somewhere certain product names repeat hence Product Name count is slighly less than Product ID.

For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)

17.Sales :-

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.444    17.280    54.490   229.858   209.940 22638.480

18.Quantity :-

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.79    5.00   14.00

19.Discount :-

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.1562  0.2000  0.8000

20.Profit :-

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6599.978     1.729     8.666    28.657    29.364  8399.976

Summaries Combined :

##      Row.ID       Order.ID           Order.Date           Ship.Date         
##  Min.   :   1   Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  1st Qu.:2499   Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Median :4998   Mode  :character   Median :2016-06-26   Median :2016-06-29  
##  Mean   :4998                      Mean   :2016-04-30   Mean   :2016-05-03  
##  3rd Qu.:7496                      3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  Max.   :9994                      Max.   :2017-12-30   Max.   :2018-01-05  
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

Distinct Count of each column -

##        Row.ID      Order.ID    Order.Date     Ship.Date     Ship.Mode 
##          9994          5009          1237          1334             4 
##   Customer.ID Customer.Name       Segment       Country          City 
##           793           793             3             1           531 
##         State   Postal.Code        Region    Product.ID      Category 
##            49           631             4          1862             3 
##  Sub.Category  Product.Name         Sales      Quantity      Discount 
##            17          1850          5825            14            12 
##        Profit 
##          7287

Data Documentation

1.There are 16 categorical columns which are - Ship Mode, Segment, Country, City, State, Postal Code, Region, Category, Sub.Category, Order.ID, Order.Date, Ship.Date, Customer.ID, Customer.Name, Product ID, Product Name
(Here, considering Postal Code as Categorical as the numeric in the value does not have any significance. )

2.While rest 4 columns are numerical/continuous in nature, they are - Sales(Price), Profit, Discount and Quantity, which are Decimal Also, Profit is having both positive and negative values. Negative values are indicating Loss.
- Sales having a range of $0.444 to $22638.480
- Profit having a range starting from $-6599.978 to $8399.976
- Discount having a range from 0 to 0.8
- Quantity is numeric with a range starting from 1 to 14.

Purpose of Data Collection:-

To figure out the sales and profit earned by the Superstore in various parts of USA. Further, it can be used to find trends in the same and come up with improvements. If there are any issues with current marketing or even certain marketing can help boost sales. All those can be inferred through analysis.

Explanation of each Columns :

Ship Mode - It has 4 values, they are First Class, Same Day, Second Class and Standard Class. Each value is defines the limit for being delayed as follows: -
Same day: >=0 days
First class: >=2 days
Second class: >=3 days
Standard class: >= 5 days
Segment - It is the product sector.There are segments which are Consumer, Corporate and Home Office
Country - This column contains the country details. It was found to be only for United States i.e one value. Hence this column can be removed.
City - Various Cities present in United States.There are about 531 cities.
State - There are 49 states in this dataset.
Postal Code - There are 631 postal codes in this dataset
Region - There are 4 regions in the dataset. i.e. South, West, Central and East
Category - There are 3 categories, they are Furniture, Office Supplies and Technology.
Sub-Category - There are 17 sub-categories, some of them are - Bookcases, Chairs, Labels, Tables, Storage, Furnishings, Art, Phones, Binders, Appliances, Paper, Accessories, Envelopes, Fasteners, Supplies, Machines, Copiers
Order Date :- There are about 1237 unique order dates, but overall from unique year and orders placed each year it seems to be increasing each year from time span starting from 2014 to 2017

## # A tibble: 4 × 2
##   year_order yearly_total_orders
##        <dbl>               <int>
## 1       2014                1993
## 2       2015                2102
## 3       2016                2587
## 4       2017                3312

Order ID :- Unique orders placed for the column - Order ID

## [1] 5009

Ship Date :- There are about 1334 unique ship dates, but overall from unique year, orders placed and shipped each year it seems to be increasing each year from time span starting from 2014 to 2017

## # A tibble: 5 × 2
##   ship_order yearly_total_shipped_orders
##        <dbl>                       <int>
## 1       2014                        1940
## 2       2015                        2131
## 3       2016                        2578
## 4       2017                        3303
## 5       2018                          42

Customer ID & 14. Customer Name :- There are 793 customers in the Superstore dataset. Count of Customer.ID and Customer.Name match i.e. 793, which states that data is not having issues.Because each Customer ID / Name is unique.
Product Name & 16. Product ID :- 1862 Unique Products are purchased in the orders placed, this can be figured from the unique id of each.
Sales - It is the price. Having a range starting from $0.444 to $22638.480
Profit - It is the profit earned from the product. Can also indicate loss with a negative sign. Having a range -6599.978 to 8399.976
Discount - Discount given for the product purchased, range starts from 0 to 0.8.
Quantity - The quantity of products bought, range starts from 1 to 14.

Exploratory Data Analysis

1. State wise Orders placed -

As it was seen above, there are about 49 states, plotting them all to understand the count of purchase is pretty difficult. Hence narrowing it down to top 10 states. Figuring out which state has the most purchases.

It would answer the question -> which are the top 10 states that have purchased the products from the Superstore?

##             state count
## 1      California  2001
## 2        New York  1128
## 3           Texas   985
## 4    Pennsylvania   587
## 5      Washington   506
## 6        Illinois   492
## 7            Ohio   469
## 8         Florida   383
## 9        Michigan   255
## 10 North Carolina   249

From above analysis, we can say that California is the state where most Orders were placed, which is around 2001 in the entire time period starting from 2014 - 2017. Followed by New York at 1128 and Texas at 985
Plotting the Orders placed to visualize the actual difference between states

Further if we narrow down to latest year i.e. 2017, we can figure out orders placed for that year.

## # A tibble: 10 × 2
##    state          count
##    <chr>          <int>
##  1 California       663
##  2 New York         352
##  3 Texas            317
##  4 Washington       215
##  5 Pennsylvania     197
##  6 Illinois         172
##  7 Ohio             161
##  8 Florida          126
##  9 North Carolina    85
## 10 Tennessee         81

Irrespective of the year, we can see that the orders placed in California is the largest.
Further we can try to figure out which sub-category of products are bought most in the State of California for the year 2017. Maybe the business can focus to market those products better.

## # A tibble: 10 × 2
##    state       count
##    <chr>       <int>
##  1 Binders        93
##  2 Paper          88
##  3 Art            70
##  4 Furnishings    65
##  5 Phones         56
##  6 Storage        53
##  7 Accessories    49
##  8 Appliances     37
##  9 Chairs         35
## 10 Labels         25

In 2017, binders were the most purchased product in California, followed by paper, art, furnishings, and phones. Offering better discounts could potentially boost sales based on customer needs.
The question is to determine the average discount given for each sub-category.

## # A tibble: 17 × 3
##    Sub_Category count average_discount
##    <chr>        <int>            <dbl>
##  1 Binders         93             0.2 
##  2 Paper           88             0   
##  3 Art             70             0   
##  4 Furnishings     65             0   
##  5 Phones          56             0.2 
##  6 Storage         53             0   
##  7 Accessories     49             0   
##  8 Appliances      37             0   
##  9 Chairs          35             0.2 
## 10 Labels          25             0   
## 11 Tables          24             0.2 
## 12 Bookcases       16             0.15
## 13 Envelopes       14             0   
## 14 Fasteners       12             0   
## 15 Supplies        11             0   
## 16 Copiers          8             0.2 
## 17 Machines         7             0.2

The data above indicates that relatively little product discounting was done in California. People bought regardless of the reductions as a result. Perhaps with more effective marketing, they could get more customers to purchase the goods.
Additionally, since California is home to numerous tech companies, office supplies make up every product purchased there. We might be able to demonstrate it by looking up the Office supplies subcategories.

## # A tibble: 9 × 2
## # Groups:   Category, Sub.Category [9]
##   Category        Sub.Category
##   <chr>           <chr>       
## 1 Office Supplies Binders     
## 2 Office Supplies Art         
## 3 Office Supplies Envelopes   
## 4 Office Supplies Paper       
## 5 Office Supplies Storage     
## 6 Office Supplies Labels      
## 7 Office Supplies Appliances  
## 8 Office Supplies Fasteners   
## 9 Office Supplies Supplies

In case of california all the products bought are office supplies. This indicates, maybe there was no marketing and it was indeed need of the hour. In conclusion, Better discounts and marketing can maybe increase sales in california and other western Tech based regions.

2.The question is whether the sales trends for each state are consistent?

we can check on the total sales obtained for those top 10 states, it would answer the questions like, - Are the states different when we plot for sales, when compared to orders placed over the time period of 2014 - 2017 ?

##           State Total_sales
## 1    California      457688
## 2      New York      310876
## 3         Texas      170188
## 4    Washington      138641
## 5  Pennsylvania      116512
## 6       Florida       89474
## 7      Illinois       80166
## 8          Ohio       78258
## 9      Michigan       76270
## 10     Virginia       70637

From above plot and value of total sales, can see that it is similar to count of products sold.
California has the most sales (457687.63), followed by New York (310876.27), then Texas (170188.05) and so on. This is the same for count of products bought.

##   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
## 1      3 CA-2016-138688 2016-06-12 2016-06-16   Second Class    DV-13045
## 2      6 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
## 3      7 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
## 4      8 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
## 5      9 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
## 6     10 CA-2014-115812 2014-06-09 2014-06-14 Standard Class    BH-11710
##     Customer.Name   Segment       Country        City      State Postal.Code
## 1 Darrin Van Huff Corporate United States Los Angeles California       90036
## 2 Brosina Hoffman  Consumer United States Los Angeles California       90032
## 3 Brosina Hoffman  Consumer United States Los Angeles California       90032
## 4 Brosina Hoffman  Consumer United States Los Angeles California       90032
## 5 Brosina Hoffman  Consumer United States Los Angeles California       90032
## 6 Brosina Hoffman  Consumer United States Los Angeles California       90032
##   Region      Product.ID        Category Sub.Category
## 1   West OFF-LA-10000240 Office Supplies       Labels
## 2   West FUR-FU-10001487       Furniture  Furnishings
## 3   West OFF-AR-10002833 Office Supplies          Art
## 4   West TEC-PH-10002275      Technology       Phones
## 5   West OFF-BI-10003910 Office Supplies      Binders
## 6   West OFF-AP-10002892 Office Supplies   Appliances
##                                                       Product.Name   Sales
## 1        Self-Adhesive Address Labels for Typewriters by Universal  14.620
## 2 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.860
## 3                                                       Newell 322   7.280
## 4                                   Mitel 5320 IP Phone VoIP phone 907.152
## 5             DXL Angle-View Binders with Locking Rings by Samsill  18.504
## 6                                 Belkin F5C206VTEL 6 Outlet Surge 114.900
##   Quantity Discount  Profit
## 1        2      0.0  6.8714
## 2        7      0.0 14.1694
## 3        4      0.0  1.9656
## 4        6      0.2 90.7152
## 5        3      0.2  5.7825
## 6        5      0.0 34.4700

Considering if the business wants to know the trend for the most sales over the years in the state of California alone.
This is possible by plotting a time series analysis graph.

## # A tibble: 6 × 2
##   Order.Date Total_sales
##   <date>           <dbl>
## 1 2014-01-06        19.4
## 2 2014-01-13      1680. 
## 3 2014-01-20        38.7
## 4 2014-01-27       427. 
## 5 2014-01-31       291. 
## 6 2014-02-02        12.4

## # A tsibble: 6 x 2 [1D]
##   Order.Date Total_sales
##   <date>           <dbl>
## 1 2014-01-06        19.4
## 2 2014-01-07        NA  
## 3 2014-01-08        NA  
## 4 2014-01-09        NA  
## 5 2014-01-10        NA  
## 6 2014-01-11        NA

## # A tsibble: 6 x 2 [1D]
##   Order.Date Total_sales
##   <date>           <dbl>
## 1 2014-01-06        19.4
## 2 2014-01-07         0  
## 3 2014-01-08         0  
## 4 2014-01-09         0  
## 5 2014-01-10         0  
## 6 2014-01-11         0

Californias’s Sales Trend (2014 - 2017)

The trend of a time series is the general increasing or decreasing of the response variable from the beginning of the time series to the end.
To illustrate this, we’ll just look at a subset of our Total_Sales data, and we can use a linear model to illustrate increase or decrease.

## `geom_smooth()` using formula = 'y ~ x'

The blue line in this plot represents a linear fit to this time series data. This is not a very good model, but it indicates a trend as the total sales increases gtadually from early 2014 to 2017 .

Californias’s Sales Trend (2017)

To closely look at the Trend of sales in California over the years, will plot only for the year 2017 .

## `geom_smooth()` using formula = 'y ~ x'

By looking at only one years trend of Total sales for the state of California (most orders placed = 663), we can see that the Sales are increasing for the year 2017.
In conclusion, given that the sector is primarily tech-based, we can say that the company is not currently promoting its products. If this is the case, they may want to think about doing so in the future to draw in clients who are prepared to purchase office supplies. And if they are already using marketing, they ought to keep doing so because it is helping them increase sales in the California area. In addition, since discounts are either nonexistent or extremely limited in California, businesses may consider implementing specific tactics to encourage customers to purchase their goods, such as providing discounts on infrequent days when sales are low and increasing the price of products on regular selling days. Furthermore, this ought to contribute to their overall sales growth.

California’s Profit Trend (2014 - 2017)

Additionally, will examine into California’s state profit trend.

## # A tsibble: 6 x 2 [1D]
##   Order.Date Total_profit
##   <date>            <dbl>
## 1 2014-01-06         9.33
## 2 2014-01-07         0   
## 3 2014-01-08         0   
## 4 2014-01-09         0   
## 5 2014-01-10         0   
## 6 2014-01-11         0

## `geom_smooth()` using formula = 'y ~ x'

Overall Profit is also increasing over the years, in the state of california.
Since California is in the west of US, we ask the question -

###3. Is there any region wise trend in how products are bought?

Hence will plot Region vs orders placed.

From the above plot can say that Western region has overall higher orders being placed than other regions.Followed by the Eastern Region.

4. Hypothesis (Average Sales in Western region is Same as Average Sales in East Region) -

From above plot, we can see that the orders placed in western and eastern region to be somewhat Higher compared to other regions. So we can put up a Hypothesis that Total Average sales in both western and eastern regions can be similar.Also, this case can be considered with the point that both regions see a increase in Tech companies. Also, from above analysis, we know that California is in the West of US, while New York the second state with highest orders is in the Eastern part of US. So, We can say that similar marketing or other tactics can be implemented in both regions, it does not have any particular role in figuring average total sales for those regions separately, as they all are same.
Hence, taking 2 Regions for the purpose of generating a Null Hypothesis. i.e. Western and Eastern regions. As discussed above, there are 2 reasons for picking those 2 regions -
1. Technology and Tech companies are expanding in both regions (eg: California [west] and New York [east]).
2. The orders from both region is comparitively higher than other regions, which can be conisdered as point to stategise in a similar manner.
Let us assume the first Null Hypothesis to be : Average Sales of Western Region = Average Sales of Eastern Region.

\[ H_0: \text{Average Sale remains equal for 2 regions. (i.e Western and Eastern) within products.} \]

\[ H_0: \text{AvgSales_West = AvgSales_East} \]

Alpha Level -$\alpha$ here represents the rejection region. If our observed value falls in this area, we are safe to reject the null hypothesis, and assume the alternative.
The alpha level $\alpha$, represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.
So taking the $\alpha$ value to be 0.05, so that False negative rate (i.e. Null hypothesis is true and the hypothesis is rejected) is low and Null Hypothesis becomes true for both the regions - West and East.
Power Level (1-$\beta$) here represents the Power i.e. True Negative Rate, which is the probability of Rejecting a Null Hypothesis when it is actually False. So we would want this to happen, such that right predictions are obtained (i.e., avoiding a Type II error). Lets assume it to be 0.80. Such that we obtain an appropriate result.
Minimum Effect size - It is the smallest difference in average between sales for each of the category, for which we want to detect from our test. Let us assume that to be around 100$. i.e. Minimum difference between average of sales in Western and Eastern regions.
Sample size calculation:
- To determine if I have enough data to perform a Neyman-Pearson hypothesis test.
- Making use of pwr libarary, to calculate the sample size. This helps in determining how many observations are needed to achieve a certain level of power to detect a specific effect size with a specified significance level.

## [1] 610.7154

From above sample size calculation, it is about 611. Since each of the region has sufficient count of data. i.e. Western region = 3203, Eastern region = 2848
So I believe I can implement Neyman-Pearson hypothesis test, to find if to accept or reject null hypothesis.

Neyman-Pearson Test:

## 
##  Welch Two Sample t-test
## 
## data:  west_sales and east_sales
## t = -0.79611, df = 5603.9, p-value = 0.426
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -41.00552  17.31977
## sample estimates:
## mean of x mean of y 
##  226.4932  238.3361

## Fail to reject the null hypothesis. There is no significant difference in average   sales between Western and Eastern regions for the products sold.

As per the result, We need to Accept the Null Hypothesis, which states that there is no difference between the average of sales in both regions -> west and east. If we accept the Null hypothesis then the alternate hypothesis cannot be considered as True, which is - Average sales of west is equal to Average sales of Eastern region.
So all in all, we can say that Eastern and western region must be treated in a similar manner, when it comes to selling products. Whatever Marketing is influencing products in Western region must be implemented in the Eastern region too. Also, they must be considered as a single region and no separate trends need to be checked.
This same can be looked into by checking the total sales for each region,

## # A tibble: 4 × 2
##   Region  Total_sales
##   <chr>         <dbl>
## 1 Central     501240.
## 2 East        678781.
## 3 South       391722.
## 4 West        725458.

Total Sales also seems to be somewhat similar with the hypothesis, that the total sales for both East and West is almost same.
Seems like irrespective of the region, there is a direct co-relation between the number of products being bought by customer and sales in that region.

5. The question is about determining if segments display a specific way in how orders are placed.

Most orders placed are from the consumer segment.

###6. Hypothesis (Total sales VS each category?)

## # A tibble: 3 × 3
##   Category        count Total_sales
##   <chr>           <int>       <dbl>
## 1 Technology       1847     836154.
## 2 Furniture        2121     742000.
## 3 Office Supplies  6026     719047.

Considering the various categories under which the products are sold and Total sales in each category, to be somewhat similar for the years 2014 - 2017. We can say that Category does not have any particular role in figuring total sales, as they all are same.
Hence, taking 2 categories for the purpose of generating a Null Hypothesis. i.e. Technology and Office Supplies. The reason for picking Technology and Office Supplies are - Technology because plenty of products are widely being used all around the world. Office supplies because each day a lot of companies are opening and growing to expand.
Let us assume the first Null Hypothesis to be : Average Sales of Technology = Average Sales of Office Supplies.

\[ H_0: \text{Average Sale remains equal for 2 type of categories (i.e Technology and Office Supplies) within products.} \]

\[ H_0: \text{AvgSales_tech = AvgSales_offsupplies} \]

Alpha Level -$\alpha$ here represents the rejection region. If our observed value falls in this area, we are safe to reject the null hypothesis, and assume the alternative.
The alpha level $\alpha$, represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.
So taking the $\alpha$ value to be 0.05, so that False negative rate (i.e. Null hypothesis is true and the hypothesis is rejected) is low and Null Hypothesis becomes true for both the categories - Tech and Office Sales.
Power Level (1-$\beta$) here represents the Power i.e. True Negative Rate, which is the probability of Rejecting a Null Hypothesis when it is actually False. So we would want this to happen, such that right predictions are obtained (i.e., avoiding a Type II error). Lets assume it to be 0.80. Such that we obtain an appropriate result.
Minimum Effect size - It is the smallest difference in average between sales for each of the category, for which we want to detect from our test. Let us assume that to be around 100$. i.e. Minimum difference between average of sales in tech and office supplies.
Sample size calculation:
- To determine if I have enough data to perform a Neyman-Pearson hypothesis test.
- Making use of pwr libarary, to calculate the sample size. This helps in determining how many observations are needed to achieve a certain level of power to detect a specific effect size with a specified significance level.

## [1] 610.7154

From above sample size calculation, it is about 611. Since each of the categories have sufficient count of data. i.e. Tech = 1847, Office Supplies = 6026
So I believe I can implement Neyman-Pearson hypothesis test, to find if to accept or reject null hypothesis.