Data Dive-Summaries

Data Set - Sample Super Store :-

Points to consider A numeric summary of data for at least 10 columns of data

For categorical columns, this should include unique values and counts
For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)
Summaries Combined

A set of at least 5 novel questions to investigate informed by the following:

column summaries (i.e., the above bullet)
data documentation
your project’s goals/purpose

Use of aggregation functions (other than the ones used from the first bullet, above) I.e., use these explore something interesting about your data

A visual summary of at least 5 columns of your data

This should include distributions at least
In addition, you should consider trends, correlations, and interactions between variables
Use different channels (e.g., color) to show how categorical variables interact with continuous variables

Read the Data set

# Load tidyverse
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)

##        Ship.Mode   Segment       Country            City      State Postal.Code
## 1   Second Class  Consumer United States       Henderson   Kentucky       42420
## 2   Second Class  Consumer United States       Henderson   Kentucky       42420
## 3   Second Class Corporate United States     Los Angeles California       90036
## 4 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 5 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 6 Standard Class  Consumer United States     Los Angeles California       90032
##   Region        Category Sub.Category    Sales Quantity Discount    Profit
## 1  South       Furniture    Bookcases 261.9600        2     0.00   41.9136
## 2  South       Furniture       Chairs 731.9400        3     0.00  219.5820
## 3   West Office Supplies       Labels  14.6200        2     0.00    6.8714
## 4  South       Furniture       Tables 957.5775        5     0.45 -383.0310
## 5  South Office Supplies      Storage  22.3680        2     0.20    2.5164
## 6   West       Furniture  Furnishings  48.8600        7     0.00   14.1694

Various columns within the Data set:-

colnames(Superstore_data)

##  [1] "Ship.Mode"    "Segment"      "Country"      "City"         "State"       
##  [6] "Postal.Code"  "Region"       "Category"     "Sub.Category" "Sales"       
## [11] "Quantity"     "Discount"     "Profit"

Number of records :-

nrow(Superstore_data)

## [1] 9994

A numeric summary of data for at least 10 columns of data

Summarize the Data set - * For categorical columns, this should include unique values and counts

1.Column Ship Mode :-

Unique values for the column - Ship Mode :-

for (x in unique(Superstore_data$Ship.Mode)) {
  print(x)
}

## [1] "Second Class"
## [1] "Standard Class"
## [1] "First Class"
## [1] "Same Day"

Count of Unique Ship Modes :-

n_distinct(Superstore_data$Ship.Mode)

## [1] 4

2.Column Segment :-

Unique values for the column - Segment :-

for (x in unique(Superstore_data$Segment)) {
  print(x)
}

## [1] "Consumer"
## [1] "Corporate"
## [1] "Home Office"

sprintf("Count of the unique Segments: %s", n_distinct(Superstore_data$Segment))

## [1] "Count of the unique Segments: 3"

3.Column Country :-

Unique values for the column - Country

for (x in unique(Superstore_data$Country)) {
  print(x)
}

## [1] "United States"

sprintf("Count of the unique Countries: %s", n_distinct(Superstore_data$Country))

## [1] "Count of the unique Countries: 1"

4.Column City :-

Unique values for the column - City

unique(Superstore_data$City)

##   [1] "Henderson"         "Los Angeles"       "Fort Lauderdale"  
##   [4] "Concord"           "Seattle"           "Fort Worth"       
##   [7] "Madison"           "West Jordan"       "San Francisco"    
##  [10] "Fremont"           "Philadelphia"      "Orem"             
##  [13] "Houston"           "Richardson"        "Naperville"       
##  [16] "Melbourne"         "Eagan"             "Westland"         
##  [19] "Dover"             "New Albany"        "New York City"    
##  [22] "Troy"              "Chicago"           "Gilbert"          
##  [25] "Springfield"       "Jackson"           "Memphis"          
##  [28] "Decatur"           "Durham"            "Columbia"         
##  [31] "Rochester"         "Minneapolis"       "Portland"         
##  [34] "Saint Paul"        "Aurora"            "Charlotte"        
##  [37] "Orland Park"       "Urbandale"         "Columbus"         
##  [40] "Bristol"           "Wilmington"        "Bloomington"      
##  [43] "Phoenix"           "Roseville"         "Independence"     
##  [46] "Pasadena"          "Newark"            "Franklin"         
##  [49] "Scottsdale"        "San Jose"          "Edmond"           
##  [52] "Carlsbad"          "San Antonio"       "Monroe"           
##  [55] "Fairfield"         "Grand Prairie"     "Redlands"         
##  [58] "Hamilton"          "Westfield"         "Akron"            
##  [61] "Denver"            "Dallas"            "Whittier"         
##  [64] "Saginaw"           "Medina"            "Dublin"           
##  [67] "Detroit"           "Tampa"             "Santa Clara"      
##  [70] "Lakeville"         "San Diego"         "Brentwood"        
##  [73] "Chapel Hill"       "Morristown"        "Cincinnati"       
##  [76] "Inglewood"         "Tamarac"           "Colorado Springs" 
##  [79] "Belleville"        "Taylor"            "Lakewood"         
##  [82] "Arlington"         "Arvada"            "Hackensack"       
##  [85] "Saint Petersburg"  "Long Beach"        "Hesperia"         
##  [88] "Murfreesboro"      "Layton"            "Austin"           
##  [91] "Lowell"            "Manchester"        "Harlingen"        
##  [94] "Tucson"            "Quincy"            "Pembroke Pines"   
##  [97] "Des Moines"        "Peoria"            "Las Vegas"        
## [100] "Warwick"           "Miami"             "Huntington Beach" 
## [103] "Richmond"          "Louisville"        "Lawrence"         
## [106] "Canton"            "New Rochelle"      "Gastonia"         
## [109] "Jacksonville"      "Auburn"            "Norman"           
## [112] "Park Ridge"        "Amarillo"          "Lindenhurst"      
## [115] "Huntsville"        "Fayetteville"      "Costa Mesa"       
## [118] "Parker"            "Atlanta"           "Gladstone"        
## [121] "Great Falls"       "Lakeland"          "Montgomery"       
## [124] "Mesa"              "Green Bay"         "Anaheim"          
## [127] "Marysville"        "Salem"             "Laredo"           
## [130] "Grove City"        "Dearborn"          "Warner Robins"    
## [133] "Vallejo"           "Mission Viejo"     "Rochester Hills"  
## [136] "Plainfield"        "Sierra Vista"      "Vancouver"        
## [139] "Cleveland"         "Tyler"             "Burlington"       
## [142] "Waynesboro"        "Chester"           "Cary"             
## [145] "Palm Coast"        "Mount Vernon"      "Hialeah"          
## [148] "Oceanside"         "Evanston"          "Trenton"          
## [151] "Cottage Grove"     "Bossier City"      "Lancaster"        
## [154] "Asheville"         "Lake Elsinore"     "Omaha"            
## [157] "Edmonds"           "Santa Ana"         "Milwaukee"        
## [160] "Florence"          "Lorain"            "Linden"           
## [163] "Salinas"           "New Brunswick"     "Garland"          
## [166] "Norwich"           "Alexandria"        "Toledo"           
## [169] "Farmington"        "Riverside"         "Torrance"         
## [172] "Round Rock"        "Boca Raton"        "Virginia Beach"   
## [175] "Murrieta"          "Olympia"           "Washington"       
## [178] "Jefferson City"    "Saint Peters"      "Rockford"         
## [181] "Brownsville"       "Yonkers"           "Oakland"          
## [184] "Clinton"           "Encinitas"         "Roswell"          
## [187] "Jonesboro"         "Antioch"           "Homestead"        
## [190] "La Porte"          "Lansing"           "Cuyahoga Falls"   
## [193] "Reno"              "Harrisonburg"      "Escondido"        
## [196] "Royal Oak"         "Rockville"         "Coral Springs"    
## [199] "Buffalo"           "Boynton Beach"     "Gulfport"         
## [202] "Fresno"            "Greenville"        "Macon"            
## [205] "Cedar Rapids"      "Providence"        "Pueblo"           
## [208] "Deltona"           "Murray"            "Middletown"       
## [211] "Freeport"          "Pico Rivera"       "Provo"            
## [214] "Pleasant Grove"    "Smyrna"            "Parma"            
## [217] "Mobile"            "New Bedford"       "Irving"           
## [220] "Vineland"          "Glendale"          "Niagara Falls"    
## [223] "Thomasville"       "Westminster"       "Coppell"          
## [226] "Pomona"            "North Las Vegas"   "Allentown"        
## [229] "Tempe"             "Laguna Niguel"     "Bridgeton"        
## [232] "Everett"           "Watertown"         "Appleton"         
## [235] "Bellevue"          "Allen"             "El Paso"          
## [238] "Grapevine"         "Carrollton"        "Kent"             
## [241] "Lafayette"         "Tigard"            "Skokie"           
## [244] "Plano"             "Suffolk"           "Indianapolis"     
## [247] "Bayonne"           "Greensboro"        "Baltimore"        
## [250] "Kenosha"           "Olathe"            "Tulsa"            
## [253] "Redmond"           "Raleigh"           "Muskogee"         
## [256] "Meriden"           "Bowling Green"     "South Bend"       
## [259] "Spokane"           "Keller"            "Port Orange"      
## [262] "Medford"           "Charlottesville"   "Missoula"         
## [265] "Apopka"            "Reading"           "Broomfield"       
## [268] "Paterson"          "Oklahoma City"     "Chesapeake"       
## [271] "Lubbock"           "Johnson City"      "San Bernardino"   
## [274] "Leominster"        "Bozeman"           "Perth Amboy"      
## [277] "Ontario"           "Rancho Cucamonga"  "Moorhead"         
## [280] "Mesquite"          "Stockton"          "Ormond Beach"     
## [283] "Sunnyvale"         "York"              "College Station"  
## [286] "Saint Louis"       "Manteca"           "San Angelo"       
## [289] "Salt Lake City"    "Knoxville"         "Little Rock"      
## [292] "Lincoln Park"      "Marion"            "Littleton"        
## [295] "Bangor"            "Southaven"         "New Castle"       
## [298] "Midland"           "Sioux Falls"       "Fort Collins"     
## [301] "Clarksville"       "Sacramento"        "Thousand Oaks"    
## [304] "Malden"            "Holyoke"           "Albuquerque"      
## [307] "Sparks"            "Coachella"         "Elmhurst"         
## [310] "Passaic"           "North Charleston"  "Newport News"     
## [313] "Jamestown"         "Mishawaka"         "La Quinta"        
## [316] "Tallahassee"       "Nashville"         "Bellingham"       
## [319] "Woodstock"         "Haltom City"       "Wheeling"         
## [322] "Summerville"       "Hot Springs"       "Englewood"        
## [325] "Las Cruces"        "Hoover"            "Frisco"           
## [328] "Vacaville"         "Waukesha"          "Bakersfield"      
## [331] "Pompano Beach"     "Corpus Christi"    "Redondo Beach"    
## [334] "Orlando"           "Orange"            "Lake Charles"     
## [337] "Highland Park"     "Hempstead"         "Noblesville"      
## [340] "Apple Valley"      "Mount Pleasant"    "Sterling Heights" 
## [343] "Eau Claire"        "Pharr"             "Billings"         
## [346] "Gresham"           "Chattanooga"       "Meridian"         
## [349] "Bolingbrook"       "Maple Grove"       "Woodland"         
## [352] "Missouri City"     "Pearland"          "San Mateo"        
## [355] "Grand Rapids"      "Visalia"           "Overland Park"    
## [358] "Temecula"          "Yucaipa"           "Revere"           
## [361] "Conroe"            "Tinley Park"       "Dubuque"          
## [364] "Dearborn Heights"  "Santa Fe"          "Hickory"          
## [367] "Carol Stream"      "Saint Cloud"       "North Miami"      
## [370] "Plantation"        "Port Saint Lucie"  "Rock Hill"        
## [373] "Odessa"            "West Allis"        "Chula Vista"      
## [376] "Manhattan"         "Altoona"           "Thornton"         
## [379] "Champaign"         "Texarkana"         "Edinburg"         
## [382] "Baytown"           "Greenwood"         "Woonsocket"       
## [385] "Superior"          "Bedford"           "Covington"        
## [388] "Broken Arrow"      "Miramar"           "Hollywood"        
## [391] "Deer Park"         "Wichita"           "Mcallen"          
## [394] "Iowa City"         "Boise"             "Cranston"         
## [397] "Port Arthur"       "Citrus Heights"    "The Colony"       
## [400] "Daytona Beach"     "Bullhead City"     "Portage"          
## [403] "Fargo"             "Elkhart"           "San Gabriel"      
## [406] "Margate"           "Sandy Springs"     "Mentor"           
## [409] "Lawton"            "Hampton"           "Rome"             
## [412] "La Crosse"         "Lewiston"          "Hattiesburg"      
## [415] "Danville"          "Logan"             "Waterbury"        
## [418] "Athens"            "Avondale"          "Marietta"         
## [421] "Yuma"              "Wausau"            "Pasco"            
## [424] "Oak Park"          "Pensacola"         "League City"      
## [427] "Gaithersburg"      "Lehi"              "Tuscaloosa"       
## [430] "Moreno Valley"     "Georgetown"        "Loveland"         
## [433] "Chandler"          "Helena"            "Kirkwood"         
## [436] "Waco"              "Frankfort"         "Bethlehem"        
## [439] "Grand Island"      "Woodbury"          "Rogers"           
## [442] "Clovis"            "Jupiter"           "Santa Barbara"    
## [445] "Cedar Hill"        "Norfolk"           "Draper"           
## [448] "Ann Arbor"         "La Mesa"           "Pocatello"        
## [451] "Holland"           "Milford"           "Buffalo Grove"    
## [454] "Lake Forest"       "Redding"           "Chico"            
## [457] "Utica"             "Conway"            "Cheyenne"         
## [460] "Owensboro"         "Caldwell"          "Kenner"           
## [463] "Nashua"            "Bartlett"          "Redwood City"     
## [466] "Lebanon"           "Santa Maria"       "Des Plaines"      
## [469] "Longview"          "Hendersonville"    "Waterloo"         
## [472] "Cambridge"         "Palatine"          "Beverly"          
## [475] "Eugene"            "Oxnard"            "Renton"           
## [478] "Glenview"          "Delray Beach"      "Commerce City"    
## [481] "Texas City"        "Wilson"            "Rio Rancho"       
## [484] "Goldsboro"         "Montebello"        "El Cajon"         
## [487] "Beaumont"          "West Palm Beach"   "Abilene"          
## [490] "Normal"            "Saint Charles"     "Camarillo"        
## [493] "Hillsboro"         "Burbank"           "Modesto"          
## [496] "Garden City"       "Atlantic City"     "Longmont"         
## [499] "Davis"             "Morgan Hill"       "Clifton"          
## [502] "Sheboygan"         "East Point"        "Rapid City"       
## [505] "Andover"           "Kissimmee"         "Shelton"          
## [508] "Danbury"           "Sanford"           "San Marcos"       
## [511] "Greeley"           "Mansfield"         "Elyria"           
## [514] "Twin Falls"        "Coral Gables"      "Romeoville"       
## [517] "Marlborough"       "Laurel"            "Bryan"            
## [520] "Pine Bluff"        "Aberdeen"          "Hagerstown"       
## [523] "East Orange"       "Arlington Heights" "Oswego"           
## [526] "Coon Rapids"       "San Clemente"      "San Luis Obispo"  
## [529] "Springdale"        "Lodi"              "Mason"

sprintf("Count of the unique cities: %s", n_distinct(Superstore_data$City))

## [1] "Count of the unique cities: 531"

5.Column State :-

Unique values for the column - State

unique(Superstore_data$State)

##  [1] "Kentucky"             "California"           "Florida"             
##  [4] "North Carolina"       "Washington"           "Texas"               
##  [7] "Wisconsin"            "Utah"                 "Nebraska"            
## [10] "Pennsylvania"         "Illinois"             "Minnesota"           
## [13] "Michigan"             "Delaware"             "Indiana"             
## [16] "New York"             "Arizona"              "Virginia"            
## [19] "Tennessee"            "Alabama"              "South Carolina"      
## [22] "Oregon"               "Colorado"             "Iowa"                
## [25] "Ohio"                 "Missouri"             "Oklahoma"            
## [28] "New Mexico"           "Louisiana"            "Connecticut"         
## [31] "New Jersey"           "Massachusetts"        "Georgia"             
## [34] "Nevada"               "Rhode Island"         "Mississippi"         
## [37] "Arkansas"             "Montana"              "New Hampshire"       
## [40] "Maryland"             "District of Columbia" "Kansas"              
## [43] "Vermont"              "Maine"                "South Dakota"        
## [46] "Idaho"                "North Dakota"         "Wyoming"             
## [49] "West Virginia"

sprintf("Count of the unique States: %s", n_distinct(Superstore_data$State))

## [1] "Count of the unique States: 49"

6.Column Region :

Unique values for the column - Region

for (x in unique(Superstore_data$Region)) {
  print(x)
}

## [1] "South"
## [1] "West"
## [1] "Central"
## [1] "East"

sprintf("Count of the unique Regions: %s", n_distinct(Superstore_data$Region))

## [1] "Count of the unique Regions: 4"

7.Column Category :-

Unique values for the column - Category

for (x in unique(Superstore_data$Category)) {
  print(x)
}

## [1] "Furniture"
## [1] "Office Supplies"
## [1] "Technology"

sprintf("Count of the unique Categories: %s", n_distinct(Superstore_data$Category))

## [1] "Count of the unique Categories: 3"

8.Sub-Category :-

Unique values for the column - Sub-category

sprintf("Unique values for the column - Sub-Category :");

## [1] "Unique values for the column - Sub-Category :"

for (x in unique(Superstore_data$Sub.Category)) {
  print(x)
}

## [1] "Bookcases"
## [1] "Chairs"
## [1] "Labels"
## [1] "Tables"
## [1] "Storage"
## [1] "Furnishings"
## [1] "Art"
## [1] "Phones"
## [1] "Binders"
## [1] "Appliances"
## [1] "Paper"
## [1] "Accessories"
## [1] "Envelopes"
## [1] "Fasteners"
## [1] "Supplies"
## [1] "Machines"
## [1] "Copiers"

sprintf("Count of the Sub-Category: %s", n_distinct(Superstore_data$Sub.Category))

## [1] "Count of the Sub-Category: 17"

9.Postal Code :-

Even though Postal code is numeric in nature, it doesn’t have any value as the mean of a postal code wouldn’t make sense. Hence considering it as a Categorical value.

Unique values for the column - Postal code :

unique(Superstore_data$Postal.Code)

##   [1] 42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140
##  [13] 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024
##  [25] 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707
##  [37] 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229
##  [49] 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105
##  [61] 98115 73034 90045 19134 88220 78207 77036 62521 71203  6824 75051 92374
##  [73] 45011  7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227
##  [85] 38401 33614 95051 55044 92037 77506 94513 27514  7960 45231 94110 90301
##  [97] 33319 80906  7109 48180  8701 22204 80004  7601 33710 19143 90805 92345
## [109] 37130 84041 78745  1852 31907  6040 78550 85705 62301  2038 33024 98198
## [121] 61604 89115  2886 33180 28403 92646 40475 80027  1841 39212 48187 10801
## [133] 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340
## [145] 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653
## [157] 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591
## [169] 92691 48307  7060 85635 98661 60505 76017 40214 75081 44105 75701 27217
## [181] 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016
## [193] 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052
## [205]  7036 93905  8901 17602  3301 21044 75043  6360 22304 43615 87401 92503
## [217] 90503 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376
## [229] 61107 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030
## [241] 46350 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503
## [253] 93727 27834 11561 35630 31204 52402  2908 81001 94533 32725 42071  6457
## [265] 11520 90660 84604 84062 30080 24153 44134 36608  2740 75061  8360 85301
## [277] 14304 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677  8302
## [289]  2149 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224
## [301] 60076 75023 23434 46203  7002 28314 27405 21215 53142 66062 98002 74133
## [313] 97756 27604 74403  6450 42104 46614  6010 89015 99207 76248 45014 32127
## [325] 97504 22901 59801 33178 29501 97477 32712 19601 80020 65807  7501 73120
## [337] 23320 79424 65203 37604 36830 92404  1453 59715 85345 44107  8861 91761
## [349] 91730 56560 75150 95207 32174 94086  3820 17403 77840 63116  2169 95336
## [361] 44240 76903 84106 35810 37918 72209 48146 43302 80122  5408  4401 38671
## [373] 47362 48640 57103 80525 47905 37042 95823 91360  2148  1040 87105 89431
## [385] 92236 60126  7055 29406 23602 14701 46544 43402 92253 32303 37211 98226
## [397] 60098 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107
## [409] 53186 93309 33068 45373 78415 90278 32839  7050 70601 60035 11550 46060
## [421] 55124 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369
## [433] 95695 77489 77581 94403 49505 93277 66212 92592 92399  2151 77301 60477
## [445] 52001 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762
## [457] 53214 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712
## [469]  2895 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704
## [481]  2920 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776
## [493] 33063 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321
## [505]  6708 30605  4240 61832 85323 30062 85364 54401 99301 60302 32503 77573
## [517] 20877 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423
## [529] 18018 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020
## [541] 48104 91941 83201 49423  6460 60089 92630 96003 95928 13501 72032 82001
## [553] 42301 83605 70065  3060 38134 94061 37087 93454 60016 98632 37075 50701
## [565]  2138 60067  1915 97405 93030 98059 60025 33445 80022 77590 27893 87124
## [577] 27534 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123
## [589] 91505 95351 67846  8401 80501 95616 26003 95037  7011 53081 30344 57701
## [601]  1810 34741  6484  6810 52302 32771 78666 80634 76063 44035 83301 33134
## [613] 60441  1752 20707 77803 71603 57401 21740  7017 60004 60543 55433 92672
## [625] 94568 93405 72762 95240 77571 45040 30188

sprintf("Count of the Postal-Code: %s",n_distinct(Superstore_data$Postal.Code))

## [1] "Count of the Postal-Code: 631"

For numeric columns, this includes min/max, central tendency, and some notion of distribution (e.g., quantiles)

10.Sales :-

summary(Superstore_data$Sales)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.444    17.280    54.490   229.858   209.940 22638.480

11.Quantity :-

summary(Superstore_data$Quantity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.79    5.00   14.00

12.Discount :-

summary(Superstore_data$Discount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.1562  0.2000  0.8000

Profit :-

summary(Superstore_data$Profit)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6599.978     1.729     8.666    28.657    29.364  8399.976

Summaries Combined :

summary(Superstore_data)

##   Ship.Mode           Segment            Country              City          
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     State            Postal.Code       Region            Category        
##  Length:9994        Min.   : 1040   Length:9994        Length:9994       
##  Class :character   1st Qu.:23223   Class :character   Class :character  
##  Mode  :character   Median :56431   Mode  :character   Mode  :character  
##                     Mean   :55190                                        
##                     3rd Qu.:90008                                        
##                     Max.   :99301                                        
##  Sub.Category           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

Distinct Count of each column -

sapply(Superstore_data,function(x) n_distinct(x))

##    Ship.Mode      Segment      Country         City        State  Postal.Code 
##            4            3            1          531           49          631 
##       Region     Category Sub.Category        Sales     Quantity     Discount 
##            4            3           17         5825           14           12 
##       Profit 
##         7287

#Data Documentation:

There are 9994 rows.
The Data is tidy with about 13 columns, Each row represents a unique and separate observation.
It fulfills the criteria of atleast 2 columns being continuous. In my case the columns Sales(Price), Profit and Discount are Decimal Also, Profit is having both positive and negative values. Negative values are indicating Loss.
Sales having a range of $0.444 to $22638.480
Profit having a range starting from $-6599.978 to $8399.976
Discount having a range from 0 to 0.8
Quantity is numeric with a range starting from 1 to 14.
It also fulfills the criteria of atleast 2 columns being categorical. In my case there are 9 categorical columns which are - Ship Mode, Segment, Country, City, State, Postal Code, Region, Category, Sub.Category
(Here, considering Postal Code as Categorical as the numeric in the value does not have any significance.)

Purpose of Data Collection:- To figure out the profit earned by the Superstore via its sales in various parts of USA. Further, it can be used to find trends in the same and come up with improvements.

Explanantion of each Columns :

Ship Mode - It has 4 values, they are First Class, Same Day, Second Class and Standard Class. Each value is defines the limit for being delayed as follows: -
Same day: >=0 days
First class: >=2 days
Second class: >=3 days
Standard class: >= 5 days
Segment - It is the product sector.There are segments which are Consumer, Corporate and Home Office
Country - This column contains the country details. It was found to be only for United States i.e one value. Hence this column can be removed.
City - Various Cities present in United States.There are about 531 cities.
State - There are 49 states in this dataset.
Postal Code - There are 631 postal codes in this dataset
Region - There are 4 regions in the dataset. i.e. South, West, Central and East
Category - There are 3 categories, they are Furniture, Office Supplies and Technology.
Sub-Category - There are 17 sub-categories, some of them are - Bookcases, Chairs, Labels, Tables, Storage, Furnishings, Art, Phones, Binders, Appliances, Paper, Accessories, Envelopes, Fasteners, Supplies, Machines, Copiers
Sales - It is the price. Having a range starting from $0.444 to $22638.480
Profit - It is the profit earned from the product. Can also indicate loss with a negative sign. Having a range -6599.978 to 8399.976
Discount - Discount given for the product purchased, range starts from 0 to 0.8.
Quantity - The quantity of products bought, range starts from 1 to 14.

In order to understand the data better we can try to visualize the data, which will indirectly help us in asking valid questions to investigate further.

visual summary

Univariate Plot:-

1.Ship Mode -

From the plot, can observe that most purchases are done via Ship.Mode = Standard Class, followed by Second Class, First Class and Same Day. About 6000 records out of 9994 i.e. more than 50% purchase are via Standard Class ship mode.

Superstore_data |>
  ggplot(mapping = aes(x=Ship.Mode, color = Ship.Mode, fill=Ship.Mode)) +
  geom_bar()

last_plot() +
  coord_polar()

2.Segment -

From the plot below can figure out that most purchases are done for Consumer needs, which are greater than 5000 purchases.Followed by Corporate purchases which is around 3000 and Home Office with a count around 1800.

So most purchases in the Superstore are done by the Consumer segment.

Superstore_data |>
  ggplot(mapping = aes(x=Segment, color = Segment, fill=Segment)) +
  geom_bar() +
  theme_minimal()

3.Region -

Overall every region has bought products but from the plot below we can observe that customers from West and East part have purchased most products, when compared to Central and southern part of US.

It answers the question -> which region has purchased the most products?

Superstore_data |>
  ggplot(mapping = aes(x=Region, color = Region, fill=Region)) +
  geom_bar() +
  theme_minimal()

4.State -

As it was seen above, there are about 49 states, plotting them all to understand the count of purchase is pretty difficult. Hence narrowing it down to top 10 states.

It would answer the question -> which are the top 10 states that have purchased the products from the Superstore?

state_count<- aggregate(Superstore_data$State,by=list(Superstore_data$State), FUN=length)

top_10_states_count <-state_count %>%
arrange(desc(x)) %>%
    slice(1:10) %>%
  rename (
    state=Group.1,
    count=x
  )
top_10_states_count

##             state count
## 1      California  2001
## 2        New York  1128
## 3           Texas   985
## 4    Pennsylvania   587
## 5      Washington   506
## 6        Illinois   492
## 7            Ohio   469
## 8         Florida   383
## 9        Michigan   255
## 10 North Carolina   249

top_10_states_count |>
  ggplot(mapping = aes(x=state, y=count, fill=state)) +
  geom_bar(stat = "identity") +
  theme_minimal()+
   theme (axis.text.x =  element_text(angle=90))

Can see that California is the state that has the most purchases.

5.City - As West region had most purchases of product, we can verify that by digging deep into the cities. Also there are about 531 city. Drilling down and figuring out the top 10 cities.

It would answer the question -> which are the top 10 cities that purchased the products from the Superstore?

City_count<- aggregate(Superstore_data$City,by=list(Superstore_data$City), FUN=length)

top_10_city_count <-City_count %>%
arrange(desc(x)) %>%
    slice(1:10) %>%
  rename (
    City=Group.1,
    Count=x
  )
top_10_city_count

##             City Count
## 1  New York City   915
## 2    Los Angeles   747
## 3   Philadelphia   537
## 4  San Francisco   510
## 5        Seattle   428
## 6        Houston   377
## 7        Chicago   314
## 8       Columbus   222
## 9      San Diego   170
## 10   Springfield   163

top_10_city_count |>
  ggplot(mapping = aes(x=City, y=Count, fill=City)) +
  geom_bar(stat = "identity") +
  theme_minimal()+
   theme (axis.text.x =  element_text(angle=90))

6.Category - From the plot below can figure out that Office supplies were the products which were bought the most, in all the purchases.

It answers the question -> which category of products were bought the most?

Superstore_data |>
  ggplot(mapping = aes(x=Category, color = Category, fill=Category)) +
  geom_bar() +
  theme_minimal()

last_plot() +
  coord_polar()

7.Sub - Category :- From the previous plot, we had observed that Office Supplies were the most purchased category. In order to understand which item from those Office supplies are being bought the most, we plot the sub-category. In the plot below, we can see that Binders are the heavily purchased item, followed by Paper and Furnishings.

It answers the question -> which sub-category of products were bought the most?

Superstore_data |>
  ggplot(mapping = aes(x=Sub.Category, fill=Sub.Category)) +
  geom_bar() +
  theme_minimal() +
   theme (axis.text.x =  element_text(angle=90))

8.Quantity - Most products bought are in pairs or trios. So majorly people tend to buy in such forms.

Superstore_data %>%
ggplot() +
geom_histogram(mapping = aes(x = Quantity), fill = 'blue', binwidth=1)

Multivariate plotting-

1.Region vs Category :

Superstore_data %>%
ggplot() +
  geom_bar(mapping = aes(x = Region, fill = Category), position="dodge")

Be it any region, all of them have a common category which was purchased the most i.e. Office Supplies While Furniture and Tech were bought little less.

2.Quantity vs Profit

Superstore_data %>%
ggplot() +
  geom_smooth(mapping = aes(x = Quantity, y = Sales))+
  theme_minimal()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

We can see the Sales and Quantity have a Positive Correlation. As the quantity increases the sales also increases.

3.Sub-Category and Mean Sales -

aggregate(Superstore_data$Sales, list(Superstore_data$Sub.Category), FUN=mean) |>
  rename (
    Sub_Category=Group.1,
    Mean_Sales=x
  ) |>
arrange(desc(Mean_Sales))

##    Sub_Category Mean_Sales
## 1       Copiers 2198.94162
## 2      Machines 1645.55331
## 3        Tables  648.79477
## 4        Chairs  532.33242
## 5     Bookcases  503.85963
## 6        Phones  371.21153
## 7       Storage  264.59055
## 8      Supplies  245.65020
## 9    Appliances  230.75571
## 10  Accessories  215.97460
## 11      Binders  133.56056
## 12  Furnishings   95.82567
## 13    Envelopes   64.86772
## 14        Paper   57.28409
## 15       Labels   34.30305
## 16          Art   34.06883
## 17    Fasteners   13.93677

Even though from uni-variate visualization we found that people bought binders and papers. The most costliest item from all of those was a Copier followed by Machines.

4.Region vs Category vs Sales:

Superstore_data %>%
ggplot() +
  geom_bar(mapping = aes(x = Region, y=Sales,fill = Category),stat = "identity", position="dodge")

From the above plot, we can see that Technologies are the costliest items be it any region. But if compared between regions, South has bought tech products which have a cost greater than $2000. Followed by Central, West and East region having cost approximately around $1750, $1450 and $1200 respectively. With Office supplies being the 2nd most costly item bought in all regions. But East regions sees similar cost on Office and furniture.

5.Region vs Category vs Profit:

Superstore_data %>%
ggplot() +
  geom_bar(mapping = aes(x = Region, y=Profit,fill = Category),stat = "identity", position="dodge")

From the above plot, loss is seen for Technology in 3 regions - East (highest loss), South and West. Rest of the categories don’t see much loss and have a little profit. Most profit is obtained in central region for Tech but loss due to Office supplies.

Data Dive-Summaries

Amritha Prakash

2023-09-04