Data Set - Sample Super Store :-
Points to consider A numeric summary of data for at
least 10 columns of data
A set of at least 5 novel questions to investigate informed by the
following:
Use of aggregation functions (other than the ones used from the first
bullet, above) I.e., use these explore something interesting about your
data
A visual summary of at least 5 columns of your data
Read the Data set
# Load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.3 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
Various columns within the Data set:-
colnames(Superstore_data)
## [1] "Ship.Mode" "Segment" "Country" "City" "State"
## [6] "Postal.Code" "Region" "Category" "Sub.Category" "Sales"
## [11] "Quantity" "Discount" "Profit"
Number of records :-
nrow(Superstore_data)
## [1] 9994
A numeric summary of data for at least 10 columns of data
Summarize the Data set - * For categorical columns, this should include unique values and counts
1.Column Ship Mode :-
Unique values for the column - Ship Mode :-
for (x in unique(Superstore_data$Ship.Mode)) {
print(x)
}
## [1] "Second Class"
## [1] "Standard Class"
## [1] "First Class"
## [1] "Same Day"
Count of Unique Ship Modes :-
n_distinct(Superstore_data$Ship.Mode)
## [1] 4
2.Column Segment :-
Unique values for the column - Segment :-
for (x in unique(Superstore_data$Segment)) {
print(x)
}
## [1] "Consumer"
## [1] "Corporate"
## [1] "Home Office"
sprintf("Count of the unique Segments: %s", n_distinct(Superstore_data$Segment))
## [1] "Count of the unique Segments: 3"
3.Column Country :-
Unique values for the column - Country
for (x in unique(Superstore_data$Country)) {
print(x)
}
## [1] "United States"
sprintf("Count of the unique Countries: %s", n_distinct(Superstore_data$Country))
## [1] "Count of the unique Countries: 1"
4.Column City :-
Unique values for the column - City
unique(Superstore_data$City)
## [1] "Henderson" "Los Angeles" "Fort Lauderdale"
## [4] "Concord" "Seattle" "Fort Worth"
## [7] "Madison" "West Jordan" "San Francisco"
## [10] "Fremont" "Philadelphia" "Orem"
## [13] "Houston" "Richardson" "Naperville"
## [16] "Melbourne" "Eagan" "Westland"
## [19] "Dover" "New Albany" "New York City"
## [22] "Troy" "Chicago" "Gilbert"
## [25] "Springfield" "Jackson" "Memphis"
## [28] "Decatur" "Durham" "Columbia"
## [31] "Rochester" "Minneapolis" "Portland"
## [34] "Saint Paul" "Aurora" "Charlotte"
## [37] "Orland Park" "Urbandale" "Columbus"
## [40] "Bristol" "Wilmington" "Bloomington"
## [43] "Phoenix" "Roseville" "Independence"
## [46] "Pasadena" "Newark" "Franklin"
## [49] "Scottsdale" "San Jose" "Edmond"
## [52] "Carlsbad" "San Antonio" "Monroe"
## [55] "Fairfield" "Grand Prairie" "Redlands"
## [58] "Hamilton" "Westfield" "Akron"
## [61] "Denver" "Dallas" "Whittier"
## [64] "Saginaw" "Medina" "Dublin"
## [67] "Detroit" "Tampa" "Santa Clara"
## [70] "Lakeville" "San Diego" "Brentwood"
## [73] "Chapel Hill" "Morristown" "Cincinnati"
## [76] "Inglewood" "Tamarac" "Colorado Springs"
## [79] "Belleville" "Taylor" "Lakewood"
## [82] "Arlington" "Arvada" "Hackensack"
## [85] "Saint Petersburg" "Long Beach" "Hesperia"
## [88] "Murfreesboro" "Layton" "Austin"
## [91] "Lowell" "Manchester" "Harlingen"
## [94] "Tucson" "Quincy" "Pembroke Pines"
## [97] "Des Moines" "Peoria" "Las Vegas"
## [100] "Warwick" "Miami" "Huntington Beach"
## [103] "Richmond" "Louisville" "Lawrence"
## [106] "Canton" "New Rochelle" "Gastonia"
## [109] "Jacksonville" "Auburn" "Norman"
## [112] "Park Ridge" "Amarillo" "Lindenhurst"
## [115] "Huntsville" "Fayetteville" "Costa Mesa"
## [118] "Parker" "Atlanta" "Gladstone"
## [121] "Great Falls" "Lakeland" "Montgomery"
## [124] "Mesa" "Green Bay" "Anaheim"
## [127] "Marysville" "Salem" "Laredo"
## [130] "Grove City" "Dearborn" "Warner Robins"
## [133] "Vallejo" "Mission Viejo" "Rochester Hills"
## [136] "Plainfield" "Sierra Vista" "Vancouver"
## [139] "Cleveland" "Tyler" "Burlington"
## [142] "Waynesboro" "Chester" "Cary"
## [145] "Palm Coast" "Mount Vernon" "Hialeah"
## [148] "Oceanside" "Evanston" "Trenton"
## [151] "Cottage Grove" "Bossier City" "Lancaster"
## [154] "Asheville" "Lake Elsinore" "Omaha"
## [157] "Edmonds" "Santa Ana" "Milwaukee"
## [160] "Florence" "Lorain" "Linden"
## [163] "Salinas" "New Brunswick" "Garland"
## [166] "Norwich" "Alexandria" "Toledo"
## [169] "Farmington" "Riverside" "Torrance"
## [172] "Round Rock" "Boca Raton" "Virginia Beach"
## [175] "Murrieta" "Olympia" "Washington"
## [178] "Jefferson City" "Saint Peters" "Rockford"
## [181] "Brownsville" "Yonkers" "Oakland"
## [184] "Clinton" "Encinitas" "Roswell"
## [187] "Jonesboro" "Antioch" "Homestead"
## [190] "La Porte" "Lansing" "Cuyahoga Falls"
## [193] "Reno" "Harrisonburg" "Escondido"
## [196] "Royal Oak" "Rockville" "Coral Springs"
## [199] "Buffalo" "Boynton Beach" "Gulfport"
## [202] "Fresno" "Greenville" "Macon"
## [205] "Cedar Rapids" "Providence" "Pueblo"
## [208] "Deltona" "Murray" "Middletown"
## [211] "Freeport" "Pico Rivera" "Provo"
## [214] "Pleasant Grove" "Smyrna" "Parma"
## [217] "Mobile" "New Bedford" "Irving"
## [220] "Vineland" "Glendale" "Niagara Falls"
## [223] "Thomasville" "Westminster" "Coppell"
## [226] "Pomona" "North Las Vegas" "Allentown"
## [229] "Tempe" "Laguna Niguel" "Bridgeton"
## [232] "Everett" "Watertown" "Appleton"
## [235] "Bellevue" "Allen" "El Paso"
## [238] "Grapevine" "Carrollton" "Kent"
## [241] "Lafayette" "Tigard" "Skokie"
## [244] "Plano" "Suffolk" "Indianapolis"
## [247] "Bayonne" "Greensboro" "Baltimore"
## [250] "Kenosha" "Olathe" "Tulsa"
## [253] "Redmond" "Raleigh" "Muskogee"
## [256] "Meriden" "Bowling Green" "South Bend"
## [259] "Spokane" "Keller" "Port Orange"
## [262] "Medford" "Charlottesville" "Missoula"
## [265] "Apopka" "Reading" "Broomfield"
## [268] "Paterson" "Oklahoma City" "Chesapeake"
## [271] "Lubbock" "Johnson City" "San Bernardino"
## [274] "Leominster" "Bozeman" "Perth Amboy"
## [277] "Ontario" "Rancho Cucamonga" "Moorhead"
## [280] "Mesquite" "Stockton" "Ormond Beach"
## [283] "Sunnyvale" "York" "College Station"
## [286] "Saint Louis" "Manteca" "San Angelo"
## [289] "Salt Lake City" "Knoxville" "Little Rock"
## [292] "Lincoln Park" "Marion" "Littleton"
## [295] "Bangor" "Southaven" "New Castle"
## [298] "Midland" "Sioux Falls" "Fort Collins"
## [301] "Clarksville" "Sacramento" "Thousand Oaks"
## [304] "Malden" "Holyoke" "Albuquerque"
## [307] "Sparks" "Coachella" "Elmhurst"
## [310] "Passaic" "North Charleston" "Newport News"
## [313] "Jamestown" "Mishawaka" "La Quinta"
## [316] "Tallahassee" "Nashville" "Bellingham"
## [319] "Woodstock" "Haltom City" "Wheeling"
## [322] "Summerville" "Hot Springs" "Englewood"
## [325] "Las Cruces" "Hoover" "Frisco"
## [328] "Vacaville" "Waukesha" "Bakersfield"
## [331] "Pompano Beach" "Corpus Christi" "Redondo Beach"
## [334] "Orlando" "Orange" "Lake Charles"
## [337] "Highland Park" "Hempstead" "Noblesville"
## [340] "Apple Valley" "Mount Pleasant" "Sterling Heights"
## [343] "Eau Claire" "Pharr" "Billings"
## [346] "Gresham" "Chattanooga" "Meridian"
## [349] "Bolingbrook" "Maple Grove" "Woodland"
## [352] "Missouri City" "Pearland" "San Mateo"
## [355] "Grand Rapids" "Visalia" "Overland Park"
## [358] "Temecula" "Yucaipa" "Revere"
## [361] "Conroe" "Tinley Park" "Dubuque"
## [364] "Dearborn Heights" "Santa Fe" "Hickory"
## [367] "Carol Stream" "Saint Cloud" "North Miami"
## [370] "Plantation" "Port Saint Lucie" "Rock Hill"
## [373] "Odessa" "West Allis" "Chula Vista"
## [376] "Manhattan" "Altoona" "Thornton"
## [379] "Champaign" "Texarkana" "Edinburg"
## [382] "Baytown" "Greenwood" "Woonsocket"
## [385] "Superior" "Bedford" "Covington"
## [388] "Broken Arrow" "Miramar" "Hollywood"
## [391] "Deer Park" "Wichita" "Mcallen"
## [394] "Iowa City" "Boise" "Cranston"
## [397] "Port Arthur" "Citrus Heights" "The Colony"
## [400] "Daytona Beach" "Bullhead City" "Portage"
## [403] "Fargo" "Elkhart" "San Gabriel"
## [406] "Margate" "Sandy Springs" "Mentor"
## [409] "Lawton" "Hampton" "Rome"
## [412] "La Crosse" "Lewiston" "Hattiesburg"
## [415] "Danville" "Logan" "Waterbury"
## [418] "Athens" "Avondale" "Marietta"
## [421] "Yuma" "Wausau" "Pasco"
## [424] "Oak Park" "Pensacola" "League City"
## [427] "Gaithersburg" "Lehi" "Tuscaloosa"
## [430] "Moreno Valley" "Georgetown" "Loveland"
## [433] "Chandler" "Helena" "Kirkwood"
## [436] "Waco" "Frankfort" "Bethlehem"
## [439] "Grand Island" "Woodbury" "Rogers"
## [442] "Clovis" "Jupiter" "Santa Barbara"
## [445] "Cedar Hill" "Norfolk" "Draper"
## [448] "Ann Arbor" "La Mesa" "Pocatello"
## [451] "Holland" "Milford" "Buffalo Grove"
## [454] "Lake Forest" "Redding" "Chico"
## [457] "Utica" "Conway" "Cheyenne"
## [460] "Owensboro" "Caldwell" "Kenner"
## [463] "Nashua" "Bartlett" "Redwood City"
## [466] "Lebanon" "Santa Maria" "Des Plaines"
## [469] "Longview" "Hendersonville" "Waterloo"
## [472] "Cambridge" "Palatine" "Beverly"
## [475] "Eugene" "Oxnard" "Renton"
## [478] "Glenview" "Delray Beach" "Commerce City"
## [481] "Texas City" "Wilson" "Rio Rancho"
## [484] "Goldsboro" "Montebello" "El Cajon"
## [487] "Beaumont" "West Palm Beach" "Abilene"
## [490] "Normal" "Saint Charles" "Camarillo"
## [493] "Hillsboro" "Burbank" "Modesto"
## [496] "Garden City" "Atlantic City" "Longmont"
## [499] "Davis" "Morgan Hill" "Clifton"
## [502] "Sheboygan" "East Point" "Rapid City"
## [505] "Andover" "Kissimmee" "Shelton"
## [508] "Danbury" "Sanford" "San Marcos"
## [511] "Greeley" "Mansfield" "Elyria"
## [514] "Twin Falls" "Coral Gables" "Romeoville"
## [517] "Marlborough" "Laurel" "Bryan"
## [520] "Pine Bluff" "Aberdeen" "Hagerstown"
## [523] "East Orange" "Arlington Heights" "Oswego"
## [526] "Coon Rapids" "San Clemente" "San Luis Obispo"
## [529] "Springdale" "Lodi" "Mason"
sprintf("Count of the unique cities: %s", n_distinct(Superstore_data$City))
## [1] "Count of the unique cities: 531"
5.Column State :-
Unique values for the column - State
unique(Superstore_data$State)
## [1] "Kentucky" "California" "Florida"
## [4] "North Carolina" "Washington" "Texas"
## [7] "Wisconsin" "Utah" "Nebraska"
## [10] "Pennsylvania" "Illinois" "Minnesota"
## [13] "Michigan" "Delaware" "Indiana"
## [16] "New York" "Arizona" "Virginia"
## [19] "Tennessee" "Alabama" "South Carolina"
## [22] "Oregon" "Colorado" "Iowa"
## [25] "Ohio" "Missouri" "Oklahoma"
## [28] "New Mexico" "Louisiana" "Connecticut"
## [31] "New Jersey" "Massachusetts" "Georgia"
## [34] "Nevada" "Rhode Island" "Mississippi"
## [37] "Arkansas" "Montana" "New Hampshire"
## [40] "Maryland" "District of Columbia" "Kansas"
## [43] "Vermont" "Maine" "South Dakota"
## [46] "Idaho" "North Dakota" "Wyoming"
## [49] "West Virginia"
sprintf("Count of the unique States: %s", n_distinct(Superstore_data$State))
## [1] "Count of the unique States: 49"
6.Column Region :
Unique values for the column - Region
for (x in unique(Superstore_data$Region)) {
print(x)
}
## [1] "South"
## [1] "West"
## [1] "Central"
## [1] "East"
sprintf("Count of the unique Regions: %s", n_distinct(Superstore_data$Region))
## [1] "Count of the unique Regions: 4"
7.Column Category :-
Unique values for the column - Category
for (x in unique(Superstore_data$Category)) {
print(x)
}
## [1] "Furniture"
## [1] "Office Supplies"
## [1] "Technology"
sprintf("Count of the unique Categories: %s", n_distinct(Superstore_data$Category))
## [1] "Count of the unique Categories: 3"
8.Sub-Category :-
Unique values for the column - Sub-category
sprintf("Unique values for the column - Sub-Category :");
## [1] "Unique values for the column - Sub-Category :"
for (x in unique(Superstore_data$Sub.Category)) {
print(x)
}
## [1] "Bookcases"
## [1] "Chairs"
## [1] "Labels"
## [1] "Tables"
## [1] "Storage"
## [1] "Furnishings"
## [1] "Art"
## [1] "Phones"
## [1] "Binders"
## [1] "Appliances"
## [1] "Paper"
## [1] "Accessories"
## [1] "Envelopes"
## [1] "Fasteners"
## [1] "Supplies"
## [1] "Machines"
## [1] "Copiers"
sprintf("Count of the Sub-Category: %s", n_distinct(Superstore_data$Sub.Category))
## [1] "Count of the Sub-Category: 17"
9.Postal Code :-
Even though Postal code is numeric in nature, it doesn’t have any value as the mean of a postal code wouldn’t make sense. Hence considering it as a Categorical value.
Unique values for the column - Postal code :
unique(Superstore_data$Postal.Code)
## [1] 42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140
## [13] 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024
## [25] 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707
## [37] 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229
## [49] 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105
## [61] 98115 73034 90045 19134 88220 78207 77036 62521 71203 6824 75051 92374
## [73] 45011 7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227
## [85] 38401 33614 95051 55044 92037 77506 94513 27514 7960 45231 94110 90301
## [97] 33319 80906 7109 48180 8701 22204 80004 7601 33710 19143 90805 92345
## [109] 37130 84041 78745 1852 31907 6040 78550 85705 62301 2038 33024 98198
## [121] 61604 89115 2886 33180 28403 92646 40475 80027 1841 39212 48187 10801
## [133] 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340
## [145] 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653
## [157] 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591
## [169] 92691 48307 7060 85635 98661 60505 76017 40214 75081 44105 75701 27217
## [181] 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016
## [193] 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052
## [205] 7036 93905 8901 17602 3301 21044 75043 6360 22304 43615 87401 92503
## [217] 90503 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376
## [229] 61107 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030
## [241] 46350 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503
## [253] 93727 27834 11561 35630 31204 52402 2908 81001 94533 32725 42071 6457
## [265] 11520 90660 84604 84062 30080 24153 44134 36608 2740 75061 8360 85301
## [277] 14304 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677 8302
## [289] 2149 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224
## [301] 60076 75023 23434 46203 7002 28314 27405 21215 53142 66062 98002 74133
## [313] 97756 27604 74403 6450 42104 46614 6010 89015 99207 76248 45014 32127
## [325] 97504 22901 59801 33178 29501 97477 32712 19601 80020 65807 7501 73120
## [337] 23320 79424 65203 37604 36830 92404 1453 59715 85345 44107 8861 91761
## [349] 91730 56560 75150 95207 32174 94086 3820 17403 77840 63116 2169 95336
## [361] 44240 76903 84106 35810 37918 72209 48146 43302 80122 5408 4401 38671
## [373] 47362 48640 57103 80525 47905 37042 95823 91360 2148 1040 87105 89431
## [385] 92236 60126 7055 29406 23602 14701 46544 43402 92253 32303 37211 98226
## [397] 60098 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107
## [409] 53186 93309 33068 45373 78415 90278 32839 7050 70601 60035 11550 46060
## [421] 55124 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369
## [433] 95695 77489 77581 94403 49505 93277 66212 92592 92399 2151 77301 60477
## [445] 52001 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762
## [457] 53214 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712
## [469] 2895 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704
## [481] 2920 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776
## [493] 33063 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321
## [505] 6708 30605 4240 61832 85323 30062 85364 54401 99301 60302 32503 77573
## [517] 20877 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423
## [529] 18018 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020
## [541] 48104 91941 83201 49423 6460 60089 92630 96003 95928 13501 72032 82001
## [553] 42301 83605 70065 3060 38134 94061 37087 93454 60016 98632 37075 50701
## [565] 2138 60067 1915 97405 93030 98059 60025 33445 80022 77590 27893 87124
## [577] 27534 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123
## [589] 91505 95351 67846 8401 80501 95616 26003 95037 7011 53081 30344 57701
## [601] 1810 34741 6484 6810 52302 32771 78666 80634 76063 44035 83301 33134
## [613] 60441 1752 20707 77803 71603 57401 21740 7017 60004 60543 55433 92672
## [625] 94568 93405 72762 95240 77571 45040 30188
sprintf("Count of the Postal-Code: %s",n_distinct(Superstore_data$Postal.Code))
## [1] "Count of the Postal-Code: 631"
10.Sales :-
summary(Superstore_data$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.444 17.280 54.490 229.858 209.940 22638.480
11.Quantity :-
summary(Superstore_data$Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.79 5.00 14.00
12.Discount :-
summary(Superstore_data$Discount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2000 0.1562 0.2000 0.8000
summary(Superstore_data$Profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6599.978 1.729 8.666 28.657 29.364 8399.976
summary(Superstore_data)
## Ship.Mode Segment Country City
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## State Postal.Code Region Category
## Length:9994 Min. : 1040 Length:9994 Length:9994
## Class :character 1st Qu.:23223 Class :character Class :character
## Mode :character Median :56431 Mode :character Mode :character
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Sub.Category Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
sapply(Superstore_data,function(x) n_distinct(x))
## Ship.Mode Segment Country City State Postal.Code
## 4 3 1 531 49 631
## Region Category Sub.Category Sales Quantity Discount
## 4 3 17 5825 14 12
## Profit
## 7287
#Data Documentation:
There are 9994 rows.
The Data is tidy with about 13 columns, Each row represents a unique and separate observation.
It fulfills the criteria of atleast 2 columns being continuous.
In my case the columns Sales(Price), Profit and Discount are Decimal
Also, Profit is having both positive and negative values. Negative
values are indicating Loss.
Sales having a range of $0.444 to $22638.480
Profit having a range starting from $-6599.978 to
$8399.976
Discount having a range from 0 to 0.8
Quantity is numeric with a range starting from 1 to 14.
It also fulfills the criteria of atleast 2 columns being
categorical. In my case there are 9 categorical columns which are - Ship
Mode, Segment, Country, City, State, Postal Code, Region, Category,
Sub.Category
(Here, considering Postal Code as Categorical as the numeric in the
value does not have any significance.)
Purpose of Data Collection:- To figure out the profit earned by the Superstore via its sales in various parts of USA. Further, it can be used to find trends in the same and come up with improvements.
Explanantion of each Columns :
Ship Mode - It has 4 values, they are First Class, Same Day,
Second Class and Standard Class. Each value is defines the limit for
being delayed as follows: -
Same day: >=0 days
First class: >=2 days
Second class: >=3 days
Standard class: >= 5 days
Segment - It is the product sector.There are segments which are Consumer, Corporate and Home Office
Country - This column contains the country details. It was found to be only for United States i.e one value. Hence this column can be removed.
City - Various Cities present in United States.There are about 531 cities.
State - There are 49 states in this dataset.
Postal Code - There are 631 postal codes in this dataset
Region - There are 4 regions in the dataset. i.e. South, West, Central and East
Category - There are 3 categories, they are Furniture, Office Supplies and Technology.
Sub-Category - There are 17 sub-categories, some of them are - Bookcases, Chairs, Labels, Tables, Storage, Furnishings, Art, Phones, Binders, Appliances, Paper, Accessories, Envelopes, Fasteners, Supplies, Machines, Copiers
Sales - It is the price. Having a range starting from $0.444 to $22638.480
Profit - It is the profit earned from the product. Can also indicate loss with a negative sign. Having a range -6599.978 to 8399.976
Discount - Discount given for the product purchased, range starts from 0 to 0.8.
Quantity - The quantity of products bought, range starts from 1 to 14.
In order to understand the data better we can try to visualize the data, which will indirectly help us in asking valid questions to investigate further.
visual summary
1.Ship Mode -
From the plot, can observe that most purchases are done via Ship.Mode = Standard Class, followed by Second Class, First Class and Same Day. About 6000 records out of 9994 i.e. more than 50% purchase are via Standard Class ship mode.
Superstore_data |>
ggplot(mapping = aes(x=Ship.Mode, color = Ship.Mode, fill=Ship.Mode)) +
geom_bar()
last_plot() +
coord_polar()
2.Segment -
From the plot below can figure out that most purchases are done for Consumer needs, which are greater than 5000 purchases.Followed by Corporate purchases which is around 3000 and Home Office with a count around 1800.
So most purchases in the Superstore are done by the Consumer segment.
Superstore_data |>
ggplot(mapping = aes(x=Segment, color = Segment, fill=Segment)) +
geom_bar() +
theme_minimal()
3.Region -
Overall every region has bought products but from the plot below we can observe that customers from West and East part have purchased most products, when compared to Central and southern part of US.
It answers the question -> which region has purchased the most products?
Superstore_data |>
ggplot(mapping = aes(x=Region, color = Region, fill=Region)) +
geom_bar() +
theme_minimal()
4.State -
As it was seen above, there are about 49 states, plotting them all to understand the count of purchase is pretty difficult. Hence narrowing it down to top 10 states.
It would answer the question -> which are the top 10 states that have purchased the products from the Superstore?
state_count<- aggregate(Superstore_data$State,by=list(Superstore_data$State), FUN=length)
top_10_states_count <-state_count %>%
arrange(desc(x)) %>%
slice(1:10) %>%
rename (
state=Group.1,
count=x
)
top_10_states_count
## state count
## 1 California 2001
## 2 New York 1128
## 3 Texas 985
## 4 Pennsylvania 587
## 5 Washington 506
## 6 Illinois 492
## 7 Ohio 469
## 8 Florida 383
## 9 Michigan 255
## 10 North Carolina 249
top_10_states_count |>
ggplot(mapping = aes(x=state, y=count, fill=state)) +
geom_bar(stat = "identity") +
theme_minimal()+
theme (axis.text.x = element_text(angle=90))
Can see that California is the state that has the most purchases.
5.City - As West region had most purchases of product, we can verify that by digging deep into the cities. Also there are about 531 city. Drilling down and figuring out the top 10 cities.
It would answer the question -> which are the top 10 cities that purchased the products from the Superstore?
City_count<- aggregate(Superstore_data$City,by=list(Superstore_data$City), FUN=length)
top_10_city_count <-City_count %>%
arrange(desc(x)) %>%
slice(1:10) %>%
rename (
City=Group.1,
Count=x
)
top_10_city_count
## City Count
## 1 New York City 915
## 2 Los Angeles 747
## 3 Philadelphia 537
## 4 San Francisco 510
## 5 Seattle 428
## 6 Houston 377
## 7 Chicago 314
## 8 Columbus 222
## 9 San Diego 170
## 10 Springfield 163
top_10_city_count |>
ggplot(mapping = aes(x=City, y=Count, fill=City)) +
geom_bar(stat = "identity") +
theme_minimal()+
theme (axis.text.x = element_text(angle=90))
6.Category - From the plot below can figure out that Office supplies were the products which were bought the most, in all the purchases.
It answers the question -> which category of products were bought the most?
Superstore_data |>
ggplot(mapping = aes(x=Category, color = Category, fill=Category)) +
geom_bar() +
theme_minimal()
last_plot() +
coord_polar()
7.Sub - Category :- From the previous plot, we had observed that Office Supplies were the most purchased category. In order to understand which item from those Office supplies are being bought the most, we plot the sub-category. In the plot below, we can see that Binders are the heavily purchased item, followed by Paper and Furnishings.
It answers the question -> which sub-category of products were bought the most?
Superstore_data |>
ggplot(mapping = aes(x=Sub.Category, fill=Sub.Category)) +
geom_bar() +
theme_minimal() +
theme (axis.text.x = element_text(angle=90))
8.Quantity - Most products bought are in pairs or trios. So majorly people tend to buy in such forms.
Superstore_data %>%
ggplot() +
geom_histogram(mapping = aes(x = Quantity), fill = 'blue', binwidth=1)
Multivariate plotting-
1.Region vs Category :
Superstore_data %>%
ggplot() +
geom_bar(mapping = aes(x = Region, fill = Category), position="dodge")
Be it any region, all of them have a common category which was purchased the most i.e. Office Supplies While Furniture and Tech were bought little less.
2.Quantity vs Profit
Superstore_data %>%
ggplot() +
geom_smooth(mapping = aes(x = Quantity, y = Sales))+
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We can see the Sales and Quantity have a Positive Correlation. As the quantity increases the sales also increases.
3.Sub-Category and Mean Sales -
aggregate(Superstore_data$Sales, list(Superstore_data$Sub.Category), FUN=mean) |>
rename (
Sub_Category=Group.1,
Mean_Sales=x
) |>
arrange(desc(Mean_Sales))
## Sub_Category Mean_Sales
## 1 Copiers 2198.94162
## 2 Machines 1645.55331
## 3 Tables 648.79477
## 4 Chairs 532.33242
## 5 Bookcases 503.85963
## 6 Phones 371.21153
## 7 Storage 264.59055
## 8 Supplies 245.65020
## 9 Appliances 230.75571
## 10 Accessories 215.97460
## 11 Binders 133.56056
## 12 Furnishings 95.82567
## 13 Envelopes 64.86772
## 14 Paper 57.28409
## 15 Labels 34.30305
## 16 Art 34.06883
## 17 Fasteners 13.93677
Even though from uni-variate visualization we found that people bought binders and papers. The most costliest item from all of those was a Copier followed by Machines.
4.Region vs Category vs Sales:
Superstore_data %>%
ggplot() +
geom_bar(mapping = aes(x = Region, y=Sales,fill = Category),stat = "identity", position="dodge")
From the above plot, we can see that Technologies are the costliest items be it any region. But if compared between regions, South has bought tech products which have a cost greater than $2000. Followed by Central, West and East region having cost approximately around $1750, $1450 and $1200 respectively. With Office supplies being the 2nd most costly item bought in all regions. But East regions sees similar cost on Office and furniture.
5.Region vs Category vs Profit:
Superstore_data %>%
ggplot() +
geom_bar(mapping = aes(x = Region, y=Profit,fill = Category),stat = "identity", position="dodge")
From the above plot, loss is seen for Technology in 3 regions - East (highest loss), South and West. Rest of the categories don’t see much loss and have a little profit. Most profit is obtained in central region for Tech but loss due to Office supplies.