With growing demands and cut-throat competitions in the market, a Superstore giant is seeking your knowledge in understanding what works best for them. They would like to understand which products, regions, categories and customer segments they should target or avoid.
They also want to have a Regression model to predict Sales or Profit.
The libraries for the installed packages are loaded.
library(readxl)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(forcats)
library(rpart)
library(rpart.plot)
library(caret)
library(scales)
library(superml)
library(corrplot)
library(reshape2)
library(broom)
library(ggpubr)
Firstly, let’s get the dataset from XLS file. In the XLS file, there are 3 sheets:
df_order = read_xls('superstorev2.xls', sheet ='Orders', col_names = TRUE)
df_return = read_xls('superstorev2.xls', sheet='Returns', col_names = TRUE)
df_people = read_xls('superstorev2.xls', sheet='People', col_names = TRUE)
“Order” dataframe:
head(df_order)
## # A tibble: 6 x 21
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode`
## <dbl> <chr> <dttm> <dttm> <chr>
## 1 1 CA-2016-152156 2016-11-08 00:00:00 2016-11-11 00:00:00 Second Class
## 2 2 CA-2016-152156 2016-11-08 00:00:00 2016-11-11 00:00:00 Second Class
## 3 3 CA-2016-138688 2016-06-12 00:00:00 2016-06-16 00:00:00 Second Class
## 4 4 US-2015-108966 2015-10-11 00:00:00 2015-10-18 00:00:00 <NA>
## 5 5 US-2015-108966 2015-10-11 00:00:00 2015-10-18 00:00:00 Standard Class
## 6 6 CA-2014-115812 2014-06-09 00:00:00 2014-06-14 00:00:00 Standard Class
## # ... with 16 more variables: Customer ID <chr>, Customer Name <chr>,
## # Segment <chr>, Country <chr>, City <chr>, State <chr>, Postal Code <dbl>,
## # Region <chr>, Product ID <chr>, Category <chr>, Sub-Category <chr>,
## # Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## # Profit <dbl>
“Return” dataframe:
head(df_return)
## # A tibble: 6 x 2
## Returned `Order ID`
## <chr> <chr>
## 1 Yes CA-2017-153822
## 2 Yes CA-2017-129707
## 3 Yes CA-2014-152345
## 4 Yes CA-2015-156440
## 5 Yes US-2017-155999
## 6 Yes CA-2014-157924
“People” dataframe:
head(df_people)
## # A tibble: 4 x 2
## Person Region
## <chr> <chr>
## 1 Anna Andreadi West
## 2 Chuck Magee East
## 3 Kelly Williams Central
## 4 Cassandra Brandow South
We merge df_return$Returned and df_people$Person into df_orderf$Returned and df_order$Person respectively, using full join to retain all values and rows.
df = dplyr::full_join(df_order, df_return, by="Order ID")
df$`Order Date` = as.Date.character(df$`Order Date`, format="%Y-%m-%d")
df$`Ship Date` = as.Date(df$`Ship Date`, format="%Y-%m-%d")
df
## # A tibble: 9,994 x 22
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode` `Customer ID`
## <dbl> <chr> <date> <date> <chr> <chr>
## 1 1 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520
## 2 2 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520
## 3 3 CA-2016-138688 2016-06-12 2016-06-16 Second Class DV-13045
## 4 4 US-2015-108966 2015-10-11 2015-10-18 <NA> SO-20335
## 5 5 US-2015-108966 2015-10-11 2015-10-18 Standard Class SO-20335
## 6 6 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 7 7 CA-2014-115812 2014-06-09 2014-06-14 <NA> BH-11710
## 8 8 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 9 9 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710
## 10 10 CA-2014-115812 2014-06-09 2014-06-14 <NA> BH-11710
## # ... with 9,984 more rows, and 16 more variables: Customer Name <chr>,
## # Segment <chr>, Country <chr>, City <chr>, State <chr>, Postal Code <dbl>,
## # Region <chr>, Product ID <chr>, Category <chr>, Sub-Category <chr>,
## # Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## # Profit <dbl>, Returned <chr>
After all the dataset are merged, df will be the dataset we use for this project.
The structure of the dataset is checked.
str(df)
## tibble [9,994 x 22] (S3: tbl_df/tbl/data.frame)
## $ Row ID : num [1:9994] 1 2 3 4 5 6 7 8 9 10 ...
## $ Order ID : chr [1:9994] "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order Date : Date[1:9994], format: "2016-11-08" "2016-11-08" ...
## $ Ship Date : Date[1:9994], format: "2016-11-11" "2016-11-11" ...
## $ Ship Mode : chr [1:9994] "Second Class" "Second Class" "Second Class" NA ...
## $ Customer ID : chr [1:9994] "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Customer Name: chr [1:9994] "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
## $ Segment : chr [1:9994] "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr [1:9994] "United States" "United States" "United States" "United States" ...
## $ City : chr [1:9994] "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr [1:9994] "KENTUCKY" "kentucky" "California" "florida" ...
## $ Postal Code : num [1:9994] 42420 42420 90036 33311 33311 ...
## $ Region : chr [1:9994] "South" "South" "West" "South" ...
## $ Product ID : chr [1:9994] "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr [1:9994] "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub-Category : chr [1:9994] "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product Name : chr [1:9994] "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num [1:9994] 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : num [1:9994] 2 -3 2 5 2 -7 4 6 3 5 ...
## $ Discount : num [1:9994] 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num [1:9994] 41.91 219.58 6.87 -383.03 2.52 ...
## $ Returned : chr [1:9994] NA NA NA NA ...
The “Order” dataframe is 9994 instances and 22 features. Now check the summary of the “Order” dataframe.
summary(df)
## Row ID Order ID Order Date Ship Date
## Min. : 1 Length:9994 Min. :2014-01-03 Min. :2014-01-07
## 1st Qu.:2499 Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## Median :4998 Mode :character Median :2016-06-26 Median :2016-06-29
## Mean :4998 Mean :2016-04-30 Mean :2016-05-03
## 3rd Qu.:7496 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :9994 Max. :2017-12-30 Max. :2018-01-05
##
## Ship Mode Customer ID Customer Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Country City State Postal Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56431
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
##
## Region Product ID Category Sub-Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Product Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. :-7.000 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.000 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.000 Median :0.2000
## Mean : 229.858 Mean : 3.781 Mean :0.1569
## 3rd Qu.: 209.940 3rd Qu.: 5.000 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.000 Max. :0.8000
## NA's :42
## Profit Returned
## Min. :-6599.978 Length:9994
## 1st Qu.: 1.729 Class :character
## Median : 8.666 Mode :character
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
##
What piqued interest was the statistics in Sales, Quantity, Discount and Profit column. From there, we know that:
drop<-c('Order Date','Ship Date','Row ID','Order ID','Customer ID','Customer Name','Postal Code', 'Product Name','Product ID','Country')
df <- df[,!(names(df) %in% drop)]
Next, we check for missing value. The data set is checked for any non-availability (NA). This is because missing value is dirty and might result in affecting out analysis.
colSums(is.na(df))
## Ship Mode Segment City State Region Category
## 33 0 0 0 0 0
## Sub-Category Sales Quantity Discount Profit Returned
## 0 0 0 42 0 9194
There are missing values spotted inside df. Based on the results, we can see that there are few variables that contain NA value. They are Ship Mode and Discount. Ship Mode has 33 missing values, Discount has 42 missing values, while Returned has 9194 missing values. These missing value will be handled accordingly.
We know that Ship Mode is in nominal pattern. Hence, we will handle the missing data by using imputation by mode. Meanwhile, for Discount, it is a numerical data. Hence, the missing data will be handled by using imputation by median.
df<-as.data.frame(df)
getmode <- function(v){
v=v[nchar(as.character(v))>0]
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(is.na(x) == TRUE, getmode(x), x))
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
df$Discount[is.na(df$Discount)]<-median(df$Discount,na.rm=TRUE)
df$Returned<-df$Returned[is.na(df$Returned)]<-'No'
colSums(is.na(df))
## Ship Mode Segment City State Region Category
## 0 0 0 0 0 0
## Sub-Category Sales Quantity Discount Profit Returned
## 0 0 0 0 0 0
From the results, all columns illustrate 0 from any NA value. This show that all missing value had been handled properly. After missing value is cattered, the pattern of the data is studied.
(Numerical data)
Numerical data in the dataset includes Sale, Quantity, Discount, and Profit. The pattern for the numerical data is studied. The aim is to determine for any irrelevant pattern in the dataset. To do so, histogram is plotted.
hist(df$Discount)
hist(df$Quantity)
hist(df$Profit)
hist(df$Sales)
From the histogram, there is weird value in Quantity. This is because quantity should be in a positive whole value. It cannot be negative because this variable indicate the number of product. This is may be due to inputation error. So, removing it would be good since the amount of error in this variable is really small. Our dataset is big, removing small portion of error data will not affecting the whole pattern and orientation of the dataset.
The rows of the error data are removed:
df <- df[df$Quantity >= 0, ]
hist(df$Quantity)
From the results, it shows that the rows containing error value is removed successfully.
(Categorical data)
Categorical data in this dataset includes Ship Mode, Segment, Country, City, State, Region, Category, and Sub-Category. The value pattern consistency is determined by counting the frequency distribution for each of the variable. Any weird naming value or redundancy can be determined from this method.
table(df$`Ship Mode`)
##
## First Class Same Day Second Class Standard Class
## 1538 543 1941 5959
table(df$Segment)
##
## Consumer Corporate Home Office
## 5183 3017 1781
table(df$Country)
## < table of extent 0 >
table(df$City)
##
## Aberdeen Abilene Akron Albuquerque
## 1 1 21 14
## Alexandria Allen Allentown Altoona
## 16 4 7 2
## Amarillo Anaheim Andover Ann Arbor
## 10 27 4 5
## Antioch Apopka Apple Valley Appleton
## 1 7 9 2
## Arlington Arlington Heights Arvada Asheville
## 60 1 4 7
## Athens Atlanta Atlantic City Auburn
## 8 39 1 24
## Aurora Austin Avondale Bakersfield
## 68 39 6 16
## Baltimore Bangor Bartlett Bayonne
## 43 5 1 3
## Baytown Beaumont Bedford Belleville
## 1 5 5 10
## Bellevue Bellingham Bethlehem Beverly
## 6 3 5 3
## Billings Bloomington Boca Raton Boise
## 1 15 3 4
## Bolingbrook Bossier City Bowling Green Boynton Beach
## 6 6 10 10
## Bozeman Brentwood Bridgeton Bristol
## 2 16 2 13
## Broken Arrow Broomfield Brownsville Bryan
## 5 5 7 6
## Buffalo Buffalo Grove Bullhead City Burbank
## 10 2 2 3
## Burlington Caldwell Camarillo Cambridge
## 25 3 5 6
## Canton Carlsbad Carol Stream Carrollton
## 3 11 5 15
## Cary Cedar Hill Cedar Rapids Champaign
## 6 2 1 1
## Chandler Chapel Hill Charlotte Charlottesville
## 7 1 52 2
## Chattanooga Chesapeake Chester Cheyenne
## 10 18 16 1
## Chicago Chico Chula Vista Cincinnati
## 314 11 3 17
## Citrus Heights Clarksville Cleveland Clifton
## 1 7 42 2
## Clinton Clovis Coachella College Station
## 29 2 2 3
## Colorado Springs Columbia Columbus Commerce City
## 25 81 222 1
## Concord Conroe Conway Coon Rapids
## 31 1 1 2
## Coppell Coral Gables Coral Springs Corpus Christi
## 4 2 8 8
## Costa Mesa Cottage Grove Covington Cranston
## 11 2 4 16
## Cuyahoga Falls Dallas Danbury Danville
## 3 157 1 7
## Davis Daytona Beach Dearborn Dearborn Heights
## 1 4 5 6
## Decatur Deer Park Delray Beach Deltona
## 35 1 3 5
## Denver Des Moines Des Plaines Detroit
## 44 19 6 115
## Dover Draper Dublin Dubuque
## 18 3 14 3
## Durham Eagan East Orange East Point
## 9 8 5 3
## Eau Claire Edinburg Edmond Edmonds
## 6 3 2 14
## El Cajon El Paso Elkhart Elmhurst
## 2 19 2 4
## Elyria Encinitas Englewood Escondido
## 1 5 2 3
## Eugene Evanston Everett Fairfield
## 6 4 22 45
## Fargo Farmington Fayetteville Florence
## 7 3 41 21
## Fort Collins Fort Lauderdale Fort Worth Frankfort
## 7 15 26 2
## Franklin Freeport Fremont Fresno
## 37 10 8 26
## Frisco Gaithersburg Garden City Garland
## 2 2 5 4
## Gastonia Georgetown Gilbert Gladstone
## 4 8 15 2
## Glendale Glenview Goldsboro Grand Island
## 23 1 1 1
## Grand Prairie Grand Rapids Grapevine Great Falls
## 13 7 2 8
## Greeley Green Bay Greensboro Greenville
## 2 4 13 9
## Greenwood Gresham Grove City Gulfport
## 4 5 2 3
## Hackensack Hagerstown Haltom City Hamilton
## 6 1 4 4
## Hampton Harlingen Harrisonburg Hattiesburg
## 11 5 7 9
## Helena Hempstead Henderson Hendersonville
## 3 11 50 5
## Hesperia Hialeah Hickory Highland Park
## 4 18 2 7
## Hillsboro Holland Hollywood Holyoke
## 4 3 11 1
## Homestead Hoover Hot Springs Houston
## 3 4 4 375
## Huntington Beach Huntsville Independence Indianapolis
## 6 36 2 23
## Inglewood Iowa City Irving Jackson
## 16 1 8 82
## Jacksonville Jamestown Jefferson City Johnson City
## 125 2 1 12
## Jonesboro Jupiter Keller Kenner
## 11 1 1 2
## Kenosha Kent Kirkwood Kissimmee
## 9 13 2 1
## Knoxville La Crosse La Mesa La Porte
## 24 5 2 10
## La Quinta Lafayette Laguna Niguel Lake Charles
## 1 30 4 3
## Lake Elsinore Lake Forest Lakeland Lakeville
## 1 6 22 15
## Lakewood Lancaster Lansing Laredo
## 49 46 8 13
## Las Cruces Las Vegas Laurel Lawrence
## 3 13 2 44
## Lawton Layton League City Lebanon
## 4 1 6 3
## Lehi Leominster Lewiston Lincoln Park
## 2 6 4 3
## Linden Lindenhurst Little Rock Littleton
## 1 1 24 1
## Lodi Logan Long Beach Longmont
## 2 5 61 3
## Longview Lorain Los Angeles Louisville
## 3 9 745 57
## Loveland Lowell Lubbock Macon
## 4 16 6 6
## Madison Malden Manchester Manhattan
## 10 3 11 1
## Mansfield Manteca Maple Grove Margate
## 2 3 4 1
## Marietta Marion Marlborough Marysville
## 7 20 2 2
## Mason Mcallen Medford Medina
## 3 19 4 9
## Melbourne Memphis Mentor Meriden
## 1 30 6 11
## Meridian Mesa Mesquite Miami
## 4 28 5 57
## Middletown Midland Milford Milwaukee
## 9 12 6 44
## Minneapolis Miramar Mishawaka Mission Viejo
## 23 8 2 5
## Missoula Missouri City Mobile Modesto
## 1 1 11 3
## Monroe Montebello Montgomery Moorhead
## 21 1 10 2
## Moreno Valley Morgan Hill Morristown Mount Pleasant
## 10 4 7 2
## Mount Vernon Murfreesboro Murray Murrieta
## 8 11 5 1
## Muskogee Naperville Nashua Nashville
## 4 8 2 29
## New Albany New Bedford New Brunswick New Castle
## 4 5 2 2
## New Rochelle New York City Newark Newport News
## 9 915 95 12
## Niagara Falls Noblesville Norfolk Normal
## 3 4 1 1
## Norman North Charleston North Las Vegas North Miami
## 2 2 16 2
## Norwich Oak Park Oakland Oceanside
## 4 3 26 30
## Odessa Oklahoma City Olathe Olympia
## 5 23 5 5
## Omaha Ontario Orange Orem
## 28 1 6 9
## Orland Park Orlando Ormond Beach Oswego
## 1 11 1 3
## Overland Park Owensboro Oxnard Palatine
## 6 2 9 1
## Palm Coast Park Ridge Parker Parma
## 5 3 8 10
## Pasadena Pasco Passaic Paterson
## 42 6 6 20
## Pearland Pembroke Pines Pensacola Peoria
## 3 14 1 23
## Perth Amboy Pharr Philadelphia Phoenix
## 8 4 537 63
## Pico Rivera Pine Bluff Plainfield Plano
## 1 2 13 15
## Plantation Pleasant Grove Pocatello Pomona
## 11 7 7 7
## Pompano Beach Port Arthur Port Orange Port Saint Lucie
## 3 6 1 3
## Portage Portland Providence Provo
## 1 24 31 10
## Pueblo Quincy Raleigh Rancho Cucamonga
## 8 20 22 4
## Rapid City Reading Redding Redlands
## 2 5 1 16
## Redmond Redondo Beach Redwood City Reno
## 11 8 1 4
## Renton Revere Richardson Richmond
## 3 7 2 90
## Rio Rancho Riverside Rochester Rochester Hills
## 2 14 53 1
## Rock Hill Rockford Rockville Rogers
## 1 11 5 1
## Rome Romeoville Roseville Roswell
## 6 1 26 18
## Round Rock Royal Oak Sacramento Saginaw
## 7 2 13 6
## Saint Charles Saint Cloud Saint Louis Saint Paul
## 10 3 5 2
## Saint Peters Saint Petersburg Salem Salinas
## 1 14 40 8
## Salt Lake City San Angelo San Antonio San Bernardino
## 8 4 59 9
## San Clemente San Diego San Francisco San Gabriel
## 2 170 509 3
## San Jose San Luis Obispo San Marcos San Mateo
## 42 1 4 1
## Sandy Springs Sanford Santa Ana Santa Barbara
## 14 2 13 8
## Santa Clara Santa Fe Santa Maria Scottsdale
## 6 2 1 12
## Seattle Sheboygan Shelton Sierra Vista
## 428 4 2 3
## Sioux Falls Skokie Smyrna South Bend
## 9 11 24 6
## Southaven Sparks Spokane Springdale
## 7 3 7 1
## Springfield Sterling Heights Stockton Suffolk
## 163 3 6 10
## Summerville Sunnyvale Superior Tallahassee
## 2 6 9 17
## Tamarac Tampa Taylor Temecula
## 4 36 3 4
## Tempe Texarkana Texas City The Colony
## 13 2 2 3
## Thomasville Thornton Thousand Oaks Tigard
## 3 10 5 8
## Tinley Park Toledo Torrance Trenton
## 1 31 4 13
## Troy Tucson Tulsa Tuscaloosa
## 29 32 26 2
## Twin Falls Tyler Urbandale Utica
## 2 5 3 8
## Vacaville Vallejo Vancouver Vineland
## 1 6 5 8
## Virginia Beach Visalia Waco Warner Robins
## 16 4 6 2
## Warwick Washington Waterbury Waterloo
## 5 10 11 1
## Watertown Waukesha Wausau Waynesboro
## 10 1 4 10
## West Allis West Jordan West Palm Beach Westfield
## 2 5 3 6
## Westland Westminster Wheeling Whittier
## 13 17 7 1
## Wichita Wilmington Wilson Woodbury
## 7 36 6 3
## Woodland Woodstock Woonsocket Yonkers
## 3 5 4 15
## York Yucaipa Yuma
## 5 1 4
table(df$State)
##
## Alabama Arizona Arkansas
## 61 224 60
## California CALIFORNIA Colorado
## 1995 2 182
## Connecticut Delaware District of Columbia
## 82 95 10
## florida Florida Georgia
## 3 380 184
## Idaho Illinois Indiana
## 21 492 148
## INDIANA Iowa Kansas
## 1 30 24
## kentucky Kentucky KENTUCKY
## 0 137 1
## Louisiana Maine Maryland
## 41 8 105
## Massachusetts Michigan MICHIGAN
## 135 254 1
## Minnesota Mississippi Missouri
## 89 53 66
## Montana Nebraska Nevada
## 15 38 39
## New Hampshire New Jersey New Mexico
## 27 130 37
## New York North Carolina North Dakota
## 1128 249 7
## Ohio Oklahoma Oregon
## 468 66 124
## Pennsylvania Rhode Island South Carolina
## 587 56 42
## South Dakota Tennessee Texas
## 12 183 980
## TEXAS Utah UTAH
## 1 52 1
## Vermont Virginia Washington
## 11 224 506
## West Virginia Wisconsin Wyoming
## 4 109 1
table(df$Region)
##
## Central East South West
## 2318 2846 1618 3199
table(df$Category)
##
## Furniture Office Supplies Technology
## 2117 6021 1843
table(df$`Sub-Category`)
##
## Accessories Appliances Art Binders Bookcases Chairs
## 774 464 796 1521 228 616
## Copiers Envelopes Fasteners Furnishings Labels Machines
## 68 254 217 956 363 114
## Paper Phones Storage Supplies Tables
## 1370 887 846 190 317
From the results, State have some value redundancy. The case of the letter is not standardized. For example, “California” and “CALIFORNIA” are referring to the same location and redundant.
Standardizing the case of the letters in State:
df$State<-tolower(df$State)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
df$State <- capFirst(df$State)
table(df$State)
##
## Alabama Arizona Arkansas
## 61 224 60
## California Colorado Connecticut
## 1997 182 82
## Delaware District of columbia Florida
## 95 10 383
## Georgia Idaho Illinois
## 184 21 492
## Indiana Iowa Kansas
## 149 30 24
## Kentucky Louisiana Maine
## 138 41 8
## Maryland Massachusetts Michigan
## 105 135 255
## Minnesota Mississippi Missouri
## 89 53 66
## Montana Nebraska Nevada
## 15 38 39
## New hampshire New jersey New mexico
## 27 130 37
## New york North carolina North dakota
## 1128 249 7
## Ohio Oklahoma Oregon
## 468 66 124
## Pennsylvania Rhode island South carolina
## 587 56 42
## South dakota Tennessee Texas
## 12 183 981
## Utah Vermont Virginia
## 53 11 224
## Washington West virginia Wisconsin
## 506 4 109
## Wyoming
## 1
From the results, we can see that the data redundancy is catered successfully.
In this section, we will drill deeper into the data for more insights. But first, we need to know what problems we want to solve and what questions to ask. Taking the POV of the owner of the Superstore:
1. Which product category and subcategory with highest sales, with and without discount?
df_category_sales <- df %>%
select(Category, Sales) %>%
group_by(Category) %>%
summarise(Sales = sum(Sales))
df_category_sales <- df_category_sales %>%
mutate(Percentage = percent(df_category_sales$Sales/sum(df_category_sales$Sales)))
Bar Chart for Category by Sales Breakdown
bar_category_sales <- ggplot(data=df_category_sales, aes(x=Category, y=Sales, fill=Category)) +
coord_cartesian(ylim = c(700000, 850000))+
ggtitle("Category by Sales Breakdown") +
geom_bar(stat="identity")+
geom_text(aes(label=Percentage), vjust=1.6, color="white", size=3.5)+
theme_minimal()
#bar_category_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
bar_category_sales
2. Which customer segment that contribute to the highest sales?
df_segment_sales <- df %>%
select(Segment, Sales) %>%
group_by(Segment) %>%
summarise(Sales = sum(Sales))
df_segment_sales[ , 1] <- apply(df_segment_sales[ , 1], 2,
function(x) as.character(x))
sapply(df_segment_sales, class)
## Segment Sales
## "character" "numeric"
Inputting percentage column into the table
df_segment_sales <- df_segment_sales %>%
mutate(Percentage = percent(df_segment_sales$Sales/sum(df_segment_sales$Sales)))
Pie Chart for Customer Segment Sales Contribution
ggplot(df_segment_sales, aes(x = "", y = Sales, fill = Segment)) +
ggtitle("Customer Segment Sales Contribution") +
geom_col() +
geom_label(aes(label = Percentage ),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")
3. Which region, state and city contribute to the highest sales?
Region - Sales
df_region_sales <- df %>%
select(Region, Sales) %>%
group_by(Region) %>%
summarise(Sales = sum(Sales))
Inputting percentage column into the table
df_region_sales <- df_region_sales %>%
mutate(Percentage = percent(df_region_sales$Sales/sum(df_region_sales$Sales)))
Bar Chart for Region by Sales
bar_region_sales <- ggplot(data=df_region_sales, aes(x=Region, y=Sales, fill=Region)) +
coord_cartesian(ylim = c(300000, 800000))+
ggtitle("Region by Sales Breakdown") +
geom_bar(stat="identity")+
geom_text(aes(label=Percentage), vjust=1.6, color="white", size=3.5)+
theme_minimal()
#bar_region_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9","#f5ad42"))
bar_region_sales
State - Sales
df_state_sales <- df %>%
select(State, Sales) %>%
group_by(State) %>%
summarise(Sales = sum(Sales))
df_state_sales <- df_state_sales[order(-df_state_sales$Sales),]
df_state_sales <- df_state_sales %>%
mutate(Percentage = percent(df_state_sales$Sales/sum(df_state_sales$Sales),accuracy = 0.01))
Horizontal Bar Chart for Top 10 Regions by Sales
bar_state_sales <- ggplot(data=df_state_sales[1:10,], aes(x=State, y=Sales, fill=State)) +
#coord_cartesian(ylim = c(300000, 800000))+
geom_bar(stat="identity")+
ggtitle("Top 10 Regions by Sales Breakdown") +
geom_text(aes(label=Percentage), hjust=1.3, vjust=0.4, color="white", size=3)+
theme_minimal()+
coord_flip()
#bar_region_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9","#f5ad42"))
bar_state_sales
City - Sales
df_city_sales <- df %>%
select(City, Sales) %>%
group_by(City) %>%
summarise(Sales = sum(Sales))
df_city_sales <- df_city_sales[order(-df_city_sales$Sales),]
df_city_sales <- df_city_sales %>%
mutate(Percentage = percent(df_city_sales$Sales/sum(df_city_sales$Sales),accuracy = 0.01))
Horizontal Bar Chart for Top 10 City by Sales
bar_city_sales <- ggplot(data=df_city_sales[1:10,], aes(x=City, y=Sales, fill=City)) +
geom_bar(stat="identity")+
ggtitle("Sales by Category Breakdown") +
geom_text(aes(label=Percentage), hjust=1.3, vjust=0.4, color="white", size=3)+
theme_minimal()+
coord_flip()
bar_city_sales
1. Category and Sub-Category having highest returned items
# Join orders and returns table
joined_df <- df_order %>% inner_join(df_return,
by = "Order ID")
Category Having Highest Returns:
res_1 <- joined_df %>%
filter(Returned == "Yes") %>%
group_by(Category)%>%
summarise(Total_Returns = n())
ggplot(data = res_1,
aes(x = Category,
y = Total_Returns,
fill = Category)) +
geom_bar(stat="identity")+
ggtitle("Returns By Category") +
geom_text(aes(label = Total_Returns), hjust=1.3, vjust=0.4, color="white", size=3)+
theme_minimal()
Sub-Category Having Highest Returns:
res_2 <- joined_df %>%
filter(Returned == "Yes") %>%
group_by(`Sub-Category`)%>%
summarise(Total_Returns = n())
ggplot(data = res_2,
aes(x = `Sub-Category`,
y = Total_Returns,
fill = `Sub-Category`)) +
geom_bar(stat="identity")+
ggtitle("Returns By Sub-Category") +
geom_text(aes(label = Total_Returns),
hjust= 1.4,
vjust=0.5,
color="white",
size=3)+
theme_minimal() +
coord_flip()
label <- LabelEncoder$new()
df$Segment <- label$fit_transform(df$Segment)
df$Region <- label$fit_transform(df$Region)
df$Category <- label$fit_transform(df$Category)
df$`Sub-Category` <- label$fit_transform(df$`Sub-Category`)
df$`Returned` <- label$fit_transform(df$`Returned`)
df$`Ship Mode` <- label$fit_transform(df$`Ship Mode`)
df$`City` <- label$fit_transform(df$`City`)
df$`State` <- label$fit_transform(df$`State`)
head(df)
## Ship Mode Segment City State Region Category Sub-Category Sales Quantity
## 1 2 0 194 0 2 0 4 261.9600 2
## 3 2 1 266 1 3 1 10 14.6200 2
## 4 3 0 153 2 2 0 16 957.5775 5
## 5 3 0 153 2 2 1 14 22.3680 2
## 7 3 0 266 1 3 1 2 7.2800 4
## 8 3 0 266 1 3 2 13 907.1520 6
## Discount Profit Returned
## 1 0.00 41.9136 0
## 3 0.00 6.8714 0
## 4 0.45 -383.0310 0
## 5 0.20 2.5164 0
## 7 0.00 1.9656 0
## 8 0.20 90.7152 0
set.seed(7)
correlationMatrix <- cor(df[,1:11])
cm <- melt(correlationMatrix)
ggplot(data = cm, aes(x=Var1, y=Var2,fill=value)) + geom_tile()
print(correlationMatrix)
## Ship Mode Segment City State Region
## Ship Mode 1.0000000000 -0.004582994 -0.0036541966 -0.003634798 -0.023631986
## Segment -0.0045829938 1.000000000 0.0057311185 0.037712219 -0.005296660
## City -0.0036541966 0.005731119 1.0000000000 -0.168428001 0.282360262
## State -0.0036347981 0.037712219 -0.1684280008 1.000000000 -0.254856599
## Region -0.0236319855 -0.005296660 0.2823602616 -0.254856599 1.000000000
## Category -0.0043026605 0.009730685 0.0135427089 0.008036425 -0.004874291
## Sub-Category -0.0034247080 0.011574558 -0.0063490087 -0.006049100 0.009029168
## Sales -0.0017386263 0.010949677 0.0056037646 0.008134196 0.004080363
## Quantity 0.0226146186 0.007670713 -0.0008610499 0.002416884 0.013250294
## Discount -0.0001778966 -0.017964106 -0.1512573723 -0.132165259 -0.212811855
## Profit -0.0065217729 0.013558801 0.0242173827 0.041014897 0.022041231
## Category Sub-Category Sales Quantity Discount
## Ship Mode -0.004302661 -0.003424708 -0.001738626 0.0226146186 -0.0001778966
## Segment 0.009730685 0.011574558 0.010949677 0.0076707127 -0.0179641061
## City 0.013542709 -0.006349009 0.005603765 -0.0008610499 -0.1512573723
## State 0.008036425 -0.006049100 0.008134196 0.0024168842 -0.1321652587
## Region -0.004874291 0.009029168 0.004080363 0.0132502938 -0.2128118549
## Category 1.000000000 -0.075511134 0.040217407 -0.0038286368 -0.0631844760
## Sub-Category -0.075511134 1.000000000 0.059908214 -0.0166216535 -0.1527127006
## Sales 0.040217407 0.059908214 1.000000000 0.2008916831 -0.0280473766
## Quantity -0.003828637 -0.016621654 0.200891683 1.0000000000 0.0104803322
## Discount -0.063184476 -0.152712701 -0.028047377 0.0104803322 1.0000000000
## Profit 0.091670085 -0.029546613 0.479017235 0.0663051434 -0.2192108855
## Profit
## Ship Mode -0.006521773
## Segment 0.013558801
## City 0.024217383
## State 0.041014897
## Region 0.022041231
## Category 0.091670085
## Sub-Category -0.029546613
## Sales 0.479017235
## Quantity 0.066305143
## Discount -0.219210886
## Profit 1.000000000
Based on the correlation matrix, we can see that “Sales” has a high positive correlation with dependent variable “Profit”
df_sales_profit <- df %>%
select(Sales,Profit)
Let’s have a look at our model fitted to our data for sales and profit.
ggplot(data = df_sales_profit, aes(x = Sales, y = Profit)) +
geom_point() +
stat_smooth(method = "lm", col = "dodgerblue3") +
theme(panel.background = element_rect(fill = "white"),
axis.line.x=element_line(),
axis.line.y=element_line()) +
ggtitle("Linear Model Fitted to Data")
## `geom_smooth()` using formula 'y ~ x'
Create training set indices with 80% of data
#For reproducibility
set.seed(100)
# Create index for testing and training data
inTrain <- createDataPartition(y = df_sales_profit$Profit,
p = 0.8, list = FALSE)
# subset Sales & Profit data to training
salesprofit_training <- df_sales_profit[inTrain,]
# subset the rest to test
salesprofit_testing <- df_sales_profit[-inTrain,]
Build a linear regression model relating Sales and Profit.
Whether we can use our model to make predictions will depend on:
linear_fit <- lm(Profit ~ Sales, data = salesprofit_training)
summary(linear_fit)
##
## Call:
## lm(formula = Profit ~ Sales, data = salesprofit_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7352.9 2.5 13.6 21.2 5438.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.483065 2.574812 -4.46 8.32e-06 ***
## Sales 0.169863 0.003798 44.73 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216.8 on 7984 degrees of freedom
## Multiple R-squared: 0.2004, Adjusted R-squared: 0.2003
## F-statistic: 2001 on 1 and 7984 DF, p-value: < 2.2e-16
Based on the result summary shown above, is the hypothesis supported?
The intercept in our example is the expected Profit when our Sales was zero. Profit can possible be negative, in which it means a loss in money.
The slope in our example is the effect of Sales on Profit. We see that for each additional value of Sales by 1, the Profit increases by RM0.18.
Since the p-value is smaller than 0.05 as the cutoff for significance, we reject Ho. We can reject the null hypothesis in favor of believing there to be a relationship between Sales and Profit.
We use the linear model to predict Profit using Sales as input. We check the residuals to see the performance of the linear model.
pred = predict(linear_fit, salesprofit_testing)
head(pred)
## 10 14 18 23 36 39
## 8.034192 57.816958 -2.055669 -1.233532 174.949040 78.951855
res<-residuals(linear_fit) # Find the residuals
res<-as.data.frame(res) # Convert the residual into a dataframe
head(res) # Prints the residuals
## res
## 1 8.899356
## 3 15.871068
## 4 -534.204911
## 5 10.199970
## 7 12.212063
## 8 -51.893285
We compare predicted values and actual values side-by-side to see how far off the prediction from the actual - which is the residual values.
# compare the predicted vs actual values
results<-cbind(pred,salesprofit_testing$Profit)
head(results)
## pred
## 10 8.034192 34.4700
## 14 57.816958 132.5922
## 18 -2.055669 9.9900
## 23 -1.233532 15.6884
## 36 174.949040 123.4737
## 39 78.951855 -46.9764
colnames(results)<-c('predicted','real')
results<-as.data.frame(results)
head(results)
## predicted real
## 10 8.034192 34.4700
## 14 57.816958 132.5922
## 18 -2.055669 9.9900
## 23 -1.233532 15.6884
## 36 174.949040 123.4737
## 39 78.951855 -46.9764
Let’s plot the difference in the chart.
# Let’s now, compare the predicted vs actual values.
# The output of the above command is shown below in a graph that shows the predicted Profit.
plot(salesprofit_testing$Profit, type = 'l', lty = 1.8, col = "red")
#Now let’s plot our test revenue with the following command:
lines(pred, type = "l", col = "blue")
As seen above, there are some instances where the model underestimated the profit. The prediction generated by this model is conservative.
Finally, we evaluate the performance of the model using RMSE.
# Calculating the accuracy of this model
rmse <- sqrt(mean(pred-df_sales_profit$Profit)^2)
## Warning in pred - df_sales_profit$Profit: longer object length is not a multiple
## of shorter object length
# Root Mean Square Error is the standard deviation of the residuals
rmse
## [1] 1.04264
We use the decision tree model to predict Ship Mode using Segment, Region, Category, Sub-category, Sales, Quantity, Discount and Profit as input.
# Remove useless variables
joined_df <- joined_df[, -which(names(joined_df) %in% c("Row ID",
"Order ID",
"Ship Date",
"Order Date",
"Customer ID",
"Customer Name",
"Postal Code",
"Product ID",
"Product Name",
"Returned",
"Country",
"State",
"City"))]
# Check structure of final variable
str(joined_df)
## tibble [800 x 9] (S3: tbl_df/tbl/data.frame)
## $ Ship Mode : chr [1:800] "Second Class" "Second Class" "Second Class" "First Class" ...
## $ Segment : chr [1:800] "Consumer" "Consumer" "Consumer" "Consumer" ...
## $ Region : chr [1:800] "West" "West" "West" "East" ...
## $ Category : chr [1:800] "Office Supplies" "Technology" "Office Supplies" "Office Supplies" ...
## $ Sub-Category: chr [1:800] "Art" "Phones" "Binders" "Storage" ...
## $ Sales : num [1:800] 8.56 213.48 22.72 208.56 32.4 ...
## $ Quantity : num [1:800] 2 3 -4 6 5 5 2 2 4 1 ...
## $ Discount : num [1:800] 0 0.2 0.2 0 0 0.1 0 0 0.2 0 ...
## $ Profit : num [1:800] 2.48 16.01 7.38 52.14 15.55 ...
library(superml)
# Label encode categorical variables
label <- LabelEncoder$new()
joined_df$Segment <- label$fit_transform(joined_df$Segment)
joined_df$Region <- label$fit_transform(joined_df$Region)
joined_df$Category <- label$fit_transform(joined_df$Category)
joined_df$`Sub-Category` <- label$fit_transform(joined_df$`Sub-Category`)
# Function for normalization
min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
#apply Min-Max normalization on the dataset
joined_df[, 6:7] <- as.data.frame(lapply(joined_df[, 6:7], min_max_norm))
joined_df$Profit <- min_max_norm(joined_df$Profit)
head(joined_df)
## # A tibble: 6 x 9
## `Ship Mode` Segment Region Category `Sub-Category` Sales Quantity Discount
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Second Class 0 0 0 0 0.000491 0.333 0
## 2 Second Class 0 0 1 1 0.0151 0.389 0.2
## 3 Second Class 0 0 0 2 0.00150 0 0.2
## 4 First Class 0 1 0 3 0.0148 0.556 0
## 5 First Class 0 1 0 4 0.00219 0.5 0
## 6 First Class 0 1 2 5 0.0227 0.5 0.1
## # ... with 1 more variable: Profit <dbl>
colSums(is.na(joined_df))
## Ship Mode Segment Region Category Sub-Category Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 1 0
# Split data into training and testing sets.
set.seed(124)
split1<- sample(c(rep(0, 0.7 * nrow(joined_df)), rep(1, 0.3 * nrow(joined_df))))
train <- joined_df[split1 == 0, ]
test <- joined_df[split1== 1, ]
# Fit decision tree model.
set.seed(123)
fit <- rpart(`Ship Mode` ~.,
data = train,
method = 'class')
# Evaluate performance on test set.
predict_unseen <- predict(fit,
test,
type = 'class')
table_mat <- table(test$`Ship Mode`,
predict_unseen)
confusionMatrix(table_mat)
## Confusion Matrix and Statistics
##
## predict_unseen
## First Class Same Day Second Class Standard Class
## First Class 4 0 0 46
## Same Day 0 0 0 18
## Second Class 1 0 0 38
## Standard Class 4 0 0 129
##
## Overall Statistics
##
## Accuracy : 0.5542
## 95% CI : (0.4889, 0.6181)
## No Information Rate : 0.9625
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0283
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: First Class Class: Same Day Class: Second Class
## Sensitivity 0.44444 NA NA
## Specificity 0.80087 0.925 0.8375
## Pos Pred Value 0.08000 NA NA
## Neg Pred Value 0.97368 NA NA
## Prevalence 0.03750 0.000 0.0000
## Detection Rate 0.01667 0.000 0.0000
## Detection Prevalence 0.20833 0.075 0.1625
## Balanced Accuracy 0.62266 NA NA
## Class: Standard Class
## Sensitivity 0.55844
## Specificity 0.55556
## Pos Pred Value 0.96992
## Neg Pred Value 0.04673
## Prevalence 0.96250
## Detection Rate 0.53750
## Detection Prevalence 0.55417
## Balanced Accuracy 0.55700
library(caret)
library(randomForest)
mod_rf <-
train(`Ship Mode` ~. , # Equation (outcome and everything else)
data = train, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
na.action = na.omit
)
# Evaluate performance on test set.
predict_unseen <- predict(mod_rf,
test)
table_mat <- table(test$`Ship Mode`,
predict_unseen)
confusionMatrix(table_mat)
## Confusion Matrix and Statistics
##
## predict_unseen
## First Class Same Day Second Class Standard Class
## First Class 1 0 5 44
## Same Day 0 0 0 18
## Second Class 1 0 2 36
## Standard Class 3 2 2 126
##
## Overall Statistics
##
## Accuracy : 0.5375
## 95% CI : (0.4722, 0.6019)
## No Information Rate : 0.9333
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0195
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: First Class Class: Same Day Class: Second Class
## Sensitivity 0.200000 0.000000 0.222222
## Specificity 0.791489 0.924370 0.839827
## Pos Pred Value 0.020000 0.000000 0.051282
## Neg Pred Value 0.978947 0.990991 0.965174
## Prevalence 0.020833 0.008333 0.037500
## Detection Rate 0.004167 0.000000 0.008333
## Detection Prevalence 0.208333 0.075000 0.162500
## Balanced Accuracy 0.495745 0.462185 0.531025
## Class: Standard Class
## Sensitivity 0.56250
## Specificity 0.56250
## Pos Pred Value 0.94737
## Neg Pred Value 0.08411
## Prevalence 0.93333
## Detection Rate 0.52500
## Detection Prevalence 0.55417
## Balanced Accuracy 0.56250
library(e1071)
svm <- svm(as.factor(`Ship Mode`) ~.,
data = train,
type = 'C-classification',
kernel = 'linear')
summary(svm)
##
## Call:
## svm(formula = as.factor(`Ship Mode`) ~ ., data = train, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 475
##
## ( 95 102 232 46 )
##
##
## Number of Classes: 4
##
## Levels:
## First Class Same Day Second Class Standard Class
# Evaluate performance on test set.
predict_unseen <- predict(svm,
test)
table_mat <- table(test$`Ship Mode`,
predict_unseen)
confusionMatrix(table_mat)
## Confusion Matrix and Statistics
##
## predict_unseen
## First Class Same Day Second Class Standard Class
## First Class 0 0 0 50
## Same Day 0 0 0 18
## Second Class 0 0 0 39
## Standard Class 0 0 0 133
##
## Overall Statistics
##
## Accuracy : 0.5542
## 95% CI : (0.4889, 0.6181)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: First Class Class: Same Day Class: Second Class
## Sensitivity NA NA NA
## Specificity 0.7917 0.925 0.8375
## Pos Pred Value NA NA NA
## Neg Pred Value NA NA NA
## Prevalence 0.0000 0.000 0.0000
## Detection Rate 0.0000 0.000 0.0000
## Detection Prevalence 0.2083 0.075 0.1625
## Balanced Accuracy NA NA NA
## Class: Standard Class
## Sensitivity 0.5542
## Specificity NA
## Pos Pred Value NA
## Neg Pred Value NA
## Prevalence 1.0000
## Detection Rate 0.5542
## Detection Prevalence 0.5542
## Balanced Accuracy NA
nb <- naiveBayes(`Ship Mode` ~.,
data = train)
# Evaluate performance on test set.
predict_unseen <- predict(nb,
test)
table_mat <- table(test$`Ship Mode`,
predict_unseen)
confusionMatrix(table_mat)
## Confusion Matrix and Statistics
##
## predict_unseen
## First Class Same Day Second Class Standard Class
## First Class 1 0 25 24
## Same Day 0 0 4 14
## Second Class 0 3 8 28
## Standard Class 3 1 31 98
##
## Overall Statistics
##
## Accuracy : 0.4458
## 95% CI : (0.3819, 0.5111)
## No Information Rate : 0.6833
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0287
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: First Class Class: Same Day Class: Second Class
## Sensitivity 0.250000 0.00000 0.11765
## Specificity 0.792373 0.92373 0.81977
## Pos Pred Value 0.020000 0.00000 0.20513
## Neg Pred Value 0.984211 0.98198 0.70149
## Prevalence 0.016667 0.01667 0.28333
## Detection Rate 0.004167 0.00000 0.03333
## Detection Prevalence 0.208333 0.07500 0.16250
## Balanced Accuracy 0.521186 0.46186 0.46871
## Class: Standard Class
## Sensitivity 0.5976
## Specificity 0.5395
## Pos Pred Value 0.7368
## Neg Pred Value 0.3832
## Prevalence 0.6833
## Detection Rate 0.4083
## Detection Prevalence 0.5542
## Balanced Accuracy 0.5685
To reiterate, the Superstore owners have the following concerns:
A data scientist’s recommendations to the Superstore owners are as follows: - In terms of Customer Segment, it is suggested to target the Home Office segment as this will be the trend, accelerated by Covid. Corporate segment is too competitive as this is a very established market segment. - In terms of Sales by Region, it is suggested to take a deep dive into the Superstore’s market presence in the Southern region - to understand why the sales in this region is much lower than East and West. - In terms of Returns, Office Supplies category records the highest returned items. Deep dive shows it is Paper and Binders sub-category. A survey must be conducted to collect feedback on why these two categories have such a ridiculously high item return.
In regard to predictive analytics (regression and classification) on the dataset: - Correlation only between few selected variables: Profit, Sales, Quantity, Discount and Sub-category. - Regression: A predictive model to predict Profit from Sales has been constructed and scored 1.8 on RMSE. Looking at the plot, there might be some instances with large errors needed to be rechecked. - Classification: This dataset not suitable for classification as most of the variables are not correlated to each other. Further data collection needed on new features. - Future work will include deep dive into instances that triggers large errors in the regression model; and features to be collected for classification problem.