Superstore Sales Prediction by Using Machine Learning Model

With growing demands and cut-throat competitions in the market, a Superstore giant is seeking your knowledge in understanding what works best for them. They would like to understand which products, regions, categories and customer segments they should target or avoid.

They also want to have a Regression model to predict Sales or Profit.

1) Load Library

The libraries for the installed packages are loaded.

library(readxl)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(forcats)
library(rpart)
library(rpart.plot)
library(caret)
library(scales)
library(superml)
library(corrplot)
library(reshape2)
library(broom)
library(ggpubr)

2) Import Data

Firstly, let’s get the dataset from XLS file. In the XLS file, there are 3 sheets:

Orders: List of transactions
Returns: List of items returned
People: List of sales person for West, East, Central and South

df_order = read_xls('superstorev2.xls', sheet ='Orders', col_names = TRUE)
df_return = read_xls('superstorev2.xls', sheet='Returns', col_names = TRUE)
df_people = read_xls('superstorev2.xls', sheet='People', col_names = TRUE)

Viewing data

“Order” dataframe:

head(df_order)

## # A tibble: 6 x 21
##   `Row ID` `Order ID`     `Order Date`        `Ship Date`         `Ship Mode`   
##      <dbl> <chr>          <dttm>              <dttm>              <chr>         
## 1        1 CA-2016-152156 2016-11-08 00:00:00 2016-11-11 00:00:00 Second Class  
## 2        2 CA-2016-152156 2016-11-08 00:00:00 2016-11-11 00:00:00 Second Class  
## 3        3 CA-2016-138688 2016-06-12 00:00:00 2016-06-16 00:00:00 Second Class  
## 4        4 US-2015-108966 2015-10-11 00:00:00 2015-10-18 00:00:00 <NA>          
## 5        5 US-2015-108966 2015-10-11 00:00:00 2015-10-18 00:00:00 Standard Class
## 6        6 CA-2014-115812 2014-06-09 00:00:00 2014-06-14 00:00:00 Standard Class
## # ... with 16 more variables: Customer ID <chr>, Customer Name <chr>,
## #   Segment <chr>, Country <chr>, City <chr>, State <chr>, Postal Code <dbl>,
## #   Region <chr>, Product ID <chr>, Category <chr>, Sub-Category <chr>,
## #   Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## #   Profit <dbl>

“Return” dataframe:

head(df_return)

## # A tibble: 6 x 2
##   Returned `Order ID`    
##   <chr>    <chr>         
## 1 Yes      CA-2017-153822
## 2 Yes      CA-2017-129707
## 3 Yes      CA-2014-152345
## 4 Yes      CA-2015-156440
## 5 Yes      US-2017-155999
## 6 Yes      CA-2014-157924

“People” dataframe:

head(df_people)

## # A tibble: 4 x 2
##   Person            Region 
##   <chr>             <chr>  
## 1 Anna Andreadi     West   
## 2 Chuck Magee       East   
## 3 Kelly Williams    Central
## 4 Cassandra Brandow South

3) Data Pre-processing

Merging “Return” and “People” dataframe into “Order” dataframe

We merge df_return$Returned and df_people$Person into df_orderf$Returned and df_order$Person respectively, using full join to retain all values and rows.

df = dplyr::full_join(df_order, df_return, by="Order ID")

df$`Order Date` = as.Date.character(df$`Order Date`, format="%Y-%m-%d")
df$`Ship Date` = as.Date(df$`Ship Date`, format="%Y-%m-%d")
df

## # A tibble: 9,994 x 22
##    `Row ID` `Order ID`     `Order Date` `Ship Date` `Ship Mode`    `Customer ID`
##       <dbl> <chr>          <date>       <date>      <chr>          <chr>        
##  1        1 CA-2016-152156 2016-11-08   2016-11-11  Second Class   CG-12520     
##  2        2 CA-2016-152156 2016-11-08   2016-11-11  Second Class   CG-12520     
##  3        3 CA-2016-138688 2016-06-12   2016-06-16  Second Class   DV-13045     
##  4        4 US-2015-108966 2015-10-11   2015-10-18  <NA>           SO-20335     
##  5        5 US-2015-108966 2015-10-11   2015-10-18  Standard Class SO-20335     
##  6        6 CA-2014-115812 2014-06-09   2014-06-14  Standard Class BH-11710     
##  7        7 CA-2014-115812 2014-06-09   2014-06-14  <NA>           BH-11710     
##  8        8 CA-2014-115812 2014-06-09   2014-06-14  Standard Class BH-11710     
##  9        9 CA-2014-115812 2014-06-09   2014-06-14  Standard Class BH-11710     
## 10       10 CA-2014-115812 2014-06-09   2014-06-14  <NA>           BH-11710     
## # ... with 9,984 more rows, and 16 more variables: Customer Name <chr>,
## #   Segment <chr>, Country <chr>, City <chr>, State <chr>, Postal Code <dbl>,
## #   Region <chr>, Product ID <chr>, Category <chr>, Sub-Category <chr>,
## #   Product Name <chr>, Sales <dbl>, Quantity <dbl>, Discount <dbl>,
## #   Profit <dbl>, Returned <chr>

After all the dataset are merged, df will be the dataset we use for this project.

Validate the data

The structure of the dataset is checked.

str(df)

## tibble [9,994 x 22] (S3: tbl_df/tbl/data.frame)
##  $ Row ID       : num [1:9994] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Order ID     : chr [1:9994] "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order Date   : Date[1:9994], format: "2016-11-08" "2016-11-08" ...
##  $ Ship Date    : Date[1:9994], format: "2016-11-11" "2016-11-11" ...
##  $ Ship Mode    : chr [1:9994] "Second Class" "Second Class" "Second Class" NA ...
##  $ Customer ID  : chr [1:9994] "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Customer Name: chr [1:9994] "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
##  $ Segment      : chr [1:9994] "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country      : chr [1:9994] "United States" "United States" "United States" "United States" ...
##  $ City         : chr [1:9994] "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State        : chr [1:9994] "KENTUCKY" "kentucky" "California" "florida" ...
##  $ Postal Code  : num [1:9994] 42420 42420 90036 33311 33311 ...
##  $ Region       : chr [1:9994] "South" "South" "West" "South" ...
##  $ Product ID   : chr [1:9994] "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : chr [1:9994] "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub-Category : chr [1:9994] "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product Name : chr [1:9994] "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : num [1:9994] 262 731.9 14.6 957.6 22.4 ...
##  $ Quantity     : num [1:9994] 2 -3 2 5 2 -7 4 6 3 5 ...
##  $ Discount     : num [1:9994] 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit       : num [1:9994] 41.91 219.58 6.87 -383.03 2.52 ...
##  $ Returned     : chr [1:9994] NA NA NA NA ...

The “Order” dataframe is 9994 instances and 22 features. Now check the summary of the “Order” dataframe.

Data Summary

summary(df)

##      Row ID       Order ID           Order Date           Ship Date         
##  Min.   :   1   Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  1st Qu.:2499   Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Median :4998   Mode  :character   Median :2016-06-26   Median :2016-06-29  
##  Mean   :4998                      Mean   :2016-04-30   Mean   :2016-05-03  
##  3rd Qu.:7496                      3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  Max.   :9994                      Max.   :2017-12-30   Max.   :2018-01-05  
##                                                                             
##   Ship Mode         Customer ID        Customer Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##                                                                          
##     Region           Product ID          Category         Sub-Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Product Name           Sales              Quantity         Discount     
##  Length:9994        Min.   :    0.444   Min.   :-7.000   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.000   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.000   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.781   Mean   :0.1569  
##                     3rd Qu.:  209.940   3rd Qu.: 5.000   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.000   Max.   :0.8000  
##                                                          NA's   :42      
##      Profit            Returned        
##  Min.   :-6599.978   Length:9994       
##  1st Qu.:    1.729   Class :character  
##  Median :    8.666   Mode  :character  
##  Mean   :   28.657                     
##  3rd Qu.:   29.364                     
##  Max.   : 8399.976                     
##

What piqued interest was the statistics in Sales, Quantity, Discount and Profit column. From there, we know that:

Sales range from USD 0.44 to USD 22,638.48 per transaction; the average sales closed is USD 229.86 while a typical sale closed is USD 54.49
Quantity sold range from 1 to 14 items per transaction; typically each transaction sold around 3 to 4 items.
Discount range from USD 0 to USD 0.80, meaning this shop only give minimal discounts to customers.
Profit range from a loss USD -6,599.98 to a profit of USD 8,399.98. The average profit per transaction is USD 28.66 but we know most of the profit is lower than that due to median lower than mean.
On top of that, we also noticed that R is not able to identify which one is datetime format as they are parsed as characters.

Drop Unnecessary Column

drop<-c('Order Date','Ship Date','Row ID','Order ID','Customer ID','Customer Name','Postal Code', 'Product Name','Product ID','Country')

df <- df[,!(names(df) %in% drop)]

Checking for any missing data

Next, we check for missing value. The data set is checked for any non-availability (NA). This is because missing value is dirty and might result in affecting out analysis.

colSums(is.na(df))

##    Ship Mode      Segment         City        State       Region     Category 
##           33            0            0            0            0            0 
## Sub-Category        Sales     Quantity     Discount       Profit     Returned 
##            0            0            0           42            0         9194

There are missing values spotted inside df. Based on the results, we can see that there are few variables that contain NA value. They are Ship Mode and Discount. Ship Mode has 33 missing values, Discount has 42 missing values, while Returned has 9194 missing values. These missing value will be handled accordingly.

Imputation (Using mode & median)

We know that Ship Mode is in nominal pattern. Hence, we will handle the missing data by using imputation by mode. Meanwhile, for Discount, it is a numerical data. Hence, the missing data will be handled by using imputation by median.

df<-as.data.frame(df)

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(is.na(x) == TRUE, getmode(x), x))
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)

df$Discount[is.na(df$Discount)]<-median(df$Discount,na.rm=TRUE)


df$Returned<-df$Returned[is.na(df$Returned)]<-'No'


colSums(is.na(df))

##    Ship Mode      Segment         City        State       Region     Category 
##            0            0            0            0            0            0 
## Sub-Category        Sales     Quantity     Discount       Profit     Returned 
##            0            0            0            0            0            0

From the results, all columns illustrate 0 from any NA value. This show that all missing value had been handled properly. After missing value is cattered, the pattern of the data is studied.

Check Value Pattern Consistency

(Numerical data)
Numerical data in the dataset includes Sale, Quantity, Discount, and Profit. The pattern for the numerical data is studied. The aim is to determine for any irrelevant pattern in the dataset. To do so, histogram is plotted.

hist(df$Discount)

hist(df$Quantity)

hist(df$Profit)

hist(df$Sales)

From the histogram, there is weird value in Quantity. This is because quantity should be in a positive whole value. It cannot be negative because this variable indicate the number of product. This is may be due to inputation error. So, removing it would be good since the amount of error in this variable is really small. Our dataset is big, removing small portion of error data will not affecting the whole pattern and orientation of the dataset.

The rows of the error data are removed:

df <- df[df$Quantity >= 0, ]
hist(df$Quantity)

From the results, it shows that the rows containing error value is removed successfully.

(Categorical data)
Categorical data in this dataset includes Ship Mode, Segment, Country, City, State, Region, Category, and Sub-Category. The value pattern consistency is determined by counting the frequency distribution for each of the variable. Any weird naming value or redundancy can be determined from this method.

table(df$`Ship Mode`)

## 
##    First Class       Same Day   Second Class Standard Class 
##           1538            543           1941           5959

table(df$Segment)

## 
##    Consumer   Corporate Home Office 
##        5183        3017        1781

table(df$Country)

## < table of extent 0 >

table(df$City)

## 
##          Aberdeen           Abilene             Akron       Albuquerque 
##                 1                 1                21                14 
##        Alexandria             Allen         Allentown           Altoona 
##                16                 4                 7                 2 
##          Amarillo           Anaheim           Andover         Ann Arbor 
##                10                27                 4                 5 
##           Antioch            Apopka      Apple Valley          Appleton 
##                 1                 7                 9                 2 
##         Arlington Arlington Heights            Arvada         Asheville 
##                60                 1                 4                 7 
##            Athens           Atlanta     Atlantic City            Auburn 
##                 8                39                 1                24 
##            Aurora            Austin          Avondale       Bakersfield 
##                68                39                 6                16 
##         Baltimore            Bangor          Bartlett           Bayonne 
##                43                 5                 1                 3 
##           Baytown          Beaumont           Bedford        Belleville 
##                 1                 5                 5                10 
##          Bellevue        Bellingham         Bethlehem           Beverly 
##                 6                 3                 5                 3 
##          Billings       Bloomington        Boca Raton             Boise 
##                 1                15                 3                 4 
##       Bolingbrook      Bossier City     Bowling Green     Boynton Beach 
##                 6                 6                10                10 
##           Bozeman         Brentwood         Bridgeton           Bristol 
##                 2                16                 2                13 
##      Broken Arrow        Broomfield       Brownsville             Bryan 
##                 5                 5                 7                 6 
##           Buffalo     Buffalo Grove     Bullhead City           Burbank 
##                10                 2                 2                 3 
##        Burlington          Caldwell         Camarillo         Cambridge 
##                25                 3                 5                 6 
##            Canton          Carlsbad      Carol Stream        Carrollton 
##                 3                11                 5                15 
##              Cary        Cedar Hill      Cedar Rapids         Champaign 
##                 6                 2                 1                 1 
##          Chandler       Chapel Hill         Charlotte   Charlottesville 
##                 7                 1                52                 2 
##       Chattanooga        Chesapeake           Chester          Cheyenne 
##                10                18                16                 1 
##           Chicago             Chico       Chula Vista        Cincinnati 
##               314                11                 3                17 
##    Citrus Heights       Clarksville         Cleveland           Clifton 
##                 1                 7                42                 2 
##           Clinton            Clovis         Coachella   College Station 
##                29                 2                 2                 3 
##  Colorado Springs          Columbia          Columbus     Commerce City 
##                25                81               222                 1 
##           Concord            Conroe            Conway       Coon Rapids 
##                31                 1                 1                 2 
##           Coppell      Coral Gables     Coral Springs    Corpus Christi 
##                 4                 2                 8                 8 
##        Costa Mesa     Cottage Grove         Covington          Cranston 
##                11                 2                 4                16 
##    Cuyahoga Falls            Dallas           Danbury          Danville 
##                 3               157                 1                 7 
##             Davis     Daytona Beach          Dearborn  Dearborn Heights 
##                 1                 4                 5                 6 
##           Decatur         Deer Park      Delray Beach           Deltona 
##                35                 1                 3                 5 
##            Denver        Des Moines       Des Plaines           Detroit 
##                44                19                 6               115 
##             Dover            Draper            Dublin           Dubuque 
##                18                 3                14                 3 
##            Durham             Eagan       East Orange        East Point 
##                 9                 8                 5                 3 
##        Eau Claire          Edinburg            Edmond           Edmonds 
##                 6                 3                 2                14 
##          El Cajon           El Paso           Elkhart          Elmhurst 
##                 2                19                 2                 4 
##            Elyria         Encinitas         Englewood         Escondido 
##                 1                 5                 2                 3 
##            Eugene          Evanston           Everett         Fairfield 
##                 6                 4                22                45 
##             Fargo        Farmington      Fayetteville          Florence 
##                 7                 3                41                21 
##      Fort Collins   Fort Lauderdale        Fort Worth         Frankfort 
##                 7                15                26                 2 
##          Franklin          Freeport           Fremont            Fresno 
##                37                10                 8                26 
##            Frisco      Gaithersburg       Garden City           Garland 
##                 2                 2                 5                 4 
##          Gastonia        Georgetown           Gilbert         Gladstone 
##                 4                 8                15                 2 
##          Glendale          Glenview         Goldsboro      Grand Island 
##                23                 1                 1                 1 
##     Grand Prairie      Grand Rapids         Grapevine       Great Falls 
##                13                 7                 2                 8 
##           Greeley         Green Bay        Greensboro        Greenville 
##                 2                 4                13                 9 
##         Greenwood           Gresham        Grove City          Gulfport 
##                 4                 5                 2                 3 
##        Hackensack        Hagerstown       Haltom City          Hamilton 
##                 6                 1                 4                 4 
##           Hampton         Harlingen      Harrisonburg       Hattiesburg 
##                11                 5                 7                 9 
##            Helena         Hempstead         Henderson    Hendersonville 
##                 3                11                50                 5 
##          Hesperia           Hialeah           Hickory     Highland Park 
##                 4                18                 2                 7 
##         Hillsboro           Holland         Hollywood           Holyoke 
##                 4                 3                11                 1 
##         Homestead            Hoover       Hot Springs           Houston 
##                 3                 4                 4               375 
##  Huntington Beach        Huntsville      Independence      Indianapolis 
##                 6                36                 2                23 
##         Inglewood         Iowa City            Irving           Jackson 
##                16                 1                 8                82 
##      Jacksonville         Jamestown    Jefferson City      Johnson City 
##               125                 2                 1                12 
##         Jonesboro           Jupiter            Keller            Kenner 
##                11                 1                 1                 2 
##           Kenosha              Kent          Kirkwood         Kissimmee 
##                 9                13                 2                 1 
##         Knoxville         La Crosse           La Mesa          La Porte 
##                24                 5                 2                10 
##         La Quinta         Lafayette     Laguna Niguel      Lake Charles 
##                 1                30                 4                 3 
##     Lake Elsinore       Lake Forest          Lakeland         Lakeville 
##                 1                 6                22                15 
##          Lakewood         Lancaster           Lansing            Laredo 
##                49                46                 8                13 
##        Las Cruces         Las Vegas            Laurel          Lawrence 
##                 3                13                 2                44 
##            Lawton            Layton       League City           Lebanon 
##                 4                 1                 6                 3 
##              Lehi        Leominster          Lewiston      Lincoln Park 
##                 2                 6                 4                 3 
##            Linden       Lindenhurst       Little Rock         Littleton 
##                 1                 1                24                 1 
##              Lodi             Logan        Long Beach          Longmont 
##                 2                 5                61                 3 
##          Longview            Lorain       Los Angeles        Louisville 
##                 3                 9               745                57 
##          Loveland            Lowell           Lubbock             Macon 
##                 4                16                 6                 6 
##           Madison            Malden        Manchester         Manhattan 
##                10                 3                11                 1 
##         Mansfield           Manteca       Maple Grove           Margate 
##                 2                 3                 4                 1 
##          Marietta            Marion       Marlborough        Marysville 
##                 7                20                 2                 2 
##             Mason           Mcallen           Medford            Medina 
##                 3                19                 4                 9 
##         Melbourne           Memphis            Mentor           Meriden 
##                 1                30                 6                11 
##          Meridian              Mesa          Mesquite             Miami 
##                 4                28                 5                57 
##        Middletown           Midland           Milford         Milwaukee 
##                 9                12                 6                44 
##       Minneapolis           Miramar         Mishawaka     Mission Viejo 
##                23                 8                 2                 5 
##          Missoula     Missouri City            Mobile           Modesto 
##                 1                 1                11                 3 
##            Monroe        Montebello        Montgomery          Moorhead 
##                21                 1                10                 2 
##     Moreno Valley       Morgan Hill        Morristown    Mount Pleasant 
##                10                 4                 7                 2 
##      Mount Vernon      Murfreesboro            Murray          Murrieta 
##                 8                11                 5                 1 
##          Muskogee        Naperville            Nashua         Nashville 
##                 4                 8                 2                29 
##        New Albany       New Bedford     New Brunswick        New Castle 
##                 4                 5                 2                 2 
##      New Rochelle     New York City            Newark      Newport News 
##                 9               915                95                12 
##     Niagara Falls       Noblesville           Norfolk            Normal 
##                 3                 4                 1                 1 
##            Norman  North Charleston   North Las Vegas       North Miami 
##                 2                 2                16                 2 
##           Norwich          Oak Park           Oakland         Oceanside 
##                 4                 3                26                30 
##            Odessa     Oklahoma City            Olathe           Olympia 
##                 5                23                 5                 5 
##             Omaha           Ontario            Orange              Orem 
##                28                 1                 6                 9 
##       Orland Park           Orlando      Ormond Beach            Oswego 
##                 1                11                 1                 3 
##     Overland Park         Owensboro            Oxnard          Palatine 
##                 6                 2                 9                 1 
##        Palm Coast        Park Ridge            Parker             Parma 
##                 5                 3                 8                10 
##          Pasadena             Pasco           Passaic          Paterson 
##                42                 6                 6                20 
##          Pearland    Pembroke Pines         Pensacola            Peoria 
##                 3                14                 1                23 
##       Perth Amboy             Pharr      Philadelphia           Phoenix 
##                 8                 4               537                63 
##       Pico Rivera        Pine Bluff        Plainfield             Plano 
##                 1                 2                13                15 
##        Plantation    Pleasant Grove         Pocatello            Pomona 
##                11                 7                 7                 7 
##     Pompano Beach       Port Arthur       Port Orange  Port Saint Lucie 
##                 3                 6                 1                 3 
##           Portage          Portland        Providence             Provo 
##                 1                24                31                10 
##            Pueblo            Quincy           Raleigh  Rancho Cucamonga 
##                 8                20                22                 4 
##        Rapid City           Reading           Redding          Redlands 
##                 2                 5                 1                16 
##           Redmond     Redondo Beach      Redwood City              Reno 
##                11                 8                 1                 4 
##            Renton            Revere        Richardson          Richmond 
##                 3                 7                 2                90 
##        Rio Rancho         Riverside         Rochester   Rochester Hills 
##                 2                14                53                 1 
##         Rock Hill          Rockford         Rockville            Rogers 
##                 1                11                 5                 1 
##              Rome        Romeoville         Roseville           Roswell 
##                 6                 1                26                18 
##        Round Rock         Royal Oak        Sacramento           Saginaw 
##                 7                 2                13                 6 
##     Saint Charles       Saint Cloud       Saint Louis        Saint Paul 
##                10                 3                 5                 2 
##      Saint Peters  Saint Petersburg             Salem           Salinas 
##                 1                14                40                 8 
##    Salt Lake City        San Angelo       San Antonio    San Bernardino 
##                 8                 4                59                 9 
##      San Clemente         San Diego     San Francisco       San Gabriel 
##                 2               170               509                 3 
##          San Jose   San Luis Obispo        San Marcos         San Mateo 
##                42                 1                 4                 1 
##     Sandy Springs           Sanford         Santa Ana     Santa Barbara 
##                14                 2                13                 8 
##       Santa Clara          Santa Fe       Santa Maria        Scottsdale 
##                 6                 2                 1                12 
##           Seattle         Sheboygan           Shelton      Sierra Vista 
##               428                 4                 2                 3 
##       Sioux Falls            Skokie            Smyrna        South Bend 
##                 9                11                24                 6 
##         Southaven            Sparks           Spokane        Springdale 
##                 7                 3                 7                 1 
##       Springfield  Sterling Heights          Stockton           Suffolk 
##               163                 3                 6                10 
##       Summerville         Sunnyvale          Superior       Tallahassee 
##                 2                 6                 9                17 
##           Tamarac             Tampa            Taylor          Temecula 
##                 4                36                 3                 4 
##             Tempe         Texarkana        Texas City        The Colony 
##                13                 2                 2                 3 
##       Thomasville          Thornton     Thousand Oaks            Tigard 
##                 3                10                 5                 8 
##       Tinley Park            Toledo          Torrance           Trenton 
##                 1                31                 4                13 
##              Troy            Tucson             Tulsa        Tuscaloosa 
##                29                32                26                 2 
##        Twin Falls             Tyler         Urbandale             Utica 
##                 2                 5                 3                 8 
##         Vacaville           Vallejo         Vancouver          Vineland 
##                 1                 6                 5                 8 
##    Virginia Beach           Visalia              Waco     Warner Robins 
##                16                 4                 6                 2 
##           Warwick        Washington         Waterbury          Waterloo 
##                 5                10                11                 1 
##         Watertown          Waukesha            Wausau        Waynesboro 
##                10                 1                 4                10 
##        West Allis       West Jordan   West Palm Beach         Westfield 
##                 2                 5                 3                 6 
##          Westland       Westminster          Wheeling          Whittier 
##                13                17                 7                 1 
##           Wichita        Wilmington            Wilson          Woodbury 
##                 7                36                 6                 3 
##          Woodland         Woodstock        Woonsocket           Yonkers 
##                 3                 5                 4                15 
##              York           Yucaipa              Yuma 
##                 5                 1                 4

table(df$State)

## 
##              Alabama              Arizona             Arkansas 
##                   61                  224                   60 
##           California           CALIFORNIA             Colorado 
##                 1995                    2                  182 
##          Connecticut             Delaware District of Columbia 
##                   82                   95                   10 
##              florida              Florida              Georgia 
##                    3                  380                  184 
##                Idaho             Illinois              Indiana 
##                   21                  492                  148 
##              INDIANA                 Iowa               Kansas 
##                    1                   30                   24 
##             kentucky             Kentucky             KENTUCKY 
##                    0                  137                    1 
##            Louisiana                Maine             Maryland 
##                   41                    8                  105 
##        Massachusetts             Michigan             MICHIGAN 
##                  135                  254                    1 
##            Minnesota          Mississippi             Missouri 
##                   89                   53                   66 
##              Montana             Nebraska               Nevada 
##                   15                   38                   39 
##        New Hampshire           New Jersey           New Mexico 
##                   27                  130                   37 
##             New York       North Carolina         North Dakota 
##                 1128                  249                    7 
##                 Ohio             Oklahoma               Oregon 
##                  468                   66                  124 
##         Pennsylvania         Rhode Island       South Carolina 
##                  587                   56                   42 
##         South Dakota            Tennessee                Texas 
##                   12                  183                  980 
##                TEXAS                 Utah                 UTAH 
##                    1                   52                    1 
##              Vermont             Virginia           Washington 
##                   11                  224                  506 
##        West Virginia            Wisconsin              Wyoming 
##                    4                  109                    1

table(df$Region)

## 
## Central    East   South    West 
##    2318    2846    1618    3199

table(df$Category)

## 
##       Furniture Office Supplies      Technology 
##            2117            6021            1843

table(df$`Sub-Category`)

## 
## Accessories  Appliances         Art     Binders   Bookcases      Chairs 
##         774         464         796        1521         228         616 
##     Copiers   Envelopes   Fasteners Furnishings      Labels    Machines 
##          68         254         217         956         363         114 
##       Paper      Phones     Storage    Supplies      Tables 
##        1370         887         846         190         317

From the results, State have some value redundancy. The case of the letter is not standardized. For example, “California” and “CALIFORNIA” are referring to the same location and redundant.

Standardizing the case of the letters in State:

df$State<-tolower(df$State)
capFirst <- function(s) {
  paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}

df$State <- capFirst(df$State)
table(df$State)

## 
##              Alabama              Arizona             Arkansas 
##                   61                  224                   60 
##           California             Colorado          Connecticut 
##                 1997                  182                   82 
##             Delaware District of columbia              Florida 
##                   95                   10                  383 
##              Georgia                Idaho             Illinois 
##                  184                   21                  492 
##              Indiana                 Iowa               Kansas 
##                  149                   30                   24 
##             Kentucky            Louisiana                Maine 
##                  138                   41                    8 
##             Maryland        Massachusetts             Michigan 
##                  105                  135                  255 
##            Minnesota          Mississippi             Missouri 
##                   89                   53                   66 
##              Montana             Nebraska               Nevada 
##                   15                   38                   39 
##        New hampshire           New jersey           New mexico 
##                   27                  130                   37 
##             New york       North carolina         North dakota 
##                 1128                  249                    7 
##                 Ohio             Oklahoma               Oregon 
##                  468                   66                  124 
##         Pennsylvania         Rhode island       South carolina 
##                  587                   56                   42 
##         South dakota            Tennessee                Texas 
##                   12                  183                  981 
##                 Utah              Vermont             Virginia 
##                   53                   11                  224 
##           Washington        West virginia            Wisconsin 
##                  506                    4                  109 
##              Wyoming 
##                    1

From the results, we can see that the data redundancy is catered successfully.

4) Exploratory Data Analysis

In this section, we will drill deeper into the data for more insights. But first, we need to know what problems we want to solve and what questions to ask. Taking the POV of the owner of the Superstore:

Overview
- What is the monthly sales and profit since inception?
Overview - Increase Revenue
- Which product category has the highest sales?
- Which customer segment that contribute to the highest sales?
- Which region, state and city contribute to the highest sales and profit?
Overview - Reduce Loss
- Which product category and subcategory that has the highest returned item?
Prediction
- Regression: What is the overall sales and profit in the next month?
- Classification: Based on order features, what is the most likely ship mode for that particular order?

Overview - Increase Revenue

1. Which product category and subcategory with highest sales, with and without discount?

df_category_sales <- df %>%
  select(Category, Sales) %>%
  group_by(Category) %>%
  summarise(Sales = sum(Sales))

df_category_sales <- df_category_sales %>%
  mutate(Percentage = percent(df_category_sales$Sales/sum(df_category_sales$Sales)))

Bar Chart for Category by Sales Breakdown

bar_category_sales <- ggplot(data=df_category_sales, aes(x=Category, y=Sales, fill=Category)) +
  coord_cartesian(ylim = c(700000, 850000))+
  ggtitle("Category by Sales Breakdown") +
  geom_bar(stat="identity")+
  geom_text(aes(label=Percentage), vjust=1.6, color="white", size=3.5)+
  theme_minimal()
#bar_category_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
bar_category_sales

2. Which customer segment that contribute to the highest sales?

df_segment_sales <- df %>%
  select(Segment, Sales) %>%
  group_by(Segment) %>%
  summarise(Sales = sum(Sales))

df_segment_sales[ , 1] <- apply(df_segment_sales[ , 1], 2,           
                    function(x) as.character(x))

sapply(df_segment_sales, class)

##     Segment       Sales 
## "character"   "numeric"

Inputting percentage column into the table

df_segment_sales <- df_segment_sales %>%
  mutate(Percentage = percent(df_segment_sales$Sales/sum(df_segment_sales$Sales)))

Pie Chart for Customer Segment Sales Contribution

ggplot(df_segment_sales, aes(x = "", y = Sales, fill = Segment)) +
  ggtitle("Customer Segment Sales Contribution") +
  geom_col() +
   geom_label(aes(label = Percentage ),
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")

3. Which region, state and city contribute to the highest sales?

Region - Sales

df_region_sales <- df %>%
  select(Region, Sales) %>%
  group_by(Region) %>%
  summarise(Sales = sum(Sales))

Inputting percentage column into the table

df_region_sales <- df_region_sales %>%
    mutate(Percentage = percent(df_region_sales$Sales/sum(df_region_sales$Sales)))

Bar Chart for Region by Sales

bar_region_sales <- ggplot(data=df_region_sales, aes(x=Region, y=Sales, fill=Region)) +
  coord_cartesian(ylim = c(300000, 800000))+
  ggtitle("Region by Sales Breakdown") +
  geom_bar(stat="identity")+
  geom_text(aes(label=Percentage), vjust=1.6, color="white", size=3.5)+
  theme_minimal()
#bar_region_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9","#f5ad42"))
bar_region_sales

State - Sales

df_state_sales <- df %>%
  select(State, Sales) %>%
  group_by(State) %>%
  summarise(Sales = sum(Sales))

df_state_sales <- df_state_sales[order(-df_state_sales$Sales),]

df_state_sales <- df_state_sales %>%
  mutate(Percentage = percent(df_state_sales$Sales/sum(df_state_sales$Sales),accuracy = 0.01))

Horizontal Bar Chart for Top 10 Regions by Sales

bar_state_sales <- ggplot(data=df_state_sales[1:10,], aes(x=State, y=Sales, fill=State)) +
  #coord_cartesian(ylim = c(300000, 800000))+
  geom_bar(stat="identity")+
  ggtitle("Top 10 Regions by Sales Breakdown") +
  geom_text(aes(label=Percentage), hjust=1.3, vjust=0.4, color="white", size=3)+
  theme_minimal()+
  coord_flip()
#bar_region_sales + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9","#f5ad42"))
bar_state_sales

City - Sales

df_city_sales <- df %>%
  select(City, Sales) %>%
  group_by(City) %>%
  summarise(Sales = sum(Sales))

df_city_sales <- df_city_sales[order(-df_city_sales$Sales),]

df_city_sales <- df_city_sales %>%
  mutate(Percentage = percent(df_city_sales$Sales/sum(df_city_sales$Sales),accuracy = 0.01))

Horizontal Bar Chart for Top 10 City by Sales

bar_city_sales <- ggplot(data=df_city_sales[1:10,], aes(x=City, y=Sales, fill=City)) +
  geom_bar(stat="identity")+
  ggtitle("Sales by Category Breakdown") +
  geom_text(aes(label=Percentage), hjust=1.3, vjust=0.4, color="white", size=3)+
  theme_minimal()+
  coord_flip()

bar_city_sales

Overview - Reduce Loss

1. Category and Sub-Category having highest returned items

# Join orders and returns table
joined_df <- df_order %>% inner_join(df_return, 
                                     by = "Order ID")

Category Having Highest Returns:

res_1 <- joined_df %>%
  filter(Returned == "Yes") %>%
  group_by(Category)%>%
  summarise(Total_Returns = n()) 

ggplot(data = res_1, 
       aes(x = Category, 
           y = Total_Returns, 
           fill = Category)) +
  geom_bar(stat="identity")+
  ggtitle("Returns By Category") +
  geom_text(aes(label = Total_Returns), hjust=1.3, vjust=0.4, color="white", size=3)+
  theme_minimal()

Sub-Category Having Highest Returns:

res_2 <- joined_df %>%
  filter(Returned == "Yes") %>%
  group_by(`Sub-Category`)%>%
  summarise(Total_Returns = n())

ggplot(data = res_2, 
       aes(x = `Sub-Category`, 
           y = Total_Returns, 
           fill = `Sub-Category`)) +
  geom_bar(stat="identity")+
  ggtitle("Returns By Sub-Category") +
  geom_text(aes(label = Total_Returns), 
            hjust= 1.4, 
            vjust=0.5,
            color="white", 
            size=3)+
  theme_minimal() + 
  coord_flip()

5) Feature Selection

Data Transformation

label <- LabelEncoder$new()
df$Segment <- label$fit_transform(df$Segment)
df$Region <- label$fit_transform(df$Region)
df$Category <- label$fit_transform(df$Category)
df$`Sub-Category` <- label$fit_transform(df$`Sub-Category`)
df$`Returned` <- label$fit_transform(df$`Returned`)
df$`Ship Mode` <- label$fit_transform(df$`Ship Mode`)
df$`City` <- label$fit_transform(df$`City`)
df$`State` <- label$fit_transform(df$`State`)

head(df)

##   Ship Mode Segment City State Region Category Sub-Category    Sales Quantity
## 1         2       0  194     0      2        0            4 261.9600        2
## 3         2       1  266     1      3        1           10  14.6200        2
## 4         3       0  153     2      2        0           16 957.5775        5
## 5         3       0  153     2      2        1           14  22.3680        2
## 7         3       0  266     1      3        1            2   7.2800        4
## 8         3       0  266     1      3        2           13 907.1520        6
##   Discount    Profit Returned
## 1     0.00   41.9136        0
## 3     0.00    6.8714        0
## 4     0.45 -383.0310        0
## 5     0.20    2.5164        0
## 7     0.00    1.9656        0
## 8     0.20   90.7152        0

Heat Map Correlation Matrix

set.seed(7)
correlationMatrix <- cor(df[,1:11])
cm <- melt(correlationMatrix)
ggplot(data = cm, aes(x=Var1, y=Var2,fill=value)) + geom_tile()

print(correlationMatrix)

##                  Ship Mode      Segment          City        State       Region
## Ship Mode     1.0000000000 -0.004582994 -0.0036541966 -0.003634798 -0.023631986
## Segment      -0.0045829938  1.000000000  0.0057311185  0.037712219 -0.005296660
## City         -0.0036541966  0.005731119  1.0000000000 -0.168428001  0.282360262
## State        -0.0036347981  0.037712219 -0.1684280008  1.000000000 -0.254856599
## Region       -0.0236319855 -0.005296660  0.2823602616 -0.254856599  1.000000000
## Category     -0.0043026605  0.009730685  0.0135427089  0.008036425 -0.004874291
## Sub-Category -0.0034247080  0.011574558 -0.0063490087 -0.006049100  0.009029168
## Sales        -0.0017386263  0.010949677  0.0056037646  0.008134196  0.004080363
## Quantity      0.0226146186  0.007670713 -0.0008610499  0.002416884  0.013250294
## Discount     -0.0001778966 -0.017964106 -0.1512573723 -0.132165259 -0.212811855
## Profit       -0.0065217729  0.013558801  0.0242173827  0.041014897  0.022041231
##                  Category Sub-Category        Sales      Quantity      Discount
## Ship Mode    -0.004302661 -0.003424708 -0.001738626  0.0226146186 -0.0001778966
## Segment       0.009730685  0.011574558  0.010949677  0.0076707127 -0.0179641061
## City          0.013542709 -0.006349009  0.005603765 -0.0008610499 -0.1512573723
## State         0.008036425 -0.006049100  0.008134196  0.0024168842 -0.1321652587
## Region       -0.004874291  0.009029168  0.004080363  0.0132502938 -0.2128118549
## Category      1.000000000 -0.075511134  0.040217407 -0.0038286368 -0.0631844760
## Sub-Category -0.075511134  1.000000000  0.059908214 -0.0166216535 -0.1527127006
## Sales         0.040217407  0.059908214  1.000000000  0.2008916831 -0.0280473766
## Quantity     -0.003828637 -0.016621654  0.200891683  1.0000000000  0.0104803322
## Discount     -0.063184476 -0.152712701 -0.028047377  0.0104803322  1.0000000000
## Profit        0.091670085 -0.029546613  0.479017235  0.0663051434 -0.2192108855
##                    Profit
## Ship Mode    -0.006521773
## Segment       0.013558801
## City          0.024217383
## State         0.041014897
## Region        0.022041231
## Category      0.091670085
## Sub-Category -0.029546613
## Sales         0.479017235
## Quantity      0.066305143
## Discount     -0.219210886
## Profit        1.000000000

Based on the correlation matrix, we can see that “Sales” has a high positive correlation with dependent variable “Profit”

Linearity between Sales and Profit

df_sales_profit <- df %>%
  select(Sales,Profit)

Let’s have a look at our model fitted to our data for sales and profit.

ggplot(data = df_sales_profit, aes(x = Sales, y = Profit)) +
geom_point() +
stat_smooth(method = "lm", col = "dodgerblue3") +
theme(panel.background = element_rect(fill = "white"),
axis.line.x=element_line(),
axis.line.y=element_line()) +
ggtitle("Linear Model Fitted to Data")

## `geom_smooth()` using formula 'y ~ x'

6) Data Preparation

Create training set indices with 80% of data

#For reproducibility
set.seed(100)
# Create index for testing and training data
inTrain <- createDataPartition(y = df_sales_profit$Profit, 
                               p = 0.8, list = FALSE)
# subset Sales & Profit data to training
salesprofit_training <- df_sales_profit[inTrain,]
# subset the rest to test
salesprofit_testing <- df_sales_profit[-inTrain,]

7) Modelling

Build a linear regression model relating Sales and Profit.

Whether we can use our model to make predictions will depend on:

Whether we can reject the null hypothesis that there is no relationship between our variables.
The model is a good fit for our data.

linear_fit <- lm(Profit ~ Sales, data = salesprofit_training)
summary(linear_fit)

## 
## Call:
## lm(formula = Profit ~ Sales, data = salesprofit_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7352.9     2.5    13.6    21.2  5438.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.483065   2.574812   -4.46 8.32e-06 ***
## Sales         0.169863   0.003798   44.73  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216.8 on 7984 degrees of freedom
## Multiple R-squared:  0.2004, Adjusted R-squared:  0.2003 
## F-statistic:  2001 on 1 and 7984 DF,  p-value: < 2.2e-16

Based on the result summary shown above, is the hypothesis supported?

The intercept in our example is the expected Profit when our Sales was zero. Profit can possible be negative, in which it means a loss in money.
The slope in our example is the effect of Sales on Profit. We see that for each additional value of Sales by 1, the Profit increases by RM0.18.
Since the p-value is smaller than 0.05 as the cutoff for significance, we reject Ho. We can reject the null hypothesis in favor of believing there to be a relationship between Sales and Profit.

Regression: Predicting Profit using Linear Regression

We use the linear model to predict Profit using Sales as input. We check the residuals to see the performance of the linear model.

pred = predict(linear_fit, salesprofit_testing)
head(pred)

##         10         14         18         23         36         39 
##   8.034192  57.816958  -2.055669  -1.233532 174.949040  78.951855

res<-residuals(linear_fit) # Find the residuals
res<-as.data.frame(res) # Convert the residual into a dataframe
head(res) # Prints the residuals

##           res
## 1    8.899356
## 3   15.871068
## 4 -534.204911
## 5   10.199970
## 7   12.212063
## 8  -51.893285

We compare predicted values and actual values side-by-side to see how far off the prediction from the actual - which is the residual values.

# compare the predicted vs actual values

results<-cbind(pred,salesprofit_testing$Profit)

head(results)

##          pred         
## 10   8.034192  34.4700
## 14  57.816958 132.5922
## 18  -2.055669   9.9900
## 23  -1.233532  15.6884
## 36 174.949040 123.4737
## 39  78.951855 -46.9764

colnames(results)<-c('predicted','real')

results<-as.data.frame(results)

head(results)

##     predicted     real
## 10   8.034192  34.4700
## 14  57.816958 132.5922
## 18  -2.055669   9.9900
## 23  -1.233532  15.6884
## 36 174.949040 123.4737
## 39  78.951855 -46.9764

Let’s plot the difference in the chart.

# Let’s now, compare the predicted vs actual values.
# The output of the above command is shown below in a graph that shows the predicted Profit.

plot(salesprofit_testing$Profit, type = 'l', lty = 1.8, col = "red")

#Now let’s plot our test revenue with the following command:
lines(pred, type = "l", col = "blue")

As seen above, there are some instances where the model underestimated the profit. The prediction generated by this model is conservative.

Finally, we evaluate the performance of the model using RMSE.

# Calculating the accuracy of this model

rmse <- sqrt(mean(pred-df_sales_profit$Profit)^2)

## Warning in pred - df_sales_profit$Profit: longer object length is not a multiple
## of shorter object length

# Root Mean Square Error is the standard deviation of the residuals

rmse

## [1] 1.04264

Classification: Predicting Ship Mode using Decision Tree

We use the decision tree model to predict Ship Mode using Segment, Region, Category, Sub-category, Sales, Quantity, Discount and Profit as input.

# Remove useless variables
joined_df <- joined_df[, -which(names(joined_df) %in% c("Row ID",
                                      "Order ID",
                                      "Ship Date",
                                      "Order Date",
                                      "Customer ID",
                                      "Customer Name",
                                      "Postal Code",
                                      "Product ID",
                                      "Product Name",
                                      "Returned",
                                      "Country",
                                      "State",
                                      "City"))]
# Check structure of final variable
str(joined_df)

## tibble [800 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Ship Mode   : chr [1:800] "Second Class" "Second Class" "Second Class" "First Class" ...
##  $ Segment     : chr [1:800] "Consumer" "Consumer" "Consumer" "Consumer" ...
##  $ Region      : chr [1:800] "West" "West" "West" "East" ...
##  $ Category    : chr [1:800] "Office Supplies" "Technology" "Office Supplies" "Office Supplies" ...
##  $ Sub-Category: chr [1:800] "Art" "Phones" "Binders" "Storage" ...
##  $ Sales       : num [1:800] 8.56 213.48 22.72 208.56 32.4 ...
##  $ Quantity    : num [1:800] 2 3 -4 6 5 5 2 2 4 1 ...
##  $ Discount    : num [1:800] 0 0.2 0.2 0 0 0.1 0 0 0.2 0 ...
##  $ Profit      : num [1:800] 2.48 16.01 7.38 52.14 15.55 ...

library(superml)
# Label encode categorical variables
label <- LabelEncoder$new()
joined_df$Segment <- label$fit_transform(joined_df$Segment)
joined_df$Region <- label$fit_transform(joined_df$Region)
joined_df$Category <- label$fit_transform(joined_df$Category)
joined_df$`Sub-Category` <- label$fit_transform(joined_df$`Sub-Category`)

# Function for normalization
min_max_norm <- function(x) {
    (x - min(x)) / (max(x) - min(x))
  }

#apply Min-Max normalization on the dataset
joined_df[, 6:7] <- as.data.frame(lapply(joined_df[, 6:7], min_max_norm))
joined_df$Profit <- min_max_norm(joined_df$Profit)
head(joined_df)

## # A tibble: 6 x 9
##   `Ship Mode`  Segment Region Category `Sub-Category`    Sales Quantity Discount
##   <chr>          <dbl>  <dbl>    <dbl>          <dbl>    <dbl>    <dbl>    <dbl>
## 1 Second Class       0      0        0              0 0.000491    0.333      0  
## 2 Second Class       0      0        1              1 0.0151      0.389      0.2
## 3 Second Class       0      0        0              2 0.00150     0          0.2
## 4 First Class        0      1        0              3 0.0148      0.556      0  
## 5 First Class        0      1        0              4 0.00219     0.5        0  
## 6 First Class        0      1        2              5 0.0227      0.5        0.1
## # ... with 1 more variable: Profit <dbl>

colSums(is.na(joined_df))

##    Ship Mode      Segment       Region     Category Sub-Category        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            1            0

# Split data into training and testing sets.
set.seed(124)
split1<- sample(c(rep(0, 0.7 * nrow(joined_df)), rep(1, 0.3 * nrow(joined_df))))
train <- joined_df[split1 == 0, ] 
test <- joined_df[split1== 1, ]

# Fit decision tree model.
set.seed(123)
fit <- rpart(`Ship Mode` ~., 
             data = train, 
             method = 'class')

# Evaluate performance on test set.
predict_unseen <- predict(fit, 
                          test, 
                          type = 'class')
table_mat <- table(test$`Ship Mode`, 
                   predict_unseen)
confusionMatrix(table_mat)

## Confusion Matrix and Statistics
## 
##                 predict_unseen
##                  First Class Same Day Second Class Standard Class
##   First Class              4        0            0             46
##   Same Day                 0        0            0             18
##   Second Class             1        0            0             38
##   Standard Class           4        0            0            129
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5542          
##                  95% CI : (0.4889, 0.6181)
##     No Information Rate : 0.9625          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0283          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: First Class Class: Same Day Class: Second Class
## Sensitivity                     0.44444              NA                  NA
## Specificity                     0.80087           0.925              0.8375
## Pos Pred Value                  0.08000              NA                  NA
## Neg Pred Value                  0.97368              NA                  NA
## Prevalence                      0.03750           0.000              0.0000
## Detection Rate                  0.01667           0.000              0.0000
## Detection Prevalence            0.20833           0.075              0.1625
## Balanced Accuracy               0.62266              NA                  NA
##                      Class: Standard Class
## Sensitivity                        0.55844
## Specificity                        0.55556
## Pos Pred Value                     0.96992
## Neg Pred Value                     0.04673
## Prevalence                         0.96250
## Detection Rate                     0.53750
## Detection Prevalence               0.55417
## Balanced Accuracy                  0.55700

Random Forest

library(caret)
library(randomForest)
mod_rf <-
  train(`Ship Mode` ~. , # Equation (outcome and everything else)
        data = train, # Training data 
        method = "ranger", # random forest (ranger is much faster than rf)
        na.action = na.omit
  )

# Evaluate performance on test set.
predict_unseen <- predict(mod_rf, 
                          test)
table_mat <- table(test$`Ship Mode`, 
                   predict_unseen)
confusionMatrix(table_mat)

## Confusion Matrix and Statistics
## 
##                 predict_unseen
##                  First Class Same Day Second Class Standard Class
##   First Class              1        0            5             44
##   Same Day                 0        0            0             18
##   Second Class             1        0            2             36
##   Standard Class           3        2            2            126
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5375          
##                  95% CI : (0.4722, 0.6019)
##     No Information Rate : 0.9333          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0195          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: First Class Class: Same Day Class: Second Class
## Sensitivity                    0.200000        0.000000            0.222222
## Specificity                    0.791489        0.924370            0.839827
## Pos Pred Value                 0.020000        0.000000            0.051282
## Neg Pred Value                 0.978947        0.990991            0.965174
## Prevalence                     0.020833        0.008333            0.037500
## Detection Rate                 0.004167        0.000000            0.008333
## Detection Prevalence           0.208333        0.075000            0.162500
## Balanced Accuracy              0.495745        0.462185            0.531025
##                      Class: Standard Class
## Sensitivity                        0.56250
## Specificity                        0.56250
## Pos Pred Value                     0.94737
## Neg Pred Value                     0.08411
## Prevalence                         0.93333
## Detection Rate                     0.52500
## Detection Prevalence               0.55417
## Balanced Accuracy                  0.56250

SVM

library(e1071)
svm <- svm(as.factor(`Ship Mode`) ~., 
           data = train,
           type = 'C-classification',
           kernel = 'linear')
summary(svm)

## 
## Call:
## svm(formula = as.factor(`Ship Mode`) ~ ., data = train, type = "C-classification", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  475
## 
##  ( 95 102 232 46 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  First Class Same Day Second Class Standard Class

# Evaluate performance on test set.
predict_unseen <- predict(svm, 
                          test)
table_mat <- table(test$`Ship Mode`, 
                   predict_unseen)
confusionMatrix(table_mat)

## Confusion Matrix and Statistics
## 
##                 predict_unseen
##                  First Class Same Day Second Class Standard Class
##   First Class              0        0            0             50
##   Same Day                 0        0            0             18
##   Second Class             0        0            0             39
##   Standard Class           0        0            0            133
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5542          
##                  95% CI : (0.4889, 0.6181)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: First Class Class: Same Day Class: Second Class
## Sensitivity                          NA              NA                  NA
## Specificity                      0.7917           0.925              0.8375
## Pos Pred Value                       NA              NA                  NA
## Neg Pred Value                       NA              NA                  NA
## Prevalence                       0.0000           0.000              0.0000
## Detection Rate                   0.0000           0.000              0.0000
## Detection Prevalence             0.2083           0.075              0.1625
## Balanced Accuracy                    NA              NA                  NA
##                      Class: Standard Class
## Sensitivity                         0.5542
## Specificity                             NA
## Pos Pred Value                          NA
## Neg Pred Value                          NA
## Prevalence                          1.0000
## Detection Rate                      0.5542
## Detection Prevalence                0.5542
## Balanced Accuracy                       NA

Naive Bayes

nb <- naiveBayes(`Ship Mode` ~., 
                 data = train)

# Evaluate performance on test set.
predict_unseen <- predict(nb, 
                          test)
table_mat <- table(test$`Ship Mode`, 
                   predict_unseen)
confusionMatrix(table_mat)

## Confusion Matrix and Statistics
## 
##                 predict_unseen
##                  First Class Same Day Second Class Standard Class
##   First Class              1        0           25             24
##   Same Day                 0        0            4             14
##   Second Class             0        3            8             28
##   Standard Class           3        1           31             98
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4458          
##                  95% CI : (0.3819, 0.5111)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0287          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: First Class Class: Same Day Class: Second Class
## Sensitivity                    0.250000         0.00000             0.11765
## Specificity                    0.792373         0.92373             0.81977
## Pos Pred Value                 0.020000         0.00000             0.20513
## Neg Pred Value                 0.984211         0.98198             0.70149
## Prevalence                     0.016667         0.01667             0.28333
## Detection Rate                 0.004167         0.00000             0.03333
## Detection Prevalence           0.208333         0.07500             0.16250
## Balanced Accuracy              0.521186         0.46186             0.46871
##                      Class: Standard Class
## Sensitivity                         0.5976
## Specificity                         0.5395
## Pos Pred Value                      0.7368
## Neg Pred Value                      0.3832
## Prevalence                          0.6833
## Detection Rate                      0.4083
## Detection Prevalence                0.5542
## Balanced Accuracy                   0.5685

8) Conclusion

To reiterate, the Superstore owners have the following concerns:

They would like to understand which products, regions, categories and customer segments they should target or avoid.
They also want to have a Regression model to predict Sales or Profit.

A data scientist’s recommendations to the Superstore owners are as follows: - In terms of Customer Segment, it is suggested to target the Home Office segment as this will be the trend, accelerated by Covid. Corporate segment is too competitive as this is a very established market segment. - In terms of Sales by Region, it is suggested to take a deep dive into the Superstore’s market presence in the Southern region - to understand why the sales in this region is much lower than East and West. - In terms of Returns, Office Supplies category records the highest returned items. Deep dive shows it is Paper and Binders sub-category. A survey must be conducted to collect feedback on why these two categories have such a ridiculously high item return.

In regard to predictive analytics (regression and classification) on the dataset: - Correlation only between few selected variables: Profit, Sales, Quantity, Discount and Sub-category. - Regression: A predictive model to predict Profit from Sales has been constructed and scored 1.8 on RMSE. Looking at the plot, there might be some instances with large errors needed to be rechecked. - Classification: This dataset not suitable for classification as most of the variables are not correlated to each other. Further data collection needed on new features. - Future work will include deep dive into instances that triggers large errors in the regression model; and features to be collected for classification problem.