With growing demands and cut-throat competitions in the market, a superstore giant is looking for some business insights. In this project, we would like to make this business understand which products, regions, sub-categories and customer segments they should target or avoid.

“Cut-throat competition, also known as destructive or ruinous competition, refers to situations when competition results in prices that do not chronically or for extended periods of time cover costs of production, particularly fixed costs.”

The dataset was obtained from Kaggle.com and it contains sales records for an unknown company.

Source: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final

The dataset is comprised of 17 categorical and 4 quantitative variables and the observations are from 2014 thru 2017. There are 9994 observations in the dataset. Below is a list of all 21 variables and what they represent:

Row ID => Unique ID for each row.

Order ID => Unique Order ID for each Customer.

Order Date => Order Date of the product.

Ship Date => Shipping Date of the Product.

Ship Mode=> Shipping Mode specified by the Customer.

Customer ID => Unique ID to identify each Customer.

Customer Name => Name of the Customer.

Segment => The segment where the Customer belongs.

Country => Country of residence of the Customer.

City => City of residence of of the Customer.

State => State of residence of the Customer.

Postal Code => Postal Code of every Customer.

Region => Region where the Customer belong.

Product ID => Unique ID of the Product.

Category => Category of the product ordered.

Sub-Category => Sub-Category of the product ordered.

Product Name => Name of the Product

Sales => Sales of the Product.

Quantity => Quantity of the Product.

Discount => Discount provided.

Profit => Profit/Loss incurred.

Load necessary libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

Read Dataset into Rstudio and show first few rows

# Set working directory
setwd("~/Data101/Final Project")

salesr <- read_csv("superstore.csv")

## Rows: 9994 Columns: 21

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (15): Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer ...
## dbl  (6): Row ID, Postal Code, Sales, Quantity, Discount, Profit

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(salesr)

## # A tibble: 6 x 21
##   `Row ID` `Order ID`     `Order Date` `Ship Date` `Ship Mode`    `Customer ID`
##      <dbl> <chr>          <chr>        <chr>       <chr>          <chr>        
## 1        1 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
## 2        2 CA-2016-152156 11/8/2016    11/11/2016  Second Class   CG-12520     
## 3        3 CA-2016-138688 6/12/2016    6/16/2016   Second Class   DV-13045     
## 4        4 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
## 5        5 US-2015-108966 10/11/2015   10/18/2015  Standard Class SO-20335     
## 6        6 CA-2014-115812 6/9/2014     6/14/2014   Standard Class BH-11710     
## # ... with 15 more variables: Customer Name <chr>, Segment <chr>,
## #   Country <chr>, City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
## #   Product ID <chr>, Category <chr>, Sub-Category <chr>, Product Name <chr>,
## #   Sales <dbl>, Quantity <dbl>, Discount <dbl>, Profit <dbl>

There are no missing values in our data.

sum(is.na(salesr))

## [1] 0

Clean up the data:

Make all headers lowercase and remove spaces

Look at the data structure

names(salesr) <- tolower(names(salesr))
names(salesr) <- gsub(" ","",names(salesr))
str(salesr)

## spec_tbl_df [9,994 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ rowid       : num [1:9994] 1 2 3 4 5 6 7 8 9 10 ...
##  $ orderid     : chr [1:9994] "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ orderdate   : chr [1:9994] "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
##  $ shipdate    : chr [1:9994] "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
##  $ shipmode    : chr [1:9994] "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ customerid  : chr [1:9994] "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ customername: chr [1:9994] "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
##  $ segment     : chr [1:9994] "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ country     : chr [1:9994] "United States" "United States" "United States" "United States" ...
##  $ city        : chr [1:9994] "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ state       : chr [1:9994] "Kentucky" "Kentucky" "California" "Florida" ...
##  $ postalcode  : num [1:9994] 42420 42420 90036 33311 33311 ...
##  $ region      : chr [1:9994] "South" "South" "West" "South" ...
##  $ productid   : chr [1:9994] "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ category    : chr [1:9994] "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ sub-category: chr [1:9994] "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ productname : chr [1:9994] "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ sales       : num [1:9994] 262 731.9 14.6 957.6 22.4 ...
##  $ quantity    : num [1:9994] 2 3 2 5 2 7 4 6 3 5 ...
##  $ discount    : num [1:9994] 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ profit      : num [1:9994] 41.91 219.58 6.87 -383.03 2.52 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Row ID` = col_double(),
##   ..   `Order ID` = col_character(),
##   ..   `Order Date` = col_character(),
##   ..   `Ship Date` = col_character(),
##   ..   `Ship Mode` = col_character(),
##   ..   `Customer ID` = col_character(),
##   ..   `Customer Name` = col_character(),
##   ..   Segment = col_character(),
##   ..   Country = col_character(),
##   ..   City = col_character(),
##   ..   State = col_character(),
##   ..   `Postal Code` = col_double(),
##   ..   Region = col_character(),
##   ..   `Product ID` = col_character(),
##   ..   Category = col_character(),
##   ..   `Sub-Category` = col_character(),
##   ..   `Product Name` = col_character(),
##   ..   Sales = col_double(),
##   ..   Quantity = col_double(),
##   ..   Discount = col_double(),
##   ..   Profit = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Orderdate and shipdate are in the wrong format. Let’s fix that.

# convert variable type to date.

salesr$orderdate <-  mdy(salesr$orderdate)
salesr$shipdate <-   mdy(salesr$shipdate)

Check class of variables

class(salesr$orderdate)

## [1] "Date"

class(salesr$shipdate)

## [1] "Date"

Now that our data is clean, we can start asking questions and perform Explanatory Data Analysis (EDA).

Part 1

Is the average profit/sales the same per region?

What type of variation occurs within the segment variable?

EDA- Investigations

Our data set has 21 columns. We will not be using all of them for our Analysis

Create a new object

salesr2 <- salesr%>%
  select(orderdate,shipmode,segment,productid, region, category,`sub-category`, sales, quantity, discount, profit)%>%
  mutate(yearsale =year(orderdate))%>%
  rename(subcategory = `sub-category`)

Preliminary investigation to get a rough idea of what the data set looks like

Sales - Five number summary and region

tapply(salesr2$sales,salesr2$region, summary)

## $Central
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.444    14.620    45.980   215.773   200.012 17499.950 
## 
## $East
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.852    17.520    54.900   238.336   209.617 11199.968 
## 
## $South
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.167    17.187    54.594   241.804   208.722 22638.480 
## 
## $West
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.99    19.44    60.84   226.49   215.81 13999.96

Profit - Five number summary and region

tapply(salesr2$profit,salesr2$region, summary)

## $Central
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3701.893    -5.664     5.184    17.093    22.456  8399.976 
## 
## $East
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6599.978     1.558     8.172    32.136    28.719  5039.986 
## 
## $South
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3839.990     1.947     9.072    28.858    34.234  3177.475 
## 
## $West
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3399.980     3.852    11.166    33.849    33.000  6719.981

Boxplots

salesr2%>%
    ggplot(aes(region, profit, col=region ))+
    geom_boxplot()+
    stat_boxplot(geom = "errorbar", width = 0.2)+
    theme(axis.text.x = element_text(angle = 90))+
    labs(y = "Profit", x = "Region", title = "Boxplot - Profit by Region")+
    theme_bw()

salesr2%>%
    ggplot(aes(region, sales, col=region ))+
    geom_boxplot()+
    stat_boxplot(geom = "errorbar", width = 0.2)+
    theme(axis.text.x = element_text(angle = 90))+
    labs(y = "Sales", x = "Region", title = "Boxplot - Sales by Region")+
    theme_bw()

table(salesr2$segment)

## 
##    Consumer   Corporate Home Office 
##        5191        3020        1783

Visualising the distribution of segment (customer type)

salesr2 %>% 
  count(segment) %>% 
  hchart('column', hcaes(x = 'segment', y = 'n'))%>%
  hc_title(text = "Customer Type") %>% 
  hc_subtitle(text = "https://www.kaggle.com/datasets/vivek468/superstore-dataset-final",color = "#A4CBAF") %>%
  hc_add_theme(hc_theme_darkunica())

So far, our EDA shows 51% of the purchases are from regular consumers, 30% from corporate offices and the rest from home offices.

Also,there are lots of negative sales for all subcategories in every region. However, the median profit/sales seems to be the same for all regions. Let’s randomly pick a subcategory and do some hypothesis test.

Part 2

First, let’s visualize each of the subcategory to see the distribution of profits

salesr2%>%
    ggplot(aes(subcategory, profit, col=category ))+
    geom_boxplot()+
    stat_boxplot(geom = "errorbar", width = 0.2)+
    theme_bw()+
    facet_wrap(~region)+
    ggtitle("Boxplot Profit by Sub-Category") +
    xlab("Sub-Category") + ylab("Profit")+
    theme(axis.text.x = element_text(angle = 90))

Question

Is it more profitable to sell Machines to customers on the East coast vs West coast customers?

Let’s zoom into the Machines subcategory to see the distribution of profits per region

Boxplot of East/West customers - Purchase of Machines

  salesr2 %>%
  #filter(region=="East"|region== "West") %>%
  filter(subcategory=="Machines")%>%
   ggplot(aes(subcategory, profit, color=region))+
  ggtitle("Boxplot Profit Machines - Sub-Category") +
    xlab(" ") + ylab("Profit")+
  facet_wrap(~region)+
  geom_boxplot()+
  theme_bw()

T-Test

We will conduct a T-Test to find out if the average profit from selling Machines on the East coast is equal or not to the West coast.

Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).

H0 : μe = μw

Hα : μe > μw

westp <- salesr2%>%
  filter(region=="West" & subcategory=="Machines")

eastp <- salesr2%>%
  filter(region=="East" & subcategory=="Machines")

t.test(eastp$profit, westp$profit, alternative = "greater", conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  eastp$profit and westp$profit
## t = 0.73843, df = 46.282, p-value = 0.232
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -258.5826       Inf
## sample estimates:
## mean of x mean of y 
## 187.26062 -15.86991

P-value: 0.23 > 0.05

Conclusion: We do not reject the Null Hypothesis H0 : μm = μc

There’s not enough evidence that it is more profitable to sell Machines to East coast customers vs. West coast customers.

Part 3

We’re also intetested in finding out which products are more profitable -

Let’s investigate

Filter all negative sales

deficit <- salesr %>%
  filter(profit<=0)
head (deficit)

## # A tibble: 6 x 21
##   rowid orderid  orderdate  shipdate   shipmode  customerid customername segment
##   <dbl> <chr>    <date>     <date>     <chr>     <chr>      <chr>        <chr>  
## 1     4 US-2015~ 2015-10-11 2015-10-18 Standard~ SO-20335   Sean O'Donn~ Consum~
## 2    15 US-2015~ 2015-11-22 2015-11-26 Standard~ HP-14815   Harold Pawl~ Home O~
## 3    16 US-2015~ 2015-11-22 2015-11-26 Standard~ HP-14815   Harold Pawl~ Home O~
## 4    24 US-2017~ 2017-07-16 2017-07-18 Second C~ SF-20065   Sandra Flan~ Consum~
## 5    28 US-2015~ 2015-09-17 2015-09-21 Standard~ TB-21520   Tracy Blums~ Consum~
## 6    29 US-2015~ 2015-09-17 2015-09-21 Standard~ TB-21520   Tracy Blums~ Consum~
## # ... with 13 more variables: country <chr>, city <chr>, state <chr>,
## #   postalcode <dbl>, region <chr>, productid <chr>, category <chr>,
## #   sub-category <chr>, productname <chr>, sales <dbl>, quantity <dbl>,
## #   discount <dbl>, profit <dbl>

Create object (table) for all items (subcategory) that were sold at a loss

df1 <- table(deficit$`sub-category`)
df1 <- as.data.frame(df1)
df1

##           Var1 Freq
## 1  Accessories   92
## 2   Appliances   67
## 3      Binders  613
## 4    Bookcases  111
## 5       Chairs  255
## 6    Fasteners   17
## 7  Furnishings  176
## 8     Machines   44
## 9       Phones  138
## 10     Storage  185
## 11    Supplies   33
## 12      Tables  205

Table of all items (subcategory)

df2 <- table(salesr2$subcategory)
df2 <- as.data.frame(df2)
df2

##           Var1 Freq
## 1  Accessories  775
## 2   Appliances  466
## 3          Art  796
## 4      Binders 1523
## 5    Bookcases  228
## 6       Chairs  617
## 7      Copiers   68
## 8    Envelopes  254
## 9    Fasteners  217
## 10 Furnishings  957
## 11      Labels  364
## 12    Machines  115
## 13       Paper 1370
## 14      Phones  889
## 15     Storage  846
## 16    Supplies  190
## 17      Tables  319

Subcategory with no negative sales

df3 <-setdiff(df2$Var1, df1$Var1)
df3

## [1] "Art"       "Copiers"   "Envelopes" "Labels"    "Paper"

Create new object showing profits by Sub-category

benefit <- salesr2 %>%
  group_by(subcategory)%>%
  summarise(totprofit=sum(profit))%>%
  arrange(-totprofit)
benefit

## # A tibble: 17 x 2
##    subcategory totprofit
##    <chr>           <dbl>
##  1 Copiers        55618.
##  2 Phones         44516.
##  3 Accessories    41937.
##  4 Paper          34054.
##  5 Binders        30222.
##  6 Chairs         26590.
##  7 Storage        21279.
##  8 Appliances     18138.
##  9 Furnishings    13059.
## 10 Envelopes       6964.
## 11 Art             6528.
## 12 Labels          5546.
## 13 Machines        3385.
## 14 Fasteners        950.
## 15 Supplies       -1189.
## 16 Bookcases      -3473.
## 17 Tables        -17725.

Based on the previous table, we can see the company has been losing money consistently on 3 different kind of items (Bookcases, Supplies, Tables).

Bookcases

book_cases <- salesr2 %>%
  filter(subcategory=="Bookcases" & profit<=0)

head(book_cases, 15)

## # A tibble: 15 x 12
##    orderdate  shipmode   segment   productid  region category subcategory  sales
##    <date>     <chr>      <chr>     <chr>      <chr>  <chr>    <chr>        <dbl>
##  1 2015-09-17 Standard ~ Consumer  FUR-BO-10~ East   Furnitu~ Bookcases   3083. 
##  2 2015-12-27 Standard ~ Home Off~ FUR-BO-10~ Centr~ Furnitu~ Bookcases    532. 
##  3 2015-01-02 Standard ~ Corporate FUR-BO-10~ East   Furnitu~ Bookcases    452. 
##  4 2016-11-20 Second Cl~ Home Off~ FUR-BO-10~ East   Furnitu~ Bookcases     87.0
##  5 2016-04-08 Standard ~ Consumer  FUR-BO-10~ East   Furnitu~ Bookcases    389. 
##  6 2016-09-08 Second Cl~ Consumer  FUR-BO-10~ Centr~ Furnitu~ Bookcases   2396. 
##  7 2017-03-31 Standard ~ Corporate FUR-BO-10~ Centr~ Furnitu~ Bookcases    205. 
##  8 2017-06-10 First Cla~ Home Off~ FUR-BO-10~ West   Furnitu~ Bookcases    514. 
##  9 2014-12-02 First Cla~ Consumer  FUR-BO-10~ East   Furnitu~ Bookcases    884. 
## 10 2015-10-03 Second Cl~ Consumer  FUR-BO-10~ East   Furnitu~ Bookcases     35.5
## 11 2017-12-28 Standard ~ Consumer  FUR-BO-10~ Centr~ Furnitu~ Bookcases     78.9
## 12 2017-12-18 Second Cl~ Consumer  FUR-BO-10~ West   Furnitu~ Bookcases    120. 
## 13 2015-12-24 Standard ~ Consumer  FUR-BO-10~ West   Furnitu~ Bookcases    590. 
## 14 2015-03-24 First Cla~ Home Off~ FUR-BO-10~ Centr~ Furnitu~ Bookcases    359. 
## 15 2017-10-07 Standard ~ Consumer  FUR-BO-10~ West   Furnitu~ Bookcases    308. 
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## #   yearsale <dbl>

Let’s find out which items the company should avoid in the bookcases subcategory

table(book_cases$productid)

## 
## FUR-BO-10000112 FUR-BO-10000330 FUR-BO-10000362 FUR-BO-10000468 FUR-BO-10000780 
##               1               2               1               3               5 
## FUR-BO-10001337 FUR-BO-10001519 FUR-BO-10001567 FUR-BO-10001601 FUR-BO-10001608 
##               7               1               1               2               2 
## FUR-BO-10001798 FUR-BO-10001811 FUR-BO-10001918 FUR-BO-10001972 FUR-BO-10002202 
##               2               5               3               5               2 
## FUR-BO-10002206 FUR-BO-10002213 FUR-BO-10002268 FUR-BO-10002545 FUR-BO-10002613 
##               1               7               3               3               1 
## FUR-BO-10002824 FUR-BO-10002853 FUR-BO-10003034 FUR-BO-10003159 FUR-BO-10003272 
##               2               1               3               5               6 
## FUR-BO-10003433 FUR-BO-10003441 FUR-BO-10003450 FUR-BO-10003546 FUR-BO-10003660 
##               2               2               1               2               1 
## FUR-BO-10003893 FUR-BO-10003894 FUR-BO-10003965 FUR-BO-10003966 FUR-BO-10004015 
##               1               1               4               1               1 
## FUR-BO-10004218 FUR-BO-10004357 FUR-BO-10004360 FUR-BO-10004409 FUR-BO-10004467 
##               3               3               1               3               4 
## FUR-BO-10004690 FUR-BO-10004695 FUR-BO-10004709 FUR-BO-10004834 
##               1               1               3               2

Supplies

supply <- salesr2 %>%
  filter(subcategory=="Supplies" & profit<=0)
head(supply, 15)

## # A tibble: 15 x 12
##    orderdate  shipmode   segment  productid  region category  subcategory  sales
##    <date>     <chr>      <chr>    <chr>      <chr>  <chr>     <chr>        <dbl>
##  1 2015-11-21 Second Cl~ Consumer OFF-SU-10~ East   Office S~ Supplies    1.59e1
##  2 2016-12-18 Second Cl~ Corpora~ OFF-SU-10~ South  Office S~ Supplies    9.61e2
##  3 2016-04-22 Standard ~ Home Of~ OFF-SU-10~ West   Office S~ Supplies    1.85e2
##  4 2015-09-03 Standard ~ Consumer OFF-SU-10~ East   Office S~ Supplies    6.66e2
##  5 2015-11-15 Second Cl~ Corpora~ OFF-SU-10~ Centr~ Office S~ Supplies    8.72e0
##  6 2016-08-27 Standard ~ Consumer OFF-SU-10~ Centr~ Office S~ Supplies    5.15e1
##  7 2017-09-28 Second Cl~ Consumer OFF-SU-10~ Centr~ Office S~ Supplies    1.74e0
##  8 2014-04-06 First Cla~ Home Of~ OFF-SU-10~ East   Office S~ Supplies    1.03e1
##  9 2017-03-02 Standard ~ Corpora~ OFF-SU-10~ Centr~ Office S~ Supplies    6.98e0
## 10 2017-09-22 Second Cl~ Consumer OFF-SU-10~ East   Office S~ Supplies    4.66e3
## 11 2016-11-10 Second Cl~ Consumer OFF-SU-10~ East   Office S~ Supplies    1.47e1
## 12 2014-10-02 First Cla~ Corpora~ OFF-SU-10~ West   Office S~ Supplies    1.54e1
## 13 2014-12-08 Standard ~ Consumer OFF-SU-10~ West   Office S~ Supplies    1.39e1
## 14 2017-11-18 Standard ~ Consumer OFF-SU-10~ South  Office S~ Supplies    2.94e0
## 15 2014-11-03 Standard ~ Consumer OFF-SU-10~ East   Office S~ Supplies    2.86e2
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## #   yearsale <dbl>

Let’s find out which items the company should avoid in the Supplies subcategory

table(supply$productid)

## 
## OFF-SU-10000151 OFF-SU-10000157 OFF-SU-10000432 OFF-SU-10000646 OFF-SU-10001225 
##               1               2               4               3               4 
## OFF-SU-10001935 OFF-SU-10002189 OFF-SU-10002301 OFF-SU-10002881 OFF-SU-10003505 
##               4               1               1               3               3 
## OFF-SU-10003567 OFF-SU-10003936 OFF-SU-10004498 
##               3               1               3

Tables

tables <- salesr2 %>%
  filter(subcategory=="Tables" & profit<=0)
head(tables, 15)

## # A tibble: 15 x 12
##    orderdate  shipmode   segment   productid   region category subcategory sales
##    <date>     <chr>      <chr>     <chr>       <chr>  <chr>    <chr>       <dbl>
##  1 2015-10-11 Standard ~ Consumer  FUR-TA-100~ South  Furnitu~ Tables       958.
##  2 2014-09-20 Standard ~ Consumer  FUR-TA-100~ Centr~ Furnitu~ Tables       618.
##  3 2014-08-03 First Cla~ Consumer  FUR-TA-100~ West   Furnitu~ Tables       219.
##  4 2017-04-07 Standard ~ Home Off~ FUR-TA-100~ South  Furnitu~ Tables       234.
##  5 2017-04-07 Standard ~ Home Off~ FUR-TA-100~ South  Furnitu~ Tables       621.
##  6 2016-06-04 Second Cl~ Consumer  FUR-TA-100~ Centr~ Furnitu~ Tables       177.
##  7 2017-11-19 Standard ~ Consumer  FUR-TA-100~ Centr~ Furnitu~ Tables       219.
##  8 2014-07-12 Standard ~ Corporate FUR-TA-100~ West   Furnitu~ Tables       698.
##  9 2017-08-27 Standard ~ Corporate FUR-TA-100~ East   Furnitu~ Tables      1488.
## 10 2015-11-27 Standard ~ Consumer  FUR-TA-100~ South  Furnitu~ Tables       375.
## 11 2017-12-08 Standard ~ Consumer  FUR-TA-100~ West   Furnitu~ Tables      1004.
## 12 2017-10-20 Standard ~ Corporate FUR-TA-100~ East   Furnitu~ Tables       284.
## 13 2016-04-22 Standard ~ Home Off~ FUR-TA-100~ West   Furnitu~ Tables      1273.
## 14 2016-03-18 Second Cl~ Consumer  FUR-TA-100~ South  Furnitu~ Tables       190.
## 15 2016-06-10 Standard ~ Consumer  FUR-TA-100~ West   Furnitu~ Tables      1336.
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## #   yearsale <dbl>

Let’s find out which items the company should avoid in the Tables subcategory

table(tables$productid)

## 
## FUR-TA-10000198 FUR-TA-10000577 FUR-TA-10000617 FUR-TA-10000688 FUR-TA-10000849 
##               4               5               2               3               4 
## FUR-TA-10001039 FUR-TA-10001086 FUR-TA-10001095 FUR-TA-10001307 FUR-TA-10001520 
##               6               2               7               2               7 
## FUR-TA-10001539 FUR-TA-10001676 FUR-TA-10001705 FUR-TA-10001768 FUR-TA-10001771 
##               5               9               2               4               1 
## FUR-TA-10001857 FUR-TA-10001866 FUR-TA-10001889 FUR-TA-10001932 FUR-TA-10001950 
##               3               1               6               4               2 
## FUR-TA-10002041 FUR-TA-10002228 FUR-TA-10002356 FUR-TA-10002530 FUR-TA-10002533 
##               5               3               4               4               4 
## FUR-TA-10002607 FUR-TA-10002622 FUR-TA-10002645 FUR-TA-10002774 FUR-TA-10002855 
##               7               3               3               3               3 
## FUR-TA-10002903 FUR-TA-10002958 FUR-TA-10003008 FUR-TA-10003238 FUR-TA-10003392 
##               3               6               6               5               2 
## FUR-TA-10003469 FUR-TA-10003473 FUR-TA-10003569 FUR-TA-10003715 FUR-TA-10003748 
##               5               5               2               3               2 
## FUR-TA-10003837 FUR-TA-10003954 FUR-TA-10004086 FUR-TA-10004147 FUR-TA-10004152 
##               2               3               4               1               4 
## FUR-TA-10004154 FUR-TA-10004175 FUR-TA-10004256 FUR-TA-10004289 FUR-TA-10004442 
##               4               2               3               3               3 
## FUR-TA-10004534 FUR-TA-10004575 FUR-TA-10004607 FUR-TA-10004619 FUR-TA-10004767 
##               2               4               6               2               2 
## FUR-TA-10004915 
##               3

Part 4 Chi-Square test

What is the average profit per region and year?

Predict future profits.

salestable <- tapply(salesr2$profit,list(salesr2$region, salesr2$yearsale), sum); salestable

##               2014      2015     2016      2017
## Central   539.5534 11716.802 19899.16  7550.844
## East    17059.6095 21091.013 20141.60 33230.561
## South   11879.1200  8318.594 17702.81  8848.908
## West    20065.6912 20492.195 24051.61 43808.956

table_profit <- salesr2%>%
  select(region,yearsale,profit)%>%
  group_by(region, yearsale)%>%
  summarise(totprofit = round(sum(profit),2))

## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.

Interactive chart

linechart <- table_profit %>%
  ggplot(aes(yearsale, totprofit, colour=region))+
  scale_color_brewer(palette = "Set1")+
  scale_color_discrete(name =  " ")+
  theme_bw()+
  geom_point()+
  geom_line()+
  ggtitle("Business Profit 2014-2017")+
  xlab("Year")+
  ylab("Profit  ($) ")

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

linechart <- ggplotly(linechart)
linechart

Let’s build a Linear Regression Model to predict future Profit.

lprofit <- lm(profit ~ sales, data = salesr2)
summary(lprofit)

## 
## Call:
## lm(formula = profit ~ sales, data = salesr2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7397.5     2.6    14.6    21.7  5261.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.732867   2.192459  -5.808 6.53e-09 ***
## sales         0.180067   0.003301  54.555  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205.6 on 9992 degrees of freedom
## Multiple R-squared:  0.2295, Adjusted R-squared:  0.2294 
## F-statistic:  2976 on 1 and 9992 DF,  p-value: < 2.2e-16

Based on our calculations our model has the following equation: profit = 0.18(sales)-12.73. We also have a very small p-value of 2.2e-16 and an Ajusted R-Square value suggesting that about 78% of the variation in the observations maybe explained by our model.

What ifS…

Annual Profit if the company was not selling Supplies, Tables, and Bookcases

benefit3 <- salesr2 %>%
  group_by(yearsale)%>%
  filter(subcategory != "Supplies"& subcategory != "Tables"& subcategory !="Bookcases" )%>%
  summarise(totsales=sum(sales), totprofit=sum(profit), profitpercentage = round(totprofit/totsales,2))#%>%
  #arrange(-totprofit)
benefit3

## # A tibble: 4 x 4
##   yearsale totsales totprofit profitpercentage
##      <dbl>    <dbl>     <dbl>            <dbl>
## 1     2014  403728.    52524.             0.13
## 2     2015  390886.    67909.             0.17
## 3     2016  507819.    85233.             0.17
## 4     2017  626248.   103119.             0.16

Actual Annual Profit

benefit2 <- salesr2 %>%
  group_by(yearsale)%>%
  summarise(totsales=sum(sales), totprofit=sum(profit), profitpercentage = round(totprofit/totsales,2))#%>%
  #arrange(-totprofit)
benefit2

## # A tibble: 4 x 4
##   yearsale totsales totprofit profitpercentage
##      <dbl>    <dbl>     <dbl>            <dbl>
## 1     2014  484247.    49544.             0.1 
## 2     2015  470533.    61619.             0.13
## 3     2016  609206.    81795.             0.13
## 4     2017  733215.    93439.             0.13

Investigating

Is there any relationship between segment and category?

segment describes where the Customer belongs (regular consumer, corporation, home office).

And Caterory is the Category of the product that is ordered.

Create a visualization that shows the distribution of the contingency table.

table_seg_subcategory <- salesr2 %>%
  count(segment, subcategory)
table_seg_subcategory

## # A tibble: 51 x 3
##    segment  subcategory     n
##    <chr>    <chr>       <int>
##  1 Consumer Accessories   408
##  2 Consumer Appliances    244
##  3 Consumer Art           428
##  4 Consumer Binders       780
##  5 Consumer Bookcases     131
##  6 Consumer Chairs        329
##  7 Consumer Copiers        35
##  8 Consumer Envelopes     129
##  9 Consumer Fasteners     114
## 10 Consumer Furnishings   494
## # ... with 41 more rows

Chi-Square Test (Buying patterns)

Conduct a Chi-Square test with = 0.05. H0: segment and category are independent

Hα: segment and category relate to each other

First let’s create our contingency table for the chi-square test.

tab_seg_sub <- table(salesr$segment,salesr2$category); tab_seg_sub

##              
##               Furniture Office Supplies Technology
##   Consumer         1113            3127        951
##   Corporate         646            1820        554
##   Home Office       362            1079        342

Let’s do a chi-square test

chisq.test(tab_seg_sub)

## 
##  Pearson's Chi-squared test
## 
## data:  tab_seg_sub
## X-squared = 1.4612, df = 4, p-value = 0.8335

P-value: 0.79 > 0.05

Conclusion: We do not reject the Null Hypothesis H0: Segment and category are independent.

The items that were purchased are not influenced by the type of customer.

Part 5 Our findings - Recommendations to the Company

Regular consumers constitute more than 50% of the company total revenues. While the company is profitable in all regions, the total profits on the East/West coast regions are higher than the ones South/Central. In addition, some sub-categories are very profitable, and the company should target them: Art, Copier, Envelopes, Labels, and Paper.

However, the company should stop selling tables, supplies and bookcases. These items are heavily discounted and are being sold at a loss.

Ethical Concerns

My main concern with this dataset is the fact that it contains people’s personal data. The person who published the data put zero effort into masking that information. All analysts, data scientists, information technology professionals or people who handle data, need to be well-versed in the basic principles of data ethics. They have moral obligations of gathering, protecting, and using personally identifiable information ethically.

Because this dataset was not de-identified, all customers’ shipping/home zip codes were made publicly available. This might cause lots of troubles for these customers. Not only does it could tell criminals where they live but it also opens them up for identity theft.

Final Project - Data 101

Daniel Lachaud

4/22/2022

With growing demands and cut-throat competitions in the market, a superstore giant is looking for some business insights. In this project, we would like to make this business understand which products, regions, sub-categories and customer segments they should target or avoid.

“Cut-throat competition, also known as destructive or ruinous competition, refers to situations when competition results in prices that do not chronically or for extended periods of time cover costs of production, particularly fixed costs.”

The dataset was obtained from Kaggle.com and it contains sales records for an unknown company.

Source: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final

The dataset is comprised of 17 categorical and 4 quantitative variables and the observations are from 2014 thru 2017. There are 9994 observations in the dataset. Below is a list of all 21 variables and what they represent:

Row ID => Unique ID for each row.

Order ID => Unique Order ID for each Customer.

Order Date => Order Date of the product.

Ship Date => Shipping Date of the Product.

Ship Mode=> Shipping Mode specified by the Customer.

Customer ID => Unique ID to identify each Customer.

Customer Name => Name of the Customer.

Segment => The segment where the Customer belongs.

Country => Country of residence of the Customer.

City => City of residence of of the Customer.

State => State of residence of the Customer.

Postal Code => Postal Code of every Customer.

Region => Region where the Customer belong.

Product ID => Unique ID of the Product.

Category => Category of the product ordered.

Sub-Category => Sub-Category of the product ordered.

Product Name => Name of the Product

Sales => Sales of the Product.

Quantity => Quantity of the Product.

Discount => Discount provided.

Profit => Profit/Loss incurred.

Load necessary libraries

Read Dataset into Rstudio and show first few rows

There are no missing values in our data.

Clean up the data:

Make all headers lowercase and remove spaces

Look at the data structure

Orderdate and shipdate are in the wrong format. Let’s fix that.

Check class of variables

Now that our data is clean, we can start asking questions and perform Explanatory Data Analysis (EDA).

Part 1

Is the average profit/sales the same per region?

What type of variation occurs within the segment variable?

EDA- Investigations

Our data set has 21 columns. We will not be using all of them for our Analysis

Create a new object

Preliminary investigation to get a rough idea of what the data set looks like

Sales - Five number summary and region

Profit - Five number summary and region

Boxplots

Visualising the distribution of segment (customer type)

So far, our EDA shows 51% of the purchases are from regular consumers, 30% from corporate offices and the rest from home offices.

Also,there are lots of negative sales for all subcategories in every region. However, the median profit/sales seems to be the same for all regions. Let’s randomly pick a subcategory and do some hypothesis test.

Part 2

First, let’s visualize each of the subcategory to see the distribution of profits

Question

Is it more profitable to sell Machines to customers on the East coast vs West coast customers?

Let’s zoom into the Machines subcategory to see the distribution of profits per region

Boxplot of East/West customers - Purchase of Machines

T-Test

We will conduct a T-Test to find out if the average profit from selling Machines on the East coast is equal or not to the West coast.

Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).

H0 : μe = μw

Hα : μe > μw

P-value: 0.23 > 0.05

Conclusion: We do not reject the Null Hypothesis H0 : μm = μc

There’s not enough evidence that it is more profitable to sell Machines to East coast customers vs. West coast customers.

Part 3

We’re also intetested in finding out which products are more profitable -

Let’s investigate

Filter all negative sales

Create object (table) for all items (subcategory) that were sold at a loss

Table of all items (subcategory)

Subcategory with no negative sales

Create new object showing profits by Sub-category

Based on the previous table, we can see the company has been losing money consistently on 3 different kind of items (Bookcases, Supplies, Tables).

Bookcases

Let’s find out which items the company should avoid in the bookcases subcategory

Supplies

Let’s find out which items the company should avoid in the Supplies subcategory

Tables

Let’s find out which items the company should avoid in the Tables subcategory