With growing demands and cut-throat competitions in the market, a superstore giant is looking for some business insights. In this project, we would like to make this business understand which products, regions, sub-categories and customer segments they should target or avoid.
“Cut-throat competition, also known as destructive or ruinous competition, refers to situations when competition results in prices that do not chronically or for extended periods of time cover costs of production, particularly fixed costs.”
The dataset was obtained from Kaggle.com and it contains sales records for an unknown company.
The dataset is comprised of 17 categorical and 4 quantitative variables and the observations are from 2014 thru 2017. There are 9994 observations in the dataset. Below is a list of all 21 variables and what they represent:
Row ID => Unique ID for each row.
Order ID => Unique Order ID for each Customer.
Order Date => Order Date of the product.
Ship Date => Shipping Date of the Product.
Ship Mode=> Shipping Mode specified by the Customer.
Customer ID => Unique ID to identify each Customer.
Customer Name => Name of the Customer.
Segment => The segment where the Customer belongs.
Country => Country of residence of the Customer.
City => City of residence of of the Customer.
State => State of residence of the Customer.
Postal Code => Postal Code of every Customer.
Region => Region where the Customer belong.
Product ID => Unique ID of the Product.
Category => Category of the product ordered.
Sub-Category => Sub-Category of the product ordered.
Product Name => Name of the Product
Sales => Sales of the Product.
Quantity => Quantity of the Product.
Discount => Discount provided.
Profit => Profit/Loss incurred.
Load necessary libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
Read Dataset into Rstudio and show first few rows
# Set working directory
setwd("~/Data101/Final Project")
salesr <- read_csv("superstore.csv")
## Rows: 9994 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (15): Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer ...
## dbl (6): Row ID, Postal Code, Sales, Quantity, Discount, Profit
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(salesr)
## # A tibble: 6 x 21
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode` `Customer ID`
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/2016 6/16/2016 Second Class DV-13045
## 4 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/2014 6/14/2014 Standard Class BH-11710
## # ... with 15 more variables: Customer Name <chr>, Segment <chr>,
## # Country <chr>, City <chr>, State <chr>, Postal Code <dbl>, Region <chr>,
## # Product ID <chr>, Category <chr>, Sub-Category <chr>, Product Name <chr>,
## # Sales <dbl>, Quantity <dbl>, Discount <dbl>, Profit <dbl>
There are no missing values in our data.
sum(is.na(salesr))
## [1] 0
Clean up the data:
Make all headers lowercase and remove spaces
Look at the data structure
names(salesr) <- tolower(names(salesr))
names(salesr) <- gsub(" ","",names(salesr))
str(salesr)
## spec_tbl_df [9,994 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ rowid : num [1:9994] 1 2 3 4 5 6 7 8 9 10 ...
## $ orderid : chr [1:9994] "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ orderdate : chr [1:9994] "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
## $ shipdate : chr [1:9994] "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
## $ shipmode : chr [1:9994] "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ customerid : chr [1:9994] "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ customername: chr [1:9994] "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
## $ segment : chr [1:9994] "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ country : chr [1:9994] "United States" "United States" "United States" "United States" ...
## $ city : chr [1:9994] "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ state : chr [1:9994] "Kentucky" "Kentucky" "California" "Florida" ...
## $ postalcode : num [1:9994] 42420 42420 90036 33311 33311 ...
## $ region : chr [1:9994] "South" "South" "West" "South" ...
## $ productid : chr [1:9994] "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ category : chr [1:9994] "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ sub-category: chr [1:9994] "Bookcases" "Chairs" "Labels" "Tables" ...
## $ productname : chr [1:9994] "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ sales : num [1:9994] 262 731.9 14.6 957.6 22.4 ...
## $ quantity : num [1:9994] 2 3 2 5 2 7 4 6 3 5 ...
## $ discount : num [1:9994] 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ profit : num [1:9994] 41.91 219.58 6.87 -383.03 2.52 ...
## - attr(*, "spec")=
## .. cols(
## .. `Row ID` = col_double(),
## .. `Order ID` = col_character(),
## .. `Order Date` = col_character(),
## .. `Ship Date` = col_character(),
## .. `Ship Mode` = col_character(),
## .. `Customer ID` = col_character(),
## .. `Customer Name` = col_character(),
## .. Segment = col_character(),
## .. Country = col_character(),
## .. City = col_character(),
## .. State = col_character(),
## .. `Postal Code` = col_double(),
## .. Region = col_character(),
## .. `Product ID` = col_character(),
## .. Category = col_character(),
## .. `Sub-Category` = col_character(),
## .. `Product Name` = col_character(),
## .. Sales = col_double(),
## .. Quantity = col_double(),
## .. Discount = col_double(),
## .. Profit = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Now that our data is clean, we can start asking questions and perform Explanatory Data Analysis (EDA).
Part 1
Is the average profit/sales the same per region?
What type of variation occurs within the segment variable?
EDA- Investigations
Our data set has 21 columns. We will not be using all of them for our Analysis
Create a new object
salesr2 <- salesr%>%
select(orderdate,shipmode,segment,productid, region, category,`sub-category`, sales, quantity, discount, profit)%>%
mutate(yearsale =year(orderdate))%>%
rename(subcategory = `sub-category`)
Preliminary investigation to get a rough idea of what the data set looks like
Sales - Five number summary and region
tapply(salesr2$sales,salesr2$region, summary)
## $Central
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.444 14.620 45.980 215.773 200.012 17499.950
##
## $East
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.852 17.520 54.900 238.336 209.617 11199.968
##
## $South
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.167 17.187 54.594 241.804 208.722 22638.480
##
## $West
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.99 19.44 60.84 226.49 215.81 13999.96
Profit - Five number summary and region
tapply(salesr2$profit,salesr2$region, summary)
## $Central
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3701.893 -5.664 5.184 17.093 22.456 8399.976
##
## $East
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6599.978 1.558 8.172 32.136 28.719 5039.986
##
## $South
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3839.990 1.947 9.072 28.858 34.234 3177.475
##
## $West
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3399.980 3.852 11.166 33.849 33.000 6719.981
Boxplots
salesr2%>%
ggplot(aes(region, profit, col=region ))+
geom_boxplot()+
stat_boxplot(geom = "errorbar", width = 0.2)+
theme(axis.text.x = element_text(angle = 90))+
labs(y = "Profit", x = "Region", title = "Boxplot - Profit by Region")+
theme_bw()

salesr2%>%
ggplot(aes(region, sales, col=region ))+
geom_boxplot()+
stat_boxplot(geom = "errorbar", width = 0.2)+
theme(axis.text.x = element_text(angle = 90))+
labs(y = "Sales", x = "Region", title = "Boxplot - Sales by Region")+
theme_bw()

table(salesr2$segment)
##
## Consumer Corporate Home Office
## 5191 3020 1783
Visualising the distribution of segment (customer type)
salesr2 %>%
count(segment) %>%
hchart('column', hcaes(x = 'segment', y = 'n'))%>%
hc_title(text = "Customer Type") %>%
hc_subtitle(text = "https://www.kaggle.com/datasets/vivek468/superstore-dataset-final",color = "#A4CBAF") %>%
hc_add_theme(hc_theme_darkunica())
So far, our EDA shows 51% of the purchases are from regular consumers, 30% from corporate offices and the rest from home offices.
Also,there are lots of negative sales for all subcategories in every region. However, the median profit/sales seems to be the same for all regions. Let’s randomly pick a subcategory and do some hypothesis test.
Part 2
First, let’s visualize each of the subcategory to see the distribution of profits
salesr2%>%
ggplot(aes(subcategory, profit, col=category ))+
geom_boxplot()+
stat_boxplot(geom = "errorbar", width = 0.2)+
theme_bw()+
facet_wrap(~region)+
ggtitle("Boxplot Profit by Sub-Category") +
xlab("Sub-Category") + ylab("Profit")+
theme(axis.text.x = element_text(angle = 90))

Question
Is it more profitable to sell Machines to customers on the East coast vs West coast customers?
Let’s zoom into the Machines subcategory to see the distribution of profits per region
Boxplot of East/West customers - Purchase of Machines
salesr2 %>%
#filter(region=="East"|region== "West") %>%
filter(subcategory=="Machines")%>%
ggplot(aes(subcategory, profit, color=region))+
ggtitle("Boxplot Profit Machines - Sub-Category") +
xlab(" ") + ylab("Profit")+
facet_wrap(~region)+
geom_boxplot()+
theme_bw()

T-Test
We will conduct a T-Test to find out if the average profit from selling Machines on the East coast is equal or not to the West coast.
Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).
H0 : μe = μw
Hα : μe > μw
westp <- salesr2%>%
filter(region=="West" & subcategory=="Machines")
eastp <- salesr2%>%
filter(region=="East" & subcategory=="Machines")
t.test(eastp$profit, westp$profit, alternative = "greater", conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: eastp$profit and westp$profit
## t = 0.73843, df = 46.282, p-value = 0.232
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -258.5826 Inf
## sample estimates:
## mean of x mean of y
## 187.26062 -15.86991
P-value: 0.23 > 0.05
Conclusion: We do not reject the Null Hypothesis H0 : μm = μc
There’s not enough evidence that it is more profitable to sell Machines to East coast customers vs. West coast customers.
Part 3
We’re also intetested in finding out which products are more profitable -
Let’s investigate
Filter all negative sales
deficit <- salesr %>%
filter(profit<=0)
head (deficit)
## # A tibble: 6 x 21
## rowid orderid orderdate shipdate shipmode customerid customername segment
## <dbl> <chr> <date> <date> <chr> <chr> <chr> <chr>
## 1 4 US-2015~ 2015-10-11 2015-10-18 Standard~ SO-20335 Sean O'Donn~ Consum~
## 2 15 US-2015~ 2015-11-22 2015-11-26 Standard~ HP-14815 Harold Pawl~ Home O~
## 3 16 US-2015~ 2015-11-22 2015-11-26 Standard~ HP-14815 Harold Pawl~ Home O~
## 4 24 US-2017~ 2017-07-16 2017-07-18 Second C~ SF-20065 Sandra Flan~ Consum~
## 5 28 US-2015~ 2015-09-17 2015-09-21 Standard~ TB-21520 Tracy Blums~ Consum~
## 6 29 US-2015~ 2015-09-17 2015-09-21 Standard~ TB-21520 Tracy Blums~ Consum~
## # ... with 13 more variables: country <chr>, city <chr>, state <chr>,
## # postalcode <dbl>, region <chr>, productid <chr>, category <chr>,
## # sub-category <chr>, productname <chr>, sales <dbl>, quantity <dbl>,
## # discount <dbl>, profit <dbl>
Create object (table) for all items (subcategory) that were sold at a loss
df1 <- table(deficit$`sub-category`)
df1 <- as.data.frame(df1)
df1
## Var1 Freq
## 1 Accessories 92
## 2 Appliances 67
## 3 Binders 613
## 4 Bookcases 111
## 5 Chairs 255
## 6 Fasteners 17
## 7 Furnishings 176
## 8 Machines 44
## 9 Phones 138
## 10 Storage 185
## 11 Supplies 33
## 12 Tables 205
Table of all items (subcategory)
df2 <- table(salesr2$subcategory)
df2 <- as.data.frame(df2)
df2
## Var1 Freq
## 1 Accessories 775
## 2 Appliances 466
## 3 Art 796
## 4 Binders 1523
## 5 Bookcases 228
## 6 Chairs 617
## 7 Copiers 68
## 8 Envelopes 254
## 9 Fasteners 217
## 10 Furnishings 957
## 11 Labels 364
## 12 Machines 115
## 13 Paper 1370
## 14 Phones 889
## 15 Storage 846
## 16 Supplies 190
## 17 Tables 319
Subcategory with no negative sales
df3 <-setdiff(df2$Var1, df1$Var1)
df3
## [1] "Art" "Copiers" "Envelopes" "Labels" "Paper"
Create new object showing profits by Sub-category
benefit <- salesr2 %>%
group_by(subcategory)%>%
summarise(totprofit=sum(profit))%>%
arrange(-totprofit)
benefit
## # A tibble: 17 x 2
## subcategory totprofit
## <chr> <dbl>
## 1 Copiers 55618.
## 2 Phones 44516.
## 3 Accessories 41937.
## 4 Paper 34054.
## 5 Binders 30222.
## 6 Chairs 26590.
## 7 Storage 21279.
## 8 Appliances 18138.
## 9 Furnishings 13059.
## 10 Envelopes 6964.
## 11 Art 6528.
## 12 Labels 5546.
## 13 Machines 3385.
## 14 Fasteners 950.
## 15 Supplies -1189.
## 16 Bookcases -3473.
## 17 Tables -17725.
Based on the previous table, we can see the company has been losing money consistently on 3 different kind of items (Bookcases, Supplies, Tables).
Bookcases
book_cases <- salesr2 %>%
filter(subcategory=="Bookcases" & profit<=0)
head(book_cases, 15)
## # A tibble: 15 x 12
## orderdate shipmode segment productid region category subcategory sales
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2015-09-17 Standard ~ Consumer FUR-BO-10~ East Furnitu~ Bookcases 3083.
## 2 2015-12-27 Standard ~ Home Off~ FUR-BO-10~ Centr~ Furnitu~ Bookcases 532.
## 3 2015-01-02 Standard ~ Corporate FUR-BO-10~ East Furnitu~ Bookcases 452.
## 4 2016-11-20 Second Cl~ Home Off~ FUR-BO-10~ East Furnitu~ Bookcases 87.0
## 5 2016-04-08 Standard ~ Consumer FUR-BO-10~ East Furnitu~ Bookcases 389.
## 6 2016-09-08 Second Cl~ Consumer FUR-BO-10~ Centr~ Furnitu~ Bookcases 2396.
## 7 2017-03-31 Standard ~ Corporate FUR-BO-10~ Centr~ Furnitu~ Bookcases 205.
## 8 2017-06-10 First Cla~ Home Off~ FUR-BO-10~ West Furnitu~ Bookcases 514.
## 9 2014-12-02 First Cla~ Consumer FUR-BO-10~ East Furnitu~ Bookcases 884.
## 10 2015-10-03 Second Cl~ Consumer FUR-BO-10~ East Furnitu~ Bookcases 35.5
## 11 2017-12-28 Standard ~ Consumer FUR-BO-10~ Centr~ Furnitu~ Bookcases 78.9
## 12 2017-12-18 Second Cl~ Consumer FUR-BO-10~ West Furnitu~ Bookcases 120.
## 13 2015-12-24 Standard ~ Consumer FUR-BO-10~ West Furnitu~ Bookcases 590.
## 14 2015-03-24 First Cla~ Home Off~ FUR-BO-10~ Centr~ Furnitu~ Bookcases 359.
## 15 2017-10-07 Standard ~ Consumer FUR-BO-10~ West Furnitu~ Bookcases 308.
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## # yearsale <dbl>
Let’s find out which items the company should avoid in the bookcases subcategory
table(book_cases$productid)
##
## FUR-BO-10000112 FUR-BO-10000330 FUR-BO-10000362 FUR-BO-10000468 FUR-BO-10000780
## 1 2 1 3 5
## FUR-BO-10001337 FUR-BO-10001519 FUR-BO-10001567 FUR-BO-10001601 FUR-BO-10001608
## 7 1 1 2 2
## FUR-BO-10001798 FUR-BO-10001811 FUR-BO-10001918 FUR-BO-10001972 FUR-BO-10002202
## 2 5 3 5 2
## FUR-BO-10002206 FUR-BO-10002213 FUR-BO-10002268 FUR-BO-10002545 FUR-BO-10002613
## 1 7 3 3 1
## FUR-BO-10002824 FUR-BO-10002853 FUR-BO-10003034 FUR-BO-10003159 FUR-BO-10003272
## 2 1 3 5 6
## FUR-BO-10003433 FUR-BO-10003441 FUR-BO-10003450 FUR-BO-10003546 FUR-BO-10003660
## 2 2 1 2 1
## FUR-BO-10003893 FUR-BO-10003894 FUR-BO-10003965 FUR-BO-10003966 FUR-BO-10004015
## 1 1 4 1 1
## FUR-BO-10004218 FUR-BO-10004357 FUR-BO-10004360 FUR-BO-10004409 FUR-BO-10004467
## 3 3 1 3 4
## FUR-BO-10004690 FUR-BO-10004695 FUR-BO-10004709 FUR-BO-10004834
## 1 1 3 2
Supplies
supply <- salesr2 %>%
filter(subcategory=="Supplies" & profit<=0)
head(supply, 15)
## # A tibble: 15 x 12
## orderdate shipmode segment productid region category subcategory sales
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2015-11-21 Second Cl~ Consumer OFF-SU-10~ East Office S~ Supplies 1.59e1
## 2 2016-12-18 Second Cl~ Corpora~ OFF-SU-10~ South Office S~ Supplies 9.61e2
## 3 2016-04-22 Standard ~ Home Of~ OFF-SU-10~ West Office S~ Supplies 1.85e2
## 4 2015-09-03 Standard ~ Consumer OFF-SU-10~ East Office S~ Supplies 6.66e2
## 5 2015-11-15 Second Cl~ Corpora~ OFF-SU-10~ Centr~ Office S~ Supplies 8.72e0
## 6 2016-08-27 Standard ~ Consumer OFF-SU-10~ Centr~ Office S~ Supplies 5.15e1
## 7 2017-09-28 Second Cl~ Consumer OFF-SU-10~ Centr~ Office S~ Supplies 1.74e0
## 8 2014-04-06 First Cla~ Home Of~ OFF-SU-10~ East Office S~ Supplies 1.03e1
## 9 2017-03-02 Standard ~ Corpora~ OFF-SU-10~ Centr~ Office S~ Supplies 6.98e0
## 10 2017-09-22 Second Cl~ Consumer OFF-SU-10~ East Office S~ Supplies 4.66e3
## 11 2016-11-10 Second Cl~ Consumer OFF-SU-10~ East Office S~ Supplies 1.47e1
## 12 2014-10-02 First Cla~ Corpora~ OFF-SU-10~ West Office S~ Supplies 1.54e1
## 13 2014-12-08 Standard ~ Consumer OFF-SU-10~ West Office S~ Supplies 1.39e1
## 14 2017-11-18 Standard ~ Consumer OFF-SU-10~ South Office S~ Supplies 2.94e0
## 15 2014-11-03 Standard ~ Consumer OFF-SU-10~ East Office S~ Supplies 2.86e2
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## # yearsale <dbl>
Let’s find out which items the company should avoid in the Supplies subcategory
table(supply$productid)
##
## OFF-SU-10000151 OFF-SU-10000157 OFF-SU-10000432 OFF-SU-10000646 OFF-SU-10001225
## 1 2 4 3 4
## OFF-SU-10001935 OFF-SU-10002189 OFF-SU-10002301 OFF-SU-10002881 OFF-SU-10003505
## 4 1 1 3 3
## OFF-SU-10003567 OFF-SU-10003936 OFF-SU-10004498
## 3 1 3
Tables
tables <- salesr2 %>%
filter(subcategory=="Tables" & profit<=0)
head(tables, 15)
## # A tibble: 15 x 12
## orderdate shipmode segment productid region category subcategory sales
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2015-10-11 Standard ~ Consumer FUR-TA-100~ South Furnitu~ Tables 958.
## 2 2014-09-20 Standard ~ Consumer FUR-TA-100~ Centr~ Furnitu~ Tables 618.
## 3 2014-08-03 First Cla~ Consumer FUR-TA-100~ West Furnitu~ Tables 219.
## 4 2017-04-07 Standard ~ Home Off~ FUR-TA-100~ South Furnitu~ Tables 234.
## 5 2017-04-07 Standard ~ Home Off~ FUR-TA-100~ South Furnitu~ Tables 621.
## 6 2016-06-04 Second Cl~ Consumer FUR-TA-100~ Centr~ Furnitu~ Tables 177.
## 7 2017-11-19 Standard ~ Consumer FUR-TA-100~ Centr~ Furnitu~ Tables 219.
## 8 2014-07-12 Standard ~ Corporate FUR-TA-100~ West Furnitu~ Tables 698.
## 9 2017-08-27 Standard ~ Corporate FUR-TA-100~ East Furnitu~ Tables 1488.
## 10 2015-11-27 Standard ~ Consumer FUR-TA-100~ South Furnitu~ Tables 375.
## 11 2017-12-08 Standard ~ Consumer FUR-TA-100~ West Furnitu~ Tables 1004.
## 12 2017-10-20 Standard ~ Corporate FUR-TA-100~ East Furnitu~ Tables 284.
## 13 2016-04-22 Standard ~ Home Off~ FUR-TA-100~ West Furnitu~ Tables 1273.
## 14 2016-03-18 Second Cl~ Consumer FUR-TA-100~ South Furnitu~ Tables 190.
## 15 2016-06-10 Standard ~ Consumer FUR-TA-100~ West Furnitu~ Tables 1336.
## # ... with 4 more variables: quantity <dbl>, discount <dbl>, profit <dbl>,
## # yearsale <dbl>
Let’s find out which items the company should avoid in the Tables subcategory
table(tables$productid)
##
## FUR-TA-10000198 FUR-TA-10000577 FUR-TA-10000617 FUR-TA-10000688 FUR-TA-10000849
## 4 5 2 3 4
## FUR-TA-10001039 FUR-TA-10001086 FUR-TA-10001095 FUR-TA-10001307 FUR-TA-10001520
## 6 2 7 2 7
## FUR-TA-10001539 FUR-TA-10001676 FUR-TA-10001705 FUR-TA-10001768 FUR-TA-10001771
## 5 9 2 4 1
## FUR-TA-10001857 FUR-TA-10001866 FUR-TA-10001889 FUR-TA-10001932 FUR-TA-10001950
## 3 1 6 4 2
## FUR-TA-10002041 FUR-TA-10002228 FUR-TA-10002356 FUR-TA-10002530 FUR-TA-10002533
## 5 3 4 4 4
## FUR-TA-10002607 FUR-TA-10002622 FUR-TA-10002645 FUR-TA-10002774 FUR-TA-10002855
## 7 3 3 3 3
## FUR-TA-10002903 FUR-TA-10002958 FUR-TA-10003008 FUR-TA-10003238 FUR-TA-10003392
## 3 6 6 5 2
## FUR-TA-10003469 FUR-TA-10003473 FUR-TA-10003569 FUR-TA-10003715 FUR-TA-10003748
## 5 5 2 3 2
## FUR-TA-10003837 FUR-TA-10003954 FUR-TA-10004086 FUR-TA-10004147 FUR-TA-10004152
## 2 3 4 1 4
## FUR-TA-10004154 FUR-TA-10004175 FUR-TA-10004256 FUR-TA-10004289 FUR-TA-10004442
## 4 2 3 3 3
## FUR-TA-10004534 FUR-TA-10004575 FUR-TA-10004607 FUR-TA-10004619 FUR-TA-10004767
## 2 4 6 2 2
## FUR-TA-10004915
## 3
Part 4
Chi-Square test
What is the average profit per region and year?
Predict future profits.
salestable <- tapply(salesr2$profit,list(salesr2$region, salesr2$yearsale), sum); salestable
## 2014 2015 2016 2017
## Central 539.5534 11716.802 19899.16 7550.844
## East 17059.6095 21091.013 20141.60 33230.561
## South 11879.1200 8318.594 17702.81 8848.908
## West 20065.6912 20492.195 24051.61 43808.956
table_profit <- salesr2%>%
select(region,yearsale,profit)%>%
group_by(region, yearsale)%>%
summarise(totprofit = round(sum(profit),2))
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
Interactive chart
linechart <- table_profit %>%
ggplot(aes(yearsale, totprofit, colour=region))+
scale_color_brewer(palette = "Set1")+
scale_color_discrete(name = " ")+
theme_bw()+
geom_point()+
geom_line()+
ggtitle("Business Profit 2014-2017")+
xlab("Year")+
ylab("Profit ($) ")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
linechart <- ggplotly(linechart)
linechart
Let’s build a Linear Regression Model to predict future Profit.
lprofit <- lm(profit ~ sales, data = salesr2)
summary(lprofit)
##
## Call:
## lm(formula = profit ~ sales, data = salesr2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7397.5 2.6 14.6 21.7 5261.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.732867 2.192459 -5.808 6.53e-09 ***
## sales 0.180067 0.003301 54.555 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 205.6 on 9992 degrees of freedom
## Multiple R-squared: 0.2295, Adjusted R-squared: 0.2294
## F-statistic: 2976 on 1 and 9992 DF, p-value: < 2.2e-16
Based on our calculations our model has the following equation: profit = 0.18(sales)-12.73. We also have a very small p-value of 2.2e-16 and an Ajusted R-Square value suggesting that about 78% of the variation in the observations maybe explained by our model.
What ifS…
Annual Profit if the company was not selling Supplies, Tables, and Bookcases
benefit3 <- salesr2 %>%
group_by(yearsale)%>%
filter(subcategory != "Supplies"& subcategory != "Tables"& subcategory !="Bookcases" )%>%
summarise(totsales=sum(sales), totprofit=sum(profit), profitpercentage = round(totprofit/totsales,2))#%>%
#arrange(-totprofit)
benefit3
## # A tibble: 4 x 4
## yearsale totsales totprofit profitpercentage
## <dbl> <dbl> <dbl> <dbl>
## 1 2014 403728. 52524. 0.13
## 2 2015 390886. 67909. 0.17
## 3 2016 507819. 85233. 0.17
## 4 2017 626248. 103119. 0.16
Actual Annual Profit
benefit2 <- salesr2 %>%
group_by(yearsale)%>%
summarise(totsales=sum(sales), totprofit=sum(profit), profitpercentage = round(totprofit/totsales,2))#%>%
#arrange(-totprofit)
benefit2
## # A tibble: 4 x 4
## yearsale totsales totprofit profitpercentage
## <dbl> <dbl> <dbl> <dbl>
## 1 2014 484247. 49544. 0.1
## 2 2015 470533. 61619. 0.13
## 3 2016 609206. 81795. 0.13
## 4 2017 733215. 93439. 0.13
Investigating
Is there any relationship between segment and category?
segment describes where the Customer belongs (regular consumer, corporation, home office).
And Caterory is the Category of the product that is ordered.
Create a visualization that shows the distribution of the contingency table.
table_seg_subcategory <- salesr2 %>%
count(segment, subcategory)
table_seg_subcategory
## # A tibble: 51 x 3
## segment subcategory n
## <chr> <chr> <int>
## 1 Consumer Accessories 408
## 2 Consumer Appliances 244
## 3 Consumer Art 428
## 4 Consumer Binders 780
## 5 Consumer Bookcases 131
## 6 Consumer Chairs 329
## 7 Consumer Copiers 35
## 8 Consumer Envelopes 129
## 9 Consumer Fasteners 114
## 10 Consumer Furnishings 494
## # ... with 41 more rows
Chi-Square Test (Buying patterns)
Conduct a Chi-Square test with = 0.05. H0: segment and category are independent
Hα: segment and category relate to each other
First let’s create our contingency table for the chi-square test.
tab_seg_sub <- table(salesr$segment,salesr2$category); tab_seg_sub
##
## Furniture Office Supplies Technology
## Consumer 1113 3127 951
## Corporate 646 1820 554
## Home Office 362 1079 342
Let’s do a chi-square test
chisq.test(tab_seg_sub)
##
## Pearson's Chi-squared test
##
## data: tab_seg_sub
## X-squared = 1.4612, df = 4, p-value = 0.8335
P-value: 0.79 > 0.05
Conclusion: We do not reject the Null Hypothesis H0: Segment and category are independent.
The items that were purchased are not influenced by the type of customer.
Part 5
Our findings - Recommendations to the Company
Regular consumers constitute more than 50% of the company total revenues. While the company is profitable in all regions, the total profits on the East/West coast regions are higher than the ones South/Central. In addition, some sub-categories are very profitable, and the company should target them: Art, Copier, Envelopes, Labels, and Paper.
However, the company should stop selling tables, supplies and bookcases. These items are heavily discounted and are being sold at a loss.
Ethical Concerns
My main concern with this dataset is the fact that it contains people’s personal data. The person who published the data put zero effort into masking that information. All analysts, data scientists, information technology professionals or people who handle data, need to be well-versed in the basic principles of data ethics. They have moral obligations of gathering, protecting, and using personally identifiable information ethically.
Because this dataset was not de-identified, all customers’ shipping/home zip codes were made publicly available. This might cause lots of troubles for these customers. Not only does it could tell criminals where they live but it also opens them up for identity theft.