1 Introduction

The following information is for the final project for the Exploratory Data Analysis course in the Industrial Engineering Department of Texas Tech University.

The product information for the following analysis can be found at:

https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/7%20SKU%20Usage%20Last%20FY.csv https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/8%20WH%20Replenishment%20Last%20FY.csv

2 Identify Warehouse Commodities

Data was first limited to commodities solely within the warehouse.

Using the epiDisplay package, the top 3 commodities used in the warehouse are:

  • GR00 (Groceries) = 352 items

  • BV05 (Alcohol) = 233 items

  • IC00 (Dummy) = 113 items

## Commodities : 
##         Frequency Percent Cum. percent
## GR00          352    24.6         24.6
## BV05          233    16.3         40.9
## IC00          113     7.9         48.8
## BV30           80     5.6         54.4
## DM00           54     3.8         58.2
## PD1005         47     3.3         61.5
## BV0505         40     2.8         64.3
## BV00           39     2.7         67.0
## PD00           34     2.4         69.4
## BV15           32     2.2         71.7
## DS10           29     2.0         73.7
## GR30           28     2.0         75.6
## DS0515         26     1.8         77.5
## BV25           25     1.7         79.2
## BV10           20     1.4         80.6
## PD0505         19     1.3         81.9
## DR00           19     1.3         83.3
## DS1010         15     1.0         84.3
## DS05           14     1.0         85.3
## DS00           14     1.0         86.3
## ME15           13     0.9         87.2
## GR5005         13     0.9         88.1
## GR20           13     0.9         89.0
## ME00           12     0.8         89.9
## GR5010         12     0.8         90.7
## GR45           12     0.8         91.5
## GR10           12     0.8         92.4
## DR10           12     0.8         93.2
## GR5015         10     0.7         93.9
## GR50            9     0.6         94.5
## DS0520          8     0.6         95.1
## DR05            8     0.6         95.7
## PO05            7     0.5         96.2
## DS0525          7     0.5         96.6
## DR15            7     0.5         97.1
## BK00            6     0.4         97.6
## PO00            5     0.3         97.9
## PD1015          5     0.3         98.3
## GR55            4     0.3         98.5
## GR35            4     0.3         98.8
## PO20            3     0.2         99.0
## DS0505          3     0.2         99.2
## ME05            2     0.1         99.4
## DS1015          2     0.1         99.5
## DS0530          2     0.1         99.7
## PD0510          1     0.1         99.7
## PD05            1     0.1         99.8
## DS1025          1     0.1         99.9
## DS1005          1     0.1         99.9
## BV0510          1     0.1        100.0
##   Total      1429   100.0        100.0

3 Temperature of Commodities

Data was filtered to the above commodities, GR00, BV05, and IC00, then analyzed to determine the temperature distribution.

The analysis yielded:

  • 390 items in the ambient air (D = Dry dock)

  • 239 items in the conditioned storage area (L = Alcohol; M = Chocolate and Tea)

  • 69 items in the refrigerated area (P = Perishable)

## CommodityTemps : 
##         Frequency Percent Cum. percent
## D             390    55.9         55.9
## L             233    33.4         89.3
## P              69     9.9         99.1
## M               6     0.9        100.0
##   Total       698   100.0        100.0

4 Storage per Temperature Area

The analysis was then adjusted to determine the different units of measure (UoM) of each item allocated to the 3 storage areas.

The ambient area most notably has 370 cases that require storage. When planning for expansion, it should be known that 370 cases require 534.29 cubic feet of space.

  • For this analysis, only “CA” UoM is considered in the ambient storage area since it is most prominent. Management should be aware of how to plan storage of EA, SJ, and CT UoMs.

The conditioned area contains 189 items with units of measurement labelled as “each” with an additional 50 cases required in the storage area. 114.62 cubic feet of storage space is required for the conditioned area.

  • Management note: CA and EA will require different storage strategies

The refrigerated area consists of 60 cases in storage. These cases require 45.98 cubic feet of storage space. Management should note storage practices for a cases-only environment.

## AmbientUoM : 
##         Frequency Percent Cum. percent
## CA            370    94.9         94.9
## EA             14     3.6         98.5
## SJ              5     1.3         99.7
## CT              1     0.3        100.0
##   Total       390   100.0        100.0
## [1] 534.29

## ConditionedUoM : 
##         Frequency Percent Cum. percent
## EA            189    79.1         79.1
## CA             50    20.9        100.0
##   Total       239   100.0        100.0
## [1] 114.62

## RefrigeratedUoM : 
##         Frequency Percent Cum. percent
## CA             69     100          100
##   Total        69     100          100
## [1] 114.62

5 Filtering

The given data from the Fresh Foods Now study is to be filtered according to the following:

  • Remove all rows with “NA” data
  • Keep only data with “UomCube” greater than 0 and less than 2
  • Keep only data with “UomWeight” with data greater than 0 and less than 50
  • Keep only “EA,” “CA,” “LB,” and “PL” UoM types # Weight and Shelf Life Analysis

The following sections contain an analysis that describes the weight and shelf life characteristics of the data set.

5.1 Reasoning for Analysis

The following analysis with regard to weight can be used to better understand the products and the overall warehouse needs in order to safely and efficiently handle, transport, and store the products.

5.1.1 Note On Abbreviations

The following abbreviations will be used in the analysis: - Unit of Measure (UoM) - Each (EA) - Case (CA) - LB (LB) _ Pallet (PL)

5.2 Understanding Overall Weight

5.2.1 Summary Statistics of Overall Weight

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    2.11    9.90   13.11   19.00   49.90
## [1] 13.11392
## [1] 12.09556

As can be seen from the plot, the weight of all products given ranges from 0.01 to 49.9. The median weight of all products is 9.9 while the mean is 13.11. The given data has a standard deviation of 12.09. The fiest and third quartile are 2.11 and 19.00, respectively. The breakdown of weight for each Unit of Measure will be studied further soon.

5.3 Understanding Weight Per UoM

5.3.1 Summary Statistics of Weight Per UoM

tapply(dat$UomWeight, dat$Uom, summary)
## $CA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    6.00   12.20   15.58   22.00   49.90 
## 
## $EA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.010   1.580   2.500   9.677  18.000  47.000 
## 
## $LB
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    1.00    6.60    9.93   13.60   44.50 
## 
## $PL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   15.40   15.40   21.19   26.95   48.00
tapply(dat$UomWeight, dat$Uom, sd)
##       CA       EA       LB       PL 
## 11.98067 11.42983 11.27680 11.65432
allsd<-tapply(dat$UomWeight, dat$Uom, sd)
mean(allsd)
## [1] 11.58541
tapply(dat$UomWeight, dat$Uom, mean)
##        CA        EA        LB        PL 
## 15.576191  9.677209  9.930140 21.190000

As can be seen from the plots and summary statistics, “EA” has the lowest median weight, followed by “LB,” “CA,” and “PL,” respectively. Given the individual plots, it can be seen that “PL” has the largest impact on the larger end of the overall weight of products. It can also be noted that the range of the product weights decreases as you move down the plots.From the summary statistics, it can be seen that the mean of both “EA” and “LB” are very close to the median; whereas “CA” and “PL” have a mean significantly higher than the overall median. The standard deviation of each UoM is similar with the average standard devaiation of each UoM being 11.58.

5.4 Understanding Weight and Units Per Case

From the table, three distinct clusters can be made within the data. The first cluster is the most dense of the three clusters and varies in weight, but remains super tight to 1 unit per case. The second cluster also varies in weight, but is distributed between 1,000 and 3,000 units per case. The final cluster also varies in weight, but is distributed between 3,000 and 10,000 units per case.

5.5 Understanding Lead Time Per Weight

df2<-data.frame(dat$LeadTime,dat$UomWeight)
#df2
km.out<-kmeans(df2,3)
#km.out$cluster
plot(df2,col=km.out$cluster,
     main="Clusters of Lead Time by Weight",
     xlab="Lead Time", ylab="")

Again, we can see three distinct clusters of data when comparing Lead Time and Weight. From the data, it can be concluded that heavier items do not necessarily produce larger lead times and do not need special treatment in terms of reducing the lead time.

5.6 Understanding Shelf Life Per UoM

5.6.1 Summary Statistics of Shelf Life Per UoM

tapply(dat$ShelfLifeDays, dat$Uom, summary)
## $CA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    90.0   240.0   286.9   365.0   999.0 
## 
## $EA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   270.0   365.0   520.8   999.0   999.0 
## 
## $LB
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    47.5   137.7   180.0   900.0 
## 
## $PL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     270     360     360     353     365     365
tapply(dat$ShelfLifeDays, dat$Uom, sd)
##        CA        EA        LB        PL 
## 251.31088 366.12245 192.11068  29.26887
allsd2<- tapply(dat$ShelfLifeDays, dat$Uom, sd)
mean(allsd2)
## [1] 209.7032

As can be seen from the summary statistics and the plots, “EA” and “PL” have the longest shelf life. “EA” also has the largest range of shelf life. All Units of Measure have a relatively large standard deviation, with the exception of “PL.”

6 Density and Inventory Level Analysis

This section contains an analysis based on the density of products which could indicate an error in the recorded volume, weight, or both of items recorded in inventory.

The levels of inventory received and delivered are compared to identify any conflicts of inventory levels

6.1 Density Analysis

An analysis is done to find if there are any impossible products based on UomCube (volume in cubic feet) and UomWeight (weight in pounds) by calculating density and comparing to the density of water and air. Anything heavier than 1.2 times the density of water will be considered impossible since we deal with food. Similarly any items less dense than air should be errors.

Every point outside of the vertical lines is considered an error. There are 824 Impossible values which indicates that either UOMCube, UOMWeight, or both are recorded incorrectly.

6.2 Product Issued versus Received

The levels of inventory received into the warehouse and inventory issued are compared.

The metric by which conflicting levels were evaluated is by subtracting products issued minus received.

Some interesting things about this analysis is that the mean of the conflicting inventory levels is -2.74 this means that in general a lot of products sit in the warehouse with nowhere to go which could indicate either that tooo much is ordered or a collection of error.

The spread of the data is close to zero which is expected.

7 Complete R Codes

7.1 Complete R Code “Commodities Analysis”

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message = FALSE)
master<-read.csv("https://raw.githubusercontent.com/moransom/EDA-Repository/main/6%20SKU%20Master.csv")
dat<-na.omit(master)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(epiDisplay)
dat1<-dat[dat$Flow=="WH", ] 
dat1$commodity<-as.factor(dat1$Commodity)
Commodities<-dat1$commodity
tab1(Commodities, sort.group = "decreasing", cum.percent = TRUE)
dat2<-dat1[dat1$Commodity=="GR00" | dat1$Commodity=="BV05" | dat1$Commodity=="IC00", ]
dat2$Whs<-as.factor(dat2$Whs)
CommodityTemps<-dat2$Whs
tab1(CommodityTemps, sort.group = "decreasing", cum.percent = TRUE)
dat3<-dat2[dat2$Whs=="D", ]
dat3$Uom<-as.factor(dat3$Uom)
AmbientUoM<-dat3$Uom
tab1(AmbientUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat3$UomCube)
dat4<-dat2[dat2$Whs=="L"|dat2$Whs=="M", ]
dat4$Uom<-as.factor(dat4$Uom)
ConditionedUoM<-dat4$Uom
tab1(ConditionedUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat4$UomCube)
dat5<-dat2[dat2$Whs=="P", ]
dat5$Uom<-as.factor(dat5$Uom)
RefrigeratedUoM<-dat5$Uom
tab1(RefrigeratedUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat4$UomCube)

7.2 Complete R code “Weight and Shelf Life” Analysis

# Using SkuMaster Data Set

dat<-read.csv("https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv")
dat

#Remove rows including "NA"
dat<-na.omit(dat)
#Filter rows where UOMCube is greater than 2 and any less than 0
dat<-dat[dat$UomCube>0&dat$UomCube<2,]
#Filter out rows where UOMWeight is greater than 50 and any less than 0
dat<-dat[dat$UomWeight>0&dat$UomWeight<50,]
colnames(dat)[colnames(dat)=="ï..SkuNbr"]<-"SkuNbr"
#view datatypes in columns
str(dat)
#assign data type to UOM
dat$Uom<-as.factor(dat$Uom)
#view 23 levels of UOM
levels(dat$Uom)
#filter keeping only rows in which UOM includes values in vector c("CA","LB","PL","EA")
dat<-dat[dat$Uom%in%c("CA","EA","LB","PL"),]
#drop unused levels
dat<-droplevels(dat)
#view remaining levels
levels(dat$Uom)


# Weight
boxplot(dat$UomWeight,
        main="WEIGHT",
        sub="Weight Regardless of Unit of Measure",
        font.main=2,
        col="gray30",
        border="black",
        horizontal=TRUE,
        ylim=c(0,50),
        staplewex=1,
        pch=21,
        bg="red",
        notch=TRUE)
text(x=fivenum(dat$UomWeight),labels=fivenum(dat$UomWeight),y=1.25,cex=.45)

summary(dat$UomWeight)
mean(dat$UomWeight)
sd(dat$UomWeight)
# As can be seen from the plot, the weight of all products given ranges from 0.01 to 49.9. 
# The median weight of all products is 9.9 while the mean is 13.11. The given data
# has a standard deviation of 12.09. The fiest and third quartile are 2.11 and 19.00, 
# respectively. The breakdown of weight for each Unit of Measure will be studied further soon. 

# Weight Per Unit of Measure
plot (dat$UomWeight~dat$Uom,
      main="Weight Per Unit of Measure",
      font.main=2,
      col="gray30",
      border="black",
      horizontal=TRUE,
      ylab="Weight", xlab="Unit of Measure",
      pch=21, bg="red", notch=TRUE)
tapply(dat$UomWeight, dat$Uom, summary)
tapply(dat$UomWeight, dat$Uom, sd)
allsd<-tapply(dat$UomWeight, dat$Uom, sd)
mean(allsd)
tapply(dat$UomWeight, dat$Uom, mean)
# As can be seen from the plots and summary statistics, "EA" has the lowest median weight, followed
# by "LB," "CA," and "PL," respectively. Given the individual plots, it can be seen that 
# "PL" has the largest impact on the larger end of the overall weight of products. 
# It can also be noted that the range of the product weights decreases as you move down the plots.
# From the summary statistics, it can be seen that the mean of both "EA" and "LB" are very close to the
# median; whereas "CA" and "PL" have a mean significantly higher than the overall median. 
# The standard deviation of each Unit of Measure is similar with the average standard devaiation 
# of each Unit of Measure being 11.58. 

## Clustering of Weight and Units Per Case
df<-data.frame(dat$UomWeight,dat$UnitsPerCase)
df
km.out<-kmeans(df,3)
km.out$cluster
plot(df,col=km.out$cluster,
     main="Clusters of Units Per Case by Weight",
     xlab="Weight", ylab="Units Per Case", yaxp= c(0, 10000, 10))

# From the table, three distinct clusters can be made within the data. The first cluster is the most 
# dense of the three clusters and varies in weight, 
# but remains super tight to 1 unit per case. The second cluster also varies in weight, but is distributed
# between 1,000 and 3,000 units per case. The final cluster also varies in weight, but is distributed between 
# 3,000 and 10,000 units per case. 


## Cluster of Lead TImes by Weight
df2<-data.frame(dat$LeadTime,dat$UomWeight)
df2
km.out<-kmeans(df2,3)
km.out$cluster
plot(df2,col=km.out$cluster,
     main="Clusters of Lead Time by Weight",
     xlab="Lead Time", ylab="")
# Again, we can see three distinct clusters of data when comparing Lead Time and Weight. 
# From the data, it can be concluded that heavier items do not necessarily produce larger lead times.

# Shelf Life per UoM
plot(dat$ShelfLifeDays~dat$Uom,
     main="Shelf Life per Unit of Measure",
     font.main=2,
     col="gray30",
     border="black",
     horizontal=TRUE,
     ylab="Shelf Life Days", xlab="Unit of Measure",
     pch=21, bg="red", notch=TRUE)
tapply(dat$ShelfLifeDays, dat$Uom, summary)
tapply(dat$ShelfLifeDays, dat$Uom, sd)
allsd2<- tapply(dat$ShelfLifeDays, dat$Uom, sd)
mean(allsd2)
# As can be seen from the summary statistics and the plots, "EA" and "PL" have the 
# longest shelf life. "EA" also has the largest range of shelf life. All Units of Measure have a
# relatively large standard deviation, with the exception of "PL."

7.3 Complete R code “Density and Inventory Level” Analysis

# Import data files
dat1<-read.csv("https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv")
dat2<-read.csv("https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/7%20SKU%20Usage%20Last%20FY.csv")
dat3 <- read.csv("https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/8%20WH%20Replenishment%20Last%20FY.csv")
#deleting NA from dat 1
dat1<-na.omit(dat1)
head(dat1)
colnames(dat1)[colnames(dat1)=="ï..SkuNbr"]<-"SkuNbr" #Rename SkuNbr column 
# I will find if there are any impossible products based on UomCube and UomWeight by calculating
#density and comparing to density of water. Anything heavier than 1.2 times the density of water 
#should be impossible since we deal with food.
#Similarly any itmes less dense than air should be errors

#Creating a dataframe with only UomCube and UomWeight
densedat<- dat1[,c(8,9)]
density<- dat1$UomWeight/dat1$UomCube
densedat<- data.frame(densedat,density)
#densedat<-na.omit(densedat)# deleting NA's
#deleting zeros

densedat<-na.omit(densedat)# deleting NA's
#finding density of products by dividing weight by volume.
head(densedat)
# density is in pounds per feet cube
#density of water is 62.3 lbs/ft^3
#any density values greater than 75 lbs/ft^3 indicates an error in either UomCube or UomWeight
# reindex frame
densedat <-densedat[order(densedat$density),]
head(densedat)
plot(densedat$density,
     ylab = "Denseity",
     ylim= c(0,9000),
     xlab = "SKU Index",
     pch = 18,
     main = "Density Plot"
  
)

abline(v=5186) # abline adds a line to the graph
abline(v=60)
# Every point to the outside of the lines is an impossible value
#There are 824 incorrect observations


#Analysis 2 electric boogaloo

# Will compare product issued and product receive to analize discrepancy
# Will subtact product in from out 
# A four number summary is appropriate for this kind of analysis
# the mean of the set should be 0
# a boxplot will tell a story of the data


# First merge dat2 and dat2 by SkuNbr
dat4 <- merge(dat3,dat2,"SkuNbr")
dat4<-na.omit(dat4) #clearing NA's
#have to set issued quanity to numeric var
dat4$Issued.Quantity<-as.numeric(gsub(",","",dat4$Issued.Quantity)) # sets issued quantity as numerical
dat4$Received.Qty<-as.numeric(gsub(",","",dat4$Received.Qty)) # sets received as numerical
#diffrenece between the two
conflicting <- dat4$Issued.Quantity-dat4$Received.Qty

dat4 <- data.frame(dat4,conflicting)
# to reduce error data conflicts of over a 100 are not considered
dat4 <-dat4[dat4$conflicting<=500, ]
dat4 <-dat4[dat4$conflicting>=-500, ] 
dat4<-na.omit(dat4)
boxplot(dat4$conflicting,
        main= "Conflicting Inventory Levels",
        col= "darkgoldenrod4")
fivenum(dat4$conflicting)
mean(dat4$conflicting)
# The median is zero as expected but the average of the data set is -2.7423
#Every non-zero value indicates a mismatch between issued and received 
# A positive value means that more product was issued than received 
# A negative value means that more product was delivered than issued.
# The negative mean indicates that in general more product is delivered than issued.