The following information is for the final project for the Exploratory Data Analysis course in the Industrial Engineering Department of Texas Tech University.
The product information for the following analysis can be found at:
https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/7%20SKU%20Usage%20Last%20FY.csv https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/8%20WH%20Replenishment%20Last%20FY.csv
Data was first limited to commodities solely within the warehouse.
Using the epiDisplay package, the top 3 commodities used in the warehouse were determined:
GR00 (Groceries) = 352 items
BV05 (Alcohol) = 233 items
IC00 (Dummy) = 113 items
## Commodities :
## Frequency Percent Cum. percent
## GR00 352 24.6 24.6
## BV05 233 16.3 40.9
## IC00 113 7.9 48.8
## BV30 80 5.6 54.4
## DM00 54 3.8 58.2
## PD1005 47 3.3 61.5
## BV0505 40 2.8 64.3
## BV00 39 2.7 67.0
## PD00 34 2.4 69.4
## BV15 32 2.2 71.7
## DS10 29 2.0 73.7
## GR30 28 2.0 75.6
## DS0515 26 1.8 77.5
## BV25 25 1.7 79.2
## BV10 20 1.4 80.6
## PD0505 19 1.3 81.9
## DR00 19 1.3 83.3
## DS1010 15 1.0 84.3
## DS05 14 1.0 85.3
## DS00 14 1.0 86.3
## ME15 13 0.9 87.2
## GR5005 13 0.9 88.1
## GR20 13 0.9 89.0
## ME00 12 0.8 89.9
## GR5010 12 0.8 90.7
## GR45 12 0.8 91.5
## GR10 12 0.8 92.4
## DR10 12 0.8 93.2
## GR5015 10 0.7 93.9
## GR50 9 0.6 94.5
## DS0520 8 0.6 95.1
## DR05 8 0.6 95.7
## PO05 7 0.5 96.2
## DS0525 7 0.5 96.6
## DR15 7 0.5 97.1
## BK00 6 0.4 97.6
## PO00 5 0.3 97.9
## PD1015 5 0.3 98.3
## GR55 4 0.3 98.5
## GR35 4 0.3 98.8
## PO20 3 0.2 99.0
## DS0505 3 0.2 99.2
## ME05 2 0.1 99.4
## DS1015 2 0.1 99.5
## DS0530 2 0.1 99.7
## PD0510 1 0.1 99.7
## PD05 1 0.1 99.8
## DS1025 1 0.1 99.9
## DS1005 1 0.1 99.9
## BV0510 1 0.1 100.0
## Total 1429 100.0 100.0
Data was filtered to the above commodities, GR00, BV05, and IC00, then analyzed to determine the temperature distribution.
The analysis yielded:
390 items in the ambient air (D = Dry dock)
239 items in the conditioned storage area (L = Alcohol; M = Chocolate and Tea)
69 items in the refrigerated area (P = Perishable)
## CommodityTemps :
## Frequency Percent Cum. percent
## D 390 55.9 55.9
## L 233 33.4 89.3
## P 69 9.9 99.1
## M 6 0.9 100.0
## Total 698 100.0 100.0
The analysis was then adjusted to determine the different units of measure (UoM) of each item allocated to the 3 storage areas.
The ambient area most notably has 370 cases that require storage. When planning for expansion, it should be known that 370 cases require 534.29 cubic feet of space.
The conditioned area contains 189 items with units of measurement labelled as “each” with an additional 50 cases required in the storage area. 114.62 cubic feet of storage space is required for the conditioned area.
The refrigerated area consists of 60 cases in storage. These cases require 45.98 cubic feet of storage space. Management should note storage practices for a cases-only environment.
## AmbientUoM :
## Frequency Percent Cum. percent
## CA 370 94.9 94.9
## EA 14 3.6 98.5
## SJ 5 1.3 99.7
## CT 1 0.3 100.0
## Total 390 100.0 100.0
## [1] 534.29
## ConditionedUoM :
## Frequency Percent Cum. percent
## EA 189 79.1 79.1
## CA 50 20.9 100.0
## Total 239 100.0 100.0
## [1] 114.62
## RefrigeratedUoM :
## Frequency Percent Cum. percent
## CA 69 100 100
## Total 69 100 100
## [1] 114.62
The following sections contain an analysis that describes the weight and shelf life characteristics of the data set. These findings, especially regarding weight, can be used to better understand the products and the overall warehouse needs in order to safely and efficiently handle, transport, and store the products.
The given data from the Fresh Foods Now study is to be filtered according to the following:
The following abbreviations will be used in the analysis:
Unit of Measure (UoM)
Each (EA)
Case (CA)
LB (LB)
Pallet (PL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 2.11 9.90 13.11 19.00 49.90
## [1] 13.11392
## [1] 12.09556
As can be seen from the plot, the weight of all products given ranges from 0.01 to 49.9 lbs. The median weight of all products is 9.9 lbs while the mean is 13.11 lbs. The given data has a standard deviation of 12.09 lbs. The first and third quarters are 2.11 lbs and 19.00 lbs, respectively. The breakdown of weight for each UoM will be studied further soon.
tapply(dat$UomWeight, dat$Uom, summary)
## $CA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 6.00 12.20 15.58 22.00 49.90
##
## $EA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 1.580 2.500 9.677 18.000 47.000
##
## $LB
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 1.00 6.60 9.93 13.60 44.50
##
## $PL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 15.40 15.40 21.19 26.95 48.00
tapply(dat$UomWeight, dat$Uom, sd)
## CA EA LB PL
## 11.98067 11.42983 11.27680 11.65432
allsd<-tapply(dat$UomWeight, dat$Uom, sd)
mean(allsd)
## [1] 11.58541
tapply(dat$UomWeight, dat$Uom, mean)
## CA EA LB PL
## 15.576191 9.677209 9.930140 21.190000
As can be seen from the plots and summary statistics, “EA” has the lowest median weight, followed by “LB,” “CA,” and “PL,” respectively. Given the individual plots, it can be seen that “PL” has the greatest impact on the larger end of the overall weight of products. It can also be noted that the range of the product weights decreases as you move down the plots. From the summary statistics, it can be seen that the mean of both “EA” and “LB” are very close to the median; whereas “CA” and “PL” have a mean significantly higher than the overall median. The standard deviation of each UoM is similar with the average standard deviation of each UoM being 11.58 lbs.
From the table, three distinct clusters can be made within the data. The first cluster is the most dense of the three clusters and varies in weight, but remains close to 1 unit per case. The second cluster also varies in weight but is distributed between 1,000 and 3,000 units per case. The final cluster also varies in weight but is distributed between 3,000 and 10,000 units per case.
df2<-data.frame(dat$LeadTime,dat$UomWeight)
#df2
km.out<-kmeans(df2,3)
#km.out$cluster
plot(df2,col=km.out$cluster,
main="Clusters of Lead Time by Weight",
xlab="Lead Time", ylab="")
Again, we can see three distinct clusters of data when comparing Lead Time and Weight. From the data, it can be concluded that heavier items do not necessarily produce larger lead times and do not need special treatment in terms of reducing the lead time.
tapply(dat$ShelfLifeDays, dat$Uom, summary)
## $CA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 90.0 240.0 286.9 365.0 999.0
##
## $EA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 270.0 365.0 520.8 999.0 999.0
##
## $LB
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 47.5 137.7 180.0 900.0
##
## $PL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 270 360 360 353 365 365
tapply(dat$ShelfLifeDays, dat$Uom, sd)
## CA EA LB PL
## 251.31088 366.12245 192.11068 29.26887
allsd2<- tapply(dat$ShelfLifeDays, dat$Uom, sd)
mean(allsd2)
## [1] 209.7032
As can be seen from the summary statistics and the plots, “EA” and “PL” have the longest shelf life. “EA” also has the largest range of shelf life. All Units of Measure have a relatively large standard deviation, with the exception of “PL.”
This section contains an analysis based on the density of products which could indicate an error in the recorded volume, weight, or both items recorded in inventory.
The levels of inventory received and delivered are compared to identify any conflicts of inventory levels.
An analysis is done to find if there are any impossible products based on UomCube (volume in cubic feet) and UomWeight (weight in pounds) by calculating density and comparing to the density of water and air. Anything heavier than 1.2 times the density of water will be considered impossible since only food is considered. Similarly any items less dense than air are considered errors.
Every point outside of the vertical lines is considered an error. There are 824 impossible values which indicates that either UOMCube, UOMWeight, or both are recorded incorrectly.
The levels of inventory received into the warehouse and inventory issued are compared.
The metric by which conflicting levels were evaluated is by subtracting products issued minus received.
The above analysis yields a mean of the conflicting inventory levels, -2.74. This means that large sums of product sit in the warehouse without destination plans which could indicate either an ordering surplus or a collection of error.
The spread of the data is close to zero which is expected.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message = FALSE)
master<-read.csv("https://raw.githubusercontent.com/moransom/EDA-Repository/main/6%20SKU%20Master.csv")
dat<-na.omit(master)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(epiDisplay)
dat1<-dat[dat$Flow=="WH", ]
dat1$commodity<-as.factor(dat1$Commodity)
Commodities<-dat1$commodity
tab1(Commodities, sort.group = "decreasing", cum.percent = TRUE)
dat2<-dat1[dat1$Commodity=="GR00" | dat1$Commodity=="BV05" | dat1$Commodity=="IC00", ]
dat2$Whs<-as.factor(dat2$Whs)
CommodityTemps<-dat2$Whs
tab1(CommodityTemps, sort.group = "decreasing", cum.percent = TRUE)
dat3<-dat2[dat2$Whs=="D", ]
dat3$Uom<-as.factor(dat3$Uom)
AmbientUoM<-dat3$Uom
tab1(AmbientUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat3$UomCube)
dat4<-dat2[dat2$Whs=="L"|dat2$Whs=="M", ]
dat4$Uom<-as.factor(dat4$Uom)
ConditionedUoM<-dat4$Uom
tab1(ConditionedUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat4$UomCube)
dat5<-dat2[dat2$Whs=="P", ]
dat5$Uom<-as.factor(dat5$Uom)
RefrigeratedUoM<-dat5$Uom
tab1(RefrigeratedUoM, sort.group = "decreasing", cum.percent = TRUE)
sum(dat4$UomCube)
# Using SkuMaster Data Set
dat<-read.csv("https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv")
dat
#Remove rows including "NA"
dat<-na.omit(dat)
#Filter rows where UOMCube is greater than 2 and any less than 0
dat<-dat[dat$UomCube>0&dat$UomCube<2,]
#Filter out rows where UOMWeight is greater than 50 and any less than 0
dat<-dat[dat$UomWeight>0&dat$UomWeight<50,]
colnames(dat)[colnames(dat)=="ï..SkuNbr"]<-"SkuNbr"
#view datatypes in columns
str(dat)
#assign data type to UOM
dat$Uom<-as.factor(dat$Uom)
#view 23 levels of UOM
levels(dat$Uom)
#filter keeping only rows in which UOM includes values in vector c("CA","LB","PL","EA")
dat<-dat[dat$Uom%in%c("CA","EA","LB","PL"),]
#drop unused levels
dat<-droplevels(dat)
#view remaining levels
levels(dat$Uom)
# Weight
boxplot(dat$UomWeight,
main="WEIGHT",
sub="Weight Regardless of Unit of Measure",
font.main=2,
col="gray30",
border="black",
horizontal=TRUE,
ylim=c(0,50),
staplewex=1,
pch=21,
bg="red",
notch=TRUE)
text(x=fivenum(dat$UomWeight),labels=fivenum(dat$UomWeight),y=1.25,cex=.45)
summary(dat$UomWeight)
mean(dat$UomWeight)
sd(dat$UomWeight)
# As can be seen from the plot, the weight of all products given ranges from 0.01 to 49.9.
# The median weight of all products is 9.9 while the mean is 13.11. The given data
# has a standard deviation of 12.09. The fiest and third quartile are 2.11 and 19.00,
# respectively. The breakdown of weight for each Unit of Measure will be studied further soon.
# Weight Per Unit of Measure
plot (dat$UomWeight~dat$Uom,
main="Weight Per Unit of Measure",
font.main=2,
col="gray30",
border="black",
horizontal=TRUE,
ylab="Weight", xlab="Unit of Measure",
pch=21, bg="red", notch=TRUE)
tapply(dat$UomWeight, dat$Uom, summary)
tapply(dat$UomWeight, dat$Uom, sd)
allsd<-tapply(dat$UomWeight, dat$Uom, sd)
mean(allsd)
tapply(dat$UomWeight, dat$Uom, mean)
# As can be seen from the plots and summary statistics, "EA" has the lowest median weight, followed
# by "LB," "CA," and "PL," respectively. Given the individual plots, it can be seen that
# "PL" has the largest impact on the larger end of the overall weight of products.
# It can also be noted that the range of the product weights decreases as you move down the plots.
# From the summary statistics, it can be seen that the mean of both "EA" and "LB" are very close to the
# median; whereas "CA" and "PL" have a mean significantly higher than the overall median.
# The standard deviation of each Unit of Measure is similar with the average standard devaiation
# of each Unit of Measure being 11.58.
## Clustering of Weight and Units Per Case
df<-data.frame(dat$UomWeight,dat$UnitsPerCase)
df
km.out<-kmeans(df,3)
km.out$cluster
plot(df,col=km.out$cluster,
main="Clusters of Units Per Case by Weight",
xlab="Weight", ylab="Units Per Case", yaxp= c(0, 10000, 10))
# From the table, three distinct clusters can be made within the data. The first cluster is the most
# dense of the three clusters and varies in weight,
# but remains super tight to 1 unit per case. The second cluster also varies in weight, but is distributed
# between 1,000 and 3,000 units per case. The final cluster also varies in weight, but is distributed between
# 3,000 and 10,000 units per case.
## Cluster of Lead TImes by Weight
df2<-data.frame(dat$LeadTime,dat$UomWeight)
df2
km.out<-kmeans(df2,3)
km.out$cluster
plot(df2,col=km.out$cluster,
main="Clusters of Lead Time by Weight",
xlab="Lead Time", ylab="")
# Again, we can see three distinct clusters of data when comparing Lead Time and Weight.
# From the data, it can be concluded that heavier items do not necessarily produce larger lead times.
# Shelf Life per UoM
plot(dat$ShelfLifeDays~dat$Uom,
main="Shelf Life per Unit of Measure",
font.main=2,
col="gray30",
border="black",
horizontal=TRUE,
ylab="Shelf Life Days", xlab="Unit of Measure",
pch=21, bg="red", notch=TRUE)
tapply(dat$ShelfLifeDays, dat$Uom, summary)
tapply(dat$ShelfLifeDays, dat$Uom, sd)
allsd2<- tapply(dat$ShelfLifeDays, dat$Uom, sd)
mean(allsd2)
# As can be seen from the summary statistics and the plots, "EA" and "PL" have the
# longest shelf life. "EA" also has the largest range of shelf life. All Units of Measure have a
# relatively large standard deviation, with the exception of "PL."
# Import data files
dat1<-read.csv("https://raw.githubusercontent.com/Garcia-Ry/IE-4331-EDA/main/SKU%20Master.csv")
dat2<-read.csv("https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/7%20SKU%20Usage%20Last%20FY.csv")
dat3 <- read.csv("https://raw.githubusercontent.com/luaraiza/IE-Exploratory-Data-Analysis/main/8%20WH%20Replenishment%20Last%20FY.csv")
#deleting NA from dat 1
dat1<-na.omit(dat1)
head(dat1)
colnames(dat1)[colnames(dat1)=="ï..SkuNbr"]<-"SkuNbr" #Rename SkuNbr column
# I will find if there are any impossible products based on UomCube and UomWeight by calculating
#density and comparing to density of water. Anything heavier than 1.2 times the density of water
#should be impossible since we deal with food.
#Similarly any itmes less dense than air should be errors
#Creating a dataframe with only UomCube and UomWeight
densedat<- dat1[,c(8,9)]
density<- dat1$UomWeight/dat1$UomCube
densedat<- data.frame(densedat,density)
#densedat<-na.omit(densedat)# deleting NA's
#deleting zeros
densedat<-na.omit(densedat)# deleting NA's
#finding density of products by dividing weight by volume.
head(densedat)
# density is in pounds per feet cube
#density of water is 62.3 lbs/ft^3
#any density values greater than 75 lbs/ft^3 indicates an error in either UomCube or UomWeight
# reindex frame
densedat <-densedat[order(densedat$density),]
head(densedat)
plot(densedat$density,
ylab = "Denseity",
ylim= c(0,9000),
xlab = "SKU Index",
pch = 18,
main = "Density Plot"
)
abline(v=5186) # abline adds a line to the graph
abline(v=60)
# Every point to the outside of the lines is an impossible value
#There are 824 incorrect observations
#Analysis 2 electric boogaloo
# Will compare product issued and product receive to analize discrepancy
# Will subtact product in from out
# A four number summary is appropriate for this kind of analysis
# the mean of the set should be 0
# a boxplot will tell a story of the data
# First merge dat2 and dat2 by SkuNbr
dat4 <- merge(dat3,dat2,"SkuNbr")
dat4<-na.omit(dat4) #clearing NA's
#have to set issued quanity to numeric var
dat4$Issued.Quantity<-as.numeric(gsub(",","",dat4$Issued.Quantity)) # sets issued quantity as numerical
dat4$Received.Qty<-as.numeric(gsub(",","",dat4$Received.Qty)) # sets received as numerical
#diffrenece between the two
conflicting <- dat4$Issued.Quantity-dat4$Received.Qty
dat4 <- data.frame(dat4,conflicting)
# to reduce error data conflicts of over a 100 are not considered
dat4 <-dat4[dat4$conflicting<=500, ]
dat4 <-dat4[dat4$conflicting>=-500, ]
dat4<-na.omit(dat4)
boxplot(dat4$conflicting,
main= "Conflicting Inventory Levels",
col= "darkgoldenrod4")
fivenum(dat4$conflicting)
mean(dat4$conflicting)
# The median is zero as expected but the average of the data set is -2.7423
#Every non-zero value indicates a mismatch between issued and received
# A positive value means that more product was issued than received
# A negative value means that more product was delivered than issued.
# The negative mean indicates that in general more product is delivered than issued.