Used in decision tree induction for mitigating process delays known as “cylinder bands” in rotogravure printing
We begin analyzing the issue by visiting the problem statement with some background information. The issue we are trying to address occurs during what is known as “Rotogravure” printing. Rotogravure printing involves rotating a chrome-plated copper cylinder in a bath of ink, scraping off the excess ink, and pressing a continuous supply of paper against the inked image with a rubber roller. Once the job is printed, the engraved image is removed from the cylinder, which is replated to be engraved for another job. Sometimes, a series of grooves - called a “band” - appears in the cylinder during printing, ruining the finished product. These grooves are not present at the start of the printing run, but once they appear the printing press must be shutdown. A technician then removes the band by polishing it out of the cylinder, or by transporting the cylinder to a plating station where the chrome surface is removed, the band is polished out of the copper subsurface, and a chrome finish is replated. The occurence of bands result in process delays, plant shutdowns, and losses in terms of labour and time. When process delays have known causes, they can be mitigated by acquiring causal rules from human experts and then applying sensors and automated real-time diagnostic devices to the process. However, for some delays the experts have only “weak” causal knowledge or none at all. In such cases, machine learning tools can collect training data and process it through an induction engine in search of “diagnostic” knowledge. Our aim in this analysis is to therefore find the most probably causes for band formation and share those parameters so that they can be controlled, in order to mitigate the effects of banding. Our analysis will follow the different paths in the flowchart as shown below. We firstly take the data that has been provided and perform data clean-up activity (pre-processing) to make them usable for a machine learning algorithm. This clean-up activity involves several sub-steps and we have tried different ways to address the missing data values, dscribed in more details in the relevant sections. After this activity, we aim to build the a robust prediction model, by using various ML
We check there how the data is spread and what it contains. The variable we are interested in is “band_type”, and especially the reason & prediction ability of banding. From the dataset, we can see the description of the “band_type” variable, as shown below. ## variable = band_type ## type = factor ## na = 0 of 540 (0%) ## unique = 2 ## band = 228 (42.2%) ## noband = 312 (57.8%) We find that out of the 540 records in the dataset, the information for “band_type” is completely available (no missing “band_type” values, na = 0 of 540). Further, there are two unique levels for this variable (band & noband). Also, the split between band & noband is quite balanced (42.2% versus 57.8%), only a slight tilt for noband, with more information. An overall plot of the data is shown below. Here we find that there are only 51.3% of complete rows i.e., rows which do not have any values missing. This works out to 277 records of the 540 records. We also see 4.6% of data not having made any obervations at all.
1.timestamp: numeric;19500101 - 21001231 2. cylinder number: nominal 3. customer: nominal; 4. job number: nominal; 5. grain screened: nominal; yes, no 6. ink color: nominal; key, type 7. proof on ctd ink: nominal; yes, no 8. blade mfg: nominal; benton, daetwyler, uddeholm 9. cylinder division: nominal; gallatin, warsaw, mattoon 10. paper type: nominal; uncoated, coated, super 11. ink type: nominal; uncoated, coated, cover 12. direct steam: nominal; use; yes, no 13. solvent type: nominal; xylol, lactol, naptha, line, other 14. type on cylinder: nominal; yes, no 15. press type: nominal; use; 70 wood hoe, 70 motter, 70 albert, 94 motter 16. press: nominal; 821, 802, 813, 824, 815, 816, 827, 828 17. unit number: nominal; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 18. cylinder size: nominal; catalog, spiegel, tabloid 19. paper mill location: nominal; north us, south us, canadian, scandanavian, mid european 20. plating tank: nominal; 1910, 1911, other 21. proof cut: numeric; 0-100 22. viscosity: numeric; 0-100 23. caliper: numeric; 0-1.0 24. ink temperature: numeric; 5-30 25. humifity: numeric; 5-120 26. roughness: numeric; 0-2 27. blade pressure: numeric; 10-75 28. varnish pct: numeric; 0-100 29. press speed: numeric; 0-4000 30. ink pct: numeric; 0-100 31. solvent pct: numeric; 0-100 32. ESA Voltage: numeric; 0-16 33. ESA Amperage: numeric; 0-10 34. wax: numeric ; 0-4.0 35. hardener: numeric; 0-3.0 36. roller durometer: numeric; 15-120 37. current density: numeric; 20-50 38. anode space ratio: numeric; 70-130 39. chrome content: numeric; 80-120 40. band type: nominal; class; band, no band
getwd()
## [1] "E:/fall sem 2022-2023/Data science programming"
setwd("E:/fall sem 2022-2023/Data science programming")
dataset = read.csv("bands.csv")
View(dataset)
str(dataset)
## 'data.frame': 512 obs. of 40 variables:
## $ timestamp : int 19910108 19910109 19910104 19910104 19910111 19910104 19910111 19910111 19910112 19910114 ...
## $ cylinder.number : chr "X126" "X266" "B7" "T133" ...
## $ customer : chr "TVGUIDE" "TVGUIDE" "MODMAT" "MASSEY" ...
## $ job.number : int 25503 25503 47201 39039 37351 38039 35751 35751 47201 37000 ...
## $ grain.screened : chr "YES" "YES" "YES" "YES" ...
## $ ink.color : chr "KEY" "KEY" "KEY" "KEY" ...
## $ proof.on.ctd.ink : chr "YES" "YES" "YES" "YES" ...
## $ blade.mfg : chr "BENTON" "BENTON" "BENTON" "BENTON" ...
## $ cylinder.division : chr "GALLATIN" "GALLATIN" "GALLATIN" "GALLATIN" ...
## $ paper.type : chr "UNCOATED" "UNCOATED" "UNCOATED" "UNCOATED" ...
## $ ink.type : chr "UNCOATED" "UNCOATED" "COATED" "UNCOATED" ...
## $ direct.steam : chr "NO" "NO" "NO" "NO" ...
## $ solvent.type : chr "LINE" "LINE" "LINE" "LINE" ...
## $ type.on.cylinder : chr "YES" "YES" "YES" "YES" ...
## $ press.type : chr "Motter94" "Motter94" "WoodHoe70" "WoodHoe70" ...
## $ press : int 821 821 815 816 816 816 827 827 802 815 ...
## $ unit.number : int 2 2 9 9 2 2 2 9 7 2 ...
## $ cylinder.size : chr "TABLOID" "TABLOID" "CATALOG" "CATALOG" ...
## $ paper.mill.location: chr "NorthUS" "NorthUS" "NorthUS" "NorthUS" ...
## $ plating.tank : chr "1911" "?" "?" "1910" ...
## $ proof.cut : chr "55" "55" "62" "52" ...
## $ viscosity : chr "46" "46" "40" "40" ...
## $ caliper : chr "0.2" "0.3" "0.433" "0.3" ...
## $ ink.temperature : chr "17" "15" "16" "16" ...
## $ humifity : chr "78" "80" "80" "75" ...
## $ roughness : chr "0.75" "0.75" "?" "0.3125" ...
## $ blade.pressure : chr "20" "20" "30" "30" ...
## $ varnish.pct : chr "13.1" "6.6" "6.5" "5.6" ...
## $ press.speed : chr "1700" "1900" "1850" "1467" ...
## $ ink.pct : chr "50.5" "54.9" "53.8" "55.6" ...
## $ solvent.pct : chr "36.4" "38.5" "39.8" "38.8" ...
## $ ESA.Voltage : chr "0" "0" "0" "0" ...
## $ ESA.Amperage : chr "0" "0" "0" "0" ...
## $ wax : chr "2.5" "2.5" "2.8" "2.5" ...
## $ hardener : chr "1" "0.7" "0.9" "1.3" ...
## $ roller.durometer : chr "34" "34" "40" "40" ...
## $ current.density : chr "40" "40" "40" "40" ...
## $ anode.space.ratio : chr "105" "105" "103.87" "108.06" ...
## $ chrome.content : chr "100" "100" "100" "100" ...
## $ band.type : chr "band" "noband" "noband" "noband" ...
dim(dataset)
## [1] 512 40
#Barplot
count = table(dataset$band.type)
count
##
## band noband
## 200 312
barplot(count)
#piechart
slice = table(dataset$paper.mill.location)
pie(slice)
Our first step is to clean our data and remove duplicates. This can be caused due to mistakes in entries (upper/lower case), duplicates, and mistakes in spelling etc., while creating the dataset by the operators, who have noted the various parameter levels when making the cylinders. Our target variable in this data set is “band_type” which has two levels: band (or) noband. Our explanatory variables are the remaining 39 variables, of which 20 attributes are numeric and 19 are nominal. After this common step, we proceed with analyzing the dataset and derive more meaningful information
#changing "?" marks to na values
idx <- dataset == "?"
is.na(dataset) <- idx
#loading dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
we identify from the problem description, the 12 variables which can be safely ignored from our dataset, as it would only decrease the models capability. The variables that we decided to remove from our training model were the below
dataset<- dataset %>% select(-c(customer,job.number,cylinder.number,cylinder.division,ink.color,unit.number,plating.tank,blade.mfg,chrome.content,ESA.Voltage,ESA.Amperage))
dim(dataset)#after removing some umwanted columns the dim of the dataset
## [1] 512 29
dataset$band.type = factor(dataset$band.type,levels=c("band","noband"),labels = c(1,0))
dataset$paper.type = factor(dataset$paper.type,levels=c("UNCOATED","COATED","SUPER"),labels = c(1,2,3))
dataset$ink.type = factor(dataset$ink.type,levels=c("UNCOATED","COATED","COVER"),labels = c(1,2,3))
dataset$direct.steam = factor(dataset$direct.steam,levels=c("NO","YES"),labels =c(0,1))
dataset$press.type = factor(dataset$press.type,levels=c("Motter94","WoodHoe70","Albert70","Motter70" ),labels =c(1,2,3,4))
dataset$grain.screened = factor(dataset$grain.screened,levels=c("YES","NO"),labels = c(1,0))
dataset$proof.on.ctd.ink = factor(dataset$proof.on.ctd.ink,levels=c("YES","NO"),labels=c(1,0))
dataset$solvent.type = factor(dataset$solvent.type,levels=c("LINE","XYLOL","NAPTHA"),labels=c(1,2,3))
dataset$type.on.cylinder = factor(dataset$type.on.cylinder,levels=c("YES","NO"),labels=c(1,0))
dataset$paper.mill.location= factor(dataset$paper.mill.location,levels=c("NorthUS","CANADIAN","SCANDANAVIAN","SouthUS","mideuropean"),labels=c(1,2,3,4,5))
dataset$cylinder.size = factor(dataset$cylinder.size,levels=c("TABLOID","CATALOG","SPIEGEL"),labels=c(1,2,3))
dataset <- dataset %>%
mutate_at(c(1:28), as.numeric)
str(dataset)
## 'data.frame': 512 obs. of 29 variables:
## $ timestamp : num 19910108 19910109 19910104 19910104 19910111 ...
## $ grain.screened : num 1 1 1 1 2 1 2 2 1 1 ...
## $ proof.on.ctd.ink : num 1 1 1 1 1 1 1 1 1 1 ...
## $ paper.type : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ink.type : num 1 1 2 1 2 1 2 2 1 1 ...
## $ direct.steam : num 1 1 1 1 1 1 1 1 1 1 ...
## $ solvent.type : num 1 1 1 1 1 1 1 1 2 1 ...
## $ type.on.cylinder : num 1 1 1 1 1 1 1 1 1 1 ...
## $ press.type : num 1 1 2 2 2 2 1 1 3 2 ...
## $ press : num 821 821 815 816 816 816 827 827 802 815 ...
## $ cylinder.size : num 1 1 2 2 1 2 1 1 2 2 ...
## $ paper.mill.location: num 1 1 1 1 NA 1 2 2 1 1 ...
## $ proof.cut : num 55 55 62 52 50 50 50 50 50 65 ...
## $ viscosity : num 46 46 40 40 46 40 46 46 45 43 ...
## $ caliper : num 0.2 0.3 0.433 0.3 0.3 0.267 0.3 0.2 0.367 0.333 ...
## $ ink.temperature : num 17 15 16 16 17 16.8 16.5 16.5 12 16 ...
## $ humifity : num 78 80 80 75 80 76 75 75 70 75 ...
## $ roughness : num 0.75 0.75 NA 0.312 0.75 ...
## $ blade.pressure : num 20 20 30 30 30 28 30 28 60 32 ...
## $ varnish.pct : num 13.1 6.6 6.5 5.6 0 8.6 0 0 0 22.7 ...
## $ press.speed : num 1700 1900 1850 1467 2100 ...
## $ ink.pct : num 50.5 54.9 53.8 55.6 57.5 53.8 62.5 62.5 60.2 45.5 ...
## $ solvent.pct : num 36.4 38.5 39.8 38.8 42.5 37.6 37.5 37.5 39.8 31.8 ...
## $ wax : num 2.5 2.5 2.8 2.5 2.3 2.5 2.5 2.5 3 3 ...
## $ hardener : num 1 0.7 0.9 1.3 0.6 0.8 0.6 1.1 1 1 ...
## $ roller.durometer : num 34 34 40 40 35 40 30 30 40 38 ...
## $ current.density : num 40 40 40 40 40 40 40 40 40 40 ...
## $ anode.space.ratio : num 105 105 104 108 107 ...
## $ band.type : Factor w/ 2 levels "1","0": 1 2 2 2 2 2 2 2 1 2 ...
dataset<-dataset %>%
mutate(across(c(grain.screened,proof.on.ctd.ink,solvent.type,type.on.cylinder,cylinder.size,paper.mill.location,proof.cut,viscosity,caliper,ink.temperature,humifity,roughness,blade.pressure,varnish.pct,press.speed,ink.pct,solvent.pct,wax,hardener,roller.durometer,current.density,anode.space.ratio), ~tidyr::replace_na(., mean(., na.rm=TRUE))))
A density plot of the data and is a representation of the distribution of a numeric variable. It helps to understand how normal or skewed the data is and if there are variables that are wildly distributed. As can be gleaned, not all the variables seem relevant here (unit_number, job_number etc.,). The rest of the data seem to be well distributed without too many extremes.
den <- density(dataset$cylinder.size)
plot(den, frame = FALSE, col = "blue",main = "Density plot")
Naïve Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks. In this article, we will understand the Naïve Bayes algorithm and all essential concepts so that there is no room for doubts in understanding
library(caTools)
set.seed(123)
split = sample.split(dataset$band.type,SplitRatio = 0.75)
training_set = subset(dataset,split == T)
dim(training_set)
## [1] 384 29
test_set = subset(dataset,split == F)
dim(test_set)
## [1] 128 29
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step
training_set[-29] = scale(training_set[-29])
View(training_set)
test_set[-29] = scale(test_set[-29])
View(test_set)
e1071 is a package for R programming that provides functions for statistic and probabilistic algorithms like a fuzzy classifier, naive Bayes classifier, bagged clustering, short-time Fourier transform, support vector machine, etc.. When it comes to SVM, there are many packages available in R to implement it.
library(e1071)
classifier = naiveBayes(x = training_set[-29],y = training_set$band.type)
summary(classifier)
## Length Class Mode
## apriori 2 table numeric
## tables 28 -none- list
## levels 2 -none- character
## isnumeric 28 -none- logical
## call 3 -none- call
y_pred = predict(object=classifier,newdata=test_set)
y_pred
## [1] 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
## [38] 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
## [75] 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0
## [112] 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## Levels: 1 0
cm = table(test_set[,29],y_pred)
cm
## y_pred
## 1 0
## 1 31 19
## 0 27 51
acc = sum(diag(cm))/sum(cm)
acc
## [1] 0.640625
KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points. It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems. It is mainly based on feature similarity Loading packages
library(e1071)
library(caTools)
library(class)
split <- sample.split(dataset, SplitRatio = 0.7)
train_cl <- subset(dataset, split == "TRUE")
dim(train_cl)
## [1] 353 29
test_cl <- subset(dataset, split == "FALSE")
dim(test_cl)
## [1] 159 29
train_scale <- scale(train_cl[, 1:28])
View(train_scale)
test_scale <- scale(test_cl[, 1:28])
View(test_scale)
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$band.type,
k = 3)
classifier_knn
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1
## [38] 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1
## [75] 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0
## [112] 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1
## [149] 1 0 1 0 0 1 1 1 1 1 1
## Levels: 1 0
cm <- table(test_cl$band.type, classifier_knn)
cm
## classifier_knn
## 1 0
## 1 43 27
## 0 12 77
acc = sum(diag(cm))/sum(cm)
acc
## [1] 0.754717
We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp). We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.
library(caTools)
set.seed(204)
split = sample.split(dataset$band.type,SplitRatio =0.8)
training_set = subset(dataset,split == T)
test_set = subset(dataset,split == F)
dim(training_set)
## [1] 410 29
dim(test_set)
## [1] 102 29
library(rpart)
library(rpart.plot)
fit <- rpart(formula=band.type~.,data = training_set,method = "class")
plot(fit)
text(fit)
rpart.plot(fit,type = 4,extra=101)
## Make a prediction
pred = predict(object = fit, new_data = test_set,type = "class")
cm = table(test_set$band.type,pred) cm
acc = sum(diag(cm))/sum(cm)
acc
## [1] 0.754717
ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties
#boxplot
library(ggplot2)
ggplot(data = dataset, mapping = aes(x = band.type, y = cylinder.size)) +
geom_boxplot()
ggplot(data = dataset, aes(x = paper.mill.location, y = paper.type)) +
geom_point()
# Histogram with density plot
ggplot(dataset, aes(x = paper.mill.location, y = paper.type)) +
geom_point(aes(color = factor(paper.mill.location)))
dataset$paper.mill.location %>% boxplot()
We started with 39 variables for the potential reason for banding. With the different methods used, we have now narrowed it down to a select few variables that are most likely to contribute to banding. We have taken the most frequently occuring and most widely seen variables across all the methods to make our selection. The 4 variables, in descending order of importance (as reason for banding), is given below: 1. press_speed - This variable has been seen in the majority of the models 2. viscosity - This is the second most often seen variable 3. solvent_pct (and ink_pct) - since there is a correlation b/w the two, this variable is important 4. humidity The management is therefore advised to pay special attention and control to these variables during the Rotogravure printing process.