Cylinder Bands Data Set

Aim/Objective

Used in decision tree induction for mitigating process delays known as “cylinder bands” in rotogravure printing

Literature Survey of your dataset:

We begin analyzing the issue by visiting the problem statement with some background information. The issue we are trying to address occurs during what is known as “Rotogravure” printing. Rotogravure printing involves rotating a chrome-plated copper cylinder in a bath of ink, scraping off the excess ink, and pressing a continuous supply of paper against the inked image with a rubber roller. Once the job is printed, the engraved image is removed from the cylinder, which is replated to be engraved for another job. Sometimes, a series of grooves - called a “band” - appears in the cylinder during printing, ruining the finished product. These grooves are not present at the start of the printing run, but once they appear the printing press must be shutdown. A technician then removes the band by polishing it out of the cylinder, or by transporting the cylinder to a plating station where the chrome surface is removed, the band is polished out of the copper subsurface, and a chrome finish is replated. The occurence of bands result in process delays, plant shutdowns, and losses in terms of labour and time. When process delays have known causes, they can be mitigated by acquiring causal rules from human experts and then applying sensors and automated real-time diagnostic devices to the process. However, for some delays the experts have only “weak” causal knowledge or none at all. In such cases, machine learning tools can collect training data and process it through an induction engine in search of “diagnostic” knowledge. Our aim in this analysis is to therefore find the most probably causes for band formation and share those parameters so that they can be controlled, in order to mitigate the effects of banding. Our analysis will follow the different paths in the flowchart as shown below. We firstly take the data that has been provided and perform data clean-up activity (pre-processing) to make them usable for a machine learning algorithm. This clean-up activity involves several sub-steps and we have tried different ways to address the missing data values, dscribed in more details in the relevant sections. After this activity, we aim to build the a robust prediction model, by using various ML

Describing the given Dataset

We check there how the data is spread and what it contains. The variable we are interested in is “band_type”, and especially the reason & prediction ability of banding. From the dataset, we can see the description of the “band_type” variable, as shown below. ## variable = band_type ## type = factor ## na = 0 of 540 (0%) ## unique = 2 ## band = 228 (42.2%) ## noband = 312 (57.8%) We find that out of the 540 records in the dataset, the information for “band_type” is completely available (no missing “band_type” values, na = 0 of 540). Further, there are two unique levels for this variable (band & noband). Also, the split between band & noband is quite balanced (42.2% versus 57.8%), only a slight tilt for noband, with more information. An overall plot of the data is shown below. Here we find that there are only 51.3% of complete rows i.e., rows which do not have any values missing. This works out to 277 records of the 540 records. We also see 4.6% of data not having made any obervations at all.

Atribute information

1.timestamp: numeric;19500101 - 21001231 2. cylinder number: nominal 3. customer: nominal; 4. job number: nominal; 5. grain screened: nominal; yes, no 6. ink color: nominal; key, type 7. proof on ctd ink: nominal; yes, no 8. blade mfg: nominal; benton, daetwyler, uddeholm 9. cylinder division: nominal; gallatin, warsaw, mattoon 10. paper type: nominal; uncoated, coated, super 11. ink type: nominal; uncoated, coated, cover 12. direct steam: nominal; use; yes, no 13. solvent type: nominal; xylol, lactol, naptha, line, other 14. type on cylinder: nominal; yes, no 15. press type: nominal; use; 70 wood hoe, 70 motter, 70 albert, 94 motter 16. press: nominal; 821, 802, 813, 824, 815, 816, 827, 828 17. unit number: nominal; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 18. cylinder size: nominal; catalog, spiegel, tabloid 19. paper mill location: nominal; north us, south us, canadian, scandanavian, mid european 20. plating tank: nominal; 1910, 1911, other 21. proof cut: numeric; 0-100 22. viscosity: numeric; 0-100 23. caliper: numeric; 0-1.0 24. ink temperature: numeric; 5-30 25. humifity: numeric; 5-120 26. roughness: numeric; 0-2 27. blade pressure: numeric; 10-75 28. varnish pct: numeric; 0-100 29. press speed: numeric; 0-4000 30. ink pct: numeric; 0-100 31. solvent pct: numeric; 0-100 32. ESA Voltage: numeric; 0-16 33. ESA Amperage: numeric; 0-10 34. wax: numeric ; 0-4.0 35. hardener: numeric; 0-3.0 36. roller durometer: numeric; 15-120 37. current density: numeric; 20-50 38. anode space ratio: numeric; 70-130 39. chrome content: numeric; 80-120 40. band type: nominal; class; band, no band

Importing the dataset

getwd()

## [1] "E:/fall sem 2022-2023/Data science programming"

setwd("E:/fall sem 2022-2023/Data science programming")
dataset = read.csv("bands.csv")
View(dataset)
str(dataset)

## 'data.frame':    512 obs. of  40 variables:
##  $ timestamp          : int  19910108 19910109 19910104 19910104 19910111 19910104 19910111 19910111 19910112 19910114 ...
##  $ cylinder.number    : chr  "X126" "X266" "B7" "T133" ...
##  $ customer           : chr  "TVGUIDE" "TVGUIDE" "MODMAT" "MASSEY" ...
##  $ job.number         : int  25503 25503 47201 39039 37351 38039 35751 35751 47201 37000 ...
##  $ grain.screened     : chr  "YES" "YES" "YES" "YES" ...
##  $ ink.color          : chr  "KEY" "KEY" "KEY" "KEY" ...
##  $ proof.on.ctd.ink   : chr  "YES" "YES" "YES" "YES" ...
##  $ blade.mfg          : chr  "BENTON" "BENTON" "BENTON" "BENTON" ...
##  $ cylinder.division  : chr  "GALLATIN" "GALLATIN" "GALLATIN" "GALLATIN" ...
##  $ paper.type         : chr  "UNCOATED" "UNCOATED" "UNCOATED" "UNCOATED" ...
##  $ ink.type           : chr  "UNCOATED" "UNCOATED" "COATED" "UNCOATED" ...
##  $ direct.steam       : chr  "NO" "NO" "NO" "NO" ...
##  $ solvent.type       : chr  "LINE" "LINE" "LINE" "LINE" ...
##  $ type.on.cylinder   : chr  "YES" "YES" "YES" "YES" ...
##  $ press.type         : chr  "Motter94" "Motter94" "WoodHoe70" "WoodHoe70" ...
##  $ press              : int  821 821 815 816 816 816 827 827 802 815 ...
##  $ unit.number        : int  2 2 9 9 2 2 2 9 7 2 ...
##  $ cylinder.size      : chr  "TABLOID" "TABLOID" "CATALOG" "CATALOG" ...
##  $ paper.mill.location: chr  "NorthUS" "NorthUS" "NorthUS" "NorthUS" ...
##  $ plating.tank       : chr  "1911" "?" "?" "1910" ...
##  $ proof.cut          : chr  "55" "55" "62" "52" ...
##  $ viscosity          : chr  "46" "46" "40" "40" ...
##  $ caliper            : chr  "0.2" "0.3" "0.433" "0.3" ...
##  $ ink.temperature    : chr  "17" "15" "16" "16" ...
##  $ humifity           : chr  "78" "80" "80" "75" ...
##  $ roughness          : chr  "0.75" "0.75" "?" "0.3125" ...
##  $ blade.pressure     : chr  "20" "20" "30" "30" ...
##  $ varnish.pct        : chr  "13.1" "6.6" "6.5" "5.6" ...
##  $ press.speed        : chr  "1700" "1900" "1850" "1467" ...
##  $ ink.pct            : chr  "50.5" "54.9" "53.8" "55.6" ...
##  $ solvent.pct        : chr  "36.4" "38.5" "39.8" "38.8" ...
##  $ ESA.Voltage        : chr  "0" "0" "0" "0" ...
##  $ ESA.Amperage       : chr  "0" "0" "0" "0" ...
##  $ wax                : chr  "2.5" "2.5" "2.8" "2.5" ...
##  $ hardener           : chr  "1" "0.7" "0.9" "1.3" ...
##  $ roller.durometer   : chr  "34" "34" "40" "40" ...
##  $ current.density    : chr  "40" "40" "40" "40" ...
##  $ anode.space.ratio  : chr  "105" "105" "103.87" "108.06" ...
##  $ chrome.content     : chr  "100" "100" "100" "100" ...
##  $ band.type          : chr  "band" "noband" "noband" "noband" ...

dim(dataset)

## [1] 512  40

Plots

#Barplot
count = table(dataset$band.type)
count

## 
##   band noband 
##    200    312

barplot(count)

#piechart
slice = table(dataset$paper.mill.location)
pie(slice)

Data Cleaning

Our first step is to clean our data and remove duplicates. This can be caused due to mistakes in entries (upper/lower case), duplicates, and mistakes in spelling etc., while creating the dataset by the operators, who have noted the various parameter levels when making the cylinders. Our target variable in this data set is “band_type” which has two levels: band (or) noband. Our explanatory variables are the remaining 39 variables, of which 20 attributes are numeric and 19 are nominal. After this common step, we proceed with analyzing the dataset and derive more meaningful information

#changing "?" marks to na values
idx <- dataset == "?"
is.na(dataset) <- idx
#loading dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Removing unwanted columns

we identify from the problem description, the 12 variables which can be safely ignored from our dataset, as it would only decrease the models capability. The variables that we decided to remove from our training model were the below

dataset<- dataset %>% select(-c(customer,job.number,cylinder.number,cylinder.division,ink.color,unit.number,plating.tank,blade.mfg,chrome.content,ESA.Voltage,ESA.Amperage))
dim(dataset)#after removing some umwanted columns the dim of the dataset

## [1] 512  29

Encoding into categorical values

dataset$band.type = factor(dataset$band.type,levels=c("band","noband"),labels = c(1,0))
dataset$paper.type = factor(dataset$paper.type,levels=c("UNCOATED","COATED","SUPER"),labels = c(1,2,3))
dataset$ink.type = factor(dataset$ink.type,levels=c("UNCOATED","COATED","COVER"),labels = c(1,2,3))
dataset$direct.steam = factor(dataset$direct.steam,levels=c("NO","YES"),labels =c(0,1))
dataset$press.type = factor(dataset$press.type,levels=c("Motter94","WoodHoe70","Albert70","Motter70" ),labels =c(1,2,3,4))
dataset$grain.screened = factor(dataset$grain.screened,levels=c("YES","NO"),labels = c(1,0))
dataset$proof.on.ctd.ink = factor(dataset$proof.on.ctd.ink,levels=c("YES","NO"),labels=c(1,0))
dataset$solvent.type = factor(dataset$solvent.type,levels=c("LINE","XYLOL","NAPTHA"),labels=c(1,2,3))
dataset$type.on.cylinder = factor(dataset$type.on.cylinder,levels=c("YES","NO"),labels=c(1,0))
dataset$paper.mill.location= factor(dataset$paper.mill.location,levels=c("NorthUS","CANADIAN","SCANDANAVIAN","SouthUS","mideuropean"),labels=c(1,2,3,4,5))
dataset$cylinder.size = factor(dataset$cylinder.size,levels=c("TABLOID","CATALOG","SPIEGEL"),labels=c(1,2,3))

Converting factor type into numerical type

dataset <- dataset %>% 
  mutate_at(c(1:28), as.numeric)
str(dataset)

## 'data.frame':    512 obs. of  29 variables:
##  $ timestamp          : num  19910108 19910109 19910104 19910104 19910111 ...
##  $ grain.screened     : num  1 1 1 1 2 1 2 2 1 1 ...
##  $ proof.on.ctd.ink   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ paper.type         : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ink.type           : num  1 1 2 1 2 1 2 2 1 1 ...
##  $ direct.steam       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ solvent.type       : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ type.on.cylinder   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ press.type         : num  1 1 2 2 2 2 1 1 3 2 ...
##  $ press              : num  821 821 815 816 816 816 827 827 802 815 ...
##  $ cylinder.size      : num  1 1 2 2 1 2 1 1 2 2 ...
##  $ paper.mill.location: num  1 1 1 1 NA 1 2 2 1 1 ...
##  $ proof.cut          : num  55 55 62 52 50 50 50 50 50 65 ...
##  $ viscosity          : num  46 46 40 40 46 40 46 46 45 43 ...
##  $ caliper            : num  0.2 0.3 0.433 0.3 0.3 0.267 0.3 0.2 0.367 0.333 ...
##  $ ink.temperature    : num  17 15 16 16 17 16.8 16.5 16.5 12 16 ...
##  $ humifity           : num  78 80 80 75 80 76 75 75 70 75 ...
##  $ roughness          : num  0.75 0.75 NA 0.312 0.75 ...
##  $ blade.pressure     : num  20 20 30 30 30 28 30 28 60 32 ...
##  $ varnish.pct        : num  13.1 6.6 6.5 5.6 0 8.6 0 0 0 22.7 ...
##  $ press.speed        : num  1700 1900 1850 1467 2100 ...
##  $ ink.pct            : num  50.5 54.9 53.8 55.6 57.5 53.8 62.5 62.5 60.2 45.5 ...
##  $ solvent.pct        : num  36.4 38.5 39.8 38.8 42.5 37.6 37.5 37.5 39.8 31.8 ...
##  $ wax                : num  2.5 2.5 2.8 2.5 2.3 2.5 2.5 2.5 3 3 ...
##  $ hardener           : num  1 0.7 0.9 1.3 0.6 0.8 0.6 1.1 1 1 ...
##  $ roller.durometer   : num  34 34 40 40 35 40 30 30 40 38 ...
##  $ current.density    : num  40 40 40 40 40 40 40 40 40 40 ...
##  $ anode.space.ratio  : num  105 105 104 108 107 ...
##  $ band.type          : Factor w/ 2 levels "1","0": 1 2 2 2 2 2 2 2 1 2 ...

Filling missing values and replace NA values with Mean

dataset<-dataset %>%
  mutate(across(c(grain.screened,proof.on.ctd.ink,solvent.type,type.on.cylinder,cylinder.size,paper.mill.location,proof.cut,viscosity,caliper,ink.temperature,humifity,roughness,blade.pressure,varnish.pct,press.speed,ink.pct,solvent.pct,wax,hardener,roller.durometer,current.density,anode.space.ratio), ~tidyr::replace_na(., mean(., na.rm=TRUE))))

Density plot

A density plot of the data and is a representation of the distribution of a numeric variable. It helps to understand how normal or skewed the data is and if there are variables that are wildly distributed. As can be gleaned, not all the variables seem relevant here (unit_number, job_number etc.,). The rest of the data seem to be well distributed without too many extremes.

den <- density(dataset$cylinder.size)
plot(den, frame = FALSE, col = "blue",main = "Density plot")

Naive Bayes

Naïve Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks. In this article, we will understand the Naïve Bayes algorithm and all essential concepts so that there is no room for doubts in understanding

library(caTools)
set.seed(123)
split = sample.split(dataset$band.type,SplitRatio = 0.75)
training_set = subset(dataset,split == T)
dim(training_set)

## [1] 384  29

test_set = subset(dataset,split == F)
dim(test_set)

## [1] 128  29

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step

training_set[-29] = scale(training_set[-29])
View(training_set)
test_set[-29] = scale(test_set[-29])
View(test_set)

Fitting a model

e1071 is a package for R programming that provides functions for statistic and probabilistic algorithms like a fuzzy classifier, naive Bayes classifier, bagged clustering, short-time Fourier transform, support vector machine, etc.. When it comes to SVM, there are many packages available in R to implement it.

library(e1071)
classifier = naiveBayes(x = training_set[-29],y = training_set$band.type)
summary(classifier)

##           Length Class  Mode     
## apriori    2     table  numeric  
## tables    28     -none- list     
## levels     2     -none- character
## isnumeric 28     -none- logical  
## call       3     -none- call

predict test_set

y_pred = predict(object=classifier,newdata=test_set)
y_pred

##   [1] 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
##  [38] 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
##  [75] 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0
## [112] 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## Levels: 1 0

confusion matrix

cm = table(test_set[,29],y_pred)
cm

##    y_pred
##      1  0
##   1 31 19
##   0 27 51

Accuracy

acc = sum(diag(cm))/sum(cm)
acc

## [1] 0.640625

KNN-K Nearest Neighbour

KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output of the data points. It is one of the most simple Machine learning algorithms and it can be easily implemented for a varied set of problems. It is mainly based on feature similarity Loading packages

library(e1071)
library(caTools)
library(class)

Splitting data into train and test data

split <- sample.split(dataset, SplitRatio = 0.7)
train_cl <- subset(dataset, split == "TRUE")
dim(train_cl)

## [1] 353  29

test_cl <- subset(dataset, split == "FALSE")
dim(test_cl)

## [1] 159  29

Feature Scaling

train_scale <- scale(train_cl[, 1:28])
View(train_scale)
test_scale <- scale(test_cl[, 1:28])
View(test_scale)

Fitting KNN Model to training dataset

classifier_knn <- knn(train = train_scale,
                      test = test_scale,
                      cl = train_cl$band.type,
                      k = 3)
classifier_knn

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1
##  [38] 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1
##  [75] 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0
## [112] 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1
## [149] 1 0 1 0 0 1 1 1 1 1 1
## Levels: 1 0

Confusion Matrix

cm <- table(test_cl$band.type, classifier_knn)
cm

##    classifier_knn
##      1  0
##   1 43 27
##   0 12 77

Accuracy

acc = sum(diag(cm))/sum(cm)
acc

## [1] 0.754717

Decision Tree

We consider the use of the Decision Tree, modelled using the rpart function. To get a large tree we make the complexity paramter really small (cp). We see in the output, all the trees that are considered in the model, giving the complexity parameter, number of splits, re-substitution error rate, cross-validated error rate and the associated standard error.

library(caTools)
set.seed(204)

Splitting the data

split = sample.split(dataset$band.type,SplitRatio =0.8)
training_set = subset(dataset,split == T)
test_set = subset(dataset,split == F)
dim(training_set)

## [1] 410  29

dim(test_set)

## [1] 102  29

Build a model

library(rpart)
library(rpart.plot)
fit <- rpart(formula=band.type~.,data = training_set,method = "class")
plot(fit)
text(fit)

rpart.plot(fit,type = 4,extra=101)

## Make a prediction

pred = predict(object = fit, new_data = test_set,type = "class")

Confusion Matrix

cm = table(test_set$band.type,pred) cm

Accuracy

acc = sum(diag(cm))/sum(cm)
acc

## [1] 0.754717

Visualize

ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties

#boxplot
library(ggplot2)
ggplot(data = dataset, mapping = aes(x = band.type, y = cylinder.size)) +
  geom_boxplot()

ggplot(data = dataset, aes(x = paper.mill.location, y = paper.type)) +
  geom_point()

# Histogram with density plot
ggplot(dataset, aes(x = paper.mill.location, y = paper.type)) +
  geom_point(aes(color = factor(paper.mill.location)))

dataset$paper.mill.location %>% boxplot()

Conclusion

We started with 39 variables for the potential reason for banding. With the different methods used, we have now narrowed it down to a select few variables that are most likely to contribute to banding. We have taken the most frequently occuring and most widely seen variables across all the methods to make our selection. The 4 variables, in descending order of importance (as reason for banding), is given below: 1. press_speed - This variable has been seen in the majority of the models 2. viscosity - This is the second most often seen variable 3. solvent_pct (and ink_pct) - since there is a correlation b/w the two, this variable is important 4. humidity The management is therefore advised to pay special attention and control to these variables during the Rotogravure printing process.

Cylinder Bands Data Set

PRABHALABHARADWAJ20MID0048

2022-11-10

Aim/Objective

Literature Survey of your dataset:

Describing the given Dataset

Atribute information

Importing the dataset

Plots

Data Cleaning

Removing unwanted columns

Encoding into categorical values

Converting factor type into numerical type

Filling missing values and replace NA values with Mean

Density plot

Naive Bayes

Feature Scaling

Fitting a model

predict test_set

confusion matrix

Accuracy

KNN-K Nearest Neighbour

Splitting data into train and test data

Feature Scaling

Fitting KNN Model to training dataset

Confusion Matrix

Accuracy

Decision Tree

Splitting the data

Build a model

Confusion Matrix

Accuracy

Visualize

Conclusion