WQD7004 Group Project - Group 7

Lecturer: Assoc. Prof. Dr. Ang Tan Fong

Group Members:

No Name Matric No
1 LEONG JIA LING S2157416
2 MD TUHIN HOSSAIN S2116205
3 SEAH TEONG TAT S2130478
4 ROZAIDAWATI ZAINUL AZNAM S2153846
5 R PREMANAN RATHAKRISHNAN S2151940

Table of Contents

Project Sections:

  1. Introduction

  2. Aim and Objectives

  3. Dataset Summary & Explanation

  4. Data Preprocessing & Cleaning

  5. Feature Engineering

  6. EDA & Visualization

  7. Feature Importance & Selection

  8. Model Evaluation & Selection

  9. Prediction and Results

  10. Conclusion

  11. References

1.0 Introduction

Basic human necessities including food, clothes, and shelter are well-known concepts. As we all know, the need for clothing also indicates that people need clothing (Mubarok, 2017). If these needs weren’t satisfied, that would be impossible. In addition to the functional requirement for clothing, the sale or business of clothing is also highly powerful. The garment business is one of the most labour-intensive sectors in the world, aside from economic factors. Around 75 million people worldwide work directly in the textile, apparel, and footwear industries (United, 2021). In many developing nations, such as Bangladesh, which is currently the second-largest exporter of apparel in the world behind China, the ready-to-wear garment industry plays a significant role in manufacturing production, employment, and trade (Chaerani, 2018). According to newly released Bureau of Export Promotion Data, ready-to-wear exports from Bangladesh generate over $ 30.61 billion in income, accounting for about 14.07% of the country’s GDP and 81% of all export earnings. (M Saiful Islam and colleagues, 2019) The improvement in production quality in the apparel business must be maintained considering the rising global demand for clothing. Productivity is one of the instruments used to measure business success, and the definition of productivity is a comparison between output and input (Sri & Margareta, 2020).

In our study, we figured out that machine learning would be the most suitable approach in measuring the performance of the employees of this industry. Machine learning (ML), a subfield of artificial intelligence, enables computers to make autonomous predictions by rapidly learning from training data and past experiences without the use of explicit programming. The goal of ML is to replicate the human brain’s capacity for problem-solving and experience-based analysis. Therefore, ML strategies involve applying various algorithms to data to uncover certain patterns that improve decision-making. According to Zhang 2010, there are different kinds of machine learning, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Each type of ML algorithm is used to address a certain class of issues; for example, some algorithms can be applied to classification, others to regression, and yet others to clustering. The kind of the problem and numerous other parameters, like parametrization, learning time, prediction time, overfitting propensity, and memory size, all play a role in selecting the most appropriate algorithm (Mahesh, 2019). All machine learning algorithms are practical tools that help people in a variety of tasks, including data mining, image processing, and prediction analysis (Mona M. Jamjoom, 2021).

2.0 Aim and Objectives

2.1 Aim

To construct the means of determining the garment productivity and employee productivity using classification and regression models.

2.2 Objectives

The objectives for this project are:

  1. To understand the factors that affect garment productivity.

  2. To build, evaluate and choose the best regression model to predict garment employee’s productivity.

  3. To build, evaluate and choose the best classification model in order to determine whether actual productivity reached the target productivity.

3.0 Dataset Summary & Explanation

Using the head() function, the top 6 results from the dataset are returned.

# Import Dataset
dataset <- read.csv("garments_worker_productivity.csv")

head(dataset)        # first 6 rows of dataset
##       date  quarter department      day team targeted_productivity   smv  wip
## 1 1/1/2015 Quarter1     sweing Thursday    8                  0.80 26.16 1108
## 2 1/1/2015 Quarter1 finishing  Thursday    1                  0.75  3.94   NA
## 3 1/1/2015 Quarter1     sweing Thursday   11                  0.80 11.41  968
## 4 1/1/2015 Quarter1     sweing Thursday   12                  0.80 11.41  968
## 5 1/1/2015 Quarter1     sweing Thursday    6                  0.80 25.90 1170
## 6 1/1/2015 Quarter1     sweing Thursday    7                  0.80 25.90  984
##   over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1      7080        98         0        0                  0          59.0
## 2       960         0         0        0                  0           8.0
## 3      3660        50         0        0                  0          30.5
## 4      3660        50         0        0                  0          30.5
## 5      1920        50         0        0                  0          56.0
## 6      6720        38         0        0                  0          56.0
##   actual_productivity
## 1           0.9407254
## 2           0.8865000
## 3           0.8005705
## 4           0.8005705
## 5           0.8003819
## 6           0.8001250

Using the summary() function, it returns the length, class and mode for alphanumeric data while for numeric data it returns the minimum, maximum, 1st quartile, median, 3rd quartile and mean of that column.

summary(dataset)     # summary of dataset
##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       team        targeted_productivity      smv             wip         
##  Min.   : 1.000   Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  1st Qu.: 3.000   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Median : 6.000   Median :0.7500        Median :15.26   Median : 1039.0  
##  Mean   : 6.427   Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##  3rd Qu.: 9.000   3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##  Max.   :12.000   Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                         NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
## 

Using the glimpse( ) function returns the data types and values of each attribute.

glimpse(dataset)     # show the features data types and values
## Rows: 1,197
## Columns: 15
## $ date                  <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", …
## $ quarter               <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department            <chr> "sweing", "finishing ", "sweing", "sweing", "swe…
## $ day                   <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
## $ team                  <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ smv                   <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ wip                   <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681…
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…
skim(dataset)        # show more details about features (Missing Values)
Data summary
Name dataset
Number of rows 1197
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 8 9 0 59 0
quarter 0 1 8 8 0 5 0
department 0 1 6 10 0 3 0
day 0 1 6 9 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
team 0 1.00 6.43 3.46 1.00 3.00 6.00 9.00 12.00 ▇▅▅▆▇
targeted_productivity 0 1.00 0.73 0.10 0.07 0.70 0.75 0.80 0.80 ▁▁▁▁▇
smv 0 1.00 15.06 10.94 2.90 3.94 15.26 24.26 54.56 ▇▅▃▁▁
wip 506 0.58 1190.47 1837.46 7.00 774.50 1039.00 1252.50 23122.00 ▇▁▁▁▁
over_time 0 1.00 4567.46 3348.82 0.00 1440.00 3960.00 6960.00 25920.00 ▇▆▁▁▁
incentive 0 1.00 38.21 160.18 0.00 0.00 0.00 50.00 3600.00 ▇▁▁▁▁
idle_time 0 1.00 0.73 12.71 0.00 0.00 0.00 0.00 300.00 ▇▁▁▁▁
idle_men 0 1.00 0.37 3.27 0.00 0.00 0.00 0.00 45.00 ▇▁▁▁▁
no_of_style_change 0 1.00 0.15 0.43 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
no_of_workers 0 1.00 34.61 22.20 2.00 9.00 34.00 57.00 89.00 ▇▂▂▇▁
actual_productivity 0 1.00 0.74 0.17 0.23 0.65 0.77 0.85 1.12 ▂▂▆▇▂

The skim( ) function shows the user more details about the features as seen above. The column type frequency for character and numeric columns are shown. Detailed contents for character and variable type variables are shown in a table format, indicating any missing values or even number of unique data. In this dataset, the ‘wip’ column has 506 missing attributes while there are 59 unique dates of data collected in this dataset.

# Correlation Marix before data cleaning
corrplot(cor(dataset[, 5:15]),
  method = "square",
  type = "upper",
  tl.col = "black",
  tl.cex = 0.6,
  col = colorRampPalette(c("purple", "dark green"))(200)
)

A correlation matrix has also been generated as shown above before the data cleaning process using the corrplot() function. It can be seen that there are question marks showing between relationships such as wip - over_time and smv - wip and several other attributes, these will be handled in the following section.

4.0 Data Preprocessing & Cleaning

First, the column names for column 7 and 8 will be renamed as “standard_minute_value” and “work_in_progress”.

# Renaming the columns name
colnames(dataset)[7] = "standard_minute_value"
colnames(dataset)[8] = "work_in_progress"


# Changing Department value Sweing to Sewing
unique(dataset[c("department")])
##    department
## 1      sweing
## 2  finishing 
## 19  finishing

After checking the unique values in the “department” column, it can be seen that the term “sewing” was spelt as “sweing” and there is “finishing” and “finishing” with one having a space at the end which will need to be corrected.

dataset <- within(dataset, department[department == 'sweing'] <- 'sewing')
dataset <- within(dataset, department[department == 'finishing '] <- 'finishing')
unique(dataset[c("department")])
##   department
## 1     sewing
## 2  finishing

After making the changes in the “department” column, it can be seen that it is left with just “sewing” and “finishing” after checking with unique().

# check the missing values appearance
sum(is.na(dataset))
## [1] 506
colSums(is.na(dataset))
##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
## standard_minute_value      work_in_progress             over_time 
##                     0                   506                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0

Moving on, the researchers checked for missing values in the dataset and found that the “work_in_progress” attribute has 506 missing data.

With the missmap() function, it draws a map of the missingness in a dataset using the image function. The columns are reordered to put the most missing variable farthest to the left. As illustrated in the missingness map, “work_in_progress” has a high proportion of NaN values. Hence, the cells will be filled with 0 that means there is no number of items currently being worked on.

missmap(dataset, col = c("red", "blue"), legend = FALSE)

As we can see work_in_progress has a high proportion of NaN values. So we are going to fill that cell as 0 that means there is no number of items currently being worked on.

dataset[is.na(dataset)] <- 0
colSums(is.na(dataset))
##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
## standard_minute_value      work_in_progress             over_time 
##                     0                     0                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0
# Shape of dataset
nrow(dataset)
## [1] 1197
ncol(dataset)
## [1] 15

After checking the dataset shape, it can be seen that the dataset is small. As such, the outliers will not be removed.

5.0 Feature Engineering

Feature engineering will be conducted in this section. The sapply() function takes list, vector or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of the same length as the original set.

# Date Data Type Conversion
sapply(dataset, class)
##                  date               quarter            department 
##           "character"           "character"           "character" 
##                   day                  team targeted_productivity 
##           "character"             "integer"             "numeric" 
## standard_minute_value      work_in_progress             over_time 
##             "numeric"             "numeric"             "integer" 
##             incentive             idle_time              idle_men 
##             "integer"             "numeric"             "integer" 
##    no_of_style_change         no_of_workers   actual_productivity 
##             "integer"             "numeric"             "numeric"

The first attribute to be converted is the “date” which will be converted from month/day/year (1/01/2015) format to Year-Month-Day (2015-01-01) format.

dataset$date <- as.POSIXct( dataset$date, format="%m/%d/%Y" )
sapply(dataset, class)
## $date
## [1] "POSIXct" "POSIXt" 
## 
## $quarter
## [1] "character"
## 
## $department
## [1] "character"
## 
## $day
## [1] "character"
## 
## $team
## [1] "integer"
## 
## $targeted_productivity
## [1] "numeric"
## 
## $standard_minute_value
## [1] "numeric"
## 
## $work_in_progress
## [1] "numeric"
## 
## $over_time
## [1] "integer"
## 
## $incentive
## [1] "integer"
## 
## $idle_time
## [1] "numeric"
## 
## $idle_men
## [1] "integer"
## 
## $no_of_style_change
## [1] "integer"
## 
## $no_of_workers
## [1] "numeric"
## 
## $actual_productivity
## [1] "numeric"
head(dataset)
##         date  quarter department      day team targeted_productivity
## 1 2015-01-01 Quarter1     sewing Thursday    8                  0.80
## 2 2015-01-01 Quarter1  finishing Thursday    1                  0.75
## 3 2015-01-01 Quarter1     sewing Thursday   11                  0.80
## 4 2015-01-01 Quarter1     sewing Thursday   12                  0.80
## 5 2015-01-01 Quarter1     sewing Thursday    6                  0.80
## 6 2015-01-01 Quarter1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity
## 1                  0          59.0           0.9407254
## 2                  0           8.0           0.8865000
## 3                  0          30.5           0.8005705
## 4                  0          30.5           0.8005705
## 5                  0          56.0           0.8003819
## 6                  0          56.0           0.8001250
# Quarter Data Type Conversion
unique(dataset[, 2])
## [1] "Quarter1" "Quarter2" "Quarter3" "Quarter4" "Quarter5"
quarter_cat <- dataset[, 2]
dataset <- cbind(dataset, quarter_cat)
head(dataset)
##         date  quarter department      day team targeted_productivity
## 1 2015-01-01 Quarter1     sewing Thursday    8                  0.80
## 2 2015-01-01 Quarter1  finishing Thursday    1                  0.75
## 3 2015-01-01 Quarter1     sewing Thursday   11                  0.80
## 4 2015-01-01 Quarter1     sewing Thursday   12                  0.80
## 5 2015-01-01 Quarter1     sewing Thursday    6                  0.80
## 6 2015-01-01 Quarter1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1

Next, the “quarter” attribute will be changed from ‘Quarter 1’, ‘Quarter 2’ until ‘Quarter 5’ into numeric format of ‘1’, ‘2’ and until ‘5’.

dataset <- within(dataset, quarter[quarter == 'Quarter1'] <- 1)
dataset <- within(dataset, quarter[quarter == 'Quarter2'] <- 2)
dataset <- within(dataset, quarter[quarter == 'Quarter3'] <- 3)
dataset <- within(dataset, quarter[quarter == 'Quarter4'] <- 4)
dataset <- within(dataset, quarter[quarter == 'Quarter5'] <- 5)

unique(dataset[, 2])
## [1] "1" "2" "3" "4" "5"
str(dataset[, 2])
##  chr [1:1197] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
dataset$quarter <- as.integer(dataset$quarter)
str(dataset[, 2])
##  int [1:1197] 1 1 1 1 1 1 1 1 1 1 ...
# Department Data Type Conversion
unique(dataset$department)
## [1] "sewing"    "finishing"
department_cat <- dataset$department
dataset <- cbind(dataset, department_cat)
head(dataset)
##         date quarter department      day team targeted_productivity
## 1 2015-01-01       1     sewing Thursday    8                  0.80
## 2 2015-01-01       1  finishing Thursday    1                  0.75
## 3 2015-01-01       1     sewing Thursday   11                  0.80
## 4 2015-01-01       1     sewing Thursday   12                  0.80
## 5 2015-01-01       1     sewing Thursday    6                  0.80
## 6 2015-01-01       1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat
## 1         sewing
## 2      finishing
## 3         sewing
## 4         sewing
## 5         sewing
## 6         sewing

The target attribute of “department” will also represent “sewing” with 0 and “finishing” as 1, making it into integers for easier model building.

dataset <- within(dataset, department[department == 'sewing'] <- 0)
dataset <- within(dataset, department[department == 'finishing'] <- 1)

unique(dataset$department)
## [1] "0" "1"
str(dataset$department)
##  chr [1:1197] "0" "1" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "1" "1" ...
dataset$department <- as.integer(dataset$department)
str(dataset$department)
##  int [1:1197] 0 1 0 0 0 0 1 0 0 0 ...
# Day Data Type Conversion
unique(dataset$day)
## [1] "Thursday"  "Saturday"  "Sunday"    "Monday"    "Tuesday"   "Wednesday"
day_cat <- dataset$day
dataset <- cbind(dataset, day_cat)
head(dataset)
##         date quarter department      day team targeted_productivity
## 1 2015-01-01       1          0 Thursday    8                  0.80
## 2 2015-01-01       1          1 Thursday    1                  0.75
## 3 2015-01-01       1          0 Thursday   11                  0.80
## 4 2015-01-01       1          0 Thursday   12                  0.80
## 5 2015-01-01       1          0 Thursday    6                  0.80
## 6 2015-01-01       1          0 Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat  day_cat
## 1         sewing Thursday
## 2      finishing Thursday
## 3         sewing Thursday
## 4         sewing Thursday
## 5         sewing Thursday
## 6         sewing Thursday

The “day” which represent the days of the week (Sunday to Saturday) are also converted into numeric format which 0 starts with Sunday with 1 representing Monday, until 6 depicting Saturday.

dataset <- within(dataset, day[day == 'Sunday'] <- 0)
dataset <- within(dataset, day[day == 'Monday'] <- 1)
dataset <- within(dataset, day[day == 'Tuesday'] <- 2)
dataset <- within(dataset, day[day == 'Wednesday'] <- 3)
dataset <- within(dataset, day[day == 'Thursday'] <- 4)
dataset <- within(dataset, day[day == 'Friday'] <- 5)
dataset <- within(dataset, day[day == 'Saturday'] <- 6)

unique(dataset$day)
## [1] "4" "6" "0" "1" "2" "3"
str(dataset$day)
##  chr [1:1197] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" ...
dataset$day <- as.integer(dataset$day)
str(dataset$day)
##  int [1:1197] 4 4 4 4 4 4 4 4 4 4 ...
glimpse(dataset)
## Rows: 1,197
## Columns: 18
## $ date                  <dttm> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,…
## $ quarter               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ department            <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
## $ day                   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ team                  <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ standard_minute_value <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ work_in_progress      <dbl> 1108, 0, 968, 968, 1170, 984, 0, 795, 733, 681, …
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…
## $ quarter_cat           <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department_cat        <chr> "sewing", "finishing", "sewing", "sewing", "sewi…
## $ day_cat               <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
head(dataset)
##         date quarter department day team targeted_productivity
## 1 2015-01-01       1          0   4    8                  0.80
## 2 2015-01-01       1          1   4    1                  0.75
## 3 2015-01-01       1          0   4   11                  0.80
## 4 2015-01-01       1          0   4   12                  0.80
## 5 2015-01-01       1          0   4    6                  0.80
## 6 2015-01-01       1          0   4    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat  day_cat
## 1         sewing Thursday
## 2      finishing Thursday
## 3         sewing Thursday
## 4         sewing Thursday
## 5         sewing Thursday
## 6         sewing Thursday

A new column is added for the targeted productivity which will be used in model building and training in the following section later.

# Adding a new classification column based on actual_productivity and targeted_productivity
target <- dataset$actual_productivity < dataset$targeted_productivity + 0.001
dataset['target'] <- target

dataset <- within(dataset, target[target == TRUE] <- 1)
dataset <- within(dataset, target[target == FALSE] <- 0)


#Separate Categorical and Continous variables
G_con = dataset[ ,!sapply(dataset, is.character)]
G_cat = dataset[ ,sapply(dataset, is.character)]


# Correlation Matrix after Data Preprocessing
corrplot(cor(dataset[, 2:15]),
         method = "square",
         type = "upper",
         tl.col = "black",
         tl.cex = 0.6,
         col = colorRampPalette(c("purple", "dark green"))(200)
)

The categorical and continuous variables are also separated with new variables “G_con” and “G_cat”.

After the whole data preprocessing and feature processing is completed, another correlation matrix is generated as shown below to better visualize the correlation between the attributes. It can be seen that the question mark symbols as seen in the previous section are now instead shown with the actual correlation between attributes.

It can be seen that the correlation between standard_minute_value - no_of_workers has the highest positive correlation with standard_minute_value - over_time and over_time - no_of_workers also having high positive correlation.

It is also to be highlighted that department - standard_minute_value, department - over_time and department - no_of_workers have a high negative correlation.

6.0 EDA & Visualization

For the exploratory data analysis and visualisation, various types of charts were constructed and plotted to get a better understanding on the variables of our data as well as the relationship between them.

### Targeted Productivity over Time
dates <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$Group.1
targeted_productivity <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$x

plot_ly(
  x = dates, 
  y = targeted_productivity,
  type = "scatter",
  mode = "lines" 
) |> 
layout(yaxis = list(range = c(0.65, 0.81)), 
       title = list(text='Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Actual Productivity over Time
dates <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$Group.1
actual_productivity <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$x

plot_ly(
  x = dates, 
  y = actual_productivity,
  type = "scatter",
  mode = "lines" 
) |> 
layout(yaxis = list(range = c(0.5, 0.9)), 
       title = list(text='Actual Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Actual & Targeted Productivity over Time
plot_ly(
  x = dates, 
  y = targeted_productivity,
  type = "scatter",
  mode = "lines",
  name = "targeted_productivity"
) %>% 
add_trace(y = actual_productivity, name = "actual_productivity") %>% 
layout(legend = list(x = 0.73, y = 0.95), 
       title = list(text='Actual & Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Ratio of Actual to Targeted Productivity over Time
prod_diff = actual_productivity - targeted_productivity
prod_diff
##  [1] -0.0339209786  0.0265828621  0.0039838318  0.0232998741  0.0221744682
##  [6] -0.0178785815 -0.0033075971  0.0167596471 -0.0183467855  0.0138895031
## [11]  0.0260206630 -0.0154420297  0.0607074814 -0.0168894555 -0.0545498789
## [16]  0.0107424270  0.0010644784  0.0115748421 -0.0258462651  0.0438446866
## [21]  0.0583350081  0.0931096937  0.0840465143  0.0337149014  0.0791443043
## [26]  0.1320382726  0.0472739451 -0.0311872794  0.0315611188  0.0563255852
## [31]  0.0876255877  0.0092534128  0.0014276525  0.0300904176  0.0470107688
## [36]  0.0790646426 -0.0204833649 -0.0359534660 -0.0382238699 -0.0206541136
## [41] -0.0137440371 -0.0624003944 -0.0484842466 -0.0820019806 -0.0762448567
## [46] -0.0430553499 -0.0285546069 -0.0730003276 -0.0580416762 -0.0462517230
## [51] -0.0002677061 -0.0452763715 -0.0122840756  0.0131424168  0.0165971511
## [56]  0.0328347056  0.0128943875  0.0352160328 -0.0136933439
plot_ly(
  x = dates, 
  y = prod_diff,
  type = "scatter",
  mode = "lines" 
) |> 
layout(title = list(text='Ratio of Actual to Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
# Distribution of Work in Progress vs day of the week
data_frame_wip <- subset(dataset, work_in_progress > 0)
nrow(data_frame_wip)
## [1] 691
result <- aggregate(data_frame_wip$work_in_progress, by=list(data_frame_wip$day_cat), mean)
result
##     Group.1         x
## 1    Monday 2156.3103
## 2  Saturday 1003.0192
## 3    Sunday 1016.1983
## 4  Thursday  974.0254
## 5   Tuesday 1012.4831
## 6 Wednesday  973.7731
barplot(result$x, names.arg=result$Group.1, xlab="Weekdays", ylab="Total Work in Progress", col=rainbow(6),
        main="Total Work in Progress by Weekdays",border="black")

# Quarter & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$quarter_cat), mean)
result
##    Group.1        x
## 1 Quarter1 4480.917
## 2 Quarter2 4355.015
## 3 Quarter3 4896.000
## 4 Quarter4 4851.250
## 5 Quarter5 3725.455
barplot(result$x, names.arg=result$Group.1, xlab="Quarter", ylab="Average Worktime", col=rainbow(6),
        main="Quarter vs Overtime",border="black")

# Team & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$team), mean)
result
##    Group.1        x
## 1        1 4793.429
## 2        2 4384.954
## 3        3 5375.684
## 4        4 5449.714
## 5        5 5330.968
## 6        6 3369.096
## 7        7 4857.188
## 8        8 4312.294
## 9        9 4519.038
## 10      10 4736.700
## 11      11 4342.500
## 12      12 3317.929
barplot(result$x, names.arg=result$Group.1, xlab="Team", ylab="Average Overtime", col=rainbow(12),
        main="Team vs Overtime",border="black")

# Stacked Bar Plot with Colors and Legend
counts <- table(G_cat$department,G_cat$day)
barplot(counts, main="Department vs day",
        xlab="Number of Departments", col=c("darkblue","red"),
        legend = rownames(counts))

# Quarter vs Department
counts <- table(G_cat$department,G_cat$quarter)
barplot(counts, main="Quarter vs Department",
        xlab="Number of Departments", col=c("pink","lightblue"),
        legend = rownames(counts))

# Quarter Propagation
PieChart(quarter_cat, hole = 0, values = "%", data = dataset,
         fill = rainbow(5), main = "Quarter Propagation")
## >>> Note: quarter_cat is not in a data frame (table)
## >>> Note: quarter_cat is not in a data frame (table)

## >>> suggestions
## piechart(quarter_cat, hole=0)  # traditional pie chart
## piechart(quarter_cat, values="%")  # display %'s on the chart
## piechart(quarter_cat)  # bar chart
## plot(quarter_cat)  # bubble plot
## plot(quarter_cat, values="count")  # lollipop plot 
## 
## --- quarter_cat --- 
## 
##                Quarter1  Quarter2  Quarter3  Quarter4  Quarter5      Total 
## Frequencies:        360       335       210       248        44       1197 
## Proportions:      0.301     0.280     0.175     0.207     0.037      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 262.336, df = 4, p-value = 0.000
### Department Propagation
PieChart(department_cat, hole = 0, values = "%", data = dataset,
         fill = c("lightblue", "blue"), main = "Department Propagation")
## >>> Note: department_cat is not in a data frame (table)
## >>> Note: department_cat is not in a data frame (table)

## >>> suggestions
## piechart(department_cat, hole=0)  # traditional pie chart
## piechart(department_cat, values="%")  # display %'s on the chart
## piechart(department_cat)  # bar chart
## plot(department_cat)  # bubble plot
## plot(department_cat, values="count")  # lollipop plot 
## 
## --- department_cat --- 
## 
##                finishing  sewing      Total 
## Frequencies:         506     691       1197 
## Proportions:       0.423   0.577      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 28.592, df = 1, p-value = 0.000
# Day Propagation
PieChart(day_cat, hole = 0.6, values = "%", data = dataset,
         fill = rainbow(5), main = "Day Propagation")
## >>> Note: day_cat is not in a data frame (table)
## >>> Note: day_cat is not in a data frame (table)

## >>> suggestions
## piechart(day_cat, hole=0)  # traditional pie chart
## piechart(day_cat, values="%")  # display %'s on the chart
## piechart(day_cat)  # bar chart
## plot(day_cat)  # bubble plot
## plot(day_cat, values="count")  # lollipop plot 
## 
## --- day_cat --- 
## 
##                Monday  Saturday  Sunday  Thursday  Tuesday  Wednesday      Total 
## Frequencies:      199       187     203       199      201        208       1197 
## Proportions:    0.166     0.156   0.170     0.166    0.168      0.174      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 1.221, df = 5, p-value = 0.943
# Team Propagation
PieChart(team, hole = 0.8, values = "%", data = dataset,
         fill = rainbow(12), main = "Team Propagation")

## >>> suggestions
## piechart(team, hole=0)  # traditional pie chart
## piechart(team, values="%")  # display %'s on the chart
## piechart(team)  # bar chart
## plot(team)  # bubble plot
## plot(team, values="count")  # lollipop plot 
## 
## --- team --- 
## 
##  team  Count   Prop 
## -------------------- 
##     1    105   0.088 
##     2    109   0.091 
##     3     95   0.079 
##     4    105   0.088 
##     5     93   0.078 
##     6     94   0.079 
##     7     96   0.080 
##     8    109   0.091 
##     9    104   0.087 
##    10    100   0.084 
##    11     88   0.074 
##    12     99   0.083 
## -------------------- 
## Total   1197   1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 4.995, df = 11, p-value = 0.931
# Generate box plots of all numeric variables
new_df_num <- data.frame(dataset$targeted_productivity,
                           dataset$standard_minute_value,
                           dataset$work_in_progress,
                           dataset$over_time,
                           dataset$incentive,
                           dataset$idle_time,
                           dataset$idle_men,
                           dataset$no_of_style_change,
                           dataset$no_of_workers,
                           dataset$actual_productivity)
col <- c("targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "actual_productivity")

colnames(new_df_num) <- col

ggplot(gather(new_df_num), aes(key,value)) +
  geom_boxplot(color="purple", fill="red") +
  facet_wrap(~key, scales="free") +
  labs(title = "Box plots showing the distribution of all variables", 
       subtitle ="except the first 3 categorical variables")

par(mfrow = c(1, 1))


# Plot a histogram or bar chart of each numeric variable
ggplot(gather(new_df_num), aes(value)) +
  geom_histogram(bins=20,color='blue', fill='pink') +
  facet_wrap(~key, scales = 'free') +
  labs(title = "Histograms showing the distribution of all variables", 
       subtitle ="except the first 3 categorical variables")

According to the diagram above, the overtime boxplot comparatively is smaller than the others. This suggests that over time did not really contribute to the garment productivity. On the other hand, the number of workers boxplot comparatively is larger than the others. This suggests that the number of workers played an important role in the garment productivity. Moreover, if one were to compare the median of the targeted productivity and actual productivity, it can be observed that the actual productivity has a slightly higher median compared to the targeted productivity. Therefore, just by looking at this box-plot it can be infered that the actual productivity was more than the targeted productivity. It can be observed the same in Figure showing the Line graph of “Actual and Targeted Productivity over Time” whereby the actual productivity had a big difference from 25th January to 8th February.

par(mfrow = c(1, 1))


# Plot a bar chart for each of the first 4 categorical variables
new_df_cat <- data.frame(dataset$day_cat,
                         dataset$department_cat,
                         dataset$quarter_cat,
                         dataset$team)
col <- c("day_cat", "department_cat", "quarter_cat", "team")

colnames(new_df_cat) <- col

ggplot(gather(new_df_cat), aes(value)) +
  geom_bar(color='black', fill='green') +
  facet_wrap(~key, scales ='free') +
  labs(title = "Bar charts showing the distribution of the first 4 categorical variables")

par(mfrow = c(1, 1))


# Actual productivity based on dates
ggplot(dataset,aes(x = reorder(date,actual_productivity,median),y = actual_productivity)) +
  geom_boxplot(outlier.shape=NA, mapping=aes(fill=quarter)) +
  theme(legend.position = "bottom", legend.box = "horizontal") +
  labs(x="Date (ordered by median actual_productivity)", title="Boxplots showing actual_productivity distribution for all dates") +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))

# Relationship between Actual and Targeted Productivity
ggplot(dataset, aes(x=targeted_productivity,y=actual_productivity, color=department_cat)) +
  geom_point() +
  geom_smooth(method=lm,se=FALSE) +
  labs(title = "Scatterplot showing the relationship between actual productivity and targeted_productivity")
## `geom_smooth()` using formula = 'y ~ x'

# Investigate the relationship between actual_productivity and no_of_workers
scatterplot(actual_productivity ~ no_of_workers, data=dataset, 
            main="Scatterplot showing relationship between actual_productivity and no_of_workers", grid=FALSE)

Based on the above diagram it can be inferred that the sewing department contributed more to the garment productivity compared to the finishing department. This is because the slope of the sewing department has a stronger linear relationship compared to the finishing department.

7.0 Feature Importance & Selection

7.1 Feature Importance & Selection for Regression Model

new_df_reg <- data.frame(dataset$quarter, 
                     dataset$department,
                     dataset$day,
                     dataset$team,
                     dataset$targeted_productivity,
                     dataset$standard_minute_value,
                     dataset$work_in_progress,
                     dataset$over_time,
                     dataset$incentive,
                     dataset$idle_time,
                     dataset$idle_men,
                     dataset$no_of_style_change,
                     dataset$no_of_workers,
                     dataset$actual_productivity)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "actual_productivity")

colnames(new_df_reg) <- col


cl <- makePSOCKcluster(5)
registerDoParallel(cl)

start.time <- proc.time()
Model <- train(actual_productivity ~ ., 
               data = new_df_reg, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
##    user  system elapsed 
##    1.60    0.02   47.11
stopCluster(cl)

Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Regression Model")

As illustrated in the plot, it can be observed that idle_time has the lowest feature importance score. Therefore, this variable can be dropped for the regression model.

drops <- c("idle_time")
new_df_reg <- new_df_reg[ , !(names(new_df_reg) %in% drops)]

7.2 Feature Importance & Selection for Classification Model

new_df_class <- data.frame(dataset$quarter, 
                     dataset$department,
                     dataset$day,
                     dataset$team,
                     dataset$targeted_productivity,
                     dataset$standard_minute_value,
                     dataset$work_in_progress,
                     dataset$over_time,
                     dataset$incentive,
                     dataset$idle_time,
                     dataset$idle_men,
                     dataset$no_of_style_change,
                     dataset$no_of_workers,
                     dataset$target)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "target")

colnames(new_df_class) <- col


cl <- makePSOCKcluster(5)
registerDoParallel(cl)

start.time <- proc.time()
Model <- train(target ~ ., 
               data = new_df_class, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
##    user  system elapsed 
##    1.17    0.01   46.11
stopCluster(cl)

Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Classification Model")

From the diagram above, it can be stated that idle_men has the lowest feature importance score followed by idle time. Therefore, these two variables can be dropped for the classification model.

drops <- c("idle_time", "idle_men")
new_df_class <- new_df_class[ , !(names(new_df_class) %in% drops)]

8.0 Model Evaluation & Selection

8.1 Regression Model Selection

set.seed(100)

TrainingIndex <- createDataPartition(new_df_reg$actual_productivity, p=0.8, list = FALSE)
TrainingSet <- new_df_reg[TrainingIndex,]
TestingSet <- new_df_reg[-TrainingIndex,]


# Linear Regression:
lin_model <- train(actual_productivity ~ ., data = TrainingSet,
               method = "lm",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"))

lin_MAE <- mae(TestingSet$actual_productivity, predict(lin_model, TestingSet))


# Ridge regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                "no_of_workers")])
y <- TrainingSet$actual_productivity

cv_model <- cv.glmnet(x, y, alpha = 0)
best_lambda <- cv_model$lambda.min
ridge_mod = glmnet(x, y, alpha = 0, lambda = best_lambda)

x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                 "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                 "no_of_workers")])
rid_MAE <- mae(TestingSet$actual_productivity, predict(ridge_mod, s = best_lambda, newx = x))


# Lasso regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                 "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                 "no_of_workers")])
y <- TrainingSet$actual_productivity

lambdas <- 10^seq(2, -3, by = -.1)
lasso_mod = glmnet(x, y, alpha = 1, lambda = lambdas)

x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                "no_of_workers")])
lasso_MAE <- mae(TestingSet$actual_productivity, predict(lasso_mod, newx = x))


# XGBoost
train_x = data.matrix(TrainingSet[, -13])
train_y = TrainingSet[,13]
test_x = data.matrix(TestingSet[, -13])
test_y = TestingSet[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

xgb_model = xgboost(data = xgb_train, max.depth = 3, nrounds = 56, verbose = 0)

xgb_MAE <- mae(test_y, predict(xgb_model, test_x))
print(paste0("Linear MAE: ", lin_MAE))
## [1] "Linear MAE: 0.111510357523804"
print(paste0("Ridge MAE: ", rid_MAE))
## [1] "Ridge MAE: 0.111521518701029"
print(paste0("Lasso MAE: ", lasso_MAE))
## [1] "Lasso MAE: 0.129368765462842"
print(paste0("XGBoost MAE: ", xgb_MAE))
## [1] "XGBoost MAE: 0.0842882112443574"

Regression Model Performance Table

regression_algorithm_name <- c('Linear Regression','Ridge Regression','Lasso Regression','XGBoost', "")
mae <- c(lin_MAE, rid_MAE, lasso_MAE, xgb_MAE, "")

rowName <- c('Model 1', 'Model 2', 'Model 3', 'Model 4', "")
model_result <- data.frame(regression_algorithm_name, mae, row.names = rowName)

knitr::kable(model_result, align = "lccrr")
regression_algorithm_name mae
Model 1 Linear Regression 0.111510357523804
Model 2 Ridge Regression 0.111521518701029
Model 3 Lasso Regression 0.129368765462842
Model 4 XGBoost 0.0842882112443574

Based on the table above, XGboost has the lowest mean absolute error of 0.084. Therefore, XGboost would be the most suitable regression algorithm to use for this study. Other than that, Lasso regression has the highest mean absolute error of 0.129. Therefore, lasso regression would not be a suitable regression algorithm for this study.

8.2 Classification Model Selection

TrainingIndex <- createDataPartition(new_df_class$target, p=0.8, list = FALSE)
TrainingSet <- new_df_class[TrainingIndex,]
TestingSet <- new_df_class[-TrainingIndex,]

# Logistic Regression
logic_model <- glm(target ~., data = TrainingSet, family = binomial) %>%
  stepAIC(trace = FALSE)

probabilities <- logic_model %>% predict(TestingSet, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
logic_accurecy <- mean(predicted.classes==TestingSet$target)


# Random Forest
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

rf_model <- train(target ~ ., 
               data = TrainingSet, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5)))

probabilities <- rf_model %>% predict(TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
rm_accurecy <- mean(predicted.classes==TestingSet$target)


# SVM
svm_model <- train(target ~ ., data = TrainingSet,
               method = "svmPoly",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"),
               tuneGrid = data.frame(degree=1,scale=1,C=1))

probabilities <-predict(svm_model, TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
svm_accurecy <- mean(predicted.classes==TestingSet$target)
print(paste0("Logistic Accurecy: ", logic_accurecy))
## [1] "Logistic Accurecy: 0.719665271966527"
print(paste0("Random Forest Accurecy: ", rm_accurecy))
## [1] "Random Forest Accurecy: 0.832635983263598"
print(paste0("SVM Accurecy: ", svm_accurecy))
## [1] "SVM Accurecy: 0.635983263598326"

Classification Model Performance Table

classification_algorithm_name <- c('Logistic Regression','Random Forest','SVM', "")
accurecy <- c(logic_accurecy, rm_accurecy, svm_accurecy, "")

rowName <- c('Model 1', 'Model 2', 'Model 3', "")
model_result <- data.frame(classification_algorithm_name, accurecy, row.names = rowName)

knitr::kable(model_result, align = "lccrr")
classification_algorithm_name accurecy
Model 1 Logistic Regression 0.719665271966527
Model 2 Random Forest 0.832635983263598
Model 3 SVM 0.635983263598326

Based on the table above, random forest has the highest accuracy of 83.26%. Therefore, random forest would be the most suitable classification algorithm for this study. On the other hand, support vector machine has an accuracy of 63.60%. Therefore, support vector machine would not be suitable classification algorithm for this study.

9.0 Prediction & Results

In this section, prediction will be conducted using the selected models.

9.1 Regression Prediction (Objective 2 Solution)

test_data <- data.frame(quarter=1, 
                        department=0, 
                        day=4, 
                        team=8, 
                        targeted_productivity=0.80, 
                        standard_minute_value=26.16, 
                        work_in_progress=1108, 
                        over_time=7080, 
                        incentive=98, 
                        idle_men=0, 
                        no_of_style_change=0, 
                        no_of_workers=59)

print(paste0("Actual Production will be: ", predict(xgb_model, as.matrix(test_data))))
## [1] "Actual Production will be: 0.947649240493774"

As per the second objective in regards to build, evaluate and choose the best regression model to predict garment employee’s productivity, the actual production as obtained by the model is 94.76%.

9.2 Classification Prediction (Objective 3 Solution)

test_data <- data.frame(quarter=1, 
                        department=0, 
                        day=4, 
                        team=7, 
                        targeted_productivity=0.80, 
                        standard_minute_value=25.90, 
                        work_in_progress=984, 
                        over_time=6720, 
                        incentive=38, 
                        no_of_style_change=0, 
                        no_of_workers=56)

predict_val <- predict(rf_model, as.matrix(test_data))
prediction <- ifelse(predict_val > 0.5, "YES", "NO")

print(paste0("Is the Actual Productivity reached the Target Productivity: ", prediction))
## [1] "Is the Actual Productivity reached the Target Productivity: YES"

Following with the third objective on whether actual productivity has reached the target productivity, the selected classification model has provided “Yes” as the answer.

10.0 Conclusion

In approaching model selection, different selection criteria are being used. For regression problem, we prioritize mean absolute error in our model selection. Mean absolute error will measure average of errors in the forecasts. As comparison between 4 models mean absolute value, Linear regression(0.11), Ridge regression (0.11), Lasso regression (0.120), XGboost(0.08), the lowest MAE is chosen. Hence, we are adopting XGboost for our regression problem.

Meanwhile, for classification problem, accuracy of the prediction is being used as criteria selection. Generally, the higher accuracy, the better model will be. Comparison from 3 training model accuracy, Logistic regression(0.71), Random Forest(0.83), SVM(0.63), Random Forest are being adopted.

Both model selections will help the garment prediction to achieve optimum productivity.

11.0 References

Chaerani, N. (2018). Peran International Labour Organization Terhadap Peningkatan Lingkungan Kerja Di Sektor Industri Garmen Di Bangladesh. Universitas Hasanuddin, 151(2), 10–17.

M Saiful Islam, Rakib, M. A., & Adnan, A. (2019). Ready-Made Garments Sector of Bangladesh: Its Growth, Contribution, and Challenges. Economics World, 7(1). https://doi.org/10.17265/2328- 7144/2019.01.004

Mahesh, Batta. 2019. Machine Learning Algorithms - A Review DOI: 10.21275/ART20203995

Mubarok, N. (2017). Strategi Pemasaran Islami Dalam Meningkatkan Penjualan Pada Butik Calista. I-Economics, 3(1), 73–92.

Sri, D., & Margareta, C. (2020). Pengaruh Pelatihan Kewirausahaan , Kemampuan Memanfaatkan Teknologi Dan Pendidikan Terhadap Produktifitas Wanita. Economic and Education Journal, 42, 142–158.

United, F. (2021). Global fashion industry statistics - International apparel. Https://Fashionunited.Com/. https://fashionunited.com/global-fashion[1]industry-statistics/

Zhang, Yagang. 2010. New advances in machine learning (BoD–Books on Demand).