Introduction

Project Sections:

  1. Objectives
  2. Dataset Summary & Explanation
  3. Data Preprocessing & Cleaning
  4. Feature Engineering
  5. EDA & Visualization
  6. Feature Importance & Selection
  7. Model Evaluation & Selection
  8. Prediction
  9. Conclusion

1. Objectives

Will try to solve the following questions:

  1. Garments productivity prediction. (Number) Here we will use Regression models. Ex. Linear or Lasso or Ridge or Decision tree or Random Forest or XGBoost.

  2. Whether actual productivity reached the target productivity or not (Yes or No). Here we will use classification models. Ex. Logistic or Random Forest or SVM

2. Dataset Summary & Explanation

# Import Dataset
dataset <- read.csv("C:\\Users\\tuhin\\MY PRACTICE\\R-Programming\\R ML Models\\1.0 DataSet ML Model\\Garments Productivity\\garments_worker_productivity.csv")

head(dataset)        # first 6 rows of dataset
##       date  quarter department      day team targeted_productivity   smv  wip
## 1 1/1/2015 Quarter1     sweing Thursday    8                  0.80 26.16 1108
## 2 1/1/2015 Quarter1 finishing  Thursday    1                  0.75  3.94   NA
## 3 1/1/2015 Quarter1     sweing Thursday   11                  0.80 11.41  968
## 4 1/1/2015 Quarter1     sweing Thursday   12                  0.80 11.41  968
## 5 1/1/2015 Quarter1     sweing Thursday    6                  0.80 25.90 1170
## 6 1/1/2015 Quarter1     sweing Thursday    7                  0.80 25.90  984
##   over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1      7080        98         0        0                  0          59.0
## 2       960         0         0        0                  0           8.0
## 3      3660        50         0        0                  0          30.5
## 4      3660        50         0        0                  0          30.5
## 5      1920        50         0        0                  0          56.0
## 6      6720        38         0        0                  0          56.0
##   actual_productivity
## 1           0.9407254
## 2           0.8865000
## 3           0.8005705
## 4           0.8005705
## 5           0.8003819
## 6           0.8001250
summary(dataset)     # summary of dataset
##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       team        targeted_productivity      smv             wip         
##  Min.   : 1.000   Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  1st Qu.: 3.000   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Median : 6.000   Median :0.7500        Median :15.26   Median : 1039.0  
##  Mean   : 6.427   Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##  3rd Qu.: 9.000   3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##  Max.   :12.000   Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                         NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
## 
glimpse(dataset)     # show the features data types and values
## Rows: 1,197
## Columns: 15
## $ date                  <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", ~
## $ quarter               <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department            <chr> "sweing", "finishing ", "sweing", "sweing", "swe~
## $ day                   <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
## $ team                  <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ smv                   <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ wip                   <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681~
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~
skim(dataset)        # show more details about features (Missing Values)
Data summary
Name dataset
Number of rows 1197
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 8 9 0 59 0
quarter 0 1 8 8 0 5 0
department 0 1 6 10 0 3 0
day 0 1 6 9 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
team 0 1.00 6.43 3.46 1.00 3.00 6.00 9.00 12.00 ▇▅▅▆▇
targeted_productivity 0 1.00 0.73 0.10 0.07 0.70 0.75 0.80 0.80 ▁▁▁▁▇
smv 0 1.00 15.06 10.94 2.90 3.94 15.26 24.26 54.56 ▇▅▃▁▁
wip 506 0.58 1190.47 1837.46 7.00 774.50 1039.00 1252.50 23122.00 ▇▁▁▁▁
over_time 0 1.00 4567.46 3348.82 0.00 1440.00 3960.00 6960.00 25920.00 ▇▆▁▁▁
incentive 0 1.00 38.21 160.18 0.00 0.00 0.00 50.00 3600.00 ▇▁▁▁▁
idle_time 0 1.00 0.73 12.71 0.00 0.00 0.00 0.00 300.00 ▇▁▁▁▁
idle_men 0 1.00 0.37 3.27 0.00 0.00 0.00 0.00 45.00 ▇▁▁▁▁
no_of_style_change 0 1.00 0.15 0.43 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
no_of_workers 0 1.00 34.61 22.20 2.00 9.00 34.00 57.00 89.00 ▇▂▂▇▁
actual_productivity 0 1.00 0.74 0.17 0.23 0.65 0.77 0.85 1.12 ▂▂▆▇▂
# Correlation Marix before data cleaning
corrplot(cor(dataset[, 5:15]),
  method = "square",
  type = "upper",
  tl.col = "black",
  tl.cex = 0.6,
  col = colorRampPalette(c("purple", "dark green"))(200)
)

3. Data Preprocessing & Cleaning

# Renaming the columns name
colnames(dataset)[7] = "standard_minute_value"
colnames(dataset)[8] = "work_in_progress"


# Changing Department value Sweing to Sewing
unique(dataset[c("department")])
##    department
## 1      sweing
## 2  finishing 
## 19  finishing
dataset <- within(dataset, department[department == 'sweing'] <- 'sewing')
dataset <- within(dataset, department[department == 'finishing '] <- 'finishing')
unique(dataset[c("department")])
##   department
## 1     sewing
## 2  finishing
# check the missing values appearance
sum(is.na(dataset))
## [1] 506
colSums(is.na(dataset))
##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
## standard_minute_value      work_in_progress             over_time 
##                     0                   506                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0
missmap(dataset, col = c("red", "blue"), legend = FALSE)

As we can see work_in_progress has a high proportion of NaN values. So we are

going to fill that cell as 0 that means there is no number of items currently being worked on.

dataset[is.na(dataset)] <- 0
colSums(is.na(dataset))
##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
## standard_minute_value      work_in_progress             over_time 
##                     0                     0                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0
# Shape of dataset
nrow(dataset)
## [1] 1197
ncol(dataset)
## [1] 15

As our dataset is small so we are not going to remove outliers

4. Feature Engineering

# Date Data Type Conversion
sapply(dataset, class)
##                  date               quarter            department 
##           "character"           "character"           "character" 
##                   day                  team targeted_productivity 
##           "character"             "integer"             "numeric" 
## standard_minute_value      work_in_progress             over_time 
##             "numeric"             "numeric"             "integer" 
##             incentive             idle_time              idle_men 
##             "integer"             "numeric"             "integer" 
##    no_of_style_change         no_of_workers   actual_productivity 
##             "integer"             "numeric"             "numeric"
dataset$date <- as.POSIXct( dataset$date, format="%m/%d/%Y" )
sapply(dataset, class)
## $date
## [1] "POSIXct" "POSIXt" 
## 
## $quarter
## [1] "character"
## 
## $department
## [1] "character"
## 
## $day
## [1] "character"
## 
## $team
## [1] "integer"
## 
## $targeted_productivity
## [1] "numeric"
## 
## $standard_minute_value
## [1] "numeric"
## 
## $work_in_progress
## [1] "numeric"
## 
## $over_time
## [1] "integer"
## 
## $incentive
## [1] "integer"
## 
## $idle_time
## [1] "numeric"
## 
## $idle_men
## [1] "integer"
## 
## $no_of_style_change
## [1] "integer"
## 
## $no_of_workers
## [1] "numeric"
## 
## $actual_productivity
## [1] "numeric"
head(dataset)
##         date  quarter department      day team targeted_productivity
## 1 2015-01-01 Quarter1     sewing Thursday    8                  0.80
## 2 2015-01-01 Quarter1  finishing Thursday    1                  0.75
## 3 2015-01-01 Quarter1     sewing Thursday   11                  0.80
## 4 2015-01-01 Quarter1     sewing Thursday   12                  0.80
## 5 2015-01-01 Quarter1     sewing Thursday    6                  0.80
## 6 2015-01-01 Quarter1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity
## 1                  0          59.0           0.9407254
## 2                  0           8.0           0.8865000
## 3                  0          30.5           0.8005705
## 4                  0          30.5           0.8005705
## 5                  0          56.0           0.8003819
## 6                  0          56.0           0.8001250
# Quarter Data Type Conversion
unique(dataset[, 2])
## [1] "Quarter1" "Quarter2" "Quarter3" "Quarter4" "Quarter5"
quarter_cat <- dataset[, 2]
dataset <- cbind(dataset, quarter_cat)
head(dataset)
##         date  quarter department      day team targeted_productivity
## 1 2015-01-01 Quarter1     sewing Thursday    8                  0.80
## 2 2015-01-01 Quarter1  finishing Thursday    1                  0.75
## 3 2015-01-01 Quarter1     sewing Thursday   11                  0.80
## 4 2015-01-01 Quarter1     sewing Thursday   12                  0.80
## 5 2015-01-01 Quarter1     sewing Thursday    6                  0.80
## 6 2015-01-01 Quarter1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
dataset <- within(dataset, quarter[quarter == 'Quarter1'] <- 1)
dataset <- within(dataset, quarter[quarter == 'Quarter2'] <- 2)
dataset <- within(dataset, quarter[quarter == 'Quarter3'] <- 3)
dataset <- within(dataset, quarter[quarter == 'Quarter4'] <- 4)
dataset <- within(dataset, quarter[quarter == 'Quarter5'] <- 5)

unique(dataset[, 2])
## [1] "1" "2" "3" "4" "5"
str(dataset[, 2])
##  chr [1:1197] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
dataset$quarter <- as.integer(dataset$quarter)
str(dataset[, 2])
##  int [1:1197] 1 1 1 1 1 1 1 1 1 1 ...
# Department Data Type Conversion
unique(dataset$department)
## [1] "sewing"    "finishing"
department_cat <- dataset$department
dataset <- cbind(dataset, department_cat)
head(dataset)
##         date quarter department      day team targeted_productivity
## 1 2015-01-01       1     sewing Thursday    8                  0.80
## 2 2015-01-01       1  finishing Thursday    1                  0.75
## 3 2015-01-01       1     sewing Thursday   11                  0.80
## 4 2015-01-01       1     sewing Thursday   12                  0.80
## 5 2015-01-01       1     sewing Thursday    6                  0.80
## 6 2015-01-01       1     sewing Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat
## 1         sewing
## 2      finishing
## 3         sewing
## 4         sewing
## 5         sewing
## 6         sewing
dataset <- within(dataset, department[department == 'sewing'] <- 0)
dataset <- within(dataset, department[department == 'finishing'] <- 1)

unique(dataset$department)
## [1] "0" "1"
str(dataset$department)
##  chr [1:1197] "0" "1" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "1" "1" ...
dataset$department <- as.integer(dataset$department)
str(dataset$department)
##  int [1:1197] 0 1 0 0 0 0 1 0 0 0 ...
# Day Data Type Conversion
unique(dataset$day)
## [1] "Thursday"  "Saturday"  "Sunday"    "Monday"    "Tuesday"   "Wednesday"
day_cat <- dataset$day
dataset <- cbind(dataset, day_cat)
head(dataset)
##         date quarter department      day team targeted_productivity
## 1 2015-01-01       1          0 Thursday    8                  0.80
## 2 2015-01-01       1          1 Thursday    1                  0.75
## 3 2015-01-01       1          0 Thursday   11                  0.80
## 4 2015-01-01       1          0 Thursday   12                  0.80
## 5 2015-01-01       1          0 Thursday    6                  0.80
## 6 2015-01-01       1          0 Thursday    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat  day_cat
## 1         sewing Thursday
## 2      finishing Thursday
## 3         sewing Thursday
## 4         sewing Thursday
## 5         sewing Thursday
## 6         sewing Thursday
dataset <- within(dataset, day[day == 'Sunday'] <- 0)
dataset <- within(dataset, day[day == 'Monday'] <- 1)
dataset <- within(dataset, day[day == 'Tuesday'] <- 2)
dataset <- within(dataset, day[day == 'Wednesday'] <- 3)
dataset <- within(dataset, day[day == 'Thursday'] <- 4)
dataset <- within(dataset, day[day == 'Friday'] <- 5)
dataset <- within(dataset, day[day == 'Saturday'] <- 6)

unique(dataset$day)
## [1] "4" "6" "0" "1" "2" "3"
str(dataset$day)
##  chr [1:1197] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" ...
dataset$day <- as.integer(dataset$day)
str(dataset$day)
##  int [1:1197] 4 4 4 4 4 4 4 4 4 4 ...
glimpse(dataset)
## Rows: 1,197
## Columns: 18
## $ date                  <dttm> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,~
## $ quarter               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ department            <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, ~
## $ day                   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ team                  <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ standard_minute_value <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ work_in_progress      <dbl> 1108, 0, 968, 968, 1170, 984, 0, 795, 733, 681, ~
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~
## $ quarter_cat           <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department_cat        <chr> "sewing", "finishing", "sewing", "sewing", "sewi~
## $ day_cat               <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
head(dataset)
##         date quarter department day team targeted_productivity
## 1 2015-01-01       1          0   4    8                  0.80
## 2 2015-01-01       1          1   4    1                  0.75
## 3 2015-01-01       1          0   4   11                  0.80
## 4 2015-01-01       1          0   4   12                  0.80
## 5 2015-01-01       1          0   4    6                  0.80
## 6 2015-01-01       1          0   4    7                  0.80
##   standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1                 26.16             1108      7080        98         0        0
## 2                  3.94                0       960         0         0        0
## 3                 11.41              968      3660        50         0        0
## 4                 11.41              968      3660        50         0        0
## 5                 25.90             1170      1920        50         0        0
## 6                 25.90              984      6720        38         0        0
##   no_of_style_change no_of_workers actual_productivity quarter_cat
## 1                  0          59.0           0.9407254    Quarter1
## 2                  0           8.0           0.8865000    Quarter1
## 3                  0          30.5           0.8005705    Quarter1
## 4                  0          30.5           0.8005705    Quarter1
## 5                  0          56.0           0.8003819    Quarter1
## 6                  0          56.0           0.8001250    Quarter1
##   department_cat  day_cat
## 1         sewing Thursday
## 2      finishing Thursday
## 3         sewing Thursday
## 4         sewing Thursday
## 5         sewing Thursday
## 6         sewing Thursday
# Adding a new classification column based on actual_productivity and targeted_productivity
target <- dataset$actual_productivity < dataset$targeted_productivity + 0.001
dataset['target'] <- target

dataset <- within(dataset, target[target == TRUE] <- 1)
dataset <- within(dataset, target[target == FALSE] <- 0)


#Separate Categorical and Continous variables
G_con = dataset[ ,!sapply(dataset, is.character)]
G_cat = dataset[ ,sapply(dataset, is.character)]


# Correlation Matrix after Data Preprocessing
corrplot(cor(dataset[, 2:15]),
         method = "square",
         type = "upper",
         tl.col = "black",
         tl.cex = 0.6,
         col = colorRampPalette(c("purple", "dark green"))(200)
)

5. EDA & Visualization

### Targeted Productivity over Time
dates <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$Group.1
targeted_productivity <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$x

plot_ly(
  x = dates, 
  y = targeted_productivity,
  type = "scatter",
  mode = "lines" 
) |> 
layout(yaxis = list(range = c(0.65, 0.81)), 
       title = list(text='Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Actual Productivity over Time
dates <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$Group.1
actual_productivity <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$x

plot_ly(
  x = dates, 
  y = actual_productivity,
  type = "scatter",
  mode = "lines" 
) |> 
layout(yaxis = list(range = c(0.5, 0.9)), 
       title = list(text='Actual Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Actual & Targeted Productivity over Time
plot_ly(
  x = dates, 
  y = targeted_productivity,
  type = "scatter",
  mode = "lines",
  name = "targeted_productivity"
) %>% 
add_trace(y = actual_productivity, name = "actual_productivity") %>% 
layout(legend = list(x = 0.73, y = 0.95), 
       title = list(text='Actual & Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
### Ratio of Actual to Targeted Productivity over Time
prod_diff = actual_productivity - targeted_productivity
prod_diff
##  [1] -0.0339209786  0.0265828621  0.0039838318  0.0232998741  0.0221744682
##  [6] -0.0178785815 -0.0033075971  0.0167596471 -0.0183467855  0.0138895031
## [11]  0.0260206630 -0.0154420297  0.0607074814 -0.0168894555 -0.0545498789
## [16]  0.0107424270  0.0010644784  0.0115748421 -0.0258462651  0.0438446866
## [21]  0.0583350081  0.0931096937  0.0840465143  0.0337149014  0.0791443043
## [26]  0.1320382726  0.0472739451 -0.0311872794  0.0315611188  0.0563255852
## [31]  0.0876255877  0.0092534128  0.0014276525  0.0300904176  0.0470107688
## [36]  0.0790646426 -0.0204833649 -0.0359534660 -0.0382238699 -0.0206541136
## [41] -0.0137440371 -0.0624003944 -0.0484842466 -0.0820019806 -0.0762448567
## [46] -0.0430553499 -0.0285546069 -0.0730003276 -0.0580416762 -0.0462517230
## [51] -0.0002677061 -0.0452763715 -0.0122840756  0.0131424168  0.0165971511
## [56]  0.0328347056  0.0128943875  0.0352160328 -0.0136933439
plot_ly(
  x = dates, 
  y = prod_diff,
  type = "scatter",
  mode = "lines" 
) |> 
layout(title = list(text='Ratio of Actual to Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor =  'top'))
# Distribution of Work in Progress vs day of the week
data_frame_wip <- subset(dataset, work_in_progress > 0)
nrow(data_frame_wip)
## [1] 691
result <- aggregate(data_frame_wip$work_in_progress, by=list(data_frame_wip$day_cat), mean)
result
##     Group.1         x
## 1    Monday 2156.3103
## 2  Saturday 1003.0192
## 3    Sunday 1016.1983
## 4  Thursday  974.0254
## 5   Tuesday 1012.4831
## 6 Wednesday  973.7731
barplot(result$x, names.arg=result$Group.1, xlab="Weekdays", ylab="Total Work in Progress", col=rainbow(6),
        main="Total Work in Progress by Weekdays",border="black")

# Quarter & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$quarter_cat), mean)
result
##    Group.1        x
## 1 Quarter1 4480.917
## 2 Quarter2 4355.015
## 3 Quarter3 4896.000
## 4 Quarter4 4851.250
## 5 Quarter5 3725.455
barplot(result$x, names.arg=result$Group.1, xlab="Quarter", ylab="Average Worktime", col=rainbow(6),
        main="Quarter vs Overtime",border="black")

# Team & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$team), mean)
result
##    Group.1        x
## 1        1 4793.429
## 2        2 4384.954
## 3        3 5375.684
## 4        4 5449.714
## 5        5 5330.968
## 6        6 3369.096
## 7        7 4857.188
## 8        8 4312.294
## 9        9 4519.038
## 10      10 4736.700
## 11      11 4342.500
## 12      12 3317.929
barplot(result$x, names.arg=result$Group.1, xlab="Team", ylab="Average Overtime", col=rainbow(12),
        main="Team vs Overtime",border="black")

# Stacked Bar Plot with Colors and Legend
counts <- table(G_cat$department,G_cat$day)
barplot(counts, main="Department vs day",
        xlab="Number of Departments", col=c("darkblue","red"),
        legend = rownames(counts))

# Quarter vs Department
counts <- table(G_cat$department,G_cat$quarter)
barplot(counts, main="quarter vs Department",
        xlab="Number of Departments", col=c("pink","lightblue"),
        legend = rownames(counts))

# Quarter Propagation
PieChart(quarter_cat, hole = 0, values = "%", data = dataset,
         fill = rainbow(5), main = "Quarter Propagation")
## >>> Note: quarter_cat is not in a data frame (table)
## >>> Note: quarter_cat is not in a data frame (table)

## >>> suggestions
## piechart(quarter_cat, hole=0)  # traditional pie chart
## piechart(quarter_cat, values="%")  # display %'s on the chart
## piechart(quarter_cat)  # bar chart
## plot(quarter_cat)  # bubble plot
## plot(quarter_cat, values="count")  # lollipop plot 
## 
## --- quarter_cat --- 
## 
##                Quarter1  Quarter2  Quarter3  Quarter4  Quarter5      Total 
## Frequencies:        360       335       210       248        44       1197 
## Proportions:      0.301     0.280     0.175     0.207     0.037      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 262.336, df = 4, p-value = 0.000
### Department Propagation
PieChart(department_cat, hole = 0, values = "%", data = dataset,
         fill = c("lightblue", "blue"), main = "Department Propagation")
## >>> Note: department_cat is not in a data frame (table)
## >>> Note: department_cat is not in a data frame (table)

## >>> suggestions
## piechart(department_cat, hole=0)  # traditional pie chart
## piechart(department_cat, values="%")  # display %'s on the chart
## piechart(department_cat)  # bar chart
## plot(department_cat)  # bubble plot
## plot(department_cat, values="count")  # lollipop plot 
## 
## --- department_cat --- 
## 
##                finishing  sewing      Total 
## Frequencies:         506     691       1197 
## Proportions:       0.423   0.577      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 28.592, df = 1, p-value = 0.000
# Day Propagation
PieChart(day_cat, hole = 0.6, values = "%", data = dataset,
         fill = rainbow(5), main = "Day Propagation")
## >>> Note: day_cat is not in a data frame (table)
## >>> Note: day_cat is not in a data frame (table)

## >>> suggestions
## piechart(day_cat, hole=0)  # traditional pie chart
## piechart(day_cat, values="%")  # display %'s on the chart
## piechart(day_cat)  # bar chart
## plot(day_cat)  # bubble plot
## plot(day_cat, values="count")  # lollipop plot 
## 
## --- day_cat --- 
## 
##                Monday  Saturday  Sunday  Thursday  Tuesday  Wednesday      Total 
## Frequencies:      199       187     203       199      201        208       1197 
## Proportions:    0.166     0.156   0.170     0.166    0.168      0.174      1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 1.221, df = 5, p-value = 0.943
# Team Propagation
PieChart(team, hole = 0.8, values = "%", data = dataset,
         fill = rainbow(12), main = "Team Propagation")

## >>> suggestions
## piechart(team, hole=0)  # traditional pie chart
## piechart(team, values="%")  # display %'s on the chart
## piechart(team)  # bar chart
## plot(team)  # bubble plot
## plot(team, values="count")  # lollipop plot 
## 
## --- team --- 
## 
##  team  Count   Prop 
## -------------------- 
##     1    105   0.088 
##     2    109   0.091 
##     3     95   0.079 
##     4    105   0.088 
##     5     93   0.078 
##     6     94   0.079 
##     7     96   0.080 
##     8    109   0.091 
##     9    104   0.087 
##    10    100   0.084 
##    11     88   0.074 
##    12     99   0.083 
## -------------------- 
## Total   1197   1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 4.995, df = 11, p-value = 0.931
# Generate box plots of all numeric variables
new_df_num <- data.frame(dataset$targeted_productivity,
                           dataset$standard_minute_value,
                           dataset$work_in_progress,
                           dataset$over_time,
                           dataset$incentive,
                           dataset$idle_time,
                           dataset$idle_men,
                           dataset$no_of_style_change,
                           dataset$no_of_workers,
                           dataset$actual_productivity)
col <- c("targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "actual_productivity")

colnames(new_df_num) <- col

ggplot(gather(new_df_num), aes(key,value)) +
  geom_boxplot(color="purple", fill="red") +
  facet_wrap(~key, scales="free") +
  labs(title = "Box plots showing the distribution of all variables", 
       subtitle ="except the first 3 categorical variables")

par(mfrow = c(1, 1))


# Plot a histogram or bar chart of each numeric variable
ggplot(gather(new_df_num), aes(value)) +
  geom_histogram(bins=20,color='blue', fill='pink') +
  facet_wrap(~key, scales = 'free') +
  labs(title = "Histograms showing the distribution of all variables", 
       subtitle ="except the first 3 categorical variables")

par(mfrow = c(1, 1))


# Plot a bar chart for each of the first 4 categorical variables
new_df_cat <- data.frame(dataset$day_cat,
                         dataset$department_cat,
                         dataset$quarter_cat,
                         dataset$team)
col <- c("day_cat", "department_cat", "quarter_cat", "team")

colnames(new_df_cat) <- col

ggplot(gather(new_df_cat), aes(value)) +
  geom_bar(color='black', fill='green') +
  facet_wrap(~key, scales ='free') +
  labs(title = "Bar charts showing the distribution of the first 4 categorical variables")

par(mfrow = c(1, 1))


# Actual productivity based on dates
ggplot(dataset,aes(x = reorder(date,actual_productivity,median),y = actual_productivity)) +
  geom_boxplot(outlier.shape=NA, mapping=aes(fill=quarter)) +
  theme(legend.position = "bottom", legend.box = "horizontal") +
  labs(x="Date (ordered by median actual_productivity)", title="Boxplots showing actual_productivity distribution for all dates") +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))

# Relationship between Actual and Targeted Productivity
ggplot(dataset, aes(x=targeted_productivity,y=actual_productivity, color=department_cat)) +
  geom_point() +
  geom_smooth(method=lm,se=FALSE) +
  labs(title = "Scatterplot showing the relationship between actual productivity and targeted_productivity")
## `geom_smooth()` using formula = 'y ~ x'

# Investigate the relationship between actual_productivity and no_of_workers
scatterplot(actual_productivity ~ no_of_workers, data=dataset, 
            main="Scatterplot showing relationship between actual_productivity and no_of_workers", grid=FALSE)

6. Feature Importance & Selection

Feature Importance & Selection for Regression Model

new_df_reg <- data.frame(dataset$quarter, 
                     dataset$department,
                     dataset$day,
                     dataset$team,
                     dataset$targeted_productivity,
                     dataset$standard_minute_value,
                     dataset$work_in_progress,
                     dataset$over_time,
                     dataset$incentive,
                     dataset$idle_time,
                     dataset$idle_men,
                     dataset$no_of_style_change,
                     dataset$no_of_workers,
                     dataset$actual_productivity)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "actual_productivity")

colnames(new_df_reg) <- col


cl <- makePSOCKcluster(5)
registerDoParallel(cl)

start.time <- proc.time()
Model <- train(actual_productivity ~ ., 
               data = new_df_reg, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
##    user  system elapsed 
##    1.89    0.01   58.80
stopCluster(cl)

Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Regression Model")

So from the plot we can see idle_time not important so we are going to drop that column

drops <- c("idle_time")
new_df_reg <- new_df_reg[ , !(names(new_df_reg) %in% drops)]

Feature Importance & Selection for Classification Model

new_df_class <- data.frame(dataset$quarter, 
                     dataset$department,
                     dataset$day,
                     dataset$team,
                     dataset$targeted_productivity,
                     dataset$standard_minute_value,
                     dataset$work_in_progress,
                     dataset$over_time,
                     dataset$incentive,
                     dataset$idle_time,
                     dataset$idle_men,
                     dataset$no_of_style_change,
                     dataset$no_of_workers,
                     dataset$target)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
         "work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change", 
         "no_of_workers", "target")

colnames(new_df_class) <- col


cl <- makePSOCKcluster(5)
registerDoParallel(cl)

start.time <- proc.time()
Model <- train(target ~ ., 
               data = new_df_class, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
##    user  system elapsed 
##    1.53    0.03   57.19
stopCluster(cl)

Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Classification Model")

So from the plot we can see idle_time and idle_men not important so we are going to drop that columns

drops <- c("idle_time", "idle_men")
new_df_class <- new_df_class[ , !(names(new_df_class) %in% drops)]

7. Model Evaluation & Selection

Regression Model Selection

set.seed(100)

TrainingIndex <- createDataPartition(new_df_reg$actual_productivity, p=0.8, list = FALSE)
TrainingSet <- new_df_reg[TrainingIndex,]
TestingSet <- new_df_reg[-TrainingIndex,]


# Linear Regression:
lin_model <- train(actual_productivity ~ ., data = TrainingSet,
               method = "lm",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"))

lin_MAE <- mae(TestingSet$actual_productivity, predict(lin_model, TestingSet))


# Ridge regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                "no_of_workers")])
y <- TrainingSet$actual_productivity

cv_model <- cv.glmnet(x, y, alpha = 0)
best_lambda <- cv_model$lambda.min
ridge_mod = glmnet(x, y, alpha = 0, lambda = best_lambda)

x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                 "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                 "no_of_workers")])
rid_MAE <- mae(TestingSet$actual_productivity, predict(ridge_mod, s = best_lambda, newx = x))


# Lasso regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                 "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                 "no_of_workers")])
y <- TrainingSet$actual_productivity

lambdas <- 10^seq(2, -3, by = -.1)
lasso_mod = glmnet(x, y, alpha = 1, lambda = lambdas)

x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value", 
                                "work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change", 
                                "no_of_workers")])
lasso_MAE <- mae(TestingSet$actual_productivity, predict(lasso_mod, newx = x))


# XGBoost
train_x = data.matrix(TrainingSet[, -13])
train_y = TrainingSet[,13]
test_x = data.matrix(TestingSet[, -13])
test_y = TestingSet[, 13]

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

xgb_model = xgboost(data = xgb_train, max.depth = 3, nrounds = 56, verbose = 0)

xgb_MAE <- mae(test_y, predict(xgb_model, test_x))
print(paste0("Linear MAE: ", lin_MAE))
## [1] "Linear MAE: 0.111510357523804"
print(paste0("Ridge MAE: ", rid_MAE))
## [1] "Ridge MAE: 0.111521518701029"
print(paste0("Lasso MAE: ", lasso_MAE))
## [1] "Lasso MAE: 0.129368765462842"
print(paste0("XGBoost MAE: ", xgb_MAE))
## [1] "XGBoost MAE: 0.0842882112443574"

Regression Model Performance Table

regression_algorithm_name <- c('Linear Regression','Ridge Regression','Lasso Regression','XGBoost', "")
mae <- c(lin_MAE, rid_MAE, lasso_MAE, xgb_MAE, "")

rowName <- c('Model 1', 'Model 2', 'Model 3', 'Model 4', "")
model_result <- data.frame(regression_algorithm_name, mae, row.names = rowName)

knitr::kable(model_result, align = "lccrr")
regression_algorithm_name mae
Model 1 Linear Regression 0.111510357523804
Model 2 Ridge Regression 0.111521518701029
Model 3 Lasso Regression 0.129368765462842
Model 4 XGBoost 0.0842882112443574

For Regression model we are going to use XGBoost.

Classification Model Selection

TrainingIndex <- createDataPartition(new_df_class$target, p=0.8, list = FALSE)
TrainingSet <- new_df_class[TrainingIndex,]
TestingSet <- new_df_class[-TrainingIndex,]

# Logistic Regression
logic_model <- glm(target ~., data = TrainingSet, family = binomial) %>%
  stepAIC(trace = FALSE)

probabilities <- logic_model %>% predict(TestingSet, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
logic_accurecy <- mean(predicted.classes==TestingSet$target)


# Random Forest
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

rf_model <- train(target ~ ., 
               data = TrainingSet, 
               method = "rf",
               tuneGrid = data.frame(mtry = seq(5,15, by=5)))

probabilities <- rf_model %>% predict(TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
rm_accurecy <- mean(predicted.classes==TestingSet$target)


# SVM
svm_model <- train(target ~ ., data = TrainingSet,
               method = "svmPoly",
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"),
               tuneGrid = data.frame(degree=1,scale=1,C=1))

probabilities <-predict(svm_model, TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
svm_accurecy <- mean(predicted.classes==TestingSet$target)
print(paste0("Logistic Accurecy: ", logic_accurecy))
## [1] "Logistic Accurecy: 0.719665271966527"
print(paste0("Random Forest Accurecy: ", rm_accurecy))
## [1] "Random Forest Accurecy: 0.832635983263598"
print(paste0("SVM Accurecy: ", svm_accurecy))
## [1] "SVM Accurecy: 0.635983263598326"

Classification Model Performance Table

classification_algorithm_name <- c('Logistic Regression','Random Forest','SVM', "")
accurecy <- c(logic_accurecy, rm_accurecy, svm_accurecy, "")

rowName <- c('Model 1', 'Model 2', 'Model 3', "")
model_result <- data.frame(classification_algorithm_name, accurecy, row.names = rowName)

knitr::kable(model_result, align = "lccrr")
classification_algorithm_name accurecy
Model 1 Logistic Regression 0.719665271966527
Model 2 Random Forest 0.832635983263598
Model 3 SVM 0.635983263598326

For Classification model we are going to use Random Forest.

8. Prediction

Regression Prediction (Question 1 Solution)

test_data <- data.frame(quarter=1, 
                        department=0, 
                        day=4, 
                        team=8, 
                        targeted_productivity=0.80, 
                        standard_minute_value=26.16, 
                        work_in_progress=1108, 
                        over_time=7080, 
                        incentive=98, 
                        idle_men=0, 
                        no_of_style_change=0, 
                        no_of_workers=59)

print(paste0("Actual Production will be: ", predict(xgb_model, as.matrix(test_data))))
## [1] "Actual Production will be: 0.947649240493774"

Classification Prediction (Question 2 Solution)

test_data <- data.frame(quarter=1, 
                        department=0, 
                        day=4, 
                        team=7, 
                        targeted_productivity=0.80, 
                        standard_minute_value=25.90, 
                        work_in_progress=984, 
                        over_time=6720, 
                        incentive=38, 
                        no_of_style_change=0, 
                        no_of_workers=56)

predict_val <- predict(rf_model, as.matrix(test_data))
prediction <- ifelse(predict_val > 0.5, "YES", "NO")

print(paste0("Is the Actual Productivity reached the Target Productivity: ", prediction))
## [1] "Is the Actual Productivity reached the Target Productivity: YES"

9. Conclusion

In approaching model selection, different selection criteria are being used. For regression problem, we prioritize mean absolute error in our model selection. Mean absolute error will measure average of errors in the forecasts. As comparison between 4 models mean absolute value, Linear regression(0.11), Ridge regression (0.11), Lasso regression (0.120), XGboost(0.08), the lowest MAE is chosen. Hence, we are adopting XGboost for our regression problem.

Meanwhile, for classification problem, accuracy of the prediction is being used as criteria selection. Generally, the higher accuracy, the better model will be. Comparison from 3 training model accuracy, Logistic regression(0.71), Random Forest(0.83), SVM(0.63), Random Forest are being adopted.

Both model selection will helps the garment prediction to achieve optimum productivity.