Garments productivity prediction. (Number) Here we will use Regression models. Ex. Linear or Lasso or Ridge or Decision tree or Random Forest or XGBoost.
Whether actual productivity reached the target productivity or not (Yes or No). Here we will use classification models. Ex. Logistic or Random Forest or SVM
# Import Dataset
dataset <- read.csv("C:\\Users\\tuhin\\MY PRACTICE\\R-Programming\\R ML Models\\1.0 DataSet ML Model\\Garments Productivity\\garments_worker_productivity.csv")
head(dataset) # first 6 rows of dataset
## date quarter department day team targeted_productivity smv wip
## 1 1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108
## 2 1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA
## 3 1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968
## 4 1/1/2015 Quarter1 sweing Thursday 12 0.80 11.41 968
## 5 1/1/2015 Quarter1 sweing Thursday 6 0.80 25.90 1170
## 6 1/1/2015 Quarter1 sweing Thursday 7 0.80 25.90 984
## over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1 7080 98 0 0 0 59.0
## 2 960 0 0 0 0 8.0
## 3 3660 50 0 0 0 30.5
## 4 3660 50 0 0 0 30.5
## 5 1920 50 0 0 0 56.0
## 6 6720 38 0 0 0 56.0
## actual_productivity
## 1 0.9407254
## 2 0.8865000
## 3 0.8005705
## 4 0.8005705
## 5 0.8003819
## 6 0.8001250
summary(dataset) # summary of dataset
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Min. : 1.000 Min. :0.0700 Min. : 2.90 Min. : 7.0
## 1st Qu.: 3.000 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Median : 6.000 Median :0.7500 Median :15.26 Median : 1039.0
## Mean : 6.427 Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.: 9.000 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :12.000 Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
glimpse(dataset) # show the features data types and values
## Rows: 1,197
## Columns: 15
## $ date <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", ~
## $ quarter <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department <chr> "sweing", "finishing ", "sweing", "sweing", "swe~
## $ day <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
## $ team <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ smv <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ wip <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681~
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~
skim(dataset) # show more details about features (Missing Values)
| Name | dataset |
| Number of rows | 1197 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 8 | 9 | 0 | 59 | 0 |
| quarter | 0 | 1 | 8 | 8 | 0 | 5 | 0 |
| department | 0 | 1 | 6 | 10 | 0 | 3 | 0 |
| day | 0 | 1 | 6 | 9 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| team | 0 | 1.00 | 6.43 | 3.46 | 1.00 | 3.00 | 6.00 | 9.00 | 12.00 | ▇▅▅▆▇ |
| targeted_productivity | 0 | 1.00 | 0.73 | 0.10 | 0.07 | 0.70 | 0.75 | 0.80 | 0.80 | ▁▁▁▁▇ |
| smv | 0 | 1.00 | 15.06 | 10.94 | 2.90 | 3.94 | 15.26 | 24.26 | 54.56 | ▇▅▃▁▁ |
| wip | 506 | 0.58 | 1190.47 | 1837.46 | 7.00 | 774.50 | 1039.00 | 1252.50 | 23122.00 | ▇▁▁▁▁ |
| over_time | 0 | 1.00 | 4567.46 | 3348.82 | 0.00 | 1440.00 | 3960.00 | 6960.00 | 25920.00 | ▇▆▁▁▁ |
| incentive | 0 | 1.00 | 38.21 | 160.18 | 0.00 | 0.00 | 0.00 | 50.00 | 3600.00 | ▇▁▁▁▁ |
| idle_time | 0 | 1.00 | 0.73 | 12.71 | 0.00 | 0.00 | 0.00 | 0.00 | 300.00 | ▇▁▁▁▁ |
| idle_men | 0 | 1.00 | 0.37 | 3.27 | 0.00 | 0.00 | 0.00 | 0.00 | 45.00 | ▇▁▁▁▁ |
| no_of_style_change | 0 | 1.00 | 0.15 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| no_of_workers | 0 | 1.00 | 34.61 | 22.20 | 2.00 | 9.00 | 34.00 | 57.00 | 89.00 | ▇▂▂▇▁ |
| actual_productivity | 0 | 1.00 | 0.74 | 0.17 | 0.23 | 0.65 | 0.77 | 0.85 | 1.12 | ▂▂▆▇▂ |
# Correlation Marix before data cleaning
corrplot(cor(dataset[, 5:15]),
method = "square",
type = "upper",
tl.col = "black",
tl.cex = 0.6,
col = colorRampPalette(c("purple", "dark green"))(200)
)
# Renaming the columns name
colnames(dataset)[7] = "standard_minute_value"
colnames(dataset)[8] = "work_in_progress"
# Changing Department value Sweing to Sewing
unique(dataset[c("department")])
## department
## 1 sweing
## 2 finishing
## 19 finishing
dataset <- within(dataset, department[department == 'sweing'] <- 'sewing')
dataset <- within(dataset, department[department == 'finishing '] <- 'finishing')
unique(dataset[c("department")])
## department
## 1 sewing
## 2 finishing
# check the missing values appearance
sum(is.na(dataset))
## [1] 506
colSums(is.na(dataset))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## standard_minute_value work_in_progress over_time
## 0 506 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
missmap(dataset, col = c("red", "blue"), legend = FALSE)
dataset[is.na(dataset)] <- 0
colSums(is.na(dataset))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## standard_minute_value work_in_progress over_time
## 0 0 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
# Shape of dataset
nrow(dataset)
## [1] 1197
ncol(dataset)
## [1] 15
# Date Data Type Conversion
sapply(dataset, class)
## date quarter department
## "character" "character" "character"
## day team targeted_productivity
## "character" "integer" "numeric"
## standard_minute_value work_in_progress over_time
## "numeric" "numeric" "integer"
## incentive idle_time idle_men
## "integer" "numeric" "integer"
## no_of_style_change no_of_workers actual_productivity
## "integer" "numeric" "numeric"
dataset$date <- as.POSIXct( dataset$date, format="%m/%d/%Y" )
sapply(dataset, class)
## $date
## [1] "POSIXct" "POSIXt"
##
## $quarter
## [1] "character"
##
## $department
## [1] "character"
##
## $day
## [1] "character"
##
## $team
## [1] "integer"
##
## $targeted_productivity
## [1] "numeric"
##
## $standard_minute_value
## [1] "numeric"
##
## $work_in_progress
## [1] "numeric"
##
## $over_time
## [1] "integer"
##
## $incentive
## [1] "integer"
##
## $idle_time
## [1] "numeric"
##
## $idle_men
## [1] "integer"
##
## $no_of_style_change
## [1] "integer"
##
## $no_of_workers
## [1] "numeric"
##
## $actual_productivity
## [1] "numeric"
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 Quarter1 sewing Thursday 8 0.80
## 2 2015-01-01 Quarter1 finishing Thursday 1 0.75
## 3 2015-01-01 Quarter1 sewing Thursday 11 0.80
## 4 2015-01-01 Quarter1 sewing Thursday 12 0.80
## 5 2015-01-01 Quarter1 sewing Thursday 6 0.80
## 6 2015-01-01 Quarter1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity
## 1 0 59.0 0.9407254
## 2 0 8.0 0.8865000
## 3 0 30.5 0.8005705
## 4 0 30.5 0.8005705
## 5 0 56.0 0.8003819
## 6 0 56.0 0.8001250
# Quarter Data Type Conversion
unique(dataset[, 2])
## [1] "Quarter1" "Quarter2" "Quarter3" "Quarter4" "Quarter5"
quarter_cat <- dataset[, 2]
dataset <- cbind(dataset, quarter_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 Quarter1 sewing Thursday 8 0.80
## 2 2015-01-01 Quarter1 finishing Thursday 1 0.75
## 3 2015-01-01 Quarter1 sewing Thursday 11 0.80
## 4 2015-01-01 Quarter1 sewing Thursday 12 0.80
## 5 2015-01-01 Quarter1 sewing Thursday 6 0.80
## 6 2015-01-01 Quarter1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
dataset <- within(dataset, quarter[quarter == 'Quarter1'] <- 1)
dataset <- within(dataset, quarter[quarter == 'Quarter2'] <- 2)
dataset <- within(dataset, quarter[quarter == 'Quarter3'] <- 3)
dataset <- within(dataset, quarter[quarter == 'Quarter4'] <- 4)
dataset <- within(dataset, quarter[quarter == 'Quarter5'] <- 5)
unique(dataset[, 2])
## [1] "1" "2" "3" "4" "5"
str(dataset[, 2])
## chr [1:1197] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
dataset$quarter <- as.integer(dataset$quarter)
str(dataset[, 2])
## int [1:1197] 1 1 1 1 1 1 1 1 1 1 ...
# Department Data Type Conversion
unique(dataset$department)
## [1] "sewing" "finishing"
department_cat <- dataset$department
dataset <- cbind(dataset, department_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 sewing Thursday 8 0.80
## 2 2015-01-01 1 finishing Thursday 1 0.75
## 3 2015-01-01 1 sewing Thursday 11 0.80
## 4 2015-01-01 1 sewing Thursday 12 0.80
## 5 2015-01-01 1 sewing Thursday 6 0.80
## 6 2015-01-01 1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat
## 1 sewing
## 2 finishing
## 3 sewing
## 4 sewing
## 5 sewing
## 6 sewing
dataset <- within(dataset, department[department == 'sewing'] <- 0)
dataset <- within(dataset, department[department == 'finishing'] <- 1)
unique(dataset$department)
## [1] "0" "1"
str(dataset$department)
## chr [1:1197] "0" "1" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "1" "1" ...
dataset$department <- as.integer(dataset$department)
str(dataset$department)
## int [1:1197] 0 1 0 0 0 0 1 0 0 0 ...
# Day Data Type Conversion
unique(dataset$day)
## [1] "Thursday" "Saturday" "Sunday" "Monday" "Tuesday" "Wednesday"
day_cat <- dataset$day
dataset <- cbind(dataset, day_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 0 Thursday 8 0.80
## 2 2015-01-01 1 1 Thursday 1 0.75
## 3 2015-01-01 1 0 Thursday 11 0.80
## 4 2015-01-01 1 0 Thursday 12 0.80
## 5 2015-01-01 1 0 Thursday 6 0.80
## 6 2015-01-01 1 0 Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat day_cat
## 1 sewing Thursday
## 2 finishing Thursday
## 3 sewing Thursday
## 4 sewing Thursday
## 5 sewing Thursday
## 6 sewing Thursday
dataset <- within(dataset, day[day == 'Sunday'] <- 0)
dataset <- within(dataset, day[day == 'Monday'] <- 1)
dataset <- within(dataset, day[day == 'Tuesday'] <- 2)
dataset <- within(dataset, day[day == 'Wednesday'] <- 3)
dataset <- within(dataset, day[day == 'Thursday'] <- 4)
dataset <- within(dataset, day[day == 'Friday'] <- 5)
dataset <- within(dataset, day[day == 'Saturday'] <- 6)
unique(dataset$day)
## [1] "4" "6" "0" "1" "2" "3"
str(dataset$day)
## chr [1:1197] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" ...
dataset$day <- as.integer(dataset$day)
str(dataset$day)
## int [1:1197] 4 4 4 4 4 4 4 4 4 4 ...
glimpse(dataset)
## Rows: 1,197
## Columns: 18
## $ date <dttm> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,~
## $ quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ department <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, ~
## $ day <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ team <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,~
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, ~
## $ standard_minute_value <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2~
## $ work_in_progress <dbl> 1108, 0, 968, 968, 1170, 984, 0, 795, 733, 681, ~
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60~
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50~
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55~
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80~
## $ quarter_cat <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", ~
## $ department_cat <chr> "sewing", "finishing", "sewing", "sewing", "sewi~
## $ day_cat <chr> "Thursday", "Thursday", "Thursday", "Thursday", ~
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 0 4 8 0.80
## 2 2015-01-01 1 1 4 1 0.75
## 3 2015-01-01 1 0 4 11 0.80
## 4 2015-01-01 1 0 4 12 0.80
## 5 2015-01-01 1 0 4 6 0.80
## 6 2015-01-01 1 0 4 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat day_cat
## 1 sewing Thursday
## 2 finishing Thursday
## 3 sewing Thursday
## 4 sewing Thursday
## 5 sewing Thursday
## 6 sewing Thursday
# Adding a new classification column based on actual_productivity and targeted_productivity
target <- dataset$actual_productivity < dataset$targeted_productivity + 0.001
dataset['target'] <- target
dataset <- within(dataset, target[target == TRUE] <- 1)
dataset <- within(dataset, target[target == FALSE] <- 0)
#Separate Categorical and Continous variables
G_con = dataset[ ,!sapply(dataset, is.character)]
G_cat = dataset[ ,sapply(dataset, is.character)]
# Correlation Matrix after Data Preprocessing
corrplot(cor(dataset[, 2:15]),
method = "square",
type = "upper",
tl.col = "black",
tl.cex = 0.6,
col = colorRampPalette(c("purple", "dark green"))(200)
)
### Targeted Productivity over Time
dates <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$Group.1
targeted_productivity <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$x
plot_ly(
x = dates,
y = targeted_productivity,
type = "scatter",
mode = "lines"
) |>
layout(yaxis = list(range = c(0.65, 0.81)),
title = list(text='Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
### Actual Productivity over Time
dates <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$Group.1
actual_productivity <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$x
plot_ly(
x = dates,
y = actual_productivity,
type = "scatter",
mode = "lines"
) |>
layout(yaxis = list(range = c(0.5, 0.9)),
title = list(text='Actual Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
### Actual & Targeted Productivity over Time
plot_ly(
x = dates,
y = targeted_productivity,
type = "scatter",
mode = "lines",
name = "targeted_productivity"
) %>%
add_trace(y = actual_productivity, name = "actual_productivity") %>%
layout(legend = list(x = 0.73, y = 0.95),
title = list(text='Actual & Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
### Ratio of Actual to Targeted Productivity over Time
prod_diff = actual_productivity - targeted_productivity
prod_diff
## [1] -0.0339209786 0.0265828621 0.0039838318 0.0232998741 0.0221744682
## [6] -0.0178785815 -0.0033075971 0.0167596471 -0.0183467855 0.0138895031
## [11] 0.0260206630 -0.0154420297 0.0607074814 -0.0168894555 -0.0545498789
## [16] 0.0107424270 0.0010644784 0.0115748421 -0.0258462651 0.0438446866
## [21] 0.0583350081 0.0931096937 0.0840465143 0.0337149014 0.0791443043
## [26] 0.1320382726 0.0472739451 -0.0311872794 0.0315611188 0.0563255852
## [31] 0.0876255877 0.0092534128 0.0014276525 0.0300904176 0.0470107688
## [36] 0.0790646426 -0.0204833649 -0.0359534660 -0.0382238699 -0.0206541136
## [41] -0.0137440371 -0.0624003944 -0.0484842466 -0.0820019806 -0.0762448567
## [46] -0.0430553499 -0.0285546069 -0.0730003276 -0.0580416762 -0.0462517230
## [51] -0.0002677061 -0.0452763715 -0.0122840756 0.0131424168 0.0165971511
## [56] 0.0328347056 0.0128943875 0.0352160328 -0.0136933439
plot_ly(
x = dates,
y = prod_diff,
type = "scatter",
mode = "lines"
) |>
layout(title = list(text='Ratio of Actual to Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
# Distribution of Work in Progress vs day of the week
data_frame_wip <- subset(dataset, work_in_progress > 0)
nrow(data_frame_wip)
## [1] 691
result <- aggregate(data_frame_wip$work_in_progress, by=list(data_frame_wip$day_cat), mean)
result
## Group.1 x
## 1 Monday 2156.3103
## 2 Saturday 1003.0192
## 3 Sunday 1016.1983
## 4 Thursday 974.0254
## 5 Tuesday 1012.4831
## 6 Wednesday 973.7731
barplot(result$x, names.arg=result$Group.1, xlab="Weekdays", ylab="Total Work in Progress", col=rainbow(6),
main="Total Work in Progress by Weekdays",border="black")
# Quarter & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$quarter_cat), mean)
result
## Group.1 x
## 1 Quarter1 4480.917
## 2 Quarter2 4355.015
## 3 Quarter3 4896.000
## 4 Quarter4 4851.250
## 5 Quarter5 3725.455
barplot(result$x, names.arg=result$Group.1, xlab="Quarter", ylab="Average Worktime", col=rainbow(6),
main="Quarter vs Overtime",border="black")
# Team & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$team), mean)
result
## Group.1 x
## 1 1 4793.429
## 2 2 4384.954
## 3 3 5375.684
## 4 4 5449.714
## 5 5 5330.968
## 6 6 3369.096
## 7 7 4857.188
## 8 8 4312.294
## 9 9 4519.038
## 10 10 4736.700
## 11 11 4342.500
## 12 12 3317.929
barplot(result$x, names.arg=result$Group.1, xlab="Team", ylab="Average Overtime", col=rainbow(12),
main="Team vs Overtime",border="black")
# Stacked Bar Plot with Colors and Legend
counts <- table(G_cat$department,G_cat$day)
barplot(counts, main="Department vs day",
xlab="Number of Departments", col=c("darkblue","red"),
legend = rownames(counts))
# Quarter vs Department
counts <- table(G_cat$department,G_cat$quarter)
barplot(counts, main="quarter vs Department",
xlab="Number of Departments", col=c("pink","lightblue"),
legend = rownames(counts))
# Quarter Propagation
PieChart(quarter_cat, hole = 0, values = "%", data = dataset,
fill = rainbow(5), main = "Quarter Propagation")
## >>> Note: quarter_cat is not in a data frame (table)
## >>> Note: quarter_cat is not in a data frame (table)
## >>> suggestions
## piechart(quarter_cat, hole=0) # traditional pie chart
## piechart(quarter_cat, values="%") # display %'s on the chart
## piechart(quarter_cat) # bar chart
## plot(quarter_cat) # bubble plot
## plot(quarter_cat, values="count") # lollipop plot
##
## --- quarter_cat ---
##
## Quarter1 Quarter2 Quarter3 Quarter4 Quarter5 Total
## Frequencies: 360 335 210 248 44 1197
## Proportions: 0.301 0.280 0.175 0.207 0.037 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 262.336, df = 4, p-value = 0.000
### Department Propagation
PieChart(department_cat, hole = 0, values = "%", data = dataset,
fill = c("lightblue", "blue"), main = "Department Propagation")
## >>> Note: department_cat is not in a data frame (table)
## >>> Note: department_cat is not in a data frame (table)
## >>> suggestions
## piechart(department_cat, hole=0) # traditional pie chart
## piechart(department_cat, values="%") # display %'s on the chart
## piechart(department_cat) # bar chart
## plot(department_cat) # bubble plot
## plot(department_cat, values="count") # lollipop plot
##
## --- department_cat ---
##
## finishing sewing Total
## Frequencies: 506 691 1197
## Proportions: 0.423 0.577 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 28.592, df = 1, p-value = 0.000
# Day Propagation
PieChart(day_cat, hole = 0.6, values = "%", data = dataset,
fill = rainbow(5), main = "Day Propagation")
## >>> Note: day_cat is not in a data frame (table)
## >>> Note: day_cat is not in a data frame (table)
## >>> suggestions
## piechart(day_cat, hole=0) # traditional pie chart
## piechart(day_cat, values="%") # display %'s on the chart
## piechart(day_cat) # bar chart
## plot(day_cat) # bubble plot
## plot(day_cat, values="count") # lollipop plot
##
## --- day_cat ---
##
## Monday Saturday Sunday Thursday Tuesday Wednesday Total
## Frequencies: 199 187 203 199 201 208 1197
## Proportions: 0.166 0.156 0.170 0.166 0.168 0.174 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 1.221, df = 5, p-value = 0.943
# Team Propagation
PieChart(team, hole = 0.8, values = "%", data = dataset,
fill = rainbow(12), main = "Team Propagation")
## >>> suggestions
## piechart(team, hole=0) # traditional pie chart
## piechart(team, values="%") # display %'s on the chart
## piechart(team) # bar chart
## plot(team) # bubble plot
## plot(team, values="count") # lollipop plot
##
## --- team ---
##
## team Count Prop
## --------------------
## 1 105 0.088
## 2 109 0.091
## 3 95 0.079
## 4 105 0.088
## 5 93 0.078
## 6 94 0.079
## 7 96 0.080
## 8 109 0.091
## 9 104 0.087
## 10 100 0.084
## 11 88 0.074
## 12 99 0.083
## --------------------
## Total 1197 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 4.995, df = 11, p-value = 0.931
# Generate box plots of all numeric variables
new_df_num <- data.frame(dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$actual_productivity)
col <- c("targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "actual_productivity")
colnames(new_df_num) <- col
ggplot(gather(new_df_num), aes(key,value)) +
geom_boxplot(color="purple", fill="red") +
facet_wrap(~key, scales="free") +
labs(title = "Box plots showing the distribution of all variables",
subtitle ="except the first 3 categorical variables")
par(mfrow = c(1, 1))
# Plot a histogram or bar chart of each numeric variable
ggplot(gather(new_df_num), aes(value)) +
geom_histogram(bins=20,color='blue', fill='pink') +
facet_wrap(~key, scales = 'free') +
labs(title = "Histograms showing the distribution of all variables",
subtitle ="except the first 3 categorical variables")
par(mfrow = c(1, 1))
# Plot a bar chart for each of the first 4 categorical variables
new_df_cat <- data.frame(dataset$day_cat,
dataset$department_cat,
dataset$quarter_cat,
dataset$team)
col <- c("day_cat", "department_cat", "quarter_cat", "team")
colnames(new_df_cat) <- col
ggplot(gather(new_df_cat), aes(value)) +
geom_bar(color='black', fill='green') +
facet_wrap(~key, scales ='free') +
labs(title = "Bar charts showing the distribution of the first 4 categorical variables")
par(mfrow = c(1, 1))
# Actual productivity based on dates
ggplot(dataset,aes(x = reorder(date,actual_productivity,median),y = actual_productivity)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=quarter)) +
theme(legend.position = "bottom", legend.box = "horizontal") +
labs(x="Date (ordered by median actual_productivity)", title="Boxplots showing actual_productivity distribution for all dates") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
# Relationship between Actual and Targeted Productivity
ggplot(dataset, aes(x=targeted_productivity,y=actual_productivity, color=department_cat)) +
geom_point() +
geom_smooth(method=lm,se=FALSE) +
labs(title = "Scatterplot showing the relationship between actual productivity and targeted_productivity")
## `geom_smooth()` using formula = 'y ~ x'
# Investigate the relationship between actual_productivity and no_of_workers
scatterplot(actual_productivity ~ no_of_workers, data=dataset,
main="Scatterplot showing relationship between actual_productivity and no_of_workers", grid=FALSE)
new_df_reg <- data.frame(dataset$quarter,
dataset$department,
dataset$day,
dataset$team,
dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$actual_productivity)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "actual_productivity")
colnames(new_df_reg) <- col
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
start.time <- proc.time()
Model <- train(actual_productivity ~ .,
data = new_df_reg,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
## user system elapsed
## 1.89 0.01 58.80
stopCluster(cl)
Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Regression Model")
drops <- c("idle_time")
new_df_reg <- new_df_reg[ , !(names(new_df_reg) %in% drops)]
new_df_class <- data.frame(dataset$quarter,
dataset$department,
dataset$day,
dataset$team,
dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$target)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "target")
colnames(new_df_class) <- col
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
start.time <- proc.time()
Model <- train(target ~ .,
data = new_df_class,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
## user system elapsed
## 1.53 0.03 57.19
stopCluster(cl)
Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Classification Model")
drops <- c("idle_time", "idle_men")
new_df_class <- new_df_class[ , !(names(new_df_class) %in% drops)]
set.seed(100)
TrainingIndex <- createDataPartition(new_df_reg$actual_productivity, p=0.8, list = FALSE)
TrainingSet <- new_df_reg[TrainingIndex,]
TestingSet <- new_df_reg[-TrainingIndex,]
# Linear Regression:
lin_model <- train(actual_productivity ~ ., data = TrainingSet,
method = "lm",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"))
lin_MAE <- mae(TestingSet$actual_productivity, predict(lin_model, TestingSet))
# Ridge regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
y <- TrainingSet$actual_productivity
cv_model <- cv.glmnet(x, y, alpha = 0)
best_lambda <- cv_model$lambda.min
ridge_mod = glmnet(x, y, alpha = 0, lambda = best_lambda)
x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
rid_MAE <- mae(TestingSet$actual_productivity, predict(ridge_mod, s = best_lambda, newx = x))
# Lasso regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
y <- TrainingSet$actual_productivity
lambdas <- 10^seq(2, -3, by = -.1)
lasso_mod = glmnet(x, y, alpha = 1, lambda = lambdas)
x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
lasso_MAE <- mae(TestingSet$actual_productivity, predict(lasso_mod, newx = x))
# XGBoost
train_x = data.matrix(TrainingSet[, -13])
train_y = TrainingSet[,13]
test_x = data.matrix(TestingSet[, -13])
test_y = TestingSet[, 13]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
xgb_model = xgboost(data = xgb_train, max.depth = 3, nrounds = 56, verbose = 0)
xgb_MAE <- mae(test_y, predict(xgb_model, test_x))
print(paste0("Linear MAE: ", lin_MAE))
## [1] "Linear MAE: 0.111510357523804"
print(paste0("Ridge MAE: ", rid_MAE))
## [1] "Ridge MAE: 0.111521518701029"
print(paste0("Lasso MAE: ", lasso_MAE))
## [1] "Lasso MAE: 0.129368765462842"
print(paste0("XGBoost MAE: ", xgb_MAE))
## [1] "XGBoost MAE: 0.0842882112443574"
regression_algorithm_name <- c('Linear Regression','Ridge Regression','Lasso Regression','XGBoost', "")
mae <- c(lin_MAE, rid_MAE, lasso_MAE, xgb_MAE, "")
rowName <- c('Model 1', 'Model 2', 'Model 3', 'Model 4', "")
model_result <- data.frame(regression_algorithm_name, mae, row.names = rowName)
knitr::kable(model_result, align = "lccrr")
| regression_algorithm_name | mae | |
|---|---|---|
| Model 1 | Linear Regression | 0.111510357523804 |
| Model 2 | Ridge Regression | 0.111521518701029 |
| Model 3 | Lasso Regression | 0.129368765462842 |
| Model 4 | XGBoost | 0.0842882112443574 |
TrainingIndex <- createDataPartition(new_df_class$target, p=0.8, list = FALSE)
TrainingSet <- new_df_class[TrainingIndex,]
TestingSet <- new_df_class[-TrainingIndex,]
# Logistic Regression
logic_model <- glm(target ~., data = TrainingSet, family = binomial) %>%
stepAIC(trace = FALSE)
probabilities <- logic_model %>% predict(TestingSet, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
logic_accurecy <- mean(predicted.classes==TestingSet$target)
# Random Forest
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
rf_model <- train(target ~ .,
data = TrainingSet,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5)))
probabilities <- rf_model %>% predict(TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
rm_accurecy <- mean(predicted.classes==TestingSet$target)
# SVM
svm_model <- train(target ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1))
probabilities <-predict(svm_model, TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
svm_accurecy <- mean(predicted.classes==TestingSet$target)
print(paste0("Logistic Accurecy: ", logic_accurecy))
## [1] "Logistic Accurecy: 0.719665271966527"
print(paste0("Random Forest Accurecy: ", rm_accurecy))
## [1] "Random Forest Accurecy: 0.832635983263598"
print(paste0("SVM Accurecy: ", svm_accurecy))
## [1] "SVM Accurecy: 0.635983263598326"
classification_algorithm_name <- c('Logistic Regression','Random Forest','SVM', "")
accurecy <- c(logic_accurecy, rm_accurecy, svm_accurecy, "")
rowName <- c('Model 1', 'Model 2', 'Model 3', "")
model_result <- data.frame(classification_algorithm_name, accurecy, row.names = rowName)
knitr::kable(model_result, align = "lccrr")
| classification_algorithm_name | accurecy | |
|---|---|---|
| Model 1 | Logistic Regression | 0.719665271966527 |
| Model 2 | Random Forest | 0.832635983263598 |
| Model 3 | SVM | 0.635983263598326 |
test_data <- data.frame(quarter=1,
department=0,
day=4,
team=8,
targeted_productivity=0.80,
standard_minute_value=26.16,
work_in_progress=1108,
over_time=7080,
incentive=98,
idle_men=0,
no_of_style_change=0,
no_of_workers=59)
print(paste0("Actual Production will be: ", predict(xgb_model, as.matrix(test_data))))
## [1] "Actual Production will be: 0.947649240493774"
test_data <- data.frame(quarter=1,
department=0,
day=4,
team=7,
targeted_productivity=0.80,
standard_minute_value=25.90,
work_in_progress=984,
over_time=6720,
incentive=38,
no_of_style_change=0,
no_of_workers=56)
predict_val <- predict(rf_model, as.matrix(test_data))
prediction <- ifelse(predict_val > 0.5, "YES", "NO")
print(paste0("Is the Actual Productivity reached the Target Productivity: ", prediction))
## [1] "Is the Actual Productivity reached the Target Productivity: YES"
In approaching model selection, different selection criteria are being used. For regression problem, we prioritize mean absolute error in our model selection. Mean absolute error will measure average of errors in the forecasts. As comparison between 4 models mean absolute value, Linear regression(0.11), Ridge regression (0.11), Lasso regression (0.120), XGboost(0.08), the lowest MAE is chosen. Hence, we are adopting XGboost for our regression problem.
Meanwhile, for classification problem, accuracy of the prediction is being used as criteria selection. Generally, the higher accuracy, the better model will be. Comparison from 3 training model accuracy, Logistic regression(0.71), Random Forest(0.83), SVM(0.63), Random Forest are being adopted.
Both model selection will helps the garment prediction to achieve optimum productivity.