Lecturer: Assoc. Prof. Dr. Ang Tan Fong
Group Members:
| No | Name | Matric No |
|---|---|---|
| 1 | LEONG JIA LING | S2157416 |
| 2 | MD TUHIN HOSSAIN | S2116205 |
| 3 | SEAH TEONG TAT | S2130478 |
| 4 | ROZAIDAWATI ZAINUL AZNAM | S2153846 |
| 5 | R PREMANAN RATHAKRISHNAN | S2151940 |
Project Sections:
Introduction
Aim and Objectives
Dataset Summary & Explanation
Data Preprocessing & Cleaning
Feature Engineering
EDA & Visualization
Feature Importance & Selection
Model Evaluation & Selection
Prediction and Results
Conclusion
References
Basic human necessities including food, clothes, and shelter are well-known concepts. As we all know, the need for clothing also indicates that people need clothing (Mubarok, 2017). If these needs weren’t satisfied, that would be impossible. In addition to the functional requirement for clothing, the sale or business of clothing is also highly powerful. The garment business is one of the most labour-intensive sectors in the world, aside from economic factors. Around 75 million people worldwide work directly in the textile, apparel, and footwear industries (United, 2021). In many developing nations, such as Bangladesh, which is currently the second-largest exporter of apparel in the world behind China, the ready-to-wear garment industry plays a significant role in manufacturing production, employment, and trade (Chaerani, 2018). According to newly released Bureau of Export Promotion Data, ready-to-wear exports from Bangladesh generate over $ 30.61 billion in income, accounting for about 14.07% of the country’s GDP and 81% of all export earnings. (M Saiful Islam and colleagues, 2019) The improvement in production quality in the apparel business must be maintained considering the rising global demand for clothing. Productivity is one of the instruments used to measure business success, and the definition of productivity is a comparison between output and input (Sri & Margareta, 2020).
In our study, we figured out that machine learning would be the most suitable approach in measuring the performance of the employees of this industry. Machine learning (ML), a subfield of artificial intelligence, enables computers to make autonomous predictions by rapidly learning from training data and past experiences without the use of explicit programming. The goal of ML is to replicate the human brain’s capacity for problem-solving and experience-based analysis. Therefore, ML strategies involve applying various algorithms to data to uncover certain patterns that improve decision-making. According to Zhang 2010, there are different kinds of machine learning, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Each type of ML algorithm is used to address a certain class of issues; for example, some algorithms can be applied to classification, others to regression, and yet others to clustering. The kind of the problem and numerous other parameters, like parametrization, learning time, prediction time, overfitting propensity, and memory size, all play a role in selecting the most appropriate algorithm (Mahesh, 2019). All machine learning algorithms are practical tools that help people in a variety of tasks, including data mining, image processing, and prediction analysis (Mona M. Jamjoom, 2021).
To construct the means of determining the garment productivity and employee productivity using classification and regression models.
The objectives for this project are:
To understand the factors that affect garment productivity.
To build, evaluate and choose the best regression model to predict garment employee’s productivity.
To build, evaluate and choose the best classification model in order to determine whether actual productivity reached the target productivity.
Using the head() function, the top 6 results from the dataset are returned.
# Import Dataset
dataset <- read.csv("garments_worker_productivity.csv")
head(dataset) # first 6 rows of dataset
## date quarter department day team targeted_productivity smv wip
## 1 1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108
## 2 1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA
## 3 1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968
## 4 1/1/2015 Quarter1 sweing Thursday 12 0.80 11.41 968
## 5 1/1/2015 Quarter1 sweing Thursday 6 0.80 25.90 1170
## 6 1/1/2015 Quarter1 sweing Thursday 7 0.80 25.90 984
## over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1 7080 98 0 0 0 59.0
## 2 960 0 0 0 0 8.0
## 3 3660 50 0 0 0 30.5
## 4 3660 50 0 0 0 30.5
## 5 1920 50 0 0 0 56.0
## 6 6720 38 0 0 0 56.0
## actual_productivity
## 1 0.9407254
## 2 0.8865000
## 3 0.8005705
## 4 0.8005705
## 5 0.8003819
## 6 0.8001250
Using the summary() function, it returns the length, class and mode for alphanumeric data while for numeric data it returns the minimum, maximum, 1st quartile, median, 3rd quartile and mean of that column.
summary(dataset) # summary of dataset
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Min. : 1.000 Min. :0.0700 Min. : 2.90 Min. : 7.0
## 1st Qu.: 3.000 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Median : 6.000 Median :0.7500 Median :15.26 Median : 1039.0
## Mean : 6.427 Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.: 9.000 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :12.000 Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
Using the glimpse( ) function returns the data types and values of each attribute.
glimpse(dataset) # show the features data types and values
## Rows: 1,197
## Columns: 15
## $ date <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", …
## $ quarter <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department <chr> "sweing", "finishing ", "sweing", "sweing", "swe…
## $ day <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
## $ team <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ smv <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ wip <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681…
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…
skim(dataset) # show more details about features (Missing Values)
| Name | dataset |
| Number of rows | 1197 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 8 | 9 | 0 | 59 | 0 |
| quarter | 0 | 1 | 8 | 8 | 0 | 5 | 0 |
| department | 0 | 1 | 6 | 10 | 0 | 3 | 0 |
| day | 0 | 1 | 6 | 9 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| team | 0 | 1.00 | 6.43 | 3.46 | 1.00 | 3.00 | 6.00 | 9.00 | 12.00 | ▇▅▅▆▇ |
| targeted_productivity | 0 | 1.00 | 0.73 | 0.10 | 0.07 | 0.70 | 0.75 | 0.80 | 0.80 | ▁▁▁▁▇ |
| smv | 0 | 1.00 | 15.06 | 10.94 | 2.90 | 3.94 | 15.26 | 24.26 | 54.56 | ▇▅▃▁▁ |
| wip | 506 | 0.58 | 1190.47 | 1837.46 | 7.00 | 774.50 | 1039.00 | 1252.50 | 23122.00 | ▇▁▁▁▁ |
| over_time | 0 | 1.00 | 4567.46 | 3348.82 | 0.00 | 1440.00 | 3960.00 | 6960.00 | 25920.00 | ▇▆▁▁▁ |
| incentive | 0 | 1.00 | 38.21 | 160.18 | 0.00 | 0.00 | 0.00 | 50.00 | 3600.00 | ▇▁▁▁▁ |
| idle_time | 0 | 1.00 | 0.73 | 12.71 | 0.00 | 0.00 | 0.00 | 0.00 | 300.00 | ▇▁▁▁▁ |
| idle_men | 0 | 1.00 | 0.37 | 3.27 | 0.00 | 0.00 | 0.00 | 0.00 | 45.00 | ▇▁▁▁▁ |
| no_of_style_change | 0 | 1.00 | 0.15 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| no_of_workers | 0 | 1.00 | 34.61 | 22.20 | 2.00 | 9.00 | 34.00 | 57.00 | 89.00 | ▇▂▂▇▁ |
| actual_productivity | 0 | 1.00 | 0.74 | 0.17 | 0.23 | 0.65 | 0.77 | 0.85 | 1.12 | ▂▂▆▇▂ |
The skim( ) function shows the user more details about the features as seen above. The column type frequency for character and numeric columns are shown. Detailed contents for character and variable type variables are shown in a table format, indicating any missing values or even number of unique data. In this dataset, the ‘wip’ column has 506 missing attributes while there are 59 unique dates of data collected in this dataset.
# Correlation Marix before data cleaning
corrplot(cor(dataset[, 5:15]),
method = "square",
type = "upper",
tl.col = "black",
tl.cex = 0.6,
col = colorRampPalette(c("purple", "dark green"))(200)
)
A correlation matrix has also been generated as shown above before the data cleaning process using the corrplot() function. It can be seen that there are question marks showing between relationships such as wip - over_time and smv - wip and several other attributes, these will be handled in the following section.
First, the column names for column 7 and 8 will be renamed as “standard_minute_value” and “work_in_progress”.
# Renaming the columns name
colnames(dataset)[7] = "standard_minute_value"
colnames(dataset)[8] = "work_in_progress"
# Changing Department value Sweing to Sewing
unique(dataset[c("department")])
## department
## 1 sweing
## 2 finishing
## 19 finishing
After checking the unique values in the “department” column, it can be seen that the term “sewing” was spelt as “sweing” and there is “finishing” and “finishing” with one having a space at the end which will need to be corrected.
dataset <- within(dataset, department[department == 'sweing'] <- 'sewing')
dataset <- within(dataset, department[department == 'finishing '] <- 'finishing')
unique(dataset[c("department")])
## department
## 1 sewing
## 2 finishing
After making the changes in the “department” column, it can be seen that it is left with just “sewing” and “finishing” after checking with unique().
# check the missing values appearance
sum(is.na(dataset))
## [1] 506
colSums(is.na(dataset))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## standard_minute_value work_in_progress over_time
## 0 506 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
Moving on, the researchers checked for missing values in the dataset and found that the “work_in_progress” attribute has 506 missing data.
With the missmap() function, it draws a map of the missingness in a dataset using the image function. The columns are reordered to put the most missing variable farthest to the left. As illustrated in the missingness map, “work_in_progress” has a high proportion of NaN values. Hence, the cells will be filled with 0 that means there is no number of items currently being worked on.
missmap(dataset, col = c("red", "blue"), legend = FALSE)
As we can see work_in_progress has a high proportion of NaN values. So we are going to fill that cell as 0 that means there is no number of items currently being worked on.
dataset[is.na(dataset)] <- 0
colSums(is.na(dataset))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## standard_minute_value work_in_progress over_time
## 0 0 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
# Shape of dataset
nrow(dataset)
## [1] 1197
ncol(dataset)
## [1] 15
After checking the dataset shape, it can be seen that the dataset is small. As such, the outliers will not be removed.
Feature engineering will be conducted in this section. The sapply() function takes list, vector or data frame as input and gives output in vector or matrix. It is useful for operations on list objects and returns a list object of the same length as the original set.
# Date Data Type Conversion
sapply(dataset, class)
## date quarter department
## "character" "character" "character"
## day team targeted_productivity
## "character" "integer" "numeric"
## standard_minute_value work_in_progress over_time
## "numeric" "numeric" "integer"
## incentive idle_time idle_men
## "integer" "numeric" "integer"
## no_of_style_change no_of_workers actual_productivity
## "integer" "numeric" "numeric"
The first attribute to be converted is the “date” which will be converted from month/day/year (1/01/2015) format to Year-Month-Day (2015-01-01) format.
dataset$date <- as.POSIXct( dataset$date, format="%m/%d/%Y" )
sapply(dataset, class)
## $date
## [1] "POSIXct" "POSIXt"
##
## $quarter
## [1] "character"
##
## $department
## [1] "character"
##
## $day
## [1] "character"
##
## $team
## [1] "integer"
##
## $targeted_productivity
## [1] "numeric"
##
## $standard_minute_value
## [1] "numeric"
##
## $work_in_progress
## [1] "numeric"
##
## $over_time
## [1] "integer"
##
## $incentive
## [1] "integer"
##
## $idle_time
## [1] "numeric"
##
## $idle_men
## [1] "integer"
##
## $no_of_style_change
## [1] "integer"
##
## $no_of_workers
## [1] "numeric"
##
## $actual_productivity
## [1] "numeric"
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 Quarter1 sewing Thursday 8 0.80
## 2 2015-01-01 Quarter1 finishing Thursday 1 0.75
## 3 2015-01-01 Quarter1 sewing Thursday 11 0.80
## 4 2015-01-01 Quarter1 sewing Thursday 12 0.80
## 5 2015-01-01 Quarter1 sewing Thursday 6 0.80
## 6 2015-01-01 Quarter1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity
## 1 0 59.0 0.9407254
## 2 0 8.0 0.8865000
## 3 0 30.5 0.8005705
## 4 0 30.5 0.8005705
## 5 0 56.0 0.8003819
## 6 0 56.0 0.8001250
# Quarter Data Type Conversion
unique(dataset[, 2])
## [1] "Quarter1" "Quarter2" "Quarter3" "Quarter4" "Quarter5"
quarter_cat <- dataset[, 2]
dataset <- cbind(dataset, quarter_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 Quarter1 sewing Thursday 8 0.80
## 2 2015-01-01 Quarter1 finishing Thursday 1 0.75
## 3 2015-01-01 Quarter1 sewing Thursday 11 0.80
## 4 2015-01-01 Quarter1 sewing Thursday 12 0.80
## 5 2015-01-01 Quarter1 sewing Thursday 6 0.80
## 6 2015-01-01 Quarter1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
Next, the “quarter” attribute will be changed from ‘Quarter 1’, ‘Quarter 2’ until ‘Quarter 5’ into numeric format of ‘1’, ‘2’ and until ‘5’.
dataset <- within(dataset, quarter[quarter == 'Quarter1'] <- 1)
dataset <- within(dataset, quarter[quarter == 'Quarter2'] <- 2)
dataset <- within(dataset, quarter[quarter == 'Quarter3'] <- 3)
dataset <- within(dataset, quarter[quarter == 'Quarter4'] <- 4)
dataset <- within(dataset, quarter[quarter == 'Quarter5'] <- 5)
unique(dataset[, 2])
## [1] "1" "2" "3" "4" "5"
str(dataset[, 2])
## chr [1:1197] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
dataset$quarter <- as.integer(dataset$quarter)
str(dataset[, 2])
## int [1:1197] 1 1 1 1 1 1 1 1 1 1 ...
# Department Data Type Conversion
unique(dataset$department)
## [1] "sewing" "finishing"
department_cat <- dataset$department
dataset <- cbind(dataset, department_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 sewing Thursday 8 0.80
## 2 2015-01-01 1 finishing Thursday 1 0.75
## 3 2015-01-01 1 sewing Thursday 11 0.80
## 4 2015-01-01 1 sewing Thursday 12 0.80
## 5 2015-01-01 1 sewing Thursday 6 0.80
## 6 2015-01-01 1 sewing Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat
## 1 sewing
## 2 finishing
## 3 sewing
## 4 sewing
## 5 sewing
## 6 sewing
The target attribute of “department” will also represent “sewing” with 0 and “finishing” as 1, making it into integers for easier model building.
dataset <- within(dataset, department[department == 'sewing'] <- 0)
dataset <- within(dataset, department[department == 'finishing'] <- 1)
unique(dataset$department)
## [1] "0" "1"
str(dataset$department)
## chr [1:1197] "0" "1" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "1" "1" ...
dataset$department <- as.integer(dataset$department)
str(dataset$department)
## int [1:1197] 0 1 0 0 0 0 1 0 0 0 ...
# Day Data Type Conversion
unique(dataset$day)
## [1] "Thursday" "Saturday" "Sunday" "Monday" "Tuesday" "Wednesday"
day_cat <- dataset$day
dataset <- cbind(dataset, day_cat)
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 0 Thursday 8 0.80
## 2 2015-01-01 1 1 Thursday 1 0.75
## 3 2015-01-01 1 0 Thursday 11 0.80
## 4 2015-01-01 1 0 Thursday 12 0.80
## 5 2015-01-01 1 0 Thursday 6 0.80
## 6 2015-01-01 1 0 Thursday 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat day_cat
## 1 sewing Thursday
## 2 finishing Thursday
## 3 sewing Thursday
## 4 sewing Thursday
## 5 sewing Thursday
## 6 sewing Thursday
The “day” which represent the days of the week (Sunday to Saturday) are also converted into numeric format which 0 starts with Sunday with 1 representing Monday, until 6 depicting Saturday.
dataset <- within(dataset, day[day == 'Sunday'] <- 0)
dataset <- within(dataset, day[day == 'Monday'] <- 1)
dataset <- within(dataset, day[day == 'Tuesday'] <- 2)
dataset <- within(dataset, day[day == 'Wednesday'] <- 3)
dataset <- within(dataset, day[day == 'Thursday'] <- 4)
dataset <- within(dataset, day[day == 'Friday'] <- 5)
dataset <- within(dataset, day[day == 'Saturday'] <- 6)
unique(dataset$day)
## [1] "4" "6" "0" "1" "2" "3"
str(dataset$day)
## chr [1:1197] "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" ...
dataset$day <- as.integer(dataset$day)
str(dataset$day)
## int [1:1197] 4 4 4 4 4 4 4 4 4 4 ...
glimpse(dataset)
## Rows: 1,197
## Columns: 18
## $ date <dttm> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,…
## $ quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ department <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, …
## $ day <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ team <int> 8, 1, 11, 12, 6, 7, 2, 3, 2, 1, 9, 10, 5, 10, 8,…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ standard_minute_value <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ work_in_progress <dbl> 1108, 0, 968, 968, 1170, 984, 0, 795, 733, 681, …
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…
## $ quarter_cat <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department_cat <chr> "sewing", "finishing", "sewing", "sewing", "sewi…
## $ day_cat <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
head(dataset)
## date quarter department day team targeted_productivity
## 1 2015-01-01 1 0 4 8 0.80
## 2 2015-01-01 1 1 4 1 0.75
## 3 2015-01-01 1 0 4 11 0.80
## 4 2015-01-01 1 0 4 12 0.80
## 5 2015-01-01 1 0 4 6 0.80
## 6 2015-01-01 1 0 4 7 0.80
## standard_minute_value work_in_progress over_time incentive idle_time idle_men
## 1 26.16 1108 7080 98 0 0
## 2 3.94 0 960 0 0 0
## 3 11.41 968 3660 50 0 0
## 4 11.41 968 3660 50 0 0
## 5 25.90 1170 1920 50 0 0
## 6 25.90 984 6720 38 0 0
## no_of_style_change no_of_workers actual_productivity quarter_cat
## 1 0 59.0 0.9407254 Quarter1
## 2 0 8.0 0.8865000 Quarter1
## 3 0 30.5 0.8005705 Quarter1
## 4 0 30.5 0.8005705 Quarter1
## 5 0 56.0 0.8003819 Quarter1
## 6 0 56.0 0.8001250 Quarter1
## department_cat day_cat
## 1 sewing Thursday
## 2 finishing Thursday
## 3 sewing Thursday
## 4 sewing Thursday
## 5 sewing Thursday
## 6 sewing Thursday
A new column is added for the targeted productivity which will be used in model building and training in the following section later.
# Adding a new classification column based on actual_productivity and targeted_productivity
target <- dataset$actual_productivity < dataset$targeted_productivity + 0.001
dataset['target'] <- target
dataset <- within(dataset, target[target == TRUE] <- 1)
dataset <- within(dataset, target[target == FALSE] <- 0)
#Separate Categorical and Continous variables
G_con = dataset[ ,!sapply(dataset, is.character)]
G_cat = dataset[ ,sapply(dataset, is.character)]
# Correlation Matrix after Data Preprocessing
corrplot(cor(dataset[, 2:15]),
method = "square",
type = "upper",
tl.col = "black",
tl.cex = 0.6,
col = colorRampPalette(c("purple", "dark green"))(200)
)
The categorical and continuous variables are also separated with new variables “G_con” and “G_cat”.
After the whole data preprocessing and feature processing is completed, another correlation matrix is generated as shown below to better visualize the correlation between the attributes. It can be seen that the question mark symbols as seen in the previous section are now instead shown with the actual correlation between attributes.
It can be seen that the correlation between standard_minute_value - no_of_workers has the highest positive correlation with standard_minute_value - over_time and over_time - no_of_workers also having high positive correlation.
It is also to be highlighted that department - standard_minute_value, department - over_time and department - no_of_workers have a high negative correlation.
For the exploratory data analysis and visualisation, various types of charts were constructed and plotted to get a better understanding on the variables of our data as well as the relationship between them.
### Targeted Productivity over Time
dates <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$Group.1
targeted_productivity <- aggregate(dataset$targeted_productivity, by=list(dataset$date), mean)$x
plot_ly(
x = dates,
y = targeted_productivity,
type = "scatter",
mode = "lines"
) |>
layout(yaxis = list(range = c(0.65, 0.81)),
title = list(text='Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
### Actual Productivity over Time
dates <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$Group.1
actual_productivity <- aggregate(dataset$actual_productivity, by=list(dataset$date), mean)$x
plot_ly(
x = dates,
y = actual_productivity,
type = "scatter",
mode = "lines"
) |>
layout(yaxis = list(range = c(0.5, 0.9)),
title = list(text='Actual Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
### Actual & Targeted Productivity over Time
plot_ly(
x = dates,
y = targeted_productivity,
type = "scatter",
mode = "lines",
name = "targeted_productivity"
) %>%
add_trace(y = actual_productivity, name = "actual_productivity") %>%
layout(legend = list(x = 0.73, y = 0.95),
title = list(text='Actual & Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
We can observe the same in the graph above whereby the actual productivity had a big difference from 25th January to 8th February. There was a massive drop from 8th February to 8th March in the actual productivity but however, the targeted productivity was still raging at a moderate level. Therefore, we can still assume that the actual productivity is higher than the targeted productivity based on the graph. Although the actual productivity seems to be more volatile compared to the targeted productivity, the graph gives us a clear indication that the actual productivity was higher than the targeted productivity on many occasions within the timeframe.
### Ratio of Actual to Targeted Productivity over Time
prod_diff = actual_productivity - targeted_productivity
prod_diff
## [1] -0.0339209786 0.0265828621 0.0039838318 0.0232998741 0.0221744682
## [6] -0.0178785815 -0.0033075971 0.0167596471 -0.0183467855 0.0138895031
## [11] 0.0260206630 -0.0154420297 0.0607074814 -0.0168894555 -0.0545498789
## [16] 0.0107424270 0.0010644784 0.0115748421 -0.0258462651 0.0438446866
## [21] 0.0583350081 0.0931096937 0.0840465143 0.0337149014 0.0791443043
## [26] 0.1320382726 0.0472739451 -0.0311872794 0.0315611188 0.0563255852
## [31] 0.0876255877 0.0092534128 0.0014276525 0.0300904176 0.0470107688
## [36] 0.0790646426 -0.0204833649 -0.0359534660 -0.0382238699 -0.0206541136
## [41] -0.0137440371 -0.0624003944 -0.0484842466 -0.0820019806 -0.0762448567
## [46] -0.0430553499 -0.0285546069 -0.0730003276 -0.0580416762 -0.0462517230
## [51] -0.0002677061 -0.0452763715 -0.0122840756 0.0131424168 0.0165971511
## [56] 0.0328347056 0.0128943875 0.0352160328 -0.0136933439
plot_ly(
x = dates,
y = prod_diff,
type = "scatter",
mode = "lines"
) |>
layout(title = list(text='Ratio of Actual to Targeted Productivity over Time', y = 0.95, x = 0.5, xanchor = 'center', yanchor = 'top'))
# Distribution of Work in Progress vs day of the week
data_frame_wip <- subset(dataset, work_in_progress > 0)
nrow(data_frame_wip)
## [1] 691
result <- aggregate(data_frame_wip$work_in_progress, by=list(data_frame_wip$day_cat), mean)
result
## Group.1 x
## 1 Monday 2156.3103
## 2 Saturday 1003.0192
## 3 Sunday 1016.1983
## 4 Thursday 974.0254
## 5 Tuesday 1012.4831
## 6 Wednesday 973.7731
barplot(result$x, names.arg=result$Group.1, xlab="Weekdays", ylab="Total Work in Progress", col=rainbow(6),
main="Total Work in Progress by Weekdays",border="black")
# Quarter & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$quarter_cat), mean)
result
## Group.1 x
## 1 Quarter1 4480.917
## 2 Quarter2 4355.015
## 3 Quarter3 4896.000
## 4 Quarter4 4851.250
## 5 Quarter5 3725.455
barplot(result$x, names.arg=result$Group.1, xlab="Quarter", ylab="Average Worktime", col=rainbow(6),
main="Quarter vs Overtime",border="black")
# Team & Overtime
data_frame_ot <- dataset
nrow(data_frame_ot)
## [1] 1197
result <- aggregate(data_frame_ot$over_time, by=list(data_frame_ot$team), mean)
result
## Group.1 x
## 1 1 4793.429
## 2 2 4384.954
## 3 3 5375.684
## 4 4 5449.714
## 5 5 5330.968
## 6 6 3369.096
## 7 7 4857.188
## 8 8 4312.294
## 9 9 4519.038
## 10 10 4736.700
## 11 11 4342.500
## 12 12 3317.929
barplot(result$x, names.arg=result$Group.1, xlab="Team", ylab="Average Overtime", col=rainbow(12),
main="Team vs Overtime",border="black")
# Stacked Bar Plot with Colors and Legend
counts <- table(G_cat$department,G_cat$day)
barplot(counts, main="Department vs day",
xlab="Number of Departments", col=c("darkblue","red"),
legend = rownames(counts))
# Quarter vs Department
counts <- table(G_cat$department,G_cat$quarter)
barplot(counts, main="Quarter vs Department",
xlab="Number of Departments", col=c("pink","lightblue"),
legend = rownames(counts))
# Quarter Propagation
PieChart(quarter_cat, hole = 0, values = "%", data = dataset,
fill = rainbow(5), main = "Quarter Propagation")
## >>> Note: quarter_cat is not in a data frame (table)
## >>> Note: quarter_cat is not in a data frame (table)
## >>> suggestions
## piechart(quarter_cat, hole=0) # traditional pie chart
## piechart(quarter_cat, values="%") # display %'s on the chart
## piechart(quarter_cat) # bar chart
## plot(quarter_cat) # bubble plot
## plot(quarter_cat, values="count") # lollipop plot
##
## --- quarter_cat ---
##
## Quarter1 Quarter2 Quarter3 Quarter4 Quarter5 Total
## Frequencies: 360 335 210 248 44 1197
## Proportions: 0.301 0.280 0.175 0.207 0.037 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 262.336, df = 4, p-value = 0.000
### Department Propagation
PieChart(department_cat, hole = 0, values = "%", data = dataset,
fill = c("lightblue", "blue"), main = "Department Propagation")
## >>> Note: department_cat is not in a data frame (table)
## >>> Note: department_cat is not in a data frame (table)
## >>> suggestions
## piechart(department_cat, hole=0) # traditional pie chart
## piechart(department_cat, values="%") # display %'s on the chart
## piechart(department_cat) # bar chart
## plot(department_cat) # bubble plot
## plot(department_cat, values="count") # lollipop plot
##
## --- department_cat ---
##
## finishing sewing Total
## Frequencies: 506 691 1197
## Proportions: 0.423 0.577 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 28.592, df = 1, p-value = 0.000
# Day Propagation
PieChart(day_cat, hole = 0.6, values = "%", data = dataset,
fill = rainbow(5), main = "Day Propagation")
## >>> Note: day_cat is not in a data frame (table)
## >>> Note: day_cat is not in a data frame (table)
## >>> suggestions
## piechart(day_cat, hole=0) # traditional pie chart
## piechart(day_cat, values="%") # display %'s on the chart
## piechart(day_cat) # bar chart
## plot(day_cat) # bubble plot
## plot(day_cat, values="count") # lollipop plot
##
## --- day_cat ---
##
## Monday Saturday Sunday Thursday Tuesday Wednesday Total
## Frequencies: 199 187 203 199 201 208 1197
## Proportions: 0.166 0.156 0.170 0.166 0.168 0.174 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 1.221, df = 5, p-value = 0.943
# Team Propagation
PieChart(team, hole = 0.8, values = "%", data = dataset,
fill = rainbow(12), main = "Team Propagation")
## >>> suggestions
## piechart(team, hole=0) # traditional pie chart
## piechart(team, values="%") # display %'s on the chart
## piechart(team) # bar chart
## plot(team) # bubble plot
## plot(team, values="count") # lollipop plot
##
## --- team ---
##
## team Count Prop
## --------------------
## 1 105 0.088
## 2 109 0.091
## 3 95 0.079
## 4 105 0.088
## 5 93 0.078
## 6 94 0.079
## 7 96 0.080
## 8 109 0.091
## 9 104 0.087
## 10 100 0.084
## 11 88 0.074
## 12 99 0.083
## --------------------
## Total 1197 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 4.995, df = 11, p-value = 0.931
# Generate box plots of all numeric variables
new_df_num <- data.frame(dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$actual_productivity)
col <- c("targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "actual_productivity")
colnames(new_df_num) <- col
ggplot(gather(new_df_num), aes(key,value)) +
geom_boxplot(color="purple", fill="red") +
facet_wrap(~key, scales="free") +
labs(title = "Box plots showing the distribution of all variables",
subtitle ="except the first 3 categorical variables")
par(mfrow = c(1, 1))
# Plot a histogram or bar chart of each numeric variable
ggplot(gather(new_df_num), aes(value)) +
geom_histogram(bins=20,color='blue', fill='pink') +
facet_wrap(~key, scales = 'free') +
labs(title = "Histograms showing the distribution of all variables",
subtitle ="except the first 3 categorical variables")
According to the diagram above, the overtime boxplot comparatively is smaller than the others. This suggests that over time did not really contribute to the garment productivity. On the other hand, the number of workers boxplot comparatively is larger than the others. This suggests that the number of workers played an important role in the garment productivity. Moreover, if one were to compare the median of the targeted productivity and actual productivity, it can be observed that the actual productivity has a slightly higher median compared to the targeted productivity. Therefore, just by looking at this box-plot it can be infered that the actual productivity was more than the targeted productivity. It can be observed the same in Figure showing the Line graph of “Actual and Targeted Productivity over Time” whereby the actual productivity had a big difference from 25th January to 8th February.
par(mfrow = c(1, 1))
# Plot a bar chart for each of the first 4 categorical variables
new_df_cat <- data.frame(dataset$day_cat,
dataset$department_cat,
dataset$quarter_cat,
dataset$team)
col <- c("day_cat", "department_cat", "quarter_cat", "team")
colnames(new_df_cat) <- col
ggplot(gather(new_df_cat), aes(value)) +
geom_bar(color='black', fill='green') +
facet_wrap(~key, scales ='free') +
labs(title = "Bar charts showing the distribution of the first 4 categorical variables")
par(mfrow = c(1, 1))
# Actual productivity based on dates
ggplot(dataset,aes(x = reorder(date,actual_productivity,median),y = actual_productivity)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=quarter)) +
theme(legend.position = "bottom", legend.box = "horizontal") +
labs(x="Date (ordered by median actual_productivity)", title="Boxplots showing actual_productivity distribution for all dates") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
# Relationship between Actual and Targeted Productivity
ggplot(dataset, aes(x=targeted_productivity,y=actual_productivity, color=department_cat)) +
geom_point() +
geom_smooth(method=lm,se=FALSE) +
labs(title = "Scatterplot showing the relationship between actual productivity and targeted_productivity")
## `geom_smooth()` using formula = 'y ~ x'
# Investigate the relationship between actual_productivity and no_of_workers
scatterplot(actual_productivity ~ no_of_workers, data=dataset,
main="Scatterplot showing relationship between actual_productivity and no_of_workers", grid=FALSE)
Based on the above diagram it can be inferred that the sewing department contributed more to the garment productivity compared to the finishing department. This is because the slope of the sewing department has a stronger linear relationship compared to the finishing department.
new_df_reg <- data.frame(dataset$quarter,
dataset$department,
dataset$day,
dataset$team,
dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$actual_productivity)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "actual_productivity")
colnames(new_df_reg) <- col
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
start.time <- proc.time()
Model <- train(actual_productivity ~ .,
data = new_df_reg,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
## user system elapsed
## 1.62 0.01 48.88
stopCluster(cl)
Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Regression Model")
As illustrated in the plot, it can be observed that idle_time has the lowest feature importance score. Therefore, this variable can be dropped for the regression model.
drops <- c("idle_time")
new_df_reg <- new_df_reg[ , !(names(new_df_reg) %in% drops)]
new_df_class <- data.frame(dataset$quarter,
dataset$department,
dataset$day,
dataset$team,
dataset$targeted_productivity,
dataset$standard_minute_value,
dataset$work_in_progress,
dataset$over_time,
dataset$incentive,
dataset$idle_time,
dataset$idle_men,
dataset$no_of_style_change,
dataset$no_of_workers,
dataset$target)
col <- c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_time", "idle_men", "no_of_style_change",
"no_of_workers", "target")
colnames(new_df_class) <- col
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
start.time <- proc.time()
Model <- train(target ~ .,
data = new_df_class,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5))
)
stop.time <- proc.time()
run.time <- stop.time - start.time
print(run.time)
## user system elapsed
## 1.28 0.00 49.56
stopCluster(cl)
Importance <- varImp(Model)
plot(Importance, col = "red", main = "Feature Importance & Selection for Classification Model")
From the diagram above, it can be stated that idle_men has the lowest feature importance score followed by idle time. Therefore, these two variables can be dropped for the classification model.
drops <- c("idle_time", "idle_men")
new_df_class <- new_df_class[ , !(names(new_df_class) %in% drops)]
set.seed(100)
TrainingIndex <- createDataPartition(new_df_reg$actual_productivity, p=0.8, list = FALSE)
TrainingSet <- new_df_reg[TrainingIndex,]
TestingSet <- new_df_reg[-TrainingIndex,]
# Linear Regression:
lin_model <- train(actual_productivity ~ ., data = TrainingSet,
method = "lm",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"))
lin_MAE <- mae(TestingSet$actual_productivity, predict(lin_model, TestingSet))
# Ridge regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
y <- TrainingSet$actual_productivity
cv_model <- cv.glmnet(x, y, alpha = 0)
best_lambda <- cv_model$lambda.min
ridge_mod = glmnet(x, y, alpha = 0, lambda = best_lambda)
x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
rid_MAE <- mae(TestingSet$actual_productivity, predict(ridge_mod, s = best_lambda, newx = x))
# Lasso regression
x <- data.matrix(TrainingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
y <- TrainingSet$actual_productivity
lambdas <- 10^seq(2, -3, by = -.1)
lasso_mod = glmnet(x, y, alpha = 1, lambda = lambdas)
x <- data.matrix(TestingSet[, c("quarter", "department", "day", "team", "targeted_productivity", "standard_minute_value",
"work_in_progress", "over_time", "incentive", "idle_men", "no_of_style_change",
"no_of_workers")])
lasso_MAE <- mae(TestingSet$actual_productivity, predict(lasso_mod, newx = x))
# XGBoost
train_x = data.matrix(TrainingSet[, -13])
train_y = TrainingSet[,13]
test_x = data.matrix(TestingSet[, -13])
test_y = TestingSet[, 13]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
xgb_model = xgboost(data = xgb_train, max.depth = 3, nrounds = 56, verbose = 0)
xgb_MAE <- mae(test_y, predict(xgb_model, test_x))
print(paste0("Linear MAE: ", lin_MAE))
## [1] "Linear MAE: 0.111510357523804"
print(paste0("Ridge MAE: ", rid_MAE))
## [1] "Ridge MAE: 0.111521518701029"
print(paste0("Lasso MAE: ", lasso_MAE))
## [1] "Lasso MAE: 0.129368765462842"
print(paste0("XGBoost MAE: ", xgb_MAE))
## [1] "XGBoost MAE: 0.0842882112443574"
Regression Model Performance Table
regression_algorithm_name <- c('Linear Regression','Ridge Regression','Lasso Regression','XGBoost', "")
mae <- c(lin_MAE, rid_MAE, lasso_MAE, xgb_MAE, "")
rowName <- c('Model 1', 'Model 2', 'Model 3', 'Model 4', "")
model_result <- data.frame(regression_algorithm_name, mae, row.names = rowName)
knitr::kable(model_result, align = "lccrr")
| regression_algorithm_name | mae | |
|---|---|---|
| Model 1 | Linear Regression | 0.111510357523804 |
| Model 2 | Ridge Regression | 0.111521518701029 |
| Model 3 | Lasso Regression | 0.129368765462842 |
| Model 4 | XGBoost | 0.0842882112443574 |
Based on the table above, XGboost has the lowest mean absolute error of 0.084. Therefore, XGboost would be the most suitable regression algorithm to use for this study. Other than that, Lasso regression has the highest mean absolute error of 0.129. Therefore, lasso regression would not be a suitable regression algorithm for this study.
TrainingIndex <- createDataPartition(new_df_class$target, p=0.8, list = FALSE)
TrainingSet <- new_df_class[TrainingIndex,]
TestingSet <- new_df_class[-TrainingIndex,]
# Logistic Regression
logic_model <- glm(target ~., data = TrainingSet, family = binomial) %>%
stepAIC(trace = FALSE)
probabilities <- logic_model %>% predict(TestingSet, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
logic_accurecy <- mean(predicted.classes==TestingSet$target)
# Random Forest
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
rf_model <- train(target ~ .,
data = TrainingSet,
method = "rf",
tuneGrid = data.frame(mtry = seq(5,15, by=5)))
probabilities <- rf_model %>% predict(TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
rm_accurecy <- mean(predicted.classes==TestingSet$target)
# SVM
svm_model <- train(target ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1))
probabilities <-predict(svm_model, TestingSet)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
svm_accurecy <- mean(predicted.classes==TestingSet$target)
print(paste0("Logistic Accurecy: ", logic_accurecy))
## [1] "Logistic Accurecy: 0.719665271966527"
print(paste0("Random Forest Accurecy: ", rm_accurecy))
## [1] "Random Forest Accurecy: 0.832635983263598"
print(paste0("SVM Accurecy: ", svm_accurecy))
## [1] "SVM Accurecy: 0.635983263598326"
Classification Model Performance Table
classification_algorithm_name <- c('Logistic Regression','Random Forest','SVM', "")
accurecy <- c(logic_accurecy, rm_accurecy, svm_accurecy, "")
rowName <- c('Model 1', 'Model 2', 'Model 3', "")
model_result <- data.frame(classification_algorithm_name, accurecy, row.names = rowName)
knitr::kable(model_result, align = "lccrr")
| classification_algorithm_name | accurecy | |
|---|---|---|
| Model 1 | Logistic Regression | 0.719665271966527 |
| Model 2 | Random Forest | 0.832635983263598 |
| Model 3 | SVM | 0.635983263598326 |
Based on the table above, random forest has the highest accuracy of 83.26%. Therefore, random forest would be the most suitable classification algorithm for this study. On the other hand, support vector machine has an accuracy of 63.60%. Therefore, support vector machine would not be suitable classification algorithm for this study.
In this section, prediction will be conducted using the selected models.
test_data <- data.frame(quarter=1,
department=0,
day=4,
team=8,
targeted_productivity=0.80,
standard_minute_value=26.16,
work_in_progress=1108,
over_time=7080,
incentive=98,
idle_men=0,
no_of_style_change=0,
no_of_workers=59)
print(paste0("Actual Production will be: ", predict(xgb_model, as.matrix(test_data))))
## [1] "Actual Production will be: 0.947649240493774"
As per the second objective in regards to build, evaluate and choose the best regression model to predict garment employee’s productivity, the actual production as obtained by the model is 94.76%.
test_data <- data.frame(quarter=1,
department=0,
day=4,
team=7,
targeted_productivity=0.80,
standard_minute_value=25.90,
work_in_progress=984,
over_time=6720,
incentive=38,
no_of_style_change=0,
no_of_workers=56)
predict_val <- predict(rf_model, as.matrix(test_data))
prediction <- ifelse(predict_val > 0.5, "YES", "NO")
print(paste0("Is the Actual Productivity reached the Target Productivity: ", prediction))
## [1] "Is the Actual Productivity reached the Target Productivity: YES"
Following with the third objective on whether actual productivity has reached the target productivity, the selected classification model has provided “Yes” as the answer.
In approaching model selection, different selection criteria are being used. For regression problem, we prioritize mean absolute error in our model selection. Mean absolute error will measure average of errors in the forecasts. As comparison between 4 models mean absolute value, Linear regression(0.11), Ridge regression (0.11), Lasso regression (0.120), XGboost(0.08), the lowest MAE is chosen. Hence, we are adopting XGboost for our regression problem.
Meanwhile, for classification problem, accuracy of the prediction is being used as criteria selection. Generally, the higher accuracy, the better model will be. Comparison from 3 training model accuracy, Logistic regression(0.71), Random Forest(0.83), SVM(0.63), Random Forest are being adopted.
Both model selections will help the garment prediction to achieve optimum productivity.
Chaerani, N. (2018). Peran International Labour Organization Terhadap Peningkatan Lingkungan Kerja Di Sektor Industri Garmen Di Bangladesh. Universitas Hasanuddin, 151(2), 10–17.
M Saiful Islam, Rakib, M. A., & Adnan, A. (2019). Ready-Made Garments Sector of Bangladesh: Its Growth, Contribution, and Challenges. Economics World, 7(1). https://doi.org/10.17265/2328- 7144/2019.01.004
Mahesh, Batta. 2019. Machine Learning Algorithms - A Review DOI: 10.21275/ART20203995
Mubarok, N. (2017). Strategi Pemasaran Islami Dalam Meningkatkan Penjualan Pada Butik Calista. I-Economics, 3(1), 73–92.
Sri, D., & Margareta, C. (2020). Pengaruh Pelatihan Kewirausahaan , Kemampuan Memanfaatkan Teknologi Dan Pendidikan Terhadap Produktifitas Wanita. Economic and Education Journal, 42, 142–158.
United, F. (2021). Global fashion industry statistics - International apparel. Https://Fashionunited.Com/. https://fashionunited.com/global-fashion[1]industry-statistics/
Zhang, Yagang. 2010. New advances in machine learning (BoD–Books on Demand).