Image from CNN
The dataset is from UC Irvine Machine Learning Repository and has various information regarding the productivity of employees in a garment manufacturing companies. Originally, the data was published in the International Journal of Business Intelligence and Data Mining.
The variables I will use and their definitions:
Research Question - Which factors best predict actual productivity among garment manufacturing workers?
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
setwd("C:/Users/tonge/Downloads")
productivity <- read_csv("garments_worker_productivity.csv")
## Rows: 1197 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): date, quarter, department, day
## dbl (11): team, targeted_productivity, smv, wip, over_time, incentive, idle_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(productivity)
## # A tibble: 6 × 15
## date quarter department day team targeted_productivity smv wip
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/2015 Quarter1 sweing Thursday 8 0.8 26.2 1108
## 2 1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA
## 3 1/1/2015 Quarter1 sweing Thursday 11 0.8 11.4 968
## 4 1/1/2015 Quarter1 sweing Thursday 12 0.8 11.4 968
## 5 1/1/2015 Quarter1 sweing Thursday 6 0.8 25.9 1170
## 6 1/1/2015 Quarter1 sweing Thursday 7 0.8 25.9 984
## # ℹ 7 more variables: over_time <dbl>, incentive <dbl>, idle_time <dbl>,
## # idle_men <dbl>, no_of_style_change <dbl>, no_of_workers <dbl>,
## # actual_productivity <dbl>
str(productivity)
## spc_tbl_ [1,197 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : chr [1:1197] "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
## $ quarter : chr [1:1197] "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
## $ department : chr [1:1197] "sweing" "finishing" "sweing" "sweing" ...
## $ day : chr [1:1197] "Thursday" "Thursday" "Thursday" "Thursday" ...
## $ team : num [1:1197] 8 1 11 12 6 7 2 3 2 1 ...
## $ targeted_productivity: num [1:1197] 0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
## $ smv : num [1:1197] 26.16 3.94 11.41 11.41 25.9 ...
## $ wip : num [1:1197] 1108 NA 968 968 1170 ...
## $ over_time : num [1:1197] 7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
## $ incentive : num [1:1197] 98 0 50 50 50 38 0 45 34 45 ...
## $ idle_time : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ idle_men : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_style_change : num [1:1197] 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_workers : num [1:1197] 59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
## $ actual_productivity : num [1:1197] 0.941 0.886 0.801 0.801 0.8 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. quarter = col_character(),
## .. department = col_character(),
## .. day = col_character(),
## .. team = col_double(),
## .. targeted_productivity = col_double(),
## .. smv = col_double(),
## .. wip = col_double(),
## .. over_time = col_double(),
## .. incentive = col_double(),
## .. idle_time = col_double(),
## .. idle_men = col_double(),
## .. no_of_style_change = col_double(),
## .. no_of_workers = col_double(),
## .. actual_productivity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
No NAs
colSums(is.na(productivity))
## date quarter department
## 0 0 0
## day team targeted_productivity
## 0 0 0
## smv wip over_time
## 0 506 0
## incentive idle_time idle_men
## 0 0 0
## no_of_style_change no_of_workers actual_productivity
## 0 0 0
Percentage of Productivity based on day of the week
Each day of the week fluctuates similarly. The most productive day seems to be Sunday, but they are all pretty close.
highchart() |>
hc_add_series(data = productivity,
type = "line",
hcaes(y = actual_productivity,
group = day)) |>
hc_yAxis(title = list(text = "Percent of Actual Work Productivity"))
Histogram of the actual productivity using R Shiny -
Slightly left skewed, the company mostly experiences a high productivity numbers. The most frequent productivity number is around 80%
library(shiny)
# Define UI for app that draws a histogram ----
ui <- fluidPage(
# App title ----
titlePanel("Hello Shiny!"),
# Sidebar layout with input and output definitions ----
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
# Input: Slider for the number of bins ----
sliderInput(inputId = "bins",
label = "Number of bins:",
min = 5,
max = 50,
value = 30)
),
# Main panel for displaying outputs ----
mainPanel(
# Output: Histogram ----
plotOutput(outputId = "distPlot")
)
)
)
server <- function(input, output) {
# Histogram of the Old Faithful Geyser Data ----
# with requested number of bins
# This expression that generates a histogram is wrapped in a call
# to renderPlot to indicate that:
#
# 1. It is "reactive" and therefore should be automatically
# re-executed when inputs (input$bins) change
# 2. Its output type is a plot
output$distPlot <- renderPlot({
x <- productivity$actual_productivity
bins <- seq(min(x), max(x), length.out = input$bins + 1)
hist(x, breaks = bins, col = "dodgerblue", border = "cyan",
xlab = "Percentages of Productivity",
main = "Histogram of Percentages of Productivity")
})
}
shinyApp(ui = ui, server = server)
Scatterplot testing if incentive has any effect on the productivity-
There is no clear upward or downward curve meaning a high financial incentive does not seem to result in more productivity. Low incentives experience a range of productivity levels. There are a few outliers around 3000.
p2 <- ggplot(productivity, aes(x = incentive, y = actual_productivity)) +
labs(title = "Scatterplot of Incentive and Productivity Percentages",
caption = "Data from the UC Irvine Machine Learning Repository",
x = "Amount of financial incentive (in BDT)",
y = "Percentage of Productivity") +
theme_minimal(base_size = 12)
p2 + geom_point()
Adjusted R Squared : About 0.007 meaning about 0.7% of productivity variance is explained by this model
Number of workers and minutes worked over time have the least significant p values. I will be removing these variables and testing out new variables to see if my adjusted R squared will improve.
fit1 <- lm(actual_productivity ~ incentive + no_of_workers + over_time, data = productivity)
summary(fit1)
##
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_workers +
## over_time, data = productivity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56467 -0.08050 0.02895 0.11920 0.37506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.493e-01 9.495e-03 78.922 < 2e-16 ***
## incentive 8.588e-05 3.148e-05 2.728 0.00646 **
## no_of_workers -3.811e-04 3.345e-04 -1.139 0.25482
## over_time -9.499e-07 2.215e-06 -0.429 0.66809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1738 on 1193 degrees of freedom
## Multiple R-squared: 0.009834, Adjusted R-squared: 0.007344
## F-statistic: 3.949 on 3 and 1193 DF, p-value: 0.00812
All the variables(incentive, number of workers, numbers of style changes in a product, targeted productivity, and idle time) have resulted in a statistically significant p value. The smallest p value(< 2e-16) is seen in targeted_productivity.
The adjust R squared has improved and is now 19.7%.Therefore, 19.7% of productivity variance is explained by this model
fit2 <- lm(actual_productivity ~ incentive + no_of_style_change + targeted_productivity + idle_time, data = productivity)
summary(fit2)
##
## Call:
## lm(formula = actual_productivity ~ incentive + no_of_style_change +
## targeted_productivity + idle_time, data = productivity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55175 -0.05208 0.00734 0.09007 0.51753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.334e-01 3.521e-02 6.630 5.08e-11 ***
## incentive 6.505e-05 2.825e-05 2.303 0.0215 *
## no_of_style_change -5.090e-02 1.081e-02 -4.708 2.79e-06 ***
## targeted_productivity 6.955e-01 4.733e-02 14.693 < 2e-16 ***
## idle_time -8.191e-04 3.564e-04 -2.298 0.0217 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1564 on 1192 degrees of freedom
## Multiple R-squared: 0.1998, Adjusted R-squared: 0.1971
## F-statistic: 74.39 on 4 and 1192 DF, p-value: < 2.2e-16
Conclusion - According to this dataset the strongest predictor of work productivity is the targeted productivity, which is set by the Authority for each team for each day. The coefficient(6.955e-01) is positive showing the higher the target set the more productivity.
Productivity Prediction of Garment Employees [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C51S6D.
Wright, R., & Saeed, S. (2020, April 22). Bangladeshi garment workers face ruin as global brands ditch clothing contracts amid coronavirus pandemic. CNN. https://www.cnn.com/2020/04/22/business/bangladesh-garment-factories