02/16/2025
The purpose of this week’s data dive is for you to think critically about the importance of documenting your model, but also the importance of referencing the documentation for the data you’re using.
Documentation in a dataset serves as an essential reference, clarifying the meaning of specialized terms and numerical values. For the garment worker dataset, certain columns like SMV, WIP, Incentive, Idle Men, and No of Style Change, represent complex aspects of the manufacturing process that are unclear without proper context provided by documentation. Here’s why these columns are crucial and what might happen if they are misunderstood:
1.SMV (Standard Minute Value)
Explanation: Measures the required time to produce one unit of a garment under standard operating conditions.
Importance in Dataset: Essential for evaluating worker efficiency and planning production schedules.
Reflection: Misunderstanding the data can lead to unrealistic goals and wasting workers’ time.
Why Encoded This Way: It gives a clear way to measure task difficulty and efficiency, helping to set realistic production goals.
Risks of Not Reading Documentation: It can lead to impossible goals, worker stress, or wasting resources.
2.WIP (Work In Progress)
Explanation: Indicates the number of items currently in production but not yet completed.
Importance in Dataset: Offers insights into the production flow and potential delays.
Reflection: Ignoring WIP data can hide production delays and cause inventory problems.
Why Encoded This Way: Enables effective monitoring of production stages and bottleneck identification.
Risks of Not Reading Documentation: Failing to monitor could lead to unnoticed bottlenecks, causing delays and inventory pile-ups.
3.Incentive
Explanation:Additional pay for workers who meet or exceed performance targets.
Importance in Dataset: Important for improving worker performance
Reflection: Incorrect assumptions about incentive can give a false idea of how well they work..
Why Encoded This Way: Connects performance to rewards, encouraging workers and increasing productivity.
Risks of Not Reading Documentation: Not understanding incentive structures can reduce motivation and lower productivity.
4.Idle Men
Explanation: Workers who are available but not working, usually due to delays or inefficiencies.
Importance in Dataset: Shows possible inefficiencies or production downtime.
Reflection: Not monitoring could lead to increased labor costs and reduced productivity.
Why Encoded This Way: Helps identify reasons for reduced productivity, optimizing labor costs and efficiency.
Risks of Not Reading Documentation: Ignoring idle time can increase labor costs and reduce productivity.
5.No of Style Change
Explanation: How often production changes to fit different garment styles.
Importance in Dataset: Shows the impact of changes on production workflow and downtime.
Reflection: Overlooking the impact can lead to poor planning and increased operational costs.
Why Encoded This Way: Manages production complexity and ensures flexibility in responding to market demands.
Risks of Not Reading Documentation: Ignoring style changes can disrupt production timelines and misuse resources.
the importance of detailed documentation for correctly understanding and using your dataset, which helps avoid mistakes and errors in the garment manufacturing process.”
Addressing Unclear Elements in the Dataset
1.WIP (Work in Progress)
The dataset includes missing values in the WIP column, but the documentation does not specify why these values are missing. This omission leaves it unclear whether to treat these as zeros (indicating no work in progress), as actual missing data points (possibly due to collection issues), or as non-applicable entries under certain conditions. The uncertainty around how to handle these missing values could significantly affect analyses of production efficiency and workload management.
2.Incentive
The documentation does not explain how incentives are calculated or if they differ based on factors like productivity levels, departments, or other conditions. Without understanding the structure of incentive calculations, analyzing the impact of incentives on labor motivation and productivity becomes challenging, as assumptions may not align with actual policy.
3.Quarter
The dataset divides a month into four quarters but does not specify the duration or definition of each quarter. Since quarters typically span three months, this unusual division could lead to confusion in conducting temporal analyses or making seasonal adjustments. A wrong understanding of how quarters are defined can lead to incorrect analyses of time-based trends and operational cycles, potentially skewing results.
library(readr)
data <- read.csv("C:/Users/rbada/Downloads/productivity+prediction+of+garment+employees/garments_worker_productivity.csv")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create a summary of missing vs. available WIP data
wip_missing_summary <- data %>%
mutate(wip_missing = ifelse(is.na(wip), "Missing", "Available")) %>%
count(wip_missing)
# Plot Missing vs. Available WIP
ggplot(wip_missing_summary, aes(x = wip_missing, y = n, fill = wip_missing)) +
geom_bar(stat = "identity") +
labs(title = "Missing vs. Available WIP Data", x = "WIP Data", y = "Count") +
scale_fill_manual(values = c("red", "blue")) +
theme_minimal()
The bar chart shows that a significant portion of WIP data is missing, which could affect production analysis. If these missing values represent zero work in progress or data collection errors, it must be clarified. Ensuring consistent WIP tracking and addressing missing data will improve accuracy in production monitoring and decision-making.
library(dplyr) # Load dplyr for data manipulation
# Standardize department names
data$department <- tolower(trimws(data$department)) # Convert to lowercase and remove spaces
data$department <- recode(data$department, "sweing" = "sewing") # Fix misspelled department names
# Check unique department names after correction
unique(data$department)
## [1] "sewing" "finishing"
library(dplyr) # Load dplyr for data manipulation
library(tidyr) # Load tidyr for pivot_wider function
# Create a summary of missing vs. available WIP
wip_missing_by_dept <- data %>%
mutate(wip_status = ifelse(is.na(wip), "Missing", "Available")) %>%
count(department, wip_status) %>%
pivot_wider(names_from = wip_status, values_from = n, values_fill = list(n = 0)) # Ensure missing values are replaced with 0
# View the corrected summary
print(wip_missing_by_dept)
## # A tibble: 2 × 3
## department Missing Available
## <chr> <int> <int>
## 1 finishing 506 0
## 2 sewing 0 691
library(ggplot2) # Load ggplot2 for visualization
# Plot missing WIP count per department
ggplot(wip_missing_by_dept, aes(x = department, y = Missing, fill = department)) +
geom_bar(stat = "identity") +
labs(title = "Missing WIP by Department",
x = "Department",
y = "Count of Missing WIP") +
theme_minimal()
The bar chart shows that WIP data is mostly missing in the ‘Finishing’ department, while ‘Sewing’ has no missing values, indicating inconsistent data collection or reporting issues. Additionally, duplicate and misspelled department names (e.g., ‘sweing’ instead of ‘sewing’) cause incorrect grouping, leading to misleading analysis. Standardizing department names, investigating missing WIP values, and improving data entry practices will enhance data accuracy and ensure reliable production insights.”
library(ggplot2)
data$department <- as.factor(data$department) # Ensuring department is a factor
ggplot(data, aes(x = department, y = incentive, fill = department)) +
geom_boxplot() +
labs(title = "Incentive Distribution by Department",
x = "Department",
y = "Incentive Amount") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The boxplot reveals high variability in incentive distribution, especially in the Finishing department, where some employees receive very high incentives while others receive little or none. This raises fairness and transparency concerns, as unclear incentive calculations can lead to dissatisfaction. Additionally, inconsistent department labels (e.g., “finishing” appearing twice) indicate data quality issues that may mislead analysis. To address this, department names should be standardized, high outliers investigated, and incentive criteria clarified to ensure a fair and transparent reward system.Outliers in incentives may be due to special bonuses, data mistakes, or inconsistencies in the system.
library(ggplot2)
ggplot(data, aes(x = actual_productivity, y = incentive)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Incentive Amount vs. Actual Productivity",
x = "Actual Productivity",
y = "Incentive Amount") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Insight The scatter plot with a regression line shows no strong correlation between incentives and productivity, suggesting that higher incentives do not consistently lead to better performance. Many employees receive low or no incentives, while a few get very high amounts, raising concerns about fairness and data accuracy. This highlights the need to review how incentives are assigned, check for missing or incorrect data, and assess whether incentives effectively drive productivity.
library(ggplot2)
library(dplyr)
# Count the number of records in each quarter
quarter_counts <- data %>%
group_by(quarter) %>%
summarise(count = n())
# Plot the distribution of records per quarter
ggplot(quarter_counts, aes(x = quarter, y = count, fill = quarter)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Data Across Quarters",
x = "Quarter",
y = "Count of Records") +
theme_minimal()
The dataset divides each month into four quarters, but the presence of Quarter 5 suggests a possible data classification error. The uneven distribution of records across quarters shows inconsistencies in time division. To ensure accurate trend analysis, the quarter system should be clarified and standardized. Maintaining consistent time grouping will help prevent confusion in time-based reporting.
# Calculate average productivity per quarter
productivity_trend <- data %>%
group_by(quarter) %>%
summarise(avg_productivity = mean(actual_productivity, na.rm = TRUE))
# Plot the trend of average productivity per quarter
ggplot(productivity_trend, aes(x = quarter, y = avg_productivity, group = 1)) +
geom_line(color = "orange", size = 1) +
geom_point(color = "red", size = 3) +
labs(title = "Average Productivity Across Quarters",
x = "Quarter",
y = "Average Productivity") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Ensure Quarter 5 is removed
data_filtered <- data %>%
filter(quarter != "Quarter5") # Removing incorrect quarter
# Calculate average productivity per quarter
productivity_trend <- data_filtered %>%
group_by(quarter) %>%
summarise(avg_productivity = mean(actual_productivity, na.rm = TRUE))
# Plot the trend of average productivity per quarter
ggplot(productivity_trend, aes(x = quarter, y = avg_productivity, group = 1)) +
geom_line(color = "orange", size = 1) +
geom_point(color = "red", size = 3) +
labs(title = "Average Productivity Across Quarters",
x = "Quarter",
y = "Average Productivity") +
theme_minimal()
he initial chart displayed five quarters, which was incorrect since a year typically consists of only four quarters. This could have led to misleading trend analysis. The sharp jump in Quarter 5 suggested either a data entry error. The data was reviewed and standardized. The error was corrected to ensure accurate representation. Now, the chart properly shows the average productivity across four quarters, allowing for a clearer analysis of trends over time. This correction helps in making more reliable observations and data-driven decisions.
I examined four categorical columns: “department,” “quarter,” “day,” and “team” for missing and empty data issues. Here’s what I found:
1.Explicitly Missing Rows
No missing values (NA
) were found in any of the selected
columns.
Every record has a valid department, quarter, day, and team assigned.
2.Implicitly Missing Rows
Quarters: All expected quarters (Quarter1, Quarter2, Quarter3, Quarter4) are present.
Days: Friday is missing from the dataset, meaning no records were assigned to this day. This could indicate:
A scheduling issue (e.g., no work on Fridays)
Incomplete data collection (Friday’s records were not recorded or removed)..
3.Empty Groups
No departments or teams were found with zero records, meaning all listed categories have data
library(ggplot2)
# Assuming 'data' is your dataframe and 'over_time' is the column of interest
Q1 <- quantile(data$over_time, 0.25, na.rm = TRUE)
Q3 <- quantile(data$over_time, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
library(ggplot2)
# Sample plot code assuming lower_bound and upper_bound have been calculated
ggplot(data, aes(x = over_time)) +
geom_histogram(binwidth = 100, fill = "blue", color = "black") +
geom_vline(xintercept = lower_bound, color = "red", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = upper_bound, color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = lower_bound, y = Inf, label = "Lower Bound", color = "red", vjust = -1.5, hjust = -0.1) +
annotate("text", x = upper_bound, y = Inf, label = "Upper Bound", color = "red", vjust = -1.5, hjust = 1.1) +
ggtitle("Over Time with Outlier") +
xlab("Over Time (minutes)") +
ylab("Frequency") +
theme_minimal()
“Our review of overtime shows that most employees usually work only a little extra beyond their regular hours. However, sometimes we see a few cases where the overtime is much higher than usual, as highlighted by the red dashed lines in our analysis. These outliers could be due to particularly busy days or special projects that require more work hours. To ensure employees aren’t overwhelmed and productivity remains high, we suggest taking a closer look at these high overtime incidents to figure out their causes. Depending on what we find, we might need to adjust work schedules or add temporary staff during peak times. These actions will help maintain a balanced workload, ensuring everyone stays productive and stress-free.”
library(ggplot2)
library(dplyr)
# Assuming 'data' is your DataFrame
percentiles <- data %>%
summarise(
Prod_1 = quantile(actual_productivity, 0.01, na.rm = TRUE),
Prod_99 = quantile(actual_productivity, 0.99, na.rm = TRUE),
WIP_1 = quantile(wip, 0.01,na.rm = TRUE),
WIP_99 = quantile(wip, 0.99, na.rm = TRUE),
) %>%
unlist()
print(percentiles)
## Prod_1 Prod_99 WIP_1 WIP_99
## 0.263593 1.005156 27.600000 9072.000000
library(ggplot2)
ggplot(data, aes(y = actual_productivity)) +
geom_boxplot() +
ggtitle("Distribution of Actual Productivity") +
ylab("Actual Productivity") +
xlab("") +
theme_minimal()
The box plot for Actual Productivity shows that most values fall within a normal range, with a few lower outliers below the main distribution. These lower outliers indicate instances where productivity was significantly lower than usual. Possible reasons could include machine downtime, skill gaps among workers, or unexpected delays in production. Since no high outliers are present, it suggests that productivity rarely exceeds expectations.To improve overall productivity, it’s important to investigate the factors causing low productivity outliers and address them. This may involve enhancing training, optimizing workflows, or identifying production slowdowns. Monitoring these trends can help ensure stable and efficient operations.
# Ensure percentiles are calculated
percentiles <- data %>%
summarise(
WIP_1 = quantile(wip, 0.01, na.rm = TRUE),
WIP_99 = quantile(wip, 0.99, na.rm = TRUE)
) %>%
unlist()
# Create the status column in the dataset
data <- data %>%
mutate(status = case_when(
wip <= percentiles["WIP_1"] ~ "low",
wip >= percentiles["WIP_99"] ~ "high",
TRUE ~ "normal"
))
# Convert status to a factor for proper color mapping
data$status <- factor(data$status, levels = c("low", "normal", "high"))
ggplot(data, aes(x = "", y = wip, color = status)) +
geom_boxplot() +
geom_vline(xintercept = percentiles["WIP_1"], color = "red", linetype = "dashed") +
geom_vline(xintercept = percentiles["WIP_99"], color = "blue", linetype = "dashed") +
scale_color_manual(values = c("high" = "red", "low" = "blue", "normal" = "green")) +
labs(title = "WIP with Dynamic Percentile Thresholds", x = "", y = "WIP (Units)") +
theme_minimal()
## Warning: Removed 506 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The box plot for “Work in Progress” (WIP) shows three categories of workload: low, normal, and high. Most work fits within the normal range, but the plot also highlights times when workload is unusually high or low. High periods may indicate times when too much is demanded from resources, suggesting a need for better staffing during these peaks. Low periods might show times when we’re not using our resources fully, suggesting we could be more efficient. Keeping an eye on these patterns helps us manage workloads better and improve how we operate. Outliers in the data show extreme cases of high or low workloads, which can help identify potential issues or opportunities for improvement.
library(ggplot2)
ggplot(data, aes(x = smv)) + # Use SMV column instead of 'cty'
geom_histogram(color = "white", fill = "#3182bd", bins = 30) + # Adjust bin size as needed
labs(title = "Distribution of SMV",
x = "Standard Minute Value (SMV)",
y = "Count") +
theme_classic()
The histogram of Standard Minute Value (SMV) shows a skewed distribution, where most tasks require low to moderate SMV, but a few take significantly longer. This suggests that while most tasks are completed quickly, some may take longer due to complexity or inefficiencies in the process.To better understand these variations, we should investigate what factors contribute to high SMV values, such as the type of garment or worker skill level. It’s also important to assess whether these longer tasks cause production delays or bottlenecks. If so, adjustments in workflow or targeted training programs could help reduce SMV for time-consuming tasks.By identifying and addressing these outliers, we can improve efficiency, balance workloads, and enhance overall production planning.