My final project is centered on Analyzing Productivity Variation across different days of the week. The objective of this project is to analyze productivity variation in a manufacturing setting based on the day of the week. By examining historical data, we aim to identify whether certain days exhibit consistently higher or lower productivity levels. Additionally, we seek to understand the underlying factors contributing to these variations and leverage predictive modeling techniques to forecast productivity.
First, the data set and neccessary libraries were loaded
data <- read.csv("C:\\Users\\Krishna\\Downloads\\productivity+prediction+of+garment+employees\\garments_worker_productivity.csv")
# Set the CRAN mirror
# Install the caret package specifying the CRAN mirror
install.packages("caret", repos = "https://cloud.r-project.org/")
## Installing package into 'C:/Users/Krishna/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'caret' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'caret'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Krishna\AppData\Local\R\win-library\4.3\00LOCK\caret\libs\x64\caret.dll
## to C:\Users\Krishna\AppData\Local\R\win-library\4.3\caret\libs\x64\caret.dll:
## Permission denied
## Warning: restored 'caret'
##
## The downloaded binary packages are in
## C:\Users\Krishna\AppData\Local\Temp\Rtmpq6CB3o\downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
library(ggplot2) # For data visualization
library(tidyr) # For data tidying
library(stats) # For statistical tests
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
data <- na.omit(data)
Next, we have Converted the day of the week variable into an ordered factor. By default, the day of the week variable may be stored as a character or categorical variable in the dataset. However, treating it as a simple categorical variable without considering its ordinal nature could lead to incorrect interpretations or analyses. For example, a model or statistical test may treat Monday and Friday as equally distant from each other, which may not accurately reflect the sequential nature of weekdays.
Converting the day of the week variable into an ordered factor assigns numerical values to each day based on their natural order . This transformation explicitly captures the inherent sequence of days, allowing statistical analyses to correctly interpret and analyze the data.
data$day <- factor(data$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Saturday", "Sunday"))
summary(data)
## date quarter department day
## Length:691 Length:691 Length:691 Monday :116
## Class :character Class :character Class :character Tuesday :118
## Mode :character Mode :character Mode :character Wednesday:119
## Thursday :118
## Saturday :104
## Sunday :116
## team targeted_productivity smv wip
## Min. : 1.000 Min. :0.070 Min. :10.05 Min. : 7.0
## 1st Qu.: 4.000 1st Qu.:0.700 1st Qu.:18.79 1st Qu.: 774.5
## Median : 6.000 Median :0.750 Median :22.52 Median : 1039.0
## Mean : 6.521 Mean :0.724 Mean :23.25 Mean : 1190.5
## 3rd Qu.:10.000 3rd Qu.:0.800 3rd Qu.:28.08 3rd Qu.: 1252.5
## Max. :12.000 Max. :0.800 Max. :54.56 Max. :23122.0
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 4560 1st Qu.: 30.00 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 6840 Median : 45.00 Median : 0.000 Median : 0.0000
## Mean : 6508 Mean : 44.48 Mean : 1.265 Mean : 0.6397
## 3rd Qu.: 7200 3rd Qu.: 60.00 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :25920 Max. :138.00 Max. :300.000 Max. :45.0000
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. :26.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.:52.00 1st Qu.:0.6615
## Median :0.0000 Median :57.00 Median :0.7506
## Mean :0.2605 Mean :52.45 Mean :0.7220
## 3rd Qu.:0.0000 3rd Qu.:58.00 3rd Qu.:0.8004
## Max. :2.0000 Max. :89.00 Max. :1.1005
head(data)
## date quarter department day team targeted_productivity smv wip
## 1 1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108
## 3 1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968
## 4 1/1/2015 Quarter1 sweing Thursday 12 0.80 11.41 968
## 5 1/1/2015 Quarter1 sweing Thursday 6 0.80 25.90 1170
## 6 1/1/2015 Quarter1 sweing Thursday 7 0.80 25.90 984
## 8 1/1/2015 Quarter1 sweing Thursday 3 0.75 28.08 795
## over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1 7080 98 0 0 0 59.0
## 3 3660 50 0 0 0 30.5
## 4 3660 50 0 0 0 30.5
## 5 1920 50 0 0 0 56.0
## 6 6720 38 0 0 0 56.0
## 8 6900 45 0 0 0 57.5
## actual_productivity
## 1 0.9407254
## 3 0.8005705
## 4 0.8005705
## 5 0.8003819
## 6 0.8001250
## 8 0.7536835
Here, we have used Boxplot to represent the productivity across different days of week because it displays key summary statistics such as the median, quartiles, and potential outliers, boxplots provide a comprehensive view of the productivity distribution for each day.
it also helps you to compare the distribution of productivity across different days of the week. This comparison allows you to identify patterns or trends in productivity levels, such as whether certain days consistently exhibit higher or lower productivity compared to others. here, the Days with higher median values typically indicate higher productivity levels, while days with lower median values suggest lower productivity levels.
ggplot(data, aes(x = day, y = actual_productivity)) +
geom_boxplot(fill = "skyblue", color = "blue") +
labs(x = "Day of the Week", y = "Productivity", title = "Distribution of Productivity Across Different Days") +
theme_minimal()
Here, we have used ANOVA to determine whether there are significant differences in productivity means across the different days of the week. it enables simultaneous comparison of productivity means across all days of the week, rather than pairwise comparisons.
1)df:
Day: This represents the degrees of freedom associated with the variable “day,” which is the categorical variable representing different days of the week. In this case, there are 5 degrees of freedom, indicating that there are 6 levels (days) of the variable minus 1.
Residuals::This represents the degrees of freedom associated with the error term or residuals, which are the differences between the observed values and the values predicted by the model. In this case, there are 685 degrees of freedom, which is the total number of observations minus the total number of levels of the “day” variable.
# ANOVA test
anova_result <- aov(actual_productivity ~ day, data = data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## day 5 0.043 0.008671 0.36 0.876
## Residuals 685 16.488 0.024070
After conducting ANOVA and obtaining an overall assessment of the variability in productivity means across different days of the week, it’s common to perform post-hoc tests to identify specific pairs of days with significantly different means. Tukey’s Honestly Significant Difference (HSD) test is one such post-hoc test commonly used for this purpose. the main use of this Tukey’s test is pairwise comparisons.
ANOVA determines whether there are significant differences in productivity means across groups (days) but does not specify which pairs of groups differ significantly. Tukey’s HSD test allows for efficient pairwise comparisons between all combinations of days, pinpointing specific pairs with significantly different means.
A significant difference between the means of two days indicates that there is strong evidence to suggest that the productivity levels on those days are indeed different. This information is valuable for identifying specific days that exhibit notable variations in productivity and understanding the factors contributing to these differences.
tukey_test <- TukeyHSD(anova_result)
tukey_test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = actual_productivity ~ day, data = data)
##
## $day
## diff lwr upr p adj
## Tuesday-Monday 0.003945265 -0.05402513 0.06191566 0.9999616
## Wednesday-Monday -0.001457702 -0.05930722 0.05639182 0.9999997
## Thursday-Monday -0.013397665 -0.07136806 0.04457273 0.9860848
## Saturday-Monday 0.009910882 -0.04996253 0.06978429 0.9970544
## Sunday-Monday -0.010620008 -0.06883761 0.04759759 0.9953368
## Wednesday-Tuesday -0.005402967 -0.06300370 0.05219776 0.9998128
## Thursday-Tuesday -0.017342929 -0.07506505 0.04037919 0.9560277
## Saturday-Tuesday 0.005965617 -0.05366745 0.06559868 0.9997431
## Sunday-Tuesday -0.014565272 -0.07253566 0.04340512 0.9797555
## Thursday-Wednesday -0.011939962 -0.06954069 0.04566077 0.9915395
## Saturday-Wednesday 0.011368584 -0.04814698 0.07088415 0.9942129
## Sunday-Wednesday -0.009162305 -0.06701182 0.04868721 0.9976131
## Saturday-Thursday 0.023308546 -0.03632452 0.08294161 0.8743598
## Sunday-Thursday 0.002777657 -0.05519273 0.06074805 0.9999933
## Sunday-Saturday -0.020530889 -0.08040430 0.03934252 0.9242781
Here, the bar plot provides a visual comparison of mean productivity levels among the different days of the week. It allows stakeholders to quickly identify any noticeable patterns or differences in productivity across the days. Stakeholders can visually identify days with relatively high or low mean productivity based on the height of the bars. Higher bars indicate higher mean productivity, while lower bars indicate lower mean productivity.
Insights from the visualization can inform decisions related to resource allocation and workforce scheduling. For example, if certain weekdays consistently exhibit lower mean productivity, management may consider adjusting staffing levels or operational procedures to improve efficiency on those days.
mean_productivity <- data %>%
group_by(day) %>%
summarize(mean_productivity = mean(actual_productivity))
ggplot(mean_productivity, aes(x = day, y = mean_productivity)) +
geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
labs(x = "Day of the Week", y = "Mean Productivity", title = "Mean Productivity Across Different Days") +
theme_minimal()
model_data <- data %>% select(actual_productivity, smv, wip) # Select relevant columns
# Split data into training and testing sets
set.seed(123) # for reproducibility
train_index <- createDataPartition(model_data$actual_productivity, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
# Train the linear regression model
lm_model <- lm(actual_productivity ~ smv + wip, data = train_data)
# Summarize the model
summary(lm_model)
##
## Call:
## lm(formula = actual_productivity ~ smv + wip, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46328 -0.05872 0.02643 0.09720 0.37400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.966e-01 2.305e-02 34.555 < 2e-16 ***
## smv -3.727e-03 9.370e-04 -3.977 7.9e-05 ***
## wip 9.878e-06 3.542e-06 2.789 0.00546 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1514 on 552 degrees of freedom
## Multiple R-squared: 0.0425, Adjusted R-squared: 0.03903
## F-statistic: 12.25 on 2 and 552 DF, p-value: 6.235e-06
To predict productivity, we employ linear regression, a statistical technique that models the relationship between predictor variables (SMV, WIP) and the target variable (actual productivity). We train the linear regression model using the training dataset, aiming to capture underlying patterns and relationships. The trained model allows us to predict productivity based on SMV and WIP values for each observation. Linear regression assumes that there is a linear relationship between the independent variables and the dependent variable. It means that changes in the independent variables result in proportional changes in the dependent variable.
The coefficients indicate the strength and direction of the relationship between each independent variable and productivity. A positive coefficient suggests a positive relationship (increasing the independent variable leads to an increase in productivity), while a negative coefficient suggests a negative relationship.
predictions <- predict(lm_model, newdata = test_data)
print(predictions)
## 5 28 44 49 50 60 69 72
## 0.7116478 0.7074584 0.7586695 0.7006390 0.7093255 0.6351449 0.7124874 0.7313468
## 96 108 122 134 138 176 192 199
## 0.7642212 0.7122001 0.7187786 0.7102253 0.7310280 0.7127829 0.6140384 0.7099774
## 205 210 223 225 229 235 238 252
## 0.7318578 0.6122504 0.7105405 0.7112131 0.7638459 0.7007416 0.7201966 0.7286663
## 267 284 288 289 294 295 304 312
## 0.7127443 0.7646307 0.6168840 0.7102639 0.7241750 0.7239478 0.6207627 0.7523828
## 323 327 339 340 342 355 359 370
## 0.7137222 0.7218733 0.7644332 0.7142557 0.7223969 0.7208361 0.7251683 0.7296773
## 377 379 381 390 406 424 425 431
## 0.7258049 0.7249751 0.7352072 0.6976720 0.7251925 0.7272823 0.7277367 0.7525450
## 456 464 468 472 473 474 478 486
## 0.6993646 0.7236119 0.7296037 0.7541848 0.7500174 0.6866585 0.7280869 0.7294566
## 501 511 515 518 544 559 566 571
## 0.7253999 0.7291602 0.7375205 0.7170842 0.7264767 0.7180165 0.9254688 0.7769171
## 593 602 611 612 616 629 636 648
## 0.7116338 0.7236317 0.7358827 0.7207330 0.6876580 0.6986499 0.7177300 0.7346183
## 666 673 678 681 690 700 705 724
## 0.6970694 0.7076638 0.7275633 0.7493246 0.6927553 0.7005664 0.6928837 0.6949977
## 729 735 744 745 746 760 786 795
## 0.7127950 0.7485738 0.6934172 0.7643541 0.6882281 0.6925775 0.7267335 0.6934468
## 799 801 804 814 833 841 842 849
## 0.6845068 0.7398839 0.7269607 0.7645023 0.6899004 0.6921550 0.7343243 0.6871641
## 859 863 864 875 883 892 906 908
## 0.7424223 0.6918883 0.7371989 0.6983671 0.6888830 0.6914415 0.7502334 0.6933381
## 912 913 917 934 952 954 968 975
## 0.7602402 0.6986239 0.6977052 0.6954134 0.7366631 0.7310117 0.7515373 0.7295991
## 980 990 991 1000 1009 1010 1017 1019
## 0.7115617 0.7383721 0.6983375 0.7040935 0.7123421 0.7499667 0.6924894 0.7054765
## 1021 1031 1033 1047 1054 1057 1089 1093
## 0.7302214 0.6984659 0.7066422 0.7412188 0.7583678 0.6930714 0.7494826 0.7084661
## 1116 1137 1153 1157 1161 1185 1187 1188
## 0.6970623 0.7576170 0.7122101 0.7075672 0.7574293 0.7294518 0.7097207 0.7070733
After fitting the linear regression model using train data set, we can use the model to make predictions on new data(test_data). The predictions allow us to assess how well the model generalizes to new data. By comparing the predicted values to the actual values of productivity, we can evaluate the accuracy and effectiveness of the linear regression model in predicting actual productivity levels based on SMV and WIP.
rmse <- sqrt(mean((test_data$actual_productivity - predictions)^2))
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 0.153976005669743"
Next we have calculated root mean square error(RMSE) which is a common method for evaluating the performance of a predictive model, such as a linear regression model. The RMSE provides a measure of how well the model’s predictions align with the actual observed values.
visualization_data <- data.frame(Actual = test_data$actual_productivity, Predicted = predictions)
print(visualization_data)
## Actual Predicted
## 5 0.8003819 0.7116478
## 28 0.8003186 0.7074584
## 44 0.8797145 0.7586695
## 49 0.8005981 0.7006390
## 50 0.8003186 0.7093255
## 60 0.6034322 0.6351449
## 69 0.8006844 0.7124874
## 72 0.8002510 0.7313468
## 96 0.8001410 0.7642212
## 108 0.8505023 0.7122001
## 122 0.7004808 0.7187786
## 134 0.8501368 0.7102253
## 138 0.8002738 0.7310280
## 176 0.8505023 0.7127829
## 192 0.4731348 0.6140384
## 199 0.8505222 0.7099774
## 205 0.8002738 0.7318578
## 210 0.6885576 0.6122504
## 223 0.8505023 0.7105405
## 225 0.8501368 0.7112131
## 229 0.8002468 0.7638459
## 235 0.5823010 0.7007416
## 238 0.3502065 0.7201966
## 252 0.7000956 0.7286663
## 267 0.8501368 0.7127443
## 284 0.8003438 0.7646307
## 288 0.8001171 0.6168840
## 289 0.8001171 0.7102639
## 294 0.6501987 0.7241750
## 295 0.6002918 0.7239478
## 304 0.8001287 0.6207627
## 312 0.6502996 0.7523828
## 323 0.8001626 0.7137222
## 327 0.7500680 0.7218733
## 339 0.8002615 0.7644332
## 340 0.8001287 0.7142557
## 342 0.7500680 0.7223969
## 355 0.9005563 0.7208361
## 359 0.8004157 0.7251683
## 370 0.5002413 0.7296773
## 377 0.8000206 0.7258049
## 379 0.7500680 0.7249751
## 381 0.7002508 0.7352072
## 390 0.5005475 0.6976720
## 406 0.8003519 0.7251925
## 424 0.8502238 0.7272823
## 425 0.8009096 0.7277367
## 431 0.7004367 0.7525450
## 456 0.3500670 0.6993646
## 464 0.8503127 0.7236119
## 468 0.8002632 0.7296037
## 472 0.7001647 0.7541848
## 473 0.6503071 0.7500174
## 474 0.6228281 0.6866585
## 478 1.0002304 0.7280869
## 486 0.8005345 0.7294566
## 501 0.9501860 0.7253999
## 511 0.8004737 0.7291602
## 515 0.6005976 0.7375205
## 518 0.4879200 0.7170842
## 544 1.0006713 0.7264767
## 559 0.4668212 0.7180165
## 566 0.9001298 0.9254688
## 571 0.7999632 0.7769171
## 593 0.7005136 0.7116338
## 602 0.9999952 0.7236317
## 611 0.7007104 0.7358827
## 612 0.7006121 0.7207330
## 616 0.6508348 0.6876580
## 629 0.7999632 0.6986499
## 636 0.4965497 0.7177300
## 648 0.7505201 0.7346183
## 666 0.7999632 0.6970694
## 673 0.3555345 0.7076638
## 678 0.9005090 0.7275633
## 681 0.8004020 0.7493246
## 690 0.3503017 0.6927553
## 700 0.8008000 0.7005664
## 705 0.7503561 0.6928837
## 724 0.7144105 0.6949977
## 729 0.5369018 0.7127950
## 735 0.8004020 0.7485738
## 744 0.5356780 0.6934172
## 745 0.5001234 0.7643541
## 746 0.4978851 0.6882281
## 760 0.7002061 0.6925775
## 786 1.0002304 0.7267335
## 795 0.7503561 0.6934468
## 799 0.6219718 0.6845068
## 801 0.3554280 0.7398839
## 804 1.0002304 0.7269607
## 814 0.7006140 0.7645023
## 833 0.7505327 0.6899004
## 841 0.3281316 0.6921550
## 842 0.3035745 0.7343243
## 849 0.7999829 0.6871641
## 859 0.6386144 0.7424223
## 863 0.5853158 0.6918883
## 864 0.2494167 0.7371989
## 875 0.7006140 0.6983671
## 883 0.3075015 0.6888830
## 892 0.7507701 0.6914415
## 906 0.8004020 0.7502334
## 908 0.8000718 0.6933381
## 912 0.7504373 0.7602402
## 913 0.7501770 0.6986239
## 917 0.7005185 0.6977052
## 934 0.7003621 0.6954134
## 952 0.7503719 0.7366631
## 954 0.7002518 0.7310117
## 968 0.8002608 0.7515373
## 975 0.7002366 0.7295991
## 980 0.5503497 0.7115617
## 990 0.7506510 0.7383721
## 991 0.7504737 0.6983375
## 1000 0.6002398 0.7040935
## 1009 0.8005795 0.7123421
## 1010 0.8004020 0.7499667
## 1017 0.7504737 0.6924894
## 1019 0.7502549 0.7054765
## 1021 0.7001360 0.7302214
## 1031 0.8003333 0.6984659
## 1033 0.7507701 0.7066422
## 1047 0.3027704 0.7412188
## 1054 0.8007018 0.7583678
## 1057 0.8000558 0.6930714
## 1089 0.8505206 0.7494826
## 1093 0.8007466 0.7084661
## 1116 0.8005111 0.6970623
## 1137 0.8000345 0.7576170
## 1153 0.9000610 0.7122101
## 1157 0.8500842 0.7075672
## 1161 0.8000345 0.7574293
## 1185 0.7503473 0.7294518
## 1187 0.7500508 0.7097207
## 1188 0.7500508 0.7070733
Here i have made a visualization of actual data and predicted data in a scatter plot with a reference line. The purpose of adding this reference line is to visually represent perfect prediction. In other words, if the actual productivity and predicted productivity were exactly the same for all data points, they would fall on this reference line. By including this line in the plot, it allows us to quickly assess how closely the predicted values align with the actual values.
Data points that fall close to the red dashed line indicate that the model’s predictions are accurate, as they closely match the actual productivity values.
Data points that deviate significantly from the red dashed line suggest discrepancies between the predicted and actual productivity values
# Scatterplot of actual vs. predicted productivity
ggplot(visualization_data, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") + # Add reference line (y = x)
labs(x = "Actual Productivity", y = "Predicted Productivity", title = "Actual vs. Predicted Productivity") +
theme_minimal()