Final_proj

My final project is centered on Analyzing Productivity Variation across different days of the week. The objective of this project is to analyze productivity variation in a manufacturing setting based on the day of the week. By examining historical data, we aim to identify whether certain days exhibit consistently higher or lower productivity levels. Additionally, we seek to understand the underlying factors contributing to these variations and leverage predictive modeling techniques to forecast productivity.

First, the data set and neccessary libraries were loaded

data <- read.csv("C:\\Users\\Krishna\\Downloads\\productivity+prediction+of+garment+employees\\garments_worker_productivity.csv")

# Set the CRAN mirror


# Install the caret package specifying the CRAN mirror
install.packages("caret", repos = "https://cloud.r-project.org/")

## Installing package into 'C:/Users/Krishna/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'caret' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'caret'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Krishna\AppData\Local\R\win-library\4.3\00LOCK\caret\libs\x64\caret.dll
## to C:\Users\Krishna\AppData\Local\R\win-library\4.3\caret\libs\x64\caret.dll:
## Permission denied

## Warning: restored 'caret'

## 
## The downloaded binary packages are in
##  C:\Users\Krishna\AppData\Local\Temp\Rtmpq6CB3o\downloaded_packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)     # For reading CSV files
library(dplyr)     # For data manipulation
library(ggplot2)   # For data visualization
library(tidyr)     # For data tidying
library(stats)     # For statistical tests
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

data <- na.omit(data)

Next, we have Converted the day of the week variable into an ordered factor. By default, the day of the week variable may be stored as a character or categorical variable in the dataset. However, treating it as a simple categorical variable without considering its ordinal nature could lead to incorrect interpretations or analyses. For example, a model or statistical test may treat Monday and Friday as equally distant from each other, which may not accurately reflect the sequential nature of weekdays.

Converting the day of the week variable into an ordered factor assigns numerical values to each day based on their natural order . This transformation explicitly captures the inherent sequence of days, allowing statistical analyses to correctly interpret and analyze the data.

data$day <- factor(data$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Saturday", "Sunday"))

summary(data)

##      date             quarter           department               day     
##  Length:691         Length:691         Length:691         Monday   :116  
##  Class :character   Class :character   Class :character   Tuesday  :118  
##  Mode  :character   Mode  :character   Mode  :character   Wednesday:119  
##                                                           Thursday :118  
##                                                           Saturday :104  
##                                                           Sunday   :116  
##       team        targeted_productivity      smv             wip         
##  Min.   : 1.000   Min.   :0.070         Min.   :10.05   Min.   :    7.0  
##  1st Qu.: 4.000   1st Qu.:0.700         1st Qu.:18.79   1st Qu.:  774.5  
##  Median : 6.000   Median :0.750         Median :22.52   Median : 1039.0  
##  Mean   : 6.521   Mean   :0.724         Mean   :23.25   Mean   : 1190.5  
##  3rd Qu.:10.000   3rd Qu.:0.800         3rd Qu.:28.08   3rd Qu.: 1252.5  
##  Max.   :12.000   Max.   :0.800         Max.   :54.56   Max.   :23122.0  
##    over_time       incentive        idle_time          idle_men      
##  Min.   :    0   Min.   :  0.00   Min.   :  0.000   Min.   : 0.0000  
##  1st Qu.: 4560   1st Qu.: 30.00   1st Qu.:  0.000   1st Qu.: 0.0000  
##  Median : 6840   Median : 45.00   Median :  0.000   Median : 0.0000  
##  Mean   : 6508   Mean   : 44.48   Mean   :  1.265   Mean   : 0.6397  
##  3rd Qu.: 7200   3rd Qu.: 60.00   3rd Qu.:  0.000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :138.00   Max.   :300.000   Max.   :45.0000  
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   :26.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.:52.00   1st Qu.:0.6615     
##  Median :0.0000     Median :57.00   Median :0.7506     
##  Mean   :0.2605     Mean   :52.45   Mean   :0.7220     
##  3rd Qu.:0.0000     3rd Qu.:58.00   3rd Qu.:0.8004     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1005

head(data)

##       date  quarter department      day team targeted_productivity   smv  wip
## 1 1/1/2015 Quarter1     sweing Thursday    8                  0.80 26.16 1108
## 3 1/1/2015 Quarter1     sweing Thursday   11                  0.80 11.41  968
## 4 1/1/2015 Quarter1     sweing Thursday   12                  0.80 11.41  968
## 5 1/1/2015 Quarter1     sweing Thursday    6                  0.80 25.90 1170
## 6 1/1/2015 Quarter1     sweing Thursday    7                  0.80 25.90  984
## 8 1/1/2015 Quarter1     sweing Thursday    3                  0.75 28.08  795
##   over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1      7080        98         0        0                  0          59.0
## 3      3660        50         0        0                  0          30.5
## 4      3660        50         0        0                  0          30.5
## 5      1920        50         0        0                  0          56.0
## 6      6720        38         0        0                  0          56.0
## 8      6900        45         0        0                  0          57.5
##   actual_productivity
## 1           0.9407254
## 3           0.8005705
## 4           0.8005705
## 5           0.8003819
## 6           0.8001250
## 8           0.7536835

Here, we have used Boxplot to represent the productivity across different days of week because it displays key summary statistics such as the median, quartiles, and potential outliers, boxplots provide a comprehensive view of the productivity distribution for each day.

it also helps you to compare the distribution of productivity across different days of the week. This comparison allows you to identify patterns or trends in productivity levels, such as whether certain days consistently exhibit higher or lower productivity compared to others. here, the Days with higher median values typically indicate higher productivity levels, while days with lower median values suggest lower productivity levels.

ggplot(data, aes(x = day, y = actual_productivity)) +
  geom_boxplot(fill = "skyblue", color = "blue") +
  labs(x = "Day of the Week", y = "Productivity", title = "Distribution of Productivity Across Different Days") +
  theme_minimal()

Here, we have used ANOVA to determine whether there are significant differences in productivity means across the different days of the week. it enables simultaneous comparison of productivity means across all days of the week, rather than pairwise comparisons.

1)df:

Day: This represents the degrees of freedom associated with the variable “day,” which is the categorical variable representing different days of the week. In this case, there are 5 degrees of freedom, indicating that there are 6 levels (days) of the variable minus 1.

Residuals::This represents the degrees of freedom associated with the error term or residuals, which are the differences between the observed values and the values predicted by the model. In this case, there are 685 degrees of freedom, which is the total number of observations minus the total number of levels of the “day” variable.

# ANOVA test
anova_result <- aov(actual_productivity ~ day, data = data)
summary(anova_result)

##              Df Sum Sq  Mean Sq F value Pr(>F)
## day           5  0.043 0.008671    0.36  0.876
## Residuals   685 16.488 0.024070

After conducting ANOVA and obtaining an overall assessment of the variability in productivity means across different days of the week, it’s common to perform post-hoc tests to identify specific pairs of days with significantly different means. Tukey’s Honestly Significant Difference (HSD) test is one such post-hoc test commonly used for this purpose. the main use of this Tukey’s test is pairwise comparisons.

ANOVA determines whether there are significant differences in productivity means across groups (days) but does not specify which pairs of groups differ significantly. Tukey’s HSD test allows for efficient pairwise comparisons between all combinations of days, pinpointing specific pairs with significantly different means.

A significant difference between the means of two days indicates that there is strong evidence to suggest that the productivity levels on those days are indeed different. This information is valuable for identifying specific days that exhibit notable variations in productivity and understanding the factors contributing to these differences.

tukey_test <- TukeyHSD(anova_result)
tukey_test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = actual_productivity ~ day, data = data)
## 
## $day
##                            diff         lwr        upr     p adj
## Tuesday-Monday      0.003945265 -0.05402513 0.06191566 0.9999616
## Wednesday-Monday   -0.001457702 -0.05930722 0.05639182 0.9999997
## Thursday-Monday    -0.013397665 -0.07136806 0.04457273 0.9860848
## Saturday-Monday     0.009910882 -0.04996253 0.06978429 0.9970544
## Sunday-Monday      -0.010620008 -0.06883761 0.04759759 0.9953368
## Wednesday-Tuesday  -0.005402967 -0.06300370 0.05219776 0.9998128
## Thursday-Tuesday   -0.017342929 -0.07506505 0.04037919 0.9560277
## Saturday-Tuesday    0.005965617 -0.05366745 0.06559868 0.9997431
## Sunday-Tuesday     -0.014565272 -0.07253566 0.04340512 0.9797555
## Thursday-Wednesday -0.011939962 -0.06954069 0.04566077 0.9915395
## Saturday-Wednesday  0.011368584 -0.04814698 0.07088415 0.9942129
## Sunday-Wednesday   -0.009162305 -0.06701182 0.04868721 0.9976131
## Saturday-Thursday   0.023308546 -0.03632452 0.08294161 0.8743598
## Sunday-Thursday     0.002777657 -0.05519273 0.06074805 0.9999933
## Sunday-Saturday    -0.020530889 -0.08040430 0.03934252 0.9242781

Here, the bar plot provides a visual comparison of mean productivity levels among the different days of the week. It allows stakeholders to quickly identify any noticeable patterns or differences in productivity across the days. Stakeholders can visually identify days with relatively high or low mean productivity based on the height of the bars. Higher bars indicate higher mean productivity, while lower bars indicate lower mean productivity.

Insights from the visualization can inform decisions related to resource allocation and workforce scheduling. For example, if certain weekdays consistently exhibit lower mean productivity, management may consider adjusting staffing levels or operational procedures to improve efficiency on those days.

mean_productivity <- data %>%
  group_by(day) %>%
  summarize(mean_productivity = mean(actual_productivity))

ggplot(mean_productivity, aes(x = day, y = mean_productivity)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
  labs(x = "Day of the Week", y = "Mean Productivity", title = "Mean Productivity Across Different Days") +
  theme_minimal()

model_data <- data %>% select(actual_productivity, smv, wip)  # Select relevant columns
# Split data into training and testing sets
set.seed(123)  # for reproducibility
train_index <- createDataPartition(model_data$actual_productivity, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
# Train the linear regression model
lm_model <- lm(actual_productivity ~ smv + wip, data = train_data)
# Summarize the model
summary(lm_model)

## 
## Call:
## lm(formula = actual_productivity ~ smv + wip, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46328 -0.05872  0.02643  0.09720  0.37400 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.966e-01  2.305e-02  34.555  < 2e-16 ***
## smv         -3.727e-03  9.370e-04  -3.977  7.9e-05 ***
## wip          9.878e-06  3.542e-06   2.789  0.00546 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1514 on 552 degrees of freedom
## Multiple R-squared:  0.0425, Adjusted R-squared:  0.03903 
## F-statistic: 12.25 on 2 and 552 DF,  p-value: 6.235e-06

To predict productivity, we employ linear regression, a statistical technique that models the relationship between predictor variables (SMV, WIP) and the target variable (actual productivity). We train the linear regression model using the training dataset, aiming to capture underlying patterns and relationships. The trained model allows us to predict productivity based on SMV and WIP values for each observation. Linear regression assumes that there is a linear relationship between the independent variables and the dependent variable. It means that changes in the independent variables result in proportional changes in the dependent variable.

The coefficients indicate the strength and direction of the relationship between each independent variable and productivity. A positive coefficient suggests a positive relationship (increasing the independent variable leads to an increase in productivity), while a negative coefficient suggests a negative relationship.

predictions <- predict(lm_model, newdata = test_data)
print(predictions)

##         5        28        44        49        50        60        69        72 
## 0.7116478 0.7074584 0.7586695 0.7006390 0.7093255 0.6351449 0.7124874 0.7313468 
##        96       108       122       134       138       176       192       199 
## 0.7642212 0.7122001 0.7187786 0.7102253 0.7310280 0.7127829 0.6140384 0.7099774 
##       205       210       223       225       229       235       238       252 
## 0.7318578 0.6122504 0.7105405 0.7112131 0.7638459 0.7007416 0.7201966 0.7286663 
##       267       284       288       289       294       295       304       312 
## 0.7127443 0.7646307 0.6168840 0.7102639 0.7241750 0.7239478 0.6207627 0.7523828 
##       323       327       339       340       342       355       359       370 
## 0.7137222 0.7218733 0.7644332 0.7142557 0.7223969 0.7208361 0.7251683 0.7296773 
##       377       379       381       390       406       424       425       431 
## 0.7258049 0.7249751 0.7352072 0.6976720 0.7251925 0.7272823 0.7277367 0.7525450 
##       456       464       468       472       473       474       478       486 
## 0.6993646 0.7236119 0.7296037 0.7541848 0.7500174 0.6866585 0.7280869 0.7294566 
##       501       511       515       518       544       559       566       571 
## 0.7253999 0.7291602 0.7375205 0.7170842 0.7264767 0.7180165 0.9254688 0.7769171 
##       593       602       611       612       616       629       636       648 
## 0.7116338 0.7236317 0.7358827 0.7207330 0.6876580 0.6986499 0.7177300 0.7346183 
##       666       673       678       681       690       700       705       724 
## 0.6970694 0.7076638 0.7275633 0.7493246 0.6927553 0.7005664 0.6928837 0.6949977 
##       729       735       744       745       746       760       786       795 
## 0.7127950 0.7485738 0.6934172 0.7643541 0.6882281 0.6925775 0.7267335 0.6934468 
##       799       801       804       814       833       841       842       849 
## 0.6845068 0.7398839 0.7269607 0.7645023 0.6899004 0.6921550 0.7343243 0.6871641 
##       859       863       864       875       883       892       906       908 
## 0.7424223 0.6918883 0.7371989 0.6983671 0.6888830 0.6914415 0.7502334 0.6933381 
##       912       913       917       934       952       954       968       975 
## 0.7602402 0.6986239 0.6977052 0.6954134 0.7366631 0.7310117 0.7515373 0.7295991 
##       980       990       991      1000      1009      1010      1017      1019 
## 0.7115617 0.7383721 0.6983375 0.7040935 0.7123421 0.7499667 0.6924894 0.7054765 
##      1021      1031      1033      1047      1054      1057      1089      1093 
## 0.7302214 0.6984659 0.7066422 0.7412188 0.7583678 0.6930714 0.7494826 0.7084661 
##      1116      1137      1153      1157      1161      1185      1187      1188 
## 0.6970623 0.7576170 0.7122101 0.7075672 0.7574293 0.7294518 0.7097207 0.7070733

After fitting the linear regression model using train data set, we can use the model to make predictions on new data(test_data). The predictions allow us to assess how well the model generalizes to new data. By comparing the predicted values to the actual values of productivity, we can evaluate the accuracy and effectiveness of the linear regression model in predicting actual productivity levels based on SMV and WIP.

rmse <- sqrt(mean((test_data$actual_productivity - predictions)^2))
print(paste("Root Mean Squared Error (RMSE):", rmse))

## [1] "Root Mean Squared Error (RMSE): 0.153976005669743"

Next we have calculated root mean square error(RMSE) which is a common method for evaluating the performance of a predictive model, such as a linear regression model. The RMSE provides a measure of how well the model’s predictions align with the actual observed values.

visualization_data <- data.frame(Actual = test_data$actual_productivity, Predicted = predictions)
print(visualization_data)

##         Actual Predicted
## 5    0.8003819 0.7116478
## 28   0.8003186 0.7074584
## 44   0.8797145 0.7586695
## 49   0.8005981 0.7006390
## 50   0.8003186 0.7093255
## 60   0.6034322 0.6351449
## 69   0.8006844 0.7124874
## 72   0.8002510 0.7313468
## 96   0.8001410 0.7642212
## 108  0.8505023 0.7122001
## 122  0.7004808 0.7187786
## 134  0.8501368 0.7102253
## 138  0.8002738 0.7310280
## 176  0.8505023 0.7127829
## 192  0.4731348 0.6140384
## 199  0.8505222 0.7099774
## 205  0.8002738 0.7318578
## 210  0.6885576 0.6122504
## 223  0.8505023 0.7105405
## 225  0.8501368 0.7112131
## 229  0.8002468 0.7638459
## 235  0.5823010 0.7007416
## 238  0.3502065 0.7201966
## 252  0.7000956 0.7286663
## 267  0.8501368 0.7127443
## 284  0.8003438 0.7646307
## 288  0.8001171 0.6168840
## 289  0.8001171 0.7102639
## 294  0.6501987 0.7241750
## 295  0.6002918 0.7239478
## 304  0.8001287 0.6207627
## 312  0.6502996 0.7523828
## 323  0.8001626 0.7137222
## 327  0.7500680 0.7218733
## 339  0.8002615 0.7644332
## 340  0.8001287 0.7142557
## 342  0.7500680 0.7223969
## 355  0.9005563 0.7208361
## 359  0.8004157 0.7251683
## 370  0.5002413 0.7296773
## 377  0.8000206 0.7258049
## 379  0.7500680 0.7249751
## 381  0.7002508 0.7352072
## 390  0.5005475 0.6976720
## 406  0.8003519 0.7251925
## 424  0.8502238 0.7272823
## 425  0.8009096 0.7277367
## 431  0.7004367 0.7525450
## 456  0.3500670 0.6993646
## 464  0.8503127 0.7236119
## 468  0.8002632 0.7296037
## 472  0.7001647 0.7541848
## 473  0.6503071 0.7500174
## 474  0.6228281 0.6866585
## 478  1.0002304 0.7280869
## 486  0.8005345 0.7294566
## 501  0.9501860 0.7253999
## 511  0.8004737 0.7291602
## 515  0.6005976 0.7375205
## 518  0.4879200 0.7170842
## 544  1.0006713 0.7264767
## 559  0.4668212 0.7180165
## 566  0.9001298 0.9254688
## 571  0.7999632 0.7769171
## 593  0.7005136 0.7116338
## 602  0.9999952 0.7236317
## 611  0.7007104 0.7358827
## 612  0.7006121 0.7207330
## 616  0.6508348 0.6876580
## 629  0.7999632 0.6986499
## 636  0.4965497 0.7177300
## 648  0.7505201 0.7346183
## 666  0.7999632 0.6970694
## 673  0.3555345 0.7076638
## 678  0.9005090 0.7275633
## 681  0.8004020 0.7493246
## 690  0.3503017 0.6927553
## 700  0.8008000 0.7005664
## 705  0.7503561 0.6928837
## 724  0.7144105 0.6949977
## 729  0.5369018 0.7127950
## 735  0.8004020 0.7485738
## 744  0.5356780 0.6934172
## 745  0.5001234 0.7643541
## 746  0.4978851 0.6882281
## 760  0.7002061 0.6925775
## 786  1.0002304 0.7267335
## 795  0.7503561 0.6934468
## 799  0.6219718 0.6845068
## 801  0.3554280 0.7398839
## 804  1.0002304 0.7269607
## 814  0.7006140 0.7645023
## 833  0.7505327 0.6899004
## 841  0.3281316 0.6921550
## 842  0.3035745 0.7343243
## 849  0.7999829 0.6871641
## 859  0.6386144 0.7424223
## 863  0.5853158 0.6918883
## 864  0.2494167 0.7371989
## 875  0.7006140 0.6983671
## 883  0.3075015 0.6888830
## 892  0.7507701 0.6914415
## 906  0.8004020 0.7502334
## 908  0.8000718 0.6933381
## 912  0.7504373 0.7602402
## 913  0.7501770 0.6986239
## 917  0.7005185 0.6977052
## 934  0.7003621 0.6954134
## 952  0.7503719 0.7366631
## 954  0.7002518 0.7310117
## 968  0.8002608 0.7515373
## 975  0.7002366 0.7295991
## 980  0.5503497 0.7115617
## 990  0.7506510 0.7383721
## 991  0.7504737 0.6983375
## 1000 0.6002398 0.7040935
## 1009 0.8005795 0.7123421
## 1010 0.8004020 0.7499667
## 1017 0.7504737 0.6924894
## 1019 0.7502549 0.7054765
## 1021 0.7001360 0.7302214
## 1031 0.8003333 0.6984659
## 1033 0.7507701 0.7066422
## 1047 0.3027704 0.7412188
## 1054 0.8007018 0.7583678
## 1057 0.8000558 0.6930714
## 1089 0.8505206 0.7494826
## 1093 0.8007466 0.7084661
## 1116 0.8005111 0.6970623
## 1137 0.8000345 0.7576170
## 1153 0.9000610 0.7122101
## 1157 0.8500842 0.7075672
## 1161 0.8000345 0.7574293
## 1185 0.7503473 0.7294518
## 1187 0.7500508 0.7097207
## 1188 0.7500508 0.7070733

Here i have made a visualization of actual data and predicted data in a scatter plot with a reference line. The purpose of adding this reference line is to visually represent perfect prediction. In other words, if the actual productivity and predicted productivity were exactly the same for all data points, they would fall on this reference line. By including this line in the plot, it allows us to quickly assess how closely the predicted values align with the actual values.

Data points that fall close to the red dashed line indicate that the model’s predictions are accurate, as they closely match the actual productivity values.

Data points that deviate significantly from the red dashed line suggest discrepancies between the predicted and actual productivity values

# Scatterplot of actual vs. predicted productivity
ggplot(visualization_data, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +  # Add reference line (y = x)
  labs(x = "Actual Productivity", y = "Predicted Productivity", title = "Actual vs. Predicted Productivity") +
  theme_minimal()

Final_proj

2024-04-16