Week 8

Load CSV file

Loading the csv file to garment_prod variable.

garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)

##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      team           targeted_productivity      smv             wip         
##  Length:1197        Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  Class :character   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Mode  :character   Median :0.7500        Median :15.26   Median : 1039.0  
##                     Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##                     3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##                     Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                           NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
##

Load required libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

Response variable

Actual_productivity is a continuous numeric variable, which makes it a good candidate for the response variable. Productivity is a key performance metric for any factory. Improving worker productivity is likely a top priority for the factory management. So actual_productivity is a valuable variable to model and understand. The goal is to understand the factors that influence productivity and identify ways to improve it. So, modeling actual_productivity as the response variable aligns well with this goal. Productivity directly impacts the factory output, revenue and profits. So, it is a very important business metric.

response <- garment_prod$actual_productivity

Explanatory variable

Of all the categories, department column is the apt column for an explanatory variable. It has 3 categories: ‘sewing’, ‘finishing’ and ‘finishing’. I will consolidate the ‘finishing’ levels into one category since there are only a few observations.

garment_prod <- garment_prod%>%
  mutate(department = ifelse(department == "finishing" | department == "finishing ", "finishing", department))

explanatory <- garment_prod$department

\[ H_0 : \text{There is no difference in mean productivity across departments.} \] ## ANOVA test

garment_prod <- garment_prod

anova_mod = aov(actual_productivity ~ department, data=garment_prod)
summary(anova_mod)

##               Df Sum Sq Mean Sq F value  Pr(>F)   
## department     1   0.28 0.27958   9.246 0.00241 **
## Residuals   1195  36.13 0.03024                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is <0.05, so we reject the null hypothesis. There is a statistically significant difference in mean productivity across departments. The sewing department has a higher mean productivity than the finishing department based on the sample data. This means that, the department a worker belongs to seems to have an impact on their productivity. One department is more efficient then the other. For people interested in improving productivity in this factory, this analysis shows they should looks deeper into the differences between the sewing and finishing departments. There may be opportunities there to understand why sewing is more productive and try to implement those best practices in the finishing department.

Another continuous variable

Looking at the data, the ‘smv’ (standard minute value) column seems like a potential continuous predictor of productivity.

ggplot(garment_prod, aes(x=smv, y=actual_productivity)) + 
  geom_point() +
  geom_smooth(method='lm')

## `geom_smooth()` using formula = 'y ~ x'

The plot shows a rough linear relationship between productivity and smv.

Linear model

model = lm(actual_productivity ~ smv, data=garment_prod)
summary(model)

## 
## Call:
## lm(formula = actual_productivity ~ smv, data = garment_prod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52095 -0.08825  0.04300  0.11334  0.37991 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7644124  0.0085220  89.699  < 2e-16 ***
## smv         -0.0019467  0.0004578  -4.252 2.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01408 
## F-statistic: 18.08 on 1 and 1195 DF,  p-value: 2.281e-05

The coefficient provide information about the estimated coefficients for the linear regression model. These coefficients determine the relationship between the independent and dependent variables.
Intercept: The estimated intercept is 0.7644124. This is the predicted value of actual_productivity when smv is 0. smv: The estimated coefficient for smv is -0.0019467. This indicates that for each one-unit increase in smv, the predicted actual_productivity is expected to decrease by approximately 0.0019467 units.
The multiple R-squared value, 0.01491, represents the proportion of the variance in the dependent variable (actual_productivity) that is explained by the independent variable(s) (smv). In this case, only about 1.49% of the variance is explained by the model.
The p-value associated with the F-statistic is 2.281e-05, which is very close to zero. This indicates that the model is statistically significant.

In summary, the model suggests that there is a statistically significant relationship between smv and actual_productivity, but the model’s explanatory power is quite limited, with only about 1.49% of the variance in actual_productivity being explained by smv. The negative coefficient for smv suggests a negative relationship, meaning that higher values of smv are associated with lower values of actual_productivity.

Diagnostic plots

par(mfrow = c(2, 2))  
plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)

In the residuals vs fitted plot, the scatter is random with no pattern. In normal Q-Q plot, the points lie reasonable close to the diagonal line. The residual vs leverage plot shows there is no high leverage points outside Cook’s distance. In the scale-location plot, there is constant spread of residuals. So, the diagnostic plots do not indicate nay major issues with the model assumptions or fit. The linear model seems reasonably well-specified for the smv predictor.

Hypothesis for smv coeffcient

\[ H_0 : \beta_smv = 0\\\ H_a : \beta_smv != 0 \]

f_test <- summary(model)
cat("Overall model F-test:\n")

## Overall model F-test:

cat("F-statistic =", f_test$fstatistic[1], ", p-value =", f_test$fstatistic[4], "\n")

## F-statistic = 18.08182 , p-value = NA

t_test <- summary(model)
cat("\nCoefficient t-test:\n")

## 
## Coefficient t-test:

cat("t-statistic =", t_test$coefficients["smv", "t value"], ", p-value =", t_test$coefficients["smv", "Pr(>|t|)"], "\n")

## t-statistic = -4.252272 , p-value = 2.28113e-05

F-test interpretation: The F-statistic tests whether there is a significant linear relationship between the predictors and the response variable. If the p-value associated with the F-statistic is small (typically less than a significance level, e.g., 0.05), you would reject the null hypothesis and conclude that the model, as a whole, is statistically significant. However, since the p-value is “NA,” it’s not possible to determine the statistical significance of the overall model based on the provided information.
T-test interpretation: The t-statistic measures how many standard errors the estimated coefficient is away from zero. A larger absolute t-statistic indicates a stronger evidence against the null hypothesis. The p-value associated with the t-statistic provides the probability that the coefficient is not significantly different from zero. In this case, the very small p-value (2.28113e-05) indicates that the smv coefficient is statistically significant. Since the p-value is very low, you would reject the null hypothesis and conclude that the smv predictor has a significant effect on the response variable (actual_productivity).

In summary, the hypothesis tests confirm smv has a statistically significant relationship with productivity. The diagnostic plots validate the linear model assumptions.

Interaction term

Model 1

Based on the ANOVA results, department seems to be associated with productivity. So I will add it to the model. I’ll create a dummy variable with the name ‘dept_sewing’.

garment_prod$dept_sewing = as.numeric(garment_prod$department == "sewing")

Let’s fit a multiple linear regression model with smv and dept_sewing.

model_1 = lm(actual_productivity ~ smv + dept_sewing, data=garment_prod)
summary(model_1)

## 
## Call:
## lm(formula = actual_productivity ~ smv + dept_sewing, data = garment_prod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52095 -0.08825  0.04300  0.11334  0.37991 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7644124  0.0085220  89.699  < 2e-16 ***
## smv         -0.0019467  0.0004578  -4.252 2.28e-05 ***
## dept_sewing         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01408 
## F-statistic: 18.08 on 1 and 1195 DF,  p-value: 2.281e-05

The estimated intercept is 0.7644124. This is the predicted value of the response variable when the predictor variables are zero (in this case, smv and dept_sewing).
The estimated coefficient for smv is -0.0019467. It represents the change in the response variable associated with a one-unit change in smv.
The coefficient for dept_sewing is not defined (“NA”) because of singularities. This suggests that there may be multicollinearity or some issue with the predictor variable, making it impossible to estimate its effect independently in the model.
The residual standard error is 0.1733. It’s an estimate of the standard deviation of the residuals, indicating the typical size of errors in the model’s predictions.
Multiple R-squared is 0.01491, indicating that approximately 1.49% of the variance in the response variable is explained by the predictors in the model.
Adjusted R-squared is slightly lower at 0.01408, accounting for the number of predictors in the model.
The F-statistic is 18.08, and its associated p-value (2.281e-05) tests the overall significance of the model. The low p-value suggests that the model is statistically significant as a whole, meaning that at least one of the predictors has a significant effect on the response variable.

In summary, the model indicates that smv is a significant predictor of the response variable. However, there may be an issue with the dept_sewing predictor, as it’s not defined due to singularities, possibly indicating multicollinearity or some other problem with this variable. The model’s explanatory power is quite limited, with only about 1.49% of the variance in the response variable being explained by the predictors.

Diagnostic plots

par(mfrow = c(2, 2))  
plot(model_1, which = 1)
plot(model_1, which = 2)
plot(model_1, which = 3)
plot(model_1, which = 4)

Model 2

I will also add an interaction term between smv and department. This allows the smv effect to vary by department.

model_2 = lm(actual_productivity ~ smv*dept_sewing, data=garment_prod)
summary(model_2)

## 
## Call:
## lm(formula = actual_productivity ~ smv * dept_sewing, data = garment_prod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52095 -0.08825  0.04300  0.11334  0.37991 
## 
## Coefficients: (2 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.7644124  0.0085220  89.699  < 2e-16 ***
## smv             -0.0019467  0.0004578  -4.252 2.28e-05 ***
## dept_sewing             NA         NA      NA       NA    
## smv:dept_sewing         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01408 
## F-statistic: 18.08 on 1 and 1195 DF,  p-value: 2.281e-05

In this case, it indicates that two coefficients are not defined due to singularities, possibly indicating collinearity issues in the model.
The residual standard error, 0.1733, represents the typical size of the errors (residuals) in the model’s predictions. It provides a measure of the model’s goodness of fit.
The multiple R-squared value, 0.01491, represents the proportion of the variance in the dependent variable (actual_productivity) that is explained by the independent variable (smv). In this case, only about 1.491% of the variance is explained by the model.
The adjusted R-squared, 0.01408, is a version of R-squared that accounts for the number of predictors in the model. It is similar to the multiple R-squared.
The F-statistic, 18.08, tests the overall significance of the model. A high F-statistic with a low p-value (2.281e-05) suggests that the model is statistically significant.

Diagnostic plots

par(mfrow = c(2, 2))  
plot(model_2, which = 1)
plot(model_2, which = 2)
plot(model_2, which = 3)
plot(model_2, which = 4)

The interaction term is significant. The smv coefficient is still significant and positive. But the magnitude is lower now. So department changes the relationship between smv and productivity. Diagnostic plots look okay. No major issues.

In summary, adding department dummy and the interaction term improves the model fit and provides additional insights. The effect of smv on productivity varies by department. This can guide optimization efforts by department.

Week 8

2023-10-23

Load CSV file

Load required libraries

Response variable

Explanatory variable

Another continuous variable

Linear model

Diagnostic plots

Hypothesis for smv coeffcient

Interaction term

Model 1

Diagnostic plots

Model 2

Diagnostic plots