Loading the csv file to garment_prod variable.
garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Length:1197 Min. :0.0700 Min. : 2.90 Min. : 7.0
## Class :character 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Mode :character Median :0.7500 Median :15.26 Median : 1039.0
## Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
Actual_productivity is a continuous numeric variable, which makes it a good candidate for the response variable. Productivity is a key performance metric for any factory. Improving worker productivity is likely a top priority for the factory management. So actual_productivity is a valuable variable to model and understand. The goal is to understand the factors that influence productivity and identify ways to improve it. So, modeling actual_productivity as the response variable aligns well with this goal. Productivity directly impacts the factory output, revenue and profits. So, it is a very important business metric.
response <- garment_prod$actual_productivity
Of all the categories, department column is the apt column for an explanatory variable. It has 3 categories: ‘sewing’, ‘finishing’ and ‘finishing’. I will consolidate the ‘finishing’ levels into one category since there are only a few observations.
garment_prod <- garment_prod%>%
mutate(department = ifelse(department == "finishing" | department == "finishing ", "finishing", department))
explanatory <- garment_prod$department
\[ H_0 : \text{There is no difference in mean productivity across departments.} \] ## ANOVA test
garment_prod <- garment_prod
anova_mod = aov(actual_productivity ~ department, data=garment_prod)
summary(anova_mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## department 1 0.28 0.27958 9.246 0.00241 **
## Residuals 1195 36.13 0.03024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is <0.05, so we reject the null hypothesis. There is a statistically significant difference in mean productivity across departments. The sewing department has a higher mean productivity than the finishing department based on the sample data. This means that, the department a worker belongs to seems to have an impact on their productivity. One department is more efficient then the other. For people interested in improving productivity in this factory, this analysis shows they should looks deeper into the differences between the sewing and finishing departments. There may be opportunities there to understand why sewing is more productive and try to implement those best practices in the finishing department.
Looking at the data, the ‘smv’ (standard minute value) column seems like a potential continuous predictor of productivity.
ggplot(garment_prod, aes(x=smv, y=actual_productivity)) +
geom_point() +
geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'
The plot shows a rough linear relationship between productivity and smv.
model = lm(actual_productivity ~ smv, data=garment_prod)
summary(model)
##
## Call:
## lm(formula = actual_productivity ~ smv, data = garment_prod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52095 -0.08825 0.04300 0.11334 0.37991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7644124 0.0085220 89.699 < 2e-16 ***
## smv -0.0019467 0.0004578 -4.252 2.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared: 0.01491, Adjusted R-squared: 0.01408
## F-statistic: 18.08 on 1 and 1195 DF, p-value: 2.281e-05
The coefficient provide information about the estimated coefficients for the linear regression model. These coefficients determine the relationship between the independent and dependent variables.
Intercept: The estimated intercept is 0.7644124. This is the predicted value of actual_productivity when smv is 0. smv: The estimated coefficient for smv is -0.0019467. This indicates that for each one-unit increase in smv, the predicted actual_productivity is expected to decrease by approximately 0.0019467 units.
The multiple R-squared value, 0.01491, represents the proportion of the variance in the dependent variable (actual_productivity) that is explained by the independent variable(s) (smv). In this case, only about 1.49% of the variance is explained by the model.
The p-value associated with the F-statistic is 2.281e-05, which is very close to zero. This indicates that the model is statistically significant.
In summary, the model suggests that there is a statistically significant relationship between smv and actual_productivity, but the model’s explanatory power is quite limited, with only about 1.49% of the variance in actual_productivity being explained by smv. The negative coefficient for smv suggests a negative relationship, meaning that higher values of smv are associated with lower values of actual_productivity.
par(mfrow = c(2, 2))
plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)
In the residuals vs fitted plot, the scatter is random with no pattern. In normal Q-Q plot, the points lie reasonable close to the diagonal line. The residual vs leverage plot shows there is no high leverage points outside Cook’s distance. In the scale-location plot, there is constant spread of residuals. So, the diagnostic plots do not indicate nay major issues with the model assumptions or fit. The linear model seems reasonably well-specified for the smv predictor.
\[ H_0 : \beta_smv = 0\\\ H_a : \beta_smv != 0 \]
f_test <- summary(model)
cat("Overall model F-test:\n")
## Overall model F-test:
cat("F-statistic =", f_test$fstatistic[1], ", p-value =", f_test$fstatistic[4], "\n")
## F-statistic = 18.08182 , p-value = NA
t_test <- summary(model)
cat("\nCoefficient t-test:\n")
##
## Coefficient t-test:
cat("t-statistic =", t_test$coefficients["smv", "t value"], ", p-value =", t_test$coefficients["smv", "Pr(>|t|)"], "\n")
## t-statistic = -4.252272 , p-value = 2.28113e-05
F-test interpretation: The F-statistic tests whether there is a significant linear relationship between the predictors and the response variable. If the p-value associated with the F-statistic is small (typically less than a significance level, e.g., 0.05), you would reject the null hypothesis and conclude that the model, as a whole, is statistically significant. However, since the p-value is “NA,” it’s not possible to determine the statistical significance of the overall model based on the provided information.
T-test interpretation: The t-statistic measures how many standard errors the estimated coefficient is away from zero. A larger absolute t-statistic indicates a stronger evidence against the null hypothesis. The p-value associated with the t-statistic provides the probability that the coefficient is not significantly different from zero. In this case, the very small p-value (2.28113e-05) indicates that the smv coefficient is statistically significant. Since the p-value is very low, you would reject the null hypothesis and conclude that the smv predictor has a significant effect on the response variable (actual_productivity).
In summary, the hypothesis tests confirm smv has a statistically significant relationship with productivity. The diagnostic plots validate the linear model assumptions.
Based on the ANOVA results, department seems to be associated with productivity. So I will add it to the model. I’ll create a dummy variable with the name ‘dept_sewing’.
garment_prod$dept_sewing = as.numeric(garment_prod$department == "sewing")
Let’s fit a multiple linear regression model with smv and dept_sewing.
model_1 = lm(actual_productivity ~ smv + dept_sewing, data=garment_prod)
summary(model_1)
##
## Call:
## lm(formula = actual_productivity ~ smv + dept_sewing, data = garment_prod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52095 -0.08825 0.04300 0.11334 0.37991
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7644124 0.0085220 89.699 < 2e-16 ***
## smv -0.0019467 0.0004578 -4.252 2.28e-05 ***
## dept_sewing NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared: 0.01491, Adjusted R-squared: 0.01408
## F-statistic: 18.08 on 1 and 1195 DF, p-value: 2.281e-05
The estimated intercept is 0.7644124. This is the predicted value of the response variable when the predictor variables are zero (in this case, smv and dept_sewing).
The estimated coefficient for smv is -0.0019467. It represents the change in the response variable associated with a one-unit change in smv.
The coefficient for dept_sewing is not defined (“NA”) because of singularities. This suggests that there may be multicollinearity or some issue with the predictor variable, making it impossible to estimate its effect independently in the model.
The residual standard error is 0.1733. It’s an estimate of the standard deviation of the residuals, indicating the typical size of errors in the model’s predictions.
Multiple R-squared is 0.01491, indicating that approximately 1.49% of the variance in the response variable is explained by the predictors in the model.
Adjusted R-squared is slightly lower at 0.01408, accounting for the number of predictors in the model.
The F-statistic is 18.08, and its associated p-value (2.281e-05) tests the overall significance of the model. The low p-value suggests that the model is statistically significant as a whole, meaning that at least one of the predictors has a significant effect on the response variable.
In summary, the model indicates that smv is a significant predictor of the response variable. However, there may be an issue with the dept_sewing predictor, as it’s not defined due to singularities, possibly indicating multicollinearity or some other problem with this variable. The model’s explanatory power is quite limited, with only about 1.49% of the variance in the response variable being explained by the predictors.
par(mfrow = c(2, 2))
plot(model_1, which = 1)
plot(model_1, which = 2)
plot(model_1, which = 3)
plot(model_1, which = 4)
I will also add an interaction term between smv and department. This allows the smv effect to vary by department.
model_2 = lm(actual_productivity ~ smv*dept_sewing, data=garment_prod)
summary(model_2)
##
## Call:
## lm(formula = actual_productivity ~ smv * dept_sewing, data = garment_prod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52095 -0.08825 0.04300 0.11334 0.37991
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7644124 0.0085220 89.699 < 2e-16 ***
## smv -0.0019467 0.0004578 -4.252 2.28e-05 ***
## dept_sewing NA NA NA NA
## smv:dept_sewing NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared: 0.01491, Adjusted R-squared: 0.01408
## F-statistic: 18.08 on 1 and 1195 DF, p-value: 2.281e-05
In this case, it indicates that two coefficients are not defined due to singularities, possibly indicating collinearity issues in the model.
The residual standard error, 0.1733, represents the typical size of the errors (residuals) in the model’s predictions. It provides a measure of the model’s goodness of fit.
The multiple R-squared value, 0.01491, represents the proportion of the variance in the dependent variable (actual_productivity) that is explained by the independent variable (smv). In this case, only about 1.491% of the variance is explained by the model.
The adjusted R-squared, 0.01408, is a version of R-squared that accounts for the number of predictors in the model. It is similar to the multiple R-squared.
The F-statistic, 18.08, tests the overall significance of the model. A high F-statistic with a low p-value (2.281e-05) suggests that the model is statistically significant.
par(mfrow = c(2, 2))
plot(model_2, which = 1)
plot(model_2, which = 2)
plot(model_2, which = 3)
plot(model_2, which = 4)
The interaction term is significant. The smv coefficient is still significant and positive. But the magnitude is lower now. So department changes the relationship between smv and productivity. Diagnostic plots look okay. No major issues.
In summary, adding department dummy and the interaction term improves the model fit and provides additional insights. The effect of smv on productivity varies by department. This can guide optimization efforts by department.