Week 11

Load CSV file

Loading the csv file to garment_prod variable.

garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)

Load required libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(modelr)

## 
## Attaching package: 'modelr'
## 
## The following objects are masked from 'package:yardstick':
## 
##     mae, mape, rmse
## 
## The following object is masked from 'package:broom':
## 
##     bootstrap

Exploring data

glimpse(garment_prod)

## Rows: 1,197
## Columns: 15
## $ date                  <chr> "1/1/15", "1/1/15", "1/1/15", "1/1/15", "1/1/15"…
## $ quarter               <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department            <chr> "sweing", "finishing ", "sweing", "sweing", "swe…
## $ day                   <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
## $ team                  <chr> "8", "1", "11", "12", "6", "7", "2", "3", "2", "…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ smv                   <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ wip                   <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681…
## $ over_time             <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive             <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers         <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity   <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…

summary(garment_prod)

##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      team           targeted_productivity      smv             wip         
##  Length:1197        Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  Class :character   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Mode  :character   Median :0.7500        Median :15.26   Median : 1039.0  
##                     Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##                     3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##                     Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                           NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
##

Linear Regression Model

To build a linear regression model, I’ll consider ‘actual_productivity’ as the response variable and ‘smv’, ‘wip’ and ‘no_of_workers’ as the explanatory variables.

# Building model on garment_prod data
model <- lm(actual_productivity ~ smv + wip + no_of_workers, data = garment_prod)

# summarizing the model
summary(model)

## 
## Call:
## lm(formula = actual_productivity ~ smv + wip + no_of_workers, 
##     data = garment_prod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48317 -0.05458  0.03700  0.09007  0.36298 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.144e-01  3.268e-02  21.862  < 2e-16 ***
## smv           -5.029e-03  1.012e-03  -4.968 8.56e-07 ***
## wip            9.991e-06  3.139e-06   3.183  0.00152 ** 
## no_of_workers  2.148e-03  7.497e-04   2.865  0.00430 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1511 on 687 degrees of freedom
##   (506 observations deleted due to missingness)
## Multiple R-squared:  0.05128,    Adjusted R-squared:  0.04714 
## F-statistic: 12.38 on 3 and 687 DF,  p-value: 6.825e-08

Coefficient interpretation

Intercept (Intercept Estimate = 7.144e-01):
- The intercept represents the estimated value of ‘actual_productivity’ when all other predictor variables are zero. In this case, it’s 0.7144 (approximately).
- It’s important to note that in your model, it may not have a practical interpretation since many predictor variables (e.g., ‘smv,’ ‘wip,’ ‘no_of_workers’) are unlikely to be zero. Instead, it serves as a baseline reference point for the model.
smv (smv Estimate = -5.029e-03):
- For every one-unit increase in ‘smv,’ ‘actual_productivity’ is estimated to decrease by approximately 0.00503 units.
- Since the estimate is negative and statistically significant (p-value < 0.001), it suggests that higher ‘smv’ values are associated with lower ‘actual_productivity.’
wip (wip Estimate = 9.991e-06):
- For every one-unit increase in ‘wip,’ ‘actual_productivity’ is estimated to increase by approximately 9.991e-06 units.
- The estimate is positive and statistically significant (p-value = 0.00152), indicating that higher ‘wip’ values are associated with slightly higher ‘actual_productivity.’
no_of_workers (no_of_workers Estimate = 2.148e-03):
- For every one-unit increase in ‘no_of_workers,’ ‘actual_productivity’ is estimated to increase by approximately 0.00215 units.
- The estimate is positive and statistically significant (p-value = 0.00430), suggesting that having more workers is associated with higher ‘actual_productivity.’

Diagnostic plots

#Residual plot
plot(model)

Scatter plot

ggplot(garment_prod, aes(x = smv, y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(y = "actual_productivity", x = "smv") + 
  ggtitle("Scatter Plot of actual_productivity vs. smv") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(garment_prod, aes(x = wip, y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(y = "actual_productivity", x = "wip") + 
  ggtitle("Scatter Plot of actual_productivity vs. wip") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 506 rows containing non-finite values (`stat_smooth()`).

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 506 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 506 rows containing missing values (`geom_point()`).

ggplot(garment_prod, aes(x = no_of_workers, y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(y = "actual_productivity", x = "no_of_workers") + 
  ggtitle("Scatter Plot of actual_productivity vs. no_of_workers") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Transformation

The above scatter plots show that the model is not entirely linear. So, we apply logarithmic transformation to the model to linearize ‘smv’ relationship.

We remove the top 5% ‘wip’ values to eliminate high leverage points.

df <- garment_prod
df$smv <- log(df$smv)
df <- df %>% filter(!is.na(wip) & wip < quantile(wip, 0.95, na.rm = TRUE))

model_t <- lm(actual_productivity ~ smv + wip + no_of_workers, data = df)

Diagnostic plots

plot(model_t)

Scatter plots

ggplot(df, aes(x = smv, y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(y = "actual_productivity", x = "smv") + 
  ggtitle("Scatter Plot of actual_productivity vs. smv") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(df, aes(x = wip, y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(y = "actual_productivity", x = "wip") + 
  ggtitle("Scatter Plot of actual_productivity vs. wip") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

summary(model_t)

## 
## Call:
## lm(formula = actual_productivity ~ smv + wip + no_of_workers, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52733 -0.06730  0.04090  0.09264  0.31971 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.071e-01  5.801e-02  13.913  < 2e-16 ***
## smv           -9.953e-02  2.265e-02  -4.395 1.29e-05 ***
## wip            1.170e-04  1.684e-05   6.952 8.77e-12 ***
## no_of_workers  2.108e-03  7.661e-04   2.752  0.00609 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1459 on 652 degrees of freedom
## Multiple R-squared:  0.09745,    Adjusted R-squared:  0.09329 
## F-statistic: 23.46 on 3 and 652 DF,  p-value: 1.969e-14

Comparing the plots before transformation and after transformation, there is significant change in the linearity of the plots. smv and wip relationship has been linearized.

Coefficient interpretation

Intercept (Intercept Estimate = 0.8071):
- The intercept represents the estimated value of ‘actual_productivity’ when all predictor variables (smv, wip, no_of_workers) are zero. It’s the expected ‘actual_productivity’ when there is no influence from the predictors.
smv (smv Estimate = -0.0995):
- For every one-unit decrease in ‘smv,’ ‘actual_productivity’ is estimated to decrease by approximately 0.0995 units.
- The estimate is negative, indicating that lower ‘smv’ values are associated with lower ‘actual_productivity.’
- The p-value (1.29e-05) is very small, indicating that this effect is highly statistically significant.
wip (wip Estimate = 0.000117):
- For every one-unit increase in ‘wip,’ ‘actual_productivity’ is estimated to increase by approximately 0.000117 units.
- The estimate is positive, suggesting that higher ‘wip’ values are associated with slightly higher ‘actual_productivity.’
- The p-value (8.77e-12) is very small, indicating that this effect is highly statistically significant.
no_of_workers (no_of_workers Estimate = 0.0021):
- For every one-unit increase in ‘no_of_workers,’ ‘actual_productivity’ is estimated to increase by approximately 0.0021 units.
- The estimate is positive, suggesting that having more workers is associated with higher ‘actual_productivity.’
- The p-value (0.00609) is less than 0.01, indicating that this effect is statistically significant, though less so than ‘smv’ and ‘wip.’

Week 11

2023-11-06

Load CSV file

Load required libraries

Exploring data

Linear Regression Model

Coefficient interpretation

Diagnostic plots

Scatter plot

Transformation

Diagnostic plots

Scatter plots

Coefficient interpretation