Loading the csv file to garment_prod variable.
garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(modelr)
##
## Attaching package: 'modelr'
##
## The following objects are masked from 'package:yardstick':
##
## mae, mape, rmse
##
## The following object is masked from 'package:broom':
##
## bootstrap
glimpse(garment_prod)
## Rows: 1,197
## Columns: 15
## $ date <chr> "1/1/15", "1/1/15", "1/1/15", "1/1/15", "1/1/15"…
## $ quarter <chr> "Quarter1", "Quarter1", "Quarter1", "Quarter1", …
## $ department <chr> "sweing", "finishing ", "sweing", "sweing", "swe…
## $ day <chr> "Thursday", "Thursday", "Thursday", "Thursday", …
## $ team <chr> "8", "1", "11", "12", "6", "7", "2", "3", "2", "…
## $ targeted_productivity <dbl> 0.80, 0.75, 0.80, 0.80, 0.80, 0.80, 0.75, 0.75, …
## $ smv <dbl> 26.16, 3.94, 11.41, 11.41, 25.90, 25.90, 3.94, 2…
## $ wip <int> 1108, NA, 968, 968, 1170, 984, NA, 795, 733, 681…
## $ over_time <int> 7080, 960, 3660, 3660, 1920, 6720, 960, 6900, 60…
## $ incentive <int> 98, 0, 50, 50, 50, 38, 0, 45, 34, 45, 44, 45, 50…
## $ idle_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ idle_men <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_style_change <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ no_of_workers <dbl> 59.0, 8.0, 30.5, 30.5, 56.0, 56.0, 8.0, 57.5, 55…
## $ actual_productivity <dbl> 0.9407254, 0.8865000, 0.8005705, 0.8005705, 0.80…
summary(garment_prod)
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Length:1197 Min. :0.0700 Min. : 2.90 Min. : 7.0
## Class :character 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Mode :character Median :0.7500 Median :15.26 Median : 1039.0
## Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
To build a linear regression model, I’ll consider ‘actual_productivity’ as the response variable and ‘smv’, ‘wip’ and ‘no_of_workers’ as the explanatory variables.
# Building model on garment_prod data
model <- lm(actual_productivity ~ smv + wip + no_of_workers, data = garment_prod)
# summarizing the model
summary(model)
##
## Call:
## lm(formula = actual_productivity ~ smv + wip + no_of_workers,
## data = garment_prod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48317 -0.05458 0.03700 0.09007 0.36298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.144e-01 3.268e-02 21.862 < 2e-16 ***
## smv -5.029e-03 1.012e-03 -4.968 8.56e-07 ***
## wip 9.991e-06 3.139e-06 3.183 0.00152 **
## no_of_workers 2.148e-03 7.497e-04 2.865 0.00430 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1511 on 687 degrees of freedom
## (506 observations deleted due to missingness)
## Multiple R-squared: 0.05128, Adjusted R-squared: 0.04714
## F-statistic: 12.38 on 3 and 687 DF, p-value: 6.825e-08
Intercept (Intercept Estimate = 7.144e-01):
The intercept represents the estimated value of ‘actual_productivity’ when all other predictor variables are zero. In this case, it’s 0.7144 (approximately).
It’s important to note that in your model, it may not have a practical interpretation since many predictor variables (e.g., ‘smv,’ ‘wip,’ ‘no_of_workers’) are unlikely to be zero. Instead, it serves as a baseline reference point for the model.
smv (smv Estimate = -5.029e-03):
For every one-unit increase in ‘smv,’ ‘actual_productivity’ is estimated to decrease by approximately 0.00503 units.
Since the estimate is negative and statistically significant (p-value < 0.001), it suggests that higher ‘smv’ values are associated with lower ‘actual_productivity.’
wip (wip Estimate = 9.991e-06):
For every one-unit increase in ‘wip,’ ‘actual_productivity’ is estimated to increase by approximately 9.991e-06 units.
The estimate is positive and statistically significant (p-value = 0.00152), indicating that higher ‘wip’ values are associated with slightly higher ‘actual_productivity.’
no_of_workers (no_of_workers Estimate = 2.148e-03):
For every one-unit increase in ‘no_of_workers,’ ‘actual_productivity’ is estimated to increase by approximately 0.00215 units.
The estimate is positive and statistically significant (p-value = 0.00430), suggesting that having more workers is associated with higher ‘actual_productivity.’
#Residual plot
plot(model)
ggplot(garment_prod, aes(x = smv, y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(y = "actual_productivity", x = "smv") +
ggtitle("Scatter Plot of actual_productivity vs. smv") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(garment_prod, aes(x = wip, y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
geom_smooth(se = FALSE) +
labs(y = "actual_productivity", x = "wip") +
ggtitle("Scatter Plot of actual_productivity vs. wip") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 506 rows containing non-finite values (`stat_smooth()`).
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 506 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 506 rows containing missing values (`geom_point()`).
ggplot(garment_prod, aes(x = no_of_workers, y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(y = "actual_productivity", x = "no_of_workers") +
ggtitle("Scatter Plot of actual_productivity vs. no_of_workers") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The above scatter plots show that the model is not entirely linear. So, we apply logarithmic transformation to the model to linearize ‘smv’ relationship.
We remove the top 5% ‘wip’ values to eliminate high leverage points.
df <- garment_prod
df$smv <- log(df$smv)
df <- df %>% filter(!is.na(wip) & wip < quantile(wip, 0.95, na.rm = TRUE))
model_t <- lm(actual_productivity ~ smv + wip + no_of_workers, data = df)
plot(model_t)
ggplot(df, aes(x = smv, y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(y = "actual_productivity", x = "smv") +
ggtitle("Scatter Plot of actual_productivity vs. smv") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(df, aes(x = wip, y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
geom_smooth(se = FALSE) +
labs(y = "actual_productivity", x = "wip") +
ggtitle("Scatter Plot of actual_productivity vs. wip") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
summary(model_t)
##
## Call:
## lm(formula = actual_productivity ~ smv + wip + no_of_workers,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52733 -0.06730 0.04090 0.09264 0.31971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.071e-01 5.801e-02 13.913 < 2e-16 ***
## smv -9.953e-02 2.265e-02 -4.395 1.29e-05 ***
## wip 1.170e-04 1.684e-05 6.952 8.77e-12 ***
## no_of_workers 2.108e-03 7.661e-04 2.752 0.00609 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1459 on 652 degrees of freedom
## Multiple R-squared: 0.09745, Adjusted R-squared: 0.09329
## F-statistic: 23.46 on 3 and 652 DF, p-value: 1.969e-14
Comparing the plots before transformation and after transformation, there is significant change in the linearity of the plots. smv and wip relationship has been linearized.
Intercept (Intercept Estimate = 0.8071):
smv (smv Estimate = -0.0995):
For every one-unit decrease in ‘smv,’ ‘actual_productivity’ is estimated to decrease by approximately 0.0995 units.
The estimate is negative, indicating that lower ‘smv’ values are associated with lower ‘actual_productivity.’
The p-value (1.29e-05) is very small, indicating that this effect is highly statistically significant.
wip (wip Estimate = 0.000117):
For every one-unit increase in ‘wip,’ ‘actual_productivity’ is estimated to increase by approximately 0.000117 units.
The estimate is positive, suggesting that higher ‘wip’ values are associated with slightly higher ‘actual_productivity.’
The p-value (8.77e-12) is very small, indicating that this effect is highly statistically significant.
no_of_workers (no_of_workers Estimate = 0.0021):
For every one-unit increase in ‘no_of_workers,’ ‘actual_productivity’ is estimated to increase by approximately 0.0021 units.
The estimate is positive, suggesting that having more workers is associated with higher ‘actual_productivity.’
The p-value (0.00609) is less than 0.01, indicating that this effect is statistically significant, though less so than ‘smv’ and ‘wip.’