Week 10

Load CSV file

Loading the csv file to garment_prod variable.

garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)

##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      team           targeted_productivity      smv             wip         
##  Length:1197        Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  Class :character   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Mode  :character   Median :0.7500        Median :15.26   Median : 1039.0  
##                     Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##                     3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##                     Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                           NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
##

Binary columns

Binary columns typically refer to categorical variables that have inly two possible categories or levels. These variables are also known as binary factors. The two categories are often coded as 0 and 1, where 0 represents one category and 1 represents the other.

Upon reviewing my dataset, I didn;t find any direct binary columns to perform modelling. So, I’ll convert a useful column in my data into a binary column by classifying based on a criteria.

Here, I’ll consider over_time as my binary column. This column captures overtime hours worked each day. It could be converted into a binary variable indicating whether any overtime was worked or not. Modeling overtime could help understand operational capacity and staffing needs. Criteria: 1 if over_time >0. 0 otherwise.

garment_prod$overtime <- as.numeric(garment_prod$over_time >0)
View(garment_prod)

Logistic regression model

In the above chunk, I have converted the “over_time” column to a binary variable called “overtime” indicating 1 if overtime was worked, o if not. For explanatory variables, I’ll consider “day” and “no_of_workers”.

model <- glm(overtime ~ day +no_of_workers, data = garment_prod, family = "binomial")
summary(model)

## 
## Call:
## glm(formula = overtime ~ day + no_of_workers, family = "binomial", 
##     data = garment_prod)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.237256   0.462947   6.993  2.7e-12 ***
## daySaturday    2.552170   1.043441   2.446   0.0144 *  
## daySunday      2.645373   1.043206   2.536   0.0112 *  
## dayThursday    1.529516   0.649467   2.355   0.0185 *  
## dayTuesday     0.665567   0.480986   1.384   0.1664    
## dayWednesday   0.857244   0.504951   1.698   0.0896 .  
## no_of_workers -0.015159   0.008834  -1.716   0.0862 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 287.71  on 1196  degrees of freedom
## Residual deviance: 264.28  on 1190  degrees of freedom
## AIC: 278.28
## 
## Number of Fisher Scoring iterations: 8

Summary interpretation

Based on the model output, here is an interpretation of the coefficients:

(Intercept): The intercept represents the baseline log-odds of overtime when all predictors are 0. The positive coefficient indicates a baseline positive probability of overtime.
daySaturday: The coefficient for daySaturday is 2.552. This indicates that on Saturdays, the log-odds of overtime increase by 2.552 compared to the baseline day (likely a weekday), holding other variables constant. In other words, Saturdays have significantly higher odds of overtime compared to weekdays.
daySunday: The coefficient for daySunday is 2.645. This indicates Sundays have an even larger positive association with overtime compared to Saturdays, with the log-odds increasing by 2.645, holding other variables constant.
dayThursday: The coefficient for dayThursday is 1.529. This indicates Thursdays have higher odds of overtime compared to baseline weekdays, but lower than weekends.
dayTuesday: The coefficient for dayTuesday is not statistically significant, indicating Tuesdays are not significantly different from baseline weekdays in terms of overtime odds.
dayWednesday: The coefficient for dayWednesday is positive but only marginally significant, indicating a slight increase in overtime odds on Wednesdays.
no_of_workers: The negative coefficient of -0.015 indicates that as the number of workers increases, the log-odds of overtime decrease. In other words, overtime is less likely with larger worker staffing, holding other variables constant.

In summary, weekends (especially Sundays) have significantly higher overtime odds than weekdays, and overtime odds decrease as number of workers rises, holding other factors constant.

Confidence interval

conf_int <- confint(model)

## Waiting for profiling to be done...

print(conf_int)

##                     2.5 %      97.5 %
## (Intercept)    2.39591586 4.223253060
## daySaturday    0.92203818 5.457323644
## daySunday      1.01596908 5.550301912
## dayThursday    0.37649355 3.015073706
## dayTuesday    -0.25175387 1.665182073
## dayWednesday  -0.09490365 1.922284750
## no_of_workers -0.03335261 0.001628879

The output here contains the 95% confidence intervals for the coefficients in the logistic regression model. It contains both the lower bound (2.5%) and upper bound (97.5%). These confidence intervals provide a range of values within which we can be reasonably confident that the true parameter values lie.

Using standard error for daySunday coefficient

coef <- coef(model)
se <- sqrt(diag(vcov(model)))

b_sun <- coef['daySunday'] 
se_sun <- se['daySunday']

ci <- b_sun + c(-1, 1) * qnorm(0.975) * se_sun 
ci

## [1] 0.600726 4.690019

In the above chunk, daysunday has been used. The output is the 95% confidence interval (CI) for the coefficient of the “daySunday” variable in the logistic regression model.

Lower bound is approximately 0.6007.
Upper bound is approximately 4.6900.

With 95% confidence, the log-odds of “overtime” on a Sunday are estimated to be between approximately 0.6007 and 4.6900 units higher than on the reference day. This means that there is a range of potential increases in the log-odds of “overtime” on a Sunday compared to the reference day, with the lower and upper bounds of the CI indicating the extent of this range. If the CI does not include zero, it suggests that the effect of “daySunday” is statistically significant at the 0.05 significance level.

In this case, the CI does not include zero, so you can conclude that there is a statistically significant difference in the log-odds of “overtime” between Sunday and the reference day, with the log-odds on Sunday being significantly higher.

The positive range indicates the true effect is likely a positive increase, but there is uncertainty in the precise effect size due to limited Sunday data. If we had more Sunday observations, we would expect the CI to narrow, providing greater precision on the overtime odds.

Transformation

Let’s consider the “no_of_workers” and “actual_productivity” as the explanatory variables. A plot can be devised using “overtime” as the filter.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

model <- lm(garment_prod$actual_productivity ~ garment_prod$no_of_workers,
            filter(garment_prod, garment_prod$overtime == 0))

rsquared <- summary(model)$r.squared

garment_prod |> 
  filter(overtime == 0) |>
  ggplot(mapping = aes(x = no_of_workers, 
                       y = actual_productivity)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', 
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "Actual productivity vs no of workers",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plot interpretation

The plot above fits a linear regression model to the data points with no overtime and visualizes the relationship between the number of workers (no_of_workers) and actual productivity (actual_productivity) using a scatter plot with two linear regression lines. The R-squared value in the subtitle quantifies the goodness of fit of the linear model.

The grey dashed line represents the linear regression model fitted to the data. It is a visual representation of the best-fit linear relationship between the predictor variable (no_of_workers) and the response variable (actual_productivity) when overtime is equal to 0. In other words, it shows the estimated linear relationship between the number of workers and actual productivity for cases with no overtime.

The grey line is straight line, indicating a linear relationship between the variables. the inclination indicates a non-zero slope, suggesting a positive or negative relationship. The inclination in the plot above indicates positive relationship.

The blue solid regression line represents another smoothed line but doesn’t specify a particular method. This regression line, similar to the gray one, represents the bets-fit linear relationship between ‘no_of_workers’ and ‘actual_productivity’ specifically for the cases with no overtime.

The presence of curves in the blue line indicates the actual relationship between the variable which may not be strictly linear. The curves suggest that a simple linear model may not adequately capture the underlying relationship in the data.

Transformation 2

Let’s consider ‘actual_productivity” and ’targeted_productivity’ as explanatory variables and check for linearity using scatter plot.

plot(garment_prod$actual_productivity, garment_prod$targeted_productivity,
     xlab = "Targeted Productivity", ylab = "Actual Productivity")
abline(lm(garment_prod$actual_productivity ~ garment_prod$targeted_productivity), col="red")

The red line is the trend or regression line. The straight line indicates linearity. The intercept on Y-axis is somewhere between 0.2 and 0.4, indicating a non-zero slope. All in all, the relationship between actual_productivity and targeted_productivity is positively linear.

Although, the plot shows linearity, lets transform using Log transformation.

garment_prod$log_targeted <- log(garment_prod$targeted_productivity)
garment_prod$log_actual <- log(garment_prod$actual_productivity)

plot(garment_prod$log_targeted, garment_prod$log_actual,
     xlab = "Log Targeted Productivity", ylab = "Log Actual Productivity")
abline(lm(garment_prod$log_actual ~ garment_prod$log_targeted), col="red")

Plot Interpretation

The log-log plot shows positive linearity as well, similar to the normal plot.