Loading the csv file to garment_prod variable.
garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Length:1197 Min. :0.0700 Min. : 2.90 Min. : 7.0
## Class :character 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Mode :character Median :0.7500 Median :15.26 Median : 1039.0
## Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
Binary columns typically refer to categorical variables that have inly two possible categories or levels. These variables are also known as binary factors. The two categories are often coded as 0 and 1, where 0 represents one category and 1 represents the other.
Upon reviewing my dataset, I didn;t find any direct binary columns to perform modelling. So, I’ll convert a useful column in my data into a binary column by classifying based on a criteria.
Here, I’ll consider over_time as my binary column. This column captures overtime hours worked each day. It could be converted into a binary variable indicating whether any overtime was worked or not. Modeling overtime could help understand operational capacity and staffing needs. Criteria: 1 if over_time >0. 0 otherwise.
garment_prod$overtime <- as.numeric(garment_prod$over_time >0)
View(garment_prod)
In the above chunk, I have converted the “over_time” column to a binary variable called “overtime” indicating 1 if overtime was worked, o if not. For explanatory variables, I’ll consider “day” and “no_of_workers”.
model <- glm(overtime ~ day +no_of_workers, data = garment_prod, family = "binomial")
summary(model)
##
## Call:
## glm(formula = overtime ~ day + no_of_workers, family = "binomial",
## data = garment_prod)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.237256 0.462947 6.993 2.7e-12 ***
## daySaturday 2.552170 1.043441 2.446 0.0144 *
## daySunday 2.645373 1.043206 2.536 0.0112 *
## dayThursday 1.529516 0.649467 2.355 0.0185 *
## dayTuesday 0.665567 0.480986 1.384 0.1664
## dayWednesday 0.857244 0.504951 1.698 0.0896 .
## no_of_workers -0.015159 0.008834 -1.716 0.0862 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 287.71 on 1196 degrees of freedom
## Residual deviance: 264.28 on 1190 degrees of freedom
## AIC: 278.28
##
## Number of Fisher Scoring iterations: 8
Based on the model output, here is an interpretation of the coefficients:
(Intercept): The intercept represents the baseline log-odds of overtime when all predictors are 0. The positive coefficient indicates a baseline positive probability of overtime.
daySaturday: The coefficient for daySaturday is 2.552. This indicates that on Saturdays, the log-odds of overtime increase by 2.552 compared to the baseline day (likely a weekday), holding other variables constant. In other words, Saturdays have significantly higher odds of overtime compared to weekdays.
daySunday: The coefficient for daySunday is 2.645. This indicates Sundays have an even larger positive association with overtime compared to Saturdays, with the log-odds increasing by 2.645, holding other variables constant.
dayThursday: The coefficient for dayThursday is 1.529. This indicates Thursdays have higher odds of overtime compared to baseline weekdays, but lower than weekends.
dayTuesday: The coefficient for dayTuesday is not statistically significant, indicating Tuesdays are not significantly different from baseline weekdays in terms of overtime odds.
dayWednesday: The coefficient for dayWednesday is positive but only marginally significant, indicating a slight increase in overtime odds on Wednesdays.
no_of_workers: The negative coefficient of -0.015 indicates that as the number of workers increases, the log-odds of overtime decrease. In other words, overtime is less likely with larger worker staffing, holding other variables constant.
In summary, weekends (especially Sundays) have significantly higher overtime odds than weekdays, and overtime odds decrease as number of workers rises, holding other factors constant.
conf_int <- confint(model)
## Waiting for profiling to be done...
print(conf_int)
## 2.5 % 97.5 %
## (Intercept) 2.39591586 4.223253060
## daySaturday 0.92203818 5.457323644
## daySunday 1.01596908 5.550301912
## dayThursday 0.37649355 3.015073706
## dayTuesday -0.25175387 1.665182073
## dayWednesday -0.09490365 1.922284750
## no_of_workers -0.03335261 0.001628879
The output here contains the 95% confidence intervals for the coefficients in the logistic regression model. It contains both the lower bound (2.5%) and upper bound (97.5%). These confidence intervals provide a range of values within which we can be reasonably confident that the true parameter values lie.
coef <- coef(model)
se <- sqrt(diag(vcov(model)))
b_sun <- coef['daySunday']
se_sun <- se['daySunday']
ci <- b_sun + c(-1, 1) * qnorm(0.975) * se_sun
ci
## [1] 0.600726 4.690019
In the above chunk, daysunday has been used. The output is the 95% confidence interval (CI) for the coefficient of the “daySunday” variable in the logistic regression model.
Lower bound is approximately 0.6007.
Upper bound is approximately 4.6900.
With 95% confidence, the log-odds of “overtime” on a Sunday are estimated to be between approximately 0.6007 and 4.6900 units higher than on the reference day. This means that there is a range of potential increases in the log-odds of “overtime” on a Sunday compared to the reference day, with the lower and upper bounds of the CI indicating the extent of this range. If the CI does not include zero, it suggests that the effect of “daySunday” is statistically significant at the 0.05 significance level.
In this case, the CI does not include zero, so you can conclude that there is a statistically significant difference in the log-odds of “overtime” between Sunday and the reference day, with the log-odds on Sunday being significantly higher.
The positive range indicates the true effect is likely a positive increase, but there is uncertainty in the precise effect size due to limited Sunday data. If we had more Sunday observations, we would expect the CI to narrow, providing greater precision on the overtime odds.
Let’s consider the “no_of_workers” and “actual_productivity” as the explanatory variables. A plot can be devised using “overtime” as the filter.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
model <- lm(garment_prod$actual_productivity ~ garment_prod$no_of_workers,
filter(garment_prod, garment_prod$overtime == 0))
rsquared <- summary(model)$r.squared
garment_prod |>
filter(overtime == 0) |>
ggplot(mapping = aes(x = no_of_workers,
y = actual_productivity)) +
geom_point() +
geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
se = FALSE) +
geom_smooth(se = FALSE) +
labs(title = "Actual productivity vs no of workers",
subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The plot above fits a linear regression model to the data points with
no overtime and visualizes the relationship between the number of
workers (no_of_workers) and actual
productivity (actual_productivity) using a
scatter plot with two linear regression lines. The R-squared value in
the subtitle quantifies the goodness of fit of the linear model.
The grey dashed line represents the linear regression model fitted to
the data. It is a visual representation of the best-fit linear
relationship between the predictor variable
(no_of_workers) and the response variable
(actual_productivity) when
overtime is equal to 0. In other words, it
shows the estimated linear relationship between the number of workers
and actual productivity for cases with no overtime.
The grey line is straight line, indicating a linear relationship between the variables. the inclination indicates a non-zero slope, suggesting a positive or negative relationship. The inclination in the plot above indicates positive relationship.
The blue solid regression line represents another smoothed line but doesn’t specify a particular method. This regression line, similar to the gray one, represents the bets-fit linear relationship between ‘no_of_workers’ and ‘actual_productivity’ specifically for the cases with no overtime.
The presence of curves in the blue line indicates the actual relationship between the variable which may not be strictly linear. The curves suggest that a simple linear model may not adequately capture the underlying relationship in the data.
Let’s consider ‘actual_productivity” and ’targeted_productivity’ as explanatory variables and check for linearity using scatter plot.
plot(garment_prod$actual_productivity, garment_prod$targeted_productivity,
xlab = "Targeted Productivity", ylab = "Actual Productivity")
abline(lm(garment_prod$actual_productivity ~ garment_prod$targeted_productivity), col="red")
The red line is the trend or regression line. The straight line indicates linearity. The intercept on Y-axis is somewhere between 0.2 and 0.4, indicating a non-zero slope. All in all, the relationship between actual_productivity and targeted_productivity is positively linear.
Although, the plot shows linearity, lets transform using Log transformation.
garment_prod$log_targeted <- log(garment_prod$targeted_productivity)
garment_prod$log_actual <- log(garment_prod$actual_productivity)
plot(garment_prod$log_targeted, garment_prod$log_actual,
xlab = "Log Targeted Productivity", ylab = "Log Actual Productivity")
abline(lm(garment_prod$log_actual ~ garment_prod$log_targeted), col="red")
The log-log plot shows positive linearity as well, similar to the
normal plot.