Linear Regression for Garment Worker Productivity(part 2)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

As we continue analyzing garment worker productivity, we need to improve our previous regression model by incorporating more variables. The goal is to make the model more accurate and better understand the factors that affect productivity. By adding more variables, we can identify deeper patterns and relationships in the data. Last week, we used a simple linear regression model with a single predictor. This week, we aim to:

1-Enhance model accuracy by adding 1-3 more variables that significantly impact productivity.

2-Explore interactions between variables to identify how factors like overtime and targeted productivity work together.

3-Check for multicollinearity to ensure each predictor contributes unique information to the model.

Refer to the simple linear regression model you built last week.

1-Include 1-3 more variables into your regression model.

2-Try out either an interaction term or a binary term to start.

3-Consider adding other integer or continuous variables. For each new variable you try, explain why you should include it, or not.E.g., are there any issues with multicollinearity?

To improve our regression model, we are adding three key variables that will help us better understand the factors affecting garment worker productivity.

The first variable is department_binary, which represents the department type, with 1 for Sewing and 0 for Finishing. Different departments may have different productivity levels due to variations in tasks and workflows. By including this binary variable, we can compare productivity between sewing and finishing teams. However, if the sample size is not balanced between departments, it could lead to biased results.

The second variable is an interaction term, targeted_productivity over_time, which measures how overtime affects productivity at different target levels. Some workers may handle overtime well when productivity targets are high, while others may struggle. This interaction term helps capture these differences. However, if over_time and targeted_productivity are highly correlated, it could create multicollinearity issues, which we will check using the Variance Inflation Factor (VIF).

The third variable is no_of_workers, which represents the number of workers in a team. Larger teams might improve productivity due to more manpower, but they could also face coordination challenges. If no_of_workers is strongly correlated with department_binary, it might affect how we interpret the model’s results.

data <- read.csv("C:/Users/rbada/Downloads/productivity+prediction+of+garment+employees/garments_worker_productivity.csv")

data$department <- tolower(trimws(data$department))

data$department[data$department == "sweing"] <- "sewing"

unique(data$department)

## [1] "sewing"    "finishing"

Before building a reliable regression model, we needed to clean the data set to ensure accuracy and consistency. Data cleaning helps remove errors, improve data quality, and make the model more effective. One issue we found was inconsistencies in the department column. There was a spelling mistake where “sweing” should have been “sewing”

colSums(is.na(data))

##                  date               quarter            department 
##                     0                     0                     0 
##                   day                  team targeted_productivity 
##                     0                     0                     0 
##                   smv                   wip             over_time 
##                     0                   506                     0 
##             incentive             idle_time              idle_men 
##                     0                     0                     0 
##    no_of_style_change         no_of_workers   actual_productivity 
##                     0                     0                     0

We checked our selected variables—department_binary, targeted_productivity * over_time, and no_of_workers—and found no missing values, keeping the data set complete and unbiased. We found that wip had missing values, but after evaluation, we decided not to include it,because Our chosen variables already capture key factors affecting productivity, and adding wip would not improve the model while possibly causing multicollinearity, making interpretation more difficult.

# Creating department_binary for Regression Analysis

data$department <- tolower(trimws(data$department))  
data$department_binary <- ifelse(data$department == "sewing", 1, 0)
table(data$department_binary, data$department)

##    
##     finishing sewing
##   0       506      0
##   1         0    691

To compare productivity between departments, we need to convert the categorical department variable into a binary format. This allows us to include it as a predictor in our regression model.

# Create the interaction term

data$interaction_term <- data$targeted_productivity * data$over_time
head(data$interaction_term)

## [1] 5664  720 2928 2928 1536 5376

After generating the interaction term targeted_productivity and over_time, we examined the first few values. These values represent how productivity targets and overtime work together for different workers. If the interaction value is high, it means the worker has both high productivity targets and high overtime hours, showing that overtime plays a significant role in their workload. If the interaction value is low, it suggests that either overtime is low, productivity targets are low, or both, meaning overtime may not be a major factor for those workers. By analyzing this interaction, we can understand whether overtime affects productivity differently depending on the worker’s initial target, providing valuable insights into how work conditions impact efficiency.

cor(data$no_of_workers, data$department_binary)

## [1] 0.9393596

The high correlation (0.939) between no_of_workers and department_binary suggests multicollinearity, meaning both variables may be providing similar information. This could distort the regression model. To resolve this, we could remove one of the variables or check the Variance Inflation Factor (VIF) to confirm if multicollinearity is an issue. Reducing this redundancy will make the model more reliable and easier to interpret.

# Fit the expanded regression model
model_expanded <- lm(actual_productivity ~ department_binary + targeted_productivity * over_time + no_of_workers, data = data)

summary(model_expanded)

## 
## Call:
## lm(formula = actual_productivity ~ department_binary + targeted_productivity * 
##     over_time + no_of_workers, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53980 -0.04080  0.01266  0.08536  0.46779 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.102e-01  5.712e-02   7.182 1.21e-12 ***
## department_binary               -1.146e-01  2.658e-02  -4.312 1.76e-05 ***
## targeted_productivity            4.318e-01  7.648e-02   5.646 2.05e-08 ***
## over_time                       -4.532e-05  8.704e-06  -5.207 2.26e-07 ***
## no_of_workers                    2.166e-03  6.416e-04   3.377 0.000757 ***
## targeted_productivity:over_time  6.295e-05  1.190e-05   5.290 1.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1555 on 1191 degrees of freedom
## Multiple R-squared:  0.2092, Adjusted R-squared:  0.2059 
## F-statistic: 63.01 on 5 and 1191 DF,  p-value: < 2.2e-16

The regression results show that targeted productivity, overtime, team size, and department type all significantly impact actual productivity. Workers in the Sewing department tend to have lower productivity than those in Finishing, while higher targeted productivity leads to better actual performance. Overtime slightly reduces productivity, but its effect depends on the productivity target, as shown by the interaction term. Larger teams have a small positive effect on productivity. Although the model is statistically significant, it explains only about 21% of productivity variations, meaning other factors might also influence performance. These findings suggest that while setting high targets improves productivity, excessive overtime could reduce efficiency.

# check for multicollinearity

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

model_expanded <- lm(actual_productivity ~ department_binary + targeted_productivity * over_time + no_of_workers, data = data)
vif(model_expanded)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##               department_binary           targeted_productivity 
##                        8.535105                        2.772676 
##                       over_time                   no_of_workers 
##                       42.028043                       10.033261 
## targeted_productivity:over_time 
##                       42.053406

The high VIF values in the model indicate multicollinearity, particularly for the variables over_time and the interaction term targeted_productivity:over_time, both of which have VIFs greater than 40. This suggests that these variables are highly correlated with others in the model and may not be providing unique information. The no_of_workers variable also has a moderate VIF (10.03), which could suggest some correlation, but it’s less concerning than the other variables. To improve the model, we may consider removing or combining the highly correlated variables to address multicollinearity. By reducing multicollinearity, we can make the model more reliable and easier to interpret.

cor(data$actual_productivity, data$incentive, use = "complete.obs")

## [1] 0.07653763

cor(data$actual_productivity, data$idle_time, use = "complete.obs")

## [1] -0.08085081

cor(data$actual_productivity, data$smv, use = "complete.obs")

## [1] -0.1220888

cor(data$actual_productivity, data$idle_men, use = "complete.obs")

## [1] -0.1817343

cor(data$actual_productivity, data$no_of_style_change, use = "complete.obs")

## [1] -0.2073656

The correlation results show how different factors relate to actual productivity. No_of_style_change has the strongest negative effect (-0.2074), meaning that frequent style changes disrupt workflow and reduce productivity. Idle_men (-0.1817) also negatively impacts productivity, suggesting that more idle workers lead to inefficiencies. Other variables, like SMV (-0.1221), idle_time (-0.0809), and incentive (0.0765), have weaker relationships, meaning they do not strongly affect productivity. Since no_of_style_change and idle_men have the most impact, we should consider adding them to the regression model to improve predictions. Incentive has a weak effect, so including it may not significantly enhance the model

# Update the regression model with no_of_style_change and idle_men
model_updated <- lm(actual_productivity ~ targeted_productivity + interaction_term + 
                    no_of_style_change + idle_men, data = data)
summary(model_updated)

## 
## Call:
## lm(formula = actual_productivity ~ targeted_productivity + interaction_term + 
##     no_of_style_change + idle_men, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55443 -0.05370  0.00683  0.08814  0.51551 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.348e-01  3.489e-02   6.731 2.62e-11 ***
## targeted_productivity  6.967e-01  4.719e-02  14.762  < 2e-16 ***
## interaction_term       4.069e-07  1.844e-06   0.221    0.825    
## no_of_style_change    -4.327e-02  1.081e-02  -4.004 6.62e-05 ***
## idle_men              -7.812e-03  1.384e-03  -5.643 2.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.155 on 1192 degrees of freedom
## Multiple R-squared:  0.2137, Adjusted R-squared:  0.211 
## F-statistic: 80.98 on 4 and 1192 DF,  p-value: < 2.2e-16

library(car)
vif(model_updated)

## targeted_productivity      interaction_term    no_of_style_change 
##              1.062629              1.016606              1.064810 
##              idle_men 
##              1.019657

The model results show that targeted productivity has the strongest positive effect on actual productivity. For every increase in targeted productivity, actual productivity goes up by 0.6967. This means setting higher productivity goals is key to improving performance. Frequent style changes negatively impact productivity, with every additional change decreasing productivity by 0.04327. This suggests that disruptions in the workflow reduce efficiency. Idle workers also hurt productivity. For every additional idle worker, productivity decreases by 0.007812, indicating inefficiency and wasted time. The interaction term between targeted productivity and overtime was not significant, meaning that overtime doesn’t strongly interact with productivity targets to influence actual productivity in this model. The model explains 21.1% of the variation in productivity (Adjusted R-squared = 0.211), which is an improvement, but still leaves room for more explanation. The model’s F-statistic (80.98) with p-value < 2.2e-16 indicates the model is statistically significant. This means the model provides valuable insights, but could be refined further for better accuracy

We added several variables to the model to understand their impact on actual productivity.

Targeted productivity has the strongest positive effect, meaning that as productivity targets increase, actual productivity improves. It is highly significant and doesn’t cause multicollinearity, so we should keep it in the model.

Style changes have a negative impact on productivity, as frequent style changes disrupt workflow. This variable is significant and does not suffer from multicollinearity, so it should also be kept in the model.

Idle workers also reduce productivity, as idle time is wasted. This variable is highly significant and doesn’t cause multicollinearity, so it should be kept in the model as well. The interaction term between targeted productivity and overtime was removed because it was not statistically significant (p = 0.825) and caused multicollinearity with other variables. Removing this term improves the model by eliminating unnecessary complexity and potential issues with multicollinearity. By carefully selecting these variables and removing the interaction term, we ensure the model is reliable, interpretable, and free from issues like multicollinearity.

# Create the simplified model with 3 terms: targeted_productivity, no_of_style_change, and idle_men

model_simplified <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data)
summary(model_simplified)

## 
## Call:
## lm(formula = actual_productivity ~ targeted_productivity + no_of_style_change + 
##     idle_men, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55554 -0.05361  0.00682  0.08715  0.51487 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.235229   0.034830   6.754 2.25e-11 ***
## targeted_productivity  0.697941   0.046816  14.908  < 2e-16 ***
## no_of_style_change    -0.043163   0.010793  -3.999 6.74e-05 ***
## idle_men              -0.007821   0.001383  -5.654 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1549 on 1193 degrees of freedom
## Multiple R-squared:  0.2137, Adjusted R-squared:  0.2117 
## F-statistic:   108 on 3 and 1193 DF,  p-value: < 2.2e-16

The regression model demonstrates that targeted productivity, style changes, and idle workers are important factors influencing actual productivity. After removing the interaction term (between targeted_productivity and over_time), the model remains statistically significant and continues to explain 21% of the variability in productivity. This suggests that simplifying the model did not result in a loss of predictive accuracy. The results show that targeted productivity continues to have a strong positive effect on productivity. For every 1-unit increase in targeted productivity, actual productivity increases by 0.698. Style changes still negatively impact productivity, with each additional style change resulting in a decrease of 0.043 in productivity. Similarly, idle workers decrease productivity, with each extra idle worker causing a reduction of 0.0078 in overall productivity. These findings emphasize the importance of setting clear productivity targets, minimizing style changes, and reducing idle time to enhance overall performance.

Evaluate this model

At the very least, use the 5 diagnostic plots discussed in class to identify any issues with your model.

For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met.

Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met.

# Residuals vs. Fitted Values

library(ggplot2)
model_simplified <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data)
ggplot(data = data, aes(x = fitted(model_simplified), y = residuals(model_simplified))) +
  geom_point() +
  geom_smooth(se = FALSE, color = "red") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") +
  theme_minimal()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The Residuals vs. Fitted Values plot indicates that the residuals (errors) from the model are not evenly spread. As the fitted values increase, the spread of residuals also increases, suggesting heteroscedasticity. This violates the assumption of constant variance, meaning the model’s errors are not consistent across all predicted values. The severity of this issue seems moderate, as the spread of residuals increases noticeably but not drastically. As a result, the model might be less reliable for higher predictions, especially in the areas where the spread is wider. I’m moderately confident that this assumption is violated, given the clear increase in residual spread. To address this issue, we may need to transform the data (e.g., or robust standard errors that can handle heteroscedasticity.

# Residuals vs. $x$ values
library(ggplot2)
model_simplified <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data)

residuals <- residuals(model_simplified)
ggplot(data, aes(x = targeted_productivity, y = residuals)) +
  geom_point() +
  geom_smooth(se = FALSE, color = "red") +
  labs(title = "Residuals vs Targeted Productivity")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.

ggplot(data, aes(x = no_of_style_change, y = residuals)) +
  geom_point() +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Residuals vs Style Changes")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.

ggplot(data, aes(x = idle_men, y = residuals)) +
  geom_point() +
  geom_smooth(se = FALSE, color = "green") +
  labs(title = "Residuals vs Idle Men")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals vs. Targeted Productivity: The plot shows clustering in the residuals, suggesting a non-linear relationship or heteroscedasticity (changing variance) between targeted and actual productivity. This indicates a violation of the assumption of constant variance, which could impact the model’s reliability, especially at higher values of targeted productivity. The issue is moderate in severity, as it might distort predictions for higher values. I am moderately confident that transforming the variables (e.g., using logarithms) might improve the model. Residuals vs. Style Changes: The residuals concentrate at low style changes, with a few outliers at higher values. This indicates that style changes may have a non-linear effect on productivity, which the model might not capture fully. This could be a moderate issue, as the model doesn’t seem to fully represent the relationship between style changes and productivity. The model might need non-linear adjustments (e.g., adding quadratic terms). I am moderately confident that this effect exists and should be explored further.

Residuals vs. Idle Men: The residuals appear to be evenly distributed, but there are some larger residuals indicating potential outliers. For the majority of data points, the assumption of constant variance is met. The outliers are a low-to-moderate issue, and their impact on the model should be investigated further. I am highly confident that, overall, the assumption of constant variance is largely met, but the outliers may require further attention.

# Residual Histogram

library(ggplot2)
model_simplified <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data)

residuals <- residuals(model_simplified)
ggplot(data = data, aes(x = residuals)) +
  geom_histogram(binwidth = 0.05, color = "black", fill = "blue") +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency") +
  theme_minimal()

The histogram shows that most residuals are small and centered around zero, which is a good sign. However, the right tail is longer, suggesting positive skewness. This indicates that some residuals are larger than expected, which means that the errors are not perfectly normally distributed. This could slightly affect statistical tests, as normality is an important assumption. The issue is moderate in severity, as the distribution is not highly skewed but could affect model performance, especially in significance tests. I am moderately confident that applying a transformation (like a log transformation) might improve normality.

# QQ-Plots

library(ggplot2)
model_simplified <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data)

ggplot(data = data, aes(sample = residuals(model_simplified))) +
  geom_qq() +
  geom_qq_line(color = "red") +
  labs(title = "Normal QQ Plot of Residuals")

The QQ-Plot shows that most of the residuals align with the reference line, indicating that they are approximately normally distributed in the middle. However, there is some deviation at the extremes, especially with outliers, suggesting non-normality in the residuals. This means that the assumption of normality is not fully met, which might affect the reliability of statistical tests, especially the significance of coefficients. The issue is moderate in severity, as the residuals are close to normal but still have some noticeable deviation. I am moderately confident that using a transformation (e.g., log) or robust methods could help address this issue.

# Cook's distance

data$cooks_d <- cooks.distance(model_simplified)
high_influence_points <- data %>%
  mutate(high_influence = cooks_d > 0.05) %>%
  filter(high_influence == TRUE)
print(high_influence_points)

##        date  quarter department      day team targeted_productivity   smv wip
## 1  2/7/2015 Quarter1     sewing Saturday    7                   0.7 24.26 658
## 2 2/17/2015 Quarter3     sewing  Tuesday    8                   0.6 29.40 179
##   over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1      6960         0       270       45                  0            58
## 2         0        23         5       30                  2            58
##   actual_productivity department_binary interaction_term    cooks_d
## 1           0.6622701                 1             4872 0.20043944
## 2           0.6009829                 1                0 0.06841388
##   high_influence
## 1           TRUE
## 2           TRUE

library(ggplot2)
library(ggrepel)

# Assuming model_simplified is already created
# Calculate Cook's D
cooks_d <- cooks.distance(model_simplified)

# Add Cook's D to the dataset
data$cooks_d <- cooks_d

# Identify the high influence points (those with Cook's D > 0.05)
data$high_influence <- cooks_d > 0.05

# Generate the Cook's D plot
ggplot(data, aes(x = 1:nrow(data), y = cooks_d)) +
  geom_point(aes(color = high_influence), size = 2) + 
  geom_text_repel(aes(label = ifelse(high_influence, as.character(row.names(data)), '')), 
                  color = 'darkred', max.overlaps = 50) +  # Adding labels to high influence points
  labs(title = "Cook's Distance Plot",
       x = "Observation Number",
       y = "Cook's Distance") +
  theme_minimal()

The Cook’s Distance plot highlights influential points in the dataset that can significantly affect the regression model. In this case, rows 819 and 651 have high Cook’s D values (0.20043944 and 0.06841388), indicating they are influential. These points are marked in blue on the plot and show significant differences between actual and targeted productivity. This suggests that these observations could distort the model’s results. The issue is moderate to high in severity, as influential points can affect the overall accuracy of the model. I am highly confident that these points should be investigated further to determine whether they should be adjusted or removed to improve model accuracy.

# Identify the high influence rows (819 and 651)

high_influence_rows <- data[c(819, 651), ]
ggplot(data = slice(data, c(819, 651))) +
  geom_point(data = data,
             mapping = aes(x = targeted_productivity, y = actual_productivity)) +
  geom_point(mapping = aes(x = targeted_productivity, y = actual_productivity),
             color = 'darkred') +
  geom_text_repel(mapping = aes(x = targeted_productivity, 
                                y = actual_productivity,
                                label = rownames(high_influence_rows)),
             color = 'darkred') +
  labs(title="Investigating High Influence Points (Targeted Productivity vs Actual Productivity)",
       subtitle="Label = Row Numbers (819, 651)")

ggplot(data = slice(data, c(819, 651))) +
  geom_point(data = data,
             mapping = aes(x = no_of_style_change, y = actual_productivity)) +
  geom_point(mapping = aes(x = no_of_style_change, y = actual_productivity),
             color = 'darkred') +
  geom_text_repel(mapping = aes(x = no_of_style_change, 
                                y = actual_productivity,
                                label = rownames(high_influence_rows)),
             color = 'darkred') +
  labs(title="Investigating High Influence Points (Style Changes vs Actual Productivity)",
       subtitle="Label = Row Numbers (819, 651)")

ggplot(data = slice(data, c(819, 651))) +
  geom_point(data = data,
             mapping = aes(x = idle_men, y = actual_productivity)) +
  geom_point(mapping = aes(x = idle_men, y = actual_productivity),
             color = 'darkred') +
  geom_text_repel(mapping = aes(x = idle_men, 
                                y = actual_productivity,
                                label = rownames(high_influence_rows)),
             color = 'darkred') +
  labs(title="Investigating High Influence Points (Idle Men vs Actual Productivity)",
       subtitle="Label = Row Numbers (819, 651)")

The plots show that rows 819 and 651 are outliers in all three variables: Targeted Productivity, Style Changes, and Idle Men. These points stand out with large deviations from the trend and could significantly impact the model. It’s important to check these points further to determine if they are errors or outliers. Adjusting or removing these points may help improve the model’s accuracy.

# Fit the model without high influence points

data_without_high_influence <- data %>% filter(!row_number() %in% c(819, 651))
model_without_high_influence <- lm(actual_productivity ~ targeted_productivity + no_of_style_change + idle_men, data = data_without_high_influence)

summary(model_simplified)

## 
## Call:
## lm(formula = actual_productivity ~ targeted_productivity + no_of_style_change + 
##     idle_men, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55554 -0.05361  0.00682  0.08715  0.51487 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.235229   0.034830   6.754 2.25e-11 ***
## targeted_productivity  0.697941   0.046816  14.908  < 2e-16 ***
## no_of_style_change    -0.043163   0.010793  -3.999 6.74e-05 ***
## idle_men              -0.007821   0.001383  -5.654 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1549 on 1193 degrees of freedom
## Multiple R-squared:  0.2137, Adjusted R-squared:  0.2117 
## F-statistic:   108 on 3 and 1193 DF,  p-value: < 2.2e-16

summary(model_without_high_influence)

## 
## Call:
## lm(formula = actual_productivity ~ targeted_productivity + no_of_style_change + 
##     idle_men, data = data_without_high_influence)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55582 -0.05313  0.00654  0.08787  0.51488 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.234998   0.034736   6.765 2.08e-11 ***
## targeted_productivity  0.698576   0.046691  14.962  < 2e-16 ***
## no_of_style_change    -0.043520   0.010827  -4.020 6.19e-05 ***
## idle_men              -0.009927   0.001564  -6.347 3.12e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1545 on 1191 degrees of freedom
## Multiple R-squared:  0.2187, Adjusted R-squared:  0.2167 
## F-statistic: 111.1 on 3 and 1191 DF,  p-value: < 2.2e-16

Based on the comparison of the two models (one with high influence points and one without), it is clear that removing the high influence points (rows 819 and 651) does not result in substantial changes to the model. The coefficients for the predictors, such as targeted productivity, number of style changes, and idle men, remain almost the same in both models. There is a slight change in the coefficient for “idle_men,” suggesting that high influence points had a moderate effect on this variable. However, the R-squared values for both models are very similar, with only a marginal improvement in the model’s fit when the high influence points are excluded.

Conclusion:

The regression model, even with the inclusion of high influence points, explains the relationship between the selected variables and actual productivity well. However, some diagnostic plots suggest potential issues with heteroscedasticity and non-linearity, especially with the residuals. These findings indicate that further investigation is needed to confirm the model’s assumptions. To improve the model, we could consider exploring data transformations or using more advanced methods to address any identified issues. Nonetheless, the model currently provides valuable insights into the factors influencing productivity, and we can proceed with cautious confidence, while keeping in mind the need for further refinement.