library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
data <- read.csv("C:/Users/rbada/Downloads/productivity+prediction+of+garment+employees/garments_worker_productivity.csv")

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable

Actual productivity is the key measure of how efficiently workers perform in the garment factory. It shows how much work they get done compared to what was expected. This is crucial for managers because it helps them see where things are going well and where they need to improve to boost production and reduce costs. It’s the main number that everyone in the business cares about because it directly affects the company’s success.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable

1-Choosing “department” as a categorical variable for analyzing actual productivity in a garment factory provides a structured way to examine how different factors influence operational efficiency. Each department, such as sewing and finishing, has unique operational processes and faces distinct challenges that significantly impact productivity. For instance, the sewing department might work under different pressures and at a different pace compared to the finishing department. Additionally, the way resources are allocated and managed can vary widely between departments, directly affecting the efficiency of task completion. By understanding these variances, it becomes possible to pinpoint specific areas within each department that may benefit from targeted improvements or adjustments, ultimately enhancing overall productivity.

2-Choosing “day of the week” as a categorical variable for analyzing productivity in a garment factory helps understand how work output varies across different days. This analysis provides valuable insights into weekly production cycles and human factors that influence productivity, such as energy levels and motivation. With this information, managers can optimize work schedules and strategically plan the most demanding tasks for the days when productivity typically peaks, ensuring efficiency and effectiveness in operations.

3-Choosing “quarter” as a categorical variable for analyzing productivity in a garment factory helps identify seasonal trends and the effectiveness of strategic initiatives throughout the year. By understanding how productivity fluctuates across quarters, managers can better align production schedules, resource allocation, and strategic planning with natural business cycles, optimizing operational efficiency and market responsiveness.

4-Choosing “team” as a categorical variable for analyzing productivity in a garment factory helps understand how different teams contribute to overall productivity. This analysis can highlight variations in performance between teams, possibly due to differences in skills, experience, or management styles. By identifying which teams are performing well and which are under performing, managers can tailor training programs, redistribute resources, or implement best practices across teams to boost productivity and enhance teamwork effectiveness.

Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results (e.g., use box plots). Be clear about how the R output relates to your conclusions.

For the ANOVA test, we will use team as the categorical variable to explore how different teams within the garment factory vary in terms of actual productivity. This approach will help us determine if there are significant differences in productivity across teams, indicating which teams might need more support or changes in strategy to enhance their performance.

Null Hypothesis (H₀): There is no significant difference in the mean actual productivity among the different teams.

Alternative Hypothesis (H₁): There is a significant difference in the mean actual productivity among at least one of the teams compared to others.

data$team <- as.factor(data$team)  
data$actual_productivity <- as.numeric(data$actual_productivity)  

Prepare the Data(Team)

data$team <- as.factor(data$team) 
str(data$team) 
##  Factor w/ 12 levels "1","2","3","4",..: 8 1 11 12 6 7 2 3 2 1 ...
data$actual_productivity <- as.numeric(data$actual_productivity)  
str(data$actual_productivity)
##  num [1:1197] 0.941 0.886 0.801 0.801 0.8 ...
sum(is.na(data$team))  
## [1] 0
sum(is.na(data$actual_productivity))  
## [1] 0
summary(data$actual_productivity) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2337  0.6503  0.7733  0.7351  0.8503  1.1204

Before running ANOVA, we ensured the data set was clean and properly formatted:

1️- Converted team to a Factor – So R treats it as a categorical variable for group comparison.

2- Checked actual_productivity is Numeric – Ensures ANOVA can correctly compare productivity across teams.

3️- Checked for Missing Data – Found no missing values, so we didn’t need to remove or replace any data.

4- Checked for Negative Values – All productivity values were positive, meaning the data was valid.

Check ANOVA Assumptions

Before running ANOVA, we need to check a few assumptions. First, we check if the data is normally distributed within each team using histograms and the Shapiro-Wilk test. Next, we check if the variance (spread of the data) is equal across teams using Levene’s test. If the variances are unequal, we would need to use Welch’s ANOVA instead of standard ANOVA. Finally, we need to confirm that the data is independent, meaning the productivity of one team is not influenced by another. This is usually ensured during data collection. If these assumptions hold, we can proceed with ANOVA; otherwise, we adjust our approach.

Independence(Team):

sum(duplicated(data))  
## [1] 0
team_date_counts <- data %>%
  group_by(team, date) %>%
  summarise(count = n()) %>%
  filter(count > 1)
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
team_date_counts
## # A tibble: 496 × 3
## # Groups:   team [12]
##    team  date      count
##    <fct> <chr>     <int>
##  1 1     1/1/2015      2
##  2 1     1/10/2015     2
##  3 1     1/11/2015     2
##  4 1     1/12/2015     2
##  5 1     1/13/2015     2
##  6 1     1/14/2015     2
##  7 1     1/15/2015     2
##  8 1     1/17/2015     2
##  9 1     1/21/2015     2
## 10 1     1/22/2015     2
## # ℹ 486 more rows

After reviewing the data set, we found that there are no duplicate rows (i.e., sum(duplicated(data)) = 0), which means each observation is unique. This indicates that the data is clean, and there are no repeated or identical entries in the data set. This supports the independence assumption for our analysis, confirming that each data point (team’s productivity) is measured separately and is independent of others. This is an important step because it ensures the validity of our statistical tests, allowing us to confidently proceed with the ANOVA analysis.

Normality(Team):

library(ggplot2)

ggplot(data, aes(x = actual_productivity)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.5) +
  facet_wrap(~team) +
  theme_minimal() +
  labs(title = "Distribution of Actual Productivity by Team")

shapiro.test(data$actual_productivity)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$actual_productivity
## W = 0.94394, p-value < 2.2e-16

The histograms of actual_productivity show skewed distributions for most teams, meaning the data is not symmetrically distributed. This is further confirmed by the Shapiro-Wilk test, which gave a p-value less than 0.05, indicating that the data is not normally distributed. The skewness in the histograms suggests that the data does not follow a typical bell-shaped curve, and instead has a long tail either to the left or right

Homoscedasticity(Team):

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:boot':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
levene_test_result <- leveneTest(actual_productivity ~ team, data = data)
levene_test_result
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group   11  3.6719 3.768e-05 ***
##       1185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of Levene’s Test indicates that the variances are unequal across teams. The p-value is 3.768e-05, which is much smaller than 0.05, so we reject the null hypothesis that the variances are equal. This means the assumption of homoscedasticity is violated. After checking the assumptions for ANOVA, we found that the data is independent. Each team’s productivity is measured separately, with no influence between teams, satisfying the independence assumption. However, we found that the normality assumption was violated, as the data for most teams is not normally distributed (based on the Shapiro-Wilk test and histograms). Additionally, the variance assumption was violated, as Levene’s Test indicated that the variances are unequal across teams. Therefore, we determine a Welch’s ANOVA test, as it is more suitable for handling violations of normality and unequal variances, providing a valid alternative to standard ANOVA in these situations.

ANOVA Test(Team)

anova_model <- aov(actual_productivity ~ team, data = data)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## team          11   3.15 0.28668   10.21 <2e-16 ***
## Residuals   1185  33.26 0.02807                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results indicate that there is a significant difference in productivity across the teams. The F-statistic of 10.21 is quite high, suggesting that the variance between the teams is much greater than the variance within teams. The p-value is < 2e-16, which is far below the significance level of 0.05, allowing us to reject the null hypothesis. This means that we can confidently conclude that at least one team has a significantly different productivity compared to others.

ggplot(data, aes(x = factor(team), y = actual_productivity, fill = factor(team))) +
  geom_boxplot() +
  labs(title = "Productivity Across Teams", x = "Team", y = "Actual Productivity") +
  scale_fill_brewer(palette = "Set3") +  
  theme_minimal()

The box plot shows that productivity varies across teams, with some teams having higher median productivity than others. Outliers indicate extreme performance within teams. The spread of the boxes reveals variability, with some teams being more consistent in productivity than others. This confirms the ANOVA results, which showed significant differences in productivity between teams.

Prepare the Data(Department):

data$department <- gsub("sweing", "sewing", data$department)  
data$department <- trimws(data$department)  
unique(data$department)
## [1] "sewing"    "finishing"

Prepare the Data(Day,Quartar):

data_clean <- na.omit(data)
data_clean$day <- as.factor(data_clean$day)
data_clean$quarter <- as.factor(data_clean$quarter)
str(data_clean)
## 'data.frame':    691 obs. of  15 variables:
##  $ date                 : chr  "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
##  $ quarter              : Factor w/ 5 levels "Quarter1","Quarter2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ department           : chr  "sewing" "sewing" "sewing" "sewing" ...
##  $ day                  : Factor w/ 6 levels "Monday","Saturday",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ team                 : Factor w/ 12 levels "1","2","3","4",..: 8 11 12 6 7 3 2 1 9 10 ...
##  $ targeted_productivity: num  0.8 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.7 0.75 ...
##  $ smv                  : num  26.2 11.4 11.4 25.9 25.9 ...
##  $ wip                  : int  1108 968 968 1170 984 795 733 681 872 578 ...
##  $ over_time            : int  7080 3660 3660 1920 6720 6900 6000 6900 6900 6480 ...
##  $ incentive            : int  98 50 50 50 38 45 34 45 44 45 ...
##  $ idle_time            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idle_men             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_style_change   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_workers        : num  59 30.5 30.5 56 56 57.5 55 57.5 57.5 54 ...
##  $ actual_productivity  : num  0.941 0.801 0.801 0.8 0.8 ...
##  - attr(*, "na.action")= 'omit' Named int [1:506] 2 7 14 15 16 17 19 20 21 22 ...
##   ..- attr(*, "names")= chr [1:506] "2" "7" "14" "15" ...

Independence(Department,Days,Quarter):

We assume that the department, day of the week, and quarter are independent variables in this analysis. This means that the productivity observed in one department, day, or quarter is not influenced by the other variables. For example, the productivity of workers in the sewing department is assumed to be independent of the day of the week or the specific quarter in which the data is collected. Similarly, the productivity differences observed across days and quarters are also assumed to be independent of the departments. This assumption helps ensure the validity of the ANOVA tests performed on these variables.

Normality(Department,Days,Quarter)

shapiro.test(data$actual_productivity[data$department %in% unique(data$department)])
## 
##  Shapiro-Wilk normality test
## 
## data:  data$actual_productivity[data$department %in% unique(data$department)]
## W = 0.94394, p-value < 2.2e-16
shapiro.test(data$actual_productivity[data$day %in% unique(data$day)])
## 
##  Shapiro-Wilk normality test
## 
## data:  data$actual_productivity[data$day %in% unique(data$day)]
## W = 0.94394, p-value < 2.2e-16
shapiro.test(data$actual_productivity[data$quarter %in% unique(data$quarter)])
## 
##  Shapiro-Wilk normality test
## 
## data:  data$actual_productivity[data$quarter %in% unique(data$quarter)]
## W = 0.94394, p-value < 2.2e-16

The results shows that the data for department, day, and quarter are not normally distributed (p-value < 0.05), indicating that normality is violated. While ANOVA can still be reliable with large sample sizes despite this, it’s advisable to consider data transformations or use non-parametric tests if necessary. After checking for normality, the next assumption to test is homogeneity of variance using Levene’s test. This ensures that the variance across groups is similar. If Levene’s test shows no significant difference in variances (p-value > 0.05), we can proceed with ANOVA. If variances are unequal, other approaches, such as a Welch’s ANOVA or data transformation, may be needed.

ggplot(data, aes(x = actual_productivity, fill = department)) +
  geom_histogram(binwidth = 0.05, position = "stack", alpha = 0.6) +
  labs(title = "Distribution of Actual Productivity by Department", x = "Actual Productivity", y = "Frequency") +
  scale_fill_brewer(palette = "Set2")

ggplot(data, aes(x = actual_productivity, fill = day)) +
  geom_histogram(binwidth = 0.05, position = "stack", alpha = 0.6) +
  labs(title = "Distribution of Actual Productivity by Day of the Week", x = "Actual Productivity", y = "Frequency") +
  scale_fill_brewer(palette = "Set2")

ggplot(data, aes(x = actual_productivity, fill = quarter)) +
  geom_histogram(binwidth = 0.05, position = "stack", alpha = 0.6) +
  labs(title = "Distribution of Actual Productivity by Quarter", x = "Actual Productivity", y = "Frequency") +
  scale_fill_brewer(palette = "Set2")

The box plots illustrate the distribution of actual productivity across different departments, days, and quarters. In terms of departments, productivity in the sewing department is generally slightly lower than in the finishing department, though there is some overlap between the two. Regarding days of the week, productivity shows minimal variation, with no significant peaks or drops, suggesting a relatively stable performance throughout the week. The comparison of productivity by quarter reveals notable differences, with Quarter 1 showing higher productivity compared to Quarter 5. The box for Quarter 1 is located higher on the y-axis, indicating better overall productivity in that period. This highlights the possibility that seasonal factors or changes in business strategies during specific quarters could affect productivity.

Homoscedasticity(Department,Days,Quarter):

library(car)
leveneTest(actual_productivity ~ department, data = data)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  42.142 1.241e-10 ***
##       1195                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(actual_productivity ~ day, data = data)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    5  0.4285  0.829
##       1191
leveneTest(actual_productivity ~ quarter, data = data)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.6876 0.1505
##       1192

The results from Levene’s Test for Homogeneity of Variance show that for the department, the p-value is 1.241e-10, which is less than 0.05, indicating that the variances across departments are significantly different. Therefore, the homogeneity of variance assumption is violated, and Welch’s ANOVA should be used instead of regular ANOVA. For the day of the week and quarter, the p-values are 0.829 and 0.1505, respectively, both greater than 0.05, indicating that the variances across groups are equal. This means the homogeneity of variance assumption is satisfied for these variables, and regular ANOVA can be applied.

ANOVA Test(Department)

welch_anova_department <- oneway.test(actual_productivity ~ department, data = data)

welch_anova_department
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  actual_productivity and department
## F = 8.593, num df = 1.00, denom df = 926.17, p-value = 0.003458

The Welch’s ANOVA results for the department variable show that the F-value is 8.593, with 1 degree of freedom for the numerator and 926.17 degrees of freedom for the denominator. The p-value is 0.003458, which is less than 0.05, indicating that there are significant differences in productivity between the departments. Therefore, we can reject the null hypothesis and conclude that productivity varies significantly between at least some of the departments. The analysis suggests that the department variable has a statistically significant effect on actual productivity, and further investigation is needed to determine which departments specifically differ in terms of productivity.

ggplot(data, aes(x = department, y = actual_productivity, fill = department)) +
  geom_boxplot() +
  labs(title = "Productivity by Department", x = "Department", y = "Actual Productivity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  scale_fill_brewer(palette = "Set3")  

The boxplot above visually shows the distribution of actual productivity across two departments: finishing and sewing. The plot indicates that productivity is generally higher in the finishing department compared to the sewing department. The sewing department has a wider range of values and several outliers, which suggests more variability in productivity. On the other hand, finishing department has a relatively more consistent productivity level. This insight could be useful for managers to identify areas for improvement, especially in the sewing department, where variability may be addressed by optimizing workflows, improving team communication, or further investigating the causes of the outliers.

ANOVA Test(Day)

anova_day <- aov(actual_productivity ~ day, data = data)
summary(anova_day)
##               Df Sum Sq Mean Sq F value Pr(>F)
## day            5   0.11 0.02171   0.712  0.614
## Residuals   1191  36.30 0.03048

The ANOVA results for day of the week show that the p-value is 0.614, which is greater than the significance level of 0.05. This indicates that there is no significant difference in productivity across the different days of the week. The F-value of 0.712 suggests that the variability between the days is small compared to the within-group variability. Therefore, we fail to reject the null hypothesis, meaning that the day of the week does not have a significant impact on productivity.

ggplot(data, aes(x = day, y = actual_productivity, fill = day)) +
  geom_boxplot() +
  labs(title = "Productivity by Day of the Week", x = "Day", y = "Actual Productivity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  scale_fill_brewer(palette = "Set3")  

The boxplot for productivity by day of the week shows the distribution of productivity across different days. The boxes represent the interquartile range (IQR), where 50% of the data lies, and the line inside each box indicates the median productivity for that day. The plot reveals how productivity varies across the week, highlighting which days show more consistency or larger variability in productivity. For example, days like Monday or Saturday may show more variation, while Tuesday and Wednesday might have more stable productivity. This visualization helps identify trends and outliers for each day of the week, offering insight into how the day may impact productivity.

ANOVA Test(Quarter)

anova_quarter <- aov(actual_productivity ~ quarter, data = data)
summary(anova_quarter)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quarter        4   0.85 0.21219   7.112 1.17e-05 ***
## Residuals   1192  35.56 0.02984                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results for “Quarter” show a significant p-value of 1.17e-05, indicating differences in productivity across quarters. The F-value of 7.112 suggests that variability between quarters is much higher than within each quarter. Therefore, we reject the null hypothesis and conclude that productivity differs significantly by quarter. Further analysis could reveal which quarters differ most.

library(ggplot2)

ggplot(data, aes(x = quarter, y = actual_productivity, fill = quarter)) +
  geom_boxplot() +
  labs(title = "Productivity by Quarter", x = "Quarter", y = "Actual Productivity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability

The boxplot for productivity by quarter reveals some variation in productivity across different quarters. Productivity in Quarter 5 appears to be higher compared to other quarters, while Quarters 1 and 4 show lower productivity levels. The spread of the data, as indicated by the interquartile ranges, is relatively consistent across quarters, although there are some outliers in each quarter. This suggests that while there are noticeable differences in productivity, further analysis is needed to pinpoint the specific factors driving these variations across quarters. Conducting ANOVA for team, department, day of the week, and quarter helps identify which factors significantly influence productivity. It allows businesses to target specific areas for improvement, such as optimizing team performance, addressing department-specific challenges, adjusting schedules based on day-of-week trends, and aligning production with seasonal changes. The analysis of quarters can also help identify productivity fluctuations over the year, guiding resource allocation and strategic planning. By understanding these influences, organizations can make data-driven decisions to enhance overall efficiency and performance.

If there are more than 10 categories, consolidate them before running the test using the methods we’ve learned in class.

Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

We have 12 teams, and we divided them into 3 categories based on productivity:

1- High Productivity: Teams with productivity greater than 0.8.

2- Medium Productivity: Teams with productivity between 0.5 and 0.8.

3- Low Productivity: Teams with productivity less than 0.5.

This categorization helps in understanding the differences in productivity levels across teams, with the ANOVA results showing significant differences between the groups.

data <- data |>
  mutate(productivity_group = case_when(
    actual_productivity > 0.8 ~ "High",  
    actual_productivity >= 0.5 & actual_productivity <= 0.8 ~ "Medium",  
    actual_productivity < 0.5 ~ "Low"  
  ))
head(data)
##       date  quarter department      day team targeted_productivity   smv  wip
## 1 1/1/2015 Quarter1     sewing Thursday    8                  0.80 26.16 1108
## 2 1/1/2015 Quarter1  finishing Thursday    1                  0.75  3.94   NA
## 3 1/1/2015 Quarter1     sewing Thursday   11                  0.80 11.41  968
## 4 1/1/2015 Quarter1     sewing Thursday   12                  0.80 11.41  968
## 5 1/1/2015 Quarter1     sewing Thursday    6                  0.80 25.90 1170
## 6 1/1/2015 Quarter1     sewing Thursday    7                  0.80 25.90  984
##   over_time incentive idle_time idle_men no_of_style_change no_of_workers
## 1      7080        98         0        0                  0          59.0
## 2       960         0         0        0                  0           8.0
## 3      3660        50         0        0                  0          30.5
## 4      3660        50         0        0                  0          30.5
## 5      1920        50         0        0                  0          56.0
## 6      6720        38         0        0                  0          56.0
##   actual_productivity productivity_group
## 1           0.9407254               High
## 2           0.8865000               High
## 3           0.8005705               High
## 4           0.8005705               High
## 5           0.8003819               High
## 6           0.8001250               High
data_grouped <- data |> group_by(productivity_group)

anova_model <- aov(actual_productivity ~ productivity_group, data = data_grouped)
summary(anova_model)
##                      Df Sum Sq Mean Sq F value Pr(>F)    
## productivity_group    2 29.647  14.824    2616 <2e-16 ***
## Residuals          1194  6.766   0.006                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the result, the F-value of 2616 indicates that the variation in productivity between the High, Medium, and Low productivity groups is much greater than the variation within the groups, suggesting that grouping the teams by productivity levels leads to a meaningful distinction in performance. The p-value of < 2e-16 is extremely small, providing strong evidence to reject the null hypothesis, meaning that there are statistically significant differences in productivity across these three groups. This suggests that team productivity is influenced by the level of performance (High, Medium, Low).

ggplot(data, aes(x = productivity_group, y = actual_productivity, fill = productivity_group)) +
  geom_boxplot() +
  labs(title = "Productivity Across Teams by Category", 
       x = "Productivity Group", 
       y = "Actual Productivity") +
  scale_fill_brewer(palette = "Set3") +  
  theme_minimal()

The box plot shows that High Productivity teams have the highest median and less variability, indicating consistent high performance. Medium Productivity teams have more variation, with moderate performance. Low Productivity teams have the lowest median, indicating under performance, with some extreme low values. This confirms the ANOVA results that there are significant differences in productivity between the groups.

Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear.

In this analysis, we have selected overtime hours as the predictor variable for actual productivity. To ensure that the relationship between these variables is roughly linear, we will first visualize the data using a scatter plot and add a linear regression line to observe if the pattern is linear. We will also calculate the Pearson correlation to confirm the strength of the linear relationship between overtime hours and productivity. This will help us determine if overtime hours are a strong predictor of productivity. After confirming linearity, we will proceed with fitting a linear regression model to explore the impact of overtime on productivity.

Build a linear regression model of the response using just this column, and evaluate its fit.

Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something?

lm_model <- lm(actual_productivity ~ over_time, data = data)
summary(lm_model)
## 
## Call:
## lm(formula = actual_productivity ~ over_time, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50813 -0.08040  0.03328  0.11930  0.37516 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.480e-01  8.523e-03  87.764   <2e-16 ***
## over_time   -2.824e-06  1.505e-06  -1.877   0.0608 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1743 on 1195 degrees of freedom
## Multiple R-squared:  0.002938,   Adjusted R-squared:  0.002104 
## F-statistic: 3.522 on 1 and 1195 DF,  p-value: 0.06082

The linear regression results show that overtime hours have a negative relationship with productivity, but the effect is very small. The coefficient for overtime hours is -2.824e-06, meaning that for each additional hour of overtime, productivity is expected to decrease slightly. While this effect is negative, the magnitude is small, implying that overtime has a minimal impact on productivity on its own. The p-value for overtime is 0.0608, which is above the typical significance level of 0.05, suggesting that overtime is not statistically significant as a predictor of productivity in this model. This means that the model does not find strong evidence to support that overtime is a significant driver of productivity in the data. Additionally, the R-squared value is very low (0.0029), meaning that overtime hours explain only 0.29% of the variability in productivity. This confirms that overtime alone does not have a substantial effect on productivity, and much of the variation in productivity is influenced by other factors not included in this model.

Interpretation of the Coefficients:

The intercept of 0.748 suggests that when no overtime is worked, the baseline productivity is already relatively high. The coefficient of -2.824e-06 indicates a very slight decrease in productivity with each additional overtime hour. However, the effect is not statistically significant, suggesting overtime is not a strong factor influencing productivity on its own.

Recommendations:

Reevaluate Overtime: Consider reducing overtime by improving workflows or team efficiency. Over-relying on overtime may not significantly boost productivity and could cause burnout.

Explore Other Factors: Since overtime alone doesn’t explain much variation in productivity, look into other factors such as team dynamics and employee engagement.

Optimize Scheduling: There might be an optimal amount of overtime that balances productivity with well-being. Data-driven scheduling could ensure efficient use of overtime.

Improve the Model: Given the low R-squared value, future models should include additional predictors, like team size or workplace conditions, to better explain productivity.

ggplot(data, aes(x = over_time, y = actual_productivity)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Overtime Hours vs Actual Productivity", x = "Overtime Hours", y = "Actual Productivity")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows the relationship between overtime hours and actual productivity, with a negative trend represented by the blue regression line. As overtime hours increase, productivity seems to decrease slightly, indicating a weak negative correlation. This suggests that while overtime might have a small negative impact on productivity, the effect is not strong. This aligns with the linear regression model, which shows a slight decrease in productivity for each additional overtime hour.

lm_team_size <- lm(actual_productivity ~ no_of_workers, data = data)

summary(lm_team_size)
## 
## Call:
## lm(formula = actual_productivity ~ no_of_workers, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51143 -0.08052  0.02730  0.12004  0.37560 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.7508678  0.0093327  80.456   <2e-16 ***
## no_of_workers -0.0004558  0.0002270  -2.008   0.0449 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1743 on 1195 degrees of freedom
## Multiple R-squared:  0.003363,   Adjusted R-squared:  0.002529 
## F-statistic: 4.032 on 1 and 1195 DF,  p-value: 0.04486

The linear regression model reveals that team size (measured by the number of workers) has a small but statistically significant negative relationship with productivity. The coefficient for team size is -0.0004558, indicating that for each additional worker in the team, productivity decreases slightly by 0.0004558 units. The p-value for team size is 0.0449, which is below 0.05, suggesting that the relationship is statistically significant, meaning that the observed effect of team size on productivity is unlikely to be due to random chance. However, the R-squared value is 0.003363, meaning that team size explains only 0.33% of the variation in productivity. This suggests that team size alone does not have a substantial effect on productivity, and other factors likely contribute more significantly to the variation in productivity.

Interpretation of the Coefficients:

Intercept: A baseline productivity of 0.7509 when team size is zero.

Coefficient for Team Size: The coefficient of -0.0004558 indicates a very slight decrease in productivity with each additional worker. This suggests that larger teams may face challenges like coordination problems or communication breakdowns.

Larger teams may not be more productive: This model suggests that larger teams could be less productive due to challenges such as coordination difficulties and role ambiguity.

Optimize team dynamics: Focus on improving communication and task allocation rather than simply increasing team size. Smaller, well-organized teams might be more effective.

Recommendation:

Reevaluate Team Size: Focus on optimizing team structure, communication, and role clarity rather than increasing team size to avoid diminishing returns.

Improve Team Efficiency: Invest in team-building and tools to enhance collaboration and task allocation, particularly in larger teams.

Focus on Other Factors: Since team size explains little of the variation in productivity, explore other factors like employee engagement, workplace conditions, and task complexity.

Model Improvement: Future models should include predictors like workplace environment, team skills, or incentives to better explain productivity.

ggplot(data, aes(x = no_of_workers, y = actual_productivity)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Team Size (No. of Workers) vs Actual Productivity", x = "Team Size (No. of Workers)", y = "Actual Productivity")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows the relationship between team size (number of workers) and actual productivity, with a blue regression line indicating the trend. The line is almost flat, suggesting that team size has little to no impact on productivity. This is supported by the linear regression results, which show a small negative coefficient for team size and a very low R-squared value (0.003363). This indicates that team size explains very little of the variation in productivity, suggesting that other factors likely play a larger role in influencing productivity.

lm_incentive <- lm(actual_productivity ~ incentive, data = data)
summary(lm_incentive)
## 
## Call:
## lm(formula = actual_productivity ~ incentive, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54788 -0.08366  0.03694  0.11276  0.38853 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.319e-01  5.172e-03 141.515  < 2e-16 ***
## incentive   8.337e-05  3.142e-05   2.654  0.00807 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.174 on 1195 degrees of freedom
## Multiple R-squared:  0.005858,   Adjusted R-squared:  0.005026 
## F-statistic: 7.042 on 1 and 1195 DF,  p-value: 0.00807

The linear regression model shows that incentives have a positive relationship with actual productivity. The coefficient for incentives is 8.337e-05, meaning that for each unit increase in incentives, productivity is expected to increase slightly. The p-value for incentives is 0.00807, which is less than 0.05, suggesting that incentives are a statistically significant predictor of productivity. However, the R-squared value is 0.005858, indicating that incentives explain only about 0.59% of the variation in productivity. While incentives have a statistically significant effect, the effect is small and they explain only a minor portion of the overall variation in productivity. This suggests that other factors likely contribute more significantly to productivity. while incentives have a significant positive impact on productivity, the effect is relatively small. The model could be improved By improving workflows, enhancing team communication, and creating a better work environment, organizations can effectively increase productivity.

Interpretation of the Coefficients:

Intercept: The intercept represents the baseline productivity when incentives are zero, indicating a baseline productivity level of around 0.7319.

Coefficient for Incentives: The coefficient of 8.337e-05 means that for each unit increase in incentives, productivity is expected to increase slightly by 0.00008337 units. This shows a positive relationship between incentives and productivity, suggesting that offering more incentives can slightly boost productivity.

Incentives have a positive but small effect on productivity. While they motivate workers, the increase in productivity is minimal, suggesting that incentives alone aren’t enough to drive significant improvements.

Incentives are useful but not the primary driver of productivity. Other factors like work environment, team dynamics, and employee engagement likely play a more significant role.

The low R-squared value (0.005858) indicates that incentives explain only a small fraction of productivity variation, highlighting the importance of considering other factors.

Recommendation:

Optimize Incentive Programs: Enhance incentives by offering tiered rewards for higher performance to encourage consistent improvement.

Explore Other Influencing Factors: Investigate factors like team collaboration, employee engagement, and workplace conditions, which may have a larger impact on productivity.

Combine Strategies: Pair incentives with strategies like skill development, communication improvement, and workflow optimization to boost productivity.

Improve the Model: Include additional predictors such as team size, task complexity, and workplace environment to better explain productivity.

ggplot(data, aes(x = incentive, y = actual_productivity)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Incentives vs Actual Productivity", x = "Incentives", y = "Actual Productivity")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows a positive relationship between incentives and actual productivity. As incentives increase, productivity also tends to increase, which is reflected by the upward sloping blue regression line. However, there are some outliers where incentives are low but productivity remains high. This indicates that while incentives have a positive impact on productivity, their effect is small and other factors might also play a role in influencing productivity. The model’s low R-squared value suggests that incentives explain only a small portion of the variability in productivity.

lm_idle_time <- lm(actual_productivity ~ idle_time, data = data)
summary(lm_idle_time)
## 
## Call:
## lm(formula = actual_productivity ~ idle_time, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50220 -0.08549  0.03743  0.11441  0.38454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7359016  0.0050372 146.092  < 2e-16 ***
## idle_time   -0.0011100  0.0003958  -2.804  0.00513 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.174 on 1195 degrees of freedom
## Multiple R-squared:  0.006537,   Adjusted R-squared:  0.005706 
## F-statistic: 7.863 on 1 and 1195 DF,  p-value: 0.005128

The linear regression model shows that idle time has a negative relationship with actual productivity. The coefficient for idle time is -0.0011100, indicating that as idle time increases, productivity decreases slightly. The p-value for idle time is 0.00513, which is less than 0.05, suggesting that idle time is a statistically significant predictor of productivity. However, the R-squared value is 0.006537, meaning that idle time explains only about 0.65% of the variation in productivity, indicating that other factors likely play a larger role. Despite its statistical significance, the effect of idle time on productivity is small, and the model could be improvising other relevant variables.

Interpretation of the Coefficients:

Intercept: The baseline productivity when idle time is zero.

Coefficient for Idle Time: The coefficient of -0.0011100 indicates that each additional unit of idle time results in a slight decrease in productivity, suggesting a negative relationship.

Idle time leads to inefficiencies and lower productivity. The p-value of 0.00513 shows the relationship is statistically significant, meaning it’s unlikely due to chance.

Idle time is a significant predictor of productivity. Minimizing downtime can improve productivity.

The low R-squared value (0.006537) suggests that other factors, like team dynamics and employee motivation, play a bigger role in productivity.

Recommendation:

Minimize Idle Time: Optimize workflows and task allocation to reduce idle time and keep employees engaged.

Explore Other Factors: Investigate factors like employee engagement, workplace conditions, and task complexity to gain a fuller understanding of productivity.

Combine Strategies: Reduce idle time while improving team collaboration, incentives, and communication to boost productivity.

Improve the Model: Include additional predictors like team size or employee skills to enhance the model’s explanatory power.

ggplot(data, aes(x = idle_time, y = actual_productivity)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Idle Time vs Actual Productivity", x = "Idle Time", y = "Actual Productivity")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot demonstrates a negative relationship between idle time and productivity. As idle time increases, productivity tends to decrease, which is reflected in the downward-sloping blue regression line. Some outliers are visible, especially at the left end, where idle time is low but productivity remains high. These data points may require further investigation to understand their effect on the model. The regression analysis confirms that idle time has a negative impact on productivity, though the effect is small. While the relationship is statistically significant, idle time explains only a small proportion of the variation in productivity, as indicated by the low R-squared value. This suggests that other factors may have a larger influence on productivity.

Conclusion:

After investigating the four models (overtime, team size, incentives, and idle time), we found that none of the factors alone fully explains the variation in productivity. Each factor showed a small effect, with overtime and team size having a negative impact, incentives showing a slight positive effect, and idle time negatively influencing productivity. While these factors are statistically significant, they only explain a small portion of the variation in productivity, suggesting that other factors, such as team dynamics, employee engagement, and workplace conditions, may play a larger role. Future models should incorporate additional predictors. By improving workflows, enhancing team communication, and creating a better work environment, organizations can effectively increase productivity.