Week 7

Potential questions

Potential questions after exploring the data set are:

What factors influence whether actual productivity meets targeted productivity? The data shows actual often falls short of targeted.
How do characteristics like idle time, idle workers, and day of week impact productivity? The visualizations hinted at relationships.
Does the number of style changes or workers affect efficiency and output? There may be optimization insights.
How do incentives relate to productivity rates? The meaning of the incentive values is still unclear.

Load CSV file

Loading the csv file to garment_prod variable.

garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)

##      date             quarter           department            day           
##  Length:1197        Length:1197        Length:1197        Length:1197       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      team           targeted_productivity      smv             wip         
##  Length:1197        Min.   :0.0700        Min.   : 2.90   Min.   :    7.0  
##  Class :character   1st Qu.:0.7000        1st Qu.: 3.94   1st Qu.:  774.5  
##  Mode  :character   Median :0.7500        Median :15.26   Median : 1039.0  
##                     Mean   :0.7296        Mean   :15.06   Mean   : 1190.5  
##                     3rd Qu.:0.8000        3rd Qu.:24.26   3rd Qu.: 1252.5  
##                     Max.   :0.8000        Max.   :54.56   Max.   :23122.0  
##                                                           NA's   :506      
##    over_time       incentive         idle_time           idle_men      
##  Min.   :    0   Min.   :   0.00   Min.   :  0.0000   Min.   : 0.0000  
##  1st Qu.: 1440   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 0.0000  
##  Median : 3960   Median :   0.00   Median :  0.0000   Median : 0.0000  
##  Mean   : 4567   Mean   :  38.21   Mean   :  0.7302   Mean   : 0.3693  
##  3rd Qu.: 6960   3rd Qu.:  50.00   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
##  Max.   :25920   Max.   :3600.00   Max.   :300.0000   Max.   :45.0000  
##                                                                        
##  no_of_style_change no_of_workers   actual_productivity
##  Min.   :0.0000     Min.   : 2.00   Min.   :0.2337     
##  1st Qu.:0.0000     1st Qu.: 9.00   1st Qu.:0.6503     
##  Median :0.0000     Median :34.00   Median :0.7733     
##  Mean   :0.1504     Mean   :34.61   Mean   :0.7351     
##  3rd Qu.:0.0000     3rd Qu.:57.00   3rd Qu.:0.8503     
##  Max.   :2.0000     Max.   :89.00   Max.   :1.1204     
##

Load required libraries

library(pwr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Hypothesis 1:

H0: There is no significant difference in average productivity on Mondays compared to other weekdays. Ha: There is a significant difference in average productivity on Mondays compared to other weekdays.

mondays <- subset(garment_prod, weekdays(as.Date(garment_prod$date, "%m/%d/%y")) == "Monday")
other_days <- subset(garment_prod, weekdays(as.Date(garment_prod$date, "%m/%d/%y")) != "Monday")
mean_mondays <- mean(mondays$actual_productivity)
mean_other <- mean(other_days$actual_productivity)

Sample size calculation

Alpha level (sig.level): 0.05 (standard for statistical significance)

Power (power): 0.8 (standard for adequate power)

Minimum detectable effect size (d) : A 0.2 difference in mean productivity (a small but potentially meaningful difference)

pwr.t.test(n=NULL, d=0.2, sig.level=0.05, power=0.8)

## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

nrow(mondays)

## [1] 199

nrow(other_days)

## [1] 998

Two Sample T-test

t.test(mondays$actual_productivity, other_days$actual_productivity)

## 
##  Welch Two Sample t-test
## 
## data:  mondays$actual_productivity and other_days$actual_productivity
## t = 0.03473, df = 279.28, p-value = 0.9723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02653974  0.02749302
## sample estimates:
## mean of x mean of y 
## 0.7354885 0.7350119

Fisher’s test

mean_diff <- mean_mondays - mean_other

se <- sqrt(var(mondays$actual_productivity)/nrow(mondays) +
           var(other_days$actual_productivity)/nrow(other_days)) 

test_stat <- mean_diff / se

p_value <- 2*pt(-abs(test_stat), df = nrow(mondays) + nrow(other_days) - 2)

p_value

## [1] 0.9723011

The p-value here is 0.97 which is much greater than the significance level (0.05). Hence, we fail to reject the null hypothesis H0. We can conclude that there is no statistically significant evidence of a difference in average productivity between Mondays and other weekdays. We would retain the null hypothesis.

Visualization

# Convert the 'date' column to a Date class
garment_prod$date <- as.Date(garment_prod$date, format = "%m/%d/%y")
ggplot(garment_prod, aes(x = weekdays(date), y = actual_productivity)) +
  geom_boxplot()

ggplot(garment_prod, aes(x = weekdays(date), y = actual_productivity)) +
  geom_jitter(color = "blue", alpha = 0.6, width = 0.2) +
  labs(title = "Actual Productivity by Weekdays",
       x = "Weekdays",
       y = "Actual Productivity")

We create a histogram with a vertical line indicating the observed test statistics and the p-value.

# Create a sequence of test statistics under the null hypothesis
null_test_stats <- seq(-4, 4, by = 0.1)  # Adjust the range and increment as needed

# Calculate the null distribution (assuming no difference in means)
null_distribution <- dt(null_test_stats, df = nrow(mondays) + nrow(other_days) - 2)

# Create a histogram of the null distribution
hist(null_distribution, 
     main = "Hypothesis Test for Mean Difference",
     xlab = "Test Statistic",
     ylab = "Frequency",
     col = "lightblue")

# Add a vertical line to indicate the observed test statistic
abline(v = test_stat, col = "red", lwd = 2)

# Add text to indicate the p-value
text(2, 15, paste("p-value =", round(p_value, 4)), col = "red", adj = 0)

The visualization help understand where the observed test statistics falls within the distribution under the null hypothesis and provides us in sight into the significance of the result.

Hypothesis 2:

H0:There is no significant difference in average productivity between quarters 1 and 2. Ha:There is a significant difference in average productivity between quarters 1 and 2. ### Creating subset for Quarter1 and Quarter 2

q1 <- subset( garment_prod, quarter == "Quarter1")
q2 <- subset( garment_prod, quarter == "Quarter2")

Sample size Calculation

Alpha (sig.level) = 0.05 Minimum effective size (d) = 0.3 Power = 0.8

pwr.t.test(n=NULL, d=0.3, sig.level=0.05, power=0.8)

## 
##      Two-sample t test power calculation 
## 
##               n = 175.3847
##               d = 0.3
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

nrow(q1)

## [1] 360

nrow(q2)

## [1] 335

Here, n value is less than both the subset counts. We need approximately 175 entries for this sample, but we have 360 for quarter 1 group and 335 entries for quarter 2 group, which are sufficient for performing Neyman-Pearson test.

Neyman-Pearson Test

# Define null and alternative hypotheses
null_hypothesis <- "There is no significant difference in actual_productivity between Quarter1 and Quarter2"
alternative_hypothesis <- "There is a significant difference in actual_productivity between Quarter1 and Quarter2"

alpha <- 0.05
power <- 0.80

# Perform a two-sample t-test
t_test_result <- t.test(q1$actual_productivity, q2$actual_productivity)

# Calculate critical values based on alpha
critical_value <- qt(1 - alpha / 2, df = min(length(q1$actual_productivity), length(q2$actual_productivity)))

# Make a decision
if (abs(t_test_result$statistic) > critical_value) {
  cat("Reject the null hypothesis:", alternative_hypothesis, "\n")
} else {
  cat("Fail to reject the null hypothesis:", null_hypothesis, "\n")
}

## Fail to reject the null hypothesis: There is no significant difference in actual_productivity between Quarter1 and Quarter2

# Calculate power (simulated power based on observed data)
simulated_data_q1 <- rnorm(10000, mean = mean(q1$actual_productivity), sd = sd(q1$actual_productivity))
simulated_data_q2 <- rnorm(10000, mean = mean(q2$actual_productivity), sd = sd(q2$actual_productivity))
t_statistic_q1 <- (mean(simulated_data_q1) - mean(q2$actual_productivity)) /
  sqrt((var(simulated_data_q1) / length(q1$actual_productivity)) + (var(simulated_data_q2) / length(q2$actual_productivity)))
t_statistic_q2 <- (mean(q1$actual_productivity) - mean(simulated_data_q2)) /
  sqrt((var(simulated_data_q1) / length(q1$actual_productivity)) + (var(simulated_data_q2) / length(q2$actual_productivity)))
power_observed <- mean(abs(t_statistic_q1) > critical_value | abs(t_statistic_q2) > critical_value)
cat("Observed Power:", power_observed, "\n")

## Observed Power: 0

Fisher’s test

mean_q1 <- mean(q1$actual_productivity)
mean_q2 <- mean(q2$actual_productivity)

mean_diff <- mean_q1 - mean_q2

se <- sqrt(var(q1$actual_productivity)/nrow(q1) + 
           var(q2$actual_productivity)/nrow(q2))

test_stat <- mean_diff / se

p_value <- 2*pt(-abs(test_stat), df = nrow(q1) + nrow(q2) - 2)

p_value

## [1] 0.5346714

In Fisher’s test as well, the p-value is 0.53 which is significantly greater than the significance level (0.05). Which means we fail to reject the null hypothesis: There is no significant difference in actual_productivity between Quarter1 and Quarter2. we would retain the null hypothesis H0.

Visualization

means_by_quarter <- garment_prod %>%
  group_by(quarter) %>%
  summarize(mean_prod = mean(actual_productivity))

ggplot(means_by_quarter, aes(x = quarter, y = mean_prod)) +
  geom_bar(stat = "identity", fill = "lightblue", color = "blue") +
  labs(title = "Mean Productivity by Quarter",
       x = "Quarter",
       y = "Mean Actual Productivity")

We create a histogram with a vertical line indicating the observed test statistics and the p-value.

null_test_stats <- seq(-4, 4, by = 0.1)  # Adjust the range and increment as needed

# Calculate the null distribution (assuming no difference in means)
null_distribution <- dt(null_test_stats, df = nrow(q1) + nrow(q2) - 2)

# Create a histogram of the null distribution
hist(null_distribution, 
     main = "Null Distribution of Test Statistic",
     xlab = "Test Statistic",
     ylab = "Frequency",
     col = "lightblue")

# Add a vertical line to indicate the observed test statistic
abline(v = test_stat, col = "red", lwd = 2)

# Add text to indicate the p-value
text(2, 15, paste("p-value =", round(p_value, 4)), col = "red", adj = 0)

The visualization help understand where the observed test statistics falls within the distribution under the null hypothesis and provides us in sight into the significance of the result.

Week 7

2023-10-09

Potential questions

Load CSV file

Load required libraries

Hypothesis 1:

Sample size calculation

Two Sample T-test

Fisher’s test

Visualization

Hypothesis 2:

Sample size Calculation

Neyman-Pearson Test

Fisher’s test

Visualization