Potential questions after exploring the data set are:
Loading the csv file to garment_prod variable.
garment_prod <-read.csv("/Users/lakshmimounikab/Desktop/Stats with R/R practice/garment_prod.csv")
garment_prod$team <- as.character(garment_prod$team)
summary(garment_prod)
## date quarter department day
## Length:1197 Length:1197 Length:1197 Length:1197
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## team targeted_productivity smv wip
## Length:1197 Min. :0.0700 Min. : 2.90 Min. : 7.0
## Class :character 1st Qu.:0.7000 1st Qu.: 3.94 1st Qu.: 774.5
## Mode :character Median :0.7500 Median :15.26 Median : 1039.0
## Mean :0.7296 Mean :15.06 Mean : 1190.5
## 3rd Qu.:0.8000 3rd Qu.:24.26 3rd Qu.: 1252.5
## Max. :0.8000 Max. :54.56 Max. :23122.0
## NA's :506
## over_time incentive idle_time idle_men
## Min. : 0 Min. : 0.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1440 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 3960 Median : 0.00 Median : 0.0000 Median : 0.0000
## Mean : 4567 Mean : 38.21 Mean : 0.7302 Mean : 0.3693
## 3rd Qu.: 6960 3rd Qu.: 50.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :25920 Max. :3600.00 Max. :300.0000 Max. :45.0000
##
## no_of_style_change no_of_workers actual_productivity
## Min. :0.0000 Min. : 2.00 Min. :0.2337
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.:0.6503
## Median :0.0000 Median :34.00 Median :0.7733
## Mean :0.1504 Mean :34.61 Mean :0.7351
## 3rd Qu.:0.0000 3rd Qu.:57.00 3rd Qu.:0.8503
## Max. :2.0000 Max. :89.00 Max. :1.1204
##
library(pwr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
H0: There is no significant difference in average productivity on Mondays compared to other weekdays. Ha: There is a significant difference in average productivity on Mondays compared to other weekdays.
mondays <- subset(garment_prod, weekdays(as.Date(garment_prod$date, "%m/%d/%y")) == "Monday")
other_days <- subset(garment_prod, weekdays(as.Date(garment_prod$date, "%m/%d/%y")) != "Monday")
mean_mondays <- mean(mondays$actual_productivity)
mean_other <- mean(other_days$actual_productivity)
Alpha level (sig.level): 0.05 (standard for statistical significance)
Power (power): 0.8 (standard for adequate power)
Minimum detectable effect size (d) : A 0.2 difference in mean productivity (a small but potentially meaningful difference)
pwr.t.test(n=NULL, d=0.2, sig.level=0.05, power=0.8)
##
## Two-sample t test power calculation
##
## n = 393.4057
## d = 0.2
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
nrow(mondays)
## [1] 199
nrow(other_days)
## [1] 998
t.test(mondays$actual_productivity, other_days$actual_productivity)
##
## Welch Two Sample t-test
##
## data: mondays$actual_productivity and other_days$actual_productivity
## t = 0.03473, df = 279.28, p-value = 0.9723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02653974 0.02749302
## sample estimates:
## mean of x mean of y
## 0.7354885 0.7350119
mean_diff <- mean_mondays - mean_other
se <- sqrt(var(mondays$actual_productivity)/nrow(mondays) +
var(other_days$actual_productivity)/nrow(other_days))
test_stat <- mean_diff / se
p_value <- 2*pt(-abs(test_stat), df = nrow(mondays) + nrow(other_days) - 2)
p_value
## [1] 0.9723011
The p-value here is 0.97 which is much greater than the significance level (0.05). Hence, we fail to reject the null hypothesis H0. We can conclude that there is no statistically significant evidence of a difference in average productivity between Mondays and other weekdays. We would retain the null hypothesis.
# Convert the 'date' column to a Date class
garment_prod$date <- as.Date(garment_prod$date, format = "%m/%d/%y")
ggplot(garment_prod, aes(x = weekdays(date), y = actual_productivity)) +
geom_boxplot()
ggplot(garment_prod, aes(x = weekdays(date), y = actual_productivity)) +
geom_jitter(color = "blue", alpha = 0.6, width = 0.2) +
labs(title = "Actual Productivity by Weekdays",
x = "Weekdays",
y = "Actual Productivity")
We create a histogram with a vertical line indicating the observed test statistics and the p-value.
# Create a sequence of test statistics under the null hypothesis
null_test_stats <- seq(-4, 4, by = 0.1) # Adjust the range and increment as needed
# Calculate the null distribution (assuming no difference in means)
null_distribution <- dt(null_test_stats, df = nrow(mondays) + nrow(other_days) - 2)
# Create a histogram of the null distribution
hist(null_distribution,
main = "Hypothesis Test for Mean Difference",
xlab = "Test Statistic",
ylab = "Frequency",
col = "lightblue")
# Add a vertical line to indicate the observed test statistic
abline(v = test_stat, col = "red", lwd = 2)
# Add text to indicate the p-value
text(2, 15, paste("p-value =", round(p_value, 4)), col = "red", adj = 0)
The visualization help understand where the observed test statistics falls within the distribution under the null hypothesis and provides us in sight into the significance of the result.
H0:There is no significant difference in average productivity between quarters 1 and 2. Ha:There is a significant difference in average productivity between quarters 1 and 2. ### Creating subset for Quarter1 and Quarter 2
q1 <- subset( garment_prod, quarter == "Quarter1")
q2 <- subset( garment_prod, quarter == "Quarter2")
Alpha (sig.level) = 0.05 Minimum effective size (d) = 0.3 Power = 0.8
pwr.t.test(n=NULL, d=0.3, sig.level=0.05, power=0.8)
##
## Two-sample t test power calculation
##
## n = 175.3847
## d = 0.3
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
nrow(q1)
## [1] 360
nrow(q2)
## [1] 335
Here, n value is less than both the subset counts. We need approximately 175 entries for this sample, but we have 360 for quarter 1 group and 335 entries for quarter 2 group, which are sufficient for performing Neyman-Pearson test.
# Define null and alternative hypotheses
null_hypothesis <- "There is no significant difference in actual_productivity between Quarter1 and Quarter2"
alternative_hypothesis <- "There is a significant difference in actual_productivity between Quarter1 and Quarter2"
alpha <- 0.05
power <- 0.80
# Perform a two-sample t-test
t_test_result <- t.test(q1$actual_productivity, q2$actual_productivity)
# Calculate critical values based on alpha
critical_value <- qt(1 - alpha / 2, df = min(length(q1$actual_productivity), length(q2$actual_productivity)))
# Make a decision
if (abs(t_test_result$statistic) > critical_value) {
cat("Reject the null hypothesis:", alternative_hypothesis, "\n")
} else {
cat("Fail to reject the null hypothesis:", null_hypothesis, "\n")
}
## Fail to reject the null hypothesis: There is no significant difference in actual_productivity between Quarter1 and Quarter2
# Calculate power (simulated power based on observed data)
simulated_data_q1 <- rnorm(10000, mean = mean(q1$actual_productivity), sd = sd(q1$actual_productivity))
simulated_data_q2 <- rnorm(10000, mean = mean(q2$actual_productivity), sd = sd(q2$actual_productivity))
t_statistic_q1 <- (mean(simulated_data_q1) - mean(q2$actual_productivity)) /
sqrt((var(simulated_data_q1) / length(q1$actual_productivity)) + (var(simulated_data_q2) / length(q2$actual_productivity)))
t_statistic_q2 <- (mean(q1$actual_productivity) - mean(simulated_data_q2)) /
sqrt((var(simulated_data_q1) / length(q1$actual_productivity)) + (var(simulated_data_q2) / length(q2$actual_productivity)))
power_observed <- mean(abs(t_statistic_q1) > critical_value | abs(t_statistic_q2) > critical_value)
cat("Observed Power:", power_observed, "\n")
## Observed Power: 0
mean_q1 <- mean(q1$actual_productivity)
mean_q2 <- mean(q2$actual_productivity)
mean_diff <- mean_q1 - mean_q2
se <- sqrt(var(q1$actual_productivity)/nrow(q1) +
var(q2$actual_productivity)/nrow(q2))
test_stat <- mean_diff / se
p_value <- 2*pt(-abs(test_stat), df = nrow(q1) + nrow(q2) - 2)
p_value
## [1] 0.5346714
In Fisher’s test as well, the p-value is 0.53 which is significantly greater than the significance level (0.05). Which means we fail to reject the null hypothesis: There is no significant difference in actual_productivity between Quarter1 and Quarter2. we would retain the null hypothesis H0.
means_by_quarter <- garment_prod %>%
group_by(quarter) %>%
summarize(mean_prod = mean(actual_productivity))
ggplot(means_by_quarter, aes(x = quarter, y = mean_prod)) +
geom_bar(stat = "identity", fill = "lightblue", color = "blue") +
labs(title = "Mean Productivity by Quarter",
x = "Quarter",
y = "Mean Actual Productivity")
We create a histogram with a vertical line indicating the observed test statistics and the p-value.
null_test_stats <- seq(-4, 4, by = 0.1) # Adjust the range and increment as needed
# Calculate the null distribution (assuming no difference in means)
null_distribution <- dt(null_test_stats, df = nrow(q1) + nrow(q2) - 2)
# Create a histogram of the null distribution
hist(null_distribution,
main = "Null Distribution of Test Statistic",
xlab = "Test Statistic",
ylab = "Frequency",
col = "lightblue")
# Add a vertical line to indicate the observed test statistic
abline(v = test_stat, col = "red", lwd = 2)
# Add text to indicate the p-value
text(2, 15, paste("p-value =", round(p_value, 4)), col = "red", adj = 0)
The visualization help understand where the observed test statistics falls within the distribution under the null hypothesis and provides us in sight into the significance of the result.