week7..

data <- read.csv("C:\\Users\\SHREYA\\OneDrive\\Documents\\Gitstuff\\modified_dataset.csv")

library(ggplot2)

Hypothesis 1

Null Hypothesis: There is no difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%. Alternative Hypothesis: There is a difference in the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

For this hypothesis, we will use a two-sample t-test to compare the mean ratings of chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

Neyman-Pearson Framework:

Test: Two-sample t-test
Alpha level: 0.05
Power level : 0.2 (power = 0.8)
Minimum Effect Size (Cohen’s d): 0.5 (moderate effect size)
Alpha level : We commonly use an alpha level of 0.05, which corresponds to a 5% chance of making a Type 1 Error. This is a standard threshold in many scientific studies.
Power level : We are specifying a Type 2 Error rate of 0.2, corresponding to a power of 0.8. This means that we want to have an 80% chance of correctly rejecting the null hypothesis when it is false. A power of 0.8 is often considered acceptable in hypothesis testing.
Minimum Effect Size (Cohen’s d): We chose a minimum effect size of 0.5, which corresponds to a moderate effect size according to Cohen’s guidelines. This indicates that we want to be able to detect differences that are moderate in magnitude, as smaller effects may not be practically meaningful.

Performing Neyman-Pearson hypothesis test

# Subset the data into two groups based on cocoa percentage
above_70 <- data[data$cocoa_percent > 0.7, ]
below_or_equal_70 <-data[data$cocoa_percent <= 0.7, ]

# Perform two-sample t-test
t_test_result <- t.test(above_70$rating, below_or_equal_70$rating)

# Print the result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  above_70$rating and below_or_equal_70$rating
## t = -6.2712, df = 2257.1, p-value = 4.279e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14750859 -0.07723218
## sample estimates:
## mean of x mean of y 
##  3.132741  3.245112

The test statistic (t) is -6.2712.
The degrees of freedom (df) are approximately 2257.1.
The p-value is very small (4.279e-10), indicating strong evidence against the null hypothesis.
The 95% confidence interval for the difference in means ranges from -0.1475 to -0.0772.
The sample mean rating for chocolate bars with a cocoa percentage above 70% is approximately 3.133.
The sample mean rating for chocolate bars with a cocoa percentage below or equal to 70% is approximately 3.245. Since the p-value is less than the chosen significance level (alpha = 0.05), we reject the null hypothesis and conclude that there is a statistically significant difference in mean ratings between the two groups. Additionally, the negative t-value indicates that the mean rating for chocolate bars with a cocoa percentage above 70% is significantly lower than that for bars with a cocoa percentage below or equal to 70%.

Performing a Fisher’s style test for significance on the same hypothesis

# Obtained p-value from the Welch Two Sample t-test
p_value <- 4.279e-10

# Significance level (alpha)
alpha <- 0.05

# Perform Fisher's Significance Testing
if (p_value <= alpha) {
  # Reject the null hypothesis
  cat("Reject the null hypothesis: There is a statistically significant difference between the groups.\n")
} else {
  # Fail to reject the null hypothesis
  cat("Fail to reject the null hypothesis: There is not enough evidence to conclude a statistically significant difference between the groups.\n")
}

## Reject the null hypothesis: There is a statistically significant difference between the groups.

Based on Fisher’s Significance Testing framework and the obtained p-value from the Welch Two Sample t-test, we reject the null hypothesis. This suggests that there is indeed a statistically significant difference between the groups, specifically between chocolate bars with a cocoa percentage above 70% and those with a cocoa percentage below or equal to 70%.

Visualization to illustrate the results of hypothesis 1

# Create a box plot with color
ggplot(data = data, aes(x = factor(cocoa_percent > 0.7), y = rating, fill = factor(cocoa_percent > 0.7))) +
  geom_boxplot(fill = c("lightblue", "lightgreen")) +
  labs(x = "Cocoa Percentage (Above 70%)", y = "Rating", fill = "Cocoa Percentage") +
  theme_minimal() +
  geom_hline(yintercept = t_test_result$conf.int[1], linetype = "dashed", color = "red") +
  geom_hline(yintercept = t_test_result$conf.int[2], linetype = "dashed", color = "red") +
  ggtitle("Comparison of Ratings for Chocolate Bars") +
  theme(plot.title = element_text(hjust = 0.5))

The x-axis represents whether the cocoa percentage is above 70% or below/equal to 70%.
The y-axis represents the rating of chocolate bars.
The box plots show the distribution of ratings for each group.
The dashed red lines represent the 95% confidence interval for the difference in means.
Bars colored in light blue represent chocolate bars with a cocoa percentage below or equal to 70%, and bars colored in light green represent bars with a cocoa percentage above 70%.

Hypothesis 2:
Null Hypothesis: There is no difference in the mean ratings of chocolate bars from the United States and those from France.
Alternative Hypothesis: There is a difference in the mean ratings of chocolate bars from the United States and those from France.

Neyman-Pearson Framework:

Checking if we have enough data for both groups

# Set parameters
alpha <- 0.05  
beta <- 0.2    
min_effect_size <- 0.5  

# Calculate sample size needed
power <- 1 - beta
effect_size <- min_effect_size
t_critical <- qt(1 - alpha/2, df = Inf)
sample_size <- (2 * (t_critical + qt(power, df = Inf))^2) / effect_size^2

# Print sample size needed
cat("Sample size needed:", ceiling(sample_size), "\n")

## Sample size needed: 63

Performing Neyman-Pearson hypothesis test

# Subset data for chocolate bars from the United States and France
chocolate_us <- subset(data, company_location == "U.S.A.")
chocolate_fr <- subset(data, company_location == "France")

# Perform two-sample t-test
t_test_result <- t.test(chocolate_us$rating, chocolate_fr$rating)

# Print the result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  chocolate_us$rating and chocolate_fr$rating
## t = -1.6583, df = 213.32, p-value = 0.09873
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14822078  0.01277744
## sample estimates:
## mean of x mean of y 
##  3.190801  3.258523

t-value: -1.6583
Degrees of Freedom (df): 213.32
p-value: 0.09873
95% Confidence Interval for the Difference in Means: (-0.14822078, 0.01277744) Sample Estimates:
Mean rating for chocolate bars from the United States (x): 3.190801
Mean rating for chocolate bars from France (y): 3.258523 Since the p-value (0.09873) is greater than the significance level (α = 0.05), we fail to reject the null hypothesis. This means that there is not enough evidence to conclude that there is a statistically significant difference in the mean ratings of chocolate bars between the United States and France.

Performing a Fisher’s style test for significance on the same hypothesis

# Extract t-statistic and degrees of freedom
t_statistic <- -1.6583
df <- 213.32

# Calculate p-value
p_value <- 2 * pt(abs(t_statistic), df, lower.tail = FALSE)

# Print p-value
print(p_value)

## [1] 0.09872661

The calculated p-value using Fisher’s style test for significance is approximately 0.0987. This value is consistent with the p-value obtained from the Welch Two Sample t-test, indicating that there is no statistically significant difference in the mean ratings of chocolate bars between the United States and France.

Interpretation:

The null hypothesis states that there is no difference in the mean ratings of chocolate bars between the United States and France.
A p-value of 0.0987 indicates that there is a 9.87% chance of observing the observed difference in mean ratings between the two countries, or a more extreme difference, if the null hypothesis were true.
Our p-value of approximately 0.0987 is greater than the common significance level of 0.05.
While the p-value does not meet the conventional threshold for statistical significance, it is close to 0.05.
Visualization to illustrate the results of hypothesis 2

# Load necessary library for visualization
library(ggplot2)

# Create a box plot
boxplot_data <- data.frame(
  Country = c(rep("USA", length(chocolate_us$rating)), rep("France", length(chocolate_fr$rating))),
  Rating = c(chocolate_us$rating, chocolate_fr$rating)
)

# Plot
ggplot(boxplot_data, aes(x = Country, y = Rating, fill = Country)) +
  geom_boxplot() +
  labs(title = "Comparison of Chocolate Bar Ratings",
       x = "Country",
       y = "Rating") +
  theme_minimal()

Conclusion

There is no statistically significant difference in the mean ratings of chocolate bars between the United States and France.
This conclusion suggests that consumers’ ratings of chocolate bars do not significantly vary based on their country of origin, at least within the dataset analyzed
Therefore, we accept the null hypothesis, indicating that there is no statistically significant difference in the mean ratings of chocolate bars between the two countries.

Further investigation or analysis may be necessary to explore other factors that could influence chocolate bar ratings.

week7..

week7 lab7

2024-04-08

Hypothesis 1

Neyman-Pearson Framework:

Performing a Fisher’s style test for significance on the same hypothesis

Visualization to illustrate the results of hypothesis 1

Neyman-Pearson Framework:

Performing a Fisher’s style test for significance on the same hypothesis