The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.

Your RMarkdown notebook for this data dive should contain the following:

After having explored your dataset over the past few weeks, you should already have some questions.
Devise two different null hypotheses based on two different aspects (e.g., columns) of your data.
- For Hypothesis 1: Determine if you have enough data to perform a hypothesis test using the Neyman-Pearson framework (i.e., choose a test, alpha level, Type 2 Error, etc., and reject or fail-to reject …). If you have enough data, show your sample size calculation, perform the test, and interpret results. If there is not enough data, explain why. Make sure your alpha level, power level, and minimum effect size are intentional, i.e., explain why you chose them.
- For Hypothesis 2: Perform a hypothesis test using Fisher’s Significance Testing framework (i.e., p-value, recommendation, etc.). For this, you should only need to choose a test, interpret the p-value, and give some reasoning for why we should be confident in your data and your conclusions.
Build two visualizations that best illustrate your results, one for each null hypothesis. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gapminder)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(pwr)
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")

Hypothesis 1

Neyman-Pearson Framework For Hypothesis 1, based on the ANOVA results, you can reject or fail to reject the null hypothesis, depending on the p-value.

ANOVA Results

The ANOVA test compares the “Revenue Growth” across 11 sectors (as indicated by k = 11). The F-value is very large (14554), and the p-value is extremely small (< 2e-16), which means the result is highly significant. This indicates that there is a statistically significant difference in revenue growth across sectors. The null hypothesis (no difference between the groups) can be rejected.

Interpretation

Since the p-value is much smaller than your alpha level (0.05), we reject the null hypothesis and conclude that there are significant differences in average revenue growth across different sectors.

Plot

effect_size <- 0.25
alpha <- 0.05
power <- 0.80
num_groups <- length(unique(data$Sector))  # Number of sectors

sample_size <- pwr.anova.test(k = num_groups, f = effect_size, sig.level = alpha, power = power)
print(sample_size)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 11
##               n = 24.46702
##               f = 0.25
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

# Once you have enough data, perform ANOVA
anova_result <- aov(Revenuegrowth ~ Sector, data = data)
summary(anova_result)

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Sector          10   1254  125.38   14554 <2e-16 ***
## Residuals   352086   3033    0.01                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Visualization: Boxplot of Revenue Growth by Sector
ggplot(data, aes(x = Sector, y = Revenuegrowth, fill = Sector)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Revenue Growth by Sector", y = "Revenue Growth", x = "Sector")

Hypothesis 2

Fisher’s Significance Testing Framework For Hypothesis 2, interpret the p-value from the correlation test and decide whether there’s a significant relationship between “Current Price” and “Market Cap.”

Interpretation:

Despite the very weak correlation (r = 0.0108), the p-value is extremely small, indicating that the result is statistically significant. This means that, although the correlation is very weak, we can reject the null hypothesis that there is no correlation between “Current Price” and “Market Cap.”
The significance of this weak correlation could be due to the large sample size, which makes even small effects detectable.
p-value: The p-value is 1.279e-10, which is much smaller than the typical significance level of 0.05.
95% Confidence Interval: The confidence interval for the correlation is (0.0075, 0.0141).

Plot

cor_test <- cor.test(data$Currentprice, data$Marketcap, method = "pearson")

# Print the results of the correlation test
print(cor_test)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Currentprice and data$Marketcap
## t = 6.4298, df = 352095, p-value = 1.279e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.007532568 0.014137924
## sample estimates:
##        cor 
## 0.01083536

# Visualization: Scatter plot with regression line
ggplot(data, aes(x = Currentprice, y = Marketcap)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  theme_minimal() +
  labs(title = "Scatter Plot of Current Price vs Market Cap",
       x = "Current Price", y = "Market Cap") +
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

Results

Hypothesis 1: There are significant differences in revenue growth between sectors. Hypothesis 2: While “Current Price” and “Market Cap” have a statistically significant correlation, the effect size is extremely small, suggesting a minimal practical relationship.

DataDiveWeek7

Hypothesis 1

ANOVA Results

Interpretation

Plot

Hypothesis 2

Interpretation:

Plot

Results