The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.

Your RMarkdown notebook for this data dive should contain the following:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(pwr)
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")

Hypothesis 1

Neyman-Pearson Framework For Hypothesis 1, based on the ANOVA results, you can reject or fail to reject the null hypothesis, depending on the p-value.

ANOVA Results

The ANOVA test compares the “Revenue Growth” across 11 sectors (as indicated by k = 11). The F-value is very large (14554), and the p-value is extremely small (< 2e-16), which means the result is highly significant. This indicates that there is a statistically significant difference in revenue growth across sectors. The null hypothesis (no difference between the groups) can be rejected.

Interpretation

Since the p-value is much smaller than your alpha level (0.05), we reject the null hypothesis and conclude that there are significant differences in average revenue growth across different sectors.

Plot

effect_size <- 0.25
alpha <- 0.05
power <- 0.80
num_groups <- length(unique(data$Sector))  # Number of sectors

sample_size <- pwr.anova.test(k = num_groups, f = effect_size, sig.level = alpha, power = power)
print(sample_size)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 11
##               n = 24.46702
##               f = 0.25
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group
# Once you have enough data, perform ANOVA
anova_result <- aov(Revenuegrowth ~ Sector, data = data)
summary(anova_result)
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Sector          10   1254  125.38   14554 <2e-16 ***
## Residuals   352086   3033    0.01                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualization: Boxplot of Revenue Growth by Sector
ggplot(data, aes(x = Sector, y = Revenuegrowth, fill = Sector)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Revenue Growth by Sector", y = "Revenue Growth", x = "Sector")

Hypothesis 2

Fisher’s Significance Testing Framework For Hypothesis 2, interpret the p-value from the correlation test and decide whether there’s a significant relationship between “Current Price” and “Market Cap.”

Interpretation:

  • Despite the very weak correlation (r = 0.0108), the p-value is extremely small, indicating that the result is statistically significant. This means that, although the correlation is very weak, we can reject the null hypothesis that there is no correlation between “Current Price” and “Market Cap.”

  • The significance of this weak correlation could be due to the large sample size, which makes even small effects detectable.

  • p-value: The p-value is 1.279e-10, which is much smaller than the typical significance level of 0.05.

  • 95% Confidence Interval: The confidence interval for the correlation is (0.0075, 0.0141).

Plot

cor_test <- cor.test(data$Currentprice, data$Marketcap, method = "pearson")

# Print the results of the correlation test
print(cor_test)
## 
##  Pearson's product-moment correlation
## 
## data:  data$Currentprice and data$Marketcap
## t = 6.4298, df = 352095, p-value = 1.279e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.007532568 0.014137924
## sample estimates:
##        cor 
## 0.01083536
# Visualization: Scatter plot with regression line
ggplot(data, aes(x = Currentprice, y = Marketcap)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  theme_minimal() +
  labs(title = "Scatter Plot of Current Price vs Market Cap",
       x = "Current Price", y = "Market Cap") +
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'

Results

Hypothesis 1: There are significant differences in revenue growth between sectors. Hypothesis 2: While “Current Price” and “Market Cap” have a statistically significant correlation, the effect size is extremely small, suggesting a minimal practical relationship.