The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.
Your RMarkdown notebook for this data dive should contain the following:
After having explored your dataset over the past few weeks, you should already have some questions.
Devise two different null hypotheses based on two different aspects (e.g., columns) of your data.
For Hypothesis 1: Determine if you have enough data to perform a hypothesis test using the Neyman-Pearson framework (i.e., choose a test, alpha level, Type 2 Error, etc., and reject or fail-to reject …). If you have enough data, show your sample size calculation, perform the test, and interpret results. If there is not enough data, explain why. Make sure your alpha level, power level, and minimum effect size are intentional, i.e., explain why you chose them.
For Hypothesis 2: Perform a hypothesis test using Fisher’s Significance Testing framework (i.e., p-value, recommendation, etc.). For this, you should only need to choose a test, interpret the p-value, and give some reasoning for why we should be confident in your data and your conclusions.
Build two visualizations that best illustrate your results, one for each null hypothesis. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(pwr)
data <- read.csv("C:/Users/aiden/OneDrive/mergedfile.csv")
Neyman-Pearson Framework For Hypothesis 1, based on the ANOVA results, you can reject or fail to reject the null hypothesis, depending on the p-value.
The ANOVA test compares the “Revenue Growth” across 11 sectors (as indicated by k = 11). The F-value is very large (14554), and the p-value is extremely small (< 2e-16), which means the result is highly significant. This indicates that there is a statistically significant difference in revenue growth across sectors. The null hypothesis (no difference between the groups) can be rejected.
Since the p-value is much smaller than your alpha level (0.05), we reject the null hypothesis and conclude that there are significant differences in average revenue growth across different sectors.
effect_size <- 0.25
alpha <- 0.05
power <- 0.80
num_groups <- length(unique(data$Sector)) # Number of sectors
sample_size <- pwr.anova.test(k = num_groups, f = effect_size, sig.level = alpha, power = power)
print(sample_size)
##
## Balanced one-way analysis of variance power calculation
##
## k = 11
## n = 24.46702
## f = 0.25
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
# Once you have enough data, perform ANOVA
anova_result <- aov(Revenuegrowth ~ Sector, data = data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sector 10 1254 125.38 14554 <2e-16 ***
## Residuals 352086 3033 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualization: Boxplot of Revenue Growth by Sector
ggplot(data, aes(x = Sector, y = Revenuegrowth, fill = Sector)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Revenue Growth by Sector", y = "Revenue Growth", x = "Sector")
Fisher’s Significance Testing Framework For Hypothesis 2, interpret the p-value from the correlation test and decide whether there’s a significant relationship between “Current Price” and “Market Cap.”
Despite the very weak correlation (r = 0.0108), the p-value is extremely small, indicating that the result is statistically significant. This means that, although the correlation is very weak, we can reject the null hypothesis that there is no correlation between “Current Price” and “Market Cap.”
The significance of this weak correlation could be due to the large sample size, which makes even small effects detectable.
p-value: The p-value is 1.279e-10, which is much smaller than the typical significance level of 0.05.
95% Confidence Interval: The confidence interval for the correlation is (0.0075, 0.0141).
cor_test <- cor.test(data$Currentprice, data$Marketcap, method = "pearson")
# Print the results of the correlation test
print(cor_test)
##
## Pearson's product-moment correlation
##
## data: data$Currentprice and data$Marketcap
## t = 6.4298, df = 352095, p-value = 1.279e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.007532568 0.014137924
## sample estimates:
## cor
## 0.01083536
# Visualization: Scatter plot with regression line
ggplot(data, aes(x = Currentprice, y = Marketcap)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
theme_minimal() +
labs(title = "Scatter Plot of Current Price vs Market Cap",
x = "Current Price", y = "Market Cap") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
Hypothesis 1: There are significant differences in revenue growth between sectors. Hypothesis 2: While “Current Price” and “Market Cap” have a statistically significant correlation, the effect size is extremely small, suggesting a minimal practical relationship.