(The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.)
Part 1: Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis:
Part 2: Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
1. Part 1 -Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis
sales_total <- Superstore_data |> group_by(Category) |>
summarise(count= n(),Total_sales = sum(Sales),
.groups = 'drop') |> arrange(desc(Total_sales))
tail(sales_total)
## # A tibble: 3 × 3
## Category count Total_sales
## <chr> <int> <dbl>
## 1 Technology 1847 836154.
## 2 Furniture 2121 742000.
## 3 Office Supplies 6026 719047.
\[ H_0: \text{Average Sale remains equal for 2 type of categories (i.e Technology and Office Supplies) within products.} \]
\[ H_0: \text{AvgSales_tech = AvgSales_offsupplies} \]
alpha <- 0.05
power <- 0.8
min_effect_size <- 100
#install.packages("pwr")
library(pwr)
# Assuming you have access to the population standard deviation
population_sd <- sd(Superstore_data$Sales)
# Calculate the required sample size
sample_size <- pwr.t.test(d = min_effect_size / population_sd, sig.level = alpha, power = power, type = "two.sample")
# Print the required sample size
sample_size$n
## [1] 610.7154
From above sample size calculation, it is about 611. Since each of the categories have sufficient count of data. i.e. Tech = 1847, Office Supplies = 6026
So I believe I can implement Neyman-Pearson hypothesis test.
Neyman-Pearson Test:
tech_sales <- Superstore_data$Sales[Superstore_data$Category == "Technology"]
office_sales <- Superstore_data$Sales[Superstore_data$Category == "Office Supplies"]
# Performing a two-sample t-test
t_test_result <- t.test(tech_sales, office_sales)
# Print t-test results
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: tech_sales and office_sales
## t = 12.694, df = 1982.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 281.8807 384.8897
## sample estimates:
## mean of x mean of y
## 452.7093 119.3241
# Interpret the results
if (t_test_result$p.value < alpha) {
cat("Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.")
} else {
cat("Fail to reject the null hypothesis. There is no significant difference in average sales between Technology and Office products.")
}
## Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.As per the result, We need to reject the Null Hypothesis as there is a large difference between the average of sales in both category -> Technology and Office Supplies. If we reject the Null hypothesis then the alternate hypothesis can be considered as True, which is - Average sales of tech is not equal to Average sales of Office Supplies.
As per the alternate Hypothesis, We cant determine if the average sales done on both the categories will remain
Fishers Style Test :
# Filter the data to only include rows where the ship mode is Same Day or Standard Class
contingency_table <- Superstore_data |>
filter(Category == "Technology" | Category == "Office Supplies")
# Perform the Wilcox on rank sum test
wilcox.test(Sales ~ Category, data = contingency_table, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Sales by Category
## W = 2234566, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The p-value is very very small, which is lesser than the alpha level of 0.05. Therefore, provides strong evidence that the null hypothesis is false. In this case, the null hypothesis is that there is no difference in average profit between different Categories, namely Technology and Office Supplies
Used the wilcox test above because was getting an error with Fishers exact test.
Further able to solve the error for Fishers Exact Test,Have added the code below and commented the same. I am not quite sure on the reason but have added the code. As per the error message even tried making the values finite by rounding it off. But still observe the error.
```r
contingency_table <- Superstore_data |>
group_by(Category) |>
filter(Category == "Technology" | Category == "Office Supplies" ) |>
summarise(sales = round(mean(Sales)))
contingency_table
```
```
## # A tibble: 2 × 2
## Category sales
## <chr> <dbl>
## 1 Office Supplies 119
## 2 Technology 453
```
```r
# Perform the Fisher's exact test
#fisher.test(contingency_table , alternative = "two.sided")
# fisher.test(contingency_table$Category, contingency_table$sales , alternative = "two.sided")
```
2. Part 2 - - Build a visualizations that best illustrate the results from the two pairs of hypothesis tests.
count_category <- Superstore_data |> group_by(Category) |>
summarise(Category_count=n(),Total_sales = sum(Sales), Average_Sales = mean(Sales),
.groups = 'drop') |> arrange(desc(Category_count))
tail(count_category)
## # A tibble: 3 × 4
## Category Category_count Total_sales Average_Sales
## <chr> <int> <dbl> <dbl>
## 1 Office Supplies 6026 719047. 119.
## 2 Furniture 2121 742000. 350.
## 3 Technology 1847 836154. 453.
# Create a bar chart
ggplot(contingency_table, aes(x = Category, y = sales, fill = Category)) +
geom_bar(stat = "identity") +
labs(title = "Average Sales by Product Category", x = "Category", y = "Average Sales") +
theme_minimal()
From the above plot, can see that Average Sales of Office Supplies is far less than what it is for the Technology. So can say that the idea of rejecting Null hypothesis stating the difference between both is 0, is totally sensible. That can be clearly seen in the above plot.
Box plot also helps visualize the Sales for Tech and Office Supplies better. It is spread across the range for both technology and Sales.
count_box <- Superstore_data |>
filter(Category == "Technology" | Category == "Office Supplies")
tail(count_box)
## Ship.Mode Segment Country City State Postal.Code
## 7868 Standard Class Consumer United States Los Angeles California 90008
## 7869 Standard Class Corporate United States Athens Georgia 30605
## 7870 Standard Class Corporate United States Athens Georgia 30605
## 7871 Standard Class Consumer United States Costa Mesa California 92627
## 7872 Standard Class Consumer United States Costa Mesa California 92627
## 7873 Second Class Consumer United States Westminster California 92683
## Region Category Sub.Category Sales Quantity Discount Profit
## 7868 West Technology Accessories 36.240 1 0.0 15.2208
## 7869 South Technology Accessories 79.990 1 0.0 28.7964
## 7870 South Technology Phones 206.100 5 0.0 55.6470
## 7871 West Technology Phones 258.576 2 0.2 19.3932
## 7872 West Office Supplies Paper 29.600 4 0.0 13.3200
## 7873 West Office Supplies Appliances 243.160 2 0.0 72.9480
ggplot(count_box, aes(x = Category, y = Sales)) +
geom_boxplot() +
labs(title = " Sales by Catgeory", x = "Category", y = "Sales") +
theme_minimal()
profit_mean <- Superstore_data |> group_by(Ship.Mode) |>
summarise(count= n(),Mean_Profit = mean(Profit),
.groups = 'drop') |> arrange(desc(Mean_Profit))
tail(profit_mean)
## # A tibble: 4 × 3
## Ship.Mode count Mean_Profit
## <chr> <int> <dbl>
## 1 First Class 1538 31.8
## 2 Second Class 1945 29.5
## 3 Same Day 543 29.3
## 4 Standard Class 5968 27.5
Considering the various shipmode under which the products are sold and Total profit in each shipmode, to be somewhat similar.
Let us assume the first Null Hypothesis to be : There is no difference in average profit between different ship modes, here Same Day and Standard Class.
\[ H_0: \text{Average Profit remains equal for 2 type of shipmode (i.e Same Day and Standard Class) within products.} \]
\[ H_0: \text{AvgProfit_Sameday = AvgProfit_Standard} \]
alpha <- 0.05
power <- 0.8
min_effect_size <- 100
#install.packages("pwr")
library(pwr)
# Assuming you have access to the population standard deviation
population_sd <- sd(Superstore_data$Profit)
# Calculate the required sample size
sample_size <- pwr.t.test(d = min_effect_size / population_sd, sig.level = alpha, power = power, type = "two.sample")
# Print the required sample size
sample_size$n
## [1] 87.11648
From above sample size calculation, it is about 87. Since each of the ship mode have sufficient count of data. i.e. Same Day = 543, Standard Class = 5968
So I believe I can implement Neyman-Pearson hypothesis test.
Neyman-Pearson Test:
tech_sales <- Superstore_data$Sales[Superstore_data$Ship.Mode == "Same Day"]
office_sales <- Superstore_data$Sales[Superstore_data$Ship.Mode == "Standard Class"]
# Performing a two-sample t-test
t_test_result <- t.test(tech_sales, office_sales)
# Print t-test results
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: tech_sales and office_sales
## t = 0.34895, df = 683.12, p-value = 0.7272
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -40.77616 58.40239
## sample estimates:
## mean of x mean of y
## 236.3962 227.5831
# Interpret the results
if (t_test_result$p.value < alpha) {
cat("Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.")
} else {
cat("Fail to reject the null hypothesis. There is no significant difference in average sales between Technology and Office products.")
}
## Fail to reject the null hypothesis. There is no significant difference in average sales between Technology and Office products.As per the result, We need to Accept the Null Hypothesis as there
is a less difference between the average of profit in both ship modes
-> Same Day and Standard Class.
In other words, the average profit for Same Day shipping is not significantly higher than the average profit for Standard Class shipping.
Fishers Style Test :
# Filter the data to only include rows where the ship mode is Same Day or Standard Class
contingency_table <- Superstore_data |>
filter(Ship.Mode == "Same Day" | Ship.Mode == "Standard Class")
# Perform the Wilcoxon rank sum test
wilcox.test(Profit ~ Ship.Mode, data = contingency_table, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: Profit by Ship.Mode
## W = 1653851, p-value = 0.4238
## alternative hypothesis: true location shift is not equal to 0
The p-value is 0.4238, which is greater than the alpha level of 0.05. Therefore, we accept the null hypothesis and conclude that there is a not a significant difference in average profit between the two ship modes.
Used the wilcox test for testing significance above, because was getting an error with Fishers exact test. Have added the Fishers exact code below and commented the same.
library(stats)
contingency_table <- Superstore_data |>
filter(Ship.Mode == "Same Day" | Ship.Mode == "Standard Class") |>
group_by(Ship.Mode) |>
summarise(profit = mean(Profit))
contingency_table
## # A tibble: 2 × 2
## Ship.Mode profit
## <chr> <dbl>
## 1 Same Day 29.3
## 2 Standard Class 27.5
# Perform the Fisher's exact test
#fisher.test(contingency_table, alternative = "two.sided")
# fisher.test(contingency_table$Ship.Mode, contingency_table$profit , alternative = "two.sided")
2. Part 2 -- Build a visualizations that best
illustrate the results from the two pairs of hypothesis tests.
- Bar graph alone helps represent the visualization better
count_shipmode <- Superstore_data |> group_by(Ship.Mode) |>
filter() |>
summarise(ShipMode_count=n(),Total_proft = sum(Profit), Average_Profit = mean(Profit),
.groups = 'drop') |> arrange(desc(ShipMode_count))
tail(count_shipmode)
## # A tibble: 4 × 4
## Ship.Mode ShipMode_count Total_proft Average_Profit
## <chr> <int> <dbl> <dbl>
## 1 Standard Class 5968 164089. 27.5
## 2 Second Class 1945 57447. 29.5
## 3 First Class 1538 48970. 31.8
## 4 Same Day 543 15892. 29.3
# Create a bar chart
ggplot(contingency_table, aes(x = Ship.Mode, y = profit, fill = Ship.Mode)) +
geom_bar(stat = "identity") +
labs(title = "Average Profit by Product shipMode", x = "Ship.Mode", y = "Average Profit") +
theme_minimal()
From the above plot, can see that Average Profit of Ship mode Standard Class is almost equal to ShipMode - Same Day. So can say that the idea of accepting the Null hypothesis stating the difference between both is 0 is totally sensible. That can be clearly seen in the above plot.