ASSIGNMENT 7

Task(s)

(The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.)

  • Part 1: Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis:

    • Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.
    • Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, perform one and interpret results. If not, explain why.
    • Perform a Fisher’s style test for significance, and interpret the p-value.
    • So, technically, you have two hypothesis tests for each hypothesis, equating two four total tests.
  • Part 2: Build two visualizations that best illustrate the results from the two pairs of hypothesis tests, one for each null hypothesis.


Read the Data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
##        Ship.Mode   Segment       Country            City      State Postal.Code
## 1   Second Class  Consumer United States       Henderson   Kentucky       42420
## 2   Second Class  Consumer United States       Henderson   Kentucky       42420
## 3   Second Class Corporate United States     Los Angeles California       90036
## 4 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 5 Standard Class  Consumer United States Fort Lauderdale    Florida       33311
## 6 Standard Class  Consumer United States     Los Angeles California       90032
##   Region        Category Sub.Category    Sales Quantity Discount    Profit
## 1  South       Furniture    Bookcases 261.9600        2     0.00   41.9136
## 2  South       Furniture       Chairs 731.9400        3     0.00  219.5820
## 3   West Office Supplies       Labels  14.6200        2     0.00    6.8714
## 4  South       Furniture       Tables 957.5775        5     0.45 -383.0310
## 5  South Office Supplies      Storage  22.3680        2     0.20    2.5164
## 6   West       Furniture  Furnishings  48.8600        7     0.00   14.1694

1. Part 1 -Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis

  1. Null Hypothesis -
    sales_total <- Superstore_data |> group_by(Category) |>
                  summarise(count= n(),Total_sales = sum(Sales),
                  .groups = 'drop') |>   arrange(desc(Total_sales))
    tail(sales_total)
## # A tibble: 3 × 3
##   Category        count Total_sales
##   <chr>           <int>       <dbl>
## 1 Technology       1847     836154.
## 2 Furniture        2121     742000.
## 3 Office Supplies  6026     719047.
  • Considering the various categories under which the products are sold and Total sales in each category, to be somewhat similar for the year of 2016.
  • Taking 2 categories for the purpose of generating a Null Hypothesis. i.e. Technology and Office Supplies. The reason for picking Technology and Office Supplies are - Technology because plenty of products are widely being used all around the world. Office supplies because each day a lot of companies are opening and growing to expand.
  • Let us assume the first Null Hypothesis to be : Average Sales of Technology = Average Sales of Office Supplies.

\[ H_0: \text{Average Sale remains equal for 2 type of categories (i.e Technology and Office Supplies) within products.} \]

\[ H_0: \text{AvgSales_tech = AvgSales_offsupplies} \]

  • Alpha Level -\(\alpha\) here represents the rejection region. If our observed value falls in this area, we are safe to reject the null hypothesis, and assume the alternative.
  • The alpha level \(\alpha\), represents the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true.
  • So taking the \(\alpha\) value to be 0.05, so that False negative rate (i.e. Null hypothesis is true and the hypothesis is rejected) is low and Null Hypothesis becomes true for both the categories - Tech and Office Sales.
   alpha <- 0.05
  • Power Level (1-\(\beta\)) here represents the Power i.e. True Negative Rate, which is the probability of Rejecting a Null Hypothesis when it is actually False. So we would want this to happen, such that right predictions are obtained (i.e., avoiding a Type II error). Lets assume it to be 0.80. Such that we obtain an appropriate result.
    power <- 0.8
  • Minimum Effect size - It is the smallest difference in average between sales for each of the category, for which we want to detect from our test. Let us assume that to be around 100$. i.e. Minimum difference between average of sales in tech and office supplies.
    min_effect_size <- 100
  • Sample size calculation:
    • To determine if I have enough data to perform a Neyman-Pearson hypothesis test.
    • Making use of pwr libarary, to calculate the sample size. This helps in determining how many observations are needed to achieve a certain level of power to detect a specific effect size with a specified significance level.
   #install.packages("pwr")
   library(pwr)
    # Assuming you have access to the population standard deviation
    population_sd <-  sd(Superstore_data$Sales)

    # Calculate the required sample size
    sample_size <- pwr.t.test(d = min_effect_size / population_sd, sig.level = alpha, power = power, type = "two.sample")

    # Print the required sample size
    sample_size$n
## [1] 610.7154
  • From above sample size calculation, it is about 611. Since each of the categories have sufficient count of data. i.e. Tech = 1847, Office Supplies = 6026

  • So I believe I can implement Neyman-Pearson hypothesis test.

  • Neyman-Pearson Test:

    tech_sales <- Superstore_data$Sales[Superstore_data$Category == "Technology"]
    office_sales <- Superstore_data$Sales[Superstore_data$Category == "Office Supplies"]
    
    # Performing a two-sample t-test
    t_test_result <- t.test(tech_sales, office_sales)
    
    # Print t-test results
    print(t_test_result)
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  tech_sales and office_sales
    ## t = 12.694, df = 1982.1, p-value < 2.2e-16
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  281.8807 384.8897
    ## sample estimates:
    ## mean of x mean of y 
    ##  452.7093  119.3241
    # Interpret the results
    if (t_test_result$p.value < alpha) {
      cat("Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.")
    } else {
      cat("Fail to reject the null hypothesis. There is no significant difference in average   sales between Technology and Office products.")
    }
    ## Reject the null hypothesis. There is a significant difference in average sales between Technology and Office products.
  • As per the result, We need to reject the Null Hypothesis as there is a large difference between the average of sales in both category -> Technology and Office Supplies. If we reject the Null hypothesis then the alternate hypothesis can be considered as True, which is - Average sales of tech is not equal to Average sales of Office Supplies.

  • As per the alternate Hypothesis, We cant determine if the average sales done on both the categories will remain

  • Fishers Style Test :

    • Perform a Fisher’s style test for significance, and interpret the p-value.
       # Filter the data to only include rows where the ship mode is Same Day or Standard Class
contingency_table <- Superstore_data |>
  filter(Category == "Technology" | Category == "Office Supplies")

# Perform the Wilcox on rank sum test
wilcox.test(Sales ~ Category, data = contingency_table, alternative = "two.sided")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Sales by Category
## W = 2234566, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
  • The p-value is very very small, which is lesser than the alpha level of 0.05. Therefore, provides strong evidence that the null hypothesis is false. In this case, the null hypothesis is that there is no difference in average profit between different Categories, namely Technology and Office Supplies

  • Used the wilcox test above because was getting an error with Fishers exact test.

  • Further able to solve the error for Fishers Exact Test,Have added the code below and commented the same. I am not quite sure on the reason but have added the code. As per the error message even tried making the values finite by rounding it off. But still observe the error.

```r
contingency_table <- Superstore_data |>
group_by(Category) |>
  filter(Category == "Technology" | Category == "Office Supplies" ) |>
summarise(sales = round(mean(Sales))) 

   contingency_table
```

```
## # A tibble: 2 × 2
##   Category        sales
##   <chr>           <dbl>
## 1 Office Supplies   119
## 2 Technology        453
```



```r
# Perform the Fisher's exact test
#fisher.test(contingency_table , alternative = "two.sided")
# fisher.test(contingency_table$Category, contingency_table$sales , alternative = "two.sided")
```

2. Part 2 - - Build a visualizations that best illustrate the results from the two pairs of hypothesis tests.

    count_category <- Superstore_data |> group_by(Category) |>
                  summarise(Category_count=n(),Total_sales = sum(Sales), Average_Sales = mean(Sales),
                  .groups = 'drop') |>   arrange(desc(Category_count))
    tail(count_category)
## # A tibble: 3 × 4
##   Category        Category_count Total_sales Average_Sales
##   <chr>                    <int>       <dbl>         <dbl>
## 1 Office Supplies           6026     719047.          119.
## 2 Furniture                 2121     742000.          350.
## 3 Technology                1847     836154.          453.
# Create a bar chart
ggplot(contingency_table, aes(x = Category, y = sales, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Sales by Product Category", x = "Category", y = "Average Sales") +
  theme_minimal()

  • From the above plot, can see that Average Sales of Office Supplies is far less than what it is for the Technology. So can say that the idea of rejecting Null hypothesis stating the difference between both is 0, is totally sensible. That can be clearly seen in the above plot.

  • Box plot also helps visualize the Sales for Tech and Office Supplies better. It is spread across the range for both technology and Sales.

    count_box <- Superstore_data |>
  filter(Category == "Technology" | Category == "Office Supplies")
    tail(count_box)
##           Ship.Mode   Segment       Country        City      State Postal.Code
## 7868 Standard Class  Consumer United States Los Angeles California       90008
## 7869 Standard Class Corporate United States      Athens    Georgia       30605
## 7870 Standard Class Corporate United States      Athens    Georgia       30605
## 7871 Standard Class  Consumer United States  Costa Mesa California       92627
## 7872 Standard Class  Consumer United States  Costa Mesa California       92627
## 7873   Second Class  Consumer United States Westminster California       92683
##      Region        Category Sub.Category   Sales Quantity Discount  Profit
## 7868   West      Technology  Accessories  36.240        1      0.0 15.2208
## 7869  South      Technology  Accessories  79.990        1      0.0 28.7964
## 7870  South      Technology       Phones 206.100        5      0.0 55.6470
## 7871   West      Technology       Phones 258.576        2      0.2 19.3932
## 7872   West Office Supplies        Paper  29.600        4      0.0 13.3200
## 7873   West Office Supplies   Appliances 243.160        2      0.0 72.9480
    ggplot(count_box, aes(x = Category, y = Sales)) +
  geom_boxplot() +
  labs(title = " Sales by  Catgeory", x = "Category", y = "Sales") +
  theme_minimal()

———-

  1. Null hypothesis -
    profit_mean <- Superstore_data |> group_by(Ship.Mode) |>
                  summarise(count= n(),Mean_Profit = mean(Profit),
                  .groups = 'drop') |>   arrange(desc(Mean_Profit))
    tail(profit_mean)
## # A tibble: 4 × 3
##   Ship.Mode      count Mean_Profit
##   <chr>          <int>       <dbl>
## 1 First Class     1538        31.8
## 2 Second Class    1945        29.5
## 3 Same Day         543        29.3
## 4 Standard Class  5968        27.5

\[ H_0: \text{Average Profit remains equal for 2 type of shipmode (i.e Same Day and Standard Class) within products.} \]

\[ H_0: \text{AvgProfit_Sameday = AvgProfit_Standard} \]

   alpha <- 0.05
    power <- 0.8
    min_effect_size <- 100
   #install.packages("pwr")
   library(pwr)
    # Assuming you have access to the population standard deviation
    population_sd <-  sd(Superstore_data$Profit)

    # Calculate the required sample size
    sample_size <- pwr.t.test(d = min_effect_size / population_sd, sig.level = alpha, power = power, type = "two.sample")

    # Print the required sample size
    sample_size$n
## [1] 87.11648
    # Filter the data to only include rows where the ship mode is Same Day or Standard Class
contingency_table <- Superstore_data |>
  filter(Ship.Mode == "Same Day" | Ship.Mode == "Standard Class")

# Perform the Wilcoxon rank sum test
wilcox.test(Profit ~ Ship.Mode, data = contingency_table, alternative = "two.sided")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Profit by Ship.Mode
## W = 1653851, p-value = 0.4238
## alternative hypothesis: true location shift is not equal to 0
    library(stats)

    contingency_table <- Superstore_data |>
  filter(Ship.Mode == "Same Day" | Ship.Mode == "Standard Class") |>
      group_by(Ship.Mode) |>
  summarise(profit = mean(Profit))

    contingency_table
## # A tibble: 2 × 2
##   Ship.Mode      profit
##   <chr>           <dbl>
## 1 Same Day         29.3
## 2 Standard Class   27.5
    # Perform the Fisher's exact test
   #fisher.test(contingency_table, alternative = "two.sided")
    # fisher.test(contingency_table$Ship.Mode, contingency_table$profit , alternative = "two.sided")

2. Part 2 -- Build a visualizations that best illustrate the results from the two pairs of hypothesis tests.
- Bar graph alone helps represent the visualization better

    count_shipmode <- Superstore_data |> group_by(Ship.Mode) |>
                  filter()  |>
                  summarise(ShipMode_count=n(),Total_proft = sum(Profit), Average_Profit = mean(Profit),
                  .groups = 'drop') |>   arrange(desc(ShipMode_count))
    tail(count_shipmode)
## # A tibble: 4 × 4
##   Ship.Mode      ShipMode_count Total_proft Average_Profit
##   <chr>                   <int>       <dbl>          <dbl>
## 1 Standard Class           5968     164089.           27.5
## 2 Second Class             1945      57447.           29.5
## 3 First Class              1538      48970.           31.8
## 4 Same Day                  543      15892.           29.3
# Create a bar chart
ggplot(contingency_table, aes(x = Ship.Mode, y = profit, fill = Ship.Mode)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Profit by Product shipMode", x = "Ship.Mode", y = "Average Profit") +
  theme_minimal()

From the above plot, can see that Average Profit of Ship mode Standard Class is almost equal to ShipMode - Same Day. So can say that the idea of accepting the Null hypothesis stating the difference between both is 0 is totally sensible. That can be clearly seen in the above plot.