Online_Retail <- read.csv('C:/Users/laasy/Documents/Fall 2023/Intro to Statistics in R/Datasets for Final Project/OnlineRetail.csv')
summary(Online_Retail)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##  InvoiceDate          UnitPrice           CustomerID       Country         
##  Length:541909      Min.   :-11062.06   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.25   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.08   Median :15152    Mode  :character  
##                     Mean   :     4.61   Mean   :15288                      
##                     3rd Qu.:     4.13   3rd Qu.:16791                      
##                     Max.   : 38970.00   Max.   :18287                      
##                                         NA's   :135080
library(infer)
library(ggplot2)
# Load dplyr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Preparation:

df <- na.omit(df) # remove NA rows
df <- subset(Online_Retail, Quantity > 0) # filter for Quantity > 0

Hypothesis Testing:

Hypothesis 1: There is a relationship between Quantity and UnitPrice in the data.

Null Hypothesis 1: There is no relationship between Quantity and UnitPrice in the data.

Hypothesis 2: There is a relationship between CustomerID and UnitPrice in the data.

Null Hypothesis 2: There is no relationship between CustomerID and UnitPrice in the data.

Null Hypothesis 1 :

I hypothesize there is no relationship between Quantity and UnitPrice.

# Sample Size for Null Hypothesis 1
sample_size_hypothesis_1 <- nrow(Online_Retail[!is.na(Online_Retail$Quantity) & !is.na(Online_Retail$UnitPrice), ])
# Print the calculated values
cat("Sample Size for Null Hypothesis 1:", sample_size_hypothesis_1, "\n")
## Sample Size for Null Hypothesis 1: 541909

###Alpha, Power, Effect size:

# Set alpha, power, effect size
alpha <- 0.05
power <- 0.8 
effect_size <- 0.2

I chose a 0.05 alpha level because this is the most common standard accepted Type I error rate in statistical analysis. At 5%, it provides a reasonable balance between identifying true effects vs avoiding false positives. A power of 0.8 means the test has an 80% probability of correctly rejecting the null hypothesis if the effect is real. This is a conventional standard that minimizes Type II errors without being too stringent. An effect size of 0.2 is considered a small-moderate effect based on Cohen’s conventions. I expected any relationship between Quantity and UnitPrice to be small, so this seems appropriate. It ensures we won’t miss substantive effects.

###Neyman- Pearson:

The dataset contains approximately 541909 rows of data with non-missing values for Quantity and UnitPrice. To perform a Pearson correlation test between these two continuous variables, a general rule of thumb there is necessity of aleast 30-50 samples. With 541909 samples in the data, there is sufficient sample size to run the Neyman-Pearson correlation test for this hypothesis.

# Neyman-Pearson 
model <- lm(UnitPrice ~ Quantity, data=df)
f_stat <- anova(model)$`F value`[1] 
p_value <- pf(f_stat, 1, nrow(df)-2, lower.tail=FALSE)
cat("Neyman-Pearson Test p-value:", p_value, "\n")
## Neyman-Pearson Test p-value: 0.0192005

Fisher’s test

# Fisher's test
cor.test(df$Quantity, df$UnitPrice, method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  df$Quantity and df$UnitPrice
## t = -2.3416, df = 531283, p-value = 0.0192
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0059014730 -0.0005236067
## sample estimates:
##          cor 
## -0.003212563

The correlation coefficient from the Fisher’s test is -0.003212563. This is a very small negative correlation.

The low p-value rejects the null hypothesis. Although the correlation is weak, it is still statistically significant, meaning there is evidence of a relationship between Quantity and UnitPrice in the data. ### Visualization:

ggplot(df, aes(Quantity, UnitPrice)) +
  geom_point() +
  geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'

In conclusion, while the correlation is small, the hypothesis test indicates it is significant, so we reject the null hypothesis of no relationship.

Null Hypothesis 2

I hypothesize no relationship between CustomerID and UnitPrice.

Alpha, Power, Effect size:

alpha <- 0.01 
power <- 0.9
effect_size <- 0.4

For this test, I used a stricter 0.01 alpha to further control Type I errors. I had less basis for expecting a relationship here. The higher 0.9 power balances the stricter alpha - provides good probability of detecting a relationship if present. I expected a larger potential effect size here since CustomerID differences could create broader UnitPrice gaps. The 0.4 value covers medium effects based on Cohen’s benchmarks.

Neyman- Pearson:

The dataset contains 4373 unique CustomerID values. With 4373 customer groups, the vast majority likely have 30 or more records each. Therefore, there appears to be sufficient sample size to perform the Neyman-Pearson ANOVA test for this hypothesis.

# Number of Unique CustomerID Values for Null Hypothesis 2
unique_customer_ids <- unique(Online_Retail$CustomerID)
num_unique_customer_ids <- length(unique_customer_ids)

cat("Number of Unique CustomerID Values for Null Hypothesis 2:", num_unique_customer_ids, "\n")
## Number of Unique CustomerID Values for Null Hypothesis 2: 4373
df <- df[complete.cases(df[,c("CustomerID", "UnitPrice")]),]

# Run ANOVA 
model <- aov(UnitPrice ~ CustomerID, data=df)

# Get summary
summary(model)
##                 Df    Sum Sq Mean Sq F value   Pr(>F)    
## CustomerID       1     22927   22927   46.96 7.25e-12 ***
## Residuals   397922 194270157     488                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pr(>F) is considered as p-value.A p-value of 7.25e-12 indicates a very small, statistically significant result. The *** notation means the p-value is less than 0.001.

Since this p-value is below the standard 0.05 significance level, we would reject the null hypothesis and conclude that there is a statistically significant difference in UnitPrice across the CustomerID groups.

In summary:

The ANOVA p-value is 7.25e-12 This is extremely small and denotes high significance We would reject the null hypothesis at alpha = 0.05 The test indicates UnitPrice differs significantly based on CustomerID

Fisher’s test

# Fisher's test
cor.test(df$CustomerID, df$UnitPrice, method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  df$CustomerID and df$UnitPrice
## t = -6.8528, df = 397922, p-value = 7.25e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.013969481 -0.007756114
## sample estimates:
##        cor 
## -0.0108629

The correlation coefficient from the Fisher’s test is -0.0108629. This is a very small negative correlation.A correlation coefficient close to 0 (like -0.0108629) suggests a very weak linear relationship between the variables, and it’s unlikely that this relationship has practical significance.

Visualization:

boxplot(UnitPrice ~ CustomerID, data=df[1:100,])

In conclusion, the hypothesis test indicates the negligible correlation is not statistically significant, so we fail to reject the null hypothesis of no relationship between CustomerID and UnitPrice.

Here are the overall conclusions for the hypothesis testing on the Online Retail dataset:

Overall Conclusions

The hypothesis testing revealed a very small but statistically significant negative correlation between Quantity and UnitPrice, with a correlation coefficient of -0.003212563.

The test found no significant relationship between CustomerID and UnitPrice, with a negligible correlation coefficient of -0.0108629.