Online_Retail <- read.csv('C:/Users/laasy/Documents/Fall 2023/Intro to Statistics in R/Datasets for Final Project/OnlineRetail.csv')
summary(Online_Retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID Country
## Length:541909 Min. :-11062.06 Min. :12346 Length:541909
## Class :character 1st Qu.: 1.25 1st Qu.:13953 Class :character
## Mode :character Median : 2.08 Median :15152 Mode :character
## Mean : 4.61 Mean :15288
## 3rd Qu.: 4.13 3rd Qu.:16791
## Max. : 38970.00 Max. :18287
## NA's :135080
library(infer)
library(ggplot2)
# Load dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data Preparation:
df <- na.omit(df) # remove NA rows
df <- subset(Online_Retail, Quantity > 0) # filter for Quantity > 0
Hypothesis 1: There is a relationship between Quantity and UnitPrice in the data.
Null Hypothesis 1: There is no relationship between Quantity and UnitPrice in the data.
Hypothesis 2: There is a relationship between CustomerID and UnitPrice in the data.
Null Hypothesis 2: There is no relationship between CustomerID and UnitPrice in the data.
I hypothesize there is no relationship between Quantity and UnitPrice.
# Sample Size for Null Hypothesis 1
sample_size_hypothesis_1 <- nrow(Online_Retail[!is.na(Online_Retail$Quantity) & !is.na(Online_Retail$UnitPrice), ])
# Print the calculated values
cat("Sample Size for Null Hypothesis 1:", sample_size_hypothesis_1, "\n")
## Sample Size for Null Hypothesis 1: 541909
###Alpha, Power, Effect size:
# Set alpha, power, effect size
alpha <- 0.05
power <- 0.8
effect_size <- 0.2
I chose a 0.05 alpha level because this is the most common standard accepted Type I error rate in statistical analysis. At 5%, it provides a reasonable balance between identifying true effects vs avoiding false positives. A power of 0.8 means the test has an 80% probability of correctly rejecting the null hypothesis if the effect is real. This is a conventional standard that minimizes Type II errors without being too stringent. An effect size of 0.2 is considered a small-moderate effect based on Cohen’s conventions. I expected any relationship between Quantity and UnitPrice to be small, so this seems appropriate. It ensures we won’t miss substantive effects.
###Neyman- Pearson:
The dataset contains approximately 541909 rows of data with non-missing values for Quantity and UnitPrice. To perform a Pearson correlation test between these two continuous variables, a general rule of thumb there is necessity of aleast 30-50 samples. With 541909 samples in the data, there is sufficient sample size to run the Neyman-Pearson correlation test for this hypothesis.
# Neyman-Pearson
model <- lm(UnitPrice ~ Quantity, data=df)
f_stat <- anova(model)$`F value`[1]
p_value <- pf(f_stat, 1, nrow(df)-2, lower.tail=FALSE)
cat("Neyman-Pearson Test p-value:", p_value, "\n")
## Neyman-Pearson Test p-value: 0.0192005
# Fisher's test
cor.test(df$Quantity, df$UnitPrice, method='pearson')
##
## Pearson's product-moment correlation
##
## data: df$Quantity and df$UnitPrice
## t = -2.3416, df = 531283, p-value = 0.0192
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0059014730 -0.0005236067
## sample estimates:
## cor
## -0.003212563
The correlation coefficient from the Fisher’s test is -0.003212563. This is a very small negative correlation.
The low p-value rejects the null hypothesis. Although the correlation is weak, it is still statistically significant, meaning there is evidence of a relationship between Quantity and UnitPrice in the data. ### Visualization:
ggplot(df, aes(Quantity, UnitPrice)) +
geom_point() +
geom_smooth(method='lm')
## `geom_smooth()` using formula = 'y ~ x'
In conclusion, while the correlation is small, the hypothesis test indicates it is significant, so we reject the null hypothesis of no relationship.
I hypothesize no relationship between CustomerID and UnitPrice.
alpha <- 0.01
power <- 0.9
effect_size <- 0.4
For this test, I used a stricter 0.01 alpha to further control Type I errors. I had less basis for expecting a relationship here. The higher 0.9 power balances the stricter alpha - provides good probability of detecting a relationship if present. I expected a larger potential effect size here since CustomerID differences could create broader UnitPrice gaps. The 0.4 value covers medium effects based on Cohen’s benchmarks.
The dataset contains 4373 unique CustomerID values. With 4373 customer groups, the vast majority likely have 30 or more records each. Therefore, there appears to be sufficient sample size to perform the Neyman-Pearson ANOVA test for this hypothesis.
# Number of Unique CustomerID Values for Null Hypothesis 2
unique_customer_ids <- unique(Online_Retail$CustomerID)
num_unique_customer_ids <- length(unique_customer_ids)
cat("Number of Unique CustomerID Values for Null Hypothesis 2:", num_unique_customer_ids, "\n")
## Number of Unique CustomerID Values for Null Hypothesis 2: 4373
df <- df[complete.cases(df[,c("CustomerID", "UnitPrice")]),]
# Run ANOVA
model <- aov(UnitPrice ~ CustomerID, data=df)
# Get summary
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## CustomerID 1 22927 22927 46.96 7.25e-12 ***
## Residuals 397922 194270157 488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pr(>F) is considered as p-value.A p-value of 7.25e-12 indicates a very small, statistically significant result. The *** notation means the p-value is less than 0.001.
Since this p-value is below the standard 0.05 significance level, we would reject the null hypothesis and conclude that there is a statistically significant difference in UnitPrice across the CustomerID groups.
In summary:
The ANOVA p-value is 7.25e-12 This is extremely small and denotes high significance We would reject the null hypothesis at alpha = 0.05 The test indicates UnitPrice differs significantly based on CustomerID
# Fisher's test
cor.test(df$CustomerID, df$UnitPrice, method='pearson')
##
## Pearson's product-moment correlation
##
## data: df$CustomerID and df$UnitPrice
## t = -6.8528, df = 397922, p-value = 7.25e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.013969481 -0.007756114
## sample estimates:
## cor
## -0.0108629
The correlation coefficient from the Fisher’s test is -0.0108629. This is a very small negative correlation.A correlation coefficient close to 0 (like -0.0108629) suggests a very weak linear relationship between the variables, and it’s unlikely that this relationship has practical significance.
boxplot(UnitPrice ~ CustomerID, data=df[1:100,])
In conclusion, the hypothesis test indicates the negligible correlation is not statistically significant, so we fail to reject the null hypothesis of no relationship between CustomerID and UnitPrice.
Here are the overall conclusions for the hypothesis testing on the Online Retail dataset:
The hypothesis testing revealed a very small but statistically significant negative correlation between Quantity and UnitPrice, with a correlation coefficient of -0.003212563.
The test found no significant relationship between CustomerID and UnitPrice, with a negligible correlation coefficient of -0.0108629.