I. INTRODUCTION
a. Hypothesis Testing - Impact in
the “Manufacturing - Supply Chain”
The hypotheses testing is
making a great impact in the manufacturing industry as it does in the
other sectors. It helps to challenge the existing processes and test
their impact effectiveness in the other dependent or independent fields.
Specifically, as I come from a supply chain background, learning
hypothesis testing and its application in the research areas of this
field is explorative. It is like reverse learning – after experiencing
the practical approach and then learning its theoretical approach, which
is a different exposure. I wanted to highlight one statement I read in
an article: The need for statistical tests in analysis and
experimentation is always unquestioned (Carrasco et al., 2020). It also
helps to analyze the performance of supply chain factors in optimizing
the working capital and cash conversion cycle (Mulchandani et al.,
2022).
b. Two Samples Comparison -
Various applications of ‘z’ test, ‘t’ test, and ‘F’ test
The
two-sample comparison is a great technique to understand the significant
and non-significant differences or relations between two sample groups
to understand its population’s characteristics and parameters.
The z-test is mainly used for a larger sample size, which is
approximately over 30 in number. If we know the population variance, it
is less widely used than the t-test since it is hard for researchers to
know the population parameters, and it is relatively more
straightforward to collect samples and derive inferences about the
population from it (King et al., 2019).
The t-test is based
explicitly on a smaller sample size, where it is lower than 30. In this
test, the sample distribution of the mean will follow t-distribution.
There are cases where the sample size is relatively larger than 30 and
may follow a t-test, but the distribution would be normal distribution
instead of t-distribution (King et al., 2019).
The F-test is
used in two normally distributed sample groups to compare their
variances and identify if they are equal (Tan, 2017). It also considered
ANOVA (Analysis of Variance) since inferences of means are made by
analysis of variance (Mishra, 2019).
c. Academic Writing -
Importance of Proper References
1. The significance of
citing resources in the research articles and studies is critical to
respect and give credit to the original authors (MIT, 2023).
2.
To avoid plagiarism in the research paper or an article we work on (MIT,
2023).
3. Every submitted academic manuscript is fully reviewed
before it is accepted for any publication (Santini, 2018).
d. Description of
Dataset
The dataset used in this report is about the
customers visiting two grocery markets to purchase products. It contains
the variables such as customer’s gender, age, visiting period (am/pm),
shipping duration (in mins). These details are collected for two grocery
markets with the sample size of 25 each. Therefore, it contains
variables in the form of categorical nominal, categorical ordinal, and
numerical discrete.
II.
ANALYSIS SECTION
1.1 Research Context
a.
Problem Statement
Although e-commerce booming in the current
era, it couldn’t be used everyday considering its cost and time. Several
customers visiting grocery markets for their daily needs. However, their
shopping time differs based on their age and literacy factors. Hence,
this study is about to identify whether there is a difference in
shopping duration of the customers visiting two different grocery
markets or not.
b. Research Question
1. Is there a
difference in mean shopping duration of customers visiting two grocery
markets?
2. Can I significantly reject null hypothesis (no
difference in shopping duration of customers visiting two different
shops)?
1.2 Dataset Variables
a. Variables
Collected
1. Customer Name/Count
2. Gender
3. Age
4. Period
5. Shopping duration (in mins)
b. Support from “Variables” to answer research question
To answer our research questions, the variable “Shopping duration
(in mins)” support extremely well to make comparison between two sample
groups. However, other variables which were collected may not be
required since our questions mainly focus on shopping duration.
1.3 Dataset: Tabular and Graphical Presentation
In
this section, dataset will be presented in the tabular format with all
of its variables. Furthermore, two of its numerical variables displayed
in the graphical formats.
#using kable() to present the dataset in table
kable(shop1dataset, table.attr = "style='width:90%;'", align = "c", format = "html")%>%
kable_styling(bootstrap_options = "bordered", latex_options = "striped", font_size = NULL)
| Customer | Gender | Age | Period | Shopping duration (in mins) |
|---|---|---|---|---|
| x1 | Male | 30 | Morning | 14 |
| x2 | Female | 22 | Afternoon | 6 |
| x3 | Female | 24 | Afternoon | 8 |
| x4 | Male | 32 | Morning | 5 |
| x5 | Male | 16 | Afternoon | 5 |
| x6 | Female | 18 | Afternoon | 10 |
| x7 | Female | 56 | Evening | 15 |
| x8 | Male | 45 | Afternoon | 20 |
| x9 | Female | 60 | Evening | 12 |
| x10 | Male | 21 | Evening | 11 |
| x11 | Male | 34 | Evening | 3 |
| x12 | Female | 16 | Afternoon | 6 |
| x13 | Female | 43 | Evening | 13 |
| x14 | Male | 34 | Afternoon | 5 |
| x15 | Female | 76 | Afternoon | 16 |
| x16 | Male | 38 | Morning | 12 |
| x17 | Female | 29 | Evening | 5 |
| x18 | Male | 15 | Evening | 8 |
| x19 | Male | 36 | Afternoon | 10 |
| x20 | Female | 20 | Evening | 12 |
| x21 | Female | 41 | Evening | 10 |
| x22 | Female | 14 | Evening | 5 |
| x23 | Male | 36 | Evening | 5 |
| x24 | Female | 39 | Afternoon | 8 |
| x25 | Male | 31 | Afternoon | 8 |
#presenting histogram for the variable "age"
hist(shop1dataset$Age, main = "Shop 1 - Customer's Age Vs Freq of visit", col = brewer.pal(8,"Set3"), ylim = c(0,10), xlab = "Age group", breaks = "Sturges")
#presenting boxplot for the variable "shopping duration (in mins)"
boxplot(shop1dataset$`Shopping duration (in mins)`, horizontal = TRUE, col = brewer.pal(8,"Set2"), main = "Shop 1 - Customer's Shopping duration (mins)")
Observation:
Table:
While observing
the tabular presentation of the dataset, it is understandable that the
variables are in three different types: nomina, ordinal, and
discrete.
Median - From the variable “Shopping duration (in mins)”,
we can understand that the median is 9-10 mins as its range is between 5
and 20.
Mode - From the variable “Shoppping duration (in mins)”
which is the context of hypothesis in this study, mode is 5 as it has
multiple repetitions compared to other values.
Histogram:
This histogram presentation is based on the “Age”
of the customers visiting the shop 1. Here are the observations from
descriptive statistics point of view. Range: The Age group range is
between 10 and 80 Maximum: I see the customers in the age of 30-40
visiting the shop 1 most often compared to other age groups Minimum: I
see the customers in the age of 70-80 visiting the shop 1 less often
compared to other age groups Mean: Based on the visual presentation, the
average in the frequency of the shop visit lies in between 2 and 4
Boxplot:
The boxplot is presented from the context of
understanding the variable “Shopping duration (in mins)” for shop 1.
Here are the observations from descriptive statistics point of view.
Range: The range of this variable lies in between 2 and 20
Inter-quartile range: The IQ1 and IQ3 stands at 5 and 12, respectively
Median: The median of this variable data lies at approximately 7 or
8.
1.4 Sampling Method and Descriptive Statistics
a.
Sampling Method used for data collection
I have used convenience
sampling method to the pick sample data. Since I stay at remote
location, and I have found two different grocery shops and thought of
observing the number of customers visit, their shopping duration, and
age groups. Based on my observation, the convenience sampling considered
as best approach considering the limited resources availability. b.
Sample Size
The sample size chosen for this data collection
process is 25 for each sample groups. In this case, 25 observations
performed for shop 1 and 25 observations collected for shop 2
#samplesize
n1=25
n2=25
n=n1+n2
#descriptive statistics shop 1
mean_custshoppingtime_shop1 = mean(shop1dataset$`Shopping duration (in mins)`)
sd_customershoppingtime_shop1 = sd(shop1dataset$`Shopping duration (in mins)`)
#descriptive statistics shop 2
mean_custshoppingtime_shop2 = mean(shop2dataset$`Shopping duration (in mins)`)
sd_customershoppingtime_shop2 = sd(shop2dataset$`Shopping duration (in mins)`)
Observation:
The dataset will be described by the ‘mean’
parameter.
The reason for choosing the ‘mean’ parameter for
this hypothesis testing is to identify the average shopping duration of
customers visiting two different shopping markets and from its context,
we try to prove null hypothesis wrong that significantly denies
differences in customer’s shopping time in two different shops.
Hence, average shopping duration of customers formulated in the above R
chunk code to use it as estimator to derive inferences of population and
make comparison.
1.5 Hypothesis
a. Null and Alternative
Hypothesis (Bluman, 2018)
1. Ho: µ1 = µ2, There is no
difference in shopping duration of customers visiting shop 1 and shop 2
2. Ha: µ1 ≠ µ2, There is a difference in shopping duration of
customers visiting shop 1 and shop 2
b. Importance of well-presented hypothesis
Hypothesis Testing and its importance is essential for any research and
case studies. Some of its critical advantages include (Williamson,
2002), 1. Being brief and as clear as possible in our research
2. It helps the readers to understand and test for validation
3. Provide enough content and analysis in explaining the relationship or
comparison of two or more variables
4. It has to be grounded
from past experience or knowledge, and focus much on literature reviews
or theory for exploration in the study in wide angle
1.6 Two Tailed Hypothesis - CL, Alpha, Critical Values
1. This hypothesis testing is two-tailed since we are trying to
identify the true population means have any differences in customer’s
shopping duration of shop 1 and shop 2, and not about one or the other
higher or lower.
2. Confidence level chosen as 0.95 by allowing
5% significance due to that we can be 95% confident to reject the null
hypothesis.
CL = 0.95
alpha = 1-CL
alpha_2 = alpha/2
df=n1-1
cvleft = qt(alpha_2, df)
cvright = qt(1-alpha_2, df)
The confidence level of this hypothesis testing is 0.95
The
significance level of this hypothesis testing at each tail is 0.025
The critical values of the confidence level 0.95 are left critical
value -2.06 and right critical value 2.06
1.7 Density Distribution
In this section,
presenting the density plot of the variable “Shopping duration (in
mins)” from Sample group 1 after normalizing the data. Besides, the
density distribution also highlight the confidence level, significance
value, and critical values.
shop1dataframe <- data.frame(shop1dataset$`Shopping duration (in mins)`)
norm_sample1 <- (shop1dataframe$shop1dataset..Shopping.duration..in.mins.. - mean_custshoppingtime_shop1) / sd_customershoppingtime_shop1
norm_sample1 %>%
density(data = data.frame(norm_sample1), adjust = 2) %>%
plot(main = "", xlab = "Density Curve - Shop 1 Customer shopping time", cex.axis = 1)
abline(v=c(cvleft, cvright), col = c("red", "red"))
text(x=cvleft, y=0.20, round(cvleft,2), srt=90, cex=0.8, adj = c(0.86,0))
text(x=cvright, y=0.20, round(cvright,2), srt=90, cex=0.8, adj = c(0.86,0))
text(x=0, y=0.15, labels = as.character(CL), srt=0, cex=0.8, adj = c(0.86,0))
text(x=-2.2, y=0.02, labels = as.character(alpha_2), srt=0, cex=0.6, adj = c(0.86,0))
text(x=2.6, y=0.02, labels = as.character(alpha_2), srt=0, cex=0.6, adj = c(0.86,0))
Data Distribution - 3 Basic Descriptive Statistics
Range -
The range of this data in normal distribution
of shopping duration (in mins) is clearlty visible as it stands in
between -3 and 3.
Standard deviation -
The
standard deviation of this normalized distribution is 2. Therefore, the
sample data is distributed in visualization between -4 and 4.
Critical Value -
Since the confidence level chosen for
this hypothesis testing as 0.95, which lefts the significance level as
0.05, and according to the two-tailed test, significance level at lower
tail and upper tail is 0.025, respectively. The critical values stands
at -2.06 and 2.06. Therefore, it allows the confidence level stands in
between the sample distribution -2.06 and 2.06.
1.8 T-test Value
Ttest = ((mean_custshoppingtime_shop1 - mean_custshoppingtime_shop2)-0) / sqrt((sd_customershoppingtime_shop1^2/n1) + (sd_customershoppingtime_shop2^2/n2))
Observation:
The T test value of the hypothesis
testing in an attempt to prove null hypothesis wrong is -1.37
In the above T-test formula, differences of mean of customer shopping time in shop 1 and shop2 evaluated and it is divided by the square root of the sum of square root of standard deviation of two shops divided by its respective sample size.
1.9 Comparison: T-test and Critical Value
resultHo <- Ttest<cvleft
Observation:
The T value is negative -1.37, hence comparing it with
negative critical value -2.06
Is T test value is less than left critical value?
FALSE
Since the t value -1.37is greater than the left Critical value -2.06, there is not enough evidence and have failed to reject the null hypothesis that is claiming as there is no significant difference in shopping duration of the customers visiting shop1 and shop2.
1.10 p value
pvalue = 2*pt(abs(Ttest), df=df, lower.tail = FALSE)
Observation:
The p value of the hypothesis testing in an attempt to prove null
hypothesis wrong is 0.18
The main purpose of the pvalue is to identify the probability of
sample statistic (like sample mean) or extreme sample statistic in the
direction of the alternative hypothesis when null hypothesis is true
(Bluman, 2018).
In the above hypothesis testing, we failed to reject the null hypothesis and found that p value is around 0.18, which gives clear path to understand and analyze whether Type 1 error occured in this study.
1.11 Density Distribution - T test value
norm_sample1 <- (shop1dataframe$shop1dataset..Shopping.duration..in.mins.. - mean_custshoppingtime_shop1) / sd_customershoppingtime_shop1
norm_sample1 %>%
density(data = data.frame(norm_sample1), adjust = 2) %>%
plot(main = "", xlab = "Density Curve - Shop 1 Customer shopping time", cex.axis = 1)
abline(v=c(cvleft, Ttest, cvright), col = c("red","blue","red"))
text(x=cvleft, y=0.20, round(cvleft,2), srt=90, cex=0.8, adj = c(0.86,0))
text(x=cvright, y=0.20, round(cvright,2), srt=90, cex=0.8, adj = c(0.86,0))
text(x=Ttest, y=0.10, round(Ttest,2), srt=90, cex=0.8, adj = c(0.86,0))
text(x=0, y=0.15, labels = as.character(CL), srt=0, cex=0.8, adj = c(0.86,0))
text(x=-2.2, y=0.02, labels = as.character(alpha_2), srt=0, cex=0.6, adj = c(0.86,0))
text(x=2.6, y=0.02, labels = as.character(alpha_2), srt=0, cex=0.6, adj = c(0.86,0))
text(x=2.6, y=0, labels = as.character(round(pvalue,2)), srt=0, cex=0.6, adj = c(0.86,0))
Observation:
Based on my understanding, T-test value is
clearly outside the critical value and it is the resemblence that we
failed to reject hypothesis. Still, we need concrete sample data to work
on this hypothesis testing to try to prove null hypothesis is wrong.
1.12 Density Distribution - Two Sample Group Comparison
density(shop1dataset$`Shopping duration (in mins)`)%>%
plot(main="")
lines(density(shop2dataset$`Shopping duration (in mins)`),col= c("red") )
mean_custshoppingtime_shop2 = mean(shop2dataset$`Shopping duration (in mins)`)
abline(v=c(mean_custshoppingtime_shop1, mean_custshoppingtime_shop2), col=c("blue", "green"))
meanshop1 <- mean_custshoppingtime_shop1
meanshop2 <- mean_custshoppingtime_shop2
text(x=meanshop1, y=0.035, labels = as.character(meanshop1), srt=90, cex = 0.9, adj = c(0.8,0))
text(x=meanshop2, y=0.035, labels = as.character(meanshop2), srt=90, cex = 0.9, adj = c(0.8,0))
Observation:
This density distribution of raw data of
sample group 1 and group 2 gives basic understanding how well these two
groups’ data varied at the scale of data range.
by incorporating
the mean of two groups shows how far they lie in the distribution and
helps us to understand its differences and relativity to understand its
patterns.
III. CONCLUSION
In
conclusion, hypothesis testing has emerged as a crucial tool in the
manufacturing-supply chain sector, enabling a systematic evaluation of
existing processes and their impact on various facets of the industry.
Particularly in the realm of supply chain research, the exploration of
hypothesis testing represents a unique approach, involving practical
experiences before delving into theoretical understanding.
The
need for statistical tests in analysis and experimentation is deemed
unquestionable, emphasizing the pivotal role of hypothesis testing in
assessing and optimizing supply chain factors. The utilization of
two-sample comparison techniques, such as the ‘z’ test, ‘t’ test, and
‘F’ test, offers a nuanced understanding of significant differences and
relationships between sample groups. The choice of these tests depends
on factors like sample size and knowledge of population variance.
Overall, hypothesis testing, in conjunction with specific comparison
tests, provides a robust framework for decision-making and continuous
improvement in the manufacturing-supply chain, contributing to enhanced
efficiency and effectiveness in operations.
IV. BIBLIOGRAPHY
1. J.
Carrasco, S. García, M.M. Rueda, S. Das, F. Herrera, Recent trends in
the use of statistical tests for comparing swarm and evolutionary
computing algorithms: Practical guidelines and a critical review, Swarm
and Evolutionary Computation, Volume 54, 2020, 100665, ISSN 210-6502, https://doi.org/10.1016/j.swevo.2020.100665.
2.
Mulchandani, K., Singh J.S., and Mulchandani, K., “Determining Supply
Chain Effectiveness for Indian MSMEs: A Structural Equation Modelling
Approach.” Asia Pacific Management Review 28.2 (2023): 90-8. ProQuest.
Web. 4 Dec. 2023.
3. King, A.P., Eckersley, A.K, 2019,
Inferential Statistics II: Parametric Hypothesis Testing, Statistics for
Biomedical Engineers and Scientists, Student’s t-Test - an overview |
ScienceDirect Topics.
4. Massachusetts Institute of Technology,
2023, URL: https://libguides.mit.edu/citing.
5. Santini,
A., 2018, The Importance of Referencing, National Library of Medicine,
PMCID: PMC5953266, PMID: 29967893, URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5953266/.
7. Mishra P, Singh U, Pandey CM, Mishra P, Pandey G.
Application of student’s t-test, analysis of variance, and covariance.
Ann Card Anaesth. 2019 Oct-Dec;22(4):407-411. doi:
10.4103/aca.ACA_94_19. PMID: 31621677; PMCID: PMC6813708.
8.
Bluman, A. (2018), Testing the differences between Two means, Two
Proportions, and Two Variances, Elementary Statistics: a step-by-step
approach. In Bluman, A., Descriptive and Inferential Statistics,
(pp. 488-490)
9. Williamson, K., 2002, The beginning stages of
research, Good Hypothesis, Research Methods for Cyber Security, 2017,
URL: https://www.sciencedirect.com/topics/computer-science/good-hypothesis
V. APPENDIX
An R
Markdown report is enclosed in the submission. The name of the file is
Project4_Jayakumar.rmd