An Overview
What Statistics Can
Do for You?
Statistics is the art and science of collecting, analyzing,
interpreting, and presenting data to understand patterns, make informed
decisions, and uncover insights about various phenomena in the
world.
“All methods of acquiring knowledge are essentially statistics.” (by
C.R. Rao)
Statistics plays a crucial role in various aspects of our lives. In
natural and life sciences, statistics is used to design experiments,
analyze experimental results, and draw meaningful conclusions.
Statistics also finds extensive uses in other areas, such as Business
and Economics, Social Sciences, Quality Control and Manufacturing,
Environmental Studies, Sports and Entertainment, Public Policy and
Government, Risk Management and Insurance, and Education.
Topics to Cover
- Probability: Classical Probability, Random Variables and Their
Distributions
- Study design: Observational or experimental
- Descriptive statistics: numerical or graphical
- Inferential statistics: confidence intervals and tests of
hypotheses
- Regression and one-way analysis of variance
- Quality control basics
Software for This
Course
We will use R software through the integrated development environment
(IDE) called Posit (previously, RStudio) to do statistical analysis and
create project reports. There are two ways to use Posit:
Visit https://posit.co/download/rstudio-desktop/ and install R
and Posit on your personal computer. Many SCSU computers already have
both installed. There can be some issues with this use of
R/Posit.
Visit https://posit.cloud/. You do not need to install
anything on your computer. Just register on this webpage. Once
registered, log in and start a new project. You do not need to create a
new project each time you use Posit, since under the first project you
have created, you can create many folders or files.
Use of AI
I use ChatGpt to prepare my notes.
“The use of AI in creating learning in this course is supported. Be
clear about your use of AI. Ensure that your work is truly your work.
Think critically about when AI should be used vs when it is an easy out.
Ask questions to your peers and me. AI may continue to impact our
society in ways that we cannot imagine and part of your experience here
is to grow your understanding of what could be, not what was. Also,
please refer to the Academic Integrity policies that apply to all
courses within our institution, and ask me for clarification on how
Academic Integrity and AI intersect in this course.”
Here is SCSU guidelines on the use of AI: https://services.stcloudstate.edu/TDClient/1919/Portal/KB/?CategoryID=24236
Examples of Using
ChatGpt
Explain what is variance of a sample. Give me a data example in
biology using R Once ChatGpt gives you a response, then type “The code
is too complex. please simplify.” If you want a simple printout, type
“no use of cat() function.” Now, all the code would be straightforward
for you.
Explain what is a histogram. Give me a data example in health
science using R. Interpret the plot.
What is a confidence interval for a population proportion? Give
me a data example in environmental science using R. Interpret the
result.
ChatGpt is not perfect and it makes mistakes. Therefore, use ChatGpt
with caution!
A quick quiz: why should I consider a career in data science?
Click to see a ChatGPT answer
Considering a career in data science can be a rewarding decision for
several reasons:
High Demand: Data science professionals are in high demand across
industries as organizations recognize the value of data-driven
decision-making.
Variety of Industries: Data science is applicable in diverse fields
such as finance, healthcare, marketing, technology, and more, providing
opportunities to work in areas of personal interest.
Well-Paying Jobs: Data scientists often enjoy competitive salaries
due to their specialized skills and the increasing demand for data
expertise.
Impactful Insights: Data analysis enables you to extract insights
from complex data, helping businesses make informed decisions, optimize
processes, and identify trends.
Innovation: Data scientists play a key role in developing innovative
solutions using machine learning, artificial intelligence, and
predictive analytics.
Continuous Learning: The field of data science is constantly evolving
with new technologies and techniques, offering a dynamic learning
environment.
Problem Solving: Data science involves solving real-world problems
using data-driven approaches, making it intellectually stimulating and
fulfilling.
Flexibility: Many data science roles offer remote work options and
flexible schedules, allowing for a better work-life balance.
Career Growth: Data science roles often have clear career paths with
opportunities to advance to senior or specialized positions.
Interdisciplinary Skills: Data science combines skills from
programming, statistics, mathematics, and domain expertise, making it a
well-rounded career option.
Global Relevance: Data science is a global field with opportunities
to work with international organizations and collaborate on cross-border
projects.
Community and Networking: Joining the data science community offers
opportunities to connect with like-minded professionals, share
knowledge, and collaborate on projects.
While data science offers numerous advantages, it’s important to note
that it requires a strong foundation in programming, mathematics, and
statistics. Pursuing relevant education, gaining practical experience,
and staying updated with industry trends are crucial steps toward
building a successful career in data science.
The Role of
Statistics in Engineering
Primary Roles of
Statistics in General
Data Collection: Statistics help in designing surveys,
experiments, and observational studies to collect relevant and reliable
data.
Data Analysis: Statistics provide methods for analyzing and
summarizing data.
Inference: Statistics helps determine whether observed
differences or relationships have occurred by chance.
Modeling and Prediction: Statistics allows for the development of
mathematical models to represent complex real-world phenomena. These
models can be used for prediction.
Mechanical and
Empirical Models
Mechanistic models, also known as deterministic models, are built on
a deep understanding of the underlying principles and mechanisms
governing the system.
Empirical models, also known as statistical models or data-driven
models, are developed based on observed data without a deep
understanding of the underlying mechanisms.
Engineers often use a combination of these models to achieve a
comprehensive understanding of complex systems.
Probability
This chapter will lay a theoretical foundation for statistics.
Random
Experiments
A random experiment refers to a process or procedure that can result
in multiple possible outcomes with uncertainty.
Tossing a Coin: When flipping a fair coin, the possible outcomes
“Heads” and “Tails are uncertain and depends on many factors.
Testing the Strength of Materials: In a material strength test, the
sample may exhibit different strengths due to inherent variations in its
composition. The strength of each sample tested is subject to
chance.
Sample Space and
Events
Sample Space: The set of all possible outcomes of a random experiment
is called the sample space.
An event is a subset of the sample space. It represents a particular
outcome or a combination of outcomes. Events are often represented by
capital letters such as A, B, C, …
In engineering, experiments can involve various factors and
parameters that lead to different outcomes. For example, if you are
testing the tensile strength of a material, the sample space is \([0, \infty)\). An event might be “the
tensile strength of a material is greater than 5 pounds per square
inch.”
Basic Operations on
Events
Union of Events \(A \cup B\):
The union of two events \(A\) and \(B\), denoted as \(A \cup B\), represents the event that
either \(A\) occurs, or \(B\) occurs, or both occur.
Intersection of Events \(A \cap
B\) or \(AB\): The intersection
of two events \(A\) and \(B\), denoted as \(A \cap B\) or \(AB\), represents the event that both \(A\) and \(B\) occur simultaneously.
Complement of an Event \(A'\): The complement of an event \(A\), denoted as \(A'\), \(A^c\), or \(\bar{A}\), represents the event that \(A\) does not occur.
These basic operations can be extended to more than two events as
well. A graphical method for operations on events (or sets) is
demonstrated https://www.youtube.com/watch?v=YYM_Wju0-so using the
so-called Venn diagrams.
Counting
Techniques
Counting techniques in probability involve methods to determine the
number of possible outcomes in a sample space or the number of ways
events can occur. These techniques are essential for calculating some
probabilities.
Permutations
Permutations are arrangements of a set of objects in a specific
order. When dealing with a set of n distinct objects and selecting r of
them in a specific order, the number of permutations is denoted as \(_nP_r\) or \(P(n,
r)\) and calculated as \(\frac{n!}{ (n
- r)!}\). Permutations are used in various scenarios, such as
arranging people in a line, arranging letters in a word, or selecting a
specific order of events.
Example:
Suppose you have 5 different books, and you want to arrange 3 of them
on a shelf. The number of ways to arrange these books is \(_5P_3 = \frac{5!}{ (5 - 3)!} = \frac{5!}{ 2!} =
60\).
Combinations
Combinations are selections of a subset of objects from a larger set,
where the order of selection does not matter. When dealing with a set of
\(n\) distinct objects and selecting
\(r\) of them without regard to order,
the number of combinations is denoted as \(_nC_r\) or \(C(n,
r)\) or \(\binom{n}{r}\) and
calculated as \(\frac{n!} {r! (n -
r)!}\). Combinations are used when the order of elements does not
influence the outcome, such as selecting a team of players from a pool
of candidates.
Example:
Suppose there are 8 candidates running for a committee, and you want
to select 4 of them. The number of ways to form the committee is \(\binom{8}{4} = \frac{8!} {4! (8 - 4)!} =
70\).
Multiplication
Principle:
The multiplication principle, also known as the fundamental counting
principle, states that if there are \(m\) ways to do one thing and \(n\) ways to do another thing, then there
are \(m \cdot n\) ways to do both
things together. This principle is often used when events are
independent, meaning that the outcomes of one event do not affect the
outcomes of others.
Example:
Suppose you have 3 different shirts and 2 different pairs of pants.
The number of ways to choose a shirt and a pair of pants to wear is
\(3 \cdot 2 = 6\).
Addition
Principle:
The addition principle states that if there are \(m\) ways to do one thing and \(n\) ways to do another thing, and these
events are mutually exclusive (cannot happen together), then there are
\(m + n\) ways to do either one thing
or the other. This principle is also called the casework, which involves
splitting a problem into several parts, counting these parts
individually, then adding together the totals of each part.
Example:
Suppose we want to find the number of 4-digit numbers that are
divisible by 5. Since all integers that are divisible by 5 must end with
0 or 5, we can use case work based on the ending digit. Let’s consider
two cases:
Case 1: O is at the end. In this case, we have the following
arrangement: _ _ _0. For the remaining three digits, there are 9 ways to
determine the first digit (note: 0 can’t be the beginning digit), there
are 10 ways to determine the second digit, and there are 10 ways to
determine the third digit. By the great multiplication principle, there
are \(9\cdot 10\cdot 10 = 800\) such
numbers.
Case 2: 5 is at the end. In this case, we have the following
arrangement: _ _ _5. By the same token, there are \(9\cdot 10\cdot 10 = 800\) numbers in this
case.
Now, we add up the outcomes from both cases: 800 + 800 = 12. So,
there are 1600 4-digit positive integers that are divisible by 5.
Probability
Probability is a fundamental concept in engineering that plays a
crucial role in decision-making, risk analysis, and designing robust
systems. It is a branch of mathematics.
Key Concepts in Probability:
Sample Space: The set of all possible outcomes of an experiment is
called the sample space. For example, if you are rolling a six-sided
die, the sample space would be {1, 2, 3, 4, 5, 6}.
Event: An event is a subset of the sample space. It represents a
particular outcome or a combination of outcomes.
Probability of an Event: The probability of an event represents the
likelihood of that event occurring. It is a number between 0 and 1,
where 0 indicates an impossible event, and 1 represents a certain event.
The probability of an event \(A\) is
denoted by \(P(A)\).
Basic Probability Rules: Probability follows certain rules. The sum
of probabilities of all possible outcomes in the sample space is always
1. The probability of an event not occurring is 1 minus the probability
of the event occurring. For mutually exclusive events (or disjoint
events, events that do not happen simultaneously), the probability of
either event occurring is the sum of their individual probabilities.
Random Variables: In engineering, we often deal with random
variables, which are variables whose values are determined by chance.
Random variables can be discrete (taking specific values) or continuous
(taking any value within a range).
Conditional
Probability
Conditional Probability: Conditional probability is the probability
of an event occurring given that another event has already occurred. It
is denoted as P(A|B), which is the probability of A given B. To
calculate P(A|B), we use the formula:
\[P(A|B)=\frac{P(A\cap B)}{P(B)}\]
Example: Throw a 6-sided die once. If the outcome is an odd number, what
is the probability that the die landed on 1, 2, or 3?
Solution. Let \(A\) denote
the event that the outcome is odd. Let \(B\) denote the event that the outcome is 1,
2, or 3. It’s easy to see that \(A\cap
B\) is the event that the outcome is 1 or 3. Further more, \(P(A)=\frac{3}{6}\) and \(P(A\cap B)=\frac{2}{6}\). We need to find
\(P(B|A)\). Since
\[P(B|A)=\frac{P(A\cap
B)}{P(A)}=\frac{2/6}{3/6}=\frac{2}{3}\] That is, given the
outcome is odd, the probability that the die landed on 1, 2, or 3 is
2/3.
Total Probability
Rules
The Total Probability Rule, also known as the
Law of Total Probability, is a fundamental concept in
probability theory that allows us to calculate the probability of an
event by considering all possible ways or scenarios that lead to that
event. It is particularly useful when the event of interest depends on
different conditions or sub-events. The Total Probability Rule is
expressed as follows:
Suppose we have a partition of the sample space, i.e., a set of
mutually exclusive events \({B_1, B_2, ...,
B_n}\) that covers the entire sample space \(\Omega\). Then, for any event \(A\), its probability can be calculated
as:
\[P(A) = \sum_{i=1}^n[P(A|B_i) \cdot
P(B_i)].\]
In simpler terms, the probability of event \(A\) is the sum of the probabilities of
event \(A\) occurring given each
condition \(B_i\) multiplied by the
probability of each condition \(B_i\).
Example 1
Suppose a factory produces light bulbs, and there are two machines
used to manufacture them: Machine A and Machine B. Machine A produces
60% of the bulbs, and Machine B produces the remaining 40%. The
probability that a bulb is defective, given it was produced by Machine
A, is 0.03, and the probability that a bulb is defective, given it was
produced by Machine B, is 0.05. What is the probability that a randomly
selected bulb is defective?
Solution:
Let \(A\) be the event that a bulb
is defective, and let \(B_1\) be the
event that the bulb is produced by Machine A, and \(B_2\) be the event that the bulb is
produced by Machine B. The partition \(\{B_1,
B_2\}\) covers the entire sample space (all bulbs).
Using the Total Probability Rule: \[P(A) =
P(A|B_1) \cdot P(B_1) + P(A|B_2) \cdot P(B_2)\] \[P(A) = (0.03 \cdot 0.60) + (0.05 \cdot
0.40)\] \[P(A) = 0.018 +
0.020\] \[P(A) = 0.038\]
The probability that a randomly selected bulb is defective is 0.038
(or 3.8%).
Example 2
Suppose the weather conditions in a certain city can be categorized
into three types: Sunny (S), Cloudy (C), and Rainy (R). Historical data
shows that the probabilities of these conditions are \(P(S) = 0.4, P(C) = 0.3\), and \(P(R) = 0.3\). The probability of carrying
an umbrella on a Sunny day is 0.1, on a Cloudy day is 0.3, and on a
Rainy day is 0.8. What is the overall probability of carrying an
umbrella in this city?
Solution:
Let \(A\) be the event of carrying
an umbrella, and let \(B_1, B_2\), and
\(B_3\) represent the events of having
a Sunny, Cloudy, and Rainy day, respectively. The partition \({B_1, B_2, B_3}\) covers the entire sample
space (all possible weather conditions).
Using the Total Probability Rule:
\[P(A) = P(A|B_1) \cdot P(B_1) + P(A|B_2)
\cdot P(B_2) + P(A|B_3) \cdot P(B_3)\]
\[P(A) = (0.1 \cdot 0.4) + (0.3 \cdot 0.3)
+ (0.8 \cdot 0.3)\] \[P(A) = 0.04 +
0.09 + 0.24\] \[P(A) =
0.37\]
The overall probability of carrying an umbrella in this city is 0.37
(or 37%).
Bayes’ Theorem
Bayes’ Theorem, also known as Bayes’ Rule or Bayes’ Law, is a
fundamental concept in probability theory and statistics. It provides a
way to update the probability of an event based on new evidence or
information. The theorem is named after the Reverend Thomas Bayes, an
18th-century mathematician and theologian, who first formulated the
idea.
Bayes’ Theorem is stated as follows:
\[P(A|B) = \frac{P(B|A) \cdot P(A)}
{P(B)}\]
where:
\(P(A|B)\) is the conditional
probability of event \(A\) occurring
given that event \(B\) has occurred.
\(P(B|A)\) is the conditional
probability of event \(B\) occurring
given that event \(A\) has occurred.
\(P(A)\) is the probability of event
\(A\) occurring without considering
event \(B\). \(P(B)\) is the probability of event \(B\) occurring without considering event
\(A\).
In simpler terms, Bayes’ Theorem allows us to update our prior belief
about the probability of event \(A\)
(i.e., \(P(A)\) based on new evidence
or information provided by event \(B\)
(i.e., \(P(B)\). The resulting
probability, \(P(A|B)\), is called the
posterior probability.
Example 1.
Suppose we have a rare disease that affects 1 in 10,000 people (i.e.,
\(P(A) = 0.0001\) with \(A\) representing the event of having the
disease). A medical test is conducted to diagnose the disease, and the
test has a false-positive rate of 1% (i.e., \(P(B|A') = 0.01\), where \(A'\) represents not having the disease
and \(B\) represents the test being
positive). The test also has a true-positive rate of 99% (i.e., \(P(B|A) = 0.99\)).
Now, we want to find the probability that a person has the disease
given that the test result is positive (P(A|B)).
Using Bayes’ Theorem:
\[P(A|B) = \frac{P(B|A) \cdot
P(A)} {P(B)}\] \[P(A|B) = \frac{0.99
\cdot 0.0001}{0.01 \cdot 0.9999 + 0.99 \cdot 0.0001}\] \[P(A|B) = \frac{0.000099}{0.009999 +
0.000099}\] \[P(A|B) ≈
0.0098039\]
The probability that a person has the disease given a positive test
result is approximately 0.0098 (or 0.98%). Bayes’ Theorem allows us to
incorporate the test’s true-positive and false-positive rates to arrive
at a more accurate probability of having the disease after the test
result.
Example 2. Assume:
- 1 in 100,000 passengers is actually a threat.
- Security correctly detects threats 99% of the time.
- 1% of innocent passengers are falsely flagged.
Determine the probability that a passenger is actually a threat given
that they triggered a security alert.
Solution.
We are given
- \(P(T)=0.00001\)
- \(P(+|T)=0.99\)
- \(P(+|T^c)=0.01\)
By the Total Probability Formula, we get \(P(+)=(0.99×0.00001)+(0.01×(1-0.00001)=0.0100098\).
By the Bayes’ Formula, we get \(P(T|+)=\frac{(0.99×0.00001)}{0.0100098}=0.00099\).
Interpretation:
Even if a passenger is flagged by security, the probability that
they are actually a threat is only 0.099%—less than 1%! This shows how
false positives can overwhelm true positives when the actual number of
threats is very low.
This illustrates why security personnel conduct secondary
screenings—to further investigate flagged passengers and reduce false
alarms.
Example 3.
A quality-control program at a plastic bottle production line
involves inspecting finished bottles for flaws such as microscopic
holes. The proportion of bottles that actually have such a flaw is only
0.0002. If a bottle has a flaw, the probability is 0.995 that it will
fail the inspection. If a bottle does not have a flaw, the probability
is 0.99 that it will pass the inspection.
If a bottle fails inspection, what is the probability that it has
a flaw?
Which of the following is the more correct interpretation of the
answer to part (a)?
Most bottles that fail inspection do not have a flaw.
Most bottles that pass inspection do have a flaw.
If a bottle passes inspection, what is the probability that it
does not have a flaw?
Which of the following is the more correct interpretation of the
answer to part (c)?
Most bottles that fail inspection do have a flaw.
Most bottles that pass inspection do not have a flaw.
Solution.
Denote \(A\) = “flaw” and \(B\) = “fail”. We are given that \(P(A)=0.0002\), \(P(B|A)=0.995\), and \(P(B^c|A^c)=0.99\). By the Total Probability
Formula, we have: \[P(B)=P(B|A)\cdot P(A) +
P(B|A^c)\cdot P(A^c)=(0.995)(0.0002)+(1-0.99)(1-0.002)=
0.010179.\]
\(P(A|B)=\frac{P(A\cap
B)}{P(B)}=\frac{P(B|A)\cdot
P(A)}{P(B)}=\frac{(0.995)(0.0002)}{0.010179}=0.01955\)
Most bottles that fail inspection do not have a flaw.
\(P(A^c|B^c)=\frac{P(A^c\cap
B^c)}{P(B^c)}=\frac{P(B^c|A^c)\cdot
P(A^c)}{P(B^c)}=\frac{(0.99)(1-0.0002)}{1-0.010179}=0.99998\)
Most bottles that pass inspection do not have a flaw.
The Bayes’ Rule can be extended to more general situations. Let a
sample space \(S\) (as a universal set)
be decomposed into k disjoint subsets (events) denoted \(A_1, A_2, \cdots, A_k\). The super set
\({A_1, A_2, \cdots, A_k}\) is called a
decomposition of the sample space.
Let \(B\) be an event. The Law of
Total Probability states that
\[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) +
\cdots + P(A_k)P(B|A_k)\] How can we calculate these conditional
probabilities \(P(A_1|B), P(A_2|B), \cdots,
\text{and}~P(A_k|B)\)?
By the definition of conditional probability, we have \(P(A_1|B)=\frac{P(A_1\cap B)}{P(B)}\). Then,
applying the conditional probability formula again to the numerator and
applying the Law of Total Probability to the denominator yield
\[P(A_1|B)=\frac{P(A_1)P(B|A_1)}{P(A_1)P(B|A_1) +
P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)}\] Note that the
numerator is one of the terms in the denominator!
Example 4. \(20\%\) of a
school’s computers are manufactured by company A, \(30\%\) by company B, and remaining \(50\%\) by company C. Suppose that \(2\%\) of A computers are defective, \(3\%\) of B computers are defective, and
\(2.5\%\) of C computers are defective.
A computer is randomly selected from the school.
If the chosen computer is defective, what is the probability that
the computer is from company A?
If the chosen computer is defective, what is the probability that
the computer is from company B?
If the chosen computer is defective, what is the probability that
the computer is from company C?
Solution.
Each of the school’s computers has a unique ID, so all computers are
distinct. Let \(S\) be the set of all
computers in the school. \(S\) is a
sample space. Let \(A_1\) be the event
that a randomly selected computer is from company A. Let \(A_2\) be the event that a randomly selected
computer is from company B. Let \(A_3\)
be the event that a randomly selected computer is from company C. We
already know that \(P(A_1) = 0.20,
P(A_2)=0.30, P(A_3)=0.50\), and the super set \(\{A_1, A_2, A_3\}\) is a decomposition of
the sample space. Let \(B\) be the
event that a randomly selected computer is defective. We also know that
\(P(B|A_1)=0.02, P(B|A_2)=0.03,
P(B|A_3)=0.025\). By the Law of Total Probability, we have \[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) +
P(A_3)P(B|A_3)=(0.20)(0.02)+(0.30)(0.03)+(0.50)(0.025)=0.004+0.009+0.0125=0.0255.\]
\(P(A_1|B)=\frac{0.004}{0.0255}=0.1569\)
\(P(A_2|B)=\frac{0.009}{0.0255}=0.3529\)
\(P(A_3|B)=\frac{0.0125}{0.0255}=0.4902\)
DIY: Construct a probability tree for the problem to find the total
probability (\(P(B)\)) as the
denominator of the Bayes’ Rule following the videos: https://www.youtube.com/watch?v=ql2qLe4UYK0 (Right click
and open in new window) and https://www.youtube.com/watch?v=dRdCUUgrwVw and https://www.youtube.com/watch?v=XvaS2GO6MGk as well.
Independence
In probability theory, two events A and B are considered independent
if the occurrence of one event does not affect the probability of the
other event occurring. In other words, the outcome of one event provides
no information or influence on the outcome of the other event.
Mathematically, events A and B are independent if and only if \(P(A \cap B) = P(A) \cdot P(B)\).
where \(P(A \cap B)\) represents the
probability of both events \(A\) and
\(B\) happening together, \(P(A)\) is the probability of event \(A\) occurring, and \(P(B)\) is the probability of event \(B\) occurring.
If two events \(A\) and \(B\) are independent then, \(P(A|B) = P(A)\) and \(P(B|A) = P(B)\).
Example:
Rolling a fair six-sided die twice, the outcomes of each roll are
independent events. The probability of rolling a 3 on the first roll is
1/6, and the probability of rolling a 3 on the second roll is also 1/6.
The probability of rolling a 3 on both rolls (both events occurring
together) is 1/6 * 1/6 = 1/36.
Exercise
- Suppose cyberattacks occur 1% of the time. The IDS detects real
attacks with 95% accuracy. The IDS falsely flags 5% of normal activity
as an attack. If the IDS triggers an alert, the probability that it is a
real attack is ____.
Discrete Random
Variables and Probability Distributions
In probability and statistics, a random variable is a variable that
can take on different values, each with a certain probability, due to
underlying random processes or uncertainty. Random variables are a
fundamental concept in probability theory and play a crucial role in
modeling and analyzing uncertain events and probabilistic phenomena.
They serve as a bridge between the theoretical mathematics of
probability and real-world applications in various fields, including
engineering, economics, physics, and social sciences.
Random variables can be categorized into two main types: discrete
random variables and continuous random variables. We focus on discrete
random variable in this chapter.
A discrete random variable is a random variable whose set of assumed
values is countable. A random variable can be denoted by a upper-case
letter such as \(X, Y\), and \(Z\).
Examples:
A lab has 10 computers. Let \(X\) denote the number of computers that
fail to work. Then \(X\) is a random
variable.
The engineers in a large company can be mechanical engineers,
electrical engineers, or other. Randomly choose an engineer from this
company. Let T denote the type of the selected engineer. \(T\) is a discrete random variable, assuming
values in the set of {me, ee, ot}.
The life time of a randomly chosen computer is denoted by \(Y\). \(Y\)
is NOT a discrete random variable. It is called a continuous random
variable.
Probability Mass
Functions
If \(X\) is a discrete random
variable, then we
- denote the probability that \(X=x\)
by \(p(x)\) and
- call \(p(x)\) the
probability mass function (PMF) of \(X\).
The PMF function fully describes the distribution of the random
variable \(X\).
- The domain of this function is called the support of the
distribution (or of the random variable).
- The sum of all probabilities in the PMF is equal to one.
Example 1:
Flip a fair coin twice. Let \(X\) be
the number of heads. Then \(X\) can
take values 0, 1, 2 with probabilities 0.25, 0.5, 0.25. The sum of the
probabilities equals one.
Example 2:
Randomly choose a value from the set \(\{2,
3, 5, 5, 6, 2, 5, 8\}\). Denote the chosen value by \(X\). Then \(X\) is a discrete random variable. Then
\(X\) can take values 2, 3, 5, 6, and 8
with probabilities 2/8, 1/8, 3/8, 1/8, and 1/8. The sum of the
probabilities equals one.
Cumulative
Distribution Functions
A random variable can be categorical or numerical (discrete or
continuous). The PMF can be used to describe its distribution. If \(X\) is a numeric random variable, we can
also equivalently use the cumulative distribution function (CDF),
usually denoted by \(F(x)\), to
describe its distribution. This CDF is defined as the probability that
\(X\) is no greater than \(x\); that is, \(F(x)=P(X\le x)\).
An example:
Suppose \(X\) is a random variable
taking values \(-2\), 0, 3, and 5 with
probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the CDF \(F(x)\) can be determined as follows:
\[\text{when} ~x< -2, F(x) = P(X\le x)
=P(\phi)= 0\] \[\text{when} ~-2\le
x< 0, F(x) = P(X\le x) =P(X=-2)= 0.3\] \[\text{when} ~0\le x< 3, F(x) = P(X\le x)
=P(X=-2 ~\text{or} ~0)= 0.3+0.1=0.4\] \[\text{when} ~3\le x< 5, F(x) = P(X\le x)
=P(X=-2, ~0, ~\text{or} ~3)= 0.3+0.1+0.2=0.6\] \[\text{when} ~x\ge 5, F(x) = P(X\le x) =P(X=-2,
~0, ~3, \text{or} ~5)= 0.3+0.1+0.2+0.4=1\] The above can be
written as a piece-wise (right-continuous) function:
\[F(x)=\begin{cases}
0, & x<-2 \\
0.3, & -2\le x< 0 \\
0.4, & 0\le x< 3\\
0.6, & 3\le x< 5\\
1, & x\ge5
\end{cases}
\]

This piece-wise function has a graph that is step-wise and
right-continuous. This observation is in general true for all discrete
random variables.
Mean and Variance of
a Discrete Random Variable
The mean of a discrete random variable (or a
discrete distribution) is defined to be the sum of products of values
and probabilities. In other words, the mean is the weighted average of
the values with weights being the corresponding probabilities. Use the
Greek letter \(\mu\) to denote the
mean. The mean describes, on average, what is the value taken by the
random variable.
Example 1:
Suppose \(X\) is a random variable
taking values \(-2\), 0, 3, and 5 with
probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the mean of
\(X\) can be determined as follows:
The mean is
\[\mu_X =
(-2)(0.3)+(0)(0.1)+(3)(0.2)+(5)(0.4)=-0.6+0+0.6+2=2\]
Each possible value of the random variable is certain distance away
from the mean. These distances are called deviations. The
variance of a discrete random variable (or a discrete
distribution) is the weighted average of the squared deviations with
weights being the corresponding probabilities. That is,
\[\sum_x (x-\mu)^2 f(x)\] where
\(x\) represents all possible values of
the random variable.
Use the Greek letter \(\sigma^2\) to
denote the variance. The square-root of the variance is called the
standard deviation, which, on average, describes how far away is each
possible value from the mean. Both variance and standard deviation
describe the variation of the random variable.
The variance is
\[\sigma^2_X =
(-2-2)^2(0.3)+(0-2)^2(0.1)+(3-2)^2(0.2)+(5-2)^2(0.4)=4.8+0.4+0.2+3.6=9\]
When calculating the variance of a discrete random variable, you can
use the following formula instead:
\[\sum x^2 f(x) - \mu^2\]
The above variance can be calculated as follows:
\[\sigma^2_X =
(-2)^2(0.3)+(0)^2(0.1)+(3)^2(0.2)+(5)^2(0.4)-2^2=1.2+0+1.8+10-4=9\]
The standard deviation is
\[\sigma_X=\sqrt{9}=3.\]
Example 2:
Randomly choose a value from the set \(\{2,
3, 5, 5, 6, 2, 5, 8\}\). Denote the chosen value by \(X\). Then \(X\) is a discrete random variable. The
random variable \(X\) can take values
2, 3, 5, 6, and 8 with probabilities 2/8, 1/8, 3/8, 1/8, and 1/8.
- The mean is \(\mu=(2)(2/8)+(3)(1/8)+(5)(3/8)+(6)(1/8)+(8)(1/8)=4.5\),
which is equal to the mean of the given set (called a population).
- The variance is \(\sigma^2=(2^2)(2/8)+(3^2)(1/8)+(5^2)(3/8)+(6^2)1/8+(8^2)(1/8)-4.5^2=3.75\)
and the standard deviation is approximately 1.9365.
- The above mean, variance, and standard deviation are also said to be
those of the population.
A useful result: if \(X\) is a
random variable with mean \(\mu\) and
variance \(\sigma^2\), then \(Y=cX\) is a new random variable, where
\(c\) is a constant. Furthermore,
the mean of \(Y\) is \(c\mu\),
the variance of \(Y\) is \(c^2\sigma^2\), and
the standard deviation of \(Y\)
is \(c\sigma\).
Another useful result: if \(X\) is a
random variable with mean \(\mu\) and
variance \(\sigma^2\), then \(Y=X+c\) is a new random variable, where
\(c\) is a constant. Furthermore,
the mean of \(Y\) is \(\mu+c\),
the variance of \(Y\) is \(\sigma^2\) as well, and
the standard deviation of \(Y\)
is \(\sigma\) as well.
Binomial
Distribution
Consider two situations:
Let’s say we have a coin (may not be fair). The coin lands on
heads with probability of \(p\) (maybe
0.5 or not). Flip it \(n\) times. Let
\(X\) denote the number of times it
lands on heads. Then, \(X\) is a
discrete random variable, since it can only takes the values \(0, 1, 2, \cdots\), and \(n\). What is the probability mass function
(pmf) of \(X\)?
There is a sea of products. The proportion of defective products
is \(p\). Randomly choose \(n\) products from this sea of products. Let
\(Y\) denote the number of defective
products. Then, \(Y\) is a discrete
random variable, since it can only takes the values \(0, 1, 2, \cdots\), and \(n\). What is the probability mass function
(pmf) of \(Y\)?
It turns out that the two random variable have the same distribution
(choosing a defective product is like flipping a head, with same
probability \(p\)), with the PMF given
by
\[p(x)=P(X=x)=\binom{n}{x}p^x(1-p)^{n-x},
~~~x = 0, 1, 2, \cdots, n\] where \(\binom{n}{x}=\frac{n!}{x!\cdot(n-x)!}\) and
\(n!\) is the product of the first
\(n\) consecutive positive integers.
For example, \(5! = 5\cdot 4\cdot
4\cdot3\cdot2\cdot1=120\), \(4!=24\), \(3!=6\), \(2!=2\), \(1!=1\), and \(0!=1\), and \(\binom{5}{2}=\frac{5!}{2!\cdot(5-2)!}=\frac{120}{2\cdot
6}=10\).
The combination \(\binom{n}{x}\) in
the above formula indicates that there are those many ways of
selecting/having \(x\) events (heads or
defectives). The term \(p^x\) and \(n-x\) indicate that events (heads or
defectives) are all independent.
The distribution is called the binomial distribution, with parameters
\(n\) and \(p\). A parameter is not a variable, but is
(usually) an unknown quantity.
The mean of this distribution is \(n\cdot
p\) and the variance is \(n\cdot p\cdot
(1-p)\).
An example:
Products manufactured by XYZ company has a defective rate of 0.01.
Randomly pick a set of 100 products,
What is the probability that exactly 3 of these 100 selected
products are defective?
What is the probability that less than 3 of these 100 selected
products are defective?
How many of these 100 selected products are expected to be
defective?
Solution.
Let \(X\) denote the number of
defective products out of the 100 selected ones. Then, \(X\) has a binomial distribution with \(n=100\) and \(p=0.01\).
\(p(3)=\binom{100}{3}0.01^3(1-0.01)^{100-3}=0.0610\)
\(F(2)=P(X<
3)=P(X=0)+P(X=1)+P(X=2)=p(0)+p(1)+p(2)=0.9206\). Here “less than
3” means “less than or equal to 2”. You need to use the binomial
probability formula 3 times!
The expected number of defective products is the same as the
mean, and it is \(n\cdot
p=100(0.01)=1\), as expected.
Geometric and
Negative Binomial Distributions
In quality control, we might be interested in the number of products
to be sampled in order to see the first defective product from a sea of
products.
Again, consider two situations:
Flip a coin until a head is seen. Denote the number of flips by
\(X\). Assume the probability of a head
to be \(p\).
Sample products until a defective product is seen. Denote the
number of products sampled as \(Y\).
Assume the defective rate is \(p\).
Both \(X\) and \(Y\) have the same distribution whose
probability mass functions is given as follows:
\[p(x)=(1-p)^{x-1}\cdot p, ~~x=1, 2, 3,
\cdots, \infty\] since selections are independent, and there is
only one defectives and \(x-1\) normal
ones.
The distribution is called the geometric distribution. The mean is
shown to be \(\frac{1}{p}\) and the
variance is \(\frac{1-p}{p^2}\).
The negative binomial distribution, which is for the
situation that you see \(k\) heads or
defectives. Except the last one which is for sure to be a head or
defective, you apply the binomial formula for the previous flips or
samplings.
Thus the formula is
\[p(x) =
\binom{x-1}{k-1}p^{k-1}(1-p)^{x-k}\cdot p, x= k, k+1, k+2, \cdots
\infty\]
The mean of the negative binomial distribution is \(\frac{k}{p}\) and variance is \(\frac{1-p}{p^2}\cdot k\).
Poisson
Distribution
When modeling the number of rare events (such as earthquakes and car
accidents), the Poisson distribution is often used. Let \(X\) denote the number of rare events in a
given dimension (space, area, or a time interval). The probability mass
function can be chosen to be
\[p(x) = \frac{\lambda^x}{x!}e^{-\lambda},
~~x = 0, 1, 2, \cdots, \infty\] where \(\lambda\) is the average number of rare
events per unit dimension (such as per cubic feet, per square feet, per
minute). The mean of this distribution is just \(\lambda\) and the variance is also \(\lambda\).
A example:
A website is attacked 3 times on average in each year.
What is the probability that the website will be attacked in the
next year?
What is the probability that the website will be attacked more
than 10 times in the next 5 years?
How many times will the website be expected to be attacked in the
following 10 years?
Solution.
Let \(X\) denote the number of
attacks in a year. Then \(X\) has a
Poisson distribution with mean \(\lambda=3\) attacks per year.
Let \(Y\) denote the number of
attacks in 5 years. Then \(Y\) has a
Poisson distribution with mean \(\lambda=3\cdot 5=15\) attacks every 5
years.
- The probability that the website will be attacked in the next year
is given by
\[P(X>0)=1-P(X=0)=1-\frac{3^0}{0!}e^{-3}=1-e^{-3}\approx
0.9502. \] (b) The probability that the website will be attacked
more than 10 times in the next 5 years is given by
\[P(Y>10)=1-P(Y\le
10)=1-p(0)-p(1)-p(2)-\cdots - p(10)\approx 0.9997. \] where \(\lambda = 15\) should be used when
calculating \(p(0), p(1), \cdots,
p(10)\).
- The website will be expected to have 30 attacks in the next 10
years.
Chapter Practice
Problems
- Poisson Distribution
- A bakery sells an average of 4 loaves of bread per hour. What is the
probability that they sell exactly 2 loaves in the next hour?
- The number of accidents at a traffic intersection follows a Poisson
distribution with an average rate of 1.5 accidents per week. What is the
probability of having no accidents in a given week?
- A researcher finds that an average of 5 emails are received per
hour. What is the probability that 7 emails will be received in the next
hour?
- Geometric Distribution
- A light bulb has a 20% chance of burning out each day. What is the
probability that it lasts exactly 3 days?
- A basketball player makes 70% of their free throws. What is the
probability that they make their first successful free throw on the 5th
attempt?
- The probability of a customer purchasing a product during their
first visit to a store is 0.3. What is the probability that a customer
makes a purchase for the first time on their third visit?
- Binomial Distribution
- A student has a 60% chance of passing an exam. If they take the exam
5 times, what is the probability that they pass exactly 3 times?
- A fair coin is flipped 10 times. What is the probability of getting
heads exactly 7 times?
- In a quality control process, 90% of products pass inspection. If 12
products are inspected, what is the probability that exactly 10
pass?
- Discrete Distribution
- A dataset consists of the following values: 3, 3, 6, 8, 10. Randomly
choose a value from the set and denote the value by X. Calculate the
mean and variance of the random variable X.
- A dataset consists of the following values: 3, 3, 6, 8, 10. Randomly
choose two values from the set without replacement and denote the
average of the two values by Y. Calculate the mean and variance of the
random variable Y.
Here are the answers double-checked:
Poisson Distribution
Geometric Distribution
Binomial Distribution
Continuous Random
Variables and Probability Distributions
A continuous random variable is one that can take on any value within
a certain range. The concept of probability mass function will not work
for continuous random variables, since the probability that \(X\) equals a single value is always 0
(finding a needle in a sea). This does not mean it is hopeless. Instead,
calculus is the savior.
The probabilities associated with continuous random variables are
represented by a probability density function (PDF). The area under the
graph of the PDF gives the likelihood of the random variable falling
within a specific interval. The area under the PDF curve over the entire
range is equal to 1.
Examples of a continuous random variable:
The temperature of a fluid in a system.
The time it takes for a customer service representative to handle
a call.
The distance a car travels before its engine fails.
Probability
Distributions and Probability Density Functions
For a continuous random variable, the counterpart of the probability
mass function for a discrete random variable is the probability
density function, denoted \(f(x)\), which is defined as the derivative
of the cumulative distribution function \(F(x)\). That is
\[f(x)=F'(x)\] The probability
density function has the following properties:
\(f(x)\) is never
negative.
The area under the curve of \(f(x)\) and above the \(x-axis\) is always 1; that is, \(\int_{-\infty}^{\infty}f(x)dx=1\).
The probability that \(X\) falls
in the interval \([a, b]\), \((a, b)\), \([a,
b)\), \((a, b]\) is given by
\(P(X<b)-P(X<a)\), or \(F(b)-F(a)\), or \(\int_{a}^{b}f(x)dx\), regardless of the
boundaries of the interval.
Cumulative
Distribution Functions
The cumulative distribution function (CDF) of a random variable \(X\) has range between 0 and 1, and is
always non-decreasing. For a continuous random variable, the CDF is a
continuous function.
To get the CDF from the pdf, we use the following formula:
\[F(x)=\int_{-\infty}^{x}f(x)dx\]
Mean and Variance of
a Continuous Random Variable
The mean (or expectation, or expected value) of a continuous random
variable whose probability density function is \(f(x)\) is given by
\[\mu =
\int_{-\infty}^{\infty}xf(x)dx\]
In this formula:
x represents the possible values of the random variable.
f(x) represents the probability density function of the random
variable at each value x.
The integral sums up the product of each possible value x and its
corresponding density f(x) over the entire range of possible
values.
The variance is given by
\[\sigma^2 =
\int_{-\infty}^{\infty}(x-\mu)^2f(x)dx ~~~\text{or}
~~\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2\]
The integral above computes the weighted sum of squared deviations
from the mean, where each squared deviation is weighted by its
corresponding probability density.
The mean is also denoted by \(E(X)\)
and variance by \(Var(X)\).
An example:
The probability density function of \(X\) is given by
\[f(x)=3x^2, ~~0<x<1\] \[f(x)=0, ~~\text{otherwise}\]
Find the mean and variance.
Find the probability that \(X<0.3\).
Find the probability that \(X>0.2\).
Find the probability that \(0.25<X<0.6\).
Solution.
- The mean is calculated as
\[\mu =
\int_{-\infty}^{\infty}xf(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}xf(x)dx=\int_{0}^{1}x\cdot
3x^2dx\stackrel{\text{?}}{=}\frac{3}{4}\]
where the first “?” is due to the fact that \(f(x)\) is only non-zero between 0 and 1,
and the second “?” is due to the fact that the anti-derivative of \(3x^3\) is \(\frac{3x^4}{4}\).
The variance is calculated as
\[\sigma ^2 =
\int_{-\infty}^{\infty}(x-\mu)^2f(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}(x-\mu)^2f(x)dx=\int_{0}^{1}(x-\frac{3}{4})^2\cdot
3x^2dx\stackrel{\text{?}}{=}3\int_{0}^{1}(x^4-\frac{3}{2}x^3+\frac{9}{16}x^2)dx=\frac{3}{80}\]
- The probability that \(x<0.3\)
is obtained by
\[\mu =
\int_{-\infty}^{0.3}f(x)dx\stackrel{\text{?}}{=}\int_{0}^{0.3}f(x)dx=\int_{0}^{0.3}3x^2dx\stackrel{\text{?}}{=}0.027\]
- The probability that \(x>0.2\)
is obtained by
\[\mu =
\int_{0.2}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.2}^{1}f(x)dx=\int_{0.2}^{1}3x^2dx\stackrel{\text{?}}{=}0.992\]
(d) The probability that \(0.25<X<0.6\) is obtained by
\[\mu =
\int_{-\infty}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.25}^{0.6}f(x)dx=\int_{0.25}^{0.6}3x^2dx\stackrel{\text{?}}{\approx}0.2\]
If \(X\) is a random variable, then
\(X^2\) is too. Thus, we can talk about
the mean of \(X^2\), and it is denoted
by \(E(X^2)\). With this notation, the
variance formula of a random variable \(X\) can be written \(Var(X)=E(X^2)-[E(X)]^2\).
Normal
Distribution
When a random variable \(X\) has the
probability density function given by
\[f(x)=\frac{1}{\sqrt{2\pi}\cdot\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}},
~~~-\infty<x<\infty\]
\(X\) is said to have a normal
distribution, denoted \(X\sim N(\mu,
\sigma^2)\) or \(X\sim N(\mu,
\sigma)\) according to different books. The mean is just equal to
\(\mu\) and the standard deviation is
just equal to \(\sigma\).
When \(X\sim N(\mu, \sigma^2)\), the
new random variable \(Z=\frac{X-\mu}{\sigma}\) is called the
\(Z\)-score or standardized score. It
turns out that \(Z\) has a normal
distribution with mean 0 and standard deviation 1.
To calculate the probability associated with a normal distribution,
we can use a standard normal distribution table or software.
An example:
If \(X~N(\mu=100, \sigma=15)\),
find
P(X<110)
P(X>120)
P(90<X<130)
Solution.
\(P(X<110)=P(X-\mu<110-\mu)=P(\frac{X-\mu}{\sigma}<\frac{110-\mu}{\sigma})=P(Z<0.67)\approx0.7486\)
by a table such as this.
Your textbook also has such a table in the appendix.
\(P(X>120)=P(X-\mu>120-\mu)=P(\frac{X-\mu}{\sigma}>\frac{120-\mu}{\sigma})=P(Z>1.33)\approx
1- 0.9082=0.0918\)
The probability is obtained as follows:
\[P(90<X<130)=P(90-\mu<X-\mu<130-\mu)=P(\frac{90-\mu}{\sigma}<\frac{X-\mu}{\sigma}<\frac{130-\mu}{\sigma})\]
\[=P(-0.67<Z<2)=P(Z<2)-P(Z<-0.67)\approx0.9772-0.2514=0.7258\]
where I used the result that \(P(a<X<b)=P(X<b)-P(X<a)\) when
\(X\) is a continuous random
variable.
Exponential
Distribution
If the probability density function of a random variable \(X\) is given by \[f(x)=\lambda e^{-\lambda x}, ~~
\text{for}~~x>0\] \[f(x)=0, ~~
\text{otherwise}\] then the random variable is said to have an
exponential distribution with parameter \(\lambda\). This distribution is a
right-skewed distribution, as seen from the following graph:

It can be shown that the mean of this distribution is \(\frac{1}{\lambda}\) and the standard
deviation is also \(\frac{1}{\lambda}\).
The t
Distribution
The t-distribution is one with a pdf that is very complicated and
thus not given here. Each t-distribution is associated with a number
that is called the number of degrees of freedom. The plots of some
t-distributions are given below. More are here: https://en.wikipedia.org/wiki/Student%27s_t-distribution

Each of the distribution is controlled by the number of degrees of
freedom (\(df = n-1\)). All the \(t\)-distributions have mean 0 and variance
\(\frac{n}{n-1}\). As \(n\) gets larger and larger, a \(t\)-distribution gets closer and closer to
the standard normal distribution.
We can use software to find the probability that a t-random variable
with “df” degrees of freedom is no less than a given number say x. The R
code is “pt(x, df)”.
Reliability
Function
The reliability function \(R(t)\),
also known as the reliability or survivor function, gives the
probability that a system, component, or individual will survive beyond
a specific time t. In other words, it calculates the probability that
the event of interest (such as system failure) has not occurred by time
t. Mathematically, it’s defined as:
\[R(t)=1−F(t)\]
where \(F(t)\) is the cumulative
distribution function (CDF) of the event times. The reliability function
decreases over time as more events occur.
In different fields and literature, you might encounter either the
reliability function or the survival function being used, depending on
the context and the field’s convention. For example, in reliability
engineering, the term “reliability” is often used, while in medical and
survival analysis contexts, the term “survival” is commonly used.
Example: Reliability of a Weibull-Distributed Component
Suppose we have a component whose lifetime (in thousand hours)
follows a Weibull distribution, a distribution with cumulative
distribution function (CDF) given by:
\[F(t)=1−e^{ −(t/λ)^k}\]
Assume that \(k=2\) and \(\lambda = 0.8\).
Find the probability density function (pdf) of the life time of
the component.
Find the reliability function of the life time of the
component.
The ratio of the pdf to reliability is called the failure rate or
the hazard function, denoted by \(h(t)\), which is interpreted as the
instantaneous rate of failure at time \(t\). Plot the graph of this
function.
Solution.
Since \(k=2\) and \(\lambda = 0.8\), \(F(t)=1−e^{ −(t/0.8)^2}\). The the
probability density function (pdf) is the derivative of \(F(t)\), or \(f(t)=\frac{t}{0.32}e^{
−(t/0.8)^2}\).
The reliability function \(R(t)=1-F(t)=
e^{ −(t/0.8)^2}\).
The failure rate or the hazard function \(h(t)=
\frac{f(t)}{R(t)}=\frac{t}{0.32}\).
The reliability can also be defined for a system of components
connected in some way. There are some special systems:
A series system is defined as a system whose individuals are
connected end-to-nd in a series.
A parallel system in reliability engineering refers to a
configuration where multiple components or paths are connected in
parallel, and the system as a whole operates if at least one of the
parallel components or paths is functioning. This setup increases the
system’s overall reliability since the system can continue to function
even if one or more of the parallel components fail.
Chapter Practice
Problems
- Uniform Distribution
- A random variable X is uniformly distributed between 0 and 8. What
is the probability that X is greater than 5?
- A car rental service charges a uniform rate between $30 and $50 per
day. What is the probability that a randomly chosen rental costs less
than $40?
- If a student randomly selects a number from the range [1, 100], what
is the probability that the number is between 20 and 50?
- Normal Distribution
- The weights of a certain type of fruit are normally distributed with
a mean of 150 grams and a standard deviation of 20 grams. What
percentage of fruits weigh more than 170 grams?
- A factory’s output is normally distributed with a mean of 200 units
per day and a standard deviation of 15 units. What is the probability
that the factory produces fewer than 190 units in a day?
- Test scores in a class are normally distributed with a mean of 75
and a standard deviation of 10. What is the z-score for a test score of
85?
- Exponential Distribution
- The time between arrivals of buses at a station is exponentially
distributed with an average arrival rate of 1 bus every 15 minutes. What
is the probability that the next bus arrives in less than 10
minutes?
- A factory machine has a mean time to failure of 12 hours. What is
the probability that it operates for more than 10 hours before
failing?
- The average lifespan of a battery is 2 years. What is the
probability that a randomly chosen battery lasts less than 1 year?
- Reliability Function
- A product has an exponential lifetime distribution with a standard
deviation of 10 years. What is the reliability at 4 years?
- The average time until failure of a system is 8 months. What is the
probability that the system will function for more than 5 months?
- A certain machine has a mean time to failure of 3 years. Assuming
failure time has an exponential distribution, what is the reliability of
the machine at 1 year?
Joint Probability
Distributions
Many times we need to consider two or more random variables at the
same time. In this situation, we need to consider the joint distribution
of these random variables.
Joint Probability
Distributions for Two Random Variables
We only consider the joint distribution of two discrete random
variables \(X\) and \(Y\). The joint probability mass function
(pmf) of them is defined to be\(f(x,y)=P(X=x,
Y=y)\). In contrast, each of the two individual distributions is
called a marginal distribution.
Example 1. \(X\) takes values 0, 2,
and 4. \(Y\) takes values 1, 2, and 3.
The following table shows the joint pmf of the two discrete random
variables.

Based on the results given, we can derive many results, such as
\(f(0, 1)= 1/4, f(4, 1)= 1/4, f(0,
3)=1/8\),
\(P(X=2)=P(X=2, Y=1~\text{or}
~2~\text{or} ~3)=f(2,1)+f(2,2)+f(2,3)=0+1/8+0=1/8\), and
\(P(X<3, Y>2)=P(X=0 ~\text{or}
~2, Y=3)=f(0,3)+f(2,3)=1/8+0=1/8\).
We can also find the PMF of \(X\)
and the PMF of \(Y\), respectively.
These two distributions are the marginal distributions of the joint
distribution. The marginal distribution of \(X\) is the values 0, 2, and 4 with
corresponding probabilities 1/2, 1/8, and 3/8. The probabilities are
obtained by adding the joint probabilities on the three rows of the
table. The marginal distribution of \(Y\) is the values 1, 2, and 3 with
corresponding probabilities 1/2, 1/4, and 1/4. The probabilities are
obtained by adding the joint probabilities on the three columns of the
table.
Conditional
Probability Distributions and Independence 102
When the joint probability mass function (pmf) of \(X\) and \(Y\) is given, we can find the pmf of \(X\) and the pmf of \(Y\) separately. Take the following joint
pmf as an example:

To find the pmf of \(X\), we just
add the 3 probabilities on each row and we end up with sums of 1/2, 1/8,
and 3/8, which are the probabilities that \(X\) takes the values of 0, 2, and 4,
respectively.
Similarly, To find the pmf of \(Y\),
we just add the 3 probabilities on each column and we end up with sums
of 1/2, 1/4, and 1/4, which are the probabilities that \(Y\) takes the values of 1, 2, and 3,
respectively.
The conditional probability of \(Y=y\) given \(X=x\) can be written as \(p(y|x)\), which is defined by
\[f(y|x)=\frac{f(x,y)}{f(x)}=\frac{\text{the joint
pmf}}{\text{the marginal pmf of}~X}\] \(X\) and \(Y\) are said to be independent, if \(f(y|x)=f(y)\), for any \(x\) and \(y\). In other words, when two random
variables are independent, the conditional probability mass function is
the same as the corresponding marginal probability mass function.
Equivalently, \(X\) and \(Y\) are independent if and only if \(f(x,y)=f_X(x)f_Y(y)\), for all \(x\) and \(y\).
Note: some books use \(p\) instead
of \(f\).
Covariance and
Correlation 110
When given the joint pmf of \(X\)
and \(Y\), how can we calculate \(E(X\cdot Y)\)?
Here is an example: for the following given joint pmf,

calculate \(E(X\cdot Y)\).
Solution.
We need to find all possible products of \(X\) and \(Y\), then multiply them by the
corresponding joint probabilities, and finally add the results up. That
is,
\[(0)(1)(1/4)+(0)(2)(1/8)+(0)(3)(1/8)\]
\[+(2)(1)(0)+(2)(2)(1/8)+(2)(3)(0)\]
\[+(4)(1)(1/4)+(4)(2)(0)+(4)(3)(1/8)\]
The result is 3.
Previously, we introduced the mean of a random variable \(X\). We use \(\mu\) or \(E(X)\) to denote the mean.
The covariance of two discrete random variables is
defined to be \[cov(X,Y)=E([X-E(X)][Y-E(Y)])\] It can be
shown that \(cov(X,Y)=E(X\cdot Y)-E(X)\cdot
E(Y)\).
In the above example, \(X\) takes
the values of 0, 2, and 4, with probabilities of 1/2, 1/8, and 3/8,
respectively. So the mean of \(X\) is
\(\mu=(0)(1/2)+(2)(1/8)+(4)(3/8)=1.75\). The
variance of \(X\), by the equivalent
formula, \(V(X)=\sum x^2\cdot
p(x)-\mu^2\), is calculated as \((0^2)(1/2)+(2^2)(1/8)+(4^2)(3/8)-1.75^2=3.4375\).
Similaryly, \(Y\) takes the values
of 1, 2, and 3, with probabilities of 1/2, 1/4, and 1/4, respectively.
The mean of \(Y\) is \((1)(1/2)+(2)(1/4)+(3)(1/4)=1.75\). The
variance of \(Y\) is \((1^2)(1/2)+(2^2)(1/4)+(3^2)(1/4)-(7/4)^2=0.6875\).
Now, the covariance of \(X\) and
\(Y\) is \(3-(1.75)(1.75)=-0.0625\).
The correlation between \(X\) and \(Y\) is defined by
\[\rho_{X,Y}=corr(X,Y)=\frac{cov(X,Y)}{\sqrt{Var(X)}\cdot
\sqrt{Var(Y)}}\]
Continue the previous example. The correlation between \(X\) and \(Y\) is
\[\rho_{X,Y} =
\frac{-0.0625}{\sqrt{3.4375}\cdot \sqrt{0.6875}}\approx
-0.04\]
Remark: Covariance can take any value, but correlation must be
between \(-1\) and 1. In addition,
correlation has no unit.
Linear Functions of
Random Variables 117
Some useful properties of the mean and variance are given below:
If \(Y\) and \(X\) are random variables and \(a, b\), and \(c\) are constants, then
\(E(c)=c\)
\(E(aX+bY) =
aE(X)+bE(Y)\)
\(E(X+b)= E(X) + b\), but \(Var(X+b)= Var(X)\)
\(Var(aX)= a^2Var(X)\)
\(Var(X+Y)=Var(X)+Var(Y)+2\cdot
cov(X,Y)\)
If \(X\) and \(Y\) are independent, then \(Var(X+Y)=Var(X)+Var(Y)\) and \(E(XY)=E(X)E(Y)\) and thus \(Cov(X,Y)=0\).
If both \(X\) and \(Y\) are normally distributed and are
independent, then \(X+Y\) is also
normally distributed with mean that equals the sum of individual means
and variance that equals the sum of individual variances.
An example:
If \(E(X) = 2\) and \(Var(X)=3\), determine
\(E(4X)\)
\(Var(4X)\)
\(E(X+5)\)
\(Var(X-6)\)
\(E(3X-2)\)
\(Var(3X-2)\)
Solution.
\(E(4X)=4E(X)=8\)
\(Var(4X)=16Var(X)=48\)
\(E(X+5)=E(X)+5=7\)
\(Var(X-6)=Var(X)=3\)
\(E(3X-2)=3E(X)-2=4\)
\(Var(3X-2)=Var(3X)=3^2Var(X)=27\)
Another example:
If \(X_1\), \(X_2\), …, \(X_n\) are independent and identically
distributed (i.i.d) with mean 100 and standard deviation 15,
determine
the mean of \(\bar{X}\), where
\(\bar{X}=\frac{X_1+X_2+\cdots+X_n}{n}\).
the variance of \(\bar{X}\).
the mean of \(S^2\), where \(S^2 = \frac{1}{n-1} \sum_{i=1}^{n}
(X_i-\bar{X})^2\).
Solution.
- the mean of \(\bar{X}\) can be
found as follows:
\[E(\bar{X})=E(\frac{X_1+X_2+\cdots+X_n}{n})=E(\frac{1}{n}(X_1+X_2+\cdots+X_n))\]
\[=\frac{1}{n}E(X_1+X_2+\cdots+X_n)\]
\[=\frac{1}{n}(E(X_1)+E(X_2)+\cdots+E(X_n))=\frac{1}{n}(100n)=100\]
(b) the variance of \(\bar{X}\) can be
found as follows:
\[V(\bar{X})=V(\frac{X_1+X_2+\cdots+X_n}{n})\]
\[=\frac{1}{n^2}V(X_1+X_2+\cdots+X_n)\]
\[=\frac{1}{n^2}(V(X_1)+V(X_2)+\cdots+V(X_n))=\frac{1}{n^2}(15^2\cdot
n)=\frac{15^2}{n}\]
- the mean of \(S^2\) is \(15^2\). The derivation is lengthy, but can
you do it?
The above 3 results are also true in general when the mean is \(\mu\) and standard deviation is \(\sigma\). Just substitute 100 by \(\mu\) and 15 by \(\sigma\).
A third example:
If \(X\) has a normal distribution
with mean 10 and variance 2, independently of \(X\), \(Y\)
has a normal distribution with mean 8 and variance 3, and \(X\) and \(Y\) are normally distributed, find
the distribution of \(X+Y\)?
the probability that \(X+Y>15\).
Solution.
The distribution of \(X+Y\) is
also normal with mean \(10+8=18\) and
variance \(2+3=5\).
Let T = \(X+Y\). T is normal
with mean \(10+8=18\), variance \(2+3=5\), and standard deviation \(\sqrt{5}\). To find probability \(P(T>15)\), we convert \(T\) to \(Z\), which has the standard normal
distribution thus allows us to use a normal table to find
probabilities.
\[P(T>15)=P(\frac{T-18}{\sqrt{5}}>\frac{15-18}{\sqrt{5}})=P(Z>\frac{15-18}{\sqrt{5}})\]
\[=P(Z>-1.34)=1-P(Z<-1.34)=0.0901\]
Descriptive
Statistics
In practice, we often need to estimate a quantity (called a
parameter) that describes a group (called a
population). Such a group can be the collection of all
light bulbs manufactured by a company. The corresponding parameter can
be the proportion of defective light bulbs or the average life time of
all light bulbs. It’s often unrealistic to reach every member in a
population. Instead, people draw a random sample (called data) from the
population of interest. After sample data are obtained, a descriptive
analysis of the data is usually desirable.
We will introduce summary methods for data. Depending on data, we can
calculate measures of central tendency (mean, median, mode), measures of
dispersion (range, variance, standard deviation), and we can visualize
the data with graphical displays. This will help us gain insights into
the characteristics of the data, understand how descriptive statistics
can be applied in various fields, and make data-driven decisions.
An Introduction to
R
To analyze data, it’s better to use software.
- Different kinds of data analysis software have been developed in the
past 50 years. We will use R.
- To use R and RStudio (now called Posit), either on-premises (install
R then RStudio from this page https://posit.co/download/rstudio-desktop/) or in the
cloud (go to https://posit.cloud and register).
- Let’s assume that you use the cloud-based RStudio. After logging in,
click the “New Project” tab in the upper-right corner and select “New
RStudio Project.” Once your project is created, rename it from “Untitled
Project” to something meaningful, like “STAT353.”
- Your R code will be entered in the upper-left panel. Highlight the
part of your code you wish to run and click the “Run” button; the
results will appear in the lower-left panel alongside the code. The
lower-right panel allows you to manage files (click Files) and install
packages (click Packages). You can upload files by selecting the “Files”
tab and clicking the “Upload” button. To download files, click “More”
and choose “Export.”
To create an R Markdown document for generating project reports in
PDF, Word, or HTML format, go to the upper-left panel and click “File,”
then “New File,” and select “R Markdown.” Fill in the title and author
fields in the dialog box and click “OK” to create a template. The first
few lines between the pair of “—” and “—” are called YAML (Yet Another
Markup Language), and lines 7-9 set up the environment—it’s best not to
modify these unless you’re familiar with RStudio. Lines 11 and 21 create
sections marked by “#” signs; ensure there’s at least one space before
each section title. You can customize the titles and include text with
formatting, such as using a pair of *’s to enclose text in order to
emphasize it, using a pair of $’s to enclose mathematical expressions,
or enclosing webpage links in between angle brackets. R code should be
placed between {r} and
to create a code chunk, which you
can insert by clicking the green “+C” button in the menu bar on the
upper-left panel.
When you’re finished editing, click the “Knit” dropdown in menu bar
on the upper-left panel and choose your desired output format (html,
pdf, word, …). If there are no errors, the generated output file will
appear in the lower-right panel’s file list.
Here is a tutorial to get you started: https://www.youtube.com/watch?v=TQMAKGDIe_8
Start a new project and practice!
Numerical Summaries
of Data
Given data such as: 800, 820, 900, 950, 780, 690, 860, 880
representing the life times of 8 randomly selected light bulbs from a
sea of light bulbs.
What is the average life time? In statistics, this is called the
sample mean of the data, which describes the center of
the data. We use \(\bar{x}\) to denote
the sample mean. For the above data, \(\bar{x}=835\).
What is the spread (or variability) of the data? In statistics,
this is measured by the sample variance defined
as
\[s^2=\frac{\sum (x_i -
\bar{x})^2}{n-1}\]
or measured by the sample standard deviation, the
square root of the sample variance. In the formula, each \(x_i\) represents an observation, \(\bar{x}\) is the sample mean, \(n\) is the number of observations (called
the sample size), and the notation \(\Sigma\) means to obtain the sum of the
squared differences between observations and their mean.
For the above data, \(s^2=6514.286\)
and \(s=80.71\). To get the sample
variance, you can follow these steps:
- Calculate the sample mean (average):
Mean = (800 + 820 + 900 + 950 + 780 + 690 + 860 + 880) / 8 = 6680 / 8
= 835
- Calculate the squared differences between each data point and the
mean:
\[(800 - 835)^2 = 1225\] \[(820 - 835)^2 = 225\] \[(900 - 835)^2 = 4225\] \[(950 - 835)^2 = 13225\] \[(780 - 835)^2 = 3025\] \[(690 - 835)^2 = 21025\] \[(860 - 835)^2 = 625\] \[(880 - 835)^2 = 2025\]
- Calculate the sum of squared differences:
Sum of Squared Differences = 1225 + 225 + 4225 + 13225 + 3025 + 21025
+ 625 + 2025 = 45600
- Calculate the sample variance by dividing the sum of squared
differences by (n-1), where n is the number of data points (8):
Sample Variance = Sum of Squared Differences / (n-1) = 45600 / (8-1)
= 45600 / 7 ≈ 6514.286
We can use the R function var() to find the variance of data and the
sd() function to find the standard deviation.
# Data
x = c(800, 820, 900, 950, 780, 690, 860, 880)
# Find the variance
var(x)
## [1] 6514.286
# Find the standard deviation
sd(x)
## [1] 80.71113
The R functions give the same answers!
The difference between the largest observation and the smallest
observations is called the range of data. The range of
the above data is \(900-690\) or
210.
We may also be interested in what percentage of the data values are
less than or equal to a value. For example,
If we know 95% of all data values are less than or equal to 800,
the number of 800 is called the 95th percentile of the
data.
If we know 90% of all data values are less than or equal to 750,
the number of 750 is called the 90th percentile.
If we know 25% of all data values are less than or equal to 260,
the number of 260 is called the 25th percentile, or the first
quartile, denoted by \(Q_1\).
If we know 50% of all data values are less than or equal to 420,
the number of 420 is called the 50th percentile, the second quartile, or
the median, denoted by \(Q_2\) or \(m\).
If we know 75% of all data values are less than or equal to 570,
the number of 570 is called the 75th percentile, or the third
quartile, denoted by \(Q_3\).
The median is the easiest among the percentiles. To find the median,
simply sort the data from smallest to largest, the middle value or the
average of the middle two values is the median.
Percentiles are also called quantiles. To find other
percentiles (including quartiles), different software often use
different approaches. To use JMP, visit https://www.uvm.edu/~rsingle/other/JMP-intro/default13.html.
R users might do the following to find quantiles:
# Data
x = c(2,4, 5, 11, 20, 22, 26, 29, 32, 35, 40, 45, 48, 50)
# Find the 95th percentile
quantile(x, 0.95)
## 95%
## 48.7
Stem-and-Leaf
Diagrams
This video: https://www.youtube.com/watch?v=faiGE_J_dww would be
great for introducing the diagram. It also introduces other commonly
used plots.
R software uses the stem function to create a stem-and-leaf plot.
x = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 86, 125, 85, 87, 102, 96, 76, 85)
stem(x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 7 | 6
## 8 | 5567
## 9 | 36689
## 10 | 12449
## 11 | 7
## 12 | 1335
Frequency
Distributions and Histograms
We can create consecutive bins (intervals) of values so that we can
count how many data values fall into each bin. This will help us
understand the distribution of the data. The bins along with the counts
of values form a table, called the frequency distribution of the data.
The counts can also be replaced by proportions and the resulting table
is called the relative frequency distribution.
Again, this video: https://www.youtube.com/watch?v=faiGE_J_dww help us make
a histogram.
R software uses the hist function to create a histogram.
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
hist(IQ,
main = "Distribution of IQ", # The main sets the title of the plot.
col = "blue", # The col fills the blue color inside each bar
xlab = "IQ", # The title of the x-axis is set to IQ
ylab = "Count" # The title of the y-axis is set to Count
)

You can set your own breaks when constructing a histogram. The breaks
should be evenly spaced. The number of breaks is usually between 5 and
25. The smallest break should be slightly smaller than the minimal value
of your data. The largest break should be slightly larger than the
maximal value of your data.
hist(IQ,
breaks = c(70,80, 90, 100, 110, 120, 130),
main = "Distribution of IQ",
col = "blue",
xlab = "IQ"
)

It’s not normal to give specific breaks when creating histograms in
general, instead, we specify something like “breaks = 12” to guide R to
create a histogram with around 12 bins.
Box Plots
From sample data, you can calculate the minimum value, first
quartile, median, third quartile, and maximum. There are called the
5-number summary, which can be displayed through the so-called box
plot.

The five vertical lines are respectively corresponding to the
minimum, first (or lower) quartile, median, third (or upper) quartile,
and maximum of data. Keep in mind, the mean of data is not shown here.
Since the line extended from the third quartile to the maximum is longer
than the line from the first quartile to the maximum, the distribution
of data is said to be right-skewed. These two lines are called
whiskers.
R software uses the “boxplot” function to create a boxplot.
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
boxplot(IQ,
main = "Distribution of IQ",
col = "blue",
xlab = "IQ",
horizontal = TRUE # If set to FALSE, the boxplot will be vertical.
)

More complicated boxplots take into account outliers (values that are
1.5 IQR’s larger than \(Q_3\) or 1.5
IQR’s smaller than \(Q_1\), where \(IQR=Q_3-Q_1\) is called the inter-quartile
range), as shown below:

It’s better to create comparative boxplots, such as the one
below:

These are called side-by-side boxplots.
You need to use two variables when creating comparative boxplots, one
being the outcome variable and one being the group variable. The two
variables are better in a data frame (just like a spread sheet in
Excel). Here is possible R code:
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
gender = c("F", "M", "F", "F", "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", "M")
boxplot(IQ~gender)

Here are some summary findings based on the boxplots:
IQ Distribution by Gender: The boxplot displays the distribution
of IQ scores for two gender groups: “F” (Female) and “M”
(Male).
Median IQ: The horizontal line inside each box represents the
median IQ score for each gender group. It appears that the median IQ for
females (“F”) is almost the same as the median for males (“M”).
Interquartile Range (IQR): The height of each box represents the
interquartile range (IQR), which is a measure of the spread of the IQ
scores within each gender group. The IQR for females appears to be
slightly larger than for males.
Outliers: The plot shows individual data points that fall outside
the whiskers of the boxplots. These data points are potential outliers.
There seems to be no outlier in IQ score for either group.
Skewness: The distribution of IQ scores for either group is
slightly skewed to the right (to larger values).
Time Sequence
Plots
When plotting data involving time, we use the time sequence or time
series plot. The target quantity would be on the y-axis and the time on
the x-axis.
When examining a time series plot, pay attention to a possible trend
and/or cyclic variation.

The plot shows an increasing linear trend and cyclic variation.
Scatter Diagrams
Video: https://www.youtube.com/watch?v=ORPOMJzaKFM
When data points in a scatterplot tend to be on a straight line, we
can calculate the so-called correlation coefficient
which quantify the linear relationship between the two quantitative
variables.
The correlation coefficient, denoted by \(r\), is calculated by the following
formula:
\[r_{xy} =
\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\]
which can be reduced to
\[r_{xy} =
\frac{\sum(x_i-\bar{x})y_i}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\]
or even simpler
\[r_{xy} = \frac{\sum(x_i
y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum
y_i^2-n\bar{y}^2}}\]
The value of \(r\) is always between
\(-1\) and 1. A value closer to 1
indicates a strong, positive, linear relationship, while a value closer
to \(-1\) indicates a strong, negative,
linear relationship. The following shows some typical scatterplots.

An example. The following data are from an article on the
quality of different young red wines in the Journal of the Science of
Food and Agriculture (1974, Vol. 25(11), pp. 1369–1379) by T.C. Somers
and M.E. Evans. The authors reported quality along with several other
descriptive variables. We show only quality, pH, total SO2 (in ppm),
color density, and wine color for a sample of their wines.
Quality = c(19.2, 18.3, 17.1, 15.2, 14.0, 13.8, 12.8, 17.3, 16.3, 16.0, 15.7, 15.3, 14.3, 14.0, 13.8, 12.5, 11.5, 14.2, 17.3, 15.8)
pH = c(3.85, 3.75, 3.88, 3.66, 3.47, 3.75, 3.92, 3.97, 3.76, 3.98, 3.75, 3.77, 3.76, 3.76, 3.90, 3.80, 3.65, 3.60, 3.86, 3.93)
TotalSO2 = c(66, 79, 73, 86, 178, 108, 96, 59, 22, 58, 120, 144, 100, 104, 67, 89, 192, 301, 99, 66)
ColorDensity = c(9.35, 11.15, 9.40, 6.40, 3.60, 5.80, 5.00, 10.25, 8.20, 10.15, 8.80, 5.60, 5.55, 8.70, 7.41, 5.35, 6.35, 4.25, 12.85, 4.90)
Color = c(5.65, 6.95, 5.75, 4.00, 2.25, 3.20, 2.70, 6.10, 5.00, 6.00, 5.50, 3.35, 3.25, 5.10, 4.40, 3.15, 3.90, 2.40, 7.70, 2.75)
D=data.frame(Quality, pH, TotalSO2, ColorDensity, Color)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Quality
|
pH
|
TotalSO2
|
ColorDensity
|
Color
|
19.2
|
3.85
|
66
|
9.35
|
5.65
|
18.3
|
3.75
|
79
|
11.15
|
6.95
|
17.1
|
3.88
|
73
|
9.40
|
5.75
|
15.2
|
3.66
|
86
|
6.40
|
4.00
|
14.0
|
3.47
|
178
|
3.60
|
2.25
|
13.8
|
3.75
|
108
|
5.80
|
3.20
|
12.8
|
3.92
|
96
|
5.00
|
2.70
|
17.3
|
3.97
|
59
|
10.25
|
6.10
|
16.3
|
3.76
|
22
|
8.20
|
5.00
|
16.0
|
3.98
|
58
|
10.15
|
6.00
|
15.7
|
3.75
|
120
|
8.80
|
5.50
|
15.3
|
3.77
|
144
|
5.60
|
3.35
|
14.3
|
3.76
|
100
|
5.55
|
3.25
|
14.0
|
3.76
|
104
|
8.70
|
5.10
|
13.8
|
3.90
|
67
|
7.41
|
4.40
|
12.5
|
3.80
|
89
|
5.35
|
3.15
|
11.5
|
3.65
|
192
|
6.35
|
3.90
|
14.2
|
3.60
|
301
|
4.25
|
2.40
|
17.3
|
3.86
|
99
|
12.85
|
7.70
|
15.8
|
3.93
|
66
|
4.90
|
2.75
|
Calculate the correlation between Quality and pH.
Plot all pairs.
Solution.
- The details are given below:
\[\bar{x}=15.22\], \[\bar{y}=3.7885\], \[\sum x_i^2=4708.82\], \[\sum y_i^2=287.3713\], \[\sum x_i y_i=1154.931\]
\[r = \frac{\sum(x_i
y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum
y_i^2-n\bar{y}^2}}=\frac{1154.931-20\cdot 15.22\cdot
3.7885}{\sqrt{4708.82-20\cdot 15.22^2}\cdot\sqrt{287.3713-20\cdot
3.7885^2}}=0.3492\]
- The set of scatterplots for all pairs of two variables is called a
scatterplot matrix:
plot(D)

# To plot one pair, say the pair for Quality versus pH, we do
plot(D$pH, D$Quality, xlab = "pH", ylab = "Quality") # pH will be on the x-axis

# The following also works
plot(D$Quality ~ D$pH, xlab = "pH", ylab = "Quality") # pH will be on the x-axis

The plots suggest that there is a strong positive correlation between
ColorDensity and Color.
Point Estimation of
Parameters and Sampling Distributions
A population is a collection of subjects or objects of interest.
Examples of populations are:
For a given population, we might be interested in a quantity that
describes the population. Such a quantity is called a
parameter. For the population of all SCSU students, we
might be interested in the proportion of students who would like to be a
CEO in their future. For the population of all animals in a forest, we
might be interested in the mean age of all animals.
In general, the proportion (denoted by \(p\)) and the mean (denoted by \(\mu\)) are two commonly studied types of
parameters.
Point Estimation
How can we estimate a parameter? It’s usually unrealistic to check
each individual in a population in order to calculate a parameter.
Instead, a random sample is drawn. If the sampling method is sound, the
sample is expected to be representative of the population and thus the
sample counterpart of the population parameter can be a good estimate
for the parameter. For example, to estimate the mean of a population, we
can use the sample mean; to estimate the proportion of a population, we
can use the sample proportion. These quantities depend on observations
and are called the statistics, and they are the
point estimates of the corresponding parameters. We
will introduce interval estimates in next chapter, where we give a range
for the parameter and tell how confident we are saying the interval
would cover the unknown parameter.
The following are commonly used point estimates:
the sample mean (denoted \(\bar{x}\)) for the population mean
the sample proportion (denoted \(\hat{p}\)) for the population
proportion
the difference in sample means \(\bar{x}_1-\bar{x}_2\) for the difference in
two population means
the difference in sample proportions (denoted \(\hat{p}_1-\hat{p}_2\)) for the difference
in two population proportions
the sample variance for the population variance
Sampling
Distributions and the Central Limit Theorem
When estimating a population parameter, we start with the best
estimate: the sample statistic, which is the sample counterpart of the
parameter. For example, if the parameter is a population mean, then the
sample statistic is the sample mean; if the parameter is a population
proportion, then the sample statistic is the sample proportion.
Since the sample statistic depends on the sample and the sample is
random, the sample statistic must be random. The distribution of all the
possible values of the sample statistic is called the sampling
distribution. It describes how the sample statistic varies across
different samples of the same size taken from the population.
Here is an app showing what a sampling distribution is: https://www.lock5stat.com/StatKey/
To use this app, in the row indicated by “Sampling Distribution”,
click either “Mean” or “Proportion”. Let’s say you have clicked “Mean”.
From the dropdown menu at the upper-left corner, choose a dataset which
can be viewed as a population. Let’s say you have chosen “Baseball
Players-3e”. Set an appropriate sample size, and then click “Generate
1000 samples”. Now, you have some results including 3 graphs. The first
graph on the right side of the screen shows the histogram of the
population, the second graph shows the histogram of the most recent
sample, and the third graph or the main graph shows the dotplot of all
the sample means. The first two graphs should be similar if the sample
size is large. The (third) graph of the sampling distribution may not
necessarily be similar to that of the population, but they do have
similar center if the sample size is large. In addition, the spread
(described by the standard error, std. error) of the sampling
distribution gets smaller if the sample size is larger. Try the app
out!
Two general results:
The mean of the sample mean (\(\bar{x}\)) is always equal to the
population mean (\(\mu\)). The standard
deviation of the sample mean is always equal to the population standard
deviation divided by the square root of the sample size.
The mean of the sample proportion (\(\hat{p}\)) is always equal to the
population proportion (\(p\)). The
standard deviation of the sample proportion is always equal to \(\sqrt{\frac{p(1-p)}{n}}\). .
For a continuous population with mean \(\mu\) and standard deviation \(\sigma\), in general,
if the population is normally distributed, the sample mean (\(\bar{x}\)) has a normal distribution with
mean \(\mu\) and standard deviation
\(\frac{\sigma}{\sqrt{n}}\).
if the population is not normally distributed, the sample mean
(\(\bar{x}\)), when the sample size
\(n\) is large (say > 30),
approximately has a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).
For a discrete population with with proportion \(p\), in general,
- the sample proportion (\(\hat{p}\)), when the sample size \(n\) is large (say > 20), approximately
has a normal distribution with mean \(p\) and standard deviation \(\sqrt{\frac{p(1-p)}{n}}\).
These results are called the Central Limit Theorems
(CLT’s).
Example 1.
Randomly selected a sample of size 64 from a population with an
exponential distribution having mean 2. What is the probability that the
sample mean exceeds 1.8?
Solution.
Note that the population standard deviation is also 2, since for the
exponential distribution, the mean and standard deviation are the
same.
Since the sample size is relatively large, by the CLT, the sample
mean has an approximately normal distribution with mean 2 and standard
deviation \(\frac{2}{\sqrt{64}}=0.25\).
That is, \(\bar{X}\) is approximately
normally distributed with mean 2 and standard deviation 0.25. We need to
calculate \(P(\bar{X}>1.8)\).
\[P(\bar{X}>1.8)=P(Z>\frac{1.8-2}{0.25})=P(Z>\frac{1.8-2}{0.25})=P(Z>-0.8)=0.7881\]
You can use a standard normal table (https://www.z-table.com/) or the following R code to
find \(P(Z>-0.8)\):
1 - pnorm(-0.8, 0, 1) # 0 is the mean and 1 is the standard deviation.
## [1] 0.7881446
General Concepts of
Point Estimation
We use a function of the data to estimate an unknown parameter. Such
a function is called a statistic and gives an estimate (called a point
estimate) of the parameter. We prefer the statistic exhibiting favorable
properties, including lack of bias and minimal variance. All statistics
are random variables.
Unbiased
Estimators
A statistic is said to be unbiased, if the mean of
the distribution of the statistic is equal to the parameter to be
estimated.
The sample mean (\(\bar{X}\)), the
sample proportion (\(\hat{p}\)), and
the sample variance (\(s^2\)) are all
unbiased for their population counterparts.
We in general would like a statistic to be unbiased and has a low
variance. In practice, these goals are usually difficult to achieve at
the same time (this phenomenon is called the bias-variance
trade-off) except we collect more data. To achieve one of the two
goals, we usually have to sacrifice the other.
Variance of a Point
Estimator
We can derive the following results:
- The variance of the sample mean (as a random variable) is always
equal to the population variance divided by the sample size.
\[V(\bar{X})=\frac{\sigma^2}{n}\]
- The variance of the sample proportion (as a random variable) is
always equal to
\[V(\hat{p})=\frac{p(1-p)}{n}\]
where \(p\) is the population
proportion.
Standard Error:
Reporting a Point Estimate
- The standard deviation of the sample mean
\[\frac{\sigma}{\sqrt{n}}\] depends
on \(\sigma\), which is unknown. A good
estimate of \(\sigma\) is the sample
standard deviation \(s\). Therefore,
\(\frac{\sigma}{\sqrt{n}}\) can be
estimated by \(\frac{s}{\sqrt{n}}\).
The latter is called the standard error of the sample mean.
Similarly, the standard deviation of the sample proportion
\[\sqrt{\frac{p(1-p)}{n}}\] can be
estimated by \[\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
which is called the standard error of the sample proportion.
The standard error of a point estimator is a measure of the
variability or precision of the estimator.
Example.
An article in the Journal of Heat Transfer (Trans. ASME, Sec. C, 96,
1974, p. 59) described a new method of measuring the thermal
conductivity of Armco iron. Using a temperature of 100°F and a power
input of 550 watts, the following 10 measurements of thermal
conductivity (in Btu/hr-ft-°F) were obtained:
41.60, 41.48, 42.34, 41.95, 41.86, 42.18, 41.72, 42.26, 41.81,
42.04
Obtain a point estimate of the mean thermal conductivity of Armco
iron.
What is the standard error of such an estimate?
Solution.
The sample mean \(\bar{x}=41.924\)
and the sample standard deviation is 0.2841.
A point estimate of the mean thermal conductivity of Armco iron
is 41.924.
The standard error of the point estimate obtained in part (a) is
\(\frac{s}{\sqrt{n}}=\frac{0.2841}{\sqrt{10}}=0.0898\).
The standard error of 0.0893 suggests that the sample mean is relatively
precise because it doesn’t vary much from sample to sample. It means
that the 10 measurements taken are fairly consistent, and the sample
mean is likely a good representation of the true mean.
Example.
In a random sample of 50 PC’s, 12 are Dell’s PCs.
Obtain a point estimate of the proportion of PCs that are Dell’s
PCs.
What is the standard error of such an estimate?
Solution.
The sample proportion \(\hat{p}=\frac{12}{50}=0.24\).
A point estimate of the proportion of PCs that are Dell’s PCs is
0.24.
The standard error of the point estimate obtained in part (a) is
\(\sqrt{\frac{\hat{p}\cdot
(1-\hat{p})}{n}}=\sqrt{\frac{0.24\cdot (1-0.24)}{50}}=0.06\).
This standard error represents the variability associated with the point
estimate of the proportion of Dell’s PCs in the population.
Statistical Intervals
for a Single Sample
The point estimate of a parameter does not tell how accurate the
estimation is. Another way of estimating a parameter is to use a range
of values, called an interval estimate. An interval estimate is
associated with a certain level of confidence which tells how sure the
interval would cover the unknown parameter, so such an interval is often
called a confidence interval.
In addition to the confidence interval method, we later will
introduce the method of testing hypotheses about a population. Both
methods belong to the so-called statistical inference, which is the
process of making an inference about a population using a sample.
Confidence Interval
on the Mean of a Normal Distribution, Variance Known
The \(1-\alpha\) confidence interval
on the mean of a normal distribution with a known variance (\(\sigma\)) is given by
\[\bar{x}\pm z_{\alpha/2}\cdot
\frac{\sigma}{\sqrt{n}} \]
where \(z_{\alpha/2}\) is called the
critical value which is the cutoff of the standard normal distribution
separating the top \(\alpha/2\) tail
area from the other area.
The part \(z_{\alpha /2}\cdot
\frac{\sigma}{\sqrt{n}}\) is called the margin of error.

The length of a confidence interval is a measure of precision of
estimation.
The above confidence interval is called a z confidence interval,
since it is based on the standard normal density, which is also known as
the z-density.
The value of \(\alpha\) is usually
0.1, 0.05, or 0.01, and the corresponding (right) cutoffs (called the
critical z-values, denoted by \(z_{\alpha/2}\)) are 1.645, 1.96, and 2.576,
respectively. They are obtained by a standard normal table or the
following R code:
qnorm(1-(0.1/2)) # alpha = 0.1 and the code gives 1.645
## [1] 1.644854
qnorm(1-(0.05/2)) # alpha = 0.05 and the code gives 1.96
## [1] 1.959964
qnorm(1-(0.01/2)) # alpha = 0.01 and the code gives 2.576
## [1] 2.575829
Example 1.
ASTM Standard E23 defines standard test methods for notched bar
impact testing of metallic materials. The Charpy V-notch (CVN) technique
measures impact energy and is often used to determine whether or not a
material experiences a ductile-to-brittle transition with decreasing
temperature. Ten measurements of impact energy (J) on specimens of A238
steel cut at 60°C are as follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3,
64.6, 64.8, 64.2, and 64.3. Assume that impact energy is normally
distributed with \(\sigma\) = 1 J. We
want to find a 95% CI for \(\mu\), the
mean impact energy.
Solution.
The required quantities are \(n = 10,
\sigma = 1, \bar{x}=64.46\), and \(\alpha = 0.05\). The critical value \(z_{\alpha/2}\) is 1.96. So the 95%
confidence interval for the mean impact energy is \(64.46\pm 1.96\cdot \frac{1}{\sqrt{10}}=64.46\pm
0.6198\) or \(63.84 < \mu<
65.08\).
Interpretation of confidence intervals:
The confidence interval is a random interval, so when having a new
sample, the interval will change. The confidence level \(1-\alpha\) reflects the proportion of all
possible confidence intervals that would cover the true population
parameter. So, we can say “we are 95% confident that the mean impact
energy is between 63.84 J and 65.08 J.
The following is a graphical interpretation of confidence intervals
(assuming the sample size is 60 and the true population mean is
2.3):

This picture shows that based on 100 confidence intervals of level
95%, about 5 intervals fail to cover the true value of \(\mu\), which is indicated by the horizontal
red line.
Is the following statement correct?
If a 95% confidence interval on the mean has a lower limit of 10 and
an upper limit of 15, this implies that 95% of the time the true value
of the mean is between 10 and 15.
No, since the true value is not random but known.
The confidence interval can be impacted by a few factors:
the sample size (\(n\)): the
larger the \(n\), the shorter the
interval and thus the more precise the interval estimate.
the population standard deviation (\(\sigma\)): the smaller the \(\sigma\), the shorter the
interval.
the confidence level \(1-\alpha\): the smaller the \(\alpha\), the more confident, and the
narrower the interval.
Determining the Sample Size for Specified Error on the Mean, Variance
Known:
- If the sample mean is used as an estimate of the population mean, we
can be \(1-\alpha\) confident that the
error \(|\bar{x}-\mu|\) will not exceed
a specified amount \(E\) when the
sample size is \(n=(\frac{z_{\alpha/2}
}{E}\cdot \sigma)^2\).
Example 2.
To get a 95% confidence interval for a population mean with at most
error of 0.6, what sample size should be used? Assume that the
population standard deviation is 2.
Solution.
\[n=(\frac{z_{\alpha/2} }{E}\cdot
\sigma)^2=(\frac{1.96 }{0.6}\cdot 2)^2=42.68\]
The sample size should be at least 43.
Sometimes, we may only want a lower bound or a upper bound for a
confidence interval.
To construct a \(1-\alpha\)
lower-confidence bound on the population mean, use the \(\mu >
\bar{x}-z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}\).
To construct a \(1-\alpha\)
upper-confidence bound on the population mean, use the \(\mu <
\bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}\).
Example.
In a location, the temperatures measured in June were obtained as
follows:
13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0
Assume that the population standard deviation is 0.3.
Construct a 90% lower-confidence bound on the mean
temperature.
Construct a 90% upper-confidence bound on the mean
temperature.
Solution.
The sample mean is 13.93636. We are given \(\alpha = 0.10\), so \(z_{\alpha}=1.28\).
A 90% lower-confidence bound on the mean temperature is \(\bar{x}-z_{\alpha}\cdot
\frac{\sigma}{\sqrt{n}}=13.93636-1.28\cdot
\frac{0.3}{\sqrt{11}}=13.82\). That is, the 90% one-sided
confidence interval with lower bound is \((13.82, \infty)\).
A 90% upper-confidence bound on the mean temperature is \(\bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}=13.93636+1.28\cdot\frac{0.3}{\sqrt{11}}=14.05\).
That is, the 90% one-sided confidence interval with upper bound is \(( -\infty, 14.05)\).
The following R function written by the instructor can help you find
a \(z\) confidence interval:
z.test = function(x, alternative = c("two.sided", "less", "greater"), mu = 0, sigma = 1, conf.level = 0.95){
n=length(x); m = mean(x); z = (m-mu)/(sigma/sqrt(n)); p = pnorm(z)
if (alternative[1]=="less"){
pvalue = p
LB = -Inf
UB = m+qnorm(conf.level)*sigma/sqrt(n)
} else if (alternative[1]=="greater"){
pvalue = 1- p
LB = m-qnorm(conf.level)*sigma/sqrt(n)
UB = Inf
} else {
pvalue = 2*min(p, 1-p)
LB = m-qnorm((1+conf.level)/2)*sigma/sqrt(n)
UB = m+qnorm((1+conf.level)/2)*sigma/sqrt(n)
}
m = round(m, 5)
z = round(z, 5)
pvalue = round(pvalue, 5)
LB = round(LB, 5)
UB = round(UB, 5)
cat(paste(" One Sample z-test\n\n", "data: ", deparse(substitute(x))), "\n",
"z =", z, "p-value =", pvalue, "\n",
paste("alternative hypothesis: true mean is",
ifelse(alternative[1] == "two.sided", "not equal to", alternative[1]), mu, "\n",
paste(100*conf.level, "percent confidence interval:\n"), " ", LB, UB,
"\n", "Sample estimates:\n", "mean of x\n", " ", m), "\n")
}
How can you use the function to do the previous example?
x = c(13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0)
# (a)
z.test(x, conf.level = 0.9, alternative = "greater", sigma=0.3)
## One Sample z-test
##
## data: x
## z = 154.0723 p-value = 0
## alternative hypothesis: true mean is greater 0
## 90 percent confidence interval:
## 13.82044 Inf
## Sample estimates:
## mean of x
## 13.93636
# (b)
z.test(x, conf.level = 0.9, alternative = "less", sigma=0.3)
## One Sample z-test
##
## data: x
## z = 154.0723 p-value = 1
## alternative hypothesis: true mean is less 0
## 90 percent confidence interval:
## -Inf 14.05228
## Sample estimates:
## mean of x
## 13.93636
The results are almost the same as done by hand. The reason that the
results are not exactly the same is that my code use more accurate
critical value. For example, you use 1.28 when doing the problem by
hand, but the code use 1.281552.
Confidence Interval
on the Mean of a Normal Distribution, Variance Unknown
When the population variance is unknown, it will have to be estimated
before a confidence interval can be constructed. Such estimation might
further introduce uncertainty to the estimation process, so the interval
is expected to be longer than the situation where the population
variance is known. This is reflected in the change of the critical value
in the formula. The new critical value is related to the \(t\)-distribution.
t Distribution
Each of the distribution is controlled by the number of degrees of
freedom (\(df = n-1\)). All the \(t\)-distributions have mean 0 and variance
\(\frac{n}{n-1}\). As \(n\) gets larger and larger, a \(t\)-distribution gets closer and closer to
the standard normal distribution.

t Confidence
Interval on the Population Mean
The \(1-\alpha\) confidence interval
on the mean of a normal distribution with an unknown variance is given
by
\[\bar{x}\pm t_{\alpha/2}\cdot
\frac{s}{\sqrt{n}} \]
where
\(\bar{x}\) is the sample
mean
\(s\) is the sample standard
deviation
\(t_{\alpha/2}\) is called the
t-critical value which is the cutoff of the \(t\)-distribution separating the top \(\alpha/2\) tail area from the other
area.
the degrees of freedom of the t-distribution is \(n-1\).
The following gives part of a t table that gives the corresponding
critical values and right tail ares for each given number of degrees of
freedom.

To use the table, let’s give some examples:
when sample size is 10, the degrees of freedom is 9. For a 95%
confidence interval, \(\alpha\) would
be 0.05, and \(\alpha/2\) is 0.025.
From the graph and the t table above, the critical t-value is
2.262.
when sample size is 8, the degrees of freedom is 7. For a 90%
confidence interval, \(\alpha\) would
be 0.10, and \(\alpha/2\) is 0.05. From
the table above, the critical t-value is 1.895.
A more detailed table can be found here: https://www.craftonhills.edu/current-students/tutoring-center/mathematics-tutoring/distribution_tables_normal_studentt_chisquared.pdf.
For example,
when df = 15 and the upper tail area is 0.075, the critical value
is 1.517;
when df = 22 and the upper tail area is 0.025, the critical value
is 2.074;
when df = 35 and the critical value is 2.438, the upper tail area
is 0.01;
when df = 12 and the critical value is 2.234, the upper tail area
is between 0.01 and 0.025.
Example 1.
Engineers want to determine the average tensile strength of a new
type of steel produced by a company. They take a random sample of 100
steel rods and measure their tensile strength. The data are:
501, 504, 499, 503, 502, 505, 498, 500, 506, 503, 505, 501, 499, 502,
504, 500, 502, 503, 506, 499, 502, 503, 501, 504, 502, 498, 499, 503,
504, 501, 503, 498, 506, 500, 502, 499, 505, 504, 503, 501, 500, 504,
502, 506, 499, 503, 505, 501, 498, 502
Calculate a 95% confidence interval for the mean tensile strength of
all steel rods produced by the company.
Solution.
When doing by hand, the sample mean is 502 and sample standard
deviation is 2.3561. The \(\alpha\) is
0.05, and the critical t-value \(t_{\alpha
/2}\) based on a t-table is 2.01.
x = c(501, 504, 499, 503, 502, 505, 498, 500, 506, 503,
505, 501, 499, 502, 504, 500, 502, 503, 506, 499,
502, 503, 501, 504, 502, 498, 499, 503, 504, 501,
503, 498, 506, 500, 502, 499, 505, 504, 503, 501,
500, 504, 502, 506, 499, 503, 505, 501, 498, 502)
t.test(x, conf.level = 0.95)
##
## One Sample t-test
##
## data: x
## t = 1506.6, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 501.3304 502.6696
## sample estimates:
## mean of x
## 502
Confidence Interval: 95% confidence interval for the mean tensile
strength of the steel rods is (501.33 MPa, 502.67 MPa).
Interpretation:
With 95% confidence, we estimate that the true average tensile
strength of all steel rods produced by the company falls within the
range of approximately 501.33 megapascals (MPa) to 502.67 MPa.
This means that if we were to take multiple random samples of 100
steel rods and calculate 95% confidence intervals from each sample, we
would expect approximately 95% of those intervals to contain the true
population mean tensile strength.
The interval does not contain values below 501.33 MPa or above 502.67
MPa, suggesting that the steel rods are manufactured with a high degree
of consistency in terms of tensile strength.
Example 2.
Environmental engineers monitor air pollution levels in a city. They
want to estimate the average concentration of a pollutant in the air
over a specific period. They collect air quality measurements at various
locations throughout the city. The data are:
18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2, 19.1,
18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7, 19.0, 18.9, 18.8,
19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0, 18.6, 19.2, 19.0, 18.8, 19.3,
18.7, 18.9, 19.1, 18.5, 18.6, 19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9,
18.6, 19.0, 18.5
Calculate a 90% confidence interval for the mean pollutant
concentration in the city based on these 50 measurements.
Solution.
x = c(18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2,
19.1, 18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7,
19.0, 18.9, 18.8, 19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0,
18.6, 19.2, 19.0, 18.8, 19.3, 18.7, 18.9, 19.1, 18.5, 18.6,
19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9, 18.6, 19.0, 18.5)
t.test(x, conf.level = 0.90)
##
## One Sample t-test
##
## data: x
## t = 549.73, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 18.85632 18.97168
## sample estimates:
## mean of x
## 18.914
Confidence Interval: 90% confidence interval for the mean pollutant
concentration in the city is (18.86 ppm, 18.97 ppm).
Interpretation:
With 90% confidence, we estimate that the true average concentration
of the pollutant in the air over the specified period falls within the
range of approximately 18.86 parts per million (ppm) to 18.97 ppm.
This means that if we were to take multiple random samples of 50 air
quality measurements and calculate 90% confidence intervals from each
sample, we would expect approximately 90% of those intervals to contain
the true population mean pollutant concentration.
The interval is relatively narrow, indicating a relatively high level
of confidence in our estimate of the average pollutant concentration.
This suggests that the city’s air quality, as measured by this
pollutant, is relatively consistent over the monitoring period.
Large-Sample
Confidence Interval for a Population Proportion
The \(1-\alpha\) confidence interval
for a population proportion is given by
\[\hat{p}\pm z_{\alpha/2}\cdot
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] Example 3.
In a random sample of 85 automobile engine crankshaft bearings, 10
have a surface finish that is rougher than the specifications allow.
Find a point estimate of the proportion of bearings in the
population (denoted \(p\)) that exceeds
the roughness specification.
Construct a 95% two-sided confidence interval for \(p\).
Solution.
\(p=\frac{10}{85}=0.1176\).
The 95% confidence interval is
\[\hat{p}\pm z_{\alpha/2}\cdot
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=0.1176\pm 1.96\cdot
\sqrt{\frac{0.1176(1-0.1176)}{85}}=0.1176\pm 0.0685\] or between
0.0491 and 0.1861.
Sample size determination for a \(1-\alpha\) confidence interval of the
population proportion \(p\)
- In order for the error \(|\hat{p}-p|\) to be no greater than \(E\), the sample should be at least \((\frac{z_{\alpha/2}}{E})^2
\hat{p}(1-\hat{p})\) if there is an estimate \(\hat{p}\) for \(p\) already or \((\frac{z_{\alpha/2}}{E})^2 (0.25)\) if
there is no such estimate for \(p\).
Always round it to the next whole integer!
Example 4.
To have a 95% confidence interval for a population proportion with at
most error 0.02, what should be the least required sample size?
Solution.
\((\frac{z_{\alpha/2}}{E})^2
(0.25)=(\frac{1.96}{0.02})^2 (0.25)=2401\)
Finding Confidence
Intervals Using Software
Example 5.
A sample of 9 faculty are selected from a large university. Their
years of service are 23, 34, 12, 40, 34, 52, 27, 28, 40. Find a 95%
confidence interval for the population mean.
The following is the R code for finding a \(t\)-confidence interval for a population
mean:
x = c(23, 34, 12, 40, 34, 52, 27, 28, 40) # This is data vector
t.test(x, conf.level = 0.95) # Call the function “t.test” to do the
calculation
The 95% confidence interval is: 23.38 to 41.06.
To use JMP, you may follow this video: https://www.youtube.com/watch?v=gDi4XWdCIbw
Example 6. A sample of 80 faculty are selected from a large
university. 32 of these people have had covid. Find a 90% confidence
interval for the population proportion.
The following is the R code for finding a confidence interval for a
population proportion:
n = 80
x = 52
prop.test(x, n, conf.level = 0.90)
The 90% confidence interval is: 0.55 to 0.74.
Example 7.
An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005,
Vol. 13, pp. 273–279) “Arthroscopic meniscal repair with an absorbable
screw: results and surgical technique” showed that only 25 out of 37
tears (67.6%) located between 3 and 6 mm from the meniscus rim were
healed.
Calculate a 95% two-sided confidence interval on the proportion of
such tears that will heal. Round the answers to 3 decimal places.
The following is the R code for finding a confidence interval for a
population proportion:
n = 37
x = 25
prop.test(x, n, conf.level = 0.95)
Tests of Hypotheses for
a Single Sample
Confidence intervals are estimation methods for parameters. In
practice, people may be interested in testing claims or hypotheses about
parameters.
Warning: this chapter covers the most difficult
concept in an introductory statistics course. Stay awaken!
Review the concept of sampling distribution for the sample mean
covered in chapter 8: https://www.youtube.com/watch?v=0zqNGDVNKgA
Consider the following situation:
A company selling juice claims that on average, each bottle of their
juice is 295 ml. Let’s first assume that the claim is correct. We will
randomly select 50 bottles of juice. Which of the following sample means
would provide the most evidence against the claim (or for the
opposite of the claim)? Which one provides the least evidence?
291 ml
296 ml
299 ml
Use the following sampling distribution to help you answer these
questions.

Answer: (a)=(c)>(b)
If the sample mean is very far away from the center of the sampling
distribution, we should reject the assumption. Otherwise, we don’t
reject it. Then, how far is far? We need a certain decision rule. One
such rule is that the standardized score or \(z\)-score of the sample mean is very
different from 0 (i.e., \(|z|>c\)
for some \(c\), called the critical
value).
Hypothesis
Testing
Watch this video: https://www.youtube.com/watch?v=zR2QLacylqQ a few times
until it makes good sense to you.
Then watch this (the first 15 minutes): https://www.youtube.com/watch?v=VK-rnA3-41c
Statistical
Hypotheses
A statistical hypothesis is a statement about the parameters of one
or more populations.
The following are examples of statistical hypotheses:
A company selling juice claims that on average, each bottle of
their juice is 295 ml.
A presidential candidate claims that more than 50% of registered
voters would support her.
A company producing light bulbs claims that less than one percent
of their products are defective.
When dealing with a problem involving statistical hypotheses, a pair
of competing hypotheses, called the null and alternative hypotheses,
respectively, are first formed. In next subsection, we will talk about
how such a pair is formed.
A procedure leading to a decision about whether the null hypothesis
should be rejected is called a test of a
hypothesis.
Tests of
Statistical Hypotheses
The test of hypotheses can be summarized in four steps as
follows-
Step 1: Specify the null and alternative hypotheses.
Step 2: Calculate the value of a test statistic.
Step 3: Calculate the critical value or \(p\)-value.
Step 4: Make a decision about whether to reject the null
hypothesis and draw a conclusion.
To test each of the following claims,
A company selling juice claims that on average, each bottle of
their juice is 295 ml.
A presidential candidate claims that more than 50% of registered
voters would support her.
A company producing light bulbs claims that less than one percent
of their products are defective.
The standard deviation of the life times of all light bulbs
produced by a company is less than 30 hours.
The difference in the mean house income between Minnesota and
Iowa is greater than $500.
the null and hypotheses are
\(H_0: \mu = 295\) vs. \(H_a: \mu \ne 295\)
\(H_0: p \le 0.5\) vs. \(H_a: p > 0.5\)
\(H_0: p\ge 0.01\) vs. \(H_a: p<0.01\)
\(H_0: \sigma = 30\) vs. \(H_a: \sigma < 30\)
\(H_0: \mu_{\text{MN}}-\mu_{\text{IA}}
= 500\) vs. \(H_a:
\mu_{\text{MN}}-\mu_{\text{IA}} = 500\)
Technically, we can use an equality sign under all null hypotheses
(indicated by \(H_0\)), but can never
use any equality sign under the alternative hypothesis indicated by
\(H_a\) or \(H_1\)). That is, the only possible signs
under an alternative hypothesis are “>”, “<”, and “\(\ne\).”
We can only use a parameter but never use a statistic to specify a
hypothesis. The following would be all wrong:
\(H_0: \bar{x} = 295\) vs. \(H_a: \bar{x} \ne 295\)
\(H_0: \hat{p} \le 0.5\)
vs. \(H_a: \hat{p} > 0.5\)
\(H_0: s = 30\) vs. \(H_a: s < 30\)
because \(\bar{x}, \hat{p}\) and
\(s\) all represent statistics.
Once the null and alternative hypotheses are determined, a test
statistic will be used as a judge between the null and alternative
hypotheses. The test statistic is an expression involving summary
statistics and the parameter (s). It can be different, depending on the
context. We will use an argument similar to the proof of contradiction
in mathematics. Specifically, we will first assume that the null
hypothesis is true. We then calculate a quantity (a critical value or
\(p\)-value) which indicates whether
there is a contradiction. If yes, we reject the null hypothesis.
Otherwise, we do not reject the null hypothesis.
Our decision might be wrong. There are two types of errors we could
make: Type I and type II errors, which are tabled below.

If the null hypothesis is true but rejected, we have made a type
I error.
If the null hypothesis is false but not rejected, we have made a
type II error.
The probability of making a type I error is denoted by \(\alpha\). The probability of making a type
II error is denoted by \(\beta\). We
would like both \(\alpha\) and \(\beta\) to be small. How can we reduce
both? Increasing the sample size would be the only way.
The value or an upper bound of \(\alpha\) is usually pre-set (or controlled)
by the investigator and is called the level of
significance.
One-Sided and
Two-Sided Hypotheses
When the alternative hypothesis has a “\(<\)” or a “>” sign, the test is
called a one-sided test. Furthermore, if it is “<”, the test is
left-sided (or left-tailed); if it is “>”, the test is right-sided
(or right-tailed).
When the alternative hypothesis has a “\(\ne\)” sign, the test is said to be a
two-sided (or two-tailed).
P-Values in
Hypothesis Tests
When we make a decision regarding whether the null hypothesis should
be rejected, we can use a few methods. One of such methods is the \(p\)-value method. The \(p\)-value is the probability of obtaining
test results at least as extreme as the result actually observed, under
the assumption that the null hypothesis is correct. A very small p-value
(say < 5%) means that such an observed outcome would not be observed
by chance (i.e., would be very unlikely) under the null hypothesis. A
relatively large p-value (say > 5%) means that such an observed
outcome may occur by chance under the null hypothesis.
More specifically,
Consider an observed test-statistic \(t\) from unknown distribution \(T\). Then the \(p\)-value is \(P(T\ge t|H_{0})\) for a one-sided
right-tail test, \(P(T\le t|H_{0})\)
for a one-sided left-tail test, or \(2\cdot
\text{min}\{P(T\ge t|H_{0}), P(T\le t|H_{0})\}\) for a two-sided
test.
When the \(p\)-value is less than or
equal to a pre-selected \(\alpha\)
(called the significance level), reject the null hypothesis. In this
case, we also say that the result is statistically
significant.
Statistical significance does not always imply practical
significance. For example, if we we find the difference in mean
income between Minnesota and Iowa residents is $3 and it is
statistically significant, such a small difference would not make any
difference practically speaking.
The \(p\)-values under one-sided or
two-sided alternatives are related. If the p-value of a one sided test
is \(p\), the p-value of the
corresponding two-sided test must be \(2p\) or \(2(1-p)\), whichever is between 0 and 1.
Connection between
Hypothesis Tests and Confidence Intervals
If a \(1-\alpha\) confidence
interval does not include the value that is under the null hypothesis of
a two-sided (or two tailed) test, we can reject the null hypothesis at
the \(\alpha\) level of
significance.
Example.
Based on a sample, a 90% confidence interval for a population mean
\(\mu\) is \((23.67, 27.92)\). If we want to use the
sample to test the following hypotheses
\[H_0: \mu=22 ~~~ vs ~~~ H_a: \mu\ne
22\] do we reject the null hypothesis at the significance level
0.10?
Solution.
Yes, reject the null hypothesis at level 0.10, since the hypothesized
value 22 falls out of the 90% confidence interval.
You can also use a one-sided confidence bound to draw a conclusion
about a one-sided test. For a left-tail test, you can develop a level
\(1-\alpha\) upper bound; For
a right-tail test, you can develop a level \(1-\alpha\) lower bound. When the
value under the null hypothesis is beyond the bound, reject the null
hypothesis at the significance level \(\alpha\).
General Procedure
for Hypothesis Tests
There are two procedures for hypothesis tests:
- The critical value method: We need to determine the critical region
(or rejection region). If the test is left-tailed (the alternative
hypothesis reads something like \(H_a:
\mu<24\) or \(H_a:
p<0.35\)), we find a cutoff (called \(c\)) that separates the lower \(\alpha\) area under the (null) distribution
of the test statistic from the other area. The critical region is \((-\infty, c)\). If the test is right-tailed
(the alternative hypothesis reads something like \(H_a: \mu>24\) or \(H_a: p>0.35\)), we find a cutoff (called
\(c\)) that separates the upper \(\alpha\) area under the distribution of the
test statistic from the other area. The critical region is \((c, \infty)\). If the test is two-tailed,
we find two cutoffs (called \(c\) and
\(-c\)) with \(c\) separating the upper \(\alpha/2\) area under the distribution of
the test statistic from the other area and \(-c\) separating the lower \(\alpha/2\) area under the distribution of
the test statistic from the other area. The critical region is \((-\infty, -c)\cup (c, \infty)\). In each
case, if the value of the test statistic falls in the corresponding
critical region, reject the null hypothesis.
For a test of hypotheses about a population mean with known
population standard deviation, watch this video: https://www.youtube.com/watch?v=04rhu_56O5g.
- The p-value method: Calculate the area that is under the (null)
distribution of the test statistic and is greater than the value of the
test statistic. Such area can be represented by \(A=P(T\ge t|H_0)\). For a right-tailed test,
the \(p\)-value equals \(A\); For a left-tailed test, the \(p\)-value equals \(1-A\); for a two-tailed test, the \(p\)-value equals \(2\cdot \text{min} \{1-A, A\}\), the smaller
of \(A\) and \(1-A\).
Watch this video: https://www.youtube.com/watch?v=W3rxXa7YNqk. Note that
in the video, the null hypothesis is expressed as \(H_a: \mu>23\), which is
technically the same as \(H_a:
\mu=23\).
Tests on the Mean of
a Normal Distribution, Variance Known
We assume that a random sample \(X_1, X_2,
\cdots, X_n\) has been taken from a normal population with known
variance \(\sigma^2\).
Hypothesis Tests on
the Mean
The null hypothesis is always written as \(H_0: \mu=\mu_0\), where \(\mu_0\) is called the hypothesized value or
null value.
The alternative hypothesis may look like one of the following:
\(H_a: \mu<\mu_0\)
\(H_a: \mu>\mu_0\)
\(H_a: \mu\ne \mu_0\)
The test statistic is always
\[Z_0=\frac{\bar{X}-\mu_0}{\sigma
/\sqrt{n}}\]
The observed value of the test statistic is always
\[z_0=\frac{\bar{x}-\mu_0}{\sigma
/\sqrt{n}}\]
For a left-tail test, the critical value is given by \(c\), the cutoff on the number line that
makes the area of the left region be \(\alpha\) under the curve of the standard
normal distribution, and the \(p\)-value is given by the area of the left
region beyond \(z_0\) under the
standard normal curve.
For a right-tail test, the critical value is given by \(c\), the cutoff on the number line that
makes the area of the right region be \(\alpha\) under the curve of the standard
normal distribution, and the \(p\)-value is given by the area of the right
region beyond \(z_0\) under the
standard normal curve.
For a two-tail test, the critical values are given by \(c\) and \(-c\), two cutoffs on the number line that
makes left tail and right tail each \(\alpha/2\).
To find \(c\), watch this video: https://www.youtube.com/watch?v=p_KApjpyBHE (starting at
1:55).
Example 1. A left-sided test for a population mean with
\(\sigma\) known.
https://www.youtube.com/watch?v=oEW8Hd_xy1k
Example 2. A right-sided test for a population mean with
\(\sigma\) known.
Example 3. A two-sided test for a population mean with \(\sigma\) known.
https://www.youtube.com/watch?v=BWJRsY-G8u0
The following example gives some details:
To test if a population mean is greater than 20. A random sample of
size 36 gives a sample mean 22. If the population standard deviation is
5, test, at level 0.05, that the population mean exceeds 20.
Solution.
The null and alternative hypotheses are:
\[H_0:\mu = 20 ~~~ vs ~~~ H_a: \mu >
20\] The test statistic value is
\[z_0=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{22-20}{5/\sqrt{36}}=2.4\]
Since larger sample mean or larger \(z_0\) suggestions rejection of the null
hypothesis, the rejection region looks like \((c, \infty)\) with the critical value \(c\). By the standard normal table or the R
code \(qnorm(1-\alpha)\) (run it on the
R console), \(c=1.645\).
Since the test statistic value falls in the rejection region, reject
the null hypothesis.
Equivalently, we can use the \(p\)-value approach. The \(p\)-value is the area to the right of the
statistic value under the standard normal curve. By the standard normal
table or the R code \(1-pnorm(2.4)\),
the \(p\)-value is 0.0082. Since the
\(p\)-value is less than the
significance level, reject the null hypothesis.
Tests on the Mean of
a Normal Distribution, Variance Unknown
We assume that a random sample \(X_1, X_2,
\cdots, X_n\) has been taken from a normal population with
unknown variance \(\sigma^2\).
Hypothesis Tests on
the Mean
The test statistic is always
\[T_0=\frac{\bar{X}-\mu_0}{S
/\sqrt{n}}\]
The observed value of the test statistic is always
\[t_0=\frac{\bar{x}-\mu_0}{s
/\sqrt{n}}\]
For a left-tail test, the critical value is given by \(c\), the cutoff on the number line that
makes the area of the left region be \(\alpha\) under the curve of the t
distribution with \(n-1\) degrees of
freedom, and the \(p\)-value is given
by the area of the left region beyond \(t_0\) under the t curve.
For a right-tail test, the critical value is given by \(c\), the cutoff on the number line that
makes the area of the right region be \(\alpha\) under the curve of the t
distribution with \(n-1\) degrees of
freedom, and the \(p\)-value is given
by the area of the right region beyond \(t_0\) under the t curve.
For a two-tail test, the critical values are given by \(c\) and \(-c\), two cutoffs on the number line that
makes left tail and right tail each \(\alpha/2\).
Example.
A very useful video: https://www.youtube.com/watch?v=VPd8DOL13Iw
Example.
Your company wants to improve sales. Past sales data indicate that
the average sales was $100 per transaction. After training your sales
force, recent sales data (taken from a random sample of 25 salesmen)
indicates an average of $130, with a standard deviation of $15. Did the
training work? Test your hypothesis at a 0.05 significance level.
Solution.
The population mean \(\mu\) is the
parameter of interest. To test whether sales has been improved, we
should have the null and alternative hypotheses as follows:
\[H_0: \mu=100 ~~~ vs ~~~ H_a:
\mu>100\] The value of the test statistic is
\[t_0 =
\frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{130-100}{15/\sqrt{25}}=10\]
with \(n-1\) or 24 degrees of
freedom.
Since larger \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null
hypothesis, the rejection (or critical) region looks like \((c, \infty)\), where \(c=t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based
on the \(t_{24}\) distribution is
1.711, which is obtained by R code \(qt(1-\alpha, n-1)\) or by a \(t\)-table.
Since the test statistic value 10 falls in the rejection region, we
reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the
\(t_{24}\) distribution to the right of
the test statistic value. Using the \(t\) table or the R code \(1-pt(10, 24)\), we know the \(p\)-value is smaller than 0.001 and thus
smaller than the significance level 0.05. Again, we reject the null
hypothesis.
In conclusion, the data provide sufficient evidence that the sales
has been improved after training.
The following is a video explaining the above procedure:
https://www.youtube.com/watch?v=7ty2bO6VrUI
Example.
A firm claims that their product on average weighs 19 pounds. A
supervisory authority doubts that the average weight is below 19 pounds,
so it collects a random sample of 51 products made by the company from
the market. The sample is 18.5 pounds with a standard deviation 3.2
pounds. Test appropriate hypotheses at the significance level 0.01. In
order to prevent themselves from been sued by the company, should the
authority use a larger or smaller significance level?
Solution.
The null and alternative hypotheses are:
\[H_0: \mu=19 ~~~ vs ~~~ H_a:
\mu<19\] The value of the test statistic is
\[t_0 =
\frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{18.5-19}{3.2/\sqrt{51}}=-1.1158\]
with \(n-1\) or 50 degrees of
freedom.
Since smaller \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null
hypothesis, the rejection (or critical) region looks like \((-\infty, c)\), where \(c=-t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based
on the \(t_{50}\) distribution is \(-1.6759\), which is obtained by R code
\(qt(\alpha, n-1)\) with \(\alpha = 0.01, n=50\) or by a \(t\)-table.
Since the test statistic value \(-1.1158\) does not fall in the rejection
region, we fail to reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the
\(t_{50}\) distribution to the left of
the test statistic value. Using the \(t\) table or the R code \(pt(-1.1158, 50)\), we know the \(p\)-value is 0.1349 and thus NOT smaller
than the significance level 0.01. Again, we fail to reject the null
hypothesis.
In conclusion, the data do not provide sufficient evidence that the
average weight of the firm’s products is below 19 pounds.
The following is a video explaining the above procedure: https://www.youtube.com/watch?v=ZY5XxJ2aJNc
Tests on a Population
Proportion
It is often necessary to test hypotheses on a population proportion.
For example, suppose that a random sample of size \(n\) has been taken from a large (possibly
infinite) population and that \(X\)
observations in this sample belong to a class of interest. Then \(\hat{P}=\frac{X}{n}\) is a point estimator
of the proportion of the population p that belongs to this class.
Typically, we require that \(np\) and
\(n(1 − p)\) be greater than or equal
to 5.
The null hypothesis always look like \(H_0:
p=p_0\), where \(p_0\) is called
the hypothesized value or null value.
The alternative hypothesis may look like one of the following:
\(H_a: p<p_0\)
\(H_a: p>p_0\)
\(H_a: p\ne p_0\)
The test statistic is always
\[Z_0=\frac{\hat{P}-p_0}{\sqrt{\frac{p_0
(1-p_0)}{n}}}\]
The observed value of the test statistic is always \[z_0=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0
(1-p_0)}{n}}}\]
The critical value and p-value are calculated in the same way as the
normal distribution case for \(\mu\)
with known \(\sigma\).
Large-Sample Tests
on a Proportion
Example.
A semiconductor manufacturer produces controllers used in automobile
engine applications. The customer requires that the process fallout or
fraction defective at a critical manufacturing step not exceed 0.045 and
that the manufacturer demonstrate process capability at this level of
quality using \(\alpha= 0.05\). The
semiconductor manufacturer takes a random sample of 200 devices and
finds that 4 of them are defective. Can the manufacturer demonstrate
process capability for the customer?
Solution.
We may solve this problem using the following steps:
Parameter of interest: The parameter of interest is the process
fraction defective \(p\).
Null hypothesis: \(H_0: p =
0.045\)
Alternative hypothesis: \(H_a: p <
0.045\) This formulation of the problem will allow the
manufacturer to make a strong claim about process capability if the null
hypothesis \(H_0: p = 0.045\) is
rejected.
Test statistic: The test statistic is \(z_0=\frac{x-np_0}{\sqrt{np_0
(1-p_0)}}=-1.7055\)
where \(x = 4, n = 200\), and \(p_0 = 0.045\).
\(p\)-value: 0.044 (the
left-tail area under the standard normal curve with cutoff \(-1.7055\)).
Decision & conclusion: Reject H0 since the p-value is less
than 0.05. We conclude that the process fraction defective p is less
than 0.05. Practical Interpretation: We conclude that the process is
capable.
Example. An article in Fortune (September 21, 1992) claimed
that nearly one-half of all engineers continue academic studies beyond
the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree.
Data from an article in Engineering Horizons (Spring 1990) indicate that
118 of 484 new engineering graduates were planning graduate study.
Test the hypothesis 𝐻0:𝑝=0.5
What is the P-value for this test? Round your answer to 4 decimal
places.
Testing for Goodness
of Fit
The hypothesis-testing procedures that we have discussed in previous
sections are designed for problems in which the population or
probability distribution is known and the hypotheses involve the
parameters of the distribution. Another kind of hypothesis is often
encountered: We do not know the underlying distribution of the
population, and we wish to test the hypothesis that a particular
distribution will be satisfactory as a population model.
The test method, called the goodness-of-fit test, is based on the
chi-square distribution.

Each chi-squared distribution is associated with a number called the
number of degrees of freedom. The above graph shows 6 different
chi-squared distributions. All chi-squared distributions are skewed to
the right.
In this chapter, we will consider hypothesis testing problems that
involve calculating p-values based on chi-squared distributions.
We will focus on the situation that the population distribution is
discrete with only a few categories. So, the null hypothesis is
\[H_0: p_1 = p_{10}, ~p_2 = p_{20},
\cdots, ~p_k = p_{k0} \] where \(p_{10}, p_{20}, \cdots, p_{k0}\) are given
proportions, and the alternative hypothesis is
\[H_a: \text{At least one of the
proportions is not as specified}\]
The test procedure requires a random sample of size \(n\) from the population whose probability
distribution is unknown. These \(n\)
observations are arranged in a frequency table with \(k\) classes/categories.
Let \(O_i\) be the observed
frequency in the \(i\)th class. Under
the null hypothesis, we compute the expected frequency in the \(i\)th class, denoted \(E_i = n\cdot p_{i0}\), \(i = 1, 2, ..., k\). The test statistic
is
\[\chi_0^2=\sum_{i=1}^k \frac{(O_i
-E_i)^2}{E_i}\]
Under the null hypothesis, \(\chi_0^2\) has, approximately, a chi-square
distribution with \(k − 1\) degrees of
freedom.
We can again use one of the following two methods for making a
decision:
- The critical value method: The critical region is always \((c, \infty)\), where \(c\) is the cutoff for the chi-square
distribution such that the upper tail is \(\alpha\) (the significance level).
*\(p\)-value method: the \(p\)-value is the upper-tail area under the
chi-square curve with cutoff \(\chi_0^2\).
Example.
Throw a 6-sided die 100 times. The observations are
17 ones
18 twos
13 threes
17 fours
22 fives
13 sixes
Test, at the significance level 0.05, whether the die is fair.
Solution.
The null hypothesis is \(H_0:
p_1=p_2=\cdots=p_6=1/6\) and the alternative hypothesis is \(H_a: \text{At least one of the probabilities is
not 1/6}\). Under the null hypothesis, the expected frequencies
are all \((\frac{1}{6})(100)=16.67\).
The observed frequencies are \(O_1=17, O_2=18,
O_3=13, O_4=17, O_5=22, O_6=13\).
The test statistic
\[\chi_0^2=\sum_{i=1}^k \frac{(O_i
-E_i)^2}{E_i}\]
\[\chi_0^2=\frac{(17
-16.67)^2}{16.67}+\frac{(18 -16.67)^2}{16.67}+\frac{(13
-16.67)^2}{16.67}+\frac{(17 -16.67)^2}{16.67}+\frac{(22
-16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}\] \[\chi_0^2=3.44\] with \(6-1=5\) degrees of freedom.
The critical value is \(\chi_{0.05,
5}^2=11.0705\). Since the test statistic value is not greater
than the critical value, we do not reject the null hypothesis. The \(p\)-value is \(P(\chi^2>3.44)= 0.6325\), the area of
the right region under the chi-squared density curve \((df=5)\) (Watch: https://www.youtube.com/watch?v=HwD7ekD5l0g).
Decision & conclusion: Since the \(p\)-value is greater than the significance
level 0.05, the null hypothesis is NOT rejected. We conclude that we
don’t have enough evidence to say that the die is unfair.
The R code:
chisq.test(x=c(17, 18, 13, 17, 22, 13 ))
Contingency Table
Tests
Many times the \(n\) elements of a
sample from a population may be classified according to two different
criteria. It is then of interest to know whether the two methods of
classification are statistically independent; for example, we may
consider the population of graduating engineers and may wish to
determine whether starting salary is independent of academic
disciplines. Assume that the first method of classification has \(r\) levels and that the second method has
\(c\) levels. We will let \(O_{ij}\) be the observed frequency for
level \(i\) of the first classification
method and level \(j\) of the second
classification method. The data would, in general, appear as shown in
the following Table. Such a table is usually called an \(r × c\) contingency table.

To test the independence of the two categorical variables, the null
and alternative hypotheses are
\[H_0: \text{The two categorical variables
are independent vs.} ~ H_a:\text{The two categorical variables are
dependent}\]
We again use the chi-square test and the test statistic is
\[\chi_0^2=\sum_{i,j} \frac{(O_{ij}
-E_{ij})^2}{E_{ij}}\]
where the expected frequency \(E_{ij}\) is calculated as the sum of the
\(i\)th row multiplied by the sum of
the \(j\)th column, then divided by the
sum of all frequencies.
Under the null hypothesis, this test statistic has an approximate
chi-square distribution with \((r − 1)(c −
1)\) degrees of freedom.
The critical value and the p-value are calculated in the same way as
for goodness of fit.
Example.
A company has to choose among three health insurance plans.
Management wishes to know whether the preference for plans is
independent of job classification and wants to use α = 0.05. The
opinions of a random sample of 500 employees are shown in table
below:

Solution.
\(H_0: \text{Job classification and
health insurance plan are independent}\) and \(H_a:\text{Job classification and health insurance
plan are dependent}\)
The expected frequencies are 136, 136, 68, 64, 64, 32,
respectively.
The Chi-square statistic is 49.63.
The critical value is 5.99 (the cutoff of the chi-square
distribution that separates the upper tail area of 0.05).
The p-value is essentially 0.
Decision & conclusion: By either method, we reject the null
hypothesis. We conclude that Job classification and health insurance
plan are dependent.
R code;
M=matrix(c(160, 40, 140, 60, 40, 60), 2, 3)
chisq.test(M)
Statistical Inference
for Two Samples
Case Study: Paint Drying Time
A product developer is interested in reducing the drying time of a
primer paint. Two formulations of the paint are tested:
- formulation 1 is the standard chemistry, and
- formulation 2 has a new drying ingredient that should reduce the
drying time.
From experience, it is known that the standard deviation of drying
time is 8 minutes, and this inherent variability should be unaffected by
the addition of the new ingredient. Ten specimens are painted with
formulation 1, and another 10 specimens are painted with formulation 2;
the 20 specimens are painted in random order. The two sample average
drying times are 121 minutes and 112 minutes, respectively. What
conclusions can the product developer draw about the effectiveness of
the new ingredient?
In the above case study, the objective is to compare two different
conditions to determine whether either condition produces a significant
effect on the response that is observed. These conditions are sometimes
called treatments. The two different treatments are two
paint formulations, and the response is the drying time. The purpose of
the study is to determine whether the new formulation results in a
significant effect—reducing drying time. In this situation, the product
developer (the experimenter) randomly assigned 10 test specimens to one
formulation and 10 test specimens to the other formulation. Then the
paints were applied to the test specimens in random order until all 20
specimens were painted. This is an example of a completely
randomized experiment.
When statistical significance is observed in a
randomized experiment, the experimenter can be
confident in the conclusion that the difference in treatments resulted
in the difference in response. That is, we can be confident that a
cause-and-effect relationship has been found.
Another case study:
Suppose an engineer wants to compare the strength of two types of
metal alloys (Alloy A and Alloy B). The strengths of 10 randomly
selected Alloy A’s and 12 randomly selected Alloy B’s (in MPa) are
- Alloy A: 404.97, 398.62, 406.48, 415.23, 397.66, 397.66, 415.79,
407.67, 395.31, 405.43
- Alloy B: 409.12, 405.57, 417.75, 395.43, 410.26, 424.42, 421.87,
402.85, 407.05, 409.43, 397.45, 402.33
Is there a significant difference in the mean strength between Alloy
A and Alloy B?
Most of the practical applications of the procedures to be covered in
this chapter arise in the context of simple comparative experiments in
which the objective is to study the difference in the parameters of the
two populations involved.
The general situation is shown in the figure below:

Population 1 has mean \(\mu_1\) and
variance \(\sigma_1^2\), and population
2 has mean \(\mu_2\) and variance \(\sigma_2^2\). Inferences will be based on
two random samples of sizes \(n_1\) and
\(n_2\), respectively.
There are many studies that are not randomized experiments. Those
studies do not involve the use of treatments and are called
observational studies. It is difficult to identify
causality in observational studies because the observed statistically
significant difference in response for the two groups may be due to some
other underlying factor (or group of factors) that was not equalized by
randomization and not due to the treatments. For example, the difference
in heart attack risk could be attributable to the difference in iron
levels or to other underlying factors that form a reasonable explanation
for the observed results—such as cholesterol levels or hypertension.
The following are more examples comparing the parameters of two
populations:
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Use Cases for T-Tests
Use_Case
|
Description
|
Comparing Two Groups
|
Assess the difference in average test scores between two different
classes.
|
Before-and-After Studies
|
Evaluate the impact of a training program on employee performance.
|
Medical Trials
|
Compare the effects of two medications on patient recovery times.
|
Quality Control
|
Assess the quality of products from two different suppliers.
|
Customer Satisfaction Surveys
|
Analyze satisfaction ratings between two store locations.
|
Performance Comparison
|
Compare the athletic performance of male and female athletes.
|
Marketing Campaign Effectiveness
|
Assess the sales impact of two different advertising strategies.
|
Educational Research
|
Investigate the effectiveness of a new teaching method versus a
traditional approach.
|
Environmental Studies
|
Compare pollution levels before and after implementing a new regulation.
|
Consumer Behavior Analysis
|
Examine spending habits between two age groups.
|
Which of the above studies could be experimental studies?
Inference on the
Difference in Means of Two Normal Distributions, Variances Known
In this section, we consider statistical inferences on the difference
in means \(\mu_1 − \mu_2\) of two
normal distributions where the variances \(\sigma_1^2\) and \(\sigma_2^2\) are known. The assumptions for
this section are summarized as follows.
There is a random sample from population 1.
There is a random sample from population 2.
The two samples are independent.
Both populations are normal.
Although there are some textbooks that discuss this situation, we
will skip this case. For students who have interest, some reference can
be found here: https://www.youtube.com/watch?v=NL9o1dKrh8o.
Inference on the
Difference in Means of Two Normal Distributions, Variances Unknown
We now consider tests of hypotheses on the difference in means \(\mu_1 − \mu_2\) of two normal distributions
where the variances \(\sigma_1^2\) and
\(\sigma_2^2\) are unknown. A \(t\)-statistic is used to test these
hypotheses. This is the most important situation in practice.
Hypotheses Tests
on the Difference in Means, Variances Unknown
We will follow the 4-step procedure introduced in the previous
chapter.
\[T_0=\frac{\bar{X}_1-\bar{X}_2-0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}\]
Under the null hypothesis, the test statistic has a \(t\)-distribution with \(\nu\) degrees of freedom given by
\[\nu =
\frac{(A+B)^2}{\frac{A^2}{n_1-1}+\frac{B^2}{n_2-1}}\] with \(A=\frac{S_1^2}{n_1}\) and \(B=\frac{S_2^2}{n_2}\)
If \(\nu\) is not an integer,
round down to the nearest integer.
Step 3: Determine the \(p\)-value. The \(p\)-value is determined by the \(t\)-distribution with \(\nu\) degrees of freedom. The procedure is
similar to the previous chapter.
Step 4: Make a decision and draw a conclusion in the
context.
If observations are available, you can use the following R code to do
the analysis:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
Here, x is the vector representing the first sample and y is the
vector representing the second sample. We always consider the case that
“mu = 0” meaning that the two means tested are equal. If the two samples
are independent, we set “paired = FALSE” and this is the default
setting. We always consider the situation that “var.equal = FALSE”
meaning that the two underlying populations have unequal variances.
Example 1.
The overall distance traveled by a golf ball is tested by hitting the
ball with Iron Byron, a mechanical golfer with a swing that is said to
emulate the legendary champion, Byron Nelson. Ten randomly selected
balls of two brands are tested and the overall distance measured. The
data follow:
Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260
265 273 281 271 270 263 268
Is there evidence to show that there is a difference in the mean
overall distance of brands? Use 0.05 as the significance level.
Solution.
Since we are testing whether the two population means are equal, the
alternative hypothesis is that they are unequal.
The R code is:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267)
y = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268)
t.test(x, y, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: x and y
## t = 2.6438, df = 17.817, p-value = 0.0166
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.088609 18.311391
## sample estimates:
## mean of x mean of y
## 276.0 265.8
Since the \(p\)-value is 0.0166, we
reject the null hypothesis at the significance level 0.05.
More examples:
Example 2: Material Strength Testing
In a civil engineering project, two types of concrete mixtures (Type
A and Type B) are being considered for constructing a bridge. Tensile
strength is a critical factor for the bridge’s durability. The project
team collects tensile strength measurements for samples of each concrete
type. The goal is to determine if there’s a significant difference in
tensile strength between the two types of concrete.
# Data for two types of concrete mixtures
concrete_type_A <- c(29.6, 31.2, 30.5, 32.1, 31.8, 29.9, 30.4, 30.2, 32.5, 31.3,
31.7, 30.8, 30.1, 32.3, 31.6, 30.9, 30.7, 31.5, 31.0, 32.0)
concrete_type_B <- c(28.3, 29.1, 28.5, 29.9, 30.5, 29.0, 28.9, 29.8, 30.2, 29.4,
28.7, 29.3, 29.7, 29.5, 30.1, 28.8, 29.6, 29.2, 30.4, 28.6)
# Perform two-sample t-test
t_test_result <- t.test(concrete_type_A, concrete_type_B)
t_test_result
##
## Welch Two Sample t-test
##
## data: concrete_type_A and concrete_type_B
## t = 7.3405, df = 35.8, p-value = 1.22e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.251928 2.208072
## sample estimates:
## mean of x mean of y
## 31.105 29.375
Example 3: Energy Efficiency Analysis
In the automotive industry, a car manufacturer is testing the energy
efficiency of two engine types: a traditional combustion engine and a
new hybrid engine. The company records fuel consumption data for both
engine types while running under identical conditions. The objective is
to determine if the hybrid engine is significantly more
fuel-efficient.
# Data for energy efficiency analysis
traditional_method <- c(8.2, 8.5, 8.4, 8.7, 8.6, 8.3, 8.2, 8.4, 8.6, 8.5,
8.4, 8.7, 8.8, 8.4, 8.5, 8.6, 8.3, 8.2, 8.6, 8.4)
new_technology <- c(6.9, 7.1, 7.0, 7.2, 7.3, 6.8, 7.1, 6.9, 7.2, 7.0,
7.1, 7.0, 6.8, 7.3, 7.2, 6.9, 7.0, 6.8, 7.1, 7.3)
# Perform two-sample t-test
t_test_result <- t.test(traditional_method, new_technology, alternative = "greater")
t_test_result
##
## Welch Two Sample t-test
##
## data: traditional_method and new_technology
## t = 26.116, df = 37.906, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.323648 Inf
## sample estimates:
## mean of x mean of y
## 8.465 7.050
Note: “the hybrid engine is significantly more fuel-efficient” is
equivalent to “the traditional engine consumes more fuel.”
Example 4: Product Performance Testing
A consumer electronics company is developing two models of
smartphones. Each model uses a different type of battery chemistry. The
company tests the battery life of both models by continuously using the
devices until the batteries are drained. They want to know if there’s a
significant difference in battery life between the two models.
# Data for battery life comparison
battery_chemistry_A <- c(14.5, 15.0, 14.7, 15.2, 14.9, 15.1, 14.8, 15.0, 14.6, 15.3,
14.8, 15.2, 15.1, 14.9, 15.0, 14.7, 15.2, 14.5, 15.1, 14.9)
battery_chemistry_B <- c(13.5, 14.0, 13.7, 14.2, 13.9, 14.1, 13.8, 14.0, 13.6, 14.3,
13.8, 14.2, 14.1, 13.9, 14.0, 13.7, 14.2, 13.5, 14.1, 13.9)
# Perform two-sample t-test
t_test_result <- t.test(battery_chemistry_A, battery_chemistry_B)
t_test_result
##
## Welch Two Sample t-test
##
## data: battery_chemistry_A and battery_chemistry_B
## t = 13.279, df = 38, p-value = 7.489e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8475502 1.1524498
## sample estimates:
## mean of x mean of y
## 14.925 13.925
Confidence
Interval on the Difference in Means, Variances Unknown
If observations are available, you can use the following R code to do
analysis:
t.test(x, y = NULL, conf.level = 0.95)
Example.
The overall distance traveled by a golf ball is tested by hitting the
ball with Iron Byron, a mechanical golfer with a swing that is said to
emulate the legendary champion, Byron Nelson. Ten randomly selected
balls of two brands are tested and the overall distance measured. The
data follow:
Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260
265 273 281 271 270 263 268
Calculate a 95% two-sided confidence interval on the difference in
mean overall distance. Round your answer to one decimal place
(e.g. 98.7).
Solution.
R code:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267)
y = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 2.6438, df = 17.817, p-value = 0.0166
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.088609 18.311391
## sample estimates:
## mean of x mean of y
## 276.0 265.8
The 95 percent confidence interval is \((2.09, 18.31)\).
Paired t-Test
Case Study: Shear Strength of Steel Girder
An article in the Journal of Strain Analysis for Engineering Design
[“Model Studies on Plate Girders” (1983, Vol. 18(2), pp. 111–117)]
reports a comparison of several methods for predicting the shear
strength for steel plate girders. Data for two of these methods, the
Karlsruhe and Lehigh procedures, when applied to nine specific girders,
are shown in Table 10.3. We wish to determine whether there is any
difference (on the average) for the two methods.

For such a study, the analysis is to use the one-sample \(t\) method (with an unknown variance) based
on the differences for testing hypotheses or constructing confidence
intervals about the the mean difference denoted by \(\mu_D=\mu_1-\mu_2\).
Specifically, the null hypothesis looks like \(H_0:\mu_D=0\) and the alternative
hypothesis looks like one of the following three:
\[H_a: \mu_D<0\] \[H_a: \mu_D>0\] \[H_a: \mu_D\ne0\] The test statistic is
\[T=\frac{\bar{d}-0}{s_d/\sqrt{n}}\] where
\(\bar{d}\) is the mean of the
differences and \(s_d\) is the
corresponding standard deviation. Under the null hypothesis, the test
statistic has a \(t\)-distribution with
\(n-1\) degrees of freedom.
When do the calculation in R, there is no need to follow the above
procedure. Instead, we can use the following R code:
t.test(x, y,
alternative = c("two.sided", "less", "greater"), ## Use only one of the 3 options
paired = TRUE,
conf.level = 0.95)
where \(x\) and \(y\) are the original samples.
The \(1-\alpha\) confidence interval
formula for \(\mu_d\) is the same as
the one-sample \(t\) confidence
interval with unknown variance. That is,
\[\bar{d}\pm
t_{\alpha/2}\cdot\frac{s_d}{\sqrt{n}}\] where \(n\) is the number of pairs.
Let’s demonstrate using the above case study.
The null hypothesis is \(H_0:\mu_d =
0\) and the alternative hypothesis is \(H_a:\mu_d \ne 0\). > x = c(1.186, 1.151,
1.322, 1.339, 1.200, 1.402, 1.365, 1.537, 1.559) > y = c(1.061,
0.992, 1.063, 1.062, 1.065, 1.178, 1.037, 1.086, 1.052) > t.test(x,
y, alternative = “two.sided”, paired = TRUE, conf.level = 0.95)
The \(t\)-statistic is 6.08. The
\(p\)-value is 0.0003, indicating a
significant difference at any commonly used significance level such as
0.05 .
The 95% confidence interval for \(\mu_d\) is \((0.1700, 0.3777)\).
Inference on Two
Population Proportions
Suppose we have two discrete populations, each having the same
interesting class/category with proportions \(p_1\) and \(p_2\), respectively. Suppose that two
independent random samples of sizes \(n_1\) and \(n_2\) are taken from two populations, and
let \(X_1\) and \(X_2\) represent the number of observations
that belong to the class of interest in samples 1 and 2,
respectively.
Large-Sample Tests
on the Difference in Population Proportions
We are interested in testing the hypotheses
\[H_0: p_1=p_2\] against one of the
following: \[H_a: p_1>p_2\] \[H_a: p_1<p_2\] \[H_a: p_1\ne p_2\]
The test statistic is:
\[Z_0=\frac{(\hat{P}_1-\hat{P}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}\]
where \(\hat{p}=\frac{X_1+X_2}{n_1+n_2}\) is called
the pooled sample proportion.
R code:
prop.test(x = c(x1, x2),
n = c(n1, n2),
alternative = c("two.sided", "less", "greater"), # Use one of the 3 options
correct = FALSE
)
where $n_1, n_2$ are sample sizes, and $x_1, x_2$ are numbers of successes. Note that the test statistic that R reports is $\chi^2$, which equals $Z_0$ squared.
*Example 1.*
Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 10 defective parts are found in the sample from machine 2. Is it reasonable to conclude that both machines produce the same proportion of defective parts, using the 0.05 significance level? Answer by finding the P-value for the test. Round your answer to 3 decimal places.
*Solution.*
The null and alternative hypotheses are:
$$H_0: p_1=p_2 ~~ vs ~~ H_a:p_1\ne p_2$$
R code:
x = c(15, 10)
n = c(300, 300)
prop.test(x, n, alternative = "two.sided", correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 1.0435, df = 1, p-value = 0.307
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01528380 0.04861713
## sample estimates:
## prop 1 prop 2
## 0.05000000 0.03333333
The \(p\)-value is 0.307, so we fail
to reject the null hypothesis.
Note that the test statistic that R reports is \(\chi^2\), which is the squared result by
hand.
Confidence
Interval on the Difference in Population Proportions
The confidence interval on the difference in population proportions
is given below:
\[(\hat{p}_1-\hat{p}_2)\pm
z_{\alpha/2}\cdot
\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\]
R code:
prop.test(x = c(x1, x2),
n = c(n1, n2),
conf.level = 0.95,
correct = FALSE)
Always set “correct = FALSE”.
Example.
Two different types of injection-molding machines are used to form
plastic parts. A part is considered defective if it has excessive
shrinkage or is discolored. Two random samples, each of size 300, are
selected, and 15 defective parts are found in the sample from machine 1,
while 10 defective parts are found in the sample from machine 2.
Construct a 95% confidence interval for the difference in proportions
between machine 1 and machine 2.
Solution.
To construct the confidence interval for the difference in
proportions, we can use the formula:
\(\text{Confidence interval} = (\hat{p}_1 -
\hat{p}_2) \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} +
\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
where:
\(\hat{p}_1\) and \(\hat{p}_2\) are the sample proportions of
defective parts for machines 1 and 2, respectively \(n_1\) and \(n_2\) are the sample sizes for machines 1
and 2, respectively \(z_{\alpha/2}\) is
the critical value for a 95% confidence interval, which is 1.96. Using
the given information, we have:
\(\hat{p}_1 = \frac{15}{300} =
0.05\)
\(\hat{p}_2 = \frac{10}{300} =
0.0333\)
\(n_1 = n_2 = 300\)
Plugging in the values, we get:
\(\text{Confidence interval} = (0.05 -
0.0333) \pm 1.96 \sqrt{\frac{0.05(1-0.05)}{300} +
\frac{0.0333(1-0.0333)}{300}}\)
Simplifying, we get:
\(\text{Confidence interval} = 0.0167 \pm
0.0276\)
Therefore, the 95% confidence interval for the difference in
proportions between machine 1 and machine 2 is (0.0167 - 0.0276, 0.0167
+ 0.0276), or approximately (-0.0109, 0.0443). We can interpret this as
follows: we are 95% confident that the true difference in proportions of
defective parts between the two machines is between -0.0109 and 0.0443.
Since the interval contains zero, we cannot conclude that there is a
significant difference in the proportions of defective parts between the
two machines at a 95% confidence level.
R code:
x = c(15, 10)
n = c(300, 300)
prop.test(x, n, conf.level = 0.95, correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 1.0435, df = 1, p-value = 0.307
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01528380 0.04861713
## sample estimates:
## prop 1 prop 2
## 0.05000000 0.03333333
A 95% confidence interval for \(p_1-p_2\) is \((-0.01528380, 0.04861713)\).
Example.
An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005,
Vol. 13, pp. 273-279), considered arthroscopic meniscal repair with an
absorbable screw. Results showed that for tears greater than 25
millimeters, 14 of 19 repairs were successful while for shorter tears,
22 of 29 repairs were successful.
With \(\alpha=0.05\), is there
evidence that the success rate is greater for longer tears? What is
the𝑃-value?
Solution.
To test whether the success rate is greater for longer tears, we can
use a one-sided hypothesis test:
\[H_0: p_1 = p_2 ~~vs ~~H_a: p_1 >
p_2\]
where \(p_1\) is the proportion of
successful repairs for tears > 25 mm and \(p_2\) is the proportion of successful
repairs for tears ≤ 25 mm.
The test statistic is:
\(z = \frac{(\hat{p}_1 -
\hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}\)
where \(\hat{p}_1\) and \(\hat{p}_2\) are the sample proportions of
successful repairs, \(n_1\) and \(n_2\) are the sample sizes, and \(\hat{p}\) is the pooled sample
proportion:
\(\hat{p} =
\frac{x_1+x_2}{n_1+n_2}\)
where \(x_1\) and \(x_2\) are the number of successful repairs
in each sample.
Plugging in the values from the problem, we get:
\(\hat{p}_1 = \frac{14}{19} \approx 0.737,
\hat{p}_2 = \frac{22}{29} \approx 0.759, n_1 = 19, n_2 = 29, \hat{p} =
\frac{14+22}{19+29} =0.75\)
The test statistic is:
\[z = \frac{0.737 -
0.759}{\sqrt{0.75(1-0.75)(\frac{1}{19}+\frac{1}{29})}} \approx
-0.17\]
Using a standard normal distribution table, the p-value for this test
is \(p \approx 0.567\).
Since the p-value is greater than \(\alpha=0.05\), we fail to reject the null
hypothesis. There is not enough evidence to conclude that the success
rate is greater for longer tears at the 5% significance level.
R code:
x = c(14, 22)
n = c(19, 29)
prop.test(x, n, alternative = "greater", correct = FALSE)
## Warning in prop.test(x, n, alternative = "greater", correct = FALSE):
## Chi-squared approximation may be incorrect
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 0.029038, df = 1, p-value = 0.5677
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.2331912 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.7368421 0.7586207
The \(p\)-value is 0.5677, so we
reject the null hypothesis.
Simple Linear
Regression and Correlation
Simple Linear Regression and Correlation are two closely related
concepts used in statistics to examine the relationship between two
continuous variables. They help us understand how changes in one
variable are associated with changes in another and allow us to make
predictions and draw inferences about their relationship. Let’s explore
each concept:
Simple Linear Regression:
Simple Linear Regression is a statistical method that models the
relationship between two continuous variables by fitting a linear
equation to the data. The goal is to find the best-fitting line that
minimizes the distance between the observed data points and the
predicted values on the line. The linear equation for simple linear
regression is of the form: \[y = \beta_0 +
\beta_1 x\]
where:
\(y\) is the dependent variable
(also called the response or outcome variable). \(x\) is the independent variable (also
called the predictor or explanatory variable). \(\beta_0\) and \(\beta_1\) are the regression coefficients,
representing the intercept and slope of the line, respectively.
The coefficients \(\beta_0\) and
\(\beta_1\) are estimated from the data
using methods such as the least squares method, which aims to minimize
the sum of squared differences between the observed and predicted
values.
Simple linear regression allows us to make predictions about the
value of the dependent variable (\(y\))
based on the value of the independent variable (\(x\)). It also provides insights into the
strength and direction of the relationship between the two
variables.
Correlation:
Correlation is a statistical measure that quantifies the strength and
direction of the linear relationship between two continuous variables.
It assesses how closely the data points tend to cluster around a
straight line, indicating the degree of association between the
variables.
The most commonly used measure of correlation is the Pearson
correlation coefficient (\(r\)), which
ranges from -1 to +1:
\(r = +1\) indicates a perfect
positive linear relationship.
\(r = -1\) indicates a perfect
negative (inverse) linear relationship.
\(r = 0\) indicates no linear
relationship (variables are not linearly correlated).
The Pearson correlation coefficient is calculated using the
formula:
\[r = \frac{cov(x,y)}{s_x
s_y}\]
where \((x_i, y_i)\) are the
individual data points,
\[cov(x,y) = \frac{1}{n-1}\sum(x_i -
\bar{x})(y_i-\bar{y})\] is the covariance between \(x\) and \(y\), and \(x_x\) and \(s_y\) are the standard deviations of \(x\) values and \(y\) values, respectively. . Correlation
measures the strength of the linear relationship between two
quantitative variables but does not provide information about causation.
A high correlation coefficient does not imply causation; it merely
indicates a strong association between the variables.
In summary, simple linear regression and correlation are valuable
tools for understanding and quantifying the relationship between two
continuous variables. Simple linear regression allows us to model and
predict one variable based on another, while correlation measures the
strength and direction of the linear association between the variables.
Both concepts are widely used in various fields, including data
analysis, scientific research, and engineering.
Hypothesis Tests in Simple Linear Regression:
In simple linear regression, hypothesis tests are used to make
inferences about the regression coefficients and assess the significance
of the relationship between the dependent variable and the independent
variable. The two main hypotheses tested in simple linear regression are
related to the slope (\(\beta_1\)) of
the regression line. The hypothesis tests are based on the underlying
assumptions of the regression model, such as the normality of errors and
constant variance.
Let’s go through the key hypotheses and the corresponding hypothesis
tests in simple linear regression.
Null Hypothesis (\(H_0\)):
The null hypothesis in simple linear regression states that there is
no significant linear relationship between the independent variable
(\(x\)) and the dependent variable
(\(y\)). In mathematical terms, it is
expressed as:
\[H_0: \beta_1 = 0\]
This implies that the slope of the regression line is zero,
indicating no association between \(x\)
and \(y\).
Alternative Hypothesis (\(H_1\) or
\(H_a\)):
The alternative hypothesis in simple linear regression states that
there is a significant linear relationship between the independent
variable (\(x\)) and the dependent
variable (\(y\)). In mathematical
terms, it is expressed as: \[H_1: \beta_1 \ne
0\]
This implies that the slope of the regression line is not zero,
indicating a non-zero association between \(x\) and \(y\).
Hypothesis Tests:
The most common hypothesis test used in simple linear regression is
the t-test. The t-test assesses whether the estimated slope coefficient
(\(\hat{\beta}_1\)) is significantly
different from zero.
The t-statistic for testing the null hypothesis is calculated as:
\[t = \frac{\hat{\beta}_1 -
0}{se(\hat{\beta}_1)}\]
where:
\(\hat{\beta}_1\) is the estimated
slope coefficient obtained from the regression analysis and \(se(\hat{\beta}_1)\) is the standard error
of the slope coefficient, which estimates the variability of \(\hat{\beta}_1\).
The t-statistic has a t-distribution with \((n-2)\) degrees of freedom (\(n\) is the sample size). The \(p\)-value is twice the tail area under the
t-distribution curve beyond the absolute value of the t statistic. If
the \(p\)-value is no greater than the
significance level, we reject the null hypothesis in favor of the
alternative hypothesis. This indicates that there is a significant
linear relationship between \(x\) and
\(y\) at the chosen significance
level.
Additionally, the t-test can be used to calculate a confidence
interval for the slope coefficient. The confidence interval provides a
range of values within which the true population slope is likely to lie
with a certain level of confidence.
It is important to note that these hypothesis tests assume that the
residuals (errors) in the regression model are normally distributed and
have constant variance. Violations of these assumptions may impact the
validity of the tests, and alternative approaches may be needed.
In summary, hypothesis tests in simple linear regression help
determine whether the relationship between the dependent variable and
the independent variable is statistically significant. They are
essential in interpreting the results of the regression analysis and
making informed conclusions about the data.
The Adequacy of the Regression Model:
The model we introduced has assumptions (LINE), including linearity,
independence of errors, normality of residuals, and equal variances of
residuals (homoscedasticity). Violations of these assumptions may
indicate inadequacy of the model.
Assessing the adequacy of a regression model is a crucial step in
analyzing the model’s performance and making sure it provides meaningful
and reliable results. Adequacy checks involve evaluating how well the
model fits the data, identifying potential issues or violations of
assumptions, and determining the overall quality of the model’s
predictions. Several techniques can be used to assess the adequacy of a
regression model:
Residual Analysis:
Residuals are the differences between the observed values and the
predicted values from the regression model. Residual analysis involves
examining the pattern of residuals to check for any systematic
deviations from randomness. Ideally, residuals should be randomly
distributed around zero, indicating that the model captures the
underlying relationships in the data. Patterns in the residuals may
indicate problems with the model, such as non-linearity,
heteroscedasticity (varying spread of residuals), or outliers.
R-squared (Coefficient of Determination):
R-squared measures the proportion of the total variation in the
dependent variable that is explained by the regression model. It ranges
from 0 to 1, with higher values indicating a better fit of the model to
the data. However, high R-squared alone does not guarantee model
adequacy, as it can increase even with the addition of irrelevant
variables. Therefore, it is essential to interpret R-squared in
conjunction with other model assessment techniques.
Adjusted R-squared:
The adjusted R-squared takes into account the number of predictor
variables in the model, penalizing the addition of unnecessary
variables. It provides a more conservative measure of model fit and is
often preferred when comparing models with different numbers of
predictors.
F-Test (Overall Significance Test):
The F-test is used to assess the overall significance of the
regression model. It tests whether the explained variation in the
dependent variable (sum of squares due to regression) is significantly
larger than the unexplained variation (sum of squares due to residuals).
A significant F-test suggests that the model as a whole is useful in
explaining the variation in the dependent variable.
Outliers and Influential Points:
Identifying outliers and influential points is essential for
understanding how individual data points affect the model. Outliers are
extreme values that can disproportionately influence the model’s
estimates, while influential points can greatly impact the model’s
coefficients. Robust regression techniques (not covered in this course)
may be used to mitigate the effect of outliers, and sensitivity analysis
can be performed to assess the impact of influential points.
Example 1: Engineering - Load vs. Deformation
In this example, we’ll analyze the relationship between the load
applied to a material and the resulting deformation. We have data from
25 tests conducted on a specific material.
# Sample data
load <- c(12.5, 14.2, 13.8, 12.9, 15.7, 14.0, 13.2, 14.5, 12.8, 13.6,
16.4, 15.2, 14.8, 16.0, 15.5, 14.7, 15.9, 16.6, 15.3, 14.6,
17.2, 16.8, 17.0, 16.3, 17.5)
deformation <- c(1.2, 1.5, 1.4, 1.3, 1.7, 1.6, 1.5, 1.8, 1.2, 1.4,
1.9, 1.7, 1.6, 1.8, 1.7, 1.5, 1.9, 2.0, 1.8, 1.6,
2.2, 2.1, 2.0, 1.9, 2.3)
df = data.frame(load, deformation)
# Perform simple linear regression
lm_model <- lm(deformation ~ load, data = df)
# Print the summary of the regression
summary(lm_model)
##
## Call:
## lm(formula = deformation ~ load, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13077 -0.05839 -0.01877 0.05360 0.20778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.20223 0.18761 -6.408 1.54e-06 ***
## load 0.19272 0.01239 15.560 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08878 on 23 degrees of freedom
## Multiple R-squared: 0.9132, Adjusted R-squared: 0.9095
## F-statistic: 242.1 on 1 and 23 DF, p-value: 1.058e-13
Insights:
The regression summary provides information about the estimated
intercept and slope coefficients. The “Estimate” for the slope
coefficient indicates how much deformation changes for a unit change in
load. Here, the “Estimate” (0.19272) for the slope coefficient indicates
deformation increases by about 0.19 for a unit increase in
load.
The “Residual standard error” (0.08878) indicates the average
difference between observed and predicted values.
The “R-squared” value measures the proportion of variability in
the dependent variable explained by the independent variable. Here, the
R-squared value (0.9132) indicates that 91.32% of total variation in
deformation is explained (or accounted for) by load.
The p-value associated with the slope coefficient tests if the
relationship is statistically significant. Here, the p-value (1.06e-13
or 0, basically) associated with the slope coefficient indicates that
the relationship is statistically significant at any reasonable
significance level (say at level 0.01).
Example 2: Engineering - Speed vs. Fuel Efficiency Consider
a scenario where we’re analyzing the relationship between the speed of
an engine and its fuel efficiency. We have data from 22 tests conducted
on different engine configurations.
# Sample data
speed <- c(1500, 1800, 2000, 2200, 2400, 2500, 2700, 2800, 3000, 3200,
3400, 3500, 3700, 3800, 4000, 4200, 4400, 4600, 4800, 5000,
5200, 5400)
fuel_efficiency <- c(18.2, 19.5, 20.1, 21.2, 22.0, 23.1, 24.5, 25.0, 25.8, 26.4,
27.0, 27.5, 28.3, 28.7, 29.5, 30.2, 31.0, 31.5, 32.2, 32.9,
33.6, 34.3)
df = data.frame(speed, fuel_efficiency)
df
## speed fuel_efficiency
## 1 1500 18.2
## 2 1800 19.5
## 3 2000 20.1
## 4 2200 21.2
## 5 2400 22.0
## 6 2500 23.1
## 7 2700 24.5
## 8 2800 25.0
## 9 3000 25.8
## 10 3200 26.4
## 11 3400 27.0
## 12 3500 27.5
## 13 3700 28.3
## 14 3800 28.7
## 15 4000 29.5
## 16 4200 30.2
## 17 4400 31.0
## 18 4600 31.5
## 19 4800 32.2
## 20 5000 32.9
## 21 5200 33.6
## 22 5400 34.3
Perform simple linear regression:
# Perform simple linear regression
lm_model <- lm(fuel_efficiency ~ speed, data = df)
# Print the summary of the regression
summary(lm_model)
##
## Call:
## lm(formula = fuel_efficiency ~ speed, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7848 -0.5353 0.1559 0.3661 0.7997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.260e+01 3.666e-01 34.36 <2e-16 ***
## speed 4.144e-03 1.008e-04 41.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5315 on 20 degrees of freedom
## Multiple R-squared: 0.9883, Adjusted R-squared: 0.9877
## F-statistic: 1691 on 1 and 20 DF, p-value: < 2.2e-16
Insights:
The “Estimate” (0.004144) for the slope coefficient indicates
fuel_efficiency increases by about 0.004144 for a unit increase in
speed.
The “Residual standard error” (0.5315) indicates the average
difference between observed and predicted values.
The R-squared value (0.9883) indicates that 98.83% of total
variation in fuel-efficiency is explained (or accounted for) by
speed.
The p-value (<2e-16 or 0, basically) associated with the slope
coefficient indicates that the relationship is statistically significant
at any reasonable significance level (say at level 0.01).
Multiple Linear
Regression
Multiple Linear Regression is an extension of simple linear
regression that allows for the analysis of the relationship between a
dependent variable and multiple independent variables. It is a
statistical technique used to model the linear relationship between the
dependent variable and two or more predictor variables. Multiple linear
regression is widely used in various fields, including statistics,
economics, social sciences, and engineering, to analyze complex data and
make predictions based on multiple factors.
The multiple linear regression model can be expressed as follows:
\[y =
\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_rx_r+\epsilon\]
where:
\(y\) is the dependent variable
(response or outcome variable),
\(x_1, x_2, \cdots, x_r\) are
the independent variables (predictors or explanatory variables). \(\beta_1, \beta_2, \cdots, \beta_r\) are the
regression coefficients, representing the intercept and slopes of the
regression line for each independent variable, respectively,
and
\(\epsilon\) is the error term
(residual), representing the difference between the observed y and the
predicted y values.
The multiple linear regression model estimates the values of the
regression coefficients based on the observed data using the method of
least squares. The goal is to minimize the sum of squared differences
between the observed and predicted values.
Key aspects of multiple linear regression:
Interpretation of Coefficients:
The regression coefficients represent the change in the dependent
variable (\(y\)) associated with a
one-unit change in each independent variable, assuming all other
variables remain constant. Positive coefficients indicate a positive
relationship with the dependent variable, while negative coefficients
indicate a negative relationship.
Adjusted R-squared:
Similar to simple linear regression, multiple linear regression uses
the R-squared statistic to measure the proportion of variance in the
dependent variable explained by the model. The adjusted R-squared takes
into account the number of predictors and provides a more accurate
measure of the model’s goodness of fit when comparing models with
different numbers of variables.
Model Assumptions:
Multiple linear regression relies on several assumptions (LINE),
including linearity, independence of errors, constant variance of
residuals (homoscedasticity), and normality of residuals. Violations of
these assumptions can impact the validity and accuracy of the regression
model.
Model Selection:
The process of selecting variables to include in the multiple linear
regression model is an essential part of the analysis. Variables that
are not relevant or highly correlated with other predictors may be
excluded to avoid multicollinearity and improve the interpretability of
the model.
Multiple linear regression is a powerful tool for analyzing the
relationship between a dependent variable and multiple predictors. It
enables researchers and analysts to explore complex data, identify
significant predictors, make predictions, and gain valuable insights
into the factors influencing the outcome of interest. However, careful
attention to the assumptions and model diagnostics is necessary to
ensure the validity and adequacy of the regression model.
A few very nice videos on multiple regression:
Let’s consider a data example in engineering involving the
relationship between the tensile strength of a metal and two factors:
temperature and time of heat treatment. We’ll create a hypothetical
dataset to illustrate the concept of multiple linear regression in
engineering.
Suppose an engineer is studying the effect of temperature (in degrees
Celsius) and time (in hours) of heat treatment on the tensile strength
(in megapascals, MPa) of a metal sample. The engineer performs
experiments at different combinations of temperature and time and
records the tensile strength for each experiment. The data is as
follows:
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Temperature
|
Time
|
Tensile.Strength
|
200
|
1
|
150
|
250
|
2
|
180
|
300
|
2
|
210
|
250
|
1
|
160
|
300
|
3
|
230
|
200
|
2
|
170
|
280
|
2
|
190
|
270
|
3
|
220
|
220
|
1
|
155
|
290
|
2
|
200
|
Using multiple linear regression, the engineer can build a model to
predict the tensile strength (y) based on temperature (x₁) and time (x₂)
as predictors. The multiple linear regression model will have the
form:
y = β₀ + β₁ * x₁ + β₂ * x₂ + ε
where:
y is the predicted tensile strength. x₁ is the temperature
(independent variable 1). x₂ is the time (independent variable 2). β₀,
β₁, and β₂ are the regression coefficients. ε is the error term. The
goal of multiple linear regression is to estimate the values of the
regression coefficients (β₀, β₁, β₂) based on the observed data to build
the best-fitting model.
To perform multiple linear regression in R, we can use the lm()
function, which stands for “linear model.” The lm() function fits a
linear regression model to the data and provides estimates for the
regression coefficients, as well as various statistics and diagnostics
to assess the model’s performance. Let’s use the example data we
previously created in R and perform multiple linear regression:
# Create the example data
temperature <- c(200, 250, 300, 250, 300, 200, 280, 270, 220, 290)
time <- c(1, 2, 2, 1, 3, 2, 2, 3, 1, 2)
tensile_strength <- c(150, 180, 210, 160, 230, 170, 190, 220, 155, 200)
# Combine the data into a data frame
data <- data.frame(temperature, time, tensile_strength)
# Perform multiple linear regression
model <- lm(tensile_strength ~ temperature + time, data = data)
# Print the summary of the regression model
summary(model)
##
## Call:
## lm(formula = tensile_strength ~ temperature + time, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9653 -1.9750 0.8815 2.3290 6.5125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39499 11.55468 4.794 0.001980 **
## temperature 0.33044 0.05428 6.088 0.000497 ***
## time 24.47977 2.84277 8.611 5.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 7 degrees of freedom
## Multiple R-squared: 0.9754, Adjusted R-squared: 0.9684
## F-statistic: 138.7 on 2 and 7 DF, p-value: 2.337e-06
The output of summary(model) will display various statistics and
information about the multiple linear regression model. It will include
the estimated regression coefficients along with their standard errors,
t-values, and p-values (the column named “Pr(>|t|)”). The p-values
indicate the significance of each coefficient, with smaller p-values
suggesting that the corresponding predictor variable has a significant
impact on the dependent variable.
How can we interpret each of the coefficients and each
p-value?
- The coefficient 0.33044 means that for each degree increase in
temperature, the tensile strength increases by about 0.33 unit, holding
time constant.
- The coefficient 24.47977 means that for each unit increase in time,
the tensile strength increases by about 24.48 units, holding temperature
constant.
- The smaller p-values (0.000497 and 5.68e-05) indicate both variables
have significant impact on tensile strength.
How can we interpret the R-squared and adjusted R-squared
values?
They measure the goodness of fit of the model. The larger the values,
the better the model fits the data.
What are model assumptions?
Before making inferences with a multiple linear regression model,
several key assumptions must be met to ensure the validity of the
results. These assumptions include:
Linearity: The relationship between each predictor (independent
variable) and the outcome (dependent variable) is linear. This means
that the effect of each predictor on the outcome is additive and
proportional.
Independence of Errors: The residuals (errors) are independent of
each other. This assumption is especially important in time-series data,
where consecutive observations may be correlated. Independence implies
that the error terms for one observation are not correlated with the
errors of other observations.
Normality of Errors: The residuals should be approximately
normally distributed, especially important when making confidence
intervals and hypothesis tests. Normality of errors is less critical for
estimating the coefficients themselves but is important for reliable
inference.
Equal Variances of Errors: The variance of the residuals should
be constant across all levels of the independent variables. This means
that the spread of residuals is roughly the same for all predicted
values of the outcome. If this assumption is violated, it can lead to
inefficiency in estimates.
These are called LINE assumptions. The assumptions
can be checked as demonstrated below:
plot(model, 1, main = "Checking Linearity and Equal Variances\n (If satisfied, the residuals should show no systematic pattern)\n")

The graph shows that there is a light issue in linearity and equal
variances.
plot(model, 2, main = "Checking Normality\n (If satisfied, the points should show a straight line pattern)\n")

Since points tend to be on a straight line, the normality assumption
is not an issue.
plot(model, 4, main = "Cook's Distance Showing How Influential Each Observation Is \n (Labeled observations are influential)\n")

The results show that observations 1, 3, and 7 are influential.
Using the model for Prediction:
The multiple linear regression model can be used to make predictions
for new data points using the predict() function:
# Predicting tensile strength for a new combination of temperature and time
new_data <- data.frame(temperature = 260, time = 2.5)
predicted_tensile_strength <- predict(model, newdata = new_data)
print(predicted_tensile_strength)
## 1
## 202.5096
This will provide the predicted tensile strength for the new
combination of temperature = 260°C and time = 2.5 hours based on the
multiple linear regression model.
Design and Analysis of
Single-Factor Experiments: The Analysis of Variance
Designing
Engineering Experiments
Designing engineering experiments is a crucial process that involves
planning, executing, and analyzing experiments to gather meaningful data
and make informed decisions. Proper experimental design ensures that the
collected data is reliable, relevant, and can lead to accurate
conclusions. Here are the key steps and considerations in designing
engineering experiments:
Define Objectives and Research Questions: Clearly
articulate the objectives of the experiment and the specific research
questions you want to address. The objectives will guide the entire
experimental design process and help determine the appropriate variables
to measure and control.
Identify Variables and Factors: Identify the key
variables that may influence the outcome of the experiment. Variables
can be classified into two types:
- Independent Variables (Factors): Variables that you intentionally
manipulate or control in the experiment.
- Dependent Variables: Variables that you measure to observe the
response or outcome.
Formulate Hypotheses: Based on your objectives and
variables, develop hypotheses that state the expected relationships or
differences between the factor and the dependent variable.
- Null Hypothesis (H₀): This hypothesis states that the means of the
dependent variable are the same across all levels of the factor.
- Alternative Hypothesis (H₁): This hypothesis states that the means
of the dependent variable are NOT the same across all levels of the
factor.
Choose Experimental Design: Select the appropriate
experimental design based on your research questions, resources, and
constraints. Common types of experimental designs include: a. Completely
Randomized Design: Randomly assign the levels of the factor to
experimental units. This is what to be covered in this chapter. b.
Randomized Block Design: Group similar experimental units into blocks
and randomize levels of the factor within each block. c. Factorial
Design: Investigate the effects of multiple factors simultaneously.
Determine Sample Size: Calculate the required sample
size to achieve adequate statistical power and precision in your
results. A larger sample size generally provides more reliable
estimates.
Control Variables: Ensure that all extraneous
factors that could influence the outcome are controlled or minimized.
This may involve using control groups, blinding, or randomization.
Conduct the Experiment: Perform the experiment
according to the experimental design, carefully following the procedures
and recording the data accurately. Document any unexpected events or
observations.
Analyze Data: Use appropriate statistical methods to
analyze the data and test the hypotheses. This may involve regression
analysis, ANOVA, t-tests, or other relevant techniques.
Interpret Results: Interpret the results of the data
analysis in the context of your research questions and hypotheses. Draw
conclusions based on the evidence provided by the data.
Draw Engineering Inferences: Apply the findings of
the experiment to make engineering inferences and decisions. Determine
how the results impact the engineering problem or system you are
investigating.
Communicate Findings: Present the experimental
design, results, and conclusions in a clear and concise manner. Clearly
communicate any implications for future research or engineering
applications.
By following these steps and considerations, engineers can design
experiments that provide valuable insights, support decision-making, and
advance the understanding of engineering systems and processes.
Well-designed experiments are essential for making progress in
engineering research and development.
Completely
Randomized Single-Factor Experiment
A Completely Randomized Single-Factor Experiment is a type of
experimental design used to study the effect of a single independent
variable (also known as a factor) on a dependent variable. In this
design, the experimental units are randomly assigned to different
treatment levels of the factor, and the response of each unit to the
treatments is measured. The objective is to compare the mean responses
of the different treatment groups to determine if there are significant
differences between them.
Key features of a Completely Randomized Single-Factor Experiment:
One Independent Variable (Factor): The experiment
involves only one independent variable (factor) that has two or more
treatment levels. Each treatment level represents a specific condition
or value of the factor being tested.
Randomization: The assignment of experimental units
to different treatments is done randomly to ensure that any extraneous
or unknown factors are evenly distributed among the treatment groups.
This helps reduce bias and allows for valid statistical inference.
Control: The experiment is designed to control any
potential confounding variables or sources of variation that could
influence the results. By randomly assigning treatments, the experiment
aims to create similar groups with comparable characteristics.
Replication: Each treatment level is applied to
multiple experimental units (replicates) to account for natural
variability and provide more precise estimates of treatment effects.
Statistical Analysis: The data collected from the
experiment is analyzed using statistical methods, such as analysis of
variance (ANOVA), to test for significant differences between treatment
means.
Example 1 (Completely Randomized Single-Factor
Experiment):
Let’s consider an example where an engineer wants to investigate the
effect of different cooling times on the hardness of a
metal alloy. The engineer selects a sample of the metal alloy and
divides it into four groups:
- Group 1: Cooling time of 1 hour.
- Group 2: Cooling time of 2 hours.
- Group 3: Cooling time of 3 hours.
- Group 4: Cooling time of 4 hours.
Each group represents a treatment level of the factor “Cooling Time.”
The engineer randomly assigns several metal specimens to each group. The
hardness of each specimen is measured after the designated cooling
time.
The data collected can be analyzed using the analysis of variance
(ANOVA) method to test if there are significant differences in hardness
among the different cooling times. If ANOVA reveals a significant
effect, post-hoc tests can be performed to identify specific pairs of
cooling times that differ significantly in terms of hardness.
The results of the experiment will help the engineer understand how
cooling time affects the hardness of the metal alloy and make informed
decisions in industrial applications, such as selecting the optimal
cooling time to achieve the desired hardness properties.
Let’s create a data example for the Completely Randomized
Single-Factor Experiment related to cooling times and the hardness of a
metal alloy. In this example, we will investigate the effect of four
different cooling times on the hardness of the metal alloy.
Assume we have the following data for the hardness of the metal alloy
(measured in Vickers hardness units, HV) after cooling for different
durations:
Cooling Time (hours)
| Hardness (HV)
1 | 300
2 | 350
3 | 380
4 | 400
2 | 340
3 | 370
1 | 290
4 | 410
3 | 375
2 | 335
1 | 295
3 | 385
4 | 395
2 | 345
1 | 305
4 | 420
3 | 380
2 | 330
4 | 415
1 | 310
In this example, the independent variable is the “Cooling Time” (in
hours), and the dependent variable is the “Hardness” of the metal alloy
after the specified cooling time.
Each row in the data represents one metal alloy specimen that
underwent a specific cooling time. The experiment involves four
different cooling times (1, 2, 3, and 4 hours), which serve as treatment
levels of the factor “Cooling Time.” The metal alloy specimens were
randomly assigned to each cooling time group to ensure a completely
randomized experiment.
To analyze the data, we can use one-way ANOVA to test if there are
significant differences in the mean hardness values among the different
cooling times. If ANOVA indicates a significant effect, we can conduct
post-hoc tests (e.g., Tukey’s HSD) to identify specific pairs of cooling
times that result in significantly different hardness values.
The results of the experiment will help us understand how cooling
time affects the hardness of the metal alloy. We can use this
information to optimize the cooling process to achieve the desired
hardness properties for specific engineering applications. For instance,
we might find that longer cooling times lead to higher hardness values,
which can be beneficial for applications requiring greater strength and
wear resistance.
To analyze the data example of the Completely Randomized
Single-Factor Experiment related to cooling times and the hardness of a
metal alloy in R, we can perform a one-way analysis of variance (ANOVA).
This will help us test if there are significant differences in the mean
hardness values among the different cooling times. Additionally, we can
conduct post-hoc tests (Tukey’s HSD) to identify specific pairs of
cooling times that result in significantly different hardness values.
Let’s go ahead and perform the analysis in R:
# Store the cooling times into an R object
cooling_times <- c(1, 2, 3, 4, 2, 3, 1, 4, 3, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1)
# Since these cooling times represent categories which have no ordering,
# we need to convert the "cooling_times" variable to a categorical variable,
# which is done by doing the following
cooling_times = as.factor(cooling_times)
# Store the hardness values into an R object
hardness_values <- c(300, 350, 380, 400, 340, 370, 290, 410, 375, 335, 295, 385, 395, 345, 305, 420, 380, 330, 415, 310)
# Form a data frame
myData <- data.frame(Cooling_Time = cooling_times,
Hardness_HV = hardness_values
)
# Perform one-way ANOVA
model <- aov(Hardness_HV ~ Cooling_Time, data = myData)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Cooling_Time 3 32895 10965 165.5 2.97e-12 ***
## Residuals 16 1060 66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The aov() function in R performs the one-way ANOVA. The output will
show the ANOVA table with the F-statistic and p-value, indicating
whether there are significant differences in the mean hardness values
among the cooling times.
Since the \(p\)-value is basically
zero, the data indicate there are significant differences in the mean
hardness values among the cooling times.
The see which levels of hardness make the differences, we can conduct
a post-hoc Tukey’s HSD test.
# Conduct post-hoc Tukey's HSD test
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Hardness_HV ~ Cooling_Time, data = myData)
##
## $Cooling_Time
## diff lwr upr p adj
## 2-1 40 25.272 54.728 4.40e-06
## 3-1 78 63.272 92.728 0.00e+00
## 4-1 108 93.272 122.728 0.00e+00
## 3-2 38 23.272 52.728 8.50e-06
## 4-2 68 53.272 82.728 0.00e+00
## 4-3 30 15.272 44.728 1.37e-04
The TukeyHSD() function conducts the post-hoc Tukey’s HSD test to
compare all possible pairs of cooling times. The result shows which
cooling times result in significantly different hardness values.
Since all adjusted p-values are quite small (smaller than the
commonly used significance levels), there is a significantly different
hardness values between any two levels of the cooling time. This
information can guide the optimization of the cooling process to achieve
the desired hardness properties for engineering applications.
Just like regression models, all ANOVA models should be subject to a
residual analysis for model checking.
We first plot residuals versus fitted values to check constant
variance assumption:
plot(model, 1)

plot(model, 2)

Since the residuals are spread evenly around zero along the range of
fitted values (predicted values). The variance of the residuals remains
constant across all levels of the predictors.
Based on the Q-Q plot, normality appears to be met.
We can even conduct a formal test for normality:
# Perform Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(model$residuals)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.95737, p-value = 0.4928
The large p-value indicates that the normality is met.
Overall, when interpreting a residual plot, look for signs of
homoscedasticity, linearity, and normality. If the plot shows no clear
patterns, the model assumptions are likely met, and the model is a good
fit for the data. However, if you observe any systematic patterns or
deviations from assumptions, it may indicate that further model
adjustments are necessary or that the model may not be appropriate for
the data.
Remember that residual plots are visual aids, and it is crucial to
complement their interpretation with formal statistical tests and
diagnostic procedures to make sound conclusions about the regression
model’s validity and reliability.
Quality Control
Basics
Quality control (QC) is a systematic process used to ensure that
products or services meet specified standards and adhere to established
guidelines. It involves monitoring, assessing, and managing the
production or delivery process to maintain consistent quality and
prevent defects. Here are some basics of quality control:
Objectives of
Quality Control
Ensure products meet customer expectations and
specifications.
Minimize defects, errors, and variations in production.
Optimize processes for efficiency and consistency.
Enhance customer satisfaction and loyalty.
Reduce waste and associated costs.
Key Concepts
Defect: Any deviation from the desired specifications or
standards.
Variation: Differences between actual measurements and ideal
values.
Process Control: Monitoring and adjusting processes to maintain
quality.
Statistical Process Control (SPC): Using statistical methods to
monitor and control processes.
Sampling: Evaluating a subset of items from a larger batch to
infer quality.
Quality Assurance (QA): Actions taken to ensure quality before
products are made.
Quality Control (QC): Activities performed to ensure quality
during production.
Six Sigma: A data-driven approach to minimize defects and improve
processes.
Continuous Improvement: The ongoing effort to enhance processes
and quality.
Quality Control
Steps
Plan: Define quality standards, methods, and resources.
Do: Implement quality control processes according to the
plan.
Check: Evaluate and monitor quality using various methods,
including inspections and tests.
Act: Take corrective actions to address deviations and improve
processes.
Quality Control
Techniques
Inspection: Visual or physical assessment of products.
Testing: Using various tests to assess product
attributes.
Statistical Analysis: Applying statistical methods to monitor and
control processes.
Control Charts: Graphical tools to monitor variations and
identify trends.
Root Cause Analysis: Identifying underlying causes of
defects.
Failure Mode and Effects Analysis (FMEA): Identifying potential
failure points and their impact.
Benefits of Quality
Control
Consistency in product quality and performance.
Reduced defects and waste.
Improved customer satisfaction and loyalty.
Enhanced brand reputation.
Efficient resource utilization.
Regulatory compliance.
Quality Control in
Different Industries
Manufacturing: Ensuring products meet specifications.
Healthcare: Ensuring patient safety and accurate
diagnoses.
Software Development: Identifying and fixing software
defects.
Construction: Ensuring buildings adhere to safety and quality
standards.
Quality control is essential for maintaining customer trust, ensuring
product reliability, and achieving operational excellence. It involves a
combination of methods, processes, and continuous improvement efforts to
deliver consistent and high-quality products or services.
Control Charts for
Proportions
A control chart for proportions (also known as a p-chart) is a
graphical tool used in quality control to monitor the stability of a
process that produces discrete outcomes or proportions. It’s commonly
used when dealing with attributes data, such as the proportion of
defective items in a sample.
Refer to a reference for the theory about p charts: https://sixsigmastudyguide.com/p-attribute-charts/
Here are the steps how the p-chart is typically constructed:
Step 1: calculate the proportion (\(\hat{p}\)) of defective items in each
sample and calculate overall proportion of defective items across all
samples (\(\bar{p}\)).
Step 2: calculate the upper control limit (UCL) and lower control
limit (LCL) as
\[\bar{p}\pm z\cdot
\sqrt{\frac{\bar{p}\cdot (1-\bar{p})}{\bar{n}}}\]
where \(z\) typically is 3 for a
3-\(\sigma\) control chart. Note: if
LCL is negative, use 0 instead; if UCL is larger than 1, use 1 instead.
If individual sample sizes are used instead of \(\bar{n}\), the two limits may not be
constant, as demonstrated in the reference https://sixsigmastudyguide.com/p-attribute-charts/.
Step 3: plot \(\hat{p}\) versus
sample serial number (1, 2, 3, … on the x-axis).
Step 4: add a center line (CL) corresponding to the overall
proportion \(\bar{p}\).
Step 5: add the UCL and LCL.
A video showing how you can create a p-chart in Excel: https://www.youtube.com/watch?v=mO7fcV4R_LY.
Here’s an example showing how you can create and interpret a p-chart
using R.
Example: Defective Products in a Manufacturing
Process
Let’s assume you are monitoring the proportion of defective products
in a manufacturing process. You collect data over time to track the
proportion of defects in each sample.
The sample sizes are: 50, 60, 55, 65, 70, 75, 60, 80
The respective numbers of defectives are: 2, 4, 1, 5, 3, 6, 2, 4
Here is the R code for constructing a p-chart:
# Sample data:
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 80)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 4)
# Calculate proportion in each sample
proportions <- defective_counts / sample_sizes
# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)
# Load required library with the library() function. Before loading the library,
# you need to install the library first by typing:
# install.packages("qcc")
# on the console of RStudio/Posit.
library(qcc)
# Create the p-chart
qcc_obj <- qcc(defective_counts, type = "p", sizes = sample_sizes,
title = "P-Chart: Defective Products with Different Sample Sizes")

In this example, the p-chart shows the proportion of defective
products in each sample along with control limits. The center line
represents the overall proportion of defects across all samples. Control
limits are calculated based on statistical methods to identify points
that fall outside expected variation.
Interpretation:
Points within the control limits suggest that the process is stable
and variation is consistent. Points outside the control limits indicate
potential issues or changes in the process. Trends or patterns in the
chart can provide insights into process behavior. Here, there is no
point outside the control limits. No obvious pattern can be observed
either.
Remember that control charts are most effective when used as part of
a comprehensive quality control system, and they help identify
deviations that warrant investigation and corrective action.
Example 2: Control chart with out of control points
# Sample data: Proportion of defective products in each sample
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 55)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 12)
# Calculate proportions
proportions <- defective_counts / sample_sizes
# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)
# Load required library
library(qcc)
# Create the p-chart
qcc(defective_counts, type = "p", sizes = sample_sizes,
title = "P-Chart: Defective Products with Different Sample Sizes")

## List of 11
## $ call : language qcc(data = defective_counts, type = "p", sizes = sample_sizes, title = "P-Chart: Defective Products with Differen| __truncated__
## $ type : chr "p"
## $ data.name : chr "defective_counts"
## $ data : num [1:8, 1] 2 4 1 5 3 6 2 12
## ..- attr(*, "dimnames")=List of 2
## $ statistics: Named num [1:8] 0.04 0.0667 0.0182 0.0769 0.0429 ...
## ..- attr(*, "names")= chr [1:8] "1" "2" "3" "4" ...
## $ sizes : num [1:8] 50 60 55 65 70 75 60 55
## $ center : num 0.0714
## $ std.dev : num 0.258
## $ nsigmas : num 3
## $ limits : num [1:8, 1:2] 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## $ violations:List of 2
## - attr(*, "class")= chr "qcc"
In a control chart, an out-of-control point (here the 8th point)
typically indicates a situation where the process has experienced a
significant shift or variation that goes beyond normal expected
variation. This could be due to various factors, such as equipment
malfunction, changes in the production process, operator error, or other
special causes.