1 An Overview

1.1 What Statistics Can Do for You?

Statistics is the art and science of collecting, analyzing, interpreting, and presenting data to understand patterns, make informed decisions, and uncover insights about various phenomena in the world.

“All methods of acquiring knowledge are essentially statistics.” (by C.R. Rao)

Statistics plays a crucial role in various aspects of our lives. In natural and life sciences, statistics is used to design experiments, analyze experimental results, and draw meaningful conclusions.

Statistics also finds extensive uses in other areas, such as Business and Economics, Social Sciences, Quality Control and Manufacturing, Environmental Studies, Sports and Entertainment, Public Policy and Government, Risk Management and Insurance, and Education.

1.2 Topics to Cover

Probability: Classical Probability, Random Variables and Their Distributions
Study design: Observational or experimental
Descriptive statistics: numerical or graphical
Inferential statistics: confidence intervals and tests of hypotheses
Regression and one-way analysis of variance
Quality control basics

1.3 Software for This Course

We will use R software through the integrated development environment (IDE) called Posit (previously, RStudio) to do statistical analysis and create project reports. There are two ways to use Posit:

Visit https://posit.co/download/rstudio-desktop/ and install R and Posit on your personal computer. Many SCSU computers already have both installed. There can be some issues with this use of R/Posit.
Visit https://posit.cloud/. You do not need to install anything on your computer. Just register on this webpage. Once registered, log in and start a new project. You do not need to create a new project each time you use Posit, since under the first project you have created, you can create many folders or files.

1.4 Use of AI

I use ChatGpt to prepare my notes.

“The use of AI in creating learning in this course is supported. Be clear about your use of AI. Ensure that your work is truly your work. Think critically about when AI should be used vs when it is an easy out. Ask questions to your peers and me. AI may continue to impact our society in ways that we cannot imagine and part of your experience here is to grow your understanding of what could be, not what was. Also, please refer to the Academic Integrity policies that apply to all courses within our institution, and ask me for clarification on how Academic Integrity and AI intersect in this course.”

Here is SCSU guidelines on the use of AI: https://services.stcloudstate.edu/TDClient/1919/Portal/KB/?CategoryID=24236

1.5 Examples of Using ChatGpt

Explain what is variance of a sample. Give me a data example in biology using R Once ChatGpt gives you a response, then type “The code is too complex. please simplify.” If you want a simple printout, type “no use of cat() function.” Now, all the code would be straightforward for you.
Explain what is a histogram. Give me a data example in health science using R. Interpret the plot.
What is a confidence interval for a population proportion? Give me a data example in environmental science using R. Interpret the result.

ChatGpt is not perfect and it makes mistakes. Therefore, use ChatGpt with caution!

A quick quiz: why should I consider a career in data science?

Click to see a ChatGPT answer

Considering a career in data science can be a rewarding decision for several reasons:

High Demand: Data science professionals are in high demand across industries as organizations recognize the value of data-driven decision-making.

Variety of Industries: Data science is applicable in diverse fields such as finance, healthcare, marketing, technology, and more, providing opportunities to work in areas of personal interest.

Well-Paying Jobs: Data scientists often enjoy competitive salaries due to their specialized skills and the increasing demand for data expertise.

Impactful Insights: Data analysis enables you to extract insights from complex data, helping businesses make informed decisions, optimize processes, and identify trends.

Innovation: Data scientists play a key role in developing innovative solutions using machine learning, artificial intelligence, and predictive analytics.

Continuous Learning: The field of data science is constantly evolving with new technologies and techniques, offering a dynamic learning environment.

Problem Solving: Data science involves solving real-world problems using data-driven approaches, making it intellectually stimulating and fulfilling.

Flexibility: Many data science roles offer remote work options and flexible schedules, allowing for a better work-life balance.

Career Growth: Data science roles often have clear career paths with opportunities to advance to senior or specialized positions.

Interdisciplinary Skills: Data science combines skills from programming, statistics, mathematics, and domain expertise, making it a well-rounded career option.

Global Relevance: Data science is a global field with opportunities to work with international organizations and collaborate on cross-border projects.

Community and Networking: Joining the data science community offers opportunities to connect with like-minded professionals, share knowledge, and collaborate on projects.

While data science offers numerous advantages, it’s important to note that it requires a strong foundation in programming, mathematics, and statistics. Pursuing relevant education, gaining practical experience, and staying updated with industry trends are crucial steps toward building a successful career in data science.

1.6 The Role of Statistics in Engineering

1.6.1 Primary Roles of Statistics in General

Data Collection: Statistics help in designing surveys, experiments, and observational studies to collect relevant and reliable data.
Data Analysis: Statistics provide methods for analyzing and summarizing data.
Inference: Statistics helps determine whether observed differences or relationships have occurred by chance.
Modeling and Prediction: Statistics allows for the development of mathematical models to represent complex real-world phenomena. These models can be used for prediction.

1.6.2 Mechanical and Empirical Models

Mechanistic models, also known as deterministic models, are built on a deep understanding of the underlying principles and mechanisms governing the system.

Empirical models, also known as statistical models or data-driven models, are developed based on observed data without a deep understanding of the underlying mechanisms.

Engineers often use a combination of these models to achieve a comprehensive understanding of complex systems.

2 Probability

This chapter will lay a theoretical foundation for statistics.

2.1 Random Experiments

A random experiment refers to a process or procedure that can result in multiple possible outcomes with uncertainty.

Tossing a Coin: When flipping a fair coin, the possible outcomes “Heads” and “Tails are uncertain and depends on many factors.

Testing the Strength of Materials: In a material strength test, the sample may exhibit different strengths due to inherent variations in its composition. The strength of each sample tested is subject to chance.

2.2 Sample Space and Events

Sample Space: The set of all possible outcomes of a random experiment is called the sample space.

An event is a subset of the sample space. It represents a particular outcome or a combination of outcomes. Events are often represented by capital letters such as A, B, C, …

In engineering, experiments can involve various factors and parameters that lead to different outcomes. For example, if you are testing the tensile strength of a material, the sample space is $[0, \infty)$. An event might be “the tensile strength of a material is greater than 5 pounds per square inch.”

2.3 Basic Operations on Events

Union of Events $A \cup B$: The union of two events $A$ and $B$, denoted as $A \cup B$, represents the event that either $A$ occurs, or $B$ occurs, or both occur.
Intersection of Events $A \cap B$ or $AB$: The intersection of two events $A$ and $B$, denoted as $A \cap B$ or $AB$, represents the event that both $A$ and $B$ occur simultaneously.
Complement of an Event $A'$: The complement of an event $A$, denoted as $A'$, $A^c$, or $\bar{A}$, represents the event that $A$ does not occur.

These basic operations can be extended to more than two events as well. A graphical method for operations on events (or sets) is demonstrated https://www.youtube.com/watch?v=YYM_Wju0-so using the so-called Venn diagrams.

2.4 Counting Techniques

Counting techniques in probability involve methods to determine the number of possible outcomes in a sample space or the number of ways events can occur. These techniques are essential for calculating some probabilities.

2.5 Permutations

Permutations are arrangements of a set of objects in a specific order. When dealing with a set of n distinct objects and selecting r of them in a specific order, the number of permutations is denoted as $_nP_r$ or $P(n, r)$ and calculated as $\frac{n!}{ (n - r)!}$. Permutations are used in various scenarios, such as arranging people in a line, arranging letters in a word, or selecting a specific order of events.

Example:

Suppose you have 5 different books, and you want to arrange 3 of them on a shelf. The number of ways to arrange these books is $_5P_3 = \frac{5!}{ (5 - 3)!} = \frac{5!}{ 2!} = 60$.

2.6 Combinations

Combinations are selections of a subset of objects from a larger set, where the order of selection does not matter. When dealing with a set of $n$ distinct objects and selecting $r$ of them without regard to order, the number of combinations is denoted as $_nC_r$ or $C(n, r)$ or $\binom{n}{r}$ and calculated as $\frac{n!} {r! (n - r)!}$. Combinations are used when the order of elements does not influence the outcome, such as selecting a team of players from a pool of candidates.

Example:

Suppose there are 8 candidates running for a committee, and you want to select 4 of them. The number of ways to form the committee is $\binom{8}{4} = \frac{8!} {4! (8 - 4)!} = 70$.

2.7 Multiplication Principle:

The multiplication principle, also known as the fundamental counting principle, states that if there are $m$ ways to do one thing and $n$ ways to do another thing, then there are $m \cdot n$ ways to do both things together. This principle is often used when events are independent, meaning that the outcomes of one event do not affect the outcomes of others.

Example:

Suppose you have 3 different shirts and 2 different pairs of pants. The number of ways to choose a shirt and a pair of pants to wear is $3 \cdot 2 = 6$.

2.8 Addition Principle:

The addition principle states that if there are $m$ ways to do one thing and $n$ ways to do another thing, and these events are mutually exclusive (cannot happen together), then there are $m + n$ ways to do either one thing or the other. This principle is also called the casework, which involves splitting a problem into several parts, counting these parts individually, then adding together the totals of each part.

Example:

Suppose we want to find the number of 4-digit numbers that are divisible by 5. Since all integers that are divisible by 5 must end with 0 or 5, we can use case work based on the ending digit. Let’s consider two cases:

Case 1: O is at the end. In this case, we have the following arrangement: _ _ _0. For the remaining three digits, there are 9 ways to determine the first digit (note: 0 can’t be the beginning digit), there are 10 ways to determine the second digit, and there are 10 ways to determine the third digit. By the great multiplication principle, there are $9\cdot 10\cdot 10 = 800$ such numbers.

Case 2: 5 is at the end. In this case, we have the following arrangement: _ _ _5. By the same token, there are $9\cdot 10\cdot 10 = 800$ numbers in this case.

Now, we add up the outcomes from both cases: 800 + 800 = 12. So, there are 1600 4-digit positive integers that are divisible by 5.

2.9 Probability

Probability is a fundamental concept in engineering that plays a crucial role in decision-making, risk analysis, and designing robust systems. It is a branch of mathematics.

Key Concepts in Probability:

Sample Space: The set of all possible outcomes of an experiment is called the sample space. For example, if you are rolling a six-sided die, the sample space would be {1, 2, 3, 4, 5, 6}.

Event: An event is a subset of the sample space. It represents a particular outcome or a combination of outcomes.

Probability of an Event: The probability of an event represents the likelihood of that event occurring. It is a number between 0 and 1, where 0 indicates an impossible event, and 1 represents a certain event. The probability of an event $A$ is denoted by $P(A)$.

Basic Probability Rules: Probability follows certain rules. The sum of probabilities of all possible outcomes in the sample space is always 1. The probability of an event not occurring is 1 minus the probability of the event occurring. For mutually exclusive events (or disjoint events, events that do not happen simultaneously), the probability of either event occurring is the sum of their individual probabilities.

Random Variables: In engineering, we often deal with random variables, which are variables whose values are determined by chance. Random variables can be discrete (taking specific values) or continuous (taking any value within a range).

2.10 Conditional Probability

Conditional Probability: Conditional probability is the probability of an event occurring given that another event has already occurred. It is denoted as P(A|B), which is the probability of A given B. To calculate P(A|B), we use the formula:

\[P(A|B)=\frac{P(A\cap B)}{P(B)}\] Example: Throw a 6-sided die once. If the outcome is an odd number, what is the probability that the die landed on 1, 2, or 3?

Solution. Let $A$ denote the event that the outcome is odd. Let $B$ denote the event that the outcome is 1, 2, or 3. It’s easy to see that $A\cap B$ is the event that the outcome is 1 or 3. Further more, $P(A)=\frac{3}{6}$ and $P(A\cap B)=\frac{2}{6}$. We need to find $P(B|A)$. Since

\[P(B|A)=\frac{P(A\cap B)}{P(A)}=\frac{2/6}{3/6}=\frac{2}{3}\] That is, given the outcome is odd, the probability that the die landed on 1, 2, or 3 is 2/3.

2.11 Total Probability Rules

The Total Probability Rule, also known as the Law of Total Probability, is a fundamental concept in probability theory that allows us to calculate the probability of an event by considering all possible ways or scenarios that lead to that event. It is particularly useful when the event of interest depends on different conditions or sub-events. The Total Probability Rule is expressed as follows:

Suppose we have a partition of the sample space, i.e., a set of mutually exclusive events ${B_1, B_2, ..., B_n}$ that covers the entire sample space $\Omega$. Then, for any event $A$, its probability can be calculated as:

\[P(A) = \sum_{i=1}^n[P(A|B_i) \cdot P(B_i)].\]

In simpler terms, the probability of event $A$ is the sum of the probabilities of event $A$ occurring given each condition $B_i$ multiplied by the probability of each condition $B_i$.

Example 1

Suppose a factory produces light bulbs, and there are two machines used to manufacture them: Machine A and Machine B. Machine A produces 60% of the bulbs, and Machine B produces the remaining 40%. The probability that a bulb is defective, given it was produced by Machine A, is 0.03, and the probability that a bulb is defective, given it was produced by Machine B, is 0.05. What is the probability that a randomly selected bulb is defective?

Solution:

Let $A$ be the event that a bulb is defective, and let $B_1$ be the event that the bulb is produced by Machine A, and $B_2$ be the event that the bulb is produced by Machine B. The partition $\{B_1, B_2\}$ covers the entire sample space (all bulbs).

Using the Total Probability Rule: \[P(A) = P(A|B_1) \cdot P(B_1) + P(A|B_2) \cdot P(B_2)\] \[P(A) = (0.03 \cdot 0.60) + (0.05 \cdot 0.40)\] \[P(A) = 0.018 + 0.020\] \[P(A) = 0.038\]

The probability that a randomly selected bulb is defective is 0.038 (or 3.8%).

Example 2

Suppose the weather conditions in a certain city can be categorized into three types: Sunny (S), Cloudy (C), and Rainy (R). Historical data shows that the probabilities of these conditions are $P(S) = 0.4, P(C) = 0.3$, and $P(R) = 0.3$. The probability of carrying an umbrella on a Sunny day is 0.1, on a Cloudy day is 0.3, and on a Rainy day is 0.8. What is the overall probability of carrying an umbrella in this city?

Solution:

Let $A$ be the event of carrying an umbrella, and let $B_1, B_2$, and $B_3$ represent the events of having a Sunny, Cloudy, and Rainy day, respectively. The partition ${B_1, B_2, B_3}$ covers the entire sample space (all possible weather conditions).

Using the Total Probability Rule:

\[P(A) = P(A|B_1) \cdot P(B_1) + P(A|B_2) \cdot P(B_2) + P(A|B_3) \cdot P(B_3)\]

\[P(A) = (0.1 \cdot 0.4) + (0.3 \cdot 0.3) + (0.8 \cdot 0.3)\] \[P(A) = 0.04 + 0.09 + 0.24\] \[P(A) = 0.37\]

The overall probability of carrying an umbrella in this city is 0.37 (or 37%).

2.12 Bayes’ Theorem

Bayes’ Theorem, also known as Bayes’ Rule or Bayes’ Law, is a fundamental concept in probability theory and statistics. It provides a way to update the probability of an event based on new evidence or information. The theorem is named after the Reverend Thomas Bayes, an 18th-century mathematician and theologian, who first formulated the idea.

Bayes’ Theorem is stated as follows:

\[P(A|B) = \frac{P(B|A) \cdot P(A)} {P(B)}\]

where:

$P(A|B)$ is the conditional probability of event $A$ occurring given that event $B$ has occurred. $P(B|A)$ is the conditional probability of event $B$ occurring given that event $A$ has occurred. $P(A)$ is the probability of event $A$ occurring without considering event $B$. $P(B)$ is the probability of event $B$ occurring without considering event $A$.

In simpler terms, Bayes’ Theorem allows us to update our prior belief about the probability of event $A$ (i.e., $P(A)$ based on new evidence or information provided by event $B$ (i.e., $P(B)$. The resulting probability, $P(A|B)$, is called the posterior probability.

Example 1.

Suppose we have a rare disease that affects 1 in 10,000 people (i.e., $P(A) = 0.0001$ with $A$ representing the event of having the disease). A medical test is conducted to diagnose the disease, and the test has a false-positive rate of 1% (i.e., $P(B|A') = 0.01$, where $A'$ represents not having the disease and $B$ represents the test being positive). The test also has a true-positive rate of 99% (i.e., $P(B|A) = 0.99$).

Now, we want to find the probability that a person has the disease given that the test result is positive (P(A|B)).

Using Bayes’ Theorem:

\[P(A|B) = \frac{P(B|A) \cdot P(A)} {P(B)}\] \[P(A|B) = \frac{0.99 \cdot 0.0001}{0.01 \cdot 0.9999 + 0.99 \cdot 0.0001}\] \[P(A|B) = \frac{0.000099}{0.009999 + 0.000099}\] \[P(A|B) ≈ 0.0098039\]

The probability that a person has the disease given a positive test result is approximately 0.0098 (or 0.98%). Bayes’ Theorem allows us to incorporate the test’s true-positive and false-positive rates to arrive at a more accurate probability of having the disease after the test result.

Example 2. Assume:

1 in 100,000 passengers is actually a threat.
Security correctly detects threats 99% of the time.
1% of innocent passengers are falsely flagged.

Determine the probability that a passenger is actually a threat given that they triggered a security alert.

Solution.

We are given

$P(T)=0.00001$
$P(+|T)=0.99$
$P(+|T^c)=0.01$

By the Total Probability Formula, we get $P(+)=(0.99×0.00001)+(0.01×(1-0.00001)=0.0100098$. By the Bayes’ Formula, we get $P(T|+)=\frac{(0.99×0.00001)}{0.0100098}=0.00099$.

Interpretation:

Even if a passenger is flagged by security, the probability that they are actually a threat is only 0.099%—less than 1%! This shows how false positives can overwhelm true positives when the actual number of threats is very low.
This illustrates why security personnel conduct secondary screenings—to further investigate flagged passengers and reduce false alarms.

Example 3.

A quality-control program at a plastic bottle production line involves inspecting finished bottles for flaws such as microscopic holes. The proportion of bottles that actually have such a flaw is only 0.0002. If a bottle has a flaw, the probability is 0.995 that it will fail the inspection. If a bottle does not have a flaw, the probability is 0.99 that it will pass the inspection.

If a bottle fails inspection, what is the probability that it has a flaw?
Which of the following is the more correct interpretation of the answer to part (a)?

Most bottles that fail inspection do not have a flaw.

Most bottles that pass inspection do have a flaw.

If a bottle passes inspection, what is the probability that it does not have a flaw?
Which of the following is the more correct interpretation of the answer to part (c)?

Most bottles that fail inspection do have a flaw.

Most bottles that pass inspection do not have a flaw.

Solution.

Denote $A$ = “flaw” and $B$ = “fail”. We are given that $P(A)=0.0002$, $P(B|A)=0.995$, and $P(B^c|A^c)=0.99$. By the Total Probability Formula, we have: \[P(B)=P(B|A)\cdot P(A) + P(B|A^c)\cdot P(A^c)=(0.995)(0.0002)+(1-0.99)(1-0.002)= 0.010179.\]

$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B|A)\cdot P(A)}{P(B)}=\frac{(0.995)(0.0002)}{0.010179}=0.01955$
Most bottles that fail inspection do not have a flaw.
$P(A^c|B^c)=\frac{P(A^c\cap B^c)}{P(B^c)}=\frac{P(B^c|A^c)\cdot P(A^c)}{P(B^c)}=\frac{(0.99)(1-0.0002)}{1-0.010179}=0.99998$
Most bottles that pass inspection do not have a flaw.

The Bayes’ Rule can be extended to more general situations. Let a sample space $S$ (as a universal set) be decomposed into k disjoint subsets (events) denoted $A_1, A_2, \cdots, A_k$. The super set ${A_1, A_2, \cdots, A_k}$ is called a decomposition of the sample space.

Let $B$ be an event. The Law of Total Probability states that

\[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)\] How can we calculate these conditional probabilities $P(A_1|B), P(A_2|B), \cdots, \text{and}~P(A_k|B)$?

By the definition of conditional probability, we have $P(A_1|B)=\frac{P(A_1\cap B)}{P(B)}$. Then, applying the conditional probability formula again to the numerator and applying the Law of Total Probability to the denominator yield

\[P(A_1|B)=\frac{P(A_1)P(B|A_1)}{P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + \cdots + P(A_k)P(B|A_k)}\] Note that the numerator is one of the terms in the denominator!

Example 4. $20\%$ of a school’s computers are manufactured by company A, $30\%$ by company B, and remaining $50\%$ by company C. Suppose that $2\%$ of A computers are defective, $3\%$ of B computers are defective, and $2.5\%$ of C computers are defective. A computer is randomly selected from the school.

If the chosen computer is defective, what is the probability that the computer is from company A?
If the chosen computer is defective, what is the probability that the computer is from company B?
If the chosen computer is defective, what is the probability that the computer is from company C?

Solution.

Each of the school’s computers has a unique ID, so all computers are distinct. Let $S$ be the set of all computers in the school. $S$ is a sample space. Let $A_1$ be the event that a randomly selected computer is from company A. Let $A_2$ be the event that a randomly selected computer is from company B. Let $A_3$ be the event that a randomly selected computer is from company C. We already know that $P(A_1) = 0.20, P(A_2)=0.30, P(A_3)=0.50$, and the super set $\{A_1, A_2, A_3\}$ is a decomposition of the sample space. Let $B$ be the event that a randomly selected computer is defective. We also know that $P(B|A_1)=0.02, P(B|A_2)=0.03, P(B|A_3)=0.025$. By the Law of Total Probability, we have \[P(B)=P(A_1)P(B|A_1) + P(A_2)P(B|A_2) + P(A_3)P(B|A_3)=(0.20)(0.02)+(0.30)(0.03)+(0.50)(0.025)=0.004+0.009+0.0125=0.0255.\]

$P(A_1|B)=\frac{0.004}{0.0255}=0.1569$
$P(A_2|B)=\frac{0.009}{0.0255}=0.3529$
$P(A_3|B)=\frac{0.0125}{0.0255}=0.4902$

DIY: Construct a probability tree for the problem to find the total probability ($P(B)$) as the denominator of the Bayes’ Rule following the videos: https://www.youtube.com/watch?v=ql2qLe4UYK0 (Right click and open in new window) and https://www.youtube.com/watch?v=dRdCUUgrwVw and https://www.youtube.com/watch?v=XvaS2GO6MGk as well.

2.13 Independence

In probability theory, two events A and B are considered independent if the occurrence of one event does not affect the probability of the other event occurring. In other words, the outcome of one event provides no information or influence on the outcome of the other event. Mathematically, events A and B are independent if and only if $P(A \cap B) = P(A) \cdot P(B)$.

where $P(A \cap B)$ represents the probability of both events $A$ and $B$ happening together, $P(A)$ is the probability of event $A$ occurring, and $P(B)$ is the probability of event $B$ occurring.

If two events $A$ and $B$ are independent then, $P(A|B) = P(A)$ and $P(B|A) = P(B)$.

Example:

Rolling a fair six-sided die twice, the outcomes of each roll are independent events. The probability of rolling a 3 on the first roll is 1/6, and the probability of rolling a 3 on the second roll is also 1/6. The probability of rolling a 3 on both rolls (both events occurring together) is 1/6 * 1/6 = 1/36.

2.14 Exercise

Suppose cyberattacks occur 1% of the time. The IDS detects real attacks with 95% accuracy. The IDS falsely flags 5% of normal activity as an attack. If the IDS triggers an alert, the probability that it is a real attack is ____.

3 Discrete Random Variables and Probability Distributions

In probability and statistics, a random variable is a variable that can take on different values, each with a certain probability, due to underlying random processes or uncertainty. Random variables are a fundamental concept in probability theory and play a crucial role in modeling and analyzing uncertain events and probabilistic phenomena. They serve as a bridge between the theoretical mathematics of probability and real-world applications in various fields, including engineering, economics, physics, and social sciences.

Random variables can be categorized into two main types: discrete random variables and continuous random variables. We focus on discrete random variable in this chapter.

A discrete random variable is a random variable whose set of assumed values is countable. A random variable can be denoted by a upper-case letter such as $X, Y$, and $Z$.

Examples:

A lab has 10 computers. Let $X$ denote the number of computers that fail to work. Then $X$ is a random variable.
The engineers in a large company can be mechanical engineers, electrical engineers, or other. Randomly choose an engineer from this company. Let T denote the type of the selected engineer. $T$ is a discrete random variable, assuming values in the set of {me, ee, ot}.
The life time of a randomly chosen computer is denoted by $Y$. $Y$ is NOT a discrete random variable. It is called a continuous random variable.

3.1 Probability Mass Functions

If $X$ is a discrete random variable, then we

denote the probability that $X=x$ by $p(x)$ and
call $p(x)$ the probability mass function (PMF) of $X$.

The PMF function fully describes the distribution of the random variable $X$.

The domain of this function is called the support of the distribution (or of the random variable).
The sum of all probabilities in the PMF is equal to one.

Example 1:

Flip a fair coin twice. Let $X$ be the number of heads. Then $X$ can take values 0, 1, 2 with probabilities 0.25, 0.5, 0.25. The sum of the probabilities equals one.

Example 2:

Randomly choose a value from the set $\{2, 3, 5, 5, 6, 2, 5, 8\}$. Denote the chosen value by $X$. Then $X$ is a discrete random variable. Then $X$ can take values 2, 3, 5, 6, and 8 with probabilities 2/8, 1/8, 3/8, 1/8, and 1/8. The sum of the probabilities equals one.

3.2 Cumulative Distribution Functions

A random variable can be categorical or numerical (discrete or continuous). The PMF can be used to describe its distribution. If $X$ is a numeric random variable, we can also equivalently use the cumulative distribution function (CDF), usually denoted by $F(x)$, to describe its distribution. This CDF is defined as the probability that $X$ is no greater than $x$; that is, $F(x)=P(X\le x)$.

An example:

Suppose $X$ is a random variable taking values $-2$, 0, 3, and 5 with probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the CDF $F(x)$ can be determined as follows:

\[\text{when} ~x< -2, F(x) = P(X\le x) =P(\phi)= 0\] \[\text{when} ~-2\le x< 0, F(x) = P(X\le x) =P(X=-2)= 0.3\] \[\text{when} ~0\le x< 3, F(x) = P(X\le x) =P(X=-2 ~\text{or} ~0)= 0.3+0.1=0.4\] \[\text{when} ~3\le x< 5, F(x) = P(X\le x) =P(X=-2, ~0, ~\text{or} ~3)= 0.3+0.1+0.2=0.6\] \[\text{when} ~x\ge 5, F(x) = P(X\le x) =P(X=-2, ~0, ~3, \text{or} ~5)= 0.3+0.1+0.2+0.4=1\] The above can be written as a piece-wise (right-continuous) function:

\[F(x)=\begin{cases} 0, & x<-2 \\ 0.3, & -2\le x< 0 \\ 0.4, & 0\le x< 3\\ 0.6, & 3\le x< 5\\ 1, & x\ge5 \end{cases} \]

This piece-wise function has a graph that is step-wise and right-continuous. This observation is in general true for all discrete random variables.

3.3 Mean and Variance of a Discrete Random Variable

The mean of a discrete random variable (or a discrete distribution) is defined to be the sum of products of values and probabilities. In other words, the mean is the weighted average of the values with weights being the corresponding probabilities. Use the Greek letter $\mu$ to denote the mean. The mean describes, on average, what is the value taken by the random variable.

Example 1:

Suppose $X$ is a random variable taking values $-2$, 0, 3, and 5 with probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the mean of $X$ can be determined as follows:

The mean is

\[\mu_X = (-2)(0.3)+(0)(0.1)+(3)(0.2)+(5)(0.4)=-0.6+0+0.6+2=2\]

Each possible value of the random variable is certain distance away from the mean. These distances are called deviations. The variance of a discrete random variable (or a discrete distribution) is the weighted average of the squared deviations with weights being the corresponding probabilities. That is,

\[\sum_x (x-\mu)^2 f(x)\] where $x$ represents all possible values of the random variable.

Use the Greek letter $\sigma^2$ to denote the variance. The square-root of the variance is called the standard deviation, which, on average, describes how far away is each possible value from the mean. Both variance and standard deviation describe the variation of the random variable.

The variance is

\[\sigma^2_X = (-2-2)^2(0.3)+(0-2)^2(0.1)+(3-2)^2(0.2)+(5-2)^2(0.4)=4.8+0.4+0.2+3.6=9\]

When calculating the variance of a discrete random variable, you can use the following formula instead:

\[\sum x^2 f(x) - \mu^2\]

The above variance can be calculated as follows:

\[\sigma^2_X = (-2)^2(0.3)+(0)^2(0.1)+(3)^2(0.2)+(5)^2(0.4)-2^2=1.2+0+1.8+10-4=9\]

The standard deviation is

\[\sigma_X=\sqrt{9}=3.\] Example 2:

Randomly choose a value from the set $\{2, 3, 5, 5, 6, 2, 5, 8\}$. Denote the chosen value by $X$. Then $X$ is a discrete random variable. The random variable $X$ can take values 2, 3, 5, 6, and 8 with probabilities 2/8, 1/8, 3/8, 1/8, and 1/8.

The mean is $\mu=(2)(2/8)+(3)(1/8)+(5)(3/8)+(6)(1/8)+(8)(1/8)=4.5$, which is equal to the mean of the given set (called a population).
The variance is $\sigma^2=(2^2)(2/8)+(3^2)(1/8)+(5^2)(3/8)+(6^2)1/8+(8^2)(1/8)-4.5^2=3.75$ and the standard deviation is approximately 1.9365.
The above mean, variance, and standard deviation are also said to be those of the population.

A useful result: if $X$ is a random variable with mean $\mu$ and variance $\sigma^2$, then $Y=cX$ is a new random variable, where $c$ is a constant. Furthermore,

the mean of $Y$ is $c\mu$,
the variance of $Y$ is $c^2\sigma^2$, and
the standard deviation of $Y$ is $c\sigma$.

Another useful result: if $X$ is a random variable with mean $\mu$ and variance $\sigma^2$, then $Y=X+c$ is a new random variable, where $c$ is a constant. Furthermore,

the mean of $Y$ is $\mu+c$,
the variance of $Y$ is $\sigma^2$ as well, and
the standard deviation of $Y$ is $\sigma$ as well.

3.4 Discrete Uniform Distribution

If the values of a discrete random variable are equally likely, the distribution is a discrete uniform distribution.

An example:

Throw a 6-sided fair die once. Let $X$ denote the outcome. Then,

$X$ takes each of the values 1, 2, 3, 4, 5, and 6, with probabilities all being 1/6. Thus,
$X$ has a discrete uniform distribution.

For a discrete uniform distribution, the mean is just the average of all possible values of the random variable. Keep in mind, in general, this is false.

3.5 Binomial Distribution

Consider two situations:

Let’s say we have a coin (may not be fair). The coin lands on heads with probability of $p$ (maybe 0.5 or not). Flip it $n$ times. Let $X$ denote the number of times it lands on heads. Then, $X$ is a discrete random variable, since it can only takes the values $0, 1, 2, \cdots$, and $n$. What is the probability mass function (pmf) of $X$?
There is a sea of products. The proportion of defective products is $p$. Randomly choose $n$ products from this sea of products. Let $Y$ denote the number of defective products. Then, $Y$ is a discrete random variable, since it can only takes the values $0, 1, 2, \cdots$, and $n$. What is the probability mass function (pmf) of $Y$?

It turns out that the two random variable have the same distribution (choosing a defective product is like flipping a head, with same probability $p$), with the PMF given by

\[p(x)=P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}, ~~~x = 0, 1, 2, \cdots, n\] where $\binom{n}{x}=\frac{n!}{x!\cdot(n-x)!}$ and $n!$ is the product of the first $n$ consecutive positive integers. For example, $5! = 5\cdot 4\cdot 4\cdot3\cdot2\cdot1=120$, $4!=24$, $3!=6$, $2!=2$, $1!=1$, and $0!=1$, and $\binom{5}{2}=\frac{5!}{2!\cdot(5-2)!}=\frac{120}{2\cdot 6}=10$.

The combination $\binom{n}{x}$ in the above formula indicates that there are those many ways of selecting/having $x$ events (heads or defectives). The term $p^x$ and $n-x$ indicate that events (heads or defectives) are all independent.

The distribution is called the binomial distribution, with parameters $n$ and $p$. A parameter is not a variable, but is (usually) an unknown quantity.

The mean of this distribution is $n\cdot p$ and the variance is $n\cdot p\cdot (1-p)$.

An example:

Products manufactured by XYZ company has a defective rate of 0.01. Randomly pick a set of 100 products,

What is the probability that exactly 3 of these 100 selected products are defective?
What is the probability that less than 3 of these 100 selected products are defective?
How many of these 100 selected products are expected to be defective?

Solution.

Let $X$ denote the number of defective products out of the 100 selected ones. Then, $X$ has a binomial distribution with $n=100$ and $p=0.01$.

$p(3)=\binom{100}{3}0.01^3(1-0.01)^{100-3}=0.0610$
$F(2)=P(X< 3)=P(X=0)+P(X=1)+P(X=2)=p(0)+p(1)+p(2)=0.9206$. Here “less than 3” means “less than or equal to 2”. You need to use the binomial probability formula 3 times!
The expected number of defective products is the same as the mean, and it is $n\cdot p=100(0.01)=1$, as expected.

3.6 Geometric and Negative Binomial Distributions

In quality control, we might be interested in the number of products to be sampled in order to see the first defective product from a sea of products.

Again, consider two situations:

Flip a coin until a head is seen. Denote the number of flips by $X$. Assume the probability of a head to be $p$.
Sample products until a defective product is seen. Denote the number of products sampled as $Y$. Assume the defective rate is $p$.

Both $X$ and $Y$ have the same distribution whose probability mass functions is given as follows:

\[p(x)=(1-p)^{x-1}\cdot p, ~~x=1, 2, 3, \cdots, \infty\] since selections are independent, and there is only one defectives and $x-1$ normal ones.

The distribution is called the geometric distribution. The mean is shown to be $\frac{1}{p}$ and the variance is $\frac{1-p}{p^2}$.

The negative binomial distribution, which is for the situation that you see $k$ heads or defectives. Except the last one which is for sure to be a head or defective, you apply the binomial formula for the previous flips or samplings.

Thus the formula is

\[p(x) = \binom{x-1}{k-1}p^{k-1}(1-p)^{x-k}\cdot p, x= k, k+1, k+2, \cdots \infty\]

The mean of the negative binomial distribution is $\frac{k}{p}$ and variance is $\frac{1-p}{p^2}\cdot k$.

3.7 Poisson Distribution

When modeling the number of rare events (such as earthquakes and car accidents), the Poisson distribution is often used. Let $X$ denote the number of rare events in a given dimension (space, area, or a time interval). The probability mass function can be chosen to be

\[p(x) = \frac{\lambda^x}{x!}e^{-\lambda}, ~~x = 0, 1, 2, \cdots, \infty\] where $\lambda$ is the average number of rare events per unit dimension (such as per cubic feet, per square feet, per minute). The mean of this distribution is just $\lambda$ and the variance is also $\lambda$.

A example:

A website is attacked 3 times on average in each year.

What is the probability that the website will be attacked in the next year?
What is the probability that the website will be attacked more than 10 times in the next 5 years?
How many times will the website be expected to be attacked in the following 10 years?

Solution.

Let $X$ denote the number of attacks in a year. Then $X$ has a Poisson distribution with mean $\lambda=3$ attacks per year.

Let $Y$ denote the number of attacks in 5 years. Then $Y$ has a Poisson distribution with mean $\lambda=3\cdot 5=15$ attacks every 5 years.

The probability that the website will be attacked in the next year is given by

\[P(X>0)=1-P(X=0)=1-\frac{3^0}{0!}e^{-3}=1-e^{-3}\approx 0.9502. \] (b) The probability that the website will be attacked more than 10 times in the next 5 years is given by

\[P(Y>10)=1-P(Y\le 10)=1-p(0)-p(1)-p(2)-\cdots - p(10)\approx 0.9997. \] where $\lambda = 15$ should be used when calculating $p(0), p(1), \cdots, p(10)$.

The website will be expected to have 30 attacks in the next 10 years.

3.8 Chapter Practice Problems

Poisson Distribution

A bakery sells an average of 4 loaves of bread per hour. What is the probability that they sell exactly 2 loaves in the next hour?
The number of accidents at a traffic intersection follows a Poisson distribution with an average rate of 1.5 accidents per week. What is the probability of having no accidents in a given week?
A researcher finds that an average of 5 emails are received per hour. What is the probability that 7 emails will be received in the next hour?

Geometric Distribution

A light bulb has a 20% chance of burning out each day. What is the probability that it lasts exactly 3 days?
A basketball player makes 70% of their free throws. What is the probability that they make their first successful free throw on the 5th attempt?
The probability of a customer purchasing a product during their first visit to a store is 0.3. What is the probability that a customer makes a purchase for the first time on their third visit?

Binomial Distribution

A student has a 60% chance of passing an exam. If they take the exam 5 times, what is the probability that they pass exactly 3 times?
A fair coin is flipped 10 times. What is the probability of getting heads exactly 7 times?
In a quality control process, 90% of products pass inspection. If 12 products are inspected, what is the probability that exactly 10 pass?

Discrete Distribution

A dataset consists of the following values: 3, 3, 6, 8, 10. Randomly choose a value from the set and denote the value by X. Calculate the mean and variance of the random variable X.
A dataset consists of the following values: 3, 3, 6, 8, 10. Randomly choose two values from the set without replacement and denote the average of the two values by Y. Calculate the mean and variance of the random variable Y.

Here are the answers double-checked:

Poisson Distribution

0.1465
0.2231
0.1044

Geometric Distribution

0.128
0.0615
0.147

Binomial Distribution

0.263
0.1172
0.1937

4 Continuous Random Variables and Probability Distributions

A continuous random variable is one that can take on any value within a certain range. The concept of probability mass function will not work for continuous random variables, since the probability that $X$ equals a single value is always 0 (finding a needle in a sea). This does not mean it is hopeless. Instead, calculus is the savior.

The probabilities associated with continuous random variables are represented by a probability density function (PDF). The area under the graph of the PDF gives the likelihood of the random variable falling within a specific interval. The area under the PDF curve over the entire range is equal to 1.

Examples of a continuous random variable:

The temperature of a fluid in a system.
The time it takes for a customer service representative to handle a call.
The distance a car travels before its engine fails.

4.1 Probability Distributions and Probability Density Functions

For a continuous random variable, the counterpart of the probability mass function for a discrete random variable is the probability density function, denoted $f(x)$, which is defined as the derivative of the cumulative distribution function $F(x)$. That is

\[f(x)=F'(x)\] The probability density function has the following properties:

$f(x)$ is never negative.
The area under the curve of $f(x)$ and above the $x-axis$ is always 1; that is, $\int_{-\infty}^{\infty}f(x)dx=1$.
The probability that $X$ falls in the interval $[a, b]$, $(a, b)$, $[a, b)$, $(a, b]$ is given by $P(X<b)-P(X<a)$, or $F(b)-F(a)$, or $\int_{a}^{b}f(x)dx$, regardless of the boundaries of the interval.

4.2 Cumulative Distribution Functions

The cumulative distribution function (CDF) of a random variable $X$ has range between 0 and 1, and is always non-decreasing. For a continuous random variable, the CDF is a continuous function.

To get the CDF from the pdf, we use the following formula:

\[F(x)=\int_{-\infty}^{x}f(x)dx\]

4.3 Mean and Variance of a Continuous Random Variable

The mean (or expectation, or expected value) of a continuous random variable whose probability density function is $f(x)$ is given by

\[\mu = \int_{-\infty}^{\infty}xf(x)dx\]

In this formula:

x represents the possible values of the random variable.
f(x) represents the probability density function of the random variable at each value x.
The integral sums up the product of each possible value x and its corresponding density f(x) over the entire range of possible values.

The variance is given by

\[\sigma^2 = \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx ~~~\text{or} ~~\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2\]

The integral above computes the weighted sum of squared deviations from the mean, where each squared deviation is weighted by its corresponding probability density.

The mean is also denoted by $E(X)$ and variance by $Var(X)$.

An example:

The probability density function of $X$ is given by

\[f(x)=3x^2, ~~0<x<1\] \[f(x)=0, ~~\text{otherwise}\]

Find the mean and variance.
Find the probability that $X<0.3$.
Find the probability that $X>0.2$.
Find the probability that $0.25<X<0.6$.

Solution.

The mean is calculated as

\[\mu = \int_{-\infty}^{\infty}xf(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}xf(x)dx=\int_{0}^{1}x\cdot 3x^2dx\stackrel{\text{?}}{=}\frac{3}{4}\]

where the first “?” is due to the fact that $f(x)$ is only non-zero between 0 and 1, and the second “?” is due to the fact that the anti-derivative of $3x^3$ is $\frac{3x^4}{4}$.

The variance is calculated as

\[\sigma ^2 = \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}(x-\mu)^2f(x)dx=\int_{0}^{1}(x-\frac{3}{4})^2\cdot 3x^2dx\stackrel{\text{?}}{=}3\int_{0}^{1}(x^4-\frac{3}{2}x^3+\frac{9}{16}x^2)dx=\frac{3}{80}\]

The probability that $x<0.3$ is obtained by

\[\mu = \int_{-\infty}^{0.3}f(x)dx\stackrel{\text{?}}{=}\int_{0}^{0.3}f(x)dx=\int_{0}^{0.3}3x^2dx\stackrel{\text{?}}{=}0.027\]

The probability that $x>0.2$ is obtained by

\[\mu = \int_{0.2}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.2}^{1}f(x)dx=\int_{0.2}^{1}3x^2dx\stackrel{\text{?}}{=}0.992\] (d) The probability that $0.25<X<0.6$ is obtained by

\[\mu = \int_{-\infty}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.25}^{0.6}f(x)dx=\int_{0.25}^{0.6}3x^2dx\stackrel{\text{?}}{\approx}0.2\]

If $X$ is a random variable, then $X^2$ is too. Thus, we can talk about the mean of $X^2$, and it is denoted by $E(X^2)$. With this notation, the variance formula of a random variable $X$ can be written $Var(X)=E(X^2)-[E(X)]^2$.

4.4 Continuous Uniform Distribution

When $f(x)$ is a non-zero constant on the interval $(a,b)$ and 0 outside the interval, the distribution is a continuous uniform distribution with support $(a,b)$. The boundaries of the interval can be open or closed, which does not matter.

For a continuous uniform random variable with support $(a,b)$, the mean equals $\frac{a+b}{2}$ and the variance equals $\frac{(b-a)^2}{12}$.

An example:

If random variable $X$ has the following probability density function

\[f(x)=c, ~~~ 1<x<5\]

\[f(x)=0, ~~~ \text{otherwise}\] (a) find the constant $c$.

Find the mean and variance.
Find the probability that $X$ is between 2 and 4.

Solution.

Sketch the graph of the function. Since the area under the curve and above the $x$-axis must be 1, the constant $c$ must be $\frac{1}{4}$.
The mean is $\frac{1+5}{2}=3$. The variance is $\frac{(5-1)^2}{12}=\frac{4}{3}$.
The probability that $X$ is between 2 and 4 is obtained by

\[P(2<X<4)=\int_{-\infty}^{\infty}f(x)dx=\int_2^4 \frac{1}{4}dx=\frac{1}{2}\]

You can avoid using calculus here by drawing a graph to find the area in between 2 and 4 under the density curve and above the $x$-axis.

4.5 Normal Distribution

When a random variable $X$ has the probability density function given by

\[f(x)=\frac{1}{\sqrt{2\pi}\cdot\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}, ~~~-\infty<x<\infty\]

$X$ is said to have a normal distribution, denoted $X\sim N(\mu, \sigma^2)$ or $X\sim N(\mu, \sigma)$ according to different books. The mean is just equal to $\mu$ and the standard deviation is just equal to $\sigma$.

When $X\sim N(\mu, \sigma^2)$, the new random variable $Z=\frac{X-\mu}{\sigma}$ is called the $Z$-score or standardized score. It turns out that $Z$ has a normal distribution with mean 0 and standard deviation 1.

To calculate the probability associated with a normal distribution, we can use a standard normal distribution table or software.

An example:

If $X~N(\mu=100, \sigma=15)$, find

P(X<110)
P(X>120)
P(90<X<130)

Solution.

$P(X<110)=P(X-\mu<110-\mu)=P(\frac{X-\mu}{\sigma}<\frac{110-\mu}{\sigma})=P(Z<0.67)\approx0.7486$ by a table such as this. Your textbook also has such a table in the appendix.
$P(X>120)=P(X-\mu>120-\mu)=P(\frac{X-\mu}{\sigma}>\frac{120-\mu}{\sigma})=P(Z>1.33)\approx 1- 0.9082=0.0918$
The probability is obtained as follows:

\[P(90<X<130)=P(90-\mu<X-\mu<130-\mu)=P(\frac{90-\mu}{\sigma}<\frac{X-\mu}{\sigma}<\frac{130-\mu}{\sigma})\] \[=P(-0.67<Z<2)=P(Z<2)-P(Z<-0.67)\approx0.9772-0.2514=0.7258\]

where I used the result that $P(a<X<b)=P(X<b)-P(X<a)$ when $X$ is a continuous random variable.

4.6 Exponential Distribution

If the probability density function of a random variable $X$ is given by \[f(x)=\lambda e^{-\lambda x}, ~~ \text{for}~~x>0\] \[f(x)=0, ~~ \text{otherwise}\] then the random variable is said to have an exponential distribution with parameter $\lambda$. This distribution is a right-skewed distribution, as seen from the following graph:

It can be shown that the mean of this distribution is $\frac{1}{\lambda}$ and the standard deviation is also $\frac{1}{\lambda}$.

4.7 The t Distribution

The t-distribution is one with a pdf that is very complicated and thus not given here. Each t-distribution is associated with a number that is called the number of degrees of freedom. The plots of some t-distributions are given below. More are here: https://en.wikipedia.org/wiki/Student%27s_t-distribution

Each of the distribution is controlled by the number of degrees of freedom ($df = n-1$). All the $t$-distributions have mean 0 and variance $\frac{n}{n-1}$. As $n$ gets larger and larger, a $t$-distribution gets closer and closer to the standard normal distribution.

We can use software to find the probability that a t-random variable with “df” degrees of freedom is no less than a given number say x. The R code is “pt(x, df)”.

4.8 Reliability Function

The reliability function $R(t)$, also known as the reliability or survivor function, gives the probability that a system, component, or individual will survive beyond a specific time t. In other words, it calculates the probability that the event of interest (such as system failure) has not occurred by time t. Mathematically, it’s defined as:

\[R(t)=1−F(t)\]

where $F(t)$ is the cumulative distribution function (CDF) of the event times. The reliability function decreases over time as more events occur.

In different fields and literature, you might encounter either the reliability function or the survival function being used, depending on the context and the field’s convention. For example, in reliability engineering, the term “reliability” is often used, while in medical and survival analysis contexts, the term “survival” is commonly used.

Example: Reliability of a Weibull-Distributed Component

Suppose we have a component whose lifetime (in thousand hours) follows a Weibull distribution, a distribution with cumulative distribution function (CDF) given by:

\[F(t)=1−e^{ −(t/λ)^k}\]

Assume that $k=2$ and $\lambda = 0.8$.

Find the probability density function (pdf) of the life time of the component.
Find the reliability function of the life time of the component.
The ratio of the pdf to reliability is called the failure rate or the hazard function, denoted by $h(t)$, which is interpreted as the instantaneous rate of failure at time $t$. Plot the graph of this function.

Solution.

Since $k=2$ and $\lambda = 0.8$, $F(t)=1−e^{ −(t/0.8)^2}$. The the probability density function (pdf) is the derivative of $F(t)$, or $f(t)=\frac{t}{0.32}e^{ −(t/0.8)^2}$.
The reliability function $R(t)=1-F(t)= e^{ −(t/0.8)^2}$.
The failure rate or the hazard function $h(t)= \frac{f(t)}{R(t)}=\frac{t}{0.32}$.

The reliability can also be defined for a system of components connected in some way. There are some special systems:

A series system is defined as a system whose individuals are connected end-to-nd in a series.
A parallel system in reliability engineering refers to a configuration where multiple components or paths are connected in parallel, and the system as a whole operates if at least one of the parallel components or paths is functioning. This setup increases the system’s overall reliability since the system can continue to function even if one or more of the parallel components fail.

4.9 Chapter Practice Problems

Uniform Distribution

A random variable X is uniformly distributed between 0 and 8. What is the probability that X is greater than 5?
A car rental service charges a uniform rate between $30 and $50 per day. What is the probability that a randomly chosen rental costs less than $40?
If a student randomly selects a number from the range [1, 100], what is the probability that the number is between 20 and 50?

Normal Distribution

The weights of a certain type of fruit are normally distributed with a mean of 150 grams and a standard deviation of 20 grams. What percentage of fruits weigh more than 170 grams?
A factory’s output is normally distributed with a mean of 200 units per day and a standard deviation of 15 units. What is the probability that the factory produces fewer than 190 units in a day?
Test scores in a class are normally distributed with a mean of 75 and a standard deviation of 10. What is the z-score for a test score of 85?

Exponential Distribution

The time between arrivals of buses at a station is exponentially distributed with an average arrival rate of 1 bus every 15 minutes. What is the probability that the next bus arrives in less than 10 minutes?
A factory machine has a mean time to failure of 12 hours. What is the probability that it operates for more than 10 hours before failing?
The average lifespan of a battery is 2 years. What is the probability that a randomly chosen battery lasts less than 1 year?

Reliability Function

A product has an exponential lifetime distribution with a standard deviation of 10 years. What is the reliability at 4 years?
The average time until failure of a system is 8 months. What is the probability that the system will function for more than 5 months?
A certain machine has a mean time to failure of 3 years. Assuming failure time has an exponential distribution, what is the reliability of the machine at 1 year?

5 Joint Probability Distributions

Many times we need to consider two or more random variables at the same time. In this situation, we need to consider the joint distribution of these random variables.

5.1 Joint Probability Distributions for Two Random Variables

We only consider the joint distribution of two discrete random variables $X$ and $Y$. The joint probability mass function (pmf) of them is defined to be$f(x,y)=P(X=x, Y=y)$. In contrast, each of the two individual distributions is called a marginal distribution.

Example 1. $X$ takes values 0, 2, and 4. $Y$ takes values 1, 2, and 3. The following table shows the joint pmf of the two discrete random variables.

Based on the results given, we can derive many results, such as

$f(0, 1)= 1/4, f(4, 1)= 1/4, f(0, 3)=1/8$,
$P(X=2)=P(X=2, Y=1~\text{or} ~2~\text{or} ~3)=f(2,1)+f(2,2)+f(2,3)=0+1/8+0=1/8$, and
$P(X<3, Y>2)=P(X=0 ~\text{or} ~2, Y=3)=f(0,3)+f(2,3)=1/8+0=1/8$.

We can also find the PMF of $X$ and the PMF of $Y$, respectively. These two distributions are the marginal distributions of the joint distribution. The marginal distribution of $X$ is the values 0, 2, and 4 with corresponding probabilities 1/2, 1/8, and 3/8. The probabilities are obtained by adding the joint probabilities on the three rows of the table. The marginal distribution of $Y$ is the values 1, 2, and 3 with corresponding probabilities 1/2, 1/4, and 1/4. The probabilities are obtained by adding the joint probabilities on the three columns of the table.

5.2 Conditional Probability Distributions and Independence 102

When the joint probability mass function (pmf) of $X$ and $Y$ is given, we can find the pmf of $X$ and the pmf of $Y$ separately. Take the following joint pmf as an example:

To find the pmf of $X$, we just add the 3 probabilities on each row and we end up with sums of 1/2, 1/8, and 3/8, which are the probabilities that $X$ takes the values of 0, 2, and 4, respectively.

Similarly, To find the pmf of $Y$, we just add the 3 probabilities on each column and we end up with sums of 1/2, 1/4, and 1/4, which are the probabilities that $Y$ takes the values of 1, 2, and 3, respectively.

The conditional probability of $Y=y$ given $X=x$ can be written as $p(y|x)$, which is defined by

\[f(y|x)=\frac{f(x,y)}{f(x)}=\frac{\text{the joint pmf}}{\text{the marginal pmf of}~X}\] $X$ and $Y$ are said to be independent, if $f(y|x)=f(y)$, for any $x$ and $y$. In other words, when two random variables are independent, the conditional probability mass function is the same as the corresponding marginal probability mass function. Equivalently, $X$ and $Y$ are independent if and only if $f(x,y)=f_X(x)f_Y(y)$, for all $x$ and $y$.

Note: some books use $p$ instead of $f$.

5.3 Covariance and Correlation 110

When given the joint pmf of $X$ and $Y$, how can we calculate $E(X\cdot Y)$?

Here is an example: for the following given joint pmf,

calculate $E(X\cdot Y)$.

Solution.

We need to find all possible products of $X$ and $Y$, then multiply them by the corresponding joint probabilities, and finally add the results up. That is,

\[(0)(1)(1/4)+(0)(2)(1/8)+(0)(3)(1/8)\] \[+(2)(1)(0)+(2)(2)(1/8)+(2)(3)(0)\] \[+(4)(1)(1/4)+(4)(2)(0)+(4)(3)(1/8)\]

The result is 3.

Previously, we introduced the mean of a random variable $X$. We use $\mu$ or $E(X)$ to denote the mean.

The covariance of two discrete random variables is defined to be \[cov(X,Y)=E([X-E(X)][Y-E(Y)])\] It can be shown that $cov(X,Y)=E(X\cdot Y)-E(X)\cdot E(Y)$.

In the above example, $X$ takes the values of 0, 2, and 4, with probabilities of 1/2, 1/8, and 3/8, respectively. So the mean of $X$ is $\mu=(0)(1/2)+(2)(1/8)+(4)(3/8)=1.75$. The variance of $X$, by the equivalent formula, $V(X)=\sum x^2\cdot p(x)-\mu^2$, is calculated as $(0^2)(1/2)+(2^2)(1/8)+(4^2)(3/8)-1.75^2=3.4375$.

Similaryly, $Y$ takes the values of 1, 2, and 3, with probabilities of 1/2, 1/4, and 1/4, respectively. The mean of $Y$ is $(1)(1/2)+(2)(1/4)+(3)(1/4)=1.75$. The variance of $Y$ is $(1^2)(1/2)+(2^2)(1/4)+(3^2)(1/4)-(7/4)^2=0.6875$.

Now, the covariance of $X$ and $Y$ is $3-(1.75)(1.75)=-0.0625$.

The correlation between $X$ and $Y$ is defined by

\[\rho_{X,Y}=corr(X,Y)=\frac{cov(X,Y)}{\sqrt{Var(X)}\cdot \sqrt{Var(Y)}}\]

Continue the previous example. The correlation between $X$ and $Y$ is

\[\rho_{X,Y} = \frac{-0.0625}{\sqrt{3.4375}\cdot \sqrt{0.6875}}\approx -0.04\]

Remark: Covariance can take any value, but correlation must be between $-1$ and 1. In addition, correlation has no unit.

5.4 Linear Functions of Random Variables 117

Some useful properties of the mean and variance are given below:

If $Y$ and $X$ are random variables and $a, b$, and $c$ are constants, then

$E(c)=c$
$E(aX+bY) = aE(X)+bE(Y)$
$E(X+b)= E(X) + b$, but $Var(X+b)= Var(X)$
$Var(aX)= a^2Var(X)$
$Var(X+Y)=Var(X)+Var(Y)+2\cdot cov(X,Y)$
If $X$ and $Y$ are independent, then $Var(X+Y)=Var(X)+Var(Y)$ and $E(XY)=E(X)E(Y)$ and thus $Cov(X,Y)=0$.
If both $X$ and $Y$ are normally distributed and are independent, then $X+Y$ is also normally distributed with mean that equals the sum of individual means and variance that equals the sum of individual variances.

An example:

If $E(X) = 2$ and $Var(X)=3$, determine

$E(4X)$
$Var(4X)$
$E(X+5)$
$Var(X-6)$
$E(3X-2)$
$Var(3X-2)$

Solution.

$E(4X)=4E(X)=8$
$Var(4X)=16Var(X)=48$
$E(X+5)=E(X)+5=7$
$Var(X-6)=Var(X)=3$
$E(3X-2)=3E(X)-2=4$
$Var(3X-2)=Var(3X)=3^2Var(X)=27$

Another example:

If $X_1$, $X_2$, …, $X_n$ are independent and identically distributed (i.i.d) with mean 100 and standard deviation 15, determine

the mean of $\bar{X}$, where $\bar{X}=\frac{X_1+X_2+\cdots+X_n}{n}$.
the variance of $\bar{X}$.
the mean of $S^2$, where $S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i-\bar{X})^2$.

Solution.

the mean of $\bar{X}$ can be found as follows:

\[E(\bar{X})=E(\frac{X_1+X_2+\cdots+X_n}{n})=E(\frac{1}{n}(X_1+X_2+\cdots+X_n))\] \[=\frac{1}{n}E(X_1+X_2+\cdots+X_n)\] \[=\frac{1}{n}(E(X_1)+E(X_2)+\cdots+E(X_n))=\frac{1}{n}(100n)=100\] (b) the variance of $\bar{X}$ can be found as follows:

\[V(\bar{X})=V(\frac{X_1+X_2+\cdots+X_n}{n})\] \[=\frac{1}{n^2}V(X_1+X_2+\cdots+X_n)\] \[=\frac{1}{n^2}(V(X_1)+V(X_2)+\cdots+V(X_n))=\frac{1}{n^2}(15^2\cdot n)=\frac{15^2}{n}\]

the mean of $S^2$ is $15^2$. The derivation is lengthy, but can you do it?

The above 3 results are also true in general when the mean is $\mu$ and standard deviation is $\sigma$. Just substitute 100 by $\mu$ and 15 by $\sigma$.

A third example:

If $X$ has a normal distribution with mean 10 and variance 2, independently of $X$, $Y$ has a normal distribution with mean 8 and variance 3, and $X$ and $Y$ are normally distributed, find

the distribution of $X+Y$?
the probability that $X+Y>15$.

Solution.

The distribution of $X+Y$ is also normal with mean $10+8=18$ and variance $2+3=5$.
Let T = $X+Y$. T is normal with mean $10+8=18$, variance $2+3=5$, and standard deviation $\sqrt{5}$. To find probability $P(T>15)$, we convert $T$ to $Z$, which has the standard normal distribution thus allows us to use a normal table to find probabilities.

\[P(T>15)=P(\frac{T-18}{\sqrt{5}}>\frac{15-18}{\sqrt{5}})=P(Z>\frac{15-18}{\sqrt{5}})\] \[=P(Z>-1.34)=1-P(Z<-1.34)=0.0901\]

6 Descriptive Statistics

In practice, we often need to estimate a quantity (called a parameter) that describes a group (called a population). Such a group can be the collection of all light bulbs manufactured by a company. The corresponding parameter can be the proportion of defective light bulbs or the average life time of all light bulbs. It’s often unrealistic to reach every member in a population. Instead, people draw a random sample (called data) from the population of interest. After sample data are obtained, a descriptive analysis of the data is usually desirable.

We will introduce summary methods for data. Depending on data, we can calculate measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and we can visualize the data with graphical displays. This will help us gain insights into the characteristics of the data, understand how descriptive statistics can be applied in various fields, and make data-driven decisions.

6.1 An Introduction to R

To analyze data, it’s better to use software.

Different kinds of data analysis software have been developed in the past 50 years. We will use R.
To use R and RStudio (now called Posit), either on-premises (install R then RStudio from this page https://posit.co/download/rstudio-desktop/) or in the cloud (go to https://posit.cloud and register).
Let’s assume that you use the cloud-based RStudio. After logging in, click the “New Project” tab in the upper-right corner and select “New RStudio Project.” Once your project is created, rename it from “Untitled Project” to something meaningful, like “STAT353.”
Your R code will be entered in the upper-left panel. Highlight the part of your code you wish to run and click the “Run” button; the results will appear in the lower-left panel alongside the code. The lower-right panel allows you to manage files (click Files) and install packages (click Packages). You can upload files by selecting the “Files” tab and clicking the “Upload” button. To download files, click “More” and choose “Export.”

To create an R Markdown document for generating project reports in PDF, Word, or HTML format, go to the upper-left panel and click “File,” then “New File,” and select “R Markdown.” Fill in the title and author fields in the dialog box and click “OK” to create a template. The first few lines between the pair of “—” and “—” are called YAML (Yet Another Markup Language), and lines 7-9 set up the environment—it’s best not to modify these unless you’re familiar with RStudio. Lines 11 and 21 create sections marked by “#” signs; ensure there’s at least one space before each section title. You can customize the titles and include text with formatting, such as using a pair of *’s to enclose text in order to emphasize it, using a pair of $’s to enclose mathematical expressions, or enclosing webpage links in between angle brackets. R code should be placed between {r} and to create a code chunk, which you can insert by clicking the green “+C” button in the menu bar on the upper-left panel.

When you’re finished editing, click the “Knit” dropdown in menu bar on the upper-left panel and choose your desired output format (html, pdf, word, …). If there are no errors, the generated output file will appear in the lower-right panel’s file list.

Here is a tutorial to get you started: https://www.youtube.com/watch?v=TQMAKGDIe_8

Start a new project and practice!

6.2 Numerical Summaries of Data

Given data such as: 800, 820, 900, 950, 780, 690, 860, 880 representing the life times of 8 randomly selected light bulbs from a sea of light bulbs.

What is the average life time? In statistics, this is called the sample mean of the data, which describes the center of the data. We use $\bar{x}$ to denote the sample mean. For the above data, $\bar{x}=835$.
What is the spread (or variability) of the data? In statistics, this is measured by the sample variance defined as

\[s^2=\frac{\sum (x_i - \bar{x})^2}{n-1}\]

or measured by the sample standard deviation, the square root of the sample variance. In the formula, each $x_i$ represents an observation, $\bar{x}$ is the sample mean, $n$ is the number of observations (called the sample size), and the notation $\Sigma$ means to obtain the sum of the squared differences between observations and their mean.

For the above data, $s^2=6514.286$ and $s=80.71$. To get the sample variance, you can follow these steps:

Calculate the sample mean (average):

Mean = (800 + 820 + 900 + 950 + 780 + 690 + 860 + 880) / 8 = 6680 / 8 = 835

Calculate the squared differences between each data point and the mean:

\[(800 - 835)^2 = 1225\] \[(820 - 835)^2 = 225\] \[(900 - 835)^2 = 4225\] \[(950 - 835)^2 = 13225\] \[(780 - 835)^2 = 3025\] \[(690 - 835)^2 = 21025\] \[(860 - 835)^2 = 625\] \[(880 - 835)^2 = 2025\]

Calculate the sum of squared differences:

Sum of Squared Differences = 1225 + 225 + 4225 + 13225 + 3025 + 21025 + 625 + 2025 = 45600

Calculate the sample variance by dividing the sum of squared differences by (n-1), where n is the number of data points (8):

Sample Variance = Sum of Squared Differences / (n-1) = 45600 / (8-1) = 45600 / 7 ≈ 6514.286

We can use the R function var() to find the variance of data and the sd() function to find the standard deviation.

# Data
x = c(800, 820, 900, 950, 780, 690, 860, 880)

# Find the variance
var(x)

## [1] 6514.286

# Find the standard deviation
sd(x)

## [1] 80.71113

The R functions give the same answers!

The difference between the largest observation and the smallest observations is called the range of data. The range of the above data is $900-690$ or 210.

We may also be interested in what percentage of the data values are less than or equal to a value. For example,

If we know 95% of all data values are less than or equal to 800, the number of 800 is called the 95th percentile of the data.
If we know 90% of all data values are less than or equal to 750, the number of 750 is called the 90th percentile.
If we know 25% of all data values are less than or equal to 260, the number of 260 is called the 25th percentile, or the first quartile, denoted by $Q_1$.
If we know 50% of all data values are less than or equal to 420, the number of 420 is called the 50th percentile, the second quartile, or the median, denoted by $Q_2$ or $m$.
If we know 75% of all data values are less than or equal to 570, the number of 570 is called the 75th percentile, or the third quartile, denoted by $Q_3$.

The median is the easiest among the percentiles. To find the median, simply sort the data from smallest to largest, the middle value or the average of the middle two values is the median.

Percentiles are also called quantiles. To find other percentiles (including quartiles), different software often use different approaches. To use JMP, visit https://www.uvm.edu/~rsingle/other/JMP-intro/default13.html. R users might do the following to find quantiles:

# Data
x = c(2,4, 5, 11, 20, 22, 26, 29, 32, 35, 40, 45, 48, 50)

# Find the 95th percentile
quantile(x, 0.95)

##  95% 
## 48.7

6.3 Stem-and-Leaf Diagrams

This video: https://www.youtube.com/watch?v=faiGE_J_dww would be great for introducing the diagram. It also introduces other commonly used plots.

R software uses the stem function to create a stem-and-leaf plot.

x = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 86, 125, 85, 87, 102, 96,  76, 85)

stem(x)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    7 | 6
##    8 | 5567
##    9 | 36689
##   10 | 12449
##   11 | 7
##   12 | 1335

6.4 Frequency Distributions and Histograms

We can create consecutive bins (intervals) of values so that we can count how many data values fall into each bin. This will help us understand the distribution of the data. The bins along with the counts of values form a table, called the frequency distribution of the data. The counts can also be replaced by proportions and the resulting table is called the relative frequency distribution.

Again, this video: https://www.youtube.com/watch?v=faiGE_J_dww help us make a histogram.

R software uses the hist function to create a histogram.

IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96,  106, 105)

hist(IQ, 
     main = "Distribution of IQ", # The main sets the title of the plot.
     col = "blue",  # The col fills the blue color inside each bar
     xlab = "IQ",  # The title of the x-axis is set to IQ
     ylab = "Count" # The title of the y-axis is set to Count
    )

You can set your own breaks when constructing a histogram. The breaks should be evenly spaced. The number of breaks is usually between 5 and 25. The smallest break should be slightly smaller than the minimal value of your data. The largest break should be slightly larger than the maximal value of your data.

hist(IQ, 
     breaks = c(70,80, 90, 100,  110, 120, 130), 
     main = "Distribution of IQ", 
     col = "blue", 
     xlab = "IQ"
    )

It’s not normal to give specific breaks when creating histograms in general, instead, we specify something like “breaks = 12” to guide R to create a histogram with around 12 bins.

6.5 Box Plots

From sample data, you can calculate the minimum value, first quartile, median, third quartile, and maximum. There are called the 5-number summary, which can be displayed through the so-called box plot.

The five vertical lines are respectively corresponding to the minimum, first (or lower) quartile, median, third (or upper) quartile, and maximum of data. Keep in mind, the mean of data is not shown here. Since the line extended from the third quartile to the maximum is longer than the line from the first quartile to the maximum, the distribution of data is said to be right-skewed. These two lines are called whiskers.

R software uses the “boxplot” function to create a boxplot.

IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96,  106, 105)

boxplot(IQ, 
     main = "Distribution of IQ", 
     col = "blue", 
     xlab = "IQ",
     horizontal = TRUE  # If set to FALSE, the boxplot will be vertical.
    )

More complicated boxplots take into account outliers (values that are 1.5 IQR’s larger than $Q_3$ or 1.5 IQR’s smaller than $Q_1$, where $IQR=Q_3-Q_1$ is called the inter-quartile range), as shown below:

It’s better to create comparative boxplots, such as the one below:

These are called side-by-side boxplots.

You need to use two variables when creating comparative boxplots, one being the outcome variable and one being the group variable. The two variables are better in a data frame (just like a spread sheet in Excel). Here is possible R code:

IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96,  106, 105)

gender = c("F", "M", "F", "F", "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", "M")

boxplot(IQ~gender)

Here are some summary findings based on the boxplots:

IQ Distribution by Gender: The boxplot displays the distribution of IQ scores for two gender groups: “F” (Female) and “M” (Male).
Median IQ: The horizontal line inside each box represents the median IQ score for each gender group. It appears that the median IQ for females (“F”) is almost the same as the median for males (“M”).
Interquartile Range (IQR): The height of each box represents the interquartile range (IQR), which is a measure of the spread of the IQ scores within each gender group. The IQR for females appears to be slightly larger than for males.
Outliers: The plot shows individual data points that fall outside the whiskers of the boxplots. These data points are potential outliers. There seems to be no outlier in IQ score for either group.
Skewness: The distribution of IQ scores for either group is slightly skewed to the right (to larger values).

6.6 Time Sequence Plots

When plotting data involving time, we use the time sequence or time series plot. The target quantity would be on the y-axis and the time on the x-axis.

When examining a time series plot, pay attention to a possible trend and/or cyclic variation.

The plot shows an increasing linear trend and cyclic variation.

6.7 Scatter Diagrams

Video: https://www.youtube.com/watch?v=ORPOMJzaKFM

When data points in a scatterplot tend to be on a straight line, we can calculate the so-called correlation coefficient which quantify the linear relationship between the two quantitative variables.

The correlation coefficient, denoted by $r$, is calculated by the following formula:

\[r_{xy} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\]

which can be reduced to

\[r_{xy} = \frac{\sum(x_i-\bar{x})y_i}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\] or even simpler

\[r_{xy} = \frac{\sum(x_i y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum y_i^2-n\bar{y}^2}}\]

The value of $r$ is always between $-1$ and 1. A value closer to 1 indicates a strong, positive, linear relationship, while a value closer to $-1$ indicates a strong, negative, linear relationship. The following shows some typical scatterplots.

An example. The following data are from an article on the quality of different young red wines in the Journal of the Science of Food and Agriculture (1974, Vol. 25(11), pp. 1369–1379) by T.C. Somers and M.E. Evans. The authors reported quality along with several other descriptive variables. We show only quality, pH, total SO2 (in ppm), color density, and wine color for a sample of their wines.

Quality = c(19.2, 18.3, 17.1, 15.2, 14.0, 13.8, 12.8, 17.3, 16.3, 16.0, 15.7, 15.3, 14.3, 14.0, 13.8, 12.5, 11.5, 14.2, 17.3, 15.8)

pH = c(3.85, 3.75, 3.88, 3.66, 3.47, 3.75, 3.92, 3.97, 3.76, 3.98, 3.75, 3.77, 3.76, 3.76, 3.90, 3.80, 3.65, 3.60, 3.86, 3.93)

TotalSO2 = c(66, 79, 73, 86, 178, 108, 96, 59, 22, 58, 120, 144, 100, 104, 67, 89, 192, 301, 99, 66)

ColorDensity = c(9.35, 11.15, 9.40, 6.40, 3.60, 5.80, 5.00, 10.25, 8.20, 10.15, 8.80, 5.60, 5.55, 8.70, 7.41, 5.35, 6.35, 4.25, 12.85, 4.90)

Color = c(5.65, 6.95, 5.75, 4.00, 2.25, 3.20, 2.70, 6.10, 5.00, 6.00, 5.50, 3.35, 3.25, 5.10, 4.40, 3.15, 3.90, 2.40, 7.70, 2.75)

D=data.frame(Quality, pH, TotalSO2, ColorDensity, Color)

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Quality	pH	TotalSO2	ColorDensity	Color
19.2	3.85	66	9.35	5.65
18.3	3.75	79	11.15	6.95
17.1	3.88	73	9.40	5.75
15.2	3.66	86	6.40	4.00
14.0	3.47	178	3.60	2.25
13.8	3.75	108	5.80	3.20
12.8	3.92	96	5.00	2.70
17.3	3.97	59	10.25	6.10
16.3	3.76	22	8.20	5.00
16.0	3.98	58	10.15	6.00
15.7	3.75	120	8.80	5.50
15.3	3.77	144	5.60	3.35
14.3	3.76	100	5.55	3.25
14.0	3.76	104	8.70	5.10
13.8	3.90	67	7.41	4.40
12.5	3.80	89	5.35	3.15
11.5	3.65	192	6.35	3.90
14.2	3.60	301	4.25	2.40
17.3	3.86	99	12.85	7.70
15.8	3.93	66	4.90	2.75

Calculate the correlation between Quality and pH.
Plot all pairs.

Solution.

The details are given below:

\[\bar{x}=15.22\], \[\bar{y}=3.7885\], \[\sum x_i^2=4708.82\], \[\sum y_i^2=287.3713\], \[\sum x_i y_i=1154.931\]

\[r = \frac{\sum(x_i y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum y_i^2-n\bar{y}^2}}=\frac{1154.931-20\cdot 15.22\cdot 3.7885}{\sqrt{4708.82-20\cdot 15.22^2}\cdot\sqrt{287.3713-20\cdot 3.7885^2}}=0.3492\]

The set of scatterplots for all pairs of two variables is called a scatterplot matrix:

plot(D)

# To plot one pair, say the pair for Quality versus pH, we do
plot(D$pH, D$Quality, xlab = "pH", ylab = "Quality") # pH will be on the x-axis

# The following also works
plot(D$Quality ~ D$pH, xlab = "pH", ylab = "Quality") # pH will be on the x-axis

The plots suggest that there is a strong positive correlation between ColorDensity and Color.

7 Point Estimation of Parameters and Sampling Distributions

A population is a collection of subjects or objects of interest. Examples of populations are:

All SCSU students
All cars manufactured by Tesla.
All animals in a forest.

For a given population, we might be interested in a quantity that describes the population. Such a quantity is called a parameter. For the population of all SCSU students, we might be interested in the proportion of students who would like to be a CEO in their future. For the population of all animals in a forest, we might be interested in the mean age of all animals.

In general, the proportion (denoted by $p$) and the mean (denoted by $\mu$) are two commonly studied types of parameters.

7.1 Point Estimation

How can we estimate a parameter? It’s usually unrealistic to check each individual in a population in order to calculate a parameter. Instead, a random sample is drawn. If the sampling method is sound, the sample is expected to be representative of the population and thus the sample counterpart of the population parameter can be a good estimate for the parameter. For example, to estimate the mean of a population, we can use the sample mean; to estimate the proportion of a population, we can use the sample proportion. These quantities depend on observations and are called the statistics, and they are the point estimates of the corresponding parameters. We will introduce interval estimates in next chapter, where we give a range for the parameter and tell how confident we are saying the interval would cover the unknown parameter.

The following are commonly used point estimates:

the sample mean (denoted $\bar{x}$) for the population mean
the sample proportion (denoted $\hat{p}$) for the population proportion
the difference in sample means $\bar{x}_1-\bar{x}_2$ for the difference in two population means
the difference in sample proportions (denoted $\hat{p}_1-\hat{p}_2$) for the difference in two population proportions
the sample variance for the population variance

7.2 Sampling Distributions and the Central Limit Theorem

When estimating a population parameter, we start with the best estimate: the sample statistic, which is the sample counterpart of the parameter. For example, if the parameter is a population mean, then the sample statistic is the sample mean; if the parameter is a population proportion, then the sample statistic is the sample proportion.

Since the sample statistic depends on the sample and the sample is random, the sample statistic must be random. The distribution of all the possible values of the sample statistic is called the sampling distribution. It describes how the sample statistic varies across different samples of the same size taken from the population.

Here is an app showing what a sampling distribution is: https://www.lock5stat.com/StatKey/

To use this app, in the row indicated by “Sampling Distribution”, click either “Mean” or “Proportion”. Let’s say you have clicked “Mean”. From the dropdown menu at the upper-left corner, choose a dataset which can be viewed as a population. Let’s say you have chosen “Baseball Players-3e”. Set an appropriate sample size, and then click “Generate 1000 samples”. Now, you have some results including 3 graphs. The first graph on the right side of the screen shows the histogram of the population, the second graph shows the histogram of the most recent sample, and the third graph or the main graph shows the dotplot of all the sample means. The first two graphs should be similar if the sample size is large. The (third) graph of the sampling distribution may not necessarily be similar to that of the population, but they do have similar center if the sample size is large. In addition, the spread (described by the standard error, std. error) of the sampling distribution gets smaller if the sample size is larger. Try the app out!

Two general results:

The mean of the sample mean ($\bar{x}$) is always equal to the population mean ($\mu$). The standard deviation of the sample mean is always equal to the population standard deviation divided by the square root of the sample size.
The mean of the sample proportion ($\hat{p}$) is always equal to the population proportion ($p$). The standard deviation of the sample proportion is always equal to $\sqrt{\frac{p(1-p)}{n}}$. .

For a continuous population with mean $\mu$ and standard deviation $\sigma$, in general,

if the population is normally distributed, the sample mean ($\bar{x}$) has a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
if the population is not normally distributed, the sample mean ($\bar{x}$), when the sample size $n$ is large (say > 30), approximately has a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.

For a discrete population with with proportion $p$, in general,

the sample proportion ($\hat{p}$), when the sample size $n$ is large (say > 20), approximately has a normal distribution with mean $p$ and standard deviation $\sqrt{\frac{p(1-p)}{n}}$.

These results are called the Central Limit Theorems (CLT’s).

Example 1.

Randomly selected a sample of size 64 from a population with an exponential distribution having mean 2. What is the probability that the sample mean exceeds 1.8?

Solution.

Note that the population standard deviation is also 2, since for the exponential distribution, the mean and standard deviation are the same.

Since the sample size is relatively large, by the CLT, the sample mean has an approximately normal distribution with mean 2 and standard deviation $\frac{2}{\sqrt{64}}=0.25$. That is, $\bar{X}$ is approximately normally distributed with mean 2 and standard deviation 0.25. We need to calculate $P(\bar{X}>1.8)$.

\[P(\bar{X}>1.8)=P(Z>\frac{1.8-2}{0.25})=P(Z>\frac{1.8-2}{0.25})=P(Z>-0.8)=0.7881\]

You can use a standard normal table (https://www.z-table.com/) or the following R code to find $P(Z>-0.8)$:

1 - pnorm(-0.8, 0, 1)  # 0 is the mean and 1 is the standard deviation.

## [1] 0.7881446

7.3 General Concepts of Point Estimation

We use a function of the data to estimate an unknown parameter. Such a function is called a statistic and gives an estimate (called a point estimate) of the parameter. We prefer the statistic exhibiting favorable properties, including lack of bias and minimal variance. All statistics are random variables.

7.4 Unbiased Estimators

A statistic is said to be unbiased, if the mean of the distribution of the statistic is equal to the parameter to be estimated.

The sample mean ($\bar{X}$), the sample proportion ($\hat{p}$), and the sample variance ($s^2$) are all unbiased for their population counterparts.

We in general would like a statistic to be unbiased and has a low variance. In practice, these goals are usually difficult to achieve at the same time (this phenomenon is called the bias-variance trade-off) except we collect more data. To achieve one of the two goals, we usually have to sacrifice the other.

7.5 Variance of a Point Estimator

We can derive the following results:

The variance of the sample mean (as a random variable) is always equal to the population variance divided by the sample size.

\[V(\bar{X})=\frac{\sigma^2}{n}\]

The variance of the sample proportion (as a random variable) is always equal to

\[V(\hat{p})=\frac{p(1-p)}{n}\] where $p$ is the population proportion.

7.6 Standard Error: Reporting a Point Estimate

The standard deviation of the sample mean

\[\frac{\sigma}{\sqrt{n}}\] depends on $\sigma$, which is unknown. A good estimate of $\sigma$ is the sample standard deviation $s$. Therefore, $\frac{\sigma}{\sqrt{n}}$ can be estimated by $\frac{s}{\sqrt{n}}$. The latter is called the standard error of the sample mean.

Similarly, the standard deviation of the sample proportion

\[\sqrt{\frac{p(1-p)}{n}}\] can be estimated by \[\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] which is called the standard error of the sample proportion.

The standard error of a point estimator is a measure of the variability or precision of the estimator.

Example.

An article in the Journal of Heat Transfer (Trans. ASME, Sec. C, 96, 1974, p. 59) described a new method of measuring the thermal conductivity of Armco iron. Using a temperature of 100°F and a power input of 550 watts, the following 10 measurements of thermal conductivity (in Btu/hr-ft-°F) were obtained:

41.60, 41.48, 42.34, 41.95, 41.86, 42.18, 41.72, 42.26, 41.81, 42.04

Obtain a point estimate of the mean thermal conductivity of Armco iron.
What is the standard error of such an estimate?

Solution.

The sample mean $\bar{x}=41.924$ and the sample standard deviation is 0.2841.

A point estimate of the mean thermal conductivity of Armco iron is 41.924.
The standard error of the point estimate obtained in part (a) is $\frac{s}{\sqrt{n}}=\frac{0.2841}{\sqrt{10}}=0.0898$. The standard error of 0.0893 suggests that the sample mean is relatively precise because it doesn’t vary much from sample to sample. It means that the 10 measurements taken are fairly consistent, and the sample mean is likely a good representation of the true mean.

Example.

In a random sample of 50 PC’s, 12 are Dell’s PCs.

Obtain a point estimate of the proportion of PCs that are Dell’s PCs.
What is the standard error of such an estimate?

Solution.

The sample proportion $\hat{p}=\frac{12}{50}=0.24$.

A point estimate of the proportion of PCs that are Dell’s PCs is 0.24.
The standard error of the point estimate obtained in part (a) is $\sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}=\sqrt{\frac{0.24\cdot (1-0.24)}{50}}=0.06$. This standard error represents the variability associated with the point estimate of the proportion of Dell’s PCs in the population.

8 Statistical Intervals for a Single Sample

The point estimate of a parameter does not tell how accurate the estimation is. Another way of estimating a parameter is to use a range of values, called an interval estimate. An interval estimate is associated with a certain level of confidence which tells how sure the interval would cover the unknown parameter, so such an interval is often called a confidence interval.

In addition to the confidence interval method, we later will introduce the method of testing hypotheses about a population. Both methods belong to the so-called statistical inference, which is the process of making an inference about a population using a sample.

8.1 Confidence Interval on the Mean of a Normal Distribution, Variance Known

The $1-\alpha$ confidence interval on the mean of a normal distribution with a known variance ($\sigma$) is given by

\[\bar{x}\pm z_{\alpha/2}\cdot \frac{\sigma}{\sqrt{n}} \]

where $z_{\alpha/2}$ is called the critical value which is the cutoff of the standard normal distribution separating the top $\alpha/2$ tail area from the other area.

The part $z_{\alpha /2}\cdot \frac{\sigma}{\sqrt{n}}$ is called the margin of error.

The length of a confidence interval is a measure of precision of estimation.

The above confidence interval is called a z confidence interval, since it is based on the standard normal density, which is also known as the z-density.

The value of $\alpha$ is usually 0.1, 0.05, or 0.01, and the corresponding (right) cutoffs (called the critical z-values, denoted by $z_{\alpha/2}$) are 1.645, 1.96, and 2.576, respectively. They are obtained by a standard normal table or the following R code:

qnorm(1-(0.1/2)) # alpha = 0.1 and the code gives 1.645

## [1] 1.644854

qnorm(1-(0.05/2)) # alpha = 0.05 and the code gives 1.96

## [1] 1.959964

qnorm(1-(0.01/2)) # alpha = 0.01 and the code gives 2.576

## [1] 2.575829

Example 1.

ASTM Standard E23 defines standard test methods for notched bar impact testing of metallic materials. The Charpy V-notch (CVN) technique measures impact energy and is often used to determine whether or not a material experiences a ductile-to-brittle transition with decreasing temperature. Ten measurements of impact energy (J) on specimens of A238 steel cut at 60°C are as follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, and 64.3. Assume that impact energy is normally distributed with $\sigma$ = 1 J. We want to find a 95% CI for $\mu$, the mean impact energy.

Solution.

The required quantities are $n = 10, \sigma = 1, \bar{x}=64.46$, and $\alpha = 0.05$. The critical value $z_{\alpha/2}$ is 1.96. So the 95% confidence interval for the mean impact energy is $64.46\pm 1.96\cdot \frac{1}{\sqrt{10}}=64.46\pm 0.6198$ or $63.84 < \mu< 65.08$.

Interpretation of confidence intervals:

The confidence interval is a random interval, so when having a new sample, the interval will change. The confidence level $1-\alpha$ reflects the proportion of all possible confidence intervals that would cover the true population parameter. So, we can say “we are 95% confident that the mean impact energy is between 63.84 J and 65.08 J.

The following is a graphical interpretation of confidence intervals (assuming the sample size is 60 and the true population mean is 2.3):

This picture shows that based on 100 confidence intervals of level 95%, about 5 intervals fail to cover the true value of $\mu$, which is indicated by the horizontal red line.

Is the following statement correct?

If a 95% confidence interval on the mean has a lower limit of 10 and an upper limit of 15, this implies that 95% of the time the true value of the mean is between 10 and 15.

No, since the true value is not random but known.

The confidence interval can be impacted by a few factors:

the sample size ($n$): the larger the $n$, the shorter the interval and thus the more precise the interval estimate.
the population standard deviation ($\sigma$): the smaller the $\sigma$, the shorter the interval.
the confidence level $1-\alpha$: the smaller the $\alpha$, the more confident, and the narrower the interval.

Determining the Sample Size for Specified Error on the Mean, Variance Known:

If the sample mean is used as an estimate of the population mean, we can be $1-\alpha$ confident that the error $|\bar{x}-\mu|$ will not exceed a specified amount $E$ when the sample size is $n=(\frac{z_{\alpha/2} }{E}\cdot \sigma)^2$.

Example 2.

To get a 95% confidence interval for a population mean with at most error of 0.6, what sample size should be used? Assume that the population standard deviation is 2.

Solution.

\[n=(\frac{z_{\alpha/2} }{E}\cdot \sigma)^2=(\frac{1.96 }{0.6}\cdot 2)^2=42.68\]

The sample size should be at least 43.

Sometimes, we may only want a lower bound or a upper bound for a confidence interval.

To construct a $1-\alpha$ lower-confidence bound on the population mean, use the $\mu > \bar{x}-z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}$.
To construct a $1-\alpha$ upper-confidence bound on the population mean, use the $\mu < \bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}$.

Example.

In a location, the temperatures measured in June were obtained as follows:

13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0

Assume that the population standard deviation is 0.3.

Construct a 90% lower-confidence bound on the mean temperature.
Construct a 90% upper-confidence bound on the mean temperature.

Solution.

The sample mean is 13.93636. We are given $\alpha = 0.10$, so $z_{\alpha}=1.28$.

A 90% lower-confidence bound on the mean temperature is $\bar{x}-z_{\alpha}\cdot \frac{\sigma}{\sqrt{n}}=13.93636-1.28\cdot \frac{0.3}{\sqrt{11}}=13.82$. That is, the 90% one-sided confidence interval with lower bound is $(13.82, \infty)$.
A 90% upper-confidence bound on the mean temperature is $\bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}=13.93636+1.28\cdot\frac{0.3}{\sqrt{11}}=14.05$. That is, the 90% one-sided confidence interval with upper bound is $( -\infty, 14.05)$.

The following R function written by the instructor can help you find a $z$ confidence interval:

z.test = function(x, alternative = c("two.sided", "less", "greater"), mu = 0, sigma = 1, conf.level = 0.95){
  n=length(x); m = mean(x); z = (m-mu)/(sigma/sqrt(n)); p = pnorm(z)
  
  if (alternative[1]=="less"){
    pvalue = p
    LB = -Inf
    UB = m+qnorm(conf.level)*sigma/sqrt(n)
  } else if (alternative[1]=="greater"){
    pvalue = 1- p
    LB = m-qnorm(conf.level)*sigma/sqrt(n)
    UB = Inf
  } else {
    pvalue = 2*min(p, 1-p)
    LB = m-qnorm((1+conf.level)/2)*sigma/sqrt(n)
    UB = m+qnorm((1+conf.level)/2)*sigma/sqrt(n)
  }
  
  m = round(m, 5)
  z = round(z, 5)
  pvalue = round(pvalue, 5)
  LB = round(LB, 5)
  UB = round(UB, 5)
  
  cat(paste("     One Sample z-test\n\n", "data: ", deparse(substitute(x))), "\n",
     "z =", z, "p-value =", pvalue, "\n",
      paste("alternative hypothesis: true mean is",
            ifelse(alternative[1] == "two.sided", "not equal to", alternative[1]), mu, "\n", 
            paste(100*conf.level, "percent confidence interval:\n"), "   ", LB, UB, 
            "\n", "Sample estimates:\n", "mean of x\n", "   ", m), "\n")
            
}

How can you use the function to do the previous example?

x = c(13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0)

# (a) 
z.test(x, conf.level = 0.9, alternative = "greater", sigma=0.3)

##      One Sample z-test
## 
##  data:  x 
##  z = 154.0723 p-value = 0 
##  alternative hypothesis: true mean is greater 0 
##  90 percent confidence interval:
##      13.82044 Inf 
##  Sample estimates:
##  mean of x
##      13.93636

# (b) 
z.test(x, conf.level = 0.9, alternative = "less", sigma=0.3)

##      One Sample z-test
## 
##  data:  x 
##  z = 154.0723 p-value = 1 
##  alternative hypothesis: true mean is less 0 
##  90 percent confidence interval:
##      -Inf 14.05228 
##  Sample estimates:
##  mean of x
##      13.93636

The results are almost the same as done by hand. The reason that the results are not exactly the same is that my code use more accurate critical value. For example, you use 1.28 when doing the problem by hand, but the code use 1.281552.

8.2 Confidence Interval on the Mean of a Normal Distribution, Variance Unknown

When the population variance is unknown, it will have to be estimated before a confidence interval can be constructed. Such estimation might further introduce uncertainty to the estimation process, so the interval is expected to be longer than the situation where the population variance is known. This is reflected in the change of the critical value in the formula. The new critical value is related to the $t$-distribution.

8.2.1 t Distribution

8.2.2 t Confidence Interval on the Population Mean

The $1-\alpha$ confidence interval on the mean of a normal distribution with an unknown variance is given by

\[\bar{x}\pm t_{\alpha/2}\cdot \frac{s}{\sqrt{n}} \]

where

$\bar{x}$ is the sample mean
$s$ is the sample standard deviation
$t_{\alpha/2}$ is called the t-critical value which is the cutoff of the $t$-distribution separating the top $\alpha/2$ tail area from the other area.
the degrees of freedom of the t-distribution is $n-1$.

The following gives part of a t table that gives the corresponding critical values and right tail ares for each given number of degrees of freedom.

To use the table, let’s give some examples:

when sample size is 10, the degrees of freedom is 9. For a 95% confidence interval, $\alpha$ would be 0.05, and $\alpha/2$ is 0.025. From the graph and the t table above, the critical t-value is 2.262.
when sample size is 8, the degrees of freedom is 7. For a 90% confidence interval, $\alpha$ would be 0.10, and $\alpha/2$ is 0.05. From the table above, the critical t-value is 1.895.

A more detailed table can be found here: https://www.craftonhills.edu/current-students/tutoring-center/mathematics-tutoring/distribution_tables_normal_studentt_chisquared.pdf. For example,

when df = 15 and the upper tail area is 0.075, the critical value is 1.517;
when df = 22 and the upper tail area is 0.025, the critical value is 2.074;
when df = 35 and the critical value is 2.438, the upper tail area is 0.01;
when df = 12 and the critical value is 2.234, the upper tail area is between 0.01 and 0.025.

Example 1.

Engineers want to determine the average tensile strength of a new type of steel produced by a company. They take a random sample of 100 steel rods and measure their tensile strength. The data are:

501, 504, 499, 503, 502, 505, 498, 500, 506, 503, 505, 501, 499, 502, 504, 500, 502, 503, 506, 499, 502, 503, 501, 504, 502, 498, 499, 503, 504, 501, 503, 498, 506, 500, 502, 499, 505, 504, 503, 501, 500, 504, 502, 506, 499, 503, 505, 501, 498, 502

Calculate a 95% confidence interval for the mean tensile strength of all steel rods produced by the company.

Solution.

When doing by hand, the sample mean is 502 and sample standard deviation is 2.3561. The $\alpha$ is 0.05, and the critical t-value $t_{\alpha /2}$ based on a t-table is 2.01.

x = c(501, 504, 499, 503, 502, 505, 498, 500, 506, 503,
505, 501, 499, 502, 504, 500, 502, 503, 506, 499,
502, 503, 501, 504, 502, 498, 499, 503, 504, 501,
503, 498, 506, 500, 502, 499, 505, 504, 503, 501,
500, 504, 502, 506, 499, 503, 505, 501, 498, 502)

t.test(x, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  x
## t = 1506.6, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  501.3304 502.6696
## sample estimates:
## mean of x 
##       502

Confidence Interval: 95% confidence interval for the mean tensile strength of the steel rods is (501.33 MPa, 502.67 MPa).

Interpretation:

With 95% confidence, we estimate that the true average tensile strength of all steel rods produced by the company falls within the range of approximately 501.33 megapascals (MPa) to 502.67 MPa.

This means that if we were to take multiple random samples of 100 steel rods and calculate 95% confidence intervals from each sample, we would expect approximately 95% of those intervals to contain the true population mean tensile strength.

The interval does not contain values below 501.33 MPa or above 502.67 MPa, suggesting that the steel rods are manufactured with a high degree of consistency in terms of tensile strength.

Example 2.

Environmental engineers monitor air pollution levels in a city. They want to estimate the average concentration of a pollutant in the air over a specific period. They collect air quality measurements at various locations throughout the city. The data are:

18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2, 19.1, 18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7, 19.0, 18.9, 18.8, 19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0, 18.6, 19.2, 19.0, 18.8, 19.3, 18.7, 18.9, 19.1, 18.5, 18.6, 19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9, 18.6, 19.0, 18.5

Calculate a 90% confidence interval for the mean pollutant concentration in the city based on these 50 measurements.

Solution.

x = c(18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2,
19.1, 18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7,
19.0, 18.9, 18.8, 19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0,
18.6, 19.2, 19.0, 18.8, 19.3, 18.7, 18.9, 19.1, 18.5, 18.6,
19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9, 18.6, 19.0, 18.5)

t.test(x, conf.level = 0.90)

## 
##  One Sample t-test
## 
## data:  x
## t = 549.73, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  18.85632 18.97168
## sample estimates:
## mean of x 
##    18.914

Confidence Interval: 90% confidence interval for the mean pollutant concentration in the city is (18.86 ppm, 18.97 ppm).

Interpretation:

With 90% confidence, we estimate that the true average concentration of the pollutant in the air over the specified period falls within the range of approximately 18.86 parts per million (ppm) to 18.97 ppm.

This means that if we were to take multiple random samples of 50 air quality measurements and calculate 90% confidence intervals from each sample, we would expect approximately 90% of those intervals to contain the true population mean pollutant concentration.

The interval is relatively narrow, indicating a relatively high level of confidence in our estimate of the average pollutant concentration. This suggests that the city’s air quality, as measured by this pollutant, is relatively consistent over the monitoring period.

8.3 Large-Sample Confidence Interval for a Population Proportion

The $1-\alpha$ confidence interval for a population proportion is given by

\[\hat{p}\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] Example 3.

In a random sample of 85 automobile engine crankshaft bearings, 10 have a surface finish that is rougher than the specifications allow.

Find a point estimate of the proportion of bearings in the population (denoted $p$) that exceeds the roughness specification.
Construct a 95% two-sided confidence interval for $p$.

Solution.

$p=\frac{10}{85}=0.1176$.
The 95% confidence interval is

\[\hat{p}\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=0.1176\pm 1.96\cdot \sqrt{\frac{0.1176(1-0.1176)}{85}}=0.1176\pm 0.0685\] or between 0.0491 and 0.1861.

Sample size determination for a $1-\alpha$ confidence interval of the population proportion $p$

In order for the error $|\hat{p}-p|$ to be no greater than $E$, the sample should be at least $(\frac{z_{\alpha/2}}{E})^2 \hat{p}(1-\hat{p})$ if there is an estimate $\hat{p}$ for $p$ already or $(\frac{z_{\alpha/2}}{E})^2 (0.25)$ if there is no such estimate for $p$. Always round it to the next whole integer!

Example 4.

To have a 95% confidence interval for a population proportion with at most error 0.02, what should be the least required sample size?

Solution.

$(\frac{z_{\alpha/2}}{E})^2 (0.25)=(\frac{1.96}{0.02})^2 (0.25)=2401$

8.4 Finding Confidence Intervals Using Software

Example 5.

A sample of 9 faculty are selected from a large university. Their years of service are 23, 34, 12, 40, 34, 52, 27, 28, 40. Find a 95% confidence interval for the population mean.

The following is the R code for finding a $t$-confidence interval for a population mean:

x = c(23, 34, 12, 40, 34, 52, 27, 28, 40) # This is data vector

t.test(x, conf.level = 0.95) # Call the function “t.test” to do the calculation

The 95% confidence interval is: 23.38 to 41.06.

To use JMP, you may follow this video: https://www.youtube.com/watch?v=gDi4XWdCIbw

Example 6. A sample of 80 faculty are selected from a large university. 32 of these people have had covid. Find a 90% confidence interval for the population proportion.

The following is the R code for finding a confidence interval for a population proportion:

n = 80

x = 52

prop.test(x, n, conf.level = 0.90)

The 90% confidence interval is: 0.55 to 0.74.

Example 7.

An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005, Vol. 13, pp. 273–279) “Arthroscopic meniscal repair with an absorbable screw: results and surgical technique” showed that only 25 out of 37 tears (67.6%) located between 3 and 6 mm from the meniscus rim were healed.

Calculate a 95% two-sided confidence interval on the proportion of such tears that will heal. Round the answers to 3 decimal places.

The following is the R code for finding a confidence interval for a population proportion:

n = 37

x = 25

prop.test(x, n, conf.level = 0.95)

9 Tests of Hypotheses for a Single Sample

Confidence intervals are estimation methods for parameters. In practice, people may be interested in testing claims or hypotheses about parameters.

Warning: this chapter covers the most difficult concept in an introductory statistics course. Stay awaken!

Review the concept of sampling distribution for the sample mean covered in chapter 8: https://www.youtube.com/watch?v=0zqNGDVNKgA

Consider the following situation:

A company selling juice claims that on average, each bottle of their juice is 295 ml. Let’s first assume that the claim is correct. We will randomly select 50 bottles of juice. Which of the following sample means would provide the most evidence against the claim (or for the opposite of the claim)? Which one provides the least evidence?

291 ml
296 ml
299 ml

Use the following sampling distribution to help you answer these questions.

Answer: (a)=(c)>(b)

If the sample mean is very far away from the center of the sampling distribution, we should reject the assumption. Otherwise, we don’t reject it. Then, how far is far? We need a certain decision rule. One such rule is that the standardized score or $z$-score of the sample mean is very different from 0 (i.e., $|z|>c$ for some $c$, called the critical value).

9.1 Hypothesis Testing

Watch this video: https://www.youtube.com/watch?v=zR2QLacylqQ a few times until it makes good sense to you.

Then watch this (the first 15 minutes): https://www.youtube.com/watch?v=VK-rnA3-41c

9.1.1 Statistical Hypotheses

A statistical hypothesis is a statement about the parameters of one or more populations.

The following are examples of statistical hypotheses:

A company selling juice claims that on average, each bottle of their juice is 295 ml.
A presidential candidate claims that more than 50% of registered voters would support her.
A company producing light bulbs claims that less than one percent of their products are defective.

When dealing with a problem involving statistical hypotheses, a pair of competing hypotheses, called the null and alternative hypotheses, respectively, are first formed. In next subsection, we will talk about how such a pair is formed.

A procedure leading to a decision about whether the null hypothesis should be rejected is called a test of a hypothesis.

9.1.2 Tests of Statistical Hypotheses

The test of hypotheses can be summarized in four steps as follows-

Step 1: Specify the null and alternative hypotheses.
Step 2: Calculate the value of a test statistic.
Step 3: Calculate the critical value or $p$-value.
Step 4: Make a decision about whether to reject the null hypothesis and draw a conclusion.

To test each of the following claims,

A company selling juice claims that on average, each bottle of their juice is 295 ml.
A presidential candidate claims that more than 50% of registered voters would support her.
A company producing light bulbs claims that less than one percent of their products are defective.
The standard deviation of the life times of all light bulbs produced by a company is less than 30 hours.
The difference in the mean house income between Minnesota and Iowa is greater than $500.

the null and hypotheses are

$H_0: \mu = 295$ vs. $H_a: \mu \ne 295$
$H_0: p \le 0.5$ vs. $H_a: p > 0.5$
$H_0: p\ge 0.01$ vs. $H_a: p<0.01$
$H_0: \sigma = 30$ vs. $H_a: \sigma < 30$
$H_0: \mu_{\text{MN}}-\mu_{\text{IA}} = 500$ vs. $H_a: \mu_{\text{MN}}-\mu_{\text{IA}} = 500$

Technically, we can use an equality sign under all null hypotheses (indicated by $H_0$), but can never use any equality sign under the alternative hypothesis indicated by $H_a$ or $H_1$). That is, the only possible signs under an alternative hypothesis are “>”, “<”, and “$\ne$.”

We can only use a parameter but never use a statistic to specify a hypothesis. The following would be all wrong:

$H_0: \bar{x} = 295$ vs. $H_a: \bar{x} \ne 295$
$H_0: \hat{p} \le 0.5$ vs. $H_a: \hat{p} > 0.5$
$H_0: s = 30$ vs. $H_a: s < 30$

because $\bar{x}, \hat{p}$ and $s$ all represent statistics.

Once the null and alternative hypotheses are determined, a test statistic will be used as a judge between the null and alternative hypotheses. The test statistic is an expression involving summary statistics and the parameter (s). It can be different, depending on the context. We will use an argument similar to the proof of contradiction in mathematics. Specifically, we will first assume that the null hypothesis is true. We then calculate a quantity (a critical value or $p$-value) which indicates whether there is a contradiction. If yes, we reject the null hypothesis. Otherwise, we do not reject the null hypothesis.

Our decision might be wrong. There are two types of errors we could make: Type I and type II errors, which are tabled below.

If the null hypothesis is true but rejected, we have made a type I error.
If the null hypothesis is false but not rejected, we have made a type II error.

The probability of making a type I error is denoted by $\alpha$. The probability of making a type II error is denoted by $\beta$. We would like both $\alpha$ and $\beta$ to be small. How can we reduce both? Increasing the sample size would be the only way.

The value or an upper bound of $\alpha$ is usually pre-set (or controlled) by the investigator and is called the level of significance.

9.1.3 One-Sided and Two-Sided Hypotheses

When the alternative hypothesis has a “$<$” or a “>” sign, the test is called a one-sided test. Furthermore, if it is “<”, the test is left-sided (or left-tailed); if it is “>”, the test is right-sided (or right-tailed).

When the alternative hypothesis has a “$\ne$” sign, the test is said to be a two-sided (or two-tailed).

9.1.4 P-Values in Hypothesis Tests

When we make a decision regarding whether the null hypothesis should be rejected, we can use a few methods. One of such methods is the $p$-value method. The $p$-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value (say < 5%) means that such an observed outcome would not be observed by chance (i.e., would be very unlikely) under the null hypothesis. A relatively large p-value (say > 5%) means that such an observed outcome may occur by chance under the null hypothesis.

More specifically,

Consider an observed test-statistic $t$ from unknown distribution $T$. Then the $p$-value is $P(T\ge t|H_{0})$ for a one-sided right-tail test, $P(T\le t|H_{0})$ for a one-sided left-tail test, or $2\cdot \text{min}\{P(T\ge t|H_{0}), P(T\le t|H_{0})\}$ for a two-sided test.

When the $p$-value is less than or equal to a pre-selected $\alpha$ (called the significance level), reject the null hypothesis. In this case, we also say that the result is statistically significant.

Statistical significance does not always imply practical significance. For example, if we we find the difference in mean income between Minnesota and Iowa residents is $3 and it is statistically significant, such a small difference would not make any difference practically speaking.

The $p$-values under one-sided or two-sided alternatives are related. If the p-value of a one sided test is $p$, the p-value of the corresponding two-sided test must be $2p$ or $2(1-p)$, whichever is between 0 and 1.

9.1.5 Connection between Hypothesis Tests and Confidence Intervals

If a $1-\alpha$ confidence interval does not include the value that is under the null hypothesis of a two-sided (or two tailed) test, we can reject the null hypothesis at the $\alpha$ level of significance.

Example.

Based on a sample, a 90% confidence interval for a population mean $\mu$ is $(23.67, 27.92)$. If we want to use the sample to test the following hypotheses

\[H_0: \mu=22 ~~~ vs ~~~ H_a: \mu\ne 22\] do we reject the null hypothesis at the significance level 0.10?

Solution.

Yes, reject the null hypothesis at level 0.10, since the hypothesized value 22 falls out of the 90% confidence interval.

You can also use a one-sided confidence bound to draw a conclusion about a one-sided test. For a left-tail test, you can develop a level $1-\alpha$ upper bound; For a right-tail test, you can develop a level $1-\alpha$ lower bound. When the value under the null hypothesis is beyond the bound, reject the null hypothesis at the significance level $\alpha$.

9.1.6 General Procedure for Hypothesis Tests

There are two procedures for hypothesis tests:

The critical value method: We need to determine the critical region (or rejection region). If the test is left-tailed (the alternative hypothesis reads something like $H_a: \mu<24$ or $H_a: p<0.35$), we find a cutoff (called $c$) that separates the lower $\alpha$ area under the (null) distribution of the test statistic from the other area. The critical region is $(-\infty, c)$. If the test is right-tailed (the alternative hypothesis reads something like $H_a: \mu>24$ or $H_a: p>0.35$), we find a cutoff (called $c$) that separates the upper $\alpha$ area under the distribution of the test statistic from the other area. The critical region is $(c, \infty)$. If the test is two-tailed, we find two cutoffs (called $c$ and $-c$) with $c$ separating the upper $\alpha/2$ area under the distribution of the test statistic from the other area and $-c$ separating the lower $\alpha/2$ area under the distribution of the test statistic from the other area. The critical region is $(-\infty, -c)\cup (c, \infty)$. In each case, if the value of the test statistic falls in the corresponding critical region, reject the null hypothesis.

For a test of hypotheses about a population mean with known population standard deviation, watch this video: https://www.youtube.com/watch?v=04rhu_56O5g.

The p-value method: Calculate the area that is under the (null) distribution of the test statistic and is greater than the value of the test statistic. Such area can be represented by $A=P(T\ge t|H_0)$. For a right-tailed test, the $p$-value equals $A$; For a left-tailed test, the $p$-value equals $1-A$; for a two-tailed test, the $p$-value equals $2\cdot \text{min} \{1-A, A\}$, the smaller of $A$ and $1-A$.

Watch this video: https://www.youtube.com/watch?v=W3rxXa7YNqk. Note that in the video, the null hypothesis is expressed as $H_a: \mu>23$, which is technically the same as $H_a: \mu=23$.

9.2 Tests on the Mean of a Normal Distribution, Variance Known

We assume that a random sample $X_1, X_2, \cdots, X_n$ has been taken from a normal population with known variance $\sigma^2$.

9.2.1 Hypothesis Tests on the Mean

The null hypothesis is always written as $H_0: \mu=\mu_0$, where $\mu_0$ is called the hypothesized value or null value.

The alternative hypothesis may look like one of the following:

$H_a: \mu<\mu_0$
$H_a: \mu>\mu_0$
$H_a: \mu\ne \mu_0$

The test statistic is always

\[Z_0=\frac{\bar{X}-\mu_0}{\sigma /\sqrt{n}}\]

The observed value of the test statistic is always

\[z_0=\frac{\bar{x}-\mu_0}{\sigma /\sqrt{n}}\]

For a left-tail test, the critical value is given by $c$, the cutoff on the number line that makes the area of the left region be $\alpha$ under the curve of the standard normal distribution, and the $p$-value is given by the area of the left region beyond $z_0$ under the standard normal curve.

For a right-tail test, the critical value is given by $c$, the cutoff on the number line that makes the area of the right region be $\alpha$ under the curve of the standard normal distribution, and the $p$-value is given by the area of the right region beyond $z_0$ under the standard normal curve.

For a two-tail test, the critical values are given by $c$ and $-c$, two cutoffs on the number line that makes left tail and right tail each $\alpha/2$.

To find $c$, watch this video: https://www.youtube.com/watch?v=p_KApjpyBHE (starting at 1:55).

Example 1. A left-sided test for a population mean with $\sigma$ known.

https://www.youtube.com/watch?v=oEW8Hd_xy1k

Example 2. A right-sided test for a population mean with $\sigma$ known.

Example 3. A two-sided test for a population mean with $\sigma$ known.

https://www.youtube.com/watch?v=BWJRsY-G8u0

The following example gives some details:

To test if a population mean is greater than 20. A random sample of size 36 gives a sample mean 22. If the population standard deviation is 5, test, at level 0.05, that the population mean exceeds 20.

Solution.

The null and alternative hypotheses are:

\[H_0:\mu = 20 ~~~ vs ~~~ H_a: \mu > 20\] The test statistic value is

\[z_0=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{22-20}{5/\sqrt{36}}=2.4\]

Since larger sample mean or larger $z_0$ suggestions rejection of the null hypothesis, the rejection region looks like $(c, \infty)$ with the critical value $c$. By the standard normal table or the R code $qnorm(1-\alpha)$ (run it on the R console), $c=1.645$.

Since the test statistic value falls in the rejection region, reject the null hypothesis.

Equivalently, we can use the $p$-value approach. The $p$-value is the area to the right of the statistic value under the standard normal curve. By the standard normal table or the R code $1-pnorm(2.4)$, the $p$-value is 0.0082. Since the $p$-value is less than the significance level, reject the null hypothesis.

9.3 Tests on the Mean of a Normal Distribution, Variance Unknown

We assume that a random sample $X_1, X_2, \cdots, X_n$ has been taken from a normal population with unknown variance $\sigma^2$.

9.3.1 Hypothesis Tests on the Mean

The test statistic is always

\[T_0=\frac{\bar{X}-\mu_0}{S /\sqrt{n}}\]

The observed value of the test statistic is always

\[t_0=\frac{\bar{x}-\mu_0}{s /\sqrt{n}}\]

For a left-tail test, the critical value is given by $c$, the cutoff on the number line that makes the area of the left region be $\alpha$ under the curve of the t distribution with $n-1$ degrees of freedom, and the $p$-value is given by the area of the left region beyond $t_0$ under the t curve.

For a right-tail test, the critical value is given by $c$, the cutoff on the number line that makes the area of the right region be $\alpha$ under the curve of the t distribution with $n-1$ degrees of freedom, and the $p$-value is given by the area of the right region beyond $t_0$ under the t curve.

For a two-tail test, the critical values are given by $c$ and $-c$, two cutoffs on the number line that makes left tail and right tail each $\alpha/2$.

Example.

A very useful video: https://www.youtube.com/watch?v=VPd8DOL13Iw

Example.

Your company wants to improve sales. Past sales data indicate that the average sales was $100 per transaction. After training your sales force, recent sales data (taken from a random sample of 25 salesmen) indicates an average of $130, with a standard deviation of $15. Did the training work? Test your hypothesis at a 0.05 significance level.

Solution.

The population mean $\mu$ is the parameter of interest. To test whether sales has been improved, we should have the null and alternative hypotheses as follows:

\[H_0: \mu=100 ~~~ vs ~~~ H_a: \mu>100\] The value of the test statistic is

\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{130-100}{15/\sqrt{25}}=10\] with $n-1$ or 24 degrees of freedom.

Since larger $\bar{x}$’s or $t_0$’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like $(c, \infty)$, where $c=t_{\alpha, n-1}$. We are given $\alpha=0.05$, so the critical value based on the $t_{24}$ distribution is 1.711, which is obtained by R code $qt(1-\alpha, n-1)$ or by a $t$-table.

Since the test statistic value 10 falls in the rejection region, we reject the null hypothesis.

Equivalently, we can calculate the $p$-value, which is the area under the $t_{24}$ distribution to the right of the test statistic value. Using the $t$ table or the R code $1-pt(10, 24)$, we know the $p$-value is smaller than 0.001 and thus smaller than the significance level 0.05. Again, we reject the null hypothesis.

In conclusion, the data provide sufficient evidence that the sales has been improved after training.

The following is a video explaining the above procedure:

https://www.youtube.com/watch?v=7ty2bO6VrUI

Example.

A firm claims that their product on average weighs 19 pounds. A supervisory authority doubts that the average weight is below 19 pounds, so it collects a random sample of 51 products made by the company from the market. The sample is 18.5 pounds with a standard deviation 3.2 pounds. Test appropriate hypotheses at the significance level 0.01. In order to prevent themselves from been sued by the company, should the authority use a larger or smaller significance level?

Solution.

The null and alternative hypotheses are:

\[H_0: \mu=19 ~~~ vs ~~~ H_a: \mu<19\] The value of the test statistic is

\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{18.5-19}{3.2/\sqrt{51}}=-1.1158\] with $n-1$ or 50 degrees of freedom.

Since smaller $\bar{x}$’s or $t_0$’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like $(-\infty, c)$, where $c=-t_{\alpha, n-1}$. We are given $\alpha=0.05$, so the critical value based on the $t_{50}$ distribution is $-1.6759$, which is obtained by R code $qt(\alpha, n-1)$ with $\alpha = 0.01, n=50$ or by a $t$-table.

Since the test statistic value $-1.1158$ does not fall in the rejection region, we fail to reject the null hypothesis.

Equivalently, we can calculate the $p$-value, which is the area under the $t_{50}$ distribution to the left of the test statistic value. Using the $t$ table or the R code $pt(-1.1158, 50)$, we know the $p$-value is 0.1349 and thus NOT smaller than the significance level 0.01. Again, we fail to reject the null hypothesis.

In conclusion, the data do not provide sufficient evidence that the average weight of the firm’s products is below 19 pounds.

The following is a video explaining the above procedure: https://www.youtube.com/watch?v=ZY5XxJ2aJNc

9.4 Tests on a Population Proportion

It is often necessary to test hypotheses on a population proportion. For example, suppose that a random sample of size $n$ has been taken from a large (possibly infinite) population and that $X$ observations in this sample belong to a class of interest. Then $\hat{P}=\frac{X}{n}$ is a point estimator of the proportion of the population p that belongs to this class. Typically, we require that $np$ and $n(1 − p)$ be greater than or equal to 5.

The null hypothesis always look like $H_0: p=p_0$, where $p_0$ is called the hypothesized value or null value.

The alternative hypothesis may look like one of the following:

$H_a: p<p_0$
$H_a: p>p_0$
$H_a: p\ne p_0$

The test statistic is always

\[Z_0=\frac{\hat{P}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}\]

The observed value of the test statistic is always \[z_0=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}\]

The critical value and p-value are calculated in the same way as the normal distribution case for $\mu$ with known $\sigma$.

9.4.1 Large-Sample Tests on a Proportion

Example.

A semiconductor manufacturer produces controllers used in automobile engine applications. The customer requires that the process fallout or fraction defective at a critical manufacturing step not exceed 0.045 and that the manufacturer demonstrate process capability at this level of quality using $\alpha= 0.05$. The semiconductor manufacturer takes a random sample of 200 devices and finds that 4 of them are defective. Can the manufacturer demonstrate process capability for the customer?

Solution.

We may solve this problem using the following steps:

Parameter of interest: The parameter of interest is the process fraction defective $p$.
Null hypothesis: $H_0: p = 0.045$
Alternative hypothesis: $H_a: p < 0.045$ This formulation of the problem will allow the manufacturer to make a strong claim about process capability if the null hypothesis $H_0: p = 0.045$ is rejected.
Test statistic: The test statistic is $z_0=\frac{x-np_0}{\sqrt{np_0 (1-p_0)}}=-1.7055$

where $x = 4, n = 200$, and $p_0 = 0.045$.

$p$-value: 0.044 (the left-tail area under the standard normal curve with cutoff $-1.7055$).
Decision & conclusion: Reject H0 since the p-value is less than 0.05. We conclude that the process fraction defective p is less than 0.05. Practical Interpretation: We conclude that the process is capable.

Example. An article in Fortune (September 21, 1992) claimed that nearly one-half of all engineers continue academic studies beyond the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an article in Engineering Horizons (Spring 1990) indicate that 118 of 484 new engineering graduates were planning graduate study.

Test the hypothesis 𝐻0:𝑝=0.5
What is the P-value for this test? Round your answer to 4 decimal places.

9.5 Testing for Goodness of Fit

The hypothesis-testing procedures that we have discussed in previous sections are designed for problems in which the population or probability distribution is known and the hypotheses involve the parameters of the distribution. Another kind of hypothesis is often encountered: We do not know the underlying distribution of the population, and we wish to test the hypothesis that a particular distribution will be satisfactory as a population model.

The test method, called the goodness-of-fit test, is based on the chi-square distribution.

Each chi-squared distribution is associated with a number called the number of degrees of freedom. The above graph shows 6 different chi-squared distributions. All chi-squared distributions are skewed to the right.

In this chapter, we will consider hypothesis testing problems that involve calculating p-values based on chi-squared distributions.

We will focus on the situation that the population distribution is discrete with only a few categories. So, the null hypothesis is

\[H_0: p_1 = p_{10}, ~p_2 = p_{20}, \cdots, ~p_k = p_{k0} \] where $p_{10}, p_{20}, \cdots, p_{k0}$ are given proportions, and the alternative hypothesis is

\[H_a: \text{At least one of the proportions is not as specified}\]

The test procedure requires a random sample of size $n$ from the population whose probability distribution is unknown. These $n$ observations are arranged in a frequency table with $k$ classes/categories.

Let $O_i$ be the observed frequency in the $i$th class. Under the null hypothesis, we compute the expected frequency in the $i$th class, denoted $E_i = n\cdot p_{i0}$, $i = 1, 2, ..., k$. The test statistic is

\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]

Under the null hypothesis, $\chi_0^2$ has, approximately, a chi-square distribution with $k − 1$ degrees of freedom.

We can again use one of the following two methods for making a decision:

The critical value method: The critical region is always $(c, \infty)$, where $c$ is the cutoff for the chi-square distribution such that the upper tail is $\alpha$ (the significance level).

*$p$-value method: the $p$-value is the upper-tail area under the chi-square curve with cutoff $\chi_0^2$.

Example.

Throw a 6-sided die 100 times. The observations are

17 ones
18 twos
13 threes
17 fours
22 fives
13 sixes

Test, at the significance level 0.05, whether the die is fair.

Solution.

The null hypothesis is $H_0: p_1=p_2=\cdots=p_6=1/6$ and the alternative hypothesis is $H_a: \text{At least one of the probabilities is not 1/6}$. Under the null hypothesis, the expected frequencies are all $(\frac{1}{6})(100)=16.67$. The observed frequencies are $O_1=17, O_2=18, O_3=13, O_4=17, O_5=22, O_6=13$.
The test statistic

\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]

\[\chi_0^2=\frac{(17 -16.67)^2}{16.67}+\frac{(18 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}+\frac{(17 -16.67)^2}{16.67}+\frac{(22 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}\] \[\chi_0^2=3.44\] with $6-1=5$ degrees of freedom.

The critical value is $\chi_{0.05, 5}^2=11.0705$. Since the test statistic value is not greater than the critical value, we do not reject the null hypothesis. The $p$-value is $P(\chi^2>3.44)= 0.6325$, the area of the right region under the chi-squared density curve $(df=5)$ (Watch: https://www.youtube.com/watch?v=HwD7ekD5l0g).
Decision & conclusion: Since the $p$-value is greater than the significance level 0.05, the null hypothesis is NOT rejected. We conclude that we don’t have enough evidence to say that the die is unfair.

The R code:

chisq.test(x=c(17, 18, 13, 17, 22, 13 ))

9.6 Contingency Table Tests

Many times the $n$ elements of a sample from a population may be classified according to two different criteria. It is then of interest to know whether the two methods of classification are statistically independent; for example, we may consider the population of graduating engineers and may wish to determine whether starting salary is independent of academic disciplines. Assume that the first method of classification has $r$ levels and that the second method has $c$ levels. We will let $O_{ij}$ be the observed frequency for level $i$ of the first classification method and level $j$ of the second classification method. The data would, in general, appear as shown in the following Table. Such a table is usually called an $r × c$ contingency table.

To test the independence of the two categorical variables, the null and alternative hypotheses are

\[H_0: \text{The two categorical variables are independent vs.} ~ H_a:\text{The two categorical variables are dependent}\]

We again use the chi-square test and the test statistic is

\[\chi_0^2=\sum_{i,j} \frac{(O_{ij} -E_{ij})^2}{E_{ij}}\]

where the expected frequency $E_{ij}$ is calculated as the sum of the $i$th row multiplied by the sum of the $j$th column, then divided by the sum of all frequencies.

Under the null hypothesis, this test statistic has an approximate chi-square distribution with $(r − 1)(c − 1)$ degrees of freedom.

The critical value and the p-value are calculated in the same way as for goodness of fit.

Example.

A company has to choose among three health insurance plans. Management wishes to know whether the preference for plans is independent of job classification and wants to use α = 0.05. The opinions of a random sample of 500 employees are shown in table below:

Solution.

$H_0: \text{Job classification and health insurance plan are independent}$ and $H_a:\text{Job classification and health insurance plan are dependent}$
The expected frequencies are 136, 136, 68, 64, 64, 32, respectively.
The Chi-square statistic is 49.63.
The critical value is 5.99 (the cutoff of the chi-square distribution that separates the upper tail area of 0.05).
The p-value is essentially 0.
Decision & conclusion: By either method, we reject the null hypothesis. We conclude that Job classification and health insurance plan are dependent.

R code;

M=matrix(c(160, 40, 140, 60, 40, 60), 2, 3)

chisq.test(M)

10 Statistical Inference for Two Samples

Case Study: Paint Drying Time

A product developer is interested in reducing the drying time of a primer paint. Two formulations of the paint are tested:

formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time.

From experience, it is known that the standard deviation of drying time is 8 minutes, and this inherent variability should be unaffected by the addition of the new ingredient. Ten specimens are painted with formulation 1, and another 10 specimens are painted with formulation 2; the 20 specimens are painted in random order. The two sample average drying times are 121 minutes and 112 minutes, respectively. What conclusions can the product developer draw about the effectiveness of the new ingredient?

In the above case study, the objective is to compare two different conditions to determine whether either condition produces a significant effect on the response that is observed. These conditions are sometimes called treatments. The two different treatments are two paint formulations, and the response is the drying time. The purpose of the study is to determine whether the new formulation results in a significant effect—reducing drying time. In this situation, the product developer (the experimenter) randomly assigned 10 test specimens to one formulation and 10 test specimens to the other formulation. Then the paints were applied to the test specimens in random order until all 20 specimens were painted. This is an example of a completely randomized experiment.

When statistical significance is observed in a randomized experiment, the experimenter can be confident in the conclusion that the difference in treatments resulted in the difference in response. That is, we can be confident that a cause-and-effect relationship has been found.

Another case study:

Suppose an engineer wants to compare the strength of two types of metal alloys (Alloy A and Alloy B). The strengths of 10 randomly selected Alloy A’s and 12 randomly selected Alloy B’s (in MPa) are

Alloy A: 404.97, 398.62, 406.48, 415.23, 397.66, 397.66, 415.79, 407.67, 395.31, 405.43
Alloy B: 409.12, 405.57, 417.75, 395.43, 410.26, 424.42, 421.87, 402.85, 407.05, 409.43, 397.45, 402.33

Is there a significant difference in the mean strength between Alloy A and Alloy B?

Most of the practical applications of the procedures to be covered in this chapter arise in the context of simple comparative experiments in which the objective is to study the difference in the parameters of the two populations involved.

The general situation is shown in the figure below:

Population 1 has mean $\mu_1$ and variance $\sigma_1^2$, and population 2 has mean $\mu_2$ and variance $\sigma_2^2$. Inferences will be based on two random samples of sizes $n_1$ and $n_2$, respectively.

There are many studies that are not randomized experiments. Those studies do not involve the use of treatments and are called observational studies. It is difficult to identify causality in observational studies because the observed statistically significant difference in response for the two groups may be due to some other underlying factor (or group of factors) that was not equalized by randomization and not due to the treatments. For example, the difference in heart attack risk could be attributable to the difference in iron levels or to other underlying factors that form a reasonable explanation for the observed results—such as cholesterol levels or hypertension.

The following are more examples comparing the parameters of two populations:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Use Cases for T-Tests
Use_Case	Description
Comparing Two Groups	Assess the difference in average test scores between two different classes.
Before-and-After Studies	Evaluate the impact of a training program on employee performance.
Medical Trials	Compare the effects of two medications on patient recovery times.
Quality Control	Assess the quality of products from two different suppliers.
Customer Satisfaction Surveys	Analyze satisfaction ratings between two store locations.
Performance Comparison	Compare the athletic performance of male and female athletes.
Marketing Campaign Effectiveness	Assess the sales impact of two different advertising strategies.
Educational Research	Investigate the effectiveness of a new teaching method versus a traditional approach.
Environmental Studies	Compare pollution levels before and after implementing a new regulation.
Consumer Behavior Analysis	Examine spending habits between two age groups.

Which of the above studies could be experimental studies?

10.1 Inference on the Difference in Means of Two Normal Distributions, Variances Known

In this section, we consider statistical inferences on the difference in means $\mu_1 − \mu_2$ of two normal distributions where the variances $\sigma_1^2$ and $\sigma_2^2$ are known. The assumptions for this section are summarized as follows.

There is a random sample from population 1.
There is a random sample from population 2.
The two samples are independent.
Both populations are normal.

Although there are some textbooks that discuss this situation, we will skip this case. For students who have interest, some reference can be found here: https://www.youtube.com/watch?v=NL9o1dKrh8o.

10.2 Inference on the Difference in Means of Two Normal Distributions, Variances Unknown

We now consider tests of hypotheses on the difference in means $\mu_1 − \mu_2$ of two normal distributions where the variances $\sigma_1^2$ and $\sigma_2^2$ are unknown. A $t$-statistic is used to test these hypotheses. This is the most important situation in practice.

10.2.1 Hypotheses Tests on the Difference in Means, Variances Unknown

We will follow the 4-step procedure introduced in the previous chapter.

Step 1: Specify the null and alternative hypotheses. \[H_0: \mu_1=\mu_2 ~~~ vs~~~ H_a: \mu_1\ne \mu_2 (~\text{or}~~ \mu_1< \mu_2 ~~\text{or}~~ \mu_1> \mu_2)\]
Step 2: Calculate the test statistic.

\[T_0=\frac{\bar{X}_1-\bar{X}_2-0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}\]

Under the null hypothesis, the test statistic has a $t$-distribution with $\nu$ degrees of freedom given by

\[\nu = \frac{(A+B)^2}{\frac{A^2}{n_1-1}+\frac{B^2}{n_2-1}}\] with $A=\frac{S_1^2}{n_1}$ and $B=\frac{S_2^2}{n_2}$

If $\nu$ is not an integer, round down to the nearest integer.

Step 3: Determine the $p$-value. The $p$-value is determined by the $t$-distribution with $\nu$ degrees of freedom. The procedure is similar to the previous chapter.
Step 4: Make a decision and draw a conclusion in the context.

If observations are available, you can use the following R code to do the analysis:

 t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

Here, x is the vector representing the first sample and y is the vector representing the second sample. We always consider the case that “mu = 0” meaning that the two means tested are equal. If the two samples are independent, we set “paired = FALSE” and this is the default setting. We always consider the situation that “var.equal = FALSE” meaning that the two underlying populations have unequal variances.

Example 1.

The overall distance traveled by a golf ball is tested by hitting the ball with Iron Byron, a mechanical golfer with a swing that is said to emulate the legendary champion, Byron Nelson. Ten randomly selected balls of two brands are tested and the overall distance measured. The data follow:

Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260 265 273 281 271 270 263 268

Is there evidence to show that there is a difference in the mean overall distance of brands? Use 0.05 as the significance level.

Solution.

Since we are testing whether the two population means are equal, the alternative hypothesis is that they are unequal.

The R code is:

x = c(287,  277,    287,    271,    283,    271,    279,    275,    263,    267)

y = c(259,  248,    260,    265,    273,    281,    271,    270,    263,    268)

t.test(x, y, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 2.6438, df = 17.817, p-value = 0.0166
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.088609 18.311391
## sample estimates:
## mean of x mean of y 
##     276.0     265.8

Since the $p$-value is 0.0166, we reject the null hypothesis at the significance level 0.05.

More examples:

Example 2: Material Strength Testing

In a civil engineering project, two types of concrete mixtures (Type A and Type B) are being considered for constructing a bridge. Tensile strength is a critical factor for the bridge’s durability. The project team collects tensile strength measurements for samples of each concrete type. The goal is to determine if there’s a significant difference in tensile strength between the two types of concrete.

# Data for two types of concrete mixtures
concrete_type_A <- c(29.6, 31.2, 30.5, 32.1, 31.8, 29.9, 30.4, 30.2, 32.5, 31.3,
                     31.7, 30.8, 30.1, 32.3, 31.6, 30.9, 30.7, 31.5, 31.0, 32.0)
concrete_type_B <- c(28.3, 29.1, 28.5, 29.9, 30.5, 29.0, 28.9, 29.8, 30.2, 29.4,
                     28.7, 29.3, 29.7, 29.5, 30.1, 28.8, 29.6, 29.2, 30.4, 28.6)

# Perform two-sample t-test
t_test_result <- t.test(concrete_type_A, concrete_type_B)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  concrete_type_A and concrete_type_B
## t = 7.3405, df = 35.8, p-value = 1.22e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.251928 2.208072
## sample estimates:
## mean of x mean of y 
##    31.105    29.375

Example 3: Energy Efficiency Analysis

In the automotive industry, a car manufacturer is testing the energy efficiency of two engine types: a traditional combustion engine and a new hybrid engine. The company records fuel consumption data for both engine types while running under identical conditions. The objective is to determine if the hybrid engine is significantly more fuel-efficient.

# Data for energy efficiency analysis
traditional_method <- c(8.2, 8.5, 8.4, 8.7, 8.6, 8.3, 8.2, 8.4, 8.6, 8.5,
                        8.4, 8.7, 8.8, 8.4, 8.5, 8.6, 8.3, 8.2, 8.6, 8.4)
new_technology <- c(6.9, 7.1, 7.0, 7.2, 7.3, 6.8, 7.1, 6.9, 7.2, 7.0,
                    7.1, 7.0, 6.8, 7.3, 7.2, 6.9, 7.0, 6.8, 7.1, 7.3)

# Perform two-sample t-test
t_test_result <- t.test(traditional_method, new_technology, alternative = "greater")
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  traditional_method and new_technology
## t = 26.116, df = 37.906, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  1.323648      Inf
## sample estimates:
## mean of x mean of y 
##     8.465     7.050

Note: “the hybrid engine is significantly more fuel-efficient” is equivalent to “the traditional engine consumes more fuel.”

Example 4: Product Performance Testing

A consumer electronics company is developing two models of smartphones. Each model uses a different type of battery chemistry. The company tests the battery life of both models by continuously using the devices until the batteries are drained. They want to know if there’s a significant difference in battery life between the two models.

# Data for battery life comparison
battery_chemistry_A <- c(14.5, 15.0, 14.7, 15.2, 14.9, 15.1, 14.8, 15.0, 14.6, 15.3,
                         14.8, 15.2, 15.1, 14.9, 15.0, 14.7, 15.2, 14.5, 15.1, 14.9)
battery_chemistry_B <- c(13.5, 14.0, 13.7, 14.2, 13.9, 14.1, 13.8, 14.0, 13.6, 14.3,
                         13.8, 14.2, 14.1, 13.9, 14.0, 13.7, 14.2, 13.5, 14.1, 13.9)

# Perform two-sample t-test
t_test_result <- t.test(battery_chemistry_A, battery_chemistry_B)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  battery_chemistry_A and battery_chemistry_B
## t = 13.279, df = 38, p-value = 7.489e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8475502 1.1524498
## sample estimates:
## mean of x mean of y 
##    14.925    13.925

10.2.2 Confidence Interval on the Difference in Means, Variances Unknown

If observations are available, you can use the following R code to do analysis:

t.test(x, y = NULL, conf.level = 0.95)

Example.

Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260 265 273 281 271 270 263 268

Calculate a 95% two-sided confidence interval on the difference in mean overall distance. Round your answer to one decimal place (e.g. 98.7).

Solution.

R code:

x = c(287,  277,    287,    271,    283,    271,    279,    275,    263,    267)
y = c(259,  248,    260,    265,    273,    281,    271,    270,    263,    268)
t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 2.6438, df = 17.817, p-value = 0.0166
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.088609 18.311391
## sample estimates:
## mean of x mean of y 
##     276.0     265.8

The 95 percent confidence interval is $(2.09, 18.31)$.

10.3 Paired t-Test

Case Study: Shear Strength of Steel Girder

An article in the Journal of Strain Analysis for Engineering Design [“Model Studies on Plate Girders” (1983, Vol. 18(2), pp. 111–117)] reports a comparison of several methods for predicting the shear strength for steel plate girders. Data for two of these methods, the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are shown in Table 10.3. We wish to determine whether there is any difference (on the average) for the two methods.

For such a study, the analysis is to use the one-sample $t$ method (with an unknown variance) based on the differences for testing hypotheses or constructing confidence intervals about the the mean difference denoted by $\mu_D=\mu_1-\mu_2$.

Specifically, the null hypothesis looks like $H_0:\mu_D=0$ and the alternative hypothesis looks like one of the following three:

\[H_a: \mu_D<0\] \[H_a: \mu_D>0\] \[H_a: \mu_D\ne0\] The test statistic is

\[T=\frac{\bar{d}-0}{s_d/\sqrt{n}}\] where $\bar{d}$ is the mean of the differences and $s_d$ is the corresponding standard deviation. Under the null hypothesis, the test statistic has a $t$-distribution with $n-1$ degrees of freedom.

When do the calculation in R, there is no need to follow the above procedure. Instead, we can use the following R code:

t.test(x, y, 
       alternative = c("two.sided", "less", "greater"),  ## Use only one of the 3 options
       paired = TRUE, 
       conf.level = 0.95)

where $x$ and $y$ are the original samples.

The $1-\alpha$ confidence interval formula for $\mu_d$ is the same as the one-sample $t$ confidence interval with unknown variance. That is,

\[\bar{d}\pm t_{\alpha/2}\cdot\frac{s_d}{\sqrt{n}}\] where $n$ is the number of pairs.

Let’s demonstrate using the above case study.

The null hypothesis is $H_0:\mu_d = 0$ and the alternative hypothesis is $H_a:\mu_d \ne 0$. > x = c(1.186, 1.151, 1.322, 1.339, 1.200, 1.402, 1.365, 1.537, 1.559) > y = c(1.061, 0.992, 1.063, 1.062, 1.065, 1.178, 1.037, 1.086, 1.052) > t.test(x, y, alternative = “two.sided”, paired = TRUE, conf.level = 0.95)

The $t$-statistic is 6.08. The $p$-value is 0.0003, indicating a significant difference at any commonly used significance level such as 0.05 .

The 95% confidence interval for $\mu_d$ is $(0.1700, 0.3777)$.

10.4 Inference on Two Population Proportions

Suppose we have two discrete populations, each having the same interesting class/category with proportions $p_1$ and $p_2$, respectively. Suppose that two independent random samples of sizes $n_1$ and $n_2$ are taken from two populations, and let $X_1$ and $X_2$ represent the number of observations that belong to the class of interest in samples 1 and 2, respectively.

10.4.1 Large-Sample Tests on the Difference in Population Proportions

We are interested in testing the hypotheses

\[H_0: p_1=p_2\] against one of the following: \[H_a: p_1>p_2\] \[H_a: p_1<p_2\] \[H_a: p_1\ne p_2\]

The test statistic is:

\[Z_0=\frac{(\hat{P}_1-\hat{P}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}\] where $\hat{p}=\frac{X_1+X_2}{n_1+n_2}$ is called the pooled sample proportion.

R code:

prop.test(x = c(x1, x2), 
          n = c(n1, n2), 
          alternative = c("two.sided", "less", "greater"),   # Use one of the 3 options
          correct = FALSE
         )

where $n_1, n_2$ are sample sizes, and $x_1, x_2$ are numbers of successes. Note that the test statistic that R reports is $\chi^2$, which equals $Z_0$ squared. 


*Example 1.*

Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 10 defective parts are found in the sample from machine 2. Is it reasonable to conclude that both machines produce the same proportion of defective parts, using the 0.05 significance level? Answer by finding the P-value for the test. Round your answer to 3 decimal places.

*Solution.*

The null and alternative hypotheses are:

$$H_0: p_1=p_2 ~~ vs ~~ H_a:p_1\ne p_2$$

R code:

x = c(15, 10)
n = c(300, 300)
prop.test(x, n, alternative = "two.sided", correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 1.0435, df = 1, p-value = 0.307
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01528380  0.04861713
## sample estimates:
##     prop 1     prop 2 
## 0.05000000 0.03333333

The $p$-value is 0.307, so we fail to reject the null hypothesis.

Note that the test statistic that R reports is $\chi^2$, which is the squared result by hand.

10.4.2 Confidence Interval on the Difference in Population Proportions

The confidence interval on the difference in population proportions is given below:

\[(\hat{p}_1-\hat{p}_2)\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\]

R code:

prop.test(x = c(x1, x2), 
          n = c(n1, n2), 
          conf.level = 0.95, 
          correct = FALSE)

Always set “correct = FALSE”.

Example.

Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 10 defective parts are found in the sample from machine 2. Construct a 95% confidence interval for the difference in proportions between machine 1 and machine 2.

Solution.

To construct the confidence interval for the difference in proportions, we can use the formula:

$\text{Confidence interval} = (\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$

where:

$\hat{p}_1$ and $\hat{p}_2$ are the sample proportions of defective parts for machines 1 and 2, respectively $n_1$ and $n_2$ are the sample sizes for machines 1 and 2, respectively $z_{\alpha/2}$ is the critical value for a 95% confidence interval, which is 1.96. Using the given information, we have:

$\hat{p}_1 = \frac{15}{300} = 0.05$

$\hat{p}_2 = \frac{10}{300} = 0.0333$

$n_1 = n_2 = 300$

Plugging in the values, we get:

$\text{Confidence interval} = (0.05 - 0.0333) \pm 1.96 \sqrt{\frac{0.05(1-0.05)}{300} + \frac{0.0333(1-0.0333)}{300}}$

Simplifying, we get:

$\text{Confidence interval} = 0.0167 \pm 0.0276$

Therefore, the 95% confidence interval for the difference in proportions between machine 1 and machine 2 is (0.0167 - 0.0276, 0.0167 + 0.0276), or approximately (-0.0109, 0.0443). We can interpret this as follows: we are 95% confident that the true difference in proportions of defective parts between the two machines is between -0.0109 and 0.0443. Since the interval contains zero, we cannot conclude that there is a significant difference in the proportions of defective parts between the two machines at a 95% confidence level.

R code:

x = c(15, 10)
n = c(300, 300)
prop.test(x, n, conf.level = 0.95, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 1.0435, df = 1, p-value = 0.307
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01528380  0.04861713
## sample estimates:
##     prop 1     prop 2 
## 0.05000000 0.03333333

A 95% confidence interval for $p_1-p_2$ is $(-0.01528380, 0.04861713)$.

Example.

An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005, Vol. 13, pp. 273-279), considered arthroscopic meniscal repair with an absorbable screw. Results showed that for tears greater than 25 millimeters, 14 of 19 repairs were successful while for shorter tears, 22 of 29 repairs were successful.

With $\alpha=0.05$, is there evidence that the success rate is greater for longer tears? What is the𝑃-value?

Solution.

To test whether the success rate is greater for longer tears, we can use a one-sided hypothesis test:

\[H_0: p_1 = p_2 ~~vs ~~H_a: p_1 > p_2\]

where $p_1$ is the proportion of successful repairs for tears > 25 mm and $p_2$ is the proportion of successful repairs for tears ≤ 25 mm.

The test statistic is:

$z = \frac{(\hat{p}_1 - \hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}$

where $\hat{p}_1$ and $\hat{p}_2$ are the sample proportions of successful repairs, $n_1$ and $n_2$ are the sample sizes, and $\hat{p}$ is the pooled sample proportion:

$\hat{p} = \frac{x_1+x_2}{n_1+n_2}$

where $x_1$ and $x_2$ are the number of successful repairs in each sample.

Plugging in the values from the problem, we get:

$\hat{p}_1 = \frac{14}{19} \approx 0.737, \hat{p}_2 = \frac{22}{29} \approx 0.759, n_1 = 19, n_2 = 29, \hat{p} = \frac{14+22}{19+29} =0.75$

The test statistic is:

\[z = \frac{0.737 - 0.759}{\sqrt{0.75(1-0.75)(\frac{1}{19}+\frac{1}{29})}} \approx -0.17\]

Using a standard normal distribution table, the p-value for this test is $p \approx 0.567$.

Since the p-value is greater than $\alpha=0.05$, we fail to reject the null hypothesis. There is not enough evidence to conclude that the success rate is greater for longer tears at the 5% significance level.

R code:

x = c(14, 22)
n = c(19, 29)
prop.test(x, n, alternative = "greater", correct = FALSE)

## Warning in prop.test(x, n, alternative = "greater", correct = FALSE):
## Chi-squared approximation may be incorrect

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 0.029038, df = 1, p-value = 0.5677
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.2331912  1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.7368421 0.7586207

The $p$-value is 0.5677, so we reject the null hypothesis.

11 Simple Linear Regression and Correlation

Simple Linear Regression and Correlation are two closely related concepts used in statistics to examine the relationship between two continuous variables. They help us understand how changes in one variable are associated with changes in another and allow us to make predictions and draw inferences about their relationship. Let’s explore each concept:

Simple Linear Regression:

Simple Linear Regression is a statistical method that models the relationship between two continuous variables by fitting a linear equation to the data. The goal is to find the best-fitting line that minimizes the distance between the observed data points and the predicted values on the line. The linear equation for simple linear regression is of the form: \[y = \beta_0 + \beta_1 x\]

where:

$y$ is the dependent variable (also called the response or outcome variable). $x$ is the independent variable (also called the predictor or explanatory variable). $\beta_0$ and $\beta_1$ are the regression coefficients, representing the intercept and slope of the line, respectively.

The coefficients $\beta_0$ and $\beta_1$ are estimated from the data using methods such as the least squares method, which aims to minimize the sum of squared differences between the observed and predicted values.

Simple linear regression allows us to make predictions about the value of the dependent variable ($y$) based on the value of the independent variable ($x$). It also provides insights into the strength and direction of the relationship between the two variables.

Correlation:

Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. It assesses how closely the data points tend to cluster around a straight line, indicating the degree of association between the variables.

The most commonly used measure of correlation is the Pearson correlation coefficient ($r$), which ranges from -1 to +1:

$r = +1$ indicates a perfect positive linear relationship.
$r = -1$ indicates a perfect negative (inverse) linear relationship.
$r = 0$ indicates no linear relationship (variables are not linearly correlated).

The Pearson correlation coefficient is calculated using the formula:

\[r = \frac{cov(x,y)}{s_x s_y}\]

where $(x_i, y_i)$ are the individual data points,

\[cov(x,y) = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i-\bar{y})\] is the covariance between $x$ and $y$, and $x_x$ and $s_y$ are the standard deviations of $x$ values and $y$ values, respectively. . Correlation measures the strength of the linear relationship between two quantitative variables but does not provide information about causation. A high correlation coefficient does not imply causation; it merely indicates a strong association between the variables.

In summary, simple linear regression and correlation are valuable tools for understanding and quantifying the relationship between two continuous variables. Simple linear regression allows us to model and predict one variable based on another, while correlation measures the strength and direction of the linear association between the variables. Both concepts are widely used in various fields, including data analysis, scientific research, and engineering.

Hypothesis Tests in Simple Linear Regression:

In simple linear regression, hypothesis tests are used to make inferences about the regression coefficients and assess the significance of the relationship between the dependent variable and the independent variable. The two main hypotheses tested in simple linear regression are related to the slope ($\beta_1$) of the regression line. The hypothesis tests are based on the underlying assumptions of the regression model, such as the normality of errors and constant variance.

Let’s go through the key hypotheses and the corresponding hypothesis tests in simple linear regression.

Null Hypothesis ($H_0$):

The null hypothesis in simple linear regression states that there is no significant linear relationship between the independent variable ($x$) and the dependent variable ($y$). In mathematical terms, it is expressed as:

\[H_0: \beta_1 = 0\]

This implies that the slope of the regression line is zero, indicating no association between $x$ and $y$.

Alternative Hypothesis ($H_1$ or $H_a$):

The alternative hypothesis in simple linear regression states that there is a significant linear relationship between the independent variable ($x$) and the dependent variable ($y$). In mathematical terms, it is expressed as: \[H_1: \beta_1 \ne 0\]

This implies that the slope of the regression line is not zero, indicating a non-zero association between $x$ and $y$.

Hypothesis Tests:

The most common hypothesis test used in simple linear regression is the t-test. The t-test assesses whether the estimated slope coefficient ($\hat{\beta}_1$) is significantly different from zero.

The t-statistic for testing the null hypothesis is calculated as:

\[t = \frac{\hat{\beta}_1 - 0}{se(\hat{\beta}_1)}\]

where:

$\hat{\beta}_1$ is the estimated slope coefficient obtained from the regression analysis and $se(\hat{\beta}_1)$ is the standard error of the slope coefficient, which estimates the variability of $\hat{\beta}_1$.

The t-statistic has a t-distribution with $(n-2)$ degrees of freedom ($n$ is the sample size). The $p$-value is twice the tail area under the t-distribution curve beyond the absolute value of the t statistic. If the $p$-value is no greater than the significance level, we reject the null hypothesis in favor of the alternative hypothesis. This indicates that there is a significant linear relationship between $x$ and $y$ at the chosen significance level.

Additionally, the t-test can be used to calculate a confidence interval for the slope coefficient. The confidence interval provides a range of values within which the true population slope is likely to lie with a certain level of confidence.

It is important to note that these hypothesis tests assume that the residuals (errors) in the regression model are normally distributed and have constant variance. Violations of these assumptions may impact the validity of the tests, and alternative approaches may be needed.

In summary, hypothesis tests in simple linear regression help determine whether the relationship between the dependent variable and the independent variable is statistically significant. They are essential in interpreting the results of the regression analysis and making informed conclusions about the data.

The Adequacy of the Regression Model:

The model we introduced has assumptions (LINE), including linearity, independence of errors, normality of residuals, and equal variances of residuals (homoscedasticity). Violations of these assumptions may indicate inadequacy of the model.

Assessing the adequacy of a regression model is a crucial step in analyzing the model’s performance and making sure it provides meaningful and reliable results. Adequacy checks involve evaluating how well the model fits the data, identifying potential issues or violations of assumptions, and determining the overall quality of the model’s predictions. Several techniques can be used to assess the adequacy of a regression model:

Residual Analysis:

Residuals are the differences between the observed values and the predicted values from the regression model. Residual analysis involves examining the pattern of residuals to check for any systematic deviations from randomness. Ideally, residuals should be randomly distributed around zero, indicating that the model captures the underlying relationships in the data. Patterns in the residuals may indicate problems with the model, such as non-linearity, heteroscedasticity (varying spread of residuals), or outliers.

R-squared (Coefficient of Determination):

R-squared measures the proportion of the total variation in the dependent variable that is explained by the regression model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data. However, high R-squared alone does not guarantee model adequacy, as it can increase even with the addition of irrelevant variables. Therefore, it is essential to interpret R-squared in conjunction with other model assessment techniques.

Adjusted R-squared:

The adjusted R-squared takes into account the number of predictor variables in the model, penalizing the addition of unnecessary variables. It provides a more conservative measure of model fit and is often preferred when comparing models with different numbers of predictors.

F-Test (Overall Significance Test):

The F-test is used to assess the overall significance of the regression model. It tests whether the explained variation in the dependent variable (sum of squares due to regression) is significantly larger than the unexplained variation (sum of squares due to residuals). A significant F-test suggests that the model as a whole is useful in explaining the variation in the dependent variable.

Outliers and Influential Points:

Identifying outliers and influential points is essential for understanding how individual data points affect the model. Outliers are extreme values that can disproportionately influence the model’s estimates, while influential points can greatly impact the model’s coefficients. Robust regression techniques (not covered in this course) may be used to mitigate the effect of outliers, and sensitivity analysis can be performed to assess the impact of influential points.

Example 1: Engineering - Load vs. Deformation

In this example, we’ll analyze the relationship between the load applied to a material and the resulting deformation. We have data from 25 tests conducted on a specific material.

# Sample data
load <- c(12.5, 14.2, 13.8, 12.9, 15.7, 14.0, 13.2, 14.5, 12.8, 13.6,
          16.4, 15.2, 14.8, 16.0, 15.5, 14.7, 15.9, 16.6, 15.3, 14.6,
          17.2, 16.8, 17.0, 16.3, 17.5)
deformation <- c(1.2, 1.5, 1.4, 1.3, 1.7, 1.6, 1.5, 1.8, 1.2, 1.4,
                 1.9, 1.7, 1.6, 1.8, 1.7, 1.5, 1.9, 2.0, 1.8, 1.6,
                 2.2, 2.1, 2.0, 1.9, 2.3)

df = data.frame(load, deformation)

# Perform simple linear regression
lm_model <- lm(deformation ~ load, data = df)

# Print the summary of the regression
summary(lm_model)

## 
## Call:
## lm(formula = deformation ~ load, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13077 -0.05839 -0.01877  0.05360  0.20778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.20223    0.18761  -6.408 1.54e-06 ***
## load         0.19272    0.01239  15.560 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08878 on 23 degrees of freedom
## Multiple R-squared:  0.9132, Adjusted R-squared:  0.9095 
## F-statistic: 242.1 on 1 and 23 DF,  p-value: 1.058e-13

Insights:

The regression summary provides information about the estimated intercept and slope coefficients. The “Estimate” for the slope coefficient indicates how much deformation changes for a unit change in load. Here, the “Estimate” (0.19272) for the slope coefficient indicates deformation increases by about 0.19 for a unit increase in load.
The “Residual standard error” (0.08878) indicates the average difference between observed and predicted values.
The “R-squared” value measures the proportion of variability in the dependent variable explained by the independent variable. Here, the R-squared value (0.9132) indicates that 91.32% of total variation in deformation is explained (or accounted for) by load.
The p-value associated with the slope coefficient tests if the relationship is statistically significant. Here, the p-value (1.06e-13 or 0, basically) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).

Example 2: Engineering - Speed vs. Fuel Efficiency Consider a scenario where we’re analyzing the relationship between the speed of an engine and its fuel efficiency. We have data from 22 tests conducted on different engine configurations.

# Sample data
speed <- c(1500, 1800, 2000, 2200, 2400, 2500, 2700, 2800, 3000, 3200,
           3400, 3500, 3700, 3800, 4000, 4200, 4400, 4600, 4800, 5000,
           5200, 5400)
fuel_efficiency <- c(18.2, 19.5, 20.1, 21.2, 22.0, 23.1, 24.5, 25.0, 25.8, 26.4,
                     27.0, 27.5, 28.3, 28.7, 29.5, 30.2, 31.0, 31.5, 32.2, 32.9,
                     33.6, 34.3)

df = data.frame(speed, fuel_efficiency)
df

##    speed fuel_efficiency
## 1   1500            18.2
## 2   1800            19.5
## 3   2000            20.1
## 4   2200            21.2
## 5   2400            22.0
## 6   2500            23.1
## 7   2700            24.5
## 8   2800            25.0
## 9   3000            25.8
## 10  3200            26.4
## 11  3400            27.0
## 12  3500            27.5
## 13  3700            28.3
## 14  3800            28.7
## 15  4000            29.5
## 16  4200            30.2
## 17  4400            31.0
## 18  4600            31.5
## 19  4800            32.2
## 20  5000            32.9
## 21  5200            33.6
## 22  5400            34.3

Perform simple linear regression:

# Perform simple linear regression
lm_model <- lm(fuel_efficiency ~ speed, data = df)

# Print the summary of the regression
summary(lm_model)

## 
## Call:
## lm(formula = fuel_efficiency ~ speed, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7848 -0.5353  0.1559  0.3661  0.7997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.260e+01  3.666e-01   34.36   <2e-16 ***
## speed       4.144e-03  1.008e-04   41.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5315 on 20 degrees of freedom
## Multiple R-squared:  0.9883, Adjusted R-squared:  0.9877 
## F-statistic:  1691 on 1 and 20 DF,  p-value: < 2.2e-16

Insights:

The “Estimate” (0.004144) for the slope coefficient indicates fuel_efficiency increases by about 0.004144 for a unit increase in speed.
The “Residual standard error” (0.5315) indicates the average difference between observed and predicted values.
The R-squared value (0.9883) indicates that 98.83% of total variation in fuel-efficiency is explained (or accounted for) by speed.
The p-value (<2e-16 or 0, basically) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).

12 Multiple Linear Regression

Multiple Linear Regression is an extension of simple linear regression that allows for the analysis of the relationship between a dependent variable and multiple independent variables. It is a statistical technique used to model the linear relationship between the dependent variable and two or more predictor variables. Multiple linear regression is widely used in various fields, including statistics, economics, social sciences, and engineering, to analyze complex data and make predictions based on multiple factors.

The multiple linear regression model can be expressed as follows:

\[y = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_rx_r+\epsilon\]

where:

$y$ is the dependent variable (response or outcome variable),
$x_1, x_2, \cdots, x_r$ are the independent variables (predictors or explanatory variables). $\beta_1, \beta_2, \cdots, \beta_r$ are the regression coefficients, representing the intercept and slopes of the regression line for each independent variable, respectively, and
$\epsilon$ is the error term (residual), representing the difference between the observed y and the predicted y values.

The multiple linear regression model estimates the values of the regression coefficients based on the observed data using the method of least squares. The goal is to minimize the sum of squared differences between the observed and predicted values.

Key aspects of multiple linear regression:

Interpretation of Coefficients:

The regression coefficients represent the change in the dependent variable ($y$) associated with a one-unit change in each independent variable, assuming all other variables remain constant. Positive coefficients indicate a positive relationship with the dependent variable, while negative coefficients indicate a negative relationship.

Adjusted R-squared:

Similar to simple linear regression, multiple linear regression uses the R-squared statistic to measure the proportion of variance in the dependent variable explained by the model. The adjusted R-squared takes into account the number of predictors and provides a more accurate measure of the model’s goodness of fit when comparing models with different numbers of variables.

Model Assumptions:

Multiple linear regression relies on several assumptions (LINE), including linearity, independence of errors, constant variance of residuals (homoscedasticity), and normality of residuals. Violations of these assumptions can impact the validity and accuracy of the regression model.

Model Selection:

The process of selecting variables to include in the multiple linear regression model is an essential part of the analysis. Variables that are not relevant or highly correlated with other predictors may be excluded to avoid multicollinearity and improve the interpretability of the model.

Multiple linear regression is a powerful tool for analyzing the relationship between a dependent variable and multiple predictors. It enables researchers and analysts to explore complex data, identify significant predictors, make predictions, and gain valuable insights into the factors influencing the outcome of interest. However, careful attention to the assumptions and model diagnostics is necessary to ensure the validity and adequacy of the regression model.

A few very nice videos on multiple regression:

Short (6 minutes): https://www.youtube.com/watch?v=mno47Jn4gaU
long (40 minutes): https://www.youtube.com/watch?v=eYTumjgE2IY
long (45 minutes): https://www.youtube.com/watch?v=0m-rs2M7K-Y

Let’s consider a data example in engineering involving the relationship between the tensile strength of a metal and two factors: temperature and time of heat treatment. We’ll create a hypothetical dataset to illustrate the concept of multiple linear regression in engineering.

Suppose an engineer is studying the effect of temperature (in degrees Celsius) and time (in hours) of heat treatment on the tensile strength (in megapascals, MPa) of a metal sample. The engineer performs experiments at different combinations of temperature and time and records the tensile strength for each experiment. The data is as follows:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Temperature	Time	Tensile.Strength
200	1	150
250	2	180
300	2	210
250	1	160
300	3	230
200	2	170
280	2	190
270	3	220
220	1	155
290	2	200

Using multiple linear regression, the engineer can build a model to predict the tensile strength (y) based on temperature (x₁) and time (x₂) as predictors. The multiple linear regression model will have the form:

y = β₀ + β₁ * x₁ + β₂ * x₂ + ε

where:

y is the predicted tensile strength. x₁ is the temperature (independent variable 1). x₂ is the time (independent variable 2). β₀, β₁, and β₂ are the regression coefficients. ε is the error term. The goal of multiple linear regression is to estimate the values of the regression coefficients (β₀, β₁, β₂) based on the observed data to build the best-fitting model.

To perform multiple linear regression in R, we can use the lm() function, which stands for “linear model.” The lm() function fits a linear regression model to the data and provides estimates for the regression coefficients, as well as various statistics and diagnostics to assess the model’s performance. Let’s use the example data we previously created in R and perform multiple linear regression:

# Create the example data
temperature <- c(200, 250, 300, 250, 300, 200, 280, 270, 220, 290)
time <- c(1, 2, 2, 1, 3, 2, 2, 3, 1, 2)
tensile_strength <- c(150, 180, 210, 160, 230, 170, 190, 220, 155, 200)

# Combine the data into a data frame
data <- data.frame(temperature, time, tensile_strength)

# Perform multiple linear regression
model <- lm(tensile_strength ~ temperature + time, data = data)

# Print the summary of the regression model
summary(model)

## 
## Call:
## lm(formula = tensile_strength ~ temperature + time, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9653 -1.9750  0.8815  2.3290  6.5125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.39499   11.55468   4.794 0.001980 ** 
## temperature  0.33044    0.05428   6.088 0.000497 ***
## time        24.47977    2.84277   8.611 5.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.997 on 7 degrees of freedom
## Multiple R-squared:  0.9754, Adjusted R-squared:  0.9684 
## F-statistic: 138.7 on 2 and 7 DF,  p-value: 2.337e-06

The output of summary(model) will display various statistics and information about the multiple linear regression model. It will include the estimated regression coefficients along with their standard errors, t-values, and p-values (the column named “Pr(>|t|)”). The p-values indicate the significance of each coefficient, with smaller p-values suggesting that the corresponding predictor variable has a significant impact on the dependent variable.

How can we interpret each of the coefficients and each p-value?

The coefficient 0.33044 means that for each degree increase in temperature, the tensile strength increases by about 0.33 unit, holding time constant.
The coefficient 24.47977 means that for each unit increase in time, the tensile strength increases by about 24.48 units, holding temperature constant.
The smaller p-values (0.000497 and 5.68e-05) indicate both variables have significant impact on tensile strength.

How can we interpret the R-squared and adjusted R-squared values?

They measure the goodness of fit of the model. The larger the values, the better the model fits the data.

What are model assumptions?

Before making inferences with a multiple linear regression model, several key assumptions must be met to ensure the validity of the results. These assumptions include:

Linearity: The relationship between each predictor (independent variable) and the outcome (dependent variable) is linear. This means that the effect of each predictor on the outcome is additive and proportional.
Independence of Errors: The residuals (errors) are independent of each other. This assumption is especially important in time-series data, where consecutive observations may be correlated. Independence implies that the error terms for one observation are not correlated with the errors of other observations.
Normality of Errors: The residuals should be approximately normally distributed, especially important when making confidence intervals and hypothesis tests. Normality of errors is less critical for estimating the coefficients themselves but is important for reliable inference.
Equal Variances of Errors: The variance of the residuals should be constant across all levels of the independent variables. This means that the spread of residuals is roughly the same for all predicted values of the outcome. If this assumption is violated, it can lead to inefficiency in estimates.

These are called LINE assumptions. The assumptions can be checked as demonstrated below:

plot(model, 1, main = "Checking Linearity and Equal Variances\n (If satisfied, the residuals should show no systematic pattern)\n")

The graph shows that there is a light issue in linearity and equal variances.

plot(model, 2, main = "Checking Normality\n (If satisfied, the points should show a straight line pattern)\n")

Since points tend to be on a straight line, the normality assumption is not an issue.

plot(model, 4, main = "Cook's Distance Showing How Influential Each Observation Is \n (Labeled observations are influential)\n")

The results show that observations 1, 3, and 7 are influential.

Using the model for Prediction:

The multiple linear regression model can be used to make predictions for new data points using the predict() function:

# Predicting tensile strength for a new combination of temperature and time
new_data <- data.frame(temperature = 260, time = 2.5)
predicted_tensile_strength <- predict(model, newdata = new_data)
print(predicted_tensile_strength)

##        1 
## 202.5096

This will provide the predicted tensile strength for the new combination of temperature = 260°C and time = 2.5 hours based on the multiple linear regression model.

13 Design and Analysis of Single-Factor Experiments: The Analysis of Variance

13.1 Designing Engineering Experiments

Designing engineering experiments is a crucial process that involves planning, executing, and analyzing experiments to gather meaningful data and make informed decisions. Proper experimental design ensures that the collected data is reliable, relevant, and can lead to accurate conclusions. Here are the key steps and considerations in designing engineering experiments:

Define Objectives and Research Questions: Clearly articulate the objectives of the experiment and the specific research questions you want to address. The objectives will guide the entire experimental design process and help determine the appropriate variables to measure and control.

Identify Variables and Factors: Identify the key variables that may influence the outcome of the experiment. Variables can be classified into two types:

Independent Variables (Factors): Variables that you intentionally manipulate or control in the experiment.
Dependent Variables: Variables that you measure to observe the response or outcome.

Formulate Hypotheses: Based on your objectives and variables, develop hypotheses that state the expected relationships or differences between the factor and the dependent variable.

Null Hypothesis (H₀): This hypothesis states that the means of the dependent variable are the same across all levels of the factor.
Alternative Hypothesis (H₁): This hypothesis states that the means of the dependent variable are NOT the same across all levels of the factor.

Choose Experimental Design: Select the appropriate experimental design based on your research questions, resources, and constraints. Common types of experimental designs include: a. Completely Randomized Design: Randomly assign the levels of the factor to experimental units. This is what to be covered in this chapter. b. Randomized Block Design: Group similar experimental units into blocks and randomize levels of the factor within each block. c. Factorial Design: Investigate the effects of multiple factors simultaneously.

Determine Sample Size: Calculate the required sample size to achieve adequate statistical power and precision in your results. A larger sample size generally provides more reliable estimates.

Control Variables: Ensure that all extraneous factors that could influence the outcome are controlled or minimized. This may involve using control groups, blinding, or randomization.

Conduct the Experiment: Perform the experiment according to the experimental design, carefully following the procedures and recording the data accurately. Document any unexpected events or observations.

Analyze Data: Use appropriate statistical methods to analyze the data and test the hypotheses. This may involve regression analysis, ANOVA, t-tests, or other relevant techniques.

Interpret Results: Interpret the results of the data analysis in the context of your research questions and hypotheses. Draw conclusions based on the evidence provided by the data.

Draw Engineering Inferences: Apply the findings of the experiment to make engineering inferences and decisions. Determine how the results impact the engineering problem or system you are investigating.

Communicate Findings: Present the experimental design, results, and conclusions in a clear and concise manner. Clearly communicate any implications for future research or engineering applications.

By following these steps and considerations, engineers can design experiments that provide valuable insights, support decision-making, and advance the understanding of engineering systems and processes. Well-designed experiments are essential for making progress in engineering research and development.

13.2 Completely Randomized Single-Factor Experiment

A Completely Randomized Single-Factor Experiment is a type of experimental design used to study the effect of a single independent variable (also known as a factor) on a dependent variable. In this design, the experimental units are randomly assigned to different treatment levels of the factor, and the response of each unit to the treatments is measured. The objective is to compare the mean responses of the different treatment groups to determine if there are significant differences between them.

Key features of a Completely Randomized Single-Factor Experiment:

One Independent Variable (Factor): The experiment involves only one independent variable (factor) that has two or more treatment levels. Each treatment level represents a specific condition or value of the factor being tested.

Randomization: The assignment of experimental units to different treatments is done randomly to ensure that any extraneous or unknown factors are evenly distributed among the treatment groups. This helps reduce bias and allows for valid statistical inference.

Control: The experiment is designed to control any potential confounding variables or sources of variation that could influence the results. By randomly assigning treatments, the experiment aims to create similar groups with comparable characteristics.

Replication: Each treatment level is applied to multiple experimental units (replicates) to account for natural variability and provide more precise estimates of treatment effects.

Statistical Analysis: The data collected from the experiment is analyzed using statistical methods, such as analysis of variance (ANOVA), to test for significant differences between treatment means.

Example 1 (Completely Randomized Single-Factor Experiment):

Let’s consider an example where an engineer wants to investigate the effect of different cooling times on the hardness of a metal alloy. The engineer selects a sample of the metal alloy and divides it into four groups:

Group 1: Cooling time of 1 hour.
Group 2: Cooling time of 2 hours.
Group 3: Cooling time of 3 hours.
Group 4: Cooling time of 4 hours.

Each group represents a treatment level of the factor “Cooling Time.” The engineer randomly assigns several metal specimens to each group. The hardness of each specimen is measured after the designated cooling time.

The data collected can be analyzed using the analysis of variance (ANOVA) method to test if there are significant differences in hardness among the different cooling times. If ANOVA reveals a significant effect, post-hoc tests can be performed to identify specific pairs of cooling times that differ significantly in terms of hardness.

The results of the experiment will help the engineer understand how cooling time affects the hardness of the metal alloy and make informed decisions in industrial applications, such as selecting the optimal cooling time to achieve the desired hardness properties.

Let’s create a data example for the Completely Randomized Single-Factor Experiment related to cooling times and the hardness of a metal alloy. In this example, we will investigate the effect of four different cooling times on the hardness of the metal alloy.

Assume we have the following data for the hardness of the metal alloy (measured in Vickers hardness units, HV) after cooling for different durations:

13.3 Cooling Time (hours) | Hardness (HV)

    1            |     300
    2            |     350
    3            |     380
    4            |     400
    2            |     340
    3            |     370
    1            |     290
    4            |     410
    3            |     375
    2            |     335
    1            |     295
    3            |     385
    4            |     395
    2            |     345
    1            |     305
    4            |     420
    3            |     380
    2            |     330
    4            |     415
    1            |     310

In this example, the independent variable is the “Cooling Time” (in hours), and the dependent variable is the “Hardness” of the metal alloy after the specified cooling time.

Each row in the data represents one metal alloy specimen that underwent a specific cooling time. The experiment involves four different cooling times (1, 2, 3, and 4 hours), which serve as treatment levels of the factor “Cooling Time.” The metal alloy specimens were randomly assigned to each cooling time group to ensure a completely randomized experiment.

To analyze the data, we can use one-way ANOVA to test if there are significant differences in the mean hardness values among the different cooling times. If ANOVA indicates a significant effect, we can conduct post-hoc tests (e.g., Tukey’s HSD) to identify specific pairs of cooling times that result in significantly different hardness values.

The results of the experiment will help us understand how cooling time affects the hardness of the metal alloy. We can use this information to optimize the cooling process to achieve the desired hardness properties for specific engineering applications. For instance, we might find that longer cooling times lead to higher hardness values, which can be beneficial for applications requiring greater strength and wear resistance.

To analyze the data example of the Completely Randomized Single-Factor Experiment related to cooling times and the hardness of a metal alloy in R, we can perform a one-way analysis of variance (ANOVA). This will help us test if there are significant differences in the mean hardness values among the different cooling times. Additionally, we can conduct post-hoc tests (Tukey’s HSD) to identify specific pairs of cooling times that result in significantly different hardness values. Let’s go ahead and perform the analysis in R:

# Store the cooling times into an R object
cooling_times <- c(1, 2, 3, 4, 2, 3, 1, 4, 3, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1)
# Since these cooling times represent categories which have no ordering, 
# we need to convert the "cooling_times" variable to a categorical variable, 
# which is done by doing the following
cooling_times = as.factor(cooling_times)

# Store the hardness values into an R object
hardness_values <- c(300, 350, 380, 400, 340, 370, 290, 410, 375, 335, 295, 385, 395, 345, 305, 420, 380, 330, 415, 310)

# Form a data frame
myData <- data.frame(Cooling_Time = cooling_times, 
                   Hardness_HV = hardness_values
                  )

# Perform one-way ANOVA
model <- aov(Hardness_HV ~ Cooling_Time, data = myData)
summary(model)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Cooling_Time  3  32895   10965   165.5 2.97e-12 ***
## Residuals    16   1060      66                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The aov() function in R performs the one-way ANOVA. The output will show the ANOVA table with the F-statistic and p-value, indicating whether there are significant differences in the mean hardness values among the cooling times.

Since the $p$-value is basically zero, the data indicate there are significant differences in the mean hardness values among the cooling times.

The see which levels of hardness make the differences, we can conduct a post-hoc Tukey’s HSD test.

# Conduct post-hoc Tukey's HSD test
TukeyHSD(model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Hardness_HV ~ Cooling_Time, data = myData)
## 
## $Cooling_Time
##     diff    lwr     upr    p adj
## 2-1   40 25.272  54.728 4.40e-06
## 3-1   78 63.272  92.728 0.00e+00
## 4-1  108 93.272 122.728 0.00e+00
## 3-2   38 23.272  52.728 8.50e-06
## 4-2   68 53.272  82.728 0.00e+00
## 4-3   30 15.272  44.728 1.37e-04

The TukeyHSD() function conducts the post-hoc Tukey’s HSD test to compare all possible pairs of cooling times. The result shows which cooling times result in significantly different hardness values.

Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.

Just like regression models, all ANOVA models should be subject to a residual analysis for model checking.

We first plot residuals versus fitted values to check constant variance assumption:

plot(model, 1)

plot(model, 2)

Since the residuals are spread evenly around zero along the range of fitted values (predicted values). The variance of the residuals remains constant across all levels of the predictors.

Based on the Q-Q plot, normality appears to be met.

We can even conduct a formal test for normality:

# Perform Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(model$residuals)
print(shapiro_test)

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.95737, p-value = 0.4928

The large p-value indicates that the normality is met.

Overall, when interpreting a residual plot, look for signs of homoscedasticity, linearity, and normality. If the plot shows no clear patterns, the model assumptions are likely met, and the model is a good fit for the data. However, if you observe any systematic patterns or deviations from assumptions, it may indicate that further model adjustments are necessary or that the model may not be appropriate for the data.

Remember that residual plots are visual aids, and it is crucial to complement their interpretation with formal statistical tests and diagnostic procedures to make sound conclusions about the regression model’s validity and reliability.

14 Quality Control Basics

Quality control (QC) is a systematic process used to ensure that products or services meet specified standards and adhere to established guidelines. It involves monitoring, assessing, and managing the production or delivery process to maintain consistent quality and prevent defects. Here are some basics of quality control:

14.1 Objectives of Quality Control

Ensure products meet customer expectations and specifications.
Minimize defects, errors, and variations in production.
Optimize processes for efficiency and consistency.
Enhance customer satisfaction and loyalty.
Reduce waste and associated costs.

14.2 Key Concepts

Defect: Any deviation from the desired specifications or standards.
Variation: Differences between actual measurements and ideal values.
Process Control: Monitoring and adjusting processes to maintain quality.
Statistical Process Control (SPC): Using statistical methods to monitor and control processes.
Sampling: Evaluating a subset of items from a larger batch to infer quality.
Quality Assurance (QA): Actions taken to ensure quality before products are made.
Quality Control (QC): Activities performed to ensure quality during production.
Six Sigma: A data-driven approach to minimize defects and improve processes.
Continuous Improvement: The ongoing effort to enhance processes and quality.

14.3 Quality Control Steps

Plan: Define quality standards, methods, and resources.
Do: Implement quality control processes according to the plan.
Check: Evaluate and monitor quality using various methods, including inspections and tests.
Act: Take corrective actions to address deviations and improve processes.

14.4 Quality Control Techniques

Inspection: Visual or physical assessment of products.
Testing: Using various tests to assess product attributes.
Statistical Analysis: Applying statistical methods to monitor and control processes.
Control Charts: Graphical tools to monitor variations and identify trends.
Root Cause Analysis: Identifying underlying causes of defects.
Failure Mode and Effects Analysis (FMEA): Identifying potential failure points and their impact.

14.5 Benefits of Quality Control

Consistency in product quality and performance.
Reduced defects and waste.
Improved customer satisfaction and loyalty.
Enhanced brand reputation.
Efficient resource utilization.
Regulatory compliance.

14.6 Quality Control in Different Industries

Manufacturing: Ensuring products meet specifications.
Healthcare: Ensuring patient safety and accurate diagnoses.
Software Development: Identifying and fixing software defects.
Construction: Ensuring buildings adhere to safety and quality standards.

Quality control is essential for maintaining customer trust, ensuring product reliability, and achieving operational excellence. It involves a combination of methods, processes, and continuous improvement efforts to deliver consistent and high-quality products or services.

14.7 Control Charts for Proportions

A control chart for proportions (also known as a p-chart) is a graphical tool used in quality control to monitor the stability of a process that produces discrete outcomes or proportions. It’s commonly used when dealing with attributes data, such as the proportion of defective items in a sample.

Refer to a reference for the theory about p charts: https://sixsigmastudyguide.com/p-attribute-charts/

Here are the steps how the p-chart is typically constructed:

Step 1: calculate the proportion ($\hat{p}$) of defective items in each sample and calculate overall proportion of defective items across all samples ($\bar{p}$).
Step 2: calculate the upper control limit (UCL) and lower control limit (LCL) as

\[\bar{p}\pm z\cdot \sqrt{\frac{\bar{p}\cdot (1-\bar{p})}{\bar{n}}}\]

where $z$ typically is 3 for a 3-$\sigma$ control chart. Note: if LCL is negative, use 0 instead; if UCL is larger than 1, use 1 instead. If individual sample sizes are used instead of $\bar{n}$, the two limits may not be constant, as demonstrated in the reference https://sixsigmastudyguide.com/p-attribute-charts/.

Step 3: plot $\hat{p}$ versus sample serial number (1, 2, 3, … on the x-axis).
Step 4: add a center line (CL) corresponding to the overall proportion $\bar{p}$.
Step 5: add the UCL and LCL.

A video showing how you can create a p-chart in Excel: https://www.youtube.com/watch?v=mO7fcV4R_LY.

Here’s an example showing how you can create and interpret a p-chart using R.

Example: Defective Products in a Manufacturing Process

Let’s assume you are monitoring the proportion of defective products in a manufacturing process. You collect data over time to track the proportion of defects in each sample.

The sample sizes are: 50, 60, 55, 65, 70, 75, 60, 80

The respective numbers of defectives are: 2, 4, 1, 5, 3, 6, 2, 4

Here is the R code for constructing a p-chart:

# Sample data:
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 80)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 4)

# Calculate proportion in each sample
proportions <- defective_counts / sample_sizes

# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)

# Load required library with the library() function. Before loading the library, 
# you need to install the library first by typing:
# install.packages("qcc") 
# on the console of RStudio/Posit.
library(qcc)

# Create the p-chart 
qcc_obj <- qcc(defective_counts, type = "p", sizes = sample_sizes,
               title = "P-Chart: Defective Products with Different Sample Sizes")

In this example, the p-chart shows the proportion of defective products in each sample along with control limits. The center line represents the overall proportion of defects across all samples. Control limits are calculated based on statistical methods to identify points that fall outside expected variation.

Interpretation:

Points within the control limits suggest that the process is stable and variation is consistent. Points outside the control limits indicate potential issues or changes in the process. Trends or patterns in the chart can provide insights into process behavior. Here, there is no point outside the control limits. No obvious pattern can be observed either.

Remember that control charts are most effective when used as part of a comprehensive quality control system, and they help identify deviations that warrant investigation and corrective action.

Example 2: Control chart with out of control points

# Sample data: Proportion of defective products in each sample
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 55)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 12)  

# Calculate proportions
proportions <- defective_counts / sample_sizes

# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)

# Load required library
library(qcc)

# Create the p-chart 
qcc(defective_counts, type = "p", sizes = sample_sizes,
               title = "P-Chart: Defective Products with Different Sample Sizes")

## List of 11
##  $ call      : language qcc(data = defective_counts, type = "p", sizes = sample_sizes, title = "P-Chart: Defective Products with Differen| __truncated__
##  $ type      : chr "p"
##  $ data.name : chr "defective_counts"
##  $ data      : num [1:8, 1] 2 4 1 5 3 6 2 12
##   ..- attr(*, "dimnames")=List of 2
##  $ statistics: Named num [1:8] 0.04 0.0667 0.0182 0.0769 0.0429 ...
##   ..- attr(*, "names")= chr [1:8] "1" "2" "3" "4" ...
##  $ sizes     : num [1:8] 50 60 55 65 70 75 60 55
##  $ center    : num 0.0714
##  $ std.dev   : num 0.258
##  $ nsigmas   : num 3
##  $ limits    : num [1:8, 1:2] 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ violations:List of 2
##  - attr(*, "class")= chr "qcc"

In a control chart, an out-of-control point (here the 8th point) typically indicates a situation where the process has experienced a significant shift or variation that goes beyond normal expected variation. This could be due to various factors, such as equipment malfunction, changes in the production process, operator error, or other special causes.

14.8 R Chart and X-bar Chart

Refer to this excellent reference: https://www.r-bloggers.com/2018/08/using-control-charts-in-r/

Temperature	Time	Tensile.Strength
200	1	150
250	2	180
300	2	210
250	1	160
300	3	230
200	2	170
280	2	190
270	3	220
220	1	155
290	2	200

Temperature	Time	Tensile.Strength
200	1	150
250	2	180
300	2	210
250	1	160
300	3	230
200	2	170
280	2	190
270	3	220
220	1	155
290	2	200

Stat 353 Notes

SZ

2025-08-29

1 An Overview

1.1 What Statistics Can Do for You?

1.2 Topics to Cover

1.3 Software for This Course

1.4 Use of AI

1.5 Examples of Using ChatGpt

1.6 The Role of Statistics in Engineering

1.6.1 Primary Roles of Statistics in General

1.6.2 Mechanical and Empirical Models

2 Probability

2.1 Random Experiments

2.2 Sample Space and Events

2.3 Basic Operations on Events

2.4 Counting Techniques

2.5 Permutations

2.6 Combinations

2.7 Multiplication Principle:

2.8 Addition Principle:

2.9 Probability

2.10 Conditional Probability

2.11 Total Probability Rules

2.12 Bayes’ Theorem

2.13 Independence

2.14 Exercise

3 Discrete Random Variables and Probability Distributions

3.1 Probability Mass Functions

3.2 Cumulative Distribution Functions

3.3 Mean and Variance of a Discrete Random Variable

3.4 Discrete Uniform Distribution

3.5 Binomial Distribution

3.6 Geometric and Negative Binomial Distributions

3.7 Poisson Distribution

3.8 Chapter Practice Problems

4 Continuous Random Variables and Probability Distributions

4.1 Probability Distributions and Probability Density Functions

4.2 Cumulative Distribution Functions

4.3 Mean and Variance of a Continuous Random Variable

4.4 Continuous Uniform Distribution

4.5 Normal Distribution

4.6 Exponential Distribution

4.7 The t Distribution

4.8 Reliability Function

4.9 Chapter Practice Problems

5 Joint Probability Distributions

5.1 Joint Probability Distributions for Two Random Variables

5.2 Conditional Probability Distributions and Independence 102

5.3 Covariance and Correlation 110

5.4 Linear Functions of Random Variables 117

6 Descriptive Statistics

6.1 An Introduction to R

6.2 Numerical Summaries of Data

6.3 Stem-and-Leaf Diagrams

6.4 Frequency Distributions and Histograms

6.5 Box Plots

6.6 Time Sequence Plots

6.7 Scatter Diagrams

7 Point Estimation of Parameters and Sampling Distributions

7.1 Point Estimation

7.2 Sampling Distributions and the Central Limit Theorem

7.3 General Concepts of Point Estimation

7.4 Unbiased Estimators

7.5 Variance of a Point Estimator

7.6 Standard Error: Reporting a Point Estimate

8 Statistical Intervals for a Single Sample

8.1 Confidence Interval on the Mean of a Normal Distribution, Variance Known

8.2 Confidence Interval on the Mean of a Normal Distribution, Variance Unknown

8.2.1 t Distribution

8.2.2 t Confidence Interval on the Population Mean

8.3 Large-Sample Confidence Interval for a Population Proportion

8.4 Finding Confidence Intervals Using Software

9 Tests of Hypotheses for a Single Sample

9.1 Hypothesis Testing

9.1.1 Statistical Hypotheses

9.1.2 Tests of Statistical Hypotheses

9.1.3 One-Sided and Two-Sided Hypotheses

9.1.4 P-Values in Hypothesis Tests

9.1.5 Connection between Hypothesis Tests and Confidence Intervals

Temperature	Time	Tensile.Strength
200	1	150
250	2	180
300	2	210
250	1	160
300	3	230
200	2	170
280	2	190
270	3	220
220	1	155
290	2	200