SZ
8/14/2023
Statistics is the art and science of collecting, analyzing, interpreting, and presenting data to understand patterns, make informed decisions, and uncover insights about various phenomena in the world.
“All methods of acquiring knowledge are essentially statistics.” (by C.R. Rao)
Statistics plays a crucial role in various aspects of our lives. In natural and life sciences, statistics is used to design experiments, analyze experimental results, and draw meaningful conclusions.
Statistics also finds extensive uses in other areas, such as Business and Economics, Social Sciences, Quality Control and Manufacturing, Environmental Studies, Sports and Entertainment, Public Policy and Government, Risk Management and Insurance, and Education.
We will use R software through the integrated development environment (IDE) called Posit (previously, RStudio) to do statistical analysis and create project reports. There are two ways to use Posit:
Visit https://posit.co/download/rstudio-desktop/ and install R and Posit on your personal computer. Many SCSU computers already have both installed. There can be some issues with this use of R/Posit.
Visit https://posit.cloud/. You do not need to install anything on your computer. Just register on this webpage. Once registered, log in and start a new project. You do not need to create a new project each time you use Posit, since under the first project you have created, you can create many folders or files.
I use ChatGpt to prepare my notes.
“The use of AI in creating learning in this course is supported. Be clear about your use of AI. Ensure that your work is truly your work. Think critically about when AI should be used vs when it is an easy out. Ask questions to your peers and me. AI may continue to impact our society in ways that we cannot imagine and part of your experience here is to grow your understanding of what could be, not what was. Also, please refer to the Academic Integrity policies that apply to all courses within our institution, and ask me for clarification on how Academic Integrity and AI intersect in this course.”
Here is SCSU guidelines on the use of AI: https://services.stcloudstate.edu/TDClient/1919/Portal/KB/?CategoryID=24236
Explain what is variance of a sample. Give me a data example in biology using R Once ChatGpt gives you a response, then type “The code is too complex. please simplify.” If you want a simple printout, type “no use of cat() function.” Now, all the code would be straightforward for you.
Explain what is a histogram. Give me a data example in health science using R. Interpret the plot.
What is a confidence interval for a population proportion? Give me a data example in environmental science using R. Interpret the result.
ChatGpt is not perfect and it makes mistakes. Therefore, use ChatGpt with caution!
A quick quiz: why should I consider a career in data science?
Click to see a ChatGPT answer
Considering a career in data science can be a rewarding decision for several reasons:
High Demand: Data science professionals are in high demand across industries as organizations recognize the value of data-driven decision-making.
Variety of Industries: Data science is applicable in diverse fields such as finance, healthcare, marketing, technology, and more, providing opportunities to work in areas of personal interest.
Well-Paying Jobs: Data scientists often enjoy competitive salaries due to their specialized skills and the increasing demand for data expertise.
Impactful Insights: Data analysis enables you to extract insights from complex data, helping businesses make informed decisions, optimize processes, and identify trends.
Innovation: Data scientists play a key role in developing innovative solutions using machine learning, artificial intelligence, and predictive analytics.
Continuous Learning: The field of data science is constantly evolving with new technologies and techniques, offering a dynamic learning environment.
Problem Solving: Data science involves solving real-world problems using data-driven approaches, making it intellectually stimulating and fulfilling.
Flexibility: Many data science roles offer remote work options and flexible schedules, allowing for a better work-life balance.
Career Growth: Data science roles often have clear career paths with opportunities to advance to senior or specialized positions.
Interdisciplinary Skills: Data science combines skills from programming, statistics, mathematics, and domain expertise, making it a well-rounded career option.
Global Relevance: Data science is a global field with opportunities to work with international organizations and collaborate on cross-border projects.
Community and Networking: Joining the data science community offers opportunities to connect with like-minded professionals, share knowledge, and collaborate on projects.
While data science offers numerous advantages, it’s important to note that it requires a strong foundation in programming, mathematics, and statistics. Pursuing relevant education, gaining practical experience, and staying updated with industry trends are crucial steps toward building a successful career in data science.
Following this slide, you will begin learning statistical methods.
Data Collection: Statistics help in designing surveys, experiments, and observational studies to collect relevant and reliable data.
Data Analysis: Statistics provide methods for analyzing and summarizing data.
Inference: Statistics helps determine whether observed differences or relationships have occurred by chance.
Modeling and Prediction: Statistics allows for the development of mathematical models to represent complex real-world phenomena. These models can be used for prediction.
Mechanistic models, also known as deterministic models, are built on a deep understanding of the underlying principles and mechanisms governing the system.
Empirical models, also known as statistical models or data-driven models, are developed based on observed data without a deep understanding of the underlying mechanisms.
Engineers often use a combination of these models to achieve a comprehensive understanding of complex systems.
This chapter will lay a theoretical foundation for statistics.
A random experiment refers to a process or procedure that can result in multiple possible outcomes with uncertainty.
Tossing a Coin: When flipping a fair coin, the possible outcomes “Heads” and "Tails are uncertain and depends on many factors.
Testing the Strength of Materials: In a material strength test, the sample may exhibit different strengths due to inherent variations in its composition. The strength of each sample tested is subject to chance.
Sample Space: The set of all possible outcomes of a random experiment is called the sample space.
An event is a subset of the sample space. It represents a particular outcome or a combination of outcomes.
In engineering, experiments can involve various factors and parameters that lead to different outcomes. For example, if you are testing the tensile strength of a material, the sample space is \([0, \infty)\). An event might be “the tensile strength of a material is greater than 5 pounds per square inch.”
Union of Events \(A \cup B\): The union of two events \(A\) and \(B\), denoted as \(A \cup B\), represents the event that either \(A\) occurs, or \(B\) occurs, or both occur.
Intersection of Events \(A \cap B\) or \(AB\): The intersection of two events \(A\) and \(B\), denoted as \(A \cap B\) or \(AB\), represents the event that both \(A\) and \(B\) occur simultaneously.
Complement of an Event \(A'\): The complement of an event \(A\), denoted as \(A'\), \(A^c\), or \(\bar{A}\), represents the event that \(A\) does not occur.
These basic operations can be extended to more than two events as well.
Counting techniques in probability involve methods to determine the number of possible outcomes in a sample space or the number of ways events can occur. These techniques are essential for calculating some probabilities.
Permutations are arrangements of a set of objects in a specific order. When dealing with a set of n distinct objects and selecting r of them in a specific order, the number of permutations is denoted as \(_nP_r\) or \(P(n, r)\) and calculated as \(\frac{n!}{ (n - r)!}\). Permutations are used in various scenarios, such as arranging people in a line, arranging letters in a word, or selecting a specific order of events.
Example:
Suppose you have 5 different books, and you want to arrange 3 of them on a shelf. The number of ways to arrange these books is \(_5P_3 = \frac{5!}{ (5 - 3)!} = \frac{5!}{ 2!} = 60\).
Combinations are selections of a subset of objects from a larger set, where the order of selection does not matter. When dealing with a set of \(n\) distinct objects and selecting \(r\) of them without regard to order, the number of combinations is denoted as \(_nC_r\) or \(C(n, r)\) or \(\binom{n}{r}\) and calculated as \(\frac{n!} {r! (n - r)!}\). Combinations are used when the order of elements does not influence the outcome, such as selecting a team of players from a pool of candidates.
Example:
Suppose there are 8 candidates running for a committee, and you want to select 4 of them. The number of ways to form the committee is \(\binom{8}{4} = \frac{8!} {4! (8 - 4)!} = 70\).
The multiplication principle, also known as the fundamental counting principle, states that if there are \(m\) ways to do one thing and \(n\) ways to do another thing, then there are \(m \cdot n\) ways to do both things together. This principle is often used when events are independent, meaning that the outcomes of one event do not affect the outcomes of others.
Example:
Suppose you have 3 different shirts and 2 different pairs of pants. The number of ways to choose a shirt and a pair of pants to wear is \(3 \cdot 2 = 6\).
The addition principle states that if there are \(m\) ways to do one thing and \(n\) ways to do another thing, and these events are mutually exclusive (cannot happen together), then there are \(m + n\) ways to do either one thing or the other. This principle is also called the casework, which involves splitting a problem into several parts, counting these parts individually, then adding together the totals of each part.
Example:
Suppose we want to find the number of ways to arrange the letters in the word “BOOK.” Since there are four letters in the word, we can use case work based on the positions of the repeated letters (O). Let’s consider two cases:
Case 1: O is in the first position. In this case, we have the following arrangement: O _ _ _ (one way, since 2 O’s are same) The remaining three letters (B, O, K) can be arranged in 3! = 6 ways. Total outcomes for Case 1: 6 ways
Case 2: O is in the second position. In this case, we have the following arrangement: _ O _ _ The remaining three letters (B, O, K) can be arranged in 3! = 6 ways. Total outcomes for Case 2: 6 ways
Now, we add up the outcomes from both cases: 6 + 6 = 12. So, there are 12 ways to arrange the letters in the word “BOOK.”
Probability is a fundamental concept in engineering that plays a crucial role in decision-making, risk analysis, and designing robust systems. It is a branch of mathematics.
Key Concepts in Probability:
Sample Space: The set of all possible outcomes of an experiment is called the sample space. For example, if you are rolling a six-sided die, the sample space would be {1, 2, 3, 4, 5, 6}.
Event: An event is a subset of the sample space. It represents a particular outcome or a combination of outcomes.
Probability of an Event: The probability of an event represents the likelihood of that event occurring. It is a number between 0 and 1, where 0 indicates an impossible event, and 1 represents a certain event. The probability of an event \(A\) is denoted by \(P(A)\).
Basic Probability Rules: Probability follows certain rules. The sum of probabilities of all possible outcomes in the sample space is always 1. The probability of an event not occurring is 1 minus the probability of the event occurring. For mutually exclusive events (or disjoint events, events that do not happen simultaneously), the probability of either event occurring is the sum of their individual probabilities.
Random Variables: In engineering, we often deal with random variables, which are variables whose values are determined by chance. Random variables can be discrete (taking specific values) or continuous (taking any value within a range).
Conditional Probability: Conditional probability is the probability of an event occurring given that another event has already occurred. It is denoted as P(A|B), which is the probability of A given B. To calculate P(A|B), we use the formula:
\[P(A|B)=\frac{P(A\cap B)}{P(B)}\] Example: Throw a 6-sided die once. If the outcome is an odd number, what is the probability that the die landed on 1, 2, or 3?
Solution. Let \(A\) denote the event that the outcome is odd. Let \(B\) denote the event that the outcome is 1, 2, or 3. It’s easy to see that \(A\cap B\) is the event that the outcome is 1 or 3. Further more, \(P(A)=\frac{3}{6}\) and \(P(A\cap B)=\frac{2}{6}\). We need to find \(P(B|A)\). Since
\[P(B|A)=\frac{P(A\cap B)}{P(A)}=\frac{2/6}{3/6}=\frac{2}{3}\] That is, given the outcome is odd, the probability that the die landed on 1, 2, or 3 is 2/3.
The Total Probability Rule, also known as the Law of Total Probability, is a fundamental concept in probability theory that allows us to calculate the probability of an event by considering all possible ways or scenarios that lead to that event. It is particularly useful when the event of interest depends on different conditions or sub-events. The Total Probability Rule is expressed as follows:
Suppose we have a partition of the sample space, i.e., a set of mutually exclusive events \({B_1, B_2, ..., B_n}\) that covers the entire sample space \(\Omega\). Then, for any event \(A\), its probability can be calculated as:
\[P(A) = \sum_{i=1}^n[P(A|B_i) \cdot P(B_i)].\]
In simpler terms, the probability of event \(A\) is the sum of the probabilities of event \(A\) occurring given each condition \(B_i\) multiplied by the probability of each condition \(B_i\).
Example 1
Suppose a factory produces light bulbs, and there are two machines used to manufacture them: Machine A and Machine B. Machine A produces 60% of the bulbs, and Machine B produces the remaining 40%. The probability that a bulb is defective, given it was produced by Machine A, is 0.03, and the probability that a bulb is defective, given it was produced by Machine B, is 0.05. What is the probability that a randomly selected bulb is defective?
Solution:
Let \(A\) be the event that a bulb is defective, and let \(B_1\) be the event that the bulb is produced by Machine A, and \(B_2\) be the event that the bulb is produced by Machine B. The partition \(\{B_1, B_2\}\) covers the entire sample space (all bulbs).
Using the Total Probability Rule: \[P(A) = P(A|B_1) \cdot P(B_1) + P(A|B_2) \cdot P(B_2)\] \[P(A) = (0.03 \cdot 0.60) + (0.05 \cdot 0.40)\] \[P(A) = 0.018 + 0.020\] \[P(A) = 0.038\]
The probability that a randomly selected bulb is defective is 0.038 (or 3.8%).
Example 2
Suppose the weather conditions in a certain city can be categorized into three types: Sunny (S), Cloudy (C), and Rainy (R). Historical data shows that the probabilities of these conditions are \(P(S) = 0.4, P(C) = 0.3\), and \(P(R) = 0.3\). The probability of carrying an umbrella on a Sunny day is 0.1, on a Cloudy day is 0.3, and on a Rainy day is 0.8. What is the overall probability of carrying an umbrella in this city?
Solution:
Let \(A\) be the event of carrying an umbrella, and let \(B_1, B_2\), and \(B_3\) represent the events of having a Sunny, Cloudy, and Rainy day, respectively. The partition \({B_1, B_2, B_3}\) covers the entire sample space (all possible weather conditions).
Using the Total Probability Rule:
\[P(A) = P(A|B_1) \cdot P(B_1) + P(A|B_2) \cdot P(B_2) + P(A|B_3) \cdot P(B_3)\]
\[P(A) = (0.1 \cdot 0.4) + (0.3 \cdot 0.3) + (0.8 \cdot 0.3)\] \[P(A) = 0.04 + 0.09 + 0.24\] \[P(A) = 0.37\]
The overall probability of carrying an umbrella in this city is 0.37 (or 37%).
Bayes’ Theorem, also known as Bayes’ Rule or Bayes’ Law, is a fundamental concept in probability theory and statistics. It provides a way to update the probability of an event based on new evidence or information. The theorem is named after the Reverend Thomas Bayes, an 18th-century mathematician and theologian, who first formulated the idea.
Bayes’ Theorem is stated as follows:
\[P(A|B) = \frac{P(B|A) \cdot P(A)} {P(B)}\]
where:
\(P(A|B)\) is the conditional probability of event \(A\) occurring given that event \(B\) has occurred. \(P(B|A)\) is the conditional probability of event \(B\) occurring given that event \(A\) has occurred. \(P(A)\) is the probability of event \(A\) occurring without considering event \(B\). \(P(B)\) is the probability of event \(B\) occurring without considering event \(A\).
In simpler terms, Bayes’ Theorem allows us to update our prior belief about the probability of event \(A\) (i.e., \(P(A)\) based on new evidence or information provided by event \(B\) (i.e., \(P(B)\). The resulting probability, \(P(A|B)\), is called the posterior probability.
Example 1
Suppose we have a rare disease that affects 1 in 10,000 people (i.e., \(P(A) = 0.0001\) with \(A\) representing the event of having the disease). A medical test is conducted to diagnose the disease, and the test has a false-positive rate of 1% (i.e., \(P(B|A') = 0.01\), where \(A'\) represents not having the disease and \(B\) represents the test being positive). The test also has a true-positive rate of 99% (i.e., \(P(B|A) = 0.99\)).
Now, we want to find the probability that a person has the disease given that the test result is positive (P(A|B)).
Using Bayes’ Theorem:
\[P(A|B) = \frac{P(B|A) \cdot P(A)} {P(B)}\] \[P(A|B) = \frac{0.99 \cdot 0.0001}{0.01 \cdot 0.9999 + 0.99 \cdot 0.0001}\] \[P(A|B) = \frac{0.000099}{0.009999 + 0.000099}\] \[P(A|B) ≈ 0.0098039\]
The probability that a person has the disease given a positive test result is approximately 0.0098 (or 0.98%). Bayes’ Theorem allows us to incorporate the test’s true-positive and false-positive rates to arrive at a more accurate probability of having the disease after the test result.
Example 2
A quality-control program at a plastic bottle production line involves inspecting finished bottles for flaws such as microscopic holes. The proportion of bottles that actually have such a flaw is only 0.0002. If a bottle has a flaw, the probability is 0.995 that it will fail the inspection. If a bottle does not have a flaw, the probability is 0.99 that it will pass the inspection.
If a bottle fails inspection, what is the probability that it has a flaw?
Which of the following is the more correct interpretation of the answer to part (a)?
Most bottles that fail inspection do not have a flaw.
Most bottles that pass inspection do have a flaw.
If a bottle passes inspection, what is the probability that it does not have a flaw?
Which of the following is the more correct interpretation of the answer to part (c)?
Most bottles that fail inspection do have a flaw.
Most bottles that pass inspection do not have a flaw.
Solution.
Denote \(A\) = “flaw” and \(B\) = “fail”. We are given that \(P(A)=0.0002\), \(P(B|A)=0.995\), and \(P(B^c|A^c)=0.99\). By the Total Probability Formula, we have: \[P(B)=P(B|A)\cdot P(A) + P(B|A^c)\cdot P(A^c)=(0.995)(0.0002)+(1-0.99)(1-0.002)= 0.010179.\]
\(P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B|A)\cdot P(A)}{P(B)}=\frac{(0.995)(0.0002)}{0.010179}=0.01955\)
Most bottles that fail inspection do not have a flaw.
\(P(A^c|B^c)=\frac{P(A^c\cap B^c)}{P(B^c)}=\frac{P(B^c|A^c)\cdot P(A^c)}{P(B^c)}=\frac{(0.99)(1-0.0002)}{1-0.010179}=0.99998\)
Most bottles that pass inspection do not have a flaw.
In probability theory, two events A and B are considered independent if the occurrence of one event does not affect the probability of the other event occurring. In other words, the outcome of one event provides no information or influence on the outcome of the other event. Mathematically, events A and B are independent if and only if \(P(A \cap B) = P(A) \cdot P(B)\).
where \(P(A \cap B)\) represents the probability of both events \(A\) and \(B\) happening together, \(P(A)\) is the probability of event \(A\) occurring, and \(P(B)\) is the probability of event \(B\) occurring.
If two events \(A\) and \(B\) are independent then, \(P(A|B) = P(A)\) and \(P(B|A) = P(B)\).
Example:
Rolling a fair six-sided die twice, the outcomes of each roll are independent events. The probability of rolling a 3 on the first roll is 1/6, and the probability of rolling a 3 on the second roll is also 1/6. The probability of rolling a 3 on both rolls (both events occurring together) is 1/6 * 1/6 = 1/36.
In probability and statistics, a random variable is a variable that can take on different values, each with a certain probability, due to underlying random processes or uncertainty. Random variables are a fundamental concept in probability theory and play a crucial role in modeling and analyzing uncertain events and probabilistic phenomena. They serve as a bridge between the theoretical mathematics of probability and real-world applications in various fields, including engineering, economics, physics, and social sciences.
Random variables can be categorized into two main types: discrete random variables and continuous random variables. We focus on discrete random variable in this chapter.
A discrete random variable is a random variable whose set of assumed values is countable. A random variable can be denoted by a upper-case letter such as \(X, Y\), and \(Z\).
Examples:
A lab has 10 computers. Let \(X\) denote the number of computers that fail to work. Then \(X\) is a random variable.
The engineers in a large company can be mechanical engineers, electrical engineers, or other. Randomly choose an engineer from this company. Let T denote the type of the selected engineer. \(T\) is a discrete random variable, assuming values in the set of {me, ee, ot}.
The life time of a randomly chosen computer is denoted by \(Y\). \(Y\) is NOT a discrete random variable. It is called a continuous random variable.
If \(X\) is a random variable, then we denote the probability that \(X=x\) by \(p(x)\) and we call this function the probability mass function (PMF) of \(X\). This function fully describes the distribution of the random variable \(X\). The domain of this function is called the support of the distribution (or of the random variable). The sum of all probabilities in the PMF is equal to one.
An example:
Flip a fair coin twice. Let \(X\) be the number of heads. Then \(X\) can take values 0, 1, 2 with probabilities 0.25, 0.5, 0.25. The sum of the probabilities equals one.
A random variable can be categorical or numerical (discrete or continuous). The PMF can be used to describe its distribution. If \(X\) is a numeric random variable, we can also equivalently use the cumulative distribution function (CDF), usually denoted by \(F(x)\), to describe its distribution. This CDF is defined as the probability that \(X\) is no greater than \(x\); that is, \(F(x)=P(X\le x)\).
An example:
Suppose \(X\) is a random variable taking values \(-2\), 0, 3, and 5 with probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the CDF \(F(x)\) can be determined as follows:
\[\text{when} ~x< -2, F(x) = P(X\le x) =P(\phi)= 0\] \[\text{when} ~-2\le x< 0, F(x) = P(X\le x) =P(X=-2)= 0.3\] \[\text{when} ~0\le x< 3, F(x) = P(X\le x) =P(X=-2 ~\text{or} ~0)= 0.3+0.1=0.4\] \[\text{when} ~3\le x< 5, F(x) = P(X\le x) =P(X=-2, ~0, ~\text{or} ~3)= 0.3+0.1+0.2=0.6\] \[\text{when} ~x\ge 5, F(x) = P(X\le x) =P(X=-2, ~0, ~3, \text{or} ~5)= 0.3+0.1+0.2+0.4=1\] The above can be written as a piece-wise (right-continuous) function:
\[F(x)=\begin{cases} 0, & x<-2 \\ 0.3, & -2\le x< 0 \\ 0.4, & 0\le x< 3\\ 0.6, & 3\le x< 5\\ 1, & x\ge5 \end{cases} \]
This piece-wise function has a graph that is step-wise and right-continuous. This observation is in general true for all discrete random variables.
The mean of a discrete random variable (or a discrete distribution) is defined to be the sum of products of values and probabilities. In other words, the mean is the weighted average of the values with weights being the corresponding probabilities. Use the Greek letter \(\mu\) to denote the mean. The mean describes, on average, what is the value taken by the random variable.
An example:
Suppose \(X\) is a random variable taking values \(-2\), 0, 3, and 5 with probabilities 0.3, 0.1, 0.2, and 0.4, respectively. Then, the mean of \(X\) can be determined as follows:
The mean is
\[\mu_X = (-2)(0.3)+(0)(0.1)+(3)(0.2)+(5)(0.4)=-0.6+0+0.6+2=2\]
Each possible value of the random variable is certain distance away from the mean. These distances are called deviations. The variance of a discrete random variable (or a discrete distribution) is the weighted average of the squared deviations with weights being the corresponding probabilities. That is,
\[\sum_x (x-\mu)^2 f(x)\] where \(x\) represents all possible values of the random variable.
Use the Greek letter \(\sigma^2\) to denote the variance. The square-root of the variance is called the standard deviation, which, on average, describes how far away is each possible value from the mean. Both variance and standard deviation describe the variation of the random variable.
The variance is
\[\sigma^2_X = (-2-2)^2(0.3)+(0-2)^2(0.1)+(3-2)^2(0.2)+(5-2)^2(0.4)=4.8+0.4+0.2+3.6=9\]
When calculating the variance of a discrete random variable, you can use the following formula instead:
\[\sum x^2 f(x) - \mu^2\]
The above variance can be calculated as follows:
\[\sigma^2_X = (-2)^2(0.3)+(0)^2(0.1)+(3)^2(0.2)+(5)^2(0.4)-2^2=1.2+0+1.8+10-4=9\]
The standard deviation is
\[\sigma_X=\sqrt{9}=3.\]
A useful result: if \(X\) is a random variable with mean \(\mu\) and variance \(\sigma^2\), then \(Y=cX\) is a new random variable, where \(c\) is a constant. Furthermore,
the mean of \(Y\) is \(c\mu\),
the variance of \(Y\) is \(c^2\sigma^2\), and
the standard deviation of \(Y\) is \(c\sigma\).
Another useful result: if \(X\) is a random variable with mean \(\mu\) and variance \(\sigma^2\), then \(Y=X+c\) is a new random variable, where \(c\) is a constant. Furthermore,
the mean of \(Y\) is \(\mu+c\),
the variance of \(Y\) is \(\sigma^2\) as well, and
the standard deviation of \(Y\) is \(\sigma\) as well.
If the values of a discrete random variable are equally likely, the distribution is a discrete uniform distribution.
An example:
Throw a 6-sided fair die once. Let \(X\) denote the outcome. Then, \(X\) takes each of the values 1, 2, 3, 4, 5, and 6, with probabilities all being 1/6. Thus, \(X\) is has discrete uniform distribution.
For a discrete uniform distribution, the mean is just the average of all possible values of the random variable. Keep in mind, in general, this is false.
Consider two situations:
Let’s say we have a coin (may not be fair). The coin lands on heads with probability of \(p\) (maybe 0.5 or not). Flip it \(n\) times. Let \(X\) denote the number of times it lands on heads. Then, \(X\) is a discrete random variable, since it can only takes the values \(0, 1, 2, \cdots\), and \(n\). What is the probability mass function (pmf) of \(X\)?
There is a sea of products. The proportion of defective products is \(p\). Randomly choose \(n\) products from this sea of products. Let \(Y\) denote the number of defective products. Then, \(Y\) is a discrete random variable, since it can only takes the values \(0, 1, 2, \cdots\), and \(n\). What is the probability mass function (pmf) of \(Y\)?
It turns out that the two random variable have the same distribution (choosing a defective product is like flipping a head, with same probability \(p\)), with the pmf given by
\[p(x)=P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}, ~~~x = 0, 1, 2, \cdots, n\] where \(\binom{n}{x}=\frac{n!}{x!\cdot(n-x)!}\) and \(n!\) is the product of the first \(n\) consecutive positive integers. For example, \(5! = 5\cdot 4\cdot 4\cdot3\cdot2\cdot1=120\), \(4!=24\), \(3!=6\), \(2!=2\), \(1!=1\), and \(0!=1\), and \(\binom{5}{2}=\frac{5!}{2!\cdot(5-2)!}=\frac{120}{2\cdot 6}=10\).
The combination \(\binom{n}{x}\) in the above formula indicates that there are those many ways of selecting/having \(x\) events (heads or defectives). The term \(p^x\) and \(n-x\) indicate that events (heads or defectives) are all independent.
The distribution is called the binomial distribution, with parameters \(n\) and \(p\). A parameter is not a variable, but is (usually) an unknown quantity.
The mean of this distribution is \(n\cdot p\) and the variance is \(n\cdot p\cdot (1-p)\).
A example:
Products manufactured by XYZ company has a defective rate of 0.01. Randomly pick a set of 100 products,
What is the probability that exactly 3 of these 100 selected products are defective?
What is the probability that less than 3 of these 100 selected products are defective?
How many of these 100 selected products are expected to be defective?
Solution.
Let \(X\) denote the number of defective products out of the 100 selected ones. Then, \(X\) has a binomial distribution with \(n=100\) and \(p=0.01\).
\(p(3)=\binom{100}{3}0.01^3(1-0.01)^{100-3}=0.0610\)
\(F(2)=P(X< 3)=P(X=0)+P(X=1)+P(X=2)=p(0)+p(1)+p(2)=0.9206\). Here “less than 3” means “less than or equal to 2”. You need to use the binomial probability formula 3 times!
The expected number of defective products is the same as the mean, and it is \(n\cdot p=100(0.01)=1\), as expected.
In quality control, we might be interested in the number of products to be sampled in order to see the first defective product from a sea of products.
Again, consider two situations:
Flip a coin until a head is seen. Denote the number of flips by \(X\). Assume the probability of a head to be \(p\).
Sample products until a defective product is seen. Denote the number of products sampled as \(Y\). Assume the defective rate is \(p\).
Both \(X\) and \(Y\) have the same distribution whose probability mass functions is given as follows:
\[p(x)=(1-p)^{x-1}\cdot p, ~~x=1, 2, 3, \cdots, \infty\] since selections are independent, and there is only one defectives and \(x-1\) normal ones.
The distribution is called the geometric distribution. The mean is shown to be \(\frac{1}{p}\) and the variance is \(\frac{1-p}{p^2}\).
The negative binomial distribution, which is for the situation that you see \(k\) heads or defectives. Except the last one which is for sure to be a head or defective, you apply the binomial formula for the previous flips or samplings.
Thus the formula is
\[p(x) = \binom{x-1}{k-1}p^{k-1}(1-p)^{x-k}\cdot p, x= k, k+1, k+2, \cdots \infty\]
The mean of the negative binomial distribution is \(\frac{k}{p}\) and variance is \(\frac{1-p}{p^2}\cdot k\).
In statistics, a population is a group whose members are of interest to us. Consider a finite population whose size is \(N\).
Let \(K\) denote the number of interesting things (say defectives) in the population.
Randomly draw \(n\) things from the population. Let \(X\) denote the number of interested things among these \(n\) drawn ones. The probability mass function (pmf) of \(X\) is given as follows:
\[p(x)=\frac{\binom{K}{x}\cdot \binom{N-K}{n-x}}{\binom{N}{n}}, ~~x~~ \text{goes from} ~~ max(0, n+K-n) ~~\text{to min}(n, K)\] where
\(N\) is the population size
\(K\) is the number of interesting things in the population
\(n\) is the number of draws
\(x\) is the number of interesting things observed
The mean of this distribution is \(n\cdot \frac{K}{N}\) and variance is \(n\cdot p(1-p)\frac{N-n}{N-1}\), with \(p=\frac{K}{N}\), the proportion of interesting things.
An example:
A bag has 100 balls, 5 defective. Randomly drawn 3 balls from the bag.
What is the probability of exact 2 defective balls observed?
What is the expected defective balls observed?
Solution.
We are given \(N=100, K=5\), and \(n = 3\).
\(p(2)=\frac{\binom{5}{2}\cdot \binom{100-5}{3-2}}{\binom{100}{3}}=0.005875\).
The expected defective balls observed is \(n\cdot \frac{K}{N}=3\frac{5}{100}=0.15\).
When modeling the number of rare events (such as earthquakes and car accidents), the Poisson distribution is often used. Let \(X\) denote the number of rare events in a given dimension (space, area, or a time interval). The probability mass function can be chosen to be
\[p(x) = \frac{\lambda^x}{x!}e^{-\lambda}, ~~x = 0, 1, 2, \cdots, \infty\] where \(\lambda\) is the average number of rare events per unit dimension (such as per cubic feet, per square feet, per minute). The mean of this distribution is just \(\lambda\) and the variance is also \(\lambda\).
A example:
A website is attacked 3 times on average in each year.
What is the probability that the website will be attacked in the next year?
What is the probability that the website will be attacked more than 10 times in the next 5 years?
How many times will the website be expected to be attacked in the following 10 years?
Solution.
Let \(X\) denote the number of attacks in a year. Then \(X\) has a Poisson distribution with mean \(\lambda=3\) attacks per year.
Let \(Y\) denote the number of attacks in 5 years. Then \(Y\) has a Poisson distribution with mean \(\lambda=3\cdot 5=15\) attacks every 5 years.
\[P(X>0)=1-P(X=0)=1-\frac{3^0}{0!}e^{-3}=1-e^{-3}\approx 0.9502. \] (b) The probability that the website will be attacked more than 10 times in the next 5 years is given by
\[P(Y>10)=1-P(Y\le 10)=1-p(0)-p(1)-p(2)-\cdots - p(10)\approx 0.9997. \] where \(\lambda = 15\) should be used when calculating \(p(0), p(1), \cdots, p(10)\).
A continuous random variable is one that can take on any value within a certain range. The concept of probability mass function will not work for continuous random variables, since the probability that \(X\) equals a single value is always 0 (finding a needle in a sea). This does not mean it is hopeless. Instead, calculus is the savior.
The probabilities associated with continuous random variables are represented by a probability density function (PDF). The area under the graph of the PDF gives the likelihood of the random variable falling within a specific interval. The area under the PDF curve over the entire range is equal to 1.
Examples of a continuous random variable:
The temperature of a fluid in a system.
The time it takes for a customer service representative to handle a call.
The distance a car travels before its engine fails.
For a continuous random variable, the counterpart of the probability mass function for a discrete random variable is the probability density function, denoted \(f(x)\), which is defined as the derivative of the cumulative distribution function \(F(x)\). That is
\[f(x)=F'(x)\] The probability density function has the following properties:
\(f(x)\) is never negative.
The area under the curve of \(f(x)\) and above the \(x-axis\) is always 1; that is, \(\int_{-\infty}^{\infty}f(x)dx=1\).
The probability that \(X\) falls in the interval \([a, b]\), \((a, b)\), \([a, b)\), \((a, b]\) is given by \(P(X<b)-P(X<a)\), or \(F(b)-F(a)\), or \(\int_{a}^{b}f(x)dx\), regardless of the boundaries of the interval.
The cumulative distribution function (CDF) of a random variable \(X\) has range between 0 and 1, and is always non-decreasing. For a continuous random variable, the CDF is a continuous function.
To get the CDF from the pdf, we use the following formula:
\[F(x)=\int_{-\infty}^{x}f(x)dx\]
The mean (or expectation, or expected value) of a continuous random variable whose probability density function is \(f(x)\) is given by
\[\mu = \int_{-\infty}^{\infty}xf(x)dx\]
In this formula:
x represents the possible values of the random variable.
f(x) represents the probability density function of the random variable at each value x.
The integral sums up the product of each possible value x and its corresponding density f(x) over the entire range of possible values.
The variance is given by
\[\sigma^2 = \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx ~~~\text{or} ~~\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2\]
The integral above computes the weighted sum of squared deviations from the mean, where each squared deviation is weighted by its corresponding probability density.
The mean is also denoted by \(E(X)\) and variance by \(Var(X)\).
An example:
The probability density function of \(X\) is given by
\[f(x)=3x^2, ~~0<x<1\] \[f(x)=0, ~~\text{otherwise}\]
Find the mean and variance.
Find the probability that \(X<0.3\).
Find the probability that \(X>0.2\).
Find the probability that \(0.25<X<0.6\).
Solution.
\[\mu = \int_{-\infty}^{\infty}xf(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}xf(x)dx=\int_{0}^{1}x\cdot 3x^2dx\stackrel{\text{?}}{=}\frac{3}{4}\]
where the first “?” is due to the fact that \(f(x)\) is only non-zero between 0 and 1, and the second “?” is due to the fact that the anti-derivative of \(3x^3\) is \(\frac{3x^4}{4}\).
The variance is calculated as
\[\sigma ^2 = \int_{-\infty}^{\infty}(x-\mu)^2f(x)dx\stackrel{\text{?}}{=}\int_{0}^{1}(x-\mu)^2f(x)dx=\int_{0}^{1}(x-\frac{3}{4})^2\cdot 3x^2dx\stackrel{\text{?}}{=}3\int_{0}^{1}(x^4-\frac{3}{2}x^3+\frac{9}{16}x^2)dx=\frac{3}{80}\]
\[\mu = \int_{-\infty}^{0.3}f(x)dx\stackrel{\text{?}}{=}\int_{0}^{0.3}f(x)dx=\int_{0}^{0.3}3x^2dx\stackrel{\text{?}}{=}0.027\]
\[\mu = \int_{0.2}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.2}^{1}f(x)dx=\int_{0.2}^{1}3x^2dx\stackrel{\text{?}}{=}0.992\] (d) The probability that \(0.25<X<0.6\) is obtained by
\[\mu = \int_{-\infty}^{\infty}f(x)dx\stackrel{\text{?}}{=}\int_{0.25}^{0.6}f(x)dx=\int_{0.25}^{0.6}3x^2dx\stackrel{\text{?}}{\approx}0.2\]
If \(X\) is a random variable, then \(X^2\) is too. Thus, we can talk about the mean of \(X^2\), and it is denoted by \(E(X^2)\). With this notation, the variance formula of a random variable \(X\) can be written \(Var(X)=E(X^2)-[E(X)]^2\).
When \(f(x)\) is a non-zero constant on the interval \((a,b)\) and 0 outside the interval, the distribution is a continuous uniform distribution with support \((a,b)\). The boundaries of the interval can be open or closed, which does not matter.
For a continuous uniform random variable with support \((a,b)\), the mean equals \(\frac{a+b}{2}\) and the variance equals \(\frac{(b-a)^2}{12}\).
An example:
If random variable \(X\) has the following probability density function
\[f(x)=c, ~~~ 1<x<5\]
\[f(x)=0, ~~~ \text{otherwise}\] (a) find the constant \(c\).
Find the mean and variance.
Find the probability that \(X\) is between 2 and 4.
Solution.
Sketch the graph of the function. Since the area under the curve and above the \(x\)-axis must be 1, the constant \(c\) must be \(\frac{1}{4}\).
The mean is \(\frac{1+5}{2}=3\). The variance is \(\frac{(5-1)^2}{12}=\frac{4}{3}\).
The probability that \(X\) is between 2 and 4 is obtained by
\[P(2<X<4)=\int_{-\infty}^{\infty}f(x)dx=\int_2^4 \frac{1}{4}dx=\frac{1}{2}\]
You can avoid using calculus here by drawing a graph to find the area in between 2 and 4 under the density curve and above the \(x\)-axis.
When a random variable \(X\) has the probability density function given by
\[f(x)=\frac{1}{\sqrt{2\pi}\cdot\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}, ~~~-\infty<x<\infty\]
\(X\) is said to have a normal distribution, denoted \(X\sim N(\mu, \sigma^2)\) or \(X\sim N(\mu, \sigma)\) according to different books. The mean is just equal to \(\mu\) and the standard deviation is just equal to \(\sigma\).
When \(X\sim N(\mu, \sigma^2)\), the new random variable \(Z=\frac{X-\mu}{\sigma}\) is called the \(Z\)-score or standardized score. It turns out that \(Z\) has a normal distribution with mean 0 and standard deviation 1.
To calculate the probability associated with a normal distribution, we can use a standard normal distribution table or software.
An example:
If \(X~N(\mu=100, \sigma=15)\), find
P(X<110)
P(X>120)
P(90<X<130)
Solution.
\(P(X<110)=P(X-\mu<110-\mu)=P(\frac{X-\mu}{\sigma}<\frac{110-\mu}{\sigma})=P(Z<0.67)\approx0.7486\) by a table such as this. Your textbook also has such a table in the appendix.
\(P(X>120)=P(X-\mu>120-\mu)=P(\frac{X-\mu}{\sigma}>\frac{120-\mu}{\sigma})=P(Z>1.33)\approx 1- 0.9082=0.0918\)
The probability is obtained as follows:
\[P(90<X<130)=P(90-\mu<X-\mu<130-\mu)=P(\frac{90-\mu}{\sigma}<\frac{X-\mu}{\sigma}<\frac{130-\mu}{\sigma})\] \[=P(-0.67<Z<2)=P(Z<2)-P(Z<-0.67)\approx0.9772-0.2514=0.7258\]
where I used the result that \(P(a<X<b)=P(X<b)-P(X<a)\) when \(X\) is a continuous random variable.
Skip!!
If the probability density function of a random variable \(X\) is given by \[f(x)=\lambda e^{-\lambda x}, ~~ \text{for}~~x>0\] \[f(x)=0, ~~ \text{otherwise}\] then the random variable is said to have an exponential distribution with parameter \(\lambda\). This distribution is a right-skewed distribution, as seen from the following graph:
It can be shown that the mean of this distribution is \(\frac{1}{\lambda}\) and the standard deviation is also \(\frac{1}{\lambda}\).
The reliability function \(R(t)\), also known as the reliability or survivor function, gives the probability that a system, component, or individual will survive beyond a specific time t. In other words, it calculates the probability that the event of interest (such as system failure) has not occurred by time t. Mathematically, it’s defined as:
\[R(t)=1−F(t)\]
where \(F(t)\) is the cumulative distribution function (CDF) of the event times. The reliability function decreases over time as more events occur.
In different fields and literature, you might encounter either the reliability function or the survival function being used, depending on the context and the field’s convention. For example, in reliability engineering, the term “reliability” is often used, while in medical and survival analysis contexts, the term “survival” is commonly used.
Example: Reliability of a Weibull-Distributed Component
Suppose we have a component whose lifetime (in thousand hours) follows a Weibull distribution, a distribution with cumulative distribution function (CDF) given by:
\[F(t)=1−e^{ −(t/λ)^k}\]
Assume that \(k=2\) and \(\lambda = 0.8\).
Find the probability density function (pdf) of the life time of the component.
Find the reliability function of the life time of the component.
The ratio of the pdf to reliability is called the failure rate or the hazard function, denoted by \(h(t)\), which is interpreted as the instantaneous rate of failure at time \(t\). Plot the graph of this function.
The reliability can also be defined for a system of components connected in some way. There are some special systems:
A series system is defined as a system whose individuals are connected end-to-nd in a series.
A parallel system in reliability engineering refers to a configuration where multiple components or paths are connected in parallel, and the system as a whole operates if at least one of the parallel components or paths is functioning. This setup increases the system’s overall reliability since the system can continue to function even if one or more of the parallel components fail.
Many times we need to consider two or more random variables at the same time. In this situation, we need to consider the joint distribution of these random variables.
We only consider the joint distribution of two discrete random variables \(X\) and \(Y\). The joint probability mass function (pmf) of them is defined to be\(f(x,y)=P(X=x, Y=y)\). In contrast, each of the two individual distributions is called a marginal distribution.
Example 1. \(X\) takes values 0, 2, and 4. \(Y\) takes values 1, 2, and 3. The following table shows the joint pmf of the two discrete random variables.
Based on the results given, we have
\(f(0, 1)= 1/4, f(4, 1)= 1/4, f(0, 3)=1/8\), and
\(P(X<3, Y>2)=P(X=0 or 2, Y=3)=f(0,3)+f(2,3)=1/8+0=1/8\).
When the joint probability mass function (pmf) of \(X\) and \(Y\) is given, we can find the pmf of \(X\) and the pmf of \(Y\) separately. Take the following joint pmf as an example:
To find the pmf of \(X\), we just add the 3 probabilities on each row and we end up with sums of 1/2, 1/8, and 3/8, which are the probabilities that \(X\) takes the values of 0, 2, and 4, respectively.
Similarly, To find the pmf of \(Y\), we just add the 3 probabilities on each column and we end up with sums of 1/2, 1/4, and 1/4, which are the probabilities that \(Y\) takes the values of 1, 2, and 3, respectively.
The conditional probability of \(Y=y\) given \(X=x\) can be written as \(p(y|x)\), which is defined by
\[f(y|x)=\frac{f(x,y)}{f(x)}=\frac{\text{the joint pmf}}{\text{the marginal pmf of}~X}\] \(X\) and \(Y\) are said to be independent, if \(f(y|x)=f(y)\), for any \(x\) and \(y\). In other words, when two random variables are independent, the conditional probability mass function is the same as the corresponding marginal probability mass function. Equivalently, \(X\) and \(Y\) are independent if and only if \(f(x,y)=f_X(x)f_Y(y)\), for all \(x\) and \(y\).
Note: some books use \(p\) instead of \(f\).
Skip!
When given the joint pmf of \(X\) and \(Y\), how can we calculate \(E(X\cdot Y)\)?
Here is an example: for the following given joint pmf,
calculate \(E(X\cdot Y)\).
Solution.
We need to find all possible products of \(X\) and \(Y\), then multiply them by the corresponding joint probabilities, and finally add the results up. That is,
\[(0)(1)(1/4)+(0)(2)(1/8)+(0)(3)(1/8)\] \[+(2)(1)(0)+(2)(2)(1/8)+(2)(3)(0)\] \[+(4)(1)(1/4)+(4)(2)(0)+(4)(3)(1/8)\]
The result is 3.
Previously, we introduced the mean of a random variable \(X\). We use \(\mu\) or \(E(X)\) to denote the mean.
The covariance of two discrete random variables is defined to be \[cov(X,Y)=E([X-E(X)][Y-E(Y)])\] It can be shown that \(cov(X,Y)=E(X\cdot Y)-E(X)\cdot E(Y)\).
In the above example, \(X\) takes the values of 0, 2, and 4, with probabilities of 1/2, 1/8, and 3/8, respectively. So the mean of \(X\) is \((0)(1/2)+(2)(1/8)+(4)(3/8)=1.75\). The variance of \(X\) is \((0^2)(1/2)+(2^2)(1/8)+(4^2)(3/8)-1.75^2=3.4375\).
Similaryly, \(Y\) takes the values of 1, 2, and 3, with probabilities of 1/2, 1/4, and 1/4, respectively. The mean of \(Y\) is \((1)(1/2)+(2)(1/4)+(3)(1/4)=1.75\). The variance of \(Y\) is \((1^2)(1/2)+(2^2)(1/4)+(3^2)(1/4)-(7/4)^2=0.6875\).
Now, the covariance of \(X\) and \(Y\) is \(3-(1.75)(1.75)=-0.0625\).
The correlation between \(X\) and \(Y\) is defined by
\[\rho_{X,Y}=corr(X,Y)=\frac{cov(X,Y)}{\sqrt{Var(X)}\cdot \sqrt{Var(Y)}}\]
Continue the previous example. The correlation between \(X\) and \(Y\) is
\[\rho_{X,Y} = \frac{-0.0625}{\sqrt{3.4375}\cdot \sqrt{0.6875}}\approx -0.04\]
Remark: Covariance can take any value, but correlation must be between \(-1\) and 1. In addition, correlation has no unit.
Skip!
Some useful properties of the mean and variance are given below:
If \(Y\) and \(X\) are random variables and \(a, b\), and \(c\) are constants, then
\(E(c)=c\)
\(E(aX+bY) = aE(X)+bE(Y)\)
\(E(X+b)= E(X) + b\), but \(Var(X+b)= Var(X)\)
\(Var(aX)= a^2Var(X)\)
If \(X\) and \(Y\) are independent, then \(Var(X+Y)=Var(X)+Var(Y)\) and \(E(XY)=E(X)E(Y)\) and thus \(Cov(X,Y)=0\).
If both \(X\) and \(Y\) are normally distributed and are independent, then \(X+Y\) is also normally distributed with mean that equals the sum of individual means and variance that equals the sum of individual variances.
An example:
If \(E(X) = 2\) and \(Var(X)=3\), determine
\(E(4X)\)
\(Var(4X)\)
\(E(X+5)\)
\(Var(X-6)\)
\(E(3X-2)\)
\(Var(3X-2)\)
Solution.
\(E(4X)=4E(X)=8\)
\(Var(4X)=16Var(X)=48\)
\(E(X+5)=E(X)+5=7\)
\(Var(X-6)=Var(X)=3\)
\(E(3X-2)=3E(X)-2=4\)
\(Var(3X-2)==Var(3X)=3^2Var(X)=27\)
Another example:
If \(X_1\), \(X_2\), …, \(X_n\) are independent and identically distributed with mean 100 and standard deviation 15, determine
the mean of \(\bar{X}\), where \(\bar{X}=\frac{X_1+X_2+\cdots+X_n}{n}\).
the variance of \(\bar{X}\).
(c)* the mean of \(S^2\), where \(S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i-\bar{X})^2\).
Solution.
\[E(\bar{X})=E(\frac{X_1+X_2+\cdots+X_n}{n})\] \[=\frac{1}{n}E(X_1+X_2+\cdots+X_n)\] \[=\frac{1}{n}(E(X_1)+E(X_2)+\cdots+E(X_n))=\frac{1}{n}(100n)=100\] (b) the variance of \(\bar{X}\) can be found as follows:
\[V(\bar{X})=V(\frac{X_1+X_2+\cdots+X_n}{n})\] \[=\frac{1}{n^2}V(X_1+X_2+\cdots+X_n)\] \[=\frac{1}{n^2}(V(X_1)+V(X_2)+\cdots+V(X_n))=\frac{1}{n^2}(15^2n)=\frac{15^2}{n}\]
The above 3 results are also true in general when the mean is \(\mu\) and standard deviation is \(\sigma\). Just substitute 100 by \(\mu\) and 15 by \(\sigma\).
A third example:
If \(X\) has a normal distribution with mean 10 and variance 2, \(Y\) has a normal distribution with mean 8 and variance 3, and \(X\) and \(Y\) are normally distributed,
What is the distribution of \(X+Y\)?
Find the probability that \(X+Y>15\).
Solution.
The distribution of \(X+Y\) is also normal with mean \(10+8=18\) and variance \(2+3=5\).
Let T = \(X+Y\). T is normal with mean \(10+8=18\), variance \(2+3=5\), and standard deviation \(\sqrt{5}\). To find probability \(P(T>15)\), we convert \(T\) to \(Z\), which has the standard normal distribution thus allows us to use a normal table to find probabilities.
\[P(T>15)=P(\frac{T-18}{\sqrt{5}}>\frac{15-18}{\sqrt{5}})=P(Z>\frac{15-18}{\sqrt{5}})\] \[=P(Z>-1.34)=1-P(Z<-1.34)=0.0901\]
We will introduce summary methods for data. Depending on data, we can calculate measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and we can visualize the data with graphical displays. This will help us gain insights into the characteristics of the data, understand how descriptive statistics can be applied in various fields, and make data-driven decisions.
In practice, we often need to estimate a quantity (called a parameter) that describes a group (called a population). Such a group can be the collection of all light bulbs manufactured by a company. The corresponding parameter can be the proportion of defective light bulbs or the average life time of all light bulbs. It’s often unrealistic to reach every member in a population. Instead, people draw a random sample from the population of interest. After sample data are obtained, a descriptive analysis of the data is usually desirable.
Different kinds of data analysis software have been developed in the past 50 years. If you use JMP, here is a tutorial: https://www.youtube.com/watch?v=1PFPMYccxLU
We will use the R programming language and RStudio (now called Posit), the integrated development environment (IDE). It can be on-premises or in the cloud. To use the RStudio in the cloud, go to https://posit.cloud. You need to get registered first. Here is a tutorial to get you started: https://www.youtube.com/watch?v=TQMAKGDIe_8
Start a new project and practice! Under this project, you can create different files.
Given data such as: 800, 820, 900, 950, 780, 690, 860, 880 representing the life times of 8 randomly selected light bulbs from a sea of light bulbs.
What is the average life time? In statistics, this is called the sample mean of the data, which describes the center of the data. We use \(\bar{x}\) to denote the sample mean. For the above data, \(\bar{x}=835\).
What is the spread (or variability) of the data? In statistics, this is measured by the sample variance defined as
\[s^2=\frac{\sum (x_i - \bar{x})^2}{n-1}\]
or measured by the sample standard deviation, the square root of the sample variance. In the formula, each \(x_i\) represents an observation, \(\bar{x}\) is the sample mean, \(n\) is the number of observations (called the sample size), and the notation \(\Sigma\) means to obtain the sum of the squared differences between observations and their mean.
For the above data, \(s^2=6514.286\) and \(s=80.71\). To get the sample variance, you can follow these steps:
Mean = (800 + 820 + 900 + 950 + 780 + 690 + 860 + 880) / 8 = 6680 / 8 = 835
(800 - 835)^2 = 1225 (820 - 835)^2 = 225 (900 - 835)^2 = 4225 (950 - 835)^2 = 13225 (780 - 835)^2 = 3025 (690 - 835)^2 = 21025 (860 - 835)^2 = 625 (880 - 835)^2 = 2025
Sum of Squared Differences = 1225 + 225 + 4225 + 13225 + 3025 + 21025 + 625 + 2025 = 45600
Sample Variance = Sum of Squared Differences / (n-1) = 45600 / (8-1) = 45600 / 7 ≈ 6514.286
We can use the R function var() to find the variance of data and the sd() function to find the standard deviation.
## [1] 6514.286
## [1] 80.71113
The R functions give the same answers!
The difference between the largest observation and the smallest observations is called the range of data. The range of the above data is \(900-690\) or 210.
We may also be interested in what percentage of the data values are less than or equal to a value. For example,
If we know 95% of all data values are less than or equal to 800, the number of 800 is called the 95th percentile of the data.
If we know 90% of all data values are less than or equal to 750, the number of 750 is called the 90th percentile.
If we know 25% of all data values are less than or equal to 260, the number of 260 is called the 25th percentile, or the first quartile, denoted by \(Q_1\).
If we know 50% of all data values are less than or equal to 420, the number of 420 is called the 50th percentile, the second quartile, or the median, denoted by \(Q_2\) or \(m\).
If we know 75% of all data values are less than or equal to 570, the number of 570 is called the 75th percentile, or the third quartile, denoted by \(Q_3\).
The median is the easiest among the percentiles. To find the median, simply sort the data from smallest to largest, the middle value or the average of the middle two values is the median.
Percentiles are also called quantiles. To find other percentiles (including quartiles), different software often use different approaches. To use JMP, visit https://www.uvm.edu/~rsingle/other/JMP-intro/default13.html. R users might do the following to find quantiles:
# Data
x = c(2,4, 5, 11, 20, 22, 26, 29, 32, 35, 40, 45, 48, 50)
# Find the 95th percentile
quantile(x, 0.95)
## 95%
## 48.7
This video: https://www.youtube.com/watch?v=faiGE_J_dww would be great for introducing the diagram. It also introduces other commonly used plots.
R software uses the stem function to create a stem-and-leaf plot.
x = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 86, 125, 85, 87, 102, 96, 76, 85)
stem(x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 7 | 6
## 8 | 5567
## 9 | 36689
## 10 | 12449
## 11 | 7
## 12 | 1335
We can create consecutive bins (intervals) of values so that we can count how many data values fall into each bin. This will help us understand the distribution of the data. The bins along with the counts of values form a table, called the frequency distribution of the data. The counts can also be replaced by proportions and the resulting table is called the relative frequency distribution.
Again, this video: https://www.youtube.com/watch?v=faiGE_J_dww help us make a histogram.
R software uses the hist function to create a histogram.
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
hist(IQ,
main = "Distribution of IQ", # The main sets the title of the plot.
col = "blue", # The col fills the blue color inside each bar
xlab = "IQ", # The title of the x-axis is set to IQ
ylab = "Count" # The title of the y-axis is set to Count
)
You can set your own breaks when constructing a histogram. The breaks should be evenly spaced. The number of breaks is usually between 5 and 25. The smallest break should be slightly smaller than the minimal value of your data. The largest break should be slightly larger than the maximal value of your data.
hist(IQ,
breaks = c(70,80, 90, 100, 110, 120, 130),
main = "Distribution of IQ",
col = "blue",
xlab = "IQ"
)
From sample data, you can calculate the minimum value, first quartile, median, third quartile, and maximum. There are called the 5-number summary, which can be displayed through the so-called box plot.
The five vertical lines are respectively corresponding to the minimum, first (or lower) quartile, median, third (or upper) quartile, and maximum of data. Keep in mind, the mean of data is not shown here. Since the line extended from the third quartile to the maximum is longer than the line from the first quartile to the maximum, the distribution of data is said to be right-skewed. These two lines are called whiskers.
R software uses the “boxplot” function to create a boxplot.
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
boxplot(IQ,
main = "Distribution of IQ",
col = "blue",
xlab = "IQ",
horizontal = TRUE # If set to FALSE, the boxplot will be vertical.
)
More complicated boxplots take into account outliers (values that are 1.5 IQR’s larger than \(Q_3\) or 1.5 IQR’s smaller than \(Q_1\), where \(IQR=Q_3-Q_1\) is called the inter-quartile range), as shown below:
It’s better to create comparative boxplots, such as the one below:
These are called side-by-side boxplots.
You need to use two variables when creating comparative boxplots, one being the outcome variable and one being the group variable. The two variables are better in a data frame (just like a spread sheet in Excel). Here is possible R code:
IQ = c(117, 98, 101, 99, 123, 123, 109, 93, 96, 104, 121, 104, 96, 125, 95, 87, 112, 96, 106, 105)
gender = c("F", "M", "F", "F", "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "F", "M", "F", "F", "F", "M")
boxplot(IQ~gender)
Here are some summary findings based on the boxplots:
IQ Distribution by Gender: The boxplot displays the distribution of IQ scores for two gender groups: “F” (Female) and “M” (Male).
Median IQ: The horizontal line inside each box represents the median IQ score for each gender group. It appears that the median IQ for females (“F”) is almost the same as the median for males (“M”).
Interquartile Range (IQR): The height of each box represents the interquartile range (IQR), which is a measure of the spread of the IQ scores within each gender group. The IQR for females appears to be slightly larger than for males.
Outliers: The plot shows individual data points that fall outside the whiskers of the boxplots. These data points are potential outliers. There seems to be no outlier in IQ score for either group.
Skewness: The distribution of IQ scores for either group is slightly skewed to the right (to larger values).
When plotting data involving time, we use the time sequence or time series plot. The target quantity would be on the y-axis and the time on the x-axis.
When examining a time series plot, pay attention to a possible trend and/or cyclic variation.
The plot shows an increasing linear trend and cyclic variation.
Video: https://www.youtube.com/watch?v=ORPOMJzaKFM
When data points in a scatterplot tend to be on a straight line, we can calculate the so-called correlation coefficient which quantify the linear relationship between the two quantitative variables.
The correlation coefficient, denoted by \(r\), is calculated by the following formula:
\[r_{xy} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\]
which can be reduced to
\[r_{xy} = \frac{\sum(x_i-\bar{x})y_i}{\sqrt{\sum(x_i-\bar{x})^2}\cdot\sqrt{\sum(y_i-\bar{y})^2}}\] or even simpler
\[r_{xy} = \frac{\sum(x_i y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum y_i^2-n\bar{y}^2}}\]
The value of \(r\) is always between \(-1\) and 1. A value closer to 1 indicates a strong, positive, linear relationship, while a value closer to \(-1\) indicates a strong, negative, linear relationship. The following shows some typical scatterplots.
An example. The following data are from an article on the quality of different young red wines in the Journal of the Science of Food and Agriculture (1974, Vol. 25(11), pp. 1369–1379) by T.C. Somers and M.E. Evans. The authors reported quality along with several other descriptive variables. We show only quality, pH, total SO2 (in ppm), color density, and wine color for a sample of their wines.
Quality = c(19.2, 18.3, 17.1, 15.2, 14.0, 13.8, 12.8, 17.3, 16.3, 16.0, 15.7, 15.3, 14.3, 14.0, 13.8, 12.5, 11.5, 14.2, 17.3, 15.8)
pH = c(3.85, 3.75, 3.88, 3.66, 3.47, 3.75, 3.92, 3.97, 3.76, 3.98, 3.75, 3.77, 3.76, 3.76, 3.90, 3.80, 3.65, 3.60, 3.86, 3.93)
TotalSO2 = c(66, 79, 73, 86, 178, 108, 96, 59, 22, 58, 120, 144, 100, 104, 67, 89, 192, 301, 99, 66)
ColorDensity = c(9.35, 11.15, 9.40, 6.40, 3.60, 5.80, 5.00, 10.25, 8.20, 10.15, 8.80, 5.60, 5.55, 8.70, 7.41, 5.35, 6.35, 4.25, 12.85, 4.90)
Color = c(5.65, 6.95, 5.75, 4.00, 2.25, 3.20, 2.70, 6.10, 5.00, 6.00, 5.50, 3.35, 3.25, 5.10, 4.40, 3.15, 3.90, 2.40, 7.70, 2.75)
D=data.frame(Quality, pH, TotalSO2, ColorDensity, Color)
Quality | pH | TotalSO2 | ColorDensity | Color |
---|---|---|---|---|
19.2 | 3.85 | 66 | 9.35 | 5.65 |
18.3 | 3.75 | 79 | 11.15 | 6.95 |
17.1 | 3.88 | 73 | 9.40 | 5.75 |
15.2 | 3.66 | 86 | 6.40 | 4.00 |
14.0 | 3.47 | 178 | 3.60 | 2.25 |
13.8 | 3.75 | 108 | 5.80 | 3.20 |
12.8 | 3.92 | 96 | 5.00 | 2.70 |
17.3 | 3.97 | 59 | 10.25 | 6.10 |
16.3 | 3.76 | 22 | 8.20 | 5.00 |
16.0 | 3.98 | 58 | 10.15 | 6.00 |
15.7 | 3.75 | 120 | 8.80 | 5.50 |
15.3 | 3.77 | 144 | 5.60 | 3.35 |
14.3 | 3.76 | 100 | 5.55 | 3.25 |
14.0 | 3.76 | 104 | 8.70 | 5.10 |
13.8 | 3.90 | 67 | 7.41 | 4.40 |
12.5 | 3.80 | 89 | 5.35 | 3.15 |
11.5 | 3.65 | 192 | 6.35 | 3.90 |
14.2 | 3.60 | 301 | 4.25 | 2.40 |
17.3 | 3.86 | 99 | 12.85 | 7.70 |
15.8 | 3.93 | 66 | 4.90 | 2.75 |
Calculate the correlation between Quality and pH.
Plot all pairs.
Solution.
\[\bar{x}=15.22\], \[\bar{y}=3.7885\], \[\sum x_i^2=4708.82\], \[\sum y_i^2=287.3713\], \[\sum x_i y_i=1154.931\]
\[r = \frac{\sum(x_i y_i)-n\bar{x}\bar{y}}{\sqrt{\sum x_i^2-n\bar{x}^2}\cdot\sqrt{\sum y_i^2-n\bar{y}^2}}=\frac{1154.931-20\cdot 15.22\cdot 3.7885}{\sqrt{4708.82-20\cdot 15.22^2}\cdot\sqrt{287.3713-20\cdot 3.7885^2}}=0.3492\]
# To plot one pair, say the pair for Quality versus pH, we do
plot(D$pH, D$Quality, xlab = "pH", ylab = "Quality") # pH will be on the x-axis
# The following also works
plot(D$Quality ~ D$pH, xlab = "pH", ylab = "Quality") # pH will be on the x-axis
The plots suggest that there is a strong positive correlation between ColorDensity and Color.
A population is a collection of subjects or objects of interest. Examples of populations are:
All SCSU students
All cars manufactured by Tesla.
All animals in a forest.
For a given population, we might be interested in a quantity that describes the population. Such a quantity is called a parameter. For the population of all SCSU students, we might be interested in the proportion of students who would like to be a CEO in their future. For the population of all animals in a forest, we might be interested in the mean age of all animals.
In general, the proportion (denoted by \(p\)) and the mean (denoted by \(\mu\)) are two commonly studied types of parameters.
How can we estimate a parameter? It’s usually unrealistic to check each individual in a population in order to calculate a parameter. Instead, a random sample is drawn. If the sampling method is sound, the sample is expected to be representative of the population and thus the sample counterpart of the population parameter can be a good estimate for the parameter. For example, to estimate the mean of a population, we can use the sample mean; to estimate the proportion of a population, we can use the sample proportion. These quantities depend on observations and are called the statistics, and they are the point estimates of the corresponding parameters. We will introduce interval estimates in next chapter, where we give a range for the parameter and tell how confident we are saying the interval would cover the unknow parameter.
The following are commonly used point estimates:
the sample mean (denoted \(\bar{x}\)) for the population mean
the sample proportion (denoted \(\hat{p}\)) for the population proportion
the difference in sample means \(\bar{x}_1-\bar{x}_2\) for the difference in two population means
the difference in sample proportions (denoted \(\hat{p}_1-\hat{p}_2\)) for the difference in two population proportions
the sample variance for the population variance
When estimating a population parameter, we start with the best estimate: the sample statistic, which is the sample counterpart of the parameter. For example, if the parameter is a population mean, then the sample statistic is the sample mean; if the parameter is a population proportion, then the sample statistic is the sample proportion.
Since the sample statistic depends on the sample and the sample is random, the sample statistic must be random. The distribution of all the possible values of the sample statistic is called the sampling distribution. It describes how the sample statistic varies across different samples of the same size taken from the population.
Here is an app showing what a sampling distribution is: https://www.lock5stat.com/StatKey/
To use this app, in the row indicated by “Sampling Distribution”, click either “Mean” or “Proportion”. Let’s say you have clicked “Mean”. From the dropdown menu at the upper-left corner, choose a dataset which can be viewed as a population. Let’s say you have chosen “Baseball Players-3e”. Set an appropriate sample size, and then click “Generate 1000 samples”. Now, you have some results including 3 graphs. The first graph on the right side of the screen shows the histogram of the population, the second graph shows the histogram of the most recent sample, and the third graph or the main graph shows the dotplot of all the sample means. The first two graphs should be similar if the sample size is large. The (third) graph of the sampling distribution may not necessarily be similar to that of the population, but they do have similar center if the sample size is large. In addition, the spread (described by the standard error, std. error) of the sampling distribution gets smaller if the sample size is larger. Try the app out!
Two general results:
The mean of the sample mean (\(\bar{x}\)) is always equal to the population mean (\(\mu\)). The standard deviation of the sample mean is always equal to the population standard deviation divided by the square root of the sample size.
The mean of the sample proportion (\(\hat{p}\)) is always equal to the population proportion (\(p\)). The standard deviation of the sample proportion is always equal to \(\sqrt{\frac{p(1-p)}{n}}\). .
For a continuous population with mean \(\mu\) and standard deviation \(\sigma\), in general,
if the population is normally distributed, the sample mean (\(\bar{x}\)) has a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).
if the population is not normally distributed, the sample mean (\(\bar{x}\)), when the sample size \(n\) is large (say > 30), approximately has a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).
For a discrete population with with proportion \(p\), in general,
These results are called the Central Limit Theorems (CLT’s).
Example 1.
Randomly selected a sample of size 64 from a population with an exponential distribution having mean 2. What is the probability that the sample mean exceeds 1.8?
Solution.
Note that the population standard deviation is also 2, since for the exponential distribution, the mean and standard deviation are the same.
Since the sample size is relatively large, by the CLT, the sample mean has an approximately normal distribution with mean 2 and standard deviation \(\frac{2}{\sqrt{64}}=0.25\). That is, \(\bar{X}\) is approximately normally distributed with mean 2 and standard deviation 0.25. We need to calculate \(P(\bar{X}>1.8)\).
\[P(\bar{X}>1.8)=P(Z>\frac{1.8-2}{0.25})=P(Z>\frac{1.8-2}{0.25})=P(Z>-0.8)=0.7881\]
You can use a standard normal table (https://www.z-table.com/) or the following R code to find \(P(Z>-0.8)\):
## [1] 0.7881446
We prefer the statistic (i.e., point estimate) of a parameter exhibiting favorable properties, including lack of bias and minimal variance.
A statistic is said to be unbiased, if its mean is equal to the parameter to be estimated.
The sample mean (\(\bar{X}\)), the sample proportion (\(\hat{p}\)), and the sample variance (\(s^2\)) are all unbiased for their population counterparts.
We in general would like a statistic to be unbiased and has a low variance. In practice, these goals are usually difficult to achieve at the same time (called the bias-variance trade-off) except we collect more data. To achieve one of the two goals, we usually have to sacrifice the other.
\[V(\bar{X})=\frac{\sigma^2}{n}\]
\[V(\hat{p})=\frac{p(1-p)}{n}\] where \(p\) is the population proportion.
\[\frac{\sigma}{\sqrt{n}}\] depends on \(\sigma\), which is unknown. A good estimate of \(\sigma\) is the sample standard deviation \(s\). Therefore, \(\frac{\sigma}{\sqrt{n}}\) can be estimated by \(\frac{s}{\sqrt{n}}\). The latter is called the standard error of the sample mean.
Similarly, the standard deviation of the sample proportion
\[\sqrt{\frac{p(1-p)}{n}}\] can be estimated by \[\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] which is called the standard error of the sample proportion.
The standard error of a point estimator is a measure of the variability or precision of the estimator.
Example.
An article in the Journal of Heat Transfer (Trans. ASME, Sec. C, 96, 1974, p. 59) described a new method of measuring the thermal conductivity of Armco iron. Using a temperature of 100°F and a power input of 550 watts, the following 10 measurements of thermal conductivity (in Btu/hr-ft-°F) were obtained:
41.60, 41.48, 42.34, 41.95, 41.86, 42.18, 41.72, 42.26, 41.81, 42.04
Obtain a point estimate of the mean thermal conductivity of Armco iron.
What is the standard error of such an estimate?
Solution.
The sample mean \(\bar{x}=41.924\) and the sample standard deviation is 0.2841.
A point estimate of the mean thermal conductivity of Armco iron is 41.924.
The standard error of the point estimate obtained in part (a) is \(\frac{s}{\sqrt{n}}=\frac{0.2841}{\sqrt{10}}=0.0898\). The standard error of 0.0893 suggests that the sample mean is relatively precise because it doesn’t vary much from sample to sample. It means that the 10 measurements taken are fairly consistent, and the sample mean is likely a good representation of the true mean.
Example.
In a random sample of 50 PC’s, 12 are Dell’s PCs.
Obtain a point estimate of the proportion of PCs that are Dell’s PCs.
What is the standard error of such an estimate?
Solution.
The sample proportion \(\hat{p}=\frac{12}{50}=0.24\).
A point estimate of the proportion of PCs that are Dell’s PCs is 0.24.
The standard error of the point estimate obtained in part (a) is \(\sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n}}=\sqrt{\frac{0.24\cdot (1-0.24)}{50}}=0.06\). This standard error represents the variability associated with the point estimate of the proportion of Dell’s PCs in the population.
The point estimate of a parameter does not tell how accurate the estimation is. Another way of estimating a parameter is to use a range of values, called an interval estimate. An interval estimate is associated with a certain level of confidence which tells how sure the interval would cover the unknown parameter, so such an interval is often called a confidence interval.
In addition to the confidence interval method, we later will introduce the method of testing hypotheses about a population. Both methods belong to the so-called statistical inference, which is the process of making an inference about a population using a sample.
The \(1-\alpha\) confidence interval on the mean of a normal distribution with a known variance (\(\sigma\)) is given by
\[\bar{x}\pm z_{\alpha/2}\cdot \frac{\sigma}{\sqrt{n}} \]
where \(z_{\alpha/2}\) is called the critical value which is the cutoff of the standard normal distribution separating the top \(\alpha/2\) tail area from the other area.
The part \(z_{\alpha /2}\cdot \frac{\sigma}{\sqrt{n}}\) is called the margin of error.
The length of a confidence interval is a measure of precision of estimation.
The above confidence interval is called a z confidence interval, since it is based on the standard normal density, which is also known as the z-density.
The value of \(\alpha\) is usually 0.1, 0.05, or 0.01, and the corresponding (right) cutoffs (called the critical z-values, denoted by \(z_{\alpha/2}\)) are 1.645, 1.96, and 2.576, respectively. They are obtained by a standard normal table or the following R code:
## [1] 1.644854
## [1] 1.959964
## [1] 2.575829
Example 1.
ASTM Standard E23 defines standard test methods for notched bar impact testing of metallic materials. The Charpy V-notch (CVN) technique measures impact energy and is often used to determine whether or not a material experiences a ductile-to-brittle transition with decreasing temperature. Ten measurements of impact energy (J) on specimens of A238 steel cut at 60°C are as follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, and 64.3. Assume that impact energy is normally distributed with \(\sigma\) = 1 J. We want to find a 95% CI for \(\mu\), the mean impact energy.
Solution.
The required quantities are \(n = 10, \sigma = 1\), \(\bar{x}=64.46\), and \(\alpha = 0.05\). The critical value \(z_{\alpha/2}\) is 1.96. So the 95% confidence interval for the mean impact energy is \(64.46\pm 1.96\cdot \frac{1}{\sqrt{10}}=64.46\pm 0.6198\) or \(63.84 < \mu< 65.08\).
Interpretation of confidence intervals:
The confidence interval is a random interval, so when having a new sample, the interval will change. The confidence level \(1-\alpha\) reflects the proportion of all possible confidence intervals that would cover the true population parameter. So, we can say "we are 95% confident that the mean impact energy is between 63.84 J and 65.08 J.
The following is a graphical interpretation of confidence intervals (assuming the sample size is 3 and the true population mean is 10):
This picture shows that based on 100 confidence intervals of level 95%, about 5 intervals fail to cover the true value of \(\mu\).
Is the following statement correct?
If a 95% confidence interval on the mean has a lower limit of 10 and an upper limit of 15, this implies that 95% of the time the true value of the mean is between 10 and 15.
No, since the true value is not random but known.
The confidence interval can be impacted by a few factors:
the sample size (\(n\)): the larger the \(n\), the shorter the interval and thus the more precise the interval estimate.
the population standard deviation (\(\sigma\)): the smaller the \(\sigma\), the shorter the interval.
the confidence level \(1-\alpha\): the smaller the \(\alpha\), the more confident, and the narrower the interval.
Determining the Sample Size for Specified Error on the Mean, Variance Known:
Example 2.
To get a 95% confidence interval for a population mean with at most error of 0.6, what sample size should be used? Assume that the population standard deviation is 2.
Solution.
\[n=(\frac{z_{\alpha/2} }{E}\cdot \sigma)^2=(\frac{1.96 }{0.6}\cdot 2)^2=42.68\]
The sample size should be at least 43.
Sometimes, we may only want a lower bound or a upper bound for a confidence interval.
To construct a \(1-\alpha\) lower-confidence bound on the population mean, use the \(\mu > \bar{x}-z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}\).
To construct a \(1-\alpha\) upper-confidence bound on the population mean, use the \(\mu < \bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}\).
Example.
In a location, the temperatures measured in June were obtained as follows:
13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0
Assume that the population standard deviation is 0.3.
Construct a 90% lower-confidence bound on the mean temperature.
Construct a 90% upper-confidence bound on the mean temperature.
Solution.
The sample mean is 13.93636. We are given \(\alpha = 0.10\), so \(z_{\alpha}=1.28\).
A 90% lower-confidence bound on the mean temperature is \(\bar{x}-z_{\alpha}\cdot \frac{\sigma}{\sqrt{n}}=13.93636-1.28\cdot \frac{0.3}{\sqrt{11}}=13.82\). That is, the 90% one-sided confidence interval with lower bound is \((13.82, \infty)\).
A 90% upper-confidence bound on the mean temperature is \(\bar{x}+z_{\alpha}\cdot\frac{\sigma}{\sqrt{n}}=13.93636+1.28\cdot\frac{0.3}{\sqrt{11}}=14.05\). That is, the 90% one-sided confidence interval with upper bound is \(( -\infty, 14.05)\).
The following R function written by the instructor can help you find a \(z\) confidence interval:
z.test = function(x, alternative = c("two.sided", "less", "greater"), mu = 0, sigma = 1, conf.level = 0.95){
n=length(x); m = mean(x); z = (m-mu)/(sigma/sqrt(n)); p = pnorm(z)
if (alternative[1]=="less"){
pvalue = p
LB = -Inf
UB = m+qnorm(conf.level)*sigma/sqrt(n)
} else if (alternative[1]=="greater"){
pvalue = 1- p
LB = m-qnorm(conf.level)*sigma/sqrt(n)
UB = Inf
} else {
pvalue = 2*min(p, 1-p)
LB = m-qnorm((1+conf.level)/2)*sigma/sqrt(n)
UB = m+qnorm((1+conf.level)/2)*sigma/sqrt(n)
}
m = round(m, 5)
z = round(z, 5)
pvalue = round(pvalue, 5)
LB = round(LB, 5)
UB = round(UB, 5)
cat(paste(" One Sample z-test\n\n", "data: ", deparse(substitute(x))), "\n",
"z =", z, "p-value =", pvalue, "\n",
paste("alternative hypothesis: true mean is",
ifelse(alternative[1] == "two.sided", "not equal to", alternative[1]), mu, "\n",
paste(100*conf.level, "percent confidence interval:\n"), " ", LB, UB,
"\n", "Sample estimates:\n", "mean of x\n", " ", m), "\n")
}
How can you use the function to do the previous example?
x = c(13.1, 14.4, 15.4, 15.7, 12.5, 15.8, 14.6, 12.5, 13.3, 12.0, 14.0)
# (a)
z.test(x, conf.level = 0.9, alternative = "greater", sigma=0.3)
## One Sample z-test
##
## data: x
## z = 154.0723 p-value = 0
## alternative hypothesis: true mean is greater 0
## 90 percent confidence interval:
## 13.82044 Inf
## Sample estimates:
## mean of x
## 13.93636
## One Sample z-test
##
## data: x
## z = 154.0723 p-value = 1
## alternative hypothesis: true mean is less 0
## 90 percent confidence interval:
## -Inf 14.05228
## Sample estimates:
## mean of x
## 13.93636
The results are almost the same as done by hand. The reason that the results are not exactly the same is that my code use more accurate critical value. For example, you use 1.28 when doing the problem by hand, but the code use 1.281552.
When the population variance is unknown, it will have to be estimated before a confidence interval can be constructed. Such estimation might further introduce uncertainty to the estimation process, so the interval is expected to be longer than the situation where the population variance is known. This is reflected in the change of the critical value in the formula. The new critical value is related to a new distribution called the \(t\)-distribution.
Each of the distribution is controlled by the number of degrees of freedom (\(df = n-1\)). All the \(t\)-distributions have mean 0 and variance \(\frac{n}{n-1}\). As \(n\) gets larger and larger, a \(t\)-distribution gets closer and closer to the standard normal distribution.
The \(1-\alpha\) confidence interval on the mean of a normal distribution with an unknown variance is given by
\[\bar{x}\pm t_{\alpha/2}\cdot \frac{s}{\sqrt{n}} \]
where
\(\bar{x}\) is the sample mean
\(s\) is the sample standard deviation
\(t_{\alpha/2}\) is called the t-critical value which is the cutoff of the \(t\)-distribution separating the top \(\alpha/2\) tail area from the other area.
the degrees of freedom of the t-distribution is \(n-1\).
The following gives part of a t table that gives the corresponding critical values and right tail ares for each given number of degrees of freedom.
To use the table, let’s give some examples:
when sample size is 10, the degrees of freedom is 9. For a 95% confidence interval, \(\alpha\) would be 0.05, and \(\alpha/2\) is 0.025. From the graph and the t table above, the critical t-value is 2.262.
when sample size is 8, the degrees of freedom is 7. For a 90% confidence interval, \(\alpha\) would be 0.10, and \(\alpha/2\) is 0.05. From the table above, the critical t-value is 1.895.
A more detailed table can be found here: https://www.craftonhills.edu/current-students/tutoring-center/mathematics-tutoring/distribution_tables_normal_studentt_chisquared.pdf. For example,
when df = 15 and the upper tail area is 0.075, the critical value is 1.517;
when df = 22 and the upper tail area is 0.025, the critical value is 2.074;
when df = 35 and the critical value is 2.438, the upper tail area is 0.01;
when df = 12 and the critical value is 2.234, the upper tail area is between 0.01 and 0.025.
Example 1.
Engineers want to determine the average tensile strength of a new type of steel produced by a company. They take a random sample of 100 steel rods and measure their tensile strength. The data are:
501, 504, 499, 503, 502, 505, 498, 500, 506, 503, 505, 501, 499, 502, 504, 500, 502, 503, 506, 499, 502, 503, 501, 504, 502, 498, 499, 503, 504, 501, 503, 498, 506, 500, 502, 499, 505, 504, 503, 501, 500, 504, 502, 506, 499, 503, 505, 501, 498, 502
Calculate a 95% confidence interval for the mean tensile strength of all steel rods produced by the company.
Solution.
When doing by hand, the sample mean is 502 and sample standard deviation is 2.3561. The \(\alpha\) is 0.05, and the critical t-value \(t_{\alpha /2}\) based on a t-table is 2.01.
x = c(501, 504, 499, 503, 502, 505, 498, 500, 506, 503,
505, 501, 499, 502, 504, 500, 502, 503, 506, 499,
502, 503, 501, 504, 502, 498, 499, 503, 504, 501,
503, 498, 506, 500, 502, 499, 505, 504, 503, 501,
500, 504, 502, 506, 499, 503, 505, 501, 498, 502)
t.test(x, conf.level = 0.95)
##
## One Sample t-test
##
## data: x
## t = 1506.6, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 501.3304 502.6696
## sample estimates:
## mean of x
## 502
Confidence Interval: 95% confidence interval for the mean tensile strength of the steel rods is (501.33 MPa, 502.67 MPa).
Interpretation:
With 95% confidence, we estimate that the true average tensile strength of all steel rods produced by the company falls within the range of approximately 501.33 megapascals (MPa) to 502.67 MPa.
This means that if we were to take multiple random samples of 100 steel rods and calculate 95% confidence intervals from each sample, we would expect approximately 95% of those intervals to contain the true population mean tensile strength.
The interval does not contain values below 501.33 MPa or above 502.67 MPa, suggesting that the steel rods are manufactured with a high degree of consistency in terms of tensile strength.
Example 2.
Environmental engineers monitor air pollution levels in a city. They want to estimate the average concentration of a pollutant in the air over a specific period. They collect air quality measurements at various locations throughout the city. The data are:
18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2, 19.1, 18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7, 19.0, 18.9, 18.8, 19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0, 18.6, 19.2, 19.0, 18.8, 19.3, 18.7, 18.9, 19.1, 18.5, 18.6, 19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9, 18.6, 19.0, 18.5
Calculate a 90% confidence interval for the mean pollutant concentration in the city based on these 50 measurements.
Solution.
x = c(18.7, 19.2, 18.5, 19.0, 19.1, 18.8, 18.9, 19.3, 18.6, 19.2,
19.1, 18.7, 18.8, 19.0, 19.2, 18.9, 18.6, 19.1, 19.3, 18.7,
19.0, 18.9, 18.8, 19.1, 19.2, 18.7, 18.5, 18.9, 19.1, 19.0,
18.6, 19.2, 19.0, 18.8, 19.3, 18.7, 18.9, 19.1, 18.5, 18.6,
19.0, 18.8, 19.2, 19.1, 18.7, 19.3, 18.9, 18.6, 19.0, 18.5)
t.test(x, conf.level = 0.90)
##
## One Sample t-test
##
## data: x
## t = 549.73, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 18.85632 18.97168
## sample estimates:
## mean of x
## 18.914
Confidence Interval: 90% confidence interval for the mean pollutant concentration in the city is (18.86 ppm, 18.97 ppm).
Interpretation:
With 90% confidence, we estimate that the true average concentration of the pollutant in the air over the specified period falls within the range of approximately 18.86 parts per million (ppm) to 18.97 ppm.
This means that if we were to take multiple random samples of 50 air quality measurements and calculate 90% confidence intervals from each sample, we would expect approximately 90% of those intervals to contain the true population mean pollutant concentration.
The interval is relatively narrow, indicating a relatively high level of confidence in our estimate of the average pollutant concentration. This suggests that the city’s air quality, as measured by this pollutant, is relatively consistent over the monitoring period.
skip
The \(1-\alpha\) confidence interval for a population proportion is given by
\[\hat{p}\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] Example 3.
In a random sample of 85 automobile engine crankshaft bearings, 10 have a surface finish that is rougher than the specifications allow.
Find a point estimate of the proportion of bearings in the population (denoted \(p\)) that exceeds the roughness specification.
Construct a 95% two-sided confidence interval for \(p\).
Solution.
\(p=\frac{10}{85}=0.1176\).
The 95% confidence interval is
\[\hat{p}\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=0.1176\pm 1.96\cdot \sqrt{\frac{0.1176(1-0.1176)}{85}}=0.1176\pm 0.0685\] or between 0.0491 and 0.1861.
Sample size determination for a \(1-\alpha\) confidence interval of the population proportion \(p\)
Example 4.
To have a 95% confidence interval for a population proportion with at most error 0.02, what should be the least required sample size?
Solution.
\((\frac{z_{\alpha/2}}{E})^2 (0.25)=(\frac{1.96}{0.02})^2 (0.25)=2401\)
Example 5.
A sample of 9 faculty are selected from a large university. Their years of service are 23, 34, 12, 40, 34, 52, 27, 28, 40. Find a 95% confidence interval for the population mean.
The following is the R code for finding a \(t\)-confidence interval for a population mean:
x = c(23, 34, 12, 40, 34, 52, 27, 28, 40) # This is data vector
t.test(x, conf.level = 0.95) # Call the function “t.test” to do the calculation
The 95% confidence interval is: 23.38 to 41.06.
To use JMP, you may follow this video: https://www.youtube.com/watch?v=gDi4XWdCIbw
Example 6. A sample of 80 faculty are selected from a large university. 32 of these people have had covid. Find a 90% confidence interval for the population proportion.
The following is the R code for finding a confidence interval for a population proportion:
n = 80
x = 52
prop.test(x, n, conf.level = 0.90)
The 90% confidence interval is: 0.55 to 0.74.
Example 7.
An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005, Vol. 13, pp. 273–279) “Arthroscopic meniscal repair with an absorbable screw: results and surgical technique” showed that only 25 out of 37 tears (67.6%) located between 3 and 6 mm from the meniscus rim were healed.
Calculate a 95% two-sided confidence interval on the proportion of such tears that will heal. Round the answers to 3 decimal places.
The following is the R code for finding a confidence interval for a population proportion:
n = 37
x = 25
prop.test(x, n, conf.level = 0.95)
Confidence intervals are estimation methods for parameters. In practice, people may be interested in testing claims or hypotheses about parameters.
Warning: this chapter covers the most difficult concept in an introductory statistics course. Stay awaken!
Review the concept of sampling distribution for the sample mean covered in chapter 8: https://www.youtube.com/watch?v=0zqNGDVNKgA
Consider the following situation:
A company selling juice claims that on average, each bottle of their juice is 295 ml. Let’s first assume that the claim is correct. We will randomly select 50 bottles of juice. Which of the following sample means would provide the most evidence against the claim (or for the opposite of the claim)? Which one provides the least evidence?
291 ml
296 ml
299 ml
Use the following sampling distribution to help you answer these questions.
Answer: (a)=(c)>(b)
If the sample mean is very far away from the center of the sampling distribution, we should reject the assumption. Otherwise, we don’t reject it. Then, how far is far? We need a certain decision rule. One such rule is that the standardized score or \(z\)-score of the sample mean is very different from 0 (i.e., \(|z|>c\) for some \(c\), called the critical value).
Watch this video: https://www.youtube.com/watch?v=zR2QLacylqQ a few times until it makes good sense to you.
Then watch this (the first 15 minutes): https://www.youtube.com/watch?v=VK-rnA3-41c
A statistical hypothesis is a statement about the parameters of one or more populations.
The following are examples of statistical hypotheses:
A company selling juice claims that on average, each bottle of their juice is 295 ml.
A presidential candidate claims that more than 50% of registered voters would support her.
A company producing light bulbs claims that less than one percent of their products are defective.
When dealing with a problem involving statistical hypotheses, a pair of competing hypotheses, called the null and alternative hypotheses, respectively, are first formed. In next subsection, we will talk about how such a pair is formed.
A procedure leading to a decision about whether the null hypothesis should be rejected is called a test of a hypothesis.
The test of hypotheses can be summarized in four steps as follows-
Step 1: Specify the null and alternative hypotheses.
Step 2: Calculate the value of a test statistic.
Step 3: Calculate the critical value or \(p\)-value.
Step 4: Make a decision about whether to reject the null hypothesis and draw a conclusion.
To test each of the following claims,
A company selling juice claims that on average, each bottle of their juice is 295 ml.
A presidential candidate claims that more than 50% of registered voters would support her.
A company producing light bulbs claims that less than one percent of their products are defective.
The standard deviation of the life times of all light bulbs produced by a company is less than 30 hours.
The difference in the mean house income between Minnesota and Iowa is greater than $500.
the null and hypotheses are
\(H_0: \mu = 295\) vs. \(H_a: \mu \ne 295\)
\(H_0: p \le 0.5\) vs. \(H_a: p > 0.5\)
\(H_0: p\ge 0.01\) vs. \(H_a: p<0.01\)
\(H_0: \sigma = 30\) vs. \(H_a: \sigma < 30\)
\(H_0: \mu_{\text{MN}}-\mu_{\text{IA}} = 500\) vs. \(H_a: \mu_{\text{MN}}-\mu_{\text{IA}} = 500\)
Technically, we can use an equality sign under all null hypotheses (indicated by \(H_0\)), but can never use any equality sign under the alternative hypothesis indicated by \(H_a\) or \(H_1\)). That is, the only possible signs under an alternative hypothesis are “>”, “<”, and “\(\ne\).”
We can only use a parameter but never use a statistic to specify a hypothesis. The following would be all wrong:
\(H_0: \bar{x} = 295\) vs. \(H_a: \bar{x} \ne 295\)
\(H_0: \hat{p} \le 0.5\) vs. \(H_a: \hat{p} > 0.5\)
\(H_0: s = 30\) vs. \(H_a: s < 30\)
because \(\bar{x}, \hat{p}\) and \(s\) all represent statistics.
Once the null and alternative hypotheses are determined, a test statistic will be used as a judge between the null and alternative hypotheses. The test statistic is an expression involving summary statistics and the parameter (s). It can be different, depending on the context. We will use an argument similar to the proof of contradiction in mathematics. Specifically, we will first assume that the null hypothesis is true. We then calculate a quantity (a critical value or \(p\)-value) which indicates whether there is a contradiction. If yes, we reject the null hypothesis. Otherwise, we do not reject the null hypothesis.
Our decision might be wrong. There are two types of errors we could make: Type I and type II errors, which are tabled below.
If the null hypothesis is true but rejected, we have made a type I error.
If the null hypothesis is false but not rejected, we have made a type II error.
The probability of making a type I error is denoted by \(\alpha\). The probability of making a type II error is denoted by \(\beta\). We would like both \(\alpha\) and \(\beta\) to be small. How can we reduce both? Increasing the sample size would be the only way.
The value or an upper bound of \(\alpha\) is usually pre-set (or controlled) by the investigator and is called the level of significance.
When the alternative hypothesis has a “\(<\)” or a “>” sign, the test is called a one-sided test. Furthermore, if it is “<”, the test is left-sided (or left-tailed); if it is “>”, the test is right-sided (or right-tailed).
When the alternative hypothesis has a “\(\ne\)” sign, the test is said to be a two-sided (or two-tailed).
When we make a decision regarding whether the null hypothesis should be rejected, we can use a few methods. One of such methods is the \(p\)-value method. The \(p\)-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value (say < 5%) means that such an observed outcome would not be observed by chance (i.e., would be very unlikely) under the null hypothesis. A relatively large p-value (say > 5%) means that such an observed outcome may occur by chance under the null hypothesis.
More specifically,
Consider an observed test-statistic \(t\) from unknown distribution \(T\). Then the \(p\)-value is \(P(T\ge t|H_{0})\) for a one-sided right-tail test, \(P(T\le t|H_{0})\) for a one-sided left-tail test, or \(2\cdot \text{min}\{P(T\ge t|H_{0}), P(T\le t|H_{0})\}\) for a two-sided test.
When the \(p\)-value is less than or equal to a pre-selected \(\alpha\) (called the significance level), reject the null hypothesis. In this case, we also say that the result is statistically significant.
Statistical significance does not always imply practical significance. For example, if we we find the difference in mean income between Minnesota and Iowa residents is $3 and it is statistically significant, such a small difference would not make any difference practically speaking.
The \(p\)-values under one-sided or two-sided alternatives are related. If the p-value of a one sided test is \(p\), the p-value of the corresponding two-sided test must be \(2p\) or \(2(1-p)\), whichever is between 0 and 1.
If a \(1-\alpha\) confidence interval does not include the value that is under the null hypothesis of a two-sided (or two tailed) test, we can reject the null hypothesis at the \(\alpha\) level of significance.
Example.
Based on a sample, a 90% confidence interval for a population mean \(\mu\) is \((23.67, 27.92)\). If we want to use the sample to test the following hypotheses
\[H_0: \mu=22 ~~~ vs ~~~ H_a: \mu\ne 22\] do we reject the null hypothesis at the significance level 0.10?
Solution.
Yes, reject the null hypothesis at level 0.10, since the hypothesized value 22 falls out of the 90% confidence interval.
You can also use a one-sided confidence bound to draw a conclusion about a one-sided test. For a left-tail test, you can develop a level \(1-\alpha\) upper bound; For a right-tail test, you can develop a level \(1-\alpha\) lower bound. When the value under the null hypothesis is beyond the bound, reject the null hypothesis at the significance level \(\alpha\).
There are two procedures for hypothesis tests:
The critical value method: We need to determine the critical region (or rejection region). If the test is left-tailed, we find a cutoff (called \(c\)) that separates the lower \(\alpha\) area under the (null) distribution of the test statistic from the other area. The critical region is \((-\infty, c)\). If the test is right-tailed, we find a cutoff (called \(c\)) that separates the upper \(\alpha\) area under the distribution of the test statistic from the other area. The critical region is \((c, \infty)\). If the test is two-tailed, we find two cutoffs (called \(c\) and \(-c\)) with \(c\) separating the upper \(\alpha/2\) area under the distribution of the test statistic from the other area and \(-c\) separating the lower \(\alpha/2\) area under the distribution of the test statistic from the other area. The critical region is \((-\infty, -c)\cup (c, \infty)\). In each case, if the value of the test statistic falls in the corresponding critical region, reject the null hypothesis.
The p-value method: Calculate the area that is under the (null) distribution of the test statistic and is greater than the value of the test statistic. Such area can be represented by \(A=P(T\ge t|H_0)\). For a left-tailed test, the \(p\)-value equals \(1-A\); for a right-tailed test, the \(p\)-value equals \(A\); for a two-tailed test, the \(p\)-value equals \(2\cdot \text{min} \{1-A, A\}\).
We assume that a random sample \(X_1, X_2, \cdots, X_n\) has been taken from a normal population with known variance \(\sigma^2\).
The null hypothesis always look like \(H_0: \mu=\mu_0\), where \(\mu_0\) is called the hypothesized value or null value.
The alternative hypothesis may look like one of the following:
\(H_a: \mu<\mu_0\)
\(H_a: \mu>\mu_0\)
\(H_a: \mu\ne \mu_0\)
The test statistic is always
\[Z_0=\frac{\bar{X}-\mu_0}{\sigma /\sqrt{n}}\]
The observed value of the test statistic is always
\[z_0=\frac{\bar{x}-\mu_0}{\sigma /\sqrt{n}}\]
For a left-tail test, the critical value is given by \(c\) such that \(P(Z_0\le c)=\alpha\), and the \(p\)-value is given by \(P(Z_0\le z_0)\) or \(-z_{\alpha}\), where \(z_{\alpha}\) is the cutoff for the upper \(\alpha\) tail of the standard normal distribution.
For a right-tail test, the critical value is given by \(c\) such that \(P(Z_0\ge c)=\alpha\), and the \(p\)-value is given by \(P(Z_0\ge z_0)\) or \(z_{\alpha}\).
For a two-tail test, the critical values are given by \(c\) and \(-c\) such that \(P(Z_0\ge c)=\alpha/2\), and the \(p\)-value is given by \(2\cdot \text{min}\{P(Z_0\le z_0), P(Z_0\ge z_0)\}\).
The above is summarized below:
Example 1. A two-sided test for a population mean with \(\sigma\) known.
https://www.youtube.com/watch?v=BWJRsY-G8u0
Example 2. A left-sided test for a population mean with \(\sigma\) known.
https://www.youtube.com/watch?v=oEW8Hd_xy1k
Example 3. A right-sided test for a population mean with \(\sigma\) known.
To test if a population mean is greater than 20. A random sample of size 36 gives a sample mean 22. If the population standard deviation is 5, test, at level 0.05, that the population mean exceeds 20.
Solution.
The null and alternative hypotheses are:
\[H_0:\mu = 20 ~~~ vs ~~~ H_a: \mu > 20\] The test statistic value is
\[z_0=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{22-20}{5/\sqrt{36}}=2.4\]
Since larger sample mean or larger \(z_0\) suggestions rejection of the null hypothesis, the rejection region looks like \((c, \infty)\) with the critical value \(c=z_{\alpha}\). By the standard normal table or the R code \(qnorm(1-\alpha)\), \(c=1.645\).
Since the test statistic value falls in the rejection region, reject the null hypothesis.
Equivalently, we can use the \(p\)-value approach. The \(p\)-value is the area to the right of the statistic value under the standard normal curve. By the standard normal table or the R code \(1-pnorm(2.4)\), the \(p\)-value is 0.0082. Since the \(p\)-value is less than the significance level, reject the null hypothesis.
Skip
Skip
We assume that a random sample \(X_1, X_2, \cdots, X_n\) has been taken from a normal population with unknown variance \(\sigma^2\).
The test statistic is always
\[T_0=\frac{\bar{X}-\mu_0}{S /\sqrt{n}}\]
The observed value of the test statistic is always
\[t_0=\frac{\bar{x}-\mu_0}{s /\sqrt{n}}\]
For a left-tail test, the critical value is given by \(c\) such that \(P(T_0\le c)=\alpha\), and the \(p\)-value is given by \(P(T_0\le t_0)\) or \(-t_{\alpha, n-1}\), where \(t_{\alpha, n-1}\) is the cutoff for the upper \(\alpha\) tail of the \(t\)-distribution with \(n-1\) degrees of freedom.
For a right-tail test, the critical value is given by \(c\) such that \(P(T_0\ge c)=\alpha\) and can be determined using R code \(qt(1-\alpha, n-1)\), and the \(p\)-value is given by \(P(T_0\ge t_0)\) and can be calculated using R code by \(1-pt(t_0, n-1)\).
For a two-tail test, the critical values are given by \(c\) and \(-c\) such that \(P(T_0\ge c)=\alpha/2\), and the \(p\)-value is given by \(2\cdot \text{min}\{P(T_0\le t_0), P(T_0\ge t_0)\}\).
The above is summarized below:
Example.
A very useful video: https://www.youtube.com/watch?v=VPd8DOL13Iw
Example.
Your company wants to improve sales. Past sales data indicate that the average sales was $100 per transaction. After training your sales force, recent sales data (taken from a random sample of 25 salesmen) indicates an average of $130, with a standard deviation of $15. Did the training work? Test your hypothesis at a 0.05 significance level.
Solution.
The population mean \(\mu\) is the parameter of interest. To test whether sales has been improved, we should have the null and alternative hypotheses as follows:
\[H_0: \mu=100 ~~~ vs ~~~ H_a: \mu>100\] The value of the test statistic is
\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{130-100}{15/\sqrt{25}}=10\] with \(n-1\) or 24 degrees of freedom.
Since larger \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like \((c, \infty)\), where \(c=t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based on the \(t_{24}\) distribution is 1.711, which is obtained by R code \(qt(1-\alpha, n-1)\) or by a \(t\)-table.
Since the test statistic value 10 falls in the rejection region, we reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the \(t_{24}\) distribution to the right of the test statistic value. Using the \(t\) table or the R code \(1-pt(10, 24)\), we know the \(p\)-value is smaller than 0.001 and thus smaller than the significance level 0.05. Again, we reject the null hypothesis.
In conclusion, the data provide sufficient evidence that the sales has been improved after training.
The following is a video explaining the above procedure:
https://www.youtube.com/watch?v=7ty2bO6VrUI
Example.
A firm claims that their product on average weighs 19 pounds. A supervisory authority doubts that the average weight is below 19 pounds, so it collects a random sample of 51 products made by the company from the market. The sample is 18.5 pounds with a standard deviation 3.2 pounds. Test appropriate hypotheses at the significance level 0.01. In order to prevent themselves from been sued by the company, should the authority use a larger or smaller significance level?
Solution.
The null and alternative hypotheses are:
\[H_0: \mu=19 ~~~ vs ~~~ H_a: \mu<19\] The value of the test statistic is
\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{18.5-19}{3.2/\sqrt{51}}=-1.1158\] with \(n-1\) or 50 degrees of freedom.
Since smaller \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like \((-\infty, c)\), where \(c=-t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based on the \(t_{50}\) distribution is \(-1.6759\), which is obtained by R code \(qt(\alpha, n-1)\) with \(\alpha = 0.01, n=50\) or by a \(t\)-table.
Since the test statistic value \(-1.1158\) does not fall in the rejection region, we fail to reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the \(t_{50}\) distribution to the left of the test statistic value. Using the \(t\) table or the R code \(pt(-1.1158, 50)\), we know the \(p\)-value is 0.1349 and thus NOT smaller than the significance level 0.01. Again, we fail to reject the null hypothesis.
In conclusion, the data do not provide sufficient evidence that the average weight of the firm’s products is below 19 pounds.
The following is a video explaining the above procedure: https://www.youtube.com/watch?v=ZY5XxJ2aJNc
Skip!
Skip!
Skip!
Skip!
It is often necessary to test hypotheses on a population proportion. For example, suppose that a random sample of size \(n\) has been taken from a large (possibly infinite) population and that \(X\) observations in this sample belong to a class of interest. Then \(\hat{P}=\frac{X}{n}\) is a point estimator of the proportion of the population p that belongs to this class. Typically, we require that \(np\) and \(n(1 − p)\) be greater than or equal to 5.
The null hypothesis always look like \(H_0: p=p_0\), where \(p_0\) is called the hypothesized value or null value.
The alternative hypothesis may look like one of the following:
\(H_a: p<p_0\)
\(H_a: p>p_0\)
\(H_a: p\ne p_0\)
The test statistic is always
\[Z_0=\frac{\hat{P}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}\] or equivalently, \[Z_0=\frac{\bar{X}-np_0}{\sqrt{np_0 (1-p_0)}}\]
The observed value of the test statistic is always \[z_0=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}}\] or, \[z_0=\frac{\bar{x}-np_0}{\sqrt{np_0 (1-p_0)}}\]
For a left-tail test, the critical value is given by \(c\) such that \(P(Z_0\le c)=\alpha\), and the \(p\)-value is given by \(P(Z_0\le z_0)\) or \(-z_{\alpha}\), where \(z_{\alpha}\) is the cutoff for the upper \(\alpha\) tail of the standard normal distribution.
For a right-tail test, the critical value is given by \(c\) such that \(P(Z_0\ge c)=\alpha\), and the \(p\)-value is given by \(P(Z_0\ge z_0)\) or \(z_{\alpha}\).
For a two-tail test, the critical values are given by \(c\) and \(-c\) such that \(P(Z_0\ge c)=\alpha/2\), and the \(p\)-value is given by \(2\cdot \text{min}\{P(Z_0\le z_0), P(Z_0\ge z_0)\}\).
The above is summarized below:
Example.
A semiconductor manufacturer produces controllers used in automobile engine applications. The customer requires that the process fallout or fraction defective at a critical manufacturing step not exceed 0.045 and that the manufacturer demonstrate process capability at this level of quality using \(\alpha= 0.05\). The semiconductor manufacturer takes a random sample of 200 devices and finds that 4 of them are defective. Can the manufacturer demonstrate process capability for the customer?
Solution.
We may solve this problem using the following steps:
Parameter of interest: The parameter of interest is the process fraction defective \(p\).
Null hypothesis: \(H_0: p = 0.045\)
Alternative hypothesis: \(H_a: p < 0.045\) This formulation of the problem will allow the manufacturer to make a strong claim about process capability if the null hypothesis \(H_0: p = 0.045\) is rejected.
Test statistic: The test statistic is \(z_0=\frac{x-np_0}{\sqrt{np_0 (1-p_0)}}=-1.7055\)
where \(x = 4, n = 200\), and \(p_0 = 0.045\).
\(p\)-value: 0.044 (the left-tail area under the standard normal curve with cutoff \(-1.7055\)).
Decision & conclusion: Reject H0 since the p-value is less than 0.05. We conclude that the process fraction defective p is less than 0.05. Practical Interpretation: We conclude that the process is capable.
Example. An article in Fortune (September 21, 1992) claimed that nearly one-half of all engineers continue academic studies beyond the B.S. degree, ultimately receiving either an M.S. or a Ph.D. degree. Data from an article in Engineering Horizons (Spring 1990) indicate that 118 of 484 new engineering graduates were planning graduate study.
Test the hypothesis 𝐻0:𝑝=0.5
What is the P-value for this test? Round your answer to 4 decimal places.
Skip!
Skip
The hypothesis-testing procedures that we have discussed in previous sections are designed for problems in which the population or probability distribution is known and the hypotheses involve the parameters of the distribution. Another kind of hypothesis is often encountered: We do not know the underlying distribution of the population, and we wish to test the hypothesis that a particular distribution will be satisfactory as a population model.
The test method, called the goodness-of-fit test, is based on the chi-square distribution.
Each chi-squared distribution is associated with a number called the number of degrees of freedom. The above graph shows 6 different chi-squared distributions. All chi-squared distributions are skewed to the right.
In this chapter, we will consider hypothesis testing problems that involve calculating p-values based on chi-squared distributions.
We will focus on the situation that the population distribution is discrete with only a few categories. So, the null hypothesis is
\[H_0: p_1 = p_{10}, ~p_2 = p_{20}, \cdots, ~p_k = p_{k0} \] where \(p_{10}, p_{20}, \cdots, p_{k0}\) are given proportions, and the alternative hypothesis is
\[H_a: \text{At least one of the proportions is not as specified}\]
The test procedure requires a random sample of size \(n\) from the population whose probability distribution is unknown. These \(n\) observations are arranged in a frequency table with \(k\) classes/categories.
Let \(O_i\) be the observed frequency in the \(i\)th class. Under the null hypothesis, we compute the expected frequency in the \(i\)th class, denoted \(E_i = n\cdot p_{i0}\), \(i = 1, 2, ..., k\). The test statistic is
\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]
Under the null hypothesis, \(\chi_0^2\) has, approximately, a chi-square distribution with \(k − 1\) degrees of freedom.
We can again use one of the following two methods for making a decision:
*\(p\)-value method: the \(p\)-value is the upper-tail area under the chi-square curve with cutoff \(\chi_0^2\).
Example.
Throw a 6-sided die 100 times. The observations are
17 ones
18 twos
13 threes
17 fours
22 fives
13 sixes
Test, at the significance level 0.05, whether the die is fair.
Solution.
The null hypothesis is \(H_0: p_1=p_2=\cdots=p_6=1/6\) and the alternative hypothesis is \(H_a: \text{At least one of the probabilities is not 1/6}\). Under the null hypothesis, the expected frequencies are all \((\frac{1}{6})(100)=16.67\). The observed frequencies are \(O_1=17, O_2=18, O_3=13, O_4=17, O_5=22, O_6=13\).
The test statistic
\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]
\[\chi_0^2=\frac{(17 -16.67)^2}{16.67}+\frac{(18 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}+\frac{(17 -16.67)^2}{16.67}+\frac{(22 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}\] \[\chi_0^2=3.44\] with \(6-1=5\) degrees of freedom.
The critical value is \(\chi_{0.05, 5}^2=11.0705\). Since the test statistic value is not greater than the critical value, we do not reject the null hypothesis. The \(p\)-value is \(P(\chi^2>3.44)= 0.6325\), the area of the right region under the chi-squared density curve \((df=5)\) (Watch: https://www.youtube.com/watch?v=HwD7ekD5l0g).
Decision & conclusion: Since the \(p\)-value is greater than the significance level 0.05, the null hypothesis is NOT rejected. We conclude that we don’t have enough evidence to say that the die is unfair.
The R code:
chisq.test(x=c(17, 18, 13, 17, 22, 13 ))
Many times the \(n\) elements of a sample from a population may be classified according to two different criteria. It is then of interest to know whether the two methods of classification are statistically independent; for example, we may consider the population of graduating engineers and may wish to determine whether starting salary is independent of academic disciplines. Assume that the first method of classification has \(r\) levels and that the second method has \(c\) levels. We will let \(O_{ij}\) be the observed frequency for level \(i\) of the first classification method and level \(j\) of the second classification method. The data would, in general, appear as shown in the following Table. Such a table is usually called an \(r × c\) contingency table.
To test the independence of the two categorical variables, the null and alternative hypotheses are
\[H_0: \text{The two categorical variables are independent vs.} ~ H_a:\text{The two categorical variables are dependent}\]
We again use the chi-square test and the test statistic is
\[\chi_0^2=\sum_{i,j} \frac{(O_{ij} -E_{ij})^2}{E_{ij}}\]
where the expected frequency \(E_{ij}\) is calculated as the sum of the \(i\)th row multiplied by the sum of the \(j\)th column, then divided by the sum of all frequencies.
Under the null hypothesis, this test statistic has an approximate chi-square distribution with \((r − 1)(c − 1)\) degrees of freedom.
The critical value and the p-value are calculated in the same way as for goodness of fit.
Example.
A company has to choose among three health insurance plans. Management wishes to know whether the preference for plans is independent of job classification and wants to use α = 0.05. The opinions of a random sample of 500 employees are shown in table below:
Solution.
\(H_0: \text{Job classification and health insurance plan are independent}\) and \(H_a:\text{Job classification and health insurance plan are dependent}\)
The expected frequencies are 136, 136, 68, 64, 64, 32, respectively.
The Chi-square statistic is 49.63.
The critical value is 5.99 (the cutoff of the chi-square distribution that separates the upper tail area of 0.05).
The p-value is essentially 0.
Decision & conclusion: By either method, we reject the null hypothesis. We conclude that Job classification and health insurance plan are dependent.
R code;
M=matrix(c(160, 40, 140, 60, 40, 60), 2, 3)
chisq.test(M)
Case Study: Paint Drying Time
A product developer is interested in reducing the drying time of a primer paint. Two formulations of the paint are tested; formulation 1 is the standard chemistry, and formulation 2 has a new drying ingredient that should reduce the drying time. From experience, it is known that the standard deviation of drying time is 8 minutes, and this inherent variability should be unaffected by the addition of the new ingredient. Ten specimens are painted with formulation 1, and another 10 specimens are painted with formulation 2; the 20 specimens are painted in random order. The two sample average drying times are 121 minutes and 112 minutes, respectively. What conclusions can the product developer draw about the effectiveness of the new ingredient?
In the above case study, the objective is to compare two different conditions to determine whether either condition produces a significant effect on the response that is observed. These conditions are sometimes called treatments. The two different treatments are two paint formulations, and the response is the drying time. The purpose of the study is to determine whether the new formulation results in a significant effect—reducing drying time. In this situation, the product developer (the experimenter) randomly assigned 10 test specimens to one formulation and 10 test specimens to the other formulation. Then the paints were applied to the test specimens in random order until all 20 specimens were painted. This is an example of a completely randomized experiment.
When statistical significance is observed in a randomized experiment, the experimenter can be confident in the conclusion that the difference in treatments resulted in the difference in response. That is, we can be confident that a cause-and-effect relationship has been found.
Most of the practical applications of the procedures to be covered in this chapter arise in the context of simple comparative experiments in which the objective is to study the difference in the parameters of the two populations.
The general situation is shown in the figure below:
Population 1 has mean \(\mu_1\) and variance \(\sigma_1^2\), and population 2 has mean \(\mu_2\) and variance \(\sigma_2^2\). Inferences will be based on two random samples of sizes \(n_1\) and \(n_2\), respectively.
There are many studies that are not randomized experiments. Those studies do not involve the use of treatments and are called observational studies. It is difficult to identify causality in observational studies because the observed statistically significant difference in response for the two groups may be due to some other underlying factor (or group of factors) that was not equalized by randomization and not due to the treatments. For example, the difference in heart attack risk could be attributable to the difference in iron levels or to other underlying factors that form a reasonable explanation for the observed results—such as cholesterol levels or hypertension.
In this section, we consider statistical inferences on the difference in means \(\mu_1 − \mu_2\) of two normal distributions where the variances \(\sigma_1^2\) and \(\sigma_2^2\) are known. The assumptions for this section are summarized as follows.
There is a random sample from population 1.
There is a random sample from population 2.
The two samples are independent.
Both populations are normal.
Example 1.
A product developer is interested in reducing the drying time of a primer paint. Two formulations of the paint are tested; formulation 1 is the standard chemistry, and formulation 2 has a new drying ingredient that should reduce the drying time. From experience, it is known that the standard deviation of drying time is 8 minutes, and this inherent variability should be unaffected by the addition of the new ingredient. Ten specimens are painted with formulation 1, and another 10 specimens are painted with formulation 2; the 20 specimens are painted in random order. The two sample average drying times are 121 minutes and 112 minutes, respectively. What conclusions can the product developer draw about the effectiveness of the new ingredient, using \(\alpha=0.05\)?
Solution.
Parameter of interest: The quantity of interest is the difference in mean drying times, \(\mu_1 − \mu_2\).
Null hypothesis: \(H_0: \mu_1 − \mu_2 = 0\), or \(H_0: \mu_1 = \mu_2\).
Alternative hypothesis: \(H_a: \mu_1 - \mu_2 > 0\) or \(H_a: \mu_1 > \mu_2\).
Test statistic:
\[Z_0=\frac{(\bar{X}_1-\bar{X}_2)-0}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\] having a standard normal distribution under the null hypothesis. The value of the test statistic is \[z_0=\frac{(\bar{x}_1-\bar{x}_2)-0}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}=\frac{(121-112)-0}{\sqrt{\frac{8^2}{10}+\frac{8^2}{10}}}=2.52\]
The \(p\)-value: 0.006 (the area of the right-tail with cutoff 2.52 under the standard normal curve).
Decision & conclusion: Since the \(p\)-value is less than 0.05, we also end up rejecting the null hypothesis. We conclude that adding the new ingredient to the paint significantly reduces the drying time. This is a strong conclusion, since the \(p\)-value is much less than 0.05.
If we use the critical value method, the critical value would be \(z_{0.05}=1.645\). Since the critical region is \([1.645, \infty)\) and the value of the test statistic falls in it, the null hypothesis is again rejected.
R code for \(p\)-value:
1-pnorm(2.52)
R code for the critical value:
qnorm(0.95)
Skip
When the interest is in estimating the difference in means between two populations, we take a random sample from each population. We use the difference in sample means to estimate the difference in population means. Since there will be a sampling error, we need to construct a confidence interval in order to take into account uncertainty.
We consider the case where the population variances are known. The level \(1-\alpha\) confidence interval for \(\mu_1 - \mu_2\) is
\[(\bar{x}_1 - \bar{x}_2)\pm z_{\alpha/2}\cdot \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]
where
\(\bar{x}_1\) and \(\bar{x}_2\) are sample means,
\(n_1\) and \(n_2\) are sample sizes, and
\(\sigma_1^2\) and \(\sigma_2^2\) are population variances.
For \(\alpha = 0.05\), \(z_{\alpha/2}=1.96\); for \(\alpha = 0.10\), \(z_{\alpha/2}=1.645\).
Example.
Suppose you want to estimate with 95 percent confidence the difference between the mean (average) length of the cobs of two varieties of sweet corn (allowing them to grow the same number of days under the same conditions). Call the two varieties Corn-e-stats and Stats-o-sweet. Assume by prior research that the population standard deviations for Corn-e-stats and Stats-o-sweet are 0.35 inches and 0.45 inches, respectively.
Suppose your random sample of 100 cobs of the Corn-e-stats variety averages 8.5 inches, and your random sample of 110 cobs of Stats-o-sweet averages 7.5 inches.
Construct a 95% confidence interval for the difference in means between the two varities.
Solution.
Given: \(n_1=100, n_2 =110 , \bar{x}_1=8.5, \bar{x}_2=7.5, \sigma_1=0.35, \sigma_2=0.45, z_{\alpha/2}=1.96\)
The 95% confidence interval for \(\mu_1 -\mu_2\) is \(1\pm 0.1085\) or (0.8915, 1.1085).
Interpretation: with 95 percent confidence, the Corn-e-stats variety is longer, on average, than the Stats-o-sweet variety, by somewhere between 0.8915 inch and 1.1085 inches, based on the samples.
A level \(1-\alpha\) lower confidence bound for \(\mu_1 - \mu_2\) is
\[(\bar{x}_1 - \bar{x}_2)- z_{\alpha}\cdot \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\] A level \(1-\alpha\) upper confidence bound for \(\mu_1 - \mu_2\) is
\[(\bar{x}_1 - \bar{x}_2)+ z_{\alpha}\cdot \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]
We now consider tests of hypotheses on the difference in means \(\mu_1 − \mu_2\) of two normal distributions where the variances \(\sigma_1^2\) and \(\sigma_2^2\) are unknown. A \(t\)-statistic is used to test these hypotheses.
Consider two situations:
\[T_0=\frac{\bar{X}_1-\bar{X}_2-\Delta_0}{\sqrt{\frac{s_p^2}{n_1}+\frac{s_p^2}{n_2}}}=\frac{\bar{X}_1-\bar{X}_2-\Delta_0}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]
where \(s_p^2=\frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2}\) is the pooled sample standard deviation. Under the null hypothesis, the test statistic has a \(t\)-distribution with \(n_1+n_2-2\) degrees of freedom.
\[T_0=\frac{\bar{X}_1-\bar{X}_2-\Delta_0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}}\]
Under the null hypothesis, the test statistic has a \(t\)-distribution with \(\nu\) degrees of freedom given by
\[\nu = \frac{(A+B)^2}{\frac{A^2}{n_1-1}+\frac{B^2}{n_2-1}}\] with \(A=\frac{S_1^2}{n_1}\) and \(B=\frac{S_2^2}{n_2}\)
If \(\nu\) is not an integer, round down to the nearest integer.
When we are not sure if the population variances are equal, we always consider the second case.
If observations are available, you can use the following R code to do analysis:
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)
Example.
The overall distance traveled by a golf ball is tested by hitting the ball with Iron Byron, a mechanical golfer with a swing that is said to emulate the legendary champion, Byron Nelson. Ten randomly selected balls of two brands are tested and the overall distance measured. Assume that the variances are equal. The data follow:
Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260 265 273 281 271 270 263 268
Is there evidence to show that there is a difference in the mean overall distance of brands? Use 0.05 as the significance level.
Solution.
The R code is:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267)
x = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268)
t.test(x, y, alternative = “two.sided”, mu = 0, paired = FALSE, var.equal = TRUE)
Since the \(p\)-value is basically 0, we reject the null hypothesis.
The critical region looks like \((-\infty, -c)\cup (c, \infty)\) with \(c = t_{\alpha/2, n_1+n_2-2}\). With \(\alpha = 0.05, n_1=10, n_2=10\), the R code qt(1-0.05/2, 10+10-2) gives the value 2.10 for \(c\). Since the test statistic value falls in the critical region, we reject the null hypothesis.
Example.
In the previous example, if we do not make the assumption of equal variance, Is there evidence to show that there is a difference in the mean overall distance of brands? Use 0.05 as the significance level.
Solution.
The R code is:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267)
x = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268)
t.test(x, y, alternative = “two.sided”, mu = 0, paired = FALSE, var.equal = FALSE)
Since the \(p\)-value is basically 0, we reject the null hypothesis.
The critical region looks like \((-\infty, -c)\cup (c, \infty)\) with \(c = t_{\alpha/2, df}\). With \(\alpha = 0.05, df=9\), the R code qt(1-0.05/2, 9) gives the value 2.26 for \(c\). Since the test statistic value falls in the critical region, we reject the null hypothesis.
More examples:
Example 1: Material Strength Testing
In a civil engineering project, two types of concrete mixtures (Type A and Type B) are being considered for constructing a bridge. Tensile strength is a critical factor for the bridge’s durability. The project team collects tensile strength measurements for samples of each concrete type. The goal is to determine if there’s a significant difference in tensile strength between the two types of concrete.
# Data for two types of concrete mixtures
concrete_type_A <- c(29.6, 31.2, 30.5, 32.1, 31.8, 29.9, 30.4, 30.2, 32.5, 31.3,
31.7, 30.8, 30.1, 32.3, 31.6, 30.9, 30.7, 31.5, 31.0, 32.0)
concrete_type_B <- c(28.3, 29.1, 28.5, 29.9, 30.5, 29.0, 28.9, 29.8, 30.2, 29.4,
28.7, 29.3, 29.7, 29.5, 30.1, 28.8, 29.6, 29.2, 30.4, 28.6)
# Perform two-sample t-test
t_test_result <- t.test(concrete_type_A, concrete_type_B)
Example 2: Quality Control in Manufacturing
A manufacturing company produces components for medical devices. Before implementing a new machining process, the company collects measurements of a critical dimension. After implementing the process, they again measure the dimension to ensure that the process doesn’t negatively impact the component’s size.
# Data for before and after the manufacturing process
before_process <- c(18.2, 18.3, 18.4, 18.5, 18.4, 18.2, 18.3, 18.6, 18.5, 18.4,
18.2, 18.4, 18.5, 18.6, 18.3, 18.4, 18.5, 18.2, 18.4, 18.6)
after_process <- c(18.3, 18.4, 18.5, 18.6, 18.4, 18.2, 18.4, 18.5, 18.6, 18.3,
18.2, 18.4, 18.5, 18.4, 18.3, 18.6, 18.5, 18.2, 18.4, 18.6)
# Perform two-sample t-test
t_test_result <- t.test(before_process, after_process)
Example 3: Energy Efficiency Analysis
In the automotive industry, a car manufacturer is testing the energy efficiency of two engine types: a traditional combustion engine and a new hybrid engine. The company records fuel consumption data for both engine types while running under identical conditions. The objective is to determine if the hybrid engine is significantly more fuel-efficient.
# Data for energy efficiency analysis
traditional_method <- c(8.2, 8.5, 8.4, 8.7, 8.6, 8.3, 8.2, 8.4, 8.6, 8.5,
8.4, 8.7, 8.8, 8.4, 8.5, 8.6, 8.3, 8.2, 8.6, 8.4)
new_technology <- c(6.9, 7.1, 7.0, 7.2, 7.3, 6.8, 7.1, 6.9, 7.2, 7.0,
7.1, 7.0, 6.8, 7.3, 7.2, 6.9, 7.0, 6.8, 7.1, 7.3)
# Perform two-sample t-test
t_test_result <- t.test(traditional_method, new_technology)
Example 4: Product Performance Testing
A consumer electronics company is developing two models of smartphones. Each model uses a different type of battery chemistry. The company tests the battery life of both models by continuously using the devices until the batteries are drained. They want to know if there’s a significant difference in battery life between the two models.
# Data for battery life comparison
battery_chemistry_A <- c(14.5, 15.0, 14.7, 15.2, 14.9, 15.1, 14.8, 15.0, 14.6, 15.3,
14.8, 15.2, 15.1, 14.9, 15.0, 14.7, 15.2, 14.5, 15.1, 14.9)
battery_chemistry_B <- c(13.5, 14.0, 13.7, 14.2, 13.9, 14.1, 13.8, 14.0, 13.6, 14.3,
13.8, 14.2, 14.1, 13.9, 14.0, 13.7, 14.2, 13.5, 14.1, 13.9)
# Perform two-sample t-test
t_test_result <- t.test(battery_chemistry_A, battery_chemistry_B)
Example 5: Environmental Impact Assessment
A chemical plant implements a new air pollution control system to reduce emissions of a harmful pollutant. Before and after the implementation, the plant measures the pollutant concentration in the surrounding air. The goal is to assess whether the new control system has significantly reduced the pollutant levels and improved air quality in the area.
# Data for pollution levels
before_intervention <- c(45, 50, 52, 47, 48, 50, 46, 48, 51, 49,
47, 50, 49, 48, 46, 45, 47, 50, 52, 49)
after_intervention <- c(35, 40, 42, 38, 39, 41, 37, 39, 43, 40,
38, 40, 39, 38, 37, 36, 38, 40, 42, 41)
# Perform two-sample t-test
t_test_result <- t.test(before_intervention, after_intervention)
Skip
Consider two situations:
Equal variance
unequal variance
If observations are available, you can use the following R code to do analysis:
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)
Example.
The overall distance traveled by a golf ball is tested by hitting the ball with Iron Byron, a mechanical golfer with a swing that is said to emulate the legendary champion, Byron Nelson. Ten randomly selected balls of two brands are tested and the overall distance measured. Assume that the variances are equal. The data follow:
Brand 1: 287 277 287 271 283 271 279 275 263 267 Brand 2: 259 248 260 265 273 281 271 270 263 268
Calculate a 95% two-sided confidence interval on the difference in mean overall distance. Round your answer to one decimal place (e.g. 98.7).
Solution.
R code:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267) x = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268) t.test(x, y, var.equal = TRUE)
The 95 percent confidence interval is \((264.7418, 265.9742)\).
Example.
In the previous example, if we do not make the assumption of equal variance, Calculate a 95% two-sided confidence interval on the difference in mean overall distance. Round your answer to one decimal place (e.g. 98.7).
Solution.
R code:
x = c(287, 277, 287, 271, 283, 271, 279, 275, 263, 267) x = c(259, 248, 260, 265, 273, 281, 271, 270, 263, 268) t.test(x, y, var.equal = FALSE)
The 95 percent confidence interval is \((258.8819, 271.8341)\).
More Examples
An article in Nature (2003, Vol. 48, p. 1013) described an experiment in which subjects consumed different types of chocolate to determine the effect of eating chocolate on a measure of cardiovascular health. We consider the results for only dark chocolate and milk chocolate. In the experiment, 12 subjects consumed 100 grams of dark chocolate and 200 grams of milk chocolate, one type of chocolate per day, and after one hour, the total antioxidant capacity of their blood plasma was measured in an assay. The subjects consisted of seven women and five men with an average age range of 32.2 ± 1 years, an average weight of 65.8 ± 3.1 kg, and average body mass index of 21.9 ± 0.4 kg/m2. Data similar to that reported in the article follows.
Dark Chocolate: 118.8, 122.6, 115.6, 113.6, 119.5, 115.9, 115.8, 115.1, 116.9, 115.4, 115.6, 107.9
Milk Chocolate: 102.1, 105.8, 99.6, 102.7, 98.8, 100.9, 102.8, 98.7, 94.7, 97.8, 99.7, 98.6
Is there evidence to support the claim that consuming dark chocolate produces a higher mean level of total blood plasma antioxidant capacity than consuming milk chocolate? Use a 0.05 significance level.
Solution.
Let \(\mu_1\) be the mean blood plasma antioxidant capacity resulting from eating dark chocolate and \(\mu_2\) be the mean blood plasma antioxidant capacity resulting from eating milk chocolate.
There is no assumption about population variances. The sample variances are 12.48447 and 8.35061. The number of degrees of freedom is \(\nu=21.167\) and is rounded down to 21.
The parameter of interest: \(\mu_1-\mu_2\)
Null hypothesis is \(H_0:\mu_1=\mu_2\)
Alternative hypothesis is \(H_a:\mu_1>\mu_2\)
Test statistic: 12.05
\(p\)-value: less than 0.001 (right-tail area under the t curve beyond 12.05)
Decision & conclusion:
If we use the critical value method, the critical value would be \(t_{0.05, 21}=1.72\). Since the critical region is \([1.72, \infty)\) and the value of the test statistic (12.05) falls in it, the null hypothesis is again rejected.
R code for \(p\)-value:
1-pt(12.05, 21)
R code for the critical value:
qt(0.95, 21)
Skip
Skip
Skip
Skip
Case Study: Shear Strength of Steel Girder
An article in the Journal of Strain Analysis for Engineering Design [“Model Studies on Plate Girders” (1983, Vol. 18(2), pp. 111–117)] reports a comparison of several methods for predicting the shear strength for steel plate girders. Data for two of these methods, the Karlsruhe and Lehigh procedures, when applied to nine specific girders, are shown in Table 10.3. We wish to determine whether there is any difference (on the average) for the two methods.
For such a study, the analysis is to use the one-sample \(t\) method (with an unknown variance) based on the differences for testing hypotheses or constructing confidence intervals about the the mean difference denoted by \(\mu_D=\mu_1-\mu_2\).
Specifically, the null hypothesis looks like \(H_0:\mu_D=0\) and the alternative hypothesis looks like one of the following three:
\[H_a: \mu_D<0, ~~ \text{or}~~ H_a: \mu_D>0~~ \text{or} ~H_a: \mu_D\ne0\] The test statistic is
\[T=\frac{\bar{d}-0}{s_d/\sqrt{n}}\] where \(\bar{d}\) is the mean of the differences and \(\s_d\) is the corresponding standard deviation. Under the null hypothesis, the test statistic has a \(t\)-distribution with \(n-1\) degrees of freedom.
When do the calculation in R, there is no need to follow the above procedure. Instead, we can use the following R code:
t.test(x, y, alternative = c(“two.sided”, “less”, “greater”), paired = TRUE, conf.level = 0.95)
where \(x\) and \(y\) are the original samples.
If you do want use the differences, you can also do
t.test(d, alternative = c(“two.sided”, “less”, “greater”), conf.level = 0.95)
The confidence interval formula for \(\mu_d\) is the same as the one-sample \(t\) confidence interval with unknown variance. That is,
\[\bar{d}\pm t_{\alpha/2}\cdot\frac{s_d}{\sqrt{n}}\]
Let’s demonstrate using the above case study.
The null hypothesis is \(H_0:\mu_d = 0\) and the alternative hypothesis is \(H_a:\mu_d \ne 0\). > x = c(1.186, 1.151, 1.322, 1.339, 1.200, 1.402, 1.365, 1.537, 1.559) > y = c(1.061, 0.992, 1.063, 1.062, 1.065, 1.178, 1.037, 1.086, 1.052) > t.test(x, y, paired = TRUE, conf.level = 0.95)
The \(t\)-statistic is 6.08. The \(p\)-value is 0.0003. The critical region looks like \((-\infty, -c)\cup (c, \infty)\) with critical values \(c=t_{\alpha/2, n-1}=t_{0.025, 9-1}=2.26\) and \(-c = -2.26\).
The 95% confidence interval for \(\mu_d\) is \((0.1700, 0.3777)\).
To get a 95% lower confidence bound for \(\mu_d\), use R code:
t.test(x, y, paired = TRUE, conf.level = 0.95, alternative = “greater”)
The lower bound is 0.1901.
To get a 95% upper confidence bound for \(\mu_d\), use R code:
t.test(x, y, paired = TRUE, conf.level = 0.95, alternative = “less”)
The upper bound is 0.3576.
Skip
Suppose we have two discrete populations, each having the same interesting class/category with proportions \(p_1\) and \(p_2\), respectively. Suppose that two independent random samples of sizes \(n_1\) and \(n_2\) are taken from two populations, and let \(X_1\) and \(X_2\) represent the number of observations that belong to the class of interest in samples 1 and 2, respectively.
We are interested in testing the hypotheses
\[H_0: p_1=p_2\] against one of the following: \[H_a: p_1<p_2, ~~H_a: p_1>p_2, ~~\text{or}~~H_a: p_1\ne p_2\]
The test statistic is:
\[Z_0=\frac{(\hat{P}_1-\hat{P}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}\] where \(\hat{p}=\frac{X_1+X_2}{n_1+n_2}\) is called the pooled sample proportion.
The test for two proportions are summarized below:
R code:
prop.test(x = c(x1, x2), n = c(n1, n2), alternative = c(“two.sided”, “less”, “greater”))
where \(n_1, n_2\) are sample sizes, and \(x_1, x_2\) are numbers of successes.
Example.
Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 10 defective parts are found in the sample from machine 2. Is it reasonable to conclude that both machines produce the same proportion of defective parts, using the 0.05 significance level? Answer by finding the P-value for the test. Round your answer to 3 decimal places.
Solution.
The null and alternative hypotheses are:
\[H_0: p_1=p_2 ~~ vs ~~ H_a:p_1\ne p_2\]
R code:
x = c(10, 10)
n = c(300, 300)
prop.test(x, n, alternative = “two.sided”, correct = FALSE)
The \(p\)-value is 0.307, so we fail to reject the null hypothesis.
Note that the test statistic that R reports is \(\chi^2\), which is just the square of our \(z_0\), which equals 1.0215, by hand.
Skip
The confidence interval on the difference in population proportions is given below:
\[(\hat{p}_1-\hat{p}_2)\pm z_{\alpha/2}\cdot \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\]
R code:
prop.test(x = c(x1, x2), n = c(n1, n2), conf.level = 0.95, correct = FALSE)
Always set “correct = FALSE”.
Example.
Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 10 defective parts are found in the sample from machine 2. Construct a 95% confidence interval for the difference in proportions between machine 1 and machine 2.
Solution.
To construct the confidence interval for the difference in proportions, we can use the formula:
\(\text{Confidence interval} = (\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
where:
\(\hat{p}_1\) and \(\hat{p}_2\) are the sample proportions of defective parts for machines 1 and 2, respectively \(n_1\) and \(n_2\) are the sample sizes for machines 1 and 2, respectively \(z_{\alpha/2}\) is the critical value for a 95% confidence interval, which is 1.96. Using the given information, we have:
\(\hat{p}_1 = \frac{15}{300} = 0.05\)
\(\hat{p}_2 = \frac{10}{300} = 0.0333\)
\(n_1 = n_2 = 300\)
Plugging in the values, we get:
\(\text{Confidence interval} = (0.05 - 0.0333) \pm 1.96 \sqrt{\frac{0.05(1-0.05)}{300} + \frac{0.0333(1-0.0333)}{300}}\)
Simplifying, we get:
\(\text{Confidence interval} = 0.0167 \pm 0.0276\)
Therefore, the 95% confidence interval for the difference in proportions between machine 1 and machine 2 is (0.0167 - 0.0276, 0.0167 + 0.0276), or approximately (-0.0109, 0.0443). We can interpret this as follows: we are 95% confident that the true difference in proportions of defective parts between the two machines is between -0.0109 and 0.0443. Since the interval contains zero, we cannot conclude that there is a significant difference in the proportions of defective parts between the two machines at a 95% confidence level.
R code:
x = c(10, 10)
n = c(300, 300)
prop.test(x, n, conf.level = 0.95, correct = FALSE)
A 95% confidence interval for \(p_1-p_2\) is \((-0.01528380, 0.04861713)\).
Example.
An article in Knee Surgery, Sports Traumatology, Arthroscopy (2005, Vol. 13, pp. 273-279), considered arthroscopic meniscal repair with an absorbable screw. Results showed that for tears greater than 25 millimeters, 14 of 19 repairs were successful while for shorter tears, 22 of 29 repairs were successful.
With \(\alpha=0.05\), is there evidence that the success rate is greater for longer tears? What is the𝑃-value?
Calculate a one-sided 95% confidence bound on the difference in proportions that can be used to answer the question in part (a). The bound should be a lower bound.
Solution.
\[H_0: p_1 = p_2 ~~vs ~~H_a: p_1 > p_2\]
where \(p_1\) is the proportion of successful repairs for tears > 25 mm and \(p_2\) is the proportion of successful repairs for tears ≤ 25 mm.
The test statistic is:
\(z = \frac{(\hat{p}_1 - \hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}\)
where \(\hat{p}_1\) and \(\hat{p}_2\) are the sample proportions of successful repairs, \(n_1\) and \(n_2\) are the sample sizes, and \(\hat{p}\) is the pooled sample proportion:
\(\hat{p} = \frac{x_1+x_2}{n_1+n_2}\)
where \(x_1\) and \(x_2\) are the number of successful repairs in each sample.
Plugging in the values from the problem, we get:
\(\hat{p}_1 = \frac{14}{19} \approx 0.737, \hat{p}_2 = \frac{22}{29} \approx 0.759, n_1 = 19, n_2 = 29, \hat{p} = \frac{14+22}{19+29} =0.75\)
The test statistic is:
\[z = \frac{0.737 - 0.759}{\sqrt{0.75(1-0.75)(\frac{1}{19}+\frac{1}{29})}} \approx -0.17\]
Using a standard normal distribution table, the p-value for this test is \(p \approx 0.567\).
Since the p-value is greater than \(\alpha=0.05\), we fail to reject the null hypothesis. There is not enough evidence to conclude that the success rate is greater for longer tears at the 5% significance level.
\(\text{Lower bound} = \hat{p_1} - \hat{p_2} - z_{0.05} \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}\)
where \(\hat{p_1}\) and \(\hat{p_2}\) are the sample proportions, \(n_1\) and \(n_2\) are the sample sizes, and \(z_{0.05}\) is the 95th percentile of the standard normal distribution, which is 1.645.
For tears greater than 25 millimeters, \(\hat{p_1} = \frac{14}{19} \approx 0.737\) and \(n_1 = 19\). For tears shorter than 25 millimeters, \(\hat{p_2} = \frac{22}{29} \approx 0.759\) and \(n_2 = 29\).
Substituting these values into the formula, we get:
\(\text{Lower bound} = 0.737 - 0.759 - 1.645 \sqrt{\frac{0.737(1-0.737)}{19} + \frac{0.759(1-0.759)}{29}} \approx -0.233\)
Therefore, a one-sided 95% confidence bound on the difference in proportions is -0.233, which is a lower bound. This means we can be 95% confident that the success rate for tears greater than 25 millimeters is at least 23.3% lower than the success rate for tears shorter than 25 millimeters.
R code:
x = c(14, 22)
n = c(19, 29)
prop.test(x, n, alternative = “greater”, correct = FALSE)
The \(p\)-value is 0.5677, so we reject the null hypothesis.
The 95 percent lower confidence bound is -0.2331912.
Skip
Simple Linear Regression and Correlation are two closely related concepts used in statistics to examine the relationship between two continuous variables. They help us understand how changes in one variable are associated with changes in another and allow us to make predictions and draw inferences about their relationship. Let’s explore each concept:
Simple Linear Regression:
Simple Linear Regression is a statistical method that models the relationship between two continuous variables by fitting a linear equation to the data. The goal is to find the best-fitting line that minimizes the distance between the observed data points and the predicted values on the line. The linear equation for simple linear regression is of the form: \[y = \beta_0 + \beta_1 x\]
where:
\(y\) is the dependent variable (also called the response or outcome variable). \(x\) is the independent variable (also called the predictor or explanatory variable). \(\beta_0\) and \(\beta_1\) are the regression coefficients, representing the intercept and slope of the line, respectively.
The coefficients \(\beta_0\) and \(\beta_1\) are estimated from the data using methods such as the least squares method, which aims to minimize the sum of squared differences between the observed and predicted values.
Simple linear regression allows us to make predictions about the value of the dependent variable (\(y\)) based on the value of the independent variable (\(x\)). It also provides insights into the strength and direction of the relationship between the two variables.
Correlation:
Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. It assesses how closely the data points tend to cluster around a straight line, indicating the degree of association between the variables.
The most commonly used measure of correlation is the Pearson correlation coefficient (\(r\)), which ranges from -1 to +1:
\(r = +1\) indicates a perfect positive linear relationship.
\(r = -1\) indicates a perfect negative (inverse) linear relationship.
\(r = 0\) indicates no linear relationship (variables are not linearly correlated).
The Pearson correlation coefficient is calculated using the formula:
\[r = \frac{cov(x,y)}{s_x s_y}\]
where \((x_i, y_i)\) are the individual data points,
\[cov(x,y) = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i-\bar{y})\] is the covariance between \(x\) and \(y\), and \(x_x\) and \(s_y\) are the standard deviations of \(x\) values and \(y\) values, respectively. . Correlation measures the strength of the linear relationship between two quantitative variables but does not provide information about causation. A high correlation coefficient does not imply causation; it merely indicates a strong association between the variables.
In summary, simple linear regression and correlation are valuable tools for understanding and quantifying the relationship between two continuous variables. Simple linear regression allows us to model and predict one variable based on another, while correlation measures the strength and direction of the linear association between the variables. Both concepts are widely used in various fields, including data analysis, scientific research, and engineering.
Hypothesis Tests in Simple Linear Regression:
In simple linear regression, hypothesis tests are used to make inferences about the regression coefficients and assess the significance of the relationship between the dependent variable and the independent variable. The two main hypotheses tested in simple linear regression are related to the slope (\(\beta_1\)) of the regression line. The hypothesis tests are based on the underlying assumptions of the regression model, such as the normality of errors and constant variance.
Let’s go through the key hypotheses and the corresponding hypothesis tests in simple linear regression.
Null Hypothesis (\(H_0\)):
The null hypothesis in simple linear regression states that there is no significant linear relationship between the independent variable (\(x\)) and the dependent variable (\(y\)). In mathematical terms, it is expressed as:
\[H_0: \beta_1 = 0\]
This implies that the slope of the regression line is zero, indicating no association between \(x\) and \(y\).
Alternative Hypothesis (\(H_1\) or \(H_a\)):
The alternative hypothesis in simple linear regression states that there is a significant linear relationship between the independent variable (\(x\)) and the dependent variable (\(y\)). In mathematical terms, it is expressed as: \[H_1: \beta_1 \ne 0\]
This implies that the slope of the regression line is not zero, indicating a non-zero association between \(x\) and \(y\).
Hypothesis Tests:
The most common hypothesis test used in simple linear regression is the t-test. The t-test assesses whether the estimated slope coefficient (\(\hat{\beta}_1\)) is significantly different from zero.
The t-statistic for testing the null hypothesis is calculated as:
\[t = \frac{\hat{\beta}_1 - 0}{se(\hat{\beta}_1)}\]
where:
\(\hat{\beta}_1\) is the estimated slope coefficient obtained from the regression analysis and \(se(\hat{\beta}_1)\) is the standard error of the slope coefficient, which estimates the variability of \(\hat{\beta}_1\).
The t-statistic has a t-distribution with \((n-2)\) degrees of freedom (\(n\) is the sample size). The \(p\)-value is twice the tail area under the t-distribution curve beyond the absolute value of the t statistic. If the \(p\)-value is no greater than the significance level, we reject the null hypothesis in favor of the alternative hypothesis. This indicates that there is a significant linear relationship between \(x\) and \(y\) at the chosen significance level.
Additionally, the t-test can be used to calculate a confidence interval for the slope coefficient. The confidence interval provides a range of values within which the true population slope is likely to lie with a certain level of confidence.
It is important to note that these hypothesis tests assume that the residuals (errors) in the regression model are normally distributed and have constant variance. Violations of these assumptions may impact the validity of the tests, and alternative approaches may be needed.
In summary, hypothesis tests in simple linear regression help determine whether the relationship between the dependent variable and the independent variable is statistically significant. They are essential in interpreting the results of the regression analysis and making informed conclusions about the data.
The Adequacy of the Regression Model:
The model we introduced has assumptions (LINE), including linearity, independence of errors, normality of residuals, and equal variances of residuals (homoscedasticity). Violations of these assumptions may indicate inadequacy of the model.
Assessing the adequacy of a regression model is a crucial step in analyzing the model’s performance and making sure it provides meaningful and reliable results. Adequacy checks involve evaluating how well the model fits the data, identifying potential issues or violations of assumptions, and determining the overall quality of the model’s predictions. Several techniques can be used to assess the adequacy of a regression model:
Residual Analysis:
Residuals are the differences between the observed values and the predicted values from the regression model. Residual analysis involves examining the pattern of residuals to check for any systematic deviations from randomness. Ideally, residuals should be randomly distributed around zero, indicating that the model captures the underlying relationships in the data. Patterns in the residuals may indicate problems with the model, such as non-linearity, heteroscedasticity (varying spread of residuals), or outliers.
R-squared (Coefficient of Determination):
R-squared measures the proportion of the total variation in the dependent variable that is explained by the regression model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data. However, high R-squared alone does not guarantee model adequacy, as it can increase even with the addition of irrelevant variables. Therefore, it is essential to interpret R-squared in conjunction with other model assessment techniques.
Adjusted R-squared:
The adjusted R-squared takes into account the number of predictor variables in the model, penalizing the addition of unnecessary variables. It provides a more conservative measure of model fit and is often preferred when comparing models with different numbers of predictors.
F-Test (Overall Significance Test):
The F-test is used to assess the overall significance of the regression model. It tests whether the explained variation in the dependent variable (sum of squares due to regression) is significantly larger than the unexplained variation (sum of squares due to residuals). A significant F-test suggests that the model as a whole is useful in explaining the variation in the dependent variable.
Outliers and Influential Points:
Identifying outliers and influential points is essential for understanding how individual data points affect the model. Outliers are extreme values that can disproportionately influence the model’s estimates, while influential points can greatly impact the model’s coefficients. Robust regression techniques may be used to mitigate the effect of outliers, and sensitivity analysis can be performed to assess the impact of influential points.
Cross-Validation:
Cross-validation techniques, such as k-fold cross-validation, can be used to assess the predictive performance of the model. By partitioning the data into training and validation sets, cross-validation helps estimate how well the model generalizes to new data.
Assessing the adequacy of a regression model is an iterative process, and it may involve model refinement, the inclusion of additional variables, or the consideration of alternative modeling approaches to improve the model’s performance and reliability. The ultimate goal is to build a model that accurately represents the data and provides meaningful insights for the specific research or decision-making context.
Example 1: Engineering - Load vs. Deformation
In this example, we’ll analyze the relationship between the load applied to a material and the resulting deformation. We have data from 25 tests conducted on a specific material.
# Sample data
load <- c(12.5, 14.2, 13.8, 12.9, 15.7, 14.0, 13.2, 14.5, 12.8, 13.6,
16.4, 15.2, 14.8, 16.0, 15.5, 14.7, 15.9, 16.6, 15.3, 14.6,
17.2, 16.8, 17.0, 16.3, 17.5)
deformation <- c(1.2, 1.5, 1.4, 1.3, 1.7, 1.6, 1.5, 1.8, 1.2, 1.4,
1.9, 1.7, 1.6, 1.8, 1.7, 1.5, 1.9, 2.0, 1.8, 1.6,
2.2, 2.1, 2.0, 1.9, 2.3)
df = data.frame(load, deformation)
# Perform simple linear regression
lm_model <- lm(deformation ~ load, data = df)
# Print the summary of the regression
summary(lm_model)
##
## Call:
## lm(formula = deformation ~ load, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13077 -0.05839 -0.01877 0.05360 0.20778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.20223 0.18761 -6.408 1.54e-06 ***
## load 0.19272 0.01239 15.560 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08878 on 23 degrees of freedom
## Multiple R-squared: 0.9132, Adjusted R-squared: 0.9095
## F-statistic: 242.1 on 1 and 23 DF, p-value: 1.058e-13
Insights:
The regression summary provides information about the estimated intercept and slope coefficients. The “Estimate” for the slope coefficient indicates how much deformation changes for a unit change in load. Here, the “Estimate” (0.19272) for the slope coefficient indicates deformation increases by about 0.19 for a unit increase in load.
The “Residual standard error” indicates the average difference between observed and predicted values. Here, the “Residual standard error” (0.08878) indicates the average difference between observed and predicted values.
The “R-squared” value measures the proportion of variability in the dependent variable explained by the independent variable. Here, the R-squared value (0.9132) indicates that 91.32% of total variation in deformation is explained (or accounted for) by load.
The p-value associated with the slope coefficient tests if the relationship is statistically significant. Here, the p-value associated with the slope coefficient (1.06e-13 or 0, basically) indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).
Example 2: Engineering - Speed vs. Fuel Efficiency Consider a scenario where we’re analyzing the relationship between the speed of an engine and its fuel efficiency. We have data from 22 tests conducted on different engine configurations.
# Sample data
speed <- c(1500, 1800, 2000, 2200, 2400, 2500, 2700, 2800, 3000, 3200,
3400, 3500, 3700, 3800, 4000, 4200, 4400, 4600, 4800, 5000,
5200, 5400)
fuel_efficiency <- c(18.2, 19.5, 20.1, 21.2, 22.0, 23.1, 24.5, 25.0, 25.8, 26.4,
27.0, 27.5, 28.3, 28.7, 29.5, 30.2, 31.0, 31.5, 32.2, 32.9,
33.6, 34.3)
df = data.frame(speed, fuel_efficiency)
# Perform simple linear regression
lm_model <- lm(fuel_efficiency ~ speed, data = df)
# Print the summary of the regression
summary(lm_model)
##
## Call:
## lm(formula = fuel_efficiency ~ speed, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7848 -0.5353 0.1559 0.3661 0.7997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.260e+01 3.666e-01 34.36 <2e-16 ***
## speed 4.144e-03 1.008e-04 41.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5315 on 20 degrees of freedom
## Multiple R-squared: 0.9883, Adjusted R-squared: 0.9877
## F-statistic: 1691 on 1 and 20 DF, p-value: < 2.2e-16
Insights:
The “Estimate” (0.004144) for the slope coefficient indicates fuel_efficiency increases by about 0.004144 for a unit increase in speed.
The “Residual standard error” (0.5315) indicates the average difference between observed and predicted values.
The R-squared value (0.9883) indicates that 98.83% of total variation in fuel-efficiency is explained (or accounted for) by speed.
The p-value associated with the slope coefficient (<2e-16 or 0, basically) indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).
Multiple Linear Regression is an extension of simple linear regression that allows for the analysis of the relationship between a dependent variable and multiple independent variables. It is a statistical technique used to model the linear relationship between the dependent variable and two or more predictor variables. Multiple linear regression is widely used in various fields, including statistics, economics, social sciences, and engineering, to analyze complex data and make predictions based on multiple factors.
The multiple linear regression model can be expressed as follows:
\[y = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_rx_r+\epsilon\]
where:
\(y\) is the dependent variable (response or outcome variable),
\(x_1, x_2, \cdots, x_r\) are the independent variables (predictors or explanatory variables). \(\beta_1, \beta_2, \cdots, \beta_r\) are the regression coefficients, representing the intercept and slopes of the regression line for each independent variable, respectively, and
\(\epsilon\) is the error term (residual), representing the difference between the observed y and the predicted y values.
The multiple linear regression model estimates the values of the regression coefficients based on the observed data using the method of least squares. The goal is to minimize the sum of squared differences between the observed and predicted values.
Key aspects of multiple linear regression:
Interpretation of Coefficients:
The regression coefficients represent the change in the dependent variable (\(y\)) associated with a one-unit change in each independent variable, assuming all other variables remain constant. Positive coefficients indicate a positive relationship with the dependent variable, while negative coefficients indicate a negative relationship.
Adjusted R-squared:
Similar to simple linear regression, multiple linear regression uses the R-squared statistic to measure the proportion of variance in the dependent variable explained by the model. The adjusted R-squared takes into account the number of predictors and provides a more accurate measure of the model’s goodness of fit when comparing models with different numbers of variables.
Assumptions:
Multiple linear regression relies on several assumptions (LINE), including linearity, independence of errors, constant variance of residuals (homoscedasticity), and normality of residuals. Violations of these assumptions can impact the validity and accuracy of the regression model.
Model Selection:
The process of selecting variables to include in the multiple linear regression model is an essential part of the analysis. Variables that are not relevant or highly correlated with other predictors may be excluded to avoid multicollinearity and improve the interpretability of the model.
Multiple linear regression is a powerful tool for analyzing the relationship between a dependent variable and multiple predictors. It enables researchers and analysts to explore complex data, identify significant predictors, make predictions, and gain valuable insights into the factors influencing the outcome of interest. However, careful attention to the assumptions and model diagnostics is necessary to ensure the validity and adequacy of the regression model.
A few very nice videos on multiple regression:
Short (6 minutes): https://www.youtube.com/watch?v=mno47Jn4gaU
long (40 minutes): https://www.youtube.com/watch?v=eYTumjgE2IY
long (45 minutes): https://www.youtube.com/watch?v=0m-rs2M7K-Y
Let’s consider a data example in engineering involving the relationship between the tensile strength of a metal and two factors: temperature and time of heat treatment. We’ll create a hypothetical dataset to illustrate the concept of multiple linear regression in engineering.
Suppose an engineer is studying the effect of temperature (in degrees Celsius) and time (in hours) of heat treatment on the tensile strength (in megapascals, MPa) of a metal sample. The engineer performs experiments at different combinations of temperature and time and records the tensile strength for each experiment. The data is as follows:
Temperature | Time | Tensile.Strength |
---|---|---|
200 | 1 | 150 |
250 | 2 | 180 |
300 | 2 | 210 |
250 | 1 | 160 |
300 | 3 | 230 |
200 | 2 | 170 |
280 | 2 | 190 |
270 | 3 | 220 |
220 | 1 | 155 |
290 | 2 | 200 |
Using multiple linear regression, the engineer can build a model to predict the tensile strength (y) based on temperature (x₁) and time (x₂) as predictors. The multiple linear regression model will have the form:
y = β₀ + β₁ * x₁ + β₂ * x₂ + ε
where:
y is the predicted tensile strength. x₁ is the temperature (independent variable 1). x₂ is the time (independent variable 2). β₀, β₁, and β₂ are the regression coefficients. ε is the error term. The goal of multiple linear regression is to estimate the values of the regression coefficients (β₀, β₁, β₂) based on the observed data to build the best-fitting model.
To perform multiple linear regression in R, we can use the lm() function, which stands for “linear model.” The lm() function fits a linear regression model to the data and provides estimates for the regression coefficients, as well as various statistics and diagnostics to assess the model’s performance. Let’s use the example data we previously created in R and perform multiple linear regression:
# Create the example data
temperature <- c(200, 250, 300, 250, 300, 200, 280, 270, 220, 290)
time <- c(1, 2, 2, 1, 3, 2, 2, 3, 1, 2)
tensile_strength <- c(150, 180, 210, 160, 230, 170, 190, 220, 155, 200)
# Combine the data into a data frame
data <- data.frame(temperature, time, tensile_strength)
# Perform multiple linear regression
model <- lm(tensile_strength ~ temperature + time, data = data)
# Print the summary of the regression model
summary(model)
##
## Call:
## lm(formula = tensile_strength ~ temperature + time, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9653 -1.9750 0.8815 2.3290 6.5125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.39499 11.55468 4.794 0.001980 **
## temperature 0.33044 0.05428 6.088 0.000497 ***
## time 24.47977 2.84277 8.611 5.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 7 degrees of freedom
## Multiple R-squared: 0.9754, Adjusted R-squared: 0.9684
## F-statistic: 138.7 on 2 and 7 DF, p-value: 2.337e-06
The output of summary(model) will display various statistics and information about the multiple linear regression model. It will include the estimated regression coefficients along with their standard errors, t-values, and p-values (the column named “Pr(>|t|)”). The p-values indicate the significance of each coefficient, with smaller p-values suggesting that the corresponding predictor variable has a significant impact on the dependent variable.
How can we interpret each of the coefficients? The coefficient 0.33044 means that for each degree increase in temperature, the tensile strength increases by about 0.33 unit, holding time constant. The coefficient 24.47977 means that for each unit increase in time, the tensile strength increases by about 24.48 units, holding temperature constant. The smaller p-values indicate both variables have significant impact on tensile strength.
The output will also include the R-squared and adjusted R-squared values, which measure the goodness of fit of the model. The larger the values, the better the model fits the data. Additionally, the summary will provide information about the residuals, which can be used to assess the assumptions of the model, such as the normality and constant variance of errors.
The multiple linear regression model can be used to make predictions for new data points using the predict() function:
# Predicting tensile strength for a new combination of temperature and time
new_data <- data.frame(temperature = 260, time = 2.5)
predicted_tensile_strength <- predict(model, newdata = new_data)
print(predicted_tensile_strength)
## 1
## 202.5096
This will provide the predicted tensile strength for the new combination of temperature = 260°C and time = 2.5 hours based on the multiple linear regression model.
Keep in mind that in real-world applications, data analysis using multiple linear regression may involve additional steps, such as data exploration, diagnostics, and validation. It is essential to carefully interpret the results and assess the model’s adequacy and assumptions to ensure the reliability of the predictions.
Designing engineering experiments is a crucial process that involves planning, executing, and analyzing experiments to gather meaningful data and make informed decisions. Proper experimental design ensures that the collected data is reliable, relevant, and can lead to accurate conclusions. Here are the key steps and considerations in designing engineering experiments:
Define Objectives and Research Questions: Clearly articulate the objectives of the experiment and the specific research questions you want to address. The objectives will guide the entire experimental design process and help determine the appropriate variables to measure and control.
Identify Variables and Factors: Identify the key variables that may influence the outcome of the experiment. Variables can be classified into two types: a. Independent Variables (Factors): Variables that you intentionally manipulate or control in the experiment. b. Dependent Variables: Variables that you measure to observe the response or outcome.
Formulate Hypotheses: Based on your objectives and variables, develop hypotheses that state the expected relationships or differences between the independent and dependent variables. Hypotheses will guide the data analysis and interpretation.
Choose Experimental Design: Select the appropriate experimental design based on your research questions, resources, and constraints. Common types of experimental designs include: a. Completely Randomized Design: Randomly assign treatments to experimental units. b. Randomized Block Design: Group similar experimental units into blocks and randomize treatments within each block. c. Factorial Design: Investigate the effects of multiple independent variables simultaneously.
Determine Sample Size: Calculate the required sample size to achieve adequate statistical power and precision in your results. A larger sample size generally provides more reliable estimates.
Control Variables: Ensure that all extraneous factors that could influence the outcome are controlled or minimized. This may involve using control groups, blinding, or randomization.
Conduct the Experiment: Perform the experiment according to the experimental design, carefully following the procedures and recording the data accurately. Document any unexpected events or observations.
Analyze Data: Use appropriate statistical methods to analyze the data and test the hypotheses. This may involve regression analysis, ANOVA, t-tests, or other relevant techniques.
Interpret Results: Interpret the results of the data analysis in the context of your research questions and hypotheses. Draw conclusions based on the evidence provided by the data.
Draw Engineering Inferences: Apply the findings of the experiment to make engineering inferences and decisions. Determine how the results impact the engineering problem or system you are investigating.
Communicate Findings: Present the experimental design, results, and conclusions in a clear and concise manner. Clearly communicate any implications for future research or engineering applications.
By following these steps and considerations, engineers can design experiments that provide valuable insights, support decision-making, and advance the understanding of engineering systems and processes. Well-designed experiments are essential for making progress in engineering research and development.
A Completely Randomized Single-Factor Experiment is a type of experimental design used to study the effect of a single independent variable (also known as a factor) on a dependent variable. In this design, the experimental units are randomly assigned to different treatment levels of the factor, and the response of each unit to the treatments is measured. The objective is to compare the mean responses of the different treatment groups to determine if there are significant differences between them.
Key features of a Completely Randomized Single-Factor Experiment:
One Independent Variable (Factor): The experiment involves only one independent variable (factor) that has two or more treatment levels. Each treatment level represents a specific condition or value of the factor being tested.
Randomization: The assignment of experimental units to different treatments is done randomly to ensure that any extraneous or unknown factors are evenly distributed among the treatment groups. This helps reduce bias and allows for valid statistical inference.
Control: The experiment is designed to control any potential confounding variables or sources of variation that could influence the results. By randomly assigning treatments, the experiment aims to create similar groups with comparable characteristics.
Replication: Each treatment level is applied to multiple experimental units (replicates) to account for natural variability and provide more precise estimates of treatment effects.
Statistical Analysis: The data collected from the experiment is analyzed using statistical methods, such as analysis of variance (ANOVA), to test for significant differences between treatment means.
Example of a Completely Randomized Single-Factor Experiment:
Let’s consider an example where an engineer wants to investigate the effect of different cooling times on the hardness of a metal alloy. The engineer selects a sample of the metal alloy and divides it into four groups:
Group 1: Cooling time of 1 hour. Group 2: Cooling time of 2 hours. Group 3: Cooling time of 3 hours. Group 4: Cooling time of 4 hours. Each group represents a treatment level of the factor “Cooling Time.” The engineer randomly assigns several metal specimens to each group. The hardness of each specimen is measured after the designated cooling time.
The data collected can be analyzed using ANOVA to test if there are significant differences in hardness among the different cooling times. If ANOVA reveals a significant effect, post-hoc tests (e.g., Tukey’s HSD) can be performed to identify specific pairs of cooling times that differ significantly in terms of hardness.
The results of the experiment will help the engineer understand how cooling time affects the hardness of the metal alloy and make informed decisions in industrial applications, such as selecting the optimal cooling time to achieve the desired hardness properties.
Let’s create a data example for the Completely Randomized Single-Factor Experiment related to cooling times and the hardness of a metal alloy. In this example, we will investigate the effect of four different cooling times on the hardness of the metal alloy.
Assume we have the following data for the hardness of the metal alloy (measured in Vickers hardness units, HV) after cooling for different durations:
1 | 300
2 | 350
3 | 380
4 | 400
2 | 340
3 | 370
1 | 290
4 | 410
3 | 375
2 | 335
1 | 295
3 | 385
4 | 395
2 | 345
1 | 305
4 | 420
3 | 380
2 | 330
4 | 415
1 | 310
In this example, the independent variable is the “Cooling Time” (in hours), and the dependent variable is the “Hardness” of the metal alloy after the specified cooling time.
Each row in the data represents one metal alloy specimen that underwent a specific cooling time. The experiment involves four different cooling times (1, 2, 3, and 4 hours), which serve as treatment levels of the factor “Cooling Time.” The metal alloy specimens were randomly assigned to each cooling time group to ensure a completely randomized experiment.
To analyze the data, we can use one-way ANOVA to test if there are significant differences in the mean hardness values among the different cooling times. If ANOVA indicates a significant effect, we can conduct post-hoc tests (e.g., Tukey’s HSD) to identify specific pairs of cooling times that result in significantly different hardness values.
The results of the experiment will help us understand how cooling time affects the hardness of the metal alloy. We can use this information to optimize the cooling process to achieve the desired hardness properties for specific engineering applications. For instance, we might find that longer cooling times lead to higher hardness values, which can be beneficial for applications requiring greater strength and wear resistance.
To analyze the data example of the Completely Randomized Single-Factor Experiment related to cooling times and the hardness of a metal alloy in R, we can perform a one-way analysis of variance (ANOVA). This will help us test if there are significant differences in the mean hardness values among the different cooling times. Additionally, we can conduct post-hoc tests (Tukey’s HSD) to identify specific pairs of cooling times that result in significantly different hardness values. Let’s go ahead and perform the analysis in R:
# Store the cooling times into an R object
cooling_times <- c(1, 2, 3, 4, 2, 3, 1, 4, 3, 2, 1, 3, 4, 2, 1, 4, 3, 2, 4, 1)
# Since these cooling times represent categories which have no ordering,
# we need to convert the "cooling_times" variable to a categorical variable,
# which is done by doing the following
cooling_times = as.factor(cooling_times)
# Store the hardness values into an R object
hardness_values <- c(300, 350, 380, 400, 340, 370, 290, 410, 375, 335, 295, 385, 395, 345, 305, 420, 380, 330, 415, 310)
# Form a data frame
myData <- data.frame(Cooling_Time = cooling_times,
Hardness_HV = hardness_values
)
# Perform one-way ANOVA
model <- aov(Hardness_HV ~ Cooling_Time, data = myData)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Cooling_Time 3 32895 10965 165.5 2.97e-12 ***
## Residuals 16 1060 66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The aov() function in R performs the one-way ANOVA. The output will show the ANOVA table with the F-statistic and p-value, indicating whether there are significant differences in the mean hardness values among the cooling times.
Since the \(p\)-value is basically zero, the data indicate there are significant differences in the mean hardness values among the cooling times.
The see which levels of hardness make the differences, we can conduct a post-hoc Tukey’s HSD test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Hardness_HV ~ Cooling_Time, data = myData)
##
## $Cooling_Time
## diff lwr upr p adj
## 2-1 40 25.272 54.728 4.40e-06
## 3-1 78 63.272 92.728 0.00e+00
## 4-1 108 93.272 122.728 0.00e+00
## 3-2 38 23.272 52.728 8.50e-06
## 4-2 68 53.272 82.728 0.00e+00
## 4-3 30 15.272 44.728 1.37e-04
The TukeyHSD() function conducts the post-hoc Tukey’s HSD test to compare all possible pairs of cooling times. The result shows which cooling times result in significantly different hardness values.
Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.
Just like regression models, all ANOVA models should be subject to a residual analysis for model checking.
We first plot residuals versus fitted values to check constant variance assumption:
# Extract residuals
residuals <- resid(model)
# Extract fitted values
fitted_values <- fitted(model)
# Plot residuals vs. fitted values (to check for constant variance)
plot(residuals ~ fitted_values,
main = "Residuals vs. Fitted",
xlab = "Fitted Values",
ylab = "Residuals"
)
# Add a horizontal line through y = 0 (i.e., add the x-axis)
abline(h = 0, col = "red")
Since the residuals are spread evenly around zero along the range of fitted values (predicted values). The variance of the residuals remains constant across all levels of the predictors.
The residual plot can also provide insights into the linearity assumption of the regression model. The linearity assumption implies that the relationship between the dependent variable and the predictors is linear.
Interpreting Linearity: In a residual plot, if the points exhibit a random scatter around the horizontal line at zero (no distinct pattern), it suggests that the linearity assumption is likely met. Here linearity appears to be met.
We next plot residuals to check normality assumption:
# Plot histogram of residuals (to check for normality)
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
Interpreting Normality: In a histogram of residuals or a Q-Q plot (quantile-quantile plot), if the residuals roughly follow a bell-shaped curve or closely align with the straight reference line, the normality assumption is likely satisfied. Here, according to both methods, normality appears to be met.
We can even conduct a formal test for normality:
# Perform Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(residuals)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.95737, p-value = 0.4928
The large p-value indicates that the normality is met.
Overall, when interpreting a residual plot, look for signs of homoscedasticity, linearity, and normality. If the plot shows no clear patterns, the model assumptions are likely met, and the model is a good fit for the data. However, if you observe any systematic patterns or deviations from assumptions, it may indicate that further model adjustments are necessary or that the model may not be appropriate for the data.
Remember that residual plots are visual aids, and it is crucial to complement their interpretation with formal statistical tests and diagnostic procedures to make sound conclusions about the regression model’s validity and reliability.
Quality control (QC) is a systematic process used to ensure that products or services meet specified standards and adhere to established guidelines. It involves monitoring, assessing, and managing the production or delivery process to maintain consistent quality and prevent defects. Here are some basics of quality control:
Ensure products meet customer expectations and specifications.
Minimize defects, errors, and variations in production.
Optimize processes for efficiency and consistency.
Enhance customer satisfaction and loyalty.
Reduce waste and associated costs.
Defect: Any deviation from the desired specifications or standards.
Variation: Differences between actual measurements and ideal values.
Process Control: Monitoring and adjusting processes to maintain quality.
Statistical Process Control (SPC): Using statistical methods to monitor and control processes.
Sampling: Evaluating a subset of items from a larger batch to infer quality.
Quality Assurance (QA): Actions taken to ensure quality before products are made.
Quality Control (QC): Activities performed to ensure quality during production.
Six Sigma: A data-driven approach to minimize defects and improve processes.
Continuous Improvement: The ongoing effort to enhance processes and quality.
Plan: Define quality standards, methods, and resources.
Do: Implement quality control processes according to the plan.
Check: Evaluate and monitor quality using various methods, including inspections and tests.
Act: Take corrective actions to address deviations and improve processes.
Inspection: Visual or physical assessment of products.
Testing: Using various tests to assess product attributes.
Statistical Analysis: Applying statistical methods to monitor and control processes.
Control Charts: Graphical tools to monitor variations and identify trends.
Root Cause Analysis: Identifying underlying causes of defects.
Failure Mode and Effects Analysis (FMEA): Identifying potential failure points and their impact.
Consistency in product quality and performance.
Reduced defects and waste.
Improved customer satisfaction and loyalty.
Enhanced brand reputation.
Efficient resource utilization.
Regulatory compliance.
Manufacturing: Ensuring products meet specifications.
Healthcare: Ensuring patient safety and accurate diagnoses.
Software Development: Identifying and fixing software defects.
Construction: Ensuring buildings adhere to safety and quality standards.
Quality control is essential for maintaining customer trust, ensuring product reliability, and achieving operational excellence. It involves a combination of methods, processes, and continuous improvement efforts to deliver consistent and high-quality products or services.
A control chart for proportions (also known as a p-chart) is a graphical tool used in quality control to monitor the stability of a process that produces discrete outcomes or proportions. It’s commonly used when dealing with attributes data, such as the proportion of defective items in a sample.
Refer to a reference for the theory about p charts: https://sixsigmastudyguide.com/p-attribute-charts/
Here’s an example showing how you can create and interpret a p-chart using R.
Example: Defective Products in a Manufacturing Process
Let’s assume you are monitoring the proportion of defective products in a manufacturing process. You collect data over time to track the proportion of defects in each sample.
The sample sizes are: 50, 60, 55, 65, 70, 75, 60, 80
The respective numbers of defectives are: 2, 4, 1, 5, 3, 6, 2, 4
Here is the R code for constructing a p-chart:
# Sample data:
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 80)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 4)
# Calculate proportion in each sample
proportions <- defective_counts / sample_sizes
# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)
# Load required library with the library() function. Before loading the library,
# you need to install the library first by typing:
# install.packages("qcc")
# on the console of RStudio/Posit.
library(qcc)
# Create the p-chart
qcc_obj <- qcc(defective_counts, type = "p", sizes = sample_sizes,
title = "P-Chart: Defective Products with Different Sample Sizes")
In this example, the p-chart shows the proportion of defective products in each sample along with control limits. The center line represents the overall proportion of defects across all samples. Control limits are calculated based on statistical methods to identify points that fall outside expected variation.
Interpretation:
Points within the control limits suggest that the process is stable and variation is consistent. Points outside the control limits indicate potential issues or changes in the process. Trends or patterns in the chart can provide insights into process behavior. Here, there is no point outside the control limits. No obvious pattern can be observed either.
Remember that control charts are most effective when used as part of a comprehensive quality control system, and they help identify deviations that warrant investigation and corrective action.
Example 2: Control chart with out of control points
# Sample data: Proportion of defective products in each sample
sample_sizes <- c(50, 60, 55, 65, 70, 75, 60, 55)
defective_counts <- c(2, 4, 1, 5, 3, 6, 2, 12)
# Calculate proportions
proportions <- defective_counts / sample_sizes
# Calculate the overall proportion of defects
overall_proportion <- sum(defective_counts) / sum(sample_sizes)
# Load required library
library(qcc)
# Create the p-chart
qcc(defective_counts, type = "p", sizes = sample_sizes,
title = "P-Chart: Defective Products with Different Sample Sizes")
## List of 11
## $ call : language qcc(data = defective_counts, type = "p", sizes = sample_sizes, title = "P-Chart: Defective Products with Differen| __truncated__
## $ type : chr "p"
## $ data.name : chr "defective_counts"
## $ data : num [1:8, 1] 2 4 1 5 3 6 2 12
## ..- attr(*, "dimnames")=List of 2
## $ statistics: Named num [1:8] 0.04 0.0667 0.0182 0.0769 0.0429 ...
## ..- attr(*, "names")= chr [1:8] "1" "2" "3" "4" ...
## $ sizes : num [1:8] 50 60 55 65 70 75 60 55
## $ center : num 0.0714
## $ std.dev : num 0.258
## $ nsigmas : num 3
## $ limits : num [1:8, 1:2] 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## $ violations:List of 2
## - attr(*, "class")= chr "qcc"
In a control chart, an out-of-control point (here the 8th point) typically indicates a situation where the process has experienced a significant shift or variation that goes beyond normal expected variation. This could be due to various factors, such as equipment malfunction, changes in the production process, operator error, or other special causes.
Refer to this excellent reference: https://www.r-bloggers.com/2018/08/using-control-charts-in-r/